Distributed filesystem for Debian clusters?
Posted by jooray on Tue 7 Feb 2006 at 09:58
I'm looking for a way to make a Debian web cluster completely fail-tolerant. There is heartbeat, a MySQL cluster and I have two firewalls in a redundant setup. The only thing missing is a file-system, that is completely distributed (i.e. symmetric).
I have tried several solutions to this problem. I looked at GFS and OCFS2, which require a shared storage. I was told, that having cluster-filesystem over NBD, which is not cluster-aware is a serious risk -- GFS and OCFS2 should be used with real shared storage. It is possible to build cheap shared storage using firewire. See this article.But what I'm really looking for is a stable, working, distributed filesystem. This way I could have a backup machine in a different datacenter connected using some form of fast connection. I was looking in AFS, but it seems to be read-only for 2.6 (or that's at least what the kernel option says, maybe using OpenAFS tarball works on 2.6).
Does anyone have good tip for distributed filesystem and/or HOWTO for AFS on Debian?
[ Parent | Reply to this comment ]
Other GFS'es
www.lustre.org
[ Parent | Reply to this comment ]
Lustre also needs shared storage for metadata servers.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
0.8 is not released.
well... i must say i'm looking forward. this would be a perfect solution for me.
[ Parent | Reply to this comment ]
Someone brought this up to me: Is there a particular reason why the syncing can't be done within the web app (to the other web app) or via scp's?
[ Parent | Reply to this comment ]
DRDB + GFS seems like a waste of processing power to me, from a design point of view.
DRDB just makes sure you have your block level stuff redundant.
GFS just makes sure you have a common (shared) block device, and allows multiple systems to operate a filesystem on it concurrently. (correct me if I'm wrong...)
Now, both in itself a are really great feats (and I don't want to dimish the accomplishments of those projects in any way), but wouldn't it be better if we just had a filesystem-level replication?
Like a nfs-client doesn't need to care about the filesystem on the remote side, the FS replication client (and possibly server) don't need to know about the specifics of the filesystem involved.
[ Parent | Reply to this comment ]
There are source packages in sid and etch:
http://packages.qa.debian.org/o/openafs.html
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
The server side, I don't know, I have not tried that. Running AFS servers on AIX here.
[ Parent | Reply to this comment ]
We use it here on a 4 node cluster. But be aware
That you can have only 1 rw volume at a time, but many
ro volumes. So you have to take care of switching one volume
to rw when the node containing the rw volume dies.
Also you need a minimal Kerberos setup.
But if you want a network fs it is imho the way to go.
As you mentioned web serving, i assume most of the time
you need only read access to the data, right? IF so
AFS really has everything you need but if you need byte
range locking or you have many apps writting to the fs all
the time...well..then this might not be what you want.
[ Parent | Reply to this comment ]
The version supplied with Debian Stable is fairly crappy. If you want to run OpenAFS then backport the version from Testing/Unstable if you can.
I use wajig.. Setup deb-src for testing in your source.list
wajig build-depend openafs-fileserver
wajig build openafs-fileserver
Then install the resulting deb files.
That way you can run openafs version from testing without having to yank back most of testing into stable.
OpenAFS is very nice and is secure enough to use over the internet even. The downsides is that the non-Linux client support for it (ie OS X or Windows) isn't going to be very good. The permissions setup for it is not POSIX, although it maps to owner's read write execute permissions well. Also it's not good for large files.
But for lots of small files and such it's is fast and reliable over slow links due to it's caching and file change monitering features. It's ideal for many different situations.. Such as sharing out lots of files to lots of people in a campus-wide network with complex topography and lots of wireless stuff.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Carp with pfsync is beautiful !
[ Parent | Reply to this comment ]
Not realy distributed but kidda RAID1 over network.
Works great in production for over 5 years !
http://www.drbd.org/
Happy clustering !
Nicolas@Bouthors.org
[ Parent | Reply to this comment ]
Ian Blenke has posted some short notes on most of the filesystems mentioned. You can find it here: http://ian.blenke.com/projects/cornfs/braindump/braindump.html
[ Parent | Reply to this comment ]
From the PVFS2 Website: (http://www.pvfs.org/pvfs2/)
========================
Parallel I/O continues to be a topic of active development. Recent years have seen the creation of many new options. Even with these new choices, certain factors remain constant. Parallel applications need a fast I/O subsystem. Clusters need a parallel file system that can scale as the number of nodes increases to the thousands and tens of thousands. PVFS2 is our answer.
Many institutions and researchers have used the first generation of the Parallel Virtual File System (PVFS) with much success. The time has come for the second generation. PVFS2 continues to serve as both a platform for parallel I/O research as well as a production file system for the cluster computing community.
The PVFS project is conducted jointly between The Parallel Architecture Research Laboratory at Clemson University and The Mathematics and Computer Science Division at Argonne National Laboratory. Funding for the PVFS project includes the following sources:
* NASA Goddard Space Flight Center Code 930
* the National Computational Science Alliance through the National Science Foundation's Partnership for Advanced Computational Infrastructure
* the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
PVFS2 provides the following features:
* Ease of installation
* User-controlled striping of files across nodes, and a well defined interface for defining new distribution schemes
* Multiple interfaces, including a MPI-IO interface via ROMIO
* Utilizes commodity network and storage hardware
* Very modular design
* Native support for several popular networking technologies like Myrinet, Infiniband, and TCP/IP
* Support for user-defined access patterns
* Support for heterogeneous clusters
* Distributed metadata
PVFS2 provides a Linux kernel module which supports the UNIX I/O interface and allows existing UNIX I/O programs to use PVFS2 files without recompiling. The familiar UNIX file tools (ls, cp, rm, etc.) will all operate on PVFS2 files and directories as well.
PVFS2 is easy to install. There is no need for extra kernel patches. The Documentation page describes how to set up a simple installation. Scripts and test applications are included to help with configuration, testing for correct operation, and performance evaluation.
PVFS2 stripes file data across multiple disks in different nodes in a cluster. By spreading out file data in this manner, larger files can be created, potential bandwidth is increased, and network bottlenecks are minimized.
Multiple user interfaces are available. This includes:
* MPI-IO support through ROMIO (see the ROMIO homepage for details)
* Traditional Linux file system access through the use of a linux kernel driver module
[ Parent | Reply to this comment ]
Can PVFS2 tolerate server failures?
Yes. We currently have a recipe describing the hardware and software needed to set up PVFS2 in a high availability cluster. Our method is outlined in the `pvfs2-ha.{ps,pdf}' file in the doc subdirectory of the PVFS2 distribution. This configuration relies on shared storage and commodity ``heartbeat'' software to provide means for failover.
Software redundancy offers a less expensive solution to redundancy, but usually at a non-trivial cost to performance. We are studying how to implement software redundancy with lower overhead, but at this time we provide no software-only server failover solution.
So for redundancy, I need shared storage.
Then it does not support locking and generally is more aided towards high-speed use than redundancy and data security.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
might have something usefull there.
it a stright forward enough for me.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
The downside is that even for 1 node only accessing you must take care of resynchornization in case of raid components timestamps differ and take preventive approach when mirroed machines go up at different times/speeds so the raid comes up with all components functional without the need to resynchronize.
And for more then one node accessing the raid you must add to previous to have one machine elected as master and export contents via nfs to other nodes.
gnbd, fencing, clvm sort of works. sometimes. it's work in progress.
[ Parent | Reply to this comment ]
http://www.coda.cs.cmu.edu/
From the web site :
1. disconnected operation for mobile computing 2. is freely available under a liberal license 3. high performance through client side persistent caching 4. server replication 5. security model for authentication, encryption and access control 6. continued operation during partial network failures in server network 7. network bandwith adaptation 8. good scalability 9. well defined semantics of sharing, even in the presence of network failuresLooks like it can do mirror 1 (replication) and sustain network failures.
Would like to read the authers opinion on this fs.
-- Maxim.
[ Parent | Reply to this comment ]
I have not been following coda recently, since it really looked like a dead project, but it seems, that there's some development lately.
Even when I was trying it, it crashed the filesystem an hour after installation, so I dropped it. I would be interested in any current experience with coda.
[ Parent | Reply to this comment ]
I search for the same thing (but havent find a suitable file system either) and is thus quite interested if you find something.
[ Parent | Reply to this comment ]
If you do want a truly shared fs (not NFS), I'm not sure why gfs/gnbd/drbd is insufficent. You would have to have 2 servers, and so some funniness with heartbeat or GFS clustering to bring drbd up rw on the secondary node before GFS/GNBD failsover, but I don't think that would be too challenging. NBD might not be cluster aware, but GNBD is for sure.
Another option, mentioned already, is to test drbd 0.8 with can run as primary/primary and do multipath gnbd/gfs. There's a release candidate (pre-release?) finally, and it's been in development for awhile, so I hope that's a good sign. drbd0.8 just got put into debian testing.
[ Parent | Reply to this comment ]
export the /dev/drbd0 device using iSCSI enterprise Target to the clients.
use heartbeat to control the iSCSI Target and DRBD.
in the clients
access the iscsi device you had already exported using sourceforge.net iscsi initiator.
format this iscsi device with gfs or ocfs2.
Now you can use this device in any amount of clients you want.
i called it cheap san.
i am doing, and will do some tests with distributed storage.
the page is
http://guialivre.governoeletronico.gov.br/seminario/index.php/Dis tributedMassStorage
Best Regars
Leonardo Rodrigues de Mello
[ Parent | Reply to this comment ]
The system is (as the name indicates) based on peer-to-peer technology - there's no one master server. Well, there actually is, more or less, but it's elected and if it goes down, a new one is elected immediately.
There's still a few things I'm not all that happy with, like doing a full synchronisation whenever a client connects to the grid, but all in all I like the concept. It's payware, though - they'll probably try to charge about 2k$ per active node and 200$ per passive node (active = local storage, passive = forwards requests to another node) but you should be able to shave some off that.
Still hoping for something free to come along :-)
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
http://openmosix.sourceforge.net/
[ Parent | Reply to this comment ]
so it's like "easy NFS"
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
As it does not guarantee order of writes, you can not reliably use it with cluster filesystem.
So this project is only useful for transporting one physical device (connecting it over ethernet to another device: linux box). That physical device can be disk/logical volume/any block device on another linux or remote disk array supporting ATA over Ethernet (but iScsi is more common in this case).
[ Parent | Reply to this comment ]
The latest version is currently in pre-release but looks very promising. (the current release does not have the replication support)
Only problem I have had with it so far is that it requires a slightly more recent version of fuse than etch provides, but this is available in lenny (testing)
[ Parent | Reply to this comment ]
the mirror is:
deb http://guialivre.governoeletronico.gov.br/guiaonline/downloads/pa cotes-cluster etch glusterfs
deb-src http://guialivre.governoeletronico.gov.br/guiaonline/downloads/pa cotes-cluster etch glusterfs
Leonardo Rodrigues de Mello
[ Parent | Reply to this comment ]
deb http://lmello.virt-br.org/debian/ ./
deb-src http://lmello.virt-br.org/debian/ ./
Leonardo Rodrigues de Mello
[ Parent | Reply to this comment ]
AOE and gfs , since gfs is now muture and included in stable.
I will let you know for the results.
[ Parent | Reply to this comment ]
http://www.linuxquestions.org/questions/linux-software-2/posix-co mpliant-distributed-file-systems-without-using-shared-storage-774 633/
Any Comments ??
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]