Distributed filesystem for Debian clusters?

Posted by jooray on Tue 7 Feb 2006 at 09:58

I'm looking for a way to make a Debian web cluster completely fail-tolerant. There is heartbeat, a MySQL cluster and I have two firewalls in a redundant setup. The only thing missing is a file-system, that is completely distributed (i.e. symmetric).

I have tried several solutions to this problem. I looked at GFS and OCFS2, which require a shared storage. I was told, that having cluster-filesystem over NBD, which is not cluster-aware is a serious risk -- GFS and OCFS2 should be used with real shared storage. It is possible to build cheap shared storage using firewire. See this article.But what I'm really looking for is a stable, working, distributed filesystem. This way I could have a backup machine in a different datacenter connected using some form of fast connection. I was looking in AFS, but it seems to be read-only for 2.6 (or that's at least what the kernel option says, maybe using OpenAFS tarball works on 2.6).

Does anyone have good tip for distributed filesystem and/or HOWTO for AFS on Debian?

 

 


Posted by Anonymous (82.235.xx.xx) on Tue 7 Feb 2006 at 10:06
NBD used bidirectionnaly (each server backing up each other) works fine with hearbeat. But it is only possible to use such a solution when you have two servers with different data.

[ Parent | Reply to this comment ]

Posted by Anonymous (195.85.xx.xx) on Tue 7 Feb 2006 at 10:57
I have seen that GFS from redhat entered into unstable. I've worked with it in the past and works like a charm. It's a bit of a pain to configure but workable.

Other GFS'es
www.lustre.org

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 7 Feb 2006 at 11:29
[ Send Message ]
gfs needs shared storage, which I don't have (and can't easily have), since the two servers should be physically several kilometers away from each other.

Lustre also needs shared storage for metadata servers.

[ Parent | Reply to this comment ]

Posted by Anonymous (195.212.xx.xx) on Tue 7 Feb 2006 at 10:44
Perhaps the distributed block device (drbd) is what you search:
http://www.drbd.org/

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 7 Feb 2006 at 11:31
[ Send Message ]
i use drbd currently, but it allows only one machine to access the data. I want machines to share the same data. currently, I have DRBD and data mounted via NFS on the machine, which is secondary. But I consider this very uneffective and stupid solution, since all writes have to go through the network connection twice. There's some scripting to do it correctly and lots of things that can go wrong. So that's why I consider this a workaround, not a final solution.

[ Parent | Reply to this comment ]

Posted by Anonymous (12.155.xx.xx) on Wed 8 Feb 2006 at 04:02
So waiting for drbd's code base to stabilize enough for active/active applications (drbd 0.8 + gfs) is not an option?

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Wed 8 Feb 2006 at 14:12
[ Send Message ]
that's an option, when it stars working. or do they already support active/active applications?

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Wed 8 Feb 2006 at 23:30
[ Send Message ]
ah, it's in the roadmap, marked as 99% done

0.8 is not released.

well... i must say i'm looking forward. this would be a perfect solution for me.

[ Parent | Reply to this comment ]

Posted by Anonymous (12.155.xx.xx) on Thu 9 Feb 2006 at 01:43
I would definitely get some feedback on the code's current state from one or more of their mailing lists. A cvs version might be stable enough for you to test it out.

Someone brought this up to me: Is there a particular reason why the syncing can't be done within the web app (to the other web app) or via scp's?

[ Parent | Reply to this comment ]

Posted by Anonymous (82.146.xx.xx) on Sun 12 Feb 2006 at 16:48
Probably not adding much to the discussion, but:

DRDB + GFS seems like a waste of processing power to me, from a design point of view.

DRDB just makes sure you have your block level stuff redundant.

GFS just makes sure you have a common (shared) block device, and allows multiple systems to operate a filesystem on it concurrently. (correct me if I'm wrong...)


Now, both in itself a are really great feats (and I don't want to dimish the accomplishments of those projects in any way), but wouldn't it be better if we just had a filesystem-level replication?

Like a nfs-client doesn't need to care about the filesystem on the remote side, the FS replication client (and possibly server) don't need to know about the specifics of the filesystem involved.

[ Parent | Reply to this comment ]

Posted by Anonymous (217.189.xx.xx) on Tue 7 Feb 2006 at 11:10
OpenAFS 1.4 works well with kernel 2.6, at least on the client side.

There are source packages in sid and etch:


http://packages.qa.debian.org/o/openafs.html

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 7 Feb 2006 at 11:33
[ Send Message ]
and is there some problem with it running as the server side or you just haven't tried. I'll probably stick with openafs. Any people having experience with OpenAFS on 2.6 and Debian?

[ Parent | Reply to this comment ]

Posted by Anonymous (217.189.xx.xx) on Tue 7 Feb 2006 at 12:57
The client side of openafs works very stable for me on a 2.6 kernel on Sarge (though I am running not the standard sarge kernel but 2.6.14 currently).

The server side, I don't know, I have not tried that. Running AFS servers on AIX here.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.171.xx.xx) on Fri 10 Feb 2006 at 02:13
The server parts works stable at least 1.4.0.
We use it here on a 4 node cluster. But be aware
That you can have only 1 rw volume at a time, but many
ro volumes. So you have to take care of switching one volume
to rw when the node containing the rw volume dies.
Also you need a minimal Kerberos setup.
But if you want a network fs it is imho the way to go.
As you mentioned web serving, i assume most of the time
you need only read access to the data, right? IF so
AFS really has everything you need but if you need byte
range locking or you have many apps writting to the fs all
the time...well..then this might not be what you want.

[ Parent | Reply to this comment ]

Posted by Anonymous (12.108.xx.xx) on Fri 10 Mar 2006 at 22:29
I use AFS at home.

The version supplied with Debian Stable is fairly crappy. If you want to run OpenAFS then backport the version from Testing/Unstable if you can.

I use wajig.. Setup deb-src for testing in your source.list

wajig build-depend openafs-fileserver
wajig build openafs-fileserver

Then install the resulting deb files.

That way you can run openafs version from testing without having to yank back most of testing into stable.

OpenAFS is very nice and is secure enough to use over the internet even. The downsides is that the non-Linux client support for it (ie OS X or Windows) isn't going to be very good. The permissions setup for it is not POSIX, although it maps to owner's read write execute permissions well. Also it's not good for large files.

But for lots of small files and such it's is fast and reliable over slow links due to it's caching and file change monitering features. It's ideal for many different situations.. Such as sharing out lots of files to lots of people in a campus-wide network with complex topography and lots of wireless stuff.

[ Parent | Reply to this comment ]

Posted by Anonymous (195.14.xx.xx) on Tue 7 Feb 2006 at 11:33

[ Parent | Reply to this comment ]

Posted by Anonymous (145.242.xx.xx) on Tue 7 Feb 2006 at 11:53
Out of the scope, but how did you have configure firewall in redudant way ? Netfilter don't allow synchro of table connection, right ? Or perhaps are you only in stateless mode ?

[ Parent | Reply to this comment ]

Posted by Anonymous (85.216.xx.xx) on Tue 7 Feb 2006 at 12:08
It's stateful. It's a webserver, so having lost few connections is not very important, people can refresh.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.126.xx.xx) on Tue 7 Feb 2006 at 18:45
or get OpenBSD.
Carp with pfsync is beautiful !

[ Parent | Reply to this comment ]

Posted by Anonymous (195.7.xx.xx) on Tue 7 Feb 2006 at 12:32
Look into DRBD !

Not realy distributed but kidda RAID1 over network.
Works great in production for over 5 years !

http://www.drbd.org/

Happy clustering !
Nicolas@Bouthors.org

[ Parent | Reply to this comment ]

Posted by Aike (217.76.xx.xx) on Tue 7 Feb 2006 at 20:23
[ Send Message ]
Hmm, no proper solution posted yet. I have the same problem, but haven't found a solution either.

Ian Blenke has posted some short notes on most of the filesystems mentioned. You can find it here: http://ian.blenke.com/projects/cornfs/braindump/braindump.html

[ Parent | Reply to this comment ]

Posted by Anonymous (69.15.xx.xx) on Tue 7 Feb 2006 at 22:35
Grab PVFS2 (Paralell Virtual File System). It works like a champ. Absolutely beautiful.

From the PVFS2 Website: (http://www.pvfs.org/pvfs2/)
========================
Parallel I/O continues to be a topic of active development. Recent years have seen the creation of many new options. Even with these new choices, certain factors remain constant. Parallel applications need a fast I/O subsystem. Clusters need a parallel file system that can scale as the number of nodes increases to the thousands and tens of thousands. PVFS2 is our answer.

Many institutions and researchers have used the first generation of the Parallel Virtual File System (PVFS) with much success. The time has come for the second generation. PVFS2 continues to serve as both a platform for parallel I/O research as well as a production file system for the cluster computing community.

The PVFS project is conducted jointly between The Parallel Architecture Research Laboratory at Clemson University and The Mathematics and Computer Science Division at Argonne National Laboratory. Funding for the PVFS project includes the following sources:

* NASA Goddard Space Flight Center Code 930
* the National Computational Science Alliance through the National Science Foundation's Partnership for Advanced Computational Infrastructure
* the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy

PVFS2 provides the following features:

* Ease of installation
* User-controlled striping of files across nodes, and a well defined interface for defining new distribution schemes
* Multiple interfaces, including a MPI-IO interface via ROMIO
* Utilizes commodity network and storage hardware
* Very modular design
* Native support for several popular networking technologies like Myrinet, Infiniband, and TCP/IP
* Support for user-defined access patterns
* Support for heterogeneous clusters
* Distributed metadata

PVFS2 provides a Linux kernel module which supports the UNIX I/O interface and allows existing UNIX I/O programs to use PVFS2 files without recompiling. The familiar UNIX file tools (ls, cp, rm, etc.) will all operate on PVFS2 files and directories as well.

PVFS2 is easy to install. There is no need for extra kernel patches. The Documentation page describes how to set up a simple installation. Scripts and test applications are included to help with configuration, testing for correct operation, and performance evaluation.

PVFS2 stripes file data across multiple disks in different nodes in a cluster. By spreading out file data in this manner, larger files can be created, potential bandwidth is increased, and network bottlenecks are minimized.

Multiple user interfaces are available. This includes:

* MPI-IO support through ROMIO (see the ROMIO homepage for details)
* Traditional Linux file system access through the use of a linux kernel driver module

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 7 Feb 2006 at 22:42
[ Send Message ]
These things from FAQ bother me:

Can PVFS2 tolerate server failures?

Yes. We currently have a recipe describing the hardware and software needed to set up PVFS2 in a high availability cluster. Our method is outlined in the `pvfs2-ha.{ps,pdf}' file in the doc subdirectory of the PVFS2 distribution. This configuration relies on shared storage and commodity ``heartbeat'' software to provide means for failover.

Software redundancy offers a less expensive solution to redundancy, but usually at a non-trivial cost to performance. We are studying how to implement software redundancy with lower overhead, but at this time we provide no software-only server failover solution.

So for redundancy, I need shared storage.

Then it does not support locking and generally is more aided towards high-speed use than redundancy and data security.

[ Parent | Reply to this comment ]

Posted by sh4rk (212.39.xx.xx) on Wed 8 Feb 2006 at 07:56
[ Send Message ]
You can use NFS or simple rsync. But my suggestion is DRBD simple and fast.

[ Parent | Reply to this comment ]

Posted by fher98 (216.230.xx.xx) on Wed 8 Feb 2006 at 18:26
[ Send Message | View Weblogs ]
Hi, I would like to see a Debian Cluster how to,.. If anyone its up to the task.. thanks

[ Parent | Reply to this comment ]

Posted by Anonymous (12.105.xx.xx) on Thu 16 Feb 2006 at 18:31
I would be more then happy to write one for you. Contact me offline (charles@thewybles.com) and we can work out contract details.

[ Parent | Reply to this comment ]

Posted by Anonymous (134.161.xx.xx) on Wed 27 Feb 2008 at 14:28
There's a Debian Cluster how to at http://debianclusters.org/

[ Parent | Reply to this comment ]

Posted by Anonymous (82.225.xx.xx) on Wed 15 Feb 2006 at 09:38
Anyone who tried csync2 ??

[ Parent | Reply to this comment ]

Posted by Anonymous (194.149.xx.xx) on Mon 20 Feb 2006 at 20:40
nbd approach simply works until you need more then 1 node using the resulting mirrored raid or it's components.

The downside is that even for 1 node only accessing you must take care of resynchornization in case of raid components timestamps differ and take preventive approach when mirroed machines go up at different times/speeds so the raid comes up with all components functional without the need to resynchronize.

And for more then one node accessing the raid you must add to previous to have one machine elected as master and export contents via nfs to other nodes.

gnbd, fencing, clvm sort of works. sometimes. it's work in progress.

[ Parent | Reply to this comment ]

Posted by hq4ever (87.69.xx.xx) on Sat 4 Mar 2006 at 19:11
[ Send Message ]
What about code ?
http://www.coda.cs.cmu.edu/

From the web site :
   1. disconnected operation for mobile computing
   2. is freely available under a liberal license
   3. high performance through client side persistent caching
   4. server replication
   5. security model for authentication, encryption and access control
   6. continued operation during partial network failures in server network
   7. network bandwith adaptation
   8. good scalability
   9. well defined semantics of sharing, even in the presence of network failures
Looks like it can do mirror 1 (replication) and sustain network failures.
Would like to read the authers opinion on this fs.

-- Maxim.

[ Parent | Reply to this comment ]

Posted by Anonymous (85.216.xx.xx) on Mon 6 Mar 2006 at 01:03
it is moving very slowly. no stable linux 2.6 support, few commits per year. the development is more or less stalled.

I have not been following coda recently, since it really looked like a dead project, but it seems, that there's some development lately.

Even when I was trying it, it crashed the filesystem an hour after installation, so I dropped it. I would be interested in any current experience with coda.

[ Parent | Reply to this comment ]

Posted by Anonymous (85.235.xx.xx) on Fri 10 Mar 2006 at 19:01
Did you ever find an apropriate filesystem that was completely distributed (did not require shared storage) and was completely fault tolerant?

I search for the same thing (but havent find a suitable file system either) and is thus quite interested if you find something.

[ Parent | Reply to this comment ]

Posted by Anonymous (69.36.xx.xx) on Fri 7 Apr 2006 at 23:28
I don't see why heartbeat/drbd/nfs is insufficent. See http://linux-ha.org/DRBD/NFS

If you do want a truly shared fs (not NFS), I'm not sure why gfs/gnbd/drbd is insufficent. You would have to have 2 servers, and so some funniness with heartbeat or GFS clustering to bring drbd up rw on the secondary node before GFS/GNBD failsover, but I don't think that would be too challenging. NBD might not be cluster aware, but GNBD is for sure.

Another option, mentioned already, is to test drbd 0.8 with can run as primary/primary and do multipath gnbd/gfs. There's a release candidate (pre-release?) finally, and it's been in development for awhile, so I hope that's a good sign. drbd0.8 just got put into debian testing.

[ Parent | Reply to this comment ]

Posted by Anonymous (200.198.xx.xx) on Wed 3 May 2006 at 14:05
Setup DRBD between to machines A and B.
export the /dev/drbd0 device using iSCSI enterprise Target to the clients.
use heartbeat to control the iSCSI Target and DRBD.


in the clients
access the iscsi device you had already exported using sourceforge.net iscsi initiator.
format this iscsi device with gfs or ocfs2.


Now you can use this device in any amount of clients you want.

i called it cheap san.

i am doing, and will do some tests with distributed storage.

the page is
http://guialivre.governoeletronico.gov.br/seminario/index.php/Dis tributedMassStorage


Best Regars

Leonardo Rodrigues de Mello

[ Parent | Reply to this comment ]

Posted by vegiVamp (194.78.xx.xx) on Thu 4 May 2006 at 12:44
[ Send Message ]
I've used PeerFS in the past, but beware of the 2.6 kernel version - it seems the guy who developed it left the code in a sorry state when he left the company, which is why we eventually stopped using it. My original tests were on the 2.4 kernel, and that seemed quite OK.

The system is (as the name indicates) based on peer-to-peer technology - there's no one master server. Well, there actually is, more or less, but it's elected and if it goes down, a new one is elected immediately.

There's still a few things I'm not all that happy with, like doing a full synchronisation whenever a client connects to the grid, but all in all I like the concept. It's payware, though - they'll probably try to charge about 2k$ per active node and 200$ per passive node (active = local storage, passive = forwards requests to another node) but you should be able to shave some off that.

Still hoping for something free to come along :-)

[ Parent | Reply to this comment ]

Posted by mfidelman (63.139.xx.xx) on Tue 11 Jul 2006 at 20:08
[ Send Message ]
Jooray, What did you end up doing? (I've just started wrestling with the exact question you started out with.)

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 18 Jul 2006 at 11:47
[ Send Message ]
I ended up with DRBD waiting for a better solution.

[ Parent | Reply to this comment ]

Posted by Anonymous (194.29.xx.xx) on Tue 18 Jul 2006 at 10:30
Just a quick note: I don't know too much re this, and it only works at present on it's own 2.4-based kernel, but isn't this exactly what open-mosix does?

http://openmosix.sourceforge.net/

[ Parent | Reply to this comment ]

Posted by jooray (85.216.xx.xx) on Tue 18 Jul 2006 at 11:57
[ Send Message ]
openmosic does not provide shared filesystem AFAIK, it only allows access to other nodes' filesystems.

so it's like "easy NFS"

[ Parent | Reply to this comment ]

Posted by Anonymous (82.239.xx.xx) on Fri 4 Aug 2006 at 21:01
Does anyone looked at ATAoE project ?
http://en.wikipedia.org/wiki/ATA-over-Ethernet

[ Parent | Reply to this comment ]

Posted by jooray (89.173.xx.xx) on Fri 4 Aug 2006 at 22:07
[ Send Message ]
Yes, but it does not solve any part of this particular problem.

As it does not guarantee order of writes, you can not reliably use it with cluster filesystem.

So this project is only useful for transporting one physical device (connecting it over ethernet to another device: linux box). That physical device can be disk/logical volume/any block device on another linux or remote disk array supporting ATA over Ethernet (but iScsi is more common in this case).

[ Parent | Reply to this comment ]

Posted by Anonymous (210.48.xx.xx) on Fri 29 Jun 2007 at 05:03
Have you tried glusterfs ( http://www.gluster.org ) it works on the idea of bricks, and can replicate data over multiple servers.
The latest version is currently in pre-release but looks very promising. (the current release does not have the replication support)

Only problem I have had with it so far is that it requires a slightly more recent version of fuse than etch provides, but this is available in lenny (testing)

[ Parent | Reply to this comment ]

Posted by Anonymous (200.198.xx.xx) on Thu 2 Aug 2007 at 20:25
I have packaged glusterfs for debian etch i386...
the mirror is:
deb http://guialivre.governoeletronico.gov.br/guiaonline/downloads/pa cotes-cluster etch glusterfs
deb-src http://guialivre.governoeletronico.gov.br/guiaonline/downloads/pa cotes-cluster etch glusterfs


Leonardo Rodrigues de Mello

[ Parent | Reply to this comment ]

Posted by Anonymous (200.198.xx.xx) on Tue 18 Sep 2007 at 20:13
the most recent packages are at diferent location:
deb http://lmello.virt-br.org/debian/ ./

deb-src http://lmello.virt-br.org/debian/ ./
Leonardo Rodrigues de Mello

[ Parent | Reply to this comment ]

Posted by vasdia (194.219.xx.xx) on Mon 30 Jul 2007 at 18:29
[ Send Message ]
I want to buila cluster the same way and i will use
AOE and gfs , since gfs is now muture and included in stable.
I will let you know for the results.

[ Parent | Reply to this comment ]

Posted by nyali (125.209.xx.xx) on Mon 8 Feb 2010 at 07:54
[ Send Message ]
Guys i want to do the same any guide or howto on this topic?

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

Which init system are you using in Debian?






( 1626 votes ~ 7 comments )