Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch

Posted by hansivers on Fri 21 Apr 2006 at 11:10

There are a lot of Linux filesystems comparisons available but most of them are anecdotal, based on artificial tasks or completed under older kernels. This benchmark essay is based on 11 real-world tasks appropriate for a file server with older generation hardware (Pentium II/III, EIDE hard-drive).

Since its initial publication, this article has generated
a lot of questions, comments and suggestions to improve it.
Consequently, I'm currently working hard on a new batch of tests
to answer as many questions as possible (within the original scope
of the article).

Results will be available in about two weeks (May 8, 2006)

Many thanks for your interest and keep in touch with
Debian-Administration.org!

Hans

Why another benchmark test?

I found two quantitative and reproductible benchmark testing studies using the 2.6.x kernel (see References). Benoit (2003) implemented 12 tests using large files (1+ GB) on a Pentium II 500 server with 512MB RAM. This test was quite informative but results are beginning to aged (kernel 2.6.0) and mostly applied to settings which manipulate exclusively large files (e.g., multimedia, scientific, databases).

Piszcz (2006) implemented 21 tasks simulating a variety of file operations on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard disk. To date, this testing appears to be the most comprehensive work on the 2.6 kernel. However, since many tasks were "artificial" (e.g., copying and removing 10 000 empty directories, touching 10 000 files, splitting files recursively), it may be difficult to transfer some conclusions to real-world settings.

Thus, the objective of the present benchmark testing is to complete some Piszcz (2006) conclusions, by focusing exclusively on real-world operations found in small-business file servers (see Tasks description).

Test settings

    Hardware
  • Processor : Intel Celeron 533
  • RAM : 512MB RAM PC100
  • Motherboard : ASUS P2B
  • Hard drive : WD Caviar SE 160GB (EIDE 100, 7200 RPM, 8MB Cache)
  • Controller : ATA/133 PCI (Silicon Image)
    OS
  • Debian Etch (kernel 2.6.15), distribution upgraded on April 18, 2006
  • All optional daemons killed (cron,ssh,saMBa,etc.)
    Filesystems
  • Ext3 (e2fsprogs 1.38)
  • ReiserFS (reiserfsprogs 1.3.6.19)
  • JFS (jfsutils 1.1.8)
  • XFS (xfsprogs 2.7.14)

Description of selected tasks

    Operations on a large file (ISO image, 700MB)
  • Copy ISO from a second disk to the test disk
  • Recopy ISO in another location on the test disk
  • Remove both copies of ISO
    Operations on a file tree (7500 files, 900 directories, 1.9GB)
  • Copy file tree from a second disk to the test disk
  • Recopy file tree in another location on the test disk
  • Remove both copies of file tree
    Operations into the file tree
  • List recursively all contents of the file tree and save it on the test disk
  • Find files matching a specific wildcard into the file tree
    Operations on the file system
  • Creation of the filesystem (mkfs) (all FS were created with default values)
  • Mount filesystem
  • Umount filesystem

The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported). Each sequence takes about 7 min. Time to complete task (in secs), percentage of CPU dedicated to task and number of major/minor page faults during task were computed by the GNU time utility (version 1.7).

RESULTS

Partition capacity

Initial (after filesystem creation) and residual (after removal of all files) partition capacity was computed as the ratio of number of available blocks by number of blocks on the partition. Ext3 has the worst inital capacity (92.77%), while others FS preserve almost full partition capacity (ReiserFS = 99.83%, JFS = 99.82%, XFS = 99.95%). Interestingly, the residual capacity of Ext3 and ReiserFS was identical to the initial, while JFS and XFS lost about 0.02% of their partition capacity, suggesting that these FS can dynamically grow but do not completely return to their inital state (and size) after file removal.
Conclusion : To use the maximum of your partition capacity, choose ReiserFS, JFS or XFS.

File system creation, mounting and unmounting

The creation of FS on the 20GB test partition took 14.7 secs for Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2, JFS = 1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times longer to mount the FS (2.3 secs) when compared to other FS (Ext3 = 0.2, JFS = 0.2, XFS = 0.5), and also 2 times longer to umount the FS (0.4 sec). All FS took comparable amounts of CPU to create FS (between 59% - ReiserFS and 74% - JFS) and to mount FS (between 6 and 9%). However, Ex3 and XFS took about 2 times more CPU to umount (37% and 45%), compared to ReiserFS and JFS (14% and 27%).
Conclusion : For quick FS creation and mounting/unmounting, choose JFS or XFS.

Operations on a large file (ISO image, 700MB)

The initial copy of the large file took longer on Ext3 (38.2 secs) and ReiserFS (41.8) when compared to JFS and XFS (35.1 and 34.8). The recopy on the same disk advantaged the XFS (33.1 secs), when compared to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9). The ISO removal was about 100 times faster on JFS and XFS (0.02 sec for both), compared to 1.5 sec for ReiserFS and 2.5 sec for Ext3! All FS took comparable amounts of CPU to copy (between 46 and 51%) and to recopy ISO (between 38% to 50%). The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%. There was a clear trend of JFS to use less CPU than any other FS (about 5 to 10% less). The number of minor page faults was quite similar between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS.

Operations on a file tree (7500 files, 900 directories, 1.9GB)

The initial copy of the tree was quicker for Ext3 (158.3 secs) and XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1). Similar results were observed during the recopy on the same disk, which advantaged the Ext3 (120 secs) compared to other FS (XFS = 135.2, ReiserFS = 136.9 and JFS = 151). However, the tree removal was about 2 times longer for Ext3 (22 secs) when compared to ReiserFS (8.2 secs), XFS (10.5 secs) and JFS (12.5 secs)! All FS took comparable amounts of CPU to copy (between 27 and 36%) and to recopy the file tree (between 29% - JFS and 45% - ReiserFS). Surprisingly, the ReiserFS and the XFS used significantly more CPU to remove file tree (86% and 65%) when other FS used about 15% (Ext3 and JFS). Again, there was a clear trend of JFS to use less CPU than any other FS. The number of minor page faults was significantly higher for ReiserFS (total = 5843) when compared to other FS (1400 to 1490). This difference appears to come from a higher rate (5 to 20 times) of page faults for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have supported the use of ReiserFS for operations on large number of small files. However, the present results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or XFS may be more appropriate for real-world file server operations. Even if JFS minimize CPU usage, it should be noted that this FS comes with significantly higher latency for large file tree operations.

Directory listing and file search into the previous file tree

The complete (recursive) directory listing of the tree was quicker for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and JFS (2.5 and 3.1). Similar results were observed during the file search, where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results compared to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took comparable amounts of CPU for directory listing (35%) and file search (6%). XFS took more CPU for directory listing (70%) but comparable amount for file search (10%). ReiserFS appears to be the most CPU-intensive FS, with 71% for directory listing and 36% for file search. Again, the number of minor page faults was 3 times higher for ReiserFS (total = 1991) when compared to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems can be regrouped as (a) quick and more CPU-intensive (ReiserFS and XFS) or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears as a good compromise, with relatively quick results, moderate usage of CPU and acceptable rate of page faults.

OVERALL CONCLUSION

These results replicate previous observations from Piszcz (2006) about reduced disk capacity of Ext3, longer mount time of ReiserFS and longer FS creation of Ext3. Moreover, like this report, both reviews have observed that JFS is the lowest CPU-usage FS. Finally, this report appeared to be the first to show the high page faults activity of ReiserFS on most usual file operations.

While recognizing the relative merits of each filesystem, only one filesystem can be install for each partition/disk. Based on all testing done for this benchmark essay, XFS appears to be the most appropriate filesystem to install on a file server for home or small-business needs :

  • It uses the maximum capacity of your server hard disk(s)
  • It is the quickest FS to create, mount and unmount
  • It is the quickest FS for operations on large files (>500MB)
  • This FS gets a good second place for operations on a large number of small to moderate-size files and directories
  • It constitutes a good CPU vs time compromise for large directory listing or file search
  • It is not the least CPU demanding FS but its use of system ressources is quite acceptable for older generation hardware

While Piszcz (2006) did not explicitly recommand XFS, he concludes that "Personally, I still choose XFS for filesystem performance and scalability". I can only support this conclusion.

References

Benoit, M. (2003). Linux File System Benchmarks.

Piszcz, J. (2006). Benchmarking Filesystems Part II. Linux Gazette, 122 (January 2006).

 

 


Posted by Anonymous (213.164.xx.xx) on Fri 21 Apr 2006 at 11:52
Nice article, but one important benchmark is missing: compatibility.

I use ext3 because most tools are written for it, and everything Linux supports it.

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 14:09
Good point! I was thinking about this one when I selected the tasks. Finally, since it would need to select a "representative" sample of applications to interact with each FS, I choose to stay within the scope of previous published tests (focusing on performance and CPU usage).

[ Parent | Reply to this comment ]

Posted by Anonymous (219.88.xx.xx) on Tue 17 Apr 2007 at 07:20
How do your results compare with these from:

http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm

The first column names the filesystem tested. The second column records the total time (in seconds) it took to run the filesystem benchmarking software bonnie++ (Version 1.93c). The third column records the total number of megabytes needed to store 655 megabytes of raw data.

SMALLER is better.

FILESYSTEMTIMEDISK USAGE
REISER4 (lzo)1,938278
REISER4 (gzip)2,295213
REISER43,462692
EXT24,092816
JFS4,225806
EXT44,408816
EXT34,421816
XFS4,625799
REISER36,178793
FAT3212,342988
NTFS-3g>10,414772


Each test was preformed 5 times and the average value recorded. SMALLER is better.

The Reiser4 filesystem clearly had the best test results.

The FAT32 filesystem had the worst test results.

The bonnie++ tests were preformed, with the following parameters:

bonnie++ -n128:128k:0

[ Parent | Reply to this comment ]

Posted by wouter (87.244.xx.xx) on Thu 27 Apr 2006 at 02:40
Compatibility was something that kept me away from trying other filesystems, but I can honestly say that it hasn't been an issue in the last years or so when I've been using different filesystems than ext2/ext3 on Linux.

These days, tools and integration have been setup quite nicely by the distribution maintainers, and a fsck-interface is used by all of the ones I've tried anyway.

Ofcourse, if you want to play with more advanced options, dump filesystems or do anything out of the ordinary your findings may be different -- I would not know since I rarely, if ever, use these features.

On the other hand, IMO it rarely matters which filesystem you use anyway. I would challenge anybody to guess the filesystem running on a light to medium loaded desktop or server. Differences (in speed of mature common journaling filesystems) really are rather small for general use, and it's not until you have very specific tasks to be done or very i/o loaded systems to be managed that the choice of journaling filesystem becomes a real issue.

I believe that XFS has upcoming (or perhaps already has some) support in FreeBSD, though. FreeBSD has ext2 (read) support too. And IIRC, there was a (non-microsoft, obviously) driver adding ext2 (hence, ext3) support to Windows -- if you run that OS and want to allow it to touch your nice Linux system, that is. I suppose that falls under compatibility too.

[ Parent | Reply to this comment ]

Posted by Anonymous (200.229.xx.xx) on Tue 8 May 2007 at 13:11
Another things missing:

- What did you use to compare the times? What tools? Which commands?
- It was not mentioned the fact that Ext3 reserves 5% of the HD to the root user. (see http://ubuntuforums.org/showthread.php?t=215177)
- Some points in the article like the other is better. That to generic to say that other FS took less or spend less. (e.g.: The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%). What other FS took about 10% of CPU?

The rest of the article I found good. My tip is trying to use a better server, like a Core Duo or an AMD X2, or even an Xeon or Opteron, since we`re talking of bussiness servers (but that`s good enough if you don`t have one to test). Maybe, a good thing is to test in SATA drivers or SCSI...


Note: I use ext3 mostly because of compatibility. I like new stuff, but I don`t like to play with FS's.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.78.xx.xx) on Fri 21 Apr 2006 at 13:00
Very good article!

I use ReiserFS because it's the only filesystem that supports shrinking a filesystem - veeeery useful with LVM! - JFS, XFS, ... - backup - resize - restore - with 1 TB (that's not that much anymore nowadays) - that's a joke. Measuring the performance of this operation is missing in the article.

[ Parent | Reply to this comment ]

Posted by Anonymous (212.254.xx.xx) on Fri 21 Apr 2006 at 13:57
FYI, ext3 supports resizing too! And even online resizing (while mounted).

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 13:58
Excellent suggestion!
The tasks selected here were not intended to be comprehensive, since I was focusing more on adding some new hard data to Piszcz's excellent benchmark series. I will surely add your suggestion in a follow-up testing.. Thanks!

[ Parent | Reply to this comment ]

Posted by Anonymous (86.82.xx.xx) on Wed 26 Apr 2006 at 08:35
XFS supports this too.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.232.xx.xx) on Fri 28 Apr 2006 at 16:40
growing ok, but shrinking xfs ? this must be pretty recent then :) have you any reference to it ?

[ Parent | Reply to this comment ]

Posted by isilmendil (140.78.xx.xx) on Fri 21 Apr 2006 at 13:20
Regarding the capacity of the filesystems compared, shouldn't it be said that ext3 is the only fs which reserves blocks for use by root alone?

You stated that all filesystems were created using default values. So ext3 loses approx. 5% of its capacity because of its reserved-blocks feature. For a fileserver you would create your data-partition without reserved blocks, as it is not needed there.

Or was this taken into account already?

Cheers,
Johannes

[ Parent | Reply to this comment ]

Posted by Anonymous (83.227.xx.xx) on Fri 21 Apr 2006 at 15:02
Those are used for defragmentation, so I keep them there aswell. Only 1% och .5% though, but I'll make sure to keep those blocks around.

/Nafallo

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 14:45
«So ext3 loses approx. 5% of its capacity because of its reserved-blocks feature. For a fileserver you would create your data-partition without reserved blocks,»

Why would you want to make a fileserver to perform badly? The 5% default reserve is in part to have some slack for use by 'root', but mostly because when a filesystem is nearly full its performance becomes very bad as fragmentation increases nonlinearly.

My experience is that 'ext3' requires at least 10% free space to perform decently over time (the 5% default is the absolute minimum that should be done) and 20-30% free space reserve is a lot better.

The problem is indeed intrinsic: as the filesystem nears 100% usage the chances of finding contiguous or nearby free blocks when writing or extending a file becomes a lot smaller. This applies to both extent based filesystems like JFS and XFS and to block based filesystems like 'ext3' (even if usually extent based filesystems do a bit better).

[ Parent | Reply to this comment ]

Posted by simonw (84.45.xx.xx) on Mon 24 Apr 2006 at 01:21
[ View Weblogs ]
"The problem is indeed intrinsic"

Intrinsic perhaps, but highlights what is missing, such as how filesystems address the issue. Part of the cost of writing the ReiserFS is all that messing with binary trees, the report doesn't attempt to understand, or address, why Reiser thinks it is worth doing all this (admittedly a lot of people said it was just too expensive to be worth trying).

Very little in the way of "aging" the filesystem.

Nothing on consistency, or limits, or features.

I think the test choice is not great, filesystem creation, mounting, and manipulating ISOs are generally not time critical tasks (well if you don't use NTFS they aren't!). I'd happily use a file system that take 100 times longer to create than any of those tested if it conveyed other discernable benefits, and it takes my CD writer over 5 minutes to write a full ISO, so a few second here or there matters not at all to me on manipulating ISOs.

Be interesting to see also how representative people think 7500 files being 1.9GB is. My understanding was that mean file size, whilst on the way up, hadn't got to that sort of size yet. Certainly I have 15GB (df -h) of files on this box, and just under 0.5 million files (find / -type f), clocking in just under 32KB average file size for my file size, rather than the 253KB used in the test. Perhaps someone is hoarding ISOs?

It is well known there is a cost-benefit trade off in ReiserFS that means it performs relatively less well on larger files than XFS. So something like mean file size is likely to explain the difference between the results here, and of other authors, on the performance of ReiserFS.

I'd also prefer to see more edge cases examined -- what happens when 100,000 emails are delivered, and then sorted, and a selection deleted, to a single maildir? For most people it is probably performing in a sensible manner under these edge cases, that matters far more than if it takes 130 or 135s to copy 7000 files.

I'd like to see blocking I/O cases, and similar examined, email delivery being a classic, and fairly easy to test.

Hardest of all to test, I want to know that the filesystem journalling "does what it says on the tin", and have someone pull the plug in the middle of these transaction, and see nothing is corrupted, and that everything is in a consistent state, and how long the recovery to that consistent state takes, and that it is automatic.

Why no "bonnie++" statistics -- I'd have thought as a test it was trivial to run, and might show up something, even if the I/O types measured are a tad artificial.

Then again I appreciate these tests take a lot of time and effort to do.

Nothing here will shake my own choice of ReiserFS for most general purpose filesystems, it has a good performance on real world benchmarks, and is the most mature of the journalling file systems presented. Although I'm looking at XFS for a project, but not because of its performance, but because of other features it brings to the table.

[ Parent | Reply to this comment ]

Posted by Anonymous (209.76.xx.xx) on Mon 24 Apr 2006 at 10:31
No file system can protect against corruption on power outages. They can only protect against (most) corruption in the event of system crashes. If the drive was completely quiescent when the plug was pulled, you're OK; otherwise, all bets are off.

There's a myth abroad that modern drives detect falling supply voltage and do stuff to protect the media. (You might read of using the motor as a generator to provide power to "park" the heads.) If it was ever true, it was only in high-end drives, and not any more. Even making sure data really is physically on the disk surface before reporting the write complete, something we like to think of as the basic promise, is no longer widely supported, though the drives will claim otherwise. In practice you need up to a few seconds of power after a crash to drain sectors from the cache to the disk surface.

The mantra is, if reliability matters, replicate and use battery ("UPS") power.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.166.xx.xx) on Wed 26 Apr 2006 at 09:25
>There's a myth abroad that modern drives detect falling supply voltage and do >stuff to protect the media. >(You might read of using the motor as a generator to provide power to "park" the >heads.) >If it was ever true, it was only in high-end drives, and not any more. This is from WD, the feature is called "auto park" AFAIK every "modern" drive has it. http://support.wdc.com/dlg/

[ Parent | Reply to this comment ]

Posted by Anonymous (24.203.xx.xx) on Sat 29 Apr 2006 at 21:47
That is only partially true though. Of course, if your hard drive does not allow you to make sure that data is really on the physical platters, you are screwed, but lets assume that it works. Just use ReiserFS with journaling. I would really like to see these tests with data=ordered/data=journal for ReiserFS to compare it.

I personally lost half of my homedir to XFS because it does only metadata journaling. Sure, all my files were there because of the journaling, but they were filled with binary zeroes. Oh and yes, the XFS FAQ says that this may happen (well at least it did when I was searching for an explanation of that behaviour back then).

I really don't want to use any "journaling" filesystem that does not journal my data, because then its worthless. I don't need a filesystem that has a clean tree and can be mounted if I lose half of my data in it. ReiserFS was pretty good at garbling file contents ("WTF is my mp3 playlist now that videos I just downloaeded?") too before they had data journaling.

[ Parent | Reply to this comment ]

Posted by Anonymous (91.77.xx.xx) on Sun 28 Oct 2007 at 02:58
> You might read of using the motor as a generator to provide power to "park" the heads
Just to let you know, Hitachi does uses this technology to move heads into proper place if power loss detected.However I have no idea if drive is able to flush buffers in this scenario.Perhaps they able, since Hitachi is intentionally(?) restricts write buffer size while read buffer size allowed to be whole size of installed RAM IC.

[ Parent | Reply to this comment ]

Posted by Anonymous (20.133.xx.xx) on Thu 27 Apr 2006 at 11:55
I use reiserFS for the simple reason that when my box loses power I can get it up and running again in a few seconds or minutes.

I'm fairly new to linux and may well have been doing something wrong, but my box regularly had the power pulled on it (my area used to be prone to power dips and I couldn't afford a UPS).

When I was using ext3 it sometimes took me a few hours to get the system to even boot, because it refused while there were errors. Once I switched to ReiserFS all those problems went away.

In a single user environment, which is realist enough for me, i'd chose the FS that allows me to recover quickest over an FS that might take a second or two longer to do something. (Can't remember the last time I copied 7000 files, if ever.)

[ Parent | Reply to this comment ]

Posted by Anonymous (82.71.xx.xx) on Fri 27 Oct 2006 at 12:57
If you use ReiserFS without a UPS on the server then you are crazy. There are masses of warnings on the Internet about the problems with ReiserFS' tree information being in unrestricted locations on the disk, so rebuilds with disk corruption are extremely risky.

[ Parent | Reply to this comment ]

Posted by drdebian (194.208.xx.xx) on Thu 30 Aug 2007 at 15:32
Once I switched to ReiserFS all those problems went away.
I once used ReiserFS on a fileserver and after a broken PSU took the server down, it was the data that went away, because the tree ReiserFS uses was corrupted.
The recovery utilities supplied for ReiserFS tried their best to recover the filesystem, but in the end I had to restore from the backup of the night before.
After this incident, I decided to give Ext3 and XFS a try. While XFS seems to be the more modern filesytem, it does lack the ability to shrink, which is a real problem in times of LVM2 and Software-RAID.
Ext3 hasn't let me down ever since. It's disaster recovery tools are the maturest of all the filesystems tested (in addition to being included in every live/recovery CD on the planet) and it's online resizing capabilities really go together well with virtualized (as well as real) infrastructure.
On a side note, I also found Ext3 to be the most tolerant filesystem for use on "flaky" hardware. If, for example, a part of the binary tree used in ReiserFS happens to land on a defective sector of the harddisk, then it's bye-bye time for your entire FS. Ext3, on the other hand, will cope quite well and allows for full recovery using one of its redundantly stored superblocks.

[ Parent | Reply to this comment ]

Posted by Anonymous (62.253.xx.xx) on Fri 21 Apr 2006 at 13:23
File system creation time is not an important consideration as you do it only once. Likewise, mount and dismount speed aren't nearly as important as things like 'Operations on a file tree'.

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 14:02
Agree with you!

I added these data to replicate other previous observations about filesystem creation and mounting time.

[ Parent | Reply to this comment ]

Posted by Anonymous (213.64.xx.xx) on Sat 22 Apr 2006 at 09:56
I agree that filesystem creation time is not an important metric but I do have to say that mount speed is. I have a home server with an LVM of approx 1.2 TB using reiserfs 3.6 and it takes a long (comparably) time to mount it. perhaps closer to a minute. This can be a factor on servers wich require high uptime but have to be reooted (for whatever reason) every now and again. The purpose of this article was to test in a small bussiness environment so "five nines" is probably not a factor but it should be mentioned none the less.

[ Parent | Reply to this comment ]

Posted by Anonymous (86.139.xx.xx) on Thu 27 Apr 2006 at 09:23
The answer is quite simple really think about it leave it mounted unless you switch your server off every time you turn your back on it, But then of course you may as well run the rubbish from M$ Corp ..

Pete .

[ Parent | Reply to this comment ]

Posted by Anonymous (205.153.xx.xx) on Tue 12 Jun 2007 at 21:23
Your "simple answer" is to create toxins that are already being overproduced to the detriment of everyone on the planet.

Every time you choose to leave your computer turned on, you are choosing to disregard a finite chance that pollution will make this planet uninhabitable for future generations. That's not such a "simple answer" any more is it?

Wasting electricity despoils the commons. Turn the computer off when you aren't using it.

[ Parent | Reply to this comment ]

Posted by Anonymous (193.94.xx.xx) on Sun 23 Apr 2006 at 18:22
Especially in home usage, mount time is very important. ReiserFS sucked so badly I had to buy a spare hard disk to do a data copy & reformat (as ext3fs) for my main data partition. Mounting the one (250GB) partition took half of my machine's boot time using ReiserFS, which is unacceptable for a desktop. With ext3, I can't notice the time it takes to mount the same partition.

[ Parent | Reply to this comment ]

Posted by Anonymous (81.187.xx.xx) on Wed 26 Apr 2006 at 00:16
Hmm - why would you reboot it except after power-outages, kernel upgrades, and (unfortunately, they do happen occasionally), system crashes?
It's so much more useful to leave the machine on all the time!

Perhaps you are shutting it down to reduce noise, in which case, I commend quietpc.com to you - I can hear birdsong over mine, with the windows shut.

[ Parent | Reply to this comment ]

Posted by Anonymous (195.135.xx.xx) on Wed 26 Apr 2006 at 13:17
Because it draws power, which costs money and causes useless pollution?
Why keep it running if you don't need it or can easily wake it up if you need it?

[ Parent | Reply to this comment ]

Posted by Anonymous (70.171.xx.xx) on Wed 26 Apr 2006 at 15:58
Why keep it running if you don't need it or can easily wake it up if you need it?

Slow the CPU (especially if it's a P4 or old Athlon) and hibernate the monitor. That itself will save a good amount of energy.

Or, better, host your home server on a passively-cooled Via system. Then you can shut-down your PC any time you want, while the server stays up, sipping watts.

[ Parent | Reply to this comment ]

Posted by Anonymous (63.116.xx.xx) on Wed 26 Apr 2006 at 21:09
Slow the CPU (especially if it's a P4 or old Athlon) and hibernate the monitor. That itself will save a good amount of energy.

Actually, SWSUSP2 works well enough now that I hibernate all of the machines at home (except the server) when they're not going to be used, such as overnight. Hibernation on my HP laptop is almost infallible -- I've got an "uptime" of over a month, hibernating once or twice (and sometimes more) every day -- and, while it takes a good minute to go into hibernation, it comes out of it within 35 seconds... that's from hitting the power switch to having my KDE desktop back up.

I just wish S3 worked as well, and that the kernel folks would adopt SWSUSP2, which works so much better than the default hibernate mechanism.

--- SER

[ Parent | Reply to this comment ]

Posted by Anonymous (68.115.xx.xx) on Mon 25 Feb 2008 at 04:51
Lier

[ Parent | Reply to this comment ]

Posted by Anonymous (87.244.xx.xx) on Wed 10 May 2006 at 03:47
Mount/unmount speed is important for desktop systems.

File system creation time can't really be an issue to anyone, I guess.

[ Parent | Reply to this comment ]

Posted by Anonymous (88.100.xx.xx) on Fri 21 Apr 2006 at 13:55
Some graphs available???

[ Parent | Reply to this comment ]

Posted by Anonymous (212.254.xx.xx) on Fri 21 Apr 2006 at 14:00
What is really missing for real use, is a concurrent file modifications benchmark. On a real server (and that's what this bench is for) you have tens of processes reading/writing *at* *the* *same* *time* on the disk! What about: - Create 4 threads tha do: - Operations on a file tree - Operations into the file tree - Remove the tree - 3 times in a row. That would be an interesting bench IMO.

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 14:32
Yes, you're right! Bryant et al. (2002) had published extensive data about concurrent performance with the 2.4.17 kernel (the Filemark benchmark: 1,8,64,128 threads). So, it clear that it would be a great addition to the initial benchmarks. You suggested 4 threads. Since they tested up to 128 threads, what do you feel would be a "representative" test for a file server? Something like 1, 8 and 24 threads?

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 14:19
That would not be much of a filesystem test as such, it would be mostly an IO subsystem test, except in ideal conditions.

The problem is that unless the IO subsystem supports mailboxing and tagged queueing, which are only available in practice on SCSI and SCSI/ATA host adapters (3ware and up), multiple concurrent accesses have awful performance.

However there are already some filesystem speed tests for suitable IO subsystems, alluded to by some other comment, for example:

http://ext2.SourceForge.net/2005-ols/ols-presentation-html/img38. html

BTW, in this graph the JFS performance comes out badly, I think that an older version of JFS was used that had excessive locking like 'ext3' for most of its life.

There are more links to filesystem speed tests here:

http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050911

[ Parent | Reply to this comment ]

Posted by mcphail (62.6.xx.xx) on Fri 21 Apr 2006 at 14:31
I have to say, I'm not very interested in how long it takes for a filesystem to mount/umount. Nor am I interested in "once only" filesystem creation. I'd rather know that my filesystem will be stable for as long as my harddisk keeps spinning. Any benchmarks for this?

NMP

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 14:37
As I said before, FS creation time and mounting/umounting were reported only to replicate previous observations. About data integrity, everybody would agree with you. However, it's the kind of data I've never found before.. How would you test it? One approach could be to do a bunch of operations over and over, and test the time before the first data corruption? Anybody, feel free to suggest! Thanks.

[ Parent | Reply to this comment ]

Posted by Anonymous (66.179.xx.xx) on Fri 21 Apr 2006 at 16:03
How about tests like:

* during a file tree copy, pull the plug on the machine. (could be simulated by running the test under vmware and killing the virtual machine) - then check how well it the fs recovers the data - which files got corrupted (if any), how long it takes to fix (replaying journals, etc).

* call your initial tree t0. make a new tree apporximate the same size called t1. for a concurrency test:
cp -a t0 t2 & # one tree
cp -a t1 t3 & # other tree
cp -a t0 t4 & # merge the trees
cp -a t1 t4 &
Hrm, should probably test mixed concurrency (deletes too!) so:
cp -a t0 t5 && rm -rf t5 &



I'm sure there's more to be added, maybe this will give you some ideas.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.92.xx.xx) on Sat 22 Apr 2006 at 17:35
Coincidentally, today I also came across some tests on the impact of write caching that relate to data integrity: http://sr5tech.com/write_back_cache_experiments.htm (Last updated October 27, 2003).

In these experiments the test variable was disk configuration rather than file system. A similar test across different file systems might produce a worthwhile indication of comparative reliability.

I suspect write caching would need to be disabled in the disk system to prevent corruptions of the kind being investigated in the link above from affecting the results. This would have an impact on absolute performance, but relative measurements could still be made.

[ Parent | Reply to this comment ]

Posted by Anonymous (70.171.xx.xx) on Wed 26 Apr 2006 at 16:03
How would you test it? One approach could be to do a bunch of operations over and over, and test the time before the first data corruption? Anybody, feel free to suggest!

Yank the plug while multiple processes are updating the disks. See what happens.

Repeat 8 or 10 times.

Yes, it's manual and time-consuming.

[ Parent | Reply to this comment ]

Posted by Anonymous (193.219.xx.xx) on Fri 21 Apr 2006 at 14:40
I'm personally using XFS on all my servers and desktop systems, mostly because I trust it - I've got dozens of power failures/unexpected reboots and I'm yet to be disappointed by how XFS handles such unclean unmounts.

With ReiserFS 3 on the other hand, I've got two such events and both times It managed to somehow completely destroy multiple files which were not even open at the time of incident.
(Yes this is anecdotal evidence, but I'm not using it anymore because of there incidents)

[ Parent | Reply to this comment ]

Posted by Anonymous (213.224.xx.xx) on Fri 21 Apr 2006 at 20:49
Same thing here... I have gotten atleast 4 incidents where ReiserFS stopped functioning properly, ending in dataloss, reïnstalling server etc. I don't know if I would trust XFS since most distributions support Ext3 or ReiserFS as default filesystem (I go for Ext3).

It just isn't fun to see the filesystem break and read a publicity message about being able to ask questions for $25... (ReiserFS)

[ Parent | Reply to this comment ]

Posted by Anonymous (24.203.xx.xx) on Sat 29 Apr 2006 at 21:56
Don't trust your XFS too much. I lost half a homedir to XFS because of this: http://oss.sgi.com/projects/xfs/faq.html#nulls

Get a recent ReiserFS and mount it with data=journal!

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 15:14
I've yet to see a benchmark methodology to test filesystem reliability, not only performance or CPU-usage. It's a bit surprising since reliability is surely one of the most important factors when an admin has to select a FS. But I feel that we are left most of the time with anectodal evidence and our own (bad) experiences.

An intesting (and simple) test would be to simulate power failures during copy/delete file operations (large ISO file and files tree) and see how each FS handles each situation. But I'm aware that this is only a small part of real data integrity testing..

If anybody has seen hard data about FS reliability, feel free to post a link here. I would be very interested to investigate this, in order to produce more comprehensive and real-world benchmarks.

[ Parent | Reply to this comment ]

Posted by Anonymous (85.76.xx.xx) on Thu 27 Apr 2006 at 07:10
I also would like to see some reliablitity tests. I've been using all linux file systems and nowadays only ext2 as /boot and xfs for the rest. I also use lvm extensively in all servers under my administration.

However, I have found one annoying feature with xfs: whenever there is power failure then all open text files are filled with "^@^@^@^@^@^@^@^@...". You can easily replicate this by opening file /etc/fstab to emacs and then unplug the power cord. Why /etc/fstab.... well, then you know why I find this feature REALLY annoying. So, this powerfailure test would be the First on my test list for real world servers.

Anyway, I enjoyed reading your article. And being on professional researcher myself I know that there is always room for improvement. Looking forward reading the new comparison from you.

[ Parent | Reply to this comment ]

Posted by Anonymous (130.156.xx.xx) on Fri 21 Apr 2006 at 16:11
Does GRUB support /boot on XFS yet?

[ Parent | Reply to this comment ]

Posted by Anonymous (159.53.xx.xx) on Fri 21 Apr 2006 at 22:52
Why would you want to boot off of an XFS partition anyways? It's my understanding that everything in the boot partition is read into memory anyways, so file system speed really isn't that important. Also, since most boot recovery tools only work with ext2 (and therefore ext3), you'll be much better off using one of those. Then you can use XFS on the other partitions.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.219.xx.xx) on Fri 21 Apr 2006 at 23:46
Where have you been hiding for the last years?? GRUB can since very long time boot off from a XFS drive:
mail / # mount|grep -i xfs
/dev/hda3 on / type xfs (rw,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-usr on /usr type xfs (rw,nodev,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-home on /home type xfs (rw,nosuid,nodev,noatime,usrquota,grpquota,logbufs=8,logbsize=327 68,ihashsize=65567)
/dev/mapper/vg-opt on /opt type xfs (rw,nodev,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-var on /var type xfs (rw,nodev,noatime,usrquota,grpquota,logbufs=8,logbsize=32768,ihas hsize=65567)
/dev/mapper/vg-tmp on /tmp type xfs (rw,noexec,nosuid,nodev,noatime,usrquota,grpquota,logbufs=8,logbs ize=32768,ihashsize=65567)
/dev/hda1 on /boot type xfs (rw,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
mail / # ls -lah /boot/grub/*xfs*
-rw-r--r-- 1 root root 11K Jul 1 2005 /boot/grub/xfs_stage1_5
mail / # grub --version
grub (GNU GRUB 0.96)
mail / #

[ Parent | Reply to this comment ]

Posted by Anonymous (72.88.xx.xx) on Thu 12 Oct 2006 at 04:57
"GRUB can since very long time boot off from a XFS drive" ... For you, yes. For me, not often. Even the new Debian installer will tell you to use LILO for booting off a XFS partition.

[ Parent | Reply to this comment ]

Posted by Anonymous (24.6.xx.xx) on Sat 22 Apr 2006 at 00:59
Yes, it does.

I have xfs on all my nfs servers. OS is SuSE 9.2/9.3/10.0.

[ Parent | Reply to this comment ]

Posted by Anonymous (192.121.xx.xx) on Mon 24 Apr 2006 at 08:27
As said, GRUB has supported XFS on /boot for a long time. The only thing that's not supported is if you install the bootloader to an XFS partition (as opposed to installing it to MBR)

- Peder

[ Parent | Reply to this comment ]

Posted by rmcgowan (143.127.xx.xx) on Fri 21 Apr 2006 at 17:15

You said "While recognizing the relative merits of each filesystem, an system administrator has no choice but to install only one filesystem...". I don't understand why you believe there is or should be this sort of restriction.

Can't the administrator decide to 'partition' usage onto different volumes, using different fs types, based on their performance for the usage?

For example, I might create a volume to hold users homes, expecting many small files while requiring maximum speed, and so choose to use XFS, while a volume to hold large files (video, audio, backups, still images, etc.), I might choose to use JFS instead.

Note that I'm not advocating or even suggesting that the above is in some way an optimal setup, it's just an 'off the top of my head' example. The question is "Why shouldn't I be able to do this sort of thing if I so choose?" Is there something I'm missing?

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Fri 21 Apr 2006 at 17:37
Sorry, this sentence was a bit imprecise.. The idea was that, utimately, for every partition, a choice has to be made since only one FS could be installed. This sentence was put there to underline the fact that, in some benchmark tests, the authors tend to conclude something like "every FS has its own merits", which leave the reader with no real answer to the question : what is the best FS to install on my partition(s). Thanks for your comment!

[ Parent | Reply to this comment ]

Posted by Anonymous (87.2.xx.xx) on Sat 22 Apr 2006 at 00:52
From the Piszcz's results, JFS seems faster than XFS, JFS seems the fastest filesystem (EXT2 and EXT3 seem also faster than XFS), look at the Total Test Time ( http://linuxgazette.net/122/misc/piszcz/group002/image037.png ). So... who says the truth?

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 01:58
A pretty awful article... In particular the method used is not clearly explained.

For example, did you unmount the relevant partition before every single operation? You don't say, but if you did not (and not many people know that this is essential) your results are largely meaningless.

For far more sensible, documented and informative tests look at mine here:

http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050908

and in a few entries around that date. Some amusing updates here:

http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050913
http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060416

[ Parent | Reply to this comment ]

Posted by Anonymous (70.171.xx.xx) on Sat 22 Apr 2006 at 02:02
I assume you used ext3 defaults here, which is not really a fair comparison of ext3's potential. Justin Piszcz allowed me to use his script to run his tests on my machine. I compared the ext3 "tuned" modes, and found ext3 with dir_index, and dir_index with data=writeback or data=journal improved most tests, in some cases where directories were involved, remarkably. My conclusion is that ext3 with dir_index (and depending on your usage journal or writeback) wins out for normal desktop performance, across the board. Changing the commit=n interval to longer than the default 5 seconds also improves performance.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 14:29
«My conclusion is that ext3 with dir_index (and depending on your usage journal or writeback) wins out for normal desktop performance, across the board.»

I used to think much the same, but over time I discovered that I'd rather use JFS across the board (except for filesystems that need to be accessible from MS Windows, where I use 'ext2' as there is an excellent filesystem for it).

The first reason is that 'ext3' performance is awesome when the filesystem has just been created and loaded, but degrades very badly over time while JFS degrades significantly but a lot less:

http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060416

The second reason is that probably because of some happenstance 'dir_index' can slow down things pretty significantly:

http://WWW.sabi.co.UK/Notes/anno05-4th.html#051204

A rather less significant advantage of JFS is that since it uses extents and dynamically allocated inodes it usually uses a lot less space for metadata, often like 3-5% of the total filesystem space.

[ Parent | Reply to this comment ]

Posted by fsateler (201.214.xx.xx) on Sat 22 Apr 2006 at 02:03
[ View Weblogs ]
Good article. I'd like to note one thing though: You mentioned initial and residual capacity of the filesystems, although you did not mention capacity when the drive is used, ie: I have 1000 files of 1Mb each, is the used space 1000Mb? I think that is the most important thing, since your file server isn't useful when its drives are empty, but rather when they are being used (I don't care if I have wasted 5% of my drive if it is empty, but I do care when the drive is almost full).
--------
Felipe Sateler

[ Parent | Reply to this comment ]

Posted by Anonymous (68.124.xx.xx) on Sat 22 Apr 2006 at 02:27
..except that XFS does not yet support ACLs/SELinux....

[ Parent | Reply to this comment ]

Posted by Anonymous (62.0.xx.xx) on Sat 22 Apr 2006 at 08:29
Sure it does, just take a look at the kernel configuration:

grep -i acl config-2.6.15:

CONFIG_XFS_POSIX_ACL=y

[ Parent | Reply to this comment ]

Posted by Anonymous (87.2.xx.xx) on Sat 22 Apr 2006 at 09:29
Yeah... it supports them.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.30.xx.xx) on Sat 22 Apr 2006 at 14:14
IIRC XFS was actually the first FS implementing ACL's. The rest followed suit later that year.

[ Parent | Reply to this comment ]

Posted by Anonymous (212.2.xx.xx) on Tue 16 May 2006 at 13:28
It does, both. I use XFS with ACLs and SELinux (extended attributtes are important here) for some time (CentOS).

[ Parent | Reply to this comment ]

Posted by Anonymous (24.98.xx.xx) on Sat 22 Apr 2006 at 06:16
EXT3 has sevire limitations ones you'll go beyond 1 Tb size partitions, number of files/directories per directory, etc. It all doesn't matter if you are using relatively small computer system. We are using systems with partition sizes (the largest) 6.4 Tb. It is impossible to use any other than XFS file system. Formatting will take several days if not more in case of ext3 fs. mounting will take half a day for raiserfs. jfs just wasn't stable enough. Recovery is excellent (for XFS). The only thing is correct - no SELinux, but for system with that much activity and disk space I wouldn't use SELinux anyway - too much overhead - the system is busy on its own. So, my summary would be - for a small system doesn't matter - for superlarge systems you are limited to XFS (with JFS trailing - especially I've heard that JFS support is removed fri Fedora5 but I am not sure)

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 14:17
Your comments on scalability are very appropriate.

The biggest problem however is not making or formatting a filesystem, it is how long it takes to 'fsck' it, and how much memory is necessary.

Times of over two months to 'fsck' a filesystem have been reported for 'ext3' and XFS sometimes requires more than 4GB of memory to run 'fsck' (it is possible to create and use an XFS filesystem on a system with a 32 bit that can only be 'fsck'ed on a 64 bit CPU, and at least one case has actually happened).

The basic problem is that while very large filesystems using JFS or XFS (or very recent 'ext3') perform well on RAID storage, because they take advantage of the parallel nature of the underlying storage system, 'fsck' is single threaded in every Linux file system design that I have seen. Bad news.

More details here:

http://www.sabi.co.uk/Notes/anno05-4th.html#051012
http://www.sabi.co.uk/Notes/anno05-4th.html#051009

I am very surprised that your experience is that «jfs just wasn't stable enough», perhaps you may want to report to the JFS mailing list, as the authors of JFS are very responsive to reports of instaibility, and usually find a fix pretty quickly.

As to FC5 support, all Red Hat systems only support 'ext3', at least officially and in the installer, but after installation you can use any of the filesystems included in the kernel. I typically install to a small temporary partition which is 'ext3' formatted, and then convert it to JFS by copying its contents over to the real ''root'' partition which is JFS formatted.

[ Parent | Reply to this comment ]

Posted by Anonymous (193.230.xx.xx) on Tue 2 May 2006 at 08:54
IIRC, for FC you can add 'reiserfs', 'xfs', 'jfs' as parameters to the prompt (right after the cd/dvd boots) and those filesystems are available to create.

[ Parent | Reply to this comment ]

Posted by Anonymous (202.1.xx.xx) on Sat 22 Apr 2006 at 07:33
its pity that this article doesnt consider 'shredding' times of large files..

[ Parent | Reply to this comment ]

Posted by Anonymous (130.127.xx.xx) on Sat 22 Apr 2006 at 18:10
That would be because everyone figured out years ago that "shred" doesn't work on journalled file systems.

[ Parent | Reply to this comment ]

Posted by Anonymous (208.54.xx.xx) on Sun 23 Apr 2006 at 20:10
Stupid newb question: Why can't journalled file systems be shredded? Thanks in advance.

Fig

[ Parent | Reply to this comment ]

Posted by Anonymous (69.128.xx.xx) on Mon 24 Apr 2006 at 05:32

The shread command works by writing random data, zeros, and ones over and over to the spots on the disk that the file you want to shred was located. The hopes are that with enough writes, the data will actually be overwritten on the disk. (The head of the hard drive varies a small amout as it traces its path over the disk, so the data might not be completely erased).

The problem is, journaling file systems write data to the journal before they write it to the final location on the disk. So shredding the file blocks on the disk, an attacker might be able to recover data from wherever on the disk the journal is located, even if the data blocks are unreadable.

The real issue is that shredding a file even on ext2 does not always work, because modern hard drives sometimes transparently remap bad sectors on the disk... so what the operating system thinks is the location on the disk it originally wrote mysecret.txt to, the drive might have moved it. An attacker could still read data from the "bad" sector using the right tools.

Realisticly, shred should never be relied on. Using dm-crypt to encrypt a full filesystem is a much better solution, and with today's CPUs power, the performance is small enough trade off for secrecy.

For more information about securly destroying data, you can read the paper TKS1 on this page, which is really interesting. (Scroll down to section 3 on page 4)

[ Parent | Reply to this comment ]

Posted by Anonymous (211.27.xx.xx) on Tue 25 Apr 2006 at 15:35
Shred does actually work on ext3 with its default settings.

From man shred:

"... Note that shred relies on a very important assumption: that the file
system overwrites data in place. This is the traditional way to do things, but
many modern file system designs do not satisfy this assumption.

[snip ]...

In the case of ext3 file systems, the above disclaimer applies (and shred is thus
of limited effectiveness) only in data=journal mode, which journals file data in
addition to just metadata. In both the data=ordered (default) and data=writeback
modes, shred works as usual. "

[ Parent | Reply to this comment ]

Posted by ericbrasseur (62.235.xx.xx) on Sat 22 Apr 2006 at 10:30
I have used all these filesystems on machines that sometimes experience brutal hangs or power failures. Every filesystem is supposed to recover from such events and they most often did. But Ext3 is the sole one that always recovered correctly and never made me loose a file. What's more with Ext3 I never have to perform the recovery manually. I agree with the conclusions of the article and for a while I thought I'd adopt XFS but I was forced back to Ext3 just because of the reliability.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 13:57
«Every filesystem is supposed to recover from such events and they most often did.»

Thats rather wide of the mark: most filesystem are supposed to recover A CONSISTENT STATE of the METADATA ONLY from such events.

'ext3' additionally can make an attempt at recovering the contents of files too, if ordered or data journaling is enabled.

However the proper way to ensure data (as opposed to metadata) recoverability is to ensure the application handles that, using atomic data transactions, because that's the only way, and even if 'ext3' often succeeds blindly, that is not the right way.

Large scale filesystems like JFS and XFS, designed for mission critical applications, don't do any attempt at data recovery, because indeed that should be handled by the applications themselves.

Many people who don't understand this then complain that then these two filesystems cause loss of data...

[ Parent | Reply to this comment ]

Posted by Anonymous (24.203.xx.xx) on Sat 29 Apr 2006 at 22:12
Well that obviously depends on your usage scenario. I think for home/small business your assumption is just not right. If I start my music player, download some large video and get my mail and I suddenly have a power outage then all those files should be intact when I switch my computer back on. With XFS my playlist might be 0000..., my download might be 0000... and my mails are 0000... and no longer on the server. With ReiserFS my playlist will be part of the video, parts of my mails might be in the video and the mails could have some other garbage in them. This all assumes that you don't use data=journal for reiser and don't use sync mode with a decent harddrive for XFS but still, some time ago there was no data=journal and maybe you just don't have a decent harddrive because your boss wants to spend less money. Now tell me, what is my music player going to do about it? :)

[ Parent | Reply to this comment ]

Posted by drdebian (194.208.xx.xx) on Thu 30 Aug 2007 at 16:38
100% ACK.

In all the cases you mentioned, your chances would indeed be best if you indeed had your data on an Ext3 filesystem, since it's the only one not using some binary tree structure to manage where your data is stored.

The problem of today's hardware is all the caching that's going on at various levels. The application can't really tell whether a certain file has really been written to a block on the harddisk, because all of that is completely hidden away in some HAL.

I think that Sun's approach with it's ZFS filesystem is suitable to tackle this challenge. It uses end-to-end checksums to detect file corruption at the harddisk right through to the application level. Too bad it isn't GPL'd, so we'll hardly see much of it in the Linux world.

[ Parent | Reply to this comment ]

Posted by Anonymous (217.132.xx.xx) on Sat 22 Apr 2006 at 11:26
Nice article. I would love to see the results of these tests using Reiser4, if and when available.

[ Parent | Reply to this comment ]

Posted by Anonymous (196.25.xx.xx) on Sat 22 Apr 2006 at 14:52
Me too

[ Parent | Reply to this comment ]

Posted by Anonymous (82.208.xx.xx) on Sun 23 Apr 2006 at 08:46
I am afraid that Reiser4 wouldn't particularly shine in this test due to the high CPU usage. A 533MHz Celeron would be a bottleneck for it.

[ Parent | Reply to this comment ]

Posted by Anonymous (217.81.xx.xx) on Fri 13 Oct 2006 at 20:17
Might be true, might be not. Reiser team wrote that they have made significant improvements regarding CPU usage of Reiser4 lately.

[ Parent | Reply to this comment ]

Posted by Anonymous (151.52.xx.xx) on Sat 22 Apr 2006 at 12:47
Do you think to add even raiserfs4 to the test? I think that will be nice to look how the improvement of the raiserfs works...

BTW, you're article isawesome

Good Work :-)

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 13:28
This article says:

«The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported).»

If what is written is true the tests have been run without 'umount'/'mount' before each of them, so these are mostly buffer cache tests, not filesystems tests.

How can then the article be awesome?

Also, the reason why JFS etc. leave more free space after formatting than 'ext3' is obvious, but is not explained (dynamic inode allocation), and so is why some like JFS have a little more used space after all files are deleted (the table containing the dynamically allocated inodes can only grow but not shrink).

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Sun 23 Apr 2006 at 10:31
How can then the article be awesome?

It's the clueless leading the clueless.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.150.xx.xx) on Sun 23 Apr 2006 at 17:07
If you're saying clueless, then you've better mailed hans a
private mail and rtfm'ed him, including at least terse pointers
on where to read what to improve.

Hans did an effort for this one, and it's a nice -improvable- overview.
And it looks like he might improve it in a followup.

Now what did you do? No data/facts/suggestions but only a personal
insult. Nice indeed, and how full of merit.

Hans, ignore that troll.

Better yet "inappropriate comments will be removed" was
written somewhere, wasn't it :)

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Tue 25 Apr 2006 at 08:44
At least my comment was actually about the topic. The most useless people on the internet are the know-nothing cop-wannabees who spend their time bitching about the improprieties of other posters. Oh well, at least it means less time for you to abuse small children.

The substantive point is that people should be much more cautious and skeptical about this report than many are being.

[ Parent | Reply to this comment ]

Posted by Anonymous (158.125.xx.xx) on Sat 22 Apr 2006 at 13:16
Very interesting article; thanks!

I would have liked to see Reiser4 on the list though (yes, I know it's not fully supported yet and is quite contraversial, but it is the latest generation of Reiser tech).

I went from ReiserFS to ext3 for support reasons a number of years ago. At work my box was installed for me with JFS and seems fairly nippy.

[ Parent | Reply to this comment ]

Posted by Anonymous (62.142.xx.xx) on Sat 22 Apr 2006 at 14:17
You should document the test methology more clearly:
- did you umount the fs between tests or purge the kernel file cache by other means?
- did you use the same partition for each fs? (difference between outer and inner sector)?
- did you recreate the fs between tests (fragmentation - actually testing a fragmented fs would reflect the reality better, but is ver yhard to reproduce/do equally for different fs's)?
- what were the mkfs and mount options (block size, root reserved space, reiser notail, extended attributes etc etc)
- what were the test commands? Did you count the sync (you did do that, right?) after the command in the elapsed time?

and so on...

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Sun 23 Apr 2006 at 10:17
- what were the mkfs and mount options (block size, root reserved space, reiser notail, extended attributes etc etc)

Indeed, notail matters a great deal. Also, mounting all filesystems with noatime should strongly be considered; access times are virtually useless information that is quite expensive to maintain. These, as well as numerous other FS configuration parameters, are commonly used by experienced administrators, and the comparison is meaningless without taking them into account. The author of this comparison means well but is apparently quite lacking in requisite experience and expertise.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.166.xx.xx) on Sat 22 Apr 2006 at 14:41
How does Reiser perform on the recursive directory listing test after a lot of deletions/file creations?

My experience is that reiser's performance seriously degrades over time on partitions that are changing frequently (i.e. /var). apt and dpkg-operations really fly when /var/lib/dpkg is on a fresh reiser partition but crawl after a couple of weeks following debian/unstable.

I guess that is due to the repacker that is necessary for reiser not being available in the distributions. IIRC that important tool is only available to customers of Mr. Reiser. I stopped using reiserfs due to this.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 15:51
«I guess that is due to the repacker that is necessary for reiser not being available in the distributions.»

All the filesystems, some more some less, have degrading performance on filesystems with high rewrites. This is more or less inevitable, in major part because very few speed tests are about this aspect, and why waste time optimizing an issue that is not that obvious?

However, the best way by far to fix the issue is not to use a defragmenter, even a background one, like the ReiserFS repacker.

Defragmenters are both dangerous and slow, because they do same-disk copies and in-place modification.

Also, in any case one should backup before defragmenting.

Now, the best way to defragment, is to do a disk-to-disk image backup followed by a re-format of the original partition and a disk to-disk tree restore, for example (where '/dev/hda' is the active drive and '/dev/hdc' the backup drive):

umount /dev/hda6
dd bs=4k if=/dev/hda6 of=/dev/hdc6
jfs_fsck /dev/hdc6
jfs_mkfs /dev/hda6
mount /dev/hda6 /mnt/hda6
mount /dev/hdc6 /mnt/hdc6
(cd /mnt/hdc6 && tar -cS -b8 --one -f - .) \
(cd /mnt/hda6 && tar -xS -b8 -p -f -)
umount /dev/hdc6

This is just a simplified example of the steps... it can be used with the 'root' filesystem too with some modifications (easiest though if done from a liveCD).

Doing this copy has these important advantages:

* A backup is done just before the filesystem is optimized, as part of the process itself.

* Both the backup and the restore are disk-to-disk copies, which is a lot faster than same-disk copying.

* One of the copies is a very fast image copy, and the other is a sequential read and a sequential write, which are about as fast as a logical copy can go.

The risk and slowness of in-place, same-disk defragmentation might have been acceptable when backup was economical only to slow tape; but currently backup to disk is the best value, and one should take advantage of that.

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Sun 23 Apr 2006 at 09:56
A backup is done just before the filesystem is optimized, as part of the process itself.

That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation. It would make more sense to mkfs the second disk, tree copy the first disk to the second disk, and then unmount the first disk and mount the second disk on the mount point (you of course use partition labels rather than absolute device names), which leaves the first disk as a backup that can be copied to tape or other backup media without affecting performance of the live filesystem. Of course, this all assumes that the filesystem can be unmounted in the first place, which often isn't possible -- making background defragmentation the best choice.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sun 23 Apr 2006 at 22:07
«"A backup is done just before the filesystem is optimized, as part of the process itself."
That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation.»

Nahhh, there is an essential detail here: in-place defragmentation is done on the filesystem itself, and there is no backup. If the in-place fails, goodbye data.

Instead by making a copy to another spindle and copying back there is always a valid copy.

«It would make more sense to mkfs the second disk, tree copy the first disk to the second disk, and then unmount the first disk and mount the second disk on the mount point»

Well, in theory one can first image copy and then tree copy back, or viceversa as you suggest.

I would rather first do the image copy, because if the filesystem is damaged it is important to have an exact image copy, including the ''free'' bits from which to attempt recovery. Doing first a tree copy has its advantages, but only copies the ''reachable from root'' subset of the filesystem, so it is not as full a backup as an image copy.

«the filesystem can be unmounted in the first place, which often isn't possible -- making background defragmentation the best choice.»

Well, no, if the file system cannot be unmounted than we have a VLDB-style filesystem:

http://WWW.sabi.co.UK/Notes/anno05-4th.html#051009

and for them in-place live restructuring is even more dangerous. It is not clear to me how best to handle 24x7 filesystems, but I suspect tree-based mirroring is the least bad option.

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Tue 25 Apr 2006 at 08:27
«"A backup is done just before the filesystem is optimized, as part of the process itself." That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation.»

Nahhh, there is an essential detail here: in-place defragmentation is done on the filesystem itself, and there is no backup. If the in-place fails, goodbye data.

You're babbling; I didn't say anything about in-place defragmentation in the statement you quoted. And when there is in-place defragmentation, who says there's no backup? One always does a backup before defragmenting. Sheesh.

Instead by making a copy to another spindle and copying back there is always a valid copy.

In your scenario, you copied one disk to another and then immediately did a mkfs on one of the first disk. That makes the copy STUPID, compared to simply mkfsing the second disk; in both cases you have one disk containing the original FS and the other containing an empty FS. Sheesh.

[ Parent | Reply to this comment ]

Posted by Anonymous (81.57.xx.xx) on Sun 23 Apr 2006 at 10:13
Now, the best way to defragment, is to do a disk-to-disk image backup followed by a re-format of the original partition and a disk to-disk tree restore
I disagree. A good deframenter would defragment based on usage statistics. Frequently used files would be placed where the disk is faster (outer tracks) while unused files would be placed whis it is slow. Files would be grouped based on use scheme. For example, files used at boot time would be placed together.

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Sun 23 Apr 2006 at 10:25
Your so-called "good" defragmenter is a figment of your imagination, because the information it would require isn't available and would be expensive to maintain. And if that information were available, it could just as well be used by a "good" disk-to-disk copier, so it's irrelevant to which is better.

[ Parent | Reply to this comment ]

Posted by Anonymous (203.166.xx.xx) on Fri 12 May 2006 at 15:43
it's a bit late for this article, however, your magical auto-defragmenter does exist, just not for linux. :) it's available starting with OS X 10.3. it looks for commonly used files and moves them to the beginning of a partion and does on the fly defragmenting. The technology is called "Hot-File-Adaptive-Clustering".

[ Parent | Reply to this comment ]

Posted by Anonymous (66.81.xx.xx) on Wed 6 Feb 2008 at 00:58
Even Windows XP does this, at least for boot-time optimization.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.166.xx.xx) on Wed 26 Apr 2006 at 08:45
I am well aware that filesystems must degrade over time. But with reiser that is somewhat extrem in my experience. It would be nice to have some data to back that empirical data up:-)

I am well aware that copying the data (better all the files containing the data) will get me back to optimal performance. Maybe you can afford the downtime necessary to copy data back and forth on the server, I absolutly need some runtime mechanism for that.

[ Parent | Reply to this comment ]

Posted by Anonymous (65.175.xx.xx) on Sat 22 Apr 2006 at 15:03
Interesting results. It seems, to me anyway, that JFS may be better suited for database use. I'd be very interested to see results that included some database action and perhaps a simple raid setup in addition to the aforementioned concurrent thread tests.

Good stuff. Oh, a table summarizing the data would be cool.

[ Parent | Reply to this comment ]

Posted by Anonymous (71.224.xx.xx) on Sat 22 Apr 2006 at 16:12
I use ReiserFS for my desktop (which I program on). It's a great FS. Once, I stupidly screwed up the FS, but within one hour the ReiserFS tools repaired the FS (200GB partition) without a hitch, and no data was lost. ReiserFS is very stable for me, and much faster than ext3 in my testing. I don't mind the long mount times... the speed of the FS overall more than makes up for it.

I've tried XFS, and I honestly think it's pretty slow. I used it on a similar machine as mine, and un-tarring the Linux source tree took much longer on the XFS FS than on my ReiserFS (v3.6) partition. Also, XFS caches a lot of data in RAM, so there's high risk of data loss upon improper shutdowns.

I must note: ReiserFS was designed to take advantage of the CPU to make a faster filesystem. The creator of ReiserFS noticed that most other filesystems barely used any CPU at all when doing I/O operations. Therefore, maybe if you tested all of the filesystems on a faster CPU, the results would be different.

Great article, though. :)

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sat 22 Apr 2006 at 19:05
«I've tried XFS, and I honestly think it's pretty slow. I used it on a similar machine as mine, and un-tarring the Linux source tree took much longer on the XFS FS than on my ReiserFS (v3.6) partition.»

This seems a reasonable conclusion from several tests I have seen: XFS has relatively high overheads but scales better than others, so it is less suitable for small scale filesystems.

In particular it has high CPU overheads. Now I have also noticed that the whole Linux IO subsystem hash pretty high CPU overheads, that make it rather CPU bounds except on the latest and greatest (Athlon 64 3000+ and similar) CPUs. If one has a slower CPU and adds a relatively high CPU file system like XFS (or ReiserFS) that is not going to be that exciting...

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Sun 23 Apr 2006 at 09:27
There's a lot of nonsense here. Notably, the high initial ext3 space usage and the long mkfs time is due to the very large number of inodes that mkfs.ext3 creates by default -- designed when disks and average file sizes were much smaller. And the default 5% reserve space pointlessly penalizes ext3. Again, this default was designed when disks were much smaller -- it's much too big today, and its primary effect is to make non-root writes fail. Its anti-fragmentation effects are dubious, and since you don't test the effects of fragmentation, you're comparing apples to oranges. In fact, you apparently did no capacity-sensitive tests. If you're going to have a real world test, you have to use real world parameters based on established best practice, and perform a real world mix of operations, occurring over a real world time span.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sun 23 Apr 2006 at 21:15
«And the default 5% reserve space pointlessly penalizes ext3. Again, this default was designed when disks were much smaller -- it's much too big today, and its primary effect is to make non-root writes fail. Its anti-fragmentation effects are dubious, and since you don't test the effects of fragmentation, you're comparing apples to oranges.»

Most file systems, not just 'ext3', perform pretty badly when they are near capacity, and in particular extents based file systems, which derive a fair bit of their advantage from having a lot less metadata, which happens only if there are long extents.

If the anti-fragmentation effects of a 5% reserve are dubious, it is only because 5% is way too low. It should be at least 10%, and ideally 20-30%.

[ Parent | Reply to this comment ]

Posted by Anonymous (68.6.xx.xx) on Tue 25 Apr 2006 at 08:09
Larger reserves mean non-user writes fail sooner -- sometimes with catastrophic undetected loss of data due to the very large number of programs that a) are written in an ancient programming language that doesn't provide exceptions and b) do not check for write errors. And in desktop systems, even those programs that do check for write errors often write the error message to an unseen log file or produce some inscrutable report -- possibly an obnoxious sound, which may or may not make it through whichever abomination of a linux sound system is running to the user's ears. The write failures result in a lot of perplexity and loss of productivity, but eventually will result in the user deleting files (sometimes with catastrophic data loss) in an attempt to free up space. Such user-executed deletions may or may not help preserve the contiguous free space, but at best it's temporary; eventually the reserved space will be scattered over the disk in small chunks, the reservation of which is pointless.

A large reserve is the stupidest possible way to address the fragmentation problem, and is no substitute for proper defragmentation techniques.

[ Parent | Reply to this comment ]

Posted by Anonymous (81.57.xx.xx) on Sun 23 Apr 2006 at 10:41
Do you know filesystem benchmarks which focus on software development usage, both on server-side and client-side storage?

Here is a simple scenario for CVS:
- multiple cvs commit (in particular look at files and directories fragmentation on server)
- cvs checkout
- cvs -n update
- cvs update (in particular look at files and directories fragmentation on client)
- cvs tag TAG
- cvs co ; cvs update -j TAG
- cvs diff
- cvs diff -j TAG1 -j TAG2

To avoid network problems, server and client environments would be on the same machine, but on different disks.

Comparing different SCS would be important as each one has different filesystem usage patterns, so filesystems may be more adapted to one SCS than the others.

Olivier Mengué

[ Parent | Reply to this comment ]

Posted by Anonymous (85.134.xx.xx) on Sun 23 Apr 2006 at 18:46
This is an excellent piece of comparison between the popular file systems for Linux. I have my point to make that, ext3 is still much more useful, since with the file removal, it restore the capacity back to the initial state. I won't want to feel strange that some storage capacity had gone missing. Keith

[ Parent | Reply to this comment ]

Posted by hansivers (64.18.xx.xx) on Sun 23 Apr 2006 at 19:47

Hi everybody,

Wow! I *really* do not expect this first contribution to generate so much comments and interesting discussions. Some sound a bit "hard", since this review has a limited and modest objective (to complete some data published by Piczsz with tasks I felt were unclear or missing) but I understand the points made by these authors. I apologize for missing or unclear information.

I'm currently working to rerun all my testing while taking into account many suggestions :

  1. Be explicit about FS creation and mounting options
  2. Umount and remount partition between each test
  3. Report initial, residual as well as full disk capacity
  4. Compute time and CPU usage for fsck each FS

I'm also investigating the best way to test :

  1. CVS operations (thanks for this excellent suggestion)
  2. concurrent tasks (another excellent one!)

within the scope of small-business file server operations.

Some discussions were initiated about how to test data integrity after unexpected system shutdowns. I feel it will be a very interesting metric to benchmark, since small-business and home servers may be less likely to have power failures protection (UPS, etc.).

QUESTION TO EXPERIENCED CONTRIBUTORS :

Since it's my first contribution to debian-administration.org, I've restricted myself to html tags suggested in the "Submit an article" section. However, I agree with previous comments that it will be more interesting to publish graphics of results. How can I do that here (other than upload graphs on a personnal website and link them here)?

Thanks everybody!

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Sun 23 Apr 2006 at 21:56
Well, most of the improvements you suggest are already in the tests I did a while ago, those at

http://www.sabi.co.uk/Notes/anno05-3rd.html#050908

and later entries.

I suspect that it would be far more interesting to see tests done with large partitions/filesystem sizes than I had the patience to do myself (my tests involved 4GB/8GB). But then I do have a bit of a lull now, and a couple of partitions with 40-70GB in them, so I could do that myself. The main problem is that takes time and it is very tedious... :-)

Even better it would be to find a largish filesystem that has been churned a fair bit and compare the times between its churned state and its freshly reloaded state.

Unless you have some host adapter with tagged queueing, doing concurrent accesses is going to be a waste of time, because it will perform awfully. Simple tests as to that were done many years ago, for example:

http://groups.google.com/group/comp.arch/msg/7da6f8635c3e14db

As someone else mentioned, more sophisticated examples have already been published, on properly massive IO subsystems though (those likely to have host adapters supporting tagged queueuing).

[ Parent | Reply to this comment ]

Posted by Anonymous (64.18.xx.xx) on Mon 24 Apr 2006 at 17:32

Well, most of the improvements you suggest are already in the tests I did a while ago

Yes, I've already noticed your various posts about your work. Really interesting! Too bad I became aware of it after publishing my initial report. It would have help me to better *bullet-proof* my methodology.. :D

I feel that independant replication of results is as important as good methodology for the advancement of knowledge. I've published my initial results here and, from the beginning, I've invited readers to share comments and suggestions to improve this ongoing work. As in many other scientific fields, it's the accumulation of evidence that helps to conclude about facts, more than waiting for the "definitive" study to come. It's probably my statistician background who speaks here, and I respect that not everybody may share this view.

I suspect that it would be far more interesting to see tests done with large partitions/filesystem sizes than I had the patience to do myself (my tests involved 4GB/8GB).

I'm working actually on 40Gb partition and I'm planning to test on 160Gb partition (transfer sizes and operations will be proportionally increased), to see whether some results scaled up linearly or exponentially.

[ Parent | Reply to this comment ]

Posted by Anonymous (82.69.xx.xx) on Tue 25 Apr 2006 at 09:50
«But then I do have a bit of a lull now, and a couple of partitions with 40-70GB in them, so I could do that myself.»

Oh well, I was too curious and I have a high tolerance for boredom, so I have redone my informal tests on a 65GiB filesystem:

http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060424

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Sun 23 Apr 2006 at 22:14
[ View Steve's Scratchpad | View Weblogs ]

Either mail me the images to go along with the article - or host them somewhere yourself and I'll copy them over.

I'm happy to include images in pieces where they are useful and this type of article would benefit from them I agree!

(I'd much prefer to host images here since then I don't have to worry about them disappearing. However if you feel strongly that this is not a good idea I would allow you to host them.)

Thanks for the article, I too was suprised how many readers and commentors appreciated it!

Steve

[ Parent | Reply to this comment ]

Posted by Anonymous (213.237.xx.xx) on Wed 26 Apr 2006 at 01:55
I would like to see one aspect covered also: namely the time spent on IO!
Maybe recorded as the total time elapsed. Many people focus on the CPU time used, but I think the real figure to focus on is the total elapsed real time, as this is what the user experiences. The FS may differ here in their efficiency given their intelligence in grouping IO requests together, and eg having inodes and blocks located closely.

I would like to see some consideration given to common jobs that take a long time:

searces in a directory tree, like find, or creation of tar file.

Boot time (for the same OS)

loading of huge applications, like openoffice.org

concurrent reading of files, like a file server or a mirror server.

database handling, like a backup of a database.

Also reported disk usage of the partitions for the different FS for your system would be interesting. The /root system particuarily, as this is likely to be common for most systems.

A survey of recovery utils could be nice. Which FS allows to recover one or more deleted files?

[ Parent | Reply to this comment ]

Posted by Anonymous (195.39.xx.xx) on Thu 15 Oct 2009 at 07:50
The best recovery chances is for ReiserFS, however this file system is very sensetive to its S+-tree damage (it is still recoverable).

The second place takes XFS. It is more stable then ReiserFS and has a about the same recovery chances (like ReiserFS), however deleted files are recovered without real size (recovered files size is greater and is rounded to FS block size).

Ext3 has almost worst recovery chances because it wipes inodes immediately. File deletion on Ext3 works through journal and file could be recovered until 'deletion update' record is pushed off journal. The number of recoverable deleted files depends of journal size and journal use.

I didn't use JFS on Linux, just because it is not intended for (actually it is my opinion for file system which design is not fully compatible to Linux VFS).


Data recovery techniques and software:
There are many techniques described on Linux forums using "**fs_repair", "grep" and other similar basic tools, however read/write access to block device with lost information could cause permanent data loss. That's why it could be recommended for non-critical data only.
Any valuable information should be recovered in read-only mode.

There are plenty tools from different vendors that claim support of lost/deleted files recovery for all these file systems.

Here we using Linux on production, so deleted/lost files recovery was eventually critical...
We did some tests and most accurate recovery was with commercial UFS Explorer Standard Recovery (version 4.x) from SysDev Laboratories (http://www.ufsexplorer.com/download_stdr.php).
The tests included fragmented files deletion, file system damage (with partial wipe, metadata damage) etc.
The current version of this tool misses only most critical thing - ability to run natively on Linux (currently it is Windows tool).

If anyone did more researches in this subject I'd be glad to see other opinions.

[ Parent | Reply to this comment ]

Posted by jak (207.144.xx.xx) on Tue 16 May 2006 at 17:50
"Umount and remount partition between each test"

Maybe reboot too.

I dunno, but I've been told, that suse's distro specific patches make reiser better than other distros. Maybe that's why the suse default fs is reiser. It's the only distro where I use reiser. :-)

[ Parent | Reply to this comment ]

Posted by Anonymous (128.153.xx.xx) on Tue 25 Apr 2006 at 00:23
I must say this is an excellent article! I have always relied on Reiser and Ext3 as my main filesystems, but have been annoyed by how long my Reiser /home takes to mount at boot (it is quite longer than my ext3 / and /anime drives (no comment on me having a seperate /anime drive :P) and I have been looking to replace it with another OS. I may have to try out XFS :D

[ Parent | Reply to this comment ]

Posted by Anonymous (70.53.xx.xx) on Wed 26 Apr 2006 at 03:57
What would be great is a comparison with the forthcoming ReiserFS4. Hans Reiser pretends that Reiser4FS is the fastest. I'd really like to know the Truth here, not some numbers picked from hand-picked artifical benchmarks.

Reiser's benchmarks are here: http://www.namesys.com/benchmarks.html

[ Parent | Reply to this comment ]

Posted by Anonymous (62.104.xx.xx) on Wed 26 Apr 2006 at 09:18
it was not the focus of this article, nevertheless this would be of interest:

To keep wear on Compact Flash/USB Memory Stick/MMC-Card/SD-Card low it would be relevant to know how many blocks were actually changed on the device for writing some smaller and some larger files.

Is this information available somewhere?

[ Parent | Reply to this comment ]

Posted by Anonymous (208.187.xx.xx) on Wed 26 Apr 2006 at 23:10
Good article. I think there needs to be some more benchmarks, though. I would add the following operations:
  • MySQL or Postgres database add/update (typical business application)
  • OpenLDAP or Fedora Directory database add/update (typical business application)
  • Streaming of multiple files (typical for multimedia servers)
  • Reading/writing to (pseudo-)random locations within a file (many tests are inherently sequential or indexed sequential, this guarantees you have a baseline for random access)

It would be good if the CPU load and kernel memory consumption was also tracked (so there was an indication of FS overhead per unit of performance), especially if the tests were run for a normal setup and on two configurations that were deliberately reduced, so that it was possible to extrapolate how the filing system would perform under any other configuration (assuming FS performance follows a simple curve).

[ Parent | Reply to this comment ]

Posted by Anonymous (84.194.xx.xx) on Fri 28 Apr 2006 at 10:53
All these comments ... called the digg effect :)

Fred
Linox.BE

[ Parent | Reply to this comment ]

Posted by Anonymous (66.190.xx.xx) on Sat 29 Apr 2006 at 03:55
Please include ext2. The main advantage of ext3 over ext2 is that you don't potentially have to wait a VERY LONG TIME for your several terabyte array to fsck after a crash or whenever it's passed its mount count. I've seen several analyses indicating that ext3 is not significantly more recoverable from damage than ext2, just repairs the simple stuff much faster.

Also, it would be nice to differentiate between metadata and full-data journalling.


I tend to use ext3 unless performance is critical, then for many things I use ext2. In special cases, I've used JFS and others, but I tend to be very cautious about that these days. I have lost critical and expensive data when a drive with a JFS filesystem went bad. Despite the personal and financial impact of the loss, I could not afford the $30k that was the best data recovery quote I received. The damage was stuff that fsck could have fixed on ext2 or ext3... I would have lost some files to corruption or total loss, but most of the data would have been recovered. I've helped a friend (resign himself to loss) after a similar problem with XFS.

One of the advantages to ext2/ext3 is the very redundancy that tends to slow it down...

[ Parent | Reply to this comment ]

Posted by Anonymous (66.190.xx.xx) on Sat 29 Apr 2006 at 04:04
Another good test would be to run iozone with same settings on the same partition of the same drive, using each filesystem of interest. Since iozone uses the various kernel file-handling calls, if the drive (and offset into the drive) does not change between runs, you get a very good differential analysis of where each filesystem is stronger or weaker. Of course the numbers don't help much if you don't actually know what your workflow is like, but that's a general problem with ALL benchmarks: The more precise they are, the harder you have to work to apply them to your case... ;)

[ Parent | Reply to this comment ]

Posted by Anonymous (24.117.xx.xx) on Sat 29 Apr 2006 at 04:18
I switched about 2 years ago to XFS on all of the servers I administer. It was for a simple reason. I never felt completely comfortable turning off the fsck period on my ext3 servers.

What that meant in practice was that the server would be up for a year or so, and I'd need to do a quick reboot. Time had expired on the fsck and as filesystems have grown, the check became around 30 minutes (too long for quick).

In every way that matters to me, I have found XFS to be superior in speed and flexibility to ext3.

[ Parent | Reply to this comment ]

Posted by Anonymous (80.48.xx.xx) on Mon 1 May 2006 at 14:42
Very good article.

[ Parent | Reply to this comment ]

Posted by Anonymous (192.35.xx.xx) on Tue 16 May 2006 at 14:55
Hi -- nice write-up.

Your article is a benchmark article, but it would be nice to acknowledge that there are other things to consider besides performance when selecting a filesystem. For me, RAS (Reliability, Availability, Serviceability) is very important. Some aspects of this:

1) How robust is the filesystem? How does it react to power failures? Hardware failures?
2) How quickly can it be recovered from a failure? How well does it recover data from a broken filesystem? Broken hardware?
3) What utilities or tools are available for the filesystem? For example, does it have a dump/restore facility?

It is the above that lead me to choose ext3 for my use.

RCP

[ Parent | Reply to this comment ]

Posted by Anonymous (149.168.xx.xx) on Tue 16 May 2006 at 14:56
There should be another category - one that either works for redhat or is a redhat/fedora lover.

This category of users shall choose ext3 and ONLY ext3. They shall remove all other filesystems from the kernel ( except ext2 of course) and re-compile and install that kernel.

[ Parent | Reply to this comment ]

Posted by jsosic (217.198.xx.xx) on Tue 13 Jun 2006 at 14:00
I want to bash this test a little...
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.

Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.


Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:

journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery. Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!! dir_index Use hashed b-trees to speed up lookups in large directories. Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised... Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want! Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it. Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2. So, my little conclusion: If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed If you want maximum security for your data - ext3 tuned for data security If you need pure speed and data is easily replacable (/usr/src) - ext2 tuned Jakov Sosic jsosic@jsosic.homeunix.org

[ Parent | Reply to this comment ]

Posted by jsosic (217.198.xx.xx) on Tue 13 Jun 2006 at 14:22
I want to bash this test a little...
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.

Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.


Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:

journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery.

Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!!

dir_index
Use hashed b-trees to speed up lookups in large directories.
Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised...

Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want!

Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it.

Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2.

So, my little conclusion:
If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed
If you want maximum security for your data - ext3 tuned for data security
If you need pure speed and data is easily replacable (/usr/src) - ext2 tuned

Jakov Sosic jsosic@jsosic.homeunix.org

[ Parent | Reply to this comment ]

Posted by jsosic (217.198.xx.xx) on Tue 13 Jun 2006 at 15:41
I want to bash this test a little...
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.

Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.


Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:

journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery.

Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!!

dir_index
Use hashed b-trees to speed up lookups in large directories.
Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised...

Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want!

Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it.

Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2.

So, my little conclusion:
1. If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed
2. If you want maximum security for your data - ext3 with journal_data
3. If you need pure speed and data is easily replacable (for example /usr/src) - ext2 with dir_index

PS. Sorry for previous posts, I'm new to this site, and I tried to edit them, but it seems that can't be done... Jakov Sosic jsosic@jsosic.homeunix.org

[ Parent | Reply to this comment ]

Posted by Anonymous (220.114.xx.xx) on Sat 1 Jul 2006 at 09:20
Thanks for your good comments!

[ Parent | Reply to this comment ]

Posted by Anonymous (87.75.xx.xx) on Tue 18 Jul 2006 at 14:16
Excellent review! Well done!

I'd love to see a review of the filesystems on RAID (hardware and software) some time in the future!

[ Parent | Reply to this comment ]

Posted by Anonymous (153.94.xx.xx) on Mon 31 Jul 2006 at 13:48
I liked some of those tests. Formating and mount time I find kind of irelevant. I format once and mount every few month and only when I scheduled downtime. A few minutes more or less doesn't matter for a few hours downtime.

What I'm missing is a comparison of various tunings for the FSes. For example (since I use ext3 all these are ext3 features) for ext3 changing the journal mode has a huge effect on speed at the cost of security. The dir_index flag is also a big plus for directories with many files (and now default).

The amount of reserved space can be tuned on ext3 (and I always set it to 0 on non system disks). What effect does that have on speed? How do filesystems behave when they are getting full? How about filling the disk up to X% with files of random size and then creating randomly sized files and deleting random files keeping the FS always at X%.
How does the speed varry with X approaching 100? How does the speed varry after 0, 1000, 1000000 files have been created/removed?

The amount of inodes can be tuned too in ext3. The default of one inode per block is quite insane for your mp3 or movie collection with avearge filesize of 1Mb or 100Mb. For a mail or news spool on the other hand it is vital. Setting -T largefile or -T largefile4 on mke2fs has a huge impact on creation speed and frees up several GiB on large partitions too. Does it affect other things too?

MfG
Goswin

[ Parent | Reply to this comment ]

Posted by Anonymous (201.19.xx.xx) on Mon 14 Aug 2006 at 22:02
I noted your hardware includes a Silicon controller.

Could it be possible that there is some sort of optimization for XFS usage, both hardware based, or maybe XFS could be made to use certain optimizations if some chip is found.

This would definitely make a point into buying such hardware to achieve even better results, right?

[ Parent | Reply to this comment ]

Posted by Anonymous (87.226.xx.xx) on Wed 27 Feb 2008 at 12:52
Sorry for embarrassing you (for your comment is very lame) but how the FS driver can know about the underlying hardware?

Anyway you can read linux/fs/xfs to get the insight.

// Artem S. Tashkinov

[ Parent | Reply to this comment ]

Posted by Anonymous (66.31.xx.xx) on Wed 19 Aug 2009 at 18:00
Filesystem performance on RAID arrays is improved when mkfs reads the stripe unit size and stripe width from the hardware. mkfs.xfs does this.

[ Parent | Reply to this comment ]

Posted by Anonymous (87.217.xx.xx) on Thu 1 Mar 2007 at 16:01
Hi there,

I wondered if you could give me your permission to translate this article into Spanish in order to share it with my community (ubuntu-es).

[ Parent | Reply to this comment ]

Posted by Anonymous (81.56.xx.xx) on Wed 2 Jan 2008 at 19:28
Bonjour,

Voici un autre compte rendu sur les systemes de fichiers ReiserFS, Ext3, Jfs et Xfs :
http://arnofear.free.fr/linux/files/contrib/bench/bench.html

[ Parent | Reply to this comment ]

Posted by Anonymous (66.240.xx.xx) on Wed 9 Jan 2008 at 17:50
I just wanted to thank you for your effort in this comparision. It echo'ed my professional experiences in corporate implementations.

I've had all sorts of issues with ReiserFS (SLES9.x) in medium loaded applications ... a NFS mounted mail store for 10,000+ Maildir accounts. Recovery from failures also left a lot to be desired compared to EXT3 and XFS. Despite good tools, I always experienced data corruption using ReiserFS.

What most reader's don't understand is that there is no perfect filesystem for everyone ... although EXT3 comes damn close and should suit most everyone's general use.

Personally, I'm phasing out my SLES installs and running CentOS4/5 servers. All my SAMBA/NFS filesystems get XFS and everything else gets EXT3. When I moved my mail store from ReiserFS to XFS, all my IMAP users noticed a dramatic increase in performance. The server didn't change: Dual Opteron 64bit, 4Gb RAM, 1Tb RAID10 storage array.

[ Parent | Reply to this comment ]

Posted by Anonymous (139.174.xx.xx) on Sun 29 Jun 2008 at 10:56
when you did your copy benchmarks, what was the source fs?

If the source was ext3 it had an unfair advantage over reiserfs&xfs.

Did you enable barriers for ext3?

If not, reiserfs&xfs had an unfair disadvantage, since extx3 disables barriers by default, but xfs and reiserfs enable it by default. If enabled there should be (if you believe the kernel people) a 30% penalty for ext3.

[ Parent | Reply to this comment ]

Posted by Anonymous (74.9.xx.xx) on Thu 4 Sep 2008 at 13:46
Note: You can shrink an ext3 volume, however not xfs. I prefer xfs on arrays the performance and stability has been good.

[ Parent | Reply to this comment ]

Posted by Anonymous (83.34.xx.xx) on Wed 15 Jul 2009 at 15:56
ext4 is the current winner.

[ Parent | Reply to this comment ]

Posted by Anonymous (148.87.xx.xx) on Fri 23 Apr 2010 at 04:30
The "inital capacity" can be increased by specifying the "-i" option to create less i-nodes, eg "mkfs -t ext3 -i 32768 /dev/hdb1"

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 842 votes ~ 10 comments )

 

 

Related Links