Creating dynamic volumes with loop devices

Posted by banchieri on Thu 20 Jan 2011 at 09:52

Recently, my focus of interest turned to volume management. Since with Linux, LVM2 seems to be the only viable solution (unless you go for cluster filesystems), I started reading about LVM. Somewhere, I read about setting up a loop device as PV and that's where I got intrigued: Couldn't I just use a sparse file with one single large hole and let my ext4 fill that when there's actually something written onto the loop device?

So for proof of concept, I created the volume with:

(1) dd if=/dev/zero of=FILE bs=X seek=Y-1 count=1, did the
(2) losetup /dev/loop0 FILE,
(3) mke2fs /dev/loop0,
(4) mount /dev/loop0 /mnt,
and after each step, checked disk usage with du -k FILE and filesystem usage with df -k /mnt.

Guess what? It worked!

For further testing, I started copying files into my "dynamic volume", watched it grow, and ultimately unmounted it.

Since that worked out so well, I'm gonna use that trick in the Linux Containers setup I'm currently working on, because it
(a) lets the volume grow dynamically and
(b) puts an upper limit on volume size.

Ah, and: With ext3/4 as underlying filesystem, ext2 within the loop device is enough. Why journal twice?

 

 


Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Thu 20 Jan 2011 at 10:03
[ View Weblogs ]

Dear editors,

thanks for the "sparse file".

banchieri

[ Parent | Reply to this comment ]

Posted by Anonymous (194.48.xx.xx) on Thu 20 Jan 2011 at 10:25
what happens if you delete large files? it won't shrink, will it? I guess it will eventually grow until the limit is reached.

[ Parent | Reply to this comment ]

Posted by gwolf (132.248.xx.xx) on Thu 20 Jan 2011 at 15:38
Of course, deleting the file would not zero its contents - So the space will keep occupied. Still, this trick sounds it can be _very_ good for provisioning virtual servers, or sets of limited, throw-away environments - (ab)using the promise of virtualization: Sustain the illusion that you have more resources than you actually do.

[ Parent | Reply to this comment ]

Posted by mcortese (20.142.xx.xx) on Thu 20 Jan 2011 at 16:51
[ View Weblogs ]
deleting the file would not zero its contents - So the space will keep occupied.
Where have I heard similar concepts, lately? Sure, talking about how file systems lack a way to inform the undelying device that some blocks are not needed anymore! Can you see the parallel?

Take a solid state device, make a file system on it, populate it with large files, then delete them. The SSD will still see most of its blocks as used, and its wear reduction strategy will lose efficiency. Some solutions have been proposed (zeroing the blocks, issuing a trim command...) but, to date, none has reached a big consensus.

In the case above, the ideal solution would be to have a top-level file system which zeros or trims blocks upon file deletion, and a loop device that understands that and translates it to holes in the underlying file.

[ Parent | Reply to this comment ]

Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Fri 21 Jan 2011 at 06:33
[ View Weblogs ]

Well, yes, that's how filesystems behave. On the other hand, if you put another large file onto the volume after deleting the first one, the volume won't grow because blocks get reallocated by the "inner" filesystem.

So far, I haven't heard of auto-shrinking filesystems (with the notable exception of tmp/ramfs-like ones).

[ Parent | Reply to this comment ]

Posted by Anonymous (62.226.xx.xx) on Thu 27 Jan 2011 at 01:06
Yes, I see the parallel and I'm hoping for KVM/QEMU qcow2 to learn interpret ATA TRIM. Though wikipedia states that

"The TRIM command does not work with disks which are stored in disk image files. This is caused by the fact that computer files can only be deleted completely or truncated at the end."

in fact a smart qcow2 driver could keep track of trimmed space, swap blocks from the end of the image file into the trimmed space (they're mapping anyways) and finally truncate the image, thus reducing size live.

As for sparse loop files, I don't know whether filesystem / kernel has a command to sparse a range within an open file. Reclaiming space was always a file copy operation for me.

[ Parent | Reply to this comment ]

Posted by Anonymous (62.226.xx.xx) on Thu 27 Jan 2011 at 01:13
Ah progress: XFS and OCFS2 have gained exacly that ability just recently! See lwn.net article 415889.

[ Parent | Reply to this comment ]

Posted by marki (188.167.xx.xx) on Thu 20 Jan 2011 at 17:49
Of course it works :) I have used similar setup to recover filesystem from lost of one PV - I have just created new PV as sparse file (in /dev/shm, so in memory) and restored original metadata into it.

[ Parent | Reply to this comment ]

Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Fri 21 Jan 2011 at 06:36
[ View Weblogs ]

I've never stated that the idea is new. It just crossed my mind and I thought that it might be useful for somebody else, so I shared it with you folks.

Obviously, it was useful for you...

[ Parent | Reply to this comment ]

Posted by Anonymous (72.249.xx.xx) on Thu 20 Jan 2011 at 20:21
When you need to recover the zeroed-out space, simply create another PV and add it to your volume group. Then "pvmove" off of the old PV and remove the old PV. During the transition, you might be using a lot of space, but after you're done, your PV will not be wasting space storing unused blocks.

The best part? You can do all of this while your LVM volumes are still in use! That is, the migration from PV1 to PV2 does not affect the volumes that are sitting on top of them. Live migration! LVM rocks.

Alan Porter

[ Parent | Reply to this comment ]

Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Fri 21 Jan 2011 at 15:33
[ View Weblogs ]

That's close to another idea that once crossed my mind: A garbage-collecting filesystem which could do that sort of thing "in situ".

The basic idea is as follows:
(1) Filesystems usually divide the allotted space into equally-sized cylinder groups (CGs).
(2) Consequently, they maintain per-CG allocation bitmaps.
(3) If you'd also monitor per-CG allocation density and detect that that some CG's allocation density drops below a certain limit (e.g., 25%), you'd transparently start moving blocks from that CG to other CGs (until the CG is empty) in order to reduce the overall CG count.
(4) Moreover, that would be a great opportunity to auto-defragment files. You're moving those blocks anyway...

If you combine that with volume management based on CG pools, filesystems could dynamically allocate free CGs when they need them as well as release them after they're garbage-collected.

How does that sound?

[ Parent | Reply to this comment ]

Posted by Anonymous (85.17.xx.xx) on Fri 21 Jan 2011 at 08:46
"With ext3/4 as underlying filesystem, ext2 within the loop device is enough. Why journal twice?"

But the outer filesystem only sees one single file, wouldn't it make sense to do ext2 on the outer filesystem and ext3/4 inside? The inner journaling would have a more detailed view on the metadata.

[ Parent | Reply to this comment ]

Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Fri 21 Jan 2011 at 15:01
[ View Weblogs ]

I think I'm able to guess what you're hinting at...

You're concerned with the fact hat "inner" metadata appears as ordinary data to the "outer" filesystem and thus isn't treated specially. Haven't thought of that, actually. But maybe you're able to fix that by altering the "outer" filesystem's type of journalling...

Still IMHO, only journalling the "inner" filesystem bears a higher risk of (meta)data corruption.

As usual, it's trading performance for data integrity: If you want to play really safe, you'd journal the "outer" as well as the "inner" filesystems at the price of degraded performance.

If I'm informed correctly, that's exactly what happens when you run an ext3/4 guest on a VMware ESX host.

[ Parent | Reply to this comment ]

Posted by mcortese (20.142.xx.xx) on Mon 24 Jan 2011 at 16:07
[ View Weblogs ]
Journals track changes to metadata only. Renaming or truncating files in the inner file system, for example, will result in no metadata change in the underlying one. If the inner file system has no journaling, you get no protection against failure (remember that renaming and truncating are allegedly the source of particularily bad corruption). Even options as "ordered data" are useless when the file system has no clue what is true data and what is metadata.

So I would definitely opt for journaling on the inner file system.

[ Parent | Reply to this comment ]

Posted by Anonymous (212.255.xx.xx) on Fri 26 Jul 2013 at 21:35
Even IF you'd really want not to journal the inner filesystem, you'd get better performance by using ext4 with journaling disabled in the mount parameters than by using the old ext2. ext4 has much more advanced data structures than ext2.

[ Parent | Reply to this comment ]

Posted by vegiVamp (81.246.xx.xx) on Fri 21 Jan 2011 at 09:31
For those with the resources for SANs, have a look at 3Par, they have storage solutions that support this. What is known on the FS level as sparse files, is known in the storage world as "thin provisioning". Same idea, but on LUNs.

[ Parent | Reply to this comment ]

Posted by banchieri (2a01:0xx:0xx:0xxx:0xxx:0xxx:xx) on Fri 21 Jan 2011 at 15:05
[ View Weblogs ]

That's exactly what I wanted to achieve: thin (over)provisioning, but only with mechanisms provided by "your ordinary Linux distro". ;-)

[ Parent | Reply to this comment ]

Posted by Anonymous (62.226.xx.xx) on Thu 27 Jan 2011 at 00:40
Some thoughts: You can only reclaim (resparse) consecutive zeroes from the loop file. E.g. you fill up the loop-filesystem with one big file with random data, then delete it, you can't shrink the file. Use a tool to overwrite free space with zeroes (# cat /dev/zero > zeroes; rm zeroes;), then you can resparse the loop file.

You can resparse offline, using "cp --sparse=always" (though cp will automatically sparse, too) or as proposed by moving the VG to a new file, which is a neat idea.

If you need more loop files than available (8 is default), detach all, rmmod loop and modprobe with some parameter (see docs) to create more device nodes.

I have no idea how safe loop files are when it comes to transactions. I doubt they're being written to synchronously, or implement cache flushs or barriers. So possibly, journaled or not, all data may be lost.

There's also an integrated option to LVM, --virtualsize, but that's rather weird because it also takes a --size parameter, limiting the filesystem to that size. It prevents overcommitment that way, granted, but filesystems hitting the limit will probably just stop working. You'd have to be monitoring usage. Is there a daemon for that?

[ Parent | Reply to this comment ]

Posted by Anonymous (193.62.xx.xx) on Tue 8 Feb 2011 at 13:30
Before you use this seriously, watch what happens if you try to use over-committed space: make a small filesystem ($FS), put your sparse file ($FS/$spF) + loopback device in it and make an filesystem ($inFS) or PV on it.

Now, fill up $FS by writing a non-sparse $FS/junk.dat; finally, write data to $inFS until $FS almost full. For production use, this is where you must stop!

See that $inFS think there is space remaining. Write some data to that space and watch dmesg(1). There is nowhere to write this data, but it is too late to tell the writer of $inFS/my_important_file.dat .

[ Parent | Reply to this comment ]

Posted by Anonymous (125.62.xx.xx) on Thu 21 Apr 2011 at 07:51
Very nice and helpful information has been given in this article. I must say that this is a great post. I loved reading it. MSI Credit Solutions

[ Parent | Reply to this comment ]

Posted by ThomasL1 (75.176.xx.xx) on Sat 17 Sep 2011 at 19:37
Very useful information on SQL.
I've been trying to learn this stuff for awhile now. You definitely helped point me in the right direction! HCG diet

[ Parent | Reply to this comment ]

Posted by Lucky127 (117.205.xx.xx) on Fri 9 Dec 2011 at 11:58
It is a technology company which has built the only data-valuation platform for advertisers. It is a from a home based business and be your own boss. Thanks for sharing the informative post.Zoekmachine optimalisatie

[ Parent | Reply to this comment ]

Posted by gpall (155.207.xx.xx) on Wed 11 Jan 2012 at 06:19
[ View Weblogs ]
And of course, you can add one more layer between (2) and (3) in order to implement encryption:

(2.5): cryptsetup luksFormat ...
(2.6): cryptsetup luksOpen ...

--
No-one cares if you backup. Only if you can restore.

[ Parent | Reply to this comment ]

Posted by Anonymous (178.201.xx.xx) on Mon 8 Oct 2012 at 13:15
No, you cannot. This encrypts the data going to the loop-device. Zeros will be encrytions, too, obviously. But encrypted zeros end up as pseudo-random numbers, where the sparse-mechanism fails.

[ Parent | Reply to this comment ]

Posted by gpall (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Mon 8 Oct 2012 at 13:42
[ View Weblogs ]
test

[ Parent | Reply to this comment ]

Posted by Anonymous (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Wed 10 Oct 2012 at 15:20
You can encrypt. It is true the NULs will be encrypted to noise; but when writing in the hole, disk space is allocated. It doesn't matter what is written.

I think it is not possible to recover disk space by "turning it back into sparse-hole", in any case.

[ Parent | Reply to this comment ]

Posted by Anonymous (212.255.xx.xx) on Fri 26 Jul 2013 at 21:43
Not like this, but using EncFS, it will work, because only file contents are encrypted and not free space. However, obviously you get less security then because many things are not encrypted: file names, file sizes, access times...

If you want encryption, why not just encrypt the host fs? Then you can also transparently enjoy sparse file containers, which don't need any further encryption.

[ Parent | Reply to this comment ]

Posted by axelabs (216.52.xx.xx) on Sat 19 Oct 2013 at 17:48
The following blog describes this process even better:

http://www.anthonyldechiaro.com/blog/2010/12/19/lvm-loopback-how- to/

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 450 votes ~ 5 comments )