Creating dynamic volumes with loop devices
Posted by banchieri on Thu 20 Jan 2011 at 09:52
Recently, my focus of interest turned to volume management. Since with Linux, LVM2 seems to be the only viable solution (unless you go for cluster filesystems), I started reading about LVM. Somewhere, I read about setting up a loop device as PV and that's where I got intrigued: Couldn't I just use a sparse file with one single large hole and let my ext4 fill that when there's actually something written onto the loop device?
So for proof of concept, I created the volume with:
(1)dd if=/dev/zero of=FILE bs=X seek=Y-1 count=1, did the(2)
losetup /dev/loop0 FILE,(3)
mke2fs /dev/loop0,(4)
mount /dev/loop0 /mnt,and after each step, checked disk usage with
du -k FILE and filesystem usage with df -k /mnt.
Guess what? It worked!
For further testing, I started copying files into my "dynamic volume", watched it grow, and ultimately unmounted it.
Since that worked out so well, I'm gonna use that trick in the Linux Containers setup I'm currently working on, because it
(a) lets the volume grow dynamically and
(b) puts an upper limit on volume size.
Ah, and: With ext3/4 as underlying filesystem, ext2 within the loop device is enough. Why journal twice?
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
deleting the file would not zero its contents - So the space will keep occupied.Where have I heard similar concepts, lately? Sure, talking about how file systems lack a way to inform the undelying device that some blocks are not needed anymore! Can you see the parallel?
Take a solid state device, make a file system on it, populate it with large files, then delete them. The SSD will still see most of its blocks as used, and its wear reduction strategy will lose efficiency. Some solutions have been proposed (zeroing the blocks, issuing a trim command...) but, to date, none has reached a big consensus.
In the case above, the ideal solution would be to have a top-level file system which zeros or trims blocks upon file deletion, and a loop device that understands that and translates it to holes in the underlying file.
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
Well, yes, that's how filesystems behave. On the other hand, if you put another large file onto the volume after deleting the first one, the volume won't grow because blocks get reallocated by the "inner" filesystem.
So far, I haven't heard of auto-shrinking filesystems (with the notable exception of tmp/ramfs-like ones).
[ Parent | Reply to this comment ]
"The TRIM command does not work with disks which are stored in disk image files. This is caused by the fact that computer files can only be deleted completely or truncated at the end."
in fact a smart qcow2 driver could keep track of trimmed space, swap blocks from the end of the image file into the trimmed space (they're mapping anyways) and finally truncate the image, thus reducing size live.
As for sparse loop files, I don't know whether filesystem / kernel has a command to sparse a range within an open file. Reclaiming space was always a file copy operation for me.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
I've never stated that the idea is new. It just crossed my mind and I thought that it might be useful for somebody else, so I shared it with you folks.
Obviously, it was useful for you...
[ Parent | Reply to this comment ]
The best part? You can do all of this while your LVM volumes are still in use! That is, the migration from PV1 to PV2 does not affect the volumes that are sitting on top of them. Live migration! LVM rocks.
Alan Porter
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
That's close to another idea that once crossed my mind: A garbage-collecting filesystem which could do that sort of thing "in situ".
The basic idea is as follows:
(1) Filesystems usually divide the allotted space into equally-sized cylinder groups (CGs).
(2) Consequently, they maintain per-CG allocation bitmaps.
(3) If you'd also monitor per-CG allocation density and detect that that some CG's allocation density drops below a certain limit (e.g., 25%), you'd transparently start moving blocks from that CG to other CGs (until the CG is empty) in order to reduce the overall CG count.
(4) Moreover, that would be a great opportunity to auto-defragment files. You're moving those blocks anyway...
If you combine that with volume management based on CG pools, filesystems could dynamically allocate free CGs when they need them as well as release them after they're garbage-collected.
How does that sound?
[ Parent | Reply to this comment ]
But the outer filesystem only sees one single file, wouldn't it make sense to do ext2 on the outer filesystem and ext3/4 inside? The inner journaling would have a more detailed view on the metadata.
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
I think I'm able to guess what you're hinting at...
You're concerned with the fact hat "inner" metadata appears as ordinary data to the "outer" filesystem and thus isn't treated specially. Haven't thought of that, actually. But maybe you're able to fix that by altering the "outer" filesystem's type of journalling...
Still IMHO, only journalling the "inner" filesystem bears a higher risk of (meta)data corruption.
As usual, it's trading performance for data integrity: If you want to play really safe, you'd journal the "outer" as well as the "inner" filesystems at the price of degraded performance.
If I'm informed correctly, that's exactly what happens when you run an ext3/4 guest on a VMware ESX host.
[ Parent | Reply to this comment ]
So I would definitely opt for journaling on the inner file system.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
That's exactly what I wanted to achieve: thin (over)provisioning, but only with mechanisms provided by "your ordinary Linux distro". ;-)
[ Parent | Reply to this comment ]
You can resparse offline, using "cp --sparse=always" (though cp will automatically sparse, too) or as proposed by moving the VG to a new file, which is a neat idea.
If you need more loop files than available (8 is default), detach all, rmmod loop and modprobe with some parameter (see docs) to create more device nodes.
I have no idea how safe loop files are when it comes to transactions. I doubt they're being written to synchronously, or implement cache flushs or barriers. So possibly, journaled or not, all data may be lost.
There's also an integrated option to LVM, --virtualsize, but that's rather weird because it also takes a --size parameter, limiting the filesystem to that size. It prevents overcommitment that way, granted, but filesystems hitting the limit will probably just stop working. You'd have to be monitoring usage. Is there a daemon for that?
[ Parent | Reply to this comment ]
Now, fill up $FS by writing a non-sparse $FS/junk.dat; finally, write data to $inFS until $FS almost full. For production use, this is where you must stop!
See that $inFS think there is space remaining. Write some data to that space and watch dmesg(1). There is nowhere to write this data, but it is too late to tell the writer of $inFS/my_important_file.dat .
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I've been trying to learn this stuff for awhile now. You definitely helped point me in the right direction! HCG diet
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
(2.5): cryptsetup luksFormat ...
(2.6): cryptsetup luksOpen ...
--
No-one cares if you backup. Only if you can restore.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
[ Parent | Reply to this comment ]
I think it is not possible to recover disk space by "turning it back into sparse-hole", in any case.
[ Parent | Reply to this comment ]
[ Send Message | View Weblogs ]
Dear editors,
thanks for the "sparse file".
banchieri
[ Parent | Reply to this comment ]