Weblog entry #48 for ajt
I've been working on a script to find duplicate files[1]. After bothering to write the script I found that there are lots of shareware versions for Windows, and a few GNU versions. Like all things open-source mine is different from the others I've found:
- Mine uses SHA1 digests, not CRC32 or MD5.
- Works across any platform where Perl runs.
- Doesn't try to do fancy delete & hardlink.
- Doesn't have a point and drool interface.
- Doesn't cost money...
I'm working on version 0.2 at the moment, which fixes a bug or two, should run marginally faster, and will include GLOBing for input filtering. I've also promised it to people, so I have to deliver it this week...
Comments on this Entry
Package: fdupes
Priority: optional
Section: utils
Installed-Size: 80
Maintainer: Adrian Bridgett <bridgett@debian.org>
Architecture: i386
Version: 1.40-4
Depends: libc6 (>= 2.3.2.ds1-4)
Filename: pool/main/f/fdupes/fdupes_1.40-4_i386.deb
Size: 14066
MD5sum: 8e527f7436a6394702d24bb6fd7fabca
Description: Identifies duplicate files within given directories
FDupes uses md5sums and then a byte by byte comparison to find duplicate
files within a set of directories. It has several useful options
including recursion.
I've used it a couple of times and it works great. Hopefully your script is (even) better :)
[ Parent | Reply to this comment ]
I use SHA1 rather than MD5, which should give you fewer errors. Mine is also written in Perl, so it will run anywhere Perl does and you don't need a compiler. I don't do a bitwise comparison or offer to delete files creating hardlinks to a master copy. To each their own...
See also:
http://premium.caribe.net/~adrian2/fdupes.html
http://www.pixelbeat.org/fslint/
http://www.stearns.org/freedups/
My work-in-progress page:
http://iredale.dyndns.org/Perl/find-duplicated-files.html
--
"It's Not Magic, It's Work"
Adam
[ Parent | Reply to this comment ]
http://www.pixelbeat.org/fslint/
Note FSlint compares both with sha1 and md5
to protect against the documented clashes.
If you want to compare raw speed, then
use the CLI version directly.
The easiest way is:
cd /usr/share/fslint/fslint
time ./findup --gui /test/dir/ >/dev/null
I suggest you test both cached
and non cached searches, each of which
fslint explicitly optimizes for.
Note fslint does sophisticated
hardlink processing that allows
the incremental merging of hardlinks efficiently,
so consider this if you do manage to find anything
faster than it (I haven't yet).
cheers,
Pádraig.
[ Parent | Reply to this comment ]
I know it's an absurb situation, but in many places of work you can't compile c/c++ code on Windows as that's a development and they don't do development. However running a Perl (or other similar) script is okay...
I'm just timimg your fslint from the Debian Etch repository (2.16-1) against my script written in Perl verusues my home directory tree (~36G).
Your tool takes (on it's first and second pass):
real ~3m
user ~22s
sys ~9s
My tool (version 0.3) takes (on it's first pass):
real ~59s
user ~4s
sys ~9s
(second pass)
real ~5s
user ~4s
sys ~4s
Mine is very sensitive to caching done by the OS, as you can see the second pass is very much faster. However even having said that it's still a lot faster on the first pass.
--
"It's Not Magic, It's Work"
Adam
[ Parent | Reply to this comment ]
I tested 0.2 of fdf as I couldn't find 0.3
It's nice and fast, well done.
I did notice that it didn't ignore symlinks,
so you would be even faster and more correct
doing that. Also I had to manually install
Digest::SHA, is there an ubuntu package for that?
Anyway my performance tests on cached files
on ubuntu breezy:
$ time ./fdf /usr/share/doc > /dev/null
real 0m1.208s
user 0m0.976s
sys 0m0.208s
$ time findup --gui /usr/share/doc > /dev/null
real 0m1.454s
user 0m1.253s
sys 0m0.175s
#The following with sha1 double check removed
time findup --gui /usr/share/doc > /dev/null
real 0m1.168s
user 0m0.965s
sys 0m0.172s
[ Parent | Reply to this comment ]
Digest::SHA isn't in Debian and or/a Debian derivative, but Digest::SHA1 is, and it will work the same - you just need to tweak the code a little. Digest::SHA will be in the next build of Perl which I why I'm using it here.
Symlinks are skipped in 0.3, and I'm doing my best to clean up the code at the moment. File::Find is a great module, but it forces you to use globals, which makes for really nasty Perl.
--
"It's Not Magic, It's Work"
Adam
[ Parent | Reply to this comment ]
If anyone wants to compare the performance to the alternatives, I am interested in the results. Try it at http://rdfind.paulsundvall.net
[ Parent | Reply to this comment ]
I shall add rdfind to my list of alternatives in the documentation.
--
"It's Not Magic, It's Work"
Adam
[ Parent | Reply to this comment ]