Weblog entry #48 for ajt

Finding Duplicate Files #2
Posted by ajt on Tue 1 Aug 2006 at 21:30
Tags: none.

I've been working on a script to find duplicate files[1]. After bothering to write the script I found that there are lots of shareware versions for Windows, and a few GNU versions. Like all things open-source mine is different from the others I've found:

  • Mine uses SHA1 digests, not CRC32 or MD5.
  • Works across any platform where Perl runs.
  • Doesn't try to do fancy delete & hardlink.
  • Doesn't have a point and drool interface.
  • Doesn't cost money...

I'm working on version 0.2 at the moment, which fixes a bug or two, should run marginally faster, and will include GLOBing for input filtering. I've also promised it to people, so I have to deliver it this week...

  1. use.perl.org/~ajt/journal/30485

 

Comments on this Entry

Posted by oxtan (80.126.xx.xx) on Wed 2 Aug 2006 at 19:46
[ Send Message | View Weblogs ]
natxete@tux:~$ apt-cache show fdupes
Package: fdupes
Priority: optional
Section: utils
Installed-Size: 80
Maintainer: Adrian Bridgett <bridgett@debian.org>
Architecture: i386
Version: 1.40-4
Depends: libc6 (>= 2.3.2.ds1-4)
Filename: pool/main/f/fdupes/fdupes_1.40-4_i386.deb
Size: 14066
MD5sum: 8e527f7436a6394702d24bb6fd7fabca
Description: Identifies duplicate files within given directories
FDupes uses md5sums and then a byte by byte comparison to find duplicate
files within a set of directories. It has several useful options
including recursion.

I've used it a couple of times and it works great. Hopefully your script is (even) better :)

[ Parent | Reply to this comment ]

Posted by ajt (84.12.xx.xx) on Wed 2 Aug 2006 at 20:15
[ Send Message | View Weblogs ]
Mine is different...

I use SHA1 rather than MD5, which should give you fewer errors. Mine is also written in Perl, so it will run anywhere Perl does and you don't need a compiler. I don't do a bitwise comparison or offer to delete files creating hardlinks to a master copy. To each their own...

See also:

http://premium.caribe.net/~adrian2/fdupes.html
http://www.pixelbeat.org/fslint/
http://www.stearns.org/freedups/

My work-in-progress page:
http://iredale.dyndns.org/Perl/find-duplicated-files.html

--
"It's Not Magic, It's Work"
Adam

[ Parent | Reply to this comment ]

Posted by pixelbeat (194.125.xx.xx) on Sat 12 Aug 2006 at 02:33
[ Send Message ]
Hi I'm glad you noticed my FSlint:
http://www.pixelbeat.org/fslint/

Note FSlint compares both with sha1 and md5
to protect against the documented clashes.

If you want to compare raw speed, then
use the CLI version directly.
The easiest way is:

cd /usr/share/fslint/fslint
time ./findup --gui /test/dir/ >/dev/null

I suggest you test both cached
and non cached searches, each of which
fslint explicitly optimizes for.

Note fslint does sophisticated
hardlink processing that allows
the incremental merging of hardlinks efficiently,
so consider this if you do manage to find anything
faster than it (I haven't yet).

cheers,
Pádraig.

[ Parent | Reply to this comment ]

Posted by ajt (84.12.xx.xx) on Sat 12 Aug 2006 at 11:01
[ Send Message | View Weblogs ]
The original requirement was to scan Microsoft NTFS filesystems via Samba from a Linux box. The reason I did anything more with my script was because a fellow Linux user need something for his works Windows boxen. In that case Perl was cross platform and acceptable, c code which I accept may be much faster isn't an option, as c won't run on a Windows machine without buying a compliler...

I know it's an absurb situation, but in many places of work you can't compile c/c++ code on Windows as that's a development and they don't do development. However running a Perl (or other similar) script is okay...

I'm just timimg your fslint from the Debian Etch repository (2.16-1) against my script written in Perl verusues my home directory tree (~36G).

Your tool takes (on it's first and second pass):
real ~3m
user ~22s
sys ~9s

My tool (version 0.3) takes (on it's first pass):
real ~59s
user ~4s
sys ~9s
(second pass)
real ~5s
user ~4s
sys ~4s

Mine is very sensitive to caching done by the OS, as you can see the second pass is very much faster. However even having said that it's still a lot faster on the first pass.

--
"It's Not Magic, It's Work"
Adam

[ Parent | Reply to this comment ]

Posted by pixelbeat (213.202.xx.xx) on Sat 12 Aug 2006 at 12:21
[ Send Message ]
There is something seriously wrong there.

I tested 0.2 of fdf as I couldn't find 0.3
It's nice and fast, well done.

I did notice that it didn't ignore symlinks,
so you would be even faster and more correct
doing that. Also I had to manually install
Digest::SHA, is there an ubuntu package for that?

Anyway my performance tests on cached files
on ubuntu breezy:

$ time ./fdf /usr/share/doc > /dev/null

real 0m1.208s
user 0m0.976s
sys 0m0.208s

$ time findup --gui /usr/share/doc > /dev/null

real 0m1.454s
user 0m1.253s
sys 0m0.175s

#The following with sha1 double check removed
time findup --gui /usr/share/doc > /dev/null

real 0m1.168s
user 0m0.965s
sys 0m0.172s

[ Parent | Reply to this comment ]

Posted by ajt (84.12.xx.xx) on Sat 12 Aug 2006 at 14:15
[ Send Message | View Weblogs ]
Version 0.3 is "in development" at the moment, so I've not uploaded it to anywhere yet.

Digest::SHA isn't in Debian and or/a Debian derivative, but Digest::SHA1 is, and it will work the same - you just need to tweak the code a little. Digest::SHA will be in the next build of Perl which I why I'm using it here.

Symlinks are skipped in 0.3, and I'm doing my best to clean up the code at the moment. File::Find is a great module, but it forces you to use globals, which makes for really nasty Perl.

--
"It's Not Magic, It's Work"
Adam

[ Parent | Reply to this comment ]

Posted by Anonymous (130.237.xx.xx) on Mon 7 Aug 2006 at 09:33
You might also try rdfind. It uses sha1 or md5, and tries to be efficient. No GUI.GPL license.

If anyone wants to compare the performance to the alternatives, I am interested in the results. Try it at http://rdfind.paulsundvall.net

[ Parent | Reply to this comment ]

Posted by ajt (204.193.xx.xx) on Mon 7 Aug 2006 at 10:13
[ Send Message | View Weblogs ]
By using c it should be fast. My Perl script uses compiled c for all the CPU intensive pieces from the Digest::SHA module, and Perl for the ease of dealing with hashes for working out what's a dupe and what's not.

I shall add rdfind to my list of alternatives in the documentation.

--
"It's Not Magic, It's Work"
Adam

[ Parent | Reply to this comment ]

User Login

Username:

Password:

[ Advanced Login ]

Register Account

Quick Site Search