Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!


Click here to return to the 'A perl script to list duplicate files' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
A perl script to list duplicate files
Authored by: merlyn on Mar 02, '06 07:16:56AM

This program does a lot more work than it needs, because it doesn't first collate lists of files by length (cheap to compute) before looking at the contents for the MD5 (expensive to compute). As an alternative approach, comparing files pair-wise using File::Compare (once determining that the length is the same) is often faster, as it reads only enough of the file to determine the first difference.



[ Reply to This | # ]
A perl script to list duplicate files
Authored by: fri2219 on Mar 02, '06 07:57:54AM

Your antecedent reference wasn't clear: File::Compare as an alternative to Digest::MD5?

If so, I'm not sure you can flatly state that File::Compare would be less expensive (in terms of execution time?) for all comparison operations than Digest::MD5. There must be (pathological) cases where that isn't true- something like Digest::SHA1 or even Digest::CRC might be a middle path in the case of sparse files.

This would be a nice weekend experiment over a large sample of files, with distributions of sparse files and compressed formats... I'm sure a well designed experiment run by a team of actuaries and industrial engineers could clear this up emprically. (And no, my 400Mhz G4 isn't going to cut it :)

My working hypothesis is the file composition would probably determine what the results looked like, followed by how well the implementation is optimized for any given machine's processor.



[ Reply to This | # ]
revised version
Authored by: hayne on Mar 02, '06 05:40:19PM
A revised version of the script that follows your suggestion to do an initial collating by data-fork size is now available in the forums thread. This new version runs about twice as fast as the original when I test it on the AIFF loop folders on my machine.

The new version also reports on cases where the resource forks differ even though the data-forks are the same.

[ Reply to This | # ]