Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

A perl script to list duplicate files UNIX
Whenever you have a collection of data files that came from several different sources, there is the possibility that some of the files are duplicates of one another. For example, you might have sound loops that came with several different products (GarageBand, Soundtrack, etc) and there is some overlap between these. If you wanted to save disk space by clearing out the duplicates, the first thing you would need to do is identify which files are duplicates of each other. This is not always easy since the filenames are often unrelated.

(There is a previous hint that supplies a script for deleting duplicate sound loops, but that script relied upon lists of the duplicates having previously been prepared. There's also a hint on finding and removing iTunes duplicates via a Perl script.)

Today's hint supplies a script that will search for duplicate files in specified folders, and output a list of the files that are duplicates of each other. (i.e. it is something that will produce the lists that were used in the first previous hint.)

The script is actually very general, so it can be used to search for duplicates of any type of file. What it does is compare the files based on the "MD5 hash," which is a sequence of characters that is computed from the content of the file. It is extremely unlikely (although theoretically possible) that two files with different content would have the same MD5 value. The file comparison does not look at the file names at all, just the content of the files.

To use it, first copy and paste the source into your favorite pure text editor. Save it with some name, and then you would need to do the usual things for running a script -- see this Unix FAQ in the forums if you have questions on that. Note that there's also a macosxhints' forums thread on this topic, and it will contain any updates made to the script. To put it another way, if you're reading this hint at some point in the future, you may wish to check that thread for a newer version than that shown in the above source.

If the script file is called findDupeFiles, and it is in your current folder, then you could run it on the two folders /Documents/Apple Loops for Soundtrack and /Library/Application Support/GarageBand as follows:
./findDupeFiles '.aif|.aiff' "/Documents/Apple Loops for Soundtrack" \
"/Library/Application Support/GarageBand"
The first argument (.aif|.aiff) specifies that you want to look at files with either a .aif or .aiff suffix. You need to have this argument inside quotes, since the vertical bar (|) that separates the two file suffixes is a special character for the shell. You need to have the folder paths (the other two arguments) in quotes because the paths contain spaces. Note that this command will typically take several minutes to finish, and you won't see any output until just before the end.

The output (in the Terminal window) from the above command would list all the duplicates it found in those folders (and sub-folders). Each set of duplicates is separated from the next in the output by a line like this:
-----------------------
The above example showed how to use the script when looking for duplicates of AIFF files. You could use it similarly to find duplicates of any other files that have definite suffixes. But sometimes your data files don't have a uniform set of suffixes, or perhaps any suffixes at all. You can tell the script to search across all files (independent of suffix) by using an empty string (two quotes right next to each other with no characters in between) as the first argument to the script. For example...
./findDupeFiles '' ~/Documents
...would search your Documents folder for duplicate files of any suffix. (You do need to supply that empty string as a first argument, since otherwise the script would interpret ~/Documents as the suffix to search for.) Note also that if you don't supply any folder names when you invoke the script, it will search under the current folder. And if you don't supply any arguments at all, it will search all files under your current folder.
    •    
  • Currently 2.60 / 5
  You rated: 3 / 5 (5 votes cast)
 
[26,346 views]  

A perl script to list duplicate files | 9 comments | Create New Account
Click here to return to the 'A perl script to list duplicate files' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
A perl script to list duplicate files
Authored by: merlyn on Mar 02, '06 07:16:56AM

This program does a lot more work than it needs, because it doesn't first collate lists of files by length (cheap to compute) before looking at the contents for the MD5 (expensive to compute). As an alternative approach, comparing files pair-wise using File::Compare (once determining that the length is the same) is often faster, as it reads only enough of the file to determine the first difference.



[ Reply to This | # ]
A perl script to list duplicate files
Authored by: fri2219 on Mar 02, '06 07:57:54AM

Your antecedent reference wasn't clear: File::Compare as an alternative to Digest::MD5?

If so, I'm not sure you can flatly state that File::Compare would be less expensive (in terms of execution time?) for all comparison operations than Digest::MD5. There must be (pathological) cases where that isn't true- something like Digest::SHA1 or even Digest::CRC might be a middle path in the case of sparse files.

This would be a nice weekend experiment over a large sample of files, with distributions of sparse files and compressed formats... I'm sure a well designed experiment run by a team of actuaries and industrial engineers could clear this up emprically. (And no, my 400Mhz G4 isn't going to cut it :)

My working hypothesis is the file composition would probably determine what the results looked like, followed by how well the implementation is optimized for any given machine's processor.



[ Reply to This | # ]
revised version
Authored by: hayne on Mar 02, '06 05:40:19PM
A revised version of the script that follows your suggestion to do an initial collating by data-fork size is now available in the forums thread. This new version runs about twice as fast as the original when I test it on the AIFF loop folders on my machine.

The new version also reports on cases where the resource forks differ even though the data-forks are the same.

[ Reply to This | # ]

A perl script to list duplicate files
Authored by: fds on Mar 02, '06 08:14:22AM

Beware this doesn't appear to compare resource forks.



[ Reply to This | # ]
resource forks
Authored by: hayne on Mar 02, '06 05:43:40PM
I revised the script to report "Resource fork differs" when the resource forks of the duplicate data files are different. You can get the revised version in the forums thread

[ Reply to This | # ]
revised version that handles special filenames better
Authored by: hayne on Mar 02, '06 01:38:40PM
Someone pointed out to me that the script failed (it hangs) if one of the filenames that it encounters starts with "-" followed by some whitespace - e.g. a filename "- foo". The problem was in the calcMd5 function. It tried to protect against strange filenames by using a well-known trick - but it turns out that trick fails when faced with a filename like "- foo".

I have modifed the version of the script that is in the forums thread to fix this problem.

[ Reply to This | # ]

A few more notes
Authored by: hayne on Mar 03, '06 01:13:38AM

I'd like to note that if you are concerned with finding duplicates in your iTunes library, the previous hint on finding duplicates in iTunes would likely serve you better. That hint relied on the song list that you can export from iTunes. Thus it will find song files that have the same meta data - even if they aren't exactly the same. It's a complementary sort of thing.

My script finds files that are byte-by-byte identical and ignores all meta data. So two songs (for example) that differ by one millisecond at the end would not be seen as duplicates by my script.

I also note that there are several shareware utilities that can find duplicate files.
E.g. here are a few that I found by searching at www.macupdate.com:
File Buddy: http://www.skytag.com/filebuddy/
TidyUp: http://www.hyperbolicsoftware.com/TidyUp.html



[ Reply to This | # ]
a Python script
Authored by: hayne on Mar 03, '06 06:16:24PM

And I just found out that Bill Bumgarner wrote a Python script to find duplicate files back in 2004:
http://www.pycs.net/bbum/2004/12/29/#200412291
His Python script uses similar techniques to the revised version of my script (collating files by size before using MD5) but his has an added optimization - it starts by doing an MD5 on the first 1 KB of the file and thus can discard many possible duplicates without needing to MD5 the whole file - this might be a significant saving for large files.

The above Python script goes one more step and actually deletes the duplicate files automatically. That's something that could easily be added to my Perl script but I would hesitate to do that since often human judgement is needed in deciding between two versions of a file - e.g. based on which folder you prefer, or based on the existence or non-existence of a resource fork.
I'd rather write a custom script to do the deleting after examining the output to understand the nature of the duplicates.

And there is apparently a logic error in the above Python script as explained (and corrected) by Andrew Shearer: http://www.shearersoftware.com/personal/weblog/2005/01/14/dupinator-ii

One of the changes made by Andrew Shearer is to avoid issues with symbolic links. I suspect the same problem might exist with my script - that it might report on duplicates when one of the files is merely a symbolic link to the other. I haven't yet checked if the directory traversal I use in my Perl script will follow symbolic links.



[ Reply to This | # ]
A few more notes
Authored by: Smokin Jake on Mar 05, '06 05:35:51AM

Another Perl script is available here
http://www.beautylabs.net/software/dupseek.html



[ Reply to This | # ]