Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!


Click here to return to the 'A few more notes' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
A few more notes
Authored by: hayne on Mar 03, '06 01:13:38AM

I'd like to note that if you are concerned with finding duplicates in your iTunes library, the previous hint on finding duplicates in iTunes would likely serve you better. That hint relied on the song list that you can export from iTunes. Thus it will find song files that have the same meta data - even if they aren't exactly the same. It's a complementary sort of thing.

My script finds files that are byte-by-byte identical and ignores all meta data. So two songs (for example) that differ by one millisecond at the end would not be seen as duplicates by my script.

I also note that there are several shareware utilities that can find duplicate files.
E.g. here are a few that I found by searching at www.macupdate.com:
File Buddy: http://www.skytag.com/filebuddy/
TidyUp: http://www.hyperbolicsoftware.com/TidyUp.html



[ Reply to This | # ]
a Python script
Authored by: hayne on Mar 03, '06 06:16:24PM

And I just found out that Bill Bumgarner wrote a Python script to find duplicate files back in 2004:
http://www.pycs.net/bbum/2004/12/29/#200412291
His Python script uses similar techniques to the revised version of my script (collating files by size before using MD5) but his has an added optimization - it starts by doing an MD5 on the first 1 KB of the file and thus can discard many possible duplicates without needing to MD5 the whole file - this might be a significant saving for large files.

The above Python script goes one more step and actually deletes the duplicate files automatically. That's something that could easily be added to my Perl script but I would hesitate to do that since often human judgement is needed in deciding between two versions of a file - e.g. based on which folder you prefer, or based on the existence or non-existence of a resource fork.
I'd rather write a custom script to do the deleting after examining the output to understand the nature of the duplicates.

And there is apparently a logic error in the above Python script as explained (and corrected) by Andrew Shearer: http://www.shearersoftware.com/personal/weblog/2005/01/14/dupinator-ii

One of the changes made by Andrew Shearer is to avoid issues with symbolic links. I suspect the same problem might exist with my script - that it might report on duplicates when one of the files is merely a symbolic link to the other. I haven't yet checked if the directory traversal I use in my Perl script will follow symbolic links.



[ Reply to This | # ]
A few more notes
Authored by: Smokin Jake on Mar 05, '06 05:35:51AM

Another Perl script is available here
http://www.beautylabs.net/software/dupseek.html



[ Reply to This | # ]