Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Find duplicate files using the terminal UNIX
[Submitted by victory]

The following will search the current dir (and subdirs) for any files that contain identical content and are of identical size, regardless if they are named differently. Open a terminal shell, and 'cd' to the dir you want to search, then type:
find . -size 20 \! -type d -exec cksum {} \; | sort | tee /tmp/f.tmp | 
cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp > dup.txt
[Editor's note: I inserted a carriage return for readability -- type the command on one line when entering it!]

This will produce a list of duplicate files (if any) in dup.txt. True there are some nicely written apps that will do the same thing, but ain't it great that you can do this right from within your OS?

Notes:
  • This will ignore files that are smaller than 10k. (remove/alter the '+size 20' to change this). But a warning: really small files may produced identical CRCs. i.e. show up as duplicates even if they really aren't.
  • If you want to search a filesystem you don't own (i.e. /) you'll need to sudo or su or 'find' will complain.
  • The built-in cksum cmd only uses CRC32. MD5 would be better. Anyone know why it's not enabled under OSX?
  • If you're gonna write a script to delete the duplicates from the produced dup.txt list, just remember that it contains ALL instances of the duplicate files.

    •    
  • Currently 3.00 / 5
  You rated: 3 / 5 (4 votes cast)
 
[21,884 views]  

Find duplicate files using the terminal | 4 comments | Create New Account
Click here to return to the 'Find duplicate files using the terminal' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
MD5
Authored by: wayneyoung on Aug 29, '01 08:18:10PM

I haven't tried this yet, but you can find UNIX source code for MD5 here http://www.fourmilab.to/md5 It should compile OK, I hope to try it this weekend.



[ Reply to This | # ]
MD5
Authored by: stonematt on Jan 05, '04 02:59:07PM

This seems to work in Panther:

$ find . -size +20 \! -type d -exec md5 -r {} \; | sort | tee /tmp/f.tmp | cut -f 1 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp > dup.txt

[ Reply to This | # ]
Modifications
Authored by: MarcusB on Aug 30, '01 03:28:13AM

Great Tip.
To get it working in ksh, I had to change the syntax as below ( I suspect the change is generic). After these changes its fine.

The correct sysntax in ksh is

find . size +20 ! -type d -exec cksum {} ";" | sort | tee /tmp/f.tmp | cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp > dup.txt

The only changes are to the find command.



[ Reply to This | # ]
Modifications
Authored by: qka on Aug 17, '04 04:36:56PM

This example works in bash, as supplied with Panther (10.3).

Note that a sufficiently large set of files, grep, and hence the whole command, will fail.

I was trying to find duplicates in a set of approximately 26,000 files, with about 50% duplication when I discovered this.



[ Reply to This | # ]