Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

10.4: Find potential duplicate files via Spotlight metadata UNIX
I often rename files immediately after downloading and stick them in a folder somewhere for later reference. But I also often forget what I've already downloaded. So I wrote this bash shell script to use Spotlight to find files that match a file's name or its size and kind.

Usage, in Terminal:
dupecheck filename
For example, for a file called 0.pdf, output might look like this (line breaks added for a narrower display):
Possible matches based on filename:
/Users/whoever/Desktop/0.pdf
Possible matches based on size and kind:
/Users/whoever/Desktop/0.pdf
/Users/whoever/Desktop/Data/anthro articles/sahlins-1999.pdf
/Users/whoever/Desktop/Data/html/oldcourses/
 intro/secure/sahlins-sweetness.pdf
/Users/whoever/Documents/archives/another backup/
 blahblah/public_html/intro/secure/sahlins1999-sweetness.pdf
/Users/whoever/Desktop/Data/html/blahblah/readings/3/sahlins99.pdf
So it turns out I just downloaded a file that I already have four copies of under different names and locations.

I've set this up as a command in OnMyCommand. For this to work, it requires you to have put the shell script in a folder that's included in your $PATH. Here's the OnMyCommand command (assuming you are using OMCEdit):
cd __OBJ_PARENT_PATH__
dupecheck __OBJ_NAME__
Execution Mode should be set to Terminal.
    •    
  • Currently 3.80 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (5 votes cast)
 
[40,868 views]  

10.4: Find potential duplicate files via Spotlight metadata | 11 comments | Create New Account
Click here to return to the '10.4: Find potential duplicate files via Spotlight metadata' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
10.4: Find potential duplicate files via Spotlight metadata
Authored by: hughescr on Oct 13, '06 12:27:26PM
A nice final step would be to go through the list of possible matches and see if the md5sum matches the $1 md5sum:

#!/bin/bash
#
# dupecheck - identified potential duplicates of a file using Spotlight metadata
# by Derick Fay, October 2006
# Extended to check md5sums by Craig Hughes, October 2006

if [ -z $1 ]; then      # -n tests to see if the argument is non empty
        echo "usage: $0 filename"
        exit
fi

# Get the to-match MD5 sum
MD5SUM=`md5sum "$1" | awk '{print $1}'`

#extract metadata from the file to be checked
size=`mdls -name kMDItemFSSize "$1" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//'`
name=`mdls -name kMDItemFSName "$1" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//'`
kind=`mdls -name kMDItemKind "$1"   | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//'`

#Get possible matches
echo "MD5-confirmed matches:"
mdfind -0 "kMDItemFSName == $name || (kMDItemFSSize == $size && kMDItemKind=$kind)" | xargs -0 md5sum | grep $MD5SUM | sed -e 's/^[0-9a-f]* *//'


[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: JadeNB on Apr 25, '07 06:16:51PM
There are several other versions of this script posted below anyway, but, just for anyone who's trying to follow and was confused by this line:
if [ -z $1 ]; then # -n tests to see if the argument is non empty
It is certainly true that -n tests if its argument is non-empty, but, obviously, that's not the test used here. -z is just the opposite test: Is the argument empty?

[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: S Barman on Oct 13, '06 10:35:53PM
On the previous rewrite of the hint, the command md5sum is not a standard MacOS/Darwin command. I rewrote the script to use /sbin/md5. Also, rather than calling mdls three times, I rewrote the script to call it once. Then, I fixed the mdfind conditions (one had "=" and I changed it to "=="). Finally, rather than using sed, which has a lot of processing overhead, I am using cut to do the same thing. Overall, it cut a bit more than one second off the command execution on my system.

So, without further ado, here's my updated script:


#!/bin/bash
# dupecheck - identified potential duplicates of a file using Spotlight metadata
# by Derick Fay, October 2006
# Extended to check md5sums by Craig Hughes, October 2006
# Making more MacOS/Darwin standard and added speedups and efficiencies by Scott Barman

# Errors should be written to stderr (file designator 2) and exit with a
# non-zero status. I also like shortening the parsing!
[ -z $1 ] && echo "usage: $0 filename" >&2, exit 1
SEARCHFILE=$1

# Get the to-match MD5 sum
# /sbin/md5 is standard on MacOS/Darwin. The -q option just prints the MD5 value
MD5SUM=$(/sbin/md5 -q $SEARCHFILE)

# extract metadata from the file to be checked
# Let's do it with one command and pull the pieces out of the command.
# I use "set" to replace the command line and just parse the command line!
set $(mdls -name kMDItemFSSize -name kMDItemFSName -name kMDItemKind "$1")
name=$5
size=$8
kind=${11}	# braces needed because position > 9 (more than 2 char)

# Get possible matches
# do this by using $(..) to put the file names on the command line for md5
# which does not require xargs and another pipe!
echo "MD5-confirmed matches:"
mdfind -0 "kMDItemFSName == $name || (kMDItemFSSize == $size && kMDItemKind == $kind)" | xargs -0 /sbin/md5 -r | grep $MD5SUM  | cut -d ' ' -f 2

I love squeezing every last bit of efficiency out of scripts!! :-)

Scott

[ Reply to This | # ]

10.4: Find potential duplicate files via Spotlight metadata
Authored by: S Barman on Oct 14, '06 08:00:22AM
Oops... ignore the comment on using $(..) because it did not work. But other than that, it still cuts a bit more than a second off the search on my Powerbook G4!

[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: shapiro on Oct 14, '06 08:20:54AM
The previous solutions fail if the file name or an attribute contains spaces. Improvements in this version:
  • File names and attributes may contain arbitrary characters.
  • Better argument check.
  • Avoid duplicate outputs.
  • Replace expensive MD5 computation with 'cmp'.
  • Don't compare file to itself.

#!/bin/sh

# dupecheck - identify potential duplicates of a file using Spotlight metadata
# see http://www.macosxhints.com/article.php?story=20061003163429425
#
# by Derick Fay, October 2006
# Extended to check md5sums by Craig Hughes, October 2006 -- removed by Marc Shapiro, Oct 2006.
# Making more MacOS/Darwin standard and added speedups and efficiencies by Scott Barman
# Support filenames and attributes that contain spaces; avoid duplicates; replace expensive MD5 computation with 'cmp'; bug fixes.  Marc Shapiro, Oct 2006 

# Errors should be written to stderr (file designator 2) and exit with a
# non-zero status. 
# [Scott's "," syntax does not work with the bash distributed with 10.4.8 (GNU bash, version 2.05b.0(1)-release (powerpc-apple-darwin8.0))]
SEARCHFILE="$1"
if [ $# == 1 -a -r "$SEARCHFILE" ] 
  then :
  else echo "Usage: $0 filename" >&2
       exit 1
  fi

# extract metadata from the file to be checked.
# [Scott's 'set' doesn't work with spaces, and the output format of mdls is unfriendly.]
declare -i size=$( mdls -name kMDItemFSSize "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
name=$( mdls -name kMDItemFSName "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
kind=$( mdls -name kMDItemKind "$SEARCHFILE"   | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )

# Get possible matches
# Carefully build query string supporting file names and item kinds containing spaces.
query='kMDItemFSName == '"$name"' || ( kMDItemFSSize == '"$size"' && kMDItemKind == '"$kind"' )' 

# Avoid 'read' breaking file names in the middle, no matter what characters they contain.
IFS=''
# Loop over the results of the query.
# 'sort -u' removes duplicates.
mdfind "$query" | sort -u | while read candidate
do
  # The 'ls -i' check removes the input file from consideration.
  # 'cmp' compares the files byte-for-byte; '--bytes 4096' limits the check to the first 4096 bytes (arbitrarily).
  [ $( ls -i "$SEARCHFILE" | sed -E -e 's/ *([0-9]+).*/1/' ) != $( ls -i "$candidate" | sed -E -e 's/ *([0-9]+).*/1/' ) ] 
     && cmp -s --bytes 4096  "$SEARCHFILE" "$candidate" 
     && echo "$candidate"
done


[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: afb on Oct 15, '06 02:27:52PM

This script isn't working for me. It doesn't bring up any errors, but after running for a few seconds it won't show any dupes.



[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: bdm on Oct 16, '06 09:29:34AM
The script suffers from the dreaded missing backslash problem. Near the end there are two sed commands. The replacement text for them is given as /1/ but should be /\1/. In case that doesn't look right either, the single digit "1" between the slashes should be "backslash 1". Twice. Brendan.

[ Reply to This | # ]
hey thanks
Authored by: deef on Oct 16, '06 07:48:33PM

Very cool to see people with much more shell scripting knowledge and experience than me taking this idea & refining it. I've just switched over to the last version above & the speed gains are huge.

One issue came up -- I cut and pasted the script above into Smultron (my editor of choice) & originally got a syntax error in line 40. When I put the two lines beginning with && right at the end onto a single line with the previous one, i.e.

[ $( ls -i "$SEARCHFILE" | sed -E -e 's/ *([0-9]+).*/\1/' ) != $( ls -i "$candidate" | sed -E -e 's/ *([0-9]+).*/\1/' ) ] && cmp -s --bytes 4096 "$SEARCHFILE" "$candidate" && echo "$candidate"

(having also fixed the sed in the way described above) I got it working. Thanks again to Craig, Scott and Marc.



[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: shapiro on Oct 20, '06 01:55:57PM

Brendan is correct. That line should read as follows:

[ $( ls -i "$SEARCHFILE" | sed -E -e 's/ *([0-9]+).*/\1/' ) != $( ls -i "$candidate" | sed -E -e 's/ *([0-9]+).*/\1/' ) ]



[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: shapiro on Oct 20, '06 02:19:42PM
As pointed out by Brendan, backslashes were missing from my previous posting. Fixed in this version.

#!/bin/sh

# dupecheck - identify potential duplicates of a file using Spotlight metadata
# see http://www.macosxhints.com/article.php?story=20061003163429425
#
# by Derick Fay, October 2006
# Extended to check md5sums by Craig Hughes, October 2006 -- removed by Marc Shapiro, Oct 2006.
# Making more MacOS/Darwin standard and added speedups and efficiencies by Scott Barman
# Support filenames and attributes that contain spaces; avoid duplicates; replace expensive MD5 computation with 'cmp'; bug fixes.  Marc Shapiro, Oct 2006 

# Errors should be written to stderr (file designator 2) and exit with a
# non-zero status. 
# [Scott's "," syntax does not work with the bash distributed with 10.4.8 (GNU bash, version 2.05b.0(1)-release (powerpc-apple-darwin8.0))]
SEARCHFILE="$1"
if [ $# == 1 -a -r "$SEARCHFILE" ] 
  then :
  else echo "Usage: $0 filename" >&2
       exit 1
  fi

# extract metadata from the file to be checked.
# [Scott's 'set' doesn't work with spaces, and the output format of mdls is unfriendly.]
declare -i size=$( mdls -name kMDItemFSSize "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
name=$( mdls -name kMDItemFSName "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
kind=$( mdls -name kMDItemKind "$SEARCHFILE"   | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )

# Get possible matches
# Carefully build query string supporting file names and item kinds containing spaces.
query='kMDItemFSName == '"$name"' || ( kMDItemFSSize == '"$size"' && kMDItemKind == '"$kind"' )' 

# Avoid 'read' breaking file names in the middle, no matter what characters they contain.
IFS=''
# Loop over the results of the query.
# 'sort -u' removes duplicates.
mdfind "$query" | sort -u | while read candidate
do
  # The 'ls -i' check removes the input file from consideration.
  # 'cmp' compares the files byte-for-byte; '--bytes 4096' limits the check to the first 4096 bytes (arbitrarily).
  [ $( ls -i "$SEARCHFILE" | sed -E -e 's/ *([0-9]+).*/\1/' ) != $( ls -i "$candidate" | sed -E -e 's/ *([0-9]+).*/\1/' ) ] \
     && cmp -s --bytes 4096  "$SEARCHFILE" "$candidate" \
     && echo "$candidate"
done


[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: chuy on Oct 21, '06 05:44:04AM

can someone have this script in a downloadable file and easy install instructions for Non-Unix users, this looks like a very good function, but my knowldage in unix is very slim, only copy/paste hints to terminal to enable functionality.
For something like this I still use an Mac OS 9 app that when I drag 2 files it checksums them to tell me if they are identical. it would be great that this script was a dropBox, so that dropping a file would find its real dups.

thank you



[ Reply to This | # ]