Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Click here to return to the '10.4: Find potential duplicate files via Spotlight metadata' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
10.4: Find potential duplicate files via Spotlight metadata
Authored by: shapiro on Oct 14, '06 08:20:54AM
The previous solutions fail if the file name or an attribute contains spaces. Improvements in this version:
  • File names and attributes may contain arbitrary characters.
  • Better argument check.
  • Avoid duplicate outputs.
  • Replace expensive MD5 computation with 'cmp'.
  • Don't compare file to itself.


# dupecheck - identify potential duplicates of a file using Spotlight metadata
# see
# by Derick Fay, October 2006
# Extended to check md5sums by Craig Hughes, October 2006 -- removed by Marc Shapiro, Oct 2006.
# Making more MacOS/Darwin standard and added speedups and efficiencies by Scott Barman
# Support filenames and attributes that contain spaces; avoid duplicates; replace expensive MD5 computation with 'cmp'; bug fixes.  Marc Shapiro, Oct 2006 

# Errors should be written to stderr (file designator 2) and exit with a
# non-zero status. 
# [Scott's "," syntax does not work with the bash distributed with 10.4.8 (GNU bash, version 2.05b.0(1)-release (powerpc-apple-darwin8.0))]
if [ $# == 1 -a -r "$SEARCHFILE" ] 
  then :
  else echo "Usage: $0 filename" >&2
       exit 1

# extract metadata from the file to be checked.
# [Scott's 'set' doesn't work with spaces, and the output format of mdls is unfriendly.]
declare -i size=$( mdls -name kMDItemFSSize "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
name=$( mdls -name kMDItemFSName "$SEARCHFILE" | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )
kind=$( mdls -name kMDItemKind "$SEARCHFILE"   | tail -n 1 | sed -e 's/^[a-zA-Z ]*= *//' )

# Get possible matches
# Carefully build query string supporting file names and item kinds containing spaces.
query='kMDItemFSName == '"$name"' || ( kMDItemFSSize == '"$size"' && kMDItemKind == '"$kind"' )' 

# Avoid 'read' breaking file names in the middle, no matter what characters they contain.
# Loop over the results of the query.
# 'sort -u' removes duplicates.
mdfind "$query" | sort -u | while read candidate
  # The 'ls -i' check removes the input file from consideration.
  # 'cmp' compares the files byte-for-byte; '--bytes 4096' limits the check to the first 4096 bytes (arbitrarily).
  [ $( ls -i "$SEARCHFILE" | sed -E -e 's/ *([0-9]+).*/1/' ) != $( ls -i "$candidate" | sed -E -e 's/ *([0-9]+).*/1/' ) ] 
     && cmp -s --bytes 4096  "$SEARCHFILE" "$candidate" 
     && echo "$candidate"

[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: afb on Oct 15, '06 02:27:52PM

This script isn't working for me. It doesn't bring up any errors, but after running for a few seconds it won't show any dupes.

[ Reply to This | # ]
10.4: Find potential duplicate files via Spotlight metadata
Authored by: bdm on Oct 16, '06 09:29:34AM
The script suffers from the dreaded missing backslash problem. Near the end there are two sed commands. The replacement text for them is given as /1/ but should be /\1/. In case that doesn't look right either, the single digit "1" between the slashes should be "backslash 1". Twice. Brendan.

[ Reply to This | # ]