Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

10.4: Phrase searching using Spotlight and Python System 10.4
One of the most serious problems with Spotlight is that you cannot search for phrases in document content. One solution is to troll through the results of a Spotlight search using other utilities to do the phrase matching.

I wrote a python script to demonstrate this. mdgrep.py "frank zappa" will first search the Spotlight index for all documents containing frank and zappa. It will then extract the text using mdimport -nfd2 to take advantage of Spotlight plug-ins. The script has command-line options for limiting the search to specific directory hierarchies, and searching using regular expressions. The output of the script is a list of all files with the given phrase.

You can always find the latest version of the script on my site (the first link above is hosted on macosxhints.com). This was tested using python 2.3 and 2.4, and obviously requires 10.4.

[robg adds: I tested this script by grabbing a phrase ("wrong one more than once") out of an old Macworld column, then searching for it using Spotlight and mdgrep.py. Spotlight found the right document, but it also found about 15 other columns that contained those very-common words. mdgrep.py found just the one document, but it took a while -- over 10 minutes on my Mac Pro. Execution time would have been much lower if I'd restricted the search to only my Macworld columns folder, but I wanted to see how it did with my whole system.]
    •    
  • Currently 3.25 / 5
  You rated: 5 / 5 (4 votes cast)
 
[12,631 views]  

10.4: Phrase searching using Spotlight and Python | 12 comments | Create New Account
Click here to return to the '10.4: Phrase searching using Spotlight and Python' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
10.4: Phrase searching using Spotlight and Python
Authored by: BrentT on Apr 09, '07 08:36:04AM

The freeware app SpotInside (http://www.oneriver.jp/SpotInside/index.html) finds phrases within documents pretty quickly. It shows a preview of documents, including pdfs, that can be launched directly. It will also allow sorting of results by attributes. I find SpotInside more useful than Spotlight because I am usually just looking for a document with a certain phrase I remember, but I don't always know the file name.



[ Reply to This | # ]
10.4: Phrase searching using Spotlight and Python
Authored by: homersimp on Apr 09, '07 10:34:09AM

Spotlight's inability to search for phrases isn't just a "serious problem" -- it all but destroys its usefulness for anyone with a large number of files to search (I have more than 15 years' worth of personal files on my computer).

Does anyone know if Google Desktop Search can find phrases?



[ Reply to This | # ]
10.4: Phrase searching using Spotlight and Python
Authored by: cgnorwood on Apr 11, '07 06:32:51AM

Google Desktop Search definitely will search for exact phrases. Just add quote marks. However, this is often unnecessary, since even an "unquoted" phrase will produce a result list sorted by relevance, and the document with the text that most closely matches the complete phrase will appear on the top of the list. Furthermore, GDS provides a line of text showing the search terms in the context in which they appear in the document, something I've never seen Spotlight do. This is very helpful to me. Lastly, GDS is much faster than Spotlight on my system. I've had none of the slowdowns reported by others. (But I do have a quad 3.0Ghz MacPro.)



[ Reply to This | # ]
one-line perl version
Authored by: SOX on Apr 09, '07 10:43:34AM
perl -e ' open FH, "mdfind -onlyin ~/ \"$ARGV[0]\"|"; while ($s=<FH>) {  chomp ($s) ; @x = grep {m/$ARGV[0]/} `mdimport -nfd2 $s \&> /tmp/crap ; cat /tmp/crap ` ; print "$s\n" if @x>0}   ' "centimeter measure"

the above one line perl code locates the phrase "centimeter measure" using the same approach as the python script. You can make an alias out of this. It will goof up if your phrase has an unescaped double-quotation mark in it. Note that it overwrites a temporary file called /tmp/crap when it runs. I had to create that temporary file because the silly behavior of mdimport does not write to a standard stream that is easy to capture.

[ Reply to This | # ]

one-line perl version
Authored by: SOX on Apr 09, '07 10:54:04AM
A couple of usage notes:

1) it's hardcoded to look in ~/ your home directory. This could obviously be changed to be an input parameter as well

2) don't forget to quote your phrase.

3) the script is multi-threaded, doing the mdfind and mdimport concurrenty. This flourish however is overhill in many cases because the slow step in the process is the MDimport.

4) it hunts the phrase you are seeking in all of the metadata, not just the text content. It would be easy to modify to restrict it to just the text content, but why would you want that.

5) if you want this to run lightning fast then just replace the mdimport with cat. like this:

perl -e ' open FH, "mdfind -onlyin ~/ \"$ARGV[0]\"|"; while ($s=<FH>) {  chomp ($s) ; $x = `cat "$s"` ; print "$s\n" if $x=~m/$ARGV[0]/}   ' "centimeter measure"

this will not use mdimport but just do a raw text search

[ Reply to This | # ]

one-line perl version
Authored by: CBrachyrhynchos on Apr 09, '07 07:42:50PM

Cat will choke on "files" that are really bundles (like mellel) and odf files which are zipped archives. Not to mention searching on the raw input of pdf files is an issue. (Doesn't seem to work for me.)

It's a cool one-liner, and my script does suffer from a bit of creeping featurism in its switches.



[ Reply to This | # ]
one-line perl version
Authored by: SOX on Apr 10, '07 10:42:34AM

well yeah the cat version is only good for plain text. But MDimport is dog slow, so when you know it's plain text. If one wanted to push things a bit one could run the files through `strings` first to remove all the binary crud and hope to get lucky finding a the phrase in a plain text file even if it was pdf or Word.doc format. It's so much faster than Mdimport that one could just do it as a pre-screen.

One feature that would be fun to add is a concept of "near" in addition to exact phrases.

@g = split /s+/, $ARGV[0];
$h = join ".[,20]+",@g
then match m/$h/
to find the words in the phrase order but insensitive to up to 20 intervening characters



[ Reply to This | # ]
one-line perl version
Authored by: SOX on Apr 10, '07 10:45:50AM

By the way, did you figure out what the heck is up with mdimport's output streams? It seems to be own of those crazy functions like top that can tell if it's being run in a terminal or redirected to a file or sent on a pipe, and then changes which stream it uses to to write the data. For example if you try to capture it's output using backtics in perl it still writes to the terminal directly, but if to redirect it to a file it does not write to the terminal.



[ Reply to This | # ]
one line perl version with Approximate matching
Authored by: SOX on Apr 10, '07 06:55:31PM
New and improved perl version.

Features:

  1. first parameter is now the origin directory for the search, follow with phrase
  2. search is liberal, matching approximate forms of the phrase: that is it allows up to 20 characters to be inserted between the words of the phrase and it will still match.
  3. this 20 character grace means it will match phrases around carriage returns, tabs, and many kinds of HTML markups
  4. no need to "quote" the phrase unless it contains chars needing shell escapes
  5. phrase can contain special characters

perl -e '($f,@A) = map { quotemeta } @ARGV;  open FH, "mdfind -onlyin $f \"@A\" |"; $A = join ".{1,20}",@A; while ($r=<FH>) {  chomp ($r) ; $s=quotemeta($r); @x = grep {m/$A/} `mdimport -nfd2 $s \&> /tmp/crap ; cat /tmp/crap`; print "$r\n" if @x>0}   ' ./ centimeter measure


[ Reply to This | # ]
10.4: Phrase searching using Spotlight and Python
Authored by: xyz3 on Apr 13, '07 04:50:00PM

Uh - did anyone try using spotlight with frank+zappa ?

Works here when I tested it.. ; )



[ Reply to This | # ]
10.4: Phrase searching using Spotlight and Python
Authored by: xyz3 on Apr 13, '07 04:53:32PM

Looks like frank&zappa works as well.



[ Reply to This | # ]
10.4: Phrase searching using Spotlight and Python
Authored by: larryy on Jun 15, '07 05:15:28PM

I think a few people are confused about what "phrase" means in this context. Spotlight will happily find documents with multiple words (including if they are separated by + or & instead of space, though they do seem to change the results slightly). SpotInside.app behaves the same as Spotlight, except it has a document preview pane.

What is needed is a simple way to find only those documents that contain a "phrase"--a set of contiguous words in exactly the given sequence. (What Spotlight finds is all documents that contain that set of words in any sequence anywhere in the document.)

The Python script seems to do this, but is quite slow and normally works over the entire disk only. There is a -o directory option to limit the search to a particular directory, but it doesn't appear to work, instead searching at least the directory containing the specified directory. (If that proves to be all that is wrong, then specifying a directory inside the directory you want to search might be a workaround.)

The Perl one-liner seems to work correctly when used as shown here, and it seems fairly quick, and the updated version nicely allows you to specify the starting directory. However, when I convert it to a bash function, it ceases to work for some reason, sadly, but that's probably me. Here's what I did, in case anyone can spot the error (it's a direct copy and paste, except for sticking it in the bash function and replacing the specific directory and search string with $*):

search () { command perl -e '($f,@A) = map { quotemeta } @ARGV; open FH, "mdfind -onlyin $f \"@A\" |"; $A = join ".{1,20}",@A; while ($r=<FH>) { chomp ($r) ; $s=quotemeta($r); @x = grep {m/$A/} `mdimport -nfd2 $s \&> /tmp/crap ; cat /tmp/crap`; print "$r\n" if @x>0} ' $* ; }

To bad this has to use an intermediate file, as it would probably be quite fast otherwise.



[ Reply to This | # ]