10.6: How to use OCR with HP multi-function printers

Feb 18, '10 07:30:01AM

Contributed by: sjinsjca

Snow Leopard only hintHP's printer software and utilities were replaced with built-in items in OS X Snow Leopard, after a few tense days where HP initially declined to support the new operating system. Many of us remember our outrage when some HP spokesdrone actually stated that users should buy new printers if they wanted Snow Leopard support. Those of us who didn't run screaming to Canon or some other brand have made do with a reduced feature-set for our HP printers. In particular, OCR functionality is no longer supported by HP.

I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google.

I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support.

With that caveat out of the way:

  1. Be logged in to your Mac as an Adminstrator.
  2. Download Tesseract 2.04 [1.1MB download]
  3. Expand the downloaded file. The expansion process will create a new subfolder, tesseract-2.04, inside your Downloads folder.
  4. Download the English dictionary file [992KB download]
  5. Expand the dictionary download. Copy the contents of the resulting folder to the tessdata subfolder inside the aforementioned tesseract-2.04 folder.
  6. Open a Terminal window (sorry), and cd ~/Downloads/tesseract-2.04.
  7. Issue the following three commands (without the $ prompt). Each will take a minute or two to run:
    $ ./configure
    $ make
    $ sudo make install
    Actually, if you're logged in as an Administrator as I recommended, you won't need the sudo).
You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. However, they must have a .tif extension, not .tiff.

How to OCR

There's a quick way with Terminal, and a cool way with Folder Actions; I'll describe both.

Terminal: To create a converted text file of name someimage.txt from an uncompressed .tif file named someimage.tif, issue this command while cded to the folder containing the .tif:
/usr/local/bin/tesseract someimage.tif someimage_text
It's that easy. Now, my HP all-in-one's scanner produces a few graphics formats including .png, but not uncompressed .tif No matter, Apple's built-in Folder Action which converts images to .tif outputs uncompressed .TIF files when fed a .png file from the scanner. Or, you can load your scanned image into Preview and Save As to the .tif format; just be sure to select no compression.

Folder Action OCR: I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above. Just open the Folder Action script editor and paste the following code in; it works beautifully.
(*
convert - do ocr via shell script

This Folder Action handler is triggered whenever items are added to
the attached folder.

The script convert files from uncompressed .tif format to PDF using
the open-source Tesseract OCR engine,
http://code.google.com/p/tesseract-ocr/

Copyright © 2002–2007 Apple Inc. [with mods by sjinsjca]

You may incorporate this Apple sample code into your program(s) without
restriction.  This Apple sample code has been provided "AS IS" and the
responsibility for its operation is yours.  You are not permitted to
redistribute this Apple sample code as "Apple sample code" after having
made changes.  If you're going to redistribute the code, we require
that you make it clear that the code was descended from Apple sample
code, but that you've made changes.  ===> Duly noted, changes have
been made. --sjinsjca
*)

property done_foldername : "OCR Files"
property originals_foldername : "Original Files"
property newimage_extension : ""
-- the list of file types which will be processed
-- eg: {"PICT", "JPEG", "TIFF", "GIFf"}
property type_list : {"TIFF"}
-- since file types are optional in Mac OS X,
-- check the name extension if there is no file type
-- NOTE: do not use periods (.) with the items in the name extensions list
-- eg: {"txt", "text", "jpg", "jpeg"}, NOT: {".txt", ".text", ".jpg", ".jpeg"}
property extension_list : {"tif"}


on adding folder items to this_folder after receiving these_items
  tell application "Finder"
    if not (exists folder done_foldername of this_folder) then
      make new folder at this_folder with properties {name:done_foldername}
    end if
    set the results_folder to (folder done_foldername of this_folder) as alias
    if not (exists folder originals_foldername of this_folder) then
      make new folder at this_folder with properties {name:originals_foldername}
      set current view of container window of this_folder to list view
    end if
    set the originals_folder to folder originals_foldername of this_folder
  end tell
  try
    repeat with i from 1 to number of items in these_items
      set this_item to item i of these_items
      set the item_info to the info for this_item
      if (alias of the item_info is false and the file type of the
item_info is in the type_list) or (the name extension of the item_info
is in the extension_list) then
        tell application "Finder"
          my resolve_conflicts(this_item, originals_folder, "")
          set the new_name to my resolve_conflicts(this_item,
results_folder, newimage_extension)
          set the source_file to (move this_item to the originals_folder
with replacing) as alias
        end tell
        process_item(source_file, new_name, results_folder)
      end if
    end repeat
  on error error_message number error_number
    if the error_number is not -128 then
      tell application "Finder"
        activate
        display dialog error_message buttons {"Cancel"} default button 1
giving up after 120
      end tell
    end if
  end try
end adding folder items to

on resolve_conflicts(this_item, target_folder, new_extension)
  tell application "Finder"
    set the file_name to the name of this_item
    set file_extension to the name extension of this_item
    if the file_extension is "" then
      set the trimmed_name to the file_name
    else
      set the trimmed_name to text 1 thru -((length of file_extension) +
2) of the file_name
    end if
    if the new_extension is "" then
      set target_name to file_name
      set target_extension to file_extension
    else
      set target_extension to new_extension
      set target_name to (the trimmed_name & "." & target_extension) as string
    end if
    if (exists document file target_name of target_folder) then
      set the name_increment to 1
      repeat
        set the new_name to (the trimmed_name & "." & (name_increment as
string) & "." & target_extension) as string
        if not (exists document file new_name of the target_folder) then
          -- rename to conflicting file
          set the name of document file target_name of the target_folder to
the new_name
          exit repeat
        else
          set the name_increment to the name_increment + 1
        end if
      end repeat
    end if
  end tell
  return the target_name
end resolve_conflicts

-- this sub-routine processes files
on process_item(source_file, new_name, results_folder)
  -- NOTE that the variable this_item is a file reference in alias format
  -- FILE PROCESSING STATEMENTS GO HERE
  try
    set the source_item to the quoted form of the POSIX path of the source_file
    -- the target path is the destination folder and the new file name
    set the target_path to the quoted form of the POSIX path of
(((results_folder as string) & new_name) as string)
    with timeout of 900 seconds
      do shell script ("/usr/local/bin/tesseract " & source_item & " " &
target_path)
    end timeout
  on error error_message
    tell application "Finder"
      activate
      display dialog error_message buttons {"Cancel"} default button 1
giving up after 120
    end tell
  end try
end process_item
You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR. Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!

[robg adds: I haven't tested this one.]

Comments (30)


Mac OS X Hints
http://hints.macworld.com/article.php?story=2010021805585497