Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

10.6: How to use OCR with HP multi-function printers Printers
Snow Leopard only hintHP's printer software and utilities were replaced with built-in items in OS X Snow Leopard, after a few tense days where HP initially declined to support the new operating system. Many of us remember our outrage when some HP spokesdrone actually stated that users should buy new printers if they wanted Snow Leopard support. Those of us who didn't run screaming to Canon or some other brand have made do with a reduced feature-set for our HP printers. In particular, OCR functionality is no longer supported by HP.

I tried the following, and not only did it work for me, but it was a really sweet solution. At its heart is Tesseract, an open-source OCR engine that originated at HP (how's that for irony) and is currently maintained (and used) by Google.

I found that some of the online instructions for deploying and using Tesseract were a little bit confusing and contradictory. But here's what worked for me. As always: proceed at your own risk; I make no guarantees, and while I have tried to be careful I can't be certain that all the necessary steps are present, or safe, nor can I provide support.

With that caveat out of the way:
  1. Be logged in to your Mac as an Adminstrator.
  2. Download Tesseract 2.04 [1.1MB download]
  3. Expand the downloaded file. The expansion process will create a new subfolder, tesseract-2.04, inside your Downloads folder.
  4. Download the English dictionary file [992KB download]
  5. Expand the dictionary download. Copy the contents of the resulting folder to the tessdata subfolder inside the aforementioned tesseract-2.04 folder.
  6. Open a Terminal window (sorry), and cd ~/Downloads/tesseract-2.04.
  7. Issue the following three commands (without the $ prompt). Each will take a minute or two to run:
    $ ./configure
    $ make
    $ sudo make install
    Actually, if you're logged in as an Administrator as I recommended, you won't need the sudo).
You have now installed your OCR engine. It can OCR uncompressed .tif files (only) very quickly. However, they must have a .tif extension, not .tiff.

How to OCR

There's a quick way with Terminal, and a cool way with Folder Actions; I'll describe both.

Terminal: To create a converted text file of name someimage.txt from an uncompressed .tif file named someimage.tif, issue this command while cded to the folder containing the .tif:
/usr/local/bin/tesseract someimage.tif someimage_text
It's that easy. Now, my HP all-in-one's scanner produces a few graphics formats including .png, but not uncompressed .tif No matter, Apple's built-in Folder Action which converts images to .tif outputs uncompressed .TIF files when fed a .png file from the scanner. Or, you can load your scanned image into Preview and Save As to the .tif format; just be sure to select no compression.

Folder Action OCR: I made a sweet Folder Action script (my very first!) by slightly modifying an existing Folder Action script from Apple which converts file formats by calling a shell script. In this case, I modified the script to call the shell script above. Just open the Folder Action script editor and paste the following code in; it works beautifully.
(*
convert - do ocr via shell script

This Folder Action handler is triggered whenever items are added to
the attached folder.

The script convert files from uncompressed .tif format to PDF using
the open-source Tesseract OCR engine,
http://code.google.com/p/tesseract-ocr/

Copyright  20022007 Apple Inc. [with mods by sjinsjca]

You may incorporate this Apple sample code into your program(s) without
restriction.  This Apple sample code has been provided "AS IS" and the
responsibility for its operation is yours.  You are not permitted to
redistribute this Apple sample code as "Apple sample code" after having
made changes.  If you're going to redistribute the code, we require
that you make it clear that the code was descended from Apple sample
code, but that you've made changes.  ===> Duly noted, changes have
been made. --sjinsjca
*)

property done_foldername : "OCR Files"
property originals_foldername : "Original Files"
property newimage_extension : ""
-- the list of file types which will be processed
-- eg: {"PICT", "JPEG", "TIFF", "GIFf"}
property type_list : {"TIFF"}
-- since file types are optional in Mac OS X,
-- check the name extension if there is no file type
-- NOTE: do not use periods (.) with the items in the name extensions list
-- eg: {"txt", "text", "jpg", "jpeg"}, NOT: {".txt", ".text", ".jpg", ".jpeg"}
property extension_list : {"tif"}


on adding folder items to this_folder after receiving these_items
  tell application "Finder"
    if not (exists folder done_foldername of this_folder) then
      make new folder at this_folder with properties {name:done_foldername}
    end if
    set the results_folder to (folder done_foldername of this_folder) as alias
    if not (exists folder originals_foldername of this_folder) then
      make new folder at this_folder with properties {name:originals_foldername}
      set current view of container window of this_folder to list view
    end if
    set the originals_folder to folder originals_foldername of this_folder
  end tell
  try
    repeat with i from 1 to number of items in these_items
      set this_item to item i of these_items
      set the item_info to the info for this_item
      if (alias of the item_info is false and the file type of the
item_info is in the type_list) or (the name extension of the item_info
is in the extension_list) then
        tell application "Finder"
          my resolve_conflicts(this_item, originals_folder, "")
          set the new_name to my resolve_conflicts(this_item,
results_folder, newimage_extension)
          set the source_file to (move this_item to the originals_folder
with replacing) as alias
        end tell
        process_item(source_file, new_name, results_folder)
      end if
    end repeat
  on error error_message number error_number
    if the error_number is not -128 then
      tell application "Finder"
        activate
        display dialog error_message buttons {"Cancel"} default button 1
giving up after 120
      end tell
    end if
  end try
end adding folder items to

on resolve_conflicts(this_item, target_folder, new_extension)
  tell application "Finder"
    set the file_name to the name of this_item
    set file_extension to the name extension of this_item
    if the file_extension is "" then
      set the trimmed_name to the file_name
    else
      set the trimmed_name to text 1 thru -((length of file_extension) +
2) of the file_name
    end if
    if the new_extension is "" then
      set target_name to file_name
      set target_extension to file_extension
    else
      set target_extension to new_extension
      set target_name to (the trimmed_name & "." & target_extension) as string
    end if
    if (exists document file target_name of target_folder) then
      set the name_increment to 1
      repeat
        set the new_name to (the trimmed_name & "." & (name_increment as
string) & "." & target_extension) as string
        if not (exists document file new_name of the target_folder) then
          -- rename to conflicting file
          set the name of document file target_name of the target_folder to
the new_name
          exit repeat
        else
          set the name_increment to the name_increment + 1
        end if
      end repeat
    end if
  end tell
  return the target_name
end resolve_conflicts

-- this sub-routine processes files
on process_item(source_file, new_name, results_folder)
  -- NOTE that the variable this_item is a file reference in alias format
  -- FILE PROCESSING STATEMENTS GO HERE
  try
    set the source_item to the quoted form of the POSIX path of the source_file
    -- the target path is the destination folder and the new file name
    set the target_path to the quoted form of the POSIX path of
(((results_folder as string) & new_name) as string)
    with timeout of 900 seconds
      do shell script ("/usr/local/bin/tesseract " & source_item & " " &
target_path)
    end timeout
  on error error_message
    tell application "Finder"
      activate
      display dialog error_message buttons {"Cancel"} default button 1
giving up after 120
    end tell
  end try
end process_item
You can get fancy and create an output folder for your scanner with a Folder Action selected to convert incoming files to .TIF, and then attach another Folder action to its TIF output folder to convert incoming files to text via OCR. Then your scans are automatically OCR'd as they arrive from the scanner. Sweet!

[robg adds: I haven't tested this one.]
    •    
  • Currently 2.57 / 5
  You rated: 3 / 5 (23 votes cast)
 
[25,007 views]  

10.6: How to use OCR with HP multi-function printers | 30 comments | Create New Account
Click here to return to the '10.6: How to use OCR with HP multi-function printers' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
10.6: How to use OCR with HP multi-function printers
Authored by: Coumerelli on Feb 18, '10 07:49:04AM
Is this really 10.6+ only? And only with HP printers? I'm gong to try with my 10.5 iMac and non-HP all-in-one!
---
"The best way to accelerate a PC is 9.8 m/s2"
Edited on Feb 18, '10 07:49:36AM by Coumerelli


[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: fracai on Feb 18, '10 08:01:37AM

Apart from the introduction dealing with HP dropping support for OCR, this has nothing to do with HP devices. The hint, as it relies on the Tesseract software, will work with any OS that supports the software.

---
i am jack's amusing sig file



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: sjinsjca on Feb 18, '10 08:51:21AM

Coumerelli, the folder-actions tricks should work with all OS X versions that support folder actions... I'd imagine that includes 10.5.

The command-line stuff should work with all versions of OS X, can't see any reason it wouldn't.

I've also found a GUI interface to the Tesseract OCR script for 10.5 and later: http://download.dv8.ro/files/TesseractGUI/

Keep in mind that the basic Tesseract script takes uncompressed TIFF files only. So, whatever your scanner produces, you'll need to convert to uncompressed TIFF. The folder action trick does that when fed a .png.

There are ways to make Tesseract work with other formats if you really need to, and you can find those with a little googling and implement them with more command-line fussing. More trouble than it's worth, IMHO, given how easy it is to do uncompressed TIFF conversions under OS X.

One thing I've found is that the folder action for the OCR doesn't like to be fed multiple files all at once. It seems to prefer to have the first file converted and no other folder actions underway. This is no problem if your intent is to have it auto-OCR images as they come from the scanner (and any conversion process). But if you drag a whole bunch of TIFF files into the folder-action-enabled "OCR me" folder, some of the files will be missed. This appears to possibly point to a bug in the folder-actions mechanism.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: sevenfeet on Feb 18, '10 08:15:37AM

Although this was the case when Snow Leopard launched last year, HP quietly updated their printer software and supporting scanning/fax applications for many of their Officejet printers some months later to be compatible with Snow Leopard.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: sjinsjca on Feb 18, '10 02:04:55PM

No OCR is yet available from HP for my top-of-the-line, 2006-vintage OfficeJet multifunction.

Maybe they're helping customers with other models, but not me.

I have no complaint about the printer or what functionality is available right now, but when they declined to support OCR after the dust settled following SL's release, they took away part of what I paid for.

Meanwhile, gotta say, it makes no sense to have to go to System Preferences, Print & Fax, and then hit a "Scan" tab in order to access my scanner. A very un-Mac user experience.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: mchagers on Feb 19, '10 12:50:42AM

This may depend on the specific printer type, but after the upgrade to 10.6.2 (I believe, it may also have been 10.6.1) the driver for my HP 1350 AIO was updated to allow the use of it's scanner through the standard SL imaging interface. This means I can now open Preview, choose File->Import from scanner->HP 1350
and it will let me scan from the application (and save as uncompressed tif right away).
This will work for all apps that support the imaging interface, which includes at least Preview and Image Capture.
In addition you can use the Capture Image Service in other apps to scan/import the image through Image Capture. This works well in Pages for instance.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: gregraven on Feb 18, '10 02:18:54PM

I use the HP software too, both for scanning images and for OCR. Works fine. If I remember correctly, VelOCRaptor also worked for me while I was awaiting the new software from HP, but the HP Scan stuff is easier to use, IMHO.

---
--
Greg Raven
Apple Valley, CA



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: orca1 on Feb 18, '10 08:21:26AM

I got the HP Scan.app software to work without any problems, including its IRIS OCR functionality. After the upgrade to Show Leopard, I simply tried to install the full featured HP software dated Sep 2009 and available on HP's website at http://tinyurl.com/dmjgvn. I was pleasantly surprised by how easy this app is to use and how well it performs. The PDFs that are generated are a bit larger than necessary, but I just post-process the documents with the quartz filter in Preview to reduce the file size.

I am using HP Scan.app v2.1.3 (7) on Mac OS X v10.6.2 on a MacBook Pro.

Hope this helps.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: sjinsjca on Feb 18, '10 02:22:56PM

Y'know, if I had tried that, and it worked, I wouldn't have had to go down the path that led me to Tesseract. But I didn't try it because HP has all sorts of warnings not to do so.

Anybody have any idea why? They lost a ton of customer goodwill in this episode, and why do that if the old software works? Are there hidden consequences somewhere?



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: orca1 on Feb 18, '10 02:34:12PM

Yeah, I was confused as well since the HP apps are dated Sep of last year and even now after five months one gets the impression from the chats on the Web that no HP solution exists. But perhaps I just had dumb luck with these drivers, while they may not work for others?

Nevertheless, I much appreciate your effort and that you have shared your workaround with this community. Don't worry, it is just a matter of time until the next upgrade and HP software not working. Also, your hint is handy if one needs to do OCR on existing tiff files.

Again, thanks and keep the hints coming!



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: mchagers on Feb 19, '10 12:58:47AM

Apparently it would work sometimes, and not in other cases (models?).
Rather than fixing the problematic cases, HP chose to tell everyone to stop using the old software.

Before the driver update I tried every tip I could find, reinstalled the same HP software, it would install, it would start a scan and then report some obscure error that I couldn't find any info on.

Finally I just gave up, vowing to never buy an HP printer ever again. Then after a SL upgrade suddenly I could access the scanner from Preview (see my other comment of today).



[ Reply to This | # ]
Glad to hear it works
Authored by: jecwobble on Feb 18, '10 08:51:11AM

I don't have an HP printer/scanner, but I have tried to install Google's tesseract unsuccessfully last year. I'll try these steps, and see if I get better results. Thanks.



[ Reply to This | # ]
Glad to hear it works
Authored by: sjinsjca on Feb 18, '10 08:53:16AM

If you run into trouble, please post all details. It worked well for me but, as you experienced, the instructions found on-line elsewhere are pretty terrible.

Hope the process I documented works for you. As an OCR engine, Tesseract really rocks.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: onkelringelhuth on Feb 18, '10 09:36:40AM

Works like a champ. And in French and German, after I'd downloaded the dictionaries from Google Code. HP not required: I used it with TIFF output obtained from a Canon scanner using Image Capture. It's not entirely house-trained: give it a file name it does not like, and it crashes. Maybe I should be a good citizen and submit a patch...

Can't say for sure if it needs Snow Leopard, but a glance at the code suggests it should be pretty portable. (In particular, it doesn't use threads, which would make it faster on modern Macs. Not that it's a slouch anyway.)



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: tcsdoc on Feb 18, '10 09:39:36AM

As you can tell, I'm not a Terminal guy. Did not make it past the make command. What do I need to install to make this work?

iMac:tesseract-2.04 tcsdoc$ ./configure
checking build system type... i686-apple-darwin10.2.0
checking host system type... i686-apple-darwin10.2.0
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name...
configure: error: C++ compiler cannot create executables
See `config.log' for more details.
iMac:tesseract-2.04 tcsdoc$ make
-bash: make: command not found
iMac:tesseract-2.04 tcsdoc$



[ Reply to This | # ]
Install XCode Tools
Authored by: barko192 on Feb 18, '10 12:23:00PM
You have to install the XCode tools (available for free from apple).
They include the Gnu C compiler (gcc/g++) and gmake, which are both required to build software from source.

[ Reply to This | # ]
Install XCode Tools
Authored by: sjinsjca on Feb 18, '10 02:06:48PM

Thanks for pointing that out. I'd installed XCode long ago so had no idea this wasn't part of the standard OS X configuration.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: simondorfman on Feb 18, '10 09:53:01AM
I don't have an HP printer, but I did try out the OCR on a .tif and it worked so-so. If you use macports you can also install it with
$ sudo port install tesseract


[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: chrischram on Feb 18, '10 03:20:04PM

Also available for Fink users: fink install tesseract



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: jubalkessler on Feb 18, '10 10:01:18AM

The Folder Action OCR step needs clarification, please. What is, on Snow Leopard, the exact procedure (from beginning to end) for using the script you've written/modified and shown in this textarea?

When replying, please keep in mind that I am speaking from the standpoint of a complete newbie. Please indicate the applications to open (for example: there is no such thing as a "Folder Actions script editor" in my Utilities folder) and the steps necessary to get to the place where one can paste in the script you've provided.

You might be talking about context menu items, but that still isn't apparent to a complete newbie.

Thanks!



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: sjinsjca on Feb 18, '10 02:19:55PM

Oh, Folder Actions are wonnnnderful. Sorry to have been terse on the how-to aspect. (Have you seen my recipe for chicken pie? First you catch a chicken, then you bake it in a pie!)

Some resources and examples that will get you started:

http://www.tuaw.com/2009/03/26/applescript-exploring-the-power-of-folder-actions-part-iii/

http://dougscripts.com/itunes/itinfo/folderaction01.php

...The second one notes, "As of Snow Leopard (OS 10.6), Script Editor.app has been renamed AppleScript Editor.app and is located in your /Applications/Utilities/ folder." So, depending on what version of OS X you're using, now you know what you want to use and where to look for it.

Open that app up. Paste in the code from the post here. Save it. Easiest to just save it in your Documents folder, then drag it to ~/Library/Scripts/Folder Action Scripts ...OS X will ask for authentication if needed. Give that, and voila, your new script is now available to be attached to any folder.

So let's do that. Make a folder somewhere handy. Right-click on it. (On a Mac laptop, press the keypad with two fingers and click.) In the menu that pops up, scroll all the way down to Folder Actions Setup. In the box that pops up, click on the name of the script you just created. Click the Attach button. Done.

Now anytime you drag a file into that folder, it'll get processed by that script. Gad, it's a wonderful feature of OS X. Go crazy with it, you'll love it.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: codine on Feb 18, '10 10:08:29AM

I just want to point out a little more specifically... TIFF files are generally saved with the tiff extension in OS X. If you use Preview for example to save your JPEG as an uncompressed TIFF for tesseract it'll make a file ending in .tiff which tesseract won't open, it wants .tif only.

Remember kids... .jpg is a JPEG, and tif is a TIFF. Thank DOS for it's 3 character file extensions causing that. If you're going to automate converting JPEG to TIFF and passing it on to this script, be sure to enforce a single letter f in the extension.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: paulw on Feb 18, '10 11:45:37AM

to tsdoc and others who had problems getting it to build:
based on posts I found searching google, I reinstalled XCode and the install command worked.

(I don't know whether XCode has to be installed in the first place to make this work, but it looks like something about my XCode installation was messing with some of the commands or command paths... and my installation had been imported from my previous Mac, maybe that's why.)



[ Reply to This | # ]
Quick edits to folder action script, and a quick folder action scripts how to
Authored by: paulw on Feb 18, '10 02:09:59PM
The folder action script works as long as your TIFF file extension is "tif" and not ".tiff" (which seems to be the default for images my Mac converts to TIFF. tesseract only likes ".tif" extensions.

sjinsjca's script seems to be set up for making the adjustment, but it doesn't quite do it. Here's what you do to edit the script to change ".tiff" files to ".tif" before feeding they get fed to the tesseract shell script.

1. change the line:
property newimage_extension : ""
to
property newimage_extension : "tif"

2. change the line:
property extension_list : {"tif"}
to
property extension_list : {"tif", "tiff"}
3. after the line:
set the source_file to (move this_item to the originals_folder with replacing) as alias
add this line:
set name of source_file to new_name

The script should now successfully process files ending in ".tiff" as well as ".tif".

Quick Folder Action Script Creation Steps

1. Copy the script text from the hint.

2. Open the application "Applescript Editor" (In Application > Utilities)

3. Paste the script text into the script editing window.
4. Hit "compile" and it will probably give you an error message because there are line breaks from your pasted text that shouldn't be there. In most cases you can just hit "ok" and then hit the space bar to replace the highlighted linebreak with a space. Sometimes it requires manually fixing a linebreak-- in this script, "giving up after 120" should not be on its own line, but should finish the line before it.

5. When you can hit "compile" without an error message, consider making the edits I suggested.

6. Save the script in your User folder > Library > Scripts > Folder Action Scripts. If you don't have a "Folder Action Scripts" folder, create one there.

7. Do a Spotlight search for "Folder Actions Setup.app" and fire it up.

8. Select the folder (create it first in Finder if need be) you want to add a folder action script to. On the right-hand pane, hit the + sign and select the script you just saved from the available list.

9. Be sure "Enable Folder Actions" is checked, and quit.

[ Reply to This | # ]
Quick edits to folder action script, and a quick folder action scripts how to
Authored by: sjinsjca on Feb 18, '10 02:24:37PM

Wonderful suggestions, thanks very much.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: tcsdoc on Feb 18, '10 05:30:26PM

Thanks for the help in getting the make command to work. Installing Xcode did the trick. This leads to my next problem. Tesseract runs but I get the error message below:

iMac:SCRATCH tcsdoc$ tesseract scan.tif scan_text
Tesseract Open Source OCR Engine
read_tif_image:Error:Illegal image format:Compression
tesseract:Error:Read of file failed:Scan.tif
Segmentation fault

I have an Epson scanner and use Image Capture to scan the document. I've loaded the scan.tif file into Preview and saved it with no compression but still get the same error. Any ideas on this?



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: TonyT on Feb 23, '10 08:06:13AM

>read_tif_image:Error:Illegal image format:Compression

You need to either save as an uncompressed TIFF (open in Preview and Save As uncompressed TIF), or install libTIFF, then re-install tesseract (see my comment below).



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: TonyT on Feb 22, '10 05:39:18PM

Thanks for the tip. An 'out-of-the-box' limitation is support for multi-page TIFF's, however, if you install libTIFF (BEFORE installing tesseract), you not only will get support for multi-page TIFF's, but also support for compressed TIFF's

Get libTIFF 3.9.2 here: http://download.osgeo.org/libtiff/
libTIFF home page: http://www.remotesensing.org/libtiff/

note, this is mentioned in the FAQ: http://code.google.com/p/tesseract-ocr/wiki/FAQ

Does it support multi-page tiff files?

Only with 2.03 and later, and only if you have libtiff installed. See Compressed Tiff above.



[ Reply to This | # ]
10.6: How to use OCR with HP multi-function printers
Authored by: imageshift on Apr 06, '10 03:31:11AM


Here's how I did an OCR scan in Snow Leopard using my HP 7210 all in one:

1st I updated the driver.

2nd I clicked on /Applications/Hewlett-Packard/HP Scan.app

3rd I choose Scan Documents

4th I hit the Save Icon at the top and choose format: TXT and make sure Contents were save to single file. . .

Works like a charm...It still uses Readiris software behind the scene



[ Reply to This | # ]
HP 1312nfi & 10.6: How to use OCR with HP multi-function printers
Authored by: rvamerongen on May 04, '10 03:18:13AM

Hi,

does someone knows what to do to get a HP 1312nfi scanning working under 10.6.x?
F.e I cant scan using the preview.app. I don't see my scanner in the print and fax pane, even after selecting the printer.

Thanks



[ Reply to This | # ]