Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

10.4: Batch text conversion with textutil UNIX
Tiger only hintI first learned about textutil in Mac OS X 10.4 in a tip here on macosxhints.com.

textutil is a rosetta stone for converting between different text file formats. For example, I recently wanted to change 36,000 .doc files into text files. So I needed to come up with a way of recursively converting all the files. The unix find command can be used to feed textutil. In Terminal, navigate to the appropriate directory (since this uses the current "." directory), and enter this command:
find . -name *.doc -exec textutil -convert txt '{}' \;
Read aloud: Find recursively in the current directory, by name, all the doc files. Execute the textutil 'convert to text' command with the found files. Bingo, done.
    •    
  • Currently 3.50 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (4 votes cast)
 
[22,794 views]  

10.4: Batch text conversion with textutil | 16 comments | Create New Account
Click here to return to the '10.4: Batch text conversion with textutil' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
10.4: Batch text conversion with textutil
Authored by: GlowingApple on Mar 14, '06 06:54:37AM

Great hint on a great utility. I never knew this command existed. This could easily be used to convert a folder of documents for viewing on an iPod. Now to find a util to pull text out of a pdf file...

---
Jayson --When Microsoft asks you, "Where do you want to go today?" tell them "Apple."



[ Reply to This | # ]
Ghostscript can convert PDF to text
Authored by: TrumpetPower! on Mar 14, '06 09:34:29AM

Ghostscript can convert PDF files to plain text, though you might not be terribly happy with the results. That's not Ghostscript's fault, though--it depends entirely on the nature of the particular PDF in question. For example, if the text was converted to paths before being outputted as PDF, you won't get anything. Often, kerning is done by starting a new block of text at that point, which can r esul t in w eir d gap s in t he t e x t. And so on.

Your best bet may be the full version of Acrobat (not the reader), since it includes OCR and other niceties. But, unless the PDFs were specifically created in a manner to keep the text machine- as well as human-readable (for speakable text, for example), don't plan on it being a fully-automated process.

Cheers,

b&



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: johnga1t on Mar 14, '06 09:58:02AM
you can try ps2ascii (it works on ps and pdf files) if you have any tetex or latex packages installed (you can get them from fink or from ii2). as mentioned above, you may see some funny spacing, but it's better than nothing.

[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: fds on Mar 14, '06 12:39:08PM

textutil actually used to be able to convert from pdf files in the original release of Tiger. However, somewhere around the 10.4.4 update, this feature was taken away. I had to revert to pdftotext from Xpdf: http://www.foolabs.com/xpdf/download.html



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: tullius on Mar 14, '06 10:31:24AM

Cool hint.

Anyone know how to convert FrameMaker files to something more modern - I used to use Frame as my word processor until OSX, and now I am stuck with using Classic to get at my Frame files. I am worried that when I upgrade to an Intel Mac I will be in deep trouble!

---
---Tom Tullius



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: fitzgunnar on Mar 15, '06 04:59:34AM

I do not know how many files you have, but you could open the fm files in Framemaker and save them in some more acccessible format, e.g. doc or rtf. HTML could work too.

/MagnusG!



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: gregraven on Mar 16, '06 05:58:14AM

Opening each FM file and saving it as something else is a painful process, unless you have the FM files in books, and are saving them as HTML files. To convert FM to Word documents, for example, requires a lot of steps, and after converting several files, FM will vapor-lock, so you'll have to force quit and then restart FM.

---
--
Greg Raven
Apple Valley, CA



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: nerkles on Mar 15, '06 07:45:26AM
Don't worry, count on it. There is no "classic" on the Intel macs. I don't have an answer for you other than try to convert them to something else. RTF, PDF, HTML, whatever... before you get your Intel Mac (or keep a PPC machine around just for this?).

[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: pc_junk on Mar 14, '06 12:57:45PM

Very minor suggestion however it would be much more efficient to use xargs to batch up the conversions. The current command will execute textutil for each file - as textutil will accept multiple files you can do something like -

find . -name \*.doc -print0 | xargs -0 textutil -convert txt

This will create a textutil command line upto the max length allowed by the OS allowing multiple conversions to run in a single process rathen than forking a separate instance for each file. The -print0 & -0 args will allow filenames containing spaces to be passed correctly. Also you need to escape the wildcard in the find statement.

xargs tends to be a rather overlooked utility but can be very useful !





[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: SOX on Mar 14, '06 10:41:38PM

I don't get it. Xargs simply puts one line onto the command line at a time. -0 says the argument spacing is delmited by nulls . and print0 replaces the newlines with nulls. So this will put all of the find results onto a single line. How does xargs know when the line buffer is overflowed and it needs to go to re-invoke the command again. Can't find this documented.



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: hopthrisC on Mar 15, '06 02:45:16AM
Why do want this documented? Isn't it enough to know that xargs does know about the limits of the command line length?

If you are really interested, read the source.

[ Reply to This | # ]

10.4: Batch text conversion with textutil
Authored by: pc_junk on Mar 16, '06 12:16:49PM

It gets this from sysconf(3) -

_SC_ARG_MAX
The maximum bytes of argument to execve(2).



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: SDO on Jun 04, '06 01:00:04PM
Would seem that 10.3.x users like myself could get the same functionality by installing darwinports, and issuing the command as root:

# port install coreutils

and then of course using what is in

# /usr/local/bin/ps2ascii filepath.pdf #<-- CLI which ouputs pdf as ascii

For instructions on how to install darwinports, just visit www.darwinports.org, where there is a 10.3.x binary distribution available at

http://darwinports.org/getdp/

Hope this helps folks like myself that refuse to pay for 10.4.x.

Good luck.

[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: chacher on Mar 05, '08 05:40:36AM

can someone please tell me how to output the converted files to a new/different directory?
thanks

OR

point me towards a good intro to Terminal – I've been putting off learning it, but after this tip, I'm READY



[ Reply to This | # ]
10.4: Batch text conversion with textutil
Authored by: morespace54 on Mar 05, '08 10:47:55AM
10.4: Batch text conversion with textutil – PHP?
Authored by: chacher on Mar 17, '08 03:45:54AM

Anyway to use this, or a another command/application to also convert PHP files?

I need to extract plain text from downloaded websites. Any advice would be greatly appreciated.



[ Reply to This | # ]