How to retrieve text from Windows Office 2007 Word docs
Dec 08, '06 07:30:00AM
Contributed by: nicbav
Looking through makezine.com brings up a way to pull just the text from a new Word for Windows Office 2007 .docx file. This page has the info you need -- a simple PHP script that will pull the text from the file.
I think that maybe Openoffice.org 2.0 may be able to help, but I haven't tried it yet, so I would love to hear from anyone who has made this work.
[robg adds: On that page, several other solutions are mentioned. It should be noted that, as of now, all of them will strip the formatting from the file, providing just the text. Microsoft has promised free converters for older versions of Office on the Mac (an I'll list them here for easy reference for anyone searching:
- docx-converter.com -- a website that takes a .docx file as input and spits out the pure text content.
- An Automator script that does the same thing.
- If you own BBEdit or TextMate (or probably others), they have a "strip all tags" function you can use on the Word XML file. To see the XML file, though, you first need to change the .docx extension to .zip, then expand that archive in the Finder. Open the resulting folder, go into the word folder, and open the document.xml folder in BBEdit or TextMate, then use each app's strip tags function to pull out the text.
With the recent news that the XML converters won't be out until April or so of next year for current versions of Office, I think tricks like this are going to be increasingly necessary. Hopefully some brilliant coder out there will figure out how to parse the XML before Microsoft does, as losing all formatting is far from ideal.]
Comments (9)
Mac OS X Hints
http://hints.macworld.com/article.php?story=20061206065508184