Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

How to retrieve text from Windows Office 2007 Word docs Apps
Looking through makezine.com brings up a way to pull just the text from a new Word for Windows Office 2007 .docx file. This page has the info you need -- a simple PHP script that will pull the text from the file.

I think that maybe Openoffice.org 2.0 may be able to help, but I haven't tried it yet, so I would love to hear from anyone who has made this work.

[robg adds: On that page, several other solutions are mentioned. It should be noted that, as of now, all of them will strip the formatting from the file, providing just the text. Microsoft has promised free converters for older versions of Office on the Mac (an I'll list them here for easy reference for anyone searching:
  • docx-converter.com -- a website that takes a .docx file as input and spits out the pure text content.
  • An Automator script that does the same thing.
  • If you own BBEdit or TextMate (or probably others), they have a "strip all tags" function you can use on the Word XML file. To see the XML file, though, you first need to change the .docx extension to .zip, then expand that archive in the Finder. Open the resulting folder, go into the word folder, and open the document.xml folder in BBEdit or TextMate, then use each app's strip tags function to pull out the text.
With the recent news that the XML converters won't be out until April or so of next year for current versions of Office, I think tricks like this are going to be increasingly necessary. Hopefully some brilliant coder out there will figure out how to parse the XML before Microsoft does, as losing all formatting is far from ideal.]
    •    
  • Currently 1.00 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (2 votes cast)
 
[15,642 views]  

How to retrieve text from Windows Office 2007 Word docs | 9 comments | Create New Account
Click here to return to the 'How to retrieve text from Windows Office 2007 Word docs' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
OpenOffice.org from Novell
Authored by: DamienMcKenna on Dec 08, '06 08:09:35AM

Novell are going to be releasing binaries of OpenOffice.org with importers for MSFT's XML formats, and I'm expecting their code to be merged into the main OOo codebases soon thereafter.



[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: dan55304 on Dec 08, '06 08:43:06AM

Kind of simple for me, Word 2007 is vaporware without the converters, and I just won't buy it. Word 2007 should be working in 2009 or so 8-)



[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: macavenger on Dec 08, '06 03:58:31PM

If only it was that easy- unfortunately, you may well need to open .docx files before then- say from a friend, coworker, etc. That's where things as mentioned in this hint will come in handy :)

---
iMac FP 17" 800MHz OS X 10.4.x



[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: ctierney on Dec 08, '06 09:26:30AM
Thanks for the tip! Now I'll be prepared the next time I get one of these files. Here's another method that could be wrapped into an applescript droplet:
unzip -p some.docx word/document.xml | perl -pe 's/<[^>]+>|[^[:print:]]+//g'

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: ctierney on Dec 08, '06 11:45:54AM
Here's a droplet that'll extract plain text to the clipboard:
on open this_item
   set docxPath to POSIX path of this_item
   try
      do shell script "unzip -p " & docxPath & " word/document.xml | perl -pe 's/<[^>]+>|[^[:print:]]+//g' | pbcopy"
   end try
end open

on run
   display dialog "Drop a docx file on this applescript and it's plain text contents will be copied to the clipboard." buttons {"Ok"} giving up after 10 default button 1
end run


[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: hoosker on Dec 13, '06 04:30:26PM

I create lots of forms for my office staff with a lone Mac. I would love to be able to perform this Word watermark trick for them but they all use Windows and their version of Office does not import PDF files. I could convert the PDF to a bitmap image but that would make for big ugly files.



[ Reply to This | # ]
According to rumors...
Authored by: DamienMcKenna on Dec 14, '06 08:09:42PM

According to some rumors doing the blogs, the TextEdit in Leopard will support loading docx files. Now if they'd just support ODF in all of their software we wouldn't need NeoOffice.



[ Reply to This | # ]
According to rumors...
Authored by: mnoriega on Apr 10, '07 06:08:59PM
Now that version 2.1 of NeoOffice is out, you can open docx files directly with Neooffice.

Download it free from:
http://www.neooffice.org/

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: jimisoft on May 30, '07 06:47:53PM
Try All2Txt 2.0 at:
http://www.jimisoft.com/en/all2txt.html

This software can retrieve text from docx file.

[ Reply to This | # ]