Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!


Click here to return to the 'Word Counter - Count words and characters in text' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Word Counter - Count words and characters in text
Authored by: robg on Nov 29, '05 05:01:31PM
We have hints on wc. But wc doesn't work well on .doc files, at least in my experience. For instance, consider my Chapter 14.doc file...
  • According to Word, it's 47,037 total characters (non-blank) and 9,415 words.
  • Word Counter totals it out to 45,553 characters and 7,742 words. Not exact, but close enough.
  • wc -mw returns two numbers: 49701 and 822272, just like that. No column heading, no commas, which makes reading a bit tougher. But if I'm reading it right, it's telling me that there are 49,701 words and 822,272 characters in that file. Clearly that's not correct.
When I ran wc -wc *.doc on the whole book, it told me that there were 4.3million characters in the files. Word Counter returns a much more accurate figure of just under 1.0million.

So explain to me what I'm doing wrong?

-rob.

[ Reply to This | # ]
Word Counter - Count words and characters in text
Authored by: daeley on Nov 29, '05 08:47:51PM
You can use the textutil CLI program along with wc to get an accurate wordcount. Here's the command:

textutil -stdout -convert txt foobar.doc | wc -w

To break it down, we're telling textutil to send its output to standard output instead of a file (-stdout) and convert to plain text (-convert txt) a Word formatted file called foobar.doc. The output of that is sent/piped to wc where we ask for a simple count of just the words (-w).

Now then, if you have multiple files, you can combine them on the fly and produce a collective word count thusly:

textutil -stdout -cat txt *.doc | wc -w

The new flag -cat tells it to concatenate all of the .doc files in the working directory.

By the way, the commands above leave the original documents untouched.

[ Reply to This | # ]

Word Counter - Count words and characters in text
Authored by: robg on Nov 29, '05 09:56:29PM

I appreciate the CLI solution ... but in this case, I think I'm going to have to say the GUI is somewhat easier and quicker for me -- especially since a given folder may hold much more than just the files I wish to count. So it's a quick drag-n-drop and that's it ... nice to know, though, that I can do this via SSH if the need arises!

Thanks;
-rob.



[ Reply to This | # ]
using textedit or antiword to convert .doc to ascii
Authored by: zojas on Dec 02, '05 04:08:28PM
another thing to try is 'antiword'. it's a Free program which will convert word files to ascii text (even the tables usually come out decent). it would be interesting to compare antiword to textedit's output.

[ Reply to This | # ]
Word Counter - Count words and characters in text
Authored by: timbos on Nov 30, '05 04:12:30PM
When I ran wc -wc *.doc on the whole book, it told me that there were 4.3million characters in the files. Word Counter returns a much more accurate figure of just under 1.0million. So explain to me what I'm doing wrong?
Probably nothing. Word generates a heap of text in the files that aren't displayed. Do you use the versioning facility, or track changes? They tend to make the filesizes huge. Also, if you open the docs up in a text editor, you'll probably find lots of extra stuff (like your address etc. all stored in there too!)

[ Reply to This | # ]
Word Counter - Count words and characters in text
Authored by: zojas on Dec 02, '05 04:10:06PM

.doc files are most assuredly not ascii, which is what wc is assuming.



[ Reply to This | # ]