Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Click here to return to the 'Use perl to repair a mis-formatted text file' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Use perl to repair a mis-formatted text file
Authored by: sweth on Jan 25, '07 09:56:12AM
perl -pe 's/(w)([.])(w)/$1$2 $3/g' source.txt > final.txt

First, two quick fixes:

  1. Those should, as has been pointed out, be \w, not just w.
  2. Those should, in fact, actually be \S, not \w, or it doesn't even work correctly for the short snippet of sample text the OP gave.

The quick explanation:

  1. perl -pe 'SOMETHING' input_file says to run the perl command "SOMETHING" on every line of the file input_file.
  2. The perl statement s/SEARCH/REPLACE/g, on its own, says to take the current line, find every instance of the regular expression pattern SEARCH, and replace every such instance with REPLACE.
  3. \w in a regular expression matches any one "word" character, meaning any alphanumeric or underscore character. \S matches any "non-whitespace" character, meaning basically anything other than a space, tab, or newline/carriage return.
  4. A dot (.) in a regular expression normally matches any character. A string enclosed by brackets ([string]), however, defines a character class that matches any one instance of any of the characters inside the brackets, so (for example) [abc] matches an "a", or a "b", or a "c". And a dot inside a character class loses its special meaning of matching any character, so "[.]" is just a way to say "match a literal period". You could also use "\.", where the backslash in that context "escapes" the special meaning of the dot.
  5. Parentheses in perl regular expressions "capture" whatever they match, and put them into numbered buffers corresponding to the order in which they match. The numbered buffers are accessed by scalar variables corresponding to the number of the buffer (e.g. what the first set of parentheses matches can be accessed using $1).

So, roughly translating the original version (assuming the \w fix is made), the perl command being run on every line is:

Match any "word" character, followed by a literal period, followed by any other "word" character, putting each of those characters into buffers 1, 2, and 3, respectively, and then replace those three characters with "the contents of buffer 1" + "the contents of buffer 2" + a space + "the contents of buffer 3".
Or, even more loosely translated:
Any time you see a "word" character followed by a period followed by a "word" character, stick a space after the period.

The problem is that, as the example text shows, sometimes sentences can start/end with characters that aren't "word" characters, like quotation marks. So changing \w to \S would match those, as well.

[ Reply to This | # ]