Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Fix a mis-formatted text file with perl and regex UNIX
Dear readers: Sigh, third try's a charm? I promise this one has no formatting errors ... but then again, that's what I thought the last time, too.

Dear readers: This hint originally appeared yesterday. However, somewhere between Smultron, Geeklog, and publication, I badly munged the contents. I managed to repeat the intro while skipping all the details I intended to explain. My apologies for this; I have nobody to blame but myself for not concentrating on the task at hand. The corrected hint appears below. Given how badly I messed up the original, I chose to re-run this one again today, just so folks would have a chance to read the corrected version.

So as to not confuse people, I also removed a couple of comments that basically just talked about the formatting issues with the original hint. I've left the remainder of the comments, however, and actually chose to refer to sweth's explanation directly in the hint, as it's much clearer than mine ever was! I also adjusted the command per his comments on "w" vs. "S," and re-titled the hint to make it a bit less confusing.

Finally, for those who commented that this hint doesn't belong here, I'd just like to point out that we have 1,226 other Unix tips in the system, and I have no intention of not publishing such tidbits. If you have no interest in seeing Unix tips, registered users can easily disable the entire category in their preferences. But OS X is built on Unix, and to claim that a Unix tip isn't relevant to OS X just isn't accurate.

-rob.

Yesterday, I was doing some global editing on a relatively large text file, and accidentally made one change too many, saved changes, and quit the editor before I noticed the problem. The result? My file was now littered with sentences that ran together at the period:
...my bearers would hurl me.As they bore me along...
...glanced at the thermometer."Gad!" he cried...
...might make reparation.I made up my mind that...
For the curious, those lines are from Edgar Rice Burroughs' book At the Earth's Core, the text of which I'm using in a comment spam blocker I'm writing for my blog site. I was editing the text to remove some of the spurious punctuation that was causing my code to misinterpret the position of word breaks, and I got overly aggressive removing some spaces. Read on to see how I resolved it with some help from a friend, and the Unix underpinnings of OS X.

I knew what I needed to do to fix the problem -- "find all instances of some character, followed by the period, followed by some other character, and add a space before that last character." But I couldn't figure out how to make that seemingly simple change. I tried using BBEdit and a couple other text editors to do my "search and replace back into," but had no luck. So I called on a friend who has tons of Unix, perl, and regular expression experience. He came up with a one-line perl solution for me:
perl -pe 's/(\S)([.])(\S)/$1$2 $3/g' source.txt > final.txt
Though this looks complex, when he explained it to me, it made at least a bit more sense. When I originally published this hint, I intended (and had drafted) my own explanation as to how that worked. But due to mistakes on my part, that explanation never made it online. Thankfully, user sweth provided an excellent writeup on how the command in the comments -- I'd ask that you just read that explanation in lieu of my feeble attempt.
    •    
  • Currently 1.00 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (1 vote cast)
 
[8,587 views]  

Fix a mis-formatted text file with perl and regex | 8 comments | Create New Account
Click here to return to the 'Fix a mis-formatted text file with perl and regex' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Use perl to repair a mis-formatted text file
Authored by: club60.org on Jan 25, '07 08:48:12AM

You could do the same in TextWrangler or BBEdit:

In Find&Replace check "Use GREP", search for: "(\w)\.(\w)", replace with: "\1\. \2"



[ Reply to This | # ]
Use perl to repair a mis-formatted text file
Authored by: ocdinsomniac on Jan 25, '07 08:59:21AM

Hey, where's the explanation?



[ Reply to This | # ]
Use perl to repair a mis-formatted text file
Authored by: sweth on Jan 25, '07 09:56:12AM
perl -pe 's/(w)([.])(w)/$1$2 $3/g' source.txt > final.txt

First, two quick fixes:

  1. Those should, as has been pointed out, be \w, not just w.
  2. Those should, in fact, actually be \S, not \w, or it doesn't even work correctly for the short snippet of sample text the OP gave.

The quick explanation:

  1. perl -pe 'SOMETHING' input_file says to run the perl command "SOMETHING" on every line of the file input_file.
  2. The perl statement s/SEARCH/REPLACE/g, on its own, says to take the current line, find every instance of the regular expression pattern SEARCH, and replace every such instance with REPLACE.
  3. \w in a regular expression matches any one "word" character, meaning any alphanumeric or underscore character. \S matches any "non-whitespace" character, meaning basically anything other than a space, tab, or newline/carriage return.
  4. A dot (.) in a regular expression normally matches any character. A string enclosed by brackets ([string]), however, defines a character class that matches any one instance of any of the characters inside the brackets, so (for example) [abc] matches an "a", or a "b", or a "c". And a dot inside a character class loses its special meaning of matching any character, so "[.]" is just a way to say "match a literal period". You could also use "\.", where the backslash in that context "escapes" the special meaning of the dot.
  5. Parentheses in perl regular expressions "capture" whatever they match, and put them into numbered buffers corresponding to the order in which they match. The numbered buffers are accessed by scalar variables corresponding to the number of the buffer (e.g. what the first set of parentheses matches can be accessed using $1).

So, roughly translating the original version (assuming the \w fix is made), the perl command being run on every line is:

Match any "word" character, followed by a literal period, followed by any other "word" character, putting each of those characters into buffers 1, 2, and 3, respectively, and then replace those three characters with "the contents of buffer 1" + "the contents of buffer 2" + a space + "the contents of buffer 3".
Or, even more loosely translated:
Any time you see a "word" character followed by a period followed by a "word" character, stick a space after the period.

The problem is that, as the example text shows, sometimes sentences can start/end with characters that aren't "word" characters, like quotation marks. So changing \w to \S would match those, as well.



[ Reply to This | # ]
Use perl to repair a mis-formatted text file
Authored by: unforeseen:X11 on Jan 25, '07 09:07:27AM

Perl is designed to easily work with textfiles, so yes, you can do that with Perl. You can also do the same with grep or SubEthaEdit's find & replace, checking the box labelled "RegExp".

This is actually a hint about using regular expressions, not Perl. And yes, there must be a Backslash before every "w" in the code. (Most probably a SQL/PHP problem of macosxhints ;D ).

---
this is not the sig you`re looking for.



[ Reply to This | # ]
Use perl to repair a mis-formatted text file
Authored by: rev_karol on Jan 25, '07 09:07:56AM

I honestly don't think this post really belongs here. I don't see the relevence to OSX at all.



[ Reply to This | # ]
Should be
Authored by: lonestar on Jan 25, '07 10:13:49AM

I agree. This hint definitely is not a mac hint and has no place here. But since it is, it should at least be correct. The backslashes are missing and it is a poor solution anyway. It replaces all <letter><dot><letter> with spaces, but likely this is not what you want if your file has abbreviations (i.e. I.B.M.) or filenames.

Much better to make sure the case of the first letter is lowercase and the case of the second letter is uppercase.

perl -pe 's/([a-z]\.)([A-Z])/$1 $2/g'



[ Reply to This | # ]
Should be
Authored by: dzurn on Jan 25, '07 01:54:36PM
I agree. This hint definitely is not a mac hint and has no place here. But since it is, it should at least be correct. The backslashes are missing and it is a poor solution anyway. It replaces all with spaces, but likely this is not what you want if your file has abbreviations (i.e. I.B.M.) or filenames.

Much better to make sure the case of the first letter is lowercase and the case of the second letter is uppercase.

perl -pe 's/([a-z].)([A-Z])/$1 $2/g'

I disagree that it's not related to Mac OS X. The underpinnings of OS X definitely are Unix (to whatever degree the purists would argue) and these tools are always there.

Besides, the more I learned about Regular Expressions, the better my life became :)

I added a quote to your script so that the offered sample text would at least resolve properly.

perl -pe 's/([a-z].)([A-Z"])/$1 $2/g'

It worked to replace all three messed-up punctuation marks in the sample.

My new sig: "There is more Unix-nature in one line of shell script than there is in ten thousand lines of C" --Rootless Root

---
Madness takes its toll.
Please have exact change.

[ Reply to This | # ]

Should be
Authored by: mastige on Jan 26, '07 11:16:50AM
perl -pe 's/([a-z].)([A-Z"])/$1 $2/g'
I tried this on the supplied text. It puts a space between the exclamation mark and the double quote. Correct is:
perl -pe 's/([a-z.])([A-Z"])/$1 $2/g'


[ Reply to This | # ]