Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Speed up perl operations on large files UNIX
Yesterday I was faced with a 21 megabyte text file that needed several find/replace operations. Tex-Edit would open the file, but not change it. BBEdit would chug through the changes, but it was taking forever. I dropped into a Terminal window and used:
perl -pi -e 's/find/replace/g' filename
However, this too was slow -- much slower than I expected.

Because I was checking the file in Tex-Edit after each find/replace, the process was taking some time, until on a whim I used Tex-Edit to change the line endings from "Mac" to "Unix" style.

From that point on, the 'perl s/' command ran blazingly fast.
    •    
  • Currently 0.00 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (0 votes cast)
 
[3,930 views]  

Speed up perl operations on large files | 8 comments | Create New Account
Click here to return to the 'Speed up perl operations on large files' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Nasty...
Authored by: pmccann on Feb 27, '02 02:08:11AM

Ouch! So (before changing line endings) you were asking poor old perl to run its regex engine over a single line that was 21M "long". That is, that whole file was sitting in a single scalar variable, and it was changing and copying this monster string as the substitutions were made. Sounds like a delicious recipe for S-L-O-W to me!!

It's probably worth noting that you can work with files that have CR line endings quite efficiently by setting the -l flag (that's an 'ell' for line) to the appropriate octal value, in this case 015.

perl -pi -l015 -e 's/foo/bar/' filename

You get a file with the same line endings out the other end, and it should be loads more efficient than the nightmare of the first paragraph!

Cheers,
Paul



[ Reply to This | # ]
How long is a piece of a string?
Authored by: pmccann on Feb 27, '02 07:59:08AM

[[I'm sure this was submitted earlier, but there's no sign of it now. Damn: I really hate reconstructing messages. For what it's worth then...]]

Ouch! The original script you detailed in the hint would have been slinging around a 21MB string (in the $_ variable), shuffling whatever copies are necessary to do the substitution, and so forth: a very effective recipe for making things incredibly S-L-O-W.

You can work with files that have "classic" mac line endings by invoking the "-l" (that's "ell" as in line) option on the command line. It takes an octal value, which in the case of a file with CR line endings should be 015. The nice thing is that the script will use this as default in and out line ending, so that your output file maintains your original line ending character.

That is, you could use:

perl -l015 -pi -e 's/foo/bar/g' filename

and it should be about as efficient as converting the file first to LF line endings and then running the script.

Cheers,
Paul (waiting for that earlier phantom message to reappear...)



[ Reply to This | # ]
How long is a piece of a string?
Authored by: pmccann on Feb 27, '02 08:03:39AM

Wouldn't you think that almost a decade of web use would drill in the idea of a "cached page"? What a nong! Sorry for the double up: I'm having a bad, bad, day with forums!

Paul ("what does this button do?")



[ Reply to This | # ]
Three Tips
Authored by: sharumpe on Feb 27, '02 12:57:24PM
Heya, this is something I do ALL the time - glad that someone thought about submitting it.

I have a couple tips for doing this, though:
1) If you think you would like to keep a backup of the file(s) that you are doing this to, you can put an extension after the 'i' option, like this:
perl -pi.bak -e 's/foo/bar/g' filename

This will rename the original file to filename.bak before doing the search/replace.

2) You can search case-insensitively with 's/foo/bar/gi'

3) You can change the line endings to Unix style like this:
perl -pi -e 's/r//g' filename


Mr. Sharumpe

[ Reply to This | # ]
careful ...
Authored by: Djonli on Feb 27, '02 11:49:32PM

From examples shown above, it will replace all words containing "foo" with "...bar...". For example, "fool" --> "barl", etc. To change word "foo" to "bar", use the following syntax for the regular expression:

s/\bfoo/bar/g # Note the "\b"

use "s/\bfoo/bar/gi" for case-insensitive replacement.



[ Reply to This | # ]
careful ...
Authored by: pmccann on Feb 28, '02 01:52:53AM

Hey Rob, will the new geeklog stop robbing code fragments of their backslashes? Both parent and grandparent of this post have lost such symbols, rendering the code somewhat silly!

In the grandparent it should be "backslash r" I presume, though I should note that this runs into **exactly** the problem that my first (and, ahem, second) post was intended to warn against. If the file is big you're doing a **lot** of heavy lifting that can easily be avoided. In fact, I really don't think that prescription for changing line endings works at all (unless you've only got one line in the file!). The problem is that the whole file is **one line** to perl, so it reads it in, eliminates the CR's in the line, and then prints out the single line as requested (with an LF on the end). Not what you're after.

You really want something like

perl -pi -e 'tr/



[ Reply to This | # ]
careful ...
Authored by: pmccann on Mar 01, '02 12:49:56AM

I surrender!

geeklog wins... (another victim of the backslash in the same post as this problem was being described). Let's try again: I'll use % where there should be a backslash in two places in the following line.

perl -pi -e 'tr/%015/%012/' filename

Now let me out of here!

Cheers,
Paul (anyone else want a forum they want littered with nonsense?)



[ Reply to This | # ]
careful ...
Authored by: Djonli on Mar 02, '02 05:23:07AM

Hmm .. The preview showed the "<back slash>b", but I can see that my post lost the <backslash>.

So, for those people that are new to regular expression, please add the <back slash> in front of the "b".



[ Reply to This | # ]