Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!


Click here to return to the 'Welcome to the world of RegEx's' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Welcome to the world of RegEx's
Authored by: babbage on Apr 25, '01 06:09:02PM
If the weird name throws you, "grep" is an acronym for "general regular expression
program". If that doesn't help, it's probably because you're wondering what a
regular expression ("re" or "regex") is. Basically, it's a pattern used to describe
a string of characters, and if you want to know aaaaaaall about them, I highly
recommend reading Mastering Regular Expressions by Jeffrey Friedl and
published by Unix ├╝ber-publisher O'Reilly & Associates.

Regexes (regices, regexen, ...the pluralization is a matter of debate) are an extremely
useful tool for any kind of text processing. Searching for patterns with grep is
most people's first exposure to them, as like the article says, you can use them to search
for a literal pattern within any number of text files on your computer. The cool thing is
that it doesn't have to be a literal pattern, but can be as complex as you'd like.

The key to this is understanding that certain characters are "metacharacters", which have
special meaning for the regex-using program. For example, a plus character (+) tells the
program to match one or more instances of whatever immediately precedes it, while parentheses
serve to treat whatever is contained as a unit. Thus, 'ha+' matches "ha", but it also matches
"haa" and "haaaaaaaaaaa", but not "hahaha". If you want to match the word "ha", you can use
'(ha)+' to match one or more instances of it, such as 'hahaha' and 'hahahahahahahahaha'.
Using a vertical bar allows alternate matching, so '(ha|ho)+' matches 'hohoho', 'hahaha', and
'hahohahohohohaha'. Etc.

There are many of these metacharacters to keep in mind. Inside brackets ([]), a carat (^)
means that you don't want to match whatever follows inside the brackets. For Magritte
fans, '[^(a cigar)]' matches any text that is not "a cigar". The rest of the time, the carat tells
the program to match only at the beginning of a line, while a dollar sign ($) matches only at
the end. Therefore, '^everything$' matches the word "everything" only when it is on a line all
by itself and '^[^(anything else)]' matches all lines that do not begin with "anything else".

The period (.) matches any character at all, and the asterisk (*) matches zero or more times.
Compare this to the plus, which matches one or more times -- a subtle but important
difference. A lot of regular expressions look for ".*", which is zero or more of anything
(that is, anything at all). This is useful when searching for two things that might or might
not have anything else (that you probably don't care about) between them: 'foo.*bar' will match
on 'foobar', 'foo bar' & 'foo boo a wop bop a lop bam boo bar'. Changing the previous example
to a plus, 'foo.+bar', requires that anything -- come between foo and bar, but it doesn't matter
what, so 'foobar' doesn't match but the other two examples given do match.

For details, try the man pages -- "man grep". There are a lot of different versions of the
program, so details may vary. All of this should be valid for OSX though.

Confusing? Maybe, but regular expressions aren't that bad when you get used to them, and
they can be a very useful tool to take advantage of it you know what you're doing. An example.

Let's say you have an website stored on your computer as a series of html documents.
As a cutting edge developer, you've seen the CSS light and want to delete all the
tags wherever they're just saying e.g. face="sans-serif" &/or size="12", because the
stylesheet can now do that for you. On the other hand, it's possible that the patterns
'face="sans-serif"' or 'size="12"' could show up in normal text (though admittedly
that's unlikely). In fact, what you really want to know is wherever those patterns show up in
a font tag, but you don't care about anywhere else that they might appear. Here's one way to
find that pattern:

grep -ir ']*(face="sans-serif"|size="12")' *.htm *.html


This does a number of things. The -i tells grep to ignore case (otherwise it's case sensitive,
and won't match 'FONT' if you're looking for 'font' or 'Font'). The -r tells it to recursively
descend through the directories from wherever the command starts -- in this case, all htm and
html files in the current directory. Everything in single quotes is the pattern we're matching.
We tell grep to match on any text that starts with " (thus staying within the font tag), and then either the face or
size definition that we're interested in. The one glitch here is that line breaks can break
things, though there are various ways around that. Finding them is left as the proverbial
exercise for the reader. :)

The next question is, what do you want to do with this information you've come up with?
Presumably you want to edit those files in order to fix them, right? With that in mind, maybe
it would be useful to just make a list of matches. Grep normally outputs all the lines that
match the pattern, but if you just want the filenames, use the -l switch. If you want to save
the results into a file, redirect the output of the command accordingly. With those changes,
we now have:

grep -irl ']*(face="sans-serif"|size="12")' *.htm *.html >font_files.txt


Great. But we can do better still. If you are comforable with the vi editor, you can call vi
with that command directly. The trick is to wrap the command in backticks (`). This is a cool
little Unix trick that runs the contained command & returns the result for whatever you want
to do with it. Thus you can simply put this command:

vi `grep -ir ']*(face="sans-serif"|size="12")' *.htm *.html`


The result of this command, as far as your tcsh shell is concerned, is something along the lines
of

vi index.html about.html contact.html music.html......


etc. The beautiful thing here is that if you quit vi & re-run the command later, it will be
able to effectively "pick up where you left off", since files you've already edited will
presumably no longer match the grep command.

And if you want to get really ambitious, you can use these techniques in ways that
allow you to do all your editing directly from the command line, without having to go into an
interactive editor such as vi or emacs or whatever. If you make it this far in your experiments,
then the next step is to learn to filter the results of a match and process the filtered data
in some way, using tools such as sed, awk, and perl. Using these tools, you can find all
instances of the pattern in question, break it down however you like, substitute or shuffle the
parts around however you like, and then build it all back up again. This is fun stuff! By this
point, you're getting pretty heavily into Unix arcana, and the best book that I've seen about
these tricks is O'Reilly's Unix Power Tools, by various authors. If you really want to leverage
the power of the tools that all Unixes come with, including OSX, then this is a great place to
both start & end up. There's plenty of material in there to keep you busy for months & years...

[ Reply to This | # ]