My old methods of scraping web pages (extracting particular pieces of data from the page) were to write a parser in Perl, or use the HTML-TableExtract module. This was difficult to develop and maintain -- every page required different code, and when a site changed their layout, everything broke. This tip shows how to use XML::LibXM and xmllint to make this process much easier and far more repeatable in the event that a layout changes.
This webscrape page has my full writeup with a walkthrough of scraping hints. It's also got the Pick of the Week from macosxhints, as well as pizza places from Yahoo's Yellow Pages as examples. It's required reading if you want to write your own .scrape files. Here's the quick and dirty install for the required modules (you will need to be running Panther since webscrape needs libxml2):
perl -MCPAN -e 'force install WWW::Mechanize'
perl -MCPAN -e 'install XML::LibXML'
cd /tmp
curl -O \
'http://marty.feebleandfrail.org/macosxhints/webscrape/webscrape.tar'
tar xf webscrape.tar
And some examples:
cd /tmp/webscrape
./webscrape -d osxhints.scrape
./webscrape -d osxhints.scrape -fl20
./webscrape -d potw.scrape -fl15 > pickoftheweek.html
./webscrape -d yahooyp.scrape -a 90210 -fcl 60
[robg adds: I haven't tried any of this, nor am I likely too as it's beyond my skill set. Please note that web scraping can be very intensive on a server -- scrapers can request data very often and very quickly, which puts a strain on even the fastest of boxes. So please, if you're going to do this, follow Marty's example to grab a couple pages locally first, then experiment with those, instead of making repeated requests to our server. For those so inclined, scraping can help create useful information pages on your local machine, hence I feel fine running this hint. As the guy who tries to keep this site online, however, I ask that you "scrape gently" so as to not innundate the server with needless requests. Thanks!]
Mac OS X Hints
http://hints.macworld.com/article.php?story=20050412053547615