Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Easily scrape data from web pages with XML::LibXML UNIX
My old methods of scraping web pages (extracting particular pieces of data from the page) were to write a parser in Perl, or use the HTML-TableExtract module. This was difficult to develop and maintain -- every page required different code, and when a site changed their layout, everything broke. This tip shows how to use XML::LibXM and xmllint to make this process much easier and far more repeatable in the event that a layout changes.

This webscrape page has my full writeup with a walkthrough of scraping hints. It's also got the Pick of the Week from macosxhints, as well as pizza places from Yahoo's Yellow Pages as examples. It's required reading if you want to write your own .scrape files. Here's the quick and dirty install for the required modules (you will need to be running Panther since webscrape needs libxml2):
perl -MCPAN -e 'force install WWW::Mechanize'
perl -MCPAN -e 'install XML::LibXML'
cd /tmp
curl -O \
 'http://marty.feebleandfrail.org/macosxhints/webscrape/webscrape.tar'
tar xf webscrape.tar
And some examples:
cd /tmp/webscrape
./webscrape -d osxhints.scrape
./webscrape -d osxhints.scrape -fl20
./webscrape -d potw.scrape -fl15 > pickoftheweek.html
./webscrape -d yahooyp.scrape -a 90210 -fcl 60
[robg adds: I haven't tried any of this, nor am I likely too as it's beyond my skill set. Please note that web scraping can be very intensive on a server -- scrapers can request data very often and very quickly, which puts a strain on even the fastest of boxes. So please, if you're going to do this, follow Marty's example to grab a couple pages locally first, then experiment with those, instead of making repeated requests to our server. For those so inclined, scraping can help create useful information pages on your local machine, hence I feel fine running this hint. As the guy who tries to keep this site online, however, I ask that you "scrape gently" so as to not innundate the server with needless requests. Thanks!]
    •    
  • Currently 2.25 / 5
  You rated: 3 / 5 (4 votes cast)
 
[12,242 views]  

Easily scrape data from web pages with XML::LibXML | 3 comments | Create New Account
Click here to return to the 'Easily scrape data from web pages with XML::LibXML' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
Easily scrape data from web pages with XML::LibXML
Authored by: pvera on Apr 12, '05 10:29:22AM

1. Consider using Anthracite, by Metafy (http://metafy.com). Out-frickin-standing scrapping capabilities.

2. If the source page validates as XML, you can try using Unserialize in php.

---
Pedro
-
http://pedrovera.com



[ Reply to This | # ]
Easily scrape data from web pages with XML::LibXML
Authored by: jaysoffian on Apr 12, '05 10:51:02PM

I've previously used HTML::TreeBuilder wth good success. But I'll take a look at XML::LibXML. Thanks for the tip!




[ Reply to This | # ]
Easily scrape data from web pages with XML::LibXML
Authored by: merlyn on Apr 13, '05 10:45:57AM
For a higher-level wrapper around XML::LibXML, see my column on using xsh to scrape web pages.

[ Reply to This | # ]