The other day, I was tasked with building a data scraper. Having never built such a contraption, I naturally turned to the Internets for preexisting code. I was horrified with what I found.
The “free” PHP scripts (that’s “free” as in “free baby vomit”) were all infested with the worst sorts of newfangled regex, and PHP 4 era DOM traversing.
Making matters worse, the scripts didn’t offer much of an API, or interface for data mining – rather they provided a rigid, and worthless example – leaving their hapless users to mutilate whatever useful lines they could find, and create an even more horrid fraken-script.*
It didn’t take me long to realize that PHP 5’s simpleXML was the answer. And indeed, after an hour of practice, simpleXML turned me into a scraping Ninja.
Below, is a very simple example [for drupal 6] that parses the drupal planet blogroll, and makes this neat little table out of it. Hopefully, you’ll find this method as easy, and useful as I did.
*Disclosure: I am not among the sadistic few that think Perl’s regular expressions are the greatest invention since sex. So you call simpleXML a crutch, and I’ll call you sick.

<?php
function simplexml_drupal_blogroll() {
$xml = simple_get_xml('http://www.drupal.org/planet');
// run an XPath query on the blogroll block's unordered list
$block = $xml->xpath('//div[@id="block-block-8"]');
/* This looks nasty -- and indeed it is... Its as nasty as the curved block markup at drupal.org.
-1. Why $block[0]? That's simply because xpath returns an array even when only one item is found
-2. Why $block->div? A: "div" is the name of the element; if you wanted to access the headers instead, you'd do:
"$block->h2"
-3. Why $block->div[0]? A: See number 1.
-4. Why didn't you use xpath to select the item list? A: Damn you! Because I haven't really learned xpath yet...
*/
// get ready to build a table
$rows = array();
$header = array('Blog Name', 'Link', 'RSS Feed');
foreach ($block[0]->div[0]->div[0]->div[0]->div[0]->ul[0]->li as $item) {
$row = array();
// the array key is 0 since its the first link in the item-list's line
$blog_link = $item->a[0];
// the array is 1 since -- duh -- the feed is the second link in the item-list's line
$feed_link = $item->a[1];
// covert the blog_link hyperlink object into a simple string
$row[] = (string)$blog_link;
// access the "href" attribute
$row[] = $blog_link['href'];
$row[] = $feed_link['href'];
$rows[] = $row;
}
$output = theme('table', $header, $rows);
return $output;
}
/**
* Implementation of hook_menu().
*
*/
function simplexml_menu() {
$items = array();
$items['simplexml'] = array(
'title' => t('Drupal Planet: The Blogroll'),
'page callback' => 'simplexml_drupal_blogroll',
'access arguments' => array('access content'),
'type' => MENU_NORMAL_ITEM
);
return $items;
}
function
simple_get_xml($url) {
$html = new DOMDocument();
// fetch drupal planet and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);
// convert DOM to SimpleXML
$xml = simplexml_import_dom($html);
return $xml;
}
/* useful debugging function */
function pre_print($input) {
print '<pre>';
print print_r($input);
print '</pre>';
}
?>
Comments
Thanks
Thanks for writing this tip. I was afraid web scraping would be a tedious task, but it was a walk in the park thanks to SimpleXML (and Nick) :)
Really simple definiton
thanks for the article. Simple definition and well examples.
Simple Scrapers
A tricky topic, content scraping. There are a few tools I have found over the past couple of years that are quite good, written in PHP, and are easy to modify for your particular needs:
CaRP from GeckoTribe
RSS2HTML from FeedforAll
and, Dapper, the online scraping site.
Cheers,
Karl
PACS Administration and Diagnostic Imaging Links and Resources.
Diagnostic Imaging News
Not well-formed XML
If it comes to not well-formed HTML you can get a try with
DOMDocument::loadHTML('some badly shaped html here');which allows you to get a dom-object even if you are working with non-xml data.
So you don't have to run html-tidy first.
Very nice tutorial and nice example.
Let me also point to another nice tutorial *cough cough* on screen-scraping (http://www.vogel-nest.de/wiki/Main/WebScraping1)
Dataminer API
Try Dataminer API. It makes scraping more fun =)
Yeah, I checked out that
Yeah, I checked out that module, and it looked crazy useful: Going over its source code, I can tell that it solves tons of nasty gotchas that one is likely to encounter when scraping. However, I'm kind of an idiot, so I wasn't able to figure out how I was supposed to use it. In particular, it was difficult to figure out how the functions were intended to work together, and the sequence in which they would typically be fired. I think a couple simple code examples would increase the use of that module X 10.
Thanks for pointing this
Thanks for pointing this out. Another reason to love SimpleXML.
Would be cool to add a check to see if a rss feed is avaiable, if so then use the xml to pull in content, else, use your scraper. The only problem with the first option is limited content from rss feed.
this is exactly how i've
this is exactly how i've done it. DOM -> SimpleXML. very easy and nice.
Use html tidy on badly formed html pages
This is very cool... until you have to scrape a page which is not well-formed html, let alone xhtml (so the xml parser doesn't freak out).
So, probably, you would have to scoop up the page, then pass it through the html tidy api (used by the Drupal htmltidy module and the Drupal "scraper" module "import_html") with settings that will render XHTML.
Then you would be all set.