Stupid Simple Web Scraping with SimpleXML


The other day, I was tasked with building a data scraper. Having never built such a contraption, I naturally turned to the Internets for preexisting code. I was horrified with what I found.

The “free” PHP scripts (that’s “free” as in “free baby vomit”) were all infested with the worst sorts of newfangled regex, and PHP 4 era DOM traversing.

Making matters worse, the scripts didn’t offer much of an API, or interface for data mining – rather they provided a rigid, and worthless example – leaving their hapless users to mutilate whatever useful lines they could find, and create an even more horrid fraken-script.*


It didn’t take me long to realize that PHP 5’s simpleXML was the answer. And indeed, after an hour of practice, simpleXML turned me into a scraping Ninja.

Below, is a very simple example [for drupal 6] that parses the drupal planet blogroll, and makes this neat little table out of it. Hopefully, you’ll find this method as easy, and useful as I did.

*Disclosure: I am not among the sadistic few that think Perl’s regular expressions are the greatest invention since sex. So you call simpleXML a crutch, and I’ll call you sick.

function simplexml_drupal_blogroll() {
$xml = simple_get_xml('');
// run an XPath query on the blogroll block's unordered list
$block = $xml->xpath('//div[@id="block-block-8"]');
/* This looks nasty -- and indeed it is... Its as nasty as the curved block markup at
-1. Why $block[0]? That's simply because xpath returns an array even when only one item is found
-2. Why $block->div? A: "div" is the name of the element; if you wanted to access the headers instead, you'd do:
-3. Why $block->div[0]? A: See number 1.
-4. Why didn't you use xpath to select the item list? A: Damn you! Because I haven't really learned xpath yet...

// get ready to build a table
$rows = array();
$header = array('Blog Name', 'Link', 'RSS Feed');
foreach ($block[0]->div[0]->div[0]->div[0]->div[0]->ul[0]->li as $item) {
$row = array();
// the array key is 0 since its the first link in the item-list's line
$blog_link = $item->a[0];
// the array is 1 since -- duh -- the feed is the second link in the item-list's line
$feed_link = $item->a[1];
// covert the blog_link hyperlink object into a simple string
$row[] = (string)$blog_link;
// access the "href" attribute
$row[] = $blog_link['href'];
$row[] = $feed_link['href'];
$rows[] = $row;
$output = theme('table', $header, $rows);
return $output;

* Implementation of hook_menu().
function simplexml_menu() {
$items = array();
$items['simplexml'] = array(
'title' => t('Drupal Planet: The Blogroll'),
'page callback' => 'simplexml_drupal_blogroll',
'access arguments' => array('access content'),
return $items;

function simple_get_xml($url) {
$html = new DOMDocument();
// fetch drupal planet and parse it (@ suppresses warnings).
// convert DOM to SimpleXML
$xml = simplexml_import_dom($html);
return $xml;

/* useful debugging function */
function pre_print($input) {
print '';
print print_r($input);
print '';