Stupid Simple Web Scraping with SimpleXML

Section:

The other day, I was tasked with building a data scraper. Having never built such a contraption, I naturally turned to the Internets for preexisting code. I was horrified with what I found.

The “free” PHP scripts (that’s “free” as in “free baby vomit”) were all infested with the worst sorts of newfangled regex, and PHP 4 era DOM traversing.

Making matters worse, the scripts didn’t offer much of an API, or interface for data mining – rather they provided a rigid, and worthless example – leaving their hapless users to mutilate whatever useful lines they could find, and create an even more horrid fraken-script.*

GOD OFFERED SIMPLEXML, AND IT WAS GOOD

It didn’t take me long to realize that PHP 5’s simpleXML was the answer. And indeed, after an hour of practice, simpleXML turned me into a scraping Ninja.

Below, is a very simple example [for drupal 6] that parses the drupal planet blogroll, and makes this neat little table out of it. Hopefully, you’ll find this method as easy, and useful as I did.

*Disclosure: I am not among the sadistic few that think Perl’s regular expressions are the greatest invention since sex. So you call simpleXML a crutch, and I’ll call you sick.

<?php
function simplexml_drupal_blogroll() {
$xml = simple_get_xml('http://www.drupal.org/planet');
// run an XPath query on the blogroll block's unordered list
$block = $xml->xpath('//div[@id="block-block-8"]');
/* This looks nasty -- and indeed it is... Its as nasty as the curved block markup at drupal.org.
-1. Why $block[0]? That's simply because xpath returns an array even when only one item is found
-2. Why $block->div? A: "div" is the name of the element; if you wanted to access the headers instead, you'd do:
"$block->h2"
-3. Why $block->div[0]? A: See number 1.
-4. Why didn't you use xpath to select the item list? A: Damn you! Because I haven't really learned xpath yet...
*/

// get ready to build a table
$rows = array();
$header = array('Blog Name', 'Link', 'RSS Feed');
foreach ($block[0]->div[0]->div[0]->div[0]->div[0]->ul[0]->li as $item) {
$row = array();
// the array key is 0 since its the first link in the item-list's line
$blog_link = $item->a[0];
// the array is 1 since -- duh -- the feed is the second link in the item-list's line
$feed_link = $item->a[1];
// covert the blog_link hyperlink object into a simple string
$row[] = (string)$blog_link;
// access the "href" attribute
$row[] = $blog_link['href'];
$row[] = $feed_link['href'];
$rows[] = $row;
}
$output = theme('table', $header, $rows);
return $output;
}

/**
* Implementation of hook_menu().
*
*/
function simplexml_menu() {
$items = array();
$items['simplexml'] = array(
'title' => t('Drupal Planet: The Blogroll'),
'page callback' => 'simplexml_drupal_blogroll',
'access arguments' => array('access content'),
'type' => MENU_NORMAL_ITEM
);
return $items;
}

function simple_get_xml($url) {
$html = new DOMDocument();
// fetch drupal planet and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);
// convert DOM to SimpleXML
$xml = simplexml_import_dom($html);
return $xml;
}

/* useful debugging function */
function pre_print($input) {
print '';
print print_r($input);
print '';
}
?>