I’ve recently been researching XML as a back-end data source for a new CMS I’m trying to build, continuing my perpetual quest for a completely database-independent front-end. My usual process for such endeavors is to pick up a couple books at the Library of Congress and peruse my RSS feeds for relevant insights. I like to start with the books (for example XML and PHP circa 2002) and work my way forward so that I can get a feel for how the technology has adapted. Before taking a look at some brand-spanking-new methods from PHP5, I’d like to take the opportunity to go over two classic methods for dealing with XML.
Simple API for XML (SAX)
As the name suggests, the SAX method for parsing XML is a very simple, top-down approach. This method is event-based, in contrast to the tree-based method we’ll go over in a bit. The API reads the file from the top down and every time it encounters an opening or closing tag, it triggers an event. These events can help you control the data as its being processed. This method is valuable because it is very fast; it processes the data in chunks so it doesn’t use much memory.
A (very) basic example:
$file = "example.xml";
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startTag", "endTag");
xml_set_character_data_handler($xml_parser, "contents");
if (file_exists($file)) {
$data= file_get_contents($file);
}
if(!(xml_parse($xml_parser, $data, feof($fp)))){
die("Error: ".xml_get_current_line_number($xml_parser));
}
function contents($parser, $data){
echo $data;
}
function startTag($parser, $data){
echo "<b>";
}
function endTag($parser, $data){
echo "</b><br />";
}
xml_parser_free($xml_parser);
fclose($fp);
What I consider to be the core of a SAX implementation are your three basic event handlers: startTag(), endTag() and contents(). You can do a large part of the manipulation and storage you need to do using these three functions and, if necessary, a few global variables. The rest is simple, read the file in to a string variable, set your handlers, and run the parser. I don’t want to go too much in to detail here, I want to save the really good stuff for when we get in to SimpleXML next time, but it is helpful to understand the process from which SimpleXML originated.
Document Object Model (DOM)
The second PHP4-based method for parsing XML differs from the first in that it is tree-based. Unlike the SAX method, the DOM method analyzes the entire file at once and creates a hierarchical tree of objects, meaning you can manipulate data based on parent-child relationships. The DOM method also includes functions for editing, creating, and deleting XML files from a repository. This method has great value for more complicated applications but has a few drawbacks. Because it must read the entire file in to memory before it can be fully parsed, the DOM method is very resource-heavy. It also isn’t very stable in PHP 4, so I wouldn’t recommend using it unless you’re running at least PHP 5. If interested, you can find some excellent tutorials and examples at W3Schools.
There is, of course, no one-size-fits-all solution here. Each application has a unique set of requirements that will determine which method is best. There are also other methods besides the two described here, such as SimpleXML which is new to PHP5 and similar to SAX in how it processes data. I will be covering this as well as XPath and XSLT in part two of this series.