List Urls in XML Sitemap by Tag Name Using XPATH and PHP

This script will List Urls in XML Sitemap by Tag Name Using XPATH and PHP. In this case, we've used a sample file that is a website sitemap file which we generate with XML Sitemaps.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

The DOMXPath class has the DOMDocument property and several very useful methods: DOMXPath::__construct, DOMXPath::evaluate (which evaluates the given XPath expression and returns a typed result if possible or a DOMNodeList containing all nodes matching the given XPath expression), DOMXPath::query (which evaluates and executes the given XPath expression and returns a DOMNodeList containing all nodes matching the given XPath expression), DOMXPath::registerNamespace (which is necessary to use XPath to handle documents which have default namespaces described in the xmlns declaration which in the case of a sitemap is in the urlset tag), and DOMXPath::registerPhpFunctions. Most XML files seem to have no xmlns declaration (e.g., PAD files), therefore needing no namespace registration.

We perform the URL listing task, first with DOM only and no XPath. Then we do the same thing using XPath. The getElementsByTagName() method seems more straightforward for this task of listing URLs in a XML sitemap by Tag Name. But keep in mind that XPath can do a lot that DOMDocument objects alone could never do. First the non-XPath version:

The new DOMDocument object is created so we can use the Document Object Model to get info from the file. We load in the XML file with the load method. Then we use the getElementsByTagName() method in this form:
$nodes = $doc->getElementsByTagName ("loc"); and this gives us a DOMNodeList to loop through. Then we get the length of this list and use a for loop to loop through these nodes, getting strings we can echo by use of: ->item(0)->nodeValue. We need strings that we can echo since raw DOM objects do not echo until you get their value as a string since echo only outputs strings.

Now the XPath version: A new DOMDocument object is created because for XPATH use, you have to create a DomDocument object.
The $dom->load('http://www.theliquidateher.com/sitemap.xml') code loads $dom as it gets a sitemap file's contents into the DOM object. Next we use $xpath = new DOMXPath($dom) to create a DOMXPath object with the file contents inside. Now we use the registerNamespace() method to register the namespace, because we happen to know about this file's xmlns declaration, which in the case of a sitemap is in the urlset tag. Next we define the $loc array. It is not needed, but it's a convenient place to store XML document info if you need to. Now we perform an XPath Query going after all elements with loc as the only tag, and using the namespace prefix we declared before the namespaceURI in our registerNamespace method. The prefix can be whatever you want it to be, but it cannot be omitted. The results of our query are a DOMNodeList we can—and do—loop through, in this case putting the node values into the $loc array. We use another loop for echoing each array element value. The XPath syntax page needs to explain namespace prefixes better. Right now, they are not even mentioned.

As you will see in List Specified Elements in XML Document by Tag Name Using XPATH and PHP, you can get tags one at a time using getElementsByTagName, but this is useful only if there a lot a unique tags with few or no children. In the script on this page, there are hundreds of loc tags in the sitemap file, so we loop. In the XML Sitemap by Tag Name Using XPATH and PHP script below, we loop through results we get when using the getElementsByTagName() method, since this method returns a new instance of class DOMNodeList containing the elements with a given tag name. These are easy to loop through.

For the DOM-only version, there's no need for $xpath = new DOMXPath($doc), which creates an XPath object to use with the getElementsByTagName() method, because you do not need XPath for a getElementsByTagName method. But for $xpath->query() methods, XPath is essential. Note that we did not need to deal with namespace registration with the getElementsByTagName() method, because no XPATH was involved, but we needed it for the XPATH version. The method registerNamespace registers the namespace with the DOMXPath object we create in the script below's XPATH version. It won't work without it, nor will it work if we leave the prefix off of the XPath query parameter.

If an XPATH expression or non-XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values. In the non-XPATH version in List Specified Elements in XML Document by Tag Name Using XPATH and PHP, forget the loop and just get the node values of four different tags found in the file. (Although we could have used the method that our XPATH example did: going after all elements with File_Info as the parent tag; but since there were only three tags with that parent, the way we illustrated was fine.) This is good if there are no tags with the same tag name or few child tags under any one parent tag, as just discussed. But, as in List Urls in XML Sitemap by Tag Name Using XPATH and PHP, below, it is essential to loop through elements with a certain tag name. This is especially great if there are many elements with the same tag name, as below.

In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. You can get more information on the syntax to use in XPath expressions in the W3Schools XPath expression page.

<?php

$doc = new DOMDocument("1.0","ISO-8859-1");
$doc->load('http://www.theliquidateher.com/sitemap.xml');
$nodes = $doc->getElementsByTagName ("loc");
$nodeListLength = $nodes->length;
for ($i = 0; $i < $nodeListLength; $i ++){
$node = $nodes->item($i)->nodeValue;echo ($i+1)." ".$node."<BR>";}
echo "<BR>";

$dom = new DOMDocument("1.0","ISO-8859-1");
$dom->load('http://www.theliquidateher.com/sitemap.xml');
$xpath = new DOMXPath($dom);
$xpath->registerNamespace("m","http://www.sitemaps.org/schemas/sitemap/0.9");
$loc = array();
$locNodes = $xpath->query('//m:loc');
for($i=0;$i<$locNodes->length;$i++) {
$loc[$i] = $locNodes->item($i)->nodeValue;}
$n=count($loc);
for($i=0;$i<$n;$i++) {
echo ($i+1)." ".$loc[$i]."<BR>";}
echo "<BR>";

?>

Free Personal Status Boards (PSB™)

List Urls in XML Sitemap by Tag Name Using XPATH and PHP