List a Website's Page Titles Alphabetically Using XPATH and PHP

This script will get an alphabetically sorted list of the page titles found on a website. It is limited to .html, .htm, and .php extensions for page urls.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

We start with ensuring that fatal run-time errors (errors that can not be recovered from) are reported. Execution of the script is halted and a message appears. The likely ones are the out of memory and timing out ones. Both happen because a site is too big for the available memory to process (or it's crammed with long page titles). Often the server's script execution time limit is exceeded (usually 30 seconds). It appears that the server is lenient if the other websites on the server make demands which are relatively light at the moment—we've experienced a nearly-3-minute script execution time as well as the 30 second timeout. The script has to examine every word on the site, but we've seen 300-page sites take under 10 seconds as well as a 240-page site with lots of links take over 2 minutes. Feel free to attempt to manipulate the script timeout setting and risk the wrath of the host. We did NOT.

Next we get the config.php data into the mix with an include so the connection to the MySQL database will work right and get our titles plus page links indexing routine access to writing and then reading all the juicy data which it will cleverly insert into records in a database table called titles.

Next there is an HTML form that gets submitted to this same page to send the site url to the PHP script. The user is instructed that the site url "must include /index.html, /index.htm, /index.php or whatever the home page filename is." The input is checked to see that it ends in htm, html, or php and if not, the script skips to the end and gives the alert "Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL." before reloading the page. The \n\n means skip a couple of lines.

Now the parse_url() function is used to parse a URL and return an associative array containing any of the various components of the URL that are present. Then the $home variable gets filled with the file name of the home page.

Next the strip_tags() function is run on the url (just in case) and the home page file name is subtracted from the url to get the remainder of the url—the scheme and the host name containing the domain. If the $f now ends with "/" that character is dumped. Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.

Then any existing MySQL titles table is dumped and a new table of that name is built with a field for the titles and another for the page urls. The $a array is for page urls. It may be useful to have a MySQL table full of the site's data regarding page titles and page links—who can say? We use it for storage, and to be able to count page titles for statistical purposes, and to use SQL's ORDER BY command to let us easily display the MySQL table contents in alphabetical order. Now we save $g as $f plus "/".

Now we go to the Internet with file_get_contents($g.$home), which gets us the page's contents into the PHP variable $t. The new DOMDocument object is created because for XPATH use, you have to create a DomDocument object. The @ in @$dom->loadHTML($t) suppresses error messages from sloppy HTML code loads as it gets page contents into the DOM object. The $xpath = new DOMXPath($dom) statement creates an XPATH object to use with the evaluate() method, which evaluates the given XPath expression, which is, in this case, rather simple. If an XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values of attributes such as href. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. Our evaluate argument contains the path /html/body//a. This gets link anchor elements and $url = $href->getAttribute('href') is used to parse these elements, in a loop, for their href attribute node values which have link urls. Multiple XPath expression paths can be used in
evaluate() method argument parameters. We used only single paths in this script.

The DOMElement class method getAttribute() is essential since attributes are where all the node values with page urls will be found. It is used, in a results loop, to get href attributes. The href attributes get stored in the $a array. In handling hrefs we trim off excess spaces and replace spaces inside the urls with %20 to avoid errors. If there are # anchors in the url, they are trimmed off. If there are ? url query strings, they are dumped as well. Path symbols like ./ and ../ and / are dumped since we only want the links on the page we are on, not elsewhere. We will, however, get to "elsewhere" (other site pages) because of our overall method, which is to get every page link url on every page and store it in the $a array, and then go to every one of these pages, getting more page urls as well as page titles.

Now the site url (plus /), A.K.A. $g, is dumped from the url being processed. This handles http://siteurl.com being used on interior links of the website which should have been relative links without these absolute link characteristics. If the links are to offsite urls like http://mylink.com, etc., the absolute aspects are purposely retained because the PHP str_replace() function will do nothing (since it will not find $g)—the desired result. The $ok flag means $url is a page url. If there is no http (offsite url) in $url and it is a page url we search the $a array. If the url isn't in the array, it is added.

Skipping a function, we next go through most of what we just did, but in a function. Why not use just the function from the start? Because the needs are similar, but not the same. We have to deal with folders, since we're no longer on the home page, like we were above. We have to deal with the context parameter for the PHP file_get_contents() function, since network conditions can cause us to lose some page links unless we create a stream context with a timeout.

The reason we call the function add_urls_to_array() is to handle the dozens or hundreds of other page urls the script will encounter after the home page. Note that the $folder variable gets and folders found, so that if we find stuff/mystuffs.html, we put stuff/ into $folders. If ./ or ../ or http is found, we ignore any folders we find, otherwise we keep the $url variable's folder aspect with the rest of the url. The rest of the function is pretty much the same as previously discussed. The stream context timeout gives the network time to catch up to the script rather than skipping links. We found this out the hard way when we ignored stream context at first. It really is needed.

Back to the skipped function grab_titles(). It uses the PHP function preg_match() to get the title tags and what's between them (from the page content string $t) into an array $m. The second array element with the key 1 is stuck into the $j variable. The preg_match() function performs a regular expression match on a string. $m[1] will have the text that matched the first captured parenthesized subpattern, which is ([^>]*) and which matches content between the opening and closing title tags.

Out of the function and farther down the page now, we go to function grab_titles(), then put the home page page title and page url in the titles table, then put a 3-second timeout in the stream context—arrived at by trial and error (mostly error!). Next comes the while ($o<$r-1){ that starts the section dealing with parsing one page url after another until they are all done. Note that $r will have its value increased as more page urls are found needing checking out because of trips to the add_urls_to_array() function. $o is the element number being processed. We also go to the function grab_titles() in this code block. The titles table gets any links found and put in the $a array as well as each page title.

Once the links are processed, the SQL statement "SELECT * FROM titles ORDER BY title" is run and the results, because of the ORDER BY, are alphabetical. The final table echoing shows the titles and page urls. The final message tells how many titles were indexed. The final codes are for if the site url entry was unacceptable, so an example entry is shown. Then the page reloads.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>List a Website's Page Titles Alphabetically</TITLE>
<meta name="description" content="List a Website's Page Titles Alphabetically">
<meta name="keywords" content="Index Website titles,List Website Page Titles Alphabetically,List Website Titles, list titles alphabetically,Index a Website,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}
</style>
</head>
<body>
<?php
error_reporting(E_ERROR);
include_once"config.php";

$f=$_POST['siteurl'];
if (!isset($f)){
echo '<div id="pw" style="position:absolute;top:150px;left:50px;width:950px;text-align:center"><table style="background-color:#8aa;border-color:#00f" border="6" cellspacing=0 cellpadding=6><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="list-a-websites-page-titles-alphabetically.php"><b>home page URL (must include /index.html, /index.htm, /index.php or whatever the home page filename is)</b><BR><label for="URL">URL: </b><input type="text" name="siteurl" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';

}else{

if (substr($f,-4)==".htm" || substr($f,-4)=="html" || substr($f,-4)==".php"){
$e=(parse_url($f,PHP_URL_PATH));
if (substr($e,0,1)=="/"){$LLLL=strlen($e);$home=substr($e,1,$LLLL-1);}

$f=strip_tags($f);$f=str_replace($e, "", $f);
$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);

$sql = "DROP TABLE IF EXISTS titles";
mysql_query($sql);

$sql = "CREATE TABLE titles (
id int(4) NOT NULL auto_increment,
pageurl varchar(255) NOT NULL default '',
title varchar(255) NOT NULL default '',
PRIMARY KEY (id)
) ENGINE=MyISAM AUTO_INCREMENT=1";
mysql_query($sql);

// "mediumtext" allows over 16 million bytes

$a=array();$n=0;$o=-1;$g=$f."/"; echo "<B>".$f."</B><BR>";
$t = file_get_contents($g.$home);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
if($L>4 && $ok=="1"){$a[$n]=$url;$n++;}}
$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
$r = count($a);

function grab_titles(){
global $t; global $z; global $g; unset($m); global $j;
preg_match('/<title>([^>]*)<\/title>/si',$t,$m);
$j=$m[1];
}

function add_urls_to_array(){
global $a; global $g; global $z; global $t; global $r; $n=$r; $folder="";
$fo=strrpos($z,"/"); if ($fo){$folder=substr($z,0,$fo+1);}
$LLL=strlen($folder);
$t = file_get_contents($g.$z,0,$context);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
if (substr($url,0,4)=="http"){$sf="0";}else{$sf="1";}
if (substr($url,0,3)=="../" || substr($url,0,2)=="./"){$sf="0";}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if (substr($url,0,4)<>"http" && substr($url,0,$LLL)<>$folder && $sf=="1"){$url=$folder.$url;}
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
$q=array_search($url,$a);if ($L>4 && $ok=="1" && $q===false){$a[$n]=$url;$n++;}}
$r = count($a);
}

$z=$home;$NN=1;
grab_titles();
$j=mysql_real_escape_string($j);
$z=mysql_real_escape_string($z);
$sql="INSERT INTO titles(id, pageurl, title)VALUES('', '$z', '$j')";
$result=mysql_query($sql);

$context = stream_context_create(array('http' => array('timeout' => 3))); // Timeout in seconds
$z="";

while ($o<$r-1){
$o++; $z=$a[$o];$NN=$o+2;
add_urls_to_array();
grab_titles();
$j=mysql_real_escape_string($j);
$z=mysql_real_escape_string($z);
$sql="INSERT INTO titles(id, pageurl, title)VALUES('', '$z', '$j')";
$result=mysql_query($sql);
}

$result = mysql_query("SELECT * FROM titles ORDER BY title")
or die(mysql_error());

echo "<table border='1'>";
echo "<tr><th>Title</th><th>Page URL</th></tr>";
while($row = mysql_fetch_array($result)) {
echo "<tr><td>";
echo $row['title'];
echo "</td><td>";
echo $row['pageurl'];
echo "</td></tr>";
}

echo "</table><BR>";

mysql_close();
unset($f);$r=$r+1;

echo $r." titles were indexed. Press Back Button to submit another URL.";

}else{

mysql_close();
unset($f);

echo '<script language="javascript">alert("Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL.");window.location="list-a-websites-page-titles-alphabetically.php"; </script>';}

}

?>

</body>
</html>

Free Personal Status Boards (PSB™)

List a Website's Page Titles Alphabetically Using XPATH and PHP