List a Website's External Links Alphabetically Using XPATH and PHP

This script will get an alphabetically sorted list of the external links found on a website. It is limited to .html, .htm, and .php extensions for page urls.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

We start with ensuring that fatal run-time errors (errors that can not be recovered from) are reported. Execution of the script is halted and a message appears. The likely ones are the out of memory and timing out ones. Both happen because a site is too big for the available memory to process (or it's crammed with external links). Often the server's script execution time limit is exceeded (usually 30 seconds). It appears that the server is lenient if the other websites on the server make demands which are relatively light at the moment—we've experienced a nearly-3-minute script execution time as well as the 30 second timeout. The script has to examine every word on the site, but we've seen 300-page sites take under 10 seconds as well as a 240-page site with lots of external links take over 2 minutes. Feel free to attempt to manipulate the script timeout setting and risk the wrath of the host. We did NOT.

Next we get the config.php data into the mix with an include so the connection to the MySQL database will work right and get our external links indexing routine access to writing and then reading all the juicy data which it will cleverly insert into records in a database table called external_links.

Next there is an HTML form that gets submitted to this same page to send the site url to the PHP script. The user is instructed that the site url "must include /index.html, /index.htm, /index.php or whatever the home page filename is." The input is checked to see that it ends in htm, html, or php and if not, the script skips to the end and gives the alert "Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL." before reloading the page. The \n\n means skip a couple of lines.

Now the parse_url() function is used to parse a URL and return an associative array containing any of the various components of the URL that are present. Then the $home variable gets filled with the file name of the home page.

Next the strip_tags() function is run on the url (just in case) and the home page file name is subtracted from the url to get the remainder of the url—the scheme and the host name containing the domain. If the $f now ends with "/" that character is dumped. Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.

Then any existing MySQL external_links table is dumped and a new table of that name is built with a field for the external links. The $a array is for page urls and the $e array is for the external links. It may be useful to have a MySQL table full of the site's data regarding external links—who can say? We use it to avoid out of memory errors from the $e array getting too big (the array holds one page's external links data, but if it held a whole site's external links data, it might be too much), and we also use the db for storage, and to be able to count external links for statistical purposes, and to use SQL's ORDER BY command to let us easily display the MySQL table contents in alphabetical order. Now we save $g as $f plus "/".

Now we go to the Internet with file_get_contents($g.$home), which gets us the page's contents into the PHP variable $t. The new DOMDocument object is created because for XPATH use, you have to create a DomDocument object. The @ in @$dom->loadHTML($t) suppresses error messages from sloppy HTML code loads as it gets page contents into the DOM object. The $xpath = new DOMXPath($dom) statement creates an XPATH object to use with the evaluate() method, which evaluates the given XPath expression, which is, in this case, rather simple. If an XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values of attributes such as href. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. Our evaluate argument contains the path /html/body//a. This gets link anchor elements and $url = $href->getAttribute('href') is used to parse these elements, in a loop, for their href attribute node values which have link urls. Multiple XPath expression paths can be used in
evaluate() method argument parameters. We used only single paths in this script.

The DOMElement class method getAttribute() is essential since attributes are where all the node values with page urls will be found. It is used, in a results loop, to get href attributes. The href attributes get stored in the $a and $e arrays. In handling hrefs we trim off excess spaces and replace spaces inside the urls with %20 to avoid errors. If there are # anchors in the url, they are trimmed off. If there are ? url query strings, they are dumped as well. Path symbols like ./ and ../ and / are dumped since we only want the links on the page we are on, not elsewhere. We will, however, get to "elsewhere" (other site pages) because of our overall method, which is to get every page link url on every page and store it in the $a array, and then go to every one of these pages, getting more page urls and external links.

Now the site url (plus /), A.K.A. $g, is dumped from the url being processed. This handles http://siteurl.com being used on interior links of the website which should have been relative links without these absolute link characteristics. If the links are to offsite urls like http://myexternallink.com, etc., the absolute aspects are purposely retained because the PHP str_replace() function will do nothing (since it will not find $g)—the desired result. Then we store any of the links that start with http in the $e array where external links live. The $ok flag means $url is a page url. If there is no http (offsite url) in $url and it is a page url we search the $a array. If the url isn't in the array, it is added.

Now we go through most of what we just did, but in a function. Why not use just the function from the start? Because the needs are similar, but not the same. We have to deal with folders, since we're no longer on the home page, like we were above. We have to deal with the context parameter for the PHP file_get_contents() function, since network conditions can cause us to lose some page links unless we create a stream context with a timeout.

The reason we call the function add_urls_to_array() is to handle the dozens or hundreds of other page urls the script will encounter after the home page. Note that the $folder variable gets and folders found, so that if we find stuff/mystuffs.html, we put stuff/ into $folders. If ./ or ../ or http is found, we ignore any folders we find, otherwise we keep the $url variable's folder aspect with the rest of the url. The rest of the function is pretty much the same as previously discussed. The stream context timeout gives the network time to catch up to the script rather than skipping links. We found this out the hard way when we ignored stream context at first. It really is needed.

Out of the function and farther down the page now, we put a 3-second timeout in the stream context—arrived at by trial and error (mostly error!). Next comes the while ($o<$r-1){ that starts the section dealing with parsing one page url after another until they are all done. Note that $r will have its value increased as more page urls are found needing checking out because of trips to the add_urls_to_array() function. $o is the element number being processed. The external_links table gets any external links found and put in the $e array, using a for loop.

Once the links are processed, the SQL statement "SELECT * FROM external_links ORDER BY external_link" is run and the results, because of the ORDER BY, are alphabetical. The final table echoing shows the external links. The final message tells how many external links were indexed. The final codes are for if the site url entry was unacceptable, so an example entry is shown. Then the page reloads.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>List a Website's External Links Alphabetically</TITLE>
<meta name="description" content="List a Website's External Links Alphabetically">
<meta name="keywords" content="Index Website External Links,List Website External Links Alphabetically,List Website External Links, list External Links alphabetically,Index a Website,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}
</style>
</head>
<body>
<?php
error_reporting(E_ERROR);
include_once"config.php";

$f=$_POST['siteurl'];
if (!isset($f)){
echo '<div id="pw" style="position:absolute;top:150px;left:50px;width:950px;text-align:center"><table style="background-color:#8aa;border-color:#00f" border="6" cellspacing=0 cellpadding=6><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="list-a-websites-external-links-alphabetically.php"><b>home page URL (must include /index.html, /index.htm, /index.php or whatever the home page filename is)</b><BR><label for="URL">URL: </b><input type="text" name="siteurl" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';

}else{

if (substr($f,-4)==".htm" || substr($f,-4)=="html" || substr($f,-4)==".php"){
$e=(parse_url($f,PHP_URL_PATH));
if (substr($e,0,1)=="/"){$LLLL=strlen($e);$home=substr($e,1,$LLLL-1);}

$f=strip_tags($f);$f=str_replace($e, "", $f);
$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);

$sql = "DROP TABLE IF EXISTS external_links";
mysql_query($sql);

$sql = "CREATE TABLE external_links (
id int(4) NOT NULL auto_increment,
external_link varchar(255) NOT NULL default '',
PRIMARY KEY (id)
) ENGINE=MyISAM AUTO_INCREMENT=1";
mysql_query($sql);

// "mediumtext" allows over 16 million bytes

$a=array();$e=array();$n=0;$nn=0;$o=-1;$g=$f."/"; echo "<B>".$f."</B><BR>";
$t = file_get_contents($g.$home);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if(substr($url,0,4)=="http"){$e[$nn]=$url;$nn++;}
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
if($L>4 && $ok=="1"){$a[$n]=$url;$n++;}}
$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
$r = count($a);

function add_urls_to_array(){
global $a; global $g; global $z; global $t; global $r; $n=$r; $folder="";
$fo=strrpos($z,"/"); if ($fo){$folder=substr($z,0,$fo+1);}
$LLL=strlen($folder);
$t = file_get_contents($g.$z,0,$context);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
if (substr($url,0,4)=="http"){$sf="0";}else{$sf="1";}
if (substr($url,0,3)=="../" || substr($url,0,2)=="./"){$sf="0";}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if (substr($url,0,4)<>"http" && substr($url,0,$LLL)<>$folder && $sf=="1"){$url=$folder.$url;}
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if(substr($url,0,4)=="http"){$e[$nn]=$url;$nn++;}
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
$q=array_search($url,$a);if ($L>4 && $ok=="1" && $q===false){$a[$n]=$url;$n++;}}
$r = count($a);
}

$z=$home;$NN=1;

$context = stream_context_create(array('http' => array('timeout' => 3))); // Timeout in seconds
$z="";

while ($o<$r-1){
$o++; $z=$a[$o];$NN=$o+2;
add_urls_to_array();
}

for ($i = 0; $i < count($e); $i++) {
$zz=mysql_real_escape_string($e[$i]);
$sql="INSERT INTO external_links(id, external_link)VALUES('', '$zz')";
$result=mysql_query($sql);}

$result = mysql_query("SELECT * FROM external_links ORDER BY external_link")
or die(mysql_error());

echo "<table border='1'>";
echo "<tr><th>External Links</th></tr>";
while($row = mysql_fetch_array($result)) {
echo "<tr><td>";
echo $row['external_link'];
echo "</td></tr>";
}

echo "</table><BR>";

mysql_close();
unset($f);$r=$r+1;

echo count($e)." external links were indexed. Press Back Button to submit another URL.";

}else{

mysql_close();
unset($f);

echo '<script language="javascript">alert("Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL.");window.location="list-a-websites-external-links-alphabetically.php"; </script>';}

}

?>

</body>
</html>

Free Personal Status Boards (PSB™)

List a Website's External Links Alphabetically Using XPATH and PHP