Get Links from Web Page
This script will get the links from a web page. It is limited to .html, .htm, .php, and .shtml extensions.
The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document. It uses path expressions to navigate in XML documents. It contains a library of standard functions. It is a major element in XSLT. It is a W3C recommendation.
But you can use it to parse web pages as well, as the code below demonstrates. To make the code more useful, we processed the retreived links. In the script, we first defined a URL to search. You can do this from an HTML form and a POST if you wish to not hardwire the web page address like we did below—use your own URL in place of http://www.css-resources.com, please. This will require a change to substr($url,0,28) since most URLs are not 28 characters long. Next, the for loop looks for the href attribute and sticks the related link into $url. We didn't want anchors in links, so we searched for # and dumped this and the rest of the anchor from the rest of the link. We didn't want query strings in links, so we searched for ? and dumped this and the rest of the query from the rest of the link. You may leave out these 2 lines if you desire query strings and anchors. Next, we made sure that if a link had http in it, it was from the current domain, not another site. We also made sure that only .html, .htm, .php, and .shtml extensions were used. If not, we dumped them. If you'd like more or fewer extensions, add them in the appropriate place in the script. We also made sure the string length was 5 or more for each link. We stuck all these links in an array, counted them (the total went into the $r variable), and dumped duplicate array values and filled the holes that were left. We'd have used array_unique, but it has a BUG! The code $a=array_keys(array_flip($a)) works great.
<?php
$a=array();$n=0;$f='http://www.css-resources.com';
$html = file_get_contents($f);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$ok="0";if ((substr($url,0,28)==$f || substr($url,0,4)<>"http") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps anchors (#d), query strings (? etc), or offsite links
if (strlen($url)>4 && $ok=="1"){$a[$n]=$url;$n++;}}
$r = count($a);
$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
for ($i = 0; $i < $r; $i++) {echo $a[$i]; echo "<BR>";}
?>