Home      Products & Services      Contact Us      Links

WebHatchers will design & develop your site for you.

Website Menu Heaven: menus, buttons, etc.

Send us your questions.

site search by freefind

SEO, Google, Privacy
   and Anonymity
Browser Insanity
Popups and Tooltips
Free Website Search
HTML Form Creator
Buttons and Menus
Image Uploading
Website Poll
IM and Texting
   or Not MySQL
Personal Status Boards
Content Management
Article Content
   Management Systems
Website Directory
   CMS Systems
Photo Gallery CMS
Forum CMS
Blog CMS
Customer Records
   Management CMS
Address Book CMS
Private Messaging CMS
Chat Room CMS
JavaScript Charts
   and Graphs

Free Personal Status Boards (PSB™)

Free Standard Free PSB

Free PSB Pro Version

Free Social PSB

Free Social PSB Plus (with Email)

Free Business PSB

Free Business PSB Plus (with Email)

PSB demo

Social PSB demo

Business PSB demo

So what's all this PSB stuff about?

Chart comparing business status boards

PSB hosting diagram

PSB Licence Agreement

Copyright © 2002 -
MCS Investments, Inc. sitemap

PSBs, social networking, social evolution, microcommunities, personal status boards
PSBs, social networking, business personal status boards
website design, ecommerce solutions
website menus, buttons, image rotators
Ez-Architect, home design software
the magic carpet and the cement wall, children's adventure book
the squirrel valley railroad, model railroad videos, model train dvds
the deep rock railroad, model railroad videos, model train dvds

Grab Web Page Links and Video Links and Audio Links from Web Page Using XPATH and PHP

This script will get alphabetically sorted lists of the links, audio links, and video links found on a web page. It is limited to .html, .htm, and .php extensions for page urls.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

To begin with, the script gets the page url name POSTed to it from a form, and $whicharray=$_POST['whicharray'] gets the POST of which array list you want undetermined media to go to, video or audio. Then there is the HTML form that gets submitted to this same page to send the page url to the PHP script and the question's answer—video or audio. (An example of an undetermined link would be if a link is found in the data attribute of an object tag, a link like http://www.stuff.com/NS2bV85ZLiA, and there is no file extension and the domain is not in the isitvideo() or isitaudio() functions, so who can say if it is video or audio?)

Next the strip_tags() function is run on the url (just in case). If the $f now ends with "/" that character is dumped. Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.

Then come the functions isitaudio() and isitvideo(). These will check the urls to see if their extensions are either video or likely to be video or audio or likely to be audio. If the MIME type is application/x-shockwave-flash, application/x-oleobject, application/x-mplayer2, application/vnd.rn-realmedia, or application/ogg, the media could be audio or video, for example, so MIME types aren't looked at except for when the link has been found in an object tag in the data attribute. The extensions rm and ogx and swf and perhaps others can be audio or video, so we may possibly end up with a few links that are "likely to be video or likely to be audio" but aren't. Another issue is that embed, object, parameter, anchor, iframe, and HTML5 source tags can contain video or audio. Another issue is that data, src, and value attributes can contain video or audio. And many video links have no extensions at all, such as those on YouTube, Break, Hulu, and plenty more, and the same goes for audio links on some audio sites. In other words, there is no way to get it perfect no matter what one does. Whatever.rm may be video, but that is not a sure thing. As a result, we put some major video site domains in the isitvideo() function and some major audio site domains in the isitaudio() function and you may feel free to add your own. These, if found, ensure that video or audio links from there get included in the video or audio links list displayed onscreen. Few sites host both video and audio, so seeing if a link contains "http://www.youtube.com" or one of the other domains really will ensure whether the link is video or not and seeing if a link contains "http://www.mp3.com" or one of the other domains really will ensure whether the link is audio or not. It's too bad that there are so many hundreds of video and audio sites and combinations of MIME types, file extensions, tags, and tag attributes and few standards to enforce consistency, but that is how it is. HTML5 is trying to simplify it with their audio and video tags, but their source tag has already confused the issue—it can contain either video or audio or both!

Now we go to the Internet with file_get_contents($f), which gets us the page's contents into the PHP variable $html. The new DOMDocument object is created because for XPATH use, you have to create a DomDocument object. The @ in @$dom->loadHTML($html) suppresses error messages from sloppy HTML code loads as it gets page contents into the DOM object. The $xpath = new DOMXPath($dom) statement creates an XPATH object to use with the evaluate() method, which evaluates the given XPath expression, which is, in this case, rather complex. If an XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values of attributes such as href. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. Our evaluate argument contains the path /html/body//a. This gets link anchor elements and $url = $href->getAttribute('href') is used to parse these elements, in a loop, for their href attribute node values which have link urls. Multiple XPath expression paths are used in our evaluate() method argument parameters. To separate the paths, we use "|", which means AND, not OR.

Look through our multi-path XPATH expression. The first path uses application/x-shockwave-flash as a required MIME type for when we find data attribute node values in object tags. The second one is the anchor tag parser. The next path gets embed tags' src or qtsrc attribute node values. The next one gets value attribute node values from the parameter tags in object tags as long as this parameter tags' name attribute is src or FileName or movie. The next five get src attribute node values from video tags, audio tags, source tags, bgsound tags, and iframe tags, but the latter is accepted only if the title attribute is "YouTube video player". (Only HTML5 has video and source tags, not to mention audio tags and some other new tags.) The final path parses img tags for the dynsrc attribute—a Microsoft way of sticking a video where it doesn't belong: in an image tag. Go figure!

The DOMElement class method getAttribute() is essential since attributes are where all the node values with page urls and video urls will be found. It is used, in a results loop, to get href first, and also dynsrc, src, qtsrc, value, and data attributes. The dynsrc attribute value is put in the $video array since what else can its node value be? The href attributes are tricky in that they can be just a new page to parse or an audio or video url link on a page. The former gets stored in the $a array—the latter in the $video or $audio array. In handling hrefs we trim off excess spaces and replace spaces inside the urls with %20 to avoid errors. If there are # anchors in the url, they are trimmed off. If there are ? url query strings, they are dumped if the url ends with html, htm, or php before the ? symbol, but otherwise left alone as essential aspects of audio or video links. Path symbols like ./ and ../ and / are dumped since we only want the links on the page we are on, not elsewhere. The $ok flag means $url is a page url. The $k=$na; isitaudio(); if($na==$k){isitvideo();} code shows how we relate to the answer input into the form for the question of where to put undetermined links—in the video or the audio array. If the user wanted audio as the default, we check the isitaudio() function first, and if nothing is added to that array we run the isitvideo() function. If the user wanted video as the default, we check the isitvideo() function first, and if nothing is added to that array we run the isitaudio() function.

If the url has an href attribute in an anchor link so $url isn't empty, control slips to the if($url){ . . . etc.} statement just discussed, since it is an href but does not end in htm, html, or php, so it may be a video or audio. We run the isitvideo() or isitaudio() function (already discussed) to find out, which will get it in the $video or $audio array if it is indeed a video or very likely a video or an audio or very likely an audio. If no href is involved in the node, one of the other attributes like src likely is. So we then check out any dynsrc, src, qtsrc, value or data attribute. And, again, if the attibute is the node located, control slips down to the if($url){ . . . etc.} statement, since $url is not empty. Now we dump duplicate array values and fill the holes that are left, then count array elements for the $a, $audio, and $video arrays. Finally, we sort each array and echo each of these alphabetically sorted arrays to the screen.

The script works nicely, given an impossible task to try to accomplish . . . well . . . SEMIperfectly! (But, of course, if you find that you know of a bunch of relevant video site domains and you add them to the isitvideo() function, as well as adding relevant audio site domains to the isitaudio() function, this makes the script more perfect!)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>Grab Web Page Links and Video links and Audio Links from Web Page</TITLE>
<meta name="description" content="Grab Web Page Links and Video links and Audio Links from Web Page">
<meta name="keywords" content="Grab Links from Web Page,Get Links from Web Page,Grab Web Page Links and Video links and Audio Links from Web Page,Grab Links on Web Page,Get Links on Web Page,Get Video Links on Web Page,Get Audio Links on Web Page,php,CMS,javascript, dhtml, DHTML">
<center><h2>Grab Web Page Links and Video links and Audio Links from Web Page</h2></center>


if (!isset($f) || ($whicharray<>"a" && $whicharray<>"A" && $whicharray<>"v" && $whicharray<>"V") || (substr($f,-4)<>".htm" && substr($f,-5)<>".html" && substr($f,-4)<>".php")){
echo '<div id="pw" style="position:absolute;top:150px;left:50px;width:950px;text-align:center"><table style="background-color:#8aa;border-color:#00f" border="6" cellspacing=0 cellpadding=6><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="Grab-Web-Page-Links-and-Video-links-and-Audio-Links-from-Web-Page.php"><b>web page URL</b><BR><label for="URL">URL: </b><input type="text" name="pageurl" size="66" maxlength="99" value=""></label><br><br><b>Undetermined extensions go in:</b><BR><label for="Audio or Video">Audio or Video (type a or v): </b><input type="text" name="whicharray" size="5" maxlength="1" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';


$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);

function isitaudio(){
global $url; global $audio; global $na;
if (substr($url,-4)==".aac" || substr($url,-4)==".aif" || substr($url,-5)==".aifc" || substr($url,-5)==".aiff" || substr($url,-3)==".au" || substr($url,-5)==".funk" || substr($url,-4)==".gsd" || substr($url,-4)==".gsm" || substr($url,-3)==".it" || substr($url,-4)==".jam" || substr($url,-3)==".la" || substr($url,-4)==".lam" || substr($url,-4)==".lma" || substr($url,-4)==".m2a" || substr($url,-4)==".m3u" || substr($url,-4)==".mid" || substr($url,-5)==".midi" || substr($url,-4)==".mod" || substr($url,-4)==".mp2" || substr($url,-4)==".mp3" || substr($url,-4)==".mpa" || substr($url,-4)==".m1a" || substr($url,-5)==".mpga" || substr($url,-3)==".my" || substr($url,-4)==".oga" || substr($url,-4)==".ogg" || substr($url,-4)==".ogx" || substr($url,-6)==".pfunk" || substr($url,-3)==".ra" || substr($url,-4)==".ram" || substr($url,-3)==".rm" || substr($url,-4)==".rmi" || substr($url,-4)==".rmm" || substr($url,-4)==".rmp" || substr($url,-4)==".rnx" || substr($url,-4)==".rpm" || substr($url,-3)==".rv" || substr($url,-4)==".s3m" || substr($url,-4)==".sid" || substr($url,-4)==".snd" || substr($url,-4)==".ssm" || substr($url,-4)==".swf" || substr($url,-4)==".m4a" || substr($url,-4)==".tsi" || substr($url,-4)==".tsp" || substr($url,-4)==".voc" || substr($url,-4)==".vox" || substr($url,-4)==".vqf" || substr($url,-4)==".wav" || substr($url,-4)==".wma" || substr($url,-3)==".xm" || substr($url,0,25)=="http://www.purevolume.com" || substr($url,0,18)=="http://www.mp3.com" || substr($url,0,21)=="http://www.deezer.com" || substr($url,0,14)=="http://mog.com" || substr($url,0,27)=="http://www.jukeboxalive.com" || substr($url,0,25)=="http://www.dopetracks.com" || substr($url,0,28)=="http://www.apple.com/itunes/" || substr($url,0,21)=="http://www.emusic.com" || substr($url,0,16)=="http://bleep.com" || substr($url,0,20)=="http://www.ilike.com"){$audio[$na]=$url;$na++;}}

function isitvideo(){
global $url; global $video; global $nv;
if (substr($url,-4)==".afl" || substr($url,-4)==".asf" || substr($url,-4)==".asx" || substr($url,-4)==".avi" || substr($url,-4)==".dif" || substr($url,-3)==".dl" || substr($url,-3)==".dv" || substr($url,-4)==".fli" || substr($url,-3)==".gl" || substr($url,-4)==".isu" || substr($url,-4)==".m1v" || substr($url,-4)==".m2v" || substr($url,-5)==".mjpg" || substr($url,-4)==".mov" || substr($url,-5)==".moov" || substr($url,-6)==".movie" || substr($url,-4)==".m4v" || substr($url,-4)==".mpe" || substr($url,-5)==".mpeg" || substr($url,-4)==".mpg" || substr($url,-3)==".mv" || substr($url,-4)==".ogv" || substr($url,-4)==".ogx" || substr($url,-3)==".qt" || substr($url,-4)==".qtc" || substr($url,-3)==".rm" || substr($url,-4)==".scm" || substr($url,-4)==".vdo" || substr($url,-4)==".viv" || substr($url,-4)==".flv" || substr($url,-4)==".swf" || substr($url,-4)==".mp4" || substr($url,-5)==".vivo" || substr($url,-4)==".vos" || substr($url,-4)==".wmv" || substr($url,-4)==".xmz" || substr($url,-4)==".xsr" || substr($url,0,22)=="http://www.youtube.com" || substr($url,0,22)=="http://video.yahoo.com" || substr($url,0,23)=="MP3.com metacafe.com" || substr($url,0,20)=="http://www.imeem.com" || substr($url,0,22)=="http://embed.break.com" || substr($url,0,19)=="http://www.veoh.com" || substr($url,0,19)=="http://www.hulu.com" || substr($url,0,24)=="http://www.clipshack.com" || substr($url,0,26)=="http://www.dailymotion.com" || substr($url,0,20)=="http://www.vimeo.com" || substr($url,0,23)=="http://www.liveleak.com" || substr($url,0,23)=="http://www.vidilife.com" || substr($url,0,24)=="http://www.livevideo.com" || substr($url,0,22)=="http://www.current.com" || substr($url,0,22)=="http://www.maniatv.com"){$video[$nv]=$url;$nv++;}}

$html = file_get_contents($f);
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//object[@data and @type='application/x-shockwave-flash'] | /html/body//a | /html/body//embed[@src or @qtsrc] | /html/body//object/param[@value and (@name='src' or @name='FileName' or @name='movie')] | /html/body//video[@src] | /html/body//audio[@src] | /html/body//bgsound[@src] | /html/body//iframe[@src and @title='YouTube video player'] | /html/body//source[@src] | /html/body//img[@dynsrc]");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w && (substr($url,$w-4,$w)==".htm" || substr($url,$w-5,$w)==".html" || substr($url,$w-4,$w)==".php")){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if(substr($url,-4)==".htm" || substr($url,-5)==".html" || substr($url,-4)==".php"){$ok=1;}

if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && $ok==1){$a[$n]=$url;$n++;} //dumps offsite, home page or wrong extension links
if(!$url){$url = $href->getAttribute('src');} //audio or video
if(!$url){$url = $href->getAttribute('qtsrc');} //audio or video
if(!$url){$url = $href->getAttribute('value');} //audio or video
if(!$url){$url = $href->getAttribute('data');} //audio or video
if($url){if($whicharray=="a" || $whicharray=="A"){$k=$na; isitaudio(); if($na==$k){isitvideo();}}//audio or video
else {$k=$nv; isitvideo(); if($nv==$k){isitaudio();}}} //audio or video
$url = $href->getAttribute('dynsrc'); if (strlen($url)>4){$video[$nv]=$url;$nv++;} // gets video only

//dump duplicate array values and fill holes that are left; array_unique has BUG!
$a=array_keys(array_flip($a));$r = count($a);
$audio=array_keys(array_flip($audio));$nar = count($audio);
$video=array_keys(array_flip($video));$nvr = count($video);

sort($a);echo "Links<BR>";for ($i = 0; $i < $r; $i++) {echo ($i+1)." ".$a[$i]; echo "<BR>";}
echo "<BR>";
sort($audio);echo "Audios<BR>";for ($i = 0; $i < $nar; $i++) {echo ($i+1)." ".$audio[$i]; echo "<BR>";}
echo "<BR>";
sort($video);echo "Videos<BR>";for ($i = 0; $i < $nvr; $i++) {echo ($i+1)." ".$video[$i]; echo "<BR>";}