Count and Alphabetize Words on a Web Page

This script will get an alphabetically sorted list of the words found on a web page. It is limited to .html, .htm, and .php extensions for page urls.

The script uses the PHP function file_get_contents() and str_word_count() and preg_replace() and array_multisort() as well as nested while loops, which we've never tried before, but these latter worked flawlessly.

We start with ensuring that fatal run-time errors (errors that can not be recovered from) are reported. Execution of the script is halted and a message appears.

Next there is an HTML form that gets submitted to this same page to send the page url to the PHP script. The user is instructed that the page url "must end with html, htm, or php." The input is checked to see that it ends in htm, html, or php and if not, the script does nothing—it assumes you can read and are smart enough to enter a URL that ends in html, htm, or php. Feel free to add a chastising alert message.

Next the strip_tags() function is run on the url (just in case). Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.

Then $content = file_get_contents() goes to the Internet and loads in the contents of the page to the $content variable. Next, the $pp array is loaded with a list of regular expressions that will filter much of the unneeded clutter on the web page when used in a preg_replace() function. The line $content = preg_replace($pp,'',$content) runs all these replacing functions on the $content string, one after another. Note that any of these array elements such as <style[^>]*?>.*?<\/style>/ will work fine with or without the > after the *? and before the .*? but with the > in the regular expression at that point, it won't dump tags (plus what's between) where the > has been accidentally forgotten. Leave out that > and it will dump tags (plus what's between), because <style[^>]*?.*? means regex match of style tags with any amount of characters that aren't > followed by any amount of characters of any type which may include >. Suit yourself on whether you prefer to be more or less forgiving. Forgotten tag closers (>) do occur—just not very often, and never if you validate your pages at http://validator.w3.org/.

Note that we dump words found in HTML comments, and replace   with spaces. Then, $content = preg_replace("/<[A-Za-z]+[^>]*?>/i", "\\0 ", $content); dumps tags (but NOT content between opening and closing tags!) like <P> <SPAN>, etc., including tags with attributes. The + means one or more alphabetic characters follow the < and the [^>]*?> means zero or more characters that aren't the > character can come before the > character. That means attributes, such as id="myid". Without this allowing for attributes, no match would be made on <P id="myid">, and it may be that this is at the start of your page and the </P> tag at the end of the page so as a result this script would find that THE PAGE HAS NO WORDS ON IT!

It's a good thing all the filtering happens BEFORE the strip_tags() function dumps all remaining tags on the page plus everything between tags, because otherwise an opening and closing DIV tag would be deleted along with everything between it, which is likely to be all page contents. The previously discussed tag cleaner $content = preg_replace("/<[A-Za-z]+[^>]*?>/i", "\\0 ", $content); and the end tag cleaner $content = preg_replace("/<\/[A-Za-z]+>/", "\\0 ", $content); are much less destructive, leaving alone everything between tags such as DIVs, SPANs, and Ps. This enables the words on the page to be left for the script to find and sort.

$n=str_word_count($content); and $a=str_word_count($content, 1); are odd commands. The first simply counts the words on the page whose contents are now in the $content string variable due to the PHP function file_get_contents(). But the second gets them all into the $a array, just because of the optional 1 in the parameters. The PHP function str_word_count() is powerful PHP.

Check out the nested while loops. While $i (the increment variable) is less than $n (the total number of words in the now-sorted $a array), we increment $o (the increment variable for the now-being-filled $b and $c arrays). The nested while loop merely counts the identical values (which are adjacent due to the sort($a) function) and increments the frequency variable $d. The $b array gets only unique words due to the work of the nested while loop, while the $c array gets the frequency of these unique words. We add this frequency to the $i variable and set $d back to 1 before looping the outer while loop again. The result is that we end up with a $b array with only unique words due to the work of the nested while loop, and the $c array with the frequency of these unique words.

Now we increment $o since arrays start at zero. Finally it is time to display tables of the results. The first is the words and frequencies sorted alphabetically, and the second is the words sorted by frequencies. Before the second table is ready for prime time, we use the PHP array function multisort for this frequency sort table, with frequencies sorted as a numeric sort from highest to lowest, which is what array_multisort($c,SORT_DESC,SORT_NUMERIC,$b); says. Note that the $c array contains frequencies and it's the one getting sorted on, with the word array, $b, being sorted in parallel to however $c gets sorted.

Whether you're evaluating web pages during SEO (search engine optimization) or just trying to improve your writing, this script can come in quite handy. It does not need any XPATH like the rest of the scripts in this group, but it does belong with the other scripts that search, index, and evaluate HTML and XML page content.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>Count and Alphabetize Words on a Web Page</TITLE>
<meta name="description" content="Count and Alphabetize Words on a Web Page">
<meta name="keywords" content="Count and Alphabetize Words on a Web Page,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}
.textbox1 {position:absolute;top:90px;left:100px;width:400px;word-wrap:break-word;white-space:nowrap;overflow:hidden;text-overflow: ellipsis;}
.textbox2 {position:absolute;top:90px;left:550px;width:400px;word-wrap:break-word;white-space:nowrap;overflow:hidden;text-overflow: ellipsis;}
.info {position:absolute;top:0px;left:2px;width:160px;background-color:#bbb;border:1px solid blue;padding:5px}
.ts {background-color:#8aa;border:6px solid blue;padding:6px}
.pw {position:absolute;top:150px;left:100px;width:800px;text-align:center}
</style>
</head>
<body>
<center><h1>Count and Alphabetize Words on a Web Page</h1></center>
<?php
error_reporting(E_ERROR);

$f=$_POST['pageurl'];
if (!isset($f)){
echo '<div class="pw"><table class="ts"><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="count-and-alphabetize-words-on-a-web-page.php"><b>Page URL (must end with html, htm, or php)</b><BR><label for="URL">URL: </b><input type="text" name="pageurl" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';

}else{

if (substr($f,-4)==".htm" || substr($f,-5)==".html" || substr($f,-4)==".php"){

$f=strip_tags($f);
$L=strlen($f);
$f = str_replace(" ", "%20", $f); $f=trim($f);

$content = file_get_contents($f);

$pp=array('/<head[^>]*?>.*?<\/head>/si',
'/<style[^>]*?>.*?<\/style>/si',
'/<script[^>]*?>.*?<\/script>/si',
'/<object[^>]*?>.*?<\/object>/si',
'/<embed[^>]*?>.*?<\/embed>/si',
'/<applet[^>]*?>.*?<\/applet>/si',
'/<noframes[^>]*?>.*?<\/noframes>/si',
'/<frameset[^>]*?>.*?<\/frameset>/si',
'/<noscript[^>]*?>.*?<\/noscript>/si',
'/<noembed[^>]*?>.*?<\/noembed>/si',
'/<form[^>]*?>.*?<\/form>/si',
'/<link[^>]*?>/si',
'//si');
$content = preg_replace($pp,'',$content);
$content = preg_replace('/ /si',' ',$content);
$content = preg_replace("/<[A-Za-z]+[^>]*?>/i", "\\0 ", $content);
$content = preg_replace("/<\/[A-Za-z]+>/", "\\0 ", $content);
$content=strip_tags($content);
$content=preg_replace('/\r\n/', ' ', trim($content));

$n=str_word_count($content);
$a=str_word_count($content, 1);
$b=array();$c=array();$o=-1;$d=1;$i=0;
sort($a);

while($i<$n){
$o++;
while($a[$i]==$a[$i+$d]){$d++;}
$b[$o]=$a[$i];$c[$o]=$d;
$i=$i+$d;$d=1;
}

$o++;
echo "<center><B>".$f."      Word Count: ".$n."      Unique Word Count: ".$o."</B></center>";
echo "<table border='1' class='textbox1'><b><tr><th colspan='2'>Sorted Alphabetically</th></tr></b>";
for($i=0;$i<$o;$i++){
echo "<tr><td>".$b[$i]."</td><td>".$c[$i]."</td></tr>";}
echo "</table>";
echo "<table border='1' class='textbox2'><b><tr><th colspan='2'>Sorted By Word Frequency</th></tr></b>";
array_multisort($c,SORT_DESC,SORT_NUMERIC,$b);
for($i=0;$i<$o;$i++){
echo "<tr><td>".$b[$i]."</td><td>".$c[$i]."</td></tr>";}
echo "</table><BR>";

}}

?>

</body>
</html>

Free Personal Status Boards (PSB™)

Count and Alphabetize Words on a Web Page