Regular Expression URL Validator
The source code for the Regular Expression URL Validator is at the bottom of the page. A discussion of what we did precedes the code. Just before that you'll find a link giving you the opportunity to try out the Regular Expression URL Validator. If you use it somewhere on the Net, please give us a dofollow backlink with link text: "Regular Expression URL Validator". Thank you. Below, you'll see a grueling series of test URLs we subjected our Regular Expression URL Validator to, followed by the results. First we will look at URLs that validated, and our comments about the results. Then we will look at URLs that did not validate, and our comments about why these were the correct results. It is probably obvious that with new top domains appearing all the time, we couldn't be expected to use (com|info|biz| etc. . . ) as a hugely long, ever-expanding parenthesis-enclosed set of literals separated by the alternation operator |. Country codes change as well, so we check only for the required 2 letter character match. So one can get plenty of nonexistant URLs validated, but the same is true of email validators, etc. We check the format, making sure their input seems legitimate. Similarly, no HTML/JavaScript page can actually try out the emails or the URLs to make sure they exist. That would not be doable or practical. Please send us comments about how you like our validator.
Be forewarned that we used many bogus names for things just to keep you on your toes, so subdomain and mysite and www12345, for example, may represent site name and top domain and subdomain, according to their format inspection.
THESE URLS WILL VALIDATE:http://mysite.com without server name (www) is fine
http://www.mysite.com with server name (www) is fine too
https://www.mysite.com https is Hyper Text Transfer Protocol with Secure Sockets Layer for web security
https://www.mysite.com.co final co is country code
https://www.mysite.com.com 1st .com is seen as site name
https://www12345.com.com.co 1st .com is seen as site name and www12345 is seen as subdomain
https://www12345com.com.com.co 1st .com is seen as site name and www12345com is seen as subdomain
https://www.mysite.com.co?.1233459588%&../123/..??? special characters are allowed as long as they're numbers or -_%&?/.=
http://subdomain.mysite.cggggm.gd cggggm is not a real top domain, but it has OK number of characters, as does the country code
http://www.1234.subdomain.mysite.co 1234 would be seen as subdomain here, subdomain would be seen as the site name, mysite would be seen as the top domain (subdomain.mysite = domain)
http://www.subdomain.mysite.cggggm.aa aa is not currently a country code but 2 is correct number of characters
http://www1234.subdomain.mysite.cggggm.gd www1234 is OK server name (although in the real world we know of none higher than www504)
http://subdomain.mysite.cggggm.gd?1234./.?-_%&.?/.=4567 special characters are allowed as long as they're numbers or -_%&?/.=
http://www212.subdomain.mysite-is-very-cool.museum.uk?1234./.?-_%&.?/.=4567 www212 is a real server name, museum is a real top domain name, and uk is a real country code.
THESE URLS WILL NOT VALIDATE BECAUSE:
http:/mysite.com missing forward slash after :
http://ww.mysite.com missing w
htts://www.mysite.com missing p in https
https://www.mysite.com.com.com final com is country code which needs to be 2 characters
https://www12345.mywebsite.com.com.co www12345 is ok as name of subdomain (but www1234 is ok for server name) but then one .com at end needs removing
https://www.mysite.com.co?.dhr345d5df%&../jjk/..??? letters not allowed with special characters at end of url
http://subdomain.mysite.cggggm.gdn final gdn is country code which needs to be 2 characters
http://www.subdomain.mysite.cgggggm.gd cgggggm is top domain which must be 2-6 characters, not 7
http://www12345.subdomain.mywebsite.cggggm.gd www12345 is ok as name of subdomain (but www1234 is ok for server name) but then .cggggm at end needs removing (leave .gd) and mywebsite needs to be shortened to mysite, since it's a top domain that cannot be over 6 characters; or simply remove .mywebsite
http://www1234.subdomain.mysite.cggggm.info final info is country code which needs to be 2 characters
http://www.1234.subdomain.mywebsite.co mywebsite is top domain which must be 2-6 characters, not 9
http://subdomain.mysite.cggggm.gd?1234./.?-_%&.?/.=4567, no commas allowed in url
http://www212.subdomain.mysite-is-cool!.museum.uk?1234./.?-_%&.?/.=4567 no ! allowed in url
Just because "mysite" looks like a site name, below, doesn't change the fact that the following validate because the formatting shows that com.com, museum.com, and museum.museum are the domains below, and mysite is in the subdomain position, as formatted. https://www1234.subdomain.domain.com.co gives a truer picture, and is formatted similarly to the other URLs, with www1234 as server (although we know of no number higher than www504, we allow 4 digits here just to be safe and be forward compatible). The domain is domain.com, not just domain, with .com as the top domain part.
https://www1234.mysite.com.comhttps://www1234.mysite.com.com.co
https://www1234.mysite.museum.com.co
https://www1234.mysite.museum.museum
https://www1234.mysite.museum.museum.co
https://www1234.subdomain.domain.com.co
In the above address, https is the protocol, www1234 is the server, subdomain is the subdomain, domain.com is the domain with .com being the top level domain, and co is the country code.
[protocol][server][subdomain, like in http://images.google.com][domain, ends with top-level domain like .com][2-digit country code][/directory/folder/subfolder/][implied default file name: "index" or "home" or "default", can end in .html, .htm, .sht, .shtml, .asp, .cfm, php, etc.]
There's no way to know that com or biz or info or museum are top level domains instead of site names except by their position in the URL formatting. Positioning is what our validation script uses to figure out which they are.
The code below lets you test various URLs to check what will or will not validate with the check() function. Note that we did not allow FTP protocol, but limited it to http and https. We also did not allow pages where letters are sent via the address bar, although numbers and these are allowed: -_%&\?\/\.=
Letters after the final country code would be indistinguishable from domains and country codes, which would confuse the regular expression until it burst into crocodile tears. In the code, for testing we used: {alert("URL validated OK.");return false}} but when you actually use the code, change this alert code to simply {return true}}, and add an appropriate action in the form tag.
Now let's look at the regular expression we used:
The ^ says to match right after the start of the string (inputted URL). The https? says to match http or https, so it's the same as (http|https). The \:\/\/ says to make sure they have :// after the protocol (protocol is http, etc.). Note we escaped these characters by use of backslash so the regular expression knows we are refering to these literal characters rather than the special meanings these characters have in regular expressions if they're not escaped. The (www\d?\d?\d?\d?\.)? code matches www or www plus up to 4 digits and there must be a period if this server designation is used, but the entire server designation may be skipped. The period is escaped since it has a special meaning if not escaped. All question marks so far have indicated optional matches. The ? in https? means it's okay if the string being matched contains an s character after the http, but it's also okay if it doesn't. Each \d? in the protocol section means it's okay to either have or not have a single digit at this spot, and since there are four of these, it says it's okay to have 0 to 4 digits here. The question mark after the entire parenthesis-enclosed protocol section means we can skip the whole thing and that's okay too.
The ([A-Za-z0-9-_]+\.)? section looks at the optional subdomain in the URL, and the optional is again indicated by a question mark. Note that alphanumeric characters as well as - and _ are allowed in URLs, so we allow them here, but if you're including a subdomain, you must end it with a period. So the \. is obviously indicating just that. But the + is not as straightforward. The []s are a "character class", also called "character set", with which you can tell the regex engine to match only one out of several characters. So if you're including a subdomain, the bracketed area shows the choices, while the + says there must be 1 or more of these characters. Note this does NOT mean you have to have a subdomain in your input string. It means that IF YOU DO have a subdomain, it must contain 1 or more of the bracketed characters (and, of course, the period).
Next comes [A-Za-z0-9-_]+, the site name section whose character class is identical to the subdomain section, and they both end with a + sign, meaning it is not optional—you must have at least one of these, since we are now in the site name part of the domain. However, the parenthesis grouping the subdomain section is followed by a ?, while the site name section does not have this, so it's really required, rather than required ONLY if you choose to input a subdomain, above. The next bunch of code is complex looking, but it's not all that bad. The first parenthesis has its mate just before the $. These grouping operators enclose areas by use of an open and close parenthesis and whatever immediately follows the group applies to the group. The dollar sign is the opposite of the ^ sign. The $ means to match right before the end of the string. It is important to have this set of parenthesis because the top domain, then the optional country code, then the optional special characters, must all be adjascent and at the end. The next parenthesis set (\.[A-Za-z]{2,6}) forces a period and then a top domain of 2 to 6 letters. The {} brackets mean {min,max}, i.e., the least to the most letter characters that will match the input string's top domain. If they have 0, 1, or more than 6, the match will fail. (The top domains travel and museum both require 6 characters.)
The (\.[A-Za-z]{2})? is the optional country code—hence the question mark. Note it requires a period at the start IF it is used. And it must have 2 characters. Hence the {2}. Finally, we use ([0-9-_%&\?\/\.=]*) to match any special characters the user happens to stick after the domain and country code. They're optional—hence the *. The ?, *, and + are repetition operators. The question mark means 0 or 1 repetitions are okay. The * means 0 or more repetitions are okay. The plus means 1 or more repetitions are okay. Since we used the * we know that 0 repetitions of any of the enclosed characters are okay but so is putting a whole slew of them. Note the special regular expressions operators like ?, / and . are escaped since we want their literal meaning, not their operator meaning.
Try the Regular Expression URL Validator.
Good info site for learning about Regular Expressions.
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>Regular Expression URL Validator</TITLE>
<meta name="description" content="Good, Tested, Regular Expression URL Validator">
<meta name="keywords" content="Regular Expression URL Validator,javascript, dhtml, DHTML">
<script language=javascript>
function check(){
var ck_url = /^https?\:\/\/(www\d?\d?\d?\d?\.)?([A-Za-z0-9-_]+\.)?[A-Za-z0-9-_]+((\.[A-Za-z]{2,6})(\.[A-Za-z]{2})?([0-9-_%&\?\/\.=]*))$/;
if(document.form.Website.value.search(ck_url)==-1)
{alert("Please only enter 6 to 70 letters, numbers and other allowable characters for URLs");return false}else
{alert("URL validated OK.");return false}}
</script>
</HEAD>
<body>
<BR><BR><BR><BR>
<form name='form' action=" " method="POST" onsubmit="return check()">
<INPUT maxLength="70" type="text" name="Website" size="50">
<INPUT TYPE="SUBMIT" value="Submit URL">
<INPUT TYPE="RESET" value="reset">
</form>
</BODY>
</HTML>