Stop searching for regular expressions

The last time I needed to program up a way of validating URLs, I did a quick search for ‘url regex’ before stopping and thinking “WTF am I doing!” If your anything like me you probably first reach for regular expressions when needing to search within or validate strings. No need to parse each character and write a finite state machine, regular expression libraries do it for us. They are incredibly powerful.

But before you practice code reuse stop one moment and think, do I really understand the problem? Searching for a regex to solve your problem is the first sign of having little to no idea. Not only are regular expressions cryptic to decipher and maintain but almost all of them you can find on developers blogs, tech articles or sites dedicated to collecting “a regex for all occasions” are incomplete or have bugs.

What you really want, what you can consume quickly and what helps you really understand how to write your own regex is the protocols specification in Backus-Naur Form. The ‘Universal Resource Locator’ specification is found in RFC1738 and contains the easily readable BNF for all parts including http, ftp, and even gopher. You’ll never find a specification written in anything resembling a perl regex so why look at regular expressions first?

For my solution it was necessary to write it in C++, so naturally I chose to use boost which provides a few options including my new favourite Xpressive. Xpressive is easy to use and you can almost copy and paste BNF right into your editor and compile. All it needs is some reordering and a bit of syntactic sugar and hey presto your done. Its also readable like the original BNF so if you chose to leave something out (like gopher support) someone else can maintain it without feeling the urge to cause you harm.

#include <boost/xpressive/xpressive.hpp>

using namespace boost::xpressive;

bool validate_url(const std::string& url)
{
    // http://<host>:<port>/<path>?<searchpart>

    sregex domainlabel    = alnum | alnum >> *( alnum | "-" ) >> alnum;
    sregex toplabel       = alpha | alpha >> *( alnum | "-" ) >> alnum;
    sregex hostname       = *( domainlabel >> "." ) >> toplabel;
    sregex hostnumber     = +digit >> "." >> +digit >> "." >> +digit >> "." >> +digit;
    sregex host           = hostname | hostnumber;
    sregex port           = +digit;
    sregex hostport       = host >> !( ":" >> port );

    sregex safe           = as_xpr("$") | "-" | "_" | "." | "+";
    sregex extra          = as_xpr("!") | "*" | "'" | "(" | ")" | ",";
    sregex unreserved     = alpha | digit | safe | extra;
    sregex escape         = xdigit; // "%" hex hex
    sregex uchar          = unreserved | escape;

    sregex hsegment       = *( uchar | ";" | ":" | "@" | "&" | "=" );
    sregex search         = *( uchar | ";" | ":" | "@" | "&" | "=" );
    sregex hpath          = hsegment >> *( "/" >> hsegment );
    sregex httpurl        = "http://" >> hostport >> !( "/" >> hpath >> !( "?" >> search ));

    sregex url_regex      = httpurl;

    return regex_match(url, url_regex);
}

And just so you know I coded this in a text editor and didn’t try compiling it, so I’m at no risk of breaking the great internet tradition of buggy regular expressions :)

PS. It’s also incomplete so don’t think you won’t have to code it up yourself even if it does compile.

programming and boost
blog comments powered by Disqus