Sunday, June 26, 2011

The Basics of Regex

A while ago it was suggested I write something about regex (regular expressions).  At the time I didn't want to because (A) I don't feel very qualified to write about them, and (B) there is already a lot of stuff about them.  Well, I have been using regex more and more lately, and while neither A or B have changed, what's the point of a blog if it's not filled with redundant inaccurate information?

Regex is an extremely useful tool for anyone that uses a computer.  For one thing, it is the only good way to write a program to parse some text.  However, many other programs use regex for searching.  Thus, even if you don't program it is still worth learning regex.  As an example, Notepad++ allows regex for its find and replace feature, which is hugely useful.  While regex has a reputation for being extremely complicated, the basics are rather easy.

Before I begin I have to recommend this page.  This is where I go every time I'm writing anything more than a simple regex.  In addition to that reference, this is a great regex tester.  Also I'll be basing this on Perl regex, which is what I'm familiar with.  It should be pretty similar to other languages, particularly at the basic level I'll be going over.

Let's begin with a basic example:
if($string =~ m/test/) {print "True"}
In this example $string is our test string; we are testing to see if something is in $string. 'test' is what we are matching for; if $string contains the substring 'test' then the if statement is true. The =~ and m are Perl specific. The m stands for match, which means Perl will return true if it finds the search string. It could also be s for substitute (find and replace) or tr for translate. The forward slash / is what Perl typically uses to separate the parts of the regex. As an example if we wanted to replace 'test' with 'TEST' we would use:
$string =~ s/test/TEST/
Note the only changes are the m becoming an s (substitute), and the second half of the regex added, again using forward slashes to deliminate. To simplify and make this more general, from here on out I'll just show the actual regex, like this:
s/test/TEST/

Now to begin going over what the characters inside a regex do. To begin, alphanumeric characters (a-z, A-Z, and 0-9) represent themselves. That is why in the above example replacing 'test' with 'TEST' didn't require any special characters. If you wish to match a non alphanumeric character, the safe practice is to escape it. This means you put a backslash \ before it. The backslash says that the character that follows has a special meaning. In the case of non alphanumerics this special meaning is just the actual character matched literally. In the case of alphanumeric it will match a variety of different things I'll get into shortly. As an example this will match on a single period:
m/\./
If you wished to match a double quote followed by a period you would do this:
m/\"\./

Now I'll go over what meaning the non escaped non alphanumerics and escaped alphanumerics have:
^ - this typically matches only at the start of the string.
m/^The/
That will match on the string 'The day is over', but will not match on 'Is the day over?'. In the second case the fact that 'the' is not at the start is why it didn't match. Confusingly, ^ also serves to negate a character class, which we will go over later.

$ - this is the opposite of ^ in that it matches only at the end of the string.
m/bye$/
This will match on 'Good bye', but will not match 'Good bye.'. The key difference being that the second ends with a period, which isn't included in the regex.

These can be combined:
m/^bye$/
This will only match exactly 'bye', it will not match on 'bye.' or 'good bye'.

Next come wildcards, ie, things that will match any of a certain type of character. Note they only match a single character. If you want to match more than one there are other characters we will discuss.

. = Matches any character, the tradition wildcard.
\n = a newline, or what you get when you press enter.
\t = a tab.
\s = Matches white space, which include normal spaces, tabs, and newlines.
\S = Matches non white space. A trend is the capitalized version of a wildcard matches anything that wouldn't be matched by the lower case version.
\d = Matches digits (0-9).
\D = Matches non digits.
\w = Matches "word" characters, alphanumeric and underscore _.
\W = Matches non word characters.
\x2a = Matches the ASCII character represented by that hex value. The \x represents hex, the following two characters is the hex value. In this case 2a is 42 in decimal, or an asterisk *. See here for an ASCII chart of hex values. Using this you can easily represent any character you wish. Note that you can usually get away with not using this, which will be more readable. But this has the advantage of letting you be certain the characters will be matched as you intended.

m/\d\d\-\d\d\-\d\d/
This will match any character followed by two digits, then a hyphen, then two more digits, then another hyphen, and two more digits. As in '11-06-18', likely a date. Note it will only match two digits. However, '2011-06-18' would also match simply because it wouldn't care about the 20 at the start. On the other hand, '06-2011-18' would not match.

As noted above, the wildcards are treated as single characters. If you wish to match more than one character (either wildcards or actual characters) then you must follow them with one of the following:
* = match 0 or more times (as in match even if it's not there).
+ = match 1 or more times (it must be there, but can be repeated any number of times)
? = match 0 or 1 times exactly.
{n} = match exactly n times.
{l,m} = match between l and m times. If you wish to just use the lower bound you can leave the other out. {4,} will match 4 or more.

I should give some more details about ?. By itself matching 0 or 1 times isn't that useful. But when you follow any of these with a question mark it will make the expression "non greedy". By default regex will match the largest substring it can, this is "greedy". If I wanted to match anything that started and ended with a space I could use:
m/\s.+\s/
This will match a space, followed by one or more characters, followed by a space. The problem with this is that it would match from the very first space it found, until the very last. In other words, given a paragraph to match from, it would match almost the entire thing. To make it match on individual words you could force it to be non greedy:
m/\s.+?\s
Now it will match the smallest string possible, as in a single word.

Another feature of regex is flags that follow the regex. In Perl the format is:
m/test/g
Here g is the flag. There are a few common flags:
i = Match case insensitive
g = Match global. This is useful for substitutions. Ordinarily, Perl would stop after the first substitution, g tells it to substitute as many times as it can.
m = Treat as multiple lines. This affects how ^ and $ work. Normally ^ would fail to match if it followed a newline. With the m flag each line gets treated as if it were a separate string.
s = Treat as single line. This affects how . matches. Normally it doesn't match on newline itself. This flag will cause it to match newlines as well. The flags s and m can be used at the same time. It will cause . to match newlines while also causing $ and ^ to treat newlines as beginnings and endings.

Now we can start writing some decently powerful regex.
m/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
This will match an IP address. Which follow the pattern of 4 groups of between 1-3 digits separated by periods. It will match on '74.125.224.72', for example. Note that in valid IP addresses each number can't be greater than 255, which this regex won't care about.

Regex has two more interesting features (well it has a lot more than two, but two more that I'll be going over), classes and groups. Classes use [] while groups use (). Groups give a way of matching any of several different choices. (a|A) will match either lower case a or upper case A. The pipe | is used as a deliminator. They can be full words or anything as complex as you'd like. (apple|peach|banana) will match either of those three fruits. (\d{2}|\d{4}) will match either 2 or 4 digits, but not 3. In addition to matching any of a choice, groups provide a way to access the string that matched later on. Since this is more complex and somewhat Perl specific (although very useful) I won't explain it. See this site for an example.

Classes allow you to match anything from a range. This allows you to create your own wildcards. Probably the most common usage (at least for me) is [a-zA-Z], this will match any letters, but not digits. Note the lack of a space between z and A, if you include a space then it will also match on spaces.

If you remember way up above I mentioned ^ would also negate a character class. This is where that comes back. For both classes and groups putting a ^ at the start will cause it to match on anything that would not match on otherwise. In other words [^a-zA-Z] will match on everything except letters.

As an example of this how would we match all three letter words?
m/\s[a-zA-Z]{3}\s/
This might be a first thought, but it has the flaw of only matching letters that are surrounded by spaces. Words can also be immediately followed by a period if they end a sentence (as well as a wide variety of other things). Instead of looking for letters surrounded by spaces we should instead look for letters surrounded by any non letter.
m/[^a-zA-Z][a-zA-Z]{3}[^a-zA-Z]/
This is better, but still has some flaws. It will match on 'www.google.com'. The 'www' and 'com' parts will each match. One solution would be to require a space at some point (after any number of non letters). Combining the two previous examples we get:
m/\s[^a-zA-Z]*[a-zA-Z]{3}[^a-zA-Z]*\s/
This reads as, match a space, followed by any number of non letters (0 or more), followed by exactly 3 letters, followed by any number of non letters, followed by a space.

No comments:

Post a Comment