The “joy” of Reg-ex part 1.

Article
01/14/2010

One of the good things in PowerShell is support for regular expressions – in fact I suspect some Unix sys-admins might laugh their Windows counterparts for not got to grips regular expressions sooner.
The downside is that regular expressions are an area which give a lot people a serious headache.

So lets start from first principles.

1. Regular expressions are about looking for text that matches a pattern. When matching text is found various follow-up operations can be performed such as replacing it.

2. The patterns defined by regular expressions allow “classes” of characters to be specified , for example “word” characters, digits and spaces

3. The patterns can specify groups of alternatives or repetitions.

Everything else stems from that: simple. The thing with simple ideas is we build complex practices on them.

In PowerShell we can use the –match operator to test an expression: so here are some examples

"cat" -match "at" returns true. “cat” contains “at”, we have a match. That was easy. We can specify alternates
" dogma" -match "cat|dog" returns true because the test pattern translates as “cat or dog” (pipe sign is “or”) and dogma is a match for dog.
"AlleyCat" -match "cat|dog" also returns true because it contains cat

If we want to specify the words rather than “substrings” cat or dog we can use the first of the special characters, \b means a word Boundary,
So "dogma" -match "cat\b|dog\b" returns false but "Alleycat” –match “cat\b|dog\b" returns true. We can specify a boundary before as well as after the text to get the exact word

In Regular expressions the Wildcards that most people have been used to divide into classes of characters – another place where we use special characters - and repetitions. \w is a word (alphanumeric) character \s is a space character, \d is a digit. A dot stands for “ANY character” at all. Changing case reverses the meaning. \W is any non-word, \S is any non-space ..
If we want to specify alternates we can write them as [aeiou] or [a-z] for a range. We can reverse the selection of alternates with the ^ , so [^aeiou], is any non vowel
"oat" -match "\b[a-z]at" returns true, but replacing the letter o with a zero as in "0at" -match "\b[a-z]at" returns false.

"oat" -match "\b[a-z]at" will only return true of there is exactly 1 letter between the start of the word and “at”, so chat won’t match. We can specify repetition: {2} means 2 exactly repetitions, {2,10} means at least between 2 and 10 repetitions, {2,} means at-least two. We have short-hands for these: * means any number including 0, + means any non zero number and ? means zero times or once but no more. Incidentally if we need to match a character which has a special use – the different kinds of brackets, . * ? and \ we “escape” them it prefixing with a \

"[a-z]at " will find a match with cat and hat but not at or chat
"[a-z]+at" will find a match with cat, hat and Chat , but not at ,
"[a-z]?at" will find a match with cat, hat and at , but not chat ,
"[a-z]*at" will find a match with cat, hat at and chat

This requires unlearning some automatic behaviours we’ve learnt: at most command lines we can use * to mean “any combination of characters, including none”, so A* means “A followed by anything” in regular expressions “A*” will always match because it means “containing any number of instances ‘A’ including none. In regular expressions the syntax is A .* (A followed by any character, repeated any number of times). Similarly where we use ? to stand for a “a single character” in regular expressions we use .

There are a couple of other special characters worth noting. ^ means start of line and $ means end of line. These last two are very useful in scripts, where you often need test for something which begins or ends with a given piece of text. For example if in PowerShell you declare a variable to hold some text – for example $myString = “The cat sat on the mat” , the .net string type has an endswith() method so $myString.endswith(“at”) returns true. Great. Except, we often want to do something with the text – like replace and PowerShell has a replace operator too. If we want to say “Replace the HTM at the end of a file with HTML” we can do $mystring –replace “HTM$”,”HTML” Similarly if we’re looking at text and we want to cope with trailing spaces strings have a trim method, but regular expressions can get rid of punctuation as well $myString –match “at\W*$” will match even if there is punctuation and spaces between “Cat” and the end of the line.

So far so good – we can also use a –split operator in PowerShell: again, .net strings have a split() method, but if we try this
$string=”The cat sat, on the dog’s mat” ; $string.split(“ “)

It will return blank extra lines for the spaces between “sat” and “on” and the comma will be welded to sat for good. We could split the text at any non-word character. by using –split “\W” Unfortunately Regualar expressions don’t consider ’ to be a word character so ’s will be split off from “dog”. This is easily fixed by using $string -split "[^\w‘] +" which says Split where you find something which is neither a word character nor an apostrophe, treating multiple occurrences as one.

The last thing I wanted to mention is one I ways have to double check, and that is something called “greedy” / “Lazy” behaviour. Suppose I want to change something in the first tag of a piece of XML . I might look for a match on “^<.*>” - which says find the start of the text, then a < then any other characters and finally a >. This will match the whole document because * finds as many characters as it can before the final > if we want the fewest characters the * must be followed by a ? sign.

In the next post I’ll look at a couple of ways we can put regular expressions to work including one of the best tools in PowerShell – select string.

The “joy” of Reg-ex part 1.

Additional resources