PowerShell regex crash course – Part 5 of 5

Doctor Scripto

Summary: Thomas Rayner, Microsoft Cloud and Datacenter Management MVP, shows the basics of working with regular expressions in PowerShell.

Hello! I’m Thomas Rayner, a proud Cloud and Datacenter Management Microsoft MVP, filling in for The Scripting Guy! this week. You can find me on Twitter (@MrThomasRayner), or posting on my blog, workingsysadmin.com. This week, I’m presenting a five-part crash course about how to use regular expressions in PowerShell. Regular expressions are sequences of characters that define a search pattern, mainly for use in pattern matching with strings. Regular expressions are extremely useful to extract information from text such as log files or documents. This isn’t meant to be a comprehensive series but rather, just as the name says, a crash course. So, buckle up!

Today, all I’m going to do is run through examples. Some touch on other posts from this series, but others are brand new. Enjoy!

Here’s a quick way to get the values between quotation marks in a string. Say you have the following string.

$s = @"

Here is: "Some data"

Here's "some other data"

this is "important" data

"@

If you just want the “some data”, “some other data” and “important” parts, you could do this a couple ways.

[regex]::matches($s,'(?<=\").+?(?=\")').value

[regex]::matches($s,'".+?"').value.trim('"')

Both return the desired results. The first one uses lookbehinds and lookaheads to search for the characters between quotation marks. The second one does basically the same thing but includes the quotation marks themselves, so I trim them afterwards. In this case, the lookahead/lookbehind example seems to be consistently faster in my tests.

How about this quick way of detecting if a string has non-alpha characters in it?

$string1 = 'something'

$string2 = 'some@thing'

$string1 -match '[^a-zA-Z]'  #returns false – no special chars

$string2 -match '[^a-zA-Z]'  #returns true – has special chars

In this example, if there’s a character in either of the strings that doesn’t match lowercase or uppercase a-z, then the statement is true.

How about seeing if an integer (could be anything, in this case a number) is a specific length?

[int]$v6 = 849032

[int]$v2 = 23

$v6 -match '^\d{6}$'

$v2 -match '^\d{6}$'

$v6 is an int that is six digits long. $v2 is an int that is only two digits long. On lines three and four, we’re testing to see if each variable matches the pattern ‘^\d{6}$’ which is regex speak for “start of the line, any digit, and six of them, end of the line”. The first one will be true because it’s six digits, and the second one will be false. You could also use something like ‘^\d{4,6}$’ to validate that the int is between four and six digits long.

Now, let’s see if a string starts or ends in a specific character (or pattern).

'something\' -match '.+?\\$'  #returns true

'something' -match '.+?\\$'  #returns false

'\something' -match '^\\.+?'  #returns true

'something' -match '^\\.+?'  #returns false

In the first two examples, I’m checking to see if the string ends in a backslash. In the last two examples, I’m seeing if the string starts with one. The regex pattern being matched for the first two is .+?\$ . What’s that mean? Well, the first part .+? means “any character, and as many of them as it takes to get to the next part of the regex. The second part \\ means “a backslash”. Because \ is the escape character, we’re basically escaping the escape character. The last part $ is the signal for the end of the line. Effectively, what we have is “anything at all, where the last thing on the line is a backslash” which is exactly what we’re looking for. In the second two examples, I’ve just moved the \\ to the start of the line and started with ^ instead of ending with $ because ^ is the signal for the start of the line.

Sometimes, you’re given a path to a file system location that’s poorly formatted. Sometimes, you’re given thousands of them. Well here’s an easy way to normalize those paths.

'c:\some/awful/oops\here\we-go.txt' -replace '/','\'

Quick and easy. Anywhere there’s a “/”, replace it with a “\”.

How about if you want to replace something with the original value but modified?

'this is something' -replace 's[oqr]mething','$0 fun'

'this is sqmething' -replace 's[oqr]mething','$0 fun'

'this is srmething' -replace 's[oqr]mething','$0 fun'

Check that out. Here’s what gets returned.

this is something fun

this is sqmething fun

this is srmething fun

In all three examples, we’re looking for the pattern “s, followed by o or q or r, followed by mething”. What am I replacing it with? Whatever the part of the string was that matched (signified by $0) plus the word “fun”. If you have multiple groups (separate matching groups by enclosing them in round brackets), you can use $1, $2 etc., to indicate which match you want to refer to. Notice that I used single quotes. If you use double quotes, $0 means something else.

You could replace using a calculated value, too, using the [regex] accelerator.

[Regex]::Replace('192.168.1.100', ‘\d{1,3}$’, {param($old) [Int]$old.Value + 1})

This will return “192.168.1.101”. The [regex]::replace() method allows you to pass a scriptblock after the pattern. In this example, we’re replacing something in the string “192.168.1.100”. What we’re replacing matches the pattern “1 to 3 digits followed by the end of the string” and we’re replacing it with the old value plus 1. Cool, right?

One last weird one. What if you have a string that reads like “this this is a fun string” and you want to remove the duplicate “this”? Regex to the rescue again!

'this this is a fun string' -replace '\b(\w+)(\s+\1){1,}\b','$1'

Alright, what is going on here? We’re feeding a string into the –replace function. What’s the pattern we’re looking for? Well it’s \b(\w+)(\s+\1){1,}\b of course. Let’s break it down. The first part of the match is “the boundary of a word”. Second is (\w+) which matches all the word characters until it gets to something that isn’t a word. Third is (\s\1){1,} which means “a space followed by the thing that matches the second part of this pattern. (\0 is the first part of the match – the word boundary, \1 is the second part of the match – the word itself denoted by (\w+), and so on) one or more times. The fourth part of the pattern is another word boundary. So, where we have a word boundary followed by a word, followed by that word again at least one time, followed by a word boundary, we want to replace it. And we replace it with $1 which equates to the original word we matched. Still with me?

This week, also, every PowerTip has been a regex example so check those out too!

These are just some off-the-cuff examples of regex in action. Regex is so robust, and there are so many applications for it that it would take months to do a fully comprehensive series of posts. What I presented this week is merely a crash course – something to get your feet wet, introduce you to some concepts, and give you a jumping off point to dig deeper on your own. There are lots of regex resources out there. You just need to be motivated and look for them.

This wraps up my regex crash course! I hope you learned something. Still confused? That’s okay, too. The best way to get better at regex is by starting to use it and practice. Don’t be afraid. Regex is complicated but also immensely powerful so it is definitely in your interest to at least get a rudimentary regex education.

See you next time!

Excellent work Thomas!  Thanks to your posts I’m feeling a lot more to speed on making Regular Expressions useful to me!

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow.

Until then always remember that with Great PowerShell comes Great Responsibility.

Sean Kearney Honorary Scripting Guy Cloud and Datacenter Management MVP

 

0 comments

Discussion is closed.

Feedback usabilla icon