Expert Solution for 2011 Scripting Games Advanced Event 7: Use PowerShell and Regex to Get Twitter IDs from a Web Page

Summary: Microsoft Windows PowerShell MVP, Tome Tanasovski, uses regular expressions to get Twitter IDs from a web page while solving Advanced Event 7 in 2011 Scripting Games.

Microsoft Scripting Guy, Ed Wilson, here. Tome Tanasovski is the expert commentator for Advanced Event 7.

Photo of Tome Tanasovski

Tome is a Windows engineer for a market-leading, global financial services firm in New York City. He is the founder and leader of the New York City PowerShell User group, a cofounder of the NYC Techstravaganza, a blogger, a speaker, and a regular contributor to the Windows PowerShell forum at Microsoft. He is a recipient of the MVP award for Windows PowerShell.
Tome’s contact information:
Blog: Tome’s Land of IT
Twitter: toenuff

Worked solution

Advanced Event 7 is my type of task. The core of this challenge is text parsing with a regular expression, and it presents an opportunity to show how Windows PowerShell shines with its flexibility to filter and convert data to different formats (like CSVs). The truth is, my first stab at this completed the requirements in only a few lines of code, but it would not have been a winner in that state.

Before I talk about the solution, I have to point out that if I was handed this requirement in the real world, there are a few things that I may have fought. In my opinion the best functions are those that keep things “PowerShell-able,” that is, they need to keep the pipeline going, they need to allow you to use existing cmdlets for filtering and selecting, and they should be flexible. Unfortunately, this task has some rigid requirements: I would not normally incorporate something like Import-CSV into my script because I would rather leave that power in the hands of the person operating it. The same goes for the filtering and displaying of the data. My personal approach without any requirements handed to me would have been to create the single function that gathers the data from the web page and spits out Windows PowerShell objects with a property for name and Twitter. I would then expect others to take that and do things like:

Get-SQLSaturdayNetworking |Export-CSV sqlsaturday.csv

Import-CSV sqlsaturday.csv | where-object {$_.name -like ‘*Wilson*’}|select -ExpandProperty Twitter

In my opinion, the previous is the answer, and the answer is all about knowing how to use Windows PowerShell. I am fortunate that I work for a company with extremely heavy Windows PowerShell users who would expect functions from me to do the heavy parsing so that they can take it and be flexible with it, for example, send it to a database, perform calculations on the objects, and automate direct message tweets to the people who attended the event.

In other parts of the universe, I hear that users do not want to know the ins and outs of Windows PowerShell, and they would rather be given a script or a function with very easy-to-use and intuitive parameters. So, that’s what I set out to provide. I wanted to make sure that the core cmdlet was still flexible enough to return only objects, but also with enough ability to empower the end user without having to bog them down with Windows PowerShell syntax. I wanted to do all of this while meeting the requirements set out in the challenge.

Part One – The GREP – Get-SQLSaturdayNetworking

I decided to create this one function to grab the objects from the web or from a CSV. I also decided that my script should be able to handle optional Twitter and LinkedIn accounts. It only seemed appropriate that I do a little more than the requirements to make it something that was truly useful.

I tackled the web download first by creating a very powerful regular expression that would let me pull the username, Twitter, and LinkedIn accounts from each line of the HTML returned:

$regex = ‘<font size=”3″>\s*(?<name>.*?)\s*<a(.*?twitter.com/(?<twitter>\W*\w+))*(.*?linkedin.com/in/(?<linkedin>\w+))*’

A technique in the above that is not known by a lot of people is the usage of named captures. By using ?<name> within my parenthesis captures, I can more easily access them via the $matches variable by $matches[‘name’], $matches[‘twitter’], and $matches[‘linkedin’] as opposed to using the order the matches appear in the regex. This feature only exists in the .NET version of regular expressions, and therefore it is accessible to Windows PowerShell.

Another powerful thing in the regex was to use parentheses () with .*? to group together the sections that may be in each row. Vollowing these parentheses with a * allowed me to capture optionally the Twitter or LinkedIn accounts, but only if they exist. This is shown here.

(.*?twitter.com/\W*(?<twitter>\w+))*

Another regular expression worth calling out is the one I used to split the html content into individual lines. This regular expression that lives in my tool belt lets me split without knowing for sure whether the line terminator is a `n or a `r`n. Dealing with unknown text can be tricky because of silly things like this. Fortunately, the ‘(?m)\s*$’ uses the \s* to signify spaces and any type of newline characters, the (?m) to signify that the dollar sign ($) will match the end of a line instead of the end of the string, and the dollar($) to match the end of the line.

If you are a beginner or an advanced PowerShell scripter who has been using cmdlets like Write-Host a lot or a lot of string concatenation, I hope that you take this next one home with you. I am sure you will see this technique repeated over and over by a lot of the guest commentators in the advanced track:

New-Object psobject -Property @{Name=$matches[‘name’];Twitter=$matches[‘twitter’];LinkedIn=$matches[‘linkedin’]}

This code on a line by itself will push the object outside of the function as the return value. My function returns a series of these which make the output a collection of objects that can easily be piped to other cmdlets like Export-CSV, Out-GridView, or into my second function.

Part Two – The End-User Function – Get-SQLSaturdayPerson

While the first function is easy to use (especially for someone who knows Windows PowerShell), I wanted to ensure that I created a single entry point that would do the filtering and the gathering of the data in one SLAM of an enter key. Get-SQLSaturdayPerson is that function.

Most of the magic in this cmdlet is all in the parameters. The rest is just using a Like comparison to ensure that wildcard support exists.

I created three parameter sets that let me break down this function into web requests, csv imports, and pipeline requests (or those that use the InputObject parameter). The first two call my first function and then pipe the contents back into the Get-SQLSaturdayPerson function. The third is the filter that can work on objects in the pipeline.

Notes of interest:

  • I used parameter validation with ValidateSet to restrict users to a few options for specific parameters.
  • I ensured that my InputObject parameter could come from the pipeline.

One final note before we get to the payoff: I personally use inline help to make sense of the functions I write. As I begin to write notes and rationalize what each parameter does, it can make things clearer. For example, I originally called the Filter parameter, Name, and the Type parameter, NameType. After a few minutes of trying to explain what these words meant it became clear that they were the wrong words to use. Another technique I try to use on occasion is to write the inline Help before I write the cmdlet. This is probably the best way to approach a new function, but it is not always possible. Regardless of when you decide to write your inline Help, the key point to take away is that the documentation of your function from the perspective of an end-user can help you write a better function.

Part Three – The Payoff

As you can see by the last three lines of the script the usage is simple and meets the requirements, but if you dig a bit deeper, and look at the Get-Help Full for the Get-SQLSaturdayPerson cmdlet, you will see that it can do a lot more. The cmdlets remains flexible by returning objects, but it also allows users to return strings of the data that they want to see by using the OutputType parameter. It lets you use the CSV file to find what you are looking for (as requested), but it also lets you pull the data down from the web on the fly.

Windows PowerShell is powerful, but unless your functions maintain that versatility you may be draining its batteries. At the same time, however, requirements must be met. Even someone like me, who was very skeptical about the requirements, can see that by approaching something rigid with PowerShell flexibility in mind you can create something that not only satisfies requirements, but becomes useful in all of the ways that make Windows PowerShell the greatest scripting language in the world (platform dependency aside).

The complete script can be found on the Scripting Guys Script Repository.

Thank you Tome; that is a great write up, and you offer a ton of great advice.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy