Counting and/or identifying unique instances of things in PowerShell

Greetings!

While not technically related directly to Exchange or Office 365, this tip can be used for any number of things, including those two.  Several years ago, in support of what was then called the ExRAP (Exchange Risk Assessment Program), I came up with a neat little trick to count things in PowerShell using something called a hash table.  A good friend of mine, Gary Siepser, did a good write up of that specific case (counting how many subfolders existed in each public folder in the hierarchy) back in 2009.

Now, let's get the simplest case out of the way first before we make it harder.  :)  To simply count the number of things in a collection (and/or get Average, Max, and Min for numeric properties), pipe them to Measure-Object.  But that's not what this post is about...  If you're dealing with "pure" .NET or PowerShell objects and just want a quick and dirty way to get a count of things grouped by a certain property of the PowerShell object, there's a super simple way to do it: it's called Group-Object.  (This will be super familiar to any of you familiar with T-SQL and the 'GROUP BY' clause).  For example, to see how many files of various extensions are in my Downloads folder:

 PS C:\Users\Chris\Downloads> dir | Group-Object Extension
Count Name
----- ----
   24 .pdf
   14 .zip
    4 .mp4
...

I only put the "Count" and "Name" columns here, but for each one, PowerShell handily gives you a collection of objects matching that criteria.  Other use cases come to mind pretty easily... { Get-Service | Group-Object StartType } to show how many services are configured for each startup type, { Get-Process | Group-Object ProcessName } to see how many instances of processes are running (think svchost, wmiprvse, or MicrosoftEdgeCP), etc.

This is all well and good for very simple scenarios.  But what if you need to do some calculations?  Or if you need to group by something that isn't a property of a well-formed PowerShell object? (We'll examine this case in detail in a moment...)  Hash tables to the rescue! Hash tables basically let you store a bunch of values together as a collection of sorts and index into it by name.  For example,

 PS C:\> $NewHashTable = @{}  #This is how you create a new hash table
PS C:\> $NewHashTable["42"] = "the answer to life the universe and everything"
PS C:\> $NewHashTable["MyName"] = "Chris"
PS C:\> $NewHashTable

Name   Value
----   -----
42     the answer to life the universe and everything
MyName Chris

Now, what does this have to do with counting, you might ask?  Well...  Since everything in PowerShell is an object, the values referenced at the keys (or "indexes") you define are strongly typed, therefore you can directly manipulate the results just like you would any other variable.  So...  I can do something like this:

 $count = @{}  #Create the hash table
foreach ($myObject in $myCollection)
{
    if ($count[$myObject.Whatever] -eq $null) # If this is the first time we've seen this [whatever],
                                              # meaning if the indexed value is $null...
    {
        $count[$myObject.Whatever] = 1;       # ... then we create an entry for this [whatever] and
                                              # set it to 1 (for the 1st time we've seen it).
    }
    else   # If it's not $null, that means we have seen it before, and the value there is how many times 
           # we've seen it...
    {
        $count[$myObject.Whatever]++;  # ... so since we're seeing it again, we just increment it.
    }
}

Now in this example, it would certainly have been simpler just to write { $myCollection | Group-Object Whatever } which would do the exact same thing.  But sometimes it's not that easy, and again, those cases aren't the primary purpose of this post.  For example, one of the things I use this for most frequently is looking for certain things in log files.  If, for example, I need to know a list of unique client IP addresses or how many times each URL was hit in a given IIS log file, I can't use Import-Csv and Group-Object because of the way the file/field headers are stored (the first few lines start with #, and the field headers aren't until line 4).  But, I can:

  1. Pull the log file into a variable using Get-Content
  2. Loop through the lines of the file with foreach
  3. Split each line by commas and store that line's "fields" in an array
  4. Index into that array by column number to retrieve the values I want from a given column
  5. Using those values to index into my hash table.

So, modifying my above example for an IIS log file:

 $uniqueIPs = @{}    #Create the hash table
$logFileContents = Get-Content E:\iis_logs\u_20160711.log   #Read the log file
$columnNum = 5 #I'm making this up for this example, but let's assume what I want is in the 6th column
               #No, that's not a typo -- indexes are 0 based so the 1st column is index 0, so the 6th column is index 5.  :)

foreach ($line in $logFileContents)
{
    $lineFields = $line.Split(",");
    $interestingValue = $lineFields[ $columnNum ];
    if ($uniqueIPs[ $interestingValue ] -eq $null) 
    {
        $uniqueIPs[ $interestingValue  ] = 1; 
    }
    else
    {
        $uniqueIPs[ $interestingValue ]]++;  
    }
}

Assuming the client IP address (or URI query stem, or HTTP response code, or whatever I'm looking for) is actually in the 6th column, the hash table will now give me both a list of all unique values -- which sometimes is all I care about, and a count of how many times each value occurred -- which may or may not be useful, but hey, it's "free".  So the contents of my hash table might look something like:

 Name        Value
----        -----
1.1.1.1     3,456
2.2.2.2     4,567

Want to make it even more interesting?  Say I'm looking at SMTP Protocol Logs (which I do a LOT to get a list of all unique IPs that are using a Receive connector), where the client IP and port number are in a single field (e.g. "1.2.3.4:5678") but I don't care about port number.  Since I'm running my own code to evaluate each line, no biggie.  I would just change this line:

 $interestingValue = $lineFields[ $columnNum ];

into something like this:

 $interestingValue = $lineFields[ $columnNum ];
$interestingValue = $interestingValue.Split(":")[0];

This, of course, is horrible PowerShell form because there's no error checking etc., but essentially here's what we're doing.  $interestingValue = $lineFields[$columnNum] indexes into $lineFields at $columnNum -- in my case, 5 --which gets the 6th value since it's a zero-based array array containing all of the comma separated values on the line, and stores that in $interestingValue.  Then, we're telling PowerShell to split $interestingValue by ":" and take the first result (e.g. before the ":", which in this example would be IP address).

Now, to address the elephant in the room, I'm sure some of you out there are also thinking "Dude, why not just use LogParser?"  And yes, LogParser is definitely a more elegant and powerful solution.  However, in my 10+ years as a PFE and almost never touching the same server twice, I learn to take the "lowest common denominator" approach.  I can't assume that LogParser will be installed, I'm certainly not going to install it on a customer's production server outside of change control, and even though it's on my laptop, I may or may not have network access, and even if I do, I may or may not be allowed to copy a customer's log file to my PC for analysis.  Also, I go back to the point of this post -- this technique is generic and can be adapted to pretty much any scenario and is ubiquitously available.

More tk...
Chris

* PS - I have a photography background, and in photojournalism, "more tk" is frequently short for "more to come".