Compare the letter frequency of two text files by using PowerShell


Summary: Learn how to use Windows PowerShell to compare the letter frequency of two different text files.

This is the fourth post in a multi-part series of blog posts that deal with how to determine letter frequency in text files. To fully understand this post, you should read the entire series in order.

Here are the posts in the series:

  1. Letter frequency analysis of text by using PowerShell
  2. How to skip the beginning and ending of a file with PowerShell
  3. Read a text file and do frequency analysis by using PowerShell
  4. Compare the letter frequency of two text files by using PowerShell
  5. Calculate percentage character frequencies from a text file by using PowerShell
  6. Additional resources for text analysis by using PowerShell

One thing that is kind of cool is that with Windows PowerShell you can do all kinds of stuff and never have to write a script. This also means that it is easy to forget your scripting skills after a few years of never writing scripts. You may say, "WOOHOO!!!" But, then again, you may lament the loss of hard-fought skills over time.

Sometimes it is just fun to fire up the Windows PowerShell integrated scripting environment (ISE) and play around for a while. That’s what I have been doing recently, just playing.

The other articles in this series should be read before going over this script. I will admit that you probably are not going to find a lot of mission-critical, at work solutions here, but the key things from this series of articles are the techniques that I go over. Building a script. Manipulating text. Working with arrays. Creating Functions. Sorting, grouping, and all of that stuff are all basic scripting techniques. I have not spent much time optimizing the code or bothered to put it into a module because I am still playing around with the code.

Create a function to count letter frequency in a text file

The first thing I need to do is to create a function to count the letter frequency in a text file that I pass to the function. I do this because I will be calling the same code twice, and it is easier to handle input and output with a function. It’s easy to create a function. I use the Function keyword, add a script block, and add a param statement for my input parameter. Here are the basic elements of my function:

Function Get-LtrFrequency
{
Param ([string]$path)
}

The previous code uses the Function keyword to tell Windows PowerShell that I want a function. Then I specify the name for the function -- Get-LtrFrequency. Next I specify that my input parameter will be called $path inside the function and that it will be passed as a string. I added this code around my previous “count the letter frequency” code. The only change I made to my code was to remove the hard-coded path to the text file, and substitute it for $path. This is shown here:

$a = Get-Content -Path $path

My complete function is shown here:

Function Get-LtrFrequency
{
Param ([string]$path)
$a = Get-Content -Path $path

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i } }
$start = $array[0] +7
$end = $array[1] -1
$a = $a[$start .. $end]
$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() |
group -NoElement |
sort count -Descending }

A better view of the code appears here in the Windows PowerShell ISE:

Screenshot of the complete function code in the in the Windows PowerShell ISE.

Call the function and pass the path

So, now I need to pass the path to the two text files that I want to compare to the function. I am only interested in the order of the letters, and not the actual percentages of the letter frequency. The point of this exercise today is to simply see if the most frequently occurring letters in A Tale of Two Cities occurs with the same order of frequency in Moby Dick.

To do this, I call the Get-LtrFrequency function, pass the path to the text files, and then select the name property from the returned grouping object. I then store the array of letters and numbers in unique variables. This portion of the code is shown here:

$basetxt = (Get-LtrFrequency 'c:\fso\mobydick.txt').name
$difftxt = (Get-LtrFrequency 'c:\fso\ataleoftwocities.txt').name

Compare two arrays

Now I want to walk through the two arrays and compare the order of the letters as they appear in the array. This will let me see the order of the letters that occur in my two text files. To do this, I use a for loop that begins at 0 (the lower boundary of Windows PowerShell arrays) and continues through the count of the number of elements in my base text. I then increment each loop by 1 to go on to the next element in the array.

If element 0 in base text and element 0 in different text (in this case I am comparing Moby Dick to A tale of Two Cities) match, I print a string that includes the letter value and state that they are the same. Otherwise, I print that they are different, and I display their values. Here is that portion of code:

For ($i = 0; $i -le $basetxt.count; $i ++)
{
if ($($basetxt[$i]) -eq $($difftxt[$i])){"$($basetxt[$i]) same"}
ELSE {"***$($basetxt[$i]) ***$($difftxt[$i])"} }

When I run the script, I see the following output:

Screenshot of output.

So, what is cool, is that the following letters occur in exactly the same order of frequency in the two text files:

E T A O N I

Then S and H are reversed between the two texts.

Then R occurs at the same frequency.

Then L and D are reversed.

U is the same.

M is reversed.

Then W is the same.

Pretty cool stuff.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. Also check out my Microsoft Operations Management Suite Blog. See you tomorrow. Until then, peace.

Ed Wilson
Microsoft Scripting Guy

Comments (0)

Skip to main content