Read a text file and do frequency analysis by using PowerShell


Summary: Learn how to read a text file and do a letter-frequency analysis using Windows PowerShell in this article written by the Microsoft Scripting Guy, Ed Wilson.

This is the third post in a multi-part series of blog posts that deal with how to determine letter frequency in text files. To fully understand this post, you should read the entire series in order.

Here are the posts in the series:

  1. Letter frequency analysis of text by using PowerShell
  2. How to skip the beginning and ending of a file with PowerShell
  3. Read a text file and do frequency analysis by using PowerShell
  4. Compare the letter frequency of two text files by using PowerShell
  5. Calculate percentage character frequencies from a text file by using PowerShell
  6. Additional resources for text analysis by using PowerShell

Today I am going to put the script I wrote yesterday together with the script that I wrote on Friday. After I do that, I will be able to get a more accurate letter-frequency analysis of a text file. The code that I wrote the other day reads a text file by using the Get-Content cmdlet. Then I join the strings together so that I can have a single string to parse. I then convert the script to all uppercase, get the enumerator, group my results, and sort my results.

So, first of all, here is the basic letter-frequency analysis code that I wrote the other day:

$a = Get-Content C:\fso\ATaleOfTwoCities.txt
$a.Count
$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

Put the script together

The first thing I do is copy the code to a blank page in my Windows PowerShell integrated scripting environment (ISE). This is shown here:

Screenshot of the basic letter-frequency analysis code in the Windows PowerShell ISE.

Now I need to take the code that I wrote yesterday. This code removes the beginning and ending portions of the text file.

$a= Get-Content 'C:\fso\MobyDick.txt'

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i }
}

$start = $array[0] +7
$end = $array[1] -1
$a[$start .. $end]

This script also reads the text file. It then creates an empty array, loops through the text, and looks for start and end strings. It then saves the line numbers that it finds so that I can use array notation to return a range of text from the file.

I paste this code at the beginning of my new script page because I need to grab the correct text BEFORE I convert it all to a single line of text, convert it to uppercase, and count the letters. So, at this point, my script appears as shown here:

Screenshot of yesterday’s code pasted before the basic letter-frequency analysis code in the Windows PowerShell ISE.

Clean up the code

Well, there are some redundancies. The code as it stands is shown here:

$a= Get-Content 'C:\fso\MobyDick.txt'

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i }
}

$start = $array[0] +7
$end = $array[1] -1
$a[$start .. $end]

$a = Get-Content C:\fso\ATaleOfTwoCities.txt
$a.Count
$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

So, the obvious duplication is the second Get-Content line. I delete it, and my script is shown here:

$a= Get-Content 'C:\fso\MobyDick.txt'

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i }
}

$start = $array[0] +7
$end = $array[1] -1
$a[$start .. $end]

$a.Count
$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The next thing I need to do is to delete the $a.count line because I do not need it either. The script now is shown here:

$a= Get-Content 'C:\fso\MobyDick.txt'

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i }
}

$start = $array[0] +7
$end = $array[1] -1
$a[$start .. $end]

$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The last thing I need to do is to store the result of grabbing my text from array notation. So that I do not need to modify my copied frequency code, I simply store the $a[$start ... $end] code back into the $a variable. This revised line is shown here:

$a = $a[$start .. $end]

The entire script is shown here:

$a= Get-Content 'C:\fso\MobyDick.txt'

$array = @()
for ($i = 0; $i -lt $a.Count; $i++)
{
If ($a[$i] -cmatch 'START')
{$array +=$i }
If ($a[$i] -like "End of *Project*")
{$array += $i }
}

$start = $array[0] +7
$end = $array[1] -1
$a = $a[$start .. $end]

$ajoined = $a -join "`r"
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The script is shown here in the ISE:

Screenshot of the entire edited script in the Windows PowerShell ISE.

The output from this script is shown here:

Screenshot of output of the script.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. Also check out my Microsoft Operations Management Suite Blog. See you tomorrow. Until then, peace.

Ed Wilson
Microsoft Scripting Guy

 

 

 

Comments (2)

  1. Excellent article. Particularly liked the evolution of the code.

    What would have made it easier to understand is the text you were working with.

  2. mjolinor says:

    Thinking in pipelines:

    $array =
    for ($i = 0; $i -lt $a.Count; $i++)
    {
    If ($a[$i] -cmatch ‘START’ ) {$i}
    If ($a[$1] -like “End of *Project*”) {$i}
    }

Skip to main content