Calculate percentage character frequencies from a text file by using PowerShell


Summary: Learn how to use Windows PowerShell to calculate the percentage of how often a character appears in a text file.

This is the fifth post in a multi-part series of blog posts that deal with how to determine letter frequency in text files. To fully understand this post, you should read the entire series in order.

Here are the posts in the series:

  1. Letter frequency analysis of text by using PowerShell
  2. How to skip the beginning and ending of a file with PowerShell
  3. Read a text file and do frequency analysis by using PowerShell
  4. Compare the letter frequency of two text files by using PowerShell
  5. Calculate percentage character frequencies from a text file by using PowerShell
  6. Additional resources for text analysis by using PowerShell

Okay, I will admit that I am just playing around, but I wanted to calculate the percentages of letter frequencies in a text file. For this example, I am using A Tale of Two Cities as a text file. You should refer to earlier blog articles about this topic so that what I write will have a chance of making sense.

Create a header for my report

The first thing I want to do is create a header for my report that I will display. To do this, I create an expanding here string. The basic here string is a bit fussy, but it begins with @” that is immediately followed by a return. Then it ends a new line that has “@.

Here’s the here string I use. The thing that is pretty cool is that it is the first thing I am defining in my script, and so neither the value of the $path nor the total number of characters are yet determined. These will be evaluated when it is time to display the header. Until then, here’s the here string:

$header = @”
****************************************************************
|
| Letter Frequency Analysis
| of $path
| Analyzing $($total) characters …
|
****************************************************************
“@

Read the contents of the file and count the characters

The next thing I need to do is to read the contents of the text file and count the characters in the file. I assign the path of my file to a variable that I name $path. I then use Get-Content to read the contents of the file, and I store the results in the $a variable. I now want to count how many characters are in the file. Because I have the entire contents of the text file in the $a variable, I can use that. Although, the count property contains the number of lines in the file, it does not contain the number of characters. The easy way to obtain this information is to use the Measure-Object cmdlet and call the -Character switch to cause it to count characters. I then directly access the Characters property from the object returned from the Measure-Object cmdlet and store the number of characters in the $total variable. I then display the header here string that I previously stored in the $header variable. This code is shown here:

$path = ‘C:\fso\ATaleOfTwoCities.txt’
$a = Get-Content $path
$total = ($a | measure -Character).characters
$header

The code that I use to count the frequency of the characters was explained in an earlier article, so the code is shown here without additional comment:

$ajoined = $a -join “`r”
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() |
group -NoElement | sort count -Descending

Use custom properties in Select-Object to get percentages

So, now I use the Select-Object cmdlet, and I compute some custom properties to display. This is a great technique that works well with the pipeline. It takes the form of the following:

@{ LABEL = STRING ; EXPRESSION = SCRIPTBLOCK}

The first custom property simply displays the Name of the character. I add a column heading called “Character”, and under that column heading, I will display each character.

For the second property, I use the Count property that comes from my Group-Object cmdlet that groups all of the characters together. My Sort-Object command sorts these from largest number to smallest number. I place the Count property below a column heading that I call “Frequency”.

The last property is the most complex. I add a column heading called “Percent”. I calculate the percentage of representation by dividing the letter frequency by the total number of characters. I then use the built-in Percentage format specifier, p, in conjunction with the -f format operator. I tell it to calculate percentage from my number and to display it to two decimal places of accuracy. The total Select statement is shown here:

Select @{L = ‘Character’; E = {$_.Name} },
            @{L = ‘Frequency’ ; E = {$_.count} },
            @{L = ‘Percent’ ; E = {“{0:p2}” -f ($_.count / $total)}}

The complete script

The complete code is shown here:

$header = @”
****************************************************************
|
| Letter Frequency Analysis
| of $path
| Analyzing $($total) characters …
|
****************************************************************
“@
$path = ‘C:\fso\ATaleOfTwoCities.txt’
$a = Get-Content $path
$total = ($a | measure -Character).characters
$header
$ajoined = $a -join “`r”
$ajoinedUC = $ajoined.ToUpper()
$ajoinedUC.GetEnumerator() |
group -NoElement | sort count -Descending |
Select @{L = ‘Character’; E = {$_.Name} },
            @{L = ‘Frequency’ ; E = {$_.count} },
            @{L = ‘Percent’ ; E = {“{0:p2}” -f ($_.count / $total)}}

The script is in the figure here:

Screenshot of completed script in PowerShell ISE.

When I run the script, the following output is shown:

Screenshot of output of script.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. Also check out my Microsoft Operations Management Suite Blog. See you tomorrow. Until then, peace.

Ed Wilson
Microsoft Scripting Guy

Comments (1)

  1. Alistair Wall says:

    The script assigns $header before $path and $total, which it depends on. Your output was probably reusing values assigned in previous runs.

Skip to main content