Letter frequency analysis of text by using PowerShell


Summary: Microsoft Scripting Guy, Ed Wilson, talks about using Windows PowerShell to do letter frequency analysis of a text enabling one to see how often a letter occurs.

This is the first post in a multi-part series of blog posts that deal with how to determine letter frequency in text files. To fully understand this post, you should read the entire series in order.

Here are the posts in the series:

  1. Letter frequency analysis of text by using PowerShell
  2. How to skip the beginning and ending of a file with PowerShell
  3. Read a text file and do frequency analysis by using PowerShell
  4. Compare the letter frequency of two text files by using PowerShell
  5. Calculate percentage character frequencies from a text file by using PowerShell
  6. Additional resources for text analysis by using PowerShell

So, the other night I was watching a show about cryptography. In the episode, the moderator said that all forms of encryption involve some form or type of letter substitution, and they went back over history to talk about some of the more famous encryption methodologies.

While I wasn’t entirely certain about his premise, what really got me thinking was when he said that he analyzed Hamlet’s soliloquy and an article from the Sun newspaper and came up with data about letter frequency. He further asserted that in English, the most common letter was t, then e, and so on, and so forth. The idea is that after you establish the most common letters in English, when you see the most common symbol / letter in your encrypted text, you can assume that it may also be the letter t, or e, and so forth.

But I was not entirely convinced. For one thing, it seems like he had a rather limited data source. Luckily, with Windows PowerShell I can do better -- and as a matter of a fact, quite easily.

Determine letter frequency in a text

First of all, I need a decent source for text. Luckily, I can get what I need from Project Gutenberg. For this example, I am going to use Charles Dickens' A Tale of Two Cities. The text file appears here in Notepad:

Text from A Tale of Two Cities in Notepad.

Now, from the previous figure, it is obvious that it takes around 113 lines of text before you arrive at “It was the best of times, it was the worst of times …” If I actually only wanted to work with the text of the book, I would need to delete the first 112 lines of text from the file. There is also end matter that begins with “*** END OF THE Project …” that I could also search for and remove if I wanted to do that as well. But for the purposes of this article, I am simply going to read the entire content of the text file and work with that.

Step one – Read the contents of the file

The first thing I need to do is to read the contents of the text file. I then store the contents in a variable, which I am simply calling $a. This line of code appears here:

$a = Get-Content C:\fso\ATaleOfTwoCities.txt

 Now I want to see what I have. So, I want to look at the count property:

$a.count

 I find from the count property that I have 16,271 lines of text. But, if I were to try to group these lines of text, I would simply end up with a mess because I would have 16,271 lines of one each (unless there were actually duplicated lines in the text file).

Step two – Join each of the 16,271 lines together

What I want to do is to join each of the 16,271 lines of text together to form one really, really long string. This will make it easier for me to work with the letters in the text. To do this, I simply call the -join operator and join the strings together with the return character “`r”. I store the results in a different variable. This appears here:

$ajoined = $a -join "`r"

Step three - Convert everything to uppercase

If I am doing a letter frequency analysis of text, I really do not care if the letter is uppercase or lowercase. So, to make it easier to group everything together, I call the ToUpper() string method. I do this on my joined string and store it in a new variable:

$ajoinedUC = $ajoined.ToUpper()

Step four - Take each letter and group and sort

Now that I have the entire contents of A Tale of Two Cities put into a single string and converted into all uppercase, the next task is to take each letter, group the letters, and sort them by count. To do this, I call the GetEnumerator() method so that I can pass each letter over to the Group-Object cmdlet. I then sort the count of each letter so I can see the most recurring letters in the book. This appears here:

$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The complete code

The complete code I used appears here:

$a = Get-Content C:\fso\ATaleOfTwoCities.txt

$a.Count

$ajoined = $a -join "`r"

$ajoinedUC = $ajoined.ToUpper()

$ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The code and the output form the Windows PowerShell code appears here:

Screenshot of counts of individual letters.

So, technically, the most common character in the book A Tale of Two Cities was the “ “ (space) character. The most common letters in order are:

E, T, A, O, N, I, H, S, R, D, L, U, …

The ellipsis (…) is an artifact from joining the lines of text together and is not part of the actual text file. So, a quicker and more complete analysis of a text file finds that E and not T is the most common letter … and by a pretty good amount.

*** Further call to action ***

Something that would be pretty easy to do and that might be fun is to read and analyze other text files, and then use Compare-Object to see how the letter frequencies compare with the other text files. It would then provide a pretty good basis and eliminate sample bias. In addition, this technique can easily be used with other language input.

Hope you have great weekend … and keep on scripting.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. Also check out my Microsoft Operations Management Suite Blog. See you tomorrow. Until then, peace.

Ed Wilson
Microsoft Scripting Guy

 

Comments (6)

  1. Compugab says:

    Nice article.

    Why not use Get-Content ‘-Raw’ parameter to skip the “join” step?

  2. Brian Benson says:

    Thanks Ed. That was an interesting exercise.

    Any idea how there are 16,270 … ‘s ?

  3. Piotr Siódmak says:

    Or you could do this:
    type X:\temp\asdfghj.txt | % {$_.ToUpper().ToCharArray()} | group -NoElement

  4. Lars Buchleitner says:

    Wouldn’t Get-Content -Raw save one Step?

  5. Fun stuff!

    Just wanted to point out, though, that you can skip the join step by using Get-Content’s -Raw switch.

  6. psDude says:

    Get-Content’s -Raw switch is available on versions 3.0 and newer only.

Skip to main content