PowerShell Examples: Counting words in a text file


This blog is part of a series that shows example PowerShell code for those learning the language.

This time we’re using PowerShell to count lines, count words, find the largest word and find the most frequently used words in a text file. To make it interesting, we’re using a plain text version of “Alice in Wonderland” downloaded from the Project Guttenberg site.

This example explores string manipulation and the use of hash tables. It also shows the use of Write-Progress.

 

#
# Counting words in a text file
# Uses the text from Alice in Wonderland
#
from http://www.gutenberg.org/ebooks/11.txt.utf-8
#

Clear-Host
$FileName = ".Alice.TXT"
Write-Host "Reading file $FileName..."
$File = Get-Content $FileName
$TotalLines = $File.Count
Write-Host "$TotalLines lines read from the file."

$SearchWord = "WONDERLAND"
$Found = 0
$WordCount = 0
$Longest = ""
$Dictionary = @{}
$LineCount = 0

$File | foreach {
    $Line = $_
    $LineCount++
    Write-Progress -Activity "Processing words..." -PercentComplete ($LineCount*100/$TotalLines)
    $Line.Split(" .,:;?!/()[]{}-```"") | foreach {
        $Word = $_.ToUpper()
        If ($Word[0] -ge 'A' -and $Word[0] -le "Z") {
            $WordCount++
            If ($Word.Contains($SearchWord)) { $Found++ }
            If ($Word.Length -gt $Longest.Length) { $Longest = $Word }
            If ($Dictionary.ContainsKey($Word)) {
                $Dictionary.$Word++
            } else {
                $Dictionary.Add($Word, 1)
            }
        }
    }
}

Write-Progress -Activity "Processing words..." -Completed
$DictWords = $Dictionary.Count
Write-Host "There were $WordCount total words in the text"
Write-Host "There were $DictWords distinct words in the text"
Write-Host "The word $SearchWord was found $Found times."
Write-Host "The longest word was $Longest"
Write-Host
Write-Host "Most used words with more than 4 letters:"

$Dictionary.GetEnumerator() | ? { $_.Name.Length -gt 4 } |
Sort Value -Descending | Select -First 20

 

In case you were wondering what the output would look like, here it is:

 

Reading file .Alice.TXT...
3339 lines read from the file.
There were 25599 total words in the text
There were 2616 distinct words in the text
The word WONDERLAND was found 3 times.
The longest word was DISAPPOINTMENT

Most used words with more than 4 letters:

Name                   Value
----                   -----
ALICE                  385
LITTLE                 128
ABOUT                  94 
AGAIN                  83 
HERSELF                83  
WOULD                  78  
COULD                  77  
THOUGHT                74  
THERE                  71  
QUEEN                  68  
BEGAN                  58  
TURTLE                 57  
QUITE                  55  
HATTER                 55 
DON'T                  55 
GRYPHON                55 
THINK                  53 
THEIR                  51  
FIRST                  50 
THING                  49  

Comments (1)

  1. Steve F says:

    Very neat. Fyi, i had to remove the double quote before the escape character to run properly (line 23). I ran this against the summary field of our ticketing system (10,000 entries just for my group). Most of the results are: Can’t, won’t, unable, access, working, please, needs, include lol.

Skip to main content