PSImaging Part 2: Export-Text from Images


Summary: Guest blogger, Ben Vierck, talks about using Windows PowerShell to export text from an image.

Microsoft Scripting Guy, Ed Wilson, is here. Welcome back guest blogger Ben Vierck, for Part 2 of PSImaging. Read Part 1 before diving into today’s post: PSImaging Part 1: Test-Image.

Now, here’s Ben...

In first blog post of this series, we wrote the Windows PowerShell function Test-Image to definitively detect whether a file is a known image type by analyzing the first 8 bits of its header. In this post, we're going to write a Windows PowerShell command with a cmdlet called Export-ImageText that can easily export text from our scanned document images.

Several popular cloud drive offerings have recently begun offering Optical Character Recognition (OCR) as a free add-on to their service. Among others, check out:

High-quality OCR was once the sole purview of tremendously expensive enterprise software. Now it's a commoditized add-on feature for cloud services. This begs the question, "How can we take advantage of modern OCR on our own systems?"

Let's start with the most accurate open-source OCR engine available: Tesseract-ocr by Google. After installing the Tesseract runtimes, one option is to automate the executable. Instead I chose to wrap up the SDK in a Windows PowerShell binary module. By doing this, we can bundle the dependencies into the module folder so that distribution is a piece of cake.

Rather than leaving this as an exercise for the reader, I've done the work and open-sourced the project here: Positronic-IO/PSImaging. To get the PSImaging module without the source, you can run this one-liner: 

& ([scriptblock]::Create((iwr -uri http://tinyurl.com/Install-GitHubHostedModule).Content))
-GitHubUserName Positronic-IO -ModuleName PSImaging -Branch 'master' -Scope CurrentUser

Now let's play...

I have a folder with sample scanned documents, including an image with the repeating text of the Quick Brown Fox. Let's start by extracting all of the text from this file: 

Image of command output

Right away I notice that running this command seemed too slow to me. In fact, it clocks in at 1.3 seconds on my machine. Luckily, we can isolate what gets read by passing in a rectangle. Let's see how limiting the scope this way affects performance.

First, we'll isolate an interesting rectangle. Here I've opened the Quick-Brown-Fox.png file in Paint, and I added a rectangle around the word "fox": 

Image showing rectangle

Paint tells us the coordinates: x,y = 172,152 h,w = 36,33. Let's add the given coordinates to System.Drawing.Rectangle: 

$rect = New-Object System.Drawing.Rectangle 172,152,36,33

Now we'll pass $rect to our Export-ImageText cmdlet:

dir .\Quick-Brown-Fox.png | Export-ImageText -Rect $rect 

Image of command output

Profiling this run of the command shows us that it took just 200 ms. That's a command I can run on a database of a million scanned images and be done in a reasonable amount of time.

Next up in this series, we'll leverage another open source technology within Windows PowerShell to automatically group images by document similarity.

~Ben

Thanks again, Ben. I'm looking forward to tomorrow's post.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy 

Comments (4)

  1. Thanks Ben, its working fine now. Great to play with!

  2. Hi Ben,

    I installed it, and when I test using Export-ImageText on an image I get:

    "Error opening data file C:Program%20FilesWindowsPowerShellModulesPSImagingtessdata/eng.traineddata"

  3. Personne says:

    Same error as Paul

    PS D:Scripts> dir $d | Export-ImageText
    Error opening data file C:UsersTestDocumentsWindowsPowerShellModulesPSImagingtessdata/eng.traineddata
    Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
    Failed loading language ‘eng’
    Tesseract couldn’t load any languages!

  4. Ben says:

    Mea Culpa. I missed a folder on check-in. Fixed.;
    https://github.com/Positronic-IO/PSImaging/issues/1

Skip to main content