Parse HTML and pass to Cognitive Services Text-to-Speech

Doctor Scripto

Summary: Having some fun with Abbott and Costello’s “Who’s on first?” comedy routine, and multiple voices with Bing Speech.

——————————-

Hello everyone!

The last few posts, I showed you all about the Cognitive Services Text-to-Speech API. You learned about the process to authenticate with Windows PowerShell.

It was also a great showcase for Invoke-RestMethod, as it demonstrated how REST API services are accessible with no real code for the IT professional.

Today, as an IT pro, I’m just going to have some fun. Sometimes that’s the best way to learn how to code.

Initially, all of this came about as a challenge from other members of “Hey, Scripting Guy!” I demonstrated a silly little script I wrote to play Abbott and Costello’s most famous comedy sketch, “Who’s on first?” with the internal voices in Windows. It’s a neat trick many PowerShell people love to play with like this.

# Establish to the Voice Comobject

$voiceAPI=New-Object -comobject SAPI.SPVoice

# Speed up the rate of the Speaker’s voice

$voiceAPI.Rate=3

I proceeded to get the voices, and then depending on who’s name (yes, that’s his name), I found I would pick a voice in Windows.

# Obtain the list of voices in Windows 10

$voiceFont=$voiceAPI.GetVoices()

# Establish a table to match the Microsoft voices with the names of the comedians

$nameMatch=@{‘Abbott:’ = ‘ZIRA’; ‘Costello:’ = ‘DAVID’ }

So it was neat. I had the text file on the hard drive, and it was all fun and games.

Some people said, “Cool, but you should try the same approach with Cognitive Services!”

It was at this point I read and learned everything I showed you in the last several posts. Today we’re going to have some fun: “Who’s on first?” portrayed by the “Azure Cognitive Services Players.”

Challenge #1 – Learn how to use Text-to-Speech in Azure. Accomplished, and built a function to leverage it. I’ve prepopulated all of the available sound file options, so I could just select from an array in this function.

Function Invoke-AzureTextToSpeech($Region,$Voice,$Content,$Filename)

{

# Obtain Access Token to communicate with Voice API

# I erased mine, you'll have to get your own ;)

$APIKey='00000000000000000000000000000000'

$AccessToken=Invoke-RestMethod -Uri "https://api.cognitive.microsoft.com/sts/v1.0/issueToken" -Method 'POST' -ContentType 'application/json' -Headers @{'Ocp-Apim-Subscription-Key' = $APIKey }

# Generate GUID for Access

# Just use this Cmdlet to generate a new one (New-Guid).tostring().replace('-','')

$XSearchAppId='00000000000000000000000000000000'

# Just use this Cmdlet to generate a new one (New-Guid).tostring().replace('-','')

$XSearchClientId='00000000000000000000000000000000'

# Current list of Audio formats for Azure Text to Speech

# HTTP Headers X-Microsoft-OutputFormat

# https://docs.microsoft.com/en-us/azure/cognitive-services/speech/api-reference-rest/bingvoiceoutput

#

$AudioFormats=( `

'ssml-16khz-16bit-mono-tts', `

'raw-16khz-16bit-mono-pcm', `

'audio-16khz-16kbps-mono-siren', `

'riff-16khz-16kbps-mono-siren', `

'riff-16khz-16bit-mono-pcm', `

'audio-16khz-128kbitrate-mono-mp3', `

'audio-16khz-64kbitrate-mono-mp3', `

'audio-16khz-32kbitrate-mono-mp3' `

)

# WAV File format

$AudioOutputType=$AudioFormats[4]

$UserAgent='PowerShellForAzureCognitiveApp'

$Header=@{ `

'Content-Type' = 'application/ssml+xml'; `

'X-Microsoft-OutputFormat' = $AudioOutputType; `

'X-Search-AppId' = $XSearchAppId; `

'X-Search-ClientId' = $XSearchClientId; `

'Authorization' = $AccessToken `

}

$Body=''+$Content+''

Invoke-RestMethod -Uri "https://speech.platform.bing.com/synthesize" -Method 'POST' -Headers $Header -ContentType 'application/ssml+xml' -Body $Body -UserAgent $UserAgent -OutFile $Filename

}

I can now use this function and dynamically supply the region data, as well as the content, in a loop or script!

Challenge #2 – Get a nice way to play WAV files synchronously, without launching additional applications.

I used a simple function based upon the earlier posted PowerTip to solve this issue.

Function Play-MediaFile($Filename)

{

$PlayMedia=New-object System.Media.Soundplayer

$PlayMedia.SoundLocation=($Filename)

$PlayMedia.playsync()

}

Challenge #3 – Get rid of the text file.  I want to read the content straight from The Abbott and Costello Fan Club.

Connecting was easy. Just use Invoke-WebRequest, and store the content in an object.

$RawSketch=Invoke-WebRequest -Uri ‘http://www.abbottandcostellofanclub.com/who.html’

The challenge was that the returned content was one massive string. I needed it broken up into lines for an array.

I’m sure I could have contacted some friends like Tome Tanasovski or Thomas Rayner for some help with regular expressions, but I like trying alternative approaches sometimes.

There were a lot of CRLF (CarriageReturn / LineFeed) and Tabs prefacing the lines. I needed that cleaned up.

$CR=[char][byte]13

$LF=[char][byte]10

$Tab=[char][byte]9

$RawSketchContent=$RawSketch.Content

$RawSketchContent=$RawSketchContent.Replace($cr+$lf+$tab,’ ‘)

Once I completed this, I just had a nice list of content terminating in carriage returns. I could split this up into an array now, in the following fashion:

$SketchArray=$rawsketchcontent.split(“`r”)

I took a look at the raw HTML, and found a “Before” and “After” on the sketch content. I passed this into Select-Object and captured the line numbers of the array. This allowed me to have a “Begin” parsing point, and an “End.”

$StartofSketch=$SketchArray | Select-string -SimpleMatch ‘<PRE>’ | Select-Object -expandproperty LineNumber

$EndofSketch=$SketchArray | Select-string -SimpleMatch ‘</PRE>’ | Select-Object -expandproperty LineNumber

With this achieved, I needed to select two voices in Cognitive Services Text-to-Speech. If you remember Part 4 in the series, we showed the list to choose from. I decided on an Australian female voice for Bud Abbott, and an Irish male voice for Lou Costello.

I used a simple array to store the data.

$CognitiveSpeakers=@()

$CognitiveSpeakers+=’BUD:;en-AU;”Microsoft Server Speech Text to Speech Voice (en-AU, Catherine)”‘

$CognitiveSpeakers+=’LOU:;en-IE;”Microsoft Server Speech Text to Speech Voice (en-IE, Shaun)”‘

We need to initial certain variables to figure out Who is talking (well yes, of course he is, that’s his job), and to store away the audio content.

$CurrentSpeaker=’Nobody’

$TempVoiceFilename=’whoisonfirst.wav’

Now for the work to begin. We start our loop from the beginning of the content array to the end, and make sure any temporary WAV file is erased from a previous run.

For ($a=$StartofSketch+1; $a -lt $EndofSketch; $a++)

{

Remove-Item $TempVoiceFilename -Force -ErrorAction SilentlyContinue

We then identify a line of content to parse:

$LinetoSpeak=$sketcharray[$a-1]

Each line that has a speaker on the site began with either BUD: or LOU:, so I used a little RegEx to trap for where the identified speaker name ended. Anything after that would be their speaking content.

$SearchForSpeaker=(($LinetoSpeak | Select-String -Pattern ‘[a-zA-Z]+(:)’).Matches)

The next scenario to trap for was whether the line contained a speaker name with text, or just text (which meant a continuation of the earlier line).

This variable would set to 1 (beginning of a line). If a speaker was found, the beginning of the content would naturally be further down the line.

$LinetoSpeakStart=1

Then I had to trap for some “fun situations.” Did the speaker change? Is it the same speaker, but they have more lines to speak?

If ($SearchForSpeaker -ne $NULL)

{

$Speaker=$SearchForSpeaker[0].Value

$LinetoSpeakStart=$SearchForSpeaker[0].Index + $SearchForSpeaker[0].Length + 5

Then of course if the speaker did change, I needed to repopulate objects unique to the speaker for Azure.

If ($Speaker -ne $CurrentSpeaker)

{

$CurrentSpeaker = $Speaker

$RawSpeakerData=$CognitiveSpeakers -match $CurrentSpeaker

$SpeakerData=$RawSpeakerData.split(';')

$Region=$SpeakerData[1]

$Voice=$SpeakerData[2]

$Name=$SpeakerData[0]

}

As you can see, I’m pulling in the data needed for Azure, like Voice and Region from the SpeakerData array I created earlier.

Once we’ve identified the speaker and the content, we can call up the two key functions of Invoke-AzureTextToSpeech and Play-MediaFile:

If ($LinetoSpeak.Length -gt 1)

{

$LinetoSpeak.replace('','').replace('','')

$Content=$LineToSpeak.Substring($LinetoSpeakStart).replace('','').replace('','')

Invoke-AzureTextToSpeech -Region $Region -Content $Content -Voice $Voice -Filename $TempVoiceFilename

Do { } until (Test-Path $TempVoiceFilename)

Play-MediaFile -filename $TempVoiceFilename

Start-Sleep -Milliseconds 1000

}

You’ll note that there is a Start-Sleep in the loop. This is because there is a limit on the REST API of how many transactions it can take within a certain timeframe.

I thank you for sharing your time with me today. Hopefully you had a little fun, and maybe even learned of some ways you, too, can play with HTML content.

If you see a more efficient way of doing this, I’d love to see the results! It could be a really cool blog post itself!

Until next time, remember that the Power of Shell is in you!

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.

Sean Kearney, Premier Field Engineer, Microsoft

Frequent contributor to Hey, Scripting Guy!

 

0 comments

Discussion is closed.

Feedback usabilla icon