Hey, Scripting Guy! How Can I Read a Text File and Extract All the Text Enclosed in Double Quote Marks?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! I’m trying to write a script that can extract all the values found between a set of double quote marks. I know how to read the text file, and I know how to output any information that I find. However, I can’t figure out how to extract all the characters between a set of double quote marks. Can you help?
— TM

SpacerHey, Scripting Guy! AnswerScript Center

Hey, TM. Before we get started today we need to update everyone on the status of the Dr. Scripto bobblehead dolls that were given out as part of the 2008 Winter Scripting Games: unfortunately, there is no status to update. Without making any excuses (well, other than the excuses we’re about to make) this past month has been an extremely … interesting … one for the Scripting Guys. That’s due, in part, to: 1) carry-over from the Scripting Games; 2) a never-ending series of meetings regarding potential changes to TechNet, changes that could dramatically affect the Script Center and how we do our work; and, 3) the need to start putting together an instructor-led lab for TechEd 2008.

Note. So how are we doing on that lab, which has to be completed in the next 10 days or so? Well, so far we have a single slide labeled Agenda. All we have to do now is actually come up with an agenda, and we’ll be able to finish off that first slide in no time. So at least things are looking up when it comes TechEd 2008.

Anyway, because of all that the Scripting Guy who writes this column agreed to something that he typically would not agree to: a few weeks ago he agreed to let someone else mail out the bobbleheads for us.

Ah, good question: why doesn’t the Scripting Guy who writes this column typically agree to let people help out? (After all, if anyone could use some help it’s him.) Let’s put it this way: in the past three weeks, how many bobbleheads do you suppose have been sent out? That’s right: zero. And how many shipping boxes have been ordered and received, boxes that are needed before the bobbleheads can be packed up and shipped out? Right again: zero. And how many – well, you get the idea. We’ve sent out hundreds of Certificates of Excellence, we’ve sent out 50 copies of Windows Vista, we’ve sent out T-shirts and books and assorted software. But how many bobbleheads have been sent out so far? Like we said: zero.

Not to mention zilch, nada, and zip. As well as nil, naught, aught, and a big goose egg. And – well, you probably get the idea here, too.

Just to make things even more interesting, the Scripting Guy who writes this column also recently discovered that a ton of the email he sent out over the past few weeks never got delivered, most likely because his email account was moved to three different servers during that period. When it rains, it pours.

Speaking of which, it’s also pouring down rain right now. And yes, as a matter if fact the Scripting Son does have a baseball game this afternoon. This is what life is like when you’re a Scripting Guy.

Anyway, we apologize for the delay, and we are going to try to take care of this as quickly as we can. (Which means we’re going to send out the bobbleheads ourselves, just like we should have done in the first place.) It’s still going to be a week or more before the first bobbleheads go out; after all, we don’t even have any boxes to pack the things in yet. But we’ll start getting bobbleheads shipped out as quickly as we can. Promise.

Fortunately, there is one thing that the Scripting Guys would never outsource, something which we always take care to do ourselves: eat lunch. Oh, wait; there’s a second thing, too: we always write our own scripts that can extract all the text found between double quote marks in a file. Let’s explain that scenario in a little more detail, then show you how we attacked the problem.

And yes, we really will show you how we attacked this problem. And with any luck we’ll show you today, not three or four weeks from today.

TM has a text file that looks something like this (we simplified his file a little to make sure it would fit across the page without needing any line breaks):

192.168.112.88 "CN=Ken Myer,CN=Users,DC=fabrikam,DC=com" 141 "15/Apr/2008" v5 connect 13552
192.168.112.89 "CN=Pilar Ackerman,CN=Users,DC=fabrikam,DC=com" 142 "16/Apr/2008" v5 connect 13631
192.168.112.90 "CN=Jonathan Haas,CN=Users,DC=fabrikam,DC=com" 143 "17/Apr/2008" v5 connect 13987

What TM needs to do is search through this file and find all the text that’s contained between double quote marks. In the first line of the file, that’s going to be the following two pieces of information:

CN=Ken Myer,CN=Users,DC=fabrikam,DC=com

15/Apr/2008

That’s nice, but how do we actually get this information? To tell you the truth, our first thought was to use a regular expression; unfortunately, though, this problems calls for a fairly tricky regular expression. Why? Well, if you look closely at line 1 you’ll see that – as far as the regular expression is concerned – we’re likely to have four items that are enclosed in double quote marks:

“CN=Ken Myer,CN=Users,DC=fabrikam,DC=com”

“15/Apr/2008”

“CN=Ken Myer,CN=Users,DC=fabrikam,DC=com” 141 “15/Apr/2008”

” 141 “

And yes, we know that’s not the way it’s supposed to work, but the computer doesn’t know that, at least not without us giving the machine a hand by writing a pretty complicated little regular expression. (How complicated? Like we said, pretty complicated. See this Web page for a sample regular expression that retrieves the text found between HTML tags, a challenge very similar to our task, which requires us to retrieve the text found between double quote marks.)

In other words, our first thought didn’t pan out. And that was a problem; after all, what are the odds that the Scripting Guys would have a second thought? But hey, stranger things have happened, right? (Not that we can think of any, mind you, but we’re sure they’ve happened.) Here’s the Scripting Guys’ Plan B, an approach that is much simpler than the regular expression we’d have to write (and, as a bonus, actually does what it’s supposed to do):

Const ForReading = 1

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

Do Until objFile.AtEndOfStream
    strText = ""
    strCharacter = objFile.Read(1)
    If strCharacter = Chr(34) Then
        Do Until objFile.AtEndOfStream
           strNewCharacter = objFile.Read(1)
           If strNewCharacter = Chr(34) Then
               Exit Do
           End If
           If strNewCharacter <> "" Then
               strText = strText & strNewCharacter
           End If
        Loop
        Wscript.Echo strText
    End If
Loop

objFile.Close

As you can see, we kick things off by defining a constant named ForReading and setting the value to 1; we’ll need to use this constant when we open our text file for reading. After defining the constant we create an instance of the Scripting.FileSystemObject, then use the following line of code to open the file C:\Scripts\Test.txt for – that’s right, for reading:

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

Now the fun begins. What we’re going to do is parse the text file character-by-character. (Why are we going to parse the text file character-by-character? That should become clear in just a moment.) In order to parse the file we set up a Do Until loop that runs until we’ve read every last character in the file. (Or, if you’re a stickler for technical accuracy, until the file’s AtEndOfStream property is True.) Inside this loop, we set the value of a variable named strText to an empty string (“”), then use the Read method to read a single character from the text file, storing that value in a variable named strCharacter:

strCharacter = objFile.Read(1)

Our next step is to determine whether or not this character is a double quote mark; that’s something we can do by checking to see if the character has an ASCII value equal to 34, the value assigned to the double quote mark:

If strCharacter = Chr(34) Then

Suppose the character isn’t a double quote mark. That’s no big deal; in that case we simply go back to the top of the loop and use the Read method to read the next character in the text file. But suppose that character is a double quote mark; what then? We’re glad you asked that question.

If it turns out that we do have a double quote mark that’s the “signal” that we need to start grabbing text; after all, our job is to grab all the text found inside a pair of double quote marks, and we just found the first of the two quote marks that make up a pair. With that in mind, the next thing we do is set up a second Do Until loop, this one also designed to run until we reach the end of the file.

Of course you might be thinking, “But if it runs all the way to the end of the file won’t that mess up our script?” And you’re right: if that happened it would mess up our script. But don’t panic; we’ll make sure that won’t happen.

Promise.

Inside this second loop we use the Read method to read the next character in the text file and store it in a variable named strNewCharacter:

strNewCharacter = objFile.Read(1)

No sooner do we grab hold of that character than we check to see if that character happens to be a double quote mark:

If strNewCharacter = Chr(34) Then

What if this character is a double quote mark? In that case, we’ve found the second half of our pair and we need to exit the inner loop, something we do by calling the Exit Do statement:

Exit Do

But what happens if the character isn’t a double quote mark? Well, that means that this is a character we want to hang onto; after all, it’s a piece of text that’s nestled between two double quote marks. With that in mind, we tack the character onto the end of the variable strText:

strText = strText & strNewCharacter

And then we go back to the top of the inner loop and repeat the process with the next character in the text file.

What does all that mean? Well, when we start reading Test.txt the first character we encounter is a 1; consequently, we ignore this character and try again with the next character in the text file: 9. Because this second character isn’t a double quote mark we skip it as well, and then we try, try again. This process continues until, at long last, we hit a double quote mark.

As soon as we hit that double quote mark we drop into our second Do Until loop. In that loop we read in the next character in the text file: C. Is C a double quote mark? Not as far as we know. Therefore, we add this character to the variable strText. We then loop around and read the next character in the text file: N. This character also gets tacked onto the end of strText. Eventually, strText will be equal to this:

CN=Ken Myer,CN=Users,DC=fabrikam,DC=com

As it turns out, the next character following the m in com is a double quote mark. Because of that we don’t append that character to strText; instead, we drop out of the inner loop and echo back the value of strText (assuming that strText actually has a value, that is):

If strText <> "" Then
    Wscript.Echo strText
End If

And what happens after that? You got it: we go back to the top of the original loop, reset the value of strText to an empty string, then begin searching for the next pair of double quotes. By the time the script is finished we should see the following information echoed back to the screen:

CN=Ken Myer,CN=Users,DC=fabrikam,DC=com
15/Apr/2008
CN=Pilar Ackerman,CN=Users,DC=fabrikam,DC=com
16/Apr/2008
CN=Jonathan Haas,CN=Users,DC=fabrikam,DC=com
17/Apr/2008

And guess what? That just happens to be the all the information that was contained between the double quote marks. Success!

That should do it, TM; let us know if you have any additional questions about this. Again, we apologize to everyone for the delay in getting the bobbleheads sent out; we kind of dropped the ball on this one, but we’ll get things squared away as quickly as we can. On the bright side – well, never mind. After all, the Scripting Guys are technical writers at Microsoft; it’s been so long since we’ve seen the bright side we probably wouldn’t recognize it anymore anyway.

And did we mention the fact that it’s pouring down rain again?

See you all tomorrow.

Editor’s Note. Update: The boxes have arrived! Does that mean the bobbleheads are on the way? Well, not just yet. We still need to get everything over to the people who promised to ship them for us. That’s assuming the Scripting Editor can keep the Scripting Guy who writes this column from stopping all his work (you know, little things like writing this column every day and creating a lab for TechEd), packing each box himself, and hand-delivering them to each of the 250 winners (since he doesn’t actually trust anyone involved in the shipping process to help either).

Oh, and the sun is out now.

0 comments

Discussion is closed.

Feedback usabilla icon