Hey, Scripting Guy! How Can I Extract Three-Digit Numbers From a String?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! I’m trying to use regular expressions to extract three-digit numbers that are enclosed by pipe characters; they embedded in string values like this: |159|468|572|843|. However, I can’t figure out the syntax that will get this to work. Can you help me?

— RS

SpacerHey, Scripting Guy! AnswerScript Center

Hey, RS. Before we answer this question we should note that the 2008 Winter Scripting Games are now in full swing: not only have all the main events been available since Friday, but today we’re posting the first event in the Sudden Death Challenge as well. But if you’re just now getting started with the Games, well, relax: the deadline for Events 1 and 2 is still two days away (Wednesday, February 20th), and the deadline for Events 3 and 4 is still four days away (Friday, February 22nd). In other words, there’s no need to panic; you still have plenty of time to complete all the events.

Note. The only ones who should panic are the Scripting Guys. We’re only a couple days into the Games, and the Games themselves opened on a Friday, usually a low-traffic day in the Script Center. Nevertheless, it already looks like we’re going to have way more entries than we did last year. Is that a problem? Let’s put it this way: last year we were barely able to keep up with all the submissions. This year? Uh, we’d rather not talk about this year.

Not because we’d rather not talk about this year. We simply don’t have time to talk about this year. It’s only Monday, and we’re already tired!

But while the Scripting Guys are beginning to freak out, Scripting Games competitors still have plenty of time. Enough time to break away for a few minutes in order to learn how to extract three-digit numbers from a string? Well, we’re about to find out, aren’t we:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True   
objRegEx.Pattern = "\d{3}"

strSearchString = "|159|468|572|843|"

Set colMatches = objRegEx.Execute(strSearchString)  

For Each strMatch in colMatches
    Wscript.Echo strMatch.Value
Next

As it turns out, RS, one problem you were having is that you were perhaps making this into a more complicated scenario than it needed to be. (And trust us, if anyone would recognize the act of making a simple thing complicated it’s the Scripting Guy who writes this column.) You were trying to come up with a way to include the pipe characters in your regular expression. Is that a problem? It can be, because the pipe character is a reserved character in regular expressions; that means you can’t just include it in a Pattern. As you found out, a regular expression pattern like this isn’t going to return the expected data:

objRegEx.Pattern = "|...|"

As it turns out, however, you don’t even need to worry about the pipe character. Based on the sample data you gave us all you’re really looking for is a three-digit number. The pipe characters don’t matter; these three-digit numbers could be separated by blank spaces, commas, even a bunch of intervening characters:

aaaaa159bbbbb468ccccc572ddddd843eeeee

In other words, all we have to do is search for three-digit numbers and we’re home free.

Note. Technically we don’t even need a regular expression; we could just use the Split function and split the string on the pipe character. So why didn’t we do that? Well, in that case we’d end up with an array that had an empty item at the beginning and the end. Rather than have to deal with that we decided to simply modify RS’ regular expression script and go that route.

So how do we go about searching for three-digit numbers? Well, in line 1 we create an instance of the VBScript.RegExp object; that’s the object that enables us to use regular expressions with a VBScript script. After we create the RegExp object we assign values to two key properties of the object:

Global. The Global property determines whether the script should find just one instance of the target text or all instances of the target text. Because we want to find all the three-digit numbers we set the value of this property to True.

Pattern. The Pattern represents the target text; this is where we specify exactly what it is we’re looking for. We want to find three-digit numbers, so we use this syntax: \d{3}. The \d means that we want to look for a digit (a number between 0 and 9); we have to use the \ because d is also a reserved character in regular expressions. Meanwhile, the {3} tells the script that we want to match exactly three of these characters (digits between 0 and 9). No more, and no less.

After we assign our search string to a variable named strSearchString we then call the Execute method to search that string for our target text:

Set colMatches = objRegEx.Execute(strSearchString)

Any matches that the RegExp object finds will be stored in a collection we named colMatches. To display a list of the three-digit numbers all we have to do is set up a For Each loop to loop through the items in the collection and, for each such item, echo back the Value property:

For Each strMatch in colMatches
    Wscript.Echo strMatch.Value
Next

When we do that we should get back the following:

159468572843

Which is just what we wanted to get back.

Now, we should note that this particular regular expression is designed to fit RS’ data; it won’t necessarily work with any data. For example, suppose we had the following string:

123aaaa|159|468|572|843|aaaa456

This string will correctly find the values 159, 468, 572, and 843; however, it will also find the values 123 and 456. Why? Because it’s simply searching for any three-digit values, regardless of whether or not those values are enclosed by pipe characters.

To be honest, we could continue to create more and more complicated scenarios that would require more and more complicated regular expression patterns. For now, however, we’ll just show you how to deal with this particular problem. Here’s a script that will extract only those three-digit numbers that “begin” with a pipe character:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True   
objRegEx.Pattern = "\|\d{3}"

strSearchString = "123aaaa|159|468|572|843|aaaa456"

Set colMatches = objRegEx.Execute(strSearchString)  

For Each strMatch in colMatches  
    strValue = strMatch.Value
    strValue = Replace(strValue, "|", "") 
    Wscript.Echo strValue
Next

In order to get this to work we had to make a couple of modifications to our original script. For one thing, we changed the pattern to this:

objRegEx.Pattern = "\|\d{3}"

We’re still looking for a three-digit number here; however, that three-digit number must be preceded by a | character. (Like we said, because the pipe is a reserved character it must be escaped [prefaced] by a \.) With this regular expression we won’t match the number 123 at the beginning of the string. Why not? You got it: because that three-digit number is not preceded by a pipe character.

And that’s a good question: wouldn’t it be more correct to search for a three-digit value bounded on both sides by a pipe character? Yes, it would. However, that introduces a new problem. Suppose we match this value |159|. That’s fine, except the trailing | has now been used. That means that we won’t match |468|; as far as the script is concerned |468| doesn’t even exist. Instead, the script believes we have these items in the string (discounting all the extraneous characters):

|159|

468

|572|

843|

Are there ways to work around that? Yes, but they go beyond what we can cover today. And beyond what RS really needs anyway.

The other thing we had to do was modify our For Each loop. Because the pipe character is part of our Pattern that means the pipe character will be part of the values we find. By default, the script will return the following:

|159
|468
|572
|843

To get rid of those pipe characters we’ve stored the Value property for each match in a variable named strValue, then used the Replace function to remove any instances of the pipe character:

strValue = Replace(strValue, "|", "")

We then echo back the value of strValue, resulting in output that looks like this:

159
468
572
843

Which is a little more like it.

Like we said, we could find a way to break this regular expression as well. However, based on RS’ data this should work just fine.

Oh, what the heck, let’s do one more. (You know how it is with script writing: you can never stop at just one.) Here’s a revised script that is a little more foolproof than the ones we just showed you:

Set objRegEx = CreateObject("VBScript.RegExp")

objRegEx.Global = True   
objRegEx.Pattern = "\|\d{3}(?=\|)"

strSearchString = "12|3aaaa|159333|468|572|843|aaaa|456"

Set colMatches = objRegEx.Execute(strSearchString)  

For Each strMatch in colMatches  
    strValue = strMatch.Value
    strValue = Replace(strValue, "|", "") 
    Wscript.Echo strValue
Next

Notice the pattern that we’re using here; we’ve added this construction to the end: (?=\|). This is “positive lookahead” syntax: it tells the script to look and see if there’s a pipe character following the three-digit number. However, it also tells the script not to include that character in the final match value; thus we get a match like |468 rather than |468|. That has the effect of freeing the pipe character to be used as part of the next search.

Like we said, this starts to get a little more complicated, and we won’t bore you with all the details, at least not today. However, the positive lookahead approach has one major advantage over the script that simply looked for a pipe character followed by three digits. That script would include |159 as a match. Why? Because it features a pipe character followed by a three-digit number. Of course, 159 is really just a part of 159333; consequently, it shouldn’t be included in any matches. For better or worse, however, our regular expression simply does what we tell it to: it looks for a pipe character followed by three digits, and it reports back what it finds.

By adding the positive lookahead we avoid that issue; now those three digits must be both preceded and followed by a pipe character. If they aren’t, they won’t be considered a match. Suppose our search string looks like this:

12|3aaaa|159333|468|572|843|aaaa|456

With that search string we’ll get back the following matches:

468
572
843

Now that’s cool.

We hope that helps, RS; if not, let us know. In the meantime, you and (everyone else reading this column) might want to check out the Scripting Games. (That is, if you haven’t already done so.) Remember, you’re under no obligation to enter every event; enter the events you want to enter, and skip the events you don’t want to enter. Alternatively, just enter one event; entering just one event (even if you fail to successfully complete that event) makes you eligible for all the great Scripting Games prizes, including script editors, Dr. Scripto bobblehead dolls, and copies of Windows Vista Ultimate. And here’s a hint for you: Event 5 in the Advanced Division makes extensive use of regular expressions. That means this should be an easy one for you; after all, by now you should know everything there is to know about regular expressions.

Well, sort of, anyway.

At any rate, give the Scripting Games a try; we don’t mind. Sure, we already have several hundred scripts sitting around waiting to be tested. But you know what they say: the more the merrier.

Of course, that was easy for them to say; after all, they don’t have several hundred scripts sitting around waiting to be tested. But, you know how it is with script testing: you can never stop at just one.

0 comments

Discussion is closed.

Feedback usabilla icon