Hey, Scripting Guy! How Can I Remove Duplicate Values From a Pair of Text Files?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have two text files that contain numbers. I need to write a script that can determine whether a given number exists in both of these text files. If it does, I need to delete that line from each of the two files. How do I do that?

— KT

SpacerHey, Scripting Guy! AnswerScript Center

Hey, KT. Before we get started today, we have a little business to take care of. First of all, we’d like to thank those of you who have joined our new Scripting Guys Facebook group. We decided to start a Facebook group largely because we wanted to experiment with different ways of interacting with, and getting information to, system administration scripters.

Oh, and, while we were at it, we also wanted to get more friends (and more group members) than either TechNet or MSDN. Well, so far so good: at last report TechNet had 43 group members, MSDN had 84, and the Scripting Guys – despite getting into the game much later than the other two – had 90.

Oops. Make that 94 members.

Note. You know, that’s a good idea: all things considered, maybe TechNet should be a subsidiary of the Script Center rather than the other way around. We’ll bring that up the next time we meet with the folks from TechNet. Can’t wait to see their faces when we suggest that.

Incidentally, we realize that, up to this point, we haven’t done much with our Facebook group. That will change come January. In January, we hope to start posting a few informative videos; we also plan to start using the group as a way to keep people informed about important developments in the scripting world. You know, things like the 2008 Winter Scripting Games. (February 15-March 3, 2008. Thanks for asking.)

Speaking of the Scripting Games, if you received a perfect score in any one of the four divisions in the 2007 Games (VBScript Beginner; VBScript Advanced; PowerShell Beginner; PowerShell Advanced) you still have plenty of time to submit your biography (and picture) for our Profiles in Perfection series. We’ll start featuring people on January 2nd, so the sooner you can get your information to us the better.

Oh, right: just email a little bit about yourself (and any pictures you don’t mind being posted on the Internet) to scripter@microsoft.com (in English, if possible). We’ll take it from there.

One last note: the IT Forum Challenge is now officially over, and the final scores are being tabulated. And yes, we know, we had hoped to post a running tally of scores along the way. Let’s just say that the past couple weeks have been … interesting … ones for the Scripting Guys and leave it at that. We can tell you this, however: if you got a score of 701 then you tied for first. And, as near as we can tell, quite a few of you tied for first.

So what are we going to do about that? Well, according to the rules, ties are broken by a random drawing: we’ll toss the names of all the high scorers in a hat and draw out 10 names. If your name is drawn you’re a winner; if you’re name is not drawn – well, then you’re not a winner.

You know what? That doesn’t seem very fair to us, either. So here’s what we’re going to do. We’re going to draw 10 names and those 10 people will receive a Dr. Scripto bobblehead; that’s pretty much all the bobbleheads we have left. But don’t despair; the other top scorers will receive a comparable prize, but they’ll have to wait until after the Scripting Games to receive that prize. But trust us, it will be worth the wait.

Note. So what is the prize? Sorry, but we can’t tell you that. All we can tell you, for now, is that it will be the same prize as the big giveaway in the 2008 Scripting Games. And you know that’s going to be good.

Whew; that’s more work than the Scripting Guy who writes this column usually does in a month. And that’s not even the half of it; after all, we haven’t begun to address today’s question. Which means that maybe we ought to get started on that.

For the sake of illustration, we’re assuming that KT has two text files that look something like these:

1
2
3
9
15
19
20

1
4
9
15
19
20
21
29

What KT needs to do is delete all the numbers that appear in both files (1, 9, 15, 19, and 20). In other words, when all is said and done, he’d like the two text files to look like this:

2
3

4
21
29

How can he do that? Well, here’s one way:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject(“Scripting.FileSystemObject”)

Set objFile1 = objFSO.OpenTextFile(“C:\Scripts\File1.txt”, ForReading) strText1 = objFile1.ReadAll objFile1.Close

Set objFile2 = objFSO.OpenTextFile(“C:\Scripts\File2.txt”, ForReading) strText2 = objFile2.ReadAll objFile2.Close

arrItems1 = Split(strText1, vbCrLf) arrItems2 = Split(strText2, vbCrLf)

Set objDictionary = CreateObject(“Scripting.Dictionary”)

For Each strItem2 in arrItems2 objDictionary.Add strItem2, strItem2 Next

For i = 0 to Ubound(arrItems1) If objDictionary.Exists(arrItems1(i)) Then objDictionary.Remove(arrItems1(i)) arrItems1(i) = “Void” End If Next

For Each strItem in arrItems1 If strItem <> “Void” Then strNewText1 = strNewText1 & strItem & vbCrLf End If Next

Set objFile1 = objFSO.OpenTextFile(“C:\Scripts\File1.txt”, ForWriting) objFile1.Write strNewText1 objFile1.Close

For Each strKey in objDictionary.Keys strNewText2 = strNewText2 & strKey & vbCrLf Next

Set objFile2 = objFSO.OpenTextFile(“C:\Scripts\File2.txt”, ForWriting) objFile2.Write strNewText2 objFile2.Close

So how does this script work? Let’s see if we can figure that out. We start off by defining a pair of constants, ForReading (which we’ll need in order to open and read from each of our text files) and ForWriting (which we’ll need in order to open and write to each of our text files). After we define the two constants, we use the following line of code to create an instance of the Scripting.FileSystemObject object:

Set objFSO = CreateObject(“Scripting.FileSystemObject”)

That brings us to these three lines of code:

Set objFile1 = objFSO.OpenTextFile(“C:\Scripts\File1.txt”, ForReading)
strText1 = objFile1.ReadAll
objFile1.Close

All we’re doing here is using the FileSystemObject to open the file C:\Scripts\File1.txt for reading. As soon as the file is open, we use the ReadAll method to read the entire contents of the file into memory, storing that information in a variable named strText1. Once we have a virtual copy of the file tucked away into memory, we then call the Close method to close File1.txt.

And then we repeat this process with our second file (C:\Scripts\File2.txt), storing that information in a variable named strText2. The net result? We now have two variables – strText1 and strText2 – that contain the contents of our two text files.

That’s great, but what exactly are we going to do with those two variables? Well, the first thing we’re going to do is execute this line of code:

arrItems1 = Split(strText1, vbCrLf)

What we’re doing here is using the VBScript Split function to convert the values in strText1 into an array, creating a new array item each time we encounter a carriage return-linefeed (vbCrLf). Why would we do that? Well, when we’re all done arrItems1 will be equal to this:

1
2
3
9
15
19
20

In other words, we now have an array that just happens to be a perfect replica of the file File1.txt.

Yes, we know: that still doesn’t tell you what we’re going to do with this array, does it? We’ll get to that in just a second. Before we do, however, we need to mention that we next do the exact same thing with the variable strText2, this time creating an array named arrItems2. That gives us an array that perfectly replicates the text file File2.txt.

Note. Couldn’t we have created these arrays by reading in the files line-by-line, adding each line to the appropriate array as it was read from the text file? Sure. We just thought this approach was a tiny bit easier, especially for people with limited experience working with arrays.

As you may or may not remember, our goal today is to delete any values that appear in both File1.txt and File2.txt. That’s a little trickier than it might first sound. Why? Well, the first value in File1.txt is the number 1. That might lead you to believe that we could just search for the number 1 in both files and, if found, delete any instances of that number. But that’s not going to work. After all, if we simply delete all instances of the number 1 we’re going to turn the value 212 into this:

22

Not good. Of course, the number 1 does appear on a line by itself; can’t we just search for a 1 followed by a carriage return-linefeed? Unfortunately, the answer here is the same: no. Why not? Because a number like 651 includes a 1 followed by a carriage return-linefeed. If we searched for a 1 followed by a carriage return-linefeed our script would end up turning 651 into this:

65

Eep.

So what are we going to do about that?

Oh, you mean other than give up and tackle an easier problem, don’t you?

Well, the solution we came up with (and no doubt there are many other ways to do this) involves creating an instance of the Scripting.Dictionary object. Once we have that object in hand, we then use this block of code to loop through all the values in the array arrItems2 and add each of those values to the Dictionary:

For Each strItem2 in arrItems2
    objDictionary.Add strItem2, strItem2  
Next

Note. We’re assuming that values are unique in each file; that is, File1.txt will contain only one instance of the number 322. If values are not unique, well, that definitely complicates the script. If anyone needs to tackle that problem, let us know and we’ll see what we can do about it.

That brings us to this block of code, a block of code that makes up the heart and soul of the script:

For i = 0 to Ubound(arrItems1)
    If objDictionary.Exists(arrItems1(i)) Then
         objDictionary.Remove(arrItems1(i))
         arrItems1(i) = “Void”
    End If
Next

Admittedly, there are just six lines of code here; however, all sorts of action and excitement are packed into these six lines. As you can see, we kick things off by setting up a For Next loop that loops through all the items in the array arrItems1. (The loop starts at 0, the index number of the first item in the array, and continues on through the Ubound item, the last item in the array.) Inside this loop, we use the Exists method to determine whether or not the array item in question (e.g., the number 1) can be found in the Dictionary:

If objDictionary.Exists(arrItems1(i)) Then

Suppose the Exists method returns False; that means that the value in question appears only in File1.txt; it doesn’t appear in both File1.txt and File2.txt. Because we’re only interested in values that appear in both files we simply go back to the top of the loop and repeat the process with the next item in the array.

However, suppose that the Exists method returns True. That means that the value in question does appear in both files. That also means that we need to execute these two lines of code:

objDictionary.Remove(arrItems1(i))
arrItems1(i) = “Void”

In line 1 we’re using the Remove method to remove the item in question from the Dictionary. If the item in question happens to be the number 1 that means that the number 1 will no longer be included in the Dictionary. In addition, the number 1 will no longer be included in the array arrItems1. Why not? Because in line 2 we set the value of that array item to the string Void.

Note. Why don’t we just remove the item from the array? We’d like to. But removing items from an array isn’t particularly fun. And we Scripting Guys don’t like to do things that aren’t particularly fun. Setting the value to Void is much easier, and ensures that we’ll still be able to quickly pick out the values that are unique to File1.txt.

The point of all this? Well, after we finish looping through all the items in the array arrItems1 our Dictionary will contain the following items:

4
21
29

Those happen to be the numbers that appear only in File2.txt. Any numbers that appear in both File1.txt and File2.txt have been deleted from the Dictionary.

Meanwhile, arrItems1 will look like this:

Void
2
3
Void
Void
Void
Void

As you can see, the only numbers that are left are those values that appear only in File1.txt. Numbers that appeared in both our text files have been replaced with the word Void.

All we have to do now is write these unique values back to our text files. To construct our new text for File1.txt we use this block of code:

For Each strItem in arrItems1
    If strItem <> “Void” Then
        strNewText1 = strNewText1 & strItem & vbCrLf
    End If
Next

Here we’re grabbing all the values in arrItems1 that are not equal to Void (in this case, the numbers 2 and 3). To create the new text we simply add each of these values, plus a carriage return-linefeed to a variable named strNewText1:

strNewText1 = strNewText1 & strItem & vbCrLf

After that we reopen the file C:\Scripts\File1.txt (this time for writing), then use the Write method to write the value of strNewText1 to the file:

objFile1.Write strNewText1

Once the old contents of File1.txt have been replaced with the value of strNewText1 we then close File1.txt.

We then use this block of code to construct a similar string for File2.txt:

For Each strKey in objDictionary.Keys
    strNewText2 = strNewText2 & strKey & vbCrLf
Next

When that’s done, we write the value of strNewText2 to File2.txt. At that point, and at long last, we’re done.

One last bit of business before we go. We should note that tomorrow’s column (December 14, 2007) will be the last column of the year; we’ll resume publication on January 2, 2008. Having filed columns from such diverse locales as Walla Walla, WA; Orlando, FL; Venice, Italy; and Barcelona, Spain the Scripting Guy who writes this column is taking a well-deserved break. (Well, OK. But he’s taking a break anyway, whether he deserves it or not.) But don’t worry; after a brief hiatus he’ll definitely be back on the job in January. After all, 2008 is the year the Scripting Son starts college.

0 comments

Discussion is closed.

Feedback usabilla icon