How Can I Remove All Duplicate Lines From a Text File?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! How can I remove all duplicate lines from a text file?

— SW

SpacerHey, Scripting Guy! AnswerScript Center

Hey, SW. You know, to be a Scripting Guy means to embark on a never-ending quest to find the ultimate solution to a given problem. (Or at least that’s what we tell our manager when he asks why we never actually seem to finish anything: “But boss, never-ending quests take time!”) That’s why we were glad to see your question. Awhile back we answered a similar question, one about removing duplicate names from a text file. The solution we came up with was simple enough, and it worked just fine; we just weren’t convinced that it was the best solution. And now, thanks to your question, we can take another shot at this. As to whether this is a better/faster/easier solution than the one we offered before, well, we’ll leave that up to you.

To begin with, we assume you have a text file where each line represents a separate record. It’s unlikely, but maybe your file looks like this:

This is one of the lines in the text file.
This is one of the lines in the text file.
This is another line in the text file.
This is one of the lines in the text file.
This is yet another line in the text file.
This is another line in the text file.
This is another line in the text file.
This is one of the lines in the text file.

You want a script that can weed out all the duplicate lines and provide you with output similar to this:

This is one of the lines in the text file.
This is another line in the text file.
This is yet another line in the text file.

SW, you came to the right place:

Const adOpenStatic = 3
Const adLockOptimistic = 3
Const adCmdText = &H0001

Set objConnection = CreateObject(“ADODB.Connection”) Set objRecordSet = CreateObject(“ADODB.Recordset”)

strPathToTextFile = “C:\Scripts\” strFile = “Test.txt”

objConnection.Open “Provider=Microsoft.Jet.OLEDB.4.0;” & _ “Data Source=” & strPathtoTextFile & “;” & _ “Extended Properties=””text;HDR=NO;FMT=Delimited”””

objRecordSet.Open “Select DISTINCT * FROM ” & strFile, _ objConnection, adOpenStatic, adLockOptimistic, adCmdText

Do Until objRecordSet.EOF Wscript.Echo objRecordSet.Fields.Item(0).Value objRecordSet.MoveNext Loop

We find this script kind of interesting because we’re using ActiveX Data Objects (ADO) and treating this text file as if it were a database. We won’t spend a lot of time detailing how you treat a text file as if it were a database; if you’d like to learn more about that we have a Scripting Clinic column covering that topic in detail. For now, suffice to say that we’re working with the text file C:\Scripts\Test.txt, something we indicate by assigning the appropriate values to the variables strPathToTextFile and strFile:

strPathToTextFile = “C:\Scripts\”
strFile = “Test.txt”

So how does that enable us to eliminate duplicate lines? Well, there’s a kind of database query known as Select DISTINCT; what Select DISTINCT allows you to do is select all the distinct (or unique) records in a table. Suppose you had a simple database with these entries:

Red
Red
Blue
Red

If you use a Select DISTINCT query, you’ll get back a recordset consisting only of the unique records:

Red
Blue

No doubt you’re thinking, “Wow: getting back the unique records is pretty much the same thing as eliminating duplicate records.” And we’ll admit that – whoa, wait a second: you’re absolutely right. Our text file is constructed like a database table, with each line in the text file representing a single field in a single record. If we run a Select DISTINCT query against this text file, we’ll get back only the unique lines. In fact, we’ll get back a recordset that looks like this:

This is one of the lines in the text file.
This is another line in the text file.
This is yet another line in the text file.

That’s just exactly the information we wanted to get back. Good thing you pointed that out to us!

After retrieving our recordset we then use this code to echo the unique lines back to the screen:

Do Until objRecordset.EOF
    Wscript.Echo objRecordset.Fields.Item(0).Value   
    objRecordset.MoveNext
Loop

If we wanted to, we could also use the FileSystemObject to open the text file and replace the existing contents with only the unique lines; that would have the effect of removing all duplicate lines from the text file. (It would be cool if we could use some sort of Update query to do that, but when it comes to text files ADO is read-only.)

So is this the last word on removing duplicate items – be they names or entire lines – from a text file? Hey, who knows: after all, never-ending quests take time! (Actually, we find that they take about 2-to-3 days. After that we get bored and move on to something else.)

0 comments

Discussion is closed.

Feedback usabilla icon