Hey, Scripting Guy! How Can I List All the Duplicate Files in a Folder and Its Subfolders?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! How can I list all the duplicate files (that is, files with the same file names) in a folder and its subfolders?

— GK

SpacerHey, Scripting Guy! AnswerScript Center

Hey, GK. We have to apologize, but you picked a bad day to ask a question. That’s because the Scripting Guy who writes this column is working on his biography for inclusion in the “Madison Who’s Who Among Executives and Professionals, Honors Section.” Needless to say, that has to take priority over anything else, including – alas – answering questions about system administration scripting.

In case you’re wondering, the Madison Who’s Who includes “biographies of the world’s most accomplished Professionals …. Inclusion is considered by many as the single highest mark of achievement.” The world’s most highly accomplished Professionals? No wonder the Scripting Guy who writes this column received an invitation!

What’s that? OK, sure, technically the invitation was sent to the scripter@microsoft.com email address; that means that – in theory – it could have been intended for any of the Scripting Guys. But let’s look at the facts:

  • Scripting Guy Dean Tsaltas was once cornered (and held captive) by a herd of cows.

  • Scripting Guy Jean Ross once melted cheese under the broiler by putting the casserole in the oven, turning the broiler on high, then returning several hours later to see if the cheese had melted. (Note. If “incinerated” means the same as “melted” then, yes, the cheese had melted.)

  • Scripting Guy Peter Costantini – oh, come on now, be serious. Peter Costantini?!? In Who’s Who?!? You’re kidding, right?

The point is this: does any of that sound like the actions of the world’s most highly accomplished Professionals? We didn’t think so, either.

Which can mean only one thing: the invitation was meant for the Scripting Guy who writes the column, the only Scripting Guy who truly qualifies as a successful individual “in the fields of Medicine, Business, Education, the Arts & Sciences, Research, Healthcare, Law, Engineering, and many other professions.” It’s about time the Scripting Guy who writes this column gets the recognition he so richly deserves.

Editor’s Note: Didn’t the Scripting Guy who writes this column just recently admit to successfully hiding a present from himself? Does that sound like the action of the world’s most highly accomplished Professional? We didn’t think so, either.

Of course, the downside to receiving this prestigious honor is that we can’t answer your question, GK, at least not today. As long as you’re here, however, could you do us a favor? Could you read the following rough draft and tell us what you think:

The Scripting Guy who writes that column has long been a successful individual in the fields of Medicine, Business, Education, the Arts & Sciences, Research, Healthcare, Law, Engineering, and many other professions. Perhaps his most notable achievement to date is the following script, one that lists all the duplicate files in a folder and its subfolders:

Set objDictionary = CreateObject("Scripting.Dictionary")
Set objFSO = CreateObject("Scripting.FileSystemObject")

strStartFolder = "C:\Scripts"

Set objFolder = objFSO.GetFolder(strStartFolder)

Set colFiles = objFolder.Files

For Each objFile in colFiles
    strName = objFile.Name
    strPath = objFile.Path

    If Not objDictionary.Exists(strName) Then
        objDictionary.Add strName, strPath  
    Else
        objDictionary.Item(strName) = objDictionary.Item(strName) & ";" & strPath
    End If
Next

ShowSubfolders objFSO.GetFolder(strStartFolder)

For Each strKey in objDictionary.Keys
    strFileName = strKey
    If InStr(objDictionary.Item(strFileName), ";") Then
        arrPaths = Split(objDictionary.Item(strFileName), ";")
        Wscript.Echo strFileName
        For Each strFilePath in arrPaths
            Wscript.Echo strFilePath
        Next
        Wscript.Echo
    End If
Next

Sub ShowSubFolders(Folder)
    For Each Subfolder in Folder.SubFolders
        Set objFolder = objFSO.GetFolder(Subfolder.Path)
        Set colFiles = objFolder.Files

    For Each objFile in colFiles
        strName = objFile.Name
        strPath = objFile.Path

        If Not objDictionary.Exists(strName) Then
            objDictionary.Add strName, strPath  
        Else
            objDictionary.Item(strName) = objDictionary.Item(strName) & ";" & strPath
        End If
    Next
        ShowSubFolders Subfolder
    Next
End Sub

That’s all we have so far. What do you think?

No doubt many of you are sitting there wondering, “Gee, what is it that separates the successful Scripting Guy from his less-successful peers?” One difference is the fact that the Scripting Guy who writes this column takes the time to explain what he’s doing, and why. For example, none of the other Scripting Guys are going to explain to you how the preceding script works. (Although, in all fairness to Dean he probably would explain the thing if the cows would just let him go.) By contrast, the Scripting Guy who writes this column will note we start out by creating instances of the Scripting.Dictionary and Scripting.FileSystemObject objects. After the objects have been created we then assign the path of the starting folder (C:\Scripts) to a variable named strStartFolder:

strStartFolder = "C:\Scripts"

From there we use the following line of code (and the FileSystemObject) to bind to the starting folder:

Set objFolder = objFSO.GetFolder(strStartFolder)

Once the connection has been made, we reference the Files property in order to retrieve a collection of all the files found in C:\Scripts:

Set colFiles = objFolder.Files

Note. What about the subfolders of C:\Scripts? We’ll get to those in a second. But first things first.

Our next task is to loop through the collection of files; for each file in the collection we use the following two commands to assign the file Name to the variable strName and the file Path to the variable strPath:

strName = objFile.Name
strPath = objFile.Path

That brings us to this block of code:

If Not objDictionary.Exists(strName) Then
    objDictionary.Add strName, strPath  
Else
    objDictionary.Item(strName) = objDictionary.Item(strName) & ";" & strPath
End If

What we’re doing in line 1 is checking to see if we have a Dictionary entry for the first file in the collection. (We’ll assume this is a file named Test.txt.) If the file does not exist, we call the Add method to add the file to the Dictionary object, using the file Name as the Dictionary key and the file Path as the Dictionary value.

Note. No, that isn’t silly; after all, successful individuals don’t do silly things. (Well, as a general rule, anyway.) With any luck the method behind our madness will become clear in just a second. Remember, patience is a virtue. We can tell you, however, that we chose to use the Dictionary object because it’s very easy to determine whether a given value already exists in the Dictionary. That’s a key part of this script; after all, we’re looking for duplicate items, items that exist multiple places.

Now, what if the file does exist in the Dictionary? Obviously, that won’t be the case in the target folder; as everyone knows, you can’t have two files named Test.txt in the same folder. But suppose we found the file C:\Scripts\Test.txt and then, later on, we found the file C:\Scripts\Test Folder\Test.txt. Different file paths, but identical file names. For this exercise, at least, that’s our definition of duplicate files.

If we do find a duplicate file we simply append the file path to the path already in the Dictionary:

objDictionary.Item(strName) = objDictionary.Item(strName) & ";" & strPath

What does that do for us? Well, before we found the second file our Dictionary Item had the following value:

C:\Scripts\Test.txt

Now, it has this value:

C:\Scripts\Test.txt;C:\Scripts\Test Folder\Test.txt

As you can see, both file paths are there, separated by a semicolon. If we find a third Test.txt we’ll simply append a semicolon and that third path to the Dictionary Item. We can continue to add paths as many times as we need to.

OK. That bit of code lets us get at all the files in the target folder (C:\Scripts); however, it doesn’t do anything with the files in any of the target folder’s subfolders. But don’t panic; that’s what this line of code is for:

ShowSubfolders objFSO.GetFolder(strStartFolder)

What we’re doing here is calling a recursive subroutine named ShowSubfolders. We aren’t going to talk about recursion in any detail today; for more information, take a look at this section of the Microsoft Windows 2000 Scripting Guide. For now, suffice to say that a recursive subroutine is a subroutine that can call itself. What does that mean? Well, it means that we can (and will) ask ShowSubfolders to return a collection of all the subfolders of C:\Scripts. Furthermore, by simply calling the subroutine from inside the ShowSubfolders subroutine, we can get ShowSubfolders to return a collection of all the subfolders of each of those subfolders. And then we can get it to return a collection of all the subfolders of all those sub-subfolders. And – well, you get the idea.

Inside the ShowSubfolders subroutine the very first thing we do is set up a For Each loop to loop through each of the “top-level” subfolders in C:\Scripts:

For Each Subfolder in Folder.SubFolders

Note. What’s a “top-level” subfolder? That’s a folder one step down from the main folder. C:\Scripts\Test is a top-level folder. Any subfolders of C:\Scripts\Test (e.g., C:\Scripts\Test\SecondLevel) are not top-level folders, at least not for C:\Scripts. The SubFolders property returns only the immediate subfolders of a folder.

So what do we do inside the loop? The exact same thing we just finished doing: we bind to the first subfolder, get the files, and add the files to the Dictionary object. We then call the ShowSubfolders subroutine again, even though we’re currently inside that subroutine. Fortunately, this won’t upset the space-time continuum (as you might have expected); instead, this recursive call simply returns the subfolders of the first subfolder in the collection. Through some sort of magical process we don’t even pretend to understand, the ShowSubfolders subroutine will dutifully search every subfolder found anywhere within C:\Scripts, add those files to the Dictionary object, then – as soon as the search is complete – return us back to the main body of the script.

Admittedly, it hurts our head, too; it’s really hard to picture how this all works. Our suggestion? Don’t worry too much about it. The Scripting Guys just accept the fact that it does work and move on to bigger and better things.

What kind of bigger and better things? Well, how about code that displays the duplicate files and their file paths:

For Each strKey in objDictionary.Keys
    strFileName = strKey
    If InStr(objDictionary.Item(strFileName), ";") Then
        Wscript.Echo strFileName
        arrPaths = Split(objDictionary.Item(strFileName), ";")
        For Each strFilePath in arrPaths
            Wscript.Echo strFilePath
        Next
        Wscript.Echo
    End If
Next

What we’re doing here is looping through each Key in our Dictionary, using this line of code to assign the Key name to a variable named strFileName:

strFileName = strKey

Note. Yes, that is a somewhat-superfluous line of code. But we thought it might help you to keep track of what’s going on.

Next we use the following line of code, and the InStr function, to see if the corresponding Dictionary Item contains a semicolon:

If InStr(objDictionary.Item(strFileName), ";") Then

Why do we do that? Well, file paths can’t contain semicolons. Therefore, if we encounter a semicolon that can mean only one thing: this is a duplicate file (that is, the same file name was found in multiple folders). With that in mind we echo back the file name (e.g., Test.txt), then use the Split function to convert the Dictionary Item to an array, an array consisting of all the file paths for our duplicate file name:

arrPaths = Split(objDictionary.Item(strFileName), ";")

Once we’ve done that we can set up a For Each loop to echo back each of the values (file paths) in the array. After inserting a blank line (Wscript.Echo) we simply loop around and repeat this process with the next Key in the Dictionary.

The end result is a report listing any duplicate files along with their file paths:

x.vbs
C:\scripts\x.vbs
C:\scripts\New Folder\x.vbs
C:\scripts\Test Folder\x.vbs

y.ps1xml
C:\scripts\y.ps1xml
C:\scripts\New Folder\y.ps1xml
C:\scripts\Test Folder\y.ps1xml

z.ps1
C:\scripts\z.ps1
C:\scripts\New Folder\z.ps1
C:\scripts\Test Folder\z.ps1

Cool, huh?

Two quick notes. First, this script is designed to work only on the local computer; that’s because the FileSystemObject is designed to work only on the local computer. Could we use WMI and run the script against a remote computer? Yes, although the resulting script is more complicated than we wanted to deal with today. For a hint on how this might be done, however, take a look at thisHey, Scripting Guy! column.

Also, keep in mind that this script keeps all the file names and paths in memory. If you have a few hundred, or even a few thousand, files in a folder and its subfolders that’s no big deal. If you have several million files in a folder, well, then you might want to think about trying a different approach (for example, storing names and paths in a database rather than a Dictionary object).

Oh, and you might also want to think about coming up with a better filing system.

So what do you think, GK; does that sound all right to you? After all, we want this to be good; as the invitation noted, many people consider inclusion in the Who’s Who “…the single highest mark of achievement.” (Or, for some of us, their only achievement.) To be honest, this isn’t quite as exciting as the day we won the Spanish lottery. But it’s close.

0 comments

Discussion is closed.

Feedback usabilla icon