Summary: Microsoft Scripting Guy, Ed Wilson, talks about using Windows PowerShell with munged data.
Microsoft Scripting Guy, Ed Wilson, is here. Last week, the Scripting Wife and I had the opportunity to attend the Charlotte SQL Users Group. This is always a highlight, because that user group is one of the largest SQL User Groups around.
I was speaking, and I decided to come up with something a little different. In the past when I have spoken, I have talked about using Windows PowerShell to manage SQL Server, or about best practices for SQL Server admins, and those sorts of things. This time, I wanted to do something that I thought might hit a little closer home. I decided to talk about working with munged data. The topic was a success, and so I thought I would turn it into a series of Hey, Scripting Guy! Blog posts. I am also adding bits, based on feedback from the dozens of questions I received during and following the presentation.
Munged data—we all see it, dude—all the time
It is absolutely no challenge at all to take a CSV file that contains well-formed data, and import it into your SQL Server database or other repository. If the field names line up and the data types line up, dude, no problem at all. In fact, there are dozens of ways to do this, starting with nearly automatic and super simple.
But we do not all live in a well-formed world. In fact, it seems that some databases were designed by people who have absolutely no idea of what a database is, who have never heard about the rules of database normalization, think that Codd was a baseball player, believe that a relational database has something to do with online dating services, and who quite frankly could not even spell DBA. That these databases are destined to become mission critical is a given; that these databases will end up as part of your migration projection is inevitable.
An example of munged data
In the following image, I show a data dump from a rather poorly designed database. The fields are Name, day, month, year, and address.
In addition to the composite fields, we also have data inconsistently entered into the database. For the names, we have first name then last name, a last name then a first name, and all lower-case letters. The year field has a combination of two-digit and four-digit numbers. The address field has all capitalization for the street name, upper case and lower case, and in some cases, a mixture of both. The states are spelled out or two letter abbreviations, with or without a period, and uppercase or lowercase.
The challenge will be to fix this data so it can be imported into a different database.
How did this happen?
To begin, the database was obviously designed by someone who did not know anything about databases. It is common that the name fields would be broken into first name, middle name, and last name. These days, this is even more of a challenge because some people have four or five name parts, without even considering things like Junior, Senior, or ordinals.
How you handle this enters in to your database design, and it should be well thought out prior to actual implementation. The address field is normally broken into street, city, state, country, and postal code. But there may be additional information that should be entered, such as a building name, suite number, or apartment number.
One thing that is painfully obvious is that the data entry interface was basically a text box, and that the database designer has no rules to enforce consistency. As a result, we have the complete state name, upper-case letters, lower-case letters, and periods for abbreviations (or not).
A good interface implements rules to check data to ensure that it is entered correctly. It also takes advantage of options like drop-down lists that ensure the data will be correct. It is even possible to avoid adding a postal code field. Many web services will accept a street address, city, and state, and then automatically return the postal code. Therefore, the postal could be automatically populated and a potential source of error can be avoided.
One thing to keep in mind, however, is that sometimes an interface can be too restrictive. An example, is matching a telephone area code with a city. For example, the Scripting Wife’s cell phone has a Cincinnati area code, and yet we live in Charlotte. Because we do not have a land line for our phone, any web form that requests a telephone number is populated with this cell phone number. There have been many times when a web form tells us that the telephone number is invalid, because it is not a Charlotte area code. Some pizza places that permit online ordering tell us that we are out of their delivery area, so they miss out on our business. So when you are creating your design, keep in mind that you can be too smart and end up with a dumb interface.
Read the data with Windows PowerShell
The first thing I do is use Import-CSV to read the data contained in my data dump. I store it in a variable so I can look around at the data. The following image illustrates this:
I can see that the data is pretty bad and that I will have to spend some time cleaning it up. The first thing I will do is fix the date. Microsoft SQL Server can use a DateTime object for the date field, and it will make future calculations easier to work with if I clean up that field first. I will do that tomorrow. Join me then.
That is all there is to using Windows PowerShell to read munged data. Data Manipulation Week will continue tomorrow when I will talk about fixing the dates from the file.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at firstname.lastname@example.org, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy