Day 15 of my virtual advent calendar, about stuff I like in SQL Server 2008..
If there were no issues with data quality in data warehouses I would have completed many of my BI projects in a third of the time. The client would always insist they had good data, after all it would often be coming form their billing or ERP systems, but I rarely found this to be the case.
Why do I get these assurances? Because while the business user might understand their high level data (in terms of sales by store for example), they probably won’t know how it’s stored or the details of each order, while the DBA knows the storage but not necessarily anything about the data. It’s only when a BI project kicks off that these questions start to be asked. So how does a BI professional quickly get an understanding of the data? Enter the data profiling task in SQL Server 2008 Integration Services (SSIS)…
Not that the tool is a control flow, not a data flow task, but nonetheless requires a data connections. In this demo we are going to look at customer data and output our findings to an xml file. This task has several panes:
The general tab allows you to specify where the profiling data ends up, in this case NewCustomerProfile.xml. BTW the quick profile button here allows you to run a profiling job on the fly with default settings.
The Profile Requests pane is where you setup what you want to analyse..
Here I have three different kinds of analysis and I have focused on the last one to look at Value distribution. Note the connection and table name can be individually specified.
The task can then be run to generate the file, and then you can take a look at the answers. I am not to good at quickly making sense of xml and so it’s good to know there is a special data profile viewer go to ..
All Programs -> Microsoft SQL Server 2008 -> Integration Services -> Data Profile Viewer.
Looking at the distribution of values for gender …
you can see that we have three M F and Male, something we might want to clean up as part of the ETL process. The other interesting thing to do is look at NULL distribution as there is little po9int in loading in a little used column no matter how important it is.
Armed with this tool you can go back to the business and show them how clean there data is and then collaborate on how to resolve each issue.