Data quality is important and data quality mistakes can be expensive. Imagine a situation where a company has two systems, each with their own data set. One system is for CRM, the other is a financial system. Both systems have information about customers. Now an employee needs to send an invoice to a customer. What happens if the two systems have conflicting information about the customer’s address?
Several different things might happen. If they employee notices the inconsistency, he or she might spend some time finding out why and learn that the customer moved a month ago. The employee can then update the incorrect system (or flag it for someone with appropriate permissions to change it) and send out the invoice to the right address. That’s costly in terms of wasted time. If, on the other hand, the employee doesn’t spot the problem, he or she could send an invoice to the wrong address. This could become more costly as someone may start chasing for an unpaid invoice because the customer never received it. In extreme cases, this could escalate to the point where the company is hiring in debt collectors, the customer is getting harassed about not paying a bill they never received, services get discontinued and the customer gets so fed up they go to a competitor.
Hopefully, the root of the problem would get resolved long before the situation reached that extreme but it could still build up costs along the way. At the very least, there would be delays in the payment of an invoice if it went first to the wrong address.
Another example of costs is if there is duplication. I might be listed in someone’s customer database as both Jess Meats and Jessica Meats. If that company sends out marketing fliers to all of their customers, I would get two. That might not seem like much of a problem until you think about the likelihood of this problem happening with multiple records in the systems.
There are all sorts of ways that data quality issues can waste time or add costs to businesses.
Improving the accuracy of data is one area that Microsoft have been working on with SQL Server 2012. There are two toolsets that are used for this end: Master Data Services and Data Quality Services. In this blog post, I’m focusing the first of these.
Master Data Services is a set of functionality that was introduced to SQL Server in the 2008 R2 release and improved in SQL Server 2012. MDS is Microsoft’s tool for Master Data Management. Master Data Management, or MDM, is all about creating a single true version of data. The concept is that you have a set of data, known as the master data, which is up to date and accurate.
There are two main ways that MDM is used. One is for business intelligence. Essentially, data is pulled out of the various systems in the company, cleaned up, checked for consistency, and then published in its accurate form into a data warehouse which can then be used for analysis and reporting. In this usage, MDM is all about making sure that reports are accurate and that the conclusions drawn from reporting can therefore be relied upon.
The second main form of MDM is operation MDM. In this scenario, MDM is used to ensure that data in all the systems is consistent and clean. For example, if someone updates a customer’s address in the CRM system, processes and procedures ensure that the address is also updated in the finance system, avoiding the scenario described at the start.
There are multiple ways to implement MDM, the three main versions being: transactional hub, registry and hybrid.
In the transactional hub model, you essentially replace the content databases of various systems with the master data database. The company systems write data directly to a central database which is then read by all the others. In some ways, this is the ideal MDM scenario because it ensures that all the systems are literally looking at the same source data. In reality, the technical implementation is incredibly difficult because different systems all treat data in different ways and expect it to be stored in a certain manner.
The registry model leaves the data where it is in the source systems but creates a database pointing to the data that should be viewed. In this case, the master data database doesn’t actually include the data, but instead has pointers to the correct information, allowing data to be reported on accurately. The drawback of this method is that it is read-only – information can be queried via the master data registry but not updated.
The third model is the hybrid model. In this model, data remains in the content databases of the source systems, as with the registry model, but you also have a central repository of data, as in the transactional hub model. In this case, the hub contains a cleaned store of the data, synchronised with the original source systems.
One of the key things to note is that MDM isn’t a fix-once solution. When implementing Master Data Management, whether in SQL Server or using another technology, it’s important to remember that data doesn’t stay fixed. These quality issues come into the systems as changes occur so it’s important to build in procedures to maintain the quality of the data going forward. This can be done by a mixture or automatic procedures and notifying users of changes. An MDM implementation should involve the identifying of data stewards, individuals whose job it should be to maintain certain pieces of the data. For example, a customer account manager should be responsible for maintaining that customer’s contact information, so when something changes, they will be the one to approve changes via MDM to all the systems that include that data. It’s vital that the planning of an MDM solution should include policies for how data quality will be maintained going forward.