Written by Chris Tao, Microsoft Premier Field Engineer.
Have you ever been asked the following questions when passing through the immigration counter to enter a foreign country?
- Who are you?
- Where are you from?
- Where are you going to?
- What are you doing here?
Most people can usually answer these questions without thinking about it too much. However, if one were to ask the same questions about the data in a company’s sales database, would it still be that easy? The answer will be probably “No” if enterprise IT doesn’t have a good handle on Metadata management, which is ‘data’ describing the aspects and characteristics of the data. Before introducing more details of metadata, let’s take a look at current metadata management situation in the enterprise.
Metadata Management in the Enterprise can be Painful
Over the past few decades, the technologies that support storing and managing data have improved significantly, but it’s still challenging to track data changes in ad-hoc data flows.
As Figure 1 below indicates, some data flows are controlled by systems while others are controlled manually. And in some occasions there might be new direct data connections or modifications created between systems without notifying IT data teams. It might be caused by the process in an IT organization, or losing control on how end users use the data. Ultimately it will bring various risks to your IT governance and planning.
Let’s take a sales BI reporting scenario as an example. Typically, the sales data lifecycle starts from a Point-of-Sale (POS) system. After a series of integration, cleansing and calculation activities take place in different systems, the data will finally be presented to the end-user in reports, charts, graphs, and/or documents. As Figure 2 illustrates, this lifecycle will usually contain the following components:
Figure 2 – Typical Data Lifecycle
- POS: The sales order record will be originally created by a front-end system such as a POS(Point-of-Sale) system, including all of the original raw sales data.
- ODS (Operational Data Store): POS data will be uploaded into an Operational Data Store (ODS) system. In ODS the sales data will be integrated together, cleansed by data quality tasks, and will then become a trustable and single sales transaction data source in IT systems.
- Data Warehouse: For sales scenario analysis purposes, data will be eventually loaded from an ODS into the enterprise data warehouse, and integrated with other corporate data sources. As data is moved from the OLTP system to the data warehouse, the data may require transformations. For example IT is combing customers from two different systems. In system one customer 12 is Bob and in system two customer 12 is Dave. So to combine those in a single data warehouse IT will need to generate a new customer id to prevent duplicates. Another example may be in the OLTP the primary key is a composite key using string data type and in the data warehouse you generate a single primary key using integer data type.
- SSAS Cube: The data warehouse (DW) is the source to build SSAS cubes from. SSAS database can contain multiple cubes. Each cube can contain multiple dimensions and measure groups. During processing SSAS reads the DW and brings data into the SSAS cubes.
- Reports: Multiple formats of reports will be generated from the SSAS Cube. This could include Excel files, reporting pages, dashboard or documents.
Within this lifecycle, the owner of the data may vary between different phases, the format of the data may vary in different systems as well, and even the definition of data may vary in different analysis scenarios. But usually it’s difficult to track these differences, which will bring potential risks to IT data governance due to:
- Lack of visibility about history of data – Data can be modified during the ETL (Extraction, Transformation and Loading) process. But in most scenarios, if the data is updated or cleansed by an ETL process automatically, you won’t be able to notice the changes and write logs to somewhere automatically. Then people won’t have visibility about how or when the data field was created, updated or loaded. This makes it really difficult to answer some questions, such as when an information worker asks “When was this data last updated?” or “Was this data cleansed?”
- Lack of ownership or responsibility of data – Data ownership information is not missing in most systems. When IT finds something wrong with the data and wants to contact the owner, IT won’t be able to locate the people through an easy and automatic way. This makes it really difficult to answer some questions, such as when an information worker asks “Who authored the data that’s in this report and what’s their phone number?” or a DBA asks “Who is pulling data from my server that I want to retire? Who do I need to notify?”
- Unable to analyze data change impact – whenever data change is happing, it’s always painful to get a whole picture about the change impact. For example, onboarding new systems and decommissioning old systems are quite normal in enterprise IT operations, but lack of system wide visibility will bring a lot of potential risks by affecting other existing systems. Besides this, smaller incremental changes to data are also happening in daily operations, such as changes to column names or types when extracting data by ETL. Without Project Barcelona (more on this later) there’s almost no visibility into these changes. This makes it really difficult to answer some questions, such as when a developer asks “We have three different apps to input customer data. How are they related?”, or a CTO asks “Where is my sensitive information? Who has access? And where is it distributed?”
So how can we solve all of the above risks and questions? A well-designed metadata management strategy and system is needed to help us improve the visibility of data changes, provide more background information to the data, and generate impact estimation about data changes. First of all, let’s get to know what metadata is.
What is Metadata?
As mentioned above, metadata is ‘data’ about data that describes the aspects and characteristics of the data. According to Forrester, “Metadata is information that describes or provides context for data, content, business processes, services, business rules and policies that support an organization’s information systems.”
Metadata already exists everywhere in our enterprise data ecosystems. It can be stored in design tools, management tools, or monitoring tools. It can appear in a variety of forms, including data tables, indexes, partitions, ETL processes, cubes, spreadsheets or Word documents.
The challenge is, even if we know where the metadata is, it’s still difficult to track the changes and update the metadata among different systems. For example, you need to extract the metadata using database tools, or you need to go through the ETL configuration and code to find the metadata change. For spreadsheets and Word documents, you will need to talk to the document owner to get the exact changes, etc.
While a lot of time is spent to extract the metadata, it will also bring extra overhead cost and time to maintain and update the existing metadata, which is quite challenging for Enterprise IT teams. So, a well-designed metadata system is needed to solve the problem.
Project Barcelona: A Revolution in Metadata Management
Solving these problems is one of the key goals of Project Barcelona. The vision of Project Barcelona is to build a metadata information server that provides details on referential and dataflow relationships. Project Barcelona will support the needs of the modern enterprise to understand and facilitate the management of data across the enterprise.
Project Barcelona is all about making enterprise metadata and enterprise data flow information available and consumable across the enterprise for those who need it most (DBAs, ETL developers, and information workers). However, Project Barcelona is not a centrally controlled metadata repository in the traditional sense – in that the overall design embraces the decentralized and web-like nature of the modern enterprise.
Figure 3 – Project Barcelona Logic Architecture
Figure 3 depicts the overall architecture of Project Barcelona. The major components that drive the primary functionality of this new server include three layers, which are:
- The Crawler Layer: Barcelona will provide several crawlers for Microsoft products (i.e., SQL Server, Excel, SharePoint, Reporting Services, Analysis Services, and SSIS) that will be able to extract metadata and enterprise dataflow information from the target sources to be indexed in the Barcelona Index Server. For sources that can’t be crawled (for example, the dataflow is in an executable program to which the crawler doesn’t have permissions, or a crawler doesn’t exist for the target domain) Barcelona will provide a declarative way to describe the metadata and dataflow information. Finally, the crawler infrastructure will support auto discovery of new target sources- further reducing administration costs since IT don’t have to specify every source to be crawled.
- The Index Layer: The Index Server will be the cache for all of the harvested metadata and enterprise dataflow information. This server will also expose an API for querying, augmenting, and annotating the metadata and dataflow information.
- The Interaction Layer and Tools: Barcelona will include a set of tools in the initial release. First of all, there will be an admin experience for managing the crawlers and the Index Server. Second, Barcelona is developing a DBA experience designed to help the DBA with tasks that require enterprise-wide knowledge of the data topology (for example – renaming a column, retiring a server, figuring out where the data comes from).
Last, but not least, there’s one more particular part in this architecture: a 3rd Party API for back-end crawler and front-end data operation, that will not only enable us to build custom crawlers to collect metadata information from other 3rd party systems that could be non-Microsoft platform based, but also allow Barcelona partners to build packaged industry solutions. This will bring a huge opportunity for 3rd party vendors and partners to provide more customized solutions targeting specific customers on top of the Barcelona framework.
In this article I discussed the nature of metadata, the common pain points of metadata management in the modern enterprise, and Project Barcelona. In the next article I’ll discuss the Barcelona project UI, and how to use it to browse the data flow and history. Stay tuned.