Data Warehouse – Anatomy of Extract, Transform, Load (ETL)

In my last post I went over the Deployment process that brings over MP’s from Service Manager and how those MP’s drive the structure, data, and reports for Data Warehouse and Reporting. Once the schema and reports are deployed, we need to populate the database with some actual data for reporting purposes. This is done via the Extract, Transform, and Load (ETL) process.

ETL works in this way (see figure below also):

  • Extract
    • The Extract process begins on a schedule interval. Extract is the process which grabs the raw data from your OLTP store, which in this case is the Service Manager cmdb.
    • Extract queries Service Manager for the “delta” data from it’s last run.
    • This new data is written into the DWStagingandConfig database in the same basic form as it is in the cmdb.
  • Transform
    • The Transform process begins on a scheduled interval.
    • Transform takes the raw data from the staging area and does any cleansing, reformatting, aggregation, etc. that is required to get it into the final format for reporting.
    • This transformed data is written into the DWRepository database.
  • Load
    • The Load process begins on a scheduled interval
    • Load queries the data from the DWRepository database
    • The transformed data from DWRepository is inserted into the DWDatamart database. The DWDatamart is the database used for all end-user reporting needs.

Many folks wonder why it is necessary to have these three different processes. The three processes each serve their own specific purpose:

  • Extract is built specifically for processing high volumes of data from multiple sources and allows for moving the data into an area that is built for manipulating the data.
  • Transform processes are built for optimization of complex logic and integration operations. This is where a lot of the heavy lifting occurs.
  • Load is built for transferring the data that has already been processed into it’s target destination in a bulk manner.

There are a few different reasons for having three different databases, but one of the main reasons is so that folks can optimize their hardware environment more easily. In high volume environments, the DWStagingandConfig and DWRepository databases will need to be on hardware that is optimized for read/write IO whereas the DWDatamart will need to be optimized for read IO. With that in mind, customers will be able to separate out the DW Datamart to a different server/drive from DWStagingandConfig and DWRepository (these two must remain on the same server) with Service Manager v1.

We’ve pretty much covered getting the DW configured, deployed, and populated with data. Next post, I am going to be talking about the reporting side of the house and how it will work within Service Manager. Please feel free to leave any comments/questions/suggestions for future topics.

ETL Anatomy