Data Warehousing: A Open Note To Cousin Tom

Dear Tom,

It's been almost three weeks since you wrote. I've been real busy with work and summer and everything, so I haven't had a chance until now to do the thinking I need to do to answer your questions.

You do remember your questions, don't you? It's been that long? Okay..

My company is in the very earliest stages of creating what they term a "Data Warehouse"*.

Can we talk about what makes a data warehouse better or worse? I think my issue is that I don't understand the premises of what a data warehouse should be. Everything I hear sounds like either "Let's build the perfect solution to problems that don't yet exist" or "Let build a solution that solves only the problems we have already solved". I have better things to do with my time than to tackle either of those projects.

Nothing like a general question to get one's geek beanie atwirl, is there? Here's a brief, general discussion of an OLTP guy's understanding of Data Warehousing.

For our purposes, there are two pieces of the active life cycle of business data: the portion where we collect our data from the field (On-Line Transaction Processing, or OLTP), and the portion where businesspeople (managers, executives, etc.) report against the data (On-Line Analytical Processing, or OLAP). The data warehouse, in general terms, is the database that enables an OLAP system.

There's a pretty decent discussion of OLTP vs. OLAP here, and a nice Webcast regarding what goes into a scalable design here (herein likely lies the answer to the crux of your question; a good data warehouse is there when its clients need it on an ongoing basis). Since I'm an OLTP guy, I need to do some of the same learning you do, and these resources are a good place to start.

What I can tell you, based on everything I'm reading, is that the skills you're looking for are highly prized in the current market. Building a scalable, performant Data Warehouse requires an entirely different technical approach than building a scalable, performant OLTP database because the patterns of use for each are completely different:

  • an OLTP database must quickly perform numerous, simultaneous write operations. Data retrieval is at a minimum.
  • an OLAP database must quickly perform numerous, simultaneous read and/or aggregation operations. Writes are at a minimum.

The optimal schemas for each of these tasks are going to be very different -- third normal form would be a very wrong approach to data warehousing. So, in addition to your OLTP database and your Data Warehouse, you'll want to pay close attention to the interface between the two. How is the data being moved and mapped? How often is the warehouse populated? What aggregations can be prepopulated?

I'd also suggest you check out my friend John Huschka's blog; he's a Data Warehousing guy and he recently started a series of posts on Business Intelligence, a topic which any good Data Warehousing project should address: how is the business going to use the data?

If you answer these questions, then you're well beyond "Let's build the perfect solution to problems that don't yet exist" and "Let build a solution that solves only the problems we have already solved". Worth tackling? I think so.. I hope I've convinced you as well.

I realize that a great deal of this is general. I have a lot of learning to do around this stuff. I'll keep reading, and I'll drop you a line here when I find anything good.

Take care, cousin..

     -wp