Sunday, September 6, 2009

Slowly Changing DImensions, Type 6 = Type 1 + Type 2 + Type 3

Recently, I have designed a multi-tenant data warehouse for AdTrack clients. One key decision in building a data warehouse is choosing how to deal with slowly changing dimensions (SCD). In a standard data warehouse, there a two basic types of data tables: fact tables hold information about a particular event (e.g. a record of one sale) and the dimension tables describe how the facts can be organized (e.g. records about stores, customers, products, date of sale). Basically, you access the data by specifying a 'slice' of each dimension and you then have the database find all the facts that fall within these slices. The database then lists or summarizes (aggregates) these facts which are displayed in tables and charts which are reported to the users.
A major fly in the ointment is that the information in the dimension tables are slowly evolving. Customers move, change phone numbers, addresses, marital status, last names, jobs and income levels. Sometimes, you just want the current values. It is not obvious a priori if clients will care about current values or historical values. If I were selling hot tubs, does 'current' marital status or 'historic' (e.g. at the time of the sales) marital status matter? I can imagine an analyst wanting to track sales by marital status, sex and income, so I can see them caring about tracking the history of a client's marital status. However, if I were to make a report of top 10 married customers for a promotion, I would probably want to add in any sales to that customer when they were still single. So, unless you know the context, there is no general answer to 'historic' vs. 'current' value when building a data warehouse. This is doubly true for a mult-tenant warehouse where one client could be selling construction equipment b3b, another could be selling medical equipment to university hospitals and a third could be selling windows b2c. Any universal assumption as to what each of these clients need now and forever is going to be badly wrong for some users.
Data warehouse designers will recognize this as the Type 1 SCD (current value ) vs. Type 2 SCD (historic value) decision that needs to be addressed in design. There is also a Type 3 SCD that has current and 'some history'. The technorati, or anyone that reads the slowly changing dimensions article in Wikipeida, will know that there is type which can act like Type 1 SCD, Type 2 SCD or Type 3 SCD. Since 1+2+3=6, Ralph Kimball called this a Type 6 SCD. I have come up with a new implementation of a Type 6 SCD. My solution is basically a Type 2 SCD, but with the addition of a single column. I can then build views of this dimension table that function as Type 1, Type 2 or Type 3 SCD. This solution has now been tested, and the query times are roughly equal for all types of behavior, this is not true of older solutions which could really slow down if you asked for historic values. At least in the case of using Pentaho PDI, the changes to ETL are simple to implement and can be retrofitted to an existing Type 2 SCD. If you are interested in how I do this, I have a paper on Google Docs that you may read. If anyone is interested in implementing this, I would be eager to help or to hear if it works for you.

No comments:

Post a Comment