Saturday, March 28, 2009

Dimension tables in data warehouses: Type 2+1

Introduction
In condensed matter physics, a system is often characterized by the dimensionality of the crystal lattice. A bulk crystal has a 3D lattice, while a surface has a 2D lattice. Crystal growth occurs on the surface. There are times where surface growth is described as being 2+1D. I used to research this.

In data warehouses, data is stored using a data cube. When a data cube is stored in a relational database, a star schema is build from fact tables and dimension tables. In a fact table, the facts are stable and do not have any time dependence: facts are facts. All of the time dependence is recorded in the dimension tables.

The problem: with example

To make this more concrete, consider a simple data cube for a swim conference. The facts will be the race results. The dimensions will be a swimmer dimension, an event dimension and a meet dimension. Race results include the time recorded. The results are linked to the dimension tables. So, we can build reports on a given swimmer, even if they change teams. Swimmers have a name, photo and birth date. Swimmers also have a link to their current team. Teams have a name, coach, city and league. (This example gives us a simple 'snowflake dimension'). Meets have a location, date, league and season. Events include a stroke (backstroke, butterfly, breaststroke and freestyle) and distance. For now, we will ignore team events, so the schema is quite simple.

This is for a Web 2.0 swim league, so we also have to create an iPhone web app to record all of the results during a meet. We can even be cool and have a URL to a 'photo finish' image for each race, pictures for each swimmer, and a Google map showing the location of each meet. After each meet, we can then run a Kettle job to move all of the data into our data warehouse running on Amazon EC2 at 10 cents an hour.
In the Data Warehouse Toolkit, Ralph Kimball has described three types of time dependence for dimension data
  1. Type 1 - the past doesn't matter. When data in the dimension tables changes, we just throw away the old data and use the new data. This is appropriate for correcting errors. If we find that we should have spelled Vicky rather than Vickie, we just change the name in production database for our application. The ETL process will then change the swimmer's name in the data warehouse.
  2. Type 2 - slowly changing dimension. In this case, a single entity in the production system gets mapped to multiple records in the dimension table. At any moment in time, there is only one record that is valid for the entity. When we change the description in the production system, the ETL job will create a new record in the dimension table. If our iPhone toting parents take a new photo of Vicky every month, Vicky will will get a new record in the swimmer dimension table. When we look at the results of a race where Vicky was swimming, we can see what she looked like at the time of the race.
  3. Type 3 - keep a few versions. In this design, we end up with multiple columns for a given attribute in the dimension table. Usually, this is used so we can have a 'current' value and a 'last' value. This is used for something like an company reorganization. Vicky is a superstar and in the middle of the season she is forced to move to a really weak team to make the league more equal. The parents are upset, so they track Vicky with the current team assignments, but they also track her beginning-of-season team. This example may seem silly, but similar reorganizations happens all the time within companies.
As the database architect, what are you to do? The first thing to realize is that you have to decide on the time dependence for each field in the database. In our case, we want a Swimmer dimension with name as a type 1 dimension attribute, a 'photo URL' as a type 2 dimension attribute and a type three dimension for the 'team' attribute. To further complicate matters, we didn't even know that the team should be 'type 3' before the Reorganization Controversy.

In physics and chemistry, we talk about reversible and irreversible transformations. In a computer system, erasing data is irreversible. From this perspective, we see that only the type 2 time dependence does not erase data: type 2 is reversible in the sense that even after we change the data in production, we can still reproduce the old data exactly - if a report was run last year, we can get exactly the same results running the report this year. A type 1 dimension throws data away immediately and a type 3 dimension will throw data away eventually (after two team changes in the example). Since a data warehouse is supposed to warehouse data, I have an immediate preference for the type 2 time dependence.

But as a practical matter, people often prefer the 'curent' results, not the results 'at that point in history'. For example, lets say that we have a swim team dashboard that lets you click on a photo of each swimmer to see their stats. Almost certainly, you want to have only one picture of Vicky that links to her entire history, not simply her history since that picture was take. When designing the warehouses, I am constantly having to choose between 'current' and 'historical' dimension data.

The New Solution: Type 2+1 dimension

In order to explain this, I need to define the structure of the production database and the data warehouse for the swimmer. In the production database, the swimmer table will have the following columns:
  • GUID, an id for each swimmer that is unique for every swimmer in every league that is using my Swim 2.0 web application. This is the primary key of the production table.
  • Name, a string with the swimmer's name. There could be separate fields for first and last names, but this doesn't matter for this discussion
  • Birthdate, a date field.
  • Photo URL, preferably a URL class
  • Team GUID, a foreign key tying each swimmer to their current team
In our data warehouse, our swimmer dimension is also represented by a database table. In a table designed for type 2 data, we would have the following columns:
  • GUID (same as above, the G is for global, and the data warehouse is part of our world of data)
  • ID, a unique integer determined by the database engine. This is the primary key of the swimmer dimension table.
  • version, an integer that starts at 1 and increments by 1 each time a new record is added for a given GUID. There is a 1-1 mapping between (GUID, version) and ID
  • start date and end date. These two columns define when a given version is valid.
  • name, birthdate and photo URL fields, just like in production
  • Team ID. Foreign keys reference primary keys, so rather than recording the Team GUID, we will record the Team ID. This may be a bad idea, if you have arguments about when to use the GUID and when to use the ID when dimensions reference each other, I would be happy to hear from you.
The rule for a data change is fairly simple. In ETL, any changes in the production data are detected. When a change is found, the current version of the entity (e.g., with the same GUID) is end dated and the current version is read and the end time is recorded. A new record is inserted with the GUI, the attributes from production, a new ID , the next version number, the start date is set to the end date of the previous record and the end date is usually sent to 'the end of time'.

In order to identify which version is current, the data warehouse architects at ETL-Tools recommend adding a current field, a Boolean set to true for the current value and false for all others. This field is could be determine by a simple rule, such as now() between swimmer_dim.date_from and swimmer_dim.date_to or swimmer_dim.version = ( select max(version) from swimmer_dim sd where sd.GUID = swimmer_dim.GUID) (I use the default field names from Pentaho's Kettle, and the date functions from PostgreSQL, but this is easy to rewrite for other environments.) As a practical matter, it is more efficient to add a current field so you can apply the rule once during ETL rather than each time the you run a query that needs the current values.

I argued that a 'proper' design should be reversible and that no data should be lost. Doesn't adding a current field that I am willing to 'throw away' each time I run ETL violate my theoretical argument? No. The reason is that the information in the current field is redundant. Since I can figure out which record would have been current at any time in the past, changing this field does not irreversibly lose data.

Rather than just recording a Boolean as to which record is current, I add a field called current_id to my dimension tables. If you want a Type 2 time dependence, you just ignore the current_id since the rest of the table was set up to support Type 2. So, if you want a dimension with Type 2 time dependence, you can write
create view swimmer_dim_type2 as
select GUID, ID, version, date_from, date_to
, name, birth_date, photo_url, team_id
from swimmer_dim
If you want a Type 1 time dependence to return the current values for any historical id, you need to make a single self join
create view swimmer_dim_type1
select any.GUID, any.ID, curr.name
, curr.birth_date, curr.photo_url, curr.team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id
This ability to act like a dimension of Type 1 or Type 2 is why I call this Type 2+1. But, we know that 2+1=3, so is there a way to get the Type 3 time dependence as well? Yes, but we will need to be able to choose a time when we want to choose the additional columns. In the swimmer example, we want the current team and the team at the beginning of the season. Let's assume that the season began on June 1, 2008. We can create our Type 3 time dependence with:
create view swimmer_dim_type3 as
select any.GUID, any.ID
, curr.name, curr.birth_date, curr.photo_url
, any.team_id, june.team_id as starting_team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id
join swimmer_dim june on any.current_id = june.current_id
where cast('2009-06-01' as date) between june.date_from and june.date_to
So, Type 2+1 can emulate Type 1, Type 2 or Type 3 time dependence. As I noted earlier, the time dependence should be determined at the field level, not the table level. If we want to always use the current name, the photo that was current at the time of each race, and the team as of June 1, 2008, we can write a view to do exactly that without the need to alter the ETL or the table structures.

I have tested this idea on larger data sets, and each view is almost as fast a query of the underlying dimension table. In most queries of a data warehouse, the query performance is dominated by the reads and aggregates of the data within the fact tables. This technique should be valuable to nearly any data warehouse architect, so I would be pleased to hear from anyone who finds this useful.

No comments:

Post a Comment