Saturday, March 28, 2009

Dimension tables in data warehouses: Type 2+1

Introduction
In condensed matter physics, a system is often characterized by the dimensionality of the crystal lattice. A bulk crystal has a 3D lattice, while a surface has a 2D lattice. Crystal growth occurs on the surface. There are times where surface growth is described as being 2+1D. I used to research this.

In data warehouses, data is stored using a data cube. When a data cube is stored in a relational database, a star schema is build from fact tables and dimension tables. In a fact table, the facts are stable and do not have any time dependence: facts are facts. All of the time dependence is recorded in the dimension tables.

The problem: with example

To make this more concrete, consider a simple data cube for a swim conference. The facts will be the race results. The dimensions will be a swimmer dimension, an event dimension and a meet dimension. Race results include the time recorded. The results are linked to the dimension tables. So, we can build reports on a given swimmer, even if they change teams. Swimmers have a name, photo and birth date. Swimmers also have a link to their current team. Teams have a name, coach, city and league. (This example gives us a simple 'snowflake dimension'). Meets have a location, date, league and season. Events include a stroke (backstroke, butterfly, breaststroke and freestyle) and distance. For now, we will ignore team events, so the schema is quite simple.

This is for a Web 2.0 swim league, so we also have to create an iPhone web app to record all of the results during a meet. We can even be cool and have a URL to a 'photo finish' image for each race, pictures for each swimmer, and a Google map showing the location of each meet. After each meet, we can then run a Kettle job to move all of the data into our data warehouse running on Amazon EC2 at 10 cents an hour.
In the Data Warehouse Toolkit, Ralph Kimball has described three types of time dependence for dimension data
  1. Type 1 - the past doesn't matter. When data in the dimension tables changes, we just throw away the old data and use the new data. This is appropriate for correcting errors. If we find that we should have spelled Vicky rather than Vickie, we just change the name in production database for our application. The ETL process will then change the swimmer's name in the data warehouse.
  2. Type 2 - slowly changing dimension. In this case, a single entity in the production system gets mapped to multiple records in the dimension table. At any moment in time, there is only one record that is valid for the entity. When we change the description in the production system, the ETL job will create a new record in the dimension table. If our iPhone toting parents take a new photo of Vicky every month, Vicky will will get a new record in the swimmer dimension table. When we look at the results of a race where Vicky was swimming, we can see what she looked like at the time of the race.
  3. Type 3 - keep a few versions. In this design, we end up with multiple columns for a given attribute in the dimension table. Usually, this is used so we can have a 'current' value and a 'last' value. This is used for something like an company reorganization. Vicky is a superstar and in the middle of the season she is forced to move to a really weak team to make the league more equal. The parents are upset, so they track Vicky with the current team assignments, but they also track her beginning-of-season team. This example may seem silly, but similar reorganizations happens all the time within companies.
As the database architect, what are you to do? The first thing to realize is that you have to decide on the time dependence for each field in the database. In our case, we want a Swimmer dimension with name as a type 1 dimension attribute, a 'photo URL' as a type 2 dimension attribute and a type three dimension for the 'team' attribute. To further complicate matters, we didn't even know that the team should be 'type 3' before the Reorganization Controversy.

In physics and chemistry, we talk about reversible and irreversible transformations. In a computer system, erasing data is irreversible. From this perspective, we see that only the type 2 time dependence does not erase data: type 2 is reversible in the sense that even after we change the data in production, we can still reproduce the old data exactly - if a report was run last year, we can get exactly the same results running the report this year. A type 1 dimension throws data away immediately and a type 3 dimension will throw data away eventually (after two team changes in the example). Since a data warehouse is supposed to warehouse data, I have an immediate preference for the type 2 time dependence.

But as a practical matter, people often prefer the 'curent' results, not the results 'at that point in history'. For example, lets say that we have a swim team dashboard that lets you click on a photo of each swimmer to see their stats. Almost certainly, you want to have only one picture of Vicky that links to her entire history, not simply her history since that picture was take. When designing the warehouses, I am constantly having to choose between 'current' and 'historical' dimension data.

The New Solution: Type 2+1 dimension

In order to explain this, I need to define the structure of the production database and the data warehouse for the swimmer. In the production database, the swimmer table will have the following columns:
  • GUID, an id for each swimmer that is unique for every swimmer in every league that is using my Swim 2.0 web application. This is the primary key of the production table.
  • Name, a string with the swimmer's name. There could be separate fields for first and last names, but this doesn't matter for this discussion
  • Birthdate, a date field.
  • Photo URL, preferably a URL class
  • Team GUID, a foreign key tying each swimmer to their current team
In our data warehouse, our swimmer dimension is also represented by a database table. In a table designed for type 2 data, we would have the following columns:
  • GUID (same as above, the G is for global, and the data warehouse is part of our world of data)
  • ID, a unique integer determined by the database engine. This is the primary key of the swimmer dimension table.
  • version, an integer that starts at 1 and increments by 1 each time a new record is added for a given GUID. There is a 1-1 mapping between (GUID, version) and ID
  • start date and end date. These two columns define when a given version is valid.
  • name, birthdate and photo URL fields, just like in production
  • Team ID. Foreign keys reference primary keys, so rather than recording the Team GUID, we will record the Team ID. This may be a bad idea, if you have arguments about when to use the GUID and when to use the ID when dimensions reference each other, I would be happy to hear from you.
The rule for a data change is fairly simple. In ETL, any changes in the production data are detected. When a change is found, the current version of the entity (e.g., with the same GUID) is end dated and the current version is read and the end time is recorded. A new record is inserted with the GUI, the attributes from production, a new ID , the next version number, the start date is set to the end date of the previous record and the end date is usually sent to 'the end of time'.

In order to identify which version is current, the data warehouse architects at ETL-Tools recommend adding a current field, a Boolean set to true for the current value and false for all others. This field is could be determine by a simple rule, such as now() between swimmer_dim.date_from and swimmer_dim.date_to or swimmer_dim.version = ( select max(version) from swimmer_dim sd where sd.GUID = swimmer_dim.GUID) (I use the default field names from Pentaho's Kettle, and the date functions from PostgreSQL, but this is easy to rewrite for other environments.) As a practical matter, it is more efficient to add a current field so you can apply the rule once during ETL rather than each time the you run a query that needs the current values.

I argued that a 'proper' design should be reversible and that no data should be lost. Doesn't adding a current field that I am willing to 'throw away' each time I run ETL violate my theoretical argument? No. The reason is that the information in the current field is redundant. Since I can figure out which record would have been current at any time in the past, changing this field does not irreversibly lose data.

Rather than just recording a Boolean as to which record is current, I add a field called current_id to my dimension tables. If you want a Type 2 time dependence, you just ignore the current_id since the rest of the table was set up to support Type 2. So, if you want a dimension with Type 2 time dependence, you can write
create view swimmer_dim_type2 as
select GUID, ID, version, date_from, date_to
, name, birth_date, photo_url, team_id
from swimmer_dim
If you want a Type 1 time dependence to return the current values for any historical id, you need to make a single self join
create view swimmer_dim_type1
select any.GUID, any.ID, curr.name
, curr.birth_date, curr.photo_url, curr.team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id
This ability to act like a dimension of Type 1 or Type 2 is why I call this Type 2+1. But, we know that 2+1=3, so is there a way to get the Type 3 time dependence as well? Yes, but we will need to be able to choose a time when we want to choose the additional columns. In the swimmer example, we want the current team and the team at the beginning of the season. Let's assume that the season began on June 1, 2008. We can create our Type 3 time dependence with:
create view swimmer_dim_type3 as
select any.GUID, any.ID
, curr.name, curr.birth_date, curr.photo_url
, any.team_id, june.team_id as starting_team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id
join swimmer_dim june on any.current_id = june.current_id
where cast('2009-06-01' as date) between june.date_from and june.date_to
So, Type 2+1 can emulate Type 1, Type 2 or Type 3 time dependence. As I noted earlier, the time dependence should be determined at the field level, not the table level. If we want to always use the current name, the photo that was current at the time of each race, and the team as of June 1, 2008, we can write a view to do exactly that without the need to alter the ETL or the table structures.

I have tested this idea on larger data sets, and each view is almost as fast a query of the underlying dimension table. In most queries of a data warehouse, the query performance is dominated by the reads and aggregates of the data within the fact tables. This technique should be valuable to nearly any data warehouse architect, so I would be pleased to hear from anyone who finds this useful.

Saturday, March 21, 2009

Bayes' Theorem and Taleb's Silent Evidence

In chapter Eight of the Black Swan, Taleb recounts Cicero's Story of the Drowned Worshippers:
One Diagoras, a non believer in the gods, was shown painted tablets bearing the portraits of some worshippers who prayed, then survived, a subsequent shipwreck. The implication was that praying protected you from drowning. Diagoras asked, "Where were those who prayed, then drowned?"
The drowned workshippers, being dead, would have a lot of trouble advertising their experiences from the bottom of the sea. This can fool the casual observer into believing in miracles.
Taleb calls this as the problem of silent evidence. We a drawn to success, and we shun failure, so, as Taleb notes, nobody ever writes 'How I failed to Make a Million Dollars on Wall Street'.

Suppose that one of our Wall Street success stories claims that in order to succeed in Wall Street, you need to be a Harvard graduate; this author cites himself and and his college roommate who have both made a million dollars on Wall Street. If you are in high school (or have a child in school), you need to choose a collage. You are wanting to know if a Harvard education is worth its rather considerable cost. If the author is right, then the decision is easy. You will simply replay your student loans with the millions you make on Wall Street.

What information do you need to assess this claim? We could begin by looking for particular cases. Finding one black swan may prove that not all swans are white, likewise finding one millionaire from a state university will disprove a claim that a Harvard education is a requirement for success on Wall Street. Similarly, finding an unsuccessful Wall Street trader from Harvard proves that a Harvard education is not sufficient for success on Wall Street. Likewise, we can find examples of Harvard graduates that failed on Wall Street. So being a Harvard Alumni is not a guarantee of success of Wall Street.

Life is not certain, so we need to make a gamble. How can we to play the odds odds intelligently? The sort of data that would be helpful would include the resumes of both successful and failed Wall Street investors. As a practical matter, getting the list of unsuccessful investors would be something of a trick. Taleb is right about silent evidence: we remember the winners and forget the losers. The winners advertise and the losers go on to something else.

In the age of data mining and databases, it should be possible to build such as list.
If we can gather the data, what framework should we use to decide if going to Harvard would be a rational risk? The correct framework to assess the validity of a claim is to use Bayes' Theorem from statistics. If you haven't heard about Bayesian statistics, a good starting point is An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky. In fact, Yudkowsky's explanation is so good that there really isn't any point in me writing more on this subject. There are also on-line course materials. Currently, I'm trying to work through Jeff Grynaviski's (at the University of Chicago) has provided his course materials. In order to learn more, there is David McKay's Information Theory, Inference and Learning Algorithms. I especially like McKay's book because it is able to unite Bayesian methods with Claude Shanon's Information Theory and even to include elements of AI.

If you still have some time left after reading about Bayesian methods, read Yudkowsky's article Cognitive biases potentially affecting judgement of global risk. His assessment seems to be in general agreement with Taleb. Since I live next to Cedar Rapids, which has just had a major flood, Yudkowsky's observation about flood damage was particularly revealing:

Burton et. al. (1978) report that when dams and levees are built, they reduce the frequency of floods, and thus apparently create a false sense of security, leading to reduced precautions. While building dams decreases the frequency of floods, damage per flood is so much greater afterward that the average yearly damage increases.

Wow, this is an extraordinary claim. If true, the much of the work done to protect people in the Mississippi/Missouri River basins is not just useless but is actually counterproductive. I will remain skeptical on this claim, but it does seem worthy of investigation before Cedar Rapids, Linn County, the State of Iowa and Federal agencies sink more money into flood control on the Cedar River. We need to find a more rational way to understand and manage risk. If we can improve that, we will have learned an important lesson from the catastrophic failures of 9/11, the Indian Ocean Tsunami of 2004, the flooding of New Orleans following Hurricane Katrina, and the current financial collapse. We need Taleb's empiricism. When we have enough data to analyze, we should be using Bayesian methods for that analysis.

Sunday, March 8, 2009

Why Swamp Fox Analyst?

First, I live in Marion, Iowa, home of the Swamp Fox Festival. I was was raised largely in Marion County, Iowa. Both are named after Francis Marion. Of course, there are hundreds of places in the US named after Francis Marion. I am no historian, but even a quick read of his Wikipedia article or the Smithsonian's recent article reveal a fascinating character. He was a small child, being something of the family runt. He was a good student and fluent in French. At age 15, he survived a shipwreck when a whale rammed the schooner he was sailing. At 25, he became a military officer in the French-Indian Wars. The following quote
The next morning we proceeded by order of Colonel James Grant, to burn down the Indians' cabins. Some of our men seemed to enjoy this cruel work, laughing very heartily at the curling flames, as they mounted loud crackling over the tops of the huts. But to me it appeared a shocking sight. Poor creatures! thought I, we surely need not grudge you such miserable habitations. But, when we came, according to orders, to cut down the fields of corn, I could scarcely refrain from tears. For who could see the stalks that stood so stately with broad green leaves and gaily tasseled shocks, filled with sweet milky fluid and flour, the staff of life; who, I say, without grief, could see these sacred plants sinking under our swords with all their precious load, to wither and rot untasted in their mourning fields.
suggests to me that he was a sensitive and compassionate man, but still a man that did his duty. He is most famous for his leadership in the American Revolutionary War. Following the Waxhaw Massacre , his band of about fifty irregulars was the only force opposing the British in South Carolina. He earned his nickname, The Swamp Fox, by outfoxing Col. Banastre Tarleton and his British forces. His tactics are predecessors of modern guerrilla warfare. In short, he was able to accomplish a great deal with very little. In short, he used intelligence to succeed. I cannot make claims that I am following in his footsteps (I failed my military physical for acne scars on my chest and back), I can certainly take inspiration from his success.

There is also a cautionary note that I associate with Francis Marion. While we Americans, and even Mel Gibson, regard Marion as hero, he was clearly viewed as a terrorist by the British. The historian Christopher Hibbert described him as a racist and rapist for his treatment of the Cherokee. There is evidence that both views are correct. One man's terrorist is another man's freedom fighter.

Francis Marion was also alive during the Age of Enlightenment. His lifetime (1732-1795) overlaps the lives of the skeptic David Hume (1711-1776), the Reverend Thomas Bayes (1702-1761), and the French aristocrat Pierre-Simon, marquis de Laplace (1749-1827). As Nassim Nicholas Taleb notes in The Black Swan, Hume was an influential empiricist for the English-speaking world. For all of the reasons Taleb outlines, we need empiricism and some of the intellectual humility of a real skeptic, to counter the hubris of modern economists.

Taleb speaks a great deal about The Problem of Silent Evidence. In a nutshell, if we try to understand the unusual, we try to look for characteristics that we feel were a cause of the event. For example, we have all been taught that World War I was caused by 'entangling alliances' between the major powers and the 'powderkeg' in the Balkans. But, identifying characteristics that were true before WWI is not the same as identifying causes. To have a useful knowledge, we need to be able to predict, not create an after-the-fact narrative. There were many times in history where there were alliances and many places where angry young men plotted to be revolutionaries. Does the presence of these conditions have any predictive value? That is what we need to know if we want to plan.

It seems to me that some of Taleb's criticism of classical statistical analysis can be addressed by using Bayesian methods. This is the tie to the Reverend Bayes and Laplace, they founded what we now called Bayesian statistics. Over the last twenty years, Bayesian methods have been rapidly evolving. In particular, Bayes theorem provides a mathematical framework to discuss Taleb's Problem of Silent Evidence. Of course, a framework is not a 'solution' and this only addresses one of Taleb's issues. More to come ...