Saturday, September 26, 2009

Involve them in the same conspiracy

As previously noted, I am listening to the lectures from Berkeley's CS 61A. In the fifth lecture, the class is listening to a video of Alan Kay. He ends the talk, "If you want others to go along with you, you have to involve them in the same conspiracy." Oddly, this made me think about spreadmarts. Wayne Eckerson at TDWI create this term. He described a spreadmart as:

renegade spreadsheets and desktop databases that are wreaking havoc on organizations. Since then, many people have adopted the term because it highlights a painful, yet largely ignored, problem that plagues organizations today.

Spreadmarts contain vital pieces of corporate data that are needed to run the business. But since spreadmarts are created by individuals at different times using different data sources and rules for defining metrics, they create a fractured view of the enterprise. Without a single version of corporate data and centrally defined metrics, employees can’t share a common understanding the business. With spreadmarts, each worker marches to the “beat of their own drummer” instead of marching together toward a common goal. In short, spreadmarts undermine corporate productivity and sabotage business alignment.


I think Mr. Eckerson needs to consider Alan Kay's words. Mr. Eckerson has just cast the creators of spreadmarts as corporate saboteurs that are destroying the One True Way™ that he and and upper management wish to bestow upon their ungrateful underlings. This is a very top down approach, that probably isn't going to work well. He does present a 'co-opt' solution where you agree to using CVS output so the spreadsheet jockeys can still use their beloved Excel. Bu with title like Reeling in Spreadmarts, In Search of a Single Version of Truth: Strategies for Consolidating Analytic Silos and Taming Spreadsheet Jockeys, I hear contempt for the existing business processes and the folks on the front line that do the business' work.

Consider how differently Kirk Wylie treats the traders in financial firms in RESTful Approaches to Financial Services. They are exactly the 'spreadsheet jockeys' derided by Eckerson. Wylie describes centralized systems as über-systems, clearly a derogatory term. Mr. Wylie also recognizes the proper desire for traders to keep control of their data processing. He notes that developers are only acquiring the needed business understanding to become productive after three to four months - he is looking for bottom up solutions that support the 'front line', not centralized solutions to give upper management a dashboard of global KPIs, key performance indicators. He accepts that "Any system that doesn't consider the traders pathological dependency on Excel is doomed to failure." In the rest of the talk, he describes how RESTful solutions allow him to roll out shared data across a variety of format, including Excel, to end users on demand.

Like Alan Kay, Kirk Wylie is involved in the same conspiracy as his users. I'm betting he has more success than Mr. Eckerson is rolling out working systems.

Saturday, September 19, 2009

Comp Sci 61A - Online lectures

This week, I am starting to follow one series of the online UC Berkeley lectures. The lectures are by Brian Harvey's CS 61A from the spring of 2008. As is always the case, following a lecture without trying to do it yourself is not going to get you far.
In the first two months of the class, Prof. Harvey is examining functional programming with LISP. To follow along, I am using Clojure. Installing on OSX amounts to 'sudo port install clojure'. The clojure command line is accessed by the clj command. As noted in the lecture, it is easy to write interpreters. The class uses scheme, so I need a scheme interpreter for clojure.
In order to describe the foundations of computer programming, Prof. Harvey wrote the following during his second lecture:

Physical foundation of computer science
Topicstudied in...
Application ProgramCS 61A
High level language (Scheme)CS 61B
Low Level Langauge (C)CS 61C
Machine Langauge/architecture...
circuit elementsEE
transistorsapplied physics
solid state physicsPhysics
Quantum MechanicsPhysics

This ties in rather naturally with the OSI Seven Layer Model.

This class is presenting a 'real world' view of computer science, not computer science as applied mathematics. As with the physical sciences and engineering, mathematics is the language, not the subject in CS. This is not meant to be a put down of math.

In the first lecture, he gave a concise explanation of 'function' vs. 'procedure'. I am going to enjoy these lectures...

Sunday, September 6, 2009

Slowly Changing DImensions, Type 6 = Type 1 + Type 2 + Type 3

Recently, I have designed a multi-tenant data warehouse for AdTrack clients. One key decision in building a data warehouse is choosing how to deal with slowly changing dimensions (SCD). In a standard data warehouse, there a two basic types of data tables: fact tables hold information about a particular event (e.g. a record of one sale) and the dimension tables describe how the facts can be organized (e.g. records about stores, customers, products, date of sale). Basically, you access the data by specifying a 'slice' of each dimension and you then have the database find all the facts that fall within these slices. The database then lists or summarizes (aggregates) these facts which are displayed in tables and charts which are reported to the users.
A major fly in the ointment is that the information in the dimension tables are slowly evolving. Customers move, change phone numbers, addresses, marital status, last names, jobs and income levels. Sometimes, you just want the current values. It is not obvious a priori if clients will care about current values or historical values. If I were selling hot tubs, does 'current' marital status or 'historic' (e.g. at the time of the sales) marital status matter? I can imagine an analyst wanting to track sales by marital status, sex and income, so I can see them caring about tracking the history of a client's marital status. However, if I were to make a report of top 10 married customers for a promotion, I would probably want to add in any sales to that customer when they were still single. So, unless you know the context, there is no general answer to 'historic' vs. 'current' value when building a data warehouse. This is doubly true for a mult-tenant warehouse where one client could be selling construction equipment b3b, another could be selling medical equipment to university hospitals and a third could be selling windows b2c. Any universal assumption as to what each of these clients need now and forever is going to be badly wrong for some users.
Data warehouse designers will recognize this as the Type 1 SCD (current value ) vs. Type 2 SCD (historic value) decision that needs to be addressed in design. There is also a Type 3 SCD that has current and 'some history'. The technorati, or anyone that reads the slowly changing dimensions article in Wikipeida, will know that there is type which can act like Type 1 SCD, Type 2 SCD or Type 3 SCD. Since 1+2+3=6, Ralph Kimball called this a Type 6 SCD. I have come up with a new implementation of a Type 6 SCD. My solution is basically a Type 2 SCD, but with the addition of a single column. I can then build views of this dimension table that function as Type 1, Type 2 or Type 3 SCD. This solution has now been tested, and the query times are roughly equal for all types of behavior, this is not true of older solutions which could really slow down if you asked for historic values. At least in the case of using Pentaho PDI, the changes to ETL are simple to implement and can be retrofitted to an existing Type 2 SCD. If you are interested in how I do this, I have a paper on Google Docs that you may read. If anyone is interested in implementing this, I would be eager to help or to hear if it works for you.