Swamp Fox Analyst

Thursday, October 1, 2009

Erlang, Mochiweb and Webmachine

Webmachine is amazing. Why doesn't .Net or Java have something this elegant?

Erlang Factory 2009 Palo Alto - Justin Sheehy - Webmachine @ Yahoo! Video

Have a look. I was impressed by the clean design, but I was blown away by the debugger described in the last ten minutes. This is an open source project in Erlang, the high concurrency functional language created by Joe Armstrong at Ericsson. As noted in the talk, Webmachine is a high quality component to include in a web framework, like Mochiweb. Mochiweb itself is interesting: it is an Erlang framework for high concurrency web applications. When serving static content, it has been compared with yaws and nginx in a deathmatch on Joes Blog! Richard Jones is contemplating a Million User Comet App. Erlang is turning into a pretty cool language for web apps with high concurrency.

I wonder if there is any attempt to write an equivalent in Scala, a high concurrency functional language for the JVM? I have got to get up to speed with Scala and Lift. There is a lot going on here...

Saturday, September 26, 2009

Involve them in the same conspiracy

As previously noted, I am listening to the lectures from Berkeley's CS 61A. In the fifth lecture, the class is listening to a video of Alan Kay. He ends the talk, "If you want others to go along with you, you have to involve them in the same conspiracy." Oddly, this made me think about spreadmarts. Wayne Eckerson at TDWI create this term. He described a spreadmart as:

renegade spreadsheets and desktop databases that are wreaking havoc on organizations. Since then, many people have adopted the term because it highlights a painful, yet largely ignored, problem that plagues organizations today.
Spreadmarts contain vital pieces of corporate data that are needed to run the business. But since spreadmarts are created by individuals at different times using different data sources and rules for defining metrics, they create a fractured view of the enterprise. Without a single version of corporate data and centrally defined metrics, employees can’t share a common understanding the business. With spreadmarts, each worker marches to the “beat of their own drummer” instead of marching together toward a common goal. In short, spreadmarts undermine corporate productivity and sabotage business alignment.

I think Mr. Eckerson needs to consider Alan Kay's words. Mr. Eckerson has just cast the creators of spreadmarts as corporate saboteurs that are destroying the One True Way™ that he and and upper management wish to bestow upon their ungrateful underlings. This is a very top down approach, that probably isn't going to work well. He does present a 'co-opt' solution where you agree to using CVS output so the spreadsheet jockeys can still use their beloved Excel. Bu with title like Reeling in Spreadmarts, In Search of a Single Version of Truth: Strategies for Consolidating Analytic Silos and Taming Spreadsheet Jockeys, I hear contempt for the existing business processes and the folks on the front line that do the business' work.

Consider how differently Kirk Wylie treats the traders in financial firms in RESTful Approaches to Financial Services. They are exactly the 'spreadsheet jockeys' derided by Eckerson. Wylie describes centralized systems as über-systems, clearly a derogatory term. Mr. Wylie also recognizes the proper desire for traders to keep control of their data processing. He notes that developers are only acquiring the needed business understanding to become productive after three to four months - he is looking for bottom up solutions that support the 'front line', not centralized solutions to give upper management a dashboard of global KPIs, key performance indicators. He accepts that "Any system that doesn't consider the traders pathological dependency on Excel is doomed to failure." In the rest of the talk, he describes how RESTful solutions allow him to roll out shared data across a variety of format, including Excel, to end users on demand.

Like Alan Kay, Kirk Wylie is involved in the same conspiracy as his users. I'm betting he has more success than Mr. Eckerson is rolling out working systems.

Saturday, September 19, 2009

Comp Sci 61A - Online lectures

This week, I am starting to follow one series of the online UC Berkeley lectures. The lectures are by Brian Harvey's CS 61A from the spring of 2008. As is always the case, following a lecture without trying to do it yourself is not going to get you far.

In the first two months of the class, Prof. Harvey is examining functional programming with LISP. To follow along, I am using Clojure. Installing on OSX amounts to 'sudo port install clojure'. The clojure command line is accessed by the clj command. As noted in the lecture, it is easy to write interpreters. The class uses scheme, so I need a scheme interpreter for clojure.

In order to describe the foundations of computer programming, Prof. Harvey wrote the following during his second lecture:

Physical foundation of computer science
Topic	studied in...
Application Program	CS 61A
High level language (Scheme)	CS 61B
Low Level Langauge (C)	CS 61C
Machine Langauge/architecture	...
circuit elements	EE
transistors	applied physics
solid state physics	Physics
Quantum Mechanics	Physics

This ties in rather naturally with the OSI Seven Layer Model.

This class is presenting a 'real world' view of computer science, not computer science as applied mathematics. As with the physical sciences and engineering, mathematics is the language, not the subject in CS. This is not meant to be a put down of math.

In the first lecture, he gave a concise explanation of 'function' vs. 'procedure'. I am going to enjoy these lectures...

Sunday, September 6, 2009

Slowly Changing DImensions, Type 6 = Type 1 + Type 2 + Type 3

Recently, I have designed a multi-tenant data warehouse for AdTrack clients. One key decision in building a data warehouse is choosing how to deal with slowly changing dimensions (SCD). In a standard data warehouse, there a two basic types of data tables: fact tables hold information about a particular event (e.g. a record of one sale) and the dimension tables describe how the facts can be organized (e.g. records about stores, customers, products, date of sale). Basically, you access the data by specifying a 'slice' of each dimension and you then have the database find all the facts that fall within these slices. The database then lists or summarizes (aggregates) these facts which are displayed in tables and charts which are reported to the users.

A major fly in the ointment is that the information in the dimension tables are slowly evolving. Customers move, change phone numbers, addresses, marital status, last names, jobs and income levels. Sometimes, you just want the current values. It is not obvious a priori if clients will care about current values or historical values. If I were selling hot tubs, does 'current' marital status or 'historic' (e.g. at the time of the sales) marital status matter? I can imagine an analyst wanting to track sales by marital status, sex and income, so I can see them caring about tracking the history of a client's marital status. However, if I were to make a report of top 10 married customers for a promotion, I would probably want to add in any sales to that customer when they were still single. So, unless you know the context, there is no general answer to 'historic' vs. 'current' value when building a data warehouse. This is doubly true for a mult-tenant warehouse where one client could be selling construction equipment b3b, another could be selling medical equipment to university hospitals and a third could be selling windows b2c. Any universal assumption as to what each of these clients need now and forever is going to be badly wrong for some users.

Data warehouse designers will recognize this as the Type 1 SCD (current value ) vs. Type 2 SCD (historic value) decision that needs to be addressed in design. There is also a Type 3 SCD that has current and 'some history'. The technorati, or anyone that reads the slowly changing dimensions article in Wikipeida, will know that there is type which can act like Type 1 SCD, Type 2 SCD or Type 3 SCD. Since 1+2+3=6, Ralph Kimball called this a Type 6 SCD. I have come up with a new implementation of a Type 6 SCD. My solution is basically a Type 2 SCD, but with the addition of a single column. I can then build views of this dimension table that function as Type 1, Type 2 or Type 3 SCD. This solution has now been tested, and the query times are roughly equal for all types of behavior, this is not true of older solutions which could really slow down if you asked for historic values. At least in the case of using Pentaho PDI, the changes to ETL are simple to implement and can be retrofitted to an existing Type 2 SCD. If you are interested in how I do this, I have a paper on Google Docs that you may read. If anyone is interested in implementing this, I would be eager to help or to hear if it works for you.

Sunday, August 30, 2009

PosgreSQL: how to run on EC2 with EBS

Ron Evans has recently blogged in great detail about getting PostgreSQL up and running on Ubuntu images on Amazon EC2 with EBS storage for the database files. Basically, you use EC2 to provide the server running PostgreSQL and you use EBS, elastic block storage to create a volume that you mount to hold the database files. Presumably, you can create a second file system to hold WAL segment files, but perhaps you just use S3 to store these.

The next step would be to learn how to create PostgreSQL clusters on EC2. This isn't much of post, but it is useful to me.

Saturday, August 29, 2009

A virtual IT department: the database

Software as a Service, SaaS, seems like it is about to revolutionize how small and medium companies run their IT. Today, many small and medium businesses have really kludgy IT. There are a bunch of Windows desktops, and a few Windows servers for Exchange, SQL Server, a File Server, a domain server and perhaps a company web site, but that is likely to be a hosted site for small business. I think we will see Microsoft's dominance fracture as a result of SaaS. Today, consider how SaaS may effect SQL Server.

Small companies now have the option of running database services in the cloud. Using Amazon EC2, they can have a hosted data center with a 99.95% uptime in the SLA. Of course, this only works when the company has a working internet connection. Since Amazon is using Xen, you can find tools that will convert between Xen images and Amazon AMI files. For Debian Linux, that maverick James Gardner has blogged about this. So, you could set up a local server to host Xen and you could then have a piece of the cloud in your data center.

Running SQL Server is a costly proposition. Enterprise Edition is $25K per CPU. Microsoft is planning to upgrade on a 2-3 year cycle and they don't intend to support servers more than two generations old. So your $25K/CPU is recurring fee that will average around $5K/year per CPU. If you want, you can run SQL Server on Amazon EC2. There is a premium for this. Lets assume that you will need a large server. For Windows alone, this is $0.50 per hour fee. But adding SQL Server Standard will increase this to $2.20 per hour. For one year, a hosted SQL Server 2008 Standard Server instance works out to about $19K. This isn't crazy for a managed server, but it isn't a cost savings for most companies over running a Windows Server in-house. So why is there so much buzz about the cloud?

With a few changes, we can save a great deal of money. For the database engine, consider PostgreSQL. It is fast and stable. It has a carefully reviewed code base. There are drivers for ODBC, JDBC and ADO.Net, so your applications can still access data. Because PostgreSQL is open source, you can license it for $0. On Amazon EC2, you can run a medium service instance with Linux for $0.40 per hour. You can also take advantage of Reserved Instances. For $1,400 you can reserve a server instance for three years. This drops the per hour fee to $0.12 per hour. Three years now costs 4, 555.76. So for one year, a hosted PostgreSQL instance works out to to just under $1,520. The cloud loves open source.

To be fair, SQL Server also include SSIS, which allows for a variety of ETL services to be built and run via a GUI. I have been using Pentaho's Kettle for this. Pentaho has renamed this as Pentaho Data Integration. PDI can be downloaded for free. So, with PDI and PostgreSQL, you have the same basic functionality as SQL Server, it just saves $17,766.61 per year for a database server. You still get 99.95% data center uptime.

Note: I have not added S3 storage for archiving databases or transaction logs. This will increase costs, but for both servers.

Sunday, August 16, 2009

PostgreSQL: the EnterpriseDB webcasts

PostgreSQL is becoming one of my favorite databases. Postgres is the direct descendent of Ingres, both are the creation of the Berkeley and were authored by Michael Stonebreaker and Eugine Wong. Other descendants of Ingres include Oracle, Sybase Enterprise Server and Microsoft SQL Server. PostgreSQL has the advantage of a very clean code base, as demonstrated by Coverity's open source scans. A limit for many open source project is getting access to paid support. In the case of PostgreSQL there is commercial support from EnterpriseDB.

One of the gems of the EnterpriseDB support is the wide range of webcasts about PostgreSQL. Because PostgreSQL is open source, the developers present detailed information about the internals of PostgreSQL and best practices for database administrators and developers.