Swamp Fox Analyst: 2009

Thursday, October 1, 2009

Erlang, Mochiweb and Webmachine

Webmachine is amazing. Why doesn't .Net or Java have something this elegant?

Erlang Factory 2009 Palo Alto - Justin Sheehy - Webmachine @ Yahoo! Video

Have a look. I was impressed by the clean design, but I was blown away by the debugger described in the last ten minutes. This is an open source project in Erlang, the high concurrency functional language created by Joe Armstrong at Ericsson. As noted in the talk, Webmachine is a high quality component to include in a web framework, like Mochiweb. Mochiweb itself is interesting: it is an Erlang framework for high concurrency web applications. When serving static content, it has been compared with yaws and nginx in a deathmatch on Joes Blog! Richard Jones is contemplating a Million User Comet App. Erlang is turning into a pretty cool language for web apps with high concurrency.

I wonder if there is any attempt to write an equivalent in Scala, a high concurrency functional language for the JVM? I have got to get up to speed with Scala and Lift. There is a lot going on here...

Saturday, September 26, 2009

Involve them in the same conspiracy

As previously noted, I am listening to the lectures from Berkeley's CS 61A. In the fifth lecture, the class is listening to a video of Alan Kay. He ends the talk, "If you want others to go along with you, you have to involve them in the same conspiracy." Oddly, this made me think about spreadmarts. Wayne Eckerson at TDWI create this term. He described a spreadmart as:

renegade spreadsheets and desktop databases that are wreaking havoc on organizations. Since then, many people have adopted the term because it highlights a painful, yet largely ignored, problem that plagues organizations today.
Spreadmarts contain vital pieces of corporate data that are needed to run the business. But since spreadmarts are created by individuals at different times using different data sources and rules for defining metrics, they create a fractured view of the enterprise. Without a single version of corporate data and centrally defined metrics, employees can’t share a common understanding the business. With spreadmarts, each worker marches to the “beat of their own drummer” instead of marching together toward a common goal. In short, spreadmarts undermine corporate productivity and sabotage business alignment.

I think Mr. Eckerson needs to consider Alan Kay's words. Mr. Eckerson has just cast the creators of spreadmarts as corporate saboteurs that are destroying the One True Way™ that he and and upper management wish to bestow upon their ungrateful underlings. This is a very top down approach, that probably isn't going to work well. He does present a 'co-opt' solution where you agree to using CVS output so the spreadsheet jockeys can still use their beloved Excel. Bu with title like Reeling in Spreadmarts, In Search of a Single Version of Truth: Strategies for Consolidating Analytic Silos and Taming Spreadsheet Jockeys, I hear contempt for the existing business processes and the folks on the front line that do the business' work.

Consider how differently Kirk Wylie treats the traders in financial firms in RESTful Approaches to Financial Services. They are exactly the 'spreadsheet jockeys' derided by Eckerson. Wylie describes centralized systems as über-systems, clearly a derogatory term. Mr. Wylie also recognizes the proper desire for traders to keep control of their data processing. He notes that developers are only acquiring the needed business understanding to become productive after three to four months - he is looking for bottom up solutions that support the 'front line', not centralized solutions to give upper management a dashboard of global KPIs, key performance indicators. He accepts that "Any system that doesn't consider the traders pathological dependency on Excel is doomed to failure." In the rest of the talk, he describes how RESTful solutions allow him to roll out shared data across a variety of format, including Excel, to end users on demand.

Like Alan Kay, Kirk Wylie is involved in the same conspiracy as his users. I'm betting he has more success than Mr. Eckerson is rolling out working systems.

Saturday, September 19, 2009

Comp Sci 61A - Online lectures

This week, I am starting to follow one series of the online UC Berkeley lectures. The lectures are by Brian Harvey's CS 61A from the spring of 2008. As is always the case, following a lecture without trying to do it yourself is not going to get you far.

In the first two months of the class, Prof. Harvey is examining functional programming with LISP. To follow along, I am using Clojure. Installing on OSX amounts to 'sudo port install clojure'. The clojure command line is accessed by the clj command. As noted in the lecture, it is easy to write interpreters. The class uses scheme, so I need a scheme interpreter for clojure.

In order to describe the foundations of computer programming, Prof. Harvey wrote the following during his second lecture:

Physical foundation of computer science
Topic	studied in...
Application Program	CS 61A
High level language (Scheme)	CS 61B
Low Level Langauge (C)	CS 61C
Machine Langauge/architecture	...
circuit elements	EE
transistors	applied physics
solid state physics	Physics
Quantum Mechanics	Physics

This ties in rather naturally with the OSI Seven Layer Model.

This class is presenting a 'real world' view of computer science, not computer science as applied mathematics. As with the physical sciences and engineering, mathematics is the language, not the subject in CS. This is not meant to be a put down of math.

In the first lecture, he gave a concise explanation of 'function' vs. 'procedure'. I am going to enjoy these lectures...

Sunday, September 6, 2009

Slowly Changing DImensions, Type 6 = Type 1 + Type 2 + Type 3

Recently, I have designed a multi-tenant data warehouse for AdTrack clients. One key decision in building a data warehouse is choosing how to deal with slowly changing dimensions (SCD). In a standard data warehouse, there a two basic types of data tables: fact tables hold information about a particular event (e.g. a record of one sale) and the dimension tables describe how the facts can be organized (e.g. records about stores, customers, products, date of sale). Basically, you access the data by specifying a 'slice' of each dimension and you then have the database find all the facts that fall within these slices. The database then lists or summarizes (aggregates) these facts which are displayed in tables and charts which are reported to the users.

A major fly in the ointment is that the information in the dimension tables are slowly evolving. Customers move, change phone numbers, addresses, marital status, last names, jobs and income levels. Sometimes, you just want the current values. It is not obvious a priori if clients will care about current values or historical values. If I were selling hot tubs, does 'current' marital status or 'historic' (e.g. at the time of the sales) marital status matter? I can imagine an analyst wanting to track sales by marital status, sex and income, so I can see them caring about tracking the history of a client's marital status. However, if I were to make a report of top 10 married customers for a promotion, I would probably want to add in any sales to that customer when they were still single. So, unless you know the context, there is no general answer to 'historic' vs. 'current' value when building a data warehouse. This is doubly true for a mult-tenant warehouse where one client could be selling construction equipment b3b, another could be selling medical equipment to university hospitals and a third could be selling windows b2c. Any universal assumption as to what each of these clients need now and forever is going to be badly wrong for some users.

Data warehouse designers will recognize this as the Type 1 SCD (current value ) vs. Type 2 SCD (historic value) decision that needs to be addressed in design. There is also a Type 3 SCD that has current and 'some history'. The technorati, or anyone that reads the slowly changing dimensions article in Wikipeida, will know that there is type which can act like Type 1 SCD, Type 2 SCD or Type 3 SCD. Since 1+2+3=6, Ralph Kimball called this a Type 6 SCD. I have come up with a new implementation of a Type 6 SCD. My solution is basically a Type 2 SCD, but with the addition of a single column. I can then build views of this dimension table that function as Type 1, Type 2 or Type 3 SCD. This solution has now been tested, and the query times are roughly equal for all types of behavior, this is not true of older solutions which could really slow down if you asked for historic values. At least in the case of using Pentaho PDI, the changes to ETL are simple to implement and can be retrofitted to an existing Type 2 SCD. If you are interested in how I do this, I have a paper on Google Docs that you may read. If anyone is interested in implementing this, I would be eager to help or to hear if it works for you.

Sunday, August 30, 2009

PosgreSQL: how to run on EC2 with EBS

Ron Evans has recently blogged in great detail about getting PostgreSQL up and running on Ubuntu images on Amazon EC2 with EBS storage for the database files. Basically, you use EC2 to provide the server running PostgreSQL and you use EBS, elastic block storage to create a volume that you mount to hold the database files. Presumably, you can create a second file system to hold WAL segment files, but perhaps you just use S3 to store these.

The next step would be to learn how to create PostgreSQL clusters on EC2. This isn't much of post, but it is useful to me.

Saturday, August 29, 2009

A virtual IT department: the database

Software as a Service, SaaS, seems like it is about to revolutionize how small and medium companies run their IT. Today, many small and medium businesses have really kludgy IT. There are a bunch of Windows desktops, and a few Windows servers for Exchange, SQL Server, a File Server, a domain server and perhaps a company web site, but that is likely to be a hosted site for small business. I think we will see Microsoft's dominance fracture as a result of SaaS. Today, consider how SaaS may effect SQL Server.

Small companies now have the option of running database services in the cloud. Using Amazon EC2, they can have a hosted data center with a 99.95% uptime in the SLA. Of course, this only works when the company has a working internet connection. Since Amazon is using Xen, you can find tools that will convert between Xen images and Amazon AMI files. For Debian Linux, that maverick James Gardner has blogged about this. So, you could set up a local server to host Xen and you could then have a piece of the cloud in your data center.

Running SQL Server is a costly proposition. Enterprise Edition is $25K per CPU. Microsoft is planning to upgrade on a 2-3 year cycle and they don't intend to support servers more than two generations old. So your $25K/CPU is recurring fee that will average around $5K/year per CPU. If you want, you can run SQL Server on Amazon EC2. There is a premium for this. Lets assume that you will need a large server. For Windows alone, this is $0.50 per hour fee. But adding SQL Server Standard will increase this to $2.20 per hour. For one year, a hosted SQL Server 2008 Standard Server instance works out to about $19K. This isn't crazy for a managed server, but it isn't a cost savings for most companies over running a Windows Server in-house. So why is there so much buzz about the cloud?

With a few changes, we can save a great deal of money. For the database engine, consider PostgreSQL. It is fast and stable. It has a carefully reviewed code base. There are drivers for ODBC, JDBC and ADO.Net, so your applications can still access data. Because PostgreSQL is open source, you can license it for $0. On Amazon EC2, you can run a medium service instance with Linux for $0.40 per hour. You can also take advantage of Reserved Instances. For $1,400 you can reserve a server instance for three years. This drops the per hour fee to $0.12 per hour. Three years now costs 4, 555.76. So for one year, a hosted PostgreSQL instance works out to to just under $1,520. The cloud loves open source.

To be fair, SQL Server also include SSIS, which allows for a variety of ETL services to be built and run via a GUI. I have been using Pentaho's Kettle for this. Pentaho has renamed this as Pentaho Data Integration. PDI can be downloaded for free. So, with PDI and PostgreSQL, you have the same basic functionality as SQL Server, it just saves $17,766.61 per year for a database server. You still get 99.95% data center uptime.

Note: I have not added S3 storage for archiving databases or transaction logs. This will increase costs, but for both servers.

Sunday, August 16, 2009

PostgreSQL: the EnterpriseDB webcasts

PostgreSQL is becoming one of my favorite databases. Postgres is the direct descendent of Ingres, both are the creation of the Berkeley and were authored by Michael Stonebreaker and Eugine Wong. Other descendants of Ingres include Oracle, Sybase Enterprise Server and Microsoft SQL Server. PostgreSQL has the advantage of a very clean code base, as demonstrated by Coverity's open source scans. A limit for many open source project is getting access to paid support. In the case of PostgreSQL there is commercial support from EnterpriseDB.

One of the gems of the EnterpriseDB support is the wide range of webcasts about PostgreSQL. Because PostgreSQL is open source, the developers present detailed information about the internals of PostgreSQL and best practices for database administrators and developers.

SQL Server and clustered indexes

When a database reads a range of values, a very common case, you may to read many values and IO usually becomes performance limiting. These values reside on your data storage, which is almost always a hard drive or a RAID array of hard drives. To make the IO faster, there are some basic strategies.

First you can make the data more compact. As an extreme example, consider the case of Decisionmark's Census Counts. This product provided a database with US Census Bureau Data distributed on a CD. Watcom SQL Anywhere, now Sybase SQL Anywhere, was used in part because we could compress each row. Compared with a hard drive, a CD has very slow data reads. So it was faster to read data from compressed rows and to unzip the data than it would have been to just read the same data uncompressed. Since SQL Anywhere's census data was read-only, we didn't need to worry about the write performance.

If you read about some modern database research, you will see that compressing data is one significant strategy in improving performance. For example, in MonetDB/X100 - A DBMS In The CPU Cache Zukowski and co-workers report

Figure 5 show the speedup of the decompressed queries to be close to the compression ratio, which in case of TPC-H allows for a bandwidth (and performance) increase of a factor 3.

So this is still an area of ongoing work at the frontier of research. The goal is maximizing the amount of data in each block of storage, so that IO is minimized and performance is maximized.

A second strategy is to use indexes. An index is a small bit of data that helps you quickly find the larger data recorded in your row. An index is build by looking at the values of some subset of your row, e.g., one or a few columns. For each value of the indexed value, you have a pointer to the row that holds the data. If you have a hash index, the pointer is directly to the row, and if you have a b-tree index, you may have to traverse a few nodes in the tree before you actually get a reference to the data row. If there is an index that is relevant for a query, it can speed that query significantly. It can do this by either reducing the IO to find the data (the database engine might otherwise have to read the entire table) or, in many cases, allow the database to avoid reading the table and simply use the index. Careful choices of indexes can vastly improve performance

A third strategy to improve read performance is physically order the data so that the data for your queries is adjacent on the hard drive. Unless you are using a solid state drive, you need to be moving the head and rotating the platter to read data. If the database can read adjacent blocks, it will read faster. In a typical query, the database engine will use the index to build a list of blocks that need to be read. If blocks are sequential, they can be read more quickly.

After laying this groundwork, I will now discuss what seems to be a fundamental problem with SQL Servers implementation of table ordering. In many databases, there is a command that allows the records in a table to be ordered by an index. To have a concrete example, consider the case of a department store chain. The chain has a computer system that records each sale in a purchase table. The chain has several stores, several departments, many sales representatives and many clients. To maximize the performance of our purchase table, you may want to order the records by (store, department, sales rep). For example, in PostgreSQL, you can create an index with create index index_purchase_store_dept on purchase (store_id, department_id, order_id). You can then order the table records with the command cluster index_purchase_store_dept on purchase. The database engine will then order the data in the purchase table using the index. This could significantly speed up reports of store or department. DB2 has a similar concept, where you can define an index to be cluster index for a table, you can then reorganize the table using the REORG utility. Oracle has a slightly different tactic: a cluster is defined in a schema. You can include up to 32 tables in the cluster and you can define an index to order the tables in the cluster. This means that not only are the rows in a given table close together, but rows from these tables will be close together. This takes the idea of clustering to a new level. Donald Burleson, author of Oracle Tuning, describes how these clusters are used. When records are added to any table in the cluster, they are placed into overflow blocks if there is no more room in the block specified by the index. When the overflow blocks reach as percent specified in PCTFREE, the records added to the tables get reorganized by the index.

SQL Server, as well as MySQL, makes clustering more convenient by defining an index to be clustered. You may declare one index per table to be the clustered index. To maximize the performance of our purchase table, you may want to order the records by either (store, department, sales rep) or perhaps by (customer). When records are inserted, they are insert in the order specified by the index. As carefully described in a blog by Sunil Agarwal, a Program Manager in the SQL Server Storage Engine Group at Microsoft, this will result in a severely fragmented table.

Instead, Microsoft urges the user to define the primary key to be an auto-generated integer and to cluster the table on the primary key's index. In other words, order the table chronologically. This seems to fundamentally contradict the concept of clustering. The records are 'clustered' in groups of one and they groups must be in chronological order. I checked some tables that followed this rule, and they all had 0% (or at least under 1%) fragmentation on the clustered index, even after a year of operation without maintenance.

Recently, I was asked to improve the performance of set of queries that were run against a SQL Server 2000. A review of the index statistics showed severe fragmentation, for the reasons outlined by Sunil Agarwal. To see if defragmenting would matter, I followed the advice in Microsoft SQL Server 2000 Index Defragmentation Best Practices. I defragmented the tables used in one particularly slow query. I was able to reduce the query time from over 4 minutes to just under 10 seconds. With a rewrite of the query, the query time dropped to under a second. However, after only two days of transaction processing, the clustered index had a fragmentation of over 40%. The query times were not back to the 4 minute level, but the formerly sub-second query was taking more than 10 seconds. In this case, the indexes are fragmenting faster than we could reasonably defragment them, since we cannot take this server offline daily to reindex the tables and using DBCC INDEXDEFRAG is simply too slow.

My conclusion is that in order to make clustered indexes useful, you need to be able to append a significant number of rows to a table and then order those records using the clustered index. Microsoft's solution of inserting each record at the location specified by the clustered index can rapidly cause severe fragmentation. The solutions of PostgreSQL, Oracle and DB2 avoid this issue, but at the cost of additional maintenance.

Friday, July 10, 2009

Microformats: Reuse, don't reinvent

Orbitz engineer Mark Meeker has an introduction to microformats with examples

This is one of the best introductions I have found.

I found this Even Bill Gates wants you to use Microformats.

After some reflection, it seems that building an open source object model based upon hCard for people and organizations, hCalendar for events and rel-tags could be useful for many projects. How often do you end up reinventing classes for people, addresses and so on? With rel tags, you could define all sorts of relationships between people and institutions: you can add 'subsidiary', 'employee', 'retailer' and build all sorts of models for a wide range of business needs. If there was a simple object model, with some basic implementations in a few key frameworks, you could start coding from that rather than from scratch. This would also be useful for learning and comparing frameworks. Presumably, you could get a REST implementation, an XHTML view, and a WS-* web services almost free. You would also get the services in a format that would help others build mash-ups with your data.

Monday, July 6, 2009

The Story of REST: Representational State Transfer

Joe Gregario has provided a lucid description of REST at YouTube.

He has also provided a companion video on the Atom Publishing Protocol, which is described in the Wikipedia article on Rest is described as a canonical RESTful protocol. So Atom provides a great example of a real-world system that uses REST and Joe describes it clearly.

If you want a clear understanding of how to use Rest, this may be your best use of half an hour. If all you want to do is to grok Rest, Ryan Tomayako's How I Explained RESR To My Wife is a classic, and it only takes five to ten minutes.

If you have more time and really want to understand in greater detail, there is Roy Thomas Fielding's dissertation, Architectural Styles and
the Design of Network-based Software Architectures.

For an actual implementation of a REST service, I am keen to use Grails. Grails has build-in support for REST. Grails also offers content negotiation. In HTTP, part of the request is a specification of the media type. Using content negotiation, the user agent can specify which format it prefers. This means that a URL for a person could return a portrait as image/jpeg, a hcard page via text/html, the vcard data as text/json, text/plain, text/xml, or text-plain. You could also use content negotiation to specify the language for the response. Suddenly URL seems to be the locator for a universal resource, as well as being the universal locator of a resource.

While it may be feasible to support multiple mime types at a single URL, it is not trivial to provide multiple representations of the same item. Even converting between two apparently similar data format has surprising complexity, as discussed in
Convert Atom documents to JSON at IBM Developer Works. The Developer Works also has a series of papers on Grails, which has specific examples of using Grails with Atom syndication.

Thursday, July 2, 2009

What HTTP reveals about your browser.

When surfing the web, is easy to feel that you are a relatively anonymous consumer of content. However, the HTTP traffic between your browser and the web server is a two way street. Henrik Gemal has provided browserspy.dk which has a series of queries that find our more about your browser than you probably know. Much of this information is potentially quite useful to the web site. The classic example is the HTTP_ACCEPT_ENCODING header that tells the server if your browser can accept compressed data. This can significantly reduce the size of a page. The other classic use is to identify Internet Explorer, the bane of JavaScript and CSS authors. But this is just the start. By knowing which version of Flash is installed, YouTube can warn you if you need to upgrade to view their video content. By sensing color depth and window size, a web site could determine an optimum image for me. This would be especially useful on a mobile device, where throughput and CPU limit battery life.

I would really like it if the geolocation information could be used to set the default country, state and city in web forms. In my case, the geolocation would have gotten me to Iowa, but would have placed me in Hiawatha rather than Marion. There is another geolocation demo that gets closer, but is still off by about 3 miles. I would like to be able to set a location, address and hcard info and have the option of using that on web forms. I would encourage more browser providers to support the navigator.geolocation object in the W3C Geolocation API.

But in my opinion, the scary information is from a CSS Exploit page. This exploit has been covered today in Slashdot. Web 2.o Collage will produce a collection of favicons of sites you have visited. What is most surprising to me is that this exploit by Brendon Boshell doesn't even require JavaScript. He has a Javascript version as well, which he describes in detail. So, unless you use one of the 'stealth modes' that don't record history, anyone can be checking to see if you have visited a particular site. Think about how that could facilitate a phishing attack.

Tuesday, June 30, 2009

V8 - the extremes

When it comes to the V8 Benchmarks, Chrome is still king, even on a Mac. Despite the fact that this is running under Parallels, Chrome 2 is almost twice as fast on the iMac as it is on the Dell at work.

Here is the real surprise. While every other browser ran faster on the iMac than the Dell, IE8 actually slowed down. But the slow down is almost entirely due to the final test, Splay. Google describes this test:

Data manipulation benchmark that deals with splay trees and exercises the automatic memory management subsystem (378 lines).

So it seems that the VM on the Mac can have performance issues with memory management.

If there is a problem with memory management, perhaps I should shut down all of the other applications and see if the numbers improve. Yup. Look at Splay: it went from 0.397 to 143.

So, running Parallels usually works fine, but in this case, it seems important to help the memory management along by closing down other applications. Still, it is IE8 which is far behind the other browsers in terms of JavaScript performance. Hey Microsoft, any chance you can keep up with the pack?

More V8 - on an iMac

After running the V8 benchmark suite at work, I am repeating at home. Home means a iMac with a 2.93 GHz Intel Core 3 Duo with 4 GB. I'm running Leopard (10.5.7). This time, I was more interested in looking at the improvements in Firefox 3.5 relative to Firefox 3.

But first, here are the results for Safari 2, this time running on OS/X.

For some reason, Blogger reverses the order of the images, so here we have the results for Firefox 3.5. There are two runs, one under OS/X and the other on XP. The XP version is running under Parallels. It sure looks like the VM running XP is very efficient: the Firefox numbers are within a few percent. They are also twice as fast the Dell Optiplex at work, even if the memory and clock speeds are quite similar.

Finally, we see the results of the old Firefox 3. The upgrade almost doubled the speed. So kudos to the Firefox team for the improvement in JavaScript performance. But we still have to recognize that the Webkit-based browsers are really dominating the JavaScript performance numbers.

V8 Benchmarks

Google's V8 Benchmark Suite is easy to run. Here are the results for several new browsers on my workstation, which is a Dell Optiplex GX620 with 2.79 GHz Pentium D and 3.49 GB of RAM. Wow. I didn't expect to see Webkit being this much faster. If my brand new Firefox 3.5 is given a relative score of 1, Safari 4 on Windows has a score of 7.5, Google Chrome has a relative score of 8.6. Internet Explore 8 has a score of 0.21. Webkit browsers have an order of magnitude better performance than Microsoft's flagship browser: IE8.

So, is Microsoft that bad a writing a JavaScript interpreter or are they trying to move us away from web standards like JavaScript and toward Silverlight?

Friday, June 19, 2009

Jef Jarvis and what to do next

Jeff Jarvis has more gray hair than me, but he really seems to understand the Internet. He may be best know as the author of What Would Google Do? You can read it online. Fora.tv has some lectures of his:

Future of Media and the Prospects for Brands
What Would Google Do? (the Video)

After listening to him, I am getting jazzed to try some Web 2.0 projects. He advocates being small, but being part of something big. So, what are some big things that we can expect to see? Here is something that comes to mind:

We all like images, and its probably the case that we have pictures all over the place that we like to access. I have pictures on Picasa, a friend has images on MobileMe, my wife has images on SmugMug and Flickr. My kids have images on FaceBook. This is probably pretty common. I don't want more ways to store images, I want a way to reference and search these images.

For images, I would like to have a linker site that lets me reference feeds, on line albums or even individual images on the Internet. Many of these sources provide titles, descriptions and other image metadata. Many images also have EXIF or IPTC metadata. What would be great would be a simple web interface to allow me to build a tool to subscript to several different sources and then subscribe to unified feed. Give the feed a URL and share with friends and family.

Even better, allow images to be linked to urls that identify people - hcard descriptions, home pages, FaceBook pages and so on. Then, you tag images (or better yet, some section of each image) with links to a person. The tags should have some 'types' like 'photographer/owner', 'model/subject' that show the relationship between the person and the image. Some key words links to images would also be great. They I could build feeds for all images linked to family in the last 6 months. Each of these dynamic albums could be given a URL on PubSubHubBub and expose the results has Media RSS. Then, anyone with a Media RSS viewer could see the feed in a browser. Open Iris and Slideshow Pro have inovative Media RSS viewers.

Wednesday, June 17, 2009

Software Tutorials on Google Wave: Eclipsy

Google Wave absolutely floors me, but it isn't available yet. I think that Wave is going to revolutionize teaching software. Consider software tutorials. First, we need something like an Eclipsy, a hypothetical Wave agent/Eclipse plug-in that allows Eclipse to be integrated into a Wave. As one developer goes through the steps in a tutorial, a second developer (or even the same developer with a microphone), could be giving a verbal description of the actions.

Once the expert has successfully performed the tutorial, a student else could play it back and see each step in the processes. I see major advantages of using Wave. Perhaps you could play back the development in your Eclipse while listening to the process, with Eclipsy in a playback-mode. At the end of this, you would know that the process worked with your configuration. Assuming that it did work, you could then go through the steps yourself while listening to the audio & perhaps watching the steps in Eclipse running on some virtual machine that you would watch as your copy the steps at home. (Two monitors would be nice for this.) The ability to pause during the playback is a big advantage of a Wave solution.

With many Web 2.0 tools like Grails, you are iteratively developing you application. For example, you might start by defining all of your domain objects and use scaffolding to build the controllers. You would then define the relationships between all of the objects. Next, you might add more attributes. To complete the domain objects, you could go in and add the constraints. This sort of iterative approach seems perfect for the as-yet-hypothetical Eclipsy.

Typically, you would also follow an iterative approach to evolving the views and the controllers. Much of the skill in this approach is having a feel to know what to do in each iteration. Eclipsy would be a good way to developing this.

In a complementary role, Eclipsy seems like it could be used for version control. At the very least, Eclipsy would be tracking the state of each source code file and your project configuration. This means that it could also function as a version control system that you could replay to get to any state in your development.

Tuesday, June 16, 2009

IE8: on Acid3

Just for fun, I tested Internet Explorer 8 on the Acid3 test. For completeness, the test was run on the morning of June 16, 2009. Today, Opera 10 was released. Opera 10, along with the recently released Safari 4, have attained a score of 100% on the Acid3 test. I also tested Chrome, release 2.0.172.31, and it also scored 100%. Is browser conformance breaking out? Helas, no. Just to prove what I say, I'm including bitmaps of 'About Internet Explorer at the time of the test. As you can read, I am using version 8.0.6001.18702. This is running on XP, as you can probably guess from the title bar.
During the test, I was asked if I wanted to let an ActiveX component run, I believe that it was for XML processing, but I foolishly clicked OK before recording the component name. The test looked like it completed at 12%. I was surprised to see what appears to be an HTML text area suddenly pop up.

After several seconds, the score started to creep up, finally reaching 20%. This is consistent with the scores reported at Anomalous Anomaly, which has much more complete Acid3 test results. Finally, I tested Firefox 3.0.10, which produced a score of 71%. As expected, this is also consistent with Steve Noonan's results at Anomalous Anomaly.

I also checked Acid2, all the browsers scored 100%. Perhaps there is still hope that bothIE and Firefox will be made more standard compliant so that there Acid3 scores can match their Acid2 scores.
[followup on July 20, 2009] Firefox 3.5 is making rapid progress. With Firefox 3.5.1, I have an Acid3 score of 93. My IE scores have remained the same.

Saturday, June 13, 2009

Richer web interfaces

Web interfaces are improving all around. With tools like GWT and OpenLaszlo, it is possible to build rich web applications with GUIs that at least match stand-alone applications. Some of the applications that I am finding innovative include:

Pentaho BI Server. Their new Mantle UI is build with GWT and is a huge jump beyond the old interface. It just behaves like a stand-alone application. Some of the components, especially jPivot, could use an update, but the Server platform is in a great place to organize all of these new features. With the modular design of BI Server 3.0, Pentaho users will be seeing a great deal of new reporting and analytic tools they can plug in.
G.ho.st, the globally hosted operating system, is a virtual computer that you can access from anywhere. You have a full GUI, built in OpenLaszlo, apparently by their team of only 30-40 staff. Ghost seeks to provide a free, web-based virtual computer for anyone in the world. The UI is fast and simply doesn't feel like a browser. Have a look.
LZPIX is another OpenLaszlo application for viewing photos on Flickr. The link is to the Flash version, there is also a DHTML version and you can see the source code as well. On a side note, open the application and search for 'toureiffel' , the photo Paris s'éveille is magnificent!
Maple is another interesting tool to view photos, but it is designed for multimedia slide shows. This can run as either a Java application or a a Java Applet. It uses Java 6, and my shiny new Mac only has Java 5, so the only way for me to view this is running a VM with either Windows or Linux so I can install Java 6, which was released in December 2006. Come on Apple, is it not possible to get the bugs out in 2 1/2 years!? Oh well, one more reason to hope for Snow Leopard. I just hope that they have a Java 6 upgrade for Leopard, since I can't run Snow Leopard on the older G4/G5 Macs. I was going to ask why Java applets didn't catch on like Flash applets, but I guess when you have trouble accessing a consistent Java platform across OS/X and Windows, the answer is clear. This is unfortunate, as Maple is a great slide show viewer. Java really does work, it is just a shame that it has been held captive by an Apple or Microsoft.
Ben Fry's zipdecode. Try this and tell me why this shouldn't be part of any application that needs a zip code. It should be easy to gather the data needed to make this international. If I am filling out a web form, why do I have to type in 52302 and then choose Marion, Iowa? There are some zipcodes that server multiple communities, but in that case, I would just need one extra click to pick the city. The only thing I would add to this applet would be a semi-transparent zip code on top of the map.

What we need to do is develop simple, rest-based tools that we can easily drop into web applications. If your application is residing on the web, you should be able to simply reference well designed tools. The current practice of installing web apps on your application's server seems to undercut the promise of the web.

For a coherent theory of how software should be designed for people, I recommend Bret Victor's Magic Ink. He has thought deeply about this, while I am just providing examples.

In some ways, it is possible to outdo stand-alone applications.

Friday, June 5, 2009

Google I/O viewed from IOwa.

With some software presentations, you are overwhelmed by the show but afterwards the gee-wiz wears off as you start to piece together what is going on. But the more I think about Google Wave, the more the ideas are growing on me. At this time, hardly anyone has heard about Google Wave, but that is going to change. Within a few years, Google Wave could be recognized one of the few paradigm changing applications, much like hypertext going mainstream was to the growing of the web in the 1990's.

In the developer preview, there were no lack of eye-popping features:

Drag and drop photos from iPhoto into the browser. Really cool use of Gears and GWT. Watching the photo upload automatically and appear on the other browsers in seconds was a direct reminder to me that we are only beginning to understand how the web can connect us.
Watching real-time translation between French and English shows how we are going to be able to interact more freely in wider communities.
The multiple, concurrent editing of a single document shows how much power there is concurrent versioning systems.
The use of associative memory, which resides not only on the server, but is shared with each participant is part of the secret sauce that makes is seem that everyone on the wave is 'together'.

This just seems like it is the next phase in the evolution of the world wide web. Before this, the web was a bunch of places to go. With Wave, it will become a bunch of events to join, review and create.

Friday, May 15, 2009

Innovation in the computer industry

Most innovations are combinations of existing ideas, this is how innovation works. In a recent ZD Net blog, Larry Dignan examined how effective Microsoft, IBM and others are at profiting from their R&D spending. The article seems quite reasonable and the claims seem well supported.

But in the comments, there was a recurring meme that claimed that there is no innovation in FLOSS software. For example, mikefarinha claims that "all of the big name OSS projects" exist to steal market share from Microsoft. His list of FLOSS projects is Firefox, Open Office, Samba, WINE and Lindows. For starters, Lindows is hardly a major FLOSS project, I would list it well behind Apache, any of the BSDs, Linux and OpenJDK, none of which made his list of 'big name projects'.

Firefox comes from Mozilla. Mozilla comes from Netscape. Apparently, Firefox exists to take market share away from Internet Explorer. If you look at the User-Agent string from Internet Explorer, you will read

Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)

Interesting, it looks like Internet Explorer is emulating Mozilla. If you really want to look for who invented the web browser, you will discover ViolaWWW. Rather that running on Windows, ViolaWWW ran on Unix and X Window. You may recognize X Window, since X.org is a 'big name' FOSS projects

OpenOffice.org is simply the open source version of Star Office. StarOffice began with StarWriter, just like Microsoft Office began with Microsoft Word. But StarWriter was originally a German word processor for Zilog Z80 and CP/M. CP/M is an operating system that was originally developed by Gary Kindall at DEC. CP/M is the ancestor of several DOS systems, including MS-DOS. In Microprocessor Report (Vol 8, No. 13, October 3, 1994), John Warton concluded "The Origins of DOS" with:

The strong impression I drew 13 years ago was that Microsoft programmers were untrained, undisciplined, and content merely to replicate other people’s ideas, and that they did not seem to appreciate the importance of deﬁning operating systems and user interfaces with an eye to the future. In the end it was this latter vision, I feel, that set Gary Kildall so far apart from his peers.

Not exactly a rousing defense of software innovation at Microsoft. So, we find that StarWriter was developed for a OS that predates MS-DOS. If I recall correctly, the GUI version of MS Word was actually developed for the Macintosh. So it is not clear to me how OpenOffice.org is an example of a Microsoft innovation that others copied, it can trace its code base back further than MS Word can.

Next, we have the case of Samba. Samba implements the SMB protocol, which Barry Feigenbaum at IBM. Microsoft implemented a heavily modified version of SMB. Andrew Tidgell, and the Samba team, worked to create a version of SMB that worked with DEC Pathworks. It is only later that they tried to understand Microsoft's undocumented modifications so their SMB Server could work with more of the computers in their employer's network. It seems that trying to achieve file sharing between Windows and Unix is just an attempt to steal market share. To my way of thinking, when Microsoft added undocumented changes to an otherwise open protocol, it was an attempt to achieve vendor lock-in and steal market share from the rest of the industry.

That leaves only Wine, which is an emulator that allows *nix computers to run Windows applications. Personally, I haven't had much luck with Wine, so I don't see how Microsoft has much to worry about from Wine.

In addition to mikefarinha, we have Rick S._z who states

But here's a TECHNICAL creation which changed the computing world, and was almost totally invented by Microsoft: truetype fonts. Before MS built Windows 3.1 around them, no one had thought to use the SAME fonts for your printers and your screens. Fantastic idea, and implemented beautifully.

The time is about right, but the source of the innovation is wrong. TrueType Fonts were developed by Apple, as noted by Microsoft.

I am not actually just an 'anything but Microsoft' Zealot, but it is just wrong to state that FOSS does nothing other than 'steal market share from Microsoft.' For example, Microsoft did release NT, which does represent several significant innovations over DOS. According to the New York Times, the original name for NT was OS/2 3.0. OS/2 was an operating system developed by IBM. Microsoft did a great deal to improve the network stack. Namely, they used the BSD Unix network stack, see the Defcon archives for evidence to support this claim. There is nothing wrong with this, but you have to admit that the innovations for the network stack came from those open source Unix developers at Berkeley. Just for giggles, open ftp.exe, its in your C:/Windows/System32 directory with Notepad. You will see the BSD license. Microsoft is not amazing because they invented everything, they are amazing because they integrated innovations from all of the world.

Remember Pablo Picasso's observation, "Bad artists copy. Good artists steal." If you want to be creative, you have to steal the good ideas.

Tuesday, May 5, 2009

Choosing SCT Type at Run Time

When a data warehouse is designed, the architect must choose a dimension type for each dimension in the warehouse. The two must common types identified by Ralph Kimball are the Type 1 SCD, in which changes in dimensional attributes are overwritten. In the Type 2 SCD, changes are not overwritten, but are recorded in multiple records for each object in the dimension table.

After using Pentaho's PDI, it become clear that you can specify the dimension type at the attribute level as long as the underlying table is a Type 2 SCD. Still these decisions need to be made when the dimension table's ETL is being designed.

What happens if you would rather delay the decision until run-time? This flexibility can be provided with an auxiliary table for each dimension table. The auxiliary table links each record in the dimension table with the most recent record for that object. With that view defined, you can easily set up a view for a Type 2 dimension table that behaves like a Type 1 dimension.

I am surprised that I don't find any references to this in standard data warehouse references or in online articles. If anyone can point out a reference to this, I would be grateful,.

Friday, April 17, 2009

Learning to program on OS/X

Ages ago, I used HyperCard on a Mac SE. Now, I'm trying to get back into programming on an iMac running Leopard. My first observation, Holy Cow! This is a fast workstation. This absolutely blows away my workstation at work. At work, I'm (still) running XP with Office 2003. Outlook is a complete hog: between Outlook, Windows checking for updates, and virus scanning, just logging up takes a few minutes. All day long, I am listening to my hard drive chatter as I work on as Pentaho BI Server site, so once I get Tomcat, Eclipse and a browser or two running. I boot up, log in, and go get coffee.

But on my Mac, Netbeans 6.5 comes up faster (under 15 seconds). Learning Grails is a joy with a workstation that responds quickly. I am quite happy to pay my 'Apple Tax', I get OS/X, iLife and iWork. I also get a faster JVM. The biggest downer is that I can't run IE, so its hard to tell if my valid web pages will break on the world's dominant browser.

To get started, I followed these instructions at ShiftEleven to get up and running with Apple's developer tools, the MacPorts tools and PostgreSQL. However, I didn't want to start PostgreSQL by default. So rather than using lauchctl, I'm just going to type

sudo -u postgres/opt/local/lib/postgresql83/bin/postgres -D /opt/local/var/db/postgresql83/defaultdb

in iTerm.

Saturday, March 28, 2009

Dimension tables in data warehouses: Type 2+1

Introduction

In condensed matter physics, a system is often characterized by the dimensionality of the crystal lattice. A bulk crystal has a 3D lattice, while a surface has a 2D lattice. Crystal growth occurs on the surface. There are times where surface growth is described as being 2+1D. I used to research this.

In data warehouses, data is stored using a data cube. When a data cube is stored in a relational database, a star schema is build from fact tables and dimension tables. In a fact table, the facts are stable and do not have any time dependence: facts are facts. All of the time dependence is recorded in the dimension tables.

The problem: with example

To make this more concrete, consider a simple data cube for a swim conference. The facts will be the race results. The dimensions will be a swimmer dimension, an event dimension and a meet dimension. Race results include the time recorded. The results are linked to the dimension tables. So, we can build reports on a given swimmer, even if they change teams. Swimmers have a name, photo and birth date. Swimmers also have a link to their current team. Teams have a name, coach, city and league. (This example gives us a simple 'snowflake dimension'). Meets have a location, date, league and season. Events include a stroke (backstroke, butterfly, breaststroke and freestyle) and distance. For now, we will ignore team events, so the schema is quite simple.

This is for a Web 2.0 swim league, so we also have to create an iPhone web app to record all of the results during a meet. We can even be cool and have a URL to a 'photo finish' image for each race, pictures for each swimmer, and a Google map showing the location of each meet. After each meet, we can then run a Kettle job to move all of the data into our data warehouse running on Amazon EC2 at 10 cents an hour.

In the Data Warehouse Toolkit, Ralph Kimball has described three types of time dependence for dimension data

Type 1 - the past doesn't matter. When data in the dimension tables changes, we just throw away the old data and use the new data. This is appropriate for correcting errors. If we find that we should have spelled Vicky rather than Vickie, we just change the name in production database for our application. The ETL process will then change the swimmer's name in the data warehouse.
Type 2 - slowly changing dimension. In this case, a single entity in the production system gets mapped to multiple records in the dimension table. At any moment in time, there is only one record that is valid for the entity. When we change the description in the production system, the ETL job will create a new record in the dimension table. If our iPhone toting parents take a new photo of Vicky every month, Vicky will will get a new record in the swimmer dimension table. When we look at the results of a race where Vicky was swimming, we can see what she looked like at the time of the race.
Type 3 - keep a few versions. In this design, we end up with multiple columns for a given attribute in the dimension table. Usually, this is used so we can have a 'current' value and a 'last' value. This is used for something like an company reorganization. Vicky is a superstar and in the middle of the season she is forced to move to a really weak team to make the league more equal. The parents are upset, so they track Vicky with the current team assignments, but they also track her beginning-of-season team. This example may seem silly, but similar reorganizations happens all the time within companies.

As the database architect, what are you to do? The first thing to realize is that you have to decide on the time dependence for each field in the database. In our case, we want a Swimmer dimension with name as a type 1 dimension attribute, a 'photo URL' as a type 2 dimension attribute and a type three dimension for the 'team' attribute. To further complicate matters, we didn't even know that the team should be 'type 3' before the Reorganization Controversy.

In physics and chemistry, we talk about reversible and irreversible transformations. In a computer system, erasing data is irreversible. From this perspective, we see that only the type 2 time dependence does not erase data: type 2 is reversible in the sense that even after we change the data in production, we can still reproduce the old data exactly - if a report was run last year, we can get exactly the same results running the report this year. A type 1 dimension throws data away immediately and a type 3 dimension will throw data away eventually (after two team changes in the example). Since a data warehouse is supposed to warehouse data, I have an immediate preference for the type 2 time dependence.

But as a practical matter, people often prefer the 'curent' results, not the results 'at that point in history'. For example, lets say that we have a swim team dashboard that lets you click on a photo of each swimmer to see their stats. Almost certainly, you want to have only one picture of Vicky that links to her entire history, not simply her history since that picture was take. When designing the warehouses, I am constantly having to choose between 'current' and 'historical' dimension data.

The New Solution: Type 2+1 dimension

In order to explain this, I need to define the structure of the production database and the data warehouse for the swimmer. In the production database, the swimmer table will have the following columns:

GUID, an id for each swimmer that is unique for every swimmer in every league that is using my Swim 2.0 web application. This is the primary key of the production table.
Name, a string with the swimmer's name. There could be separate fields for first and last names, but this doesn't matter for this discussion
Birthdate, a date field.
Photo URL, preferably a URL class
Team GUID, a foreign key tying each swimmer to their current team

In our data warehouse, our swimmer dimension is also represented by a database table. In a table designed for type 2 data, we would have the following columns:

GUID (same as above, the G is for global, and the data warehouse is part of our world of data)
ID, a unique integer determined by the database engine. This is the primary key of the swimmer dimension table.
version, an integer that starts at 1 and increments by 1 each time a new record is added for a given GUID. There is a 1-1 mapping between (GUID, version) and ID
start date and end date. These two columns define when a given version is valid.
name, birthdate and photo URL fields, just like in production
Team ID. Foreign keys reference primary keys, so rather than recording the Team GUID, we will record the Team ID. This may be a bad idea, if you have arguments about when to use the GUID and when to use the ID when dimensions reference each other, I would be happy to hear from you.

The rule for a data change is fairly simple. In ETL, any changes in the production data are detected. When a change is found, the current version of the entity (e.g., with the same GUID) is end dated and the current version is read and the end time is recorded. A new record is inserted with the GUI, the attributes from production, a new ID , the next version number, the start date is set to the end date of the previous record and the end date is usually sent to 'the end of time'.

In order to identify which version is current, the data warehouse architects at ETL-Tools recommend adding a current field, a Boolean set to true for the current value and false for all others. This field is could be determine by a simple rule, such as now() between swimmer_dim.date_from and swimmer_dim.date_to or swimmer_dim.version = ( select max(version) from swimmer_dim sd where sd.GUID = swimmer_dim.GUID) (I use the default field names from Pentaho's Kettle, and the date functions from PostgreSQL, but this is easy to rewrite for other environments.) As a practical matter, it is more efficient to add a current field so you can apply the rule once during ETL rather than each time the you run a query that needs the current values.

I argued that a 'proper' design should be reversible and that no data should be lost. Doesn't adding a current field that I am willing to 'throw away' each time I run ETL violate my theoretical argument? No. The reason is that the information in the current field is redundant. Since I can figure out which record would have been current at any time in the past, changing this field does not irreversibly lose data.

Rather than just recording a Boolean as to which record is current, I add a field called current_id to my dimension tables. If you want a Type 2 time dependence, you just ignore the current_id since the rest of the table was set up to support Type 2. So, if you want a dimension with Type 2 time dependence, you can write

create view swimmer_dim_type2 as
select GUID, ID, version, date_from, date_to
, name, birth_date, photo_url, team_id
from swimmer_dim

If you want a Type 1 time dependence to return the current values for any historical id, you need to make a single self join

create view swimmer_dim_type1
select any.GUID, any.ID, curr.name
, curr.birth_date, curr.photo_url, curr.team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id

This ability to act like a dimension of Type 1 or Type 2 is why I call this Type 2+1. But, we know that 2+1=3, so is there a way to get the Type 3 time dependence as well? Yes, but we will need to be able to choose a time when we want to choose the additional columns. In the swimmer example, we want the current team and the team at the beginning of the season. Let's assume that the season began on June 1, 2008. We can create our Type 3 time dependence with:

create view swimmer_dim_type3 as
select any.GUID, any.ID
, curr.name, curr.birth_date, curr.photo_url
, any.team_id, june.team_id as starting_team_id
from swimmer_dim any
join swimmer_dim curr on any.current_id = curr.id
join swimmer_dim june on any.current_id = june.current_id
where cast('2009-06-01' as date) between june.date_from and june.date_to

So, Type 2+1 can emulate Type 1, Type 2 or Type 3 time dependence. As I noted earlier, the time dependence should be determined at the field level, not the table level. If we want to always use the current name, the photo that was current at the time of each race, and the team as of June 1, 2008, we can write a view to do exactly that without the need to alter the ETL or the table structures.

I have tested this idea on larger data sets, and each view is almost as fast a query of the underlying dimension table. In most queries of a data warehouse, the query performance is dominated by the reads and aggregates of the data within the fact tables. This technique should be valuable to nearly any data warehouse architect, so I would be pleased to hear from anyone who finds this useful.

Saturday, March 21, 2009

Bayes' Theorem and Taleb's Silent Evidence

In chapter Eight of the Black Swan, Taleb recounts Cicero's Story of the Drowned Worshippers:

One Diagoras, a non believer in the gods, was shown painted tablets bearing the portraits of some worshippers who prayed, then survived, a subsequent shipwreck. The implication was that praying protected you from drowning. Diagoras asked, "Where were those who prayed, then drowned?"
The drowned workshippers, being dead, would have a lot of trouble advertising their experiences from the bottom of the sea. This can fool the casual observer into believing in miracles.

Taleb calls this as the problem of silent evidence. We a drawn to success, and we shun failure, so, as Taleb notes, nobody ever writes 'How I failed to Make a Million Dollars on Wall Street'.

Suppose that one of our Wall Street success stories claims that in order to succeed in Wall Street, you need to be a Harvard graduate; this author cites himself and and his college roommate who have both made a million dollars on Wall Street. If you are in high school (or have a child in school), you need to choose a collage. You are wanting to know if a Harvard education is worth its rather considerable cost. If the author is right, then the decision is easy. You will simply replay your student loans with the millions you make on Wall Street.

What information do you need to assess this claim? We could begin by looking for particular cases. Finding one black swan may prove that not all swans are white, likewise finding one millionaire from a state university will disprove a claim that a Harvard education is a requirement for success on Wall Street. Similarly, finding an unsuccessful Wall Street trader from Harvard proves that a Harvard education is not sufficient for success on Wall Street. Likewise, we can find examples of Harvard graduates that failed on Wall Street. So being a Harvard Alumni is not a guarantee of success of Wall Street.

Life is not certain, so we need to make a gamble. How can we to play the odds odds intelligently? The sort of data that would be helpful would include the resumes of both successful and failed Wall Street investors. As a practical matter, getting the list of unsuccessful investors would be something of a trick. Taleb is right about silent evidence: we remember the winners and forget the losers. The winners advertise and the losers go on to something else.

In the age of data mining and databases, it should be possible to build such as list.

If we can gather the data, what framework should we use to decide if going to Harvard would be a rational risk? The correct framework to assess the validity of a claim is to use Bayes' Theorem from statistics. If you haven't heard about Bayesian statistics, a good starting point is An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky. In fact, Yudkowsky's explanation is so good that there really isn't any point in me writing more on this subject. There are also on-line course materials. Currently, I'm trying to work through Jeff Grynaviski's (at the University of Chicago) has provided his course materials. In order to learn more, there is David McKay's Information Theory, Inference and Learning Algorithms. I especially like McKay's book because it is able to unite Bayesian methods with Claude Shanon's Information Theory and even to include elements of AI.

If you still have some time left after reading about Bayesian methods, read Yudkowsky's article Cognitive biases potentially affecting judgement of global risk. His assessment seems to be in general agreement with Taleb. Since I live next to Cedar Rapids, which has just had a major flood, Yudkowsky's observation about flood damage was particularly revealing:

Burton et. al. (1978) report that when dams and levees are built, they reduce the frequency of floods, and thus apparently create a false sense of security, leading to reduced precautions. While building dams decreases the frequency of floods, damage per flood is so much greater afterward that the average yearly damage increases.

Wow, this is an extraordinary claim. If true, the much of the work done to protect people in the Mississippi/Missouri River basins is not just useless but is actually counterproductive. I will remain skeptical on this claim, but it does seem worthy of investigation before Cedar Rapids, Linn County, the State of Iowa and Federal agencies sink more money into flood control on the Cedar River. We need to find a more rational way to understand and manage risk. If we can improve that, we will have learned an important lesson from the catastrophic failures of 9/11, the Indian Ocean Tsunami of 2004, the flooding of New Orleans following Hurricane Katrina, and the current financial collapse. We need Taleb's empiricism. When we have enough data to analyze, we should be using Bayesian methods for that analysis.

Sunday, March 8, 2009

Why Swamp Fox Analyst?

First, I live in Marion, Iowa, home of the Swamp Fox Festival. I was was raised largely in Marion County, Iowa. Both are named after Francis Marion. Of course, there are hundreds of places in the US named after Francis Marion. I am no historian, but even a quick read of his Wikipedia article or the Smithsonian's recent article reveal a fascinating character. He was a small child, being something of the family runt. He was a good student and fluent in French. At age 15, he survived a shipwreck when a whale rammed the schooner he was sailing. At 25, he became a military officer in the French-Indian Wars. The following quote

The next morning we proceeded by order of Colonel James Grant, to burn down the Indians' cabins. Some of our men seemed to enjoy this cruel work, laughing very heartily at the curling flames, as they mounted loud crackling over the tops of the huts. But to me it appeared a shocking sight. Poor creatures! thought I, we surely need not grudge you such miserable habitations. But, when we came, according to orders, to cut down the fields of corn, I could scarcely refrain from tears. For who could see the stalks that stood so stately with broad green leaves and gaily tasseled shocks, filled with sweet milky fluid and flour, the staff of life; who, I say, without grief, could see these sacred plants sinking under our swords with all their precious load, to wither and rot untasted in their mourning fields.

suggests to me that he was a sensitive and compassionate man, but still a man that did his duty. He is most famous for his leadership in the American Revolutionary War. Following the Waxhaw Massacre , his band of about fifty irregulars was the only force opposing the British in South Carolina. He earned his nickname, The Swamp Fox, by outfoxing Col. Banastre Tarleton and his British forces. His tactics are predecessors of modern guerrilla warfare. In short, he was able to accomplish a great deal with very little. In short, he used intelligence to succeed. I cannot make claims that I am following in his footsteps (I failed my military physical for acne scars on my chest and back), I can certainly take inspiration from his success.

There is also a cautionary note that I associate with Francis Marion. While we Americans, and even Mel Gibson, regard Marion as hero, he was clearly viewed as a terrorist by the British. The historian Christopher Hibbert described him as a racist and rapist for his treatment of the Cherokee. There is evidence that both views are correct. One man's terrorist is another man's freedom fighter.

Francis Marion was also alive during the Age of Enlightenment. His lifetime (1732-1795) overlaps the lives of the skeptic David Hume (1711-1776), the Reverend Thomas Bayes (1702-1761), and the French aristocrat Pierre-Simon, marquis de Laplace (1749-1827). As Nassim Nicholas Taleb notes in The Black Swan, Hume was an influential empiricist for the English-speaking world. For all of the reasons Taleb outlines, we need empiricism and some of the intellectual humility of a real skeptic, to counter the hubris of modern economists.

Taleb speaks a great deal about The Problem of Silent Evidence. In a nutshell, if we try to understand the unusual, we try to look for characteristics that we feel were a cause of the event. For example, we have all been taught that World War I was caused by 'entangling alliances' between the major powers and the 'powderkeg' in the Balkans. But, identifying characteristics that were true before WWI is not the same as identifying causes. To have a useful knowledge, we need to be able to predict, not create an after-the-fact narrative. There were many times in history where there were alliances and many places where angry young men plotted to be revolutionaries. Does the presence of these conditions have any predictive value? That is what we need to know if we want to plan.

It seems to me that some of Taleb's criticism of classical statistical analysis can be addressed by using Bayesian methods. This is the tie to the Reverend Bayes and Laplace, they founded what we now called Bayesian statistics. Over the last twenty years, Bayesian methods have been rapidly evolving. In particular, Bayes theorem provides a mathematical framework to discuss Taleb's Problem of Silent Evidence. Of course, a framework is not a 'solution' and this only addresses one of Taleb's issues. More to come ...