Musing on business intelligence, particularly using Pentaho. Also interests in software design, particularly in the open source community.
Thursday, October 1, 2009
Erlang, Mochiweb and Webmachine
Have a look. I was impressed by the clean design, but I was blown away by the debugger described in the last ten minutes. This is an open source project in Erlang, the high concurrency functional language created by Joe Armstrong at Ericsson. As noted in the talk, Webmachine is a high quality component to include in a web framework, like Mochiweb. Mochiweb itself is interesting: it is an Erlang framework for high concurrency web applications. When serving static content, it has been compared with yaws and nginx in a deathmatch on Joes Blog! Richard Jones is contemplating a Million User Comet App. Erlang is turning into a pretty cool language for web apps with high concurrency.
I wonder if there is any attempt to write an equivalent in Scala, a high concurrency functional language for the JVM? I have got to get up to speed with Scala and Lift. There is a lot going on here...
Saturday, September 26, 2009
Involve them in the same conspiracy
renegade spreadsheets and desktop databases that are wreaking havoc on organizations. Since then, many people have adopted the term because it highlights a painful, yet largely ignored, problem that plagues organizations today.
Spreadmarts contain vital pieces of corporate data that are needed to run the business. But since spreadmarts are created by individuals at different times using different data sources and rules for defining metrics, they create a fractured view of the enterprise. Without a single version of corporate data and centrally defined metrics, employees can’t share a common understanding the business. With spreadmarts, each worker marches to the “beat of their own drummer” instead of marching together toward a common goal. In short, spreadmarts undermine corporate productivity and sabotage business alignment.
Saturday, September 19, 2009
Comp Sci 61A - Online lectures
Topic | studied in... |
Application Program | CS 61A |
High level language (Scheme) | CS 61B |
Low Level Langauge (C) | CS 61C |
Machine Langauge/architecture | ... |
circuit elements | EE |
transistors | applied physics |
solid state physics | Physics |
Quantum Mechanics | Physics |
Sunday, September 6, 2009
Slowly Changing DImensions, Type 6 = Type 1 + Type 2 + Type 3
Sunday, August 30, 2009
PosgreSQL: how to run on EC2 with EBS
Saturday, August 29, 2009
A virtual IT department: the database
Sunday, August 16, 2009
PostgreSQL: the EnterpriseDB webcasts
SQL Server and clustered indexes
When a database reads a range of values, a very common case, you may to read many values and IO usually becomes performance limiting. These values reside on your data storage, which is almost always a hard drive or a RAID array of hard drives. To make the IO faster, there are some basic strategies.
First you can make the data more compact. As an extreme example, consider the case of Decisionmark's Census Counts. This product provided a database with US Census Bureau Data distributed on a CD. Watcom SQL Anywhere, now Sybase SQL Anywhere, was used in part because we could compress each row. Compared with a hard drive, a CD has very slow data reads. So it was faster to read data from compressed rows and to unzip the data than it would have been to just read the same data uncompressed. Since SQL Anywhere's census data was read-only, we didn't need to worry about the write performance.
If you read about some modern database research, you will see that compressing data is one significant strategy in improving performance. For example, in MonetDB/X100 - A DBMS In The CPU Cache Zukowski and co-workers report
Figure 5 show the speedup of the decompressed queries to be close to the compression ratio, which in case of TPC-H allows for a bandwidth (and performance) increase of a factor 3.
So this is still an area of ongoing work at the frontier of research. The goal is maximizing the amount of data in each block of storage, so that IO is minimized and performance is maximized.
A second strategy is to use indexes. An index is a small bit of data that helps you quickly find the larger data recorded in your row. An index is build by looking at the values of some subset of your row, e.g., one or a few columns. For each value of the indexed value, you have a pointer to the row that holds the data. If you have a hash index, the pointer is directly to the row, and if you have a b-tree index, you may have to traverse a few nodes in the tree before you actually get a reference to the data row. If there is an index that is relevant for a query, it can speed that query significantly. It can do this by either reducing the IO to find the data (the database engine might otherwise have to read the entire table) or, in many cases, allow the database to avoid reading the table and simply use the index. Careful choices of indexes can vastly improve performance
A third strategy to improve read performance is physically order the data so that the data for your queries is adjacent on the hard drive. Unless you are using a solid state drive, you need to be moving the head and rotating the platter to read data. If the database can read adjacent blocks, it will read faster. In a typical query, the database engine will use the index to build a list of blocks that need to be read. If blocks are sequential, they can be read more quickly.
After laying this groundwork, I will now discuss what seems to be a fundamental problem with SQL Servers implementation of table ordering. In many databases, there is a command that allows the records in a table to be ordered by an index. To have a concrete example, consider the case of a department store chain. The chain has a computer system that records each sale in a purchase table. The chain has several stores, several departments, many sales representatives and many clients. To maximize the performance of our purchase table, you may want to order the records by (store, department, sales rep). For example, in PostgreSQL, you can create an index with create index index_purchase_store_dept on purchase (store_id, department_id, order_id). You can then order the table records with the command cluster index_purchase_store_dept on purchase. The database engine will then order the data in the purchase table using the index. This could significantly speed up reports of store or department. DB2 has a similar concept, where you can define an index to be cluster index for a table, you can then reorganize the table using the REORG utility. Oracle has a slightly different tactic: a cluster is defined in a schema. You can include up to 32 tables in the cluster and you can define an index to order the tables in the cluster. This means that not only are the rows in a given table close together, but rows from these tables will be close together. This takes the idea of clustering to a new level. Donald Burleson, author of Oracle Tuning, describes how these clusters are used. When records are added to any table in the cluster, they are placed into overflow blocks if there is no more room in the block specified by the index. When the overflow blocks reach as percent specified in PCTFREE, the records added to the tables get reorganized by the index.
SQL Server, as well as MySQL, makes clustering more convenient by defining an index to be clustered. You may declare one index per table to be the clustered index. To maximize the performance of our purchase table, you may want to order the records by either (store, department, sales rep) or perhaps by (customer). When records are inserted, they are insert in the order specified by the index. As carefully described in a blog by Sunil Agarwal, a Program Manager in the SQL Server Storage Engine Group at Microsoft, this will result in a severely fragmented table.
Instead, Microsoft urges the user to define the primary key to be an auto-generated integer and to cluster the table on the primary key's index. In other words, order the table chronologically. This seems to fundamentally contradict the concept of clustering. The records are 'clustered' in groups of one and they groups must be in chronological order. I checked some tables that followed this rule, and they all had 0% (or at least under 1%) fragmentation on the clustered index, even after a year of operation without maintenance.
Recently, I was asked to improve the performance of set of queries that were run against a SQL Server 2000. A review of the index statistics showed severe fragmentation, for the reasons outlined by Sunil Agarwal. To see if defragmenting would matter, I followed the advice in Microsoft SQL Server 2000 Index Defragmentation Best Practices. I defragmented the tables used in one particularly slow query. I was able to reduce the query time from over 4 minutes to just under 10 seconds. With a rewrite of the query, the query time dropped to under a second. However, after only two days of transaction processing, the clustered index had a fragmentation of over 40%. The query times were not back to the 4 minute level, but the formerly sub-second query was taking more than 10 seconds. In this case, the indexes are fragmenting faster than we could reasonably defragment them, since we cannot take this server offline daily to reindex the tables and using DBCC INDEXDEFRAG is simply too slow.
My conclusion is that in order to make clustered indexes useful, you need to be able to append a significant number of rows to a table and then order those records using the clustered index. Microsoft's solution of inserting each record at the location specified by the clustered index can rapidly cause severe fragmentation. The solutions of PostgreSQL, Oracle and DB2 avoid this issue, but at the cost of additional maintenance.
Friday, July 10, 2009
Microformats: Reuse, don't reinvent
This is one of the best introductions I have found.
After some reflection, it seems that building an open source object model based upon hCard for people and organizations, hCalendar for events and rel-tags could be useful for many projects. How often do you end up reinventing classes for people, addresses and so on? With rel tags, you could define all sorts of relationships between people and institutions: you can add 'subsidiary', 'employee', 'retailer' and build all sorts of models for a wide range of business needs. If there was a simple object model, with some basic implementations in a few key frameworks, you could start coding from that rather than from scratch. This would also be useful for learning and comparing frameworks. Presumably, you could get a REST implementation, an XHTML view, and a WS-* web services almost free. You would also get the services in a format that would help others build mash-ups with your data.
Monday, July 6, 2009
The Story of REST: Representational State Transfer
the Design of Network-based Software Architectures.
While it may be feasible to support multiple mime types at a single URL, it is not trivial to provide multiple representations of the same item. Even converting between two apparently similar data format has surprising complexity, as discussed in
Convert Atom documents to JSON at IBM Developer Works. The Developer Works also has a series of papers on Grails, which has specific examples of using Grails with Atom syndication.
Thursday, July 2, 2009
What HTTP reveals about your browser.
I would really like it if the geolocation information could be used to set the default country, state and city in web forms. In my case, the geolocation would have gotten me to Iowa, but would have placed me in Hiawatha rather than Marion. There is another geolocation demo that gets closer, but is still off by about 3 miles. I would like to be able to set a location, address and hcard info and have the option of using that on web forms. I would encourage more browser providers to support the navigator.geolocation object in the W3C Geolocation API.
But in my opinion, the scary information is from a CSS Exploit page. This exploit has been covered today in Slashdot. Web 2.o Collage will produce a collection of favicons of sites you have visited. What is most surprising to me is that this exploit by Brendon Boshell doesn't even require JavaScript. He has a Javascript version as well, which he describes in detail. So, unless you use one of the 'stealth modes' that don't record history, anyone can be checking to see if you have visited a particular site. Think about how that could facilitate a phishing attack.
Tuesday, June 30, 2009
V8 - the extremes
Here is the real surprise. While every other browser ran faster on the iMac than the Dell, IE8 actually slowed down. But the slow down is almost entirely due to the final test, Splay. Google describes this test:
Data manipulation benchmark that deals with splay trees and exercises the automatic memory management subsystem (378 lines).So it seems that the VM on the Mac can have performance issues with memory management.
More V8 - on an iMac
V8 Benchmarks
Google's V8 Benchmark Suite is easy to run. Here are the results for several new browsers on my workstation, which is a Dell Optiplex GX620 with 2.79 GHz Pentium D and 3.49 GB of RAM. Wow. I didn't expect to see Webkit being this much faster. If my brand new Firefox 3.5 is given a relative score of 1, Safari 4 on Windows has a score of 7.5, Google Chrome has a relative score of 8.6. Internet Explore 8 has a score of 0.21. Webkit browsers have an order of magnitude better performance than Microsoft's flagship browser: IE8.
So, is Microsoft that bad a writing a JavaScript interpreter or are they trying to move us away from web standards like JavaScript and toward Silverlight?
Friday, June 19, 2009
Jef Jarvis and what to do next
Wednesday, June 17, 2009
Software Tutorials on Google Wave: Eclipsy
Tuesday, June 16, 2009
IE8: on Acid3
Just for fun, I tested Internet Explorer 8 on the Acid3 test. For completeness, the test was run on the morning of June 16, 2009. Today, Opera 10 was released. Opera 10, along with the recently released Safari 4, have attained a score of 100% on the Acid3 test. I also tested Chrome, release 2.0.172.31, and it also scored 100%. Is browser conformance breaking out? Helas, no. Just to prove what I say, I'm including bitmaps of 'About Internet Explorer at the time of the test. As you can read, I am using version 8.0.6001.18702. This is running on XP, as you can probably guess from the title bar.
During the test, I was asked if I wanted to let an ActiveX component run, I believe that it was for XML processing, but I foolishly clicked OK before recording the component name. The test looked like it completed at 12%. I was surprised to see what appears to be an HTML text area suddenly pop up.
After several seconds, the score started to creep up, finally reaching 20%. This is consistent with the scores reported at Anomalous Anomaly, which has much more complete Acid3 test results. Finally, I tested Firefox 3.0.10, which produced a score of 71%. As expected, this is also consistent with Steve Noonan's results at Anomalous Anomaly.
I also checked Acid2, all the browsers scored 100%. Perhaps there is still hope that bothIE and Firefox will be made more standard compliant so that there Acid3 scores can match their Acid2 scores.
[followup on July 20, 2009] Firefox 3.5 is making rapid progress. With Firefox 3.5.1, I have an Acid3 score of 93. My IE scores have remained the same.
Saturday, June 13, 2009
Richer web interfaces
- Pentaho BI Server. Their new Mantle UI is build with GWT and is a huge jump beyond the old interface. It just behaves like a stand-alone application. Some of the components, especially jPivot, could use an update, but the Server platform is in a great place to organize all of these new features. With the modular design of BI Server 3.0, Pentaho users will be seeing a great deal of new reporting and analytic tools they can plug in.
- G.ho.st, the globally hosted operating system, is a virtual computer that you can access from anywhere. You have a full GUI, built in OpenLaszlo, apparently by their team of only 30-40 staff. Ghost seeks to provide a free, web-based virtual computer for anyone in the world. The UI is fast and simply doesn't feel like a browser. Have a look.
- LZPIX is another OpenLaszlo application for viewing photos on Flickr. The link is to the Flash version, there is also a DHTML version and you can see the source code as well. On a side note, open the application and search for 'toureiffel' , the photo Paris s'éveille is magnificent!
- Maple is another interesting tool to view photos, but it is designed for multimedia slide shows. This can run as either a Java application or a a Java Applet. It uses Java 6, and my shiny new Mac only has Java 5, so the only way for me to view this is running a VM with either Windows or Linux so I can install Java 6, which was released in December 2006. Come on Apple, is it not possible to get the bugs out in 2 1/2 years!? Oh well, one more reason to hope for Snow Leopard. I just hope that they have a Java 6 upgrade for Leopard, since I can't run Snow Leopard on the older G4/G5 Macs. I was going to ask why Java applets didn't catch on like Flash applets, but I guess when you have trouble accessing a consistent Java platform across OS/X and Windows, the answer is clear. This is unfortunate, as Maple is a great slide show viewer. Java really does work, it is just a shame that it has been held captive by an Apple or Microsoft.
- Ben Fry's zipdecode. Try this and tell me why this shouldn't be part of any application that needs a zip code. It should be easy to gather the data needed to make this international. If I am filling out a web form, why do I have to type in 52302 and then choose Marion, Iowa? There are some zipcodes that server multiple communities, but in that case, I would just need one extra click to pick the city. The only thing I would add to this applet would be a semi-transparent zip code on top of the map.
Friday, June 5, 2009
Google I/O viewed from IOwa.
- Drag and drop photos from iPhoto into the browser. Really cool use of Gears and GWT. Watching the photo upload automatically and appear on the other browsers in seconds was a direct reminder to me that we are only beginning to understand how the web can connect us.
- Watching real-time translation between French and English shows how we are going to be able to interact more freely in wider communities.
- The multiple, concurrent editing of a single document shows how much power there is concurrent versioning systems.
- The use of associative memory, which resides not only on the server, but is shared with each participant is part of the secret sauce that makes is seem that everyone on the wave is 'together'.
Friday, May 15, 2009
Innovation in the computer industry
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)Interesting, it looks like Internet Explorer is emulating Mozilla. If you really want to look for who invented the web browser, you will discover ViolaWWW. Rather that running on Windows, ViolaWWW ran on Unix and X Window. You may recognize X Window, since X.org is a 'big name' FOSS projects
The strong impression I drew 13 years ago was that Microsoft programmers were untrained, undisciplined, and content merely to replicate other people’s ideas, and that they did not seem to appreciate the importance of defining operating systems and user interfaces with an eye to the future. In the end it was this latter vision, I feel, that set Gary Kildall so far apart from his peers.Not exactly a rousing defense of software innovation at Microsoft. So, we find that StarWriter was developed for a OS that predates MS-DOS. If I recall correctly, the GUI version of MS Word was actually developed for the Macintosh. So it is not clear to me how OpenOffice.org is an example of a Microsoft innovation that others copied, it can trace its code base back further than MS Word can.
But here's a TECHNICAL creation which changed the computing world, and was almost totally invented by Microsoft: truetype fonts. Before MS built Windows 3.1 around them, no one had thought to use the SAME fonts for your printers and your screens. Fantastic idea, and implemented beautifully.The time is about right, but the source of the innovation is wrong. TrueType Fonts were developed by Apple, as noted by Microsoft.
Tuesday, May 5, 2009
Choosing SCT Type at Run Time
Friday, April 17, 2009
Learning to program on OS/X
sudo -u postgres/opt/local/lib/postgresql83/bin/postgres -D /opt/local/var/db/postgresql83/defaultdbin iTerm.
Saturday, March 28, 2009
Dimension tables in data warehouses: Type 2+1
- Type 1 - the past doesn't matter. When data in the dimension tables changes, we just throw away the old data and use the new data. This is appropriate for correcting errors. If we find that we should have spelled Vicky rather than Vickie, we just change the name in production database for our application. The ETL process will then change the swimmer's name in the data warehouse.
- Type 2 - slowly changing dimension. In this case, a single entity in the production system gets mapped to multiple records in the dimension table. At any moment in time, there is only one record that is valid for the entity. When we change the description in the production system, the ETL job will create a new record in the dimension table. If our iPhone toting parents take a new photo of Vicky every month, Vicky will will get a new record in the swimmer dimension table. When we look at the results of a race where Vicky was swimming, we can see what she looked like at the time of the race.
- Type 3 - keep a few versions. In this design, we end up with multiple columns for a given attribute in the dimension table. Usually, this is used so we can have a 'current' value and a 'last' value. This is used for something like an company reorganization. Vicky is a superstar and in the middle of the season she is forced to move to a really weak team to make the league more equal. The parents are upset, so they track Vicky with the current team assignments, but they also track her beginning-of-season team. This example may seem silly, but similar reorganizations happens all the time within companies.
- GUID, an id for each swimmer that is unique for every swimmer in every league that is using my Swim 2.0 web application. This is the primary key of the production table.
- Name, a string with the swimmer's name. There could be separate fields for first and last names, but this doesn't matter for this discussion
- Birthdate, a date field.
- Photo URL, preferably a URL class
- Team GUID, a foreign key tying each swimmer to their current team
- GUID (same as above, the G is for global, and the data warehouse is part of our world of data)
- ID, a unique integer determined by the database engine. This is the primary key of the swimmer dimension table.
- version, an integer that starts at 1 and increments by 1 each time a new record is added for a given GUID. There is a 1-1 mapping between (GUID, version) and ID
- start date and end date. These two columns define when a given version is valid.
- name, birthdate and photo URL fields, just like in production
- Team ID. Foreign keys reference primary keys, so rather than recording the Team GUID, we will record the Team ID. This may be a bad idea, if you have arguments about when to use the GUID and when to use the ID when dimensions reference each other, I would be happy to hear from you.
create view swimmer_dim_type2 asselect GUID, ID, version, date_from, date_to, name, birth_date, photo_url, team_idfrom swimmer_dim
create view swimmer_dim_type1select any.GUID, any.ID, curr.name, curr.birth_date, curr.photo_url, curr.team_idfrom swimmer_dim anyjoin swimmer_dim curr on any.current_id = curr.id
create view swimmer_dim_type3 asselect any.GUID, any.ID, curr.name, curr.birth_date, curr.photo_url, any.team_id, june.team_id as starting_team_idfrom swimmer_dim anyjoin swimmer_dim curr on any.current_id = curr.idjoin swimmer_dim june on any.current_id = june.current_idwhere cast('2009-06-01' as date) between june.date_from and june.date_to
Saturday, March 21, 2009
Bayes' Theorem and Taleb's Silent Evidence
One Diagoras, a non believer in the gods, was shown painted tablets bearing the portraits of some worshippers who prayed, then survived, a subsequent shipwreck. The implication was that praying protected you from drowning. Diagoras asked, "Where were those who prayed, then drowned?"The drowned workshippers, being dead, would have a lot of trouble advertising their experiences from the bottom of the sea. This can fool the casual observer into believing in miracles.
Burton et. al. (1978) report that when dams and levees are built, they reduce the frequency of floods, and thus apparently create a false sense of security, leading to reduced precautions. While building dams decreases the frequency of floods, damage per flood is so much greater afterward that the average yearly damage increases.
Wow, this is an extraordinary claim. If true, the much of the work done to protect people in the Mississippi/Missouri River basins is not just useless but is actually counterproductive. I will remain skeptical on this claim, but it does seem worthy of investigation before Cedar Rapids, Linn County, the State of Iowa and Federal agencies sink more money into flood control on the Cedar River. We need to find a more rational way to understand and manage risk. If we can improve that, we will have learned an important lesson from the catastrophic failures of 9/11, the Indian Ocean Tsunami of 2004, the flooding of New Orleans following Hurricane Katrina, and the current financial collapse. We need Taleb's empiricism. When we have enough data to analyze, we should be using Bayesian methods for that analysis.
Sunday, March 8, 2009
Why Swamp Fox Analyst?
The next morning we proceeded by order of Colonel James Grant, to burn down the Indians' cabins. Some of our men seemed to enjoy this cruel work, laughing very heartily at the curling flames, as they mounted loud crackling over the tops of the huts. But to me it appeared a shocking sight. Poor creatures! thought I, we surely need not grudge you such miserable habitations. But, when we came, according to orders, to cut down the fields of corn, I could scarcely refrain from tears. For who could see the stalks that stood so stately with broad green leaves and gaily tasseled shocks, filled with sweet milky fluid and flour, the staff of life; who, I say, without grief, could see these sacred plants sinking under our swords with all their precious load, to wither and rot untasted in their mourning fields.suggests to me that he was a sensitive and compassionate man, but still a man that did his duty. He is most famous for his leadership in the American Revolutionary War. Following the Waxhaw Massacre , his band of about fifty irregulars was the only force opposing the British in South Carolina. He earned his nickname, The Swamp Fox, by outfoxing Col. Banastre Tarleton and his British forces. His tactics are predecessors of modern guerrilla warfare. In short, he was able to accomplish a great deal with very little. In short, he used intelligence to succeed. I cannot make claims that I am following in his footsteps (I failed my military physical for acne scars on my chest and back), I can certainly take inspiration from his success.
There is also a cautionary note that I associate with Francis Marion. While we Americans, and even Mel Gibson, regard Marion as hero, he was clearly viewed as a terrorist by the British. The historian Christopher Hibbert described him as a racist and rapist for his treatment of the Cherokee. There is evidence that both views are correct. One man's terrorist is another man's freedom fighter.
Francis Marion was also alive during the Age of Enlightenment. His lifetime (1732-1795) overlaps the lives of the skeptic David Hume (1711-1776), the Reverend Thomas Bayes (1702-1761), and the French aristocrat Pierre-Simon, marquis de Laplace (1749-1827). As Nassim Nicholas Taleb notes in The Black Swan, Hume was an influential empiricist for the English-speaking world. For all of the reasons Taleb outlines, we need empiricism and some of the intellectual humility of a real skeptic, to counter the hubris of modern economists.
Taleb speaks a great deal about The Problem of Silent Evidence. In a nutshell, if we try to understand the unusual, we try to look for characteristics that we feel were a cause of the event. For example, we have all been taught that World War I was caused by 'entangling alliances' between the major powers and the 'powderkeg' in the Balkans. But, identifying characteristics that were true before WWI is not the same as identifying causes. To have a useful knowledge, we need to be able to predict, not create an after-the-fact narrative. There were many times in history where there were alliances and many places where angry young men plotted to be revolutionaries. Does the presence of these conditions have any predictive value? That is what we need to know if we want to plan.
It seems to me that some of Taleb's criticism of classical statistical analysis can be addressed by using Bayesian methods. This is the tie to the Reverend Bayes and Laplace, they founded what we now called Bayesian statistics. Over the last twenty years, Bayesian methods have been rapidly evolving. In particular, Bayes theorem provides a mathematical framework to discuss Taleb's Problem of Silent Evidence. Of course, a framework is not a 'solution' and this only addresses one of Taleb's issues. More to come ...