Musing on business intelligence, particularly using Pentaho. Also interests in software design, particularly in the open source community.
Sunday, August 30, 2009
PosgreSQL: how to run on EC2 with EBS
Saturday, August 29, 2009
A virtual IT department: the database
Sunday, August 16, 2009
PostgreSQL: the EnterpriseDB webcasts
SQL Server and clustered indexes
When a database reads a range of values, a very common case, you may to read many values and IO usually becomes performance limiting. These values reside on your data storage, which is almost always a hard drive or a RAID array of hard drives. To make the IO faster, there are some basic strategies.
First you can make the data more compact. As an extreme example, consider the case of Decisionmark's Census Counts. This product provided a database with US Census Bureau Data distributed on a CD. Watcom SQL Anywhere, now Sybase SQL Anywhere, was used in part because we could compress each row. Compared with a hard drive, a CD has very slow data reads. So it was faster to read data from compressed rows and to unzip the data than it would have been to just read the same data uncompressed. Since SQL Anywhere's census data was read-only, we didn't need to worry about the write performance.
If you read about some modern database research, you will see that compressing data is one significant strategy in improving performance. For example, in MonetDB/X100 - A DBMS In The CPU Cache Zukowski and co-workers report
Figure 5 show the speedup of the decompressed queries to be close to the compression ratio, which in case of TPC-H allows for a bandwidth (and performance) increase of a factor 3.
So this is still an area of ongoing work at the frontier of research. The goal is maximizing the amount of data in each block of storage, so that IO is minimized and performance is maximized.
A second strategy is to use indexes. An index is a small bit of data that helps you quickly find the larger data recorded in your row. An index is build by looking at the values of some subset of your row, e.g., one or a few columns. For each value of the indexed value, you have a pointer to the row that holds the data. If you have a hash index, the pointer is directly to the row, and if you have a b-tree index, you may have to traverse a few nodes in the tree before you actually get a reference to the data row. If there is an index that is relevant for a query, it can speed that query significantly. It can do this by either reducing the IO to find the data (the database engine might otherwise have to read the entire table) or, in many cases, allow the database to avoid reading the table and simply use the index. Careful choices of indexes can vastly improve performance
A third strategy to improve read performance is physically order the data so that the data for your queries is adjacent on the hard drive. Unless you are using a solid state drive, you need to be moving the head and rotating the platter to read data. If the database can read adjacent blocks, it will read faster. In a typical query, the database engine will use the index to build a list of blocks that need to be read. If blocks are sequential, they can be read more quickly.
After laying this groundwork, I will now discuss what seems to be a fundamental problem with SQL Servers implementation of table ordering. In many databases, there is a command that allows the records in a table to be ordered by an index. To have a concrete example, consider the case of a department store chain. The chain has a computer system that records each sale in a purchase table. The chain has several stores, several departments, many sales representatives and many clients. To maximize the performance of our purchase table, you may want to order the records by (store, department, sales rep). For example, in PostgreSQL, you can create an index with create index index_purchase_store_dept on purchase (store_id, department_id, order_id). You can then order the table records with the command cluster index_purchase_store_dept on purchase. The database engine will then order the data in the purchase table using the index. This could significantly speed up reports of store or department. DB2 has a similar concept, where you can define an index to be cluster index for a table, you can then reorganize the table using the REORG utility. Oracle has a slightly different tactic: a cluster is defined in a schema. You can include up to 32 tables in the cluster and you can define an index to order the tables in the cluster. This means that not only are the rows in a given table close together, but rows from these tables will be close together. This takes the idea of clustering to a new level. Donald Burleson, author of Oracle Tuning, describes how these clusters are used. When records are added to any table in the cluster, they are placed into overflow blocks if there is no more room in the block specified by the index. When the overflow blocks reach as percent specified in PCTFREE, the records added to the tables get reorganized by the index.
SQL Server, as well as MySQL, makes clustering more convenient by defining an index to be clustered. You may declare one index per table to be the clustered index. To maximize the performance of our purchase table, you may want to order the records by either (store, department, sales rep) or perhaps by (customer). When records are inserted, they are insert in the order specified by the index. As carefully described in a blog by Sunil Agarwal, a Program Manager in the SQL Server Storage Engine Group at Microsoft, this will result in a severely fragmented table.
Instead, Microsoft urges the user to define the primary key to be an auto-generated integer and to cluster the table on the primary key's index. In other words, order the table chronologically. This seems to fundamentally contradict the concept of clustering. The records are 'clustered' in groups of one and they groups must be in chronological order. I checked some tables that followed this rule, and they all had 0% (or at least under 1%) fragmentation on the clustered index, even after a year of operation without maintenance.
Recently, I was asked to improve the performance of set of queries that were run against a SQL Server 2000. A review of the index statistics showed severe fragmentation, for the reasons outlined by Sunil Agarwal. To see if defragmenting would matter, I followed the advice in Microsoft SQL Server 2000 Index Defragmentation Best Practices. I defragmented the tables used in one particularly slow query. I was able to reduce the query time from over 4 minutes to just under 10 seconds. With a rewrite of the query, the query time dropped to under a second. However, after only two days of transaction processing, the clustered index had a fragmentation of over 40%. The query times were not back to the 4 minute level, but the formerly sub-second query was taking more than 10 seconds. In this case, the indexes are fragmenting faster than we could reasonably defragment them, since we cannot take this server offline daily to reindex the tables and using DBCC INDEXDEFRAG is simply too slow.
My conclusion is that in order to make clustered indexes useful, you need to be able to append a significant number of rows to a table and then order those records using the clustered index. Microsoft's solution of inserting each record at the location specified by the clustered index can rapidly cause severe fragmentation. The solutions of PostgreSQL, Oracle and DB2 avoid this issue, but at the cost of additional maintenance.