Sunday, August 30, 2009

PosgreSQL: how to run on EC2 with EBS

Ron Evans has recently blogged in great detail about getting PostgreSQL up and running on Ubuntu images on Amazon EC2 with EBS storage for the database files. Basically, you use EC2 to provide the server running PostgreSQL and you use EBS, elastic block storage to create a volume that you mount to hold the database files. Presumably, you can create a second file system to hold WAL segment files, but perhaps you just use S3 to store these.

The next step would be to learn how to create PostgreSQL clusters on EC2. This isn't much of post, but it is useful to me.

Saturday, August 29, 2009

A virtual IT department: the database

Software as a Service, SaaS, seems like it is about to revolutionize how small and medium companies run their IT. Today, many small and medium businesses have really kludgy IT. There are a bunch of Windows desktops, and a few Windows servers for Exchange, SQL Server, a File Server, a domain server and perhaps a company web site, but that is likely to be a hosted site for small business. I think we will see Microsoft's dominance fracture as a result of SaaS. Today, consider how SaaS may effect SQL Server.

Small companies now have the option of running database services in the cloud. Using Amazon EC2, they can have a hosted data center with a 99.95% uptime in the SLA. Of course, this only works when the company has a working internet connection. Since Amazon is using Xen, you can find tools that will convert between Xen images and Amazon AMI files. For Debian Linux, that maverick James Gardner has blogged about this. So, you could set up a local server to host Xen and you could then have a piece of the cloud in your data center.

Running SQL Server is a costly proposition. Enterprise Edition is $25K per CPU. Microsoft is planning to upgrade on a 2-3 year cycle and they don't intend to support servers more than two generations old. So your $25K/CPU is recurring fee that will average around $5K/year per CPU. If you want, you can run SQL Server on Amazon EC2. There is a premium for this. Lets assume that you will need a large server. For Windows alone, this is $0.50 per hour fee. But adding SQL Server Standard will increase this to $2.20 per hour. For one year, a hosted SQL Server 2008 Standard Server instance works out to about $19K. This isn't crazy for a managed server, but it isn't a cost savings for most companies over running a Windows Server in-house. So why is there so much buzz about the cloud?

With a few changes, we can save a great deal of money. For the database engine, consider PostgreSQL. It is fast and stable. It has a carefully reviewed code base. There are drivers for ODBC, JDBC and ADO.Net, so your applications can still access data. Because PostgreSQL is open source, you can license it for $0. On Amazon EC2, you can run a medium service instance with Linux for $0.40 per hour. You can also take advantage of Reserved Instances. For $1,400 you can reserve a server instance for three years. This drops the per hour fee to $0.12 per hour. Three years now costs 4, 555.76. So for one year, a hosted PostgreSQL instance works out to to just under $1,520. The cloud loves open source.

To be fair, SQL Server also include SSIS, which allows for a variety of ETL services to be built and run via a GUI. I have been using Pentaho's Kettle for this. Pentaho has renamed this as Pentaho Data Integration. PDI can be downloaded for free. So, with PDI and PostgreSQL, you have the same basic functionality as SQL Server, it just saves $17,766.61 per year for a database server. You still get 99.95% data center uptime.

Note: I have not added S3 storage for archiving databases or transaction logs. This will increase costs, but for both servers.

Sunday, August 16, 2009

PostgreSQL: the EnterpriseDB webcasts

PostgreSQL is becoming one of my favorite databases. Postgres is the direct descendent of Ingres, both are the creation of the Berkeley and were authored by Michael Stonebreaker and Eugine Wong. Other descendants of Ingres include Oracle, Sybase Enterprise Server and Microsoft SQL Server. PostgreSQL has the advantage of a very clean code base, as demonstrated by Coverity's open source scans. A limit for many open source project is getting access to paid support. In the case of PostgreSQL there is commercial support from EnterpriseDB.

One of the gems of the EnterpriseDB support is the wide range of webcasts about PostgreSQL. Because PostgreSQL is open source, the developers present detailed information about the internals of PostgreSQL and best practices for database administrators and developers.

SQL Server and clustered indexes

When a database reads a range of values, a very common case, you may to read many values and IO usually becomes performance limiting. These values reside on your data storage, which is almost always a hard drive or a RAID array of hard drives. To make the IO faster, there are some basic strategies.

First you can make the data more compact. As an extreme example, consider the case of Decisionmark's Census Counts. This product provided a database with US Census Bureau Data distributed on a CD. Watcom SQL Anywhere, now Sybase SQL Anywhere, was used in part because we could compress each row. Compared with a hard drive, a CD has very slow data reads. So it was faster to read data from compressed rows and to unzip the data than it would have been to just read the same data uncompressed. Since SQL Anywhere's census data was read-only, we didn't need to worry about the write performance.

If you read about some modern database research, you will see that compressing data is one significant strategy in improving performance. For example, in MonetDB/X100 - A DBMS In The CPU Cache Zukowski and co-workers report

Figure 5 show the speedup of the decompressed queries to be close to the compression ratio, which in case of TPC-H allows for a bandwidth (and performance) increase of a factor 3.

So this is still an area of ongoing work at the frontier of research. The goal is maximizing the amount of data in each block of storage, so that IO is minimized and performance is maximized.

A second strategy is to use indexes. An index is a small bit of data that helps you quickly find the larger data recorded in your row. An index is build by looking at the values of some subset of your row, e.g., one or a few columns. For each value of the indexed value, you have a pointer to the row that holds the data. If you have a hash index, the pointer is directly to the row, and if you have a b-tree index, you may have to traverse a few nodes in the tree before you actually get a reference to the data row. If there is an index that is relevant for a query, it can speed that query significantly. It can do this by either reducing the IO to find the data (the database engine might otherwise have to read the entire table) or, in many cases, allow the database to avoid reading the table and simply use the index. Careful choices of indexes can vastly improve performance

A third strategy to improve read performance is physically order the data so that the data for your queries is adjacent on the hard drive. Unless you are using a solid state drive, you need to be moving the head and rotating the platter to read data. If the database can read adjacent blocks, it will read faster. In a typical query, the database engine will use the index to build a list of blocks that need to be read. If blocks are sequential, they can be read more quickly.

After laying this groundwork, I will now discuss what seems to be a fundamental problem with SQL Servers implementation of table ordering. In many databases, there is a command that allows the records in a table to be ordered by an index. To have a concrete example, consider the case of a department store chain. The chain has a computer system that records each sale in a purchase table. The chain has several stores, several departments, many sales representatives and many clients. To maximize the performance of our purchase table, you may want to order the records by (store, department, sales rep). For example, in PostgreSQL, you can create an index with create index index_purchase_store_dept on purchase (store_id, department_id, order_id). You can then order the table records with the command cluster index_purchase_store_dept on purchase. The database engine will then order the data in the purchase table using the index. This could significantly speed up reports of store or department. DB2 has a similar concept, where you can define an index to be cluster index for a table, you can then reorganize the table using the REORG utility. Oracle has a slightly different tactic: a cluster is defined in a schema. You can include up to 32 tables in the cluster and you can define an index to order the tables in the cluster. This means that not only are the rows in a given table close together, but rows from these tables will be close together. This takes the idea of clustering to a new level. Donald Burleson, author of Oracle Tuning, describes how these clusters are used. When records are added to any table in the cluster, they are placed into overflow blocks if there is no more room in the block specified by the index. When the overflow blocks reach as percent specified in PCTFREE, the records added to the tables get reorganized by the index.

SQL Server, as well as MySQL, makes clustering more convenient by defining an index to be clustered. You may declare one index per table to be the clustered index. To maximize the performance of our purchase table, you may want to order the records by either (store, department, sales rep) or perhaps by (customer). When records are inserted, they are insert in the order specified by the index. As carefully described in a blog by Sunil Agarwal, a Program Manager in the SQL Server Storage Engine Group at Microsoft, this will result in a severely fragmented table.

Instead, Microsoft urges the user to define the primary key to be an auto-generated integer and to cluster the table on the primary key's index. In other words, order the table chronologically. This seems to fundamentally contradict the concept of clustering. The records are 'clustered' in groups of one and they groups must be in chronological order. I checked some tables that followed this rule, and they all had 0% (or at least under 1%) fragmentation on the clustered index, even after a year of operation without maintenance.

Recently, I was asked to improve the performance of set of queries that were run against a SQL Server 2000. A review of the index statistics showed severe fragmentation, for the reasons outlined by Sunil Agarwal. To see if defragmenting would matter, I followed the advice in Microsoft SQL Server 2000 Index Defragmentation Best Practices. I defragmented the tables used in one particularly slow query. I was able to reduce the query time from over 4 minutes to just under 10 seconds. With a rewrite of the query, the query time dropped to under a second. However, after only two days of transaction processing, the clustered index had a fragmentation of over 40%. The query times were not back to the 4 minute level, but the formerly sub-second query was taking more than 10 seconds. In this case, the indexes are fragmenting faster than we could reasonably defragment them, since we cannot take this server offline daily to reindex the tables and using DBCC INDEXDEFRAG is simply too slow.

My conclusion is that in order to make clustered indexes useful, you need to be able to append a significant number of rows to a table and then order those records using the clustered index. Microsoft's solution of inserting each record at the location specified by the clustered index can rapidly cause severe fragmentation. The solutions of PostgreSQL, Oracle and DB2 avoid this issue, but at the cost of additional maintenance.