Swamp Fox Analyst: 2010

Sunday, June 13, 2010

Children of Prolog

Recently, I have become interested in functional programming, largely because of the remarkable success of Erlang in producing robust web applications. Erlang seems to have 'solved' concurrent programming, which is one of the growing challenges in programming - especially as we move into a post Web 2.0 world. Erlang was initially implemented in Prolog.

Today, I listened to Joseph M Hellerstein's Keynote address at ACM PODS 2010, The Declarative Imperative: Experiences and Conjectures in Distributed Logic. The slides are also available. He uses Daedalus, a derivative of Datalog. Datalog is, syntactically, a subset of Prolog. Hellerstein also addresses the concurrency conundrum, a.k.a., the Programming Crisis by retelling children's classics

“The sky is falling! The sky is falling! Computers won’t get any faster unless programmers learn to write parallel code!” squawked Chicken Licken.

Henny Penny clucked in agreement: “Worse, there is Cloud Computing on the horizon, and it requires programmers to write parallel AND distributed code!”

This is more fun than your average academic talk. Of course, he then goes on to provide his solution to the programming crisis. He shows how Datalog has been used to solve a series of practical problems in computer science.

The solutions Hellerstein expounds rely upon declarative programming in Datalog variants. He describes a series of practical code for distributed crawlers, networking protocols, query optimizers, distributed file systems and a map-reduce scheduler. What is as impressive as the breadth of topics is his claim that his declarative solutions have code bases that are orders of magnitude smaller that existing, imperative solutions.

His code samples describe solving problems with a time dependence that evolves in steps, like a turn-based game. This was a little disconcerting to a physicist, since we like to think about time as continuous, still it was interesting to find a computer scientist shift from 'practical' solutions to a rather abstract discussion on the meaning of time and causality. At this point, I am starting to think about the InfoQ talk by Rich Hicky on Persistent Data Structures and Managed References.

From a Prolog beginning, Ericsson developed Erlang. Erlang is having a renaissance that is leading to high concurrency solutions for web programming. Next, we have Hellerstein using Datalog to address the programming crisis. Where are the imperative, OO solutions? Declarative logic & functional programming - are these the tools that will lead out out the the programming crisis?

Saturday, June 5, 2010

Trends in programming and Amdal's Law

in my previous blog, I explained trends I saw that will affect programmers. The main takeaway is that we are moving into a world of concurrent programming. If we do concurrent programming with memory shared by several threads of execution, life is going to be miserable for programmers. Fortunately, message passing via actors is a viable path to concurrency that is much easier to program.

In any discussion of concurrency, the iron law is Amdahl's Law. If you have a process in which some fraction, s, of the steps are sequential and the rest, 1-s, can be done in parallel, then the most you can gain from concurrency over N processors is to reduce the time to execute from 1 to s + (1-s)/N. So basically, the time to execute a program is limited by s as much as by N. The hardware determines N and the software determines s. So what determines s? Basically, s is the fraction of the algorithm that you spend queueing stuff. If I go to the grocery store, I can shop in parallel with everyone else. It takes a little longer to shop in a busy store, but only a little longer. But when it is time to check out, I have to queue up to pay.

What is exciting is that by using techniques like 'no shared state' and Actors for communication can greatly reduce s, so we can write software that efficiently tanks advantage of concurrent hardware (i.e., big N). If we look at a standard sort of web applications, the bottlenecks are things like writing to the database. The first thing to do is to replace a single server with a cluster of identical servers, so that a given client can access any one of the servers and get the job done.

But there turns out to be a better plan - sharding. In sharding, you still have a cluster of N servers, but you don't have each server having identical data. If I have a large database, I would shard it by creating N servers, but I would shard the data so that any given server has only a fraction of the data. This is much like partitioning data on a RAID array, and the more sophisticated designs take into account 'k safety". Reed-Solomon erasure codes exhibit this behavior, as do Tornado codes. This means that the data is divided in to n+k partitions so that any set of n partitions can reconstruct the entire data set. This means you can loose k partitions before you loose any data. This idea is central to VoltDB, but they implement it by having k+1 copies of each block. It seems that a more efficient design could be built using erasure codes (See exercise 11.9 in David MacKay's Information Theory, Inference and Learning Algorithms.)

Wednesday, June 2, 2010

Trends in programming

There is a dizzying wealth of new options for web programmers. I am trying to understand why there are so many changes and which of those changes I should be choosing to learn. I have found that the answer to this sort of question often comes from looking backward and trying to judge current trends in an historical context.

Hardware: Multicore is the future

This is an inevitable trend, as carefully explained in The Future of Computing by Anant Agarwal to MIT freshman. The power consumption of a CPU scales as the cube of the voltage. So trying to improve performance by brute force is a loosing game. Instead, it is much better to build smaller, lower power cores. Prof. Agarwal is among those who are laying the foundation for a future of RISC cores arranged in a tiled multi core architecture. As a programmer, you can get much better performance if you can leverage concurrency. We can leave the hardware issues to the likes of MIT, but the programming challenges will affect everyone in who programs.

To better understand hardware and how it affects software, listen to A Crash Course in Modern Hardware by Cliff Click of Azul. Azul has systems with 864 processor cores, so they are on the front lines of this revolution. He explains how modern computers, with caches and pipelining, are not particularly well described by the von Neumann architecture. The entire talk seems like an explanation of how hardware engineers have been addressing the von Neumann bottleneck between CPU and memory. Even here, we find that each physical core is running concurrent processes.

So, we are seeing horizontal scaling of computers in clusters, with each computer having more cores and with each core handing more concurrent tasks. I can spot a trend - concurrent processing is the future. But I HATE threading and locks.

Functional Programming and Concurrency

Erlang was developed for high reliability in Ericsson telephone switches. Joe Armstrong describes how they tried to develop Systems that Never Stop and why that lead to Message Passing Concurrency in Erlang. Armstrong came up with a list of six laws for reliable systems. The laws/requirements are:

isolation (the probability of isolated systems failing is independent. So processes need to be isolated.)
concurrency (do you need to worry about how many concurrent processes can you support? You need at least two systems to make a non-stop system. Even a two computer system is concurrent. You can't avoid concurrency)
failure detection (if you can't detect a failure, you can't fix it. the thing that crashed can be the thing that detects failures.)
fault identification (You need to figure out why there was a failure if you want to fix it.)
live code upgrade (Systems that should never stop, need live code upgrade)
stable storage (no backups, implies multiple copies. must keep crash logs)

Erlang is a functional language, so there are no side effects to calculations, Joe Armstrong discusses the importance of Functional Programming in Functions + Messages + Concurrency = Erlang. There really is no point in arguing about if he is right. Erlang is the proof. It is reliable, Ericsson has the AXD 301 switch with two million lines of code: BT operates these switches at a 'nine-nines' reliability - it can fail for 32 ms each year. BT has run switches for six years without stopping. Erlang also allows for high concurrency, this is clearly illustrated in Apache vs. Yaws: while Apache and Yaws have similar throughputs for score of concurrent clients, Apache bogs down at around 4 thousand parallel sessions. By contrast, the Erlang powered Yaws can serve over 80 thousand parallel sessions on Linux with similar hardware. Apache is a well written server, Netcraft shows that Apache is easily the #1 server on the World Wide Web in May 2010. But Yaws beat it handily in this measure. In RESTful Services with Erlang and Yaws, Steve Vinoski examines Yaws in more detail.

But perhaps Erlang is hard to use and expensive to code? According to Ulf Wigner, Ericsson recorded a Four-fold increase in Productivity and Quality on the ADX 301 project. Joe Armstrong is very optimistic about the future of Erlang, and I think he is right.

When I looked into this some more, I realized that Erlang was not the 'bolt of the the blue' that is first seemed to me. Some of the greatest minds in programming have been noticing that imperative programming has been at something of a standstill. Read John Backur's 1977 ACM Turing Award Lecture, Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs. This is a broadside attack on imperative programming is all the more remarkable because it is coming from the author of FORTRAN. He then goes on to explain what he sees are the advantages of functional programming. This must have been a controversial paper, Edsger Dijkstra critique of Backur's lecture ended with:

In short, the article is a progress report on a valid effort but suffers badly from agressive overselling of its significance, long before convincing results have been reached.

I tend to believe both Backur and Dijkstra. I resolve any apparent conflict by admitting that Backur was ahead of his time.

Monday, May 10, 2010

Gaelyk again

Gaelyk is gaining ground as a Google AppEngine framework. Developer Works has Java Development 2.0: Gaelyk for Google App Engine by Andrew Glover. This title emphasizes that Java is more of a platform than a language. I tried this a few months ago, but was disappointed by the initial load times. The application runs fine once it loaded, but my application doesn't get many hits, so most users ended up hitting a cold site.

One solution is to regularly ping the site with a service like Pingdom with something like a 1 minute interval. This seems like overkill to me, but 43,200 pings a month is still well below the roughly 500,000 hits per month that triggers payments. A Basic Pingdom account is $9.95 per moth and there is also a free account that would be fine for a single site. In addition, Groovy 1.7 as improvements for faster compilation. So the performance issues are getting resolved.

The joy of Gaelyk is getting to use the AppEngine features with a minimum of Fuss. Pratik Patel's screencast shows how to build a basic CRUD application. This is also shown nicely in the Gaelyk tutorial. Recently, Gaelyk introduced plugins in version 0.4. There is also a great tutorial about adding REST via JSON and Jackson. It also seems like it should be reasonably easy to integrate with RESTClient.

Saturday, March 20, 2010

Jeffrey Zeldman vs. Ed Bott

In his ZD Net Blog, Ed Bott stated:

In one breath he said, “I’m not challenging the quality of the hardware and software improvements,” and then, in the very next breath, he criticized Microsoft’s “brilliant browser engineers” for “torturing the IE rendering engine every couple of years instead of putting it out of its misery.”

This certainly seems damning. But it isn't quite true. Each of the quotes can be pulled from Zeldman's blog, IE9 Preview, but Bott's juxtaposition of Zeldman's words are downright deceptive. The two statements were separated by four paragraphs, so perhaps Bott quite understands what 'very next breath' means. Or more likely, Bott is being intentionally being misleading.

It is clear that Jeffrey Zeldman does respect Microsoft Engineering, and that he considers at least some of the Microsoft engineers to be brilliant.

Jeffrey Zeldman clearly states that he is writing in response to Dean Hachamovitch's report, An Early Look at IE9 for Developers. Zeldman's blog is dated 9 AM Eastern on 16 Mar 2010. Bott then launched into a description of Hachamovitch's presentation at MIX at Las Vagas. The talk was live blogged by Sharon Chan at 9:51 AM Pacific on 16 March 2010. Or about 4 hours after Zeldman posted his blog.

For some reason, Bott is surprised that Zelman didn't report on the still 4 hour in the future MIX talk. This is even more surprising since Bott was at the talk, so I would have expected him to note who was present in the room.

What there the significant changes that occurred between Hachamovitch's blog and his presentation at MIX? From what I can tell, the major changes introduced at the talk were all known at PDC 2009 (18 November 2009):

html5
SVG 1.1
GPU rendering using Direct2D
CSS3
new JavaScript engine

These are good news. What has changed between November:

improving Acid3 (which did improve from 32 to 55)
support for CSS3 selectors (which improved from passing 574/578 tests to a perfect 578/578)

So, we are seeing IE9 making significant improvements in standards compliance. But this is just a start. To see how IE9 is faring on SVG 1.1, we can look at this recent blog at codedread. On the official SVG Test Suite, IE9 preview is scoring 28%, Firefox 3.7 scores 73%, Opera 10.5 scores 94%, Chrome 5 scores 87% and Safari 4 scores 82%. These numbers are quite different than the test results presented by Microsoft. Haavard noticed that before I did.

The Acid3 score of 55 is well below the scores of the production versions of Firefox 3.5 (94%), Opera (100%), Chrome (99%) and Safari (100%).

That still leaves JavaScript. In addition to speed, a a browser's JavaScript should conform to the ECMA standard. Google released the Sputnik JavaScript conformance test suite. The Chromium blog recently posted test results. The winner is Opera (78 failures) and the looser is IE8, with 463 failures. In terms of JavaScript performance, IE9 is now respectably fast. It is on par with the the other modern JavaScript engines.

Thursday, March 4, 2010

Making the case for Scala

At first glance, Scala seems like an odd choice for developing web applications. Scala is a JVM language created by Martin Odersky of EPFL seems like an academic's testbed for a 'grand unified language' combining functional programming and strict object-oriented design. Scala is clearly gaining in popularity, but the current TIOBE Software Index places Scala at #24 with a rank of only 0.459%, far behind the #1 language, Java, which has a a rank of 17.348%. So, if you want to be on a JVM, why not stay with the undisputed champion of the TIOBE ranking since June 2001?

The key reason is money. Developer time is money. Developer time is, to a remarkably good approximation, proportional to the number of lines of codes that need to be written. This holds over a wide range of languages. Rather than regurgitate an existing piece, I urge you to read Research: Programming Style and Productivity and Scala vs Java: Conciseness equals less bugs. Scala was found to reduce lines of code by 30-35%. This should lead to a commensurate increase in productivity.

The second reason is that functional programming looks like it is the future. Erlang proves that functional programming can lead to high reliability systems with massive concurrency. Computers are becoming highly concurrent, not faster.

The third reason is more prosaic. Liftweb gives many of the advantages of Ruby on Rails and it also provides type checking. Call me old fashioned, but I get warm fuzzy feelings knowing that a compiler is preventing wide classes of errors.

The last reason is still in the future. In 2002, Phil Bagwell wrote Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays. He introduced new data structures for functional programming. Bagwell is Swiss and is at EPFL, where Martin Odersky works. So EPFL is the center of Scala. I learned about these by watching Rich Hickey's QCon London 2010 talk, Persistent Data Structures and Managed References. The slides are available on Michael Galpin's blog entry Persistent Data Strictures. This excites me, because many of the ideas of state took me back to physics, especially statistical mechanics. The talk, and Clojure's approach to state just seems right. So why is this a reason to use Scala!? Why not cut to the chase and go right to Clojure? At this point, I don't know, that may be the right answer. But Scala is a bit of a chameleon, and it seems to me that the concepts of State and Identity, as well as the persistent data structures that make this efficient, will be implemented soon in Scala. Scala 2.8 is getting Bagwell's VLists.

Suddenly, this is becoming an interesting time to be a programmer, and the JVM seems to be the epicenter of these changes. The JVM is very robust - there has been a decade of work with theorem provers that demonstrate the robustness of the design. There has also been a decade to flush out the bugs. Now, Scala, Groovy and Clojure are leveraging the JVM, and the plethora of libraries on the JVM. Full speed ahead!

Monday, March 1, 2010

David Pollak's Web Framework Manifesto

The best part of blogs is that is makes it possible for people with clear ideas, not just fame or access to money, to be published. David Pollak certainly falls within the category of clear thinking people, and his Web Framework Manifesto, from November 2006, is a great technical summary of what we all should expect of a web framework. It is fascinating to see the breadth of frameworks that he has examined.

Security figures predominantly in this manifesto. He states:

There should exist an orthogonal security layer such that objects that are not accessible to a user should never be returned in a query for the user and fields on an object that are not accessible should not be visible. The security and access control rules should be algebraic and demonstrable during a security audit. This means that neither the view nor the controller should have to test for access control. Objects and requests should be sanitized before they get to the “controller.”

At first, this seems like Spring Security, but the algebraic caught my eye. I don't know what this means. A Google search found Beyond Separation of Duty: An Algebra for Specifying High-level Security Policies, but this seems quite theoretical at this point. In the IEEE Security & Privacy Journal, the article, A Metrics Framework to Drive Application Security Improvement, describes a more pragmatic, but still quantitative approach to security.

David then went on to create and act as benevolent dictator for life of the Liftweb framework, or just Lift. As a benevolent dictator, it is no surprise that the key ideas of the Web Framework Manifesto are all present in Lift.

Friday, February 12, 2010

Liftweb 1.1 and Google AppEngine

With only minor modifications, I have repeated Joe Kutner's installation of Liftweb on AppEngine. I'm running OS/X, so I started by using macports to install maven2. I saw that Java App-Engine was just updated to 1.3.1, so that was installed and I adjusted my APPENGINE_HOME and M2_HOME in .profile. The latest liftweb version is 1.1-M8, so I'll try that one.

Now that I know I have updated libraries, I generated a liftweb project with:

mvn archetype:generate -U -DarchetypeGroupId=net.liftweb

-DarchetypeArtifactId=lift-archetype-blank -DarchetypeVersion=1.1-M8

-DremoteRepositories=http://scala-tools.org/repo-releases

-DgroupId=com.folkertsfotografie. rlftest -DartifactId= rlftest

Rather than using -Dversion=1.0-SNAPSHOT, I entered a version of 1.1-M8 when prompted in maven. As usual, then next commands were cd rlftestl and a quick mvn jetty:run to see that http://localhost:8080/ was working. As an eclipse user, I added mvn eclipse:eclipse. So far, so good. The simple helloword application came up normally under the jetty installed with liftweb.

To integrate with AppEngine, I just added the boilerplate appengine-web.xml in my WEB-INF directory. The contents are:

<?xml version="1.0" encoding="utf-8"?>
 <appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
  <application>rlftest>
  <version>3</version>
  <system-properties>
    <property name="in.gae.j" value="true" />
  </system-properties>
  <sessions-enabled>true</sessions-enabled>
  <static-files>
  <exclude path="/**" />
  </static-files>
</appengine-web-app>

With that done, I built the war file with mvn package. To run that war file under AppEngine, just type

$APPENGINE_HOME/bin/dev_appserver.sh target/rlftest-1.1-M8.

That also worked, I could see http://localhost:8080/. Finally, I published this to appsite with $APPENGINE_HOME/bin/appcfg.sh update target/rlftest-1.1-M8.

With that done, I browsed my AppEngine control panel and tested that the hello world application. It worked without a hitch. Now, I can get back to learning liftweb and Scala. It isn't at all clear to me how liftweb and datastore will work together, but I'll just have to burn that bridge when I get to it. So thanks Joe, you have been a great help getting me started with Lift.

Sunday, February 7, 2010

Google Analytics: integration with web apps.

Google Analytics is a powerful, free tool to gather information on events in the click streams of your website's visitors. I'm trying to develop a web application for photographers, starting with my wife's site, folkertsfotografie.com. I'm tracking progress on fotositeproject and the new webapp is running on GAE, Google App Engine. This is a rather Google-centric project. I'm using Picasa to hold image galleries. The RSS from Blogger, Picasa and flickr are being integrated using Google's Feed API via the jQuery library jGFeed. So when we wanted to add analytics, Google Analytics was on the top our our list.

I had hoped to develop with Grails, but it has proven to be too slow. This is a known issue, and both Google and SpringSource are working on speed ups. I switch to Gaelyk, a lighter Groovy framework for GAE, which is resulting in reasonable performance. I looked at GAE sites using other Java frameworks, and was surprised how much faster they are. However, GAE also has similar issues with Rails on jRuby, presumably for the same architectural reasons. GAE performance with Groovy is still an issue, but one that I trust will have a happy ending in the next few months. GAE does a lot of validating of included libraries on startup, which is probably a really good thing for security. I wonder if they can have some standard pre-verified jar files for groovy, jruby, grails, sitemesh and so on. When I want to use one of these jars, I agree to use the 'Google compatible version'. When I load my app, they check the MD5 hash to make sure that I am still using an unadulterated copy & they can just copy their jar image into my application's memory without the expensive validation. Can't we pay this price once when we load the app, not every time we start it up? There is still activity in the Grails community to develop GAE integration, so I can't believe that this issue will remain for long.

Returning to my original topic: Google Analytics requires adding some code snippets to the bottom of each tracked 'page' or in the event handlers for tracked events. In a MVC design, the code snippets must resign in the views. However, the URLs are tied to controllers, so the keys for each URL need to be managed by the controller. I had assumed that a quick search would find others who had discussed this topic. Surprisingly, this is not the case. I have written a page that will read a Picasa RSS feed and then show the album in Galleriffic, a terrific jQuery image gallery. I want to have URLs like /gallery/show/sarah2009, so I don't want to place 'the key' in a file like /album/show.gsp, since the controller don't even expose that URL to the user. I certainly want to track patrick2009 with a different code that campingTripSpring2007, even if they both use the same view. So for each controller, I need a map of the view, the model, Google Analytics keys to use. This isn't quite model data, since the model should only have 'domain' data, which in this cased is information about the albums, media feeds, images, models and clients. It seems that only the controller is in a position to address this.

Wednesday, February 3, 2010

REST and application evolution

REST turns out to be the key to allowing software to be upgraded without breaking applications. The idea is pretty simple,you agree on the small number of REST verbs. You then have interfaces that are specified by URL and a version number (perhaps part of URL). You then allow clients to specify their version number in a request.

This is discussed in a couple of InfoQ talks. Alex Antonov gave Case Study: RESTful Web Services at Orbitz. He shows how Protocol Buffers allow for efficient (he claims 7 times faster than SOAP) and flexible message passing using HTTP. Protocol Buffers is an open source HTTP messaging protocol developed by Google. The exchanged data uses protobuf.

Steve Vinoski gives a great talk on RPC and its fundamental flaws. He was an expert in CORBA in the 90's and he has recently moved to REST and Erlang. He is quite convinced that functional programming is the future and that imperative languages, such as Java and C# are based upon a the wrong model of distributed programming.

These talks dovetail nicely with Kirk Wylie's earlier InfoQ talk, Restful approaches to financial systems.

Saturday, January 23, 2010

Google Sites - easy project web sites

I'm working with my wife to develop a web site for her business, FolkertsFotografie. Currently, it is a Smug Mug site. She isn't a fan of selling images via shopping carts, so SmugMug is not a special advantage. I'm wanting to learn about Google AppEngine, so we set up a 'project' to make this happen.

This is where Google Sites fits in. I wanted to organize all of the brainstorming, designing and so forth. Google Sites makes it easy. See for yourself at Fotositeproject. There is easy integration with a host of Google tools, such as Google Docs, Calendar, Groups, Analytics and Webmaster Tools. Plus, I can rest knowing that I have offsite backup of everything and I do not have to worry about keeping the servers running 24/7.

Google Analytics & Webmaster Tools gives me as much monitoring data as my employer gets from IT. But unlike me, Google provides IT services for free.

Friday, January 8, 2010

Grails: By the numbers

Everybody likes to talk about highly productive environments, but numbers are notoriously hard to come by. This makes alterlab's ALTERthoughts Blog, entitled Grails vs. Rails especially interesting. Using Java J2EE as a base, both Ruby on Rails and Grails were found to be twice as productive. Grails beat Rails, but by a margin so slim it seems a dead heat.

Swamp Fox Analyst