Swamp Fox Analyst

Sunday, June 13, 2010

Children of Prolog

Recently, I have become interested in functional programming, largely because of the remarkable success of Erlang in producing robust web applications. Erlang seems to have 'solved' concurrent programming, which is one of the growing challenges in programming - especially as we move into a post Web 2.0 world. Erlang was initially implemented in Prolog.

Today, I listened to Joseph M Hellerstein's Keynote address at ACM PODS 2010, The Declarative Imperative: Experiences and Conjectures in Distributed Logic. The slides are also available. He uses Daedalus, a derivative of Datalog. Datalog is, syntactically, a subset of Prolog. Hellerstein also addresses the concurrency conundrum, a.k.a., the Programming Crisis by retelling children's classics

“The sky is falling! The sky is falling! Computers won’t get any faster unless programmers learn to write parallel code!” squawked Chicken Licken.

Henny Penny clucked in agreement: “Worse, there is Cloud Computing on the horizon, and it requires programmers to write parallel AND distributed code!”

This is more fun than your average academic talk. Of course, he then goes on to provide his solution to the programming crisis. He shows how Datalog has been used to solve a series of practical problems in computer science.

The solutions Hellerstein expounds rely upon declarative programming in Datalog variants. He describes a series of practical code for distributed crawlers, networking protocols, query optimizers, distributed file systems and a map-reduce scheduler. What is as impressive as the breadth of topics is his claim that his declarative solutions have code bases that are orders of magnitude smaller that existing, imperative solutions.

His code samples describe solving problems with a time dependence that evolves in steps, like a turn-based game. This was a little disconcerting to a physicist, since we like to think about time as continuous, still it was interesting to find a computer scientist shift from 'practical' solutions to a rather abstract discussion on the meaning of time and causality. At this point, I am starting to think about the InfoQ talk by Rich Hicky on Persistent Data Structures and Managed References.

From a Prolog beginning, Ericsson developed Erlang. Erlang is having a renaissance that is leading to high concurrency solutions for web programming. Next, we have Hellerstein using Datalog to address the programming crisis. Where are the imperative, OO solutions? Declarative logic & functional programming - are these the tools that will lead out out the the programming crisis?

Saturday, June 5, 2010

Trends in programming and Amdal's Law

in my previous blog, I explained trends I saw that will affect programmers. The main takeaway is that we are moving into a world of concurrent programming. If we do concurrent programming with memory shared by several threads of execution, life is going to be miserable for programmers. Fortunately, message passing via actors is a viable path to concurrency that is much easier to program.

In any discussion of concurrency, the iron law is Amdahl's Law. If you have a process in which some fraction, s, of the steps are sequential and the rest, 1-s, can be done in parallel, then the most you can gain from concurrency over N processors is to reduce the time to execute from 1 to s + (1-s)/N. So basically, the time to execute a program is limited by s as much as by N. The hardware determines N and the software determines s. So what determines s? Basically, s is the fraction of the algorithm that you spend queueing stuff. If I go to the grocery store, I can shop in parallel with everyone else. It takes a little longer to shop in a busy store, but only a little longer. But when it is time to check out, I have to queue up to pay.

What is exciting is that by using techniques like 'no shared state' and Actors for communication can greatly reduce s, so we can write software that efficiently tanks advantage of concurrent hardware (i.e., big N). If we look at a standard sort of web applications, the bottlenecks are things like writing to the database. The first thing to do is to replace a single server with a cluster of identical servers, so that a given client can access any one of the servers and get the job done.

But there turns out to be a better plan - sharding. In sharding, you still have a cluster of N servers, but you don't have each server having identical data. If I have a large database, I would shard it by creating N servers, but I would shard the data so that any given server has only a fraction of the data. This is much like partitioning data on a RAID array, and the more sophisticated designs take into account 'k safety". Reed-Solomon erasure codes exhibit this behavior, as do Tornado codes. This means that the data is divided in to n+k partitions so that any set of n partitions can reconstruct the entire data set. This means you can loose k partitions before you loose any data. This idea is central to VoltDB, but they implement it by having k+1 copies of each block. It seems that a more efficient design could be built using erasure codes (See exercise 11.9 in David MacKay's Information Theory, Inference and Learning Algorithms.)

Wednesday, June 2, 2010

Trends in programming

There is a dizzying wealth of new options for web programmers. I am trying to understand why there are so many changes and which of those changes I should be choosing to learn. I have found that the answer to this sort of question often comes from looking backward and trying to judge current trends in an historical context.

Hardware: Multicore is the future

This is an inevitable trend, as carefully explained in The Future of Computing by Anant Agarwal to MIT freshman. The power consumption of a CPU scales as the cube of the voltage. So trying to improve performance by brute force is a loosing game. Instead, it is much better to build smaller, lower power cores. Prof. Agarwal is among those who are laying the foundation for a future of RISC cores arranged in a tiled multi core architecture. As a programmer, you can get much better performance if you can leverage concurrency. We can leave the hardware issues to the likes of MIT, but the programming challenges will affect everyone in who programs.

To better understand hardware and how it affects software, listen to A Crash Course in Modern Hardware by Cliff Click of Azul. Azul has systems with 864 processor cores, so they are on the front lines of this revolution. He explains how modern computers, with caches and pipelining, are not particularly well described by the von Neumann architecture. The entire talk seems like an explanation of how hardware engineers have been addressing the von Neumann bottleneck between CPU and memory. Even here, we find that each physical core is running concurrent processes.

So, we are seeing horizontal scaling of computers in clusters, with each computer having more cores and with each core handing more concurrent tasks. I can spot a trend - concurrent processing is the future. But I HATE threading and locks.

Functional Programming and Concurrency

Erlang was developed for high reliability in Ericsson telephone switches. Joe Armstrong describes how they tried to develop Systems that Never Stop and why that lead to Message Passing Concurrency in Erlang. Armstrong came up with a list of six laws for reliable systems. The laws/requirements are:

isolation (the probability of isolated systems failing is independent. So processes need to be isolated.)
concurrency (do you need to worry about how many concurrent processes can you support? You need at least two systems to make a non-stop system. Even a two computer system is concurrent. You can't avoid concurrency)
failure detection (if you can't detect a failure, you can't fix it. the thing that crashed can be the thing that detects failures.)
fault identification (You need to figure out why there was a failure if you want to fix it.)
live code upgrade (Systems that should never stop, need live code upgrade)
stable storage (no backups, implies multiple copies. must keep crash logs)

Erlang is a functional language, so there are no side effects to calculations, Joe Armstrong discusses the importance of Functional Programming in Functions + Messages + Concurrency = Erlang. There really is no point in arguing about if he is right. Erlang is the proof. It is reliable, Ericsson has the AXD 301 switch with two million lines of code: BT operates these switches at a 'nine-nines' reliability - it can fail for 32 ms each year. BT has run switches for six years without stopping. Erlang also allows for high concurrency, this is clearly illustrated in Apache vs. Yaws: while Apache and Yaws have similar throughputs for score of concurrent clients, Apache bogs down at around 4 thousand parallel sessions. By contrast, the Erlang powered Yaws can serve over 80 thousand parallel sessions on Linux with similar hardware. Apache is a well written server, Netcraft shows that Apache is easily the #1 server on the World Wide Web in May 2010. But Yaws beat it handily in this measure. In RESTful Services with Erlang and Yaws, Steve Vinoski examines Yaws in more detail.

But perhaps Erlang is hard to use and expensive to code? According to Ulf Wigner, Ericsson recorded a Four-fold increase in Productivity and Quality on the ADX 301 project. Joe Armstrong is very optimistic about the future of Erlang, and I think he is right.

When I looked into this some more, I realized that Erlang was not the 'bolt of the the blue' that is first seemed to me. Some of the greatest minds in programming have been noticing that imperative programming has been at something of a standstill. Read John Backur's 1977 ACM Turing Award Lecture, Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs. This is a broadside attack on imperative programming is all the more remarkable because it is coming from the author of FORTRAN. He then goes on to explain what he sees are the advantages of functional programming. This must have been a controversial paper, Edsger Dijkstra critique of Backur's lecture ended with:

In short, the article is a progress report on a valid effort but suffers badly from agressive overselling of its significance, long before convincing results have been reached.

I tend to believe both Backur and Dijkstra. I resolve any apparent conflict by admitting that Backur was ahead of his time.

Monday, May 10, 2010

Gaelyk again

Gaelyk is gaining ground as a Google AppEngine framework. Developer Works has Java Development 2.0: Gaelyk for Google App Engine by Andrew Glover. This title emphasizes that Java is more of a platform than a language. I tried this a few months ago, but was disappointed by the initial load times. The application runs fine once it loaded, but my application doesn't get many hits, so most users ended up hitting a cold site.

One solution is to regularly ping the site with a service like Pingdom with something like a 1 minute interval. This seems like overkill to me, but 43,200 pings a month is still well below the roughly 500,000 hits per month that triggers payments. A Basic Pingdom account is $9.95 per moth and there is also a free account that would be fine for a single site. In addition, Groovy 1.7 as improvements for faster compilation. So the performance issues are getting resolved.

The joy of Gaelyk is getting to use the AppEngine features with a minimum of Fuss. Pratik Patel's screencast shows how to build a basic CRUD application. This is also shown nicely in the Gaelyk tutorial. Recently, Gaelyk introduced plugins in version 0.4. There is also a great tutorial about adding REST via JSON and Jackson. It also seems like it should be reasonably easy to integrate with RESTClient.

Saturday, March 20, 2010

Jeffrey Zeldman vs. Ed Bott

In his ZD Net Blog, Ed Bott stated:

In one breath he said, “I’m not challenging the quality of the hardware and software improvements,” and then, in the very next breath, he criticized Microsoft’s “brilliant browser engineers” for “torturing the IE rendering engine every couple of years instead of putting it out of its misery.”

This certainly seems damning. But it isn't quite true. Each of the quotes can be pulled from Zeldman's blog, IE9 Preview, but Bott's juxtaposition of Zeldman's words are downright deceptive. The two statements were separated by four paragraphs, so perhaps Bott quite understands what 'very next breath' means. Or more likely, Bott is being intentionally being misleading.

It is clear that Jeffrey Zeldman does respect Microsoft Engineering, and that he considers at least some of the Microsoft engineers to be brilliant.

Jeffrey Zeldman clearly states that he is writing in response to Dean Hachamovitch's report, An Early Look at IE9 for Developers. Zeldman's blog is dated 9 AM Eastern on 16 Mar 2010. Bott then launched into a description of Hachamovitch's presentation at MIX at Las Vagas. The talk was live blogged by Sharon Chan at 9:51 AM Pacific on 16 March 2010. Or about 4 hours after Zeldman posted his blog.

For some reason, Bott is surprised that Zelman didn't report on the still 4 hour in the future MIX talk. This is even more surprising since Bott was at the talk, so I would have expected him to note who was present in the room.

What there the significant changes that occurred between Hachamovitch's blog and his presentation at MIX? From what I can tell, the major changes introduced at the talk were all known at PDC 2009 (18 November 2009):

html5
SVG 1.1
GPU rendering using Direct2D
CSS3
new JavaScript engine

These are good news. What has changed between November:

improving Acid3 (which did improve from 32 to 55)
support for CSS3 selectors (which improved from passing 574/578 tests to a perfect 578/578)

So, we are seeing IE9 making significant improvements in standards compliance. But this is just a start. To see how IE9 is faring on SVG 1.1, we can look at this recent blog at codedread. On the official SVG Test Suite, IE9 preview is scoring 28%, Firefox 3.7 scores 73%, Opera 10.5 scores 94%, Chrome 5 scores 87% and Safari 4 scores 82%. These numbers are quite different than the test results presented by Microsoft. Haavard noticed that before I did.

The Acid3 score of 55 is well below the scores of the production versions of Firefox 3.5 (94%), Opera (100%), Chrome (99%) and Safari (100%).

That still leaves JavaScript. In addition to speed, a a browser's JavaScript should conform to the ECMA standard. Google released the Sputnik JavaScript conformance test suite. The Chromium blog recently posted test results. The winner is Opera (78 failures) and the looser is IE8, with 463 failures. In terms of JavaScript performance, IE9 is now respectably fast. It is on par with the the other modern JavaScript engines.

Thursday, March 4, 2010

Making the case for Scala

At first glance, Scala seems like an odd choice for developing web applications. Scala is a JVM language created by Martin Odersky of EPFL seems like an academic's testbed for a 'grand unified language' combining functional programming and strict object-oriented design. Scala is clearly gaining in popularity, but the current TIOBE Software Index places Scala at #24 with a rank of only 0.459%, far behind the #1 language, Java, which has a a rank of 17.348%. So, if you want to be on a JVM, why not stay with the undisputed champion of the TIOBE ranking since June 2001?

The key reason is money. Developer time is money. Developer time is, to a remarkably good approximation, proportional to the number of lines of codes that need to be written. This holds over a wide range of languages. Rather than regurgitate an existing piece, I urge you to read Research: Programming Style and Productivity and Scala vs Java: Conciseness equals less bugs. Scala was found to reduce lines of code by 30-35%. This should lead to a commensurate increase in productivity.

The second reason is that functional programming looks like it is the future. Erlang proves that functional programming can lead to high reliability systems with massive concurrency. Computers are becoming highly concurrent, not faster.

The third reason is more prosaic. Liftweb gives many of the advantages of Ruby on Rails and it also provides type checking. Call me old fashioned, but I get warm fuzzy feelings knowing that a compiler is preventing wide classes of errors.

The last reason is still in the future. In 2002, Phil Bagwell wrote Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays. He introduced new data structures for functional programming. Bagwell is Swiss and is at EPFL, where Martin Odersky works. So EPFL is the center of Scala. I learned about these by watching Rich Hickey's QCon London 2010 talk, Persistent Data Structures and Managed References. The slides are available on Michael Galpin's blog entry Persistent Data Strictures. This excites me, because many of the ideas of state took me back to physics, especially statistical mechanics. The talk, and Clojure's approach to state just seems right. So why is this a reason to use Scala!? Why not cut to the chase and go right to Clojure? At this point, I don't know, that may be the right answer. But Scala is a bit of a chameleon, and it seems to me that the concepts of State and Identity, as well as the persistent data structures that make this efficient, will be implemented soon in Scala. Scala 2.8 is getting Bagwell's VLists.

Suddenly, this is becoming an interesting time to be a programmer, and the JVM seems to be the epicenter of these changes. The JVM is very robust - there has been a decade of work with theorem provers that demonstrate the robustness of the design. There has also been a decade to flush out the bugs. Now, Scala, Groovy and Clojure are leveraging the JVM, and the plethora of libraries on the JVM. Full speed ahead!

Monday, March 1, 2010

David Pollak's Web Framework Manifesto

The best part of blogs is that is makes it possible for people with clear ideas, not just fame or access to money, to be published. David Pollak certainly falls within the category of clear thinking people, and his Web Framework Manifesto, from November 2006, is a great technical summary of what we all should expect of a web framework. It is fascinating to see the breadth of frameworks that he has examined.

Security figures predominantly in this manifesto. He states:

There should exist an orthogonal security layer such that objects that are not accessible to a user should never be returned in a query for the user and fields on an object that are not accessible should not be visible. The security and access control rules should be algebraic and demonstrable during a security audit. This means that neither the view nor the controller should have to test for access control. Objects and requests should be sanitized before they get to the “controller.”

At first, this seems like Spring Security, but the algebraic caught my eye. I don't know what this means. A Google search found Beyond Separation of Duty: An Algebra for Specifying High-level Security Policies, but this seems quite theoretical at this point. In the IEEE Security & Privacy Journal, the article, A Metrics Framework to Drive Application Security Improvement, describes a more pragmatic, but still quantitative approach to security.

David then went on to create and act as benevolent dictator for life of the Liftweb framework, or just Lift. As a benevolent dictator, it is no surprise that the key ideas of the Web Framework Manifesto are all present in Lift.