October 26, 2010

Why Membase Uses Erlang

Less and less often (because Erlang is becoming more popular), I’m asked why Membase chose to use Erlang for our cluster management and process supervision component. Common alternatives people suggest are Java, C++, Python, Ruby, and, more recently, node.js and Clojure (which would be my top choice if Erlang were off limits to me).

There are certainly a lot of disadvantages to using Erlang. First and foremost, it’s another runtime environment to build, bundle, and support - we depend on the system Python on Linux and OS X and use py2exe to build executables for Windows, but we actually build and bundle Erlang. With Java, we could potentially depend on the system’s already having Java installed - a little harder on Linux but easier on Windows. In addition, we’ll eventually need a Java runtime to be able to run Java NodeCode modules.

In addition, not nearly as many programmers know Erlang as Java or C++, or in fact any of the possible alternatives I mentioned. Its syntax and semantics are unusual given its Prolog roots, concurrent, functional nature, and the fact that it’s a pragmatic language that’s been in use and evolving since 1986. This hasn’t turned out to be a problem for Membase’s engineering team - four of us can work on the Erlang components, and the Erlang code is small enough that it could easily be maintained by any one of us. However, it definitely sets the bar higher both for our own developers and community contributors to be able to contribute to the cluster management subsystem if they don’t already know Erlang.

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskell and ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of code.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

Erlang processes, in addition to being isolated from one another, are very lightweight. One can easily run a quarter million of them in a single Erlang VM. The cost of spawning them is also quite low, making silly hacks to reuse them unnecessary. Since they run in the same memory space and are isolated only by software, message passing between them is extremely fast, and while it does involve copying the data in the message most of the time, messages tend to be small and this allows Erlang processes to be garbage collected independently.

OTP arranges processes into a “supervision tree” where supervisor processes monitor child processes, which can also be supervisors, and restart them and potentially any dependent processes should they crash. Along with process isolation, this makes Erlang applications extremely fault-tolerant: any Erlang process within our cluster management subsystem (other than the root supervisor) can crash without taking Membase down. In fact, we make very good use of this: when unexpected things happen in any of our processes, we generally let them just crash and restart, bringing them back up in a known good state. For example, the process that manages the administrative connection to the local memcached process will simply crash if the connection times out or returns any error it doesn’t expect. This makes error handling exactly the same as startup, which leaves fewer places for bugs to hide and eases testing.

In order to make it easy to structure a complex application as a set of interacting processes, Erlang/OTP provides a standard set of “behaviors” that modules can implement. The most common of these is gen_server, the generic server behavior, which implements a basic server with callbacks for requests. Gen_fsm implements a finite state machine, and gen_event is for implementing event handlers and event managers.

Messages in Erlang can be sent as easily to a process residing on another node as they can to processes on the same node; this is completely transparent to the programmer. Erlang/OTP has a bunch of modules that help with distributed processing. For example, the “global” module provides a common registry of global names and locks, similar to Google’s Chubby lock manager. OTP’s “distributed applications” provide a way to ensure a given set of processes only starts on one node. Mnesia is a distributed DBMS with strong consistency guarantees that we use for storing statistics.

Erlang provides excellent support for debugging and patching of live systems. After all, it was invented for building systems with five nines of uptime. If you know the cryptographic cookie nodes use to authenticate themselves to one another, it’s trivial to attach another Erlang node to the cluster and execute arbitrary commands on any other node, including “reload this module”. This sort of capability is invaluable when you need to fix or work around a bug and can’t take down a cluster or even a single node to replace code. It’s also invaluable for testing or any situation where we (or a customer) want to make changes to the cluster state that haven’t yet been implemented in the REST API.

At the end of the day, the real question isn’t whether it would have been possible for us to implement our cluster management in another language; it’s really a question of effort and maintainability of the result. With any other environment, we would have had to reimplement at least part of what Erlang/OTP provides, while we haven’t really found ourselves reimplementing features provided by any other environment.

Comments