March 15, 2010

Avoiding Death Spirals in Distributed Systems

What's a death spiral?

Single-node death spirals
A former employer of mine was contracted to build an email routing system for one of the world's largest companies. They had me do the design and spec out the hardware, and then another group set out to build it. The design had two dedicated DNS resolvers, five servers to handle inbound email and queue it for the spam and virus scanners, separate hosts for spam and virus scanning, and separate hosts for routing outgoing mail.

The week the system was put into production, I got a call: it was too slow, and mail was backing up and connections were being refused. I started asking questions and found out what had actually been implemented had been changed "slightly" from my original design to save some money: they scrapped the dedicated DNS resolvers, instead running BIND on each host, and they were running the virus scanners on the hosts that were handling the incoming email. Of course, they kept the same amount of RAM, and BIND uses lots of RAM for itself, to say nothing of the virus scanner.

The first thing the systems administrator did when the problem started was increase the number of inbound SMTP connections the servers would accept simultaneously. This made the problem significantly worse. When I told him he needed to decrease it to well below what it had been, something in the neighborhood of ten, he argued with me: of course increasing the number of simultaneous deliveries should make it faster, and reducing it would make it slower! My response was simple: you can't be in swap. The system was using so much RAM it was swapping to disk, bringing the whole thing to a crawl.

Reducing the number of simultaneous SMTP connections significantly sped up email delivery and nearly eliminated the problems. They ended up adding more servers: pretty much the same number they thought they would save by sharing the hosts among multiple services. 

The concurrency at which a service operates, or the number of requests it is servicing at the same time, is related to the rate of incoming requests and the amount of time it takes to service an individual request as follows:

R = C / T

where C is concurrency, R is the request rate in requests per second, and T is the amount of time it takes the system to service a request, from start to finish, in seconds. It might seem obvious from looking at this equation that the way to increase the number of requests a system can service is to increase the maximum concurrency, and indeed this is what a lot of system administrators try to do: they see Apache complaining about hitting MaxClients (or whatever), so they increase MaxClients. And their request rate plummets.

Problem is, T is far from being a fixed quantity. It's dependent on all sorts of factors: I/O if the request hits the disk, available memory, available network bandwidth, the speed of the client, latency and service time of requests this request spawns, and many more. As C increases and one of the available resources starts to become fully utilized or overloaded, T will go up faster than C and the request rate will drop. 

If there is no limit on concurrency at all, many services exhibit what I call a "cliff effect": the rate at which requests are serviced falls off dramatically, and the time to service a single request goes up by one or two orders of magnitude. This can be caused by hosts starting to swap, scheduling algorithms that don't scale well, or any number of subtle effects that aren't really that important to understand. 

The most important thing to remember here is that less concurrency is frequently better, and that it always needs to be limited. Really, the best way to figure out what your maximum concurrency should be is to measure it. Benchmark the system at various concurrency levels and pick the level that gives you the maximum rate while still serving individual requests in a reasonable amount of time when the concurrency limit is reached.

Distributed death spirals
Another former employer of mine embarked on an ambitious project to replace database accesses with web services. They hired a bunch of PHP developers to go through each piece of SQL in the application code, write a web service that did the same thing, and replace the application code with a call to the web service. The project was a disaster.

There were several places where a single request resulted in requests to several databases that made sense to write as a single web service. Because both the individual database queries and the compound queries needed to be called, both were exposed as web services. Several of the web services called other web services hosted on the same cluster. The cluster ultimately consisted of about a hundred hosts with several different URL namespaces on potentially overlapping groups of hosts, with several load balancers in front of them running Pound.

Several of the web services ended up being much slower than the existing SQL and C++ code. With about 5,000 application servers and only about 100 web services hosts, this was a problem. However, more interesting and educational was what happened even when just a small number of application servers were pointed at the web services. Load on the web services hosts would vary wildly, with individual hosts hitting their maximum number of connections, resulting in the load balancer's taking them out of the rotation. Load would drop on the host again and it would get added back in. This would progress until most of the cluster was overloaded and the application servers were having problems. Restarting individual nodes didn't help.

Ultimately it was discovered that the easiest and fastest way to bring load on the cluster under control was to stop Apache on all the nodes and then start it back up. The reason this worked, while restarting individual nodes (and, even, *ultimately* all the nodes) did not, was because individual web service requests were spawning other requests to other nodes in the same cluster. If any individual node became slow, every other node would eventually have all its clients waiting on requests to that node, until the entire cluster was waiting on answers from other nodes in the cluster.

I call this and related conditions in distributed systems a distributed death spiral. Distributed because it's more complex and chaotic than the single-node case of simply becoming overloaded and not being able to keep up with incoming requests. 

Avoiding death spirals

Limit concurrency close to the client
One of the problems that came up repeatedly at this employer was the following situation. The concurrency of a given service would be limited within the server itself, for example using MaxClients in Apache or max_connections in MySQL. Bursty load would cause the server to hit its concurrency limit, and some service closer to the client (the application server, the web frontends, the load balancers, or Squid) would return errors that the user would see. Squid, for example, will mark a "parent" server as down if doesn't respond for two minutes, returning HTTP 500 errors. The load balancers would return 503s. This frequently resulted in developers blaming the service returning the error for the problem, to which my response was "The thing returning the error is the thing that's working."
I eventually convinced said employer to replace Pound with Haproxy in most places. Haproxy allows you to configure the maximum number of connections to send to any given backend, after which it will queue incoming connections, even allowing you to prioritize incoming connections based on which URL they're trying to reach. Along with its statistics reporting, Haproxy was invaluable when deploying new web services, because one could simply watch to see how many and how often incoming connections were getting shoved into the queue to see if the service was overloading, and most of the application servers did exactly the right thing when their connections got queued: they just made the user wait a little bit, slowing the user down when the system was overloaded rather than giving them an error that would anger them and cause them to just hit reload or retry their operation over and over, making the problem worse.

And speaking of queuing connections:

Use job queues
Dustin evangelizes this a lot. Web developers like to use stateless servers, so it's traditional to try to finish everything related to a given client request before returning a result to the client. This means that all the resources related to the request, for example memory allocated by the Apache process during processing, perhaps locks held in the database, etc, will be held for the duration of the request.

Instead, it's better if you don't really need to do everything up front to simply accept the information from the client, stick it in the queue, and return a success message indicating you've taken responsibility for it. This is how email works, for example: SMTP traditionally returns a success code as soon as the message is spooled to disk, not after it's been delivered to the user's mailbox. One of the advantages of doing things this way is that queuing takes advantage of the fact that streaming writes are fast, whereas updating a database generally requires random seeks and is thus much slower.

Avoid loops in your call graph
The web services I described above would have had much more stable load had requests not spawned requests to other nodes in the same cluster. In systems that have loops like that, a single slow (or even slightly slower) node can bring the whole system to its knees, turning your fault tolerant distributed service into one whose reliability goes down as nodes are added. 

Incidentally, the company also had a very expensive clustered storage solution that suffered from the exact same sorts of symptoms: a single slow node would slow down the entire cluster and would frequently cause the whole thing to lock up, requiring all the nodes to be rebooted. So this isn't just a problem of home-grown solutions.

Mark servers dead, or limit outbound concurrency per destination server
This is related to the previous point, because it can mitigate or even eliminate propagation of failures even if you end up having loops in your call graph anyway. 

Another problem this same company had was that each web service generally depended on multiple MySQL servers. Individual requests would generally only require a single MySQL server, but if any of the MySQL servers got slow or disappeared off the network, all of the web services hosts would eventually stop responding because, like in the case of calls to hosts in the same cluster, all outbound connections on any web services host would eventually be used up talking to the slow/dead host. They had a manual blacklisting mechanism, but a stateful way for a host to record that a database had timed out several times, or a limit on the number of simultaneous requests to any given database host, would have prevented this problem from happening.


It's easy to build a distributed service that works perfectly under ideal conditions but will fail when subjected to real-world conditions. Death-spirals are one of the most common of these failures, and their most common causes can be avoided by following some simple design guidelines.

I hope you find this information helpful in building your own distributed services! Please feel free to send me your own death-spiral stories, and definitely let me know if I've saved you from one.