Paying Attention to what Time it is
So what's new in 2.6? There are a number of performance, transcoder and timeout handling enhancements, but the big change is in timeout handling.
There had been quite a few comments and postings on spymemcached issue number 136. From the bug reports in spy 136, various people are running apps and seeing a RuntimeException from the MemcachedClient get(). This is a very straightforward method call for the consumer and thus is very common, but there is some complex work going on underneath to optimize how spymemcached is used. Allow me to explain a bit further.
Technically speaking, nothing in spymemcached is done in a synchronous manner. When you make a get() method call that appears to be synchronous, we use the Java concurrency APIs to make your seemingly synchronous get by running the operation through the same asynchronous write queues for a given connection by using a countdown latch with a default timeout to let the operation flow through the system, be serviced, and responded to very quickly. Realistically, most of these operations are done in a matter of a few milliseconds to a microsecond.
Having said that, there are still a few situations that can come up when running
a) as a userspace application
b) on a busy system
c) on a busy or wibbily network
d) on a JVM, owing to garbage collection or JIT activities
Many of these can be controlled to a very large degree with some effort, but they cannot be entirely eliminated. That means code should always be ready to receive occasional timeouts. Equivalently, spymemcached should aim to recover as quickly as possible from the timeout condition.
Having said that, when I was investigating spy issue 136, a few things came up.
Mike Wiederhold had noted that he would sometimes see a TCP retransmit. Looking further into things, he found the default time before a retransmit is 1 second. Well, with our default timeout on the latch at 1 second, this meant any TCP retransmit would cause a timeout. With cloud and virtualized deployments these days, some of which are known to be on oversubscribed networks, it's likely to come up occasionally. The simple fix there is to increase the default timeout. This is not something I did lightly, but I think it's the right thing to do.
Some of our load testing with Membase hit this. From memcachetest (a toy stress tester, not a real client) I know Membase easily handles tens of thousands to hundreds of thousands of operations per second, depending on the systems in use.
I began to look about for things that would impede recovery, given that many of the bug reports were pointing to a failure to recover. It turns out that the mechanism which implements the synchronous timeout doesn't do anything to prevent that item from being written to the MemcachedNode object it's designated to.
In other words, our thread dealing with the node wasn't paying enough attention to the time for a given operation.
Thinking about it in terms of just one operation in one circumstance, if the buffer for sending data to the network is full because it's really busy or there's a momentary disruption, there is nothing stopping other threads from heaping more operations on that MemcachedNode. Those operations will timeout on the calling thread, but they'll still be sent over the network. This will extend the effect of that momentary disruption until that backlog of stuff that won't be used is worked through. With a really saturated server, it may not ever recover.
With the code changes in the 2.6 release candidate, we now guard against even sending an Operation which has timed out. I've had good feedback already and it's helped to clean things up for Membase testing, but with projects as widely distributed as spymemcached, memcached and Membase, it's good to get some real world feedback. If you're using existing spymemcached, please give it a shot and let me know.