March 31, 2010

Understanding and characterizing the performance of Spymemcached: The importance of picking meaningful data points

Guest post: Eric Lambert

The Spymemcached Java client is the leading Java client used by Java based applications that need to avail themselves to Memcached servers. The reason Spy has become so popular in the Java community is that it has been highly optimized and that it provides outstanding performance, but at the same time does so without burdening the user with undue complexity.

For the most typical cases, users simply need to use the following 'template' List servers = new ArrayList(); servers.add(new InetSocketAddress("192.168.0.9",11211); MemcachedClient client = new MemcachedClient(new BinaryConnectionFactory();, servers); At which point they are ready to start blasting operations at their cluster of Memcached servers. While Spy provides a very simple and easy to use interface. It does not do so at the expense of the flexibility needed to tune the system. Underneath the covers, Spy provides you with many ways to tweak the system. In this blog, I delve into some of the knobs and switches that Spy provides. My hope is that by doing so I will aid you in better understanding how Spy behaves and how you can tune it to best suite your needs when the out-of-the-box behavior is not sufficient. But before I get in to the nitty-gritty, I first I want to start off with a very simple, straight-forward, and somewhat naive approach and use it to illustrate a rather significant point. Let's say that I am evaluating Spymemcached for the first time and I want to understand how my system performs when doing a series of sets,gets and deletes. To do so, I can create a simple program that first sets a series of entries into the cache, then retrieves each of these entries and then finally goes and deletes each of these entries. I can instrument the code I use to do this in such a way that it tracks the total time it takes to do all the sets, gets, and deletes. The code for this could look something like the following // perform sets long start = System.nanoTime(); for (int i = 0; i < numberOfOps; i++) { client.set(KEY_PREFIX + i, 3600, PAY_LOAD).get(); } long end = System.nanoTime(); long setTime = end - start; // perform gets start = System.nanoTime(); for (int i = 0; i < numberOfOps; i++) { client.get(KEY_PREFIX + i); } end = System.nanoTime(); long getTime = end - start; // perform deletes start = System.nanoTime(); for (int i = 0; i < numberOfOps; i++) { client.delete(KEY_PREFIX + i).get(); } end = System.nanoTime(); long delTime = end - start; System.out.println(numberOfOps + " Set: " + setTime / nanosPerSec + " Average: " + setTime / numberOfOps); System.out.println(numberOfOps + " Get: " + getTime / nanosPerSec + " Average: " + getTime / numberOfOps); System.out.println(numberOfOps + " Delete: " + delTime / nanosPerSec + " Average: " + delTime / numberOfOps); Now, as I mentioned earlier, this is a rather naive and unrealistic use case but none-the-less this crude example has an important lesson to teach us. Let's see what happens when we run this test using different values for the numberOfOps. If we set numberOfOps to 10,000, our performance might look something like the following: 10000 Set: 3.008377 Average: 300837 10000 Get: 1.730886 Average: 173088 10000 Delete: 1.172679 Average: 117267 Now lets up the ante a bit and see what happens when we increase the numberOfOperations by a factor of 10 and set it to 100,000: 100000 Set: 10.710224 Average: 107102 100000 Get: 9.992544 Average: 99925 100000 Delete: 9.876984 Average: 98769 What is interesting about this is that even though we increased the workload that the test needed to do, the average time it took for each operation to complete actually got significantly smaller. We see nearly a 66% decrease in latency in set operations and nearly a 40% decrease in latency for get operations and a 15% decrease in latency for deletes. Now lets see what happens when we throw 1,000,000 operations at the test. 1000000 Set: 106.15172 Average: 106151 1000000 Get: 101.086393 Average: 101086 1000000 Delete: 99.747647 Average: 99747 As we can see, the performance at 1,000,000 operations looks much more like the performance at 100,000 operations than the performance at 100,000 operations looks like the performance we see at 10,000 operations. Lets set aside for the moment what actually is the cause for this discrepancy and focus on the bigger lesson here. Which is that when we are are trying to understand the behavior and performance of a system, the attributes and data points we use when testing the system are paramount. We should either empirically derive these values when ever possible or base them on an explicitly stated constraint or assumption. If I had really been using the test above to understand the behavior of my Spymemcached based system, simply using the 10,000 operations scenario would have not have resulted in understanding what the peak performance of the system is. Had I only run the 100,000 or 1,000,000 operations scenarios I would have missed the fact that the system appears to require some warm up time. But by doing some exploration to see what are the interesting values for this simple and naive test, we have learned some significant lessons about our system. I bring this up because too often I see discussions of a system's performance and behavior that use what appear to be arbitrary values and assumptions. And when I see these, it makes me wonder why has the author chosen a particular value. If it is a significant value then its' significance should be explained. If it is an arbitrary value, then we should be wary of the results and conclusions put forth. (Note, for the record, the values recorded above were achieved using Spymemcached 2.5 running against memcached 1.4.4, with the client and server running on the same host).

Comments