Stabilizing Couchbase Server 2.0

[This blog was syndicated from http://damienkatz.net/]

I’m glad to report we are now pretty much going into full-on stabilization and resource optimization mode for Couchbase Server 2.0. It’s taken us a lot longer than we planned. Creating a high performance, efficient, reliable, full-featured distributed document database is a non-trivial matter ;)

In addition to the same “simple, fast, elastic” memcached and clustering technology we have in previous versions of Couchbase, we’ve added 3 big new features to dramatically extend it’s capabilities and use cases, as well as its performance and reliability.

Couchstore: High Throughput, Recovery Oriented Storage

One of the biggest obstacles for 2.0 was the Erlang-based storage engine was too resource heavy compared to our 1.8.x release, which uses SQLite. We did a ton of optimization work and modifications, stripping out everything we could to make it as a fast and efficient as possible, and in the process making our Erlang-based storage code several times faster than when we started, but the CPU and resource usage was still too high, and without lots of CPU cores, we couldn’t get total system performance where our existing customers needed it.

In the end, the answer was to rewrite the core storage engine and compactor in C, using a format bit for bit compatible with our Erlang storage engine, so that updates written in one process could be read, indexed, replicated, and even compacted from Erlang. It’s the same basic tail-append, recovery oriented MVCC design, so it’s simple to write to it from one OS process and read it from another process. The storage format is immune to corruption caused by server crashes, OOM killers or even power loss.

Rewriting it in C let us break through many optimization barriers. We are easily getting 2x the write throughput over the optimized Erlang engine and SQLite engines, with less CPU and a fraction of the memory overhead.

Not all of this is due to C being faster than Erlang. A good chunk of the performance boost is just being able to embed the persistence engine in-process. That alone cut out a lot of CPU and overhead by avoiding transmitting data across processes and converting to Erlang in-memory structures. But also it’s C, which provides good low level control and we can optimize much more easily. The cost is more engineering effort and low-level code, but the performance gains have proven very much worth it.

And so now we’ve got the same optimistically updating, MVCC capable, recovery oriented, fragmentation resistant storage engines both in Erlang and C. Reads don’t block writes and writes don’t block reads. Writes also happen concurrently with compaction. Getting all or incremental changes via MVCC snapshotting and the by_sequence index makes our disk io mostly linear for fast warmup, indexing, and cluster rebalances. It allows asynchronous indexing, and it also powers XDCR.

B-Superstar: Cluster Aware Incremental Map/Reduce

Another big item was bringing all the important features of CouchDB incremental map/reduce views to Couchbase, and combining it with clustering while maintaining consistency during rebalance and failover.

We started using an index per virtual partition (vbucket), merging across all indexes results at query time, but quickly scrapped that design as it simply wouldn’t bring us the performance or scalability we needed. We needed a system to support MVCC range scans, with fast multi-level key based reductions (_sum, _count, _stats, and user defined reductions), and require the fewest index reads possible.

What we came up with uses the proven CouchDB-based view model, same javascript incremental map/reduce, same pre-indexed, memoized reductions stored in inner btree nodes for low cost range queries, yet can instantly exclude invalid partitions results when partitions are rebalanced off a node, or are partially indexed on a new node.

We embed a bitmap partition index in each btree node that is the recursive OR of all child reductions. Due to the tail append index updates, it’s a linear write to update modified leaf nodes through to root while updating all the bitmaps. Now we can tell instantly which subtrees have values emitted from a particular vbucket.