February 11, 2014

libcouchbase 2.3-DP2: Enhanced Configuration Updates

The upcoming libcouchbase version will feature enhanced cluster updates and a whole lot of other stability and performance improvements. Most of the new work revolves around the Cluster Configuration Carrier Publication or CCCP for short.

Couchbase is a scalable, elastic cluster. Part of this featureset means that cluster nodes may be freely added or removed without causing downtime at the application level. Swapping out all nodes from a cluster and replacing them is fully supported and something normally transparent to an application server using the SDK.

The SDK itself must ensure that it is talking to a healthy cluster and is aware of the various nodes which are cluster members. Specifically this comprises of knowing two things:

  1. Which nodes are part of the cluster
  2. Which node is responsible for a given key

This information is transferred via a JSON object called the "Configuration Map" or "Cluster Map". You can see it yourself by navigating to a URL such as 

curl localhost:8091/pools/default/buckets/default

 

This JSON object contains the lists of the nodes as well as a large "vBucket Map" which basically instructs client about which node is responsible for which vBucket - and the client then hashes keys against this map.

Bootstrapping

To see how the SDK (specifically, libcouchbase) places this together, we can examine the strace output from cbc

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
connect(6, {sa_family=AF_INET, sin_port=htons(8091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
connect(6, {sa_family=AF_INET, sin_port=htons(8091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"GET /pools/default/bucketsStream"..., 145}], msg_controllen=0, msg_flags=0}, 0) = 145
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"HTTP/1.1 200 OK\r\nTransfer-Encodi"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 200
recvmsg(6, 0x7fff7b7e9bb0, 0)           = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"2223\r\n{\"name\":\"protected\",\"bucke"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 8756
recvmsg(6, 0x7fff7b7e9bb0, 0)           = -1 EAGAIN (Resource temporarily unavailable)
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
connect(7, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
connect(7, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(7, {sa_family=AF_INET, sin_port=htons(34432), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
getpeername(7, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\200 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=0, msg_flags=0}, 0) = 24
recvmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\201 \0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 38
recvmsg(7, 0x7fff7b7e9b70, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\200!\0\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5", 32}], msg_controllen=0, msg_flags=0}, 0) = 32
recvmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\201!\0\0\0\0\0!\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0<9003993"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 54
recvmsg(7, 0x7fff7b7e9b70, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\200\"\0\10\0\0\0\0\0\0\0002\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5"..., 74}], msg_controllen=0, msg_flags=0}, 0) = 74
recvmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\201\"\0\0\0\0\0\0\0\0\0\r\0\0\0\0\0\0\0\0\0\0\0\0Authenti"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 37
recvmsg(7, 0x7fff7b7e9b70, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\3\0\0\0s\0\0\0\3\1\0\0\0\0\0\0\0\0\0\0\0foo", 27}], msg_controllen=0, msg_flags=0}, 0) = 27
recvmsg(7, {msg_name(0)=NULL, msg_iov(1)=[{"\201\0\0\0\4\0\0\0\0\0\0\21\1\0\0\0\0\1~$@\327-\17\0\0\0\0Hell"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 41
recvmsg(7, 0x7fff7b7e9b70, 0)           = -1 EAGAIN (Resource temporarily unavailable)
"foo" Size:13 Flags:0 CAS:f2dd740247e0100
Hello World!

 

Above you can see that the library first makes an HTTP request to port 8091 to retrieve the cluster map. Once the cluster map has been retrieved, it then connects to the node which is the vBucket Master for the key "foo" on port 11210, performs SASL authentiation, and finally issues a memcached request for the key.

With CCCP, the memcached port at 11210 can itself also host the configuration information. Thus if we look at a similar trace from the upcoming DP version of the cbc tool, we'll see this:

connect(6, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
connect(6, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(6, {sa_family=AF_INET, sin_port=htons(34412), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
getpeername(6, {sa_family=AF_INET, sin_port=htons(11210), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\200 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=0, msg_flags=0}, 0) = 24
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\201 \0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 38
recvmsg(6, 0x7fff8206e770, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\200!\0\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5", 32}], msg_controllen=0, msg_flags=0}, 0) = 32
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\201!\0\0\0\0\0!\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0<9014490"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 54
recvmsg(6, 0x7fff8206e770, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\200\"\0\10\0\0\0\0\0\0\0002\0\0\0\0\0\0\0\0\0\0\0\0CRAM-MD5"..., 74}], msg_controllen=0, msg_flags=0}, 0) = 74
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\201\"\0\0\0\0\0\0\0\0\0\r\0\0\0\0\0\0\0\0\0\0\0\0Authenti"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 37
recvmsg(6, 0x7fff8206e770, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\200\265\0\0\0\0\0\0\0\0\0\0\r\360\0\0\0\0\0\0\0\0\0\0", 24}], msg_controllen=0, msg_flags=0}, 0) = 24
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\201\265\0\0\0\0\0\0\0\0\36{\r\360\0\0\0\0\0\0\0\0\0\0{\"rev\":1"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 7827
recvmsg(6, 0x7fff8206e770, 0)           = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\3\0\0\0s\0\0\0\3\1\0\0\0\0\0\0\0\0\0\0\0foo", 27}], msg_controllen=0, msg_flags=0}, 0) = 27
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\201\0\0\0\4\0\0\0\0\0\0\21\1\0\0\0\0\1~$@\327-\17\0\0\0\0Hell"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 41
recvmsg(6, 0x7fff8206e730, 0)           = -1 EAGAIN (Resource temporarily unavailable)
"foo" Size:13 Flags:0 CAS:f2dd740247e0100
Hello World!

 

Here we reduce an extra TCP connection by fetching the configuration directly from the data port.

Even against a localhost cluster, the second example runs about twice as fast as the first, owing to the reduction of the initial HTTP connection.

Configuration Updates

Configuration updates are new and updated cluster maps received when the cluster topology changes. This lets a client know that nodes have been added or removed - and subsequently, that the vBucket map has changed.

The previous behavior for libcouchbase was to connect to the streaming REST API endpoint for the given bucket in order to receive configuration updates from it. This required the library to maintain a mostly idle TCP connection which would be pushed configuration information from the cluster. This approach suffered from two primary disadvantages:

  1. It would place the library into a wait/long poll state, where the socket which was serving the configuration would assume that it was connected to a functioning node which would push it configuration information. This assumption would of course fail if the node the client was connected to via the streaming endpoint would be the one that failed itself (and worse, would never deliver a TCP RST). Since the semantics were push based the client would not be able to set a timeout of sorts to ensure the connection was still functioning properly.
  2. An extra TCP connection was required at all times in order to maintain up-to-date configuration information. As the REST API endpoints were designed for administration, not routine usage - they are not optimized for memory use. Specifically each TCP connection to these incurs a significant resource penalty on the server side.

Confmon - Configuration Manager

The new 2.3 changes the model and approach to how configuration is retrieved. An internal set of APIs collectively known as confmon/clconfig were introduced into the codebase. Rather than have an intrusive push-based model where the configuration would be imposed upon an open socket, confmon is pull-based and is triggered only on an as-needed basis. Thus, the client will by default not maintain open sockets for configuration only; rather it will assume a valid configuration until it reaches a certain error threshold or receives an explicit NOT_MY_VBUCKET error from one of the cluster nodes (indicating that the client's map is out of data).

Specifically with the new CCCP enhancements, each NOT_MY_VBUCKET response in-itself already contains the updated cluster map, thereby eliminating the need to re-fetch the configuration in the first place.

Benefits are also reaped with older clusters as the new model opens REST API connections only on-demand - rather than keeping them open indefinitely (In fact, we've made a bit of an optimization where we long-poll for a short amount of time - as configuration updates tend to happen in succession during topology changes like rebalances).

Logging

Logging hooks have been added to the library. This model allows you to either enable the default console logger by setting the LCB_LOGLEVEL environment variable, or install your own logging hooks by implementing the lcb_logprocs interface and telling the instance about your logging hooks.

Logging has been added for notable but non-CPU-intensive events such as timeouts, socket connections, socket destruction, configuration updates, and more.

Note that there are more things we'd like to log and this is not the end of all the instrumentation and diagnostic aids we plan to add!

Connection Management

We've also beefed up the way we handle new connections to memcached "data" nodes. Previously connections would be scoped to the objects which created them. This meant that the lcb_server_t object itself would open and close a connection.

In 2.3, we've added a connmgr module which functions much like a socket pool (and will be used in the future to support socket pooling for things like view queries). Rather than having subsystems open and close connections from the I/O system directly, they now request and release (or discard) connections from and to the connmgr instance

Now server structures will not unconditionally close their TCP connections, but will check if there is any pending data on them; if there is data then the socket is discarded back to the pool (i.e. the socket is freed and any pooled resources associated with it are freed as well) because we deem the socket to be in an invalid state (since further replies from the server will likely be additional NOT_MY_VBUCKET responses). If there is no pending data the socket is released back into the pool, becoming available for a subsequent request for a new connection. In our tests this has shown up to a 6x decrease in the creation of new TCP connections during cluster topology changes.

Get The Code

You may download a source tarball of then release at

http://packages.couchbase.com/clients/c/snapshots/libcouchbase-2.3.0_dp2.tar.gz

Comments