This is the first of two blog posts getting into some of the nitty gritty details of Couchbase's rebalancing technology. This first one details the rebalance functionality itself, and the second discusses how to monitor and work with it.
A Practitioner's Guide to Rebalancing with Couchbase
Back in mid-2010, a little company called NorthScale introduced a product called Membase. Among many revolutionary features was the ability to dynamically grow (or shrink) a cluster of 'nodes' (instances of the software). By design, this is an online operation meaning it doesn't affect the availability of data stored within that cluster and doesn't materially impact the performance of an application accessing that data.
Fast forward to now (late 2011...okay, it was a quick fast forward). The company has since changed names (twice) to Couchbase, and has learned an immense amount about how the product works 'in the real world'. It is being deployed in drastically different environments (think your own data center versus Amazon's cloud) under drastically different applications (think logging versus social gaming versus ad targeting). The same 'rebalancing' technology is being employed not only within Membase, but in the upcoming Couchbase Server 2.0. I thought now might be a good time to write this rather lengthy article to answer many of the questions that surround this process as well as address some of the known issues. The current release is 220.127.116.11, and it has come a LONG way since the first beta release of 1.6.0 way back in that memorable summer of 2010.
Enough with the introduction, let's see how far down the rabbit hole we can go...
What is a rebalance?
Let's start at a high level. Rebalance is the process of redistributing data to added, or from removed, nodes within a Membase/Couchbase Server cluster.
Sounds pretty simple right? To the casual observer, it is. And we meant it to be. Just like most other technologies, it takes quite a bit of complexity to achieve this simplicity (total distraction: https://plus.google.com/112218872649456413744/posts/dfydM2Cnepe).
One level down now.
To begin understanding how Membase/Couchbase Server works, the main underlying concept is vbuckets (http://dustin.github.com/2010/06/29/memcached-vbuckets.html). A vbucket is a logical 'slice' of an overall dataset (in our case, a bucket...sorry for the naming, we didn't have a product marketing team yet when that one came up). There are a static number of these vbuckets and keys are hashed (using CRC32) into that static list. In this way, you are guaranteed that a given key will always hash to the same vbucket. Each vbucket is then assigned to given server within the cluster. With 256 vbuckets and one server they would all be on the same server. With 2 servers, they would be shared equally across the two with 128 buckets each, four servers would be assigned 64 each and so on.
A 'vbucket map' is simply the enumeration of this list of vbuckets and the servers that own them.
I'm purposefully ignoring replication at the moment...I'll get back to it, but it's not important at this point.
A rebalance is simply (there's that word again...) the process of moving a certain number of vbuckets from some server(s) to other server(s). The end goal is that each server within the cluster ends up with the same number of vbuckets. This ensures that the data is evenly distributed throughout the cluster, and therefore the application access to that data is also load-balanced evenly across all the nodes within the cluster.
You with me so far? (if not, email firstname.lastname@example.org).
And another level.
When a rebalance is initiated, a process called the 'orchestrator' looks at the current vbucket map, and combines it with any node addition/removals to calculate what the new vbucket map should look like. The orchestrator then begins initiating the movement of vbuckets from one node to another. It's important to note that the orchestrator simply 'kicks off the process' between two nodes but does not actually transfer the data itself...that's purposefully left up to the source and destination nodes to coordinate to avoid any bottlenecks or points of failure. From the perspective of the orchestrator, it doesn't really matter whether a node is being added or removed from the cluster. In fact, multiple nodes can be added and/or removed from the cluster in the same rebalance. All that's happening is a new vbucket map is being calculated and vbucket movements initiated to make that map a reality.
Each vbucket is moved individually and independently of the others (multiple can and are moved in parallel, but the point is that there is no relationship between unique vbuckets). The destination node starts up a process called 'ebucketmigrator' which opens a TAP (http://www.couchbase.org/wiki/display/membase/TAP+Protocol) connection to a vbucket on the source node. This connection has with specific flags signaling that it a) wants all the data contained within and b) it plans on 'taking over' that vbucket when all is done.
This last part tells the source node to initiate the switchover process as soon as all the data is sent. (history lesson: ebucketmigrator used to be called vbucketmigrator but we moved it into the Erlang VM...hence the 'e')
While each vbucket is being moved, client access (reads and writes) are still going to the original location. Once the data is all copied over, an atomic switchover happens where the original location says 'I'm no longer the master for this vbucket' and sends a 'token' over to the newly created vbucket saying 'you are'. The original vbucket transitions from active to dead, and the new one transitions from pending to active. Smart clients and Moxi are updated with a new vbucket map to know that this took place and subsequent data requests are directed at the new location. (see a few sections down for even deeper discussion of Moxi and smart client behavior)
And now the bottom level
At least as far as I'm going to go. Disclaimer: This is going to get pretty technical. Skip down a section if you're short on time.
As mentioned above, the orchestrator will direct the ebucketmigrator process on the destination node to “pull” a vbucket from the source node via TAP.
TAP connections at the source node
When that TAP connection is initiated to the source node, a 'cursor' starts walking the hash table within the particular vbucket. In parallel, a 'backfiller' process starts up and decides whether to bulk load data from disk. As you're probably aware, Membase/Couchbase seamlessly supports having more data in the system than RAM available. In this case, there can be a significant amount of data that needs to be read from disk in order to transfer to another node. The backfiller process looks at the 'resident item ratio' (amount of data cached in RAM versus not). If that ratio is lower than 90%, it bulk loads the whole vbucket from disk into a temporary RAM buffer (there are caps and backoffs built-in to ensure that it doesn't overrun the RAM capacity of the node). If the resident item ratio is higher than 90%, this process doesn't happen.
As the cursor is walking through the hash table, it begins transmitting the keys and documents over the TAP connection. Anything that is cached in RAM already is sent very quickly, anything else is either taken from that temporary buffer space (if available) or read in a one-off fashion from disk.
During this process, mutations to data that has already been sent over are transmitted over the TAP stream as they happen (technically they're queued up, but it happens blazingly fast anyway). Mutations to data that has not yet been sent are applied to the source vbucket only and will be picked up when that particular document is transmitted.
Once all the data has been copied and synchronized, the switchover happens. Technically it would be possible to transmit changes so quickly into a vbucket that it could never catch up and switch over. In practice, this never happens. Any brief delay on the client-side between requests is enough to get the last bits over and it would be extremely rare for inter-node communication to be drastically slower than client-server communication.
TAP connections at the destination node
In general, the receiving end of a TAP connection is not treated very much differently than a regular client putting data into the particular vbucket on that node. There are a few exceptions:
- The vbucket on the destination side is in the “pending” state. This means that the data contained within is not accessible to anything other than the TAP stream sending traffic to the vbucket.
- The data is not replicated out. Traditionally, putting data into a vbucket would cause it to be replicated to that vbucket’s replica. This doesn’t happen for pending vbuckets.
- TAP backoffs (this one is important): To prevent a fast source node from overwhelming a slower destination node, there is a special part of the TAP protocol called “backoffs”. This allows the destination to tell the sender “STOP! WAIT! I need more time...”. When the sender receives this message, it will back off and retry the request after some brief period of time. These backoffs currently kick in when the disk write queue on the destination reaches 1m items. This is measured across all vbuckets on that node, and can be a combination of application traffic as well as the rebalance traffic. There’s more discussion below on how to monitor for these.
Congratulations, you're now a level-4 rebalancer! I know it was a dizzying ride, thanks for sticking with me.
Replication and rebalancing
Not a whole lot to say here, but didn’t want to leave it out completely. Each vbucket is replicated 1, 2 or 3 times to it’s “replica” vbuckets. During a rebalance, these replica vbuckets are moved around as well to ensure a balanced cluster, and are created if they didn’t already exist. For instance, a single node doesn’t have any replicas. When you add the second, those replicas get created. If your replica count is higher than 1, adding more nodes will cause even more replicas to be created. It’s important to be aware of this since there is a multiplication of data being moved when multiple replicas are involved.
There are some special tricks employed during a rebalance to not have to move both an active vbucket and its replica. If the algorithm dictates, the software is capable of switching an active and replica vbucket “in place” and then only moving the replica.
Lastly (at least for this topic), the whole rebalance process is not complete until replicas have been sufficiently “caught up” with their active vbuckets which can also increase the amount of time a rebalance takes. This is an evolution from the original implementation which only concerned itself with the active vbuckets. This caused two issues. One is that the rebalance was “done” before the cluster was really safe. Secondly, there was an immense load put on the whole cluster when rebalance finished in order to reinstantiate the replicas. The current version (1.7.1) has changed this behavior to match what I described earlier in this paragraph.
Moxi and Smart clients during a rebalance
While this is not the place for a complete description of the inner workings, you'll want to understand the basic ideas for all smart clients and Moxi.
When a client or Moxi starts up, it connects to the URL of any Membase node (via port 8091 using HTTP). It authenticates if necessary (only not necessary for default bucket) and receives a vbucket map. This is what's called a 'streaming connection' or 'comet stream' and the HTTP connection remains open (as does the TCP connection I believe). That's the end of communication when everything is stable.
Should the client exit or restart, it will go through that process again. Should the connection be closed, it will attempt to reconnect as well. If the node that it was first talking to doesn't respond, it will go to the next one in its list (assuming you provided a list...best practice here). Round-robin around until it gets one to respond, and then stays connected. Note that it doesn't make connections to all nodes in its list, only one at a time.
Now, during a rebalance, there are a few nuances. By design, each vbucket movement is communicated back to the client over this connection. In reality, that's not quite the case. Earlier versions waited until the end of rebalance to push out a new list. Now, the updated list is pushed before we do the rebalance (called a “fast-forward map”) and it’s up to the client to 'figure it out' as the process goes on (more on that later).
When a client (or Moxi) is sending requests to the cluster, it takes the key and hashes it against the list of vbuckets. It then looks at the map that it has to determine which server is active for that vbucket. If its map is correct, the server will accept the request (whatever it is, all ops have a key and vbucket id). If the vbucket id that the client/Moxi is sending to a server is not active on that server, it will respond with an error saying 'not my vbucket'. During a rebalance, if a client/Moxi doesn't get updated in time, has some requests 'in flight' or somehow just misses the memo, the old node will respond with a 'not my vbucket' error to any requests after this point in time.
While technically an error, this message is really just a signal to the client/Moxi that it needs to go find the right place for that request and retransmit. This is to ensure that there are never clients accessing the same documents in more than one location. At the lowest level, this is just an error saying that the request wasn't right for this server. Higher up the stack, it's up to the client/Moxi to understand that this means it needs to find the right place.
This is where things start to diverge a bit in each implementation. For instance, Moxi originally just brute force tried every server in the cluster. If none responded, it sent that error up to the client (legacy memcached) who had no idea what to do with it. Now, we handle that much better and the current implementation is to have Moxi consult the new 'fast-forward' map it got at the beginning of the rebalance. The basic idea is to have smart clients follow the same logic with each implementation being slightly different.
In general, there's very little traffic over this 'management channel' for clients/Moxi. Really only when a client connects or during a rebalance and even then it's a very low amount of traffic.
The following question(s) came up recently so I thought I’d address it here:
“Hi Perry, I have a user that has what they describe as a flaky network either because of firewall rulesets or connectivity and they wanted to know what happens during a rebalance when the vbucket map gets updated but the client machine is unavailable? I see that the cluster pushes things like this to the client through 8091 and HTTP but does it maintain a persistent connection or does the cluster only connect to the client when there is a update it needs to know about?
They're concerned that after a rebalance the client may have a stale map and want to know more details about how that’s handled.”
And my response detailed these two points:
- The connection is kept always open. The client(s) always connect to the cluster, not the other way around...the cluster will push a new map over existing connections.
- A client losing its connection to the cluster will attempt to reestablish (configurable). Anytime it reconnects (first time or not) it gets the latest map that the cluster has. Ironically, a flaky network in theory might just help here to keep the map constantly updated during a rebalance, but that's for a different discussion.
If you've made it through this far, not only do you know what rebalancing is, but you also now know how it works. That should help immensely when you grow your cluster so that you better understand what is going on behind the scenes, both on your servers and your clients.
The second part looks at how to monitor your cluster and the rebalancing progress, how to handle failures, and how the performance of your cluster can be affected (and mitigated) during the rebalancing process.