This is the second part of a two-part piece that gets into the nitty gritty details of Couchbase's rebalancing technology. The first part deals with the basic mechanics and background of the rebalancing process. In this second part, we'll take a closer look at how to monitor the rebalance process and how to stop and deal with rebalancing issues.
Monitoring a rebalance
Apart from general monitoring of a Membase/Couchbase cluster, there are some specific areas to focus on during a rebalance. Most of these can be monitored through the Web UI, and more specifically captured by looking at the underlying statistics available on each node.
- You can monitor how many (and which) vbuckets a given node is responsible for and use that to get a feel for the progress of a rebalance
- Either due to a bulk 'backfill' or one-off reads to materialize the data into RAM before sending into the destination node
- Combined with the normal traffic load, a rebalance can result in a greatly increased disk write queue as data is transferred into RAM faster than it can be drained to disk. (*see below for a bug/design flaw that is currently being worked out)
- You should see quite a bit of activity here, and it may be 'bursty' as a vbucket is moved from one place to another.
- Critical to understanding for yourself why a rebalance might be taking longer than expected. You will see these backoffs associated with the destination node (because it is recording how many it sent...as opposed to the source node who is being told to backoff).
Why is my rebalance taking so long?
Customers routinely ask how long a rebalance should take, and I have to constantly say that I really don't know. It depends so much on the amount of data (number and size of individual objects), the speed of disk (both reading and writing), and the network load...not to mention the possibility for node failures and other unexpected slowdowns throughout the process.
Rather than trying to reliably quantify how long a rebalance would take under increasingly variable conditions, I feel it much more important to be able to characterize and monitor its status. As mentioned above, ideally a rebalance shouldn't have a material impact on your application and so it doesn't really matter how long it takes. If there is a perceived impact, it will be more important to understand why (and possibly resolve) rather than trying to speed up the rebalance process itself.
The most common cause of general slowness is a slow disk on the destination node which leads to TAP backoffs.
*Current bug/design flaw: In the 188.8.131.52 and below releases, a node may start deleting old data from disk while it is still in the process of receiving new data. In our initial testing, it was shown that this deletion does not take very long. In reality, we've discovered that under certain conditions (especially Amazon's EC2) it can take much longer than expected. Unfortunately, a node is unable to perform these deletions (en masse) and write at the same time and that can end up 'starving' the writer process. This leads to a very large disk write queue, and eventually an extended period of TAP backoffs. While not catastrophic, it can be disconcerting to the user in terms of a greatly extended rebalance time and possible fear for data safety. The behavior is being (or has been) fixed with 1.7.2 to prevent this situation from happening.
Performance during a rebalance
By design, a rebalance should not have a material impact on the performance of the application. In reality, however, a rebalance operation does increase the overall load and resource utilization of a system so certain environments may notice a degradation. Our best practice is to perform a rebalance during an application's lowest traffic levels if possible.
The two main causes for a performance hit during rebalance are network saturation and disk contention. While CPU utilization is another possible cause, it's very uncommon as the sole cause in any but the highest of traffic rates (multiple 100's of thousands of operations per second) and is usually related to some underlying cause (like disk IO wait).
Network saturation should be fairly obvious. If your application is already pushing the cluster close to it's network bandwidth, a rebalance is likely to cause more saturation which can lead to timeouts and errors. We don't currently (but may in the future) allow throttling the network activity of a rebalance.
Disk contention is a bit harder to see and characterize. This goes back to one of Membase/Couchbase Server's primary benefits, the separation of RAM from disk IO. By serving as much as possible out of RAM, the software masks any slowdowns at the disk IO level. Disk is normally much slower than RAM, but when increased load is asked of a disk, it gets even slower. Rebalance is one of those increased loads. If your application is reading all it's data from RAM, you should not see any impact here. However, if many requests (and 'many' is subjective) are being serviced from disk because they are not cached in RAM, performance can and will suffer. You can mitigate this by having more RAM, but that's not always practical and so sometimes it's just a matter of being aware of what's going on.
Stopping a rebalance
Because each vbucket is moved individually, a rebalance can be stopped (either manually or because of some failure) without having to re-do the entire process. You can simply restart the rebalance and it will pick up from where it left off.
There are two caveats to this given the current state of the software:
- If you are removing nodes with a rebalance, and that rebalance stops, you need to re-remove the nodes before initiating the rebalance again. The cluster doesn't "remember" what you were doing last time, so it's important to remind it.
- It is a best practice to wait at least 5 minutes before restarting a rebalance (regardless of why it was stopped). This is to allow the various connections between nodes to get cleaned up properly. Later versions of the software will actually impose this limit (through the Web UI, it could be overriden by using the CLI/REST API if desired...but don't do that unless we tell you to ;-)
If a rebalance stops in the middle of moving a particular vbucket (as it likely would), there is nothing to worry about. That particular move is "backed out of" by simply leaving the original vbucket as active. Nothing more is needed.
What should you do when a rebalance fails? Ideally, you want to resume the rebalance in short order if it makes sense (again, wait at least 5 minutes). This depends a lot on the nature of the failure. When a rebalance failure occurs, you should endeavor to find out the cause and also the current state of the cluster.
While we of course strive to fix all bugs within the software, there will inevitably be situations where a rebalance fails. The two most common reasons are from timeouts or crashes.
A timeout will be generated when a particular node is slow to respond (or doesn't respond at all after a period of time) and usually comes either from network or disk slowness. You'll also see a timeout if a node fails, but only if it does so in a way that the other side can't figure out what happened. In this latter case, a timeout can actually mask some other, more dire issue.
A crash (either of an internal process such as memcached or some external system component like the kernel) will also result in a rebalance failing.
The logs will have some (hopefully a substantial amount) of information regarding the failure itself. They will say something like "timeout" (same thing as "wait for memcached failed") or "some process exited unexpectedly". While this is useful information, especially for support to investigate, the cause is much less important than the current state of the cluster.
You want to immediately assess whether or not further action is required before resuming the rebalance. The primary diagnostic is to determine whether all of the nodes in the cluster are still healthy and able to serve data. If they are, great. If not, you need to remedy that situation first. I'm not going to address this more general troubleshooting activity here, but you get the idea. You can rest assured that the data is safe (and a failover possible if needed).
If you've followed this piece, and the first post on rebalancing then you know pretty much everything there is to know about the rebalancing process. Most importantly, you should now be in a position to make an educated guess about the effects and needs of doing a rebalancing and how this will affect your deployments and operations of your Couchbase and Membase clusters.