March 11, 2014

Rolling Upgrades

One of the capabilities of Couchbase Server is the ability to do online upgrades without downtime. This is in part due to Couchbase’s auto-sharding and rebalancing features. In this blog post we will walk through the recommended approaches of doing a rolling upgrade. While all of these approaches are possible ways of upgrading, they are listed in order of what is considered best practice. Preference is given to ease of operation, keeping the cluster and application online and operational, as well as consistent performance while the upgrade is happening. Obviously some of these will be easier for some people's situations than others. All of them will work.

Note: While these instructions here have been tested and will work on a functional level, as always you should try your selected approach first in your own test environments before proceeding to upgrade your production cluster. Once upgraded, make sure to run through your existing application and database smoke test procedures to confirm the features of Couchbase that you use continue to perform as you need and expect them to. This should go for the upgrade process as well as new functionality provided by Couchbase. If you have an Enterprise support contract with Couchbase and need some overall assistance with the upgrade, run into problems before or during, etc., please do not hesitate to contact Couchbase Support and/or your Account Manager. If you are using the Community Edition of Couchbase, there are the community forums for help or there is no time like the present to get world class support and the features that the Enterprise Edition brings to your mission critical systems. In addition to not having access to the Couchbase support team, the Community Edition will be behind on code revisions for a period of time (usually 3+ months), will not contain certain Enterprise features and you will not have access to hot fixes as they come out.

Note: In Couchbase Server 2.5 Enterprise Edition, a feature was introduced called Rack/Zone Awareness (RZA). Its functionality is beyond the scope of this post, but you should be aware of it and go read up on it if you are upgrading to 2.5 or above, from a version prior to 2.5. For more details, see Rack/Zone Awareness in the documentation.

Note: If you are upgrading from Couchbase 1.8.1 to 2.x, there are some unique technical considerations that you should review. Including, but not limited to, the introduction of the append-only data file structure for disk based storage which increases the required disk space to 2-3x of the amount of data you wish to store, but greatly speeds up disk writes. Full details are available in the Resource requirements section of the Couchbase manual to make sure you have at least the recommended minimum configured servers. Also, check out Perry’s blog post about sizing a 2.0 cluster for more help.

Recommended Approach #1, a Full Swap Rebalance:

This approach uses a process called Swap Rebalance to add in the exact same number of new servers/instances as your existing cluster and remove the old ones. This will be particularly appealing for organizations using a public or private cloud that can acquire server instances with relative ease or for users doing a complete hardware refresh. Each new server/instance should have equal or better hardware specifications as the existing ones, have the operating system security hardened and the target upgrade version of Couchbase Server installed. (If you are increasing RAM and therefore need to increase the overall RAM quota of Couchbase, see the section below detailing information on what you need to look at to help with this process.) This upgrade approach offers the best performance and easiest transition from an existing version to a new version. When the rebalance is complete the whole cluster will be on the new version and the old nodes removed and ready to be recycled. For more details, see Swap Rebalance in the Couchbase manual.

Steps:

1) Acquire an identical number of new servers/instances as are in your current cluster.
2) For each new server/instance, harden the operating system and install the target version of Couchbase Server, plus any hot fixes.
3) Create a backup of your cluster's data using the cbbackup tool, per your normal procedure.
4) On one of the DB nodes you will have to construct a command line call using the couchbase-cli command. The Web UI does not support swapping out all of the existing nodes in the cluster at one time, as of this writing. Here is an example of the command you need to run to add and remove four nodes:

$> /opt/couchbase/bin/couchbase-cli rebalance \
    -c <ip or hostname of an existing cluster node>:8091 \
    --server-add=<new node hostname>:8091 \
    --server-add=<new node hostname>:8091 \
    --server-add=<new node hostname>:8091 \
    --server-add=<new node hostname>:8091 \
    --server-remove=<existing node hostname>:8091 \
    --server-remove=<existing node hostname>:8091 \
    --server-remove=<existing node hostname>:8091 \
    --server-remove=<existing node hostname>:8091 \
    -u Administrator -p<your password>

This command will connect to the specified existing cluster node and commence the online swap rebalance process to add in the four new nodes and remove the existing four nodes. Make sure you use something like screen in order to run this process so you can disconnect and reconnect to see progress. A rebalance might take a while depending on the sizing and utilization of your cluster resources, (e.g. Disk I/O, network, etc.). Also, a rebalance is meant to be a background process and prioritizes application level traffic at the cost of increased rebalance time. That being said, depending how you sized your cluster, you could see a minor drop in performance while the rebalance is running. It might take an hour or more depending on quite a few factors. Usually network and disk will be your biggest bottlenecks. See the section below about monitoring a rebalance to be able to see what is going on and confirm the process is still progressing. Also, let me take this moment to mention again that you really want to test this in a test environment first and see what will happen and timings.

Note: Usually the new features of the upgraded Couchbase version will be available for use once the rebalance is complete. Though some features will be available immediately.

5)Now let’s clean up the Couchbase client connection strings. While the running application servers have the updated cluster map already, if you were to restart those app servers, they need to get a new map from the cluster. So, on each application server you have accessing the Couchbase cluster, edit the connection string configuration to remove the old hostnames/IPs of the cluster nodes entirely and make sure you have at least three of the active nodes.

Recommended Approach #2, a Rolling Swap Rebalance:

This process is a variation on #1. Instead of replacing all servers/instances in the cluster, you will add one or more (one is good, more is better) new servers/instances of equal or better hardware specifications to the cluster and remove just as many from the cluster at the same time using the Swap Rebalance process. Then repeat this process until each node has been upgraded to the target version.

Technical background info: Just remember, if you add in two servers, you remove two. This way it minimizes the nodes involved in the rebalance. In a swap rebalance, a node will just copy its vBuckets to the new node and only involve the two servers being swapped. If instead you were to swap in more than you swap out, for example, then the cluster has to shuffle vBuckets from all nodes and you take the increased load hit on all nodes of the cluster.

Steps:

  1. Acquire as many of the new servers/instances as you desire. If this is your second pass of these steps, you can just reuse the other servers/instances that you swapped out in previous steps.
  2. For each instance, harden the operating system and install the target full version of Couchbase, plus any hot fixes.
  3. Create a backup of the cluster’s data using the cbbackup tool per your normal procedure.
  4. Open the Couchbase Web Console, Server Nodes tab, then click on the Add Server button and follow the process to add each new server/instance to the cluster.
  5. For each server you just added, click on the Remove button for an existing server.
  6. Click on the Rebalance button to initiate the rebalance process for the cluster. The rebalance process will automatically move all of the data from the nodes flagged for removal to the new nodes.

Repeat these steps until all nodes have been upgraded to the new version.

Once complete, let’s clean up the Couchbase client connection strings. While the running application servers have the updated cluster map already, if you were to restart those app servers, they need to get a new map from the cluster. So, on each application server you have accessing Couchbase cluster, edit the connection string configuration to remove the old hostnames/IPs of the cluster nodes entirely and make sure you have at least three of the active nodes.

All new features should be available for use once all nodes are on the same version. For general information on swap rebalance, see Swap Rebalance.

Note: While the examples here use the Couchbase Web Console, you could use the REST API or CLI tools to do this.


Recommended Approach #3, A True Rolling Upgrade:

This process is a little more involved and I say that only because it should really only be performed if your cluster is properly sized and can withstand temporarily losing a node entirely with little or no performance degradation while you upgrade. If you are unsure if your cluster is properly sized, do some research to find out ahead of time and not in the middle of your maintenance window. The reason you would use the process is if you did not have access to other servers/instances and wanted to do the upgrade in place. If you have an Couchbase Enterprise Support contract and need help with this process, you can reach out to support with any questions during your planning of this upgrade.

Also remember, this might just be a good time to add cluster capacity (either vertically or horizontally) in general and then do the #1 or #2 recommended solutions above instead.

Steps:
  1. Create a backup of the cluster’s data using the cbbackup tool per your normal procedure.
  2. In the Admin Web UI, click on the Remove button for only one node from cluster.
  3. Rebalance the cluster across the remaining nodes of the cluster.
  4. Perform the Couchbase server upgrade on the server/instance to the target version of Couchbase, plus any hot fixes.
  5. Click on the Add server button to add the upgraded node back into the cluster.
  6. Click on the Rebalance button to initiate the rebalance process for the cluster. The rebalance process will distribute the vBuckets across the cluster to evenly.

Repeat these steps until all nodes have been upgraded to the new version.

Three important items during rebalance:

  1. If you are upgrading from 1.8.1 to 2.x and the rebalance stops/fails or pauses for whatever reason, you should wait a minimum of 5 minutes and then restart it.  The reason for this that the network connections that Couchbase uses for rebalance between nodes do not get shutdown gracefully on a pause or a unplanned stop so they can end up living for longer if a rebalance is kicked off immediately after a failed one.  However, the network connection has a timeout of 5 minutes if it does not see any activity and because of this we recommend waiting at least 5 minutes before attempting a rebalance again. Just nailing the button over and over will do more harm than good. If you are upgrading from 2.x to a higher level of 2.x (e.g. 2.0.1 to 2.5) then this is not applicable and the rebalance can be restarted immediately.
  2. The browser cache is partly used to remember which nodes were removed so before you click rebalance again it is crucial that you remember to re-remove those nodes that are leaving so the cluster doesn't think the next rebalance should actually include them again.
  3. Client configuration - Both Moxi and the Couchbase clients need to be configured to point to at least one Couchbase node that is part of the actual cluster.  The node that they are configured with will allow them to receive the cluster's topology.  As you rebalance nodes out of the cluster, make sure that the clients are configured with a node that is part of the original cluster with a backup configuration pointing to one of the newly upgraded nodes that is being added in.  That way the clients can continue accessing the cluster as nodes are removed and added, even if new client objects are created during the rebalance.

Monitoring Rebalance Progress

You can of course monitor through the rebalance GUI itself but you may need a more detailed view to determine if it stalls.  You will need to watch at minimum the below two metrics on the nodes, the disk write queue and the active vBucket count. In Couchbase 2.x there are more and detailed rebalance metrics under the server tab. Also note that during a rebalance that Couchbase moves a bucket at a time. So if you have multiple buckets and see no activity in the disk write queue or the bucket numbers, you may be looking at the wrong bucket.

Disk Write Queue

The disk write queue should be under 1 million for each node. If it is over this number, the node will send a back off message to other nodes when they try to send it vbuckets during the rebalance process. This is normal and the only thing to do here is to wait for the disk write queue to drain so that it will start accepting rebalance operations again. The quota for this can be configured, but only do this if you know what you are doing.

Active Bucket Count

This number should be changing on the nodes that are being added and removed.  If you do not see this number change for 10-20 minutes, then the rebalance may have stalled and will need to be stopped and restarted.

For more detailed information, see Monitoring a Rebalance in the Couchbase documentation.

Rebalancing from the Command Line

Doing this gives us a little more flexibility over using the Web Console, as we can add new nodes and remove the old nodes in one rebalance. We will be using the couchbase-cli command to specify which nodes to add as well as which to remove. Full details for the CLI command available here.

    [couchbase@serverXYZ ~]$ /opt/couchbase/bin/couchbase-cli rebalance -c 192.168.0.1:8091 \
      --server-remove=192.168.0.2 \
      --server-add=192.168.0.4 \
      --server-add-username=Administrator
      --server-add-password=<your strong password>

 

Make sure that you use something like screen or nohup as the rebalance can take a while depending on the size of your cluster, replicas, XDCR, etc.

Client SDK Upgrades

As a side note while we are on the subject of upgrades, Couchbase client SDKs are updated on a monthly basis with bug fixes and new features added all the time. This would be a great time to see which of the client libraries you use have been updated and get them into the queue for update in your applications. Go here for information on SDK updates.

While we are on the topic of client SDKs, this is also a good time to make sure that you have at least 2-3 of your cluster nodes in the Couchbase client configuration connection string on each of your application servers.

Comments