February 10, 2014

Failure is not an option

Databases are complex and have many moving parts that can fail. At the same time, failures in large scale systems are inevitable. These failures can happen at anytime and anywhere in the stack. So what does all this mean for your mission-critical app ? To keep your app running 24x365, your database must be highly available and should be capable of recovering from any disasters. Even if a single node, a rack, or even an entire datacenter fails, your database must continue to operate without any downtime. For a mission-critical app, that is a high hurdle that your database must meet, but failure is not an option.

Several organizations have successfully used Couchbase Server in production for their mission-critical apps. This blog discusses the high availability and disaster recovery features in Couchbase Server that make it so reliable, including an exciting new feature that we have announced in Couchbase Server 2.5 called “Rack Awareness”. To learn more about Couchbase Server 2.5, register for our webinar here

Why isn’t replication enough?

Replication of a data is at the core of high-availability, but is not enough by itself. In Couchbase, each document is replicated upto 3 times (depending on the user configured replication factor). But, for a database system to be highly available, its not just the data - all of the system components including hardware, software, data, etc. must be highly available.

Simply put, having two or more of something does not mean high-availability. Even if a piece of hardware fails, the system must be capable of continuing to operate. If a software component fails, it must also continue to operate. For example, in versions prior to Couchbase Server 2.5, replica data was distributed randomly across all of the nodes in the cluster with each node (of a cluster containing N nodes) containing roughly 1/Nth of the data partitions. The replicas for these partitions were evenly divided among the remaining N-1 nodes. With this distribution of data, if a physical rack would fail, the nodes in the rack may contain both the primary data partition and replica data for some partitions, resulting in possible data unavailability. Definitely, something else was needed...

Rack Awareness in Couchbase Server 2.5

As your Couchbase cluster gets bigger, you will need to spread out your Couchbase server nodes in more than one rack. To maintain high availability, you may want the cluster to survive the failure of an entire rack. This way, the failure of any one rack will not make all the copies of your data unavailable.

With rack awareness in Couchbase Server 2.5, you can configure nodes into server groups where all of the servers in a group are in a single rack. Grouping servers into server groups ensures that replica data partitions are not on the same rack as the primary partitions. For example, as shown in Figure 1, there are 3 server groups each containing 3 server nodes. The cluster has 2 replicas (3 copies of the data) and the replica copies are on different racks. The configuration is balanced because every server group has the same number of server nodes.

 

Figure 1: A balanced rack aware configuration in Couchbase Server

 

As shown in Figure 2 below, when a rack fails and the server nodes are failed over, replica copies in other racks are promoted to active and the app can access the data.

 
 

Figure 2: Rack awareness in Couchbase Server

 

Unlike other databases out there, configuring rack awareness in Couchbase Server is simple - you just have to create the server groups and assign server nodes into particular server groups. Rack awareness in Couchbase Server can be configured through the management console or using the REST interface.

 

 

Rack Awareness or Cross Datacenter Replication?

Hurricanes, earthquakes, dns failures, lightnings’ storms, fat fingers,.., etc were the major reasons for most of the biggest outages in traditional data centre infrastructures. These incidents teach us that we need something more indeed... Plan B for reliability.

 

Plan B requires you to have high availability as well as disaster recovery. High availability (HA) ensures that the data is available with a little downtime as possible whereas Disaster recovery (DR) is about preparing and recovering from a disaster.


With more and more enterprises using Couchbase to run their mission critical apps, we felt that it was important to go beyond just a node failure and a rack failure to keep your app running. Even if an entire datacenter fails, your database your app should continue to work. Using cross datacenter replication in Couchbase Server, you can replicate data in an active-active fashion between two datacenters across different geographies that best fits your recovery requirements. In Couchbase 2.5, we added support for securing the replication channel between the datacenters for higher security.

Rack awareness is high availability feature in Couchbase Server whereas Cross datacenter replication is a disaster recovery feature. So, do you need Rack awareness or Cross datacenter replication? Yes, you need both!

 

Want to learn more about 2.5 ?

Couchbase Server 2.5 is an exciting release and in addition to availability and reliability, we also added some cool features in the area of security and connection management. To learn more about these features, register now for the upcoming launch webinar.

 

 

In addition to these improvements there are several other performance and stability fixes that improve the user experience across all supported platforms. To learn more about this release you can take a look at our documentation here.

Give 2.5 a try! Download it here: http://www.couchbase.com/download

 

Comments