How Many Nodes? Part 2: Sizing a Couchbase Server 2.0 cluster
In the first part of this series, I gave an overview of the 5 factors that determine the sizing of your Couchbase cluster: RAM, disk (IO and size), CPU, network and data distribution. In this second part, I want to go into more detail about specific use cases and scenarios to see how various application designs and workloads affect these various factors. The third entry in this series goes into different hardware and infrastructure choices.
A Couchbase Server 1.8 cluster was fairly straightforward to size, and you can find a discussion on this here, along with a calculator. With the recent feature additions to 2.0, this gets a little bit more involved...
Let’s look at two application design and requirement considerations that will have an impact on your cluster sizing:
- Is your application primarily (or even exclusively) comprised of individual document access? Or do you expect to make use of our Indexing and querying feature? A good description on how to combine key/document access and views can be found here.
- Do you plan on using the new Cross-datacenter replication (XDCR) feature?
Individual Document Access and upgrading from 1.8
The simplest use case to discuss is when an application really only requires individual document access (typically referred to as a “key/value” use case). These include session and user profile stores, ad targeting applications, many social games, etc.
Sizing these use cases maps very similarly to the 1.8 sizing guidelines with a few modifications. As with all sizing discussions, the same 5 determining factors apply, with RAM sizing typically being the one that wins out here.
- RAM: See part 1 for a discussion on the considerations of sizing RAM properly. For those of your familiar with 1.8, not much has changed here. If you are not familiar, take a look at our sizing guidelines and calculator.
- Disk: Keep in mind that upgrading from Couchbase Server 1.8 to 2.0 may require considerably more disk space. From an IO perspective, the append-only disk format will actually read and write data faster and more consistently than 1.8. However, there is also the need for online compaction of the disk files to take place and this will require some amount of added disk IO. This blog may help you understand compaction better.
- CPU: As mentioned in part 1, the application reading and writing into and out of RAM can be handled very efficiently with very little CPU. The addition of automatic compaction does add a bit more CPU activity due to the disk IO, but won’t normally impact the application’s performance. While typical 1.x installations could “make do” with 2 cores, we are increasing this recommendation to at least 4 even for the basic “key-document” use case.
When upgrading from 1.x, it would be highly recommended to engage with Couchbase Support so we can help review the environment, provide recommendations and aid in making the upgrade go as smoothly as possible.
Cross Datacenter Replication (XDCR)
Many situations/environments create the need for data to be replicated across multiple Couchbase clusters. You need need XDCR for disaster recovery, geo-locality of data or just simply synchronization for testing. It will be important to ensure your Couchbase cluster is sized effectively. These capacity requirements are in addition to everything else the clusters are being asked to perform.
For the purpose of this discussion, I’ll take separate looks at the “source” and “destination” cluster requirements for XDCR. In production, you can have a one-many, many-one or many-many topology by combining multiple uni-directional replication streams.
1. Source Cluster
- RAM: This is a bit variable, but we have seen that there is a bit of extra RAM required just to handle the process of replicating data to another cluster. It will vary a bit based on how many buckets (and replication streams) but a good estimate is to leave about 2GB of RAM free per node above and beyond what is needed to hold the data. This is needed by the Erlang process (beam.smp on Linux, erl.exe on Windows).
- Disk: The current implementation of XDCR involves sending data from the disk subsystem of a source cluster to the destination. This allows for no queues to be maintained in RAM, and the restart of a replication process to still take place without resending all of the data. Additionally, multiple writes from the local application are automatically de-duplicated before being replicated across, saving precious network bandwidth. What this means, however, is that the source cluster of any XDCR will have increased disk IO requirements in order to read the documents once they have been persisted to disk
- CPU: The source cluster of an XDCR will also have increased CPU requirements in order to process and send the data to another cluster. By default, there are 32 streams per node per replication. This can be increased, but keep in mind that a higher number of streams will require more CPU. A general best practice should be to have one extra core per XDCR bucket being replicated.
2. Destination Cluster:
- RAM and disk sizing: It should generally go without saying, but if you’re sending more data to a cluster (as in aggregating it from other clusters), it will need more RAM and disk space to contain that data. If you’re simply replicating the same dataset around, there shouldn’t be any increased requirements here.
- Disk IO: Even though you may not need more space for the same dataset (that is you do not need the working set of the source in memory), you will likely be incurring more writes from the source cluster(s) via XDCR. You will need enough disk IO to persist these to disk and also enough IO bandwidth to compact this data along with the local application writes.
- In general, it is important to size both the source and destination clusters and test how it performs for your workload. However, it is known that the XDCR performance at the source cluster will benefit greatly by having more nodes as the data needing to be sent is done so in parallel.
While the inter-cluster network requirements for Couchbase Server may not typically require much consideration, transferring data over a WAN link will likely require a bit more thought. Couchbase Server will automatically checkpoint and restart XDCR as the network allows, but the replication is ultimately only effective if the data can get to the other side.
4. Bi-directional replication and other topologies
This discussion focused on single source and destination clusters replicating a single bucket. Bi-directional replication turns each cluster into both a source and a destination with the capacity requirements of both.
When combining multiple clusters and/or multiple buckets, it is important to ensure that each has enough capacity. For example, a one-many will create more replication streams on the one source, and a many-one will generate significant overhead (disk space / IO / CPU) on the destination cluster.
Views and Queries
One of the most sought-after new features of Couchbase Server 2.0 is the addition of views for indexing/querying your data. As with any new capability, it comes with certain requirements and considerations relating to sizing of the cluster.
The view system is currently disk based which means that both disk space and IO requirements will be increased as well as needing more RAM outside of the object-based cache to improve the query throughput and latency via the OS’s built-in IO caching.
The actual impact will depend quite significantly on both your application’s workload as well as the amount and complexity of your design documents and view definitions. It’s very important to follow these view writing best practices when designing your views to mitigate and lessen the sizing requirements as much as possible. Write heavy workloads will put more pressure on the actual updating of indexes and more read-oriented workloads will put more load on the querying of those indexes.
1. Index Updating
An insert or update of a document will eventually trigger an index update to take place. This trigger is either part of our normal background process and/or from an application request. The effective processing of these updates will have the following impacts:
- Disk Size: There are extra files created both for each design document, as well as each view within. These files are then filled with the keys and values emitted by your view descriptions (which is why it’s important to keep them as small as possible). These are append-only files, so they will grow with each new insert, update or delete and are compacted in a similar fashion to the data files.
- Disk IO: Possibly the biggest impact that using views will have on a system is on the disk IO. The index updating itself will potentially require a significant amount of IO as each document write may need to be written into multiple index files. These updates are handled by separate threads and processes so as not to impact the normal functioning of the database, but having enough IO will be crucial to keeping the indexes up-to-date. Additionally, the normal and automatic compaction of indexes will consume some amount of disk IO and is required to keep the size of the disk files in check.
- CPU: Along with the disk IO, view updates will incur an added amount of CPU. By default there are 4 indexers per bucket per node. Systems with many cores will be benefited by increasing this number. The general practice should be to have 1 additional core per design document (each view is processed serially within a design document, but multiple design documents can be processed in parallel)
- Replica Indexes: Off by default, these can be turned on to greatly aid the query performance after the failure of a node. However, they will at least double (depending on how many replicas are configured) the amount of disk space, disk IO and CPU load required by the system. Along with the primary indexers, there are 2 replica indexers per bucket per node by default.
The index updating process scales linearly by adding more nodes because each node is only responsible for processing it’s own data.
It is our best practice to configure the disk path for indexes to be on a separate drive / drive from the data files. Since each set of files is append-only, writes are always sequential on a file. However, combining both the data and index files creates much more random IO on the shared disk. Additionally, write heavy workloads should seriously consider using SSD’s.
2. View Querying
In order for an application to receive the best performance when querying views, the following sizing impacts should be taken into consideration:
- RAM: Even though the index files are stored on and accessed from disk, the format of these disk files lends itself very well to the OS’s normal disk IO caching. Because of this, it is important to leave a certain amount of RAM available to the system outside of the Couchbase quota. The actual number will depend greatly on the size of your indexes, but even a relatively small amount can have a large impact on query latency and performance.
- Disk IO: While no extra space will be required simply by querying the views, there will be some amount of additional disk IO required. This will depend greatly on how frequently the indexes are updated and then queried, as well as how much RAM is available to cache them.
- CPU: There will also be increased CPU requirements when querying the views due to the handling of the individual querying and the merging of results from the multiple nodes.
Given the scather/gather implementation of indexing and querying, every node needs to participate in every query request. Query throughput can be improved by increasing availble file system cache or physical hardware.
3. Effect of Views on Rebalancing
When taking advantage of the new Views feature of Couchbase Server 2.0, it will be important to know and prepare for the impact this will have on rebalancing of the cluster.
At some point throughout the rebalancing process, the data that is moving from one node to another needs to be reflected in the index(es) on that node in order to keep queries consistently returning the same results. We have designed the system to perform this update throughout the entire rebalancing process rather than try to catch everything up all at the end. As each data partition is moved , the index is updated before transferring ownership from one to the other. This behavior is configurable. You can disable index-aware rebalance if your application can handle inconsistent results and significantly speed up rebalance time.
Leaving this setting on leads to the following impacts:
- RAM: Because there is extra processing to be done, more RAM will be used for holding the pending writes in the disk write queue of each destination node
- Disk Size: There will be a significantly increased amount of disk space used during rebalance to ensure both the safety and consistency of the views. This is cleaned up once the data is not needed anymore, but due to the heavy amount of writes, it’s possible that the compaction process is not able to keep up until rebalance has completed.
- Disk IO: Since there will be a large amount of data being read, written and indexed all at the same time, disk IO will be sharply increased during rebalance.
Along with nearly every other part of Couchbase, the rebalancing process is benefited by having more nodes. Imagine adding just one node to a cluster. At one extreme, going from 1 to 2 nodes is always the worst case scenario. This is because half of the dataset needs to be moved. In most cases, a set of replica data is being created during rebalance as well. In this special case, there is only one node available to handle the entire reading and receipt of data. In contrast, going from 21 to 22 nodes is significantly less intensive. Only 1/21 of the data needs to be moved, likely no replicas are being created and the read/receipt load is shared by all 21 nodes.
For any complex system (especially a database), sizing is never an easy task. The variables are numerous and the considerations and decisions must always be based on the individual requirements of the application and availability of resources. We will constantly strive to do our best to provide guidance, but it is very important to test and stage a system with your application workload / requirements both before and after deployment.