Couchbase Collections introduces the separation of the bucket data into logical scopes and collections, on top of the Couchbase JSON database. The separation allows the data to be separated out to different schema and tables, a concept that most RDBMS database users are familiar with, as well as enabling finer access control to the individual scopes and collections. It should be noted that the introduction of the concept of scope and collections does not imply that data of specific
has to be separated from each other, or must reside in their own collections. On the contrary, a collection is still first and foremost a collection of JSON documents, and as such the flexibility of the schemaless remains the same.
The question is whether you should consider using the Couchbase collections if you are comfortable with the bucket model, or already have a well configured Couchbase cluster. In this blog, I will outline a few areas that we have been optimized with regard to the indexing service that could help you to decide whether you would want to migrate from the Couchbase bucket model to the new collection model.
The indexing pipeline for the bucket model
Here is a diagram showing the index build pipeline.
- The projector process in the data service is solely responsible for streaming the bucket data to the indexing service.
- The projector uses a single DCP stream to evaluate all mutations to determine if a document should be streamed to the indexing service, based on the index meta data.
- The projector streams only the specific columns that the indexing service maintains for its indexes.
One point is clear in the above diagram is that the projector has to consider all bucket mutations for all of the indexes in the cluster.
The indexing pipeline for the collection model
In the collection model, DCP streaming between the data and indexing service is at the collection level. While this implies more DCP streams, this optimization will benefit the downstream processing when the projector decides which indexing service it will send over the mutations.
There is a small difference on how this works for the initial index build vs the index updates.
- A DCP stream is created for each collection during the initial index build, resulting in smaller workload for the projector.
- The projector no longer needs to evaluate the index WHERE clause to determine if a mutation qualifies for the index.
- The DCP stream data will now be prefixed with collection id, allowing the projector to know which index to send the change to
- Projector no longer need to evaluate the index WHERE clause
- Index ingestion check would be limited to indexes defined on the updated document’s collection, instead of all indexes in the bucket. This will result in significant saving in CPU/Disk i/o
From the configuration standpoint, Couchbase collections feature doesn’t require any changes for indexing service. Of course, users do need to specify the specific collection name, instead of just the bucket name, when creating indexes on a specific collection. All the major changes were made to take advantage of working with smaller datasets, instead of at the bucket level for all mutations.
This benefit is permeated through to all the different stages of the indexing services, starting with the projector process to the indexer as well as the downstream storage layer.