In this post I will describe how Couchbase QA team goes to great lengths to test Couchbase products, using DCP Rollbacks as an example, and first giving background on what DCP is and what a DCP rollback is.
What is DCP?
DCP stands for Database Change Protocol between Couchbase nodes such are replicas or views, or to external clients such as XDCR or FTS. It follows a producer/consumer model and provide high throughput and low latency. Data is streamed per vbucket. It is described here and there is a deep dive video here.
Of particular interest in the sequence number. The sequence number is a monotonically increasing number associated with mutations on a vbucket and is used to keep the producer and consumer in sync.
What is a rollback?
No, not these rollbacks
A Couchbase rollback occurs when a client connects to a producer with a sequence number greater than what the producer has, or to put it another way the client has newer mutations but as these are not the “truth” it must rollback or undo some mutations to align with the new truth.
This is not usual but does happen and for the purposes of data integrity it is important that it is handled correctly.
In a live environment this could happen in the following scenarios:
Failover with mutations in flight (this can occur with a hard failover)
- Client is connected to active node A
- Mutations with sequence number 100 arrives at client and is in flight to replica node B
- Before mutation with sequence number 100 arrives at node B there is a failover, node B has only sequence number 90
- The client, recognizing the failover connects to the newly active node B with sequence number 100
- Node B only having sequence number 90, responds to the client asking for a rollback to sequence 90
- The client undoes the effects of the mutations for sequence numbers 91-100, and requests a connection to the producer with sequence number 90.
Crash with non-persisted data
- Client is connected to active node A
- Mutations up to sequence number 90 have been persisted
- Mutations up to sequence number 100 arrives at client and is not yet persisted
- Memcached crashes and restarts, no failover
- Client reconnects to node A with sequence number 100
- Similar to steps 5 and 6 above
Testing for and with DCP Rollbacks (and how they break)
The Couchbase QA organization tests rollbacks in three ways:
The producer requests rollbacks in the cases when the client requests a sequence number greater than that is known by the producer. We verify that producer really does request the client to rollback and it returns consistent data from that point.
We have developed a custom DCP client. It performs mutations and parallel creates DCP connections and exercises specific scenarios. It can manipulate the requested sequence number, and trigger other feature interaction such as compaction, and monitor the persistence state
When receiving a rollback request verify the client properly undoes later mutations and correctly applies the new mutations.
The following scenario will deterministically cause a client to get a rollback request:
- Stop persistence
- Kill memcached and memcached restarts
- DCP connections will be rollbacked to the sequence number prior to persistence being stopped
Test writers use the above scenario to trigger the rollback and verify that the client properly undoes the rolled back mutations.
Typical issues found on the consumer side deal with the consumer retaining information associated with the rollbacked mutations.
Do hard failovers at high mutation rates (which causes active vbuckets to get ahead of the replica) during views and 2i. As a result, there will be data loss while promoting replica vbuckets to active and rolling back sequence numbers, but the cluster should continue to be stable. Typical bugs found include the newly promoted replica having inconsistent data, and clients not handling rollback correctly.