Couchbase Autonomous Operator version 1.1.0 was released on November 15th 2018. For the Kubernetes team it is a small release but an important one to improve the end user experience. In this post we will double click into the specifics of what has been changed and why.

Stateful Services and Persistent Volumes

The Autonomous Operator 1.0.0 release allowed servers from a specific server scaling group to be backed by persistent storage. Support for persistence storage is essential for many reasons.

Data is the lifeblood of a database, and we take it seriously. It records who your customers are, their preferences for targeted promotion, their transactions and multitudes of other business-critical information. If you lose that data, then the business may be impacted in negative ways from financial penalties to loss of consumer confidence.

The Kubernetes platform that the Operator manages is, by design, based on ephemeral resources that are only available as long as the process requiring them is running. Kubernetes does, however, provide persistent storage volumes that allow stateful applications such as the Couchbase data platform to restore data in the event of a server process crashing or accidental deletion.

Best Practices

We highly recommend all scaling groups containing document data or index services must use persistent storage. By doing so, data is not lost due to a crash and is available to a replacement server instance. The entire cluster can be restored from a total power loss, which is not possible without persistent storage to record cluster state. Recovery from a crash becomes far quicker as the replacement server can reuse existing document data and indices, recovering from replicas the small subset of data that has been modified in the interim. As a side effect logs are persisted on the same persistent volume that enables recovery, allowing diagnosis of and a solution to the root cause of the crash.

Some server scaling groups such as those running only the query service are not reliant on persistent storage for reliable operation. These services use state provided by the data and index services served by other nodes. As a result, there is no requirement for the stateless servers to use persistent storage. Without persistence, server logs cannot be retrieved if a stateless server crashes.

The Autonomous Operator 1.1.0 release solves this issue by allowing a lightweight log volume to be attached to servers.

Log Collection Improvements

In the event of a server crashing the log volume is detected as orphaned by the Operator and retained for log collection. The cbopinfo support tool has been improved to detect these log volumes and present them at log collection time. The support tool selectively collects Couchbase server logs from persistent log volumes and downloads them to the local machine running the command automatically.

The support tool now redacts server logs of potentially sensitive information. Both redacted and plain logs are available to the user. The end user is then able to choose which version to submit to our support team.

Log Retention Improvements

The Operator features a new log retention policy that can be sufficiently controlled by the cluster administrator to cap resource usage. The Operator supports retaining logs based on the maximum number of log volumes allowed (the oldest volumes are deleted first if the number in existence exceeds this value) or based on the duration of the orphaned volume. Log retention policies prevent excessive resource usage and also helps the administrator to adhere to data retention legislation.

Cluster Management Improvements

The cluster management tool, cbopctl has also been updated to help end users deploy supportable clusters. The presence of the default or logs volume mounts in any server scaling group signals intent that the end user wishes the cluster to be supportable. These volume mounts cannot be specified at the same time. Furthermore, cbopctl enforces that all server scaling groups are supportable if intended, and logs will always be available for collection by cbopinfo. Finally, the logs volume mount cannot be used on scaling groups containing the data, index or analytics services.  This helps prevent data loss scenarios by forcing the use of default volume mounts that the Operator can recover.

The cluster management tool still allows the user to create clusters without any persistent volume backing at all for evaluation. As described, in this configuration the cluster cannot survive a power outage, may result in data loss and issues may be unsupportable due to the absence of Couchbase server logs.

Useful Links

 

Posted by Simon Murray, Senior Software Engineer, Couchbase

Simon has almost 20 years experience on diverse topics such as systems programming, application performance and scale out storage. The cloud is now his current focus, specializing in enterprise network architecture, information security and platform orchestration across a wide range of technologies.

Leave a reply