The Couchbase Autonomous Operator 1.0.0 was released just 8 months ago.  This was swiftly followed by 1.1.0. During this period we have had tremendous feedback from our community of users.  First and foremost a thank you to all who have helped, and continue to shape this product.

The Operator version 1.2.0 is the first major update and addresses a large number of change requests.

This article looks at all the major new functionality and usability improvements in this release.  New networking and deployment options (with Helm) warrant their own posts.  Links are provided at the bottom of the page.

Cloud Certification

Kubernetes is more than just an application deployment framework, it is an abstraction layer.  Moving between cloud providers as the business dictates should be simple and cost effective. As the old adage “don’t keep all your eggs in one basket” dictates, it’s actually prudent to embrace a multi-cloud deployment strategy.  Kubernetes is the perfect medium to facilitate this.

Subtle differences between cloud providers do exist however.  Storage and networking are the main offenders. Where there are differences there lies uncertainty and unpredictability.

The Operator 1.2.0 release is the first to be fully certified on Amazon EKS, Google GKE and Microsoft Azure AKS.  Our internal regression test suite is now cloud first. All tests will be run on these major platforms. This provides us and our customers with the confidence that the Operator will run predictably and reliably in any environment regardless of the differences that do exist between cloud providers.

For further information on running in cloud environments please consult the relevant quick start guides.

Storage Improvements

Kubernetes platforms supported by the Operator have also been expanded to encompass versions 1.11 to 1.13 and 3.11 for Redhat Openshift.

Prior to Kubernetes 1.12 you had to be very careful about persistent volume scheduling across availability zones.  There lacked any intelligence in the scheduling so it was possible to have a persistent volume created in one availability zone while a pod trying to access that volume would be created in a different availability zone. This will not work as volumes have to exist in the same datacenter as their consumers.

While it is possible to create a Couchbase cluster that takes into consideration these constraints, that same configuration is very verbose and hard to understand and maintain.

A new volume binding mode – WaitForFirstConsumer – was introduced in Kubernetes 1.12 that can be applied to a dynamically provisioned storage class.  When creating a persistent volume claim, it doesn’t create the underlying persistent volume straight away. When the persistent volume claim is attached to a pod, then and only then is the persistent volume created in the same availability zone as the pod is scheduled in.

Our strong recommendation is to use a version of Kubernetes greater than 1.12 and provision your Couchbase clusters with this new lazy binding mode.  Your cluster configuration files will become greatly simplified and easier to manage and maintain. This method of persistent volume is used by all our internal testing so you can be confident it works for your use case.

Couchbase Upgrade

Automated upgrade was one of the most asked for features since the Operator was released.  It is now fully supported in Operator 1.2.0.

The upgrade process follows our standard best practices for manually performing this procedure.  A pod running the old version of the Couchbase data platform is selected for upgrade and a new Couchbase instance is created.  A high performance swap-rebalance moves existing data onto the new node and the old one deleted. This continues until all pods have reached the target version.

Operating in this fashion allows safe, online upgrades with no performance degradation or client disruption.

Standard upgrade paths are enforced so a point release upgrade (5.5.3 to 5.5.4) and a major release upgrade (5.5.3 to 6.0.1) are allowed.  Upgrades skipping major releases (5.5.3 to 7.0.0) and downgrades are not allowed and are rejected.

Rollbacks are allowed midway through an upgrade operation but only to the original version.  If you are performing an upgrade to 6.0.1 from 5.5.3 and 3 of 8 pods have been upgraded, you can revert back to 5.5.3.

Triggering an upgrade is performed by editing the spec.version field in your Couchbase cluster custom resource.

Kubernetes Upgrade

Many cloud platforms provide one-click upgrades of the entire Kubernetes cluster.  This is dangerous for a stateful application such as the Couchbase data platform and may result in data loss.  To avoid this scenario the Operator 1.2.0 release creates some additional resources to manage when pods can be killed in a safe manner.  The Couchbase cluster needs to be upgraded first in order to take advantage of this feature.

For further details please read our previous article on the topic.

TLS Certificate Rotation and Verification

TLS support has been provided since Operator version 1.0.0.  This feature allows the administrator to supply a wildcard certificate chain and private key to be installed into the Couchbase data platform by the Operator along with a root CA certificate.

While this hasn’t changed we now support rotation of server certificates or even the entire PKI.  This provides a mechanism to handle certificate expiry or private key compromise. Triggering a rotation operation requires an update of the TLS secrets and the Operator will handle the rest.  Please consult our TLS documentation for more details.

TLS isn’t easy.  You need to have a good knowledge of networking and the X.509 standard to make it work, and we do see a number of cases where clusters fail to provision due to TLS misconfiguration.  The error messages were cryptic at best so we have strived to improve the user experience in this area.

Now when a cluster is created, if a TLS secret exists, then the contents are validated.  This checks that the certificates and keys are in the correct format, that the certificates are valid for that specific Couchbase cluster and Kubernetes namespace, that the server certificate validates against the root CA etc.  All of this is reported back to the user in a simple, easy to understand message. How it does this is explained in the next section…

Dynamic Admission Control

The Kubernetes API will report back errors in your YAML manifests for core types.  Prior to Operator 1.2.0 we have employed a JSON schema associated with the CouchbaseCluster custom resource definition to catch simple formatting errors.  For other more complex validations specific to a Couchbase deployment we have distributed a separate binary to validate your YAML.

While this method worked, it may not have been employed by end users.  It certainly didn’t mesh well with existing workflows using kubectl or oc clients.  With dynamic admission controllers, we can plug this deep validation directly into the Kubernetes API.

Now when you create a Couchbase cluster with kubectl, for example, the API passes the request on to a dynamic admission controller distributed as part of the Operator 1.2.0.  The dynamic admission controller can then validate and modify the custom resource before responding whether to admit the request.  If the request is rejected, the reason why is relayed directly back to the client. Having to look through log files for reasons as to why your deployment is not working is a thing of the past.

Modifying the custom resource gives us a mechanism whereby we can also automatically populate new fields required by the custom resource type.  This helps to maintain backwards compatibility with old Couchbase cluster YAML files.

For further information on how the dynamic admission controller works and is installed please consult the documentation.

Logging Improvements

Some enhancements were made to our support tool when we released the Operator 1.1.0.  These enhancements were specifically to handle the use of log volumes when used with stateless Couchbase pods.  The collection of log volumes presented the user with an interactive menu to allow selection and local download.  While this was a welcome addition it was orthogonal to how logs were collected from running pods. Running pod logs were collected unconditionally and left on the pod itself, the end user having responsibility for the download and cleanup.

With the operator 1.2.0 release, all running pods and log volumes are displayed in a unified interactive menu.  The user is able to select exactly which logs to collect. The support tool now also automatically downloads all requested logs locally and removes any intermediate files from the running pods.

We also provide the same functionality via CLI flags so that available logs can be polled and collection automated.  For further information please consult the cbopinfo documentation.

When the Operator tries to create a pod and fails we delete that pod and retry creating it in case the error causing the failure was transient.  In the common case where the container image has been incorrectly specified or the scheduler is unable to find a node to run the pod on we had nothing to indicate that this was the case.

In the Operator 1.2.0 release we have extended the Operator logs to cater to these cases and allow simple problem determination.  Failed pod creation will trigger a collection of the pod state and associated events and output to the log stream.

Kubernetes RBAC

The Operator functions by creating and manipulating Kubernetes resources.  The Operator must be granted permissions in order to do this. In versions prior to 1.2.0 we’d simply say “grant all permission on pods” for example.  While terse and easy to understand it did grant privileges to the Operator that were not strictly necessary for operation.

As of Operator version 1.2.0 any example Kubernetes roles distributed by Couchbase will be explicit about exactly what operations are required on what resource types.  All stated permissions are required, the Operator cannot function without them. For further details about what permissions are required and why, please consult the RBAC documentation.

The ability for the Operator to function with a role, as opposed to a cluster role, is now fully supported.  Previous verification and sanity checks that required access to cluster resources are now handled solely by the dynamic admission controller.

Next Steps

The Couchbase Autonomous Operator 1.2.0 is a big release with many new features.  The main focuses are upgradability and ease of use. We hope you enjoy doing cool new things with it as much as we have enjoyed creating it.  As always your feedback is key!

Read More

Posted by Simon Murray, Senior Software Engineer, Couchbase

Simon has almost 20 years experience on diverse topics such as systems programming, application performance and scale out storage. The cloud is now his current focus, specializing in enterprise network architecture, information security and platform orchestration across a wide range of technologies.

Leave a reply