OVERVIEW

Several years ago I wrote a blog that reflects how many customers monitor their Couchbase clusters (http://blog.couchbase.com/monitoring-couchbase-cluster). Albeit dated, the information is still relevant and we’ll be updating that original. The statistics and metrics outlined in that general overview are often consumed by monitoring frameworks to measure the health and performance of Couchbase for specific use cases. Since that point we often encounter people asking for example implementations that consume these statistics. Typically the real need by devops teams are not broader monitoring frameworks but simple scripts that can be leveraged. Many times cross functional teams are diagnosing issues in “war room” settings and want simple ways to digest recent events within a cluster.

The goal here is not to provide a comprehensive model but to outline a simple implementaiton that consumes Couchbase metrics via our REST API and command-line tools. Once we get something in place that can digest these metrics on a regular basis we will need to store these somewhere. In this walk through we'll store these metrics in a Couchbase bucket and use N1QL to understand what's happening in a given cluster. There are other ways to visualize this type of information (http://blog.couchbase.com/2016/march/http—packages.couchbase.com-releases-4.5.0-dp1-couchbase-server-enterprise_4.5.0-dp1-windows_amd64.exe.md5) but for this example we'll focus on Couchbase monitoring and associated N1QL queries.

DIGESTING INFORMATION

I wanted to provide an example that was relevant and useful for teams to digest and use. Often times leveraging Python is familiar to people and can be easily implemented on a variety of systems. Obviously these same methods could be leveraged in shell or just about any language but Python seemed a natual fit for a generalized discussion.

The complete code can be found here … pythonlab/MonitorStats.py … but we'll walk through some of the details and different ways we're gathering information from Couchbase.

The first thing we need to establish is where we can find the local Couchbase binaries which is all binPath is doing for us.

The administrative UI is a great feature of Couchbase and is installed by default on every node in a Couchbase cluster; however, this interface is designed to provide a view into “what's happening now” in the cluster. Statistics get agregated over time by the cluster and doesn't provide granular information for reviewing “what happened” in the cluster. Because we're consuming statistics that are created in real time by the Couchbase cluster we need somewhere to store the results. While there could be other options we'll use Couchbase to store our historical statistics. The cluster we're using to store statistics will be controlled by seedNode and seedBucket (a Couchbase SDK only needs one node and the bucket name to create a connection to all nodes in a cluster).

 This script would be running locally on each node in the cluster and capturing a statistical profile of the health of that node and its view of the cluster. As a result, the cluster we're monitoring will be defined as “localhost” and we'll use the Couchbase REST API to determine the name of the local host.

Everything is driven by understanding which node the script is running and how many nodes make up the cluster. Here we'll capture this information and will drive the remaining portion of the monitoring script.

For the remaining script we now can consume specific statistics in a few different ways. Here we'll either pull information from the Couchbase utility cbstats or directly from our REST API. Here we're using cbstats to get the memory utilization for a node and use a Python REST client (requests) to get the drain queue to measure how the cluster is doing at getting data persisted to disk. These are some of the most important statistcs to monitor for a Couchbase cluster.

The actual script demonstrates additional information but we ultimately capture everything in a JSON document and set a TTL for 30 days.

With the monitoring data in Couchbase we can consume things via our N1QL query language to look for annomolies. The data itself is using a time series data model based on a time stamp; as a result, we can query for things happening in the cluster and the KEY (meta.id) will show us the period of time of concern. We might need to be smart about the indexes needed to support the analytics but for a firfighting platform is very doable leveraging Couchbase query language N1QL.

CONCLUSION:

This isn't something that should be thrown into production but should provide a guideline on how to monitor a Couchbase environment. There is a lot available via the REST interface and cbstats, all of which can be consumed and monitored in similar fashion. 

Posted by Justin Michaels, Solutions Engineer, Couchbase

Leave a reply