You’ve certainly heard it before: “What gets measured gets done.”

It’s true: what you observe and measure is what you can improve.

The key to any improvement is to first identify what to measure and then collect the related metrics. Using those metrics, you can tune the underlying work and analyze the effectiveness of any changes. Then repeat the cycle until you’ve sufficiently improved.

At Couchbase, we needed to improve some of our day-to-day operations, so we created observability dashboards to help us identify issues and track improvement. We used a combination of Prometheus, which simplifies storing and querying time-series data, and Grafana, which can be used to make stunning data visualizations. In addition, we used Couchbase to store historical data for later use with its Full-Text Search and Analytics tools.

In this article, we’ll walk you through how to build your own observability dashboard using Prometheus, Grafana and Couchbase.

Your in-house data source pipelines may vary – as might your data visualization software. However, the steps we’ll show you today should be applicable across a number of tools and deployments.

Generic Observability Dashboard: Design & Architecture

In order to build a reusable and scalable tool, it’s better to work from common designs and templates as a first step. From there, you can customize as needed. With this approach, it’s quick and easy to develop future dashboards.

The diagram below shows the generic architecture of the observability dashboards we’ll build together:

An observability dashboard architecture using Prometheus, Couchbase and Grafana

In this architecture, two different data inputs form an interface to the dashboard service. Let’s take a closer look at each of these below.

  1. JSON metadata about the dashboard
    • Data source definitions, including information about the data sources (like DB URL, SQL, credentials), the file paths, and Jenkins artifacts URLs.
    • The Grafana layout template (or visual dashboard view), which we’ll design first and then use as templates for panels in our later dashboards.
  2. The actual data source files from .json and .csv files and from Couchbase.
    • The design of these observability dashboards supports various data sources like Couchbase Server, and direct files like JSON documents and CSV (Comma Separated Values) files. You can extend the databoard proxy service code (in dashboard.py) to parse other data formats as needed.

The expected output will be a Grafana dashboard UI and Prometheus time-series collected metrics from the two inputs listed above. The central part of the above diagram shows the different services in the collection that support the creation of the dashboards.

Let’s take a closer look at the different facets and services included in the architecture diagram:

  • Dashboard proxy service:
    • This is a generic Python Flask web app service (dashboard.py) that interacts with the Grafana service to serve the tabular data and other APIs like /query, /add, /import and /export endpoints. You can develop a similar one to have a generic template (JSON) for the panels on Grafana and attach the graph data points and tabular-data points as target JSON to display on your Grafana dashboard.
  • Prometheus export service:
    • This is a custom Prometheus exporter (say prometheus.py) Flask web app service that connects to the data sources and serves the requests from Prometheus itself. At a high level, this acts as a bridge between the Prometheus and datasources. Note that this service is needed only when the data source is to be maintained for time series (many trends need this).
  • Grafana service:
    • This is the regular Grafana tool itself that you use to create panels and display as dashboards.
  • Prometheus service:
    • This is the regular Prometheus tool itself that holds your metrics as time-series data.
  • Alert Manager:
    • The Alert Manager has custom alert rules that receive alerts when certain thresholds are met.
  • Other services:
    • Couchbase: You might already be using this NoSQL document database, but if not, you can install it through a container or directly on a different host. Couchbase stores your data as JSON documents, or you can have it store required fields as separate documents for historic trends while preparing your health or trend data.
    • Docker: You’ll need to install the docker agent software on the host in order to use this containerized service deployment.

Sample Dashboard JSON Structure

In the table below, you’ll see a sample of the structure of both the input metadata and the input data source.

Input metadata JSON structure: Input data sources structure:
{
  "dashboard_title": "",
  "data": [
{
  "source": "couchbase|json|csv",
  "type": "timeseries|table",
  "name": "<unique name for this data source>",
  "refresh": "<how often the data should be refreshed (seconds)>"
      }
   ],
  "grafana": [
      { “template”: “link to template json” },
       {
           "title": "Links",
           "grid_position": {},
           "type": "text",
           "links": []
       },
  ]
}
//Couchbase source
{
  "host": "<couchbase host>",
  "username": "<couchbase username>",
  "password": "<couchbase password>",
  "query": "<couchbase query>"
}
//JSON or CSV source
{
  "file": "<link to file served via http>"
}
//CSV source
{
  "delimiter": "<comma, space, tab or custom character>"
}

Deploying the Observability Dashboard Services

Use the docker-compose file below to bring up all of the required services – e.g., Dashboard proxy, Grafana, Prometheus, Exporter, Alert manager – that appear in the architecture diagram above for our observability dashboards. You can install Couchbase on a different host to store your growing high-volume data.

To bring up: docker-compose up

Next, visit http://host:3000 for the Grafana page.

To bring down: docker-compose down

The above service reference files content – or snippets for brevity – can be found in the implementation section below.

Using these tools, you can create a wide variety of dashboards to suit your requirements. We’ll walk through three types of example dashboards to give you an idea of what’s possible.

Example Dashboards: Overview

# Dashboard Measurements Metrics
1 Functional Regression Testing Cycles dashboards Trends among functional regression testing cycles at both the build level & component level total tests, passed, failed, aborts, total time, fresh run time, etc.
2 Infra VMs usage dashboards, including Static VMs & Dynamic VMs Resources utilization & history active count, available count, compute hours/max/created per day, week, month
3 Infra VMs Health dashboards, Static Servers, Jenkins Slaves VMs VM health monitoring, alerts & history tracking of VMs ssh_fail, pool_os vs real_os, cpu-memory-disk-swap usages, file descriptors, firewall rules, pool_mac_address vs real_mac_address, booted days, total and product processes, installed app versions and services etc.

Dashboard #1: Functional Regression Testing Cycles Dashboard

Problem: Before we created this dashboard for ourselves, there were no trend graphs on the regression test cycles with metrics like total time taken, pass rate, fresh vs. reruns (e.g., due to infrastructure issues), inconsistent number of aborts and failures, and also no separate component- or module-level trends.

Solution: The plan was to create a run analyzer script that analyzes the test data that is already stored in the Couchbase bucket. After that, we get the time-series data for the last n number of builds and targeted metrics for each build.

Dashboard snapshots:

Weekly functional regression testing cycles Grafana dashboard, part 1

Weekly functional regression testing cycles Grafana dashboard, part 2

Weekly functional regression testing cycles Grafana dashboard, part 3

Weekly functional regression testing cycles Grafana dashboard, part 4

Weekly functional regression testing cycles Grafana dashboard, part 5

Dashboard #2: Infrastructure Resources / VMs Usage Dashboard

Problem: Prior to building this dashboard, we had a large number of static and dynamic virtual machines but there was no tracking of how the hardware resources were utilized. We had no insight into metrics such as active VMs used at the time, available count, machine time used, or compute hours on a daily, weekly or monthly basis.

Solution: Our plan was to first collect the data for all the VMs such as dynamically allocating and releasing IPs, exact time creation, and release times, as well as any groupings such as pools, etc. Most of this data already existed in Couchbase Server (managed by the respective service managers). Using the flexibility of the N1QL query language, we were able to extract that data into a format suitable for the graphs we wanted to show in this observability dashboard.

Dashboard snapshots:

Static pool VMs Grafana dashboard, part 1

Static pool VMs Grafana dashboard, part 2

Server dynamic VMs Grafana dashboard

Dashboard #3: Infrastructure VMs Health Dashboard

Problem: Before we had this dashboard, regression test runs were failing inconsistently, and there were low-hanging issues with the VMs. Some of the issues included SSH Failures, OS mismatches, VM IP switches, too many open files, swap issues, need reboots, duplicate IPs among multiple runs, high memory usage, disk full (/ or /data), firewall rules stopping the endpoint connection, slave issues due to high memory, and disk usage were all common. There was no observability dashboard to look at and observe these metrics and also no checks and alerts for the test infrastructure health.

Solution: We decided to create an automatic periodic health check that captures metrics data for the targeted VMs such as ssh_fail, pool_os vs real_os, cpu-memory-disk-swap usages, file descriptors, firewall rules, pool_mac_address vs real_mac_address, booted days, total and Couchbase processes, installed Couchbase version and services. (In sum, we captured ~50 metrics). These metrics are exposed as a Prometheus endpoint that is displayed in Grafana, and the information is also saved in Couchbase for future data analysis. Alerts were also created to monitor the key health metrics for issues to allow for quick intervention and finally achieve increased stability of the test runs.

Dashboard snapshots:

VM health Grafana dashboard, part 1

VM health Grafana dashboard, part 2

Implementation

So far, you’ve seen the high-level architecture of the observability dashboards, what services are required, what kind of dashboards you might need, and also how to deploy these services. Now, it’s time to look at some implementation details.

Our first stop is the collection and storage of metrics and the data visualization of the dashboards. Most of the data storage and display steps are similar for all use cases, but the metrics data collection depends on which metrics you choose to target.

How to Get Health Data for Your Dashboards

For the infrastructure monitoring use case, you have to collect various health metrics from hundreds of VMs to create a stable infrastructure.

For this step, we created a Python script that does the SSH connection to the VMs in parallel (multiprocessing pool) and collects the required data. In our case, we also have a Jenkins job that periodically runs this script and collects the health data (CSV), and then saves it to the Couchbase database.

The reason we created this custom script – rather than the readily available node exporter provided by Prometheus – is that some of the required metrics were not supported. In addition, this solution was simpler than deploying and maintaining the new software on 1000+ servers. The code snippet below shows some of the checks being done at the VM level.

The below code snippet shows you how to connect to Couchbase using Python SDK 3.x with key-value operations, getting a document, or saving a document in the database.

Implementing the Dashboard Proxy Service

For the tests observability use cases, the data is in a Jenkins artifact URL and also in Couchbase Server. To bridge these multiple data sources together (CSV, DB), we created a proxy API service that would accept requests from Grafana and return the data format understood by Grafana.

The below code snippets give the implementation and service preparation details.

dashboard.py

Dockerfile

entrypoint.sh

requirements.txt

How to Get the Tabular Data in Grafana

Grafana is a great tool for viewing time-series data. However, sometimes you want to show some non-time-series data in the same interface.

We achieved this goal using the Plotly plugin which is a JavaScript graphing library. Our main use case was to illustrate trends across a variety of important metrics for our weekly regression testing runs. Some of the most important metrics we wanted to track were pass rate, the number of tests, aborted jobs, and total time is taken. Since the release of Grafana 8, there is limited support for bar graphs. At the time of writing, the bar graph functionality is still in beta and doesn’t offer all of the features we require, such as stacking.

Our goal was to support generic CSV/JSON files or a Couchbase N1QL query and view the data as a table in Grafana. For maximum portability, we wanted to have a single file that would define both the data sources and Grafana template layout together.

For the tabular data to be displayed, below are the two viable options.

  1. Write a UI plugin for Grafana
  2. Provide a JSON proxy using the JSON datasource plugin

We chose option 2 for our implementation, since it seemed simpler than trying to learn the Grafana plugin tools and creating a separate UI plugin for the configuration.

Note that since finishing this project, a new plugin has been released that allows you to add CSV data to Grafana directly. If viewing tabular data from a CSV is your only requirement, then this plugin is a good solution.

Implementing the Prometheus Service

prometheus.yml

alert.rules.yml

How to Get Custom Metrics through the Prometheus Exporter

Many cloud-native services integrate directly with Prometheus to allow centralized metrics collection for all of your services.

We wanted to see how we could utilize this technology to monitor our existing infrastructure. If you have services that don’t directly expose a Prometheus metrics endpoint, the way to solve it is to use an exporter. In fact, there is even a Couchbase exporter to expose all of the important metrics from your cluster. (Note: In Couchbase Server 7.0, a Prometheus endpoint is directly available, and internally, Couchbase 7 uses Prometheus for server stats collection and management to service the web UI).

While creating our observability dashboards, we had various data stored in JSON files, in CSV files, and in Couchbase buckets. We wanted a way to expose all of this data and show it in Grafana both in tabular format and as time-series data using Prometheus.

Prometheus expects a simple line-based text output. Here’s an example from our server pool monitoring:

Let’s take a closer look at how to implement data sources from both CSV files and from Couchbase directly.

CSV Files as Your Data Source

Each time Prometheus polls the endpoint, we fetch the CSV, and for each column, we expose a metric, appending labels for multiple rows if a label is supplied in the config.

For the above example, the CSV looks like:

Couchbase as Your Data Source

Each time Prometheus polls the endpoint, we execute the N1QL queries defined in the config, and for each query, we expose a metric, appending labels for multiple rows if a label is supplied in the config.

Below is an example N1QL response that produces the above metrics:

This exporter Python service exposes a /metrics endpoint to be used in Prometheus. These metrics are defined in queries.json and define which queries and CSV columns should be exposed as metrics. See the below JSON snippet (reduced for brevity) as an example.

queries.json

exporter.py

Implementing the Alert Manager Service

Prometheus also supports alerting where it tracks specific metrics for you over time. If that metric starts returning results, it will trigger an alert.

For the example above you could add an alert for when the regression pool has no servers available. If you specify the query as available_vms{pool="regression"} == 0 that will return a series when there are 0 available. Once added, Prometheus tracks this for you (default is every minute). If that is all you do, you can visit the Prometheus UI and the alerts tab will show you which alerts are firing.

With the Alert Manager, you can take this a step further and connect communications services so that Prometheus can alert you via email or a Slack channel, for example, when an alert fires. This means you can be informed immediately via your preferred method when something goes wrong. At Couchbase, we set up alerts to be notified of high disk usage on servers as well as when servers could not be reached via SSH. See the example below:

alertmanager.yml

Conclusion

In conclusion, we hope you can learn from our experience of creating observability dashboards that help you hone in on the metrics that matter most in your implementation or use case with the power of data visualization.

In our case, this effort allowed us to find server infrastructure and test stability issues. Building dashboards also reduced the number of failed tests as well as the total regression time for multiple product releases.

We hope this walkthrough helps you build better observability dashboards in the future.

Also, we’d like to extend special thanks to Raju and the QE team for their feedback on improving the targeted metrics.

Building something awesome?

Why not build it on Couchbase?

 
 
 

Author

Posted by Jake Rawsthorne & Jagadesh Munta

Leave a reply