Database users of every kind have come to expect cost-based optimization for their queries.

It’s a standard feature of all traditional relational databases (RDBMS), but there hasn’t yet been a cost-based optimizer (CBO) for NoSQL document databases. That is, until now.

Cost-based optimization for Couchbase queries was released as a Developer Preview in Couchbase 6.5, and it’s now ready for general availability with Couchbase 7.0.

Let’s take a closer look at the new CBO features of Couchbase Server 7.0.

The Old Way: Rule-Based Optimization

It’s impossible to discuss cost-based optimization without referring to the previous approach to database query optimization: rule-based optimization (RBO).

With rule-based optimization, the query optimizer considers using any index that provides a path to retrieve the query result. However, when the optimizer is presented with more than one path to process the query – that is, when the query could use multiple indexes – it could end up using all indexes.

With cost-based optimization, the query optimizer calculates the cost of using each index, and then decides on the execution plan with the lowest cost.

What Is Cost-Based Optimization?

Cost-based optimization enables the database query optimizer to choose the best (lowest-cost) plan to process a query.

This matters most whenever a query has more than one potential execution path, i.e., more than one qualified index. Each query plan path is assigned a total cost for all of the operations that the query needs to perform. These costs particularly depend on the specified predicates’ selectivity. To arrive at the selectivity value of a predicate, the statistics of the predicate fields need to have been pre-computed.

Please note that the cost value is an arbitrary unit of statistics calculation: it does not equate to the resource cost or CPU usage.

Cost-Based Optimization vs Rule-Based Optimization

Cost-based query optimization always provides a better execution path when the optimizer is presented with more than one option. The computed statistics ensure that the optimizer is equipped with the knowledge about the data distribution, and subsequently the efficiency of one query plan versus another.

For instance, imagine a Couchbase Collection that contains customer information and their addresses, with an index on ix1(zipcode) and ix2(city). Statistically, there are more ZIP codes than there are cities in the U.S. But that doesn’t necessarily mean that using the ix1(zipcode) index has a lower cost (and therefore better performance) than the ix2(city) index. Why? Because we don’t yet know if the customers in the system are evenly distributed across all U.S. cities and ZIP codes. Chances are that they’re not.

In this scenario, it’s the statistics collected by the cost-based optimizer beforehand that help it decide the lowest-cost query path at runtime.

With rule-based optimization, if a query has both city and ZIP code predicates, then the query optimizer uses both indexes with intersect scan. However, the CBO only uses one index – whichever one has the lowest cost based on the data within the Collection.

How to Enable Cost-Based Optimization in Couchbase

There are two important steps you need to take in order to use cost-based optimization in Couchbase Server 7.0:

1. Enable the CBO Feature

Currently the cost-based optimizer is enabled by default in Couchbase 7.0.

However, you can disable it with the N1QL Feature Control value in Settings->Query Settings. If you plan to use cost-based optimization, then make sure it’s enabled.

2. Collect Statistics

In addition to having the cost-based optimizer enabled in your cluster, you must also have collected statistics on the underlying fields for the indexes used by your queries. Here’s a sample query on how you’d do that:

For example, using the travel-sample dataset, this above query would look like this:

You can also collect the statistics according to the index name:

Note that you can also disable the cost-based optimizer by either using the cbo_flag in your API, or by deleting the statistics with UPDATE STATISTICS <keyspace> DELETE (<index_expr1>,..).

The CBO System Bucket

When you’ve enabled cost-based optimization and you’ve collected statistics, you’ll notice an additional Bucket in your cluster called N1QL_SYSTEM_BUCKET.N1QL_SYSTEM_SCOPE.N1QL_CBO_STAT. Direct access to this Bucket is restricted to system administrators.

Using the UPDATE STATISTICS command, you can either add more stats to the Bucket or delete the stats with UPDATE STATISTICS FOR <keyspace> DELETE.

Cost-Based Optimizer in Action

Let’s take a closer look at how cost-based optimization works in Couchbase 7.0 using the travel-sample dataset in the following scenario.

Improved Index Selection with Cost-Based Optimization

Let’s say you have defined two indexes for the travel-sample._default.airport Collection.

And you have a query below:

If the above cost-based optimization weren’t used, the rules-based optimizer execution plan for the above query would be:

rules-based optimization execution plan for Couchbase N1QL query

Note that since the query predicates are on both the city and faa fields, both indexes are used in the intersect scan. Learn more about intersect scans in this article: Performance Ingredients for NoSQL: Intersect Scans in N1QL.

Next, you enable cost-based optimization for the database query. Since CBO is already enabled by default, all you need to do is to collect the statistics.

After the statistics are collected, run the SELECT query again.

You’ll see that the execution plan changes to using only the airport.faa index. This change is because the statistics are telling the optimizer that the faa field has a better selectivity compared to the city field. Therefore the airport.faa index is the more efficient index for the query to use.

Couchbase cost-based optimization execution plan for N1QL query

A Better JOIN with Cost-Based Optimization

Let’s look at another cost-based query optimization example that illustrates how the cost-based optimizer chooses a better JOIN type between airport and route.

First, we create an index to track the source airport:

And then we use the query below:

With the old rules-based optimizer, the execution plan for the above query would look like:

rules-based optimization execution plan for database JOIN

Note that the RBO execution plan above uses a nested loop to JOIN the airport Collection to the route Collection.

Next, you need to enable cost-based optimization for the query. Since CBO is already enabled by default, all you have to do is to collect the statistics for the field in the Collection.

With cost-based optimization stats available on the two underlying fields for the indexes, the optimizer switches over to using a HASH for the Collection JOIN.

Couchbase CBO optimizer query execution plan using a HASH for the Collection JOIN

Cost-Based Optimizer Troubleshooting in Couchbase

  1. Cost-based optimization is enabled by default in Couchbase 7.0. However, statistics need to be present on the specific fields in your Bucket/Collections before the cost-based optimizer functions effectively.
  2. You can disable cost-based optimization at the cluster level using the N1QL Feature Control field by adding 16 to the value. It is a set with x10 bit.
  3. CBO statistics are not updated automatically by the Query Service. You should collect statistics on a regular basis, or after a significant change to your Collection content. I recommend collecting statistics after a large upload of data; otherwise, a weekly run should suffice.
  4. Cost-based optimization is only considered for a query if the statistics are available for all of the indexes that were being considered by the optimizer. You can collect the statistics for all the indexes on a given keyspace using the following:

More Resources

To learn more about the cost-based optimizer for JSON – and the Couchbase 7.0 release in general – check out the following resources:

Want to kick the tires on cost-based optimization with Couchbase?<br/ >Get Started with Couchbase 7.0

 

Author

Posted by Binh Le

Binh Le is a Principal Product Manager for Couchbase Query service. Prior to Couchbase, he worked at Oracle and led the product management team for Sales Clould Analytics and CRM OnDemand. Binh holds a Bachelor's Degree in Computer Science from the University of Brighton, UK.

One Comment

  1. Great article, thanks. Minor typo in last paragraph of section “A better JOIN with CBO”. The word NOT should be NOW. As in

    …, the optimizer will noW switch over to using a HASH for the collection JOIN.

Leave a reply