Couchbase Query Cost Based Optimization (CBO) was released as a Developer Preview in Couchbase 6.5, and it is now ready for GA with Couchbase v7.0.
While CBO is a ubiquitous feature with traditional RDBMS, and that users have come to expect with all databases, it is a feature that has yet to make its appearance in the NoSQL document-based databases. That is, until Couchbase came up with an approach to deliver CBO for a schema-less document model.
What is Cost-Based Optimization
Cost Based Optimization is a feature that enables the query optimization process to choose the best, i.e. lowest-cost plan, to process a query. This comes to effect when a query has more than one potential execution path, i.e. more than one qualified indexes. Each query plan path is assigned a cost, which is the total cost for all of the operations that the query needs to perform, which can depend on the specified predicates’ selectivity. And to arrive at the selectivity value of a predicate, the statistics of the predicate fields need to have been pre-computed. Please note that the cost value is an arbitrary unit of statistics calculation, it does not equate to the resource cost or cpu usage.
What is Rule Based Optmization
We can’t discuss CBO without referring to what the optimizer was using before CBO, and that is Rule Based Optimization. With RBO, the query optimizer will consider using any indexes that can provide a path to retrieve the query result. However, when the optimizer is presented with more than one path to process the query, i.e. when there are multiple indexes that the query can use, it could end up using all indexes. With CBO, the optimizer can calculate the cost for using each index, and decide on the execution plan with the lowest cost.
Cost-Based Optimization Vs Rule-Based Optimization
CBO will always provide a better execution path, when the optimizer is presented with more than one option. The computed statistics ensure that the optimizer is equipped with the knowledge about the data distribution, and subsequently the efficiency of one query plan vs another.
For instance, assume we have a Couchbase collection that contains customer information and their addresses, with an index on ix1(zipcode) and ix2(city). Statistically, we know that there are more zip codes than there are cities in the US. It does not necessarily mean that using the ix1(zipcode) will have a lower cost, and therefore better performance, than ix2(city). This is because we do not know if the customers in the system are evenly distributed across all of the US cities or zip codes. Chances are that they are not.
In this scenario, the collected statistics from the CBO feature will ensure that the correct knowledge is provided to the optimizer for it to decide at runtime. Incidentally, with Rule Based Optimization, if a query has both city and zip code predicates, then RBO will use both indexes with intersect scan. However with CBO, it will only use one index, the one with the lowest cost based on the content of the collections.
How do you enable Couchbase CBO
To use CBO, you need to make sure of two things:
Enable the CBO feature
Currently this is enabled by default with Couchbase release 7.0. However the feature could be disabled with the N1QL Feature Control value in Settings->Query Settings, so please make sure it is enabled.
In addition to having CBO enabled in your cluster, you must also have collected statistics on the underlying fields for the indexes used by your queries.
UPDATE STATISTICS FOR<keyspace>(<index_expr1>,..)
i.e..UPDATE STATISTICS FOR`travel-sample`._default.hotel(city);
You could also collect the statistics by the index name.
UPDATE STATISTICS FOR`travel-sample`._default.route.def_route_schedule_utc;
Note that you could also disable CBO, either with the cbo_flag in your API, or delete the statistics with UPDATE STATISTICS <keyspace> DELETE (<index_expr1>,..)
CBO system bucket
When CBO is enabled and statistics are collected, you will notice an additional bucket in your cluster called N1QL_SYSTEM_BUCKET.N1QL_SYSTEM_SCOPE.N1QL_CBO_STAT. Direct access to this bucket is restricted to system administrators. Using the UPDATE STATISTICS command, users can either add more stats to the bucket or delete the stats with UPDATE STATISTICS FOR <keyspace> DELETE.
Cost-Based Optmizer in action
We can use the following scenario to illustrate how CBO works, using the
in Couchbase 7.
A better index selection with CBO
Let’s say I have defined two indexes for the
CREATE INDEX def_airport_city_country ON airport(city,country);
CREATE INDEX def_airport_faa ON airport(faa);
And you have a query below:
SELECT *FROM airport WHERE city="New York"ANDfaa="JFK";
If CBO is not used, the RBO execution plan for the above query will be:
Note that since the query predicates are on both city and faa fields, both indexes are used in the intersect scan. For more information in intersect scan: https://dzone.com/articles/performance-ingredients-for-nosql-intersect-scans
Next, I enable CBO for the query. Since CBO is already enabled by default, all I need to do is to collect the statistics.
UPDATE STATISTICS FORairport(city,faa);
After the statistics are collected, run the SELECT query again. We can now see that the execution plan changes to using only the airport.faa index. This is because the statistics is telling the optimizer that the
field has a better selectivity compared to the
field, and it is therefore more efficient for the query to use the index on
A better JOIN with CBO
Another example that shows how CBO can result in a choosing a better JOIN type between airport and route.
And you have the query below:
SELECT DISTINCT route.destinationairport FROM airport JOIN route ON airport.faa=route.sourceairport;
With RBO, the execution plan for the above query will be:
Note that the execution plan uses the Nested Loop to join the airport to route collection. Next, I enable CBO for the query. Since CBO is already enabled by default, all I need to do is to collect the statistics for the field in the collection.
UPDATE STATISTICS FORroute(sourceairport);
UPDATE STATISTICS FORairport(faa);
With CBO stats available on the two underlying fields for the indexes, the optimizer will now switch over to using a HASH for the collection JOIN.
- CBO is enabled by default in Couchbase v7. However, statistics need to be present on the specific fields in your bucket/collections before the optimizer can consider CBO.
- CBO can be disabled at the cluster level using the NQL Feature field by adding 16 to the value. It is a set with x10 bit.
- CBO statistics are not updated automatically by the query service. We recommend that you collect the statistics on a regular basis, or after a significant change to the collection content. The recommendation is to collect statistics after a large upload of data, otherwise a weekly run would suffice.
- CBO will only be considered for a query if the statistics are available for all of the indexes that were being considered by the optimizer. You can collect the statistics for all the indexes on a given keyspace using.12UPDATE STATISTICS FOR<collection>INDEX((SELECT RAW name from system:indexesWHERE state='online'ANDkeyspace_id='<collection_name>'ANDbucket_id='<bucket_name>'ANDscope_id='<scope_name>'))
To find out more about our CBO for JSON, take a look at some of our other blogs:
- Less is More: Why Cost Based Optimizer?
- Cost Based Optimizer for Couchbase N1QL (SQL for JSON)
- Access Path Selection in a Relational Database Management System