Introduction

There are three things important in database systems: performance, performance, performance. For NoSQL database systems, there are three important things: performance at scale, performance at scale, performance at scale.

Understanding the index options, creating the right index, with the right keys, right order, and right expression is critical to query performance and performance at scale on Couchbase. We’ve discussed data modeling for JSON and querying on JSON earlier. In this article, we’ll discuss indexing options for JSON in Couchbase.

Couchbase 5.0 has three types of index categories. Each Couchbase cluster can only have one category of index, either standard global secondary index or memory-optimized global secondary index.

Standard Secondary:  Release 4.0 and above
  • Based on ForestDB
  • Released with Couchbase 4.0
Memory-Optimized Index: 4.5 and above
  • 100% of the index is in memory
  • The index is written to disk for recovery only
  • Predictable Performance
  • Better mutation rate
Standard Secondary: Release 5.0
  • Uses the lockless skiplist based Plasma storage engine for our enterprise edition
  • Uses ForestDB storage engine for community edition
  • Released with Couchbase 5.0.
  • Designed to handle very large data sets

The standard secondary index (from 4.0 to 4.6.x) stores uses the ForestDB storage engine to store the B-Tree index and keeps the optimal working set of data in the buffer. That means, the total size of the index can be much bigger than the amount of memory available in each index node.

A memory-optimized index uses a novel lock-free skiplist to maintain the index and keeps 100% of the index data in memory. A memory-optimized index (MOI) has better latency for index scans and can also process the mutations of the data much faster.

The standard secondary index in 5.0 uses the plasma storage engine in our enterprise edition, which uses the lock-free skip list like MOI, but supports large indexes that don’t fit in memory.

All three types of indexes implement multi-version concurrency control (MVCC) to provide consistent index scan results and high throughput. During cluster installation, choose the type of index.

The goal is to give you an overview of various indices you create in each of these services so that your queries can execute efficiently. The goal of this article is not to describe or compare and contrast these two types of index services. It does not cover the Full-Text Search Index (FTS), released in Couchbase 5.0.

Let’s take the travel-sample dataset and try out these indices.

In the web console, go to Settings->Sample Buckets to install travel-sample.

Here are the various indices you can create.

  • Primary Index
  • Named primary index
  • Secondary index
  • Composite Secondary Index
  • Functional index
  • Array Index
  • ALL array
  • ALL DISTINCT array
  • Partial Index
  • Adaptive Index
  • Duplicate Indices
  • Covering Index

Background

Couchbase is a distributed database. It supports a flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during the insertion of the data.

Here’s an example document.

Type of Indexes

1. Primary Index

creates the primary index on ‘travel-sample’:

The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index in Couchbase, is maintained asynchronously. You set the recency of the data by setting the consistency level for your query.

Here is the metadata for this index:

The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state), and the indexing method (using).

The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or another index or access path can be used.  In Couchbase, you store multiple keyspaces (documents of a different type, customer, orders, inventory, etc) in a single bucket. So, when you do the primary scan, the query will use the index to get the document-keys and fetch all the documents in the bucket and then apply the filter. This is VERY EXPENSIVE.

The document key design is somewhat like a primary key design with multiple parts.

Lastname:firstname:customerid

Example: smith:john:X1A1849

In Couchbase, it’s a best practice to prefix the key with the type of the document. Since this is a customer document, let’s prefix with CX. Now, the key becomes:

So, in the same bucket, there will be other types of documents.

These are simply best practices. There are no restrictions on the format or structure of the document key in Couchbase, except that they need to be unique within a bucket.

Now, if you have documents with various keys and have a primary index, you can use the following queries efficiently.

Example 1: Looking for a specific document key.

If you do know the full document key, you can use the following statement and avoid the index access altogether.

SELECT * FROM sales USE KEYS [CX:smith:john:X1A1849]

You can get more than one document in a statement.

Example 2: Look for a pattern. Get ALL the customer documents.

Example 3: Get all the customers with “smith” as their last name.

The following query uses the primary index efficiently, only fetching the customers with a particular range. Note: This scan is case sensitive. To do a case insensitive scan, you can create a secondary index with UPPER() or LOWER() of the document key.

Example 4: It’s common for some applications to use an email address as part of the document as they are unique values. In that case, you need to find out all of the customers with “@gmail.com”. If this is a typical requirement, then, store the REVERSE of the email address as the key and simply do the scan of the leading string pattern.

Email:johnsmith@gmail.com;   key: reverse("johnsmith@gmail.com") => moc.liamg@htimsnhoj 

Email: janesnow@yahoo.com  key: reverse("janesnow@yahoo.com") => moc.oohay@wonsenaj

2. Named Primary Index

In Couchbase 5.0, you can create multiple replicas of any index with a simple parameter to CREATE INDEX. The following will create 3 copies of the index and there has to be a minimum of 3 index nodes in the cluster.

You can also name the primary index. The rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in Couchbase versions before 5.0 using different names. Duplicate indices help with high availability as well as query load distribution throughout them. This is true for both primary indices and secondary indices.

3. Secondary Index

The secondary index is an index on any key-value or document-key. This index can be any key within the document. The key can be of any time: scalar, object, or array. The query has to use the same type of object for the query engine to exploit the index.

The schedule field is an array of objects with flight details. This indexes on the complete array. Not exactly useful unless you’re looking for the whole array.

4. Composite Secondary Index

It’s common to have queries with multiple filters (predicates). So, you want the indices with multiple keys so the indices can return only the qualified document keys. Additionally, if a query is referencing only the keys in the index, the query engine will simply answer the query from the index scan result without going to the data nodes. This is a commonly exploited performance optimization.

Each of the keys can be a simple scalar field, object, or an array. For the index filtering to be exploited, the filters have to use the respective object type in the query filter. The keys to the secondary indices can include document keys (meta().id) explicitly if you need to filter on it in the index.

Let’s look at the queries that exploit and cannot exploit the index.

5. Functional (Expression) Index

It’s common to have names in the database with a mix of upper and lower cases. When you need to search, “John,” you want it to search for any combination of “John,” “john,” etc. Here’s how you do it.

CREATE INDEX travel_cxname ON `travel-sample`(LOWER(name));

Provide the search string in lowercase and the index will efficiently search for already lowercased values in the index.

You can use complex expressions in this functional index.

CREATE INDEX travel_cx1 ON `travel-sample`(LOWER(name), length*width, round(salary));

You’ll also see that array indexes can be created on an expression that returns an array in the next section.

6. Array Index

JSON is hierarchical. At the top level, it can have scalar fields, objects, or arrays. Each object can nest other objects and arrays. Each array can have other objects and arrays. And so on. The nesting continues.

When you have this rich structure, here’s how you index a particular array or a field within the sub-object.

Consider the array, schedule:

This index key is an expression on the array to clearly reference only the elements needed to be indexed. schedule the array we’re dereferencing into. v is the variable we’ve implicitly declared to reference each element/object within the array: schedule v.day refers to the element within each object of the array schedule.

The query below will exploit the array index.

Because the key is a generalized expression, you get the flexibility to apply additional logic and processing on the data before indexing. For example, you can create functional indexing on elements of each array. Because you’re referencing individual fields of the object or element within the array, the index creation, size, and search are efficient. The index above stores only the distinct values within an array. To store all elements of an array in an index, use the DISTINCT modifier to the expression.

CREATE INDEX travel_sched ON `travel-sample` (ALL DISTINCT ARRAY v.day FOR v IN schedule END)

Array Index can be created on static values (like above) or an expression that returns an array. TOKENS() are one such expression, returning an array of tokens from an object. You can create an index on this array and search using the index.

Couchbase 5.0 makes it simpler to create and match the array indexes. Providing ALL ( or ALL DISTINCT) prefix to the key will make it an array key.

Array indexes can be created on elements within arrays of arrays as well. There is no limit to the nested level of the array expression. The query expression does have to match the index expression.

7. Partial Index

So far, the indices we’ve created will create indices on the whole bucket. Because the Couchbase data model is JSON and JSON schema are flexible, an index may not contain entries to documents with absent index keys. That’s expected. Unlike relational systems, where each type of row is in a distinct table, Couchbase buckets can have documents of various types. Typically, customers include a type field to differentiate distinct types.

When you want to create an index of airline documents, you can simply add the type field for the WHERE clause of the index.

CREATE INDEX travel_info ON `travel-sample`(name, id, icoo, iata) WHERE type = 'airline';

This will create an index only on the documents that have (type = ‘airline’). In your queries, you’d need to include the filter (type = ‘airline’) in addition to other filters so this index qualifies.

You can use complex predicates in the WHERE clause of the index. Various use cases to exploit partial indexes are:

  1. Partitioning a large index into multiple indices using the mod function.
  2. Partitioning a large index into multiple indices and placing each index into distinct indexer nodes.
  3. Partitioning the index based on a list of values. For example, you can have an index for each state.
  4. Simulating index range partitioning via a range filter in the WHERE clause. One thing to remember is Couchbase N1QL queries will use one partitioned index per query block. Use UNION ALL to have a query exploit multiple partitioned indices in a single query.

8. Adaptive Index

An adaptive index creates a single index on the whole document or set of fields in a document. This is a form or array index using {“key”: value} pair as the single index key. The purpose is to avoid the bane of a query having to match the leading keys of the index in traditional indexes.

There are two advantages with Adaptive index:

  • Multiple predicates on the keyspace can be evaluated using different sections of the same index.
  • Avoid creating multiple indexes just to reorder the index keys.
  • Avoid the index key-order.

Example:

The same index can be used for queries with other predicates as well. This reduces the number of indexes you’d need to create as the document grows.

Considerations for usage:

  1. Since each attribute field has an index entry, the size of the indexes can be huge.
  2. The adaptive index is an array index. It’s bound by the restriction of the array indexes.

Please see the detailed documentation on adaptive index at Couchbase documentation.

9. Duplicate Index

This isn’t really a special type of index, but a feature of Couchbase indexing. You can create duplicate indexes with distinct names.

All three indices have identical keys, identical WHERE clause; the only difference is the name of the indices. You can choose their physical location using the WITH clause of the CREATE INDEX. During query optimization, the query will choose one of the names. You see that in your plan. During query runtime, these indices are used in a round-robin fashion to distribute the load. This gives you scale-out, multi-dimensional scaling, performance, and high availability. Not bad!

Couchbase 5.0 makes the duplicate index SIMPLER. Instead of creating multiple indexes with distinct names, you can simply specify the number of replica indexes you require.

This will create 2 additional copies of the index in addition to the index i1. Load balancing and HA features are the same as an equivalent index.

10. Covering Index

Index selection for a query solely depends on the filters in the WHERE clause of your query. After the index selection is made, the engine analyzes the query to see if it can be answered using only the data in the index. If it does, the query engine skips retrieving the whole document. This is a performance optimization to consider while designing the indices.

All Together Now!

Let’s put together a partitioned composite functional array index now!

Rules for Creating the Indexes.

So far, we looked at the types of indexes. Let’s now look at how we go about designing the indexes for your workload.

Rule #1: USE KEYs

In Couchbase, each document in a bucket has a user-generated unique key. The documents are distributed among different nodes by hashing this key (we use consistent hashing). When you have the document key, you can fetch the documents directly from the Applications (via SDKs). Even when you have the document keys, you may want to do fetch and do some post-processing via N1QL. That’s when you use the USE KEYS method.

Example:

The USE KEYS access method can be used even when you do joins. Here’s an example:

SELECT * FROM ORDERS o USE KEYS ["ord::382"] INNER JOIN CUSTOMER c ON KEYS o.id;

In Couchbsae 5.0, indexes are used only to process the first keyspace (bucket) of each FROM clause. Subsequent keyspaces are processed via direct fetch of the document.

SELECT * FROM ORDERS o INNER JOIN CUSTOMER c ON KEYS o.id WHERE o.state = "CA";

In this statement, we process the ORDERS keyspace via an index on (state) if it’s available. Otherwise, we use the primary index to scan ORDERS. We then fetch the CUSTOMER documents matching the id in the ORDERS document.

Rule #2: USE COVERING INDEX 

We discussed the types of index earlier in the article. The right index serves two purposes:

  1. Reduce the working set for the query to speed up query performance
  2. Store and provide additional data even.

When a query can be answered completely by the data stored in the index, the query said to be covered by the covering index. You should try to have most, if not all, of your queries to be covered. This will reduce the processing burden on the query service, reduce additional fetch from the data service.

The index selection is still done based on the predicates in the query. Once the index selection is done, the optimizer will evaluate to see if the index contains all the required attributes for the query and creates a covered index path access.

Examples:

Note that the status field in the WHERE clause of the index (status = ‘premium’) is also covered. We know every document in the index has a field called status with a value ‘premium’. We can simply project this value. “Filter_covers” field in the explain shows this information.

As long as the index has the field, a query can do additional filtering, joins, aggregation, pagination after fetching the data from the indexer without fetching the full document.

Rule #3: USE THE INDEX REPLICATION

In a Couchbase cluster, you have multiple index services. Prior to Couchbase 5.0, you can manually create replica (equivale) indexes to improve throughput, load balancing, and high availability.

Prior to 5.0:

We recognize the equivalency of these three indexes because the key expressions and the WHERE clause are exactly the same.

During the query optimization phase, the N1QL engine picks up one of the three indexes for index scan (assuming other requirements are met) to create the query plan. During query execution, the query prepares the scan package and sends an index scan-request. During this process, based on the load statistics, we send the request to one of them. The idea is, over time, each of them will have a similar load.

This process of creating replica indexes (equivalent indexes) is made easier with a simple parameter.

This is the same as creating three distinct, but equivalent indexes.

Rule #4: INDEX BY WORKLOAD, NOT BY BUCKET/KEYSPACE

Consider the whole application workload and service level agreements (SLAs) for each of the queries. The queries that have millisecond latency requirements with high throughput will require customized and replica indexes, whereas others could share indexes.

There may be keyspaces on which you simply do set & get operations or can issue queries with USE KEYS. These keyspaces won’t need any indexes.

Analyze the queries to find the common predicates, projections from a keyspace. You can optimize the number of indexes based on common predicates. If one of your queries does not have a predicate on the leading key or keys, see if adding (field IS NOT MISSING) makes sense to that the index can be shared.

It’s fine to have a primary index while developing your application or queries. But, before you test, create the right indexes and drop the primary index from your system, unless your application uses cases described in the “Primary Index” section. If you do have a primary index in production and queries end up doing a full primary scan with a full span on the index, you’re asking for trouble. In Couchbase, the primary index indexes all the documents in the bucket.

Every secondary index in Couchbase should have a WHERE clause, with at least a condition on the document type. This isn’t enforced by the system, but it’s good design.

Creating the right indexes is one of the best practices for performance optimization. This isn’t the only thing you’d need to do to get the best performance. Cluster configuration, tuning, SDK configuration, use of prepared statements all play a significant role.

Rule #5: INDEX BY PREDICATE, NOT BY PROJECTION

This seems like an obvious rule. But, I come across folks making this mistake every now and then.

Consider the query:

Any of the following indexes can be used by the query:

To make the index completely cover the query, simply add the city field to index 3-6.

However, if you have an index that has the city as the leading key, the optimizer won’t pick up the index.

See the detailed article on how the index scan works in various scenarios to optimize the index: https://dzone.com/articles/understanding-index-scans-in-couchbase-50-n1ql-que

Rule #6: ADD INDEXES TO MEET THE SLAs

For relational databases, three things were most important: performance, performance, performance.

For NoSQL databases, three things matter most: performance at scale, performance at scale, performance at scale.

Your queries running basic performance test on your laptop is one thing, running the high throughput, low latency queries on the cluster is another thing. Fortunately, in Couchbase, it’s easy to identify and scale the bottleneck resources independently, thanks to the multi-dimensional scaling. Each of the services in Couchbase is abstracted into distinct service: data, index, query. Couchbase console has statistics on each of the services independently.

After you’ve created indexes for your queries, optimized the indexes for the workload, you can add additional replica (equivalent) indexes to improve the latency because we load balance the scans among the replica indexes.

Rule #7: INDEX TO AVOID SORTING

The index already has the data in the sorted order of the index keys. After the scan, the index returns the results in the index key order.

The data is stored and returned in the order: state, city, name.lastname. So, if you have a query, that expects the data in the order of state, city, name.lastname, an index will help you to avoid the sort.

In this example below, the results are ordered by name.lastname, the third key of the index. Therefore, it’s necessary to sort the resultset on name.lastname. Explain will tell you if the plan requires this sort.

The query below has the perfect match for the index keys.  So, the sort is unnecessary.  In the explain output, the order operator is missing.

Exploiting the index sort order may not seem important until you see the pagination use case.  When the query has specified OFFSET and LIMIT, an index can be used for efficiently eliminating the documents which the application does not care or need.  See the article on pagination for details on this.

N1QL optimizer first selects the index based on predicates in the query (filters) and then verifies if the index can cover all the query references in projection and order by.  After that, the optimizer tries to eliminate the sorting and decide on the OFFSET and LIMIT pushdown.  The explain shows if the OFFSET and LIMIT were pushed to the index scan.

Rule #8: Number of indexes

There’s no artificial limit on the number of indexes you can have in the system. If you’re creating a large number of indexes on a bucket that has the data,  use the deferred build option so the data transfer between the data service and index service is efficient.

Rule #9: Index during INSERT, DELETE, UPDATE

The index is maintained asynchronously. Your data updates via key-value API or any N1QL statements only update the documents in the bucket.  The index receives the notification of changes via the stream and applies the changes to the index. Here is the sequence of operations for an UPDATE statement. The statement uses the index to qualify the documents to update; fetch the documents and update them; then write the documents back and return any data requested from the UPDATE statement.

Rule #11: INDEX KEY ORDER AND PREDICATE TYPES

Index scan requests created by the query users’ first N consecutive keys of the index.  So, the order of the index key is important.

Consider a query with various predicates:

These are general rules for the order of keys in the index.  Keys can be simpler scalar attributes or expressions which return scalar values: e.g. UPPER(name.lastname).

  1. First priority is keys with equality predicates. In this query, it’s on state and type. When there are multiple predicates of the same type, choose any combination.
  2. Second priority is keys with IN predicates. In this query, it’s on zipcode.
  3. Third priority is the less than (<) predicates. In this case, it’s on salary.
  4. Fourth priority is the between predicates. This query does not have a between predicate.
  5. Fifth priority is the greater-than (>) predicates.  In this query, it’s on age.
  6. Sixth priority is the array predicates: ANY, or EVERY AND ANY, predicates after UNNEST.
  7. Look to add additional fields for the index to cover the query.
  8. After doing this analysis, look for any expressions that can be moved to WHERE clause.  For example, in this case, type = “premium” can be moved because the type field is designated by the users to identify the type of customers.

With this, we come up with the following index.

Rule #12: Understand how to read EXPLAIN and PROFILING

No matter how many rules you follow, you’ll have to understand query plans and profile, monitor the system under load, and tune it.  The ability to understand and analyze the query plan and profiling information is the key to tuning a query and a workload.  There are two fine articles on those topics.  Go through and try out the examples. 

References

Author

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and has three US patents pending.

5 Comments

  1. Hi Keshav,
    I am new to CB, so pls pardon my understanding.

    The foll. is mentioned in this blog
    “So, when you do the primary scan, the query will use the index to get the document-keys and fetch all the documents in the bucket and then apply the filter. So, this is VERY EXPENSIVE.”

    I have read this at many places and same is what I was told i.e. a primary index should be avoided. I am unable to understand why? In Oracle or any other RDBMS, Primary/Unique key based lookup is the fastest/best. I understand that if the N1QL query has no predicate then it will scan the entire bucket i.e. it will scan all keys using the Primary index and that would be expensive. But if the key is specified in the predicate then wouldn’t it be the fastest?

    In example 1, the specific key is mentioned in the predicate. So that should be almost as good getid(‘key’), isnt it?

    Thanks

  2. Hello pccb,

    If you have the document key (this is the unique key within the bucket), you have a builtin access method and is the most efficient one from N1QL.
    SELECT * FROM mybucket USE KEYS “cx:482:gn:284”;

    You can also use primary index for this by issuing:
    SELECT * FROM mybucket WHERE meta().id = “cx:482:gn:284”;

    You can do smart range scans on the meta().id using primary index:
    SELECT * FROM mybucket WHERE meta().id LIKE “cx:482:gn:%”;

    Here are the things you do need to be aware of:
    1. Couchbase bucket can documents all different types: customer, order, item,etc. Primary index will be across all these document types.
    2. If developer/user carefully constructs the query to perform equality or limited range scan, using primary index is fine.
    3. But, if someone issues a query without these guidelines or issues a query without any other qualifying index in production, we endup using primary scan and scan & fetch the whole index and all the documents in the bucket. Typically, that’s a bad thing in production.

  3. Hello,

    Is there a way to avoid an index i.e. make CB NOT use the index?

    What we are trying to achieve is as follows:
    There will be the Primary index, so that developers can try out their queries. However, once they have finalized the queries including the index that is needed, we would like to them to execute it while making sure that it is not using the Primary Index. Reason being, even after creating the appropriate secondary index, it might be that the query is still using the Primary index and the developer has not realized it. So if there is a way to avoid the usage of Primary Index then they would run the query using that option.

    Thanks

  4. Thanks for the feedback. I’ve opened an improvement to add this: https://issues.couchbase.com/browse/MB-32109

  5. Is there a typo or am I missing something:
    CREATE INDEX travel_sched ON travel-sample
    (ALL DISTINCT ARRAY v.day FOR v IN schedule END)

    Is “ALL DISTINCT” valid syntax and what does it mean? I could not find it in the documentation!

    Thanks

Leave a reply