Topics this article will cover
  1. What’s good with N1QL?
  2. What about FTS?
  3. But why FTS within N1QL?
  4. Basic N1QL+FTS queries
  5. Deploying N1QL+FTS
    1. Syntax(es)
    2. Abilities & limitations
  6. N1QL+FTS Internals
  7. Covered-index vs Non-covered-index queries
  8. More N1QL+FTS query examples
  9. What’s next?

1. Couchbase’s N1QL

  • N1QL is Couchbase’s SQL offering for manipulating JSON data stored within the couchbase server.
  • N1QL statements can be used to create, modify, drop indexes and select, insert, update, delete and upsert data into JSON documents.
  • N1QL expressions allow the user to perform aggregate, arithmetic, collection, comparison, conditional and several other operations.
  • All this comes with support from secondary indexes to enable these operations very efficiently.

2. Couchbase’s FTS

  • Couchbase’s Full Text Search offering provides extensive capabilities for natural language querying and is meant to enable users to search text across multiple fields in JSON documents stored within a couchbase server.
  • It supports language-aware searching and scoring results based on relevancy which can be configured by the user.
  • FTS sets up fast indexes that are very specially designed and meant to handle a wide range of text search workloads very efficiently.

3. Why support FTS via N1QL?

  • N1QL queries can perform a search on strings, numbers, arrays etc. and with secondary indexing (b-tree index) – support point look up and range scans as well but FTS delivers performance for text search (simple and complex compound queries) with the support of its underlying inverted index.
  • Applications need the capability to leverage both the capabilities using a using a single API and language.
  • Supporting compound/complex operations such as applying aggregations, arithmetic and other SQL operations over FTS results for ease-of development.
  • Extend FTS’s visibility (more than just via SDK, curl or Couchbase’s UI).

4. N1QL + FTS

Couchbase 6.5 will support the proposed interface between N1QL and FTS. The user will still be required to set up FTS indexes to support their use case just like with GSI (Global Secondary Indexing). With this new interface, users will be allowed to merge and execute FTS queries from within N1QL queries seamlessly.

A new SEARCH(..) predicate will now be supported as part of the N1QL query syntax. Before getting into the internals of what goes on when a N1QL query with the SEARCH(..) predicate is executed, here is some documentation on how to create and manage FTS indexes and here are a few sample queries ..

For example, say I have an FTS index set up over some travel documents, and I want to fetch the FTS results (just ids) for all documents carrying “San Francisco” in their city field ..

As you can see above, the FTS query string ‘city:”San Francisco”’ is embedded within the N1QL query. Alternatively, the N1QL query will also support a FTS query object such as ..

The above example will limit the FTS result set to 10.

Or even a FTS search request object ..

The above example also limits the result set to 10, but FTS will enforce it.

OFFSET/LIMIT filters can be set either within the N1QL query syntax or within the FTS search request object. If these parameters are set within the FTS search request, FTS will stream only the requested number of results. N1QL parameters will be applied on whatever results FTS has sent it. If these settings are not set within the FTS object but are set in the N1QL query – FTS will stream all results to N1QL until such a time that N1QL has received all the results that it needs.

Also, like in the last 3 examples, one doesn’t have to explicitly specify the name of the FTS index to pick – N1QL will determine which is the best index among available ones to run the FTS query against. Should one want N1QL to use a specific index (for example – an index named “travel”), here’s how ..

Note that for all the example queries above – results are streamed and are not sorted based on relevance (score – default FTS behavior). Sorting can be achieved by explicitly stating it within the search request or using N1QLs’ ORDER BY clause. Pagination of results can also be achieved by only explicitly stating within the search request or by using N1QL’s OFFSET and LIMIT clauses.

5. Deploying N1QL + FTS

  • To allow N1QL queries with the SEARCH(..) capability, the couchbase cluster needs to have at least 1 node that runs the Search service and 1 node that runs the Query service (both these services can be configured on the same node as well).
  • FTS indexes are to be set up by user to index the necessary content that one wants to search over.
  • If no FTS indexes were found by N1QL to execute the query, it searches for GSI indexes that can potentially handle the query. In this case, the SEARCH(..) predicate is applied on the intermediate results obtained.  While this would work, it isn’t the recommended approach since SEARCH(..) evaluation can be expensive.
  • N1QL queries with SEARCH(..) can be run from the query workbench, curl, SDK or the command line interface that couchbase offers.
5.1 Search syntax

Here is what’s supported within the SEARCH(<field>, <FTS query>, [options]) function..

5.2 Abilities & limitations
  • All the features that the FTS search requests offer will be supported by the N1QL queries.
  • Here are some highlights on what to throw in the “query” section within the SEARCH(..) function.
  • FTS indexes that support multiple type mappings will be disallowed in the first release of N1QL+FTS interface so that false-positives wouldn’t sneak in to the result set.

Internal Implementation

Internally the N1QL+FTS interface supports 4 APIs that the N1QL service will invoke during its prepare and execution phases for queries with the SEARCH(..) predicate.

Before describing the above APIs in a little more detail, here’s a flow chart of operations supported within the interface ..

6.1 Sargable

This FTS index API will be used to determine whether the index is capable of handling the query request without returning false negatives. In the first release, the index is only chosen if all the query fields are indexed within it, or the index has within its definition a dynamic mapping that would cater to all the requested fields. If multiple indexes are sargable for a given query, one that has the least number of fields indexed satisfying the sargability clause is chosen (for performance reasons).

6.2 Pageable

This API will be used to determine if the results obtained from the underlying FTS index will be pageable or not. If the index is pageable, N1QL will apply the filters (offset, limit, sort information) for FTS prior to issuing the Search(..), otherwise the filters are applied on the result set after FTS has shipped to it.

Note that this API is not invoked if there are no filters (offset, limit, order by) in the N1QL query.

6.3 Search

This API is invoked for the most sargable index, and is essentially responsible for getting the search request through to the FTS index and streaming results back from it via a channel. If the amount of data to stream exceeds the available buffer size at N1QL’s end of the channel, FTS will backfill the data to a file and a separate routine is responsible for streaming this content to N1QL. This is done so that FTS’s resources aren’t held up by a slow connection from N1QL. Internally, the gRPC protocol is used for streaming data from FTS to N1QL.

6.4 Verify:Evaluate

This API is used by N1QL to ensure that the results/hits returned by an FTS index are indeed valid for the query. FTS returns only the key-ids and some FTS related metadata (if requested) like score etc. N1QL fetched documents from KV and invokes Verify for them iff the SELECT predicate requests for some document fields.

Covered-index vs Non-covered-index queries

If the SELECT statement’s predicate requests only keys or the metadata, the Verify API isn’t invoked at all. This kind of a query is referred to as a covered-index query. If the request is for some other document content, N1QL will use the keys returned by FTS to fetch the document data from KV, after which it invokes the Verify method for each of the fetched documents to re-check whether the documents retrieved are indeed valid matches. This kind of a query is referred to as a non-covered-index query. Non-covered-index queries tend to have higher latencies than covered-index queries as they involve the KV fetch and verify for each hit.

If the user requires other fields from the documents to be a part of the result set, a faster approach would be to tune the FTS index definition to store the desired fields. Now, in the search request within the SEARCH(..) function, include a section called “fields”: [“*”] to fetch all stored fields as part of the result set. This way N1QL will not have to do a separate document fetch and can skip the verify as well => essentially converting a non-covered-index query to a covered-index query at the cost of a larger FTS index.

Consider the following FTS index definition with fields “country” and “content” indexed..

Here’s an example of a slower non-covered index query that fetches the document field “content” for documents that have “united states” in their “country” field..

Lets update the index definition to also store the field “content” which is of interest..

Now here’s the same query that qualifies as a covered-index query..

More N1QL+FTS examples

8.1 Complex-er queries

Running a compound conjunction/disjunction FTS search within a N1QL query … fetch top 100 document ids ordered by score (tf-idf .. which is FTS’s default scoring algorithm) highest to lowest, whose category is landmark and country is United States..

Here’s an equivalent query with embedded FTS settings within the SEARCH(..) function …

Running another query with FTS settings embedded within the SEARCH(..) function … fetch all document ids that contain within their description field the term gothic without considering score. Here we can optimize the FTS search request to not determine score at all..

Various FTS query types that are supported are described in more detail here.

8.2 Query sargability vs Index definitions

Before we jump into some examples on how queries are deemed sargable for FTS index definitions, learn more about FTS index definitions here.

Consider the following query, which looks for term “gothic” in the field “description” ..

In the couchbase system at hand, let’s assume there are several FTS indexes defined.

The first FTS index we encounter has the following definition ..

This index is what is referred to as a default dynamic index which covers all fields available across all documents and also includes content within the default field “_all”. This default field is what is looked into when an FTS query does not carry “field” information for a search criteria. This index is deemed SARGABLE for the above query.

A second FTS index is found to have the following definition ..

This index only has the field “description” indexed, which would cater to the query’s request, and hence the index is SARGABLE for the query.

A third FTS index is found to have the following definition ..

This index has a few fields indexed but none of them match the requested field “description”. This index is deemed NOT-SARGABLE for the query.

N1QL now has the option to select from the first 2 indexes which will be able to deliver accurate results for the query. Since the number of fields indexed within the second index is precise and smaller (and therefore since the search across this index would be faster), N1QL chooses the second index for execution of the query.

Future

  • Establishing sargability better by supporting some flexibility of the FTS index definitions – as in FTS indexes that don’t support all the requested fields.
  • Support for FTS indexes with multiple type mappings.
  • Extending N1QL query interface to edit FTS indexes.

Posted by Abhinav Dangeti, Software Engineering, Couchbase inc.

Work on Couchbase's Distributed Full Text Search Offering

Leave a reply