Ever since the nimbler, refined scorch indexing format released in Couchbase 6.0.0, the FTS team has been pretty excited about the future optimisation potential unlocked with the newer indexing scheme. With the 6.5.0 release, we have essentially embarked on a never ending journey of optimising and tuning the performance of the Full Text Search engine.

 

Let me share a few bits about the interesting optimisations we are introducing in the upcoming release 6.5.0. 

 

Geo Queries

FTS supports two types of geo queries – Point Distance and Bounded Rectangle. Since the feature inception, we haven’t got a chance to revise and improve this feature, though it was almost certain that the query implementation needs improvements both in terms of memory and CPU utilisation. The good news is that – out of all the upcoming performance improvements in 6.5.0, we have seen the biggest gains with the geo changes.

 

Opportunity

As a part of analysing the high memory usage by geo queries – it became clear that there is a huge performance tuning potential left untapped. In case of both point distance and bounded rectangle query types, the indexing engine (bleve) has to figure out the range of geo points to be searched (candidate terms) which qualifies the search criteria from the given query parameters. Bleve derives those candidate geo terms through certain mathematical steps.

But it wasn’t doing any filtering of the mathematically generated candidate geo terms. In other words, an unintentional amplification of the search space was happening in the background. And this was triggering the creation of too many unnecessary or irrelevant term search iterators(eg: terms which don’t even exist in the index), which then leads to a lot of internal temporary objects to manage and further adding a large overhead to the Garbage Collector. 

Response

We optimised this case by applying a geo term filtering which basically validates the mathematically generated geo points for its existence in the Terms Dictionary and qualify them only if its present in the Term Dictionary. This approach significantly reduced the number of candidate geo points/terms getting searched against the inverted index and thus improved both the latency and throughput numbers for Geo queries.

Reward

This has brought in an improvement of up to 6X latency reduction for geo queries in our internal benchmarks.

 

Fuzzy Queries

In pre scorch days, when a fuzzy/edit-distance query is received, the indexing engine would have gone through each and every term indexed against the given field in query to compute the edit distance to figure out whether the current term lies within the asked edit distance with the query term. And if it qualifies that edit distance criteria, it will look up the term in the inverted index to fetch the list of documents in which the term appears and necessary details as mandated by the query. (term vectors, doc values etc

 

The recent scorch indexing format uses FiniteStateTransducers(FST) for implementing the Terms Dictionary which has this nice DFA property. This helped us to improve on the novice approach for finding the candidate terms with in a given edit distance from query.

 

The approach is to build a Levenshtein Automaton(LA) for the given term in the edit distance query. This LA is a DFA which has the property of matching all the terms which are at most at an edit distance of given “d” in the query. And this LA DFA would then be made to step together or make an intersection with the original indexed term corpus/Terms Dictionary (FST DFA) to filter the candidate terms that needs to be eventually searched as a part of the query.

 

Opportunity

Until 6.0 release, we were making the LA DFA afresh on every incoming query and this LA DFA building part turned out to be taxing in terms of memory use and speed of the DFA construction.

Response

Inspired by the original paper and the blog post – levenshtein, the newer levenshtein automaton construction takes a two step approach. In the first step, it precomputes a parametric DFA for a set of pre-defined* edit distances and this part is completely independent of the query string and hence this can be precomputed. In the second step, this parametric DFA is used to compute the per query LA DFA. And this makes the DFA construction pretty fast and memory friendly.

Reward

Go micro benchmarks showed that the newer implementation is up to 5X faster and 12X better in memory usage. And in our internal performance tests, this has brought up to ~50% throughput improvements for query.

 

Glad to see that you scrolled all the way down here. 

Let me take an abrupt stop here to ease your attention span and save the remaining tales for  Part 2. The next part will cover the performance gains with the gRPC adoption in FTS as well as that with queries like numeric range, prefix, wildcard etc.

 

                                                                                                                                                                                             … Part 2

 

{pre-defined* edit distances => as of now FTS supports edit distance of upto 2. ie only 1 and 2. 

The primary reason for this is, in any reasonably sized document corpus there will be too many terms matching an edit distance > 2 from the query term. Hence any further index lookup becomes very resource intensive along with the poorer search relevance.}

Posted by Sreekanth Sivasankaran

Sreekanth Sivasankaran is a Software Engineer, Couchbase. He is into the design and development of distributed and highly performant full text search functionality.

Leave a reply