What is autoComplete?

Autocomplete as the wikipedia says

“Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing”

It is also known as  Suggestions or Type Ahead or Search as you Type. This helps in navigating or guiding a user by prompting them with likely completions and alternatives to the text as they are typing it. And It reduces the amount of character a user needs to type for getting the final search results, thereby enhancing the search experience of users.

Let’s explore a couple of functional approaches to implement the auto complete feature using Couchbase FTS.

 

Using Edge NGram approach.

This approach involves the use of different index and query time analysers.

Step1 – is to have the right FTS index definition for the fields that needs to be autocompleted. So during indexing, the autocomplete intended fields have to be analysed with a custom analyser making use of the edge-ngram token filter. It can also be chained with to_lower token filter if concerned about the case sensitivity.  store – option has to be enabled for the autocomplete field in the index definition to preserve the field contents intact within the FTS index so that this stored field value will be fetched explicitly during the query phase to perform the actual auto complete or suggestions for the user.

How does an edge ngram token filter works?

An edge ngram tokeniser will tokenise a given text value into sub tokens of length ranging from a given min and max length parameters. For example,  an edge-ngram tokeniser of min length 2 and max length 6 would tokenise given the text “jurassic park” like below.

ju, jur, jura, juras, jurass, pa, par, park. 

The idea here is that these tokenised texts of the indexed field would serve as the future potential partial text entries from the user on the user interface.

 

 

Step2 – Later when the end user actually starts typing on the text box in the user interface,  the client application can trigger search queries in the background with the partial text available in the text box. Clients should specifically use the match query as it has the provisions to explicitly provide,

  • analyser to be used for the search text.

    Should use a simple analyser for the match query as that should prevent any unnecessary text splitting during the search phase.

  • fuzziness that needs to be applied.

  Can also mention the fuzziness factor if the client is interested in getting autocomplete suggestions with fuzziness applied.

these options helps to control the auto complete feature.

Along with the match query, the client should request for the actual auto completed field contents using Fields  option in the search request and this value will be used as the auto completed text or type ahead for the user.

 

For example, a match query for the partial texts like “jur”  or “pa” will match all of the below titles.

Jurassic Park 

Jurassic Park III

The Lost World: Jurassic Park

 

Order/Rank of the Results –  The default tf-idf ranks are applicable here for the n-grams which are getting indexed.  Specific order of suggestions are possible by applying any client/application specific custom sorting on the retrieved field values at the client side.

You may check out the sample auto complete bootstrap application here – 

 

 

 

 

 

 

 

 

 

 

 

Prefix based Approach

This approach involves the use of the same index and query time analysers.

Step1During indexing, autocomplete intended fields have to be analysed with a keyword analyser. With keyword analyser, the value of this field will be stored as a keyword so that the entire value will be treated like a single token with all the terms and spaces retained. 

Step2- Later during the query time, this approach involves trying prefix queries against the desired field in documents.  Similar to the first approach, the client needs to explicitly request for the actual auto completed field contents using Fields  option in the search request and this value will be used as the auto completed text for the user.

This approach has the restriction that the matching is strictly limited to the starting/prefix of the field value.

 For example, in use case mentioned above (with partial text like “Jur”), with prefix based approach

“Jurassic Park”

“Jurassic Park III”

will show up in the results but “The Lost World: Jurassic Park” will not show up as it’s prefix (“The Lost World:”) doesn’t start with “Jur”.

 

 

Posted by Sreekanth Sivasankaran

Sreekanth Sivasankaran is a Software Engineer, Couchbase. He is into the design and development of distributed and highly performant full text search functionality.

Leave a reply