In this blog post, we’ll have a look at the preview API for full text search in Couchbase 4.5. Please note that this API, released in the latest Java SDK  (2.2.4), is still @Experimental.

We’ll cover:

This experimental API can be used with Couchbase Server 4.5 Developer Preview, provided you use the 2.2.4 Java SDK client, which you can get through Maven. Add the following dependency to your pom.xml:

Full Text Search in Couchbase?

Yes! The upcoming 4.5 server release, (codename Watson) will include a full text indexer (FTS, also known as CBFT) based on the open-source Bleve project. Bleve is all about full-text search and indexing in Go (shoutout to our very own Marty Schoch for initiating this project).

The idea is to leverage Bleve to provide an off-the-shelf full text search in Couchbase Server, without having to use connectors to external software (that runs on their own cluster). If that off-the-shelf solution doesn’t meet your needs all the way of course you still can use these connectors, but for simpler needs you are good to go with a single solution.

FTS offers a host of capabilities that are provided by Bleve: Text Analyzers, Tokenizers and post-processing Token Filters that are beyond the scope of this post, as well as the numerous types of queries that you can run on the resulting indexes. Let’s see what those types are and how you can expect to use them in the context of the Java SDK.

In the rest of this blog post, we’ll use 3 indexes that you will be able to build through the web administrative console in the upcoming 4.5 Developer Preview:

Here is the list of indexes in the UI:

We have:

  • a beerIndex that indexes the whole content of each document in the beer-sample bucket.
  • a travelIndex that indexes the whole content of each document in the travel-sample bucket.
  • an alias index, commonIndex, that is an union of the two indexes above.

The Java API

The entry point of the full text search feature in the Java SDK is on the Bucket, using the query(SearchQuery ftq) method. This is consistent with the existing querying methods already present in the API to run a ViewQuery or a N1qlQuery.

The API for full text search follows the builder pattern. Identify the type of query you want and use the corresponding builder to construct it, get the SearchQuery out of it using build() and execute it using bucket.query(searchQuery).

Let’s take a (very simple) example and see how it can be consumed:

If we look at each section individually, here’s what happened:

  1. We create a simple MatchQuery on a single term.
  2. It runs on the beer sample (.on(beerIndex), looks for textual occurrences of the word “national” (.query("national")) or close terms.
  3. Additional configuration is done to limit the number of results to 3 (limit(3)) and the actual query is created at this point (.build()).
  4. The query is executed (bucket.query(ftq)) and returns a SearchQueryResult.
  5. We output the result’s totalHits() and individual rows (also accessible as a list through hits()).

Running that code outputs:

We see that total hits gives us the actual number of hits before the limit was applied. The hits() method returns 3 SearchQueryRow objects, as requested.

Each hit contains the key to the associated document in Couchbase (id()), as well as more information on the matching, eg. a score for the match (score())… If you want, you can retrieve the associated document using bucket.get(row.id()):

This gives us, for the first hit:

If we look closely at the document’s JSON, we notice where the document probably matched. In the “description” field of the document, there is this sentence:

The first brewery to open in the nation‘s capital since Prohibition.

Also notice that the text query looked for the word requested and derived words that have the same root. It actually applied a fuzziness of 2 (see the next section).

This pattern can be applied to the other types of queries as well, so let’s have a look at a few more, see what kind of search can be performed.

Various Types of Queries

Fuzzy Querying

Fuzzy querying can be performed with the MatchQuery, specifying a Levenshtein distance as the maximum fuzziness() to allow on the term:

At a fuzziness of 2, this matches words like “hammer”, “mamma” or “summer”:

At a fuzziness of 1, no match is found:

A type of query dedicated to fuzziness and not applying any analyzer is also provided in the FuzzyQuery.

Multiple Terms: MatchPhrase

As we saw, MatchQuery is a term-based query that allows to optionally specify fuzziness and also applies the same filter to the searched term that may have been applied to the field (eg. stemming, etc…):

You can search for multiple terms in a single query by using a Match Phrase query. Terms are analyzed and fuzziness can be optionally activated:

Regexp Query

A RegexpQuery doesn’t only do literal matching but allows to match using a regular expression. Take this example:

Notice this query targets a particular field in the json (field("name")). We want all names that contain either “tale” or “pale”. Here are a few names that match this query:

Prefix Query

A PrefixQuery looks for word occurrences that start with the given string:

Once again we only look inside the name field, this time for words that start with “weiss”:

Range and Date Queries

FTS is also good with non-textual data. For instance, the NumericRangeQuery allows you to look for numerical values within a provided range:

Which outputs:

Dates are covered as well with the DateRangeQuery:

Which outputs:

Generic Querying

FTS also offer a more generic form of querying that combines phrases, terms and more using the String Query syntax. This is accessible in the API through the StringQuery.

Combining

Additionally, you can combine simple criteria like MatchQuery using combination queries. Taking these two simple term queries:

You could combine them in different manners:

  • a conjunction looks for all the terms
  • a disjunction looks for at least one term
  • a boolean query allows you to combine the two approaches

Getting Hit Explanations

If you want to get insights into the scoring and matching of a particular SearchQueryRow, you can build your query using the .explain(true) parameter and get details from the index in result’s explanation() field:

Conclusion

We hope that this preview of the API has peeked your interest!

Go ahead and download the first Developer Preview of Couchbase 4.5 with embedded Full Text Search service. We hope that you’ll be able to quickly start searching using the associated Java SDK API.

And until then… Happy coding!
The Java SDK Team

Author

Posted by Simon Basle, Software Engineer, Pivotal

Simon Basl_ is a Paris-based Software Engineer working in the Spring team at Pivotal. Previously, he worked in the Couchbase Java SDK team. His interests span software design aspects (OOP, design patterns, software architecture), rich clients, what lies beyond code (continuous integration, (D)VCS, best practices), and reactive programming. He is also an editor for the French version of InfoQ.com.

Leave a reply