People don’t want a four key index.  They need a four-ms response.

Ted Levitt

Application development is demanding. Each application is trying to progress on behalf of the customer — searching for the right product or the right form, ordering, canceling, shipping, checking the status, etc.

The typical query and B-Tree based databases and infrastructure are suitable for developing many of the modules. Still, there are cases where SQL & B-Tree based searching is ineffective in meeting your SLA.

Here are the seven problems you’d encounter in SQL development and solutions for them using Search technology. I’ve used Couchbase N1QL as the SQL implementation example, Couchbase GSI as the sample B-Tree based indexes (logically speaking) and FTS (Couchbase Full-Text Search) as the example search technology.

I’ll also use the travel-sample model and data in the Couchbase sample dataset in the examples.

Developer Challenges

1. The NAME search problem (aka the LIKE problem)

Social media may like to like often, but SQL doesn’t. Like predicates are one of the first things to look at when they’re used in a query. Without careful usage, LIKE predicate will increase your latency, bring down your throughput and shut down your operation.  Here is a sample LIKE predicates I’ve seen in real customer applications.

  • WHERE name LIKE “%joe%”;
  • WHERE UPPER(name) LIKE “%JOE%”
  • WHERE REGEX_CONTAINS(name, “.*joe.*”)
  • WHERE name LIKE “% joe %”

Problems:

  • Even when you have an index on the name, you’ll have to scan the entire index to determine all the documents with joe.
  • You’d have to create a functional index (e.g. UPPER(name)) and have the additional key with the name to save the exact case of the name.  E.g. McDonald
  • With the first three predicates, you’ll not only get joe, but also strings joel, sojoel, bejoel, etc.

2. The multi-match problem

When names are split into first name, middle name, last name, you’d need to search everywhere. Sometimes, you’d want to search for a string in any one of the names.   Combine that with matching their call number. Your query will look something like this.


SELECT c.customerid, c.fname, c.mname, c.lname, c.address
FROM customer c
WHERE (c.fname in [‘John’, ‘Doe’]
     OR c.mname in [‘John’, ‘Doe’]
     OR c.lname in [‘John’, ‘Doe’])
     OR (ANY p in c.contacts satisfies p = ‘6509264813’)
     OR (ANY q in c.services satisfies q = ‘6509264813’)

Now, try to create a B-TREE index that gives you not only the low latency you want but also high throughput.  

3. Beyond the exact match

Real world matches are inexact.  William can be abbreviated to Will, Willy, Bill, and Billy.  The Irish and the Scottish have additional abbreviations.  How do you search for Billy and end up with a William?


4. Search based on business requirements

Each business has rules on resolving customer information (name, address, and date of birth) to a particular customer.  Some give higher priority to address others to name or address. Rules can say, the two out of four will have to be an exact match and others have to be a partial match.     N1QL predicates become even more complex than the second case. So, the GSI index suitable for the query gets complicated as well.

5. Relevance ordering

N1QL’s ordering is based on the ORDER BY clause and the expression within it. When you’re doing an inexact match, you may want the order based on the text score and the distance from the given search pattern.

6. The array problem.

JSON arrays give you the ability to store the 1:N, N:M relationships as an array of objects or arrays with links to other objects. This article gives you details on manipulating arrays. The second article describes additional set-oriented operations on arrays. To get the best performance while searching inside arrays, you need to create indexes with array keys. The array index comes with a limitation: each array index can only have one array key per index. So, when you have a customer object with multiple array fields, you can’t search all of them using a single index. When you have customers with an array of contact numbers and an array of email contacts, you can’t search by both using a single index, causing expensive queries. This is a common limitation of b-tree based array indexes. Since the B-tree index stores individual array elements as distinct keys, having multiple array keys will require the index to store the product of the number of keys (in the example above, (number of contact numbers) * (number of emails). The index size will increase exponentially. To avoid this, the products disable creating multiple array keys for a single index.

But, the developer’s job to efficiently look up in multiple JSON arrays remain.

7. Multiple queries, single index.

JSON model has been successfully used to aggregate information from multiple sources. The point of aggregating all this information is not to just simply store them, but to query them and get business value. When you have an ad-hoc query on a dataset, it’s difficult to pre-plan and re-create the indexes needed for it. N1QL itself provides the adaptive index method that helps in certain use cases. If you have multiple predicates, each of which can return a large resultset, the adaptive index will run into performance issues due to the amount of data transfer from index to the query node and large intermediate result set.

In this case, developers need to continuously monitor and tune the indexes. The users with performance issues will continue to complain.

 

Solutions:

1.The NAME search problem (aka the LIKE problem)

Whether you’re searching for joe, Joel or jolly, they all tend to be individual words. Typically you’re not searching for lajoella. The search index takes each word in your field (name, title or comment), analyzes it based on the language characteristics, categorizes the words into stop words (uninteresting and not usually searched for – e.g.: and, or, not, the in English) and terms. Indexing the term instead of the actual word has two benefits.

  1. It groups the related words together: francis, francisco, frank, francois, etc. When you search for francisco, you’ll also find closely related word francis.
  2. This grouping will also reduce the size of the index because we only need to index the root term instead of every modifier for that word.

Back to the name search: Let’s now search for the hotel name.

  1. Create an FTS index on documents of type ‘hotel’ and the field name in the document.
  2. Start searching for name:
    • Here’s the query to run:

curl -XPOST -H “Content-Type: application/json” \ http://172.23.120.38:8094/api/index/trhotelname/query \ -d ‘{ “explain”: true, “fields”: [ “*” ], “highlight”: {}, “query”: { “query”: “francisco” } }’

  1. Searching for francisco gives you the hotel names:
      • “name”: “The Opal San Francisco”,
      • “name”: “Francisco Bay Inn”,
      • and more (totally 11 results).
      •  

2.The multi-match problem

We saw the complex query earlier.

SELECT c.customerid, c.fname, c.mname, c.lname, c.address
FROM customer c
WHERE (c.fname in [‘John’, ‘Doe’]
     OR c.mname in [‘John’, ‘Doe’]
     OR c.lname in [‘John’, ‘Doe’])
     OR (ANY p in c.contacts satisfies p = ‘6509264813’)
     OR (ANY q in c.services satisfies q = ‘6509264813’)

On FTS, we can define a special _all field that combines multiple fields. Searching on this index automatically searches on ALL of the fields. (Also see this article) This is an index on five fields in the hotel document:

  1. city
  2. country
  3. name
  4. public_likes (an array field)
  5. description

Example values for the five fields:

[
{
“city”: “Medway”,
“country”: “United Kingdom”,
“description”: “40 bed summer hostel about 3 miles from Gillingham, housed in a distinctive converted Oast House in a semi-rural setting.”,
“name”: “Medway Youth Hostel”,
“public_likes”: [
“Julius Tromp I”,
“Corrine Hilll”,
“Jaeden McKenzie”,
“Vallie Ryan”,
“Brian Kilback”,
“Lilian McLaughlin”,
“Ms. Moses Feeney”,
“Elnora Trantow”
]
}
]

Once you define the search index on all these fields, simply start querying.

 

 

 

 

Here’s the index definition for the _all field index used here.

 

 

 

3. Beyond the exact match

Search (FTS) does more than an exact search. It does the term search, range search, fuzzy search, conjuncts, disjuncts, geo search (nearest neighbor), regex search, boosted search, phrase search and more. See examples and details in Couchbase FTS documentation.

4.Search based on business (your) requirements

When you’re searching for hotels in New York, you may prefer Manhattan hotels, but want to see other hotels as well. Now, simply boost the Manhattan search term by appending ^5. This boosting improves the score of the documents that contain Manhattan. The results are ordered in the descending order of score by default.

5. Relevance Ordering

In SQL ordering is based on the value of the expression or the field itself. In search, the relevance of a document is calculated by the distance between what you’re searching for and what the document contains. This is the score we manipulated by boosting the importance of a term in the previous section. You can sort the results by this score and any other field by using the sort clause of the search request.

6.The array problem

Now, let’s create a single index on the following four arrays.

  1. public_likes
  2. reviews.author
  3. reviews.ratings.Location
  4. reviews.ratings.Service

Here’s the index definition. This is a single index on four array keys. This is something you could never do in a B-tree based index.

Now, you can query using predicates on one or more of the fields above.

EXAMPLE 1: hotels liked by “Vallie Ryan” or Service rating greater than 4

‘{
“explain”: true,
“fields”: [ “*” ],
“highlight”: {},
“query”: { “query”: “+public_likes:\”Vallie Ryan\” reviews.ratings.Service:>4″ }
}’

Full Index Definition.

 

7.Multiple Queries, Single Index.

“Sometimes, queries are like a box of chocolates. You don’t know what query you’re going to get.” When you want to expose the data to business users who can issue ad-hoc queries, you can’t create every kind of index to speed up every kind of query. Performance tuning requires a different approach.

Consider the index created in section (2) above on the 5 fields:

  1. city
  2. country
  3. name
  4. public_likes (an array field)
  5. description

If you created a B-Tree (GSI index in case of Couchbsae), it would look like the following:

CREATE INDEX itravel ON travel-sample(city, country, name, DISTINCT public_likes, description) WHERE type = ‘hotel’;

The queries that would benefit from this follow the rules explained here. The major issue being each query block (specifically index scan) will have to use predicates on the leading key(s) of the index. This limits the efficacy of the index.

With the search(FTS) index created above, you can query based on any conjucts, disjuncts, must have, etc with any combination. With complex queries, the search may take more than few milliseconds, but you’re doing all this in a single index. The flexibility with reasonable use of resources makes the search index very valuable.

Note: In the upcoming Couchbase release 6.5, we’ve made it easier to query using the search index. Binh Le has explained this in this article:https://www.couchbase.com/blog/n1ql-and-search-how-to-leverage-fts-index-in-n1ql-query/

References

 

Author

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and has three US patents pending.

Leave a reply