Understanding Letter Ordering in View Queries

Map/Reduce Views are an important part of Couchbase 2.0 and understanding how to query them is also important. Our documentation is great and can be found here: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views.html

One of the subtleties is understanding Unicode Collation ordering of letters which is different than Byte Order (like ASCII), which is what we are typically more used to as it's used in most programming languages for ordering of strings via string comparisons.

Byte Order (like ASCII)

123456890 < A–Z < a–z 

Unicode Collation (default option)

123456790 < aAbBcCdDeEfFgGhH…

Notice that letters are grouped, so Lowercase then Uppercase of the same letter, rather than a range of all Uppercase and then a range of all Lowercase.

In addition, with non-english, accented characters follow a similar principle, like so:

a < á < A < Á < b

Notice that all “a” characters and accented variants occur before captial A and it's variants, which likewise occur before any “b” characters.

Ordering Example

In Byte Order (like ASCII) ordering, the following indexed keys would be ordered in this order:

“ABC123” < “ABC223” < “abc123” < “abc223” < “abcd23” < “bbc123” < “bbcd23”

However in Unicode Collation used in Couchbase Views, this is the order that they actually would occur in:

“abc123” < “ABC123” < “abc223” < “ABC223” < “abcd23” < “bbc123” < “bbcd23” 

So when determining your startkey and endkey ranges for strings, it's important to know this ordering!

Examples

For instance, using the beer-sample database that is packaged with Couchbase 2.0, and wanted to query Breweries by_name:

Breweries starting with Uppercase Y:

startkey=“Y”&endkey=“z”

Will return only those starting with Uppercase Y!

Breweries starting with Lowercase y or Uppercase Y:

startkey=“y”&endkey=“z”

Happens to return all with Uppercase only because of supplied data, but would include those with Lowercase y!

Breweries starting with Lowercase y only:

startkey=“y”&endkey=“Y”

Should return no results with supplied data, because they are all Uppercase Y!

This last one is probably the least intuitive if you are coming from an Byte Order ASCII mentality! Hope this is helpful for those doing range queries!

@scalabl3

PS. If you are interested in the Unicode Collation topic, you can learn more about it at these urls:

http://www.unicode.org/reports/tr10/

http://userguide.icu-project.org/collation/customization#TOC-Default-Options

The Couchbase Team

Author

Posted by The Couchbase Team

Jennifer Garcia is a Senior Web Manager at Couchbase Inc. As the website manager, Jennifer has overall responsibility for the website properties including design, implementation, content, and performance.

All Posts Website

One Comment

Matt Ingenthron January 20, 2013 at 3:25 pm

Note that the end of the unicode Latin collation table is u02ad. Good to know when searching to the end. You may sometimes see uefff which works, but isn\’t technically correct.

Log in to Reply

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Capella Playground

Start A Free Capella Trial

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott