January 18, 2013

Understanding Letter Ordering in View Queries

Map/Reduce Views are an important part of Couchbase 2.0 and understanding how to query them is also important. Our documentation is great and can be found here: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views.html

One of the subtleties is understanding Unicode Collation ordering of letters which is different than Byte Order (like ASCII), which is what we are typically more used to as it's used in most programming languages for ordering of strings via string comparisons.

Byte Order (like ASCII)

123456890 < A-Z < a-z

Unicode Collation (default option)

123456790 < aAbBcCdDeEfFgGhH...

Notice that letters are grouped, so Lowercase then Uppercase of the same letter, rather than a range of all Uppercase and then a range of all Lowercase.

In addition, with non-english, accented characters follow a similar principle, like so:

a < á < A < Á < b

Notice that all "a" characters and accented variants occur before captial A and it's variants, which likewise occur before any "b" characters. 

Ordering Example

In Byte Order (like ASCII) ordering, the following indexed keys would be ordered in this order:

"ABC123" < "ABC223" < "abc123" < "abc223" < "abcd23" < "bbc123" < "bbcd23"

However in Unicode Collation used in Couchbase Views, this is the order that they actually would occur in:

"abc123" < "ABC123" < "abc223" < "ABC223" < "abcd23" < "bbc123" < "bbcd23"

So when determining your startkey and endkey ranges for strings, it's important to know this ordering! 


For instance, using the beer-sample database that is packaged with Couchbase 2.0, and wanted to query Breweries by_name:

Breweries starting with Uppercase Y:

Will return only those starting with Uppercase Y!

Breweries starting with Lowercase y or Uppercase Y:

Happens to return all with Uppercase only because of supplied data, but would include those with Lowercase y!

Breweries starting with Lowercase y only:

Should return no results with supplied data, because they are all Uppercase Y!

This last one is probably the least intuitive if you are coming from an Byte Order ASCII mentality! Hope this is helpful for those doing range queries!


PS. If you are interested in the Unicode Collation topic, you can learn more about it at these urls: