Overview

ANSI JOIN support is added in N1QL to Couchbase version 5.5. Previous versions of Couchbase only support lookup join and index join. Lookup join and index join works great when the document key from one side of the join can be produced by the other side of the join, i.e., joining on a parent-child or child-parent relationship through document key. Where they fall short is when the join is on arbitrary fields or expressions of fields, or when multiple join conditions are required. ANSI JOIN is a standardized join syntax widely used in relational databases. ANSI JOIN is much more flexible than lookup join and index join, allowing join to be done on arbitrary expressions on any fields in a document, this makes join operations much simpler and more powerful.

ANSI JOIN syntax:

lhs-expression [ join-type ] JOIN rhs-keyspace ON [ join-condition ]

The left-hand side of the join, lhs-expression can be a keyspace, a N1QL expression, a subquery, or a previous join. The right-hand side of the join, rhs-keyspace, must be a keyspace. The ON-clause specifies the join condition, which can be any arbitrary expression, although it should contain predicates that allows an index scan on the right-hand side keyspace. Join-type can be INNER, LEFT OUTER, RIGHT OUTER. The INNER and OUTER keywords are optional, thus JOIN is the same as INNER JOIN, and LEFT JOIN is the same as LEFT OUTER JOIN. In relational databases join-type can also be FULL OUTER or CROSS, although FULL OUTER JOIN and CROSS JOIN are not supported currently in N1QL.

Details of ANSI JOIN support

We’ll use examples to show you new ways you can run queries using ANSI JOIN syntax, and how to transform your existing join queries in N1QL from lookup join or index join syntax into new ANSI JOIN syntax. It should be noted that lookup join and index join will continue to be supported in N1QL for backward compatibility, however you cannot mix lookup join or index join with the new ANSI JOIN syntax in the same query block, thus customers are encouraged to migrate to the new ANSI JOIN syntax.

To follow along, install travel-sample sample bucket.

Example 1: ANSI JOIN with arbitrary join condition

The join condition (ON-clause) for ANSI JOIN can be any expression, involving any fields of the documents being joined. For example:

Required index:

Optional index:

Query:

In this query we are joining a field (“faa”) from the airport document with a field (“sourceairport”) from the route document (see the ON clause of the join). Such join is not possible with lookup join or index join in N1QL, since both requires joining on document key only.

ANSI JOIN requires an appropriate index on the right-hand side keyspace (“Required index” above). You can also create other indexes (e.g. “Optional index” above) to speed up your query. Without the optional index a primary scan will be used and query still works, however without the required index the query will not work and will return an error.

Looking at the explain:

You will see a NestedLoopJoin operator is used to perform the join, and underneath that an IndexScan3 operator is used to access the right-hand side keyspace, “route”. The spans for the index scan looks like:

The index scan for the right-hand side keyspace (“route”) is using a field (“faa”) from the left-hand side keyspace (“airport”) as search key. For each document from outer side keyspace “airport” the NestedLoopJoin operator performs an index scan on the inner side keyspace “route” to find matching documents, and produces join results. The join is performed in a nested-loop fashion, where the outer loop produces document from outer side keyspace, and a nested inner loop searches for matching inner side document for the current outer side document.

The explain information can also be view graphically on Query Workbench, by clicking the Explain button followed by the Plan button:

Visual Explain for ANSI JOIN

In this example the index scan on the right-hand side keyspace is a covered index scan. In case the index scan is not covered, a fetch operator will be following the index scan operator to fetch the document.

It should be noted that nested-loop join requires an appropriate secondary index on the right-hand side keyspace of ANSI JOIN. Primary index is not considered for this purpose. If an appropriate secondary cannot be found, an error will be returned for the query.

In addition, you might have noticed that the filter route.type = “route” appears in the ON-clause as well. The ON-clause is different than the WHERE clause in that the ON-clause is evaluated as part of the join, while the WHERE clause is evaluated after all joins are done. This distinction is important, especially for outer joins. Therefore it is recommended that you include filters on the right-hand side keyspace for a join in the ON-clause as well, in addition to any join filters.

Example 2: ANSI JOIN with multiple join conditions

While lookup join and index join only joins on a single join condition (equality of document key), the ON-clause of ANSI JOIN can contain multiple join conditions.

Required index:

Optional index:

Query:

Looking at the explain, the index spans for the index (“landmark_city_country”) of the right-hand side keyspace (“landmark”) is:

Thus multiple join predicates can potentially generate multiple index search keys for the index scan of the inner side of a nested-loop join.

Example 3: ANSI JOIN with complex join expressions

The join condition in the ON-clause can be complex join expression. For example, the “airlineid” field in “route” document corresponds to the document key for “airline” document, but it can also be constructed by concatenating “airline_” with the “id” field of the “airline” document.

Required index:

Optional index:

Query:

The explain contains the following index spans for the right-hand side keyspace(“route”):

The expression will be evaluated at runtime to generate the search keys for the index scan on the inner side of nested-loop join.

Example 4: ANSI JOIN with IN clause

The join condition does not need to be an equality predicate. An IN-clause can be used as join condition.

Required index:

Optional index:

Query:

The explain contains the following index spans for the right-hand side keyspace(“airport”):

 

Example 5: ANSI JOIN with OR clause

Similar to IN-clause, the join condition for an ANSI JOIN can also contain an OR-clause. Different arms of the OR-clause can potentially reference different fields of the right-hand side keyspace, as long as appropriate indexes exists.

Required index (route_airports index same as example 1):

Optional index (same as example 1):

Query:

The explain shows an UnionScan being used under NestedLoopJoin, to handle the OR-clause:

 

Example 6: ANSI JOIN with hints

For lookup join and index join, hints can only be specified on the keyspace on the left-hand side of the join. For ANSI JOIN, hints can be specified on the right-hand side keyspace as well. Using the same query as example 1 (with addition of USE INDEX hint):

The USE INDEX hint limits the number of indexes the planner needs to consider for performing the join.

Hints can also be specified on the left-hand side keyspace of ANSI JOIN.

 

Example 7: ANSI LEFT OUTER JOIN

So far we’ve been looking at inner joins. You can also perform LEFT OUTER JOIN by just including LEFT or LEFT OUTER keywords in front of JOIN keyword in join specification.

Required index (same as example 1):

Optional index (same as example 1):

Query:

The result set for this query contains all the joined results, as well as any left-hand side (“airport”) document that does not join with the right-hand side (“route”) document, according to semantics of LEFT OUTER JOIN. Thus you’ll find results that just contain airport.airportname but not route.airlineid (which is missing). You can also select just the left-hand side (“airport”) document that does not join with right-hand side (“route”) document by adding a IS MISSING predicate on the right-hand side keyspace (“route”):

 

Example 8: ANSI RIGHT OUTER JOIN

ANSI RIGHT OUTER JOIN is similar to ANSI LEFT OUTER JOIN except we preserve the right-hand side document if no join occurs. We can modify the query in example 7 by switching the left-hand side and right-hand side keyspaces, and replacing LEFT keyword with RIGHT keyword:

Note that although we switched airport and route in join specification, the filter on route (now the left-hand side keyspace) still appears in the ON-clause of the join, since route is still on the subservient side in this outer join.

RIGHT OUTER JOIN is internally converted to LEFT OUTER JOIN.

If a query contains multiple joins, a RIGHT OUTER JOIN can only be the first join specified. Since N1QL only support linear joins, i.e., the right-hand side of each join must be a single keyspace, if a RIGHT OUTER JOIN is not the first join specified, then after converting to LEFT OUTER JOIN, the right-hand side of the join now contains multiple keyspaces, which is not supported. If you specify RIGHT OUTER JOIN in any position other than the first join, a syntax error will be returned.

Example 9: ANSI JOIN using Hash Join

N1QL supports two join methods for ANSI JOIN. The default join method for an ANSI JOIN is nested-loop join. The alternative is hash join. Hash join uses a hash table to match documents from both sides of the join. Hash join has a build side and a probe side, where each document from the build side is inserted into a hash table based on values of equi-join expression from the build side; subsequently each document from the probe side looks up from the hash table based on values of equi-join expression from the probe side. If a match is found then the join operation is performed.

Compared with nested-loop join, hash join can be more efficient when the join is large, e.g., when there are tens of thousand or more documents from the left-hand side of the join (after applying filters). If using nested-loop join, then for each document from the left-hand side an index scan needs to be performed on the right-hand side index. As the number of documents from the left-hand side increases, nested-loop join becomes less efficient.

For hash join, the smaller side of the join should be used for building the hash table, and the larger side of the join should be used for probing the hash table. It should be noted that hash join does require more memory than nested-loop join, since an in-memory hash table is required. The amount of memory required is proportional to the number of documents from the build side, as well as average size of each document.

Hash join is supported in enterprise edition only. To use hash join, a USE HASH hint must be specified on the right-hand side keyspace of ANSI JOIN. Using the same query as example 1:

The USE HASH(build) hint directs the N1QL planner to perform hash join for the ANSI JOIN specified, and the right-hand side keyspace (“route”) is used on the build side of the hash join. Looking at the explain, there is a HashJoin operator:

The child operator (“~child”) for a HashJoin operator is always the build side of the hash join. For this query, it’s an index scan on the right-hand side keyspace “route”.

Note that for accessing the “route” document we can no longer use information from the left-hand side keyspace (“airport”) for index search key (look at the “spans” information in the explain section above). Unlike nested-loop join, the index scan on “route” is no longer tied to an individual document from the left-hand side, and thus no value from the “airport” document can be used as search key for the index scan on “route”.

The USE HASH(build) hint used in the query above directs the planner to use the right-hand side keyspace as the build side of the hash join. You can also specify USE HASH(probe) hint to direct the planner to use the right-hand side keyspace as the probe side of the hash join.

Looking at the explain, you’ll find the HashJoin operator:

The child operator (“~child”) for HashJoin is an intersect index scan on the left-hand side keyspace of the ANSI JOIN, “airport”, followed by a fetch operator.

The USE HASH hint can only be specified on the right-hand side keyspace in an ANSI JOIN. Therefore depending on whether you want the right-hand side keyspace to be the build side or the probe side of a hash join, a USE HASH(build) or USE HASH(probe) hint should be specified on the right-hand side keyspace.

Hash join is only considered when USE HASH(build) or USE HASH(probe) hint is specified. Hash join requires equality join predicates to work. Nested-loop join requires an appropriate secondary index on the right-hand side keyspace, hash join does not (a primary index scan is an option for hash join). However, hash join does require more memory than nested-loop join since an in-memory hash table is required for hash join to work. In addition, hash join is considered a “blocking” operation, meaning the query engine must finish building the hash table before it can produce the first join result, thus for queries needing only the first few results quickly (e.g. with a LIMIT clause) hash join may not be the best fit.

If a USE HASH hint is specified, but a hash join cannot be generated successfully (e.g., lack of equality join predicates), then a nested-loop join will be considered.

Example 10: ANSI JOIN with multiple hints

You can now specify multiple hints for a keyspace on the right-hand side of an ANSI JOIN. For example, USE HASH hint can be used together with USE INDEX hint.

Note when multiple hints are used together, you only need to specify the “USE” keyword once, as in the example above.

USE HASH hint can also be combined with USE KEYS hint.

Example 11: ANSI JOIN with multiple joins

ANSI JOIN can be chained together. For example:

Required indexes (route_airports index same as example 1):

Optional index (same as example 1):

Query:

Since there is no USE HASH hint specified in the query the explain should show two NestedLoopJoin operators.

You can mix hash join with nested-loop join by adding USE HASH hint to any of the joins in a chain of ANSI JOINs.

or

The visual explain for the last query is follows:

As mentioned before, N1QL only supports linear joins, i.e., the right-hand side of each join must be a keyspace.

Example 12: ANSI JOIN involving right-hand side arrays

Although ANSI JOIN comes from SQL standard, since Couchbase N1QL handles JSON documents and array is an important aspect of JSON, we extended ANSI JOIN support to arrays as well.

For examples in array handling please create a bucket “default” and insert the following documents:

Then create the following indexes:

When the join predicate involves an array on the right-hand side of ANSI JOIN, you need to create an array index on the right-hand side keyspace.

Query:

Note that part of the join condition is an ANY clause which specifies that the left-hand side field b1.c12 can match any element of the right-hand side array b2.a21. For this join to work properly, we need an array index on b2.a21, e.g., default_ix_right index created above.

The explain plan shows a NestedLoopJoin, with child operator being a distinct scan on the array index default_ix_right.

 

Example 13: ANSI JOIN involving left-hand side arrays

If ANSI JOIN involves an array on the left-hand side, then there are two options for performing the join.

Option 1: use UNNEST

Use UNNEST clause to flatten the left-hand side array first before performing the join.

After the UNNEST the array becomes individual fields, and the subsequent join is just like a “regular” ANSI JOIN with fields from both sides.

Option 2: use IN-clause

Alternatively, use IN-clause as join condition.

The IN-clause is satisfied when any element of the array on the left-hand side keyspace (“b1.a11”) matches the right-hand side field (“b2.c21”).

Note that there is a semantics difference between the two options. When there are duplicates in the array, the UNNEST option does not care about duplicates and will flatten the left-hand side documents to as many documents as number of elements in the array, thus may produce duplicated results; the IN-clause option will not produce duplicated results if there are duplicated elements in the array. In addition, when LEFT OUTER JOIN is performed, there may be different number of preserved documents due to the flattening of the array with the UNNEST option. Thus the user is advised to pick the option that reflect the semantics needed for the query.

Example 14: ANSI JOIN involving arrays on both sides

Although uncommon, it is possible to perform an ANSI JOIN when both sides of the join are arrays. In such cases, you can use a combination of the techniques described above. Use array index to handle array on the right-hand side, and use either UNNEST option or IN-clause option to handle array on the left-hand side.

Option 1: use UNNEST clause

Option 2: use IN-clause

Again the two options are not semantically identical, and may give different results. Pick the option that reflects the semantics desired.

Example 15: lookup join migration

N1QL will continue to support lookup join and index join for backward compatibility, however, you cannot mix ANSI JOIN with lookup join or index join in the same query. You can convert your existing queries from using lookup join and index join to the ANSI JOIN syntax. This example shows you how to convert a lookup join into ANSI JOIN syntax.

Create the following index to speed up the query (same as example 1):

This is a query using lookup join syntax (note the ON KEYS clause):

In lookup join the left-hand side of the join (“route”) needs to produce document keys for the right-hand side of the join (“airline”), this is achieved by the ON KEYS clause. The join condition (which is implied from the syntax) is route.airlineid = meta(airline).id, thus the same query can be specified using ANSI JOIN syntax:

In this example the ON KEYS clause contain a single document key. It’s possible for the ON KEYS clause to contain an array of document keys, in which case the converted ON clause will be in the form of an IN clause instead of an equality clause. Let’s assume each route document has an array of document keys for airline, then the original ON KEYS clause:  

can be converted to:

 

Example 16: index join migration

This example shows you how to convert an index join into ANSI JOIN syntax.

Required index (same as example 3):

Optional index (same as example 3):

Query using index join syntax (note the ON KEY … FOR … clause):

In index join the document key for left-hand side (“airline”) is used to probe an index on an expression (“route.airlineid” which appears in the ON KEY clause) from the right-hand side (“route”) that corresponds to the document key for the left-hand side (“airline” which appears in the FOR clause). The join condition (implied from syntax) is route.airlineid = meta(airline).id, thus the same query can be specified using ANSI JOIN syntax:

 

Example 17: ANSI NEST

Couchbase N1QL supports NEST operation. Previously NEST can be done using lookup nest or index nest, similar to lookup join and index join, respectively. With ANSI JOIN support, NEST operation can also be done using similar syntax, i.e., using ON clause instead of ON KEYS (lookup nest) or ON KEY … FOR … (index nest) clauses. This new variant is referred to as ANSI NEST.

Required index (route_airports index same as example 1, route_airline_distance index same as example 4):

Optional index:

Query:

As you can see the syntax for ANSI NEST is very similar to that of ANSI JOIN. There is one peculiar property for nest though. By definition the nest operation creates an array of all matching right-hand side document for each left-hand side document, which means the reference to the right-hand side keyspace, “route” in this query, has different meaning depending on where the reference is. The ON-clause is evaluated as part of the NEST operation, and thus references to “route” is referencing a single document. In contrast, references in the projection clause, or the WHERE clause, are evaluated after the NEST operation, and thus references to “route” means the nested array, thus it should be treated as an array. Notice the projection clause of the query above has an ARRAY construct with a FOR clause to access each individual document within the array (i.e., the reference to “route” is now in an array context).

Summary

ANSI JOIN provides much more flexibility in join operations in Couchbase N1QL, compared to previously supported lookup join and index join, both of which requires joining on document key only. The examples above show various ways you can use ANSI JOIN in queries. Since ANSI JOIN is widely used in relational world, the support for ANSI JOIN in Couchbase N1QL should make it much easier to migrate applications from a relational database to Couchbase N1QL.

 

Posted by Bingjie Miao, Senior Software Engineer, Couchbase

Bingjie Miao is a senior software engineer in Couchbase. Bingjie has 20 years of experience in relational and NoSQL databases. His main area of expertise is query optimization and query execution.

2 Comments

  1. finally 🙂

  2. guy.klages@couchbase.com March 6, 2018 at 5:32 pm

    Awesome examples!

Leave a reply