In Matthew’s blog the Sub-Document (subdoc) API feature is introduced with a short overview: In summary, subdoc allows efficient access to parts of documents (sub-documents) without requiring the transfer of the entire document over the network.

Throughout this blog, we’ll use a reference document. This document will then be accessed through various ways using the subdoc API. Note that for each subdocument operation, doc_size - op_size bytes of bandwidth are being saved, where doc_size is the length of the document, and op_size is the length of the path and sub-document value.

The document below is 500 bytes. Performing a simple get() would consume 500 bytes (plus protocol overhead) on the server response. If you only care for the delivery address, you could issue a lookup_in('customer123', SD.get('addresses.delivery')) call. You would receive only about 120 bytes over the network, a savings of over 400 bytes, using a quarter of the bandwidth of the equivalent full-document (fulldoc) operation.

I’ll be demonstrating examples using a development branch of the Python SDK and Couchbase Server 4.5 Developer Preview.
[EDIT: An experimental version of the sub-document API is now available in the latest Python SDK, version 2.0.8, and the examples below have been updated to reflect the latest API]

You can read about the other new features in Couchbase 4.5 in Don Pinto’s blog post

Subdoc operations

A subdoc operation is a single action for a single path in a document. This may be expressed as GET('addresses.billing') or ARRAY_APPEND('purchases.abandoned', 42). Some operations are lookups (they simply return data without modifying the document) while some are mutations (they modify the contents of the document).

Many of the subdoc operations are smaller scale equivalents of fulldoc operations. It helps to think of a single document as being itself a miniature key-value store. In the Python SDK, operations can be specified via special functions in the couchbase.subdocument module, which I will abbreviate in the rest of this blog as SD. This is done by

While looking at these operations, note that what is being transmitted over the network is only the arguments passed to the subdoc API itself, rather than the contents of the entire document (as would be with fulldoc). While the document itself may seem small, even a simple

Lookup Operations

Lookup operations queries the document for a certain path and returns that path. You have a choice of actually retrieving the document path using the GET operation, or simply querying the existence of the path using the EXISTS operation. The latter saves even more bandwidth by not retrieving the contents of the path if it is not needed.

In the second snippet, I also show how to access the last element of an array, using the special [-1] path component.

We can also combine these two operations:

Mutation Operations

Mutation operations modify one or more paths in the document. These operations can be divided into several groups:

  • Dictionary/Object operations: These operations write the value of a JSON dictionary key.
  • Array/List operations: These operations add operations to JSON array/list.
  • Generic operations: These operations modify the existing value itself and are container-agnostic.

Mutation operations are all or nothing, meaning that either all the operations within mutate_in are successful, or none of them are.

Dictionary operations

The simplest of these operations is UPSERT. Just like the fulldoc-level upsert, this will either modify the value of an existing path or create it if it does not exist:

In addition to UPSERT, the INSERT operation will only add the new value to the path if it does not exist.

While the above operation will fail, note that anything valid as a full-doc value is also valid as a subdoc value: As long as it can be serialized as JSON. The Python SDK serializes the above value to [42, true, null].

Dictionary values can also be replaced or removed:

Array Operations

True array append (ARRAY_APPEND) and prepend (ARRAY_PREPEND) operations can also be performed using subdoc. Unlike fulldoc append/prepend operations (which simply concatenate bytes to the existing value), subdoc append and prepend are JSON-aware:

You can make an array-only document as well, and then perform array_ operations using an empty path:

Limited support also exists for treating arrays like unique sets, using the ARRAY_ADDUNIQUE command. This will do a check to determine if the given value exists or not before actually adding the item to the array:

Array operations can also be used as the basis for efficient FIFO or LIFO queues. First, create the queue:

Adding items to the end

Consuming item from beginning.

The example above performs a GET followed by a REMOVE. The REMOVE is only performed once the application already has the job, and it will only succeed if the document has not since changed (to ensure that the first item in the queue is the one we’ve just removed).

Counter Operations

Counter operations allow the manipulation of a numeric value inside a document. These operations are logically similar to the counter operation on an entire document.

The COUNTER operation peforms simple arithmetic against a numeric value (the value is created if it does not yet exist).

COUNTER can also decrement as well:

Note that the existing value for counter operations must be within range of a 64 bit signed integer.

Creation of Intermediates

All of the examples above refer to creating a single new field within an existing dictionary. Creating a new hierarchy however will result in an error:

Despite the operation being an UPSERT, subdoc will refuse to create missing hierarchies by default. The create_parents option however allows it to succeed: add the protocol level the option is called F_MKDIRP, like the -p option of the mkdir command on Unix-like platforms.

Subdocument and CAS

Subdoc mostly eliminates the need for tracking CAS. Subdoc operations are atomic and therefore if two different threads access two different sub-documents then no conflict will arise. For example the following two blocks can execute concurrently without any risk of conflict:

Even when modifying the same part of the document, operations will not necessarily conflict, for example two concurrent ARRAY_PREPEND to the same array will both succeed, never overwriting the other.

This does not mean that CAS is no longer required – sometimes it’s important to ensure the entire document didn’t change state since the last operation: this is especially important with the case of REMOVE operations to ensure that the element being removed was not already replaced by something else.

FAQ about Sub-Document Operations in Couchbase

Over the course of developing subdoc, I’ve been asked several questions about what it does, and I’ll respond in turn:

What’s the difference between Subdoc and N1QL?

N1QL is a rich, expressive query language which allows you to search for and possibly mutate multiple documents at once. Subdoc is a high performance API/implementation designed for searching within a single document.

Subdoc is a high performance set of simple, discreet APIs for accessing data within a single document, with a goal of reducing network bandwidth and increasing overall throughput. It is implemented as part of the KV service and is therefore strongly consistent with it.

N1QL is a rich query language capable of searching multiple documents within Couchbase which adhere to certain criteria. It operates outside the KV service, making optimized KV and index requests to satisfy incoming queries. Consistency with the KV service is configurable per query (for example, the USE KEYS clause and the scan_consistency option).

When should I use N1QL and when should I use subdoc?

N1QL answers questions such as Find me all documents where X=42 and Y=77 whereas subdoc answers questions such as Fetch X and Y from document Z. More specifically, subdoc should be used when all the Document IDs are known (in other words, if a N1QL query contains USE KEYS it may be a candidate for subdoc).

The two are not mutually exclusive however, and it is possible to use both N1QL and subdoc in an application.

Are mutate_in and lookup_in atomic?

Yes, they are atomic. Both these operations are guaranteed to have all their sub-commands (e.g. COUNTER, GET, EXISTS, ADD_UNIQUE) operate on the same version of the document.

How do I access multiple documents with subdoc?

There is no bona fide multi operation for subdoc, as subdoc operates within the scope of a single document. Because documents are sharded across the cluster (this is common to Couchbase and all other NoSQL stores), multi operations would not be able to guarantee the same level of transactions and atomicity between documents.

I don’t like the naming convention for arrays. Why didn’t you use append, add, etc.?

There are many languages out there and it seems all of them have a different idea of how to call array access functions:

  • Generic: add to end, add to front
  • C++: push_back(), push_front()
  • Python: append(), insert(0), extend
  • Perl, Ruby, Javascript, PHP: push(), unshift()
  • Java, C#: add()

The term append is already used in Couchbase to refer to the full-document byte concatenation, so I considered it inconsistent to use this term in yet a different manner in subdoc.

Why does COUNTER require 64 bit signed integers?

This is a result of the subdoc code being implemented in C++. Future implementations may allow a broader range of existing numeric values (for example, large values, non-integral values, etc.).

How do i perform a pop? why is there no POP operation?

POP refers to the act of removing an item (e.g. from an array) and returning it, in a single operation.

POP may indeed be implemented in the future, but using it is inherently dangerous:

Because the operation is being done over the network, it is possible for the server to have executed the removal of the item but have the network connection terminated before the client receives the previous value. Because the value is no longer in the document, it is permanently lost.

Can I use CAS with subdoc operations?

Yes, in respect to CAS usage, Subdoc operations are normal KV API operations, similar to upsert, get, etc.

Can I use durability requirements with subdoc operations?

Yes, in respect to durability requirements, mutate_in is seen like upsert, insert and replace.

Author

Posted by Mark Nunberg, Software Engineer, Couchbase

Mark Nunberg is a software engineer working at Couchbase. He maintains the C client library (libcouchbase) as well as the Python client. He also developed the Perl client (for use at his previous company) - which initially led him to working at Couchbase. Prior to joining Couchbase, he worked on distributed and high performance routing systems at an eCommerce analytics firm. Mark studied Linguistics at the Hebrew University of Jerusalem.

Leave a reply