Sub-Document API: Using Subdoc to Get (Only) What You Want

In Matthew’s blog the Sub-Document (subdoc) API feature is introduced with a short overview: In summary, subdoc allows efficient access to parts of documents (sub-documents) without requiring the transfer of the entire document over the network.

Throughout this blog, we’ll use a reference document. This document will then be accessed through various ways using the subdoc API. Note that for each subdocument operation, doc_size - op_size bytes of bandwidth are being saved, where doc_size is the length of the document, and op_size is the length of the path and sub-document value.

The document below is 500 bytes. Performing a simple get() would consume 500 bytes (plus protocol overhead) on the server response. If you only care for the delivery address, you could issue a lookup_in('customer123', SD.get('addresses.delivery')) call. You would receive only about 120 bytes over the network, a savings of over 400 bytes, using a quarter of the bandwidth of the equivalent full-document (fulldoc) operation.

{
  "name": "Douglas Reynholm",
  "email": "douglas@reynholmindustries.com",
  "addresses": {
    "billing": {
      "line1": "123 Any Street",
      "line2": "Anytown ",
      "country": "United Kingdom"
    },
    "delivery": {
      "line1": "123 Any Street",
      "line2": "Anytown ",
      "country": "United Kingdom"
    }
  },
  "purchases": {
    "complete": [
      339, 976, 442, 666
    ],
    "abandoned": [
      157, 42, 999
    ]
  }
}

{

"name": "Douglas Reynholm",

"email": "douglas@reynholmindustries.com",

"addresses": {

"billing": {

"line1": "123 Any Street",

"line2": "Anytown ",

"country": "United Kingdom"

"delivery": {

"line1": "123 Any Street",

"line2": "Anytown ",

"country": "United Kingdom"

}

"purchases": {

"complete": [

339, 976, 442, 666

"abandoned": [

157, 42, 999

]

}

I’ll be demonstrating examples using a development branch of the Python SDK and Couchbase Server 4.5 Developer Preview.
[EDIT: An experimental version of the sub-document API is now available in the latest Python SDK, version 2.0.8, and the examples below have been updated to reflect the latest API]

You can read about the other new features in Couchbase 4.5 in Don Pinto’s blog post

Subdoc operations

A subdoc operation is a single action for a single path in a document. This may be expressed as GET('addresses.billing') or ARRAY_APPEND('purchases.abandoned', 42). Some operations are lookups (they simply return data without modifying the document) while some are mutations (they modify the contents of the document).

Many of the subdoc operations are smaller scale equivalents of fulldoc operations. It helps to think of a single document as being itself a miniature key-value store. In the Python SDK, operations can be specified via special functions in the couchbase.subdocument module, which I will abbreviate in the rest of this blog as SD. This is done by

import couchbase.subdocument as SD

1	import couchbase.subdocument as SD

While looking at these operations, note that what is being transmitted over the network is only the arguments passed to the subdoc API itself, rather than the contents of the entire document (as would be with fulldoc). While the document itself may seem small, even a simple

Lookup Operations

Lookup operations queries the document for a certain path and returns that path. You have a choice of actually retrieving the document path using the GET operation, or simply querying the existence of the path using the EXISTS operation. The latter saves even more bandwidth by not retrieving the contents of the path if it is not needed.

rv = bucket.lookup_in('customer123', SD.get('addresses.delivery.country'))
country = rv[0] # =&gt; 'United Kingdom'

1 2	rv = bucket.lookup_in('customer123', SD.get('addresses.delivery.country')) country = rv[0] # => 'United Kingdom'

rv = bucket.lookup_in('customer123', SD.exists('purchases.pending[-1]'))
rv.exists(0) # (check if path for first command exists): =&gt;; False

1 2	rv = bucket.lookup_in('customer123', SD.exists('purchases.pending[-1]')) rv.exists(0) # (check if path for first command exists): =>; False

In the second snippet, I also show how to access the last element of an array, using the special [-1] path component.

We can also combine these two operations:

rv = bucket.lookup_in('customer123',
                  SD.get('addresses.delivery.country'),
                  SD.exists('purchases.pending[-1]'))
rv[0] # =&gt; 'United Kingdom'
rv.exists(1) # =&gt; False
rv[1] # =&gt; SubdocPathNotFoundError

rv = bucket.lookup_in('customer123',

SD.get('addresses.delivery.country'),

SD.exists('purchases.pending[-1]'))

rv[0] # => 'United Kingdom'

rv.exists(1) # => False

rv[1] # => SubdocPathNotFoundError

Mutation Operations

Mutation operations modify one or more paths in the document. These operations can be divided into several groups:

Dictionary/Object operations: These operations write the value of a JSON dictionary key.
Array/List operations: These operations add operations to JSON array/list.
Generic operations: These operations modify the existing value itself and are container-agnostic.

Mutation operations are all or nothing, meaning that either all the operations within mutate_in are successful, or none of them are.

Dictionary operations

The simplest of these operations is UPSERT. Just like the fulldoc-level upsert, this will either modify the value of an existing path or create it if it does not exist:

bucket.mutate_in('customer123', SD.upsert('fax', '775-867-5309'))

1	bucket.mutate_in('customer123', SD.upsert('fax', '775-867-5309'))

In addition to UPSERT, the INSERT operation will only add the new value to the path if it does not exist.

bucket.mutate_in('customer123', SD.insert('purchases.complete', [42, True, None]))
# SubdocPathExistsError

1 2	bucket.mutate_in('customer123', SD.insert('purchases.complete', [42, True, None])) # SubdocPathExistsError

While the above operation will fail, note that anything valid as a full-doc value is also valid as a subdoc value: As long as it can be serialized as JSON. The Python SDK serializes the above value to [42, true, null].

Dictionary values can also be replaced or removed:

bucket.mutate_in('customer123',
                 SD.remove('addresses.billing'),
                 SD.replace('email', 'doug96@hotmail.com'))

bucket.mutate_in('customer123',

SD.remove('addresses.billing'),

SD.replace('email', 'doug96@hotmail.com'))

Array Operations

True array append (ARRAY_APPEND) and prepend (ARRAY_PREPEND) operations can also be performed using subdoc. Unlike fulldoc append/prepend operations (which simply concatenate bytes to the existing value), subdoc append and prepend are JSON-aware:

bucket.mutate_in('customer123', SD.array_append('purchases.complete', 777))
# purchases.complete is now [339, 976, 442, 666, 777]

1 2	bucket.mutate_in('customer123', SD.array_append('purchases.complete', 777)) # purchases.complete is now [339, 976, 442, 666, 777]

bucket.mutate_in('customer123', SD.array_prepend('purchases.abandoned', 18))
# purchaes.abandoned in now [18, 157, 49, 999]

1 2	bucket.mutate_in('customer123', SD.array_prepend('purchases.abandoned', 18)) # purchaes.abandoned in now [18, 157, 49, 999]

You can make an array-only document as well, and then perform array_ operations using an empty path:

bucket.upsert('my_array', [])
bucket.mutate_in('my_array', SD.array_append('', 'some element'))
# the document my_array is now ["some element"]

bucket.upsert('my_array', [])

bucket.mutate_in('my_array', SD.array_append('', 'some element'))

# the document my_array is now ["some element"]

Limited support also exists for treating arrays like unique sets, using the ARRAY_ADDUNIQUE command. This will do a check to determine if the given value exists or not before actually adding the item to the array:

bucket.mutate_in('customer123', SD.array_addunique('purchases.complete', 95))
# =&gt; Success
bucket.mutate_in('customer123', SD.array_addunique(‘purchases.abandoned', 42))
# =&gt;
 SubdocPathExists exception!

bucket.mutate_in('customer123', SD.array_addunique('purchases.complete', 95))

# => Success

bucket.mutate_in('customer123', SD.array_addunique(‘purchases.abandoned', 42))

# =>

SubdocPathExists exception!

Array operations can also be used as the basis for efficient FIFO or LIFO queues. First, create the queue:

bucket.upsert('my_queue', [])

1	bucket.upsert('my_queue', [])

Adding items to the end

bucket.mutate_in('my_queue', SD.array_append('', 'job:953'))

1	bucket.mutate_in('my_queue', SD.array_append('', 'job:953'))

Consuming item from beginning.

rv = bucket.lookup_in('my_queue', SD.get('[0]'))
job_id = rv[0]
bucket.mutate_in('my_queue', SD.remove('[0]'), cas=rv.cas)
run_job(job_id)

rv = bucket.lookup_in('my_queue', SD.get('[0]'))

job_id = rv[0]

bucket.mutate_in('my_queue', SD.remove('[0]'), cas=rv.cas)

run_job(job_id)

The example above performs a GET followed by a REMOVE. The REMOVE is only performed once the application already has the job, and it will only succeed if the document has not since changed (to ensure that the first item in the queue is the one we’ve just removed).

Counter Operations

Counter operations allow the manipulation of a numeric value inside a document. These operations are logically similar to the counter operation on an entire document.

rv = bucket.mutate_in('customer123', SD.counter('logins', 1))
cur_count = rv[0] # =&gt; 1

1 2	rv = bucket.mutate_in('customer123', SD.counter('logins', 1)) cur_count = rv[0] # => 1

The COUNTER operation peforms simple arithmetic against a numeric value (the value is created if it does not yet exist).

COUNTER can also decrement as well:

bucket.upsert('player432', {'gold': 1000})
rv = bucket.mutate_in('player432', SD.counter('gold', -150))
print('player432 now has {0} gold remaining'.format(rv[0]))
# =&gt; player 432 now has 850 gold remaining

bucket.upsert('player432', {'gold': 1000})

rv = bucket.mutate_in('player432', SD.counter('gold', -150))

print('player432 now has {0} gold remaining'.format(rv[0]))

# => player 432 now has 850 gold remaining

Note that the existing value for counter operations must be within range of a 64 bit signed integer.

Creation of Intermediates

All of the examples above refer to creating a single new field within an existing dictionary. Creating a new hierarchy however will result in an error:

bucket.mutate_in('customer123',
                 SD.upsert('phone.home', {'num': '775-867-5309', 'ext': 16}))
# =&gt; SubdocPathNotFound

bucket.mutate_in('customer123',

SD.upsert('phone.home', {'num': '775-867-5309', 'ext': 16}))

# => SubdocPathNotFound

Despite the operation being an UPSERT, subdoc will refuse to create missing hierarchies by default. The create_parents option however allows it to succeed: add the protocol level the option is called F_MKDIRP, like the -p option of the mkdir command on Unix-like platforms.

bucket.mutate_in('customer123',
                 SD.upsert('phone.home',
                           {'num': '775-867-5309', 'ext': 16},
                           create_parents=True))

bucket.mutate_in('customer123',

SD.upsert('phone.home',

{'num': '775-867-5309', 'ext': 16},

create_parents=True))

Subdocument and CAS

Subdoc mostly eliminates the need for tracking CAS. Subdoc operations are atomic and therefore if two different threads access two different sub-documents then no conflict will arise. For example the following two blocks can execute concurrently without any risk of conflict:

bucket.mutate_in('customer123', SD.array_append('purchases.complete', 999))

1	bucket.mutate_in('customer123', SD.array_append('purchases.complete', 999))

bucket.mutate_in('customer123', SD.array_append(‘purchases.abandoned', 998))

1	bucket.mutate_in('customer123', SD.array_append(‘purchases.abandoned', 998))

Even when modifying the same part of the document, operations will not necessarily conflict, for example two concurrent ARRAY_PREPEND to the same array will both succeed, never overwriting the other.

This does not mean that CAS is no longer required – sometimes it’s important to ensure the entire document didn’t change state since the last operation: this is especially important with the case of REMOVE operations to ensure that the element being removed was not already replaced by something else.

FAQ about Sub-Document Operations in Couchbase

Over the course of developing subdoc, I’ve been asked several questions about what it does, and I’ll respond in turn:

What’s the difference between Subdoc and N1QL?

N1QL is a rich, expressive query language which allows you to search for and possibly mutate multiple documents at once. Subdoc is a high performance API/implementation designed for searching within a single document.

Subdoc is a high performance set of simple, discreet APIs for accessing data within a single document, with a goal of reducing network bandwidth and increasing overall throughput. It is implemented as part of the KV service and is therefore strongly consistent with it.

N1QL is a rich query language capable of searching multiple documents within Couchbase which adhere to certain criteria. It operates outside the KV service, making optimized KV and index requests to satisfy incoming queries. Consistency with the KV service is configurable per query (for example, the USE KEYS clause and the scan_consistency option).

When should I use N1QL and when should I use subdoc?

N1QL answers questions such as Find me all documents where X=42 and Y=77 whereas subdoc answers questions such as Fetch X and Y from document Z. More specifically, subdoc should be used when all the Document IDs are known (in other words, if a N1QL query contains USE KEYS it may be a candidate for subdoc).

The two are not mutually exclusive however, and it is possible to use both N1QL and subdoc in an application.

Are `mutate_in` and `lookup_in` atomic?

Yes, they are atomic. Both these operations are guaranteed to have all their sub-commands (e.g. COUNTER, GET, EXISTS, ADD_UNIQUE) operate on the same version of the document.

How do I access multiple documents with subdoc?

There is no bona fide multi operation for subdoc, as subdoc operates within the scope of a single document. Because documents are sharded across the cluster (this is common to Couchbase and all other NoSQL stores), multi operations would not be able to guarantee the same level of transactions and atomicity between documents.

I don’t like the naming convention for arrays. Why didn’t you use `append`, `add`, etc.?

There are many languages out there and it seems all of them have a different idea of how to call array access functions:

Generic: add to end, add to front
C++: push_back(), push_front()
Python: append(), insert(0), extend
Perl, Ruby, Javascript, PHP: push(), unshift()
Java, C#: add()

The term append is already used in Couchbase to refer to the full-document byte concatenation, so I considered it inconsistent to use this term in yet a different manner in subdoc.

Why does `COUNTER` require 64 bit signed integers?

This is a result of the subdoc code being implemented in C++. Future implementations may allow a broader range of existing numeric values (for example, large values, non-integral values, etc.).

How do i perform a pop? why is there no `POP` operation?

POP refers to the act of removing an item (e.g. from an array) and returning it, in a single operation.

POP may indeed be implemented in the future, but using it is inherently dangerous:

Because the operation is being done over the network, it is possible for the server to have executed the removal of the item but have the network connection terminated before the client receives the previous value. Because the value is no longer in the document, it is permanently lost.

Can I use CAS with subdoc operations?

Yes, in respect to CAS usage, Subdoc operations are normal KV API operations, similar to upsert, get, etc.

Can I use durability requirements with subdoc operations?

Yes, in respect to durability requirements, mutate_in is seen like upsert, insert and replace.

Mark Nunberg, Software Engineer, Couchbase

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Developer Playground

Start a Capella session

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Using the Sub-Document API to get (only) what you want

Subdoc operations

Lookup Operations

Mutation Operations

Dictionary operations

Array Operations

Counter Operations

Creation of Intermediates

Subdocument and CAS

FAQ about Sub-Document Operations in Couchbase

What’s the difference between Subdoc and N1QL?

When should I use N1QL and when should I use subdoc?

Are mutate_in and lookup_in atomic?

How do I access multiple documents with subdoc?

I don’t like the naming convention for arrays. Why didn’t you use append, add, etc.?

Why does COUNTER require 64 bit signed integers?

How do i perform a pop? why is there no POP operation?

Can I use CAS with subdoc operations?

Can I use durability requirements with subdoc operations?

Author

Posted by Mark Nunberg, Software Engineer, Couchbase

Leave a reply Cancel reply

Are `mutate_in` and `lookup_in` atomic?

I don’t like the naming convention for arrays. Why didn’t you use `append`, `add`, etc.?

Why does `COUNTER` require 64 bit signed integers?

How do i perform a pop? why is there no `POP` operation?