One of the big document database modelling questions is: how far do I go with denormalisation?

When working with relational databases we're used to strictly normalising our data: we hold a canonical, non-duplicative, instance of each item of data. That gives us almost unlimited scope for queryability and makes consistency a breeze.

In my previous post on data modelling, we saw how document databases take a different tack. Document databases optimise for storing together the data that we access together and, from that, come a whole bunch of other benefits, such as speed, distributability, preservation of document structure and so on.

In that post I gave the example of a swag management system where an order would be stored as one JSON document, rather than split-out into various tables and then reconstructed each time using a SQL query.

That model — of storing data aggregates — is naturally denormalised. However, that doesn't mean we should create monolithic documents for each context.

Instead, we should consider two options: we can embed data from multiple places into one document or we can store references to other documents.

Key-value document access

With Couchbase we have a couple of ways to access our data: key-value look-ups and views (automatically generated secondary indexes built from map-reduce queries).

When N1QL becomes generally available, it'll be three.

Everything we look at in this post is based around key-value look-ups.

A quick recap of our example system

In my previous post we looked at the example of a stock management system to track Couchbase-branded swag.

Let's imagine the standard path is:

  1. A customer makes an order.
  2. A stock picker receives the order and packages the items.
  3. A despatcher sends out the package through a delivery service.

At the moment the customer makes an order, we have a choice of how we store the order data in Couchbase:

  • either embed all the order information in one document
  • or maintain one main copy of each record involved and refer to it from the order document.

Embedding

If we chose to embed all the data in one document, we might end up with something like this:

  {
      “orderID”: 200,
      “customer”:
      {
        “name”: “Steve Rothery”,
        “address”: “11-21 Paul Street”,
        “city”: “London”
      },
      “products”:
        [
          {  
             “itemCode”: “RedTShirt”,
             “itemName”: “Red Couchbase t-shirt”,
             “supplier”: “Lovely t-shirt company”,
             “location”: “warehouse 1, aisle 3, location 4”,
             “quantityOrdered”: 3
          },
          {
            “itemCode”: “USB”,
            “supplier”: “Memorysticks Foreva”,
            “itemName”: “Black 8GB USB stick with red Couchbase logo”,
            “location”: “warehouse 1, aisle 42, location 12”,
            “quantityOrder”: 51
          }
        ],
      “status”: “paid”
   }

Here, everything we need to fulfil the order is stored in one document.

Despite having separate customer profile and item details documents, we replicate parts of their data in the order document.

This might seem wasteful or even dangerous, if you're coming from the relational world.

However, it's quite normal for a document database. As we saw earlier, document databases operate around the idea that one document could store everything you need for a particular situation.

There are, though, some trade-offs to embedding data like this.

First, let's look at what's potentially bad:

  • Inconsistency: if Steve wants to update his address after the order is made, we're relying on:
    • our application code to be robust enough to find every instance of his address in the database and update it
    • nothing going wrong on the network, database side or elsewhere that would prevent the update completing fully.
  • Queryability: by making multiple copies of the same data, it could be harder to query on the data we replicate as we'll have to filter out all of the embedded copies.
  • Size: you could end up with large documents consistening of lots of duplicated data.
  • More documents: this isn't a major concern but it could have some impacts, such as the size of your cached working set.

So, what benefits does embedding give us? Mostly, it gives us:

  • Speed of access: embedding everything in one document means we need just one database look-up.
  • Potentially greater fault tolerance at read time: in a distributed database our referred documents would live on multiple machines, so by embedding we're introducing fewer opportunities for something to go wrong and we're simplifying the application side.

When to embed

You might want to embed data when:

  1. Reads greatly outnumber writes.
  2. You're comfortable with the slim risk of inconsistent data across the multiple copies.
  3. You're optimising for speed of access.

Why are we asking whether reads outnumber writes?

In our example above, each time someone reads our order they're also likely to update the state of the order:

  • someone in the warehouse reads the order document and updates the status to “Picked”, once they're done
  • one of our despatch team reads the document and updates the status to “Despatched” when the parcel is with the courier
  • when we receive an automated delivery notice from the courier, our application updates the document status to “Delivered”.

So, here the reads and writes are likely to be fairly balanced.

Imagine, though, that we add a blog to our swag management system and then write a post about our new Couchbase branded USB charger. We'd make two, maybe three, writes to the document while finessing our post. Then, for the rest of that document's lifetime, it'd be all reads. If the post is popular, we could see a hundred or thousand times the number of reads compared to writes.

As the benefits of embedding come at read-time, and the risks mostly at write-time, it seems reasonable to embed all the contents of the blog post page in one document rather than, for example, pull in the author details from a separate profile document.

There's another compelling reason to embed data:

  • You actually want to maintain separate, and divergent, copies of data.

In our swag order above, we're using the customer's address as the despatch address. By embedding the despatch address, as we are, we can easily offer the option to choose a different despatch address for each order. We also get a historic record of where each order went even if the customer later changes the address stored in their profile.

Need for speed

Couchbase is fast.

All writes are made to the managed cache and, if we size our RAM appropriately, our working set is served from that cache.

That changes the weighting of the trade-offs we make. Embedding primarily for speed is less compelling with Couchbase.

With Couchbase your average read/write operation time is half a millisecond, soit doesn't hurt when a single query requires three or four operations.

Referring

Another way to represent our order would be to refer to the user profile document and stock item details document but not to pull their contents into the order document.

Let's imagine our customer profiles are keyed by the customer's email address and our stock items are keyed by a stock code. We can use those to refer to the original documents:

   {
      “orderID”: 200,
      “customer”: “steve@gmail.com”,
      “products”:
        [
          {  
             “itemCode”: “RedTShirt”,
             “quantityOrdered”: 3
          },
          {
            “itemCode”: “USB”,
            “quantityOrder”: 51
          }
        ],
      “status”: “paid”
   }

When we view Steve's order, we can fill in the details with three more reads: his user profile (keyed by the email address) and the stock item details (keyed by their item codes).

It requires three additional reads but it gives us some benefits:

  • Consistency: we're maintaining one canonical copy of Steve's profile information and the stock item details.
  • Queryability: this time, when we query the data set we can be more sure that the results are the canonical versions of the data rather than embedded copies.
  • Better cache usage: as we're accessing the canonical documents frequently, they'll stay in our cache, rather than having multiple embedding copies that are accessed less frequently and so drop out of the cache.
  • More efficient hardware usage: embedding data gives us larger documents with multiple copies of the same data; referring helps reduce the disk and RAM our database needs.

There are also disadvantages:

  • Multiple look-ups: this is mostly a consideration for cache misses, as the read time increases whenever we need to read from disk.
  • Consistency is enforced: referring to a canonical version of a document means updates to that document will be reflected in every context where it is used.

When to refer

Referring to canonical instances of documents is a good default when modelling with Couchbase. You should be especially keen to use referrals when:

  • Consistency of the data is a priority.
  • You want to ensure your cache is used efficiently.
  • The embedded version would be unwieldy.

That last point is particularly important where your documents have an unbound potential for growth.

Imagine we were storing activity logs related to each user of our system. Embedding those logs in the user profile could lead to a rather large document.

It's unlikely we'd breach Couchbase's 20 MB upper limit for an individual document but processing the document on the application side would be less effecient as the log element of the profile grew. It'd be much more effecient to refer to a separate document, or perhaps paginated documents, holding the logs.

Denormalise all the things! (Except when you don't)

Denormalisation is a key part of working with document databases but that doesn't mean we can't maintain canonical records and then refer to them from within other documents.

In fact, referring to documents is often the most efficient way of working with Couchbase and we should embed in particular situations, such as read-heavy workloads.

Next time I'll look at key naming patterns and how we maintain schema in a soft-schema database.

Posted by Matthew Revell, Lead Developer Advocate, EMEA, Couchbase

One Comment

  1. i didnt understand why the following is a disadvantage or a problem in general?

    Consistency is enforced: referring to a canonical version of a document means updates to that document will be reflected in every context where it is used.

Leave a reply