April 9, 2012

Organizing Document Structure in Document Databases

Couchbase is schema-less. That runs against the grain of traditional, RDBMS history and experience, but it's proven to be one of the most lauded features of NoSQL databases.

Not being required to develop a schema before you build your application is a huge time saver. It enables quick prototyping and lets you mold the structure of your document as you delve into its different uses within your application.

However, the lack of a required, formal schema, does not mean your documents don't gain value from having a consistent structure or "inherent" schema.

One of the key considerations when crafting your documents for Couchbase is how granularly to split your data: one document or multiple documents for a single concept?

Here are the key decision makers when choosing how to construct your data in Couchbase:

What does this document look like in real life?
In our case, we have Breweries and their Beer. We have a noun for each, and in the earlier post, we had a document for each. It feels natural, and makes sense.

How often will I update this?
Breweries and Beers are fairly static topics. It's more likely that we'll be adding Beer documents to a Brewery's portfolio rather than updating a Beer's description or changing a Brewery's address. Those things happen, but it's rare when compared with stock tickers, sensor data, or social game actions.

Do you want all this data updated together?
In Couchbase, the document is the smallest level of atomicity. If we combine Brewery and Beer data into a single document, all changes to Beer data require sending the entire Brewery document that contains it as well. All creation and update operations happen on that entire collection as if it were a single thing. This has the advantage of consolidating disparate updates into single, streamlined requests, but has the disadvantage of requiring a larger update each time a change is made to any of the contained concepts.

Currently it looks like we have two available options when designing documents:

  • one document for every distinct concept
  • one document for the largest "container" concept

However, there's a third option: determine the canonical data later with MapReduce. Content can be structured in such a way that it will exist more like a "general ledger" from which you can build an index that states what the canonical data is from among the documents. Plus, you can leave the past (now "stale") data in the database without having to worry about the effects of that stale data on how you view the active material. This can prove advantageous if you'd prefer to have that historical data available. In this case, we could put the address of the brewery on both the Brewery and Beer documents and be able to find that a Beer was originally crafted at the Brewery's original location. Interesting options abound!

Since Couchbase is de-normalization friendly, we could put the Brewery data in each Beer document:

Beer with Brewery Object

{
   "_id": "beer_1554_Enlightened_Black_Ale",
   "_rev": "1-191ae52a6c773fd7749b65ffd9ae8044",
   "brewery": {
      "name": "New Belgium Brewing",
      "address": [
         "500 Linden Street"
      ],
      "city": "Fort Collins",
      "state": "Colorado"
     ...
   }
   "name": "1554 Enlightened Black Ale",
   "abv": "5.5",
   ...
   "category": "Belgian and French Ale",
   "style": "Other Belgian-Style Ales",
   "updated": "2010-07-22 20:00:20"
}

This could save us a great deal of request time. It could also serve as a history record of where the beer was original brewed (if the address of the brewery changed years later). If we need the canonical brewery info in our app (vs. possibly stale data), then we can construct the brewery ID from it's name (at least in this app), and look up the address from the authoritative Brewery document. The Brewery document would be the canonical source of information about the Brewery, and the Brewery address information on the Beer document could serve as a historical reference.

Pros:

  • Get a Beer document and you have the Brewery info in one request
  • Brewery info on Beer documents could serve some historical purposes
  • No need to use MapReduce View Collation to construct relationships

Cons:

  • Getting the latest Brewery information takes a second request or a multiget (if you know both IDs) or a MapReduce View

Brewery with Beer Objects

Let's flip that last example on it's head. Brewery's brew beers. If they've not given away the recipe (or even if they have), you can't get that particular beer, made that particular way, with that special mountain water (or whatever) from anyone but that brewery. So why not store all the beers within a Brewery document?

Let's take a look at what this new structure might look like. The first several lines are from the original New Belgium Brewing document. The new addition is the "beers" key and the object of beer information set as its value.

{
   "_id": "brewery_New_Belgium_Brewing",
   "_rev": "1-e405d6f86ec028a4fe0d18be0a6d4fa1",
   "name": "New Belgium Brewing",
   "address": [
       "500 Linden Street"
   ],
   .....
   "geo": {
       "loc": [
           "-105.07",
           "40.5929"
       ]
   },
   "updated": "2010-07-22 20:00:20",
   "beers": {
      "1554_Enlightened_Black_Ale": {
        "name": "1554 Enlightened Black Ale",
        "abv": "5.5",
        "category": "Belgian and French Ale",
        "style": "Other Belgian-Style Ales"
      },
      "beer_Fat_Tire": {...}
   }
}

Our new "beers" child-object has keys for each beer using nearly the same IDs as keys for Beer documents we saw earlier (just without the "beer_" prefix).

Pros:

  • Brewery and its Beer in a single (possibly quite large) request
  • No need to use View Collation to construct relationships

Cons:

  • Beer docs aren't directly retrievable (requires MapReduce)
  • Size of the response could be quite large for big Breweries (+1 for microbreweries!)

Conclusion

Either of these approaches could be valid for different use cases. In cases where you want quick retrieval of a relationship "package" going the single, combined document route can be a great optimization.

Enjoy the freedom, consider your options, and get that prototype built!

Next up, we'll look at how to build indexes from the original Beer and Brewery documents. Stay tuned!

Comments