“How should I access my data?” is often asked by developers as they contemplate a storage solution. To answer that question one first needs to understand the application under consideration. Who are the most important users, and which use cases need to be fast—that is, what actions does the user take a lot? What is the hot path?

Once you understand the hot path, you are ready to look at storage.

Certain actions, like uploading a large file, are not as latency-sensitive as others. For instance, clicking “buy now” on an ecommerce site, or viewing your timeline in social media, benefit a lot more from high-performance storage and retrieval capabilities.

So for each application, if you take a performance perspective in the architecture, you'll see that there are places the user hangs out a lot, where responsiveness is key. It's not just about being fast, it's about being reliably fast under load. Nines matter, especially for latency.

You want your hottest most crucial use cases to be fast. Because memory is so much faster than disk, these should definitely be served from memory. It's no coincidence that memory based databases are ascendent. Especially in cloud environments, serving from memory insulates the application against inconsistent disk performance.

Prioritize your schema around the hot path

Even with a schema-less database your application will impose a schema. It's up to you to make that schema favor the high-performance use cases. So identifying the hot object types and keeping them in a fully-resident bucket is one step. So if you have 4GB of live shopping cart data, set aside 4+GB of RAM just for that. If you have 20GB of purchase history data, you might be fine to only dedicate 5GB of RAM to those document types. This lets disk-greater-than-memory effects kick in for your less latency-sensitive data, while preserving the performance of the hottest data under all circumstances.

Now that you're thinking in terms of hot path, lets talk about views. They are very flexible (key here is the ability to project the realtime data into forms more advantageous for querying). Since you're gonna want your hot path to run on a key-value model, you end up doing a lot of pointers and multi get. That is, a high-performance application will tightly bind it's data model to the hot path. This can make supporting the less hot use-cases a little more challenging. This is where views step in. Views are powerful enough to give those runtime-driven data structures new life as query-able indexes. The upshot is you get a flexible query model on top of key value performance.

To take a particular example look at the beers sample data set. Perhaps in your application, the main screen that users want to see, is a list of beers with their ratings, and for each beer, the ability to quickly load reviews of the beer written by end users. In this case, you'll want to do bulk key-value lookups (rather than view queries) to populate those screens with the highest possible performance. If a secondary use case is browsing breweries by city, this is a good time to use views, as they can provide a flexible window into the underlying data.

Let's start by looking at the documents themselves. In this case were are storing ratings directly on the beer documents. By embedding the ratings, we are able to trivially show them in the UI, without any queries or additional look-ups. We are also linking directly to reviews (“comments”) from the beer document, so that a beer and all of it's comment can be fetched without going to a disk-based view query. This technique ensures that even under tremendous traffic load, database response times will be fast.

Here is an illustration:

The blue doc is a user profile document, it is referenced in a few other places in the schema. Anytime we already have the user_id we can look up the user profile quickly. So look for that id in the other JSON documents.

The yellow document is a comment / review on a beer. It has the beer id on it, but lookups from a comment to the beer it is on will be rare. More typically you'll have the beer in hand and want to look for the reviews. We mentioned earlier that in our hypothetical example, we consider this to be a performance critical section, so while it would be possible to use views to lookup reviews by beer_id, in this case we want to do our lookups via the key value interface. By working in this interface we get the benefit of in-memory speeds, as well as more scalability as these requests use fewer server resources.

The green document is an actual beer document. We have the ratings inline, stored under the user_id to enforce the constraint that each user may only rate each beer once. For the reviews / comments, we link to them from the comments array. So the code to fetch all the data needed to display a page for a beer looks something like this (Example is in no particular programming language).

beer = couchbase.get(“my-beer-id”);
reviews = couchbase.multiget(beer.comments);
profiles = couchbase.multiget(reviews.map{|review|review.user_id});

Then the page will have enough data to display info about all reviews of the beer, as well as info about the user who left the review. The entire thing only took 3 requests to the database, so the total elapsed time should be only a handful of milliseconds, which will be faster than the alternate style of sending a complex query to the database, so that it can construct a result set and return it.

Layering views on top of your high-performance schema

There's a lot of richness in the document structure we designed to facilitate key-value interactions. Even though we aren't maintaining a key-value lookup path for discovering the top-rated beers, because the ratings are embedded directly in the document, it's easy to write a view that ranks beers by rating. Querying this view won't be as fast as a direct key-value lookup, but something like top-rated beers can be easily cached, so that users have a high-performance experience even while the underlying index is disk-based rather than in-memory.

Here is a map view which would sort the beers by average rating.

function(doc, meta) {
  if (doc.ratings) {
    var total = 0, count = 0;
    for (var user_id in doc.ratings) {
      total = total + doc.ratings[user_id]; count++
    }
    emit(total/count, null);
  }
}

You'd query this with ?descending=true to get the top rated beers first.

From that same underlying data-set, we can also provide a way for a given user to find all the reviews they have left. This is not a common operation (in our made-up example application), so it's not important that it be blazingly fast. So it would have been a big-pain to maintain a key-value lookup path for this. Eg, if your users will rarely be trying to find all the reviews they have written, then it's a pain to maintain a list of review-ids attached to each user profile. Instead just tag the reviews with the user-id, and use an index to support the query. The view to find all reviews for user X is simple:

function(doc, meta) {
  if (doc.type == “comment” && doc.user_id) {
    emit(doc.user_id, null);
  }
}

You query this with ?key=525 to find all the reviews written by user number 525.

Conclusion: Design for the hot-path, let NoSQL flexibility help with the rest

Hopefully this article has painted a clear enough picture about how to architect your application for the performance critical sections. The biggest thing you can do is to customize your schema so that the pages which need to be the fastest, are the easiest to load from the database. Of course this means that it won't be quite as fast to load non-critical pages, as the schema isn't designed specifically to make them fast.

However, Couchbase views make it easy to repurpose the data for your other access patterns. Hopefully the above examples show how you can do essentially the same kind of query in two very different ways, depending on your performance constraints.

Bonus: On caching views

If your hot path does include a view query (for instance in a social network home timeline) you should cache it. This means that while the first request may take a few milliseconds, follow-up requests will have consistent high-performance. This is more typical of the usage of memcached with mysql. In that pattern, memcached is used to both speed up perceived performance, and to keep load off the database. We are essentially talking about the same thing here, except for instead of putting the results of a slow mysql query in memcached, we put the results of a Couchbase view query in a memory bucket.

Couchbase TTL can handle expiring the cache for you. Or it is possible to use some more advanced cache-invalidation

strategies (a story for another day…)

Author

Posted by J. Chris Anderson, Co-Founder and Mobile Architect, Couchbase

One Comment

  1. […] modeling strategies, but that is outside the scope of this article. For more on this, read “Performance Oriented Architecture” by Chris […]

Leave a reply