May 23, 2013

Couchbase @ Ziniki : The odyssey to find “Find”

Ziniki Infrastructure Systems built their integration tier on top of Couchbase, because the combination of document storage with incremental mapreduce gave them a powerful way to query data. In this blog, Gareth Powell, founder and architect at Ziniki, describes his experience of using mapreduce views in Couchbase. 


I come from a heavy SQL background and some exposure to MongoDB. Because so much of SQL is based around joins and the “SELECT” statement, I naively presumed that there would be similar “find” functionality in Couchbase as in MongoDB, which I had used briefly in another project. I was initially surprised to find that this was not the case.

The problem

As it happens, the reason I had decided to use Couchbase for this project was its ability to create complex indices in the background using its incremental mapreduce technology. I’ll talk about that some more in the next section, but before I go into that, let me first describe what I was trying to build.

I am building a middleware tier on top of Couchbase. This tier has knowledge about users, their credentials, identities, personal data, security requirements and the like. It is also application aware and has logic and rules to control the application’s access to any individual user’s data.

In order to make all this work, there is a need for a “data definition layer” which describes the application’s data domain. As it happens, I’m using XML for this purpose, but anything from UML diagrams or Visual Design tools down to JSON would be acceptable. The important thing is that there is one unambiguous meaning to all the definitions and there is no question of tricky issues such as “the halting problem” coming into the mix. It’s also important that the representation is essentially language-neutral and can be used to generate definitions for any programming language or asset class that might get involved. For now, you can think of this being equivalent to a set of class definitions in your favorite object-oriented programming language.

Because of the abstract nature of my data model, I chose to use globally unique keys (UUIDs) as the keys for the documents stored in Couchbase instead of hand-coding the keys based on the data. This enabled me to generate a key exactly once. The key will uniquely identify the object and will be its main identity no matter how much the object may change.

The other aspect to my data model is that it assumes that data can be “clumpy”, that is, that it will be very common to define compound objects as groups of highly interconnected objects, which are all connected into the main object graph through one representative which is also responsible for handling the security aspects of the clump.

Couchbase Storage Mechanisms

Couchbase defines the notion of “views” which are incredibly powerful and can be used in many different ways. In large part, however, this blog post is a cautionary tale against using them for things that they should not be used for.

Everything in Couchbase is designed in such a way that it works effectively “at scale”. This differs from most other systems you will use which are defined around a particular set of semantics (relational theory, for example) and then shoehorned until they can operate effectively at scale (as with Star Schema in analytical databases). The consequence of this is that the guarantees that Couchbase provides you are the minimal set it can offer at massive scale. As with every scalable system, everything is decoupled. If you plan – as I do – to offer a massively scalable system, it is pointless to rail at these restrictions. You will sooner or later encounter these limits and it’s really just a question of where you deal with the consequences. My advice is to leave as much of it to Couchbase as you can and select an access mechanism that is best suitable for your application. Essentially, there are two access mechanisms in Couchbase -

The key/value store is essentially synchronous: in a single set of operations, you can make sure that an assignment of a single value to a specific key was successful, unique or otherwise atomic. This allows you to make sure that any operation you carry out only succeeds if the uniqueness constraints of the key apply.

Querying documents using views is asynchronous and eventually consistent: that is, views are updated at some point (possibly far) removed from when you asked for a change to be made, but ultimately, if you stop doing anything in the system, they will “catch up” and when (eventually) they do, the system will be 100% consistent.

These two mechanisms offer other different semantics. For example, while the key/value store requires the keys to have distinct values, views do not. The mapping function of a view can create as many keys as it likes with the same values; views also support multi-part keys with a richness that the simple key-value store does not.

Finally, the two interact, in that the input to a view is exactly the set of documents that are in the key/value store.

Defining Views

Views in Couchbase are defined using a pair of map and reduce functions. The reduce function is optional and is simply present to allow multiple rows of a view to be “collapsed” into a single row. Since for the purposes of this article, I am mainly interested in using views to create an “index” of objects in Couchbase, I will not discuss the “reduce” methods further, but if you want to do exciting things (such as data analysis), it’s well worth taking a look at them in the docs.

The map and reduce functions in Couchbase are defined in JavaScript and logically are “called” when needed. For debugging it is possible to define a simple environment either in Chrome or Rhino that wraps these functions and enables you to see how they would “semantically” operate outside of Couchbase. However, this is not directly the mechanism that is used to implement the view creation inside Couchbase: rather, the updates are delayed, grouped and processed in batches to achieve highest performance at scale. Moreover, the updates are broken down by server node and may happen in any order.

This is obviously an incredibly powerful mechanism for defining an index, particularly compared to standard SQL mechanisms. For example, if you have an attribute in a document that is an array object it is possible to calculate and put the length (or indeed the sum) of the object in one of the attributes.

In my perspective, one issue with this approach is that the functions operate in a “sandbox”; that is, each document that is presented to the map function has been sundered from its context. If you have data that is clumpy, this can be a problem, but in a scalable, strictly shared-nothing view of the world, this approach makes perfect sense. Since the documents are hashed to different servers, retrieving them during batch JavaScript processing would be relatively expensive and inefficient.

Accessing Views

Having defined a view using these JavaScript functions, it’s easy enough to access the views by key range either directly using HTTP or through one of the many client bindings. In my project, I’m using the Java client library, which opens up the views that I’m using and then issues queries based on the abstract definitions provided in my definitions XML file. The Problem with “Eventual” Consistency Having defined the indices on my data, I naturally turned around and tried to access them to retrieve the data that I stored. Given that I had chosen to use semantically irrelevant UUIDs as keys in the key/value store, I had given up the immediate opportunity to find anything there using a “natural” key, and chose to use the index mechanism to generate these “natural” keys. But I quickly found that I had use cases that did not match well with the view semantics.

Credentials

The first issue I ran into was with the credentials. Couchbase updates its views after a certain amount of activity or a certain amount of time, whichever comes first. While that would probably be good enough for real life use cases, our testing was based on repeatable, automated scripts. The first script I’d written simulated a user registering for the system and then immediately turning around and logging in. I was using the view mechanism to extract the “unique” user credentials (login mechanism and login id) and map this back to the credential UUID. However, as I went to fetch this from the view after creating the credential, it had not yet hit the index. I tried using the “stale” option on the view, but for login operations, this can be expensive (typically around 2.5s to do the query).

Generated Artifacts

Another related problem was with artifacts that I was generating during processing of user requests. These artifacts “remembered” previous user interactions and enabled the system to respond appropriately. In each case, the artifact had a unique “natural” key that reflected the user, the operation they were performing, and the object they were performing it on. I used views to track these and then to re-load them when the same user performed the same operation on the same object at a later date.

I ran across the same issue with eventual consistency: at the speed at which my automated test scripts were running, I would issue the second request before the object created in the first one had hit the index.

Nested Documents

The third case I ran into was with documents in the same “clump”. Having found the main one – the one that was being “secured” on a per-user basis – I wanted to navigate to some of the other objects in the same “vicinity”. I defined a view describing the type of the objects I wanted to find, and included the original object ID. By searching this index for the object ID I had and the characteristics I was looking for, I believed I should be able to recover all the documents in the clump.

Again, the speed at which my tests generated the documents and then attempted to access them was killing me. I would create the objects, then turn around and attempt to access the view – only to find it was still empty. A few seconds later, as I attempted to diagnose the problem through the UI, the objects would be there but the index was not refreshed yet.

The First Solution: Duplicating the Indices

As I experimented with Couchbase, I recognized the difference between the Key/Value store and the View mechanism. My first attempt to resolve this problem was simply to duplicate all the work that Couchbase was doing in the view but in “real time” in the key/value store. This was not actually as big a burden as it might sound: since all my data definitions were in abstract form, I was generating the view definitions and it was relatively easy to extend this to generate the same code in Java to store the items in the key/value store.

This addressed my problems, but I never liked it for a number of reasons. Most obviously, the ineffective duplication made me question the choice of Couchbase. But more importantly, the number of different cases that arose in the code suggested I was conflating issues. The two most important such bifurcations were the difference between “unique” and “non-unique” indices; and the difference between indices that needed to consider security contexts and those that did not.

The Second Solution: Recognize the individual nature of things

The helpful folks at Couchbase pointed me to the “lookup” pattern in the Couchbase documentation, which described in detail a problem very similar to the one I was having with credentials.

The lookup pattern describes how to use an indirection within the key/value store to essentially have a single object with multiple keys. There is one, unique key (the UUID in my case) and then all the other, secondary, keys point to that. I was able to rework my index definitions to distinguish between the case when I wanted a view that could support multiple rows with the same key, and the case where I wanted just one row with any given key. I did this by specifying how a unique key could be constructed from the fields of the data object and this was then used as a lookup key pointing to the UUID of the object.

This solved the first two challenges used above. For credentials, I was able to use the “credential mechanism” (basic, OpenId, OAuth, etc.) and the user’s unique login id with that mechanism as the unique key; for the artefacts, I was able to use the combination of user id, operation and object UUID. In each case, I automatically added in the fact that this was a secondary key and the type of object being indexed.

The third challenge was different in nature and needed a different solution, but again using a view had been the wrong choice. In this case, the set of unique object ids to consider was contained within the actual object definition I already had in memory. Rather than look up the objects in a view, the correct approach was to read each of these objects from the key/value store using their UUIDs and see which ones had the appropriate characteristics.

While it’s not clear that this approach will scale to thousands or millions of contained objects, at the moment it’s also not clear that it needs to. But other (hybrid) approaches are possible in this case. For instance, it would be possible to analyze each contained object as it is written to the key/value store and to add it an appropriate set of characterized sub items if it matches the criteria. Balancing the size, number and relations between objects is a completely separate challenge I’m facing and one that may one day form the content of another blog post.

Key takeaways

The main takeaway is that it’s important to understand what it is you actually want to achieve and make sure that you match the right Couchbase access mechanism to address your application needs.

In my case, I was trying to overuse views because it is a cool and powerful feature that is hard to resist. But the fact is that, views are simply not a good fit for all data access requirements in your app.

The other thing to bear in mind is to be sympathetic to the needs of infrastructure software and recognize that for scalability it is important to have shared-nothing data. Railing against this, or trying to avoid it will just cause you pain.

So, how do you pick which Couchbase access patterns to use for your data? Following on from this experience, I have written down some guidelines that developers can use when deciding which data access pattern to pick. Although, this is not an exhaustive list, here are 4 typical patterns you should consider:

1. If you have the key of an object, then use that to get the object directly from the Couchbase key/value store.

2. If you don’t have that, but you are looking for an object that would have a unique secondary key, then attempt to find the key by reading it from the secondary index and then get the object from the Couchbase key/value store using its key.

3. If you already have an object and it contains references to other objects, then use those references directly; don’t go looking for them based on a relation that has been written into a view.

4. Finally, if you are looking to retrieve all the objects that match certain criteria across the entire database, and the semantics of your operation are such that there is no inherent ordering dependency, then access a view that you have defined. Remember that the views in Couchbase are eventually consistent.

Comments