I am asked routinely about storing images, documents, PDFs or other binary objects in Couchbase. As an architect for services and databases, I usually give the same answer, it’s a mediocre idea overall to store objects like this permanently in any database. I have personal experience with this over my career working with many Relational and NoSQL databases to back this up.
I know from a developer perspective it is easy and convenient to store these types of objects in the database. It even makes sense logically to do it too. I have data, I need to store it. I want to have information about that data so I can look it up and serve the data to my users. This all makes sense to me too.
I have two arguments of why it is a mediocre idea to store these types of objects in a database (Couchbase or otherwise). I will also propose a solution of how best to use Couchbase in this scenario to get the best of both worlds, be cost efficient and to serve the end user best. That last one is why we are all truly here in the end, right?
Operational expense and performance
By permanently storing large objects in a database, Couchbase or otherwise, you will be using your most expensive and what should be one of your most performant layers for objects that are usually static and change infrequently in most use cases. Just think of the per GB cost of storage on an EC2 instance in AWS as compared to storing that object in S3. When I use the AWS calculator as of this writing, S3 storage is at least 1/5th the cost of the cheapest EC2 storage. I say at least as the settings are not 1:1 comparison of possibilities, since they really are meant for different things. S3 is specifically meant to hold lots of static objects at a very high rate of durability for a very reasonable cost and EC2 is meant to have operational storage.
Now think about the physical cost of storing that object in a database then replicating it, backing it up, etc. for months or years. The hard costs and operational times to haul that volume of data around becomes clear. Over time these objects become ship anchors around the operations team’s necks. In addition, if you do not manage things correctly, you could have a database two years from now storing, backing up and replicating images from a user that quit using your service 1.5 years ago and you are paying for every KB multiple times over.
All of that overhead simply for development convenience. It just is not worth it or efficient in the long run most times. Again, this is not exclusive to Couchbase by any means.
Use each tool for what it is best at
Couchbase can serve up objects right from the managed cache in RAM with great performance if you access it via object ID or identified which object you want from a query. Can Couchbase store and serve up that image or larger binary very fast from RAM as well? Of course it can, but it will consume expensive server resources, not just storage, to get that performance. Couchbase also has an object size limitation of 20MB. Even if your objects do not approach that size, it still may be a bad idea to store these types of objects in a database permanently. As I mentioned earlier, Amazon S3 and HDFS are excellent at storing and serving up static content like this. This is what they were designed for. They offer great performance at a great value for that type of workload. It is best to use the tools at your disposal for what they are best for. While databases can do this, it is definitely not what they are best at.
Then how do I solve this? I have stuff to store!
For this example, we will talk about an image store, since that is the most common use case I hear, but it could be any kind of large static objects. At a high level for this use case, you should look to store just the data in Couchbase that is required by the application when a user is looking for an image. When planning out your data access patterns and thus your database object model; ask yourself a few questions.
What data does the app need to present to the user about each image and when in the flow?
What searches will be done about the image and how will the results be presented? (Keywords, title, creator, create date, etc.) If it fits your use case, you might even store a tiny thumbnail image in the database for fast delivery.
When in the application flow will each piece of data be needed?
Now that you have the user facing data in Couchbase, you can do fast key lookups, map reduce views or full N1QL queries against secondary indexes to get at the data. The large images should be stored in something like AWS S3, HDFS, Content Delivery Network (CDN), a web server, file server or whatever else would be great a serving up large static objects, fits your use case and budget.
Now let’s dive a little deeper and talk more about how we architect this.
Example object model
I propose two objects in Couchbase for each actual image in your application:
A JSON object containing the metadata about the image. It will be in JSON so we can index it, query it with N1QL or Views, whatever the application needs. In this object will also be the pointer to the main image in the other system.
A Key/Value object containing the thumbnail of the main image and stored in a separate Couchbase bucket. We are keeping the thumbnail in Couchbase so we have fast presentation of it to a user. We could in theory have this as a value in the JSON document with the metadata, but the advantages to having them separated when it comes to indexing and Couchbase resource utilization offset that, especially so if you plan on a high data mutation rate.
Since every object in Couchbase has to have a unique object ID (within a bucket) and we can have 250 bytes in that ID, let’s use that to our advantage and have a standardized object ID pattern for easy and fast object retrieval. A standardized object ID will help us easily retrieve an image and its related content quickly from the Couchbase Data Service or when querying when using N1QL.
The object ID pattern for each document will be as follows:
metadata object: metadata::
where is the unique identifier assigned to that image by the application. Since we are going to be finding images by querying with N1QL, I am not going with a more descriptive object ID.
Thumbnail object: thumbnail::
is the name as the one we used for the metadata object. This way, we establish an informal relationship between the objects. We know that each metadata object has a corresponding thumbnail object. So if we need to get them both, we can once we know the UUID, we can get the thumbnail very quickly or vice versa.
For the objects themselves:
The metadata object will be a JSON document and might look like the following:
“title” : “Cute Kitty and Doggy”,
“file-location” : “https://s3.amazonaws.com/kittypics/cutekittyanddoggy.jpg”,
“thumbnail1” : “thumbnail::
“dimensions-px” : “50×50”
The part in the thumbnail where it says
would be replaced by the ID of that object of course. This way when we get the metadata object, we have the thumbnail object’s ID and can grab that quickly. This is one of those times where most likely it is better to make multiple calls to Couchbase than like you would in other database where it’d be better to do it in one.
The thumbnail object will simply be a key/value with the value being binary.
Couchbase specific settings
I propose two Couchbase buckets. One for the JSON documents containing the metadata about each image and one for the thumbnail. The two specific reasons for this are:
Separate buckets allows for tuning of the managed cache by object need. For example, perhaps I want the metadata for each image to always be available and always as fast as possible. I size the RAM quota for the “metadata” bucket to have all of those objects in the managed cache for best performance. The thumbnails are larger objects and perhaps we want to save a little money on the size of our instances and not keep as many of them in the cache because if they show up a few seconds later, no big deal. We could size the RAM quota for the metadata bucket to be 300GB across the cluster, but the thumbnails to be 50GB across the cluster, even though the thumbnails might be the larger data set on disk.
We will never need to index or query the thumbnail objects. We can always grab them by the object ID that we got from the metadata JSON document or by the application constructing it. To step a level deeper as to why we want these objects in two separate buckets; when you do indexing in Couchbase, every object in a bucket is interrogated at some point to see if it should be included in an index. This is done by the View Indexer if you are using Views or the Projector if you are using GSI (Global Secondary Indexing). If we have these two data types in separate buckets, the indexer and projector we need for querying the JSON documents will never have to bother with the thumbnail objects and waste cycles or resources since indexes are bucket specific. Another bonus is, if you are using Couchbase Views which are stored along with the data, it should keep cluster rebalance times down as again the View Indexer does not have to interrogate the thumbnails as the data moves. Overall this means you need fewer server resources, so it’s more cost effective.
For purposes of this example, let’s call the two buckets something cryptic like “metadata” and “thumbnails”.
Value Evict (the default) or Fully Evict
More than likely, you want to avoid using Couchbase’s full eviction feature for this particular use case. It is a great feature, but part of the reason to store these image metadata objects in Couchbase is for the functionality, but also the performance you get from the managed cache. More than likely your use case will require checking for the existence of an object at some point in the application flow. If that is the case, using full eviction will be bad as you will have to go to disk to check for that. If you use the default “value eviction”, then you’d be able to tell very quickly if the object exists as the Couchbase data about every object will be in the managed cache at all times. So use this feature wisely and only enable full eviction if you know exactly what it will do to your application and why.
An exception to the rule
As always, there are exceptions that fly in the face of the rules. There is one Couchbase customer I know of that does put binary objects (audio files to be specific) into Couchbase Server with amazing success. They do it for a very specific reason that uses Couchbase to their advantage though. They insert audio recordings into Couchbase, but the key is, their application breaks the audio files into smaller chunks and streams each into Couchbase as they come in along with a metadata document for that recording. The interesting thing is they do not permanently store the audio file in the database for the reasons I already stated in this article. After a few minutes, if the audio file has not been accessed, a background process reconstructs each file and moves it to Amazon S3 for longer term storage. Then they update the audio file’s metadata JSON document with a pointer to the file on S3. Very fast and high velocity ingestion with Couchbase and longer term static object storing with S3. It is a great example of using the best tools for what they are best at.
Try to avoid permanently putting larger objects in a database, regardless of what database platform you use. Even if there is a special mock filesystem in the database that will break up your large binary files into smaller ones to store them in the database and reassemble them automagically for you. The same concepts apply. You are trading off ease of development for an expensive and operationally more difficult life down the road. It will haunt you later.
For the best solution, use each tool for what it is best at. Store in Couchbase a metadata JSON document for each object, maybe a small thumbnail image at most. In that document is data you need about that object in your application quickly, but also a pointer to a purpose built object store like S3, a file system or HDFS. You will get the best of all worlds. Performance, ease of operations and cost effectiveness for not much extra work.
Disagree? Have another exception to the rule? Add it to the comments and let’s talk