Microsoft has generated a lot of buzz since the launch of CosmosDB. It is basically a rebranding of Amazon DocumentDB with some new cool features. Let’s go a little deeper on it and explore its strategy, documentation, what developers have been talking about and how does it compares with Couchbase Server.

 

One Database to rule them all?

 

In simple words, Microsoft claims that CosmosDB is a NoSQL database able to do literally everything: It is a Document database, Columnar storage, a Key-Value Store and a Graph Database. All achieved thanks to an abstraction of the data format called atom-record-sequence (ARS).

A good sign of Microsoft’s work is how data is differently organized according to each model. First, you have to choose the API you would like to use ( SQL, MongoDB API, Microsoft Azure Table, Cassandra or Gremlin) and stick with it as it can’t be changed later. Currently, you can still try to access some models through DocumentDB API. that was what gave me some hints of how CosmosDB uses internally a decorated JSON format to store its data.

It looks like Microsoft wants to compete with most of the NoSQL databases out there, which is a really risky strategy as we might have passed the gold era of a single database solution for everything. There are huge benefits of choosing specialized storages, and this is the path most of the applications have been following right now with the rise of polyglot persistences.  An all-in-one solution like CosmosDB might be good for low-demanding applications, but all those abstractions come with a cost and will ultimately impact simplicity, performance and be feature limited.

 

Couchbase vs CosmosDB – Comparing Apples with “Apples”

 

I will try to limit my comparison with CosmosDB focusing most on scenarios that make sense to compare both technologies. The table below tries to show some of the differences side-by-side:

Feature CosmosDB Couchbase Server
Licensing
  • Proprietary
Type
  • Key-Value Store
  • Document Database
  • Graph Database
  • Columnar storage
Model
Search
  • Azure Search
Indexing
Data Integrity
  • Strongly Consistent
  • Bounded-staleness
  • Session (default)
  • Consistent Prefix
  • Eventually Consistent
Scalability Highly Scalable Highly Scalable
Mobile
Deployment
  • Azure Only, fully managed
  • Development version available only on Windows
Locking
  • Optimistic Locking
  • Pessimistic Locking
Backup & Restore
Querying
  • Querying is made according to the chosen API, and each API has different limitations
Data Center Replication
  • Push-button global Bidirectional replication
Throttling
  • Only by increasing RCUs
Administration Interface
  • Feature-rich administration interface;
  • Provide a few endpoints to automate some tasks
  • Some features aren’t well documented
Sharding
  • Partitioning is made automatically but you are required to pick a partition key.
Sharding is automatically done under the covers
Security
  • Provided by normal Azure security measures
Architecture
  • Unknown
Integrations
  • Spark, Hadoop
  • Other Azure Services
  • MongoDB Connector
  • Cassandra
  • Gremlin

 

Conclusion

 

I think this is the very first article comparing CosmosDB with another database. It took me a good amount of time to go through a lot of documentation, developer’s feedbacks, and some webinars.

My feeling, in general, is that CosmosDB has a great vision, but currently, it is still immature in some aspects. Documentation and backups, for instance, are not one of their strengths, which is a natural consequence of building something focusing on multiple fields at once. Microsoft’s database also brings a lot of innovations, one of the most prominent is the new multiple levels of eventual consistency: Bounded-staleness, Session, Consistent Prefix and Eventually Consistent.

The fact that Session is set as the default consistency says a lot about the recommended way to use CosmosDB. It also gives us hints that it might not be the best solution if you need a strong data consistency.

I could not find any mention of caching mechanisms in CosmosDB, so I am assuming that it is not a major part of the database. The problem is that caching is crucial for good performance in strongly consistent databases, being memory-first is one of the reasons why Couchbase Server is blazing fast.

CosmosDB does not provide memory-optimized indexes and by default, all fields are indexed in their Global Secondary Indexes (GSI). It totally sounds like overkill to me as I still think it is easier to specify which fields I want indexes than specifying which fields I don’t. Of course, you don’t necessarily need to remove those fields from the index but don’t forget you are getting charged for it.

Sharding seems to be right now one of the trickiest things in CosmosDB. Partitions are moved automatically among nodes, but you still have to specify a partition key. The drawback of this approach is that each partition is indivisible with a max size of 10Gb. If you pick a bad partition key, a lot of frequently accessed documents might end up in the same partition, which limits the throughput of your reads/writes by the node capacity where the partition is stored.

The partition key is also immutable, so in order to change it, you will be required to copy your whole data to another collection. In Couchbase, we transparently distribute your documents evenly between vBuckets to avoid this problem, and also to increase your reads/writes performance.

Currently, throttling is done only by increasing Request Units (RUs) which is a common standard for fully managed databases (on DynamoDB, for instance, throttling is made by increasing Read/Write capacity units).  The challenge with this approach is that it is not a very good predictor of the query performance and makes even harder to boost just a specific behavior like increasing only the writes capacity.

Microsoft has put a lot of effort in trying to make RUs provisioning easy to understand, but I have found many comments of developers underestimating their RUs ( like here or here )  and ending up with a bill much higher than expected. In general, the pattern that I have seen of provisioning in CosmosDB is mostly based on trial-and-error. On Couchbase, throttling is very flexible, it can be done by vertical/horizontal scaling, running specific services according to the node hardware, keeping indexes in memory, etc.

Microsoft is also clearly trying to convince MongoDB’s users to migrate to CosmosDB. They even provide a fairly compatible connector to make the migration easier. The problem is that the root cause of why some users are willing to migrate to other databases is due to MongoDB’s scalability and performance issues. We know it very well because many of those users end up migrating to Couchbase Server, and CosmosDB performance does not seem to be a big plus, at least not for a reasonable cost.

Microsoft does provide a limited local version for development, but so far it runs only on Windows machines.

CosmosDB also provides a cool push-button global data distribution that makes really simple to replicate data in multiple locations of the world. It is, however, a feature not used daily to require such simplicity, it could also be easily achieved in a matter of minutes in Couchbase Server without the limitation of running in a single cloud.

In summary, I agree with CosmosDB point of view that eventual-consistency is a too broad definition. Their new consistency models let the developer choose the level of consistency their application tolerates.

The reasons to use it are nearly the same as the ones mentioned in my article about DynamoDB. The main difference, of course, is that CosmosDB is much more flexible than DynamoDB. It is right now an average multi-purpose database for applications demanding average performance with strong consistency . It also easily integrates with some features of Azure Functions.

CosmosDB still lacks famous use cases/clients, but it has the potential to stand out in applications with eventual consistency, as it seems to be their main focus. But when it comes to strongly-consistent medium/high demanding applications, Couchbase Server is by far a better choice, both from the price and performance point-of-view.

It is hard to come up with a fair benchmark between those two databases as it’s unclear, for instance, how many servers are running when you provision 30.000 RUs in CosmosDB, so the easiest way to predict their expected performance is through their architecture/features.

Pretty much like DynamoDB, CosmosDB pricing is attractive if you have a small database with few reads/writes per second. But anything above that with cost you a good amount of money: 200 000 documents of 45kb, with 4 writes/sec and 40 reads/sec will cost at least US$ 2 500.

Their calculator does not consider the consistency model you are going to use, so you have to add a few extra dollars to this number for Strong-Consistency. In this setup, CosmosDB cost is at least double the price you would spend to run Couchbase EE on Amazon Web Services with our recommend architecture (which is capable of handling more than that)

As I mentioned at the beginning of the article, there are a lot of advantages in choosing specialized storages for each specific purpose, and Couchbase Server really excels in delivering high performance with strong-consistency.

If you have any questions, feel free to tweet me at @deniswsrosa

Author

Posted by Denis Rosa, Developer Advocate, Couchbase

Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.

One Comment

  1. Really inaccurate on the pricing front for Cosmos DB. 40 reads/sec and 40 writes/sec at 45kb docs requires nearly 2500 RUs (request units) – they are not dollars! They are a throughput measure. Cost for that workload is around $130 / month.

    The key advantage like every PAAS service is you don’t have to manage a bunch of VMs yourself and set every aspect of backups/replication/tuning/patching/etc etc. Also you’re not hit with a license cost as well as VM cost for your cluster. You only pay for the throughput you use. Depends if you want a service that requires minimal management, or if want to tweak everything yourself.

Leave a reply