This morning, Matt Aslett of The 451 blogged on The beginning of the end of NoSQL in which he highlighted the uselessness of the NoSQL category name. Good post, as usual. But this isn't new news. People have been griping about the term since the day it was coined.
I wholeheartedly agree with Matt that a focus on use cases ("what problem are you trying to solve?") will be far more productive than focusing on labels. But I disagree with Matt when he seems to imply that it may be any more useful to categorize in terms of underlying data model (key-value, document, column-oriented, graph, etc.). These categories offer only marginal improvement if one is focused on assessing fitness of purpose for a particular task.
Mongo is a document database. So is Couch. But these two solutions are heading in drastically different directions from a use case perspective.
Mongo is a document database. Cassandra is a column-oriented database. Membase and Riak are key-value stores. But these solutions are on a direct collision course from a use case perspective. They are being evaluated interchangeably by customers today and each is winning or losing, frequently based on capabilities that have nothing to do with the underlying data model.
Yes, NoSQL is a useless category name.
There are at least two good reasons why the NoSQL moniker sucks:
- Some "NoSQL" systems actually provide dialects of SQL as the query language for accessing data within the database. Google App Engine has the GQL language. Quest has Toad for Cloud Databases which seeks to provide a SQL interface for Hbase, Azure Tables, Amazon's SimpleDB and other "NoSQL" databases. Hive brings a dialect of SQL to Hadoop, facilitating ETL.
- As Matt points out, there are too many different classes of technology lumped under the NoSQL banner today. No doubt about it. But I propose that simply talking in terms of one aspect of a complex technology stack brings us only marginally closer to addressing the real problem. Humans are wired to desire nice neat categories in which to lump concepts and things. But I suggest that NoSQL as a label for Mongo is only slightly less useful than saying Mongo is a document database, if what you are trying to do is figure out what it is good for.
Categorizing by data model is pretty useless too.
So what would be a nice, clean, representative set of categories for all these emerging solutions? One that would actually allow an observer to determine whether or not the solution category is appropriate for a given use case? I don't have the answer and I'm not sure there is a good answer. Perhaps the use cases themselves provide the right categorization. In any case, I'm looking forward to reading the 451 report that Matt indicates will help structure the thinking around these "database alternatives."
Ultimately, categorizing these solutions will require looking beyond the underlying data model alone (key-value, document, column-oriented, graph, etc.). Rather, these systems should be compared using a larger, hopefully manageable, set of attributes: Must you declare a schema before inserting data? Can you change the schema on the fly if one is required? How hard is it to do that? Can the database transparently (to an application) spill across machines or is it a single-server focused solution? Must you take down the database to add or remove capacity? Can you query the database using a query language or must you write code? Does the system maintain indices to speed queries? How does the database perform on random and sequential operations? How does it perform on reads versus writes? Is data written to durable media immediately, or eventually, and what is my data loss exposure on node failure? How about on datacenter failure? Can I change that exposure through synchronous operations? What will that do to performance? Can the database work across datacenter boundaries? Will I always read my writes, or are there periods of data inconsistency across readers?
These are certainly far more relevant questions to ask when trying to ascertain fitness of purpose for any given use case. Again, there is a low correlation coefficient when comparing the answers to these questions with the narrow categories in which these systems are being slotted; and it matters little if you slot them all into one large "NoSQL" category, or if you break them up into subcategories based on a single-axis data model-driven approach (key-value, document, column-oriented, graph, etc.).
Is NoSQL just a repeat of the OODBMS fiasco of the late 90s?
Why is there so much noise around NoSQL anyway? Is this just a redo of the object oriented database hype we all lived through in the late 1990s? In that fiasco, OODBMS vendors, pundits and investors threw relational database technology under the bus as a poor "impedence match" for increasingly dominant object-oriented application development models. There was a claim that it was more "natural" for developers to store data in the database in its native object form, with arguments that it would be more efficient too.
But in reality, there was little real pain being addressed. In fact, the shift from a known, working technology to an unproven solution that promised a more "elegant" and theoretically correct approach was actually only guaranteed to provide one thing: disruption. Object-relational mapping (ORM) layers designed to bridge the "mismatch" between the object and relational data models are not perfect; but they are better than turning your world upside down if you don’t have to. The tens of billions of dollars in revenue forecasted by analysts for the OODBMS market never even came close to materializing.
So, is there a real problem now? A real use case, or use cases, for which existing database technologies are truly insufficient? Where the disruption of a technology shift will have substantive economic impact? The answer is a resounding yes.
Use cases are driving the divergence, and the convergence, of NoSQL solutions.
Couch has staked a claim on the mobile data synchronization use case. In a computing world that is increasingly dominated by mobile devices, synchronization of data between the cloud and mobile devices (for data availability even when disconnected from the network) is a problem that many applications must solve. There are many things to consider - intermittent connectivity, widely divergent platforms on which these synchronized databases must run and, perhaps, the expectation that the data set that is being moved around and synchronized will normally need to fit on a single box or device. Couch is focusing on these requirements, and making appropriate simplifying assumptions, allowing Couch to address the use case better than anyone else. Couch is a document database. So is Mongo. Mongo is a poor solution for this use case. It is not designed to keep transiently connected systems in sync and, notwithstanding that Mongo was initially designed as a single-node database, the sharding and replication work done in the last year clearly indicates Mongo is moving in a different direction. In this case, there is clearly divergence of solutions; even within the more narrow “document database” slice of the "NoSQL" market.
On the other hand, there is cross-category convergence occurring among other NoSQL solutions which seek to address another extremely large, generalized use case: storing data behind interactive web applications. At Membase we talk to a lot of potential users grappling with this problem every week; and we are consistently being evaluated against Cassandra (column), Mongo (document) and Riak (also key-value).
Web applications that allow organizations to directly interact with consumers are increasingly the most common form of new interactive software system being built. These systems are characterized by random concurrent usage patterns by large user populations (big audience) and by their propensity to accumulate large data sets (big data). There is also a push toward the cloud computing model, particularly for this class of application, in which "scaling out" (adding more cloud machine instances, virtual machines or commodity servers) is preferred over running workloads on large, dedicated, "big iron" machines. These realities have led to the widespread need for a new class of database management system that is designed, from the ground up, to allow scaling horizontally and to cost-effectively support high measures of concurrency against rapidly growing data sets. Perhaps we can call this the cloud database, the elastic database, the scale-out database or the auto-sharding (: P) database use case.
So what does a database need to provide in order to solve this problem? I would argue that it needs to be simple, fast, elastic and safe. If you consider Membase, Mongo, Cassandra and Riak, each of which explicitly aims to solve this generalized "cloud database" problem, the scores on each of these points vary.
Let’s pick on the first characteristic. In order to succeed, a general-purpose cloud database must be simple to get, to develop against and to operate in production.
- Membase is extremely easy to get, install and begin using. So is Mongo – easier than Membase in some cases, harder in others. Cassandra is challenging in just about every situation.
- Mongo is easy to develop against - providing rich queries, index management and the ability to operate on a variety of interesting data types inside the database itself. Membase provides a Memcached-compatible key-value API to developers which, while extremely easy to use, puts more burden on the application developer for many common database operations than does Mongo. Cassandra presents a more complicated development model, but does allow rich queries. Riak queries rely exclusively on map-reduce.
- Membase is easy to manage in production, providing a rich set of monitoring and management tools that provide deep insight into the operation of a small or large cluster of servers, increasing system uptime. Mongo provides far less insight into the operation of a cluster, and this deficiency was the one primarily pointed to by Mongo following the FourSquare outage.
A similar comparison can be made for each of the remaining characteristics: fast, elastic and safe. Each solution is strong in certain of these areas and relatively weak in others.
But the most important point is that each of these solutions is moving to shore up their relative deficiencies to better meet customer needs. There is convergence driven by use case. No project is focused on simply being a great key-value, or a document, or a column-oriented database management system. Each is focused on solving a real world problem – providing a cost-effective place to store data behind interactive web applications, as previously characterized. To that end, Membase is adding query and indexing capabilities. Mongo recently retrofitted sharding and replica support to their initial single-server solution and they are redesigning their storage engine to increase data safety. Redis, another “NoSQL” solution, is adding cluster management capabilities in order to become truly “elastic”. There is clear convergence among the solutions targeting this use case. Each of these projects identifies ad and offer targeting; social gaming; web application session state management; and real-time event processing as use cases for which their system is intended. These are all interactive web application use cases.
Non-relational is the real, consistent technology thread in cloud databases
There is one thing that can be consistently said about these cloud database solutions, from a technical perspective. They aim to scale horizontally, and that is very difficult (impossible) to cost-effectively do, with acceptable performance, employing the relational data model. So each of these “NoSQL” database systems are “non-relational” systems. That is ultimately what was intended by “NoSQL.”
In fact each of these systems is, at its core, a key-value store. They vary in the techniques used to either look inside the values or to group the values in order to operate on them (key-value sees the value as an opaque blob, document sees the value as a formatted attribute-data type collection, column-oriented store groups individual KV pairs using a separate data structure(s) into “columns”). But in each case, rather than disaggregating a data record (tuple) into entries stored across a normalized set of cross-referenced tables, as in the relational model, these cloud databases store the data fields for a particular “record” in a single location. This makes it very easy to automatically distribute records (“to shard the database”) across many nodes in a database cluster. There are pros and cons to this approach.
On the con side, de-normalization leads to increases in the total size of the dataset (as some data is inevitably stored multiple times in the database, where there would instead just be references in a normalized relational database), and increases the complexity of doing joins. On the pro side, it makes it easy to smear data across many cheap servers and to change that distribution on demand without application disruption. It also removes the requirement that you pre-define a schema (or change the database schema) before inserting data. When in doubt, store it. You can infer a schema later. This makes it much easier to collect information that previously may have gone uncollected. This is ultimately perhaps the place where NoSQL database technology will create the most value.
So, let’s find a better name.
We’re quite ready to dump the NoSQL label. If people can rally around a taxonomy that is more use case-centric, I believe that would be a win for users grappling to figure out what these systems are good for.
Cloud database is a name that gets to the heart of the problem that Membase is trying to solve, for all the reasons articulated above. But it may be a bit too trendy or amorphous. I would love to hear what people think.