The Need

Ah, database migrations. After you’ve migrated to Couchbase where the JSON document data representation is much more flexible, you won’t as often need to twist your development process through the change request pipeline followed by the downtime, possible errors and just general unpleasantness when bound to rows and columns.

That said, just because you are on Couchbase does not mean that you are always free of needing to do data transformation with data in Couchbase. It will just be a lot less common.

This is actually an interesting place for Apache Spark!

An Aside

With the Spark connector and Couchbase 4.0 you have not just one, but four interfaces that are relevant to those using Spark. They’re the K-V interface, the Database Change Protocol (a.k.a. DCP) streaming interface, the N1QL query interface via Spark SQL and the View Query interface.

These can be combined with a host of different datasources from the Spark ecosystem to pull together and manipulate data in a number of ways. For instance, you may want to stream the data out from Couchbase via DCP mix it up with an HDFS data source and put the target results back in a different Couchbase bucket.

The Solution

Taking a simple case, how can we use Spark to write some code to efficiently transform a set of data inside Couchbase?

Imagine the scenario where you have acquired a new set of data about game players in JSON format. They’re all going to be playing your new FizzBuzz game soon, and the profiles have come over from a partner. The incoming profiles all look a bit like this:

The problem is that the FizzBuzz profiles all look like this:

Normally, if you had another shape for the data, you’d add a little bit of logic for mapping at the time of read and write. However, this particular transition is a one-time process and comes with an additional wrinkle. That “entitlementtoken” needs to be looked up from a MySQL database backup you also have. You don’t want to have to provision or maintain a big MySQL deployment to handle launch day traffic, so a one time pre-launch transformation is better.

Streaming the data out, finding those with the ‘shape’ we want and then transforming them with Spark based on a SQL query would be ideal.

First, we need to set up our connection and stream the data out, looking for the shape of the already imported JSON. This will use Couchbase’s DCP interface to stream out the data.

A current limitation is that the DStream never stops, but we can simply monitor for when we don’t see any more data being transformed as a workaround for this simple case.

Then, on a per-item basis, we need to apply a transformation based on this MySQL lookup. To do so, we'll need to load the data from MySQL. Assuming the MySQL table looks like so:

We will want to load the MySQL data as a DataFrame. Since the StreamingContext gives us RDDs to join to, we'll convert it to a set of RDDs for a later join inside the stream. Spark 1.6 may make this easier. That conversion looks like this (extracted to a function for redablility):

We also need to add the new entitlement token (also extracted):

In the end we have a nice fluent description of our transformation modifying the RDDs in flight that need changes. This ultimately writes the transformed data back into Couchbase, overwriting the items using the K-V interface.

The full example is in the couchbase-spark-samples repository.

The beauty of this example is that it’s easy to understand what’s happening and fairly trivial to scale. Chances are your own transformation is more complex, but this should give you a sense of what is possible and something to build upon.

There is always room for improvement.

One problem is that the MySQL could be larger than what I want to load into memory. Spark does account for that by giving you a way to split the DataFrames. I didn’t need that here and wanted the sample to be readable. The other thing that may help this is the ability to reference a SparkContext from within an existing StreamingContext. Spark doesn’t allow that at the moment for good reasons, but I’d argue this simple use case of doing a single record lookup from inside the stream there sort of makes sense.

On the Couchbase Connector, at the moment, the DCP interface is classified as volatile and should be considered experimental. Also, the example above is very fast but needs some help to scale. A forthcoming update from my colleague Sergey Avseyev will allow for splitting the DCP streams across Spark workers to parallelize this transformation.

To Conclude

Spark is a great new tool for this kind of transformation. The same techniques can certainly be applied to migrating to Couchbase from a different data source, like a relational database. The technique can even be expanded with Spark's machine learning to build a model around data streaming from Couchbase for anticipating results.

Posted by Matt Ingenthron, Senior Director, SDK Engineering, Couchbase

Matt Ingenthron is the Senior Director of SDK Engineering at Couchbase and an experienced web architect with a software development background. He has deep expertise in building, scaling, and operating global-scale Java, Ruby on Rails, and AMP web applications. He has been a contributor to the memcached project, one of the maintainers of the Java spymemcached client, and a core developer on Couchbase. He is currently heading up Couchbase's work in getting the right bits needed for PHP, Java, .NET and Ruby developers.

Leave a reply