Transferring Data From Cassandra to Couchbase Using Spark

Shivansh Srivastava is a polyglot developer and a Scala ,Spark aficionado He likes to contribute to Open Source projects, and has contributed to many projects like, Apache Iota, Apache Spark, Apache Carbondata, Couchbase-Spark-Connector, Akka, Alpakka and many other projects. He has a keen interest in upcoming technologies like IoT, DeepLearning etc. Currently he holds the position of Sr. Software Engineer at Chirpanywhere Inc, an IoT based startup, where he manages many a things from hardware programming to designing the whole solution and deploying it.

There are many NoSQL databases in the market like Cassandra, MongoDB, Couchbase, and others, and each have pros and cons.

Types of NoSQL databases

There are mainly four types of NoSQL databases, namely:

Column-oriented
Key-value store
Document-oriented
Graph

The databases that support more than one format are called “multi-model,” like Couchbase which supports key-value and document-oriented databases.

Sometimes we choose the wrong database for our application and realize this harsh truth at a later stage.

Then what? What should we do?

Such is the case in our experience, where we were using Cassandra as our database and later discovered it is not fulfilling all of our needs. We needed to find a new database and discovered Couchbase to be the right fit.

The main difficulty was figuring out how we should transfer our data from Cassandra to Couchbase, because no such plugin was available.

In this blog post I’ll be describing the code I wrote that transfers data from Cassandra to Couchbase using Spark.

All of the code is available here: cassandra-couchbase-transfer-plugin

Explanation of the code

Here, I am reading data from Cassandra and writing it back on Couchbase. This simple code solves our problem.

The steps involved are:

Reading the configuration:
val config = ConfigFactory.load()
//Couchbase Configuration
val bucketName = config.getString(“couchbase.bucketName”)
val couchbaseHost = config.getString(“couchbase.host”)
//Cassandra Configuration
val keyspaceName = config.getString(“cassandra.keyspaceName”)
val tableName = config.getString(“cassandra.tableName”)
val idFeild = config.getString(“cassandra.idFeild”)
val cassandraHost = config.getString(“cassandra.host”)
val cassandraPort = config.getInt(“cassandra.port”)
Setting up the Spark configuration and the creation of the Spark session:
val conf = new SparkConf()
.setAppName(s”CouchbaseCassandraTransferPlugin”)
.setMaster(“local[*]”)
.set(s”com.couchbase.bucket.$bucketName“, “”)
.set(“com.couchbase.nodes”, couchbaseHost)
.set(“spark.cassandra.connection.host”, cassandraHost)
.set(“spark.cassandra.connection.port”, cassandraPort.toString)
val spark = SparkSession.builder().config(conf).getOrCreate()
val sc = spark.sparkContext
Reading data from Cassandra:
  val cassandraRDD = spark.read
  .format(“org.apache.spark.sql.cassandra”)
  .options(Map(“table” -> tableName, “keyspace” -> keyspaceName))
  .load()
Checking the id field:
The id field is being checked to see if it exists. Then use that as id in Couchbase too or else generate a random id and assign it to the document.
import org.apache.spark.sql.functions._
val uuidUDF = udf(CouchbaseHelper.getUUID _)
val rddToBeWritten = if (cassandraRDD.columns.contains(idFeild)) {
cassandraRDD.withColumn(“META_ID”, cassandraRDD(idFeild))
} else {
cassandraRDD.withColumn(“META_ID”, uuidUDF())
}

In a different file:
object CouchbaseHelper {
def getUUID: String = UUID.randomUUID().toString
}
Writing to Couchbase:
rddToBeWritten.write.couchbase()

You can run this code directly to transfer data from Cassandra to Couchbase – all you need to do is some configuration.

Configurations

All the configurations can be done by setting the environment variables.

Couchbase configuration:

Configuration Name	Default Value	Description
COUCHBASE_URL	“localhost”	The hostname for the Couchbase.
COUCHBASE_BUCKETNAME	“foobar”	The bucket name to which data needs to be transferred.

Cassandra configuration:

Configuration Name	Default Value	Description
CASSANDRA_URL	“localhost”	The hostname for the Cassandra.
CASSANDRA_PORT	9042	The port for the Cassandra.
CASSANDRA_KEYSPACENAME	“foobar”	The keyspace name for the Cassandra
CASSANDRA_TABLENAME	“testcouchbase”	The table name that needs to be transferred.
CASSANDRA_ID_FEILD_NAME	“id”	The field name that should be used as Couchbase document id, if the field name does not match any column it gives a random id to the document.

Code in action

Cassandra side:

This is how data looks on Cassandra side.

Couchbase side:

Case 1: When id exists and same can be used as Couchbase document id.

Case 2: When id name does not exist and we need to assign random id to documents.

How to run the Cassandra-Couchbase transfer plugin

Steps to run the code:

Download the code from the repository.
Configure the environment variables according to the configuration.
Run the project using sbt run

Want to write about Couchbase and take part in our community writing program then why not find out more!

Laura Czajkowski, Developer Community Manager, Couchbase

2 Comments

Ranjit Nagi July 13, 2018 at 9:39 am

Hi There,

Can you please let me know where and how to specify option for cassandra and couchbase authentication?

Both cassandra and couchbase are authentication enabled.

Reagrds

Log in to Reply
Nic Raboy, Developer Advocate, Couchbase July 13, 2018 at 10:36 am

Hi,

I’m personally not familiar with Cassandra, but if you want to authenticate with Couchbase you can add something like:

.set(“spark.couchbase.username”, “”)
.set(“spark.couchbase.password”, “”)

This is defined in the documentation found here:

https://developer.couchbase.com/documentation/server/5.1/connectors/spark-2.2/getting-started.html

Best,

Log in to Reply

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Capella Playground

Start A Free Capella Trial

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Transferring Data From Cassandra to Couchbase Using Spark

Explanation of the code

Configurations

Code in action

How to run the Cassandra-Couchbase transfer plugin

Author

Posted by Laura Czajkowski, Developer Community Manager, Couchbase

2 Comments

Leave a reply Cancel reply