I’m glad to announce the first developer preview of the next major iteration of our integration with Kafka. This version is based on a new library for DCP, and supports the Kafka Connect framework. In this post I will show how it could be integrated to relay data from Couchbase to HDFS.

Here I'll show steps for CentOS/Fedora Linux distributions. The steps on other OSs are going to be similar. First, install Confluent Platform (http://docs.confluent.io/3.0.0/installation.html#rpm-packages-via-yum) and download the Couchbase zip archive with connector integration http://packages.couchbase.com/clients/kafka/3.0.0-DP1/kafka-connect-couchbase-3.0.0-DP1.zip

To register the connector, just extract the contents to the default class path, for example on CentOS (Fedora) it is /usr/share/java:

Now run the Confluent Control Center and all dependent services. Read more about what these commands do at Confluent's quickstart guide

At this point everything is ready for setting up the link to transfer documents from Couchbase to HDFS using Kafka Connect. We assume you are running Couchbase Server on http://127.0.0.1:8091/ and Confluent Control Center on http://127.0.0.1:9021/. For this example, make sure you have the travel-sample bucket loaded on Couchbase. If you didn't set it up when setting up the cluster, you can add it through the settings part of the Web UI.

Once you have all of theese prerequisites out of the way, navigate to the section “Kafka Connect” in your Confluent Control Center. Select “New source”, then select “CouchbaseSourceConnector” as a connector class and fill in the settings so that the final JSON will be similar to:

Once you save the Source connection, the Connect daemon will start receiving mutations and storing them into specified Kafka topic. To demonstrate a full pipeline, lets setup a Sink connection to get data out of Kafka. To do so, go to “Sinks” tab, and click “New sink” button. It should ask for a topics where interesting data stored, enter “travel-topic”. Then select “HdfsSinkConnector” and fill in settings so that, the JSON config will look like this (assuming the HDFS name node is listening on hdfs://127.0.0.1:8020/):

Once the Sink connection configured, you will see the data appearing on HDFS in /topics/travel-topic/ with the default topics directory. Let's inspect one of them:

That’s my quick runthrough example! The DCP client is still under active development and has some additional features being added to handle various topology change, failure scenarios. The next couple updates of our Kafka connector will pick up those updates. I should also briefly note that Couchbase's DCP client interface should be considered volatile for the moment. We use it in various projects, but you should only use it directly at your own risk.

The source code for the connector is at https://github.com/couchbaselabs/kafka-connect-couchbase. The issue tracker is at https://issues.couchbase.com/projects/KAFKAC, and feel free to ask any questions on https://www.couchbase.com/forums/.

 

Author

Posted by Sergey Avseyev, SDK Engineer, Couchbase

Sergey Avseyev is a SDK Engineer at Couchbase. Sergey Avseyev is responsible for development of Kafka connector, and underlying library, which implements DCP, Couchbase replication protocol. Also maintaining PHP SDK for Couchbase.

Leave a reply