Couchbase offers an unique NoSQL database combining integrated cache and storage technology. NoSQL is gaining strong popularity and a number of customers are looking to export or map some of the data they have in RDBMS to Couchbase NoSQL; this is where Talend’s offering can be leveraged.

Talend offers a number of capabilities around integration from business process management to enterprise service bus to master data management. Talend’s new Open Studio for Big Data version 5.3 release contains a Couchbase connector, which enables a large number of scenarios where Couchbase Server can be used alongside with traditional operational and analytical data sources.

To provide you with a basic idea on how to configure and use Talend with Couchbase Server, I will walk you through a simple scenario to create a ETL workflow using Talend Open Studio software to export data from MySQL database server into a Couchbase database server using JSON document format. The workflow will read from a MySQL database table and create corresponding JSON documents in Couchbase Server. You can build upon to example and augment it to support more complex scenarios involving data transformations, complex mappings, etc. The following steps have been tested on CentOS but can be adapted for other systems.

So let’s get started …

Prerequisites

First, let’s look at the pre-requisites you need to get installed:

Start

Now, let’s make sure that everything is started

  • Start MySQL, if not started already:
    • sudo service mysqld start
  • Start Couchbase, if not started already:
    • sudo /etc/init.d/couchbase-server start
  • Launch Talend Open Studio for Big Data:
    • Execute following command or similar (based on installation directory and version): /opt/TOS_BD-r101800-V5.3.0/TOS_BD-linux-gtk-x86_64

Create

Now, let’s use Talend Open Studio to create the ETL workflow

  • After launching Talend, click on Create button and enter a project name such as “MySQLToCouchbaseIntegration” and click on Finish.
  • Select the created project and click on Open button

talend-open-studio-for-big-data

 

  • Create account in Talend or click on Skip button, this will open the IDE
  • Wait for IDE initialization to complete and close or minimize Welcome window
  • In Repository panel, left click on Job Designs and then right click on Create job
  • Enter job name, such as “MySQLToCouchbaseIntegrationJob” and click on Finish.

 talend-mysql-to-couchbase-integration

 

  • This will create a panel where job artifacts can be dragged and dropped and should display the palette to the right.
  • From the palette, Drag and Drop tMyslInput widget (can be found under Databases/MySQL) to the left

 talend-job-myql-to-couchbase

 

  • From the palette, Drag and Drop tCouchbaseOutput widget (can be found under Databases/Couchbase) to the right

  • Both components require jar files which will are not shipped with Talend software. Talend has a wiki explaining how to add jar files. At times, jar files can be directly downloaded from IDE or might need be downloaded manually and loaded into Talend by clicking on jar symbol (the one with “Add External JARs tooltip”) on Modules panel.

 talend-modules

  • Jars must be loaded / resolved before processing to next steps. See detailed info
  • Double click on tMysqlInput on panel; this will display Component panel at bottom
  • Enter MySQL database connection info in panel. Note that all fields must be surrounded by double quotes (ie “)
  • Click on … button which is located to the right of Table Name, expand world and select City entry and Click on OK button
  • Remove default Query and replace by “SELECT * FROM City”
  • Below is an example screenshot of what it should look like:

talend-basic-settings

  • Click on Guess schema button, this will bring up the Schema window, move and resize window as appropriate
  • Edit schema with the following changes:
    • Check Nullable for ID
    • Check Nullable for Population
    • Change ID type to String
    • Input DB types with following and click on OK button
      • INT for ID
      • VARCHAR for Name
      • VARCHAR for CountryCode
      • VARCHAR for District
      • INT for Population
  • Below is an example screenshot of what it should look like:

 talend-schema

  • Left click on tMysqlInput and then right click to select Row / Main and drag arrow from tMysqlInput component to tCouchbaseOutput component. An arrow should be created between the two.

 tmysql-input_1

  • Left click on tCouchbaseOutput on panel; this will display Component panel at bottom
  • Enter Couchbase database connection info in panel. Note that all fields must be surrounded by double quotes (ie “)
  • In this example, the bucket name is default and it does not have authentication set up so username and password are left with “”
  • URI should be input by clicking on + button and changing field to pool URI such as http://localhost:8091/pools
  • Key should be left to ID (default) and Action on data to Insert
  • Below is an example screenshot of what it should look like:

 talend-component

  • Click on Configure JSON Tree … button; this will bring up mapper allowing mapping from table schema to JSON.
  • Select All schema attributes (using multi selection) on the left schema list and drop onto rootTag on the Link target window to the right and select Create as sub-element of target node, then click on OK button.
  • Output should look like this:

talend-tcouchbaseoutput_1

  • Select ID attribute in the Link target window on the right and right click on Set As Loop Element, then click on OK button

talend-tcouchbaseoutput_1-loop-element

 

  • If everything is fine, both icons on the Designer panel should show no sign of errors, in which case we can now run the job
  • Click on Run panel, this will build job and execute it
  • Access your Couchbase server bucket and verify that the documents have been loaded: http://localhost:8091/index.html#sec=buckets
  • This should have added 4079 documents to the default bucket:

 talend-data-buckets

  • You can drill down and look at individual documents:

 talend-default-documents

  • Voila!

As indicated, much more complex scenarios can be built and we will follow up soon with more elaborate examples.

Please send comments and feedback at alexis@couchbase.com

Author

Posted by Alexis Roos

Alexis Roos is senior engineering manager at Salesforce. Alexis has over 20 years of software engineering experience with the last five years focused on large-scale data science and engineering, working for SIs in Europe, Sun Microsystems/Oracle, and several startups, including Radius Intelligence, Concurrent, and Couchbase.

10 Comments

  1. great post ! thanks ! I was wondering how to feed couchbase with some external data. that\’s a first part of a good answer. Maybe usefull for a one shot migration. But what about a continuous integration process ?

    1. Hi,

      It will really depend of the type of the data source and the features available in the source itself (or the tool at the top of it).
      For example in the Oracle \”world\” you can use stuff like DCN (Database Change Notification), Golden Gate to listen to the database change and push data back into Couchbase. (I need some time to write some code around this)

      I do not know if ETL tools can use this but from your code you can.

      Regards
      Tug

      1. Wilbert Klinton July 22, 2015 at 1:30 pm

        It’s really a great pleasure to provide an opinion about ETL
        tools. These are very important and useful to all the people from all over the
        world.

        ETL

    2. For continuous export, you could keep track of latest inserted ID and filter the import based on that.
      Indeed Talend offers a component for MySQL to read latest inserted ID: tMysqlLastInsertId
      You can find more info in Talend Open Studio Components Reference Guide.

      Regards,

      Alexis

    3. Talend provides a change data capture (CDC) component with its enterprise subscription. For most databases it is trigger rather than log driven. The CDC component itself can be scheduled to run on whatever polling interval is desired. It would then post the delta-records to the Couchbase data source. This works for many use cases, but it is IMO more \”business-time\” rather than real-time. While not a nightly batch run, it should not be considered a \”real-time\” solution for low latency designs.

      Ed Ost

  2. Cool Tool – Can\’t wait to use it for my next Mysql to couchdbase Migration.

  3. Thank you Alexis for the instructions.
    My problem is, that Couchbase ignores the schema definition of \’INT for Population\’.
    The document looks like this, after running the ETL-Job:
    {
    \”ID\”: \”city:1000\”,
    \”Type\”: \”city\”,
    \”Name\”: \”Taman\”,
    \”CountryCode\”: \”IDN\”,
    \”District\”: \”East Java\”,
    \”Population\”: \”107000\”
    }

    I guess it should look like this:
    Population without quotation marks. Reason: to use the _sum reduce function.

    {
    \”ID\”: \”city:1000\”,
    \”Type\”: \”city\”,
    \”Name\”: \”Taman\”,
    \”CountryCode\”: \”IDN\”,
    \”District\”: \”East Java\”,
    \”Population\”: 107000
    }

    1. Wilbert Klinton August 10, 2015 at 7:38 pm

      It’s really a great pleasure to provide an opinion about ETL
      tools. These are very important and useful to all the people from all over the
      world. ETL tools are useful to everyone which help to transform any data into
      any database fast and easy and comfortably.
      ETL

  4. Q1: How to set static value when add linker to target node?
    Q2: How to mapping Integer value to Couchbase Integer value(Not String)?

Leave a reply