Bulk Loading Documents into Couchbase

This blog post is an example of how one might load data as JSON into Couchbase. For the purpose of this post, data extracted from RDBMS as a CSV will be converted to JSON. RDBMS DBAs and Admins familiar with Oracle, SQL server, MySQL, etc are probably looking for a way to experiment and test with NoSQL. Often the first step in using NoSQL is to convert whatever you have into JSON.

Couchbase supports JSON and Binary data, but for the purpose of this post we are looking at the most rich data type, JSON.This is important to understand because by loading documents into Couchbase in a format other than JSON, data will be loaded as binary and may impact view flexibility when trying to build views or indexes. With that, let’s get on with loading some documents. There are two ways to achieve this but for the purpose of this post I’m assuming that any document you want loaded is already in JSON format and either compressed or uncompressed. The next section will describe one possible path to ensure you are loading JSON data.

Some Couchbase ETL partners, such as Talend, which offers a connector for Couchbase if you want a GUI, maybe don’t want to deal with CSV files or want to reorder your data prior to commiting your data to CSV or just have a need to ETL data from several sources before storing within Couchbase Server. Talend can map and store documents directly as JSON files prior to loading into Couchbase, if desired .

This guide assumes you have some familiarity with Linux or Mac, package managers, and Ruby.

For additional SDK setup information please visit: www.couchbase.com/developers/

The steps used to prep and load the data are as follows:

Prepare the data: Look at a couple of example tools to convert the CSV to JSON.
Load the data: Examine a few methods to load the data into couchbase via Ruby scripts.

Prerequisites for Linux and Mac: Requires a functional build environment!

Libcouchbase: Since this script uses Ruby you must have libcouchbase installed prior to installing the couchbase gem
- Installing C Client Libraries
rubygem: be sure that you have rubygems available to install the ruby couchbase wrapper
- Gems that I used are
  - ‘ruby-progressbar’
  - ‘couchbase’
  - ‘yaji’
  - ‘optparse’
Yajl parser: this must also be installed as a prerequisite to YAJI.
- YAJI Parser Page with libyajl instructions
Install Couchbase Gem: gem install couchbase

If the setup has been successful the ruby scripts I’ve provided should run. You may want to pass -h to the streamloader and ensure you get the syntax message. Lastly, don’t forget to install the GEMs listed as well, Yaji, optparse, couchbase, and ruby-progressbar. Links are provided at the bottom of this post.

Prepare the Data

Data Prep Method 1: Simple, speedy and consistent: csvtojson NodeJS script

Can be found via Google for and there are others too.
Installs via NPM ala npm install -g csvtojson

Here is an example conversion for reference:

Austins-MBP:complaintstemp austin$ npm info csvtojson
{ name: 'csvtojson',
  description: 'A tool concentrating on converting csv data to JSON with customised parser supporting',
  'dist-tags': { latest: '0.3.21' },
Austins-MBP:complaintstemp austin$ which csvtojson
/usr/local/bin/csvtojson
Austins-MBP:complaintstemp austin$ csvtojson data.gov_Consumer_Complaints.csv

Austins-MBP:complaintstemp austin$ npm info csvtojson

{ name: 'csvtojson',

description: 'A tool concentrating on converting csv data to JSON with customised parser supporting',

'dist-tags': { latest: '0.3.21' },

Austins-MBP:complaintstemp austin$ which csvtojson

/usr/local/bin/csvtojson

Austins-MBP:complaintstemp austin$ csvtojson data.gov_Consumer_Complaints.csv

Data Prep Method 2: Write a Ruby Script: csv2json.rb

The time to complete the process will vary because Ruby is single threaded.

An additional note for this script is that I am using the YAJL parser instead of the default JSON module which doesn’t handle streaming data into Couchbase.

The script below shows the only change required. This will improve memory use during conversion. If you haven’t installed YAJL before you can simply do this: ‘gem install yajl-ruby’

#!/usr/bin/env ruby
require 'rubygems'
require 'csv'
require 'pathname'
#require 'JSON'
require 'yajl/json_gem'

Dir["./**/*.csv"].each do |csv_file_path|

  puts csv_file_path
  file_name = Pathname.new(csv_file_path).basename(".csv").to_s

  File.open("#{file_name}.json",'w') do |json_file|

  jsonData = CSV.read(csv_file_path,
     :headers =&gt; true, :header_converters =&gt; :symbol).map{|csv_row| csv_row.to_hash}

    json_file.write(JSON.pretty_generate(jsonData));
  end
end

#!/usr/bin/env ruby

require 'rubygems'

require 'csv'

require 'pathname'

#require 'JSON'

require 'yajl/json_gem'

Dir["./**/*.csv"].each do |csv_file_path|

puts csv_file_path

file_name = Pathname.new(csv_file_path).basename(".csv").to_s

File.open("#{file_name}.json",'w') do |json_file|

jsonData = CSV.read(csv_file_path,

:headers => true, :header_converters => :symbol).map{|csv_row| csv_row.to_hash}

json_file.write(JSON.pretty_generate(jsonData));

end

Post Conversion Steps

Once conversion is completed it is time to compress the file with ZIP:

 Austins-MBP:complaintstemp austin$ ls
Consumer_Complaints.csv        Consumer_Complaints.json    csv2json.rb
 
Austins-MBP:complaintstemp austin$ zip Consumer_Complaints.zip Consumer_Complaints.json
  adding: Consumer_Complaints.json (deflated 95%)

Austins-MBP:complaintstemp austin$ ls

Consumer_Complaints.csv Consumer_Complaints.json csv2json.rb

Austins-MBP:complaintstemp austin$ zip Consumer_Complaints.zip Consumer_Complaints.json

adding: Consumer_Complaints.json (deflated 95%)

Place the zip file(s) into a directory. I used ~/Downloads/json_files/ in my home directory.

Austins-MBP:complaintstemp austin$ cp Consumer_Complaints.zip ~/Downloads/json_files/

1	Austins-MBP:complaintstemp austin$ cp Consumer_Complaints.zip ~/Downloads/json_files/

Once data has been prepared you are ready to start loading. The following examples will touch on a couple of common ways to get your data into Couchbase in bulk.

The Couchbase install comes with a built-in tool called cbcdocloader. It takes individual document files, up to 20MB in size, either zipped up or within a directory and loads them. At the time of this writing cbcdocloader requires multiple JSON formated files contained in a directory. Second, I will discuss a tool I wrote in Ruby which employs Couchbase’s own Sergey Asavayev’s YAJI Ruby Gem. The code referenced is free to use and can be rewritten in any language you are comfortable using.

Doc Loading Method 1: Using cbcdocloader

Using a set of individual files within a directory, this is a common use case but depends on the structure of the files and directories to be imported to reflect desire documents as they will appear once loaded. To assist in preserving that structure we recommend packaging the files and directories to be loaded within a .zip file.

The document ID key names will be based off the document files provided.

Note: this method is not ideal for large consolidated document files. For large monolithic files I will exemplify how they are loaded in Method 2, below.

Then load the file or files by the following command:

cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 
~/json_files/beer-sample.zip

cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000

~/json_files/beer-sample.zip

Note:
The ‘-s 1000’ is the memory size for the bucket. You’ll need to adjust this value for your bucket.
Also the bucket does not need to exist as cbcdocloader will create it but be aware of your resource utilization prior to setting the ‘-s’ flag to make sure you have available RAM.

If everything was successful you’ll see output stating if documents were loaded, how many bytes, etc.


bucket: 2014-10-02-ny-json.zip, msgs transferred...
       :                total |       last |    per sec
 byte  :            242446488 |  242446488 |  3115728.2

bucket: 2014-10-02-ny-json.zip, msgs transferred...

: total | last | per sec

byte : 242446488 | 242446488 | 3115728.2

Here is a brief script to load up a lot of .zip files in a given directory:

#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ; 
do /Applications/Couchbase Server.app/Contents/Resources/couchbase-core/bin/cbdocloader  
    -u Administrator -p s3kre7Pa55 -b MyBucketToLoad                                  
    -n 127.0.0.1:8091 -s 1000 $ZipFile
 done

#!/bin/bash

JSON_Dir=~/json_files/

for ZipFile in $JSON_Dir/*.zip ;

do /Applications/Couchbase Server.app/Contents/Resources/couchbase-core/bin/cbdocloader

-u Administrator -p s3kre7Pa55 -b MyBucketToLoad

-n 127.0.0.1:8091 -s 1000 $ZipFile

done

Doc Loading Method 2: streamloadjson

The other method is to load all documents, comma separated, from a single monolithic file.

In order to accomplish this method I have prepared a small but effective script that uses the YAJI JSON stream parser and I called it streamjsonload.

The options for this program are:

~$ ruby streamloadjson.rb -h
ruby json loader
    -f, --jsonfile jsonfile          JSON formatted filename
    -b, --bucketname BUCKETNAME      bucket name
    -n, --hostname hostname          hostname/ip address
    -r, --root JSONroot              JSON root to parse
    -d, --docid SearchKey            JSON Key ID
    -h, --help                       This menu

~$ ruby streamloadjson.rb -h

ruby json loader

-f, --jsonfile jsonfile JSON formatted filename

-b, --bucketname BUCKETNAME bucket name

-n, --hostname hostname hostname/ip address

-r, --root JSONroot JSON root to parse

-d, --docid SearchKey JSON Key ID

-h, --help This menu

To load documents with a test JSON file such as fathers.json.txt from below, it can be called like so:

ruby streamloadjson.rb -b TestBucket -f fathers.json -n localhost -d id -r /fathers/

1	ruby streamloadjson.rb -b TestBucket -f fathers.json -n localhost -d id -r /fathers/

The script should provide output like below:

Reading from fathers.json
JSON path processed is /fathers/
Couchbase  node is localhost
29999 Time: 00:00:07 |&gt;&gt;---=---=---=---=---| Doc Loading

Reading from fathers.json

JSON path processed is /fathers/

Couchbase node is localhost

29999 Time: 00:00:07 |>>---=---=---=---=---| Doc Loading

One major advantage to using the YAJI parser is that it requires very low memory consumption. This means you could potentially paginate the input data and break it up into multiple streams to load into couchbase. It will spawn discreet processes since Ruby is single threaded but another language could also be used for multi-threading. An example of these are on Couchbase Labs Github repository.

A couple of things to note, This tool only loads monolithic document files, will try to create an ID automatically if one isn’t provided with ‘-d’ and will require some fine-tuning of the “root” with -r if no documents load.

Newer code for the loader is available on my github repository, but I have also provided it in-linebelow:

#!/usr/bin/env ruby

#An example JSON file for testing is at:
# https://github.com/ysharplanguage/FastJsonParser/raw/master/JsonTest/TestData/fathers.json.txt

require 'ruby-progressbar'
require 'rubygems'
require 'couchbase'
require 'yaji'
require 'optparse'

options = {}

optparse = OptionParser.new do |opts|
opts.banner = 'ruby json loader'

opts.on('-f', '--jsonfile jsonfile', 'JSON formatted filename') do |jsonfile|
options[:jsonfile] = jsonfile
end

opts.on('-b', '--bucketname BUCKETNAME', 'bucket name') do |bucketname|
options[:bucketname] = bucketname
end

opts.on('-n', '--hostname hostname', 'hostname/ip address') do |hostname|
options[:hostname] = hostname
 end

opts.on('-r', '--root JSONroot', 'JSON root to parse') do |root|
options[:root] = root
 end

opts.on('-d', '--docid SearchKey', 'JSON Key ID') do |docid|
options[:docid] = docid
 end

opts.on('-h', '--help', 'This menu') { puts opts; exit}
end

optparse.parse!

json_file = options[:jsonfile]
bucket = options[:bucketname]
host = options[:hostname]
root = options[:root]
docid = options[:docid]

#set some defaults if not passed
unless host
      host=localhost
end

unless root
    root="/"
end

puts "Reading from #{json_file}"
puts "JSON path processed is #{root}"
puts "Couchbase  node is #{host}"

# Connect to couchbase host and bucket provided on the command line
client = Couchbase.connect(:bucket =&gt; bucket, :host =&gt; host)

#Create the progressbar
progressbar=ProgressBar.create(:title =&gt; "Doc Loading",
                               :starting_at =&gt; 0,
                               :total =&gt; nil,
                               :throttle_rate =&gt; 0.01,
                               :format =&gt; '%a |%b&gt;&gt;%i| %p%% %t',
                               :length =&gt; 50)

parser = YAJI::Parser.new(File.open(json_file))

  parser.each(root.to_s).with_index do |doc,i|
  unless docid
     hashinfo=doc.hash.abs
     prikey="Doc_#{i}:#{hashinfo}"
  end
prikey="#{docid}:#{doc["#{docid}"]}"

  client.set("#{prikey}", doc)
  print("rProcessed #{i} Documents")
  progressbar.increment
  end
progressbar.finished?
puts

#!/usr/bin/env ruby

#An example JSON file for testing is at:

# https://github.com/ysharplanguage/FastJsonParser/raw/master/JsonTest/TestData/fathers.json.txt

require 'ruby-progressbar'

require 'rubygems'

require 'couchbase'

require 'yaji'

require 'optparse'

options = {}

optparse = OptionParser.new do |opts|

opts.banner = 'ruby json loader'

opts.on('-f', '--jsonfile jsonfile', 'JSON formatted filename') do |jsonfile|

options[:jsonfile] = jsonfile

end

opts.on('-b', '--bucketname BUCKETNAME', 'bucket name') do |bucketname|

options[:bucketname] = bucketname

end

opts.on('-n', '--hostname hostname', 'hostname/ip address') do |hostname|

options[:hostname] = hostname

end

opts.on('-r', '--root JSONroot', 'JSON root to parse') do |root|

options[:root] = root

end

opts.on('-d', '--docid SearchKey', 'JSON Key ID') do |docid|

options[:docid] = docid

end

opts.on('-h', '--help', 'This menu') { puts opts; exit}

end

optparse.parse!

json_file = options[:jsonfile]

bucket = options[:bucketname]

host = options[:hostname]

root = options[:root]

docid = options[:docid]

#set some defaults if not passed

unless host

host=localhost

end

unless root

root="/"

end

puts "Reading from #{json_file}"

puts "JSON path processed is #{root}"

puts "Couchbase node is #{host}"

# Connect to couchbase host and bucket provided on the command line

client = Couchbase.connect(:bucket => bucket, :host => host)

#Create the progressbar

progressbar=ProgressBar.create(:title => "Doc Loading",

:starting_at => 0,

:total => nil,

:throttle_rate => 0.01,

:format => '%a |%b>>%i| %p%% %t',

:length => 50)

parser = YAJI::Parser.new(File.open(json_file))

parser.each(root.to_s).with_index do |doc,i|

unless docid

hashinfo=doc.hash.abs

prikey="Doc_#{i}:#{hashinfo}"

end

prikey="#{docid}:#{doc["#{docid}"]}"

client.set("#{prikey}", doc)

print("rProcessed #{i} Documents")

progressbar.increment

end

progressbar.finished?

puts

Finishing the Job

Once data has been loaded login to the Couchbase console and begin working with development views for queries and indexing.

If you are using Couchbase Server 4.0 with N1QL you will want to create a primary index so you can explore the Couchbase SQL-like interface immediately and start taking advantage of the power of N1QL query through our SDKs!

Many thanks to the great folks in the Open Source community for providing the YAJL gem and to Sergey Avseyev for the YAJI parser. Sergey is a very knowledgeable Couchbase resource responsible for Ruby SDK work and I would also like to encourage any of you to try our JRuby SDK and provide feedback.

Links:

CB Examples Github – https://github.com/agonyou/cb-examples/
YAJI Stream Parser – https://github.com/avsej/yaji
YAJL JSON Gem – https://github.com/brianmario/yajl-ruby
csv2json Gem – https://rubygems.org/gems/csv2json/
Couchbase Server 4 with N1QL – http://www.couchbase.com/nosql-databases/downloads

Austin Gonyou, Solutions Engineer, Couchbase

3 Comments

Tito Ciuro March 4, 2016 at 4:57 am

If I execute \’./cbdocloader -u tito -p foobar -b test -n 192.168.1.4:8091 -s 1000 /Users/tito/Desktop/sample.zip \’ and the sample file contains a JSON array with just two documents, the entire file content is imported as one document. In other words, cbdocloader does not seem to realize that the document is an array of JSON objects. Also, the editor shows \’Warning: JSON should not be an array\’. How is cbdocloader supposed to work? Thanks.

Log in to Reply
1. Tito Ciuro March 4, 2016 at 5:09 am
  
  By the way, if I use the sample JSON found in \’http://www.rubydoc.info/gems/c…, I get the very same issue. In both cases, the file contents are valid JSON. It\’s just seems that cbdocloader is not cooperating. :-/
  
  Log in to Reply
Mohamed Boudra August 6, 2017 at 1:31 pm

Well put together, you could also use https://sqlify.io/convert/csv/to/json to convert to JSON and then load the documents as normal instead of rolling your own script.

Log in to Reply

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Developer Playground

Start a Capella session

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Converting CSV data to JSON and Bulk Loading Documents into Couchbase