Using YCSB to Benchmark JSON Databases

Bruce Lindsay once said, “There are three things important in the database world: Performance, Performance, and Performance”. Most enterprise architects know, as we progress in database features and architectures, it’s important to measure performance in an open way so they can compare total cost of ownership reliably.

YCSB did a great job of benchmarking datastores serving the “Cloud OLTP” applications. These data stores were simple with simple get, put, delete operations. The original YCSB benchmark consists of a simple insert, update, delete, and scan operations on a simple document of 10 key-values; workloads are defined with a mix of these operations with various percentages.

JSON databases like Couchbase and MongoDB have a more advanced data model with scalars, nested objects, arrays, arrays of objects, arrays and arrays of objects. JSON databases also have more sophisticated query language, indexes, and capabilities. In addition to CRUD operations, applications routinely use the declarative query languages in these databases to search, paginate, and run reports. So, to help architects to evaluate platforms effectively, we need an additional benchmark to measure these capabilities in addition to the basic CRUD operations. This YCSB tutorial explains its capabilities in filling the gap.

YCSB paper states: We also hope to foster the development of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.

This benchmark extends YCSB to JSON databases by extending existing operations to JSON and then defining new operations and new workloads.

Here’s the outline.

Introduction
Data Model
Benchmark Operations
Benchmark Workloads
YCSB-JSON implementation
How to run YCSB-JSON?
References

1. Introduction

YCSB was developed to measure the performance of scalable NoSQL key-value datastores. YCSB infrastructure does that job well. YCSB uses a simple flat key-value. Couchbase uses a JSON model, which customers use to massively interactive applications. We’ve built and are building features into the product to enable customers to build these applications effectively. We need performance measurements for these use cases.

There are additional databases supporting JSON model: MongoDB, DocumentDB, DynamoDB, RethinkDB, Oracle NoSQL. When running YCSB on JSON databases (Couchbase, MongoDB, etc), the driver simply stores and retrieves strings in the JSON key-value structure. All of these databases require a new benchmark to measure processing of rich structure of JSON (nested objects, arrays) and operations like paging, grouping, aggregations.

The purpose of YCSB-JSON is to extend the YCSB benchmark to measure JSON database capability to cover these two things:

Operations representative of massively interactive applications.
- Operations on the JSON data model, including nested objects, arrays.
Create workloads that represent operations from these applications.

See these customer use cases:

Marriott built its reservation system on IBM Mainframe and DB2. They’ve run into cost, performance challenges as more and more customer try to browse the available inventory. Systems on DB2 was originally built to take reservations from a phone-in system or from agents. The look to book ratio is low. Today, this ratio is high since the number of lookup requests has gone up exponentially. This has increased the database cost dramatically as well. Marriott moved all of its inventory data to Couchbase with continuous synchronization from its mainframe systems; web applications use Couchbase for the lookup/search operations.
Cars.com is a portal to list and sell cars. They have the listing data on Oracle. When they serve it up on the web, they not only have to present the basic car information but also provide additional insights like how many users are looking into a car or have saved it in their wish list. This is a way of increasing the engagement and sense of urgency. All the data required for these interactive operations are stored in Couchbase.

More generally, the massively interactive applications include the following:

Browse rooms availability, pricing details, amenities (lookups by end customers)
Browse information on car make/model or repair shops (enable web-scale consumers & partners)
Provide information to the customer in context (location-based services)
Serve both Master Data and Transactional Data (at scale)

To support these requirements, the applications & databases do the following:

Query offload from high-cost Systems of Record (mainframe, Oracle) databases
- (reservations & revenue apps)
Opening up back-office functions to web / mobile access
- (enable web users to check room details)
Scale database/queries with better TCO
- (scale mainframes with commodity servers)
Modernize legacy systems with capabilities demanded by new collaboration/engagement applications
- (browse inventory, flight, room availability, departmental analysis)

The new benchmark needs to measure the performance of queries implementing these operations.

2. Data Model

We’ve taken customer and orders as two distinct collections of JSON documents. Each order has a reference to its customer.

Below are the sample customer and order document. This has been generated via the fakeit data generator. This tool is available at: https://github.com/bentonam/fakeit

See the appendix for the YAML file used to define the data model and domain.



Sample customer document
Document Key: 100_advjson
{
  "_id": "100_advjson",
  "doc_id": 100,
  "gid": "48a8e177-15e5-5116-95d0-41478601bbdd",
  "first_name": "Stella",
  "middle_name": "Jackson",
  "last_name": "Toy",
  "ballance_current": "$1084.94",
  "dob": "2016-05-11",
  "email": "Alysson83@yahoo.com",
  "isActive": true,
  "linear_score": 31,
  "weighted_score": 40,
  "phone_country": "fr",
  "phone_by_country": "01 80 03 25 39",
  "age_group": "child",
  "age_by_group": 12,
  "url_protocol": "http",
  "url_site": "twitter",
  "url_domain": "gov",
  "url": "http://www.twitter.gov/Stella",
  "devices": [
    "EE-245",
    "FF-012",
    "GG-789",
    "HH-246"
  ],
  "linked_devices": [
    [
      "AA-038",
      "BB-577"
    ],
    [
      "OO-565",
      "KK-448",
      "FF-281"
    ],
    [
      "BB-495",
      "AA-374"
    ],
    [
      "BB-609",
      "VV-899",
      "LL-675",
      "BB-291"
    ],
    [
      "CC-048"
    ]
  ],
  "address": {
    "street": "6392 Crona Rue Curve",
    "city": "Simeonland",
    "zip": "98316",
    "country": "Bahrain",
    "prev_address": {
      "street": "9063 Johns Islands Divide",
      "city": "South Jayme",
      "zip": "34950-8194",
      "country": "Bulgaria",
      "property_current_owner": {
        "first_name": "Weston",
        "middle_name": "Clyde",
        "last_name": "Considine",
        "phone": "(665) 343-9468"
      }
    }
  },
  "children": [
    {
      "first_name": "Darrel",
      "gender": null,
      "age": 10
    },
    {
      "first_name": "Shea",
      "gender": null,
      "age": 6
    }
  ],
  "visited_places": [
    {
      "country": "Iran",
      "cities": [
        "Heidenreichshire",
        "West Luciano",
        "Haroldmouth",
        "West Jakeburgh"
      ]
    },
    {
      "country": "Comoros",
      "cities": [
        "New Valliemouth",
        "East Kaleighland"
      ]
    },
    {
      "country": "Israel",
      "cities": [
        "East Kali",
        "Pabloport"
      ]
    },
    {
      "country": "French Guiana",
      "cities": [
        "North Zachary",
        "Kielmouth"
      ]
    }
  ]
}

See the appendix for the YAML file used to define the data model and domain.

Sample customer document

Document Key: 100_advjson

{

"_id": "100_advjson",

"doc_id": 100,

"gid": "48a8e177-15e5-5116-95d0-41478601bbdd",

"first_name": "Stella",

"middle_name": "Jackson",

"last_name": "Toy",

"ballance_current": "$1084.94",

"dob": "2016-05-11",

"email": "Alysson83@yahoo.com",

"isActive": true,

"linear_score": 31,

"weighted_score": 40,

"phone_country": "fr",

"phone_by_country": "01 80 03 25 39",

"age_group": "child",

3. Benchmark Operations:

The first four operations are the same as standard YCSB, except this is on JSON documents. Rest of the operations are new.

Insert: Insert a new JSON document.
Update: Update a JSON document by replacing the value of one scalar field.
Read: Read a JSON document, either one randomly chosen field or all fields.
Delete: Delete a JSON document with a given key.
Scan: Scan JSON documents in order, starting at a randomly chosen record key. The number of records to scan is randomly chosen (LIMIT).
Search: Search JSON documents based on range predicates on 3 fields (customizable to n fields).
Page: Paginate result set of a query with predicate on a field in the document.
- All customers in zip with randomly chosen OFFSET and LIMIT in SQL, N1QL.
NestScan: Query JSON documents based on a predicate on a 1-level nested field.
ArrayScan: Query JSON documents based on a predicate within the single-level array field.
ArrayDeepScan: Query JSON documents based on a predicate within a two-level array field (array of arrays).
Report: Query customer order details for customers in specific zipcode.
- Each customer has multiple orders.
- Order document has order details.
Report2: Generate sales order summary for a given day, group by zip.
Load: Data loading.
Sync: Data streaming and synchronization from another system.
Aggregate: Do some grouping and aggregation.

For Couchbase: Benchmark Operations implementation examples

The first four operations are the same as standard YCSB, except this is on JSON documents. Rest of the operations are new.

Couchbase implements YCSB in two modes.

KV=true. KV stands for key-value. The simple YCSB operations INSERT, UPDATE, and DELETE can be implemented via KV APIs instead of queries. Setting KV=true means, use the KV API and KV=false means use the N1QL (SQL for JSON) query. See the tutorial for N1QL at https://query-tutorial.couchbase.com

Insert: Insert a new JSON document.

KV=true: KV call to insert
KV=false: INSERT INTO customer VALUES(...)

1 2	KV=true: KV call to insert KV=false: INSERT INTO customer VALUES(...)

2. Update: Update a JSON document by replacing the value of one scalar field.


KV=true: KV call to UPDATE a single document.
KV=false: UPDATE customer SET field1 = value USE KEYS [documentkey]<span style="font-weight: 400"><strong>Read</strong>: Read a JSON document, either one randomly chosen field in the document or all the fields.</span>

KV=true: KV call to UPDATE a single document.

KV=false: UPDATE customer SET field1 = value USE KEYS [documentkey]<span style="font-weight: 400"><strong>Read</strong>: Read a JSON document, either one randomly chosen field in the document or all the fields.</span>


KV=true: KV call to fetch a single document.
KV=false: SELECT * FROM customer USE KEYS [documentkey]

KV=true: KV call to fetch a single document.

KV=false: SELECT * FROM customer USE KEYS [documentkey]

3. Read: Fetch a JSON document with a given key.


KV=true: KV call to fetch a single document.
KV=false: SELECT * FROM customer USE KEYS [documentkey]

KV=true: KV call to fetch a single document.

KV=false: SELECT * FROM customer USE KEYS [documentkey]

4. Delete: Delete a JSON document with a given key.


KV=true: KV call to fetch a single document.
KV=false: DELETE FROM customer USE KEYS [documentkey]

KV=true: KV call to fetch a single document.

KV=false: DELETE FROM customer USE KEYS [documentkey]

5. Scan: Scan JSON documents in order, starting at a randomly chosen record key. The number of records to scan is randomly chosen (LIMIT).


KV=TRUE:
SELECT META().id FROM customer WHERE META().id > “val” ORDER BY META().id LIMIT <num>
Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE META().id > “val” ORDER BY META().id LIMIT <num>

KV=TRUE:

SELECT META().id FROM customer WHERE META().id > “val” ORDER BY META().id LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE META().id > “val” ORDER BY META().id LIMIT <num>

6. Page: Paginate result set of a query with predicate on a field in the document.



All customers in address.zip with randomly chosen OFFSET and LIMIT in SQL, N1QL
KV=TRUE:
SELECT META().id FROM customer WHERE address.zip = “value” OFFSET <num> LIMIT <num>
Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE address.zip = “value” OFFSET <num> LIMIT <num>

All customers in address.zip with randomly chosen OFFSET and LIMIT in SQL, N1QL

KV=TRUE:

SELECT META().id FROM customer WHERE address.zip = “value” OFFSET <num> LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE address.zip = “value” OFFSET <num> LIMIT <num>

7. Search: Search JSON documents based on range predicates on 3 fields (customizable to n fields).



All customers WHERE (country = “value1” AND age_group = “value2” and YEAR(dob) = “value” )
All customers retrieved with randomly chosen OFFSET and LIMIT in SQL, N1QL

KV=TRUE:
SELECT META().id FROM customer WHERE country = “value1” AND age_group = “value2” and YEAR(dob) = “value” ORDER BY country, age_group, YEAR(dob) OFFSET <num> LIMIT <num>
Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE WHERE country = “value1” AND age_group = “value2” and YEAR(dob) = “value” ORDER BY country, age_group, YEAR(dob) OFFSET <num> LIMIT <num>

All customers WHERE (country = “value1” AND age_group = “value2” and YEAR(dob) = “value” )

All customers retrieved with randomly chosen OFFSET and LIMIT in SQL, N1QL

KV=TRUE:

SELECT META().id FROM customer WHERE country = “value1” AND age_group = “value2” and YEAR(dob) = “value” ORDER BY country, age_group, YEAR(dob) OFFSET <num> LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE WHERE country = “value1” AND age_group = “value2” and YEAR(dob) = “value” ORDER BY country, age_group, YEAR(dob) OFFSET <num> LIMIT <num>

8. NestScan: Query JSON documents based on a predicate on a 1-level nested field.



KV=TRUE:
SELECT META().id FROM customer WHERE address.prev_address.zip = “value” LIMIT <num>
Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE address.prev_address.zip = “value” LIMIT <num>

KV=TRUE:

SELECT META().id FROM customer WHERE address.prev_address.zip = “value” LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE address.prev_address.zip = “value” LIMIT <num>

9. ArrayScan: Query JSON documents based on a predicate within the single-level array field.



Find all customers who have devices with a value. E.g. FF-012
Sample devices field
 "devices": [
   "EE-245",
   "FF-012",
   "GG-789",
   "HH-246"
 ],
KV=TRUE:
SELECT META().id FROM customer WHERE ANY v IN devices SATISFIES v = “FF-012” END ORDER BY META().id LIMIT <num>
Fetch the actual documents directly using KV calls from the benchmark driver.
KV=false: SELECT * FROM customer WHERE ANY v IN devices SATISFIES v = “FF-012” ORDER BY META().id END LIMIT <num>

Find all customers who have devices with a value. E.g. FF-012

Sample devices field

"devices": [

"EE-245",

"FF-012",

"GG-789",

"HH-246"

KV=TRUE:

SELECT META().id FROM customer WHERE ANY v IN devices SATISFIES v = “FF-012” END ORDER BY META().id LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false: SELECT * FROM customer WHERE ANY v IN devices SATISFIES v = “FF-012” ORDER BY META().id END LIMIT <num>

10. ArrayDeepscan: Query JSON documents based on a predicate within a two-level array field (array of arrays).

Get me list of all customers who have visited Paris, France.

KV=true:


SELECT META().id FROM customer
WHERE ANY v in visited_places SATISFIES
v.country = “France” AND
ANY c in v.cities SATISFIES c = “Paris” END
ORDER BY META().id
LIMIT <num>

SELECT META().id FROM customer

WHERE ANY v in visited_places SATISFIES

v.country = “France” AND

ANY c in v.cities SATISFIES c = “Paris” END

ORDER BY META().id

LIMIT <num>

Fetch the actual documents directly using KV calls from the benchmark driver.

KV=false:


SELECT * FROM customer
WHERE ANY v in visited_places SATISFIES v.country = “France” AND
           ANY c in v.cities SATISFIES c = “Paris” END
      END
ORDER BY META().id
LIMIT <num>

SELECT * FROM customer

WHERE ANY v in visited_places SATISFIES v.country = “France” AND

ANY c in v.cities SATISFIES c = “Paris” END

END

ORDER BY META().id

LIMIT <num>

11. Report: Query customer order details for customers in specific zipcode.


Each customer has multiple orders.
Order document has order details.
KV=TRUE:
Not possible (easily without significant perf impact.
KV=false:

SELECT *
FROM customer c INNER JOIN orders o  
ON (META(id) IN c.order_list)
WHERE address.zip = "val"               

ANSI JOIN with HASH join:
SELECT *
FROM customer c INNER JOIN orders o USE HASH (probe)
ON (META(id) IN c.order_list)
WHERE address.zip = “val”

Each customer has multiple orders.

Order document has order details.

KV=TRUE:

Not possible (easily without significant perf impact.

KV=false:

SELECT *

FROM customer c INNER JOIN orders o

ON (META(id) IN c.order_list)

WHERE address.zip = "val"

ANSI JOIN with HASH join:

SELECT *

FROM customer c INNER JOIN orders o USE HASH (probe)

ON (META(id) IN c.order_list)

WHERE address.zip = “val”

12. Report2: Generate sales order summary for a given day, group by zip.

KV=TRUE:
Need to write a program
KV=false:
SELECT  o.day, c.zip, SUM(o.salesamt)
FROM customer c INNER JOIN orders o  
ON (META(id) IN c.order_list)
WHERE c.zip = “value”
AND o.day = “value”
GROUP BY c.day, c.zip
ORDER BY SUM(o.sales_amt)



----ANSI join

SELECT  o.day, c.zip, SUM(o.salesamt)
FROM customer c INNER JOIN orders o
ON (META(id) IN c.order_list)
WHERE c.zip = “value”
AND o.day = “value”
GROUP BY c.day, c.zip
ORDER BY SUM(o.sales_amt)

------ANSI join with HASH join

SELECT  o.day, c.zip, SUM(o.salesamt)
FROM customer c INNER JOIN orders o USE HASH (probe)
ON (META(id) IN c.order_list)
WHERE c.zip = “value”
AND o.day = “value”
GROUP BY c.day, c.zip
ORDER BY SUM(o.sales_amt)

KV=TRUE:

Need to write a program

KV=false:

SELECT o.day, c.zip, SUM(o.salesamt)

FROM customer c INNER JOIN orders o

ON (META(id) IN c.order_list)

WHERE c.zip = “value”

AND o.day = “value”

GROUP BY c.day, c.zip

ORDER BY SUM(o.sales_amt)

----ANSI join

SELECT o.day, c.zip, SUM(o.salesamt)

FROM customer c INNER JOIN orders o

ON (META(id) IN c.order_list)

WHERE c.zip = “value”

AND o.day = “value”

GROUP BY c.day, c.zip

ORDER BY SUM(o.sales_amt)

------ANSI join with HASH join

SELECT o.day, c.zip, SUM(o.salesamt)

FROM customer c INNER JOIN orders o USE HASH (probe)

ON (META(id) IN c.order_list)

WHERE c.zip = “value”

AND o.day = “value”

GROUP BY c.day, c.zip

ORDER BY SUM(o.sales_amt)

13. Load: Data loading.

LOAD 1 million documents.
LOAD 10 million documents.

14. Sync: Data streaming and synchronization from another system

Need to measure the data sync performance.
1. Sync 1 million documents. 50% update, 50% insert.
2. Sync 10 million documents. 80% update, 20% insert.
Ideally, this sync would be done from Kafka or some other connector pulling data from a different source.

15. Aggregate: Do some grouping and aggregation.

---Group Query 1

SELECT c.zip, COUNT(1)
FROM customer c
WHERE c.zip between "value1" and "value2"
GROUP BY c.zip

---Group Query 1

SELECT c.zip, COUNT(1)

FROM customer c

WHERE c.zip between "value1" and "value2"

GROUP BY c.zip



---GROUP BY query 2

SELECT o.day, SUM(o.salesamt)
FROM orders o
WHERE o.day  between “value1” and “value2”
GROUP BY o.day;

---GROUP BY query 2

SELECT o.day, SUM(o.salesamt)

FROM orders o

WHERE o.day between “value1” and “value2”

GROUP BY o.day;

4. Benchmark Workloads

Workloads are a combination of these operations.

To begin with, the workload definition can reuse the definitions of the YCSB definition: workload-A through workload-E. Details are available at https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads. We’ll need to define additional workloads with a combination of operations defined above.

Workload SA is the same as workload A on the new model. Ditto with workload B through F. We’ll call them SB through SF to differentiate from the workload B through F.

Workload	Operations	Record selection	Application Example
SA — Update heavy	Read: 50% Update 50%	Zipfian	Session store recording recent actions in a user session
SB — Read heavy	Read: 95% Update: 5%	Zipfian	Photo tagging; add a tag is an update, but most operations Update: 5% are to read tags
SC — Read only	Read: 100%	Zipfian	User profile cache, where profiles are constructed elsewhere (e.g., Hadoop)
SD — Read latest	Read: 95% Insert 5%	Latest	User status updates; people want to read the latest statuses
SE — Short ranges	Scan: 95% Insert: 5%	Zipfian/Uniform	Threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id)
SF — Read, modify, write	Read: 50% Write: 50%	Zipfian	user database, where user records are read and modified by the user or to record user activity.
SG — Page heavy	Page: 90% Insert: 5% Update:5%	Zipfian	User database, where new users are added, existing records are updated, pagination queries on the system.
SH — Search heavy	Search: 90% Insert: 5% Update: 5%	Zipfian	User database, where new users are added, existing records are updated, search queries on the system.
SI — NestScan heavy	Nestscan: 90% Insert: 5% Update: 5%	Zipfian	User database, where new users are added, existing records are updated, nestscan queries on the system.
SJ — Arrayscan heavy	Arrayscan: 90% Insert: 5% Update: 5%	Zipfian
SK — ArrayDeepscan heavy	ArrayDeepScan: 90% Insert: 5% Update: 5%	Zipfian
SL — Report	Report: 100%
SL — Report2	Report2: 100%
SLoad — Load	Load: 100%	Everything	Data load to setup SoE
SN — Aggregate (SN1, SN2)	Aggregation: 90% Insert: 5% Update: 5%
SMIX — Mixed workload	Page:20% Search:20% Nestscan:15% Arrayscan:15% ArrayDeepscan:10% Aggregate: 10% Report: 10%		See below.
SSync — Sync	Sync: 100% Merge/Update: 70% New/Insert: 30%		Continuous sync of data from other systems to systems of engagement. See below.

Example Configuration for YCSB/JSON Workload



recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload
Filternumlow = 2
Filternumhigh = 14
Sortnumlow = 3
Sortnumhigh = 6
page1propotion=0.95
insertproportion=0.05
requestdistribution=zipfian
maxscanlength=100
scanlengthdistribution=uniform

recordcount=1000

operationcount=1000

workload=com.yahoo.ycsb.workloads.CoreWorkload

Filternumlow = 2

Filternumhigh = 14

Sortnumlow = 3

Sortnumhigh = 6

page1propotion=0.95

insertproportion=0.05

requestdistribution=zipfian

maxscanlength=100

scanlengthdistribution=uniform

Acknowledgments

Thanks to Raju Suravarjjala, Couchbase Senior director for QE and Performance, for pushing us to do this and the entire performance team for supporting this effort. The YCSB-JSON benchmark was developed in collaboration with Alex Gyryk, Couchbase Principal Performance Engineer. He developed the data models for customer and orders used in this paper and implemented the operations and workloads in YCSB-JSON for Couchbase and MongoDB. The YCSB-JSON implementation is available at: https://github.com/couchbaselabs/YCSB

Thanks to Aron Benton, Couchase Solution Architect, for developing an easy to use and efficient JSON data generator, fakeit. He developed this prior to joining Couchbase. It is available at: https://github.com/bentonam/fakeit

Next part

In the next article on YCSB-JSON, Alex will explain the implementations of this benchmark for Couchbase and MongoDB. The source code for the implementation is available at: https://github.com/couchbaselabs/YCSB

References

Benchmarking Cloud Serving Systems with YCSB: https://www.cs.duke.edu/courses/fall13/cps296.4/838-CloudPapers/ycsb.pdf
JSON: http://json.org
JSON Generator: http://www.json-generator.com/
YCSB-JSON Implementation: https://github.com/couchbaselabs/YCSB

Appendix

YAML to generate the customer dataset.


name: AdvJSON
type: object
key: _id
data:
  fixed: 10000
properties:
  _id:
    type: string
    data:
      post_build: "return '' + this.doc_id + '_advjson';"
  doc_id:
    type: integer
    description: The document id
    data:
      build: "return document_index + 1"
  gid:
    type:
    description: "guid"
    data:
        build: "return chance.guid();"
  first_name:
    type: string
    description: "First name - string, linked to url as the personal page"
    data:
      fake: "{{name.firstName}}"
  middle_name:
    type: string
    description: "Middle name - string"
    data:
      build: "return chance.bool() ? chance.name({middle: true}).split(' ')[1] : null;"
  last_name:
    type: string
    description: "Last name - string"
    data:
      fake: "{{name.lastName}}"
  ballance_current:
    type: string
    description: "currency"
    data:
      build: "return chance.dollar();"
  dob:
    type: string
    description: "Date"
    data:
      build: "return chance.bool() ? new Date(faker.date.past()).toISOString().split('T')[0] : null;"
  email:
    type: string
    description: "email"
    data:
      fake: "{{internet.email}}"
  isActive:
    type: boolean
    description: "active boolean"
    data:
      build: "return chance.bool();"
  linear_score:
    type: integer
    description: "integer 0 - 100"
    data:
      build: "return chance.integer({min: 0, max: 100});"
  weighted_score:
    type: integer
    description: "integer 0 - 100 with zipf distribution"
    data:
      build: "return chance.weighted([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [1, 0.4, 0.3, 0.25, 0.2, 0.17, 0.13, 0.11, 0.1, 0.09]) * 10 + chance.integer({min: 0, max: 10});"
  phone_country:
    type: string
    description: "field linked to phone, choices: us, uk, fr"
    data:
      build: "return  chance.pickone(['us', 'uk', 'fr']);"
  phone_by_country:
    type: string
    description: "phone number by country code, linked to phone_country field"
    data:
      post_build: "return chance.phone({country: this.phone_country});"
  age_group:
    type: string
    description: "field linked to age, choices: child, teen, adult, senior"
    data:
      build: "return  chance.pickone(['child', 'teen', 'adult', 'senior']);"
  age_by_group:
    type: integer
    description: "age by group, linked to age_group field"
    data:
      post_build: "return chance.age({type: this.age_group});"
  url_protocol:
    type: string
    description: "lined to url"
    data:
      build: "return  chance.pickone(['http', 'https']);"
  url_site:
    type: string
    description: "lined to url"
    data:
      build: "return  chance.pickone(['twitter', 'facebook', 'flixter', 'instagram', 'last', 'linkedin', 'xing', 'google', 'snapchat', 'tumblr', 'pinterest', 'youtube', 'vine', 'whatsapp']);"
  url_domain:
    type: string
    description: "lined to url"
    data:
      build: "return  chance.pickone(['com', 'org', 'net', 'int', 'edu', 'gov', 'mil', 'us', 'uk', 'ft', 'it', 'de']);"
  url:
    type: string
    description: "user profile url, linked to other document fields"
    data:
      post_build: "return '' + this.url_protocol + '://www.' + this.url_site + '.' + this.url_domain + '/' + this.first_name;"
  devices:
    type: array
    description: "Array of strings - device"
    items:
      $ref: '#/definitions/Device'
      data:
        min: 2
        max: 6
  linked_devices:
    type: array
    description: "Array of array of string"
    items:
      $ref: '#/definitions/Device'
      data:
        min: 3
        max: 6
        submin: 1
        submax: 4
  address:
    type: object
    description: An object of the Address
    schema:
      $ref: '#/definitions/Address'
  children:
    type: array
    description: "An array of Children objects"
    items:
      $ref: '#/definitions/Children'
      data:
        min: 0
        max: 5
  visited_places:
    type: array
    description: "Array of objects with arrays"
    items:
      $ref: '#/definitions/Visited_places'
      data:
        min: 1
        max: 4

definitions:
  Device:
    type: string
    description: "string AA-001 with zipf step distribution"
    data:
      build: "return chance.weighted(['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL', 'MM', 'NN', 'OO', 'PP', 'QQ', 'RR', 'SS', 'TT', 'UU', 'VV', 'WW', 'XX', 'YY', 'ZZ'], [1, 0.5, 0.333, 0.25, 0.2, 0.167, 0.143, 0.125, 0.111, 0.1, 0.091, 0.083, 0.077, 0.071, 0.067, 0.063, 0.059, 0.056, 0.053, 0.050, 0.048, 0.045, 0.043, 0.042, 0.04, 0.038]).concat('-').concat(chance.string({length: 3, pool: '0123456789'}));"
  Address:
    type: object
    properties:
      street:
        type: string
        description: The address 1
        data:
          build: "return faker.address.streetAddress() + ' ' + faker.address.streetSuffix();"
      city:
        type: string
        description: The locality
        data:
          build: "return faker.address.city();"
      zip:
        type: string
        description: The zip code / postal code
        data:
          build: "return faker.address.zipCode();"
      country:
        type: string
        description: The country
        data:
          build: "return faker.address.country();"
      prev_address:
        type: object
        description: An object of the Address
        schema:
          $ref: '#/definitions/Previous_address'
  Previous_address:
    type: object
    properties:
      street:
        type: string
        description: The address 1
        data:
          build: "return faker.address.streetAddress() + ' ' + faker.address.streetSuffix();"
      city:
        type: string
        description: The locality
        data:
          build: "return faker.address.city();"
      zip:
        type: string
        description: The zip code / postal code
        data:
          build: "return faker.address.zipCode();"
      country:
        type: string
        description: The country
        data:
          build: "return faker.address.country();"
      property_current_owner:
        type: object
        description: "owner object"
        schema:
          $ref: '#/definitions/Property_owner'
  Children:
    type: object
    properties:
      first_name:
        type: string
        description: "first name - string"
        data:
          fake: "{{name.firstName}}"
      gender:
        type: string
        description: "gender M or F"
        data:
          build: "return chance.bool({likelihood: 50})? faker.random.arrayElement(['M', 'F']) : null;"
      age:
        type: integer
        description: "age - 1 to 17"
        data:
          build: "return chance.integer({min: 1, max: 17})"
  Visited_cities:
    type: string
    description: "city"
    data:
      build: "return faker.address.city();"
  Visited_places:
    type: object
    properties:
      country:
        type: string
        data:
          build: "return faker.address.country();"
      cities:
        type: array
        description: "Array of strings - device id"
        items:
          $ref: '#/definitions/Visited_cities'
          data:
            min: 1
            max: 5
  Property_owner:
    type: object
    properties:
      first_name:
        type: string
        description: "First name - string, linked to url as the personal page"
        data:
          fake: "{{name.firstName}}"
      middle_name:
        type: string
        description: "Middle name - string"
        data:
          build: "return chance.bool() ? chance.name({middle: true}).split(' ')[1] : null;"
      last_name:
        type: string
        description: "Last name - string"
        data:
          fake: "{{name.lastName}}"
      phone:
        type: string
        description: "phone"
        data:
          build: "return chance.phone();"

name: AdvJSON

type: object

key: _id

data:

fixed: 10000

properties:

_id:

type: string

data:

post_build: "return '' + this.doc_id + '_advjson';"

doc_id:

type: integer

description: The document id

data:

build: "return document_index + 1"

gid:

type:

description: "guid"

data:

build: "return chance.guid();"

first_name:

type: string

description: "First name - string, linked to url as the personal page"

data:

Keshav Murthy

6 Comments

heyfaraday February 5, 2019 at 5:56 am

Hey. Does there YAML for orders exist to generate orders dataset?

Log in to Reply
1. 3bst0r August 26, 2021 at 8:32 am
  
  I am also looking for this. The YAML in the appendix is missing the “order_list” key.
  
  Log in to Reply
3bst0r July 14, 2021 at 9:03 am

Hi, great work! Could you please provide more instructions on how to get to the implementation mentioned here? I just checked out the master branch from https://github.com/couchbaselabs/YCSB and I can’t seem to find neither the workloads mentioned here nor the implementation of the new operations.

Log in to Reply
Keshav Murthy July 14, 2021 at 9:12 am

Please see the details in the follow-up article: https://www.couchbase.com/ycsb-json-implementation-for-couchbase-and-mongodb/

Log in to Reply
1. 3bst0r July 21, 2021 at 12:49 am
  
  Awesome, thanks!
  
  Log in to Reply
alflahi August 9, 2021 at 2:34 pm

Thanks a lot,
please, I have a question, How we can generate a new workload based on new requirements? please, we need an example.

Log in to Reply

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Developer Playground

Start a Capella session

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Using YCSB to Benchmark JSON Databases

1. Introduction

2. Data Model

3. Benchmark Operations:

For Couchbase: Benchmark Operations implementation examples

4. Benchmark Workloads

Next part

In the next article on YCSB-JSON, Alex will explain the implementations of this benchmark for Couchbase and MongoDB. The source code for the implementation is available at: https://github.com/couchbaselabs/YCSB

References

Appendix

Author

Posted by Keshav Murthy

6 Comments

Leave a reply Cancel reply