YCSB is a great benchmarking tool built to be easily extended by any driver which supports and implements basic operations like: insert, read, update, delete and scan. Plain synthetic data introduced by YCSB fits this paradigm perfectly.

But when it comes to working with JSON databases such as Couchbase and MongoDB, YCSB is helpful because queries became way more sophisticated: querying arrays and nested objects, running joins, and aggregations.

The YCSB-JSON extension on one hand should be able to utilize all possible JSON operations supported by a database.
On the other hand, implementing this approach should be generic enough to be easily extended from YCSB and MongoDB, Couchbase, or any other database driver, no matter what level of JSON querying it supports.

The YCSB-JSON connection is designed to better emulate realistic, end-user scenarios. It is designed to work on any JSON data either real datasets or pseudo-realistic or fully synthetic. And one of the requirements for the tool is that there shouldn’t be any hardcoded values in query predicates. Users can only control the data cardinality during the dataset generation process. 

 Fig 1. YCSB-JSON implementation at a glance

Data Model

The  data model we choose for this benchmark is well described in this article: https://dzone.com/articles/ycsb-json-benchmarking-json-databases-by-extending
The dataset is generated by using a fakeit tool and loaded into a database (Couchbase, MongoDB) by external scripts. While the model is defined and fixed values are randomly generated, this data is randomly generated but it’s not synthetic.

 

Data Management

For each operation in the workload queries are fixed, but bound values for each parameterized predicate are non-deterministic. So, the following data management flow was chosen:

  1. Generate documents by fakeit. 
  2. Load generated  data to a database by any external script.
  3. Run the load phase. During this phase the YCSB will read a random subset of the generated documents, store all its values in its internal cache.
  4. During the run phase the YCSB will use the values from its cache while binding and executing queries against the database.

 

Predicates Generator

The YCSB uses generators when operating with data. The connector introduces its own generator mapped to a particular data model. The mapping and the model exists only within the generator namespace. The generator output is a set of generic predicates (field-value pairs) for particular queries. This allows users to modify models  and extend the tool with other queries without modifying the rest of the core code.

Predicates generator: Generator.java 

 

Example #1:  Pagination query

One of the YCSB-JSON operations, the pagination query, can be represented by the following statement:

SELECT * FROM <bucket> WHERE address.zip = <value> OFFSET <num> LIMIT <num>

The query predicate is a field within an object. When using Couchbase N1QL the field can be simply accessed as “address.zip”. But other databases might not be as flexible so the YCSB-JSON generator creates 2 predicates: the parent predicate (address) and child/nested predicate (zip).

And the child predicate has a value randomly picked from the list of sample values for this particular field. 

The function below generates the SoeQueryPredicate object
Where name is “address
And nested predicate is another SoeQueryPredicate object with name “zip” and value  <value>:

 

Example #2  Report query

Predicates for more complex queries are generated the same way. With only difference that when query introduces multiple predicates the predicates sequence (array of predicates) is being generated instead for a single predicate. Here is the Report query:

SELECT  o2.month, c2.address.zip, SUM(o2.sale_price) FROM <bucket> c2
INNER JOIN orders o2 ON KEYS c2.order_list
WHERE c2.address.zip = “value” AND o2.month = “value”
GROUP BY o2.month, c2.address.zip ORDER BY SUM(o2.sale_price)

Function below generates sequence of:
“Month” predicate, “address” predicate with nested “zip” predicate, “sale_price” predicate, etc:

Other queries generators can be found here:
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/generator/soe/Generator.java

 

New Operations

The code needs to be updated with new operations.

Signatures in DB class: 
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/DB.java#L140:

Implementations in DBWrapper: 
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/DBWrapper.java#L346:

Extending CoreWorkload with new operations: SoeWorkload.java

 

Implementation of YCSB-JSON Operations for Couchbase and MongoDB

The DB driver function of a YCSB-JSON operation takes an additional parameter which is a generator object. It is being passed by the Workload class and it has a particular predicate sequence prebuilt.

Because predicates structure and sequences are well defined by the  generator a DB driver can access names and values directly and construct the query using its native query language or other access methods. Below are examples of implementing Page and Report queries.

Page query, generating query statement for Couchbase:

for MongoDB:

Report query, Couchbase:

MongoDB:

All Couchbase implementations: Couchbase2Client.java

All MongoDB implementations: MongoDbClient.java

References

Article part 1:
 https://www.couchbase.com/blog/ycsb-json-benchmarking-json-databases-by-extending-ycsb/

YCSB-JSON Implementation:
https://github.com/couchbaselabs/YCSB/tree/soe

FakeIt:
 https://github.com/bentonam/fakeit

Next steps

Implement fakeit-like generator in to simplify data and query predicates generation.

Author

Posted by Alex Gyryk

Alex Gyryk is a Principal Software Engineer, Performance at Couchbase. Prior to joining Couchbase, he spent a few years in Forte Group as Senior Performance Analyst.

Leave a reply