YCSB is a great benchmarking tool built to be easily extended by any driver which supports and implements basic operations like: insert, read, update, delete and scan. Plain synthetic data introduced by YCSB fits this paradigm perfectly.

But when it comes to JSON databases queries became way more sophisticated: querying arrays and nestled objects, running joins, aggregations.
The YCSB-JSON extension on one hand should be able to utilize all possible JSON operations supported by a database.
On the other hand implementing this approach in YCSB should be generic enough to be easily extended by any other DB driver no matter what level of JSON querying it supports.

The YCSB-JSON is designed to better emulate realistic, end-user scenarios. It designed to work on any JSON data either real datasets or pseudo-realistic or fully synthetic.  And one of the requirements for the tool is that there shouldn’t be any hardcoded values in query predicates. User can only control the data cardinality during dataset generation process. 

 Fig 1. YCSB-JSON implementation at a glance

Data model

The  data model we choose for this benchmark is well described in this article: https://dzone.com/articles/ycsb-json-benchmarking-json-databases-by-extending
The dataset is generated by using fakeit tool and loaded into a database (Couchbase, MongoDB) by external scripts.
While model is defined and fixed values are randomly generated. This data is randomly generated but it’s not synthetic.

 

Data management

For each operation in the workload queries are fixed, but bound values for each parameterized predicate are non-deterministic. So, the following data management flow was chosen:

  1. Generate documents by fakeit. 
  2. Load generated  data to a database by any external script.
  3. Run the YCSB load phase. During this phase the YCSB will read random subset of the generated documents, store all its values in its internal cache.
  4. During the run phase the YCSB will use the values from its cache while binding and executing queries against the database.

 

Predicates generator

The YCSB uses generators when operating with data. The YCSB-JSON introduces its own generator mapped to particular data model. The mapping and the model exists only within generator namespace. The generator output is a set of generic predicates (field-value pairs) for particular query. This allows to modify model  and extend the tool with other queries without modifying rest of the YCSB core code.

Predicates generator: Generator.java 

 

Example #1:  Pagination query

One of the YCSB-JSON operations, the pagination query, can be represented by the following statement:

SELECT * FROM <bucket> WHERE address.zip = <value> OFFSET <num> LIMIT <num>

The query predicate is a field within an object. When using Couchbase N1QL the field can be simply accessed as “address.zip”. But other database might not be as flexible so YCSB-JSON generator creates 2 predicates: the parent predicate (address) and child/nested predicate (zip).

And the child predicate has a value randomly picked from list of sample values for this particular field. 

The function below generates the SoeQueryPredicate object
Where name is “address
And nested predicate is another SoeQueryPredicate object with name “zip” and value  <value>:

 

Example #2  Report query

Predicates for more complex queries are generated the same way. With only difference that when query introduces multiple predicates the predicates sequence (array of predicates) is being generated instead for a single predicate.  Here is Report query:

SELECT  o2.month, c2.address.zip, SUM(o2.sale_price) FROM <bucket> c2
INNER JOIN orders o2 ON KEYS c2.order_list
WHERE c2.address.zip = “value” AND o2.month = “value”
GROUP BY o2.month, c2.address.zip ORDER BY SUM(o2.sale_price)

Function below generates sequence of:
“Month” predicate, “address” predicate with nested “zip” predicate, “sale_price” predicate, etc:

Other queries generators can be found here:
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/generator/soe/Generator.java

 

New operations

The YCSB code needs to be updated with new operations.

Signatures in DB class: 
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/DB.java#L140:

Implementations in DBWrapper: 
https://github.com/couchbaselabs/YCSB/blob/soe/core/src/main/java/com/yahoo/ycsb/DBWrapper.java#L346:

Extending YCSB CoreWorkload with new operations: SoeWorkload.java

 

Implementation of YCSB-JSON operations for Couchbase and MongoDB

The DB driver function of a YCSB-JSON operation takes an additional parameter which is generator object. It is being passed by Workload class and it has a particular predicate sequence prebuilt.

Because predicates structure and sequences are well defined by the  generator a DB driver can access names and values directly and construct the query using its native query language or other access methods. Below are examples of implementing Page and Report queries.

Page query, generating query statement for Couchbase:

for MongoDB:

Report query, Couchbase:

MongoDB:

All Couchbase implementations: Couchbase2Client.java

All MongoDB implementations: MongoDbClient.java

References

Article part 1:
 https://blog.couchbase.com/ycsb-json-benchmarking-json-databases-by-extending-ycsb/

YCSB-JSON Implementation:
https://github.com/couchbaselabs/YCSB/tree/soe

FakeIt:
 https://github.com/bentonam/fakeit

Next steps

Implement fakeit-like generator in YCSB to simplify data and query predicates generation.

Posted by Alex Gyryk

Principal Software Engineer, Performance

Leave a reply