The purpose of computing is insight, not numbers.  — Richard Hamming

The spiral of running the business, analyzing what to change & what to change to, and then changing the business is an eternal one. Do the right analysis, your spiral will get larger.  Else, you’ll spiral down.

Couchbase, like the other pioneers of NoSQL systems, was created to address extreme scale, performance, and availability requirements of the web 2.0 world. From the simple key-value, Couchbase has evolved to handle query, search and analytics — at scale. Each of them is purpose-built engines integrated via Couchbase’s multi-dimensional architecture.  The query and analytics service both talk N1QL. Why build two distinct engines that talk the same language?  Because…

One Size Fits All: An Idea Whose Time Has Come and Gone.  — Michael Stonebraker

Query engine was built for operational workload and the Analytics engine for the analysis workload.  We’ve compared the two engines and given the guidance.  MongoDB has followed a similar path from being a clustered database handling simple workload to complex workload for analytics and queries on data lakes. 

Last year, MongoDB announced analytic nodes in their clusters for analytic processing.  In this blog, we compare and contrast the two engines for the analytics use case.

Couchbase: High-Level Architecture

Inside Couchbase Analytics: High-level Architecture
MongoDB Analytics Nodes:

Let’s compare and contrast the analytics support in MongoDB Analytic nodes and Couchbase Analytics.

MongoDB Analytic nodesCouchbase Analytics
Docshttps://docs.atlas.mongodb.com/reference/replica-set-tags/https://docs.couchbase.com/server/6.5/analytics/introduction.html
ArchitectureUse a set of Secondary replica nodes with a complete copy of the operational data. The query language is the same (MQL); query processing is the same as the operational workload.Distinct Analytics nodes which have a user-defined subset of the operational data. The query language is the same (N1QL); query processing is designed for larger datasets (see below).
Architecture DetailsAtlas Mapped Analytics NodesCouchbase Analytics: NoETL for Scalable NoSQL Data Analysis
Data ModelBSONJSON
Query LanguageMQL – MongoDB Query LanguageN1QL – Non 1st Normal-form Query Language; SQL for JSON
Query pageMongoDB QueryAnalytics Query
Query processingSame as operational query processing, using mongos and mongod for distributed query processing. Analytics engine designed for massive parallel processing (MPP) of the data. Each N1QL 
Query optimizerShape-Based Optimizer; Requires plan management.Rule-Based Optimizer. No plan management required. 
ExplainText and graphical.Text and graphical.
IndexingNeed to create the index in the operational and have it copied over.Analytics only Indexing
Parallel processingEach Mongod node runs the basic operations and mongos combines it (e.g. final group and aggregation). To handle complex analytics queries efficiently, and to deliver

the desired scale-up and speed-up properties, the Analytics Service

employs the same kinds of state-of-the-art, shared-nothing MPP

(massively parallel processing) based query processing strategies [From the VLDB paper]

IndexingLocal indexing Local indexing
Joins – Language$lookup operator supports simple equality joins between two collections; Only simple scalar fields are allowed. Arrays need to be unwound before joins.

  1. One of the two collections CANNOT be sharded.  That means, can’t join on large collections.
  2. Need a separate pipeline stage for simple non-equality joins. That means, the queries are inefficient and takes lot of resources; 
  3. This is roughly equivalent to LEFT OUTER JOINs in SQL.  Users will have to do additional pipeline processing to get the INNER JOIN and other joins.
INNER JOIN, LEFT OUTER JOIN, NEST and UNNEST operations.

  1. Standard SQL syntax
  2. Supports sharded dataset by default.
  3. Supports equality and arbitrarily complex join expressions.
Query processing:data sizeIntermediate stages of aggregate() pipeline cannot be more than 100 MiB in size. Query writers/users should use a special flag to allow this.No Limitations; When the intermediate data (e.g. hash table, sort data) gets bigger, it’s spilled over to the disk.
Query processing: Join type(roughly) LEFT OUTER JOININNER JOIN

LEFT OUTER JOIN

SearchSupports search within the query. Uses Atlas search on the cloud and basic B-tree based search on-prem. Analytics service doesn’t have a built-in search.  We need to use query service with FTS for combining search within a query.
Queries supportedfind() and aggregate()SELECT statement (from SQL and SQL++)
JOIN types (Language)$lookup — this is roughly LEFT OUTER JOIN via INNER JOIN

LEFT OUTER JOIN 

JOIN types (Implementation)
  1. Join only between one sharded and another non-sharded collection.
  2. Nested loop (NL) only. (NL is bad for performance while handling large data).
  3. Intermediate results are limited to 100MB memory.  User will have to know the size and use options to allow spill over.
  1. Joins sharded datasets; All of the datasets are sharded by default.
  2. Supports nested loop, broadcast and parallel hash join
  3. Supports both Nested Loop join and Hash join. 
  4. Use Hash join by default — well suited for large scale data processing. 
  5. No size limitations on the intermediate data.
AggregationSupports the common grouping and aggregation via aggregate() method.Supports the common grouping and aggregation via GROUP BY and respective aggregations. See below for Windowed aggregates. 
Windowed aggregate Functions: Probably, the Coolest SQL Feature.Unavailable.Fully supported.

RANK()

PERCENT_RANK()

DENSERANK()

ROW_NUMBER()

CUME_DIST()

FIRST_VALUE()

LAST_VALUE()

NTH_VALUE()
LAG()

LEAD()

NTILE()

RATIO_TO_REPORT()

Analyzing data from Multi-clustersAll the data analyzed is from a single MongoDB cluster.6.5: All the data analyzed is from a single Couchbase cluster.

6.6: Can ingest and analyze the data from multiple Couchbase clusters.

External dataSupports query processing on S3 data. Supports BSON, CSV, TSV, Avro, and Parquet formats.6.6: Supports external JSON, CSV and TSV data in S3
External data sourcesSupports additional datasources via JDBC driver. Integrated with the aggregation pipeline via, you’ve to wait for it, $sql operator.None except the ones mentioned above. 
SubqueriesSubqueries via the aggregation pipeline.Standard SQL subqueries.
Query plan$explainEXPLAIN
DataVizBuilt-in MongoDB chartsNo built-in DataViz
Business intelligence Knowi

Tableau and other ODBC, JDBC compliant BI engines.

Knowi

Tableau and other ODBC, JDBC compliant BI engines.

 

References:

  1. Comparing Two SQL-Based Approaches for Querying JSON: SQL++ and SQL:2016
  2. SQL to NoSQL – 7 Metrics to Compare Query Language
  3. Couchbase Analytics: NoETL for Scalable NoSQL Data Analysis

 

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and two US patents pending.

Leave a reply