Introduction to external datasets

Couchbase is very excited to announce its new “External Datasets” Analytics Services feature in the latest Couchbase Server 6.6 release. External datasets empower customers to access externally stored data in real-time from Amazon Web Services (AWS) Simple Storage Service (S3) and to combine S3-resident data with existing Couchbase data for analysis.

Customer use case

Some customers use AWS S3 to reduce storage costs and store data (e.g., multiple years of historical data, offline business data for machine learning, product reviews, etc.). They have expressed a desire to combine, query, and utilize S3 data in real-time to make this data available to business users for analytics. You can read more about other Analytics use cases here.

How do external datasets work?

External datasets provide the ability to dynamically query and analyze data residing in AWS S3, allowing users to easily combine data in real-time from both inside and outside their Couchbase analytics nodes. This is achieved in three simple steps:

  1. Set up an S3 link by using a REST API call or the command-line interface (CLI)
  2. Create an external dataset on the S3 link
  3. Query the dataset using SQL++ (or your favorite BI tool)

Let’s walk through a simple example. iMaz, an e-commerce company, sells consumer products online. Their order, product, and user data are stored on a Couchbase cluster with both data and analytics services (on separate sets of nodes in the cluster). They use the Analytics Service to run ad hoc and complex queries to analyze their business. iMaz also stores their product reviews on AWS S3, and they would like to combine and analyze the top 3 most highly rated products using the Couchbase Analytics Service.

Sample product data:

Sample review data:

Let’s follow the three steps from above with sample setup code along with a SQL++ query.

Step 1: Set up S3 link

We’ll create an S3 link using a REST API call. (Alternatively, you can use the CLI to create S3 links.). We’ll need to provide:

  • Analytics Service hostname
  • Analytics user credentials
  • S3 link name (in this case myS3Link)
  • Dataverse name (if different from default)
  • Link type (S3)
  • AWS S3 required access key ID
  • AWS S3 required secret access key
  • AWS S3 required region (e.g., us-west-2)

Step 2: Create an External Dataset

Using the Analytics workbench, we’ll now create an external dataset named “S3productreviews”. We’ll need to specify:

  • S3 bucket name
  • Dataverse name (if different from default) and S3 bucket name (in this case cb-analytics-6.6-demo)
  • Directory location (optionally) inside the bucket where files will be read from and recursively collected (in this case product reviews are stored in a “reviews” folder)
  • File format (in this case we’ll use JSON) with the ability to specify a search pattern (in this case *.json indicates that all JSON files will be included when querying data)

Currently, the external datasets feature supports the json, csv (comma-separated values), and tsv (tab-separated values)  file formats, including compressed gzip files (filenames ending with .gz or .gzip). Both the csv and tsv formats require you to specify an inlined type definition (more about this shortly). Additional file formats will be supported in future releases. You can read more about that here.

Step 3: Query using SQL++

As the last step, we can now run the SQL++ query listed below (which looks exactly like SQL :)). It joins the existing products dataset from the Couchbase Analytics Service and the product reviews data from AWS S3 to get the top 3 highly rated products.

Here are the json query results:

This is great – we’re now able to combine and analyze external data located in AWS S3 from the Couchbase Analytics Service. Notice how few steps it took to enable us to analyze our data; no ETL was involved, and the data was immediately available!

You might now be wondering: How would this have worked if the S3 reviews file format had been of type csv instead of JSON? The answer is simple; you simply would have constructed your external dataset accordingly. Below, we show what the create external dataset statement from above would look like to support csv:

Notice how the create statement now includes inlined type information. This is needed to tell Analytics how to interpret the csv data (e.g., not just as strings).

The SQL++ query remains exactly the same. That’s right, no change at all! External datasets are easy to set up, flexible, and simple to use thanks to the power of the SQL++ language. Users can develop complex ad hoc queries for further data exploration, answer new business questions, and combine external data with data from Remote Links to bring in other Couchbase data sources as well.

Benefits

Here are key benefits that come from using external datasets:

  1. Data enrichment. Couchbase data can now be enriched with additional information obtained from files that reside in an enterprise’s existing S3-based data lake.
  2. Dynamic data access. The latest data can be dynamically retrieved, streamed, combined, and analyzed from any S3 bucket in any AWS region during Analytics query execution.
  3. Parallel query processing. Users can configure and arrange access to S3 data using Analytics’ massively parallel processing (MPP) query processing architecture for fast response to queries involving external data.

Summary

External Datasets unlock the value of external live and archived data residing in S3-based data lakes. Users can combine and analyze data in real-time, sourced from both AWS S3 and Couchbase Analytics Service. This enables faster and more comprehensive data analysis and agile decision making.

Resources

You can learn more about External Datasets statements here. Register here for our upcoming “What’s new in Couchbase Server release 6.6” webinar.

Explore Couchbase Server 6.6 resources

Blogs

Docs and Tutorials

Webpages and Webinars

What’s New in Couchbase Server 6.6

What’s New in Couchbase Server 6.6?

New Features in Couchbase Server 6.6: Analytics, Backup, Query, and More

Eventing Improvements (Timers, Handlers, and Statistics)

Couchbase Server 6.6 Release Notes

Couchbase Analytics Service

Remote Links – Analyze Your Enterprise With Couchbase Analytics

Try the Couchbase Index Advisor Service

What’s New in Couchbase Server (Product Page)

External Datasets – Extend Your Reach With Couchbase Analytics

Set Up Analytics Remote and S3 Links Using REST API

Compare Editions

Announcing Flex Index With Couchbase

Create External Datasets Using Data Definition Language (DDL)

 

Introducing Backing Up to Object Store (S3)

Set Up Analytics Remote and S3 Links Using CLI

 

Import Documents With the Web Admin Console

   

Thanks Till Westmann for co-authoring and Michael Carey for valuable contributions and review of this post.

till westman engineering director analytics

 

Co-author

Till Westmann, Engineering Director at Couchbase

Till Westmann is an Engineering Director at Couchbase working on the Analytics Service. Before joining Couchbase Till built data management software at Oracle, 28msec, SAP, BEA Systems, XQRL, and Xyleme. He is a member of the Apache Software Foundation and the Vice President of the Apache AsterixDB project. Till holds a PhD from the University of Mannheim in Germany.

Author

Posted by Idris Motiwala

Idris is a Principal Product Manager, Analytics at Couchbase with 20+ years experience in design, development and execution of software products at both Fortune 500s and startups leading teams in digital transformation, cloud and analytics. Idris holds an MS in Technology Management and certifications in product management .

One Comment

  1. This is a revolutionary feature ! I simply love it!!!

Leave a reply