You can now analyze even more of your enterprise data using Couchbase Analytics – without having to move or migrate a byte.

Starting today, I’m excited to announce the forthcoming support for Microsoft Azure Blob Storage for external Analytics Collections (currently in Developer Preview). This news follows the support for external Analytics Collections for Amazon Web Services Simple Storage Service (AWS S3) in the Couchbase Server 6.6 release.

In short, this announcement means more Couchbase customers can combine data from external sources (e.g., Azure Blob Storage and AWS S3) with local data (Couchbase Analytics) as well as remote Couchbase data (i.e., Remote Analytics Collections).

(Note: This topic has been similarly covered in External Datasets: Accessing AWS S3 in Couchbase Analytics. If you’ve used external Analytics Collections before, then the concepts covered in this article will be familiar to you.)

Why Use External Analytics Collections?

Some Couchbase customers use Azure Blob Storage to reduce storage costs for large volumes of data (e.g., multiple years of historical data, offline business data for machine learning, product reviews, etc.).

These customers want to combine, query and utilize data from Azure Blob Storage in real time for business users and data analytics. (Discover how other customers use Couchbase Analytics in this article.)

In order to keep this data in low-cost storage but still use it for data analysis, Couchbase customers can use external Analytics Collections.

How External Analytics Collections Work

External Analytics Collections enable you to dynamically query and analyze data residing in external sources (such as AWS S3 and Azure Blob Storage), allowing you to easily combine data in real time from both inside and outside your Couchbase Analytics nodes.

All it takes to combine your internal and external datasets is three simple steps:

  1. Set up an external link using a REST API call or the command-line interface (CLI).
  2. Create an external Analytics Collection on the external link.
  3. Query the Analytics Collection using N1QL for Analytics (or your favorite BI tool)
  4. .

Let’s walk through a simple example.

iMaz, an ecommerce company, sells consumer products online. Their order, product, and user data are stored on a Couchbase cluster using both the Data and Analytics Services (on separate sets of nodes in the cluster). iMaz uses the Analytics Service to run ad hoc and complex queries to analyze their business. They store their product reviews on Azure Blob Storage, and they would like to combine and analyze the top three most highly rated products using the Couchbase Analytics Service.

Here’s some sample product data:

And here’s some sample review data:

Now we’ll walk through each of the three steps from above using sample setup code along with a N1QL for Analytics query.

Step 1: Set Up the Links

First, you’ll need to create an Azure Blob Storage link using a REST API call. (Alternatively, you can use the CLI to create Azure Blob Storage links.)

You’ll need to provide:

    • Your Couchbase Analytics Service hostname
    • Your Analytics user credentials
    • The Azure Blob Storage link name. In the example below, we’ll use: myAzureLink
    • The Scope name (previously known as dataverse). In the example below, we’ll use: Default
    • Link type (AzureBlob)
    • Credentials (only one of the following is allowed):
      • Connection string, or
      • Account name and account key, or
      • Account name and shared access signature

All together, creating the Azure Blob Storage link will look something like the following:

Note: It is also possible to create Azure Blob Storage links without credentials, which will then use anonymous authentication. In this case accountName and accountKey are not needed. This approach can be used to access public data.

Step 2: Create an External Analytics Collection

After you create the external links, you can create an external Analytics Collection using Data Definition Language (DDL) statements that refer to the previously created link names.

The code sample below is the DDL statement to create the Analytics Collection on the Azure Blob Storage link created previously. (cb-analytics-7-0-0-demo is the name of the container in Azure Blob Storage.)

As shown above, the creation of the external Analytics Collection is independent of the link type after you create the links. Multiple Analytics Collections can be created on the same external link to point to different external data containers.

Currently, the external Analytics Collection feature supports the JSON, CSV (comma-separated values), and TSV (tab-separated values) file formats, including compressed gzip files (filenames ending with .gz or .gzip). Both the CSV and TSV formats require you to specify an inline type definition (more on this below). Additional file formats will be supported in future releases of Couchbase Analytics.

Learn more about inline type definitions in the Couchbase docs.

Step 3: Query Using N1QL for Analytics

Your last step is to run the N1QL query shown below (which looks suspiciously like SQL, don’t you think? :)).

This N1QL query above joins the existing products analytics Collection from the Couchbase Analytics Service with the product reviews data from Azure Blob Storage to retrieve the top 3 most highly rated products.

Here are the JSON query results from the above:

Now you can combine and analyze external data located in Azure Blob Storage using the Couchbase Analytics Service. Notice how few steps it took for you to analyze your data: there was no ETL, and the data was immediately available(!). Moreover, if your data changes, you’ll see those changes updated whenever you rerun the query. That’s because external data is accessed on demand at query execution time.

Of course, what if the file format had been CSV rather than JSON?

The answer is simple: You just have to define your external Analytics Collection accordingly. The sample N1QL statement below illustrates how you would create an external Analytics Collection that supports CSV:

Notice how the CREATE statement now includes inline type information. This type information tells Couchbase Analytics how to interpret the CSV data in order to transform it into JSON data (e.g., not just as strings).

But whether or not you’re using JSON or CSV data, the N1QL query remains exactly the same.

Benefits of External Analytics Collections

Here are some key benefits that come from using external Analytics Collections:

  1. Data enrichment: Couchbase data can now be enriched with additional information obtained from data that resides in external data stores like Microsoft Azure Blog Storage.
  2. Dynamic data access: Your most current data can be dynamically retrieved, streamed, combined and analyzed from any S3 or Azure Blob Storage (DP) containers in any region. Data is retrieved during Analytics query execution.
  3. Parallel query processing: You can configure and arrange access to external data using Analytics’ massively parallel processing (MPP) architecture for fast response to queries involving external data.

Conclusion

External Analytics Collections unlock the value of your live and archived data residing in external data stores. They’re also easy to set up, flexible, and simple to use thanks to the power of the N1QL query language.

Your users and data analysts can now combine and analyze data sourced in real time from AWS S3, Azure Blob Storage, and the Couchbase Analytics Service. With external Analytics Collections, you can develop complex ad hoc queries for interactive data exploration, answer new business questions, and combine external data from Remote Links to involve other Couchbase data sources as well.

Bottom line: Your teams conduct faster and more comprehensive data analyses for more agile decision-making.

To learn more about external Analytics Collections with Couchbase 7.0, check out the following resources:

Don’t just take our word for it: Test out Couchbase Analytics and see for yourself!
Get Started with Couchbase 7.0

 
 
 

Author

Posted by Hussain Towaileb, Software Engineer

Hussain Towaileb is a Software Engineer working on Couchbase Analytics. He focuses on external links and external datasets.

Leave a reply