Data scientists love Jupyter Notebooks – and it makes a natural pairing with the Couchbase document database.

Jupyter Notebook logo

Why? The Jupyter Notebook web application lets you create and share documents that contain narrative text, equations and the like for use cases such as data visualization and machine learning. Couchbase lets you store and process vast amounts of data (semi-structured and unstructured) at scale and support the kinds of data the world is full of: narrative text (social media posts, etc.), equations and more.

In this post, you’ll learn how to establish connectivity between a Couchbase cluster and a Jupyter Notebook, then pull data from Couchbase and use it to train a linear regression model for machine learning. We’ll walk through an example to predict a target variable’s value using categorical variables via a linear regression equation.

Loading Your Data

To kick things off, proceed through these steps to load the sample dataset:

  1. In the Couchbase cluster’s Admin Console, go to Buckets > Add Bucket to create a new bucket, as shown here:

    Add a bucket in the Couchbase cluster admin console
  2. Add documents to your bucket by either navigating to Documents > Add Document, like this:

    Add documents to a Couchbase bucket

    or uploading a list of JSON documents or a CSV file. For this example we’ll upload a CSV file using cbimport. Here’s what my document looks like:

    CSV file upload using the cbimport tool

  3. The file can be any data you want to work on. This example uses the Advertising Dataset from Kaggle.
  4. Go to Documents > Import, as shown here:

    Import the Kaggle advertising dataset documents into Couchbase

  5. Select the file you wish to import and the data bucket where the documents reside:

    Select the file in the Couchbase data bucket

    Your Documents menu should now look something like this:

    The documents menu in the Couchbase admin console

  6. In the Admin Console, create a primary index for the data bucket to make the data queryable, as you see here:

    Add a primary index to a data bucket in the Couchbase Web Console

Installing the Jupyter Notebook

First off, download the couchbase-jupyter-example from the Couchbase Labs GitHub repo. Then follow these steps:

  1. Install Jupyter Notebook via either the Python package management system (pip) or Anaconda.
  2. Install the dependencies for this project by using pip from the requirements file in your shell:
  3. Open Jupyter Notebook from the shell.
  4. Create a new notebook with Python 3, as shown here:

    Making a new Jupyter Notebook using Python 3

What Is a Linear Regression Model?

The linear regression model is powerful for predictive analysis, letting us determine the strength of categorical or independent variables and forecasting the effect of those variables and identifying trends in the data.

As you might deduce from the name linear regression, the “curve” we use to fit the data is a line. The simplest form of the regression equation is y = mx + c, where y represents the target variable, x represents a single categorical variable and m and c are constants. We will use a simple linear regression equation in our example.

The categorical variables in our example are TV, Radio and Newspaper. The target variable is Sales.

Training Our Linear Regression Model

  1. In the new Jupyter Notebook, use the code shown below to connect to the Couchbase server. Use your username and password, of course, instead of Administrator and 123456.

    Connecting a Jupyter Notebook to Couchbase Server

  2. Import the required libraries, shown in the screenshot here. If these libraries aren’t present in your environment, download the latest versions of these libraries to the right environment using the Python package manager, pip.

    Import required libraries using Python package manager

  3. Using the SELECT command, fetch the data from your data bucket into a pandas data frame:

    Fetch data into a pandas data frame

  4. You can view the contents of your pandas data frame using the describe() command, as shown here:

    Using the describe command to view data frame contents

  5. Create boxplots corresponding to the values of each categorical variable to detect outliers:

    Detect outliers using boxplot data visualization

  6. Create scatter plots for each categorical variable against the target variable to determine the degree of correlation.
    Notice that TV appears to have the highest degree of correlation.

    Categorical variable correlation using scatter plots for data visualization

  7. Split the dataset, using 60% of it for training and the remaining 40% for testing. We can now determine the value of coefficients in the regression equation, when the categorical variable is TV and the target variable is Sales, using the Ordinary Least Squares method.

    Ordinary Least Squares method for regression testing

  8. Now train the model using the code you see here:

    Training data in a Jupyter notebook

  9. Next, substitute test for train to use the model to predict the test set’s values, like this:

    Testing data for a TV variable in an advertising dataset

Using the same approach, we can train and test a model with the categorical variables Radio and Newspaper, as well:

Linear regression training data using a radio variable
Linear regression testing data using the radio variable
Training data in a Jupyter Notebook from Couchbase (newspaper variable)
Testing data in a Jupyter Notebook from Couchbase (newspaper variable)

Going Further with Machine Learning & Couchbase

Now that you’ve gotten your feet wet with connecting Couchbase Server to Jupyter Notebook and explored the machine learning concept of linear regression, build on that knowledge with these posts on how to use Couchbase as a machine learning model store and enabling AI-driven insights using Couchbase.

Try it out for yourself:

Get Couchbase 7 here




Posted by Agrima Khanna

Leave a reply