The past few years we noticed how machine learning had been proven to be a technology in which companies should invest massively, you can easily find dozens of papers talking about how company X saved tons of money by adding some level of AI into their process.
Surprisingly I still notice many industries being skeptical about it and others which think it is “cool” but does not have anything in mind yet.

I believe the reason for such dissonance is due to 2 main factors: Many companies have no idea how AI fits in their business and for most of the developers, it still sounds like black magic.

That is why I would like to show you today how you can start with machine learning with almost zero effort.

Linear Regression

On the most basic level of machine learning, we have something called Linear Regression, which is roughly an algorithm that tries to “explain” a number by giving weight to a set of features, let’s see some examples:

  •  The price of a house could be explained by things like size, location, number of bedrooms and bathrooms.
  •  The price of a car could be explained by its model, year, mileage, condition, etc.
  •  The time spent for a given task could be predicted by the number of subtasks, level of difficulty, worker experience, etc

There are a plenty of use cases were Linear Regression (or other Regression types) can be used, but let’s focus on the first one related to house prices.

Imagine we are running a real estate company in a particular region of the country, as we are an old company, there is some data record of which were the houses were sold in the past and for how much.

In this case, each row in our historical data will look like this:

 

The problem – How to price a house

Now, imagine you just joined the company and you have to sell the following house:

For how much would you sell it?

The question above would be very challenging if you never sold a similar house in the past. Luckily, you have the right tool for the job: A Linear Regression.

The Answer – Predicting house prices with Linear Regression

Before you go further, you will need to install the following items:

 Loading the Dataset

With your Couchbase Server running, go to the administrative portal ( usually at http://127.0.0.1:8091) and create a new bucket called houses_prices

Now, let’s clone our tutorial code:

In the root folder there is a file called house_prices_train_data.zip, it is our dataset which I borrowed from an old machine learning course on Coursera. Please unzip it and then run the following command:

TIP: If you are not familiar with cbimport please check this tutorial

If your command ran successfully, you should notice that your houses_prices bucket has been populated:

Let’s also quickly add a primary index for it:

Time to Code!

Our environment is ready, it is time to code!

In the LinearRegressionExample class we start by creating the Spark context with our bucket credentials:

and then we load all the data from the database:

As Spark uses a lazy approach, the data is not loaded until it is really needed. You can clearly see the beauty of the Couchbase Connector above, we just converted a JSON Document into a Spark Dataframe with zero effort.

In other databases for example, you would be required to export the data to a CSV file with some specific formats, copy it to your machine, load and do some extra procedures to convert it to a dataframe (not to mention the cases where the file generated is too big).

In a real world you would need to do some filtering instead of just grabbing all data, again our connector is there for you, as you can even run some N1QL queries with it:

TIP: There are a lot of examples on how to use Couchbase connector here.

Our dataframe still looks exactly like what we had in our database:

There are two different types of data here, “scalar numbers” such as bathrooms and sqft_living and “categorical variables” such as zipcode and yr_renovated. Those categorical variables are not just simple numbers, they have a much deeper meaning as they describe a property, in the zipcode case, for example, it represents the location of the house.

Linear Regression does not like that kind of categorical variables, so if we really want to use zipcode in our Linear Regression, as it seems to be a relevant field to predict the price of a house, we have to convert it to a dummy variable, which is fairly simple process:

  1. Distinct all values of the target column. Ex:  SELECT DISTINCT(ZIPCODE) FROM HOUSES_PRICES
  2. Convert each row into a column. Ex: zipcode_98002, zipcode_98188, zipcode_98059
  3. Update those new columns with 1s and 0s according to the value of the zipcode content:

Ex:

The table above will be transformed to:

That is what we are doing on the line below:

Converting categorical variables is a very standard procedure and Spark already has some utilities to do this work for you:

NOTE: The final dataframe will not look exactly like the example shown above as it is already optimized to avoid  The Sparse Matrix problem.

Now, we can select the fields we would like to use and group them in a vector called features, as this linear regression implementation expects a field called label, we also have to rename the price column :

You can play around with those features removing/adding them as you wish, later you can try for example remove the “sqft_living” feature to see how the algorithm has a much worse performance.

Finally, we will only use houses in which the price is not null to train our machine learning algorithm, as our whole goal is to make our Linear Regression “learn” how to predict the price by a giving set of features.

Here is where the magic happens, first we split our data into training (80%) and test (20%), but for the purpose of this article let’s ignore the test data, then we create our LinearRegression instance and fit our data into it.

The lrModel variable is already a trained model capable of predicting house prices!

Before we start predicting things, let’s just check some metrics of our trained model:

The one you should care here is called RMSE – Root Mean Squared Error which roughly is the average deviation of what our model predicts X the actual price sold.

On average we miss the actual price by $147556.0841305963, which is not bad at all considering we barely did any feature engineering or removed any outliers (some houses might have inexplicable high or low prices and it might mess up with your Linear Regression)

There is only one house with a missing price in this dataset, exactly the one that we pointed in the beginning:

And now we can finally predict the expected house price:

 

Awesome, isn’t it?

For production purpose, you would still need to do a model selection first, check other metrics of your regression and save the model instead of training it on the fly, but it’s amazing how much can be done with less than 100 lines of code!

If you have any questions, feel free to ask me on twitter at @deniswsrosa  or on our forums.

Posted by Rosa, Developer Advocate, Couchbase

Denis Rosa is a Developer Advocate for Couchbase and lives in Munich – Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.

Leave a reply