This is a guest post by Jay Gopalakrishnan. Jay is the founder of Cloud9 Charts, an analytics platform built for modern data architectures, featuring native support for Couchbase and N1QL. Follow them on twitter @cloud9charts

This post uses the NYC green cab taxi dataset on Couchbase to demonstrate native analytics using N1QL & Cloud9 Charts.

Native NoSQL Analytics

Traditionally, analytics on NoSQL databases typically means one of the following:

  1. Shoe-horn the data into a relational form using an ODBC driver that traditional BI architectures can understand. This usually requires a third party ODBC driver, a traditional (typically desktop based) BI tool and schemas defined upfront.  

  2. ETL processes to load up relevant data into relational database for analysis. This requires schemas to be defined across the NoSQL and SQL based datastore and negates the schema-flexibility of NoSQL databases like Couchbase. Also required is a long lead time for any changes to the data structure to propagate to the underlying store.   

Native NoSQL Analytics unshackles the dependency on ODBC drivers and ETL processes, enabling business and technical users to leverage the underlying database to the fullest extent to derive actionable insights immediately.

Briefly, Couchbase-Cloud9 Charts Integration features the following:

  • Fully Native N1QL integration without drivers/translators

  • Point and click N1QL query generator

  • Support for nested objects and Arrays

  • Join between Couchbase and other SQL/NoSQL or REST API based sources

  • Instant visualizations & embeddable dashboards

  • Advanced Analytics &  Predictions

Dataset

The dataset consists of a 45 million green cab taxi rides, made available by the NYC Taxi  & Limousine Commission.

Green cabs were launched in 2013 in New York city, targeted towards taxi rides in the outer boroughs of NYC that were traditionally underserved by yellow cabs.

The analysis is focused on the following:

  • Geo Spatial analysis of pickup areas and dropoffs

  • Trip durations, by hour by day across neighborhoods

  • Fare analysis

  • Ride Predictions

The raw dataset in CSV form can be found here (2 GB zipped, 15GB unzipped).

Couchbase Cluster

A 3 node Couchbase cluster was provisioned by our friends at Couchbase. cbtransfertool was used to load up the data into couchbase from CSV files. Raw data looks like this (truncated for the sake of brevity):

Indexes & Performance Considerations

Aggregations from the raw data to track rides by hour/day/pickup/dropoff was created into another bucket to enable fast query executions:

Analysis

The full dashboard of the following analysis can be accessed here. This is how a generated live dashboard/report can be shared with others.

Pickup & Dropoff Analysis:    

The following N1QL query clusters the pickup geo locations, from which a geo-spatial view of the pickup locations can be instantly derived using Cloud9 Charts:

See the dead zone within the Manhattan area on the map above? Here’s why: Green cabs are only allowed pick ups from the north of East 96th St and West 110 St.  

Contrast this with Yellow cab pickups below where majority of the pickups are concentrated around the Manhattan area.

While pickup zones are restricted, there are no limitations on passenger drop off areas for the green cab service. The drop-off heatmap looks like this:

Drop-off Locations N1QL query:

Ride Trends & Predictions

Let’s look at the overall monthly ride trends since the Green Cab service was launched.

Trends show that service ramped up in late 2013, and from 2014 onwards, the rides per month trend has stabilized.

Let’s apply predictive models on it to determine the total rides over the next few months.

Predictive models available on Cloud9 Charts backtests the data against them determine the best fit. The model implies that a slight uptick is expected over the next few months.

Neighborhood Comparisons

Pickup restrictions poses interesting conundrums from a cab owner’s standpoint. Dropoffs into restricted areas imply that driver must get back into a pickup region for the next pickup.

What are the most productive areas for a driver/owner operator? To answer this question, let’s look at some neighborhood analysis.

Following shows a Chord diagram of the relationships between start regions vs end regions. For example, there are far more (43% more) rides from Harlem → Hamilton Heights, vs Hamilton Heights → Harlem.

Fare Analysis

If the the pickup is in Harlem and the drop-off is in Chelsea, a prohibited pickup area. During 5-6 PM on a Wednesday, it’s a 35 minute ride as indicated by the grid heatmap below, for average fare of $28. But it means that the driver must get back to a pick-up area for the next ride. From a driver’s standpoint, not ideal.

Contrast that to another dropoff location, say the Fort Green neighborhood  in Brooklyn (from Harlem). This is a $47 dollar ride taking an average of 57 mins during rush hour, and also provides ample pickup opportunities within the Fort Green area as indicated by the geo heatmap.

To take this one step further, which areas should a cab operator deploy their asset?

The following chart helps to answer that question, providing the top average fares across locations for a given day and hour of the day.

Top Fares By Locations by Date/Time

So it looks like for Tuesdays from 8-9 AM, Nkew Gardens neighborhood yields the highest average pickup fare, whereas Saturday night at 11PM, the pole position goes to the Jamaica neighborhood.

Summary

Gone are the the days of long drawn out ETL processes or shoehorning semi-structured data into relational formats for analytics purposes. With Cloud9 Charts, you can leverage N1QL natively to derive immediate, actionable insights that can be shared and embedded in a jiffy.

Special thanks to Prasad Varakur, Chin Hong and the rest of the Couchbase team for their hands-on support with Couchbase deployment and query optimization.

Resources:

Instant Analytics on Couchbase

Couchbase Connect, live talk on the taxi dataset

Couchbase-Cloud9 Charts documentation

Multi-Datasource Joins

Predictive Analytics

Posted by Jay Gopalakrishnan

Leave a reply