Co-author:  Sitaram Vemulapalli, Principal Engineer, Couchbase R&D.

“The answer my friend is hiding in JSON” – Bob Dylan

There are a lot of public JSON datasets and then is awesome JSON datasets. Every company, including yours, has stored a lot of data in JSON — the result of surveys, campaigns, and forums.

There are many ways to get skin the JSON. You can write Python program for every report, visualization you want to do.  Or, you can use N1QL (SQL for JSON) to generate the right algorithm for you to analyze JSON data. In this article, we show you how to use N1QL to extract insights quickly.  We also use two features coming up in the next release: Common Table Expression (CTE) and Window Functions.

Goal: Use public JSON dataset for US Open golf scores to create a simple leaderboard, ranking, etc.

Three things you’ll do as part of this:

  1. Ingest the data into Couchbase easily.
  2. Start getting the value of this JSON data immediately.
  3. Shape the JSON to generate useful reports using new features quickly.

Source Data: https://github.com/jackschultz/usopen

Queries in this post are also available at: https://github.com/keshavmr/usopen-golf-queries

Data repo structure: This GitHub repo https://github.com/jackschultz/usopen contains US Open golf -2018 data.  For each hole, it as a separate document for each day.

Each document has this structure.  This is the document for hole 1 on day 1.  The filed Ps has the list of players, each with a unique ID.  

Each player’s playing statistics is following that, stroke by stroke.   The players are matched to scores using the field unique ID for the player.  

Start getting insights:

Before you start querying, create a primary index on the bucket.

CREATE PRIMARY INDEX ON usopen;

Task 1:   Create a report of player scores by round and the final total.

After playing with JSON from bottom-up, we came up with this query. The explanation is after the query.

Tabular Results (In Tabular form, from the Couchbase query workbench)

Let’s look at the query block by block.

Look at the WITH d clause.  The statement untangles the JSON from PER-day-PER-hole-shot-by-shot data to simple scalar values.

Holedn is the document key – hole-day-number

Country is the player’s nationality

ID is the player’s unique ID.

Hole and day are obvious and score is the player’s score for that hole.

In the FROM clause of the SELECT statement, pl is the full list of players taken from the document for the first day, first hole (holes:1:1).

Rs is the players’ result, shot by shot, hole by hole.  First, we unnest that array couple of times to project details on each hole and score for that hole, determined by array_length(hps.Sks).

Once we have the hole-by-hole score,  it’s easy to write the final query to aggregate by the player and by day.

 

**The WITH clause is the common table expression (CTE) feature in the upcoming Mad-Hatter release. The old way to do this in Couchbase 5.5 or below is using the LET clause.  Post the question in Couchbase forum if you need help here).

Task 2: Now, create the full leaderboard and add the CUT information.  The golfers who got cut won’t play third or the fourth round. We use this information to determine the players who got cut.

Query 2.  Take the previous query and name it as a common table dx and then add the following expression to determine that cut.

Here’s the full query:

Task 3: Determine the winners.

We need to rank the players based on the total score to determine who won the tournament.  The rankings are skipped over if there are ties in the scores. Doing this in SQL without window functions is expensive. Here, we write the query using the RANK() window function.  Window functions are a feature in N1QL in the upcoming release (Mad-Hatter)

Query 3:  

Notice the ranks 4, 8, 9, 10, 11 missing because of the tie scores!

Task 4: Now, let’s find out how each player fared after round1, round2, round3 compared to the final round.  Using the window functions, it becomes as easy making the marshmallows covered with chocolate disappear.

Query 4: Use the same RANK() function, by ORDER BY the score of each day (day1, day1+day2, day1+day2+day3) instead of just the final score. 

Now you can see how the players moved up or down each day.

Task 5:  Create the full scorecard for the leader using the basic shot-by-shot statistics.

Query 5:  Brooks Koepka is the final winner of the US open.   Let’s get his scores, hole by hole and get the cumulative scores for him by round.  Notice how the simple SUM() and the COUNT() aggregate works as a window function with the OVER() clause.

This first partitions the score by day and then by hole – specified by PARTITION BY clause, in the order of the holes 1-18. The SUM then adds up the scores so far.

This SUM() function simply adds up the score score from day 1, hole 1 to day 4, hole 18 — this is specified by the ORDER BY d3.day, d3.hole within the OVER() clause..  The field ToTScore shows the total shorts for the tournament by Koepka at each hole.

Author

Posted by Keshav Murthy

Keshav Murthy is a Vice President at Couchbase R&D. Previously, he was at MapR, IBM, Informix, Sybase, with more than 20 years of experience in database design & development. He lead the SQL and NoSQL R&D team at IBM Informix. He has received two President's Club awards at Couchbase, two Outstanding Technical Achievement Awards at IBM. Keshav has a bachelor's degree in Computer Science and Engineering from the University of Mysore, India, holds ten US patents and has three US patents pending.

Leave a reply