Is No Processing Better Than Some Processing?

Let’s do a little thought experiment.

Yeah, I know, thinking. Who wants to do that?

Wait!

Before you tune me out and browse on to the next post on sexy indexes…

At least give me a couple of minutes.

Let’s say you have a website.

Well, not just any website…

That’s a little too generic.

OK, let’s say you’ve got a travel website.

Someplace where people come to make airline reservations.

I think we’ve all used one of those at some time.

So, your users come to your site, and want to see what flights are available.

What’s the first thing they do?

Do they write a question out in long-hand?

Not these days.

They probably start with selecting where they want to leave from.

Their local airport.

And then they probably want to select where they want to go.

So, two choices, both airports.

You could make them guess what airports are around.

I mean, I usually just enter the town that I need to go to and let the website figure out what airport I need to fly to.

But let’s assume that your users already know the airports at both ends of their trip.

Makes it easy for our little thought experiment.

So, you need to present the user with a list of airports to choose from.

Yeah, it’ll be a long list.

Seems there’s a lot of airports scattered around this big-ole world.

Just loading up our travel-sample bucket gives us almost 2 thousand.

Man, that’s a lot of airports!

Someone did a lot of data-entry…

But the good thing about airports is that don’t change that often.

I mean, yeah there’s new ones being built…

And old ones being left for ruin…

But that all happens over time.

Usually if one airport is abandoned, it’s often because a newer, shinier one got built.

And it takes a long time to build a new airport.

It’s not like they’re throwing them up every day.

So, back to our list of airports…

Long or not, you’ll have to provide some form of list of airports for the user to choose from.

And if your website if very busy, there could be a lot of users.

And we all want our websites to be busy.

So let’s just go ahead and assume that our website not just busy…

It’s very busy.

Millions of users every day.

Thousands of users every minute.

That’s a lot of times that you’re having to serve up that list of airports!

So, let’s start by assuming that your airport documents in your Couchbase bucket are structured like the ones in our travel-sample bucket.

Hey, it comes with our Couchbase Server product, may as well use it!

Makes things easy…

So, just listing the airports using a simple N1QL query:

SELECT `travel-sample`.*
FROM `travel-sample`
WHERE type = "airport"
;

SELECT `travel-sample`.*

FROM `travel-sample`

WHERE type = "airport"

;

Gives us this:

[
 {
  "airportname": "Calais Dunkerque",
  "city": "Calais",
  "country": "France",
  "faa": "CQF",
  "geo": {
    "alt": 12,
    "lat": 50.962097,
    "lon": 1.954764
  },
  "icao": "LFAC",
  "id": 1254,
  "type": "airport",
  "tz": "Europe/Paris"
 },
 {
  "airportname": "Peronne St Quentin",
  "city": "Peronne",
  "country": "France",
  "faa": null,
  "geo": {
    "alt": 295,
    "lat": 49.868547,
    "lon": 3.029578
  },
  "icao": "LFAG",
  "id": 1255,
  "type": "airport",
  "tz": "Europe/Paris"
 },
...
]

[

{

"airportname": "Calais Dunkerque",

"city": "Calais",

"country": "France",

"faa": "CQF",

"geo": {

"alt": 12,

"lat": 50.962097,

"lon": 1.954764

"icao": "LFAC",

"id": 1254,

"type": "airport",

"tz": "Europe/Paris"

{

"airportname": "Peronne St Quentin",

"city": "Peronne",

"country": "France",

"faa": null,

"geo": {

"alt": 295,

"lat": 49.868547,

"lon": 3.029578

"icao": "LFAG",

"id": 1255,

"type": "airport",

"tz": "Europe/Paris"

...

]

Hmm, not going to be easy finding what our users need in this. Maybe if we sort it on the FAA airport code, and then eliminate those where the code is null…

[
 {
  "airportname": "Lansdowne Airport",
  "city": "Youngstown",
  "country": "United States",
  "faa": "04G",
  "geo": {
    "alt": 1044,
    "lat": 41.1304722,
    "lon": -80.6195833
  },
  "icao": null,
  "id": 8534,
  "type": "airport",
  "tz": "America/New_York"
 },
 {
  "airportname": "Moton Field Municipal Airport",
  "city": "Tuskegee",
  "country": "United States",
  "faa": "06A",
  "geo": {
    "alt": 264,
    "lat": 32.4605722,
    "lon": -85.6800278
  },
  "icao": null,
  "id": 8317,
  "type": "airport",
  "tz": "America/Chicago"
 },
...
]

[

{

"airportname": "Lansdowne Airport",

"city": "Youngstown",

"country": "United States",

"faa": "04G",

"geo": {

"alt": 1044,

"lat": 41.1304722,

"lon": -80.6195833

"icao": null,

"id": 8534,

"type": "airport",

"tz": "America/New_York"

{

"airportname": "Moton Field Municipal Airport",

"city": "Tuskegee",

"country": "United States",

"faa": "06A",

"geo": {

"alt": 264,

"lat": 32.4605722,

"lon": -85.6800278

"icao": null,

"id": 8317,

"type": "airport",

"tz": "America/Chicago"

...

]

That’s better, but it’s more data than we need to be providing to the website.

So, let’s reduce what we’re returning to the FAA code, airport name, city, and country:

[
 {
  "airportname": "Lansdowne Airport",
  "city": "Youngstown",
  "country": "United States",
  "faa": "04G"
 },
 {
  "airportname": "Moton Field Municipal Airport",
  "city": "Tuskegee",
  "country": "United States",
  "faa": "06A"
 },
...
]

[

{

"airportname": "Lansdowne Airport",

"city": "Youngstown",

"country": "United States",

"faa": "04G"

{

"airportname": "Moton Field Municipal Airport",

"city": "Tuskegee",

"country": "United States",

"faa": "06A"

...

]

Ok, now we’re getting down to what we’re looking for.

So, if we query this we’re getting , oh, let’s say about a 50-60ms response time.

Not bad.

But with thousands of requests for this list every minute…

Hmm, maybe we can speed things up a bit.

Let’s make it a covered query by adding our own index that includes everything we need.

CREATE INDEX myFaaIndex on `travel-sample`(faa asc,airportname,city,country)
WHERE type = "airport" AND faa IS NOT NULL;

1 2	CREATE INDEX myFaaIndex on `travel-sample`(faa asc,airportname,city,country) WHERE type = "airport" AND faa IS NOT NULL;

And now we re-run the query and get a response time in around 17.5 ms.

Much better.

But is it possible to do even better than this?

I mean, this list will be requested thousands of times every minute.

Those milliseconds will add up.

So, what if we took the results of this query, and saved it as a single document?

Let’s call it “airport_list”.

So now, if we run a query selecting the whole document with the “USE KEYS” clause:

SELECT `travel-sample`.*
FROM `travel-sample`
USE KEYS "airport_list";

SELECT `travel-sample`.*

FROM `travel-sample`

USE KEYS "airport_list";

This is giving us a response time around 14.5 ms.

Hmm, saved another 3 whole milliseconds!

And we might save another half-millisecond or two if we use the key-value access and get the document by its ID directly from the data service.

For a document that needs to be served thousands of times a minute.

Millions of times a day.

Those milliseconds will add up.

Yeah, I know. Airports change from time to time.

Yeah, but they don’t change very often.

Yes, this one document will need to be replaced every so often.

But that’s an operation that isn’t serving a high-activity website.

So who care’s how slow (comparatively) that process may be.

Plus, I no longer need my covering index!

I can save a little bit of space on my index server!

Woo-hoo! Bonus!

Yeah, I know. I get excited about some odd things…

OK, so that was an exercise in shaving milliseconds off our response time. What about a query that takes a bit longer and does more?

Let’s say you run a call-center, and it’s important to keep track of how quickly your team is picking up incoming calls…

OK, let’s get a little more specific.

Let’s say you want to have a dashboard showing how many calls have been answered within five seconds, ten seconds, and the total number of calls that have come in today.

Something like…

SELECT SUM(five) as fiveCount, SUM(ten) as tenCount, SUM(incoming) as callCount
FROM
 (SELECT
    CASE 
      WHEN connectTime = 0 AND (endTime - startTime) <= 5000 THEN 1 
      WHEN connectTime > 0 AND (acceptTime - startTime) <= 5000 THEN 1
      ELSE 0 END as five,
    CASE 
      WHEN connectTime = 0 AND (endTime - startTime) <= 10000 THEN 1 
      WHEN connectTime > 0 AND (acceptTime - startTime) <= 10000 THEN 1 
      ELSE 0 END as ten, 
    1 as incoming 
  FROM sigc 
  WHERE type='cdr' AND startTime > $today
    AND callType BETWEEN 10 AND 2000) as calls
;

SELECT SUM(five) as fiveCount, SUM(ten) as tenCount, SUM(incoming) as callCount

FROM

(SELECT

CASE

WHEN connectTime = 0 AND (endTime - startTime) <= 5000 THEN 1

WHEN connectTime > 0 AND (acceptTime - startTime) <= 5000 THEN 1

ELSE 0 END as five,

CASE

WHEN connectTime = 0 AND (endTime - startTime) <= 10000 THEN 1

WHEN connectTime > 0 AND (acceptTime - startTime) <= 10000 THEN 1

ELSE 0 END as ten,

1 as incoming

FROM sigc

WHERE type='cdr' AND startTime > $today

AND callType BETWEEN 10 AND 2000) as calls

;

So, you start with an index on the startTime and callType properties, limiting it to documents of type “cdr”, only to find it takes about a second to run this query.

And this isn’t the only query you want to use to populate your dashboard…

Ugh, this is going to be as slow as molasses!

OK, so let’s build a new index with all of the properties in it, making this a covered query, only to find that, while it’s improved, it still takes around 100 milliseconds.

Hey, that’s a 10X improvement! That’s great, isn’t it?

Only your dashboard still refreshes like it’s running in molasses.

Thin, watery molasses, but still…

Hmm, what can we do to improve this?

What if, instead of using this query to feed the dashboard, we take the output and use it to create a new document with just the results?

Something with a known name, like call_stats_<some date>…

And we can run this query on a timer, using a cron job, or trigger it using the Couchbase Eventing service.

Only if we trigger it from the Eventing service, we probably want to run it with a scan consistency of at least at_plus to include the document update you are using to trigger the query.

But now, when we retrieve the result document, we’re achieving response times in the low single-digit milliseconds, so close to a 1000X improvement in performance!

And now we’ve got a responsive dashboard!

WOO-HOO!!!

Now we’re talking turbo-booster speed!

So, what is the lesson from both of these two scenarios?

Well, by taking any processing we needed to do on the data and making them background tasks, so that our interactive data requests involve no processing, we’ve made things very speedy…

We’re talking faster than a speeding bullet fast!

Excuse us Superman, we’re coming through…

So, was that thought experiment really that painful?

Now on to those sexy indexes…

Couchbase, empowering data nerds everywhere…

(Hey Peter, I think I’ve got our new slogan here!)

Davis Chapman

Products

See How Capella Stacks Up

See How Capella Stacks Up

By Industry

By Need

Why NoSQL

What is NoSQL and why choose it?

Popular Docs

By Developer Role

Capella Playground

Start A Free Capella Trial

Resource Center

Education

Certification Exams 2023

Get Couchbase certified

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Is No Processing Better Than Some Processing?

Author

Posted by Davis Chapman

Leave a reply Cancel reply