Blog Post

Creating a content store with Couchbase - The Learning Portal

Marty Schoch of Couchbase Published

Two weeks ago McGraw Hill presented at CouchConf SF and our users expressed so much interest that I thought I'd share more details in a blog. Earlier this year McGraw Hill and Couchbase teamed up to build a proof-of-concept application showing off the power of using Couchbase and ElasticSearch together.

The goal of the project was to build a self-adapting learning portal that delivers personalized results. Specifically that meant:

  • Allow users to browse and search a variety of content (articles, images, and video)
  • Fast access to both content and metadata
  • Enhance user profiles based on user behaviors and actions
  • Incorporate the user's profile into search queries and deliver personalized results

Architecture

Couchbase Server was used to store all of the content meta-data, as well as the full-text source of the text articles. This gives the application sub-millisecond latency access to the primary data set.

ElasticSearch was chosen to handle the full-text search requirements for the application. ElasticSearch combines rich querying capabilities with excellent clustering capabilities, making it a great match with Couchbase. Integration between Couchbase Server and ElasticSearch was provided by the Couchbase Transport plug-in. This transport uses the Cross Data Center Replication feature of Couchbase Server 2.0 to reliably transfer all document mutations to the ElasticSearch index (Learn more about this here).

On the front-end, the decision was made to build the application using Ruby on Rails. Our primary objective in the code being to clearly document the best practices when using Couchbase and ElasticSearch together.

Learning Portal

Here is what a user sees when they first log in to the application.

 
Fast Access to Documents using Couchbase Client SDK

When a user selects on a particular piece of content, the data is loaded directly from Couchbase Server by its key. Here's a sample document in Couchbase:

{
  "title": "Vince Shields",
  "url": "http://en.wikipedia.org/wiki/Vince_Shields",
  "type": "text",
  "is_text": 1,
  "is_video": 0,
  "is_image": 0,
  "popularity": 0,
  "views": 0,
  "categories": [
    "1900 births",
    "1952 deaths",
    "Baseball people from New Brunswick",
    "Canadian baseball pitcher stubs",
    "Fort Smith Twins players",
    "Independence Producers players",
    "Major League Baseball pitchers",
    "Major League Baseball players from Canada",
    "People from Fredericton",
    "St. Louis Cardinals players"
  ],
  "timestamp": "2012-01-06T02:27:11Z",
  "content": "{{Infobox MLB player\n|name=Vince Shields...",
  "authors": [
    {
      "name": "Chris the speller"
    }
  ],
  "contributors": [
    {
      "name": "Chris the speller",
      "timestamp": "2012-01-06T02:27:11Z"
    }
  ]
}

and here is the same document when viewed in the application:

 
Top Contributors and Top Tags using Couchbase Map Reduce Views

Users of the system can browse content by exploring the systems top contributors and top tags.

Let's take a closer look at how the top tags are determined.

First, here is the map function we're using:

function(doc){
  if (doc.type){
    doc.categories.forEach(function(category){
      emit(category, null);
    });
  }
}

And we're using the built-in reduce _count

When we access this view with a group_level of 1, we see each tag, and the number of times it has been used to describe a document.

Couchbase Views are sorted by the key, so we cannot directly query for the top 8 tags. Instead, we have a job that runs every 10 minutes, queries this view, sorts the results, and stores the top 8 results into another document in Couchbase. Here is what that document looks like:

{
  "tags": [
    {
      "name": "Living people",
      "count": 27554
    },
    {
      "name": "Persondata templates without short description parameter",
      "count": 20971
    },
    {
      "name": "All articles with unsourced statements",
      "count": 13509
    },
    {
      "name": "Article Feedback Blacklist",
      "count": 9205
    },
    {
      "name": "Articles with hCards",
      "count": 9028
    },
    {
      "name": "Disambiguation pages",
      "count": 5912
    },
    {
      "name": "Articles lacking sources from December 2009",
      "count": 4158
    },
    {
      "name": "Commons category template with no category set",
      "count": 2904
    }
  ]
}

Now, we have very fast access to the top-tags updated every 10 minutes.

 
Full-Text Search

Users of the system can also perform complex search queries.

Using this interface a user could search for the term "water". This results in search query sent to ElasticSearch:

{
  "query": {
    "query_string": {
      "query": "water"
    }
  }
}

This particular query matches 42 documents, below is a subset of the response showing 1 document:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 42,
    "max_score": 1.0178552,
    "hits": [
      {
        "_index": "learning_portal",
        "_type": couchbaseDocument,
        "_id": 18087337,
        "_score": 1.0178552,
        "_source": {
          "meta": {
          "id": 18087337,
          "rev": 1-0017a16b2b29dc9c0000000000000000,
          "flags": 0,
          "expiration": 0
        }
      }
    },...

The important thing to note here is that the full document body is not included in the response from ElasticSearch. This was done by design, as we configured the index to not store the full source documents. The reason is simple, we already have fast access to the documents in Couchbase. Using the Couchbase Client SDK, we can perform a multi-get operation and efficiently pull down the document bodies. This allows us to render the search results screen:

More Information

  • Check out the presentation Chris Tse gave at CouchConf
  • All of the source code for the Learning Portal is available on github
  • Want to learn more about the Couchbase ElasticSearch integration and have your questions answered? Sign up to attend the webinar on October 24th