November 15, 2012

Fun with Couchbase and Markov Chains

I’ve been hearing about Markov chains for long enough – it was time that I learned more about them and develop a simple fun markov chain application. I’m sure that you don’t want to get bogged down by the mathematical details of Markov Chains - learning by building an application is where all the fun is!

In this blog, we will show how to build an application “Marky” that uses Markov chains to generate nonsensical tweets based on your twitter history. It uses Couchbase Server to store and process the data to generate these tweets.

Marky uses Couchbase Server views to process data
Marky’s map function is :

function (doc, meta) {
   if(doc.body) {
       var words = doc.body.split(/\s+/);
       if (words.length >= 1) {
           emit([null, words[0]], 1);
       }
       for(var i = 0; i < (words.length - 1); i++) {
           var pair = [words[i], words[i+1]];
           emit(pair, 1);
       }
   }
}

At a high-level, it splits text up into smaller chunks using a sliding window over 2 consecutive words and tries to regroup these chunks in correct order to form sentences based on a statistical weight. In the end, you get some nonsensical text that is fun to read.

For example : Given the input text “In this blog, we will show you how to build an application”, it will emit the Key,Value pairs -

Key                   Value

[null,"In"]           1
["In","this"]         1
["this","blog,"]      1
["blog,","we"]        1
["we","will"]         1
["will","show"]       1
["show","you"]        1
["you","how"]         1
["how","to"]          1
["to","build"]        1
["build","an"]        1
["an","application"]  1

To generate a word, we query the view using the last word we output. For example, to get candidates for a word to follow “the”, we use the query parameters startkey=["the"]&endkey=["the",{}]&group_level=2&reduce=true

This will get all the word pairs we outputted that start with “the”, group together pairs that are the same, and run the view’s reduce function on each group. Marky uses the built in reduce _sum, which will add together the values it is given. Running this on the database backing dkatz_ebooks yields:

Key                         Value
["the","#1"]                1
["the","100"]               1
["the","2"]                 1
["the","ability"]           3
["the","absolute"]          1
["the","answer"]            1
["the","app"]               1
["the","application"]       1
["the","area,"]             1
["the","background."]       1

To pick the word to output after “the”, we choose a word that follows it at random, but weight our choice based on the frequency of the word pair appearing in the input. That means we give “ability” has a 3/12 or 25% chance of being chosen here, where the rest of the words each have a 1/12 chance of being chosen or 8.3%.

Since at the beginning of a sentence, we pair the first word with null (for example: [null, “In”] in the earlier example), we can do the same query with null to begin a new output and get words likely to start a thought, or tweet, or whatever our input was. We also need to do this if we get unlucky and don’t get any candidate words back from the first view query. This could happen if the word in the query had only ever shown up at the end of the input texts we processed.


Marky Application

Marky uses a simple clojure wrapper built by the community. To setup marky, create a marky-config.clj file and point it to your Couchbase Server cluster and twitter account. Add some seed data, twitter user accounts or atom feeds and you're ready to launch the app.


{:bucket "default"
:pass ""
:cburl "http://localhost:8091/"
:twitter {:app-key "XXXXXXXXX"
          :app-secret "XXXXXXXXXX"
          :user-token "XXXXXXXX"
          :user-secret "XXXXXXXX"}
:jobs
[; :period, :after are in seconds, :ttl is in days.
 {:type :twitter :user "user-handle1" :period 3600 :ttl 60}
 {:type :twitter :user "user-handle2" :period 3600 :ttl 60}
 {:type :send-tweet :period 3600 :after 600}
 {:type :atom :url "http://some-domain/rssfeed.php" :period 86400 :ttl 60}]}

Here are some fun Marky tweets -

Want To Get Marky?

You can download the Marky source code here
You can also contribute to the clojure wrapper project here

Have Fun!

----

Thanks to Aaron for putting together the code in clojure.

Comments