October 29, 2013

Couchbase 103: Q & A

In our ongoing training series, a number of questions come up each time, I list them out with their respective answers below!

Couchbase 103 - Document Modeling

Q: Assuming you can store a document as XML rather than JSON, what are your thoughts on one vs. the other for Couchbase?

A: You can store XML, as we can store strings. It's difficult to do Map-Reduce Views for indexing and querying them however since we don't parse XML server side to be able to use dot-notation like with JSON. XML is not native to javascript. The E4X extension for parsing XML is not a part of the V8 engine that we use in Couchbase to process the javascript map functions.

Q: Can we do bulk inserts/updates on Couchbase?

A: Typically you do this in the same way that you do any CRUD operations with Couchbase, you use the SDK's. We do have a command line tool that can read JSON files off disk, cbtransfer. I recommend using a fast VM (for instance Java/C/C++/Go instead of Ruby) and parallelize the effort if it's a large amount of documents. 

Q: Aren't you going back to the "relational" pattern when you use the multiple documents pattern?

A: Not necessarily, it's more of a separation of concerns, but there are still no joins. It is merely isolating user behaviors into separate documents, not normalization. There is a big difference between the two. Instead of having one document for many user behaviors, you are isolating the documents by key to particular user behaviors and still storing denormalized data. What we are getting around using JSON instead of tables and denormalized data instead of normalized is the ability to distribute data. When you have normalized tables, distribution of data and computing joins is very challenging. You cannot easily split up large tables on multiple servers, and when you do, computing joins requires aggregation of large tables over multiple servers which is going to be heavy in terms of computation and data transfer, and unable to sustain lots of querying. Of course the logic for CRUD operations in a distributed RDBMS situation is just as complex.

Q: From your experience, what is the best way to ensure as much as possible that an item is referred with the same name through oiut the database?

A: First of all, pick immutable characteristics for the keys, that way referential id's within JSON document don't need updating. If you do need to change a key that has been referred in other documents, you will likely have to create a View to find and update those documents after key change.

Q: You noted one advantage of CB being the straight line between app and data as it is stored in the DB, i.e. no Impedance Mismatch. What if the app changes and different data is used in different pages, or (even worse) when several apps are looking at the same data. Isn't the RDBMS approach better then? or would you still defend it. If so, why?

A: If different parts of a JSON document are used on different pages that is not an issue. If different user behaviors modify the same document and there is an opportunity for race conditions, you can use CAS operations or you can separate the documents into multiple. If several apps are looking at the same data it's not an issue either. If these apps are 3rd party apps that can do writes on the data, it's best if they go through an API so that formatting of JSON keys and other validations can be done to retain formatting and data integrity. In RDBMS systems you are relying on the database to do type validation, this responsibility moves from the DB to the App Tier with Couchbase. It's fairly easily solved, and it's a better practice anyways to have 3rd party apps go through an API rather than direct to database. Even in the case of multiple applications interacting with the same database, using SOA might be a better architecture as well.

Q: How do you "find" the size (in terms of bytes on disk, for instance) of a particular document?

A: We don't have a way to do this for individual documents. Documents are compressed on disk, using Snappy, so you can take an average compression ratio off the size of the document from client side (raw uncompressed json) and then take an arbitrary compression of 40% to see it's disk footprint. Or you can take an average by looking at the number of items (documents) and taking the disk usage and divide it to see average doc size on disk.

Q: How to "update" json item(s), for instance, timestamp of document update, using whatever way but not manually?

A: There is no automatic mechanism for updating individual json keys within a json document. Entire documents are passed back and forth with CRUD operations. If you are updating any json key with new value, it's done client side, including and not limited to updating a timestamp.

Q: How do you handle updates of data, won't it be very time consuming?

A: If you are referring simply to updating the JSON document, of course the sub-millisecond replace operation isn't what you are likely talking about. So, I had to make an assumption to answer this question, that you are referring to migrating one JSON document structure to another one, such as having a user document that has some json keys and you want to add/remove/transform them. As I mentioned in the webinar, you can do it in multiple ways. Of course it takes two different forms, one is a batch/cron job style, where you iterate through the documents and do your transform/mutation. How long this takes depends on the size of your dataset, how many machines are in your Couchbase cluster, and how many machines are doing the transformations. So "time-consuming" is a relative assessment. The other way of dealing with it would be to transform documents "on-demand", i.e. when they are retrieved per the normal usage of your application. For instance, when a User does a login, you can retrieve the user document, transform it, and replace it in Couchbase. 

Q: What Jasdeep has not mentioned is that data needs to reside in memory (at least for the speed justification). So what is the advantage over a relational in-memory-database which supports rich SQL? They do also good clustering ...

A: If you are doing a single vertically scaled node of an in-memory-relational database then you are going to be capped at the vertical scale of that one machine. Once you have more than one server and joins (and particularly distributed joins) you will find performance bottlenecks similar to the disk-bound ancestors purely because of physics. If you do a join across a network connection on many large tables, no matter which way you cut it, it's going to be slow, and if you are doing them often, the whole performance will degrade at scale.

Q: In a Counter-ID pattern for Users, where will the user count value reside?

A: It's a simple positive integer value with a key, like "user::count". Make sure you only increment your counter! And follow it with an add operation instead of a set, that way if you have a logic error then you will detect it.

Q: How would you do a reverse(inverted) index MapReduce?

A: This is more on topic for the Couchbase 104: Views webinar. But reverse indexes are tricky with Map-Reduce, you can create it by emitting the tag for when it occurs and then range query for that tag.

Q: When do you split up documents into multiple-documents?

A: There are generally three reasons to think about splitting up a document into multiple documents. First is that a particular User action can modify a document that also can be modified by another User action (and both are frequent). If multiple users can modify the same document at any given point (or nearly likely simultaneously), this is a "race condition" and you should isolate that document and use CAS operations. Second is when a single document might be "bloated" or very large, in which case it might be more efficient to have it split up in terms of code and code maintenance. Third is if you are using an atomic counter to represent a particular component of an object/class. This will have to be it's own key-value pair independent of the primary JSON document of course.

Q: How would you do a search if you had a ship ID and Lat/Long and you want to search by either parameter?

A: In that case because geo coordinates are preceise to three decimal places, no key pattern will help you with that one. However, we have two options for doing searchs for the kind of information you want. Our general Views Map/Reduce is more than satisfactory, but we also have specialized Views with Geo. Those allow for bounded box searching and other more advaned Geo techniques. I don't talk about that specifically in the Couchbase 104: Views webinar, but contact me if you are interested in learning more or need assistance regarding Geo and Couchbase.

Q: Is counting the users a better way (with .incr() function) to create the userID than for example a big random number or a GUID? I'm talking about the case where there can be many many new users per second, won't the counting lead to more conflicts than a big random key that has a very low probability of conflict?

A: Since atomic counters are, well, atomic, then you can count on them (in a single cluster), to be executed in order. If you're app servers are all pointing to that cluster then they are all executing incr operations simultaneously. But since they are atomic, each operation from each app server executing operations will generate a new integer, therefore they will all be unique. If you never use a decr (decrease) operation then you are good to go. It does get a little more complex if you have multiple data centers and are using XDCR (Cross Data Center Replication). In those cases it's best to prefix your counters by data center id (that you create).

Q: I'm using the couchbase-model gem in Rails, and I'm having a tough time getting a grip on counter-ID and namespacing the keys in the Rails paradigm, e.g., using the Object.find() method or accessing/incrementing non-JSON refdocs.  Could you post an example that mixes and matches Counter-ID with Rails, including incrementing from the Model's bucket?

A: This is going to need something better than a blog to show. I'll figure out a way to put together a gist and post a link here. Give me a little time to do so though. :)

Q: Is there a way to grab multiple docs for different id's. The example I’m thinking of is having a list of Facebook user id’s we want to map to their user docs. Is there a way we can pass multiple Facebook id's and get back an array of docs?

A: We have a multi-get operation in every sdk just for this purpose. In the resulting array if the document didn't exist, it will have a null/nill element to indicate it (maybe Java is a different return, will have to check, but I know it has it too).

Q: Are views the only way you can use compound keys? Is there a way to use it with get or setup keys to use this kind of functionality without a view http://blog.couchbase.com/understanding-grouplevel-view-queries-compound-keys 

A: This is more a topic for Couchbase 104: Views and Indexing, but basically a key can be anything you want, so technically, yes, you can have a compound key (just depends on how you define compound). However, the link you are referring to is focused on querying a View-Index, and those have key's as well, index-keys, what we build our B+trees with. In that case, you are taking parts of the key and breaking it up based on "," (comma) delimiter. It's a different beast there we are talking about.

Q: Currently we are using a view to lookup Facebook user id to get our own app user ID. It sounds like if we actually use the lookup pattern you suggested this may be faster than using a view?

A: Yes, and of course it depends. Scale is a factor here as well. The more nodes in the cluster, the more wide the View query scatter-gather process will be, i.e. the more nodes it has to scatter to and gather from in terms of View query results. So, just by physics alone, Lookup will be faster. The tradeoff is that you are adding more documents, and therefore more RAM/metadata requirements, but you are gaining tremendous scalability as a result, because binary operations go over a persistent connection to one server for the facebook_id->app_user_id, and then again a second binary operation to the same or another node through a persistent connection will always be faster than scattering a View query across all nodes and gathering results. Views are very powerful because they are more flexible than key patterns, however in this scenario, go for speed.

Q: View’s have a stale option which seems like it can mitigate the constancy issue vs. two separate get calls. Is it actually faster to have two calls to the db versus one well managed view? I could be concerned at scale, two calls vs. one on a view.

A: It doesn't mitigate the consistency issue. Indexes are always eventually consistent in Couchbase (until we evolve it ourselves which is on the radar), but please see the above question/answer as well as watch Couchbase 104: Views and Indexes! It might change your opinion.

Q: In the example you are demoing, we are discussing about singe document and its attributes/properties via JSON. How do we handle, if you want to get stat/count across multiple objects/documents For example: a) get me all the users with fav color blue? b) get me all the user who uses visa credit card?

A: This requires Views and the Map/Reduce technique to gather all those document data elements and put them into an Index structure so that you can make queries such as those. This is a good example of things you cannot really do with Key patterns, and require VIews and/or Elastic Search integration. In these scenarios you are talking about, you could use either solution.

Q: *Require* views/map-reduce? What about storing second document u::uuid { email:user@domain.com } to then access u:user@domain.com? When should this be done over view/map-reduce?

A: This question was referring to what happens without a Lookup pattern. If you have randomly generated document keys, then you typically need to create an Index to be able to lookup documents via different JSON properties. However, if you create a Lookup pattern then you can do a double-get to get the primary document, the first get will be based on known information (email address), the value will be the randomly generated id that is they to the primary document;  the second get will use that value and get the primary document.

Q: Best way to query those notification documents?

A: In the example I was mentioning, the keys themselves are timestamp's for when notifications should be delivered. Here is a gist that can help explain the paradigm: https://gist.github.com/scalabl3/7235173

Q: The main purpose of normalization is to avoid update anomalies; since all programs have bugs, is there a way in couchbase to prevent these update anomalies and not have to depend on the developer "getting it right"?

A: The main purpose of normalization was to save disk space. If you look back at the time when SQL was created to get about 1GB of disk storage (which wasn't possible in a single disk) was about $700,000. Normalization reduced data redundancy by pointing to the same data from multiple places, then re-assembling that data by aggregating or JOIN'ing it together. It became very clear very quickly in practice that if you are inserting, updating and deleting data in a normalized scheme, if any one part fails your data integrity goes down the tube. It is precisely because of normalization that transactions are necessary! In a denormalized form, or aggregate form, transactions become far less necessary, but we do have optimistic concurrency and pessimistic concurrency to have single document transactions and you can use a two-phase commit.

There really isn't fullproof protection in any system for developers "getting it wrong". Let me know if you figure that one out :).

Q: Would it not be advantageous to combine the Counter-ID patter with the Lookup pattern?

A: It's most definitely advantageous because then you have a total count of documents for that type as well. If you use random id and lookup you don't have the total count. With the total count, you can iterate through the collection by just generating keys in batches up to the total count. Otherwise you need to create a View (an Index) to iterate through the collection.

Q: Wound not an index for the RDBMS give you linear performance as well?

A: It gives you great performance as long as you can live on one server and vertically scale it. All the mess comes when you have to split traffic up onto more than one server. This is where the birth of all the new databases exploded on the scene could be traced to in the early 2000's when audiences "blew up" relational databases ability to handle larger amounts of traffic. Things like memcached came first, and then the NoSQL's. 

Q: Why not extending CB with abstract functionalities based on those patterns, on server-side? right now, patterns need to be individually implemented on client-side.

A: It's an idea, and I actually think there is some validity to doing it, but there is also a good argument on the other side that it should be under programmatic control. As soon as you add a new feature, people want to tinker with it. :)

 

Thanks for attending Couchbase 103!

Jasdeep
@scalable

Comments