December 6, 2013

Couchbase 102: Q & A

In our ongoing training series, a number of questions come up each time, I list them out with their respective answers below!

Couchbase 102 - SDK Operations

Operations Concepts By Language: https://github.com/couchbaselabs/DeveloperDay

Q: Are these operations the same for mobile applications?

A: No, they aren't the same. Couchbase Lite has its own API. You can see the Couchbase Lite Beta 2 API documentation for iOS here: http://couchbase.github.io/couchbase-lite-ios/docs/html/annotated.html

Q: Can Couchbase guarantee all operations are persisted?  i.e., once a put to Couchbase returns, the value is considered persisted even if Couchbase server has a hard crash while the data is still in the disk queue and replication queue.  What mechanism is used to guarantee this?

A: No, we cannot make that guarantee. It IS possible for the client to see a successful operation, but the server hard crashes before being replicated or persisted. In hard crashes no system is safe. What we try to minimize is the amount of data loss as well as maintain high performance. With replication and failover we can achieve that and be able to quickly promote replicas and keep the cluster and application going during node failures. We do offer durability observations for storage operations on a per document-operation basis, you will only get a callback if the document has been replicated, persisted or both. These increase latency of operations, so are best used when needed instead of for all operations.

Q: I am sure that using Storage with Observe will introduce a large overhead... Can Jasdeep provide some number (ratio with and without)?

A: There is no way to give an exact figure here because it varies based on hardware and horizontal scale. What I can say is that it's significant enough to think twice about doing it for every single storage operation especially if you have high write volume.

Q: What happens when you have a new K-V pair (same key, but diff values) that are being simultaneously inserted at exactly the same time on two Data Centers? In this case, since this is being written to cache (i.e. the XDCR queue is still behind), then you can't really be "consistent", right? How is this conflict handled?

A: Consistency only applies within a cluster, XDCR by definition is eventually consistent because data has to move between clusters across the wire. To answer your question, I cover the XDCR conflict resolution rules in the Couchbase 105 training webinar, our conflict resolution logic is rather simple at this point. Conflict resolution logic is notoriously difficult to write universally.  In your scenario, since they are both new documents, have only one creation each, and were written to two clusters at the same time with bi-directional XDCR, a "winner" will be picked at "random". Our conflict resolution rules are related to the number of revisions a document has along with other criteria. 

Q: Can you lock documents across clusters with XDCR?

A: No, document locking is local to a Couchbase cluster only, not across clusters.

Q: If i want to modify/update a few items in a document, how can i write a java (or any other language) code to achive that instead of manual find and update a particular item/field?

A: This is actually pretty standard usage. If you are using Java, you are going to transcode/unmarshall the JSON to your object, modify whatever values, and transcode/marshall back to JSON for a replace operation. We don't have API for partial updates on documents from the Couchbase server side, between the clients and Couchbase server, we always transmit the entire document back and forth. Meaning we cannot send partial JSON modifications to Couchbase server and have the server get the document and update the partial changes.

Q: Can 2 different versions (say v2.0.1 and v2.2) of Couchbase co-exist on same box? 

A: You can install them both, but running them both in a cluster is possible but tricky, and definitely non-standard usage. You would have to change all the port mappings and expectations of where the second server is to be able to cluster them. Generally I'd say it's not worth the effort. If you are familiar with Docker, it's possible to create a cluster in Docker with core.os.

Q: Insert/update ops create a lock on the entire collection or the document? Will Couchbase alllow read ops on the coll/doc?

A: We don't lock by default. Individual documents can be locked for writes via the GetWithLock operation with an automatic timeout maximum of 30 seconds. There is no concept of a "collection", nor collection locks, nor are their bucket (database) locks. Using CAS operations for optimistic concurrency is generally a much better practice than locking, locking is only really necessary in specific use cases where CAS may not be sufficient, or to actually prevent any modification for a given amount of time. We don't need to lock to read/write.

Q: Do you have something like gridFS which Mongodb has for impage/video files?

A: We do not break up documents into separate pieces, no. However, you can do this within your application if your use case demands it.

Q: Is there a bulkupload capability of JSON documents in couchbase server? We want to replicate changes from our enterprise systems (data stores) into Couchbase.

A: Some of our SDK's have multi-set but generally people script this. Doing ops is doing ops regardless, I'd generally recommend finding a way to parallelize it and use the fastest coding VM's or languages as well (Java/C) over script languages. The command-line tool cbtransfer can do some of what you are asking, depending on the source of JSON, if they are files for instance, the tool basically is using the c library.

Q: What are pools in "http://127.0.0.1:8091/pools" when u specify this in java code?

A: It's just part of the connection URI. In Java, we have had that syntax for a long time rather than just the IP:Port. It was one of our very first SDK clients. We allowed for the "idea" of multiple pools, but there really is only one. The other SDK's only use the IP:Port syntax and append the pools/default/{bucket} part of the URI themselves, which the java is also kind of doing as well.

Q: Can you elaborate that how CAS works? What if CAS value mismatches?

A: Everytime you store a document or modify a document (change expiration), a new long integer value is generated that is associated with the document (in the metadata). That value represents its current state, similar to CRC check or MD5 hash. If you do a replace operation and provide the CAS value that you last retrieved for a document, if it matches, the operation proceeds. If it mismatches you get a CAS Mismatch error, meaning that the document has been modified and has a different CAS now. Then you can handle that race condition. This is called optimistic concurrency because no server resources are required to handle this, i.e. no locking is needed.

Q: What happens if I make a set with durability parameter > of the replica present in the cluster?

A: You can an error :). 

Q: What is the maximum limit of Document size as a whole? What is the maximum limit of storing data as Value when we save binary data?

A: In both cases it's 20MB. You don't want to store bigger than that in Couchbase anyway, there are much smarter CDN based solutions for storing and distributing large files (which are most likely videos). In those cases you would store the file metadata in Couchbase as JSON with a link to the file, and the actual file on CDN. You can also check out CBFS, an open-source S3 like redundant distributed file store that uses Couchbase, written by one of our founders: https://github.com/couchbaselabs/cbfs

Q: You are keep saying that it's good to configure the server such as all the documents can fit in RAM. How much performace degradation would I get if I use SSD disk and low ram?

A: That depends. If you have high write volume and have filled your bucket RAM quota, you are going to have competing processes that are trying to get documents to disk: ejection of active documents from RAM to make space in RAM and writing new documents to disk. If you can have enough Disk I/O via SSD Raid so that you can exceed your write volume in terms of being able to eject active documents faster than you are filling the RAM back up, then yes. If you scale horizontally then it has more of a chance because you will have many nodes ejecting at the same time and might be able to outpace your write volume theoretically! If not, you will get Temporary OOM errors in the client which tells you to "back off" on operations and retry. 

Q: How do we save binary data ( files like images, pdf, etc)?

A: See the next question below... :)

Q: What happens when are storing pictures in Couchbase? 

A: You can store pictures in two different ways with Couchbase, one is to store straight binary data as a Document value. The second option is to store it as pre-encoded bases64 within a JSON document, so then it's a standard JSON document with one or more images encoded as JSON values (with JSON keys). The advantage of storing images in Couchbase is that it will be served from RAM instead of disk so it will be very performant. Couple this with XDCR (Cross Data Center Replication) and you can create your own CDN for images!

Q: How do you modify/update multiple docs and rollback if an error occurs on one of them?

A: In Couchbase you can very easily use Optimistic (CAS) or Pessimistic (Lock) Concurrency for transactions on single documents, but for multiple documents in a single "transaction", you will need to use what is called a Two-Phase Commit. You can read more about it here: http://www.couchbase.com/docs/couchbase-devguide-2.0/two-phase-commits.html

Q: Is there a possibility for transactions in Couchbase?

A: Like the previous question, you can easily do single document transactions using Optimistic Concurrency with (CAS - Compare and Swap), or Get and Lock. 

Q: Is there a batch insert/update operation that would callback with a list of failed inserts/updates after the write to primary disk?

A: In some SDK's (Python for instance) they do have multi-set type operations (all of them have multi-get operations), but I don't believe we support multi-set AND observe type operations. 

Q: How can I make a query with multiple parameters to pass, like name, date, anda status; something like a "where" in SQL?

A: You do this with our Views (Indexes), querying multiple parameters might require either a) querying separate views and doing an intersection within your applicaiton, or being creative with your index key's so that you can range query. There are a number of different strategies for this, and it will be particular to a use case and document design to be able to answer it succinctly.

Q: Do you support other languages like Go or Clojure?

A: Yes! We have community editions for Go (https://github.com/dustin/go-couchbase), you can see all the community clients on our All Clients page on couchbsae.com: http://www.couchbase.com/communities/all-client-libraries Community client libraries aren't officially supported by our support contracts, but you can easily find help from our engineers via IRC or twitter.

Q: Are Key Patterns faster than Views?

A: In most cases where you can actually use Key Patterns, yes, because they are binary socket operations with data coming out of RAM cache. Since data is distributed and indexing happens on each node of the cluster responsible for their distribution of data, Views require scattering the query across all nodes in cluster and gathering results from all nodes in cluster. This will always be slower than going directly to a single node of Couchbase and doing a binary CRUD operation over a persistent binary socket connection for a single key. So, yes, Key Patterns are faster than Views. However, not all problems can be solved by Key Patterns, that is why Views are there. The typical use case of Views is to query the view then follow that (when needed) by a multi-get operation for documents in a View result set. In the use cases that require more complex querying, Views are the next answer, and for use cases that require more flexible search,  our Elastic Search integration is the answer. We are working on a Couchbase Query Language (N1QL) for AdHoc querying of Couchbase, it's currently in Developer Preview which is also another interesting and powerful choice for queries.

Comments