Previous blogs gave a better understanding on how our approach to Real Time Observability (RTO) in general helps troubleshooting distributed systems and why we chose OpenTracing as a foundation and public API. If you haven’t already, you can check out those blogs here and here.

In this blog, you’ll learn how RTO works with the Java SDK and how you can use it today for your benefit.

Getting Started

As soon as the Java SDK 2.6.0 (or later) version is released, you can grab it through the usual ways as described here. Since this version is right now in a preview release state, you need to access it from our own maven repository. Through a maven pom.xml:

The only difference you will notice is a new dependency called opentracing-api which we’ll talk about in a bit. Since this is not a major release, all previous code will work as before and you’ll benefit from the enhancements right away.

If you need to customize the tracer itself or some of its configuration, you can do this through the CouchbaseEnvironment.Builder as with all the other settings. Here is an example on how to customize our default tracer to reduce the time interval when the information gets logged:

Threshold Logging Explained

To get our feet wet with the threshold logging tracer (which is on by default) we are going to customize its settings a bit and lower the thresholds so low that practically every request and response are covered. Obviously this is not a good idea for production, but it will help us get to the desired log output quickly and assert its functionality.

Apply the following configuration as outlined above:

This will set our threshold for key/value operations to one microsecond, log the found operations every second and set the sample size to a very large value so everything will be logged. By default it will log every minute (if something is found) and only sample the top 10 slowest operations. The default threshold for key/value operation is 500 milliseconds.

With these configs in place we are ready to run some operations. Feel free to adjust the following config so that it points to your server and applies the appropriate credentials:

In this simple code we are reading some docs from the travel-sample bucket and if found write it back with an upsert. This allows us in a simple way to perform both read and write operations. Once the code is run, you will see (similar) output in the logs:

This is our threshold log reporter in action! For each service (only kv based on this workload) it will show you the total amount of recorded ops (through count) and give you the top slowest ops sorted by their latency. Since only airline_10 exists in the bucket you see 5 document fetches but only one mutation happening.

Let’s look at one specific operation and discuss each field in a little more detail:

It tells us the following:

  • total_us: The total time it took to perform the full operation: here around 1.5 milliseconds
  • server_us: The server reported that its work performed took 23 microseconds (this does not include network time or time in the buffer before picked up at the cluster)
  • decode_us: Decoding the response took the client 1.2 milliseconds
  • dispatch_us: The time when the client sent the request and got the response took around 1 millisecond.
  • local_address: The local socket used for this operation.
  • remote_address: The remote socket on the server used for this operation. Useful to figure out which node is affected.
  • operation_id: A combination of type of operation and id (in this case the opaque value), useful for diagnosing and troubleshooting in combination with the local_id.
  • local_id: With Server 5.5 and later, this id is negotiated with the server and can be used to correlate logging information on both sides in a simpler fashion.

Note that the exact format of this log is still a bit in flux and will change between dp1 and beta / GA releases.

You can see that if the thresholds are set the right way based on production requirements, without much effort slow operations can be logged and pinpointed more easily than before.

There is one missing piece to the puzzle though which we are going to cover next: how this relates to the dreaded TimeoutException.

Timeout Visibility

Previously, when an operation takes longer than the timeout specified allows, a TimeoutException is thrown. It usually looks like this:

Looking at this stack trace you can infer a couple of things, for example that the timed out operation was a GetRequest. If at runtime you need more information, you used to have to wrap it based on your own context. For example:

This is very tedious and still doesn’t give you insight into the internal behavior as needed.

To remedy this situation, the new client does (roughly) the above thing for you transparently but adds even more information without having to do a thing. The same timeout in the new SDK will look like the following:

You might recall some of the fields from the log description above, and that’s intended. Now the timeout itself provides you valuable information like the local and remote sockets, the operation id as well as the timeout set and the local ID used for troubleshooting. Without extra effort you can take this information and correlate it to the top slow operations in the threshold log.

The TimeoutException now provides you more information into “what” went wrong and then you can go look at the log to figure out “why” it was slow.

Swapping out the Tracer

Finally, there is one more feature baked into this. As mentioned earlier, the new opentracing-api comes bundled for a reason. It only provides an interface description, but this is enough to open all these metrics and stats to a whole universe of OpenTracing compatible tracer implementations. You can pick one of the open source available ones like jaeger, use one provided by APM vendors or even write your own one. The key point here is that the SDK uses an open, standardized interface which you can plug other things into while including a useful, basic Tracer.

As an example, here is how you can swap in our default tracer with one from jaeger:

Without any changes to the actual app, all the traces and spans will now be fed into a distributed tracing engine where you have nice UIs to visualize the behaviour of your application landscape.

A get operation as in the sample above can look like this in the jaeger overview page:

Drilling into the specific trace, the individual span and sub-span timings become visible:

Each span also contains tags that the SDK attaches for advanced filtering and analysis:

Summary

Enhancing our SDKs with Response Time Observability we are now able to provide deeper insight into what and why something is not working/performing as expected. Relying on OpenTracing as the interface, in addition we can make it vendor-agnostic and open the possibility of plugging in one’s compatible APM/tracing tool of choice if there is a need for it.

Author

Posted by Michael Nitschinger

Michael Nitschinger works as a Principal Software Engineer at Couchbase. He is the architect and maintainer of the Couchbase Java SDK, one of the first completely reactive database drivers on the JVM. He also authored and maintains the Couchbase Spark Connector. Michael is active in the open source community, a contributor to various other projects like RxJava and Netty.

Leave a reply