March 5, 2014

Topology: The Architecture of Distributed Systems

You can’t judge a book by its cover, but you can judge the architecture of a distributed system by its topology.

If two distributed systems are equally effective, is the one with the simpler topology the one with the better architecture? This article compares the architecture of two document databases and two wide column stores by looking at their topologies.

Wide Column Store

Topology #1

Wow. There is a lot going on here. There are four nodes types and multiple components per node.

Topology #2

Nice. Simple. There is one node type.

Which wide column store would you choose?

  • Which one is going to be easier to deploy?
  • Which one is going to be easier to maintain?
  • Which one is going to be easier to scale?
  • Which one is going to be more resilient

I believe the less moving parts, the better.

Apache HBase

Apaceh HBase sits on top of Apache Hadoop, so there are a lot of nodes types and components. Apache Hadoop requires name nodes and data nodes for HDFS. It requires job trackers and task trackers for map / reduce.  Apache HBase requires master servers, region servers, and a Zookeeper cluster. The Apache HBase, HDFS, and map / reduce components can be co-located. However, they don’t have to be.

The master server and the name node may be single points of failure. However, multiple name nodes can be deployed, as can multiple master servers. That being said, there will be problems if the name nodes are unavailable, the master servers are unavailable, and / or the Zookeeper cluster is unavailable.

Apache Cassandra

There is one node type. That’s it. Clients communicate directly with the nodes. There are no single points of failure. There are no dependencies on independent nodes or separate clusters.

Document Databases

Topology #1

Wow. There is a lot going on here. There are four node types and two layers of logical groupings.

Topology #2

Nice. Simple. There is one node type.

Which document database would you choose?

  • Which one is going to be easier to deploy?
  • Which one is going to be easier to maintain?
  • Which one is going to be easier to scale?
  • Which one is going to be more resilient?

I believe the less moving parts, the better.

MongoDB

The MongoDB topology is similar to the Apache HBase topology. The difference is that clients to not directly connect to the nodes. THe client requests are proxied by the router nodes. The router nodes retrieve shard information from the config nodes. A shard consists of a replica set. A replica set consists of multiple nodes and an arbiter.

Like Apache HBase, the router node and the config node may be single points of failure. However, like Apache HBase, multiple router nodes and multiple config nodes can be deployed. That being said, there will be problems if the router nodes and / or the config nodes are unavailable.

Couchbase Server

There is one node type. That’s it. Clients communicate directly with the nodes. There are no single points of failure. There are no dependencies on independent nodes or separate clusters.

Summary

A great architecture balances flexibility and simplicity. There is value in a modular architecture. There is value in a simple architecture. However, modularity does not have to be reflected in the topology of a distributed system. Couchbase Server is a modular, distributed system. A single instance is compromised of multiple components and multiple services. However, the modularity is not forced on administrators. It is an aspect of the distributed system itself, not its deployment.

Join the conversation over at reddit (link).
Join the conversation over at Hacker News (link).

Comments