October 12, 2009

scaling data at Los Angeles CloudCamp

It's just over a week ago now, but I had a good time and learned a few things from both the session and the discussions at the Los Angeles CloudCamp. I proposed, and ultimately lead a session on scaling data.  Actually, I proposed it as "Scaling Your Data: Both Before and After you Need To."

Part of the reason I proposed it is that there was discussion of the CAP theorem during the proposal of sessions, and what was stated was a bit off. CAP is well discussed in a lot of places, (one good blog is from Jonathan Gray, CTO and co-founder of Streamy.  CAP stands for Consistency, Availability and tolerance to network Partition.  Interestingly, many folks kept trying to make the last P performance or introduce performance into the mix.  Performance (or at least the tradeoffs involved) are better discussed in another set of letters W+R>N.  We had one room at Microsoft's downtown LA digs pretty much dedicated to scaling data.  The first session was lead by .... and covered the CAP theorem.  It was a bit contentious at first, as there were some assertions you could get all three with some judicious use of Microsoft technology, but through the discussion in the room we ended up in the right place and in agreement with each other.  That session was lead by Abhijit Gadkari from ZimbaTech. After that, Microsoft's own SoCalDevGal (Lynn Langit) did a presentation on SQL Azure.  I certainly learned a bit about what Microsoft is delivering and what their developers are asking for.  Heck, I even learned a few new three-letter-acryonyms.  :) SQL Azure is Microsoft's hosted relational database as a service.  From the presentation, it became clear that initially Microsoft was initally not planning a full SQL compliant hosted data store (which would make them more like Amazon's SimpleDB or Google's BigTable as accessed through Google's App Engine).  In the end, customer feedback lead them back to providing as much SQL as they could, including integration with all of the Microsoft developer tools. Interestingly, Microsoft is suggesting data sharding for scale.  This is in part because the initial product is limited to 10GByte databases.  SoCalDevGal was really clear that this is just an initial limitation, and a few things had to give to meet their schedule.  The other really interesting bit about SQL Azure, which I thought was pertinent given the rest of the discussion, was the seeming decision to make SQL Azure a CA system, in regards to CAP.  I'd even asked to verify, and that was in fact the intent. The reason this so significant to me is that the design decision here, coupled with the replication across datacenters, means the only way to do this correctly is offline the system (or a portion of it's safety) in the event a datacenter is cut off from the 'net.  Since all data is in at least two datacenters, and you don't know which two, it's possible the datacenters are more than "pairs", meaning a complete partition of a single datacenter could mean an outage for at least two, and maybe more.  This is good for their users in that it's completely compatible with their applications, but it also means there could be a "CNN moment" someday meaning a couple of telco failures will take offline a large number of SQL Azure customers. Just so this isn't misinterpreted, this isn't a criticism of Microsoft's technology or some deficiency they should have overcome.  It was just a design choice made by the SQL Azure team.  This is where the "engineering" is in our discipline... there are tradeoffs in any design. The final session in the room in the evening ended up being mine.  I did a quick introduction of myself and a quick assessment of who had ever even looked into stuff like memcached, gearman, Haoop HBase, Cassandra and the like.  It turned out almost no one had thought about this.  I think this is, in part, because Microsoft did a great job putting butts in seats for this event.  Because of the audience experience with such things, I ran the session as a bit of a cerebral exercise in understanding why someone may want to look into different ways of storing data.  I'm sure I didn't cover everything, but I did assert a few reasons:

  • Scale - Moving some of the intelligence to the clients in a distributed computing architecture makes it simpler to split components, and thus scale them.  It also becomes much easier to move caching closer to where the data is used, helping scale.
  • Developer Productivity - This cuts both ways in that having to learn a new way of storing and retrieving data is a productivity sapper, but being able to evolve data structures and persistence mechanisms without ever having to declare changes to a data model (get DBAs involved, etc.) is liberating.
  • Availability - For some classes of applications, it may entirely make sense for most of the data to be available, even if a small portion is currently unavailable.
  • Geographic Distribution - If you begin to think about the data a little differently, it becomes easy to geographically distribute the data.

I'll have to write more about this in the not too distant future.  I've been looking into a lot of previous work (even some academic papers) and there are some interesting ideas out there. p.s.: the TechZulu guys recorded the session.  It may be on their site at some point... I checked to link but it's not there yet.

Comments