An emerging data management architectural pattern behind interactive web applications

The user-data connection is driving NoSQL database-Hadoop pairing

AOL Advertising runs one of the largest online ad serving operations, serving billions of impressions each month to hundreds of millions of people. AOL faced three data management challenges in building their ad serving platform:

How to analyze billions of user-related events, presented as a mix of structured and unstructured data, to infer demographic, psychographic and behavioral characteristics that are encapsulated into hundreds of millions of “cookie profiles”
How to make hundreds of millions of cookie profiles available to their ad targeting platform with sub-millisecond, random read latency
How to keep the user profiles fresh and current

The solution was to integrate two data management systems: one optimized for high-throughput data analysis (the “analytics” system), the other for low-latency random access (the “transactional” system). After analyzing alternatives, the final architecture selected paired Cloudera Distribution for Hadoop (CDH) with Membase:

In this architecture, (1) click-stream data and other events are fed into CDH from a wide variety of sources (2) the data is analyzed using MapReduce to generate hundreds of millions of user profiles; then, based on which ad campaigns are running, selected user profiles are loaded into Membase where (3) ad targeting logic can query Membase with sub-millisecond latency to get the data needed to make optimized decisions about real-time ad placement.

While AOL offers a specific example of the power of pairing a “NoSQL” transactional database system with the Hadoop analytics platform, they represent what is emerging as a common pattern of deployment across a wide variety of web application environments.

Increasingly organizations deploy these two technologies in concert. I would hazard to say that every NoSQL database deployment would be more valuable when paired with Hadoop. Why?

A NoSQL database is invariably used to allow interactive web applications to serve large and growing application user populations. A large and growing set of users naturally generates a lot of data – directly and indirectly, structured and unstructured. Hadoop can store, process and analyze lots of data. The resulting analysis can offer insight into user behavior, preferences and patterns that can be used to make the application experience even better for users. A better application leads to more users. And the cycle accelerates.

Lots of users

In simplest terms, a NoSQL database (such as Membase, MongoDB or Riak) is designed to provide interactive Web applications with cost-effective, low-latency, random access to data. Web applications have three defining characteristics that matter in this context:

Hyper-growth. They can grow from one to hundreds of thousands of users overnight – literally. And they can continue growing to serve hundreds of millions of users. The world is a very big place. A public web interface today makes a software system accessible to nearly two billion people.
Serving the impatient. Humans use these systems. Decades of research has shown that speed matters. People don’t like to be kept waiting. Any part of the technology stack that contributes to a user waiting contributes the demise of the application.
Transiency. The user population of a web application comes and goes – both permanently and temporarily. There usually exists a population of users “online” or active at any given time. They come, they use, they leave. Hopefully they return.

NoSQL database technology was invented to address these issues. They grow elastically – just add more cheap servers into a cluster and the data, and most importantly the I/O, is automatically rebalanced across the new servers to support increased load. And the same goes in reverse when the user population of an application recedes. And they are built to guarantee very low-latency random read-write access to data when an application needs it. They do this, in part, by taking advantage of the transient use of these software systems. When users are active, the data required to serve them is cached in main memory. In this case, reading or writing a 5k data object can be done with sub-millisecond latency. When a user has been away and begins using the application again, if her data is no longer in memory it is automatically fetched and available through the users session, with a no-longer-active user’s data being ejected from memory and stored on low-cost disk storage awaiting next use.

Lots of data

If data must be ready to “go active” at any point in time, then a NoSQL database is the right solution. But Hadoop is a much better choice for storing data when that is not a requirement. And interactive web applications can generate mountains of this sort of data – data that historically may have gone uncollected. Login information, click streams, page views, gaze data, “old” application data no longer needed for real-time access, entry and exit drivers, flows to purchase, historical offer and purchase flows, timing information. The list goes on and on. With Hadoop, you can just collect stuff. There is no need to set up a schema or define data formats in advance. If you have an idea for collecting some information that might yield useful insight, store it in Hadoop. “When in doubt, write it out.”

Hadoop was built to slurp up information and to store it cost-effectively. It employs the same “scale-out” approach as a NoSQL database – spreading data across inexpensive servers. But it stores data in a way that is optimized for high-throughput batch analysis, versus low-latency, random access. With Hadoop MapReduce, trends can be discovered, data can be aggregated and conclusions can be drawn. Those conclusions can then shape application behavior.

Check out the webinar

Next Thursday, Matt Aslett, Senior Analyst with The 451 Group, will be hosting a webinar entitled How AOL Accelerates Ad Targeting Decisions with Hadoop and Membase Server. Joining Matt will be Pero Subasic, Chief Architect, AOL.

For anyone interested in building scalable web applications, I would encourage you to check out the webinar. It’s fun to see how these technologies are used “in the real world” and it may spark an idea for your own environment.

ⁱZynga publicly indicated that 290,000 people played CityVille in its first 24 hours (http://www.insidesocialgames.com/2010/12/06/cityville-claims-cityville-is-its-fastest-growing-game/)

ⁱⁱ http://www.internetworldstats.com/stats.htm

ⁱⁱⁱ http://www.useit.com/alertbox/response-times.html