Friday, October 28, 2011

A Good Primer on NoSQL

NoSQL databases are all the rage, but the array of choices before us is bewildering. I must confess I'm still confused about the features and differences between BigTable, GAE DataStore, GemFire, SimpleDB, SQLFire, CouchDB, MongoDB, RavenDB, Redis, Cassandra, Riak, HBase, Neo4j and so many other names that I have only recently begun to hear about. I'm sure many others would be in the same situation.

I was therefore happy to see that my colleague at WSO2, Dr Srinath Perera, has analysed the NoSQL landscape in depth, zeroed in on the characteristics of NoSQL databases that are relevant, and summarised this for our common understanding in an InfoQ article that provides a simple overview of the choices that designers and developers have today, choices that go beyond the traditional relational databases that we're familiar with.

I've often wondered about why NoSQL should be so popular in the first place. Srinath explains:

A few years ago, most systems were small and relational databases could handle [their requirements] without any trouble. Therefore, the storage choices for architects and programmers were simple. However, the size and scale of these systems have grown significantly over the last few years. High tech companies like Amazon and Google faced the challenge of scale before others. They soon observed that relational databases could not scale to handle those use cases.
In other words, this demanding new requirements wave has probably not hit most of us yet, but with the jump in the number of connected devices (smartphones, tablets and the coming "Internet of Things"), applications dealing with huge volumes of data are probably not going to be as rare as in the past. And when we say "huge", we're not even talking Gigabytes anymore. It's Terabytes and larger. As we learnt from Godzilla, size does matter. And drastic situations call for drastic measures, hence the NoSQL revolution.

Srinath refers to Eric Brewer's CAP theorem, which states that a distributed system can only have two of the three properties - Consistency, Availability, and Partition Tolerance. The NoSQL databases aim to break through the limitations imposed on traditional relational databases by loosening the fundamental principles on which these have been based, dropping one or more constraints as appropriate, to obtain a desired behaviour.

Depending on the constraints dropped, the resulting solution falls into one of several new categories:

  • Local memory
  • Distributed cache
  • Column Family Storage
  • Document storage
  • Name-value pairs
  • Graph DB
  • Service Registry
  • Tuple Space addition to the traditional filesystems, relational databases and message queues that are familiar to IT practitioners today.

Perhaps the most important contribution of Srinath's article is his distilling of the four primary characteristics that are important from a usage point of view - data structure, the level of scalability required, the nature of data retrieval and the level of consistency required. He then puts these characteristics together in various combinations to show which of the above-listed categories of data store would be the most appropriate solution to use.

He's certainly succeeded in demystifying NoSQL for me, although I suspect I'll need to go back and read the article a few times till I've fully internalised the concepts in it. This is an overview article that I'd recommend to anyone trying to make sense of NoSQL and wanting to decide on the appropriate product category that would be right for their needs.

I can see the demand for a follow-up article from Srinath drilling down into each of these data storage categories and providing recommendations about actual products (e.g., Cassandra, Redis, CouchDB, etc.) While the sands shift more rapidly in the product space, it's also a more practically urgent decision for a developer or architect to make. So while such an article might need to be updated quite frequently, the advice in it would be more practical than this one, which provides the necessary initial understanding of the NoSQL landscape.


aaron said...

done forget about RavenDB, in terms of usability it takes the cake and its featureset is suprisingly good.

Ganesh Prasad said...

Added, thanks!