Tech Tutorials: Introduction to Apache Cassandra

Tuesday, August 17, 2010

Introduction to Apache Cassandra

What is Cassandra?
Apache Cassandra is a non relational database which is given by the Apache. Initially, Cassandra was open sourced by Facebook in 2008, and is now developed by Apache Group.

In the normal relational databases data will store as rows, but in Cassandra the data will stored in columns format as key value pairs. Due to this column based data storage its giving the high performance while comparing the relational databases.

But there is No SQL, then how to Query?

To insert and retrieve the data there are some apis. Thrift framework is also one of its client API. Essentially a communication protocol used not just by Cassandra but by many others.

Who are using Apache Cassandra?

Cassandra is in use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX, and more companies that have large, active data sets. The largest production cluster has over 100 TB of data in over 150 machines.

For the RDBMS users it will take time to implement the Cassandra.

Terminology of Cassandra:

Column – Column is a tuple with binary no-fixed length name and value along with the timestamp. To keep it simple ignore the timestamp for the moment.

Super Column - Essentially a container for one or more columns. It is again a tuple with a binary name and a map of where key is the same as the name of the column

Column Family - A structure which keeps an infinite number of rows just like a traditional table. Each row in itself has a binary key and a map of where again a key is the same as the name of the column

Super Column Family - Same as column family with the exception that each row has a map of super columns instead of columns. The map is keyed with the name of each SuperColumn and the value is the SuperColumn itself

Keyspace - It is like schema containing the column families

Sorting - The data is sorted as soon as we put the data within the cluster and it remains that way as there is no way to do it while fetching the data which makes it all the most necessary to plan it right as per the access path. I am not going to go all in detail here but I guess you got the idea.

What might be the reason to develop Cassandra in Java?

Security: it's easier to write secure software in Java than in C++ (remember the buffer overflows?)
Performance: it's not THAT worse. It's definetely worse at startup, but once the code is up and running, it's not a big thing. Actually, you have to remember an important point here: Java code is continually optimized by the VM, so in some circunstances it get faster then C++

Features of Cassandra:

Fault Tolerant : Data is automatically replicated to multiple nodes. Loosing a node doesn’t bring down the cluster

Flexible Schema : We are talking in terms of columns, supercolumns and columnfamilies instead of rows and tables. BigTable datamodel

Symmetric : No single point of failure, Every node within the cluster is identical and there are no network bottlenecks

Scalable : Linear with addition of new machines with no downtime or interruption to applications. Read and write throughput increase linearly as new machines are added

Support for Large Data : The ability to scale to many hundreds of gigabytes of data

Written in Java : Originally built for facebook and then made open source