I’m writing this up because there’s always quite a bit of discussion on both the Cassandra and Hector mailing lists about indexes and the best ways to use them. I’d written a previous post about Secondary indexes in Cassandra last July, but there are a few more options and considerations today. I’m going to do a quick run through of the different approaches for doing indexes in Cassandra so that you can more easily navigate these and determine what’s the best approach for your application.
The Primary Index
Most conversations about indexing in Cassandra are about secondary indexes. This begs the question, what is the primary index? Your primary index is the index of your row keys. There isn’t a central master index of all the keys in the database, each node in the cluster maintains an index of the rows it contains. This is what the Partitioner in Cassandra manages, as it decides where in a cluster of nodes to store your row. Because of this, the index typically only enables basic looking up of rows by key, much like a hashtable. The discussions that break out about the OrderPreservingPartitioner versus the RandomPartioner are really about how literally the primary index behaves like a hashtable versus an ordered map (i.e. something you could do a “select * from foo order by id” against). This is because, in the case of the RandomPartioner, you can’t easily traverse your set of row keys in meaningful ways since the sorted order of those keys is assigned by Cassandra based on a hashing algorithm. The OrderPreservingPartitioner, as the name implies, orders the keys in string-sort order, so you can not only look up a row via a specific key, but can also traverse your set of keys in ways that are directly related to the values you are using as your keys. In other words, if your row key was a “lastname,firstname,ss#” string, you could iterate through your keys in alphabetical order by lastname. Generally, though, people try to use the RandomPartioner because, in exchange for the convenience of the OrderPreservingPartitioner, you lose the even distribution of your data across the set of nodes in your overall system, which impedes the scalability of Cassandra. For more understanding of this, I’d recommend reading Cassandra: RandomPartitioner vs OrderPreservingPartitioner.
By definition, any other way of finding your row other than using the row key, makes use of a secondary index. Cassandra uses the term “secondary index” to refer to the specific built-in functionality that was added to version 0.7 for specifying columns for Cassandra to index upon, so we’re going to use the broader term “alternate index” to refer to both Cassandra’s native secondary indexes as well as other techniques for creating indexes in Cassandra.