I’m writing this up because there’s always quite a bit of discussion on both the Cassandra and Hector mailing lists about indexes and the best ways to use them. I’d written a previous post about Secondary indexes in Cassandra last July, but there are a few more options and considerations today. I’m going to do a quick run through of the different approaches for doing indexes in Cassandra so that you can more easily navigate these and determine what’s the best approach for your application.
The Primary Index
Most conversations about indexing in Cassandra are about secondary indexes. This begs the question, what is the primary index? Your primary index is the index of your row keys. There isn’t a central master index of all the keys in the database, each node in the cluster maintains an index of the rows it contains. This is what the Partitioner in Cassandra manages, as it decides where in a cluster of nodes to store your row. Because of this, the index typically only enables basic looking up of rows by key, much like a hashtable. The discussions that break out about the OrderPreservingPartitioner versus the RandomPartioner are really about how literally the primary index behaves like a hashtable versus an ordered map (i.e. something you could do a “select * from foo order by id” against). This is because, in the case of the RandomPartioner, you can’t easily traverse your set of row keys in meaningful ways since the sorted order of those keys is assigned by Cassandra based on a hashing algorithm. The OrderPreservingPartitioner, as the name implies, orders the keys in string-sort order, so you can not only look up a row via a specific key, but can also traverse your set of keys in ways that are directly related to the values you are using as your keys. In other words, if your row key was a “lastname,firstname,ss#” string, you could iterate through your keys in alphabetical order by lastname. Generally, though, people try to use the RandomPartioner because, in exchange for the convenience of the OrderPreservingPartitioner, you lose the even distribution of your data across the set of nodes in your overall system, which impedes the scalability of Cassandra. For more understanding of this, I’d recommend reading Cassandra: RandomPartitioner vs OrderPreservingPartitioner.
By definition, any other way of finding your row other than using the row key, makes use of a secondary index. Cassandra uses the term “secondary index” to refer to the specific built-in functionality that was added to version 0.7 for specifying columns for Cassandra to index upon, so we’re going to use the broader term “alternate index” to refer to both Cassandra’s native secondary indexes as well as other techniques for creating indexes in Cassandra.
Cassandra’s Native Secondary Indexes
For people new to Cassandra, this should be your starting point. Often, in discussions on the mailing list, people get deep into other forms of alternate indexes and the newcomer to Cassandra quickly becomes bewildered on what they should be doing. So, if you’re 30 days into using Cassandra, and you need to quickly find a way to index your data, get started with the native secondary indexes. You might want to read the next sections, especially to to understand what “wide rows” are how how they relate to indexing, but don’t start your implementation with these techniques. You can learn more about native secondary indexes on the blog page where they were initially announced and on the Datastax documentation page for them.
“Wide rows” and CF-based Indexes
The basic property of Cassandra that a lot of new users just blow past is the fact that it’s a column-oriented database. They quickly see that they can create a column family (CF) as an analogue to a traditional database table and so create, for example, a CF called “Users” and the columns in each row in that CF are effectively used the same way that they’d use any database column, storing fields like “name”, “location”, etc. The fact that you don’t have to specify the columns ahead of time is regarded as a convenience feature, no more “alter table” statements. These are certainly a big part of how you’ll use Cassandra. However, the point of Cassandra, at least in the context of discussing indexing techniques, isn’t whether you can fit 10 columns or 20 in a row. Cassandra can fit 2 billion columns in a single row. Why do you need 2 billion columns? One of the primary reasons is to have columns that point to other rows, and in that context, you might not need 2 billion rows, but you could end up using quite a few.
The most basic form of CF Index isn’t truly an index at all, it’s list of keys. It’s very common to store a set of row keys as column names in a way that’s called a “wide row”, because it consists of a large number of columns that each contain a small piece of data (i.e. a row key). Here’s a simple example. Let’s suppose you’ve created the aforementioned Users CF. Now you want to organize those users into groups. You could go and create a classic join table, but the more efficient Cassandra way is to have a Groups CF where every row contained the set of keys for the users in that group. So, if you wanted to quickly retrieve all the users in a group, you could just load that the contents of that row. Unlike the primary index, which is only ordered if the partitioner supports it, the column names in a row are always ordered, according to the criteria you used when you created the CF. If the user’s key was their “lastname,firstname,ss#” as mentioned previously, and your Groups CF was sorting columns as strings, then when you retrieved your columns from the row, they would be in that sort order. If you wanted to get a range, say “last names starting with the letter ‘a'”, that would easily be accomplished as well. Another example, suppose for each user, you have a set of Tweets and the key for each tweet was a time-based UUID, then you’d create a CF called UserTweets where the key to that CF was the user’s key, and each row in the UserTweets CF would contain the time-ordered set of keys to the tweets in your Tweet CF. When you look around the various sample projects, like Twissandra, you see variants of this used quite a bit. It’s the basic way relationships are typically modelled in the system.
Now, the challenge becomes what if you want to use a CF row to contain a sorted set of something other than the target row’s keys, such as the “lastname” column of the rows in the User CF. In that case, several users might have the same last name. What if, out of necessity, your User row keys are random UUID’s? Obviously, the easiest answer is to go use native secondary indexing. However, there are a couple of good reasons why you might find that approach wanting. First of all, there are still limitations on the types of searching you can do with native indexes, particularly in regards to range queries, although these will be worked through over time. Indexes aren’t just used for searching, though, they’re used for sorting as well, and native secondary indexes don’t yet help with that either. The final issue is the idea of an indexed collection, which I’m going to talk about a bit later, but is important when you want to do indexing across many-to-many relationships.
Inverted-indexes Using Composite Column Names
Building an inverted index inside a row is not that hard. Essentially you want to use the column sort ordering of the CF and make each column name in a row start with the value you want to search against. For example, let’s suppose each column name was the user’s last name and the column value was their user id. It would be very easy to search by last name by using Cassandra’s get_slice method. Where it gets more complicated is that typically you have multiple target entities that have the same column value. For example, multiple users with the same last name. In this case, you have two options, either composite column names or supercolumns. A composite column name is where you essentially combine the indexed value with the target entity id. For example, the lastname and the user id as a single value that’s used as the column name. Since the combination of the last name and the user id is unique, you can have multiple index entries for the same last name. To construct that composite column name, there are several techniques. The simplest, although not the most efficient or elegant approach, is string concatenation. I’m going to use this approach for my examples because it’s the easiest to illustrate. If you take the string concatenation approach then, ideally, you’d want all of the search term component of the column name to be the same length, so perhaps you’d decide that you were only indexing on the first 10 characters of the user’s last name, and you truncate lastnames longer than that or pad with spaces lastname shorter than that before concatenating the user id to form the index entry. So, you’d end up with, for example, a column name like “smith_____:0001″ for user “Bob Smith” and “smith_____:0005″ for “John Smith”. Doing a get_slice against these column names would let you find all the users whose last name was “smith” pretty quickly. Your row would end up looking like this:
"smith_____:0001" : null
"smith_____:0005" : null
"wilson____:0003" : null
In this case, I’m storing null as the value for each column, but you could put whatever you wanted in there, for example, the user’s full name, so that you didn’t later have to grab that from the user’s row. Note that “indexkey” is the row key, since this entire index sits inside a single row in the CF. You’ll undoubtedly have at least a few different indexes in your app, each one in a separate row.
If you choose to take the composite index approach, you’ll want to read my previous post and look at the CassandraCompositeType project that I’ve put on GitHub as a more elegant and flexible alternative to string concatenation, allowing any number of different value types, such as string, longs, or UUIDs, to be combined together into a single comparable value. Even if you decide not to use it, you’ll probably want to consider some similar method of encoding your column names.
Inverted-indexes Using Supercolumns
The other approach to inverted indexes is to use super columns. This is the approach that Solandra takes, for example. In this case, you’d use a super column where the super column name was the search term and the sub columns were the target entity ids. Something like this:
"0001" : null
"0005" : null
"0003" : null
Again, note that you can put whatever you want in the subcolumn values.
Using super columns is a fairly clean approach and is pretty much what they were designed for. There are a couple of reasons why people avoid supercolumns, the subcolumns in a supercolumn aren’t sorted, and there’s some question about how well they’re supported in general, although that’s a subjective concern. Also, if you wanted to create an index of something like “lastname,city”, you’d still end up falling back to using composite column names (i.e. “smith_____new york__” as the supercolumn name).
One interesting thing to note is that when building CF indexes, the index itself has a key. This makes it possible to create indexed collections. This is where you’re indexing an entity in the context of another entity that owns a collection of which the first entity is a member of. For example, searching for all the users in group A by their last name. If it’s a one-to-many relationship and a user can only be in one group, then that’s as easy as adding an indexed “group id” column to your Users CF if you’re using native secondary indexes. However, if that’s not the case, if users can be in multiple groups, for example, then you might want to be able to maintain multiple “micro-indexes” rather than the global index that native secondary indexes manage. Yes, you’ll need to maintain those multiple indexes whenever the value in the target entity changes, for example, updating the seperate index of user last names that each group maintains, but that’s something that’s actually easier than it sounds. You can see one technique for managing that within my CassandraIndexedCollections project and is something I discussed in my original blog post on Cassandra indexes.
That Annoying Transactional Gotcha
With all of these indexing approaches, you start to raise questions about how transactions fit into this. I’d recommend reading Life Beyond Distributed Transactions for a good discussion of some of the specific issues related to alternate indexes and transactions.
Hopefully this round-up provides some guidance on thinking about indexing within Cassandra-based applications.