Daniel Harrison's Personal Blog

Personal blog for daniel harrison

Playing with Cassandra Again September 30, 2010

Filed under: cassandra,development,internet,Uncategorized — danielharrison @ 12:59 am

I’ve been recently playing with the latest version of Cassandra again.   Some new things going in the direction I like is that it seems to be growing into a more enterprise keystore model rather than something that is solving a specific high volume websites requirements only.  To me it felt like there’d been a lot of work in beefing up the management and making it solve a more generic problem.  The programmatic adhoc schema editing was a good improvement and based on the direction, 1.0 is shaping up to be really good.

My previous access code was using the thrift API directly.  For this prototype I tried out a few libraries; Pelops and Hector.  Both seemed to still be thrift focused and I’m not sure how this works with the change to AVRO.  Thrift always felt clumsy to me.  Technologies like thrift and avro, where you’re expressing a language independent communication protocol that various languages need talk in, in my view can’t help bleeding those idioms and generality up to the client.  It means client access code often feels, well slightly awkward.  It feels a bit like the good old days with IIOP/CORBA and EJB communication.  My personal preference is targeted hand coded adapters which feel like a good fit for the language, but the downside of course is that the clients can lag and not always be available for your choice of language.  So it’s a tradeoff as always.  Hector seems like it’s actively trying to avoid this but still has wrappers where if feels a bit thrifty, eg HKSDef instead of KSDef used to create a keystore.  If you are trying out and evaluating these libraries I would highly recommend you bite the bullet and just get the cassandra source for your targeted library and build it yourself.  Due to the fast moving nature it looks like the current releases are out of date and to get it working you really need the latest trunk versions of everything.  For example I don’t think beta2 of 0.7 cassandra is available as a package but it seems to be required with the current version of Pelops and Hector, Pelops is source only on github, so you’ll likely be building things yourself anyway.  I was impressed by both and it feels like there’s alot of room for future improvement and both seem to be shaping up as strong client access libraries.

Another good thing is that it seems like there’s some valuable resources coming through.  At the moment it’s a lot of google and reading the forums to nut out problems.  I bought the ‘Cassandra, the definitive guide’ rough cuts book from Oreilly and it seems like it’s taken a lot of the information, focused it and made it a good source for explanation of idioms and general wisdom.  So my recommendation would be to buy as it seems like it’s going to be an invaluable reference.

My biggest problem for using cassandra at the moment is support for multitenancy.  For the problem I have in mind it requires text indexing and content that is private per account.  With a model like cassandra you need to know what you will be searching for first and basically you build column families representing those indexes.  Now in my case I have users, accounts (many users), objects (storing text) and various indices around that text that drive my application.  Think a little bit like an RDF store with accounts and users.  Now in a traditional database model I would probably store this as a separate database for each account.  This may mean each running datastore instance has 10’s to 1000’s of databases.  With cassandra and the way this is structured this would not be advisable.  Each keystore maintains memory etc and to take advantage of it’s model of replication etc it’s more advisable to have less keyspaces.  Now one of the easy wins in the database server world of having separate databases per account is you’re guaranteed to not see other accounts data, you’re connecting to the datastore for that client which makes it very easy to guarantee and maintain security.  Under cassandra this makes it an application concern at the moment.  For my prototype I wasn’t happy with the extent that this was invading my code and required extra indices to make it all work, all of which increased the cognitive load of developing the application.  There’s work afoot around multi-tennancy requirements, but until that’s addressed, for me at least, it rules cassandra out.  The cassandra team are working on it and there’s some interesting proposals (the namespace one seems interesting) and I’m sure once it’s complete it will really make cassandra the first choice for an enterprise keystore.