An alternative technology quickly gaining popularity these days is CouchDB, a document-based database system for semi-structured data. I wasn’t sure what that meant at first, so I read as much as I could about it. The result? I couldn’t wait to use it.
I decided CouchDB would be a good fit for my next project (which I should be releasing sometime this week BTW) and rolled up my sleeves. Because of the amount of data I’m working with, I hit a few snags along the way with regard to CouchDB view performance. Some of the things I learned, although they make sense, were not what I was expecting initially (even after reading all the docs). So for the benefit of others, I thought it’d be a good idea to share my current understanding of the way views work in CouchDB, and share some of the tips & tricks Jan, Chris, and others have given me along the way.
Importing Data for Speed and Glory
Most of the work I’ve done with CouchDB this past week has been related to importing a fair amount of data (600k+ documents). Initially I tried creating one document at a time. This worked, but each request has associated with it a certain amount of overhead and latency. For example, creating 33,847 documents, one at a time, took 726 seconds (~12 min). Thankfully CouchDB has a bulk create mode. Creating the same documents 1,000 at a time took 58 seconds. That’s a 1,250% improvement! An added benefit of using bulk create is that it consumes less hard disk space (28.2MB vs 213.2MB).
CouchDB Views vs RDBMS Tables
Now that I could get my data in the database in a decent amount of time, I wanted to aggregate some of it together in a view. Before I get into too many details, let me explain how I think about CouchDB views. I’m a very spatial thinker, and so visualizing the similarities and differences between CouchDB views and traditional RDBMS tables helps me to understand how they work. It may be stupid, it may be naive, it may even be wrong, but here goes: Imagine a RDBMS database. Imagine a handful of tables in that database, each with different columns. Now imagine that every row in every one of those tables is just a document in CouchDB, all lumped into the same bucket (database) and with no hierarchy. Views are what filter and aggregate documents together to create (in a very limited sense) the equivalent of a table.
You don’t join views with each other because you’re already essentially “joining” documents (rows) to create the view. This might get you thinking that CouchDB views relate better to RDBMS views. In some ways that is true, but RDBMS views are a one-time snapshot of the underlying tables, and so for the sake of this discussion I’m leaving RDBMS views out.
Indexing, the Slowdown
CouchDB view indexes are generated when the view is first called. More than one view can be stored inside a design document, but as long as they’re in the same doc they get generated (and updated) at the same time. After that initial creation, updating the view indexes is incremental based on what documents in the database are added, edited, or deleted. Notice I said indexes are updated based on what documents are modified in the database. This is an important point!, something that wasn’t obvious to me initially. There are no tables. This is no hierarchy. No isolation. Modifying any documents in the database means that all view indexes in all design documents have to be updated.
The only time view generation is isolated is when a new design document is created or updated. In this case, though, the process is not incremental. For this reason, if you plan to store a large number of documents I strongly suggest that you work out and create your design documents before populating your database. Although CouchDB is a schema-less database, creating views for a large data set is currently much like designing a schema: you mostly do it before filling your database with data, and you generally don’t change it often. Why? Because generating views is currently a slow process, and gets slower the more documents you have. As a point of reference, for my example of 33,847 documents it took me 6,705 seconds (~1.86 hours) to generate a view for the first time. Retrieving the view after that took 0.006 seconds.
Improving Speed Now, and in the Future
In some cases it is possible speed up view generation by priming the view as you create documents. This method has great results. For example, if I create 33,847 documents in batches of 1,000, calling my view after every bulk create, the whole process takes 219 seconds (~3.65 min). If we compare the time it takes to insert the documents and then generate a view separately vs doing them at the same time, the latter is 3,088% faster (58 + 6,705 / 219 = 30.88).
CouchDB uses an implementation of MapReduce for generating views. Currently though, view generation cannot be distributed across several nodes. I’ve been told this feature is on the development roadmap, and so chances are view generation will get much, much faster in the (hopefully) not too distant future. Also worth noting, is that CouchDB has not yet been optimized, and Damien is quite optimistic about its potential, as am I.
Now, I have only been working with CouchDB for a week, so it’s quite possible my understanding of something might be off. If so, please correct me. Working with CouchDB has been a load of fun (and education). I’m really looking forward to where it goes in the future, and I hope to do what I can get help get it there.
UPDATE: For anyone that might be interested, you can get the import script I’m using from my Launchpad.net repository. Depending on what you’re doing it might be a decent start. The script has a few nice features like resuming interrupted imports, bulk inserts with priming, and graceful handling of failed bulk inserts.