The Map of Life

integrating species distribution knowledge

Git for development data

In an earlier post I had promised more discussion of the backend-frontend architecture we are using at MOL. One of the reasons that has been slower to develop than I had hoped is that our architecture is evolving so rapidly, I haven’t been certain when I can say something will be around long enough to call it core MOL. Today though, I’d like to show how Aaron Steele and I are using Git to manage rapidly evolving data.

For this project, we are trying to quickly pull together many different data types, structure, and combine them in ways that will serve APIs, user interface, and analytics. No matter what level of planning we could imagine, we just aren’t going to get the structure and combinations right the first or even the Nth time. On top of that, we needed to be able to move data and data structures between our two development laptops, development server, and production server. Testing new features and restructuring that data quickly without breaking any component. Enter git. Below, I’ll walk through the steps of setting up a privately hosted git repository and managing changes to the data effectively. At then end, I will also cover how we are using Chef to keeping application servers up-to-date with code and data simultaneously.

The first thing you are going to need is a remote repository, a server that you have ssh access to with static IP.

Remote repository

First, ssh into the server. Next, you will want to set up a user for git.

adduser git
passwd pw

Next, inside your /home/git directory, create a new repo

mkdir bigdata
cd bigdata
git init --bare

For our uses, I wanted to be able to clone this repo on any machine without ssh access. For that, you need to expose it via HTTP. We use Nginx for all of our servers. So to enable this, we just need to add the git repo directory we just created as a site in the Nginx sites-enabled config. Important part,

location /bigdata {
    alias   /home/git/bigdata;
    allow  all;

At this point, be sure to restart Nginx. You can test that it is working by loading the description in your browser,

Local machine

Now you’ll want to set up a repository on your local machine for the data. In our case, our backend architecture relies on the data in a particular place, so on my local machine, I place it in the same spot so that by running my dev server, it is just there.

mkdir bigdata
cd bigdata
git init

Next, standard git procedure, get it started by,

touch README
git add README
git commit -m 'First Commit'

Finally, we want to tell my local git repo, that there is a remote git repo waiting for its knowledge, so,

git remote add origin

This is where you will need SSH access to your remote server. If you are not running SSH on port 22, it might help to add the actual to your ssh config so it knows to use it by default,

nano ~/.ssh/config
#now add the lines
   Port {your-port-#-here}

Finally, we can push our changes to the remote server.

git push origin master


Now, we might want to add a bunch of data to our repo. In MOL, we are playing with ~1.5-2 gigs of data right now, that number will grow, but probably not much faster than the rate we finalize structures and databases. For now, I have a directory structure for all the types of data we are playing with. I just drop that into my bigdata folder. Commit the changes, and git push. That data is now on our shared remote repo. If I push changes to the data structure and change features in the application, Aaron only need pull each of the repos (our code has it’s own repo) and everything should be bug free.

Next, we will want to make this available to both development servers and production servers. This is where the magic of Git+Chef comes in. By using Git Tags, I can basically freeze the data as it is right now, while still committing changes to the branch. So, while I may want to push changes in code to the a production server, I probably don’t want it pulling in or restructuring data unless I tell it to explicitly.

So once I have my data in my local git repo and my application works, I can commit and add a tag,

git tag -a v0.1

Using Chef (I’ll come back to Git in a sec)

A few days ago, we decided it was time to move the MOL backend off of a local development server and into a more easily scalable infrastructure. On top of the fact that I had to migrate all the services, application code, and data to a new server, I realized that while we were developing the backend modular so that different components could eventually scale independently (database versus long running tasks versus application interfaces), we didn’t need a ton of computing yet. Chef is beautiful in that it allowed me to look at the currently running development server, encode each part into actual instructions (roles, recipes, and cookbooks), and run those instructions to build a new server. While we put all of the backend on a remote node for now, it will be fairly straight forward to break up our Chef recipe down the road and move parts of the architecture to independent nodes in the backend.

So, I have encoded the backend in Chef, and I have a virtual server running on Linode. To deploy my Chef instructions (we use to host our Chef) all I do is run,

knife bootstrap root@{node-ip-address}  -r 'role[{name-of-node-role}]'

Voila! I have a running backend server on Linode. I wont go into much detail on the magic that is Chef here because I want to get back to Git for data. It was at this point that I wanted a method to include populating data using Chef. Sure, I could run dropbox on the server, or have it pull data statically hosted else where, and I’m sure using more specific tools. Git is a warm cozy blanket though. It offers version management, branching, tagging. We already have it, know the methods, dream about it. So, now that we have the remote git repo we set up above, I created a ‘resource’ in our mol-backend cookbook. What this resource will do, is when I bootstrap a new or existing backend node, it will checkout the data from the git repo. Here is what it looks like in chef,

# download mol source and checkout specific version/branch
execute "fetch data from git data repo" do
  command "git clone  #{node[:mol][:remote_data_repo]} #{node[:mol][:base_data_dir]}"
  node.set['mol']['node_existed'] = false
  not_if { FileTest.exists?(node[:mol][:base_data_dir]) }
execute "switch to specified branch/tag of data repo" do
  command "cd #{node[:mol][:base_data_dir]} && git checkout #{node[:mol][:remote_data_branch]}"
execute "pull updates of data repo" do
  command "cd #{node[:mol][:base_data_dir]} && git pull #{node[:mol][:remote_data_repo]} #{node[:mol][:remote_data_checkout_point]}"

A few things just happened. First, if this is the first time the node is being built, it performs a clone of the remote server we set up above over HTTP (no ssh creds needed), passed to the code via a variable,


Next, if that was in fact the first time that repo downloaded, the next block performs a checkout of the data. Here it gets beautiful again, from inside the repo that was just cloned, we tell it to checkout a specific branch or tag, passed via the variable,


What we have just done is said, if we are executing this Chef build on a Production server, we can have it checkout a specific snapshot of our data, say the v0.1 we created above. While, at the same time Aaron and I may have committed hundreds of changes to that data that we are still working on in development. Now our development node can checkout those changes from say, the master branch, whenever we execute the Chef instructions on it. The next really great part here, is that we have executed the Chef build command on our backend many times already.

In a normal case, we may have a problem if we wanted it to pull in a big data set every time it rebuilds. But, because we are only using a ‘git pull’ if the git repo already existed (found out by the ‘not_if’ commands above), the backend node will only waste its time pulling in changes to the declared version or branch. Since all cloud services are going to charge you for bandwidth in and out, it is important to minimize wastage. On top of that more obvious example is this, if we decide we need to move the data around, or rename files, we can track those changes using git. When we execute our Chef code again on the servers, instead of rebuilding those data resources from scratch, it will just replicate moves and renames.

Doubts of scalability

At this point, some more hard-core Git users will likely be shaking their heads at the use of Git for large directories with lots of binary data. In our case though, and I think for many projects in this domain, it works. We don’t want to waste time hard coding data structures before we have tested the data use and functional requirements. On top of that, we don’t have time to be developing version control systems for data structures and sources that are ultimately going to be rolled into our databases. But we need methods to quickly change, share, and track the structure of these datasets. For us, getting this solution in place frees up a lot of our time for developing more useful components and eventually getting away from I/O heavy first passes.

Chef. Wow

I first picked up on this project through Anthony Goddard’s blog. In just the past couple days I feel like Chef has changed my view of deployment and architecture. I would like to spend time in another post talking about how much I love Chef. This project is so wonderful. Anyone who does development on remote servers or instances should really check this out. Especially if you have a toolkit of your favorite technology layers that you find yourself deploying all the time. Partially for that reason, I think the technology offers a lot to our community. Particularly in the use of Cookbooks. Cookbooks are nuggets of coded functionality for your system architecture. They are reusable, easily linked (via git!) to a maintainers repository, and powerful. Stealing from a conversation earlier tonight with Aaron, we can relatively easily assemble Cookbooks that would facilitate the sharing and publishing of say Darwin Core records, taxonomic databases, or annotation services that then anyone could modify and use in their systems. Love it.


Written by andrewxhill

June 9, 2011 at 10:35 pm

4 Responses

Subscribe to comments with RSS.

  1. Great post Andrew, really like how you’re using things like git and chef, but also that you’re sharing/documenting it here. Anthony has done some great work with chef, and I’m sprinting to catch up, but clearly when we’re looking at a more ‘cloud-based’ architecture, it really glosses over much of the complexity in setting up servers. This way things are usable now, instead of after a bunch of time spent in configuration. With chef you figure it out once and then automatically reuse. Also appreciate that you anticipated my comments about storing large binary data with git, hey, if you can justify it and it works for your stuff, have at it. Great stuff man!


    June 12, 2011 at 9:16 am

    • Thanks Phil. There is another thing I didn’t go into about our use of Git for binary data. When moving data like this around, most of the changes are in the form of name changes and directory structure changes, not changes to the data itself. My intuition is that Git works great for versioning, sharing, and tracking these type of changes regardless of the binary nature of the data.


      June 13, 2011 at 10:13 am

  2. Hey man, thanks for the mention. Chef has definitely changed everything about how I work.
    We should definitely think about getting something started to encourage some community collaboration for #biodiv cookbooks, here’s a start: would love to collaborate on some more cookbooks.

    Anthony Goddard

    June 12, 2011 at 12:52 pm

    • No problem. I actually heard about it before the post I linked, but still from you, so… I’ll ping you if we come up with any cookbooks for your repo soon.


      June 13, 2011 at 10:15 am

Comments are closed.

%d bloggers like this: