The Map of Life

integrating species distribution knowledge

Archive for the ‘overview’ Category

Git for development data

In an earlier post I had promised more discussion of the backend-frontend architecture we are using at MOL. One of the reasons that has been slower to develop than I had hoped is that our architecture is evolving so rapidly, I haven’t been certain when I can say something will be around long enough to call it core MOL. Today though, I’d like to show how Aaron Steele and I are using Git to manage rapidly evolving data.

For this project, we are trying to quickly pull together many different data types, structure, and combine them in ways that will serve APIs, user interface, and analytics. No matter what level of planning we could imagine, we just aren’t going to get the structure and combinations right the first or even the Nth time. On top of that, we needed to be able to move data and data structures between our two development laptops, development server, and production server. Testing new features and restructuring that data quickly without breaking any component. Enter git. Below, I’ll walk through the steps of setting up a privately hosted git repository and managing changes to the data effectively. At then end, I will also cover how we are using Chef to keeping application servers up-to-date with code and data simultaneously.

The first thing you are going to need is a remote repository, a server that you have ssh access to with static IP.

Remote repository

First, ssh into the server. Next, you will want to set up a user for git.

adduser git
passwd pw

Next, inside your /home/git directory, create a new repo
mkdir bigdata
cd bigdata
git init --bare

For our uses, I wanted to be able to clone this repo on any machine without ssh access. For that, you need to expose it via HTTP. We use Nginx for all of our servers. So to enable this, we just need to add the git repo directory we just created as a site in the Nginx sites-enabled config. Important part,
location /bigdata {
    alias   /home/git/bigdata;
    allow  all;
}

At this point, be sure to restart Nginx. You can test that it is working by loading the description in your browser,

http://yoursite.org/bigdata/description

Local machine

Now you’ll want to set up a repository on your local machine for the data. In our case, our backend architecture relies on the data in a particular place, so on my local machine, I place it in the same spot so that by running my dev server, it is just there.

mkdir bigdata
cd bigdata
git init

Next, standard git procedure, get it started by,
touch README
git add README
git commit -m 'First Commit'

Finally, we want to tell my local git repo, that there is a remote git repo waiting for its knowledge, so,
git remote add origin git@yoursite.org:bigdata

This is where you will need SSH access to your remote server. If you are not running SSH on port 22, it might help to add the actual to your ssh config so it knows to use it by default,
nano ~/.ssh/config
#now add the lines
Host yoursite.org
   Port {your-port-#-here}

Finally, we can push our changes to the remote server.
git push origin master

Function

Now, we might want to add a bunch of data to our repo. In MOL, we are playing with ~1.5-2 gigs of data right now, that number will grow, but probably not much faster than the rate we finalize structures and databases. For now, I have a directory structure for all the types of data we are playing with. I just drop that into my bigdata folder. Commit the changes, and git push. That data is now on our shared remote repo. If I push changes to the data structure and change features in the application, Aaron only need pull each of the repos (our code has it’s own repo) and everything should be bug free.

Next, we will want to make this available to both development servers and production servers. This is where the magic of Git+Chef comes in. By using Git Tags, I can basically freeze the data as it is right now, while still committing changes to the branch. So, while I may want to push changes in code to the a production server, I probably don’t want it pulling in or restructuring data unless I tell it to explicitly.

So once I have my data in my local git repo and my application works, I can commit and add a tag,

git tag -a v0.1

Using Chef (I’ll come back to Git in a sec)

A few days ago, we decided it was time to move the MOL backend off of a local development server and into a more easily scalable infrastructure. On top of the fact that I had to migrate all the services, application code, and data to a new server, I realized that while we were developing the backend modular so that different components could eventually scale independently (database versus long running tasks versus application interfaces), we didn’t need a ton of computing yet. Chef is beautiful in that it allowed me to look at the currently running development server, encode each part into actual instructions (roles, recipes, and cookbooks), and run those instructions to build a new server. While we put all of the backend on a remote node for now, it will be fairly straight forward to break up our Chef recipe down the road and move parts of the architecture to independent nodes in the backend.

So, I have encoded the backend in Chef, and I have a virtual server running on Linode. To deploy my Chef instructions (we use http://opscode.com/ to host our Chef) all I do is run,

knife bootstrap root@{node-ip-address}  -r 'role[{name-of-node-role}]'

Voila! I have a running backend server on Linode. I wont go into much detail on the magic that is Chef here because I want to get back to Git for data. It was at this point that I wanted a method to include populating data using Chef. Sure, I could run dropbox on the server, or have it pull data statically hosted else where, and I’m sure using more specific tools. Git is a warm cozy blanket though. It offers version management, branching, tagging. We already have it, know the methods, dream about it. So, now that we have the remote git repo we set up above, I created a ‘resource’ in our mol-backend cookbook. What this resource will do, is when I bootstrap a new or existing backend node, it will checkout the data from the git repo. Here is what it looks like in chef,

# download mol source and checkout specific version/branch
execute "fetch data from git data repo" do
  command "git clone  #{node[:mol][:remote_data_repo]} #{node[:mol][:base_data_dir]}"
  node.set['mol']['node_existed'] = false
  not_if { FileTest.exists?(node[:mol][:base_data_dir]) }
end
execute "switch to specified branch/tag of data repo" do
  command "cd #{node[:mol][:base_data_dir]} && git checkout #{node[:mol][:remote_data_branch]}"
end
execute "pull updates of data repo" do
  command "cd #{node[:mol][:base_data_dir]} && git pull #{node[:mol][:remote_data_repo]} #{node[:mol][:remote_data_checkout_point]}"
end

A few things just happened. First, if this is the first time the node is being built, it performs a clone of the remote server we set up above over HTTP (no ssh creds needed), passed to the code via a variable,

#{node[:mol][:remote_data_repo]}

Next, if that was in fact the first time that repo downloaded, the next block performs a checkout of the data. Here it gets beautiful again, from inside the repo that was just cloned, we tell it to checkout a specific branch or tag, passed via the variable,

#{node[:mol][:remote_data_checkout_point]}

What we have just done is said, if we are executing this Chef build on a Production server, we can have it checkout a specific snapshot of our data, say the v0.1 we created above. While, at the same time Aaron and I may have committed hundreds of changes to that data that we are still working on in development. Now our development node can checkout those changes from say, the master branch, whenever we execute the Chef instructions on it. The next really great part here, is that we have executed the Chef build command on our backend many times already.

In a normal case, we may have a problem if we wanted it to pull in a big data set every time it rebuilds. But, because we are only using a ‘git pull’ if the git repo already existed (found out by the ‘not_if’ commands above), the backend node will only waste its time pulling in changes to the declared version or branch. Since all cloud services are going to charge you for bandwidth in and out, it is important to minimize wastage. On top of that more obvious example is this, if we decide we need to move the data around, or rename files, we can track those changes using git. When we execute our Chef code again on the servers, instead of rebuilding those data resources from scratch, it will just replicate moves and renames.

Doubts of scalability

At this point, some more hard-core Git users will likely be shaking their heads at the use of Git for large directories with lots of binary data. In our case though, and I think for many projects in this domain, it works. We don’t want to waste time hard coding data structures before we have tested the data use and functional requirements. On top of that, we don’t have time to be developing version control systems for data structures and sources that are ultimately going to be rolled into our databases. But we need methods to quickly change, share, and track the structure of these datasets. For us, getting this solution in place frees up a lot of our time for developing more useful components and eventually getting away from I/O heavy first passes.

Chef. Wow

I first picked up on this project through Anthony Goddard’s blog. In just the past couple days I feel like Chef has changed my view of deployment and architecture. I would like to spend time in another post talking about how much I love Chef. This project is so wonderful. Anyone who does development on remote servers or instances should really check this out. Especially if you have a toolkit of your favorite technology layers that you find yourself deploying all the time. Partially for that reason, I think the technology offers a lot to our community. Particularly in the use of Cookbooks. Cookbooks are nuggets of coded functionality for your system architecture. They are reusable, easily linked (via git!) to a maintainers repository, and powerful. Stealing from a conversation earlier tonight with Aaron, we can relatively easily assemble Cookbooks that would facilitate the sharing and publishing of say Darwin Core records, taxonomic databases, or annotation services that then anyone could modify and use in their systems. Love it.

Written by andrewxhill

June 9, 2011 at 10:35 pm

Cloud where appropriate

Here at the Map of Life, we’ve been cranking away at development for some months now. We are finalizing some of our beta release user interfaces (UIs) and application programming interfaces (APIs), but in the meantime we wanted to start opening up some of our development ideas for a wider discussion. During discussions we have reiterated the importance of explicitly opening our methods and solutions to the community. Not only because we think our methods and solutions are interesting, but because we think that only by starting these conversations can we ever receive feedback and knowledge that could be key to making MOL a success.

Over the next months and years, this blog will be a primary way that MOL will begin these discussions. One technical solution that we have been excited to write about is our approach to scalable architectures across MOL. The project has some diverse scaling challenges. At the most basic level, MOL will provide APIs and UIs that allow users and other projects to rapidly access high quality species distribution maps.  But this is a major simplification of the scope, ambition and challenges of MOL. MOL is not just providing distribution maps, we are provisioning many diverse analyses across data types and scales, methods to deploy long-running analysis jobs, toolkits for expert users to harness those analyses to improve distribution maps and finally, storage and versioning for many parts of the system.

From the beginning, the project has been guided by milestones that evolve over time. The goal is to finish one milestone before spending too much time on later milestones, while still allowing some fuzziness along the borders. So, with a small and overworked team of developers, we adopted an agile development strategy that supports fast iterations, rapid prototyping, and quick refactoring as the project moves forward. A part of that plan was a decision to minimize the time spent developing complex technology solutions that would likely need  to be painfully refactored after reaching later milestones.

Early milestones deal with making data accessible and discoverable in standardized formats with a focus on high quality metadata combined with very-fast data access for visualization and search. This situation led us to making a bold decision: many of our databasing and back-end scaling solutions could be developed later, only after preliminary testing of data loads were available (and a more robust map of data relationships had been developed). On the other hand, our front-end would need to be streamlined, fast, and handle a lot of caching from the onset so that we can start rolling out releases without knowledge of initial request volume. To handle this situation while also maximizing the benefits of cloud computing during these early stages, we decoupled the front end and back end of the system architecture.

For now, we built the back end using existing local hardware. Since the back end is never directly accessed by users, we can more easily predict the load based on how many analyses for which it is responsible. When that load gets unmanageable on its current hardware, we will begin developing new storage strategies and redeploy the back end using a more scalable system. At its core though, the applications we have deployed on the back end will be directly reusable in conjunction with novel storage solutions and new hardware environments.

The same early milestones led us to feel that we would need to solve scalability challenges in the front-end of MOL more immediately. The front end needed to expose data – primarily data about what data sets and data types we have available – and provide access to processed forms of that data for our web based UIs (e.g. map tiles). The front end can handle widely varying loads while managing the amount of requests it needed to send to our back end. Dealing with unpredictable and diverse client requests, we could not just deploy an app locally and ensure that it would run quickly night and day. For this reason we have developed a front-end application using the Python SDK on Google App Engine(GAE).

GAE provides apps with generous free quotas for data storage and CPU cycles, so during both development and slow days at MOL we can keep costs under a dollar. We manage to do this by never storing large data on the front end, but instead take advantage of the GAE Datastore to provide fast access to specific parts of the data that we want to query, such as, metadata, taxonomy, and data relationships. We also take advantage of a large free quota for the Memcache API to handle caching small pieces of data, such as map tiles, that the front end pulls from the back end using REST APIs. Now, when MOL starts rolling out features, we can be confident that our app will scale with increased loads!

The MOL architecture has one foot in the cloud while keeping another firmly in hardware sitting at the University of Colorado (for now). This hybrid system has reduced our early development costs while still ensuring that we can easily scale up as we announce early releases. By reducing the responsibilities of the front end to storing and searching metadata and caching data requests, we have been able to develop our system cheaply while already working on cloud solutions that will scale far into the future. I’ll devote a later blog post to a concrete example, likely focusing on how we serve species-ecoregion occurrence polygon data for the mapping user interface.

Written by andrewxhill

April 21, 2011 at 7:22 pm

Map of Life Dream Team (and, hey, we are ready to blog!)

Map of Life has been chugging along for about 6 months now in its current configuration and now seems like as good a time as any to step back and consider how far we have come and what might be next. The team working on Map of Life is such an interesting one. Geographically we are spread out across the United States, from the East Coast (Yale University) to the Midwest (University of Kansas), Mountain West (University of Colorado) to the Pacific Coast (University of California, Berkeley). We are also diverse by country of origin (Australia, Germany, U.S.A.), academic training (computer science, ecology, evolution, informatics) and skill set (programming, systems engineerings, informatics, macroecology, systematics, etc). A gratifying part of the first six months, for me, is that these differences and diversity has translated into a strong working relationship and collaborative spirit, where the strengths of the group, not the weaknesses, have multiplied. I think this likely reflects a strong impetus to meet regularly, as often as three times a week, via cell or Skype, to synchronize efforts. Plus, good peeps and – turns out – we like working with each other! 

So what have we accomplished with all this good will and great vibes? A lot, as it turns out! Much of that is “behind the scenes”. Andrew Hill and Aaron Steele have been bouncing great ideas back and forth about how to create an information architecture that is robust, scalable and efficient. We’ve put together a broad technological overview here: http://www.mappinglife.org/tech. As it stands now, and looking back, we have done a lot. First, we deployed a cloud-based copy of the Catalogue of Life, accumulated a large set of range maps for amphibians and mammals, checked taxonomy of those range maps against the Catalogue of Life database, and provided an initial mechanism to search those maps. Next we have developed the means to display range maps via Google Maps, using a map tiling tool named Mapnik, and developed some initial user interface frameworks and designs. We are currently polishing off access and display of species occurrence data points. All of this is great, but we are still treading in known waters. The excellent AmphibiaWeb project has also developed the means for displaying range maps and occurrence data points, for example.

Soon we will be pulling together some new types of distribution data such as “occurrence polygons” — places where species have been described via species list — and habitat preferences such as “wet broadleaf forests” or “shortgrass prairie”. These are new challenges for storage, query and visualization. An even greater challenge will be trying to provide all these sources of knowledge in a single search and user interface. Exciting times for Map of Life! We are looking forward to having some demonstrations soon, so you can try out some Map of Life features. Stay tuned (and thanks for reading).

Written by Rob

April 6, 2011 at 11:38 am

Posted in overview

Tagged with ,

Follow

Get every new post delivered to your Inbox.