Archive for the ‘architecture’ Category
Git for development data
In an earlier post I had promised more discussion of the backend-frontend architecture we are using at MOL. One of the reasons that has been slower to develop than I had hoped is that our architecture is evolving so rapidly, I haven’t been certain when I can say something will be around long enough to call it core MOL. Today though, I’d like to show how Aaron Steele and I are using Git to manage rapidly evolving data.
For this project, we are trying to quickly pull together many different data types, structure, and combine them in ways that will serve APIs, user interface, and analytics. No matter what level of planning we could imagine, we just aren’t going to get the structure and combinations right the first or even the Nth time. On top of that, we needed to be able to move data and data structures between our two development laptops, development server, and production server. Testing new features and restructuring that data quickly without breaking any component. Enter git. Below, I’ll walk through the steps of setting up a privately hosted git repository and managing changes to the data effectively. At then end, I will also cover how we are using Chef to keeping application servers up-to-date with code and data simultaneously.
The first thing you are going to need is a remote repository, a server that you have ssh access to with static IP.
Remote repository
First, ssh into the server. Next, you will want to set up a user for git.
adduser git passwd pw
Next, inside your /home/git directory, create a new repo
mkdir bigdata cd bigdata git init --bare
For our uses, I wanted to be able to clone this repo on any machine without ssh access. For that, you need to expose it via HTTP. We use Nginx for all of our servers. So to enable this, we just need to add the git repo directory we just created as a site in the Nginx sites-enabled config. Important part,
location /bigdata {
alias /home/git/bigdata;
allow all;
}
At this point, be sure to restart Nginx. You can test that it is working by loading the description in your browser,
http://yoursite.org/bigdata/description
Local machine
Now you’ll want to set up a repository on your local machine for the data. In our case, our backend architecture relies on the data in a particular place, so on my local machine, I place it in the same spot so that by running my dev server, it is just there.
mkdir bigdata cd bigdata git init
Next, standard git procedure, get it started by,
touch README git add README git commit -m 'First Commit'
Finally, we want to tell my local git repo, that there is a remote git repo waiting for its knowledge, so,
git remote add origin git@yoursite.org:bigdata
This is where you will need SSH access to your remote server. If you are not running SSH on port 22, it might help to add the actual to your ssh config so it knows to use it by default,
nano ~/.ssh/config
#now add the lines
Host yoursite.org
Port {your-port-#-here}
Finally, we can push our changes to the remote server.
git push origin master
Function
Now, we might want to add a bunch of data to our repo. In MOL, we are playing with ~1.5-2 gigs of data right now, that number will grow, but probably not much faster than the rate we finalize structures and databases. For now, I have a directory structure for all the types of data we are playing with. I just drop that into my bigdata folder. Commit the changes, and git push. That data is now on our shared remote repo. If I push changes to the data structure and change features in the application, Aaron only need pull each of the repos (our code has it’s own repo) and everything should be bug free.
Next, we will want to make this available to both development servers and production servers. This is where the magic of Git+Chef comes in. By using Git Tags, I can basically freeze the data as it is right now, while still committing changes to the branch. So, while I may want to push changes in code to the a production server, I probably don’t want it pulling in or restructuring data unless I tell it to explicitly.
So once I have my data in my local git repo and my application works, I can commit and add a tag,
git tag -a v0.1
Using Chef (I’ll come back to Git in a sec)
A few days ago, we decided it was time to move the MOL backend off of a local development server and into a more easily scalable infrastructure. On top of the fact that I had to migrate all the services, application code, and data to a new server, I realized that while we were developing the backend modular so that different components could eventually scale independently (database versus long running tasks versus application interfaces), we didn’t need a ton of computing yet. Chef is beautiful in that it allowed me to look at the currently running development server, encode each part into actual instructions (roles, recipes, and cookbooks), and run those instructions to build a new server. While we put all of the backend on a remote node for now, it will be fairly straight forward to break up our Chef recipe down the road and move parts of the architecture to independent nodes in the backend.
So, I have encoded the backend in Chef, and I have a virtual server running on Linode. To deploy my Chef instructions (we use http://opscode.com/ to host our Chef) all I do is run,
knife bootstrap root@{node-ip-address} -r 'role[{name-of-node-role}]'
Voila! I have a running backend server on Linode. I wont go into much detail on the magic that is Chef here because I want to get back to Git for data. It was at this point that I wanted a method to include populating data using Chef. Sure, I could run dropbox on the server, or have it pull data statically hosted else where, and I’m sure using more specific tools. Git is a warm cozy blanket though. It offers version management, branching, tagging. We already have it, know the methods, dream about it. So, now that we have the remote git repo we set up above, I created a ‘resource’ in our mol-backend cookbook. What this resource will do, is when I bootstrap a new or existing backend node, it will checkout the data from the git repo. Here is what it looks like in chef,
# download mol source and checkout specific version/branch
execute "fetch data from git data repo" do
command "git clone #{node[:mol][:remote_data_repo]} #{node[:mol][:base_data_dir]}"
node.set['mol']['node_existed'] = false
not_if { FileTest.exists?(node[:mol][:base_data_dir]) }
end
execute "switch to specified branch/tag of data repo" do
command "cd #{node[:mol][:base_data_dir]} && git checkout #{node[:mol][:remote_data_branch]}"
end
execute "pull updates of data repo" do
command "cd #{node[:mol][:base_data_dir]} && git pull #{node[:mol][:remote_data_repo]} #{node[:mol][:remote_data_checkout_point]}"
end
A few things just happened. First, if this is the first time the node is being built, it performs a clone of the remote server we set up above over HTTP (no ssh creds needed), passed to the code via a variable,
#{node[:mol][:remote_data_repo]}
Next, if that was in fact the first time that repo downloaded, the next block performs a checkout of the data. Here it gets beautiful again, from inside the repo that was just cloned, we tell it to checkout a specific branch or tag, passed via the variable,
#{node[:mol][:remote_data_checkout_point]}
What we have just done is said, if we are executing this Chef build on a Production server, we can have it checkout a specific snapshot of our data, say the v0.1 we created above. While, at the same time Aaron and I may have committed hundreds of changes to that data that we are still working on in development. Now our development node can checkout those changes from say, the master branch, whenever we execute the Chef instructions on it. The next really great part here, is that we have executed the Chef build command on our backend many times already.
In a normal case, we may have a problem if we wanted it to pull in a big data set every time it rebuilds. But, because we are only using a ‘git pull’ if the git repo already existed (found out by the ‘not_if’ commands above), the backend node will only waste its time pulling in changes to the declared version or branch. Since all cloud services are going to charge you for bandwidth in and out, it is important to minimize wastage. On top of that more obvious example is this, if we decide we need to move the data around, or rename files, we can track those changes using git. When we execute our Chef code again on the servers, instead of rebuilding those data resources from scratch, it will just replicate moves and renames.
Doubts of scalability
At this point, some more hard-core Git users will likely be shaking their heads at the use of Git for large directories with lots of binary data. In our case though, and I think for many projects in this domain, it works. We don’t want to waste time hard coding data structures before we have tested the data use and functional requirements. On top of that, we don’t have time to be developing version control systems for data structures and sources that are ultimately going to be rolled into our databases. But we need methods to quickly change, share, and track the structure of these datasets. For us, getting this solution in place frees up a lot of our time for developing more useful components and eventually getting away from I/O heavy first passes.
Chef. Wow
I first picked up on this project through Anthony Goddard’s blog. In just the past couple days I feel like Chef has changed my view of deployment and architecture. I would like to spend time in another post talking about how much I love Chef. This project is so wonderful. Anyone who does development on remote servers or instances should really check this out. Especially if you have a toolkit of your favorite technology layers that you find yourself deploying all the time. Partially for that reason, I think the technology offers a lot to our community. Particularly in the use of Cookbooks. Cookbooks are nuggets of coded functionality for your system architecture. They are reusable, easily linked (via git!) to a maintainers repository, and powerful. Stealing from a conversation earlier tonight with Aaron, we can relatively easily assemble Cookbooks that would facilitate the sharing and publishing of say Darwin Core records, taxonomic databases, or annotation services that then anyone could modify and use in their systems. Love it.
Cloud where appropriate
Over the next months and years, this blog will be a primary way that MOL will begin these discussions. One technical solution that we have been excited to write about is our approach to scalable architectures across MOL. The project has some diverse scaling challenges. At the most basic level, MOL will provide APIs and UIs that allow users and other projects to rapidly access high quality species distribution maps. But this is a major simplification of the scope, ambition and challenges of MOL. MOL is not just providing distribution maps, we are provisioning many diverse analyses across data types and scales, methods to deploy long-running analysis jobs, toolkits for expert users to harness those analyses to improve distribution maps and finally, storage and versioning for many parts of the system.
From the beginning, the project has been guided by milestones that evolve over time. The goal is to finish one milestone before spending too much time on later milestones, while still allowing some fuzziness along the borders. So, with a small and overworked team of developers, we adopted an agile development strategy that supports fast iterations, rapid prototyping, and quick refactoring as the project moves forward. A part of that plan was a decision to minimize the time spent developing complex technology solutions that would likely need to be painfully refactored after reaching later milestones.
Early milestones deal with making data accessible and discoverable in standardized formats with a focus on high quality metadata combined with very-fast data access for visualization and search. This situation led us to making a bold decision: many of our databasing and back-end scaling solutions could be developed later, only after preliminary testing of data loads were available (and a more robust map of data relationships had been developed). On the other hand, our front-end would need to be streamlined, fast, and handle a lot of caching from the onset so that we can start rolling out releases without knowledge of initial request volume. To handle this situation while also maximizing the benefits of cloud computing during these early stages, we decoupled the front end and back end of the system architecture.
For now, we built the back end using existing local hardware. Since the back end is never directly accessed by users, we can more easily predict the load based on how many analyses for which it is responsible. When that load gets unmanageable on its current hardware, we will begin developing new storage strategies and redeploy the back end using a more scalable system. At its core though, the applications we have deployed on the back end will be directly reusable in conjunction with novel storage solutions and new hardware environments.
The same early milestones led us to feel that we would need to solve scalability challenges in the front-end of MOL more immediately. The front end needed to expose data – primarily data about what data sets and data types we have available – and provide access to processed forms of that data for our web based UIs (e.g. map tiles). The front end can handle widely varying loads while managing the amount of requests it needed to send to our back end. Dealing with unpredictable and diverse client requests, we could not just deploy an app locally and ensure that it would run quickly night and day. For this reason we have developed a front-end application using the Python SDK on Google App Engine(GAE).
GAE provides apps with generous free quotas for data storage and CPU cycles, so during both development and slow days at MOL we can keep costs under a dollar. We manage to do this by never storing large data on the front end, but instead take advantage of the GAE Datastore to provide fast access to specific parts of the data that we want to query, such as, metadata, taxonomy, and data relationships. We also take advantage of a large free quota for the Memcache API to handle caching small pieces of data, such as map tiles, that the front end pulls from the back end using REST APIs. Now, when MOL starts rolling out features, we can be confident that our app will scale with increased loads!
The MOL architecture has one foot in the cloud while keeping another firmly in hardware sitting at the University of Colorado (for now). This hybrid system has reduced our early development costs while still ensuring that we can easily scale up as we announce early releases. By reducing the responsibilities of the front end to storing and searching metadata and caching data requests, we have been able to develop our system cheaply while already working on cloud solutions that will scale far into the future. I’ll devote a later blog post to a concrete example, likely focusing on how we serve species-ecoregion occurrence polygon data for the mapping user interface.
