The Map of Life

integrating species distribution knowledge

Cloud where appropriate

Here at the Map of Life, we’ve been cranking away at development for some months now. We are finalizing some of our beta release user interfaces (UIs) and application programming interfaces (APIs), but in the meantime we wanted to start opening up some of our development ideas for a wider discussion. During discussions we have reiterated the importance of explicitly opening our methods and solutions to the community. Not only because we think our methods and solutions are interesting, but because we think that only by starting these conversations can we ever receive feedback and knowledge that could be key to making MOL a success.

Over the next months and years, this blog will be a primary way that MOL will begin these discussions. One technical solution that we have been excited to write about is our approach to scalable architectures across MOL. The project has some diverse scaling challenges. At the most basic level, MOL will provide APIs and UIs that allow users and other projects to rapidly access high quality species distribution maps.  But this is a major simplification of the scope, ambition and challenges of MOL. MOL is not just providing distribution maps, we are provisioning many diverse analyses across data types and scales, methods to deploy long-running analysis jobs, toolkits for expert users to harness those analyses to improve distribution maps and finally, storage and versioning for many parts of the system.

From the beginning, the project has been guided by milestones that evolve over time. The goal is to finish one milestone before spending too much time on later milestones, while still allowing some fuzziness along the borders. So, with a small and overworked team of developers, we adopted an agile development strategy that supports fast iterations, rapid prototyping, and quick refactoring as the project moves forward. A part of that plan was a decision to minimize the time spent developing complex technology solutions that would likely need  to be painfully refactored after reaching later milestones.

Early milestones deal with making data accessible and discoverable in standardized formats with a focus on high quality metadata combined with very-fast data access for visualization and search. This situation led us to making a bold decision: many of our databasing and back-end scaling solutions could be developed later, only after preliminary testing of data loads were available (and a more robust map of data relationships had been developed). On the other hand, our front-end would need to be streamlined, fast, and handle a lot of caching from the onset so that we can start rolling out releases without knowledge of initial request volume. To handle this situation while also maximizing the benefits of cloud computing during these early stages, we decoupled the front end and back end of the system architecture.

For now, we built the back end using existing local hardware. Since the back end is never directly accessed by users, we can more easily predict the load based on how many analyses for which it is responsible. When that load gets unmanageable on its current hardware, we will begin developing new storage strategies and redeploy the back end using a more scalable system. At its core though, the applications we have deployed on the back end will be directly reusable in conjunction with novel storage solutions and new hardware environments.

The same early milestones led us to feel that we would need to solve scalability challenges in the front-end of MOL more immediately. The front end needed to expose data – primarily data about what data sets and data types we have available – and provide access to processed forms of that data for our web based UIs (e.g. map tiles). The front end can handle widely varying loads while managing the amount of requests it needed to send to our back end. Dealing with unpredictable and diverse client requests, we could not just deploy an app locally and ensure that it would run quickly night and day. For this reason we have developed a front-end application using the Python SDK on Google App Engine(GAE).

GAE provides apps with generous free quotas for data storage and CPU cycles, so during both development and slow days at MOL we can keep costs under a dollar. We manage to do this by never storing large data on the front end, but instead take advantage of the GAE Datastore to provide fast access to specific parts of the data that we want to query, such as, metadata, taxonomy, and data relationships. We also take advantage of a large free quota for the Memcache API to handle caching small pieces of data, such as map tiles, that the front end pulls from the back end using REST APIs. Now, when MOL starts rolling out features, we can be confident that our app will scale with increased loads!

The MOL architecture has one foot in the cloud while keeping another firmly in hardware sitting at the University of Colorado (for now). This hybrid system has reduced our early development costs while still ensuring that we can easily scale up as we announce early releases. By reducing the responsibilities of the front end to storing and searching metadata and caching data requests, we have been able to develop our system cheaply while already working on cloud solutions that will scale far into the future. I’ll devote a later blog post to a concrete example, likely focusing on how we serve species-ecoregion occurrence polygon data for the mapping user interface.


Written by andrewxhill

April 21, 2011 at 7:22 pm

2 Responses

Subscribe to comments with RSS.

  1. Some further thinking and number crunching on colocation can be found in an excellent write up here,
    Projects based as academic institutions (and similar environments) stand to take particular advantages from such an approach. In part because we have already paid for space and bandwidth through overheads taken off the top of grants. In our case, we will have even greater options, allowing for redundancy at different locations across the country.


    April 26, 2011 at 1:00 am

  2. […] an earlier post I had promised more discussion of the backend-frontend architecture we are using at MOL. One of the reasons that has been slower […]

Comments are closed.

%d bloggers like this: