The Map of Life

integrating species distribution knowledge

Archive for April 2011

Cloud where appropriate

Here at the Map of Life, we’ve been cranking away at development for some months now. We are finalizing some of our beta release user interfaces (UIs) and application programming interfaces (APIs), but in the meantime we wanted to start opening up some of our development ideas for a wider discussion. During discussions we have reiterated the importance of explicitly opening our methods and solutions to the community. Not only because we think our methods and solutions are interesting, but because we think that only by starting these conversations can we ever receive feedback and knowledge that could be key to making MOL a success.

Over the next months and years, this blog will be a primary way that MOL will begin these discussions. One technical solution that we have been excited to write about is our approach to scalable architectures across MOL. The project has some diverse scaling challenges. At the most basic level, MOL will provide APIs and UIs that allow users and other projects to rapidly access high quality species distribution maps.  But this is a major simplification of the scope, ambition and challenges of MOL. MOL is not just providing distribution maps, we are provisioning many diverse analyses across data types and scales, methods to deploy long-running analysis jobs, toolkits for expert users to harness those analyses to improve distribution maps and finally, storage and versioning for many parts of the system.

From the beginning, the project has been guided by milestones that evolve over time. The goal is to finish one milestone before spending too much time on later milestones, while still allowing some fuzziness along the borders. So, with a small and overworked team of developers, we adopted an agile development strategy that supports fast iterations, rapid prototyping, and quick refactoring as the project moves forward. A part of that plan was a decision to minimize the time spent developing complex technology solutions that would likely need  to be painfully refactored after reaching later milestones.

Early milestones deal with making data accessible and discoverable in standardized formats with a focus on high quality metadata combined with very-fast data access for visualization and search. This situation led us to making a bold decision: many of our databasing and back-end scaling solutions could be developed later, only after preliminary testing of data loads were available (and a more robust map of data relationships had been developed). On the other hand, our front-end would need to be streamlined, fast, and handle a lot of caching from the onset so that we can start rolling out releases without knowledge of initial request volume. To handle this situation while also maximizing the benefits of cloud computing during these early stages, we decoupled the front end and back end of the system architecture.

For now, we built the back end using existing local hardware. Since the back end is never directly accessed by users, we can more easily predict the load based on how many analyses for which it is responsible. When that load gets unmanageable on its current hardware, we will begin developing new storage strategies and redeploy the back end using a more scalable system. At its core though, the applications we have deployed on the back end will be directly reusable in conjunction with novel storage solutions and new hardware environments.

The same early milestones led us to feel that we would need to solve scalability challenges in the front-end of MOL more immediately. The front end needed to expose data – primarily data about what data sets and data types we have available – and provide access to processed forms of that data for our web based UIs (e.g. map tiles). The front end can handle widely varying loads while managing the amount of requests it needed to send to our back end. Dealing with unpredictable and diverse client requests, we could not just deploy an app locally and ensure that it would run quickly night and day. For this reason we have developed a front-end application using the Python SDK on Google App Engine(GAE).

GAE provides apps with generous free quotas for data storage and CPU cycles, so during both development and slow days at MOL we can keep costs under a dollar. We manage to do this by never storing large data on the front end, but instead take advantage of the GAE Datastore to provide fast access to specific parts of the data that we want to query, such as, metadata, taxonomy, and data relationships. We also take advantage of a large free quota for the Memcache API to handle caching small pieces of data, such as map tiles, that the front end pulls from the back end using REST APIs. Now, when MOL starts rolling out features, we can be confident that our app will scale with increased loads!

The MOL architecture has one foot in the cloud while keeping another firmly in hardware sitting at the University of Colorado (for now). This hybrid system has reduced our early development costs while still ensuring that we can easily scale up as we announce early releases. By reducing the responsibilities of the front end to storing and searching metadata and caching data requests, we have been able to develop our system cheaply while already working on cloud solutions that will scale far into the future. I’ll devote a later blog post to a concrete example, likely focusing on how we serve species-ecoregion occurrence polygon data for the mapping user interface.

Written by andrewxhill

April 21, 2011 at 7:22 pm

Map of Life Dream Team (and, hey, we are ready to blog!)

Map of Life has been chugging along for about 6 months now in its current configuration and now seems like as good a time as any to step back and consider how far we have come and what might be next. The team working on Map of Life is such an interesting one. Geographically we are spread out across the United States, from the East Coast (Yale University) to the Midwest (University of Kansas), Mountain West (University of Colorado) to the Pacific Coast (University of California, Berkeley). We are also diverse by country of origin (Australia, Germany, U.S.A.), academic training (computer science, ecology, evolution, informatics) and skill set (programming, systems engineerings, informatics, macroecology, systematics, etc). A gratifying part of the first six months, for me, is that these differences and diversity has translated into a strong working relationship and collaborative spirit, where the strengths of the group, not the weaknesses, have multiplied. I think this likely reflects a strong impetus to meet regularly, as often as three times a week, via cell or Skype, to synchronize efforts. Plus, good peeps and – turns out – we like working with each other! 

So what have we accomplished with all this good will and great vibes? A lot, as it turns out! Much of that is “behind the scenes”. Andrew Hill and Aaron Steele have been bouncing great ideas back and forth about how to create an information architecture that is robust, scalable and efficient. We’ve put together a broad technological overview here: As it stands now, and looking back, we have done a lot. First, we deployed a cloud-based copy of the Catalogue of Life, accumulated a large set of range maps for amphibians and mammals, checked taxonomy of those range maps against the Catalogue of Life database, and provided an initial mechanism to search those maps. Next we have developed the means to display range maps via Google Maps, using a map tiling tool named Mapnik, and developed some initial user interface frameworks and designs. We are currently polishing off access and display of species occurrence data points. All of this is great, but we are still treading in known waters. The excellent AmphibiaWeb project has also developed the means for displaying range maps and occurrence data points, for example.

Soon we will be pulling together some new types of distribution data such as “occurrence polygons” — places where species have been described via species list — and habitat preferences such as “wet broadleaf forests” or “shortgrass prairie”. These are new challenges for storage, query and visualization. An even greater challenge will be trying to provide all these sources of knowledge in a single search and user interface. Exciting times for Map of Life! We are looking forward to having some demonstrations soon, so you can try out some Map of Life features. Stay tuned (and thanks for reading).

Written by Rob

April 6, 2011 at 11:38 am

Posted in overview

Tagged with ,