Getting Started with Cloud Services for Biomedical Computation
How to tap into this cost-effective and flexible solution
Biomedical researchers who work with large data sets may run out of both disk space and patience while waiting for a computation to finish. Though buying more hard drives and faster computers may seem tempting, the cloud is now a realistic option.
In 2008, when cloud computing was relatively new, this magazine published a column by Alain Laederach predicting that scientists would be won over to cloud computing, despite some people’s concerns about a loss in performance with the added layer of virtualization.
In the last six years the world of cloud computing has expanded dramatically. Moreover, the performance losses skeptics feared have not surfaced. I have used both cloud computing and local clusters and have found that after some initial setup, computing in the cloud can be just as efficient as working on a local cluster. And the ability to quickly change the number of machines in the cluster allows scaling up or down to fit the problem at hand.
For biomedical researchers, several realistic cloud options now exist, including Amazon Web Services, Joyent, Google’s Compute Engine, the HPCloud, IBM SmartCloud, and Rackspace—all with very low barriers to adoption, including step-by-step tutorials and guides to setting up public key cryptography to permit secure access to a virtual server with Secure Shell (SSH).
Amazon Web Services (AWS), the most well known of the cloud computing services, launched in 2006 and now includes dozens of products. The most relevant for scientific cloud computing is the Elastic Cloud Compute (EC2) service, which provides access to a variety of virtual machines ranging from small (1 CPU, 1.7GB RAM for $0.60/hour) to very large (32 CPUs, 244GB RAM for $2.40/hour) as well as high-performance GPU machines.
AWS users can select an Amazon Machine Image (AMI), which contains the operating system and software for the virtual computer. There are several Linux-based AMIs that come pre-installed with bioinformatics related software, including the Bioconductor AMI and the CloudBioLinux AMI. Once an instance is running, users receive a public IP address which they can use to upload data and log in to run programs. Users who want more computing power can start up more servers or even create a virtual compute cluster with as many nodes as desired using the StarCluster program from MIT.
There is one major drawback to using virtual servers: Once the server is shut down, it ceases to exist, creating a risk of losing valuable data or results if they are not copied to a local computer before terminating a cloud instance. One solution: Amazon’s Elastic Block Storage (EBS) volumes, which hold up to 1TB of data and can be mounted just like a disk drive. The user can also reattach a virtual machine to the EBS volume at a later time to continue working. Unfortunately EBS volumes can reduce performance due to slower disk read and write operations. The Manta service from Joyent has some potential advantages for high-performance computing on large datasets because it focuses on the data storage first and then brings the computation to the storage. By integrating the computation with the data, Manta avoids slow performing network drives and operates with virtually zero data latency.
Cloud computing can be a cost-effective and flexible solution for a researcher’s computing needs. When tackling a large data analysis project, I recommend considering an upload to the cloud.
Amazon Web Services: http://aws.amazon.com/
BioConductor AMI: http://www.bioconductor.org/help/bioconductor-cloud-ami/
CloudBioLinux AMI: http://cloudbiolinux.org/
Joyent’s Manta service: http://www.joyent.com/products/manta
MIT’s StartCluster program: http://star.mit.edu/cluster/
Guy Haskin Fernald is a PhD candidate in Russ Altman’s lab at Stanford University. He is working on identifying and using molecular features of drugs to predict chemical activities and biological phenotypes. Cloud computing is proving to be one of his best tools for working with large chemical databases and implementing machine learning algorithms.