Baylor Case Study
2014
Baylor College of Medicine in Houston, Texas is home to the Human Genome Sequencing Center (HGSC), one of three federally funded sequencing centers in the US. One of the projects HGSC is involved with is the Cohorts for Heart and Aging Research in Genomic Epidemiology project (CHARGE), a consortium of more than 200 scientists across 5 institutions worldwide who are working to identify genes that contribute to aging and heart disease. The CHARGE project, which is an ongoing consortium, analyzes genetic samples and phenotype data from the National Heart, Lung, and Blood Institute’s (NHLBI) large cohort studies and similar studies in Europe. CHARGE and Baylor College of Medicine are collaborating to sequence many of the study participants and process them through Baylor’s Mercury analysis pipeline to help scientists to better understand how genetic variation could play a role in preventing and treating stroke and heart disease. Baylor has 20 sequencing machines that deliver some 24 terabases of content each month, about 1 PB of raw data. There are currently more than 14,000 participants in the study. The magnitude of the data resource requires innovative data solutions.
Upgrading your infrastructure for every big influx you see coming requires substantial investment, not to mention space. These kinds of compute are not a one-time thing, either: they keep growing exponentially. There are all kinds of limitations in our ability to find the horizons of science. But now, thanks to AWS and DNAnexus, we can focus on the science instead of the infrastructure.”
Narayanan Veeraraghavan
Lead Programmer Scientist, Baylor
The Challenge
Over the last century, a number of studies have followed patients throughout their lives to determine how people develop certain conditions or diseases. With the development of DNA sequencing tools, as well as the ability to manage vast sets of data, the results from these studies are now being re-analyzed as part of the CHARGE project. CHARGE scientists all over the world are making use of data to research the causes and prevention of disease.
But as DNA sequencers become more efficient and genomic testing becomes more prevalent, the amount of data to be analyzed has become truly massive. With more than 430 TB of data in play on the CHARGE project, simply distributing the data to interested scientists would have proved challenging. In the old days, hard drives with the data would be encrypted and then shipped out by mail to the more than 200 scientists involved in the CHARGE project—creating delays in sharing information and issues with data security. “Having to ship out hard drives to so many people would be a logistical nightmare,” Narayanan Veeraraghavan, Lead Programmer Scientist at Baylor, says. “Data would have to be encrypted at all points. With so many scientists handling so many hard drives, there would be a lot of failures, because not everyone would be able to follow the security guidelines.”
The infrastructural challenges alone were steep. “It takes a couple months to set up your infrastructure to cater to a particular need in terms of data storage and compute,” Veeraraghavan says. “In those months, technology can change, protocols can change, and updates to the sequencing platform can mean that sequencers can double their output. So demand has doubled in the time you’ve taken to plan and estimate your hardware needs.” Baylor also wanted scientists to be able to share tools across operating systems.
The local computational burden “can bring projects to their knees,” Veeraraghavan says. “We have to be able to operate at scale and store immense amounts of data. We needed another solution, or the CHARGE study would have been prohibitively expensive. It would be difficult or impossible for us to get the computing resources we need on our own.”
Why Amazon Web Services
Baylor needed a cost-efficient, easily maintainable solution that would enable it to provide safe, effective worldwide collaboration without delays caused by setting up a physical infrastructure. “We didn’t have months to spend on setting up an infrastructure, and we needed to be able to share the data efficiently, interactively, and securely,” Veeraraghavan says.
The solution needed to be flexible enough to meet clinical standards and HIPAA requirements, as well. “Once we put all our cards on the table, we naturally gravitated toward DNAnexus and the AWS Cloud.”
Baylor decided to partner with DNAnexus, which provides an API-based PaaS that enables clinical and research enterprises to efficiently and securely move their analysis pipelines and data into the AWS Cloud. DNAnexus enables its clients to port their proprietary algorithms to the cloud alongside industry-recognized tools and reference resources to create customized workflows. The DNAnexus PaaS is built entirely on AWS, which has allowed DNAnexus to scale its system to more than 20,000 simultaneous compute cores, 1 PB of storage, millions of core hours of analysis, and hundreds of thousands of compute jobs orchestrated in the AWS Cloud. AWS has also provided DNAnexus with a Business Associates Agreement (BAA), allowing DNAnexus to offer best-in-class security and compliance with healthcare laws both in the US and internationally. Using AWS, customers can build and run HIPAA-compliant workloads.
The CHARGE project uses Baylor’s analysis pipeline, Mercury, to process its data. The Mercury pipeline consumes raw files from the sequencer and transforms that data into the end deliverable: an annotated variant call file, identifying mutations that could be of clinical significance. Scientists downstream perform tertiary analysis to tackle additional research questions. A small group of researchers are developing tools that look closer at the biology of each genetic marker, so that they can reprocess the data with new findings about both predictive and protective genes. Researchers can compare different tools and share them across geographical boundaries using the DNAnexus platform.
DNAnexus uses Amazon Simple Storage Service (Amazon S3) and Amazon Glacier to store more than 1 PB of genomic data. DNAnexus created a command-line tool that gives scientists the option to upload DNA data directly from the sequencing instrument to the cloud, thus eliminating the need for costly on-premises storage infrastructure. Amazon Elastic Compute Cloud (Amazon EC2) hosts the DNA analysis itself. DNAnexus developed a custom queuing system that operates on Amazon EC2 instances, which is designed to handle interruptions in data processing.
To optimize costs, DNAnexus uses Amazon EC2 Reserved Instances for its interactive services, such as its website, customer front-end portal, and DNA visualization tools, as well as for its back-end cloud and job management services.
Baylor and DNAnexus protect CHARGE data by controlling access to the Mercury pipeline, using the best practices outlined by AWS. “We handle sensitive medical information about people,” Veeraraghavan says. “By using one pipeline and controlling access to that pipeline, you can structure your environment in such a way as to minimize the risk.” The rigorous security protocols in AWS allow DNAnexus to offer its clients best-in-class security, compliance, and audit standards in accordance to HIPAA, CLIA, and other complex regulatory measures. Omar Serang, DNAnexus Chief Cloud Officer, says, “We are able to power ultra large-scale clinical studies that require computational infrastructure in a secure and compliant environment at a scale not previously possible.”
Baylor's HGSC Architecture on the AWS Cloud
The Benefits
After moving to AWS and DNAnexus, Baylor completed its first analysis in 10 days—five times faster than with the local infrastructure—and was able to share the findings quickly. The analysis took 21,000 cores; one Amazon EC2 XL instance has 16 virtual cores. “The AWS Cloud enables swift collaboration even with hundreds of terabytes of data,” Veeraraghavan says. “The ability to have a central area for people to process that data cuts down on bandwidth and the need to buy and maintain vast computational resources.”
It’s a far cry from the days when Baylor had to ship out hard drives to help scientists collaborate. By using AWS and DNAnexus, Baylor and CHARGE were able to provide scientists using different systems with a common environment for sharing analysis tools. “Any scientist, whether he’s running on a Mac, Linux, or Windows, can run any tool on all the CHARGE data in DNAnexus,” Veeraraghavan says. Andrew Carroll, lead DNAnexus Scientist for CHARGE, adds, “Using the AWS Cloud makes it possible to compare tools, so that you can understand what works for your project and what doesn’t. DNAnexus on the AWS Cloud lets researchers share what they learn with the scientific community.”
The scalability of the AWS Cloud helps CHARGE scientists gain more predictive power over the conditions they are studying. They can also identify “protective” genes that may help shield a person from developing a condition—and they can do so quickly and securely. “This is the definition of why you want to go to the AWS Cloud,” Carroll says. “CHARGE needs to run at very high peak loads for as short a period of time as possible to get the job done. Using the AWS Cloud allows DNAnexus the flexibility to build its own PaaS on top of AWS technology. We can scale the DNAnexus system to practically unlimited compute and data storage resources.”
Above all, using DNAnexus and AWS has enabled CHARGE scientists to focus on science—not infrastructure. “Upgrading your infrastructure for every big influx you see coming requires substantial investment, not to mention space,” Veeraraghavan says. “These kinds of compute are not a one-time thing, either: they keep growing exponentially. There are all kinds of limitations in our ability to find the horizons of science. But now, thanks to AWS and DNAnexus, we can focus on the science instead of the infrastructure.”
About Baylor
Baylor College of Medicine in Houston, Texas is home to the Human Genome Sequencing Center (HGSC), one of three federally funded sequencing centers in the US.
AWS Services Used
Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Learn more »
Amazon Glacier
Amazon S3 Glacier and S3 Glacier Deep Archive are a secure, durable, and extremely low-cost Amazon S3 cloud storage classes for data archiving and long-term backup.
Learn more »
Get Started
Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.