Chinese Crunch Human Genome With Videogame Chips
Jan. 11, 2012 (TSR) – The world’s largest genome sequencing center once needed four days to analyze data describing a human genome. Now it needs just six hours.
The trick is servers built with graphics chips — the sort of processors that were originally designed to draw images on your personal computer. They’re called graphics processing units, or GPUs — a term coined by chip giant Nvidia. This fall, BGI — a mega lab headquartered in Shenzhen, China — switched to servers that use GPUs built by Nvidia, and this slashed its genome analysis time by more than an order of magnitude.
Can GPUs keep up with the torrent of DNA data these machines crank out? Photo: Lawrence Berkeley National Laboratory
In recent years, the cost of sequencing genomes — mapping out an organism’s entire genetic code — has dropped about five-fold each year. But according to Gregg TeHennepe — a senior manager and research liaison in the IT department at The Jackson Laboratory in Bar Harbor, Maine — the cost ofanalyzing that sequencing data has dropped much more slowly. With its GPU breakthrough, BGI is shrinking the gap.
In the world of medicine, this is nothing but good news. It promises to dramatically boost biological exploration, the study of diseases, and efforts to realize the long-touted vision of personalized medicine — the idea of being able to tailor drugs and other treatments based on an individual’s genetic makeup.
GPUs Get Super
GPUs began life in desktop PCs. But nowadays, they’re widely used for “high-performance computing,” driving supercomputers that crunch through huge amounts of data generated by scientists, financial institutions, and government agencies. Much of this data can be broken into small pieces and spread across hundreds or thousands of processors.
Graphics processors are designed to crunch floating-point data. Floating point processing — in which the decimal point can move — makes it easier for computers to handle the large numbers typical of scientific data. As a bonus, graphics processors are generally less expensive and less energy-intensive than standard CPUs.
According to Jackson Lab’s TeHennepe, the feat BGI and NVIDIA pulled off was porting key genome analysis tools to NVIDIA’s GPU architecture, a nontrivial accomplishment that the open source community and others have been working toward. The development is timely. TeHennepe’s Jackson Laboratory is best known as as one of the main sources of mice for the world’s biomedical research community, but it’s also a research center that focuses on the genetics of cancer and other diseases. The lab has been conducting high-throughput sequencing for more than a year, and it has been looking into GPU computing to bolster the lab’s ability to analyze the data.
TeHennepe calls BGI’s accomplishment “an important step forward in the effort to apply the promise of GPU computing to the challenge of scaling the mountain of high-throughput sequencing data” — assuming that BGI’s accomplishment can be verified and applied elsewhere.
GPU computing holds the promise of delivering orders of magnitude increases in performance and reducing power and space requirements for problems that can be structured to take advantage of the highly parallelized architecture. The open question in the high-throughput sequencing community has been the extent to which their analysis challenges can be restructured to fit the GPU model.
Beyond the CPU
To achieve the same genome analysis speeds with traditional CPUs, BGI would have to use 15 times more computer nodes, with an equivalent increase in power and air conditioning, according to bioinformatics consultant Martin Gollery. With GPUs, Gollery says, BGI gets faster results for its existing algorithms or use more sensitive algorithms to get better results. It can use its existing computing resources for other tasks.
According to Chris Dwan — principal investigator and director of professional services at BioTeam, a consulting firm that specializes in technology for biomedical research — organizations that use GPU-enabled genome analysis can also pare back their computing infrastructure. Sequencing machines generate hundreds of gigabytes of data at a time. That data has to remain “hot” on disk drives for as long as the analysis software runs.
“If you can churn through data in a few hours rather than a week you might be able to save quite a bit on high-performance disk space,” Dwan says.
Another consequence of BGI’s GPU initiative is the likelihood that other institutions will be able to use BGI’s GPU-enabled applications. “Most of the genomics folks that I know have been waiting for GPU-enabled applications to appear in the wild, rather than dedicating local developers and building the apps themselves,” says Dwan.
From bench to cloud
BGI uses GPUs across a large server farm. But its GPU software port has consequences for other platforms as well. Big, high-throughput sequencing machines have dominated the sequencing market, but smaller bench-top systems are likely to drive growth in the market over the next four years, according to DeciBio, a biomedical technology market research firm. Benchtop sequencers are likely to capture close to half of the market by 2015, according to the firm.
As the sequencing manufacturers develop ever smaller bench-top instruments such as Illumina’s MiSeq and Ion Torrent’s PGM, they will also need to scale down the built-in analysis capabilities of the systems. “GPU-based systems might allow them to fit a traditional CPU-based cluster’s worth of compute capacity into the instrument itself,” says Jackson Lab’s TeHennepe.
And then there’s the cloud. Running genome sequence analysis pipelines in the cloud is a hot topic. Pipelines refer to the end-to-end process of running DNA sequence data through a series of analysis tools to produce genomes whose structures and variations are identified and labeled. The resulting analyzed genomes are tools for researchers studying biology, pharmaceutical companies developing drugs, and physicians treating patients.
Harvard Medical School’s Laboratory for Personalized Medicine has been running analysis pipelines on Amazon’s EC2. All of the major sequencing instrument manufacturers have or will soon have cloud-based analysis services, which are primarily aimed at smaller organizations, says TeHennepe.
The combination of sequencing services — like those offered by BGI and Edge Bio — and cloud-based genome analysis promises to make genomics more affordable for smaller research outfits. A researcher can send a biological sample to a sequencing service, which can upload the sequencing data directly to a cloud service. “The researcher now no longer has to own a sequencer or a cluster, and does not have to have employees to manage both of these technologies,” Gollery says.
But loading huge amounts of data into the cloud is problematic. A single instrument run can produce hundreds of gigabytes of data. “I know several groups who are shipping disk drives around in FedEx pouches rather than saturating their internet links,” says Dwan. “That introduces a lot of human hands — and time on trucks — into the process.” Sequencing centers and instrument manufacturers are working on “direct to cloud” support, but it’s not clear what that’s going to mean.
GPU-enabled cloud services will help once the data is in the cloud. Cloud service providers are increasingly adding GPU capabilities. Amazon Web Services is a prime example. According to Dwan, any organization that has figured out how to run its analysis in a cloud service like Amazon’s EC2 will not have to rent as many instance-hours to complete the same task if it can use GPU-based analysis tools. This means cheaper and faster results for commonly used pipelines.
Another advantage of GPU-enabled cloud services, says Gollery, is that research organizations can test GPU versions of algorithms without having to have a GPU system in-house. If the algorithm doesn’t port well to GPU architecture, then the organization hasn’t lost much.
Not everyone is sold on cloud-based sequence analysis. Jackson Laboratory took a close look at the issue when the lab applied for funding in support of storage for sequencing data. “We argued that while cloud is making steady progress, it’s still not ready for large-scale sequencing pipelines,” says TeHennepe.
The Need for Speed
What’s more, not everyone is focused on speeding up computation, either locally or in the cloud, via GPUs or otherwise. For some of the largest genomics centers, data handling and data representation are bigger challenges than pure compute speed. The Broad Institute, a joint Harvard-MIT biomedical research center, spends most of its compute cycles moving bytes around. “The time spent doing CPU-intensive work has been relatively modest compared to the time spent doing input-output work,” says Matthew Trunnell, Acting Director of Advanced IT.
According to Trunnell, the speed of a single analysis pipeline is less important than improving data representation and figuring out the big data problem of processing large swaths of sequencing data simultaneously.
Even for computer-intensive aspects of analysis pipelines, GPUs aren’t necessarily the answer. “Not everything will accelerate well on a GPU, but enough will that this is a technology that cannot be ignored,” says Gollery. “The system of the future will not be some one-size-fits-all type of box, but rather a heterogeneous mix of CPUs, GPUs and FPGAs depending on the applications and the needs of the researcher.”
Analysis versus interpretation
Being able to keep up with the torrent of raw sequencing data is a critical challenge. But once researchers have analyzed genomes in hand, the question becomes: Now what? The main bottleneck in genomics is making sense of the information, says Kevin Davies, editor-in-chief of Bio-IT World, founding editor of the journal Nature Genetics, and author of The $1,000 Genome. “Shaving a few hours or a couple of days off a step is great but not necessarily a quantum leap into a new realm of biological understanding,” he says.
Our understanding of genome biology is still relatively limited. Once a researcher or clinician has that list of thousands or tens of thousands of genomic variances, they have to try to figure out which ones are medically important. “There’s still a huge gap in our ability to do that,” says Davies. “Partly it’s because the existing medical databases, the gene variant databases, aren’t nearly as accurate and as actionable as we would like them to be.”
As far as medical genomics and the promise of personalized medicine, the goal is to be able to look in a database to see that a variant in, for example, the 833rd gene on chromosome 17 has a particular meaning. “You want to be able to look that up in a reliable and robust database,” says Davies. “We don’t really have that at the moment.”
Still, genomics is creeping into medicine. A growing number of medical centers are taking the first steps into using genome analysis. “We’ll see where that goes,” says Davies. “The interpretation of those data is a challenge, and it’s going to be several years before we really assemble the right tools to be able to do that.”
GPUs have cranked up the speed of genome sequencing analysis, but in the complicated and fast-moving field of genomics that doesn’t necessarily count as a breakthrough. “The game changing stuff,” says Trunnell, “is still on the horizon for this field.”
AUTHOR: Eric Smalley