The Microbiome: Dealing with the Data Deluge
Bioinformatics and computational biology enable microbiome research
This past June, 200 members of the NIH-funded Human Microbiome Project (HMP) Consortium published a slew of papers offering fresh insights into the role microbial communities play in the human body—including how changes in the vaginal bacteria of pregnant women affect the health of their babies, and how gut microbes influence inflammatory bowel disease.
But the research was only possible thanks to a team of experts in computational biology and bioinformatics.
HMP participants sequenced thousands of metagenome samples—which contain genetic material from hundreds of diverse microbes—from up to 18 body sites in 242 healthy individuals. State-of-the-art tools for analyzing individual genomes aren’t well suited to analyzing metagenomes, as the data are much more massive and messy. So a team of researchers from the Department of Energy’s Joint Genome Initiative (JGI) joined forces with software engineers and computer scientists from the Biological Data Management and Technology Center at Lawrence Berkeley National Laboratory to develop and maintain a suite of novel tools, including a quality control filter, a curation and annotation pipeline, and methods for analyzing and integrating the data.
These efforts culminated in a one-stop shopping data management and analysis system for microbial metagenomic studies, the Integrated Microbial Genomes and Metagenomes (IMG/M) system; and an HMP-specific web interface known as IMG/M-HMP that supports comparative analysis of HMP genomes and metagenomes against the vast pool of microbial data in IMG/M.
HMP scientists can come to the IMG/M-HMP—which is neck-deep in genomics tools and annotated microbiome data—knowing that they will find much of what they need. The IMG/M tools can do a range of analyses, including identifying microbes and genes within a metagenome; predicting gene function; and comparing populations of microbes across metagenomes. According to JGI functional annotation group leader Natalia Ivanova, PhD, HMP researchers assemble their sequences and perform structural annotation, or gene prediction; while JGI scientists perform functional annotation—assigning the predicted genes to conserved protein families—and provide data integration and the user interface.
Finding Known Genes and Microbes
HMP researchers must find protein-coding genes and determine what bugs they come from. The process typically starts with comparing a set of metagenomic data to millions of genes in the IMG reference database in hopes of finding a match. Using this process, the HMP has harvested approximately 200 million genes from its metagenomic samples. “And that requires a lot of computations,” says JGI computational genomics group leader Konstantinos Mavrommatis, PhD.
For decades, researchers have used an algorithm called BLAST (Basic Local Alignment Search Tool) to search for similarities between nucleotide sequences. But BLAST alone is too slow and computationally expensive to handle the metagenomic sifting required by the HMP. So Mavrommatis and his colleagues incorporated novel computational approaches into IMG/M.
At first, the team at JGI investigated alternatives to similarity-based pattern searches, but those produced too many false results. So they turned instead to new similarity-based algorithms, such as USEARCH, capable of producing results similar to those of BLAST, only faster and more efficiently. USEARCH looks for a small number of good matches rather than trying to identify all homologous sequences, cutting down on search time without affecting sensitivity.
Finding Novel Genes and Predicting their Fun
Matching a microbe buried inside a metagenome to a genome in a reference database is akin to finding a needle in a haystack. But identifying the genes from a microbe that hasn’t previously been sequenced and figuring out what those genes actually do is even more challenging.
To enable researchers to find novel genes, the IMG/M includes gene-predicting algorithms that rely on generic features of nucleotide sequences rather than relying on comparison to known sequences (as similarity-based algorithms do). The mathematical methods used in gene prediction, such as hidden Markov models, “work quite well,” says Ivanova. As a result, even when they are fed radically new content, their error rate remains below 10 percent. “It’s still not perfect,” says Mavrommatis, “but considering all the other sources of error, it’s not the worst.”
These algorithms in the IMG/M were the basis for characterizing the diversity of the microbiome in many of the papers published by the HMP.
Teasing out the function of a novel gene is similarly demanding. In general, Ivanova says, gene function is determined by comparing unknown genes to ones whose function has been verified experimentally. Function can be confirmed by analyzing the distribution of similar genes in known genomes, or by looking at a gene’s chromosomal neighborhood, since “genes that are next to each other are more likely to be functionally related.”
Few genes in the HMP database have been characterized to the point where scientists can say precisely what they do. And while perhaps 75 percent of the genes in IMG have been at least broadly characterized, that figure falls by half for genes within the HMP pool.
Yet considering the number of genes involved, that’s still an awful lot of information. And the methods that Ivanova describes can be used to create clusters of microbial sequences that might be worth examining in the lab, where researchers can learn more about them through experimentation. The gene prediction and annotation pipeline developed by JGI has already led to the creation of the HMP Gene Index, a collection of 690 annotated sequences from 15 different body sites. And this past September, a group of users attending a Microbial Genomics & Metagenomics (MGM) workshop run by the JGI used the data in an attempt to identify potential antibiotic-resistance genes in different metagenome samples.
One of the grand challenges of the microbiome project is to discover new information that could help to diagnose or treat disease. This is a huge challenge computationally for several reasons. First, differences in the diversity and complexity of the microbial communities found in different body sites (e.g., the skin, the mouth, the gut) make it difficult to do comparisons between them, Ivanova says. As a result, researchers tend to focus on comparisons of populations at the same body sites but in different individuals.
Perhaps even more significantly, due to privacy restrictions, the HMP metagenome datasets themselves come with very little metadata attached. Such metadata, which might describe the sex or dietary preferences of the human donor, is crucial to determining which metagenomic datasets might be of interest. Scientists at JGI have manually applied their own five-tiered classification scheme to the data, moving from the general (e.g., “host-associated” versus “engineered”) to the specific (“respiratory system,” “digestive system,” “skin and appendages”), but the approach has its limitations.
“It has to be much more granular,” says Ivanova. “There are some scientific questions that you won’t be able to answer because of the lack of metadata.”
The IMG/M helps researchers access and manipulate microbiome data in useful ways, but the sheer volume of data continues to present challenges.
For example, standard methods of storing and retrieving data from relational databases are no longer sufficient. The JGI-BDMTC team is exploring options such as nonrelational or NoSQL databases, and while they have yet to find a one-size-fits-all solution, they continue to explore alternatives.
And then there’s the question of how to provide access to the data and distribute the information. “We are struggling with the challenge of devising tools that don’t overwhelm our scientific users,” says Victor M. Markowitz, DSc, head of the Biological Data Management and Technology Center.
Giving scientists tools that are easy and efficient to use is critical because these tools drive research as much as they support it.
“In our experience most researchers don’t have a clear idea of what they really want and how to achieve it until they start getting the data,” Mavrommatis says. “For good or for ill, there is frequently no prior design of the analysis; we generate the data, and then the researcher starts trying to address questions based on what tools are available.”
Which is all the more reason to ensure that those tools are the best possible ones for the job.