Virtual Genomic Scans with Real Data
HAP-SAMPLE takes real data as the template for simulations
Trying to find the genetic causes of a human disease requires lots of data. These days, researchers scan the genomes of people who do and don’t have a particular disease and look for genome-wide associations between a particular disease and a gene or genes. But they’d like to know if their findings are statistically valid. Moreover, the variety of disease models currently in use have led to debates over which work best. Now, researchers have developed a new tool that they hope will help resolve these issues and will also work with any genotyping platform in use. Their software generates large simulated populations using present-day genetic information from specific populations.
“The main challenge is working out how you draw from real data to mimic what you expect to happen in a disease model situation,” says Fred Wright, PhD, a biostatistician at the University of North Carolina and senior author of a study published online in September 2007 in the journal Bioinformatics. “Because of that, we developed a method that’s simple, almost dumb, in the way it approaches it.”
Current statistical simulations either work backward to generate genetic “histories” that might give rise to present-day forms, or else they go forward, simulating genetic data from the distant past until the present day.
To present a more accurate simulation grounded in real data, Wright's method—called HAP-SAMPLE—now offers a third option: using data from a real population to generate a large sample set against which genes of interest can be checked. Because the data already contain realistic historic mutations, there’s no need to let the population evolve (developing new mutations) over time. Instead, HAP- SAMPLE generates simulated populations solely by meiosis and its associated crossovers—it’s that simple.
The real genetic data is supplied by HapMap, an international database that catalogs 10 million common genetic variations (single nucleotide polymorphisms or “SNPs”) within three populations—Caucasian European, Chinese /Japanese, and Nigerian.
HAP-SAMPLE is potentially valuable to researchers who have identified a possible gene-disease association and want to see how it would play out in a larger population. For example, would the same SNP still be a significant contributor to the disease of interest in a larger population? By comparing the resulting simulated data against known SNPs, they can figure out how good their statistical methods are.
"HAP-SAMPLE is great because it takes real data as the template for the simulation,” says Marylyn Ritchie, PhD, a computational geneticist at Vanderbilt University, whose lab developed a complementary simulation tool. Still, she adds, HAP-SAMPLE’s usefulness is limited by HapMap’s small chromosome pool: Fewer than 400 people represent the three populations. For some researchers, having a real data template might not overcome the problem of limited population size, Ritchie cautions.
“What they’re asking for is just a broader population base,” Wright responds. His team does plan to augment HAP-SAMPLE soon with updates from other genetic databases.