Genetic Variants and Ill Health: Scanning 500,000 SNPs Yields Gene-Disease Connections
It's an exhilarating time for genome-wide association studies
For the past few months it seemed you couldn’t open a journal without reading results of a new genome-wide association study. The results kept pouring in: four studies in April showing seven genetic links to Type 2 diabetes; in May, a paper in Nature showing six new links to breast cancer; and in June, a much anticipated study from the UK-based Wellcome Trust announcing genetic links to seven common diseases, from arthritis to schizophrenia. The list goes on and on.
This was unquestionably the year of the genome-wide association study—research that seeks connections between traits and common mutations in the human genome. The work relies on information gleaned from the Human Genome Project and its successors. Cheaper and more powerful sequencing technologies available in the last two years let researchers scan 500,000 genetic markers on a single chip. This technological muscle has ended a long wait in the use of genetics to study common, perplexing diseases.
“There's been a sea change in this type of work,” said Eric Topol, MD, director of the Scripps Translational Science Institute in La Jolla and lead author of a commentary in JAMA on genome-wide association studies.1 “[The research] is going at a breakneck pace. This week was HIV. Last week was Type 1 diabetes. It's an extraordinary chain of discoveries.”
Is the Human Genome Project finally changing medicine? For years, scientists and policymakers have promised the dawn of personalized care. And while doctors do not yet routinely prescribe medications based on a read-out of an individual’s genetic frailties, the summer of 2007 saw a great leap forward.
“After many years and a fair amount of frustration …we have, in the last few months, about 50 discoveries of genetic risk factors for common diseases,” said Francis Collins, MD, PhD, director of the National Human Genome Research Institute at the National Institutes of Health in late July. “Most of those point us toward targets that nobody would have guessed,” he added. “From the perspective of people working in the field of common disease genetics, this is an exhilarating time.”
Genes and Candidate Genes: A Slow Start
Medical genetics has spent a couple of decades in the doldrums. Early geneticists assembled family histories to study how specific traits were passed on. They then honed in on the genetic target. This method discovered genetic bases for such diseases as Huntington’s and cystic fibrosis, as well as for about 2,000 other inherited traits. Though these were important discoveries at the time, today, some geneticists refer to them as the “low-hanging fruit”—easy targets involving a single gene producing a single, generally rare and deleterious, trait.
But many traits are more complex, involving multiple genes as well as the environment. And some aren’t classically genetic—a person can have the risk allele and not develop the condition, while people lacking the risk allele do get sick.
The Human Genome Project seemed to offer a promising way to get at these more complex conditions through an approach known as candidate-gene studies. Researchers used their knowledge of a disease to study likely suspects in a particular pathway—insulin-production in diabetes patients, or cholesterol production in patients with heart disease. Despite a few successes—for example, a single-nucleotide variant that explains a large part of why people react differently to the blood-thinning drug, warfarin—progress was middling.
“The disadvantage of candidate-gene studies,” said Mary Relling, PharmD, the chair of pharmaceutical sciences at St. Jude Children’s Research Hospital in Memphis, Tenn., “is you will only find what you’re looking for.”
And for many common diseases, we don’t yet know where in the genome to look.
The Genome-Wide Strategy
Researchers hoping for better results began taking a different approach. They look genome-wide in hopes of finding associations between genetic variations and a particular disease or drug response. Such studies are sometimes described as “agnostic” or “hypothesis-generating.” One could describe them as “brute force” or even “shot in the dark” methods—applying purely computational and fast-sequencing technology to the problem without any preconception about where in the genome relevant alleles will be found.
Biologists begin by assuming no knowledge of how the disease works. They assemble a group of people, some who have the disease and some who don’t. And they scan about 500,000 of each person’s single-nucleotide polymorphisms, or SNPs (said “snips”)—single-letter blips in the genetic code. In this sense, the term “genome-wide” is misleading. Scientists don’t yet sequence individuals’ whole genomes. Instead, they sequence a few hundred thousand of the most changeable of the 3 billion letters in our genetic code. Those most changeable locations were identified by the HapMap project, which mapped locations where a rare allele crops up in at least 1 percent of people.
The straightforward SNP approach won’t find duplications or changes in the structure of DNA, such as doubled chromosomes. And it won’t detect addition or deletion of DNA units or flipping of large segments of DNA within a chromosome. It also won’t find truly rare mutations that vary in less than 1 percent of the population, or the somatic errors that occur in cancer cells. What it will find are fairly common blips in the genetic code—blips that constitute about 90 percent of human genetic variation.
Early Successes: Macular Degeneration & Crohn’s Disease
Before this summer, genome-wide studies had a trickle of early triumphs. Three independent groups reported in 2005 that a single point mutation increases a person’s risk of developing age-related macular degeneration, the most common form of age-related vision impairment, by as much as seven times. The Yale University group used some 100,000 SNPs, a fairly small number by today’s standards, in a tiny cohort of 96 cases and 50 controls.2 That discovery has inspired ongoing development of drugs based on the complement factors H and B pinpointed by the genes.
Last fall, another group used a genome-wide scan to discover a novel genetic link to Crohn’s disease, a common inflammatory bowel condition.3 The study of 300,000 SNPs in about 550 cases and controls honed in on three statistically significant links. Two were known genetic markers for Crohn’s disease. The third, IL-23R, was brand new. It has since been widely reproduced and prompted new therapeutic research.
These are the good-news tales in taking a statistical association to the realm of medicine.
Early Disappointments: The Replication Problem
Unfortunately, the last couple of years also saw many genetic associations that turned out to be embarrassing dead ends. Epidemiologists had faced a similar problem in the 1990s, when some worried that too many false discoveries threatened the field’s credibility.4 In 2005, a large genomic study for the time found 13 associations with Parkinson’s, the degenerative muscle disease.5 Follow-up studies the subsequent year didn’t support a single one of the links.6 A much-touted link to longevity for the gene MTP was, ironically, short-lived—the original study made headlines in 2003 but had been largely discredited within two years. Associations with obesity— GAD, ENPP1 and, most recently, INSIG2—failed to produce convincing follow-ups.
“What we see in the media, or we have seen in the past ten years, is every now and then something will come up: ’Oh, there’s a new gene for obesity,’ or ’There’s something new for cancer,’” said Lon Cardon, PhD, professor of bioinformatics at the University of Oxford and at the Fred Hutchinson Cancer Research Center in Seattle. “By and large, those things turned out not to be reliable.”
Part of the problem was statistics. With hundreds of thousands of genetic suspects, the risks of finding one that looks guilty are high. Complex diseases are thought to involve tens or hundreds of associations, each exerting a small effect. Distinguishing a real effect against background noise, or against a genotyping error, was difficult. The p-values for these publications were a far cry from the typical values.
“People were optimistically taking exciting nominal p-values, 104, 105, and trying to construct interesting functional hypotheses,” said Mark Daly, PhD, a population geneticist at the Broad Institute in Cambridge, MA. “What has changed is now people appreciate the statistics and that you will get some of those results by chance.” To establish a biological link, he said, geneticists now replicate the association in another population.
A recent document by the National Institutes of Health, “Replicating genotype-phenotype associations,” emphasizes replication and sets some standards: that the replication study should look at the same location on the chromosome, not a nearby position; that the replication study should look at the same or very similar phenotype; and that a similar population should be studied in the follow-up.7 The report also includes guidelines for reviewing genome-wide association studies.
This being said, a negative result isn’t always the last word. If the association was specific to one population, it might be that the second population didn’t have that genetic risk, or that they were in an environment that didn’t trigger the genetic expression.
This year’s genetic results meet the higher standards for replication, Cardon and others insist. They have already been verified in another, usually bigger, population. For instance, a link between the FTO gene and obesity was first discovered by members of the Wellcome Trust in a study among 2,000 subjects and 3,000 controls. Then the researchers looked at that one position in relation to the physique of some 39,000 other people before announcing they’d found a reliable genetic link.
The Successes of 2007—And Their Limits
The phenotypes of most interest today—obesity, diabetes, heart disease, cancer—are complex traits that likely have many different causes. During the last few months, genome-wide association studies have produced results that offer hope of finding out why we inherit a risk for such traits.
For example, in May, four separate groups reported seven new associations with Type 2 diabetes, bringing the total number of associations to 10. But the new associations are hardly slam-dunks. Taken together, they explain between 2 and 20 percent of a person’s risk of developing diabetes. The number of associations is expected to grow with subsequent studies. If the risks are additive, then as we discover more genes we will explain a larger percentage of susceptibility.
Drug response seems likely to be a similar story. Researchers studying warfarin had a big hit early on using the candidate-gene method. A single change in a vitamin K receptor gene, VKORC1, predicts almost a quarter of the patient’s response to the common blood-thinning drug.8 But follow-up studies, now using genome-wide associations, have found additional genes that explain fractions of the observed response: 10 percent, or 5 percent, or even smaller additional risk.
“If genetic factors can explain 5 percent of a phenotype, that’s considered a big deal,” said Mark Rieder, PhD, a geneticist at the University of Washington in Seattle and, lead author of the original warfarin study.
But from a patient’s perspective, what does it mean to have a 5-percent increase in risk of getting a disease? And for a physician, what does it mean to have a 5-percent higher chance that a medication will cause side effects? Even a 50 percent increase in risk might not mean as much as it appears. If you initially had a risk of 3 in 1,000 cases, then a person who carries the gene for a 50 percent increased risk has a 0.45 percent chance of developing the symptom, instead of 0.3 percent. Will a person stop eating burgers if he or she has the gene that increases risk of heart disease by 0.15 percent?
Because it’s tough to answer these questions, screening tests for common diseases remain rare. But there are a few exceptions: Roche Diagnostics and deCODE Genetics, an Icelandic company, announced in July that they would offer screening for single mutations associated with increased risk of schizophrenia, Type 2 Diabetes and stroke. And in August, deCODE’s chief executive, said the company will also use recent results for glaucoma to design a screen to improve diagnosis of the degenerative eye condition.
Still, many scientists caution against using the newly identified genes for screening whether a person is slightly more or less likely to get sick.
“It doesn’t matter how much [the associations] increase risk,” said Cardon. “It’s the fact that they’re a brand-new pathway about the biology of the disease.”
A better understanding of the disease may provide clues to new treatments.
A Mechanistic View: It’s Not About Screening
Most researchers pursue the so-called “agnostic” studies to discover new genetic associations that could help to understand the disease mechanism and, ultimately, help to design drugs and genetic therapies.
A few of the discoveries, such as the August link to a major form of glaucoma, implicate a known protein and suggest a biological basis. But the biological basis of many new associations remains a mystery. Two groups reported in May that a single genetic variant on chromosome nine increases risk of heart disease by up to 60 percent. The variant is common, with a fifth of the European population having two copies. But it lies outside coding regions in an area of the genome with no known function. Intriguingly, the mutation lies close to the one of the Type 2 diabetes variants reported a week earlier.
In fact, most of the associations with Type 2 diabetes are located in “gene deserts,” that have no known regulatory or coding function. None were in the locations for insulin resistance suspected of being linked to diabetes. Many of the new breast-cancer associations are similarly thousands of base pairs away from coding genes.
Collins says many geneticists are not surprised that the common diseases are influenced by variants outside coding regions. It makes sense that the contributions would be subtle. For example, rather than producing a different protein, the risk variants are being found in regulatory regions that might change the magnitude of a gene’s activity.
“I think we all expected some of [the associations] to be non-coding,” Collins said. “It’s surprising just how many that applies to.”
A recent study took a first step toward providing a mechanism. An international team of researchers led by Bill Cookson, PhD, at Imperial College London reported in July that genetic markers on chromosome 17 increase a child’s risk of asthma by about two thirds. The researchers then recorded gene expression. The children with the variant had more of a different gene, called ORMDL3, in their blood. Their results imply that the genetic variant somehow causes more transcription of the ORMDL3 protein, and that may provide the pathway for the disease.
If genome-wide associations lead to advances in our understanding of biology, that’s when they will really matter. As one researcher said: “If you have an association, you publish in PLoS Medicine. But if you have a mechanism, you publish in Science or Nature.”
Environmental and Racial Issues
The mechanism for complex diseases may be especially difficult to tease out because many of these traits interact with the environment.
“We’re in a stage where we have a lot of progress, but it’s not yet clear with how well we’ve come to grips with the genetic basis of traits that are complex,” said Ken Weiss, PhD, a geneticist at Penn State University. He points to complex phenotypes such as heart disease, obesity and breast cancer, where in a single generation the incidence of disease has skyrocketed—without any change in the population’s genes. “Clearly, the population’s gene pool hasn’t changed in the last 30, 40, 50 years. But the disease risk has changed dramatically. Well, that has to be attributed to what we would call environmental factors, even if we don’t know what the factors are.”
Mysterious environmental triggers are yet another reason why scientists advise waiting before predicting risk. It may be that some populations live in an environment that sets off the genetic trigger. Other people, with identical genes, might react differently.
Just as touchy an issue as environment is race. Researchers in the U.K. have been careful to choose ethnically homogeneous samples. Otherwise, they risk finding differences between racial groups. And although the HapMap was
created by comparing different ethnic groups, the resulting map describes European variation more thoroughly than the vast variety of genes that exist among peoples from sub-Saharan Africa.
“Our tools are currently underpowered with respect to performing genome-wide association studies in populations of African ancestry,” said Malia Fullerton, PhD, a bioethicist at the University of Washington School of Medicine in Seattle. “I think these differences are widely appreciated,” she added. “But in the United States context, if we want to be true to our national demographics, and include a representative mixture of people in our studies, then we have to be paying closer attention to these issues.”
The Future: Bigger and Better
The onslaught of data already raises new statistical questions, and those are only likely to increase. The newest chip from Affymetrix measures more than 1.8 million markers. And studies will also increase the number of people scanned for SNPs. Some complex diseases will likely involve rare mutations, generally classified as those that crop up in less than 5 percent of the population. Studying these rare mutations demands bigger and bigger sample sizes. If a variant exists in just 1 percent of a population, then a study of 100,000 people (far bigger than anything yet attempted) would count just 1,000 people with that variant.
Biologists already have more data than they know what to do with, asserts Weiss. And he foresees the day when genetic studies will expand to include hundreds of thousands of subjects and sequence more than a million markers for each person—or, likely soon, all 3 billion base pairs in the human genome.
This has already begun. The UK Biobank study this year began recruiting half a million volunteers of European ancestry aged between 40 and 69 for a long-range study that will look a links between genes, health and environment.9 It will take blood samples and keep DNA for further testing, while carefully tracking subjects’ health and environment.9 Similar projects have been discussed in other countries, including the United States. The current NIH budget does not permit such an undertaking here, Collins says. “I’m deeply concerned about that, because I think we’re going to kick ourselves five or six years from now.”
The recent successes will prompt more investigation into treatment, searches for new associations and a better understanding of what the existing associations mean. As nearly every aspect of our selves comes under study–dyslexia, autism, schizophrenia, obesity–the amount of data will grow.
“It’s not yet clear, I think, how much more information versus more confusion this huge amount of new data is going to cause,” Weiss said. “We’ll have to wait and see what people will attempt.”
1. The Genomics Gold Rush. Eric J. Topol, MD; Sarah S. Murray, PhD; Kelly A. Frazer, PhD JAMA. 2007;298:218-221.
2. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308: 385–389.
3. Duerr, RH, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science (2006) 314: 1461-3.
4. Epidemiology faces its limits. Taubes, Gary; Mann, Charles C. Science; Jul 14, 1995; 269, 164-169.
5. Maraganore DM, de Andrade M, Lesnick TG, Strain KJ, Farrer MJ, Rocca WA, Pant KPV, Frazer KA, Cox DR, Dennis G, Ballinger DG (2005) High-resolution whole-genome association study of Parkinson disease. Am J Hum Genet 77:685–693.
6. D.R. Myers, Considerations for genomewide association studies in Parkinson disease, Am J Hum Genet 78 (2006), pp. 1081–1082.
7. Replicating genotype-phenotype associations. NCI-NHGRI Working Group on Replication in Association Studies et al. (2007)Nature 447(7145):655-60.
8. Rieder MJ, Reiner AP, Gage BF, Nickerson DA, Eby CS, McLeod HL, Blough DK, Thummel KE, Veenstra DL, and Rettie AE. Effect of VKORC1 haplotypes on transcriptional regulation and warfarin dose New Engl J Med (2005) 352: 2285-93.
9. UK Biobank: bank on it. Palmer LJ (2007) Lancet 369(9578):1980-2.
Strength in Numbers: Databases
Prompted by the need for ever larger studies, medical geneticists are learning to play together.
The Wellcome Trust Case Control Consortium (WTCCC) brought together over 50 research groups from the United Kingdom to carry out its recent genome-wide study, the largest yet. The group makes its data available on it's Web site. The study authors write that larger sample sizes—in their case, 2,000 for each disease and 3,000 shared controls, all of European ancestry—greatly increased the number of statistically significant associations they were able to find.
The database of Genotype and Phenotype (dbGaP) dbGaP was launched in 2006 by the National Center for Biotechnology Information and is housed at the NIH. It will receive data from NSF-sponsored studies, which strongly encourage researchers to deposit data to a public source. dbGaP, which also encompasses the Genetic Association Information Network (GAIN), is a public-private partnership between the Foundation for the National Institutes of Health and Pfizer’s research branch. The national institute offers researchers a carrot: if they deposit phenotype data for clinical studies underway or already conducted, they will sequence the study participants’ DNA.
For the genetics of drug response, PharmGKB was established in 2000 as part of an NIH-sponsored pharmacogenomics research network. The network offers data to investigators outside the network, and it actively recruits from any relevant studies. Persuading researchers to contribute data has become much easier in the past seven years, said Teri Klein, PhD, PharmGKB’s director at Stanford University. PharmGKB curators also actively recruit data by scanning publications and contacting authors to submit.
Ideally, all these databases will support one another. PharmGKB will link to microarray data housed on the NCBI’s Gene Expression Omnibus database, merely noting what data is available. The organizers have developed a similar relationship with dbGaP, Klein said, while PharmGKB will focus specifically on drug response.
Even more specialized databases are cropping up. The Bipolar Disorder Phenome Database, launched in July as part of Johns Hopkins Psychiatry's “BioinforMOODics” site, offers detailed symptom descriptions and complete SNP profiles for more than 5,000 people with bipolar disorde
Hurdles remain. Scientists don’t want yet another hoop to jump through when publishing results. And many are reluctant to share their data before results are published. From the study participants’ side, issues of informed consent and access to the data are under discussion. But more data sharing may be inevitable: new evidence suggests that the bigger the study, the more possible associations will be found.
Fuzzy Phenotypes: Taking The Measure of Disease
SNP discovery offers the computational biologist an obligingly quantitative trait. Each base-pair offers four possibilities that fit neatly into a ones-and-zeroes database. But researchers often struggle to fit phenotypes into an equally tidy box.
Sometimes it’s easy. For example, pharmacogenomics associations are measured in a clinical setting. Computerized photographs of Petri dishes help assess lab cultures, and regular readouts of patient information measure drug response. One prime example is patients’ reactions to warfarin. Because dosing is problematic, and mistakes can be fatal, delivery and patient monitoring happens in "a very tightly controlled environment with a narrow outcome, standardized across clinics. So that's a good phenotype,” said Mark Rieder, PhD, a geneticist at the University of Washington.
But as genome-wide association studies set themselves more and more targets—autism, schizophrenia, obesity, asthma—the challenge becomes quantifying those traits in a meaningful way. Is anybody who's wheezed and self-identified as asthmatic an asthma patient? Does someone whose weight changes from one year to the next qualify as obese? Many of these traits may have a genetic component, but they also have decades of life experience
Studies of irritable bowel syndrome carefully parsed out subjects and chose only those where symptoms were most similar. Genetic screens for bipolar disorder selected those patients for whom the illness began at an unusually young age, or those who also experienced panic attacks. Some researchers studying schizophrenia measure subjects’ startle response as a quantitative proxy for the condition. Clever study design solved problems for Type 2 diabetes, which would seem to be rife with such problems. Many doctors would consider the condition to be a number of different diseases that manifest a similar set of symptoms.
"I think the most challenging part is still the fuzziness of the phenotypes," said Michelle Carrillo, PhD, a curator at the PharmGKB database, “because ambiguity makes it hard to aggregate [data from different studies] and that's what most investigators want to do." A group in Florida and a group in San Diego might both study hypertension and have genotyped their subjects. The genotyping occurs at particular positions and can be compared. But the phenotypes are harder because the tests might have been run differently in Florida and in San Diego.
The National Institutes of Health encourages researchers to describe methods and measurements in as much detail as possible so that subsequent studies can compare results. And PharmGKB is working on a phenotype ontology, so the vocabulary is standardized.
Pharmacogenomics researchers face an additional challenge: comparable studies must match up not only genotype and phenotype, but also drug dosage. To further research on warfarin, the blood-clotting drug, a new 21-institution consortium is working to develop standard dosing guidelines between members in the United States, United Kingdom, Israel, Japan, Korea and Brazil, said Klein. The Pharmacogenetics Research Network is now identifying other areas where a consortium would be beneficial, such as tamoxifen for breast cancer and statin drugs, Klein said.