Getting It Right: Better Validation Key to Progress in Biomedical Computing
Bringing models closer to reality
When the ill-fated space shuttle Columbia launched on January 16, 2003, a large piece of foam fell off and hit the left wing. Alerted of the impact, NASA engineers used a computer model to predict the possible consequences. Their conclusion: It will likely be okay. But, in fact, the foam had catastrophically exposed the shuttle’s thermal protection system, causing Columbia to disintegrate during reentry and killing all seven crew members.
Investigators later concluded that the disaster might have been averted. One of the key failures: The computer model got it wrong. The model had been validated for small pieces of foam, not “huge hunking pieces,” says Jerry Myers, PhD, chief of the Bio-Science and Technology branch at NASA’s Glenn Research Center. “Because it had been well-validated down in the low end in that operational scheme, everybody took it at face value that it would work in the upper scheme.” (In fact, one simulation did predict catastrophic failure—but engineers distrusted that particular simulation.) The assumptions and uncertainties of the model were never fully presented to higher-ups, who consequently made the wrong decisions. This chain of failures led NASA to implement a comprehensive standard (NASA 7009) for vetting models and simulations.
Models can be extremely valuable: They complement experimental studies by providing additional insights in a cost-effective way. But the value of a model depends on rigorous validation, as the Columbia accident tragically shows. Modelers in high-stakes fields—aeronautics, nuclear physics, bomb making, and weather prediction, for example—understand this well. But “in the biomedical sciences, there hasn’t been such a culture of holding people to the fire of validation,” says Peter Lyster, PhD, program director in the Division of Biomedical Technology, Bioinformatics and Computational Biology at the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH).
This laissez-faire ethos is going to have to change. Biomedical modeling has now entered a high-stakes era: Models are increasingly being used to make direct clinical decisions, with life-and-death consequences, such as choosing between cancer drugs. At the same time, there is a brewing crisis of confidence in bioinformatics and biomedical computing (see: Meet the Skeptics, in the Summer 2012 issue of this magazine). Scores of papers have been published claiming “success”—for everything from disease signatures to drug targets—but practical applications have been few, and some models have been debunked (see: Errors in Biomedical Computing, Fall 2011 issue of this magazine). These factors are fueling an intense discussion on validation in biomedical modeling circles.
The point of validation is to help modelers and model consumers decide: Does the model get close enough to reality so that they can use it with confidence in a particular scenario? Perfect validation isn’t always the goal; sometimes a less costly validation might suffice if the costs of making a mistake are low. The problem is that current validation schemes for biomedical models are often inadequate given the stakes. This article describes several common pitfalls of current practices, as well as several efforts to remedy these issues by innovating or standardizing validation for biomedical models.
The Status Quo
When biomedical modelers talk about “validation” currently, they may mean many different things. Some researchers may confuse verification—checking that the code does what it’s supposed to—with validation; but verification is only a prerequisite to validating a model. Some researchers also confuse peer review with validation. “We had a long discussion with a couple of researchers a while back as to what constituted validation. And their response was, ‘publication in the general literature,’” Myers says. “But that is just not right.” At most, peer review provides a very low-level, “do-my-concepts-look-good” validation, he says. Peer review is simply not equipped to vet high-throughput data and complex models in a meaningful way.
Researchers who go beyond verification and peer review will typically validate their models against existing data. Using a kind of statistical validation, they fit the model on one set of data while holding out some of the data for subsequent “independent” testing. For example, in the old days of predicting protein structure from sequence, people used to fit an algorithm to one set of known structures and then test it on a separate set of known structures, says John Moult, PhD, professor of cell biology and molecular genetics at the University of Maryland. In theory, this could provide reasonable validation—but in practice, there’s good evidence that it simply doesn’t work. “In practice, we’re all rather fallible. It’s very hard if you know the answer not to be unconsciously biased by it,” Moult says.
“I don’t mean that people deliberately cheat,” Moult says. “I think it’s a lot subtler than that. In the field that I’m familiar with, there are a lot of very, very smart people and they’re very honest people by and large. But somehow we fool ourselves.” Information inevitably “leaks” from the training set to the test set; for example, if the model doesn’t fit quite right on the test set, researchers go back and tweak the algorithm a little, Moult says. Or the training and test set may contain such similar samples that the algorithm works well on both, but does not generalize to other problems.
Researchers do better when they get beyond statistical validation and benchmark their algorithms against truly new experimental data (or data that they were blinded to during algorithm development). However, even in this situation biases slip in. Researchers may selectively report the most optimistic validation results, for example. “We call it the self-assessment trap,” says Gustavo Stolovitzky, PhD, manager of functional genomics and systems biology at the IBM Computational Biology Center. “You want to publish your paper and, therefore, at the end of the day, some of the objectivity of the scientific enterprise is lost.”
In a 2011 paper in Molecular Systems Biology, Stolovitzky and colleagues surveyed 57 modeling papers—within a few specific areas—in which authors assessed their own methods. Sixty-eight percent of authors reported that their method was best for all metrics and all datasets; and 100 percent reported that their method was among the best. But, of course, this is impossible—all these methods cannot be the best.
Another problem with the status quo is that most researchers view validation as a one-time, one-size-fits-all endeavor. “The word ‘validated’ can get slipped in very, very easily,” says David M. Eddy, MD, PhD, founder and medical director of Archimedes, a healthcare modeling company in San Francisco. “A team can validate the model in one population for one outcome for one treatment, and then they’ll attach the word ‘validated’ to the model as though it’s a property of the model, that goes with the model wherever the model goes—to any treatment, to any outcome, to any population, to any time period,” he says. This, of course, leads to the kind of dangerous extrapolation that happened with the Columbia disaster. Plus, if you only validate a model once, that model is going to be out of date in a few years, Eddy says.
Finally, most prevailing validation efforts omit a critical element: error bars. Since a model can never match reality perfectly, “validation is mostly about knowing what the errors are and accounting for them,” Lyster says. The uncertainties in the model and data need to be quantified by putting error bars around model predictions. “It’s not just a matter of having a forecast; it’s a matter of knowing accurately how fat the error bars are,” Lyster says. “You’ve got to know that you’ve got good error bars so that people can go to the bank with them.”
Many modelers resort to statistical validation on existing data because they don’t have the expertise, time, or resources to generate new experimental data. But it’s becoming increasingly easy to outsource validation experiments, says Atul Butte, MD, PhD, associate professor of pediatrics at Stanford University. Outsourcing validation doesn’t mean assays performed on-the-cheap in China or India. Rather, modelers can hire companies or university core facilities—experts in a particular research technique—to run the specific experiments needed to test their model predictions. “I’m a big fan of this approach,” Butte says.
Using online marketplaces—such as ScienceExchange.com, AssayDepot.com, and Biomax.us— modelers can find exactly the services or samples they need. It’s a lot like shopping on Amazon.com. Need high-quality serum from breast cancer patients treated with Tamoxifen? Just drop them in your shopping cart. Need to test a drug in a rat model of inflammatory bowel disease? Here are 15 companies that can do it for you. “This is the most amazing thing for us in informatics and computational biology. If we want to do this kind of translational work, all this is here waiting for us,” Butte says.
In 2011, Butte’s team published back-to-back papers in Science Translational Medicine that highlight the value of outsourcing validation. Butte’s team devised an algorithm that mines publicly available gene expression data to find new uses for old drugs. The model predicted that cimetidine, an antiulcer drug, would be effective against lung cancer. Butte hired the Transgenic Mouse Research Center core facility at Stanford to test this prediction; the result: the drug indeed slowed the growth of lung cancer in mouse models. Butte’s algorithm also predicted that topiramate, an anti-seizure drug, could be used to treat inflammatory bowel disease. Butte collaborated with scientists at Stanford to test the prediction in rats and additionally hired two companies that he found through AssayDepot.com to perform independent replications. All three experiments gave strong evidence of the drug’s efficacy. One of the companies even provided colonoscopies of the rats, something that his Stanford collaborators couldn’t do. Statistical validation doesn’t resonate much with physicians and biologists, but “when you show them the colonoscopy from the rat, that’s huge value-added for your model,” Butte says.
Using outsourced validation, Butte has repositioned one drug—moving it from computational prediction to cell and mouse models and then to clinical trials (which are about to begin)—in the span of eight months. “It is getting to be too trivial to get just a simple bioinformatics paper published. Those kinds of papers are slowly losing their impact, especially with non-computational scientists,” Butte says. “I think if you want to find and show something big, this is how you’re going to do it.”
Are the data trustworthy? Sure, it’s a worry, Butte admits. “But that worry lasts maybe about 48 hours—the time it takes you to get the samples that you would have otherwise waited months or years to get,” he says. And because companies send Butte the raw samples as well as the data (he has the formalin tissue slides on his shelf), he can look at them himself and even seek a second opinion. (For example, a pathologist might read slides for a dollar apiece, he says.) And, if the price is low enough, one can always send the same experiment to two or more vendors, for maximum independent validation, Butte says.
Outsourced validation experiments may actually be more robust and of higher quality than experiments done by the computational modeler, says Elizabeth Iorns, PhD, the cofounder and CEO of Science Exchange in Palo Alto. Science Exchange is a marketplace for university core facilities. “If you have one person who is doing all the experiments, no matter how hard they try, subconsciously they’re looking at the data in a way that matches what they want it to say. So, distributing the experiments across multiple investigators is a way to eliminate the individual investigator bias.” Plus, the core facilities tend to be extremely specialized in a particular experimental technique. So the quality tends to be higher than if an inexperienced postdoc or graduate student is running the experiment, she says.
To promote the cause of validation, Science Exchange recently launched the Reproducibility Initiative. Scientists may apply to have previously published research (including models) independently tested through the Science Exchange network; then they can publish the results in a special issue of PLoS ONE. Even if a validation study refutes a computational model, Iorns points out, it’s better to publish this failing yourself rather than for someone else to discover and publicize it.
One of the most successful innovations in validation is the use of collaborative competitions. These competitions engage the community in an ongoing, cyclic model of validation that helps the field progress, Lyster says. The first of these competitions, CASP (Critical Assessment of Techniques for Protein Structure Prediction), began in 1994 and is now in its tenth round. Others quickly followed, including CAGI (Critical Assessment of Genome Interpretation), CAPRI (Critical Assessment of PRedicted Interactions), the American Society of Mechanical Engineers (ASME) Grand Challenges, and DREAM (Dialogue on Reverse Engineering Assessment and Methods)—which is now in its seventh round.
Teams work on the same challenges, so it is possible to directly compare their performance; and independent judges evaluate the methods using several well-defined metrics. These objective assessments help to “break that vicious circle of self-assessment traps and lack of sufficient rigor,” Stolovitzky says. Competitors are also blinded to solutions, which further reduces bias. For example, in CASP, organizers gather unpublished data from X-ray crystallographers and NMR spectroscopers who are on the cusp of solving a structure. “The key thing about CASP is that one doesn’t know the answers; one is doing genuine blinded prediction,” Moult says.
After each competition, results and data are made freely available to the community so that everyone can learn from the successes and failures. Competitions systematically reveal where people are “fooling themselves”; they also give a field insight as to which problems have been effectively solved. “As participants in a field, we’ve got much better feedback on what the real issues are and where we should focus our efforts,” Moult says. DREAM organizers also aggregate the best solutions—yielding a collaborative algorithm that often outperforms the best single method. “This is the wisdom of crowds,” Stolovitzky says.
Competitions also increase confidence. “In the area of protein structure, before CASP got established, we were sort of a laughing stock with the experimentalists. They all knew that we were exaggerating,” Moult says. That’s completely changed, he says. “In terms of people in the broader protein structure community having more confidence in the methods, it’s had a huge impact.”
Industry can benefit from competitions as well—as evidenced by the Netflix Prize for successfully predicting a person’s taste in movies—but the rules need to be slightly different, Stolovitzky says. He and colleagues have pioneered a collaborative competition model for industry, called IMPROVER (Industrial Methodology for Process Verification of Research). IBM and Philips Morris codeveloped the first set of IMPROVER challenges in systems biology. They aimed to verify that computational approaches can use transcriptomic data to classify clinical samples into diseased and non-diseased (for specific illnesses, including multiple sclerosis, lung cancer, and chronic obstructive pulmonary disease). Entries were assessed using gene expression data from unpublished cohorts of cases and controls. Data, gold standards, and scores are available at sbvimprover.com
Predictive and “One-Click” Validation
In crowdsourced validation, the participants are blinded but the answers are known to the organizers. “Predictive validation” takes blinding one step further: predictions are made while an experiment is ongoing—in other words, when the answers are truly unknown. This type of prophetic validation has a certain “wow factor” that is particularly useful for convincing skeptics.
For example, in 2004, the American Diabetes Association asked David Eddy (CEO of the company Archimedes) if his healthcare model could predict the results of an ongoing clinical trial called the Collaborative Atorvastatin Diabetes Study (CARDS). The trial was testing whether atorvastatin could reduce the chance of heart attack or stroke in people at risk, especially diabetics. Months before completion of the study, Eddy’s team simulated the trial, sealed the resulting predictions in an envelope, and FedExed them to the American Diabetes Association, the principal investigators of CARDS, and Pfizer (the drug’s sponsor).
Their predictions were “right on the money” for three of four outcomes, Eddy says: they closely predicted the actual rates of heart attack and stroke in the control group and heart attack in the atorvastatin-treated group. They underestimated the drug’s ability to prevent strokes, but even that “error” turned out to have value, Eddy says. In the absence of data, the modelers had assumed that atorvastatin’s effects on stroke would be similar to that of other statins; but it turns out that atorvastatin may, in fact, be more effective. “The mismatch between our model and the real results was what alerted Pfizer to that fact. So that’s opened up other research avenues,” Eddy says.
The success of the predictive validation won over modeling skeptics at the American Diabetes Association, which went on to commission considerable work from Archimedes, Eddy says.
Modelers at the company have also pioneered a “one-click” validation tool that addresses the need to continually update and revalidate models. With validation, “there’s no end point. It’s not as though some hand comes down from the sky and says ‘you’ve got it; you can rest, relax.’ It’s a constant process,” Eddy says. One-click validation provides an automated way to revalidate models every time new medical evidence comes out. “With this one-click validation, we’re getting much, much more efficient. We don’t have to set up each new trial every single time,” Eddy says.
Companies are increasingly using biomedical modeling in regulatory submissions for medical products, making validation a hot topic at the Food and Drug Administration (FDA).
Currently, decisions about validation are made on a case-by-case basis—and, correspondingly, what companies report to the FDA is highly variable. “What we get from manufacturers is just such a range in terms of defining their models, defining the limitations, defining what they’re using the model for—the things that you think would be in any test report. We’re not necessarily even getting those basics,” says Donna Lochner, associate director for scientific outreach at the FDA’s Center for Devices and Radiological Health.
So, the FDA is working to develop standards. “We want to promote greater use of computational models. One of the ways we can promote their use is to come out with clear expectations with respect to validation. That’s where we are now,” Lochner says.
Standardizing validation for biomedical models is a challenge. “The models in this space are very complex, particularly when talking about long-term interactions between a device and a patient,” says Tina M. Morrison, PhD, a mechanical and biomedical engineer at the FDA’s Center for Devices and Radiological Health. Though validation standards exist for hard-core engineering and physics-based fields, these don’t necessarily transfer well to biomedical models—because data from living systems are harder to come by and highly variable, Morrison says.
Morrison and colleagues at the Center for Devices and Radiological Health have begun drafting a guidance document specific to medical devices. Though in its early stages, some of the essentials are clear: “first and foremost, good documentation,” Morrison says. We need companies to document “what they did, why they did it, what their results are, how confident they are in those inputs, and the use history of those models,” she says. Secondly, validation will have to be more quantitative. It’s not sufficient to say that the prediction and the data match by 20 percent and that’s “close enough,” Morrison says. Companies might need to perform formal uncertainty analyses (adding error bars) and sensitivity analyses, where they tweak the parameters and the assumptions in the model and see how much that affects their predictions. Finally, the FDA is creating an innovative scheme to risk-stratify validation requirements, so that the level of validation depends on how the model is being used in the regulatory submission.
For example, imagine a computational model that predicts which commercial hip implant would be best for a given patient, based specifically on his or her anatomy, bone density and activity level. If that model gets it wrong, the stakes are high: the patient could experience a bone fracture and require repeat surgery. So, validation will need to be rigorous. But if a company is just using a model to justify which sizes of its device it needs to evaluate with bench testing, the risks are lower and, thus, a less rigorous validation strategy might suffice.
“Right now, we’re not making big decisions based solely on the computational models; therefore, the level of validation isn’t high,” Morrison says. “But if we start shifting where we make more important regulatory decisions based on the computational outputs, the amount of information that’s going to be needed to support that model’s credibility is going to change.”
The FDA is also developing reference problems to attempt to benchmark model performance. For example, they challenged 28 labs to simulate flow out of a simple nozzle. The variability they see in the computations helps the FDA and the researchers understand how underlying assumptions affect the model’s precision. Lochner says. “By validating the result for a reference problem, we can then gain confidence in the outcomes as more complexity is added to the model.”
In the wake of the Columbia disaster, NASA developed a standard (7009) for assessing the credibility of models and simulations. The goal: to help decision-makers know if they can trust a model’s prediction when it counts.
“When we talk verification and validation [V&V] at NASA these days, we’re really pointing to something that’s a little more globally inclusive, which is what we call the credibility score,” says DeVon Griffin, PhD, project manager of the Digital Astronaut Project (DAP), which uses simulations to evaluate various risks to human health that arise during long-term space travel. “We always did V&V prior to 7009, but the standard provided a systematic way to do it. More importantly, the standard provided a vehicle to communicate with managers so they understand the requirement to do V&V.”
When Myers (who is the DAP technical consultant) first joined NASA’s human research program the question of validation came up immediately, he says, “because we were getting very grandiose statements about what people’s models could do.” After Griffin identified and provided the standard to him, Myers says his reaction was: “This is like the greatest document I’ve ever read. It is a culmination of 50-plus years across NASA of understanding computational modeling.” But the standard was designed for models in general and did not address the unique challenges of biomedical models. So Myers and his colleagues set about adapting the standard for their human research models. They are now writing up a formal guidebook on how to apply 7009 to biomedical models.
7009 is a synthesis of eight factors: verification, validation (comparison with experimental, simulation, or real world data not used to develop the model), input pedigree (how good are the input data), uncertainty quantification (error bars), model robustness (sensitivity analyses), use history, modeling and simulation management, and people qualifications. The eight categories are scored on a scale of 0 (lowest) to 4 (highest). The scores encompass both internal and external assessments. Though NASA currently uses the lowest of the eight scores as the overall score, NASA’s Human Research Program is working to establish a process for calculating the overall score as a weighted average of the eight individual scores. The goal isn’t necessarily to achieve a perfect 4.0, but rather to get as high as is reasonable for a particular modeling application. For example, a score of 2 or 3 might be the highest score that can be reasonably attained for certain biomedical modeling problems but may well be sufficiently high to meet customer requirements, Griffin says.
“I make it my business to look out into the field to see what’s happening in V&V and credibility, and I still haven’t found anything that’s as comprehensive as 7009,” says Lealem Mulugeta, DAP project scientist.
Most validation efforts only ask: how well does the model match the validation data? But 7009 additionally asks: how good are the data? “I think a lot of people just make the assumption that just because you have data, it’s good data,” Mulugeta says. “We go through the process of actually vetting our data to make sure that the data are credible and appropriate to use.”
For example, the Digital Astronaut team created a model of an astronaut exercising on the Advanced Resistive Exercise Device (ARED)—the exercise device that astronauts use on the International Space Station to prevent muscle and bone loss. To verify and validate their model, they have to use data on joint torques and forces that were collected on earth or from other exercise models. So, when the models were extended to exercise simulations in microgravity, this resulted in a relative reduction in the overall credibility score of the models by about 25 percent. This lower score tells you that you will need to supplement the simulation results with other evidence to inform research or decision-making, Mulugeta says.
7009 also weighs people’s qualifications and a model’s use history—two features that can help increase confidence. “People I work with in the biomedical community will say, ‘oh, this model doesn’t take into account this parameter, so it’s no good to me,’” Myers says. But then they look at the use history—what others have used it for—and that “tends to win people over pretty quickly,” he says.
7009 has helped boost confidence in the Digital Astronaut Project. The models are now being used for completely unanticipated problems, sometimes without the modeling team’s knowledge, Mulugeta says.
Another critical feature of 7009 is that it gives explicit weight to uncertainty and sensitivity analyses. “These two things are the keystone,” Myers says. “Everyone worries about validation. But even if you get your validation close to perfect matching, you’ll always have a cone of uncertainty that surrounds the data in your model.” For decision-making, you have to understand uncertainty and sensitivity, because these are what indicate how far off the model’s answer could be from the truth. “Uncertainty and sensitivity also imply how the model can be interpreted when used ‘near’ where it is validated, but not directly at the state in which it was validated,” Myers says. This is particularly helpful when decision-makers have to make a decision involving multiple scenarios, factors, and mission goals, he says.
Uncertainty and sensitivity analyses can also clarify a model’s weaknesses. For example, an allied team working on the Integrated Medical Model (IMM) modeled the risk of astronauts getting a hip fracture in space (a concern because astronauts experience accelerated bone loss) while wearing a cushioned spacesuit. Without data on how much a spacesuit actually reduces impact from a fall, Myers built his model using data on the cushioning effect of medical hip protectors (worn by the elderly to prevent fracture). But because the commercial systems they tested varied widely in terms of their abilities to dissipate and disperse an impact, the model produced large error bars around the risk estimates. So, NASA agreed to pay for a more appropriate dataset gathered using actual spacesuit material. This greatly reduced the uncertainty and improved the credibility score, Griffin says.
This is a good example of how, when done correctly, validation not only builds confidence in a model, but actually drives scientific research. “Validation tells you where to put your money and helps you make intelligent decisions about where to drive the science,” Lyster says.
He adds: “I have been saying that validation is the organizing principle for scientific computing. A bold assertion, but I think it is that important. It’s not about building a perfect model (there isn’t one) but rather about seeking to quantify how imperfect your model is. In doing that you also understand more about the underlying science.”