Drilling for Insight: NIH Funding for Biocomputing
A support vector machine approach to cataloguing NIH expenditures
Philip Bourne’s recent appointment as Associate Director for Data Science at the National Institutes of Health (NIH) signals the growing importance of bioinformatics and biomedical computing in achieving the NIH mission. Yet the NIH Institutes and Centers don’t have reliable information about how much they spend on computational science. For fiscal year 2011, for example, NITRD (the Networking and Information Technology Research and Development program), reported that the NIH invested $551 million in computational science. But that report focused heavily on information technology and “high-end computing,” which does not completely or accurately cover the world of scientific computing, says Peter Lyster, PhD, program director in the Division of Biomedical Technology, Bioinformatics and Computational Biology at the NIH’s National Institute of General Medical Sciences (NIGMS).
“We need a more nuanced classification,” Lyster says. So a few years ago, he decided to create just that. “The main goal is to get a quantitative handle on what NIH invests in bioinformatics and biomedical computing so that we can convey this information to the public and do a good job of planning future expenditures,” he says.
It is impossible to manually review thousands of annual grants to determine which ones involve computational work. “It has to be done automatically, using an algorithm that’s clever enough to get around the fact that words like ‘model’ have different meanings in different areas of biomedical research,” Lyster says.
In collaboration with Calvin Johnson and William Lau at the NIH Center for Information Technology, Lyster developed and fine-tuned a support vector machine (SVM) approach to cataloging the NIH expenditures in various subfields of bioinformatics and biomedical computing. They started by categorizing computational science into six sub-areas that are in line with NIH priorities: applications and modeling, informatics, high-throughput data-intensive scientific methods (such as next-generation sequencing, proteomics), imaging and signal analysis, high-end computing, and software and productivity. Lyster then used his expert knowledge of the field to identify a training set of about 1500 NIH projects across these areas. After training the SVM algorithm on biomedical concepts and key phrases extracted from Lyster’s set of identified projects, the algorithm retrieved additional projects from the entire NIH research portfolio relevant to the six categories. Lyster reviewed a sampling of the results to confirm that the algorithm returns good hits.
The outcome of the team’s effort is summarized in the figure shown below. Because the categories are overlapping, it is not possible to calculate the total investment in bioinformatics and biomedical computing by adding up the columns. Furthermore, numerous NIH projects involve both computational and experimental work. But Lyster estimates that the total investment exceeds $900 million.
After testing, validating and hardening the algorithm further, Lyster hopes to make it publicly available. “It should prove useful to both the NIH and grant applicants who will be able to see at a glance which institutes support their area of research,” he says.