Curriculum And Big Data: Revamping To Open The Bottleneck
What kinds of training opportunities are needed ?
Things change quickly in the fields of computational biology and bioinformatics. “New technology comes along and whoa! You need to design a new course!” says Claudia Neuhauser, PhD, director of graduate studies, biomedical informatics and computational biology at the University of Minnesota, Rochester.
Today, she says, biomedical “big data” are putting pressure on bioinformatics curriculum. Be it genomic or molecular data, imaging data, or electronic medical records, big data adds a new level of complexity that requires a shift in training.
“We need new research and we need re-vamped training programs,” says Rob Kass, PhD, professor of statistics and machine learning at Carnegie Mellon University.
In July 2013, the National Institutes of Health (NIH) hosted a workshop to discuss possible training initiatives to help people take full advantage of big data. The workshop, which was part of the Big Data to Knowledge (BD2K) initiative, generated plenty of ideas (some of which are described below) that may soon find their way into a grant program.
But even without new grants, the bottleneck in big data training requires academic institutions as well as society at large to ponder a difficult question: What kinds of training opportunities are needed to ensure that researchers can extract knowledge from biomedical big data?
As one might suspect, an individual’s background is a huge factor in determining the kinds of training needed as well as its duration (short- vs. long-term). The specific research question being addressed also affects curriculum. Because genomics research using big data will be quite different from imaging or electronic medical records (EMR) research using big data, it’s hard to imagine a generic “Big Data for Biomedicine” curriculum.
The NIH workshop participants therefore discussed many different types of training opportunities that could help open up the bottleneck, says Karen Bandeen-Roche, PhD, professor of biostatistics at Johns Hopkins’ Bloomberg School of Public Health, who led the workshop discussion together with Zak Kohane, MD, PhD, director of the informatics program at Children’s Hospital, Boston, and co-director of the Center for Biomedical Informatics at Harvard Medical School. All agree that there is no one-size-fits-all solution. So here’s a sampling of potential programs, some of which are already being piloted at various institutions around the country while others may require funding from BD2K or other sources.
Build on a Data Science Foundation
For researchers who already consider themselves data scientists in biomedicine, the leap to big data isn’t a huge stretch, says Kass. “People already good at data analysis have an easy transition to bigger datasets because the principles haven’t changed,” he notes. Still, the growing size of biomedical data sets means students need an appreciation for computer systems and software engineering as well as algorithms and statistical methods, not to mention all the issues associated with data warehousing, standardization, access, security, and confidentiality, Kass says.
Carnegie Mellon is already building more and more references to big data into its regular courses in data analysis. And one of Kass’s colleagues, William Cohen, has developed a course in machine learning with large datasets. “It’s a class that specifically talks about how to scale things up,” he says. Courses of this type should be more widespread, Kass says.
Use Case Histories at the Cutting Edge
Kohane would like to see training provided “right at the cutting edge of where the experts are.” He supports the use of case studies that involve problems created by the size of the data set and the limits on computational resources and bandwidth. As he sees it, learning happens best when there is a problem that biomedical domain experts believe is important, and they have the methodological people working on it with them. “Then it’s not make-work, and it’s not tangential,” he said during the workshop.
Train Team Members
People working at the interface of big data and biomedicine will inescapably work in teams of individuals with diverse sets of skills, Bandeen-Roche notes. The question then becomes, she says, “How do you create a community well-trained to be team members?”
In Bandeen-Roche’s own training program on the epidemiology and biostatistics of aging, people from different fields gain expertise in their own areas but are also trained in shared activities—common curriculum and shared research projects—where they learn to work together and communicate across disciplines. A similar approach might work for big data, she says.
Train for Real World Data
Colin Hill, PhD, chairman and CEO of GNS Healthcare, a big data analytics company, says there are really two buckets of healthcare big data: the bioinformatics/genomics side for drug discovery and development; and then the real-world data side dealing with mash-ups of EMRs, claims, and genomics data. “Our biggest growth (and the biggest growth in the field),” Hill says, “is in the real-world data side.” For jobs in this arena, he says, current training programs in computational biology or bioinformatics lack the necessary epidemiological training while current epidemiological training “typically doesn’t cover the new math of causal inference/Bayesian network inference and gives little exposure to claims data and EMR data.” Curriculum developers, he says, should take this job market into account.
Take It to the Users
Bioinformatics and computational biology programs are a diverse lot, Bandeen-Roche says. “Some might approximate what is needed to handle big data while others don’t,” she says. And only a few are dedicated to training the users of big data rather than PhDs. “It’s important to think about training the huge number of people needed to do the standard day-to-day stuff—interpreting and explaining the data and translating it to clinical practice,” she says. “At the end of the day, it somehow has to help patients.”
Neuhauser’s program at the University of Minnesota has a huge focus on educating the local workforce of people at Mayo Clinic who are working in the labs where big data is already being used. Mayo employees have come to Neuhauser’s program seeking new skills. Because many have biology backgrounds, she established a sequence of three online quantitative courses (an introductory computer science course, as well as separate algorithms and programming courses) that enable them to enter the graduate program in biomedical informatics and computational biology. “We’re preparing people for something that would have been closed off to them a few years ago,” Neuhauser says.
Brush Up Professionals’ Skill-Sets
Several workshop participants noted their own need for big data training. “We all need to be retooled, PhD students as well as the rest of us,” said Elaine Larson, PhD, associate dean for research at Columbia University School of Nursing. Current professionals could benefit from short-term training provided online, at workshops, bootcamps, or summer programs. Fellowships could help medical doctors learn big data informatics. And team challenges and competitions can provide training with the extra dose of reality that only a true problem-oriented experience can provide.
Get Creative—MOOCs and Modules
We live in a new era of education where MOOCs (massive online open courses) allow the possibility for scaffolding courses and making them broadly available. Andrew Laine, PhD, professor of biomedical engineering at Columbia University, noted during the workshop that the NIH could require grantees to contribute modules to a broadly shared resource. “If it can be done once and done really well, it can be a commodity to the community,” he said.
As Bandeen-Roche notes, “Extracting knowledge from big data is more difficult than one might have hoped.” Hopefully, efforts to design new training programs will allow researchers meet the challenge.
Big Data Programs Outside Biomedicine
If programs focused on biomedicine don’t offer enough big data training, researchers can look for opportunities elsewhere. For example, several organizations already offer novel training programs in big data.
Berkeley’s Simons Institute is currently running a four-month program called “Theoretical Foundations of Big Data Analysis.” Although not specifically designed for biomedical researchers, the program covers the same big data turf that a biologist would find useful, such as succinct data representations; parallel and distributed algorithms; and big data privacy. And several of the instructors have experience using big data in biomedicine.
Insight Data Science also offers a 6-week post-doc training fellowship to help “bridge the gap between academia and a career in data science.” Jenelle Bray, PhD, who recently completed her post-doc at Stanford using machine learning to study protein structures, entered the Insight program in September. She saw it as the best way to get the training she needed to lead a data science team—in biomedicine if possible. “Sometimes the newest technologies take a while to permeate academia,” she says. “You can learn them more quickly going into industry.”