Open Source Tools for Parsing Clinical Records
Researchers at the Mayo Clinic and IBM have each built computer pipelines for extracting useful information from unstructured notes in patient charts, such as physician’s notes and pathology reports. And they’ve now partnered to make these best-of-breed natural language annotators freely available through the Open Health Natural Language Processing (OHNLP) Consortium ( http://ohnlp.org ).
“While each of us [IBM and Mayo] contributed a whole pipeline, the more important contribution was that we were starting to feed the shelves with annotator widgets that other people could take and assemble in different and interesting ways,” says Christopher Chute, MD, DrPh, Mayo Clinic bioinformatics expert and senior consultant on the OHNLP Consortium project. If someone wants to create a complex natural language processing (NLP) pipeline to address a particular research question, “maybe they write the tough little piece that goes in the middle, but 90 percent of the work is already written,” he says.
Until recently, researchers could only access valuable data within medical records by hiring medical professionals to read charts and abstract the information out to a case report form. And while general architectures for building NLP pipelines and automating this process of extracting information from medical records aren’t new, and some of the pieces of the OHNLP pipelines, such as standardized vocabularies, have existed for some time, the Mayo and IBM pipelines weave many pieces together to accomplish real world tasks. “And that’s what we so desperately need,” says Rebecca Crowley, MD, associate professor of biomedical informatics at the University of Pittsburgh, a researcher who has developed a separate open source pipeline (caTIES) for extracting information from pathology reports.
In an NLP pipeline, unstructured text goes through a series of annotators that work step-by-step toward identifying meaningful entities or phrases and the relationships between them. For instance, the first annotator distiniguishes letters from punctuation and other marks, the next identifies words, and the next that identifies parts of speech. Ultimately this might lead to an annotator that could assign meaning to phrases or entities.
For the most part, Chute says, Mayo’s pipeline (cTAKES) stops at the stage of entity recognition—identifying specific symptoms, diseases, and drugs. Once you have the entities or phrases, he says, “then you can start doing all kinds of fun things either with subsequent annotators or as a post-NLP process.” IBM’s medKAT pipeline also includes annotators that identify relations between named entities. For example, a pathology record might mention multiple sites and sizes of tumors, but medKAT identifies relationships among those pieces of information in order to identify, for example, the size of the primary tumor.
In the long run, Chute says, the OHNLP will be most valuable for building a community of people who use shared tools. IBM’s manager of medical text and image analysis, Anni Coden, PhD, who leads work on the IBM pipeline (medKAT), agrees. “We [IBM and Mayo] decided to put this out there in open source because it takes a whole community to make progress in this field,” says Coden. “If we put our efforts together we may be able to solve it.”
Crowley says the long-term value of NLP pipelines is clear: “So much of the data we want to work with is available only in text,” she says. “Data mining, identifying new hypotheses, translational research and clinical trials can all benefit greatly from being able to access data in text.”