A Software Tool Developed at Columbia Enhances Genetic Diagnoses
A new tool developed by scientists at Columbia efficiently extracts data about the physical characteristics of a patient’s condition from electronic health records and automatically identifies genes associated with diseases.
Even in the age of next-generation sequencing, the search for disease-causing genes is expensive and time-consuming. A new software tool developed by scientists at Columbia University, in collaboration with colleagues at the Mayo Clinic and the Children’s Hospital of Philadelphia, automates the generation of a list of candidate genes for a given condition, promising to improve accuracy and save time and money for genomic diagnoses.
The tool, called the EHR-Phenolyzer, is a high-throughput system for extracting data from electronic health records (EHR) and correlating that information with genomic data to infer causal genes that underly diseases.
The researchers describe their work in “Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes,” published online today in the American Journal of Human Genetics (AJHG). The paper will also appear in the July 5, 2018, print edition of the journal.
Whereas traditional genomic analyses generate vast amounts of sequence data from all 20,000 or so human genes, the EHR-Phenolyzer uses additional patient information to home in on 1,000 or fewer candidate genes potentially associated with a patient’s condition. What’s more, the EHR-Phenolyzer automates the process by reusing data that are already in a patient’s EHR.
“This study looks at how we can leverage existing data resources other than genomic sequences to improve the yield of genetic diagnoses,” says study co-leader, Dr. Chunhua Weng, an associate professor of biomedical informatics at Columbia University Vagelos College of Physicians and Surgeons and a member of the Data Science Institute, Columbia University. “This is the first study to use information about the disease phenotype—that is, the physical manifestations of the disease—captured in the medical records by healthcare providers to facilitate genomic diagnoses,” adds Weng.
Specifically, the EHR-Phenolyzer extracts information about the disease phenotype and integrates that with knowledge of associated genotypes to generate a list of candidate genes known to be linked to that particular medical condition. From there, a patient’s genomic sequence can be screened to determine which of those gene sequences contain anomalies known to cause the condition.
The EHR-Phenolyzer is not the first tool to use phenotypic data to prioritize genes associated with patient phenotypes. But it is the first to automatically reuse the rich and nuanced phenotypes extracted from EHRs by leveraging methods for analyzing techniques known by data scientists as natural language processing (NLP).
The current study builds on previous work by Kai Wang, who is also a lead author on the paper. In previous work, Wang developed the Phenolyzer, a computational tool that, like the EHR-Phenolyzer, uses phenotypic information to identify genes implicated in a patient’s condition. The main difference is that the Phenolyzer requires manual specification of phenotypic information, a labor-intensive process that's not scalable. Related tools, including the Phenomizer, Phen-Gen, and Phevor, similarly require manual inputs of phenotypic data. The EHR-Phenolyzer overcomes that bottleneck.
“Unlike the Phenolyzer, which requires human experts to manually input phenotypic terms, the EHR-Phenolyzer created an automated workflow that allows direct extraction of standardized phenotype terms that are further used to expedite the clinical interpretation of genomic sequencing data,” says Wang, who recently moved from Columbia to the Children’s Hospital of Philadelphia and the University of Pennsylvania, where he is an Associate Professor of Pathology and Laboratory Medicine.
“The EHR-Phenolyzer allows us to automate the association of phenotypes stored in patients' medical records with genomic information," adds Wang, “where those associations are known, to improve the efficiency and effectiveness of phenotype-driven genetic diagnosis in the era of genomic medicine. ”
High Throughput via Natural Language Processing
The tool’s high throughput, moreover, is attributable to collaborations between informatics and data science.
“This project serves as a perfect example on how the disparate fields of bioinformatics and medical informatics can be bridged to facilitate the integration of clinical data and genomic knowledge for precision patient care,” says Weng. Indeed, extracting the relevant phenotypic information from text in the EHR is no easy feat. Nor is it easy to integrate this phenotypic information with the genomic knowledge, which is continually being updated.
Consider the problem: patient information is entered into the EHR as a narrative based on a patient’s history when they are seen at a hospital or in a doctor’s office. The narrative might say, for example, “The patient was profoundly deaf and lost the ability to speak. He also had loss of bladder control.” (See Figure 1.) To efficiently search for causal genes for this condition, a computer algorithm first must distinguish the clinically relevant phenotypic information “deaf,” and “loss of bladder control,” from irrelevant text in the rest of the narrative.
Moreover, the same phenotype might be described in different ways by different clinicians. Another clinician might use the expression “incontinent,” instead of “loss of bladder control.” To solve that problem, a standard terminology of phenotypic descriptors, the Human Phenotype Ontology (HPO), was created. The algorithm in the EHR-Phenolyzer was built to recognize both expressions—“incontinent” and “loss of bladder control”—as equivalent and then normalize them using the standard concept in the HPO, which in this case is “urinary incontinence.”
Once the entire phenotypic profile has been extracted from the EHR, the EHR-Phenolyzer generates a list of candidate genes by first correlating the phenotypes with genes known to be associated and then ranks the genes according to the strength of the correlations.
To assess the accuracy of the method, Wang, Weng, and their colleagues at the Children’s Hospital of Philadelphia, and the Mayo Clinic, tested the EHR-Phenolyzer on data from patients for whom a definitive diagnosis had already been made. Taking care to ensure patient privacy, the team probed the records of four different groups of patients. For the purposes of the study, the team included patients with diseases attributable to mutations in only a single gene. The team winningly determined there to be a generally 75 percent chance that the causative gene could be found within a list of 1,000 candidates generated by the EHR-Phenolyzer.
In future work, the team will further improve the NLP method’s accuracy, and explore whether adding additional data—including self-reported phenotypic data from the patients themselves—might help further refine the list of candidate genes and narrow down the search. And at some point, the team might train their collective sights on diseases caused by multiple genes.
And Weng anticipates the day when the EHR-Phenolyzer might be used by labs to more effectively diagnose millions of patients.
“I can see how this software tool will be valuable to diagnostic labs,” she says, “since it will allow them to arrive at a more precise diagnosis more efficiently and cost-effectively.”