Building Causal Graphs from Medical Literature and Electronic Medical Records
Large repositories of medical data, such as Electronic Medical Record (EMR) data, are recognized as promising sources for knowledge discovery. Effective analysis of such repositories often necessitate a thorough understanding of dependencies in the data. For example, if the patient age is ignored, then one might wrongly conclude a causal relationship between cataract and hypertension. Such confounding variables are often identified by causal graphs, where variables are connected by causal relationships. Current approaches to automatically building such graphs are based on text analysis over medical literature; yet, the result is typically a large graph of low precision. There are statistical methods for constructing causal graphs from observational data, but they are less suitable for dealing with a large number of covariates, which is the case in EMR data. Consequently, confounding variables are often identified by medical domain experts via a manual, expensive, and time-consuming process.
We present a novel approach for automatically constructing causal graphs between medical conditions. The first part is a novel graph-based method to better capture causal relationships implied by medical literature, especially in the presence of multiple causal factors. Yet even after using these advanced text-analysis methods, the text data still contains many weak or uncertain causal connections. Therefore, we construct a second graph for these terms based on an EMR repository of over 1.5M patients. We combine the two graphs, leaving only edges that have both medical-text-based and observational evidence. We examine several strategies to carry out our approach, and compare the precision of the resulting graphs using medical experts. Our results show a significant improvement in the precision of any of our methods compared to the state of the art.