Publication: Panacea: Making the World’s Biomedical Information Computable to Develop Data Platforms for Machine Learning
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The marriage between healthcare and artificial intelligence systems has long been coveted — by computer scientists and medical researchers alike — as a grand challenge to progress the frontier of health. Given the explosion of biomedical information produced over the past decade as medicine became digitized, there is a growing need in the field to develop clinically-accurate machine learning models across a range of applications. Yet, mining this enormous ocean of data to build model training sets has proven taxing for engineers, as the underlying data presents a number of computational problems. Namely, it is deeply unstructured, captured by fragmented sources, sits in closed silos, subjected to important privacy regulations, lacks consistent protocols for interoperability, and often requires trained medical experts to normalize and label the data. In contrast to the exigency of model generation in healthcare, the foundational data infrastructure poses severe limitations to enabling advancements of artificial intelligence technologies — causing significant loss to improving patient care.
In this thesis, we contribute to the broader body of literature in 3 important dimensions. Principally, we conduct an exhaustive literature review of the field — mapping out the key challenges restraining the adoption of machine learning in healthcare, identifying the technical root causes of those problems, and surveying seminal papers proposing different technological solutions to enabling model development at scale. In charting this domain, we aim to develop a map for researchers new to the field of healthcare machine learning — and highlight the open problems.
Building on these findings, we then present a novel contribution to the field in accelerating the data preprocessing step: by proposing a “data preprocessing engine (DPE)” which automates the data normalization and labeling functions needed to curate training sets. We explore partitioning text-based healthcare data into its semantic and syntactic properties in order to create a system of transforming unstructured medical records into structured form at scale, and review the limitations of our design. Notably, we still depend on humans-in-the-loop to provide error-correction and explainability for our classifier model. Next we investigate the rapid data labeling function of our DPE through leveraging active learning and weak supervision via programmatic labeling to automate tagging. We discover significant opportunities for a syndicate of the two systems in our DPE for image and text health data, but encounter limitations in measuring accuracy, mitigating bias propagation, and expert dependence — despite improvements in speed, costs, and privacy under our system.
Finally, we suggest an alternate direction for the field, towards developing more tooling in creating a health data infrastructure platform to enable machine learning. We propose the notion of a computable health data lake, and outlines seven principles for its governance and design as a guide to creating it in a manner that balances functionality and ethical responsibility for patient privacy. Researchers, patients, clinicians, policymakers, and developers alike can securely access the information lake as a tool to build models, conduct scientific experiments, and overall improve quality of healthcare. To deliver this vision, we emphasize five technological breakthroughs that are uniquely enabled through our computable health lake infrastructure, and the open scientific problems which may be unlocked. Namely, improved genome-wide association studies, faster clinical trial matchings, better clinical decision support tools, interconnected public health dashboards for policymakers, and personalized patient search engines.