Publication:

Panacea: Making the World’s Biomedical Information Computable to Develop Data Platforms for Machine Learning

Loading...
Thumbnail Image

Date

2022-05-23

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Patel, Zeel. 2022. Panacea: Making the World’s Biomedical Information Computable to Develop Data Platforms for Machine Learning. Bachelor's thesis, Harvard College.

Abstract

The marriage between healthcare and artificial intelligence systems has long been coveted — by computer scientists and medical researchers alike — as a grand challenge to progress the frontier of health. Given the explosion of biomedical information produced over the past decade as medicine became digitized, there is a growing need in the field to develop clinically-accurate machine learning models across a range of applications. Yet, mining this enormous ocean of data to build model training sets has proven taxing for engineers, as the underlying data presents a number of computational problems. Namely, it is deeply unstructured, captured by fragmented sources, sits in closed silos, subjected to important privacy regulations, lacks consistent protocols for interoperability, and often requires trained medical experts to normalize and label the data. In contrast to the exigency of model generation in healthcare, the foundational data infrastructure poses severe limitations to enabling advancements of artificial intelligence technologies — causing significant loss to improving patient care.

In this thesis, we contribute to the broader body of literature in 3 important dimensions. Principally, we conduct an exhaustive literature review of the field — mapping out the key challenges restraining the adoption of machine learning in healthcare, identifying the technical root causes of those problems, and surveying seminal papers proposing different technological solutions to enabling model development at scale. In charting this domain, we aim to develop a map for researchers new to the field of healthcare machine learning — and highlight the open problems.

Building on these findings, we then present a novel contribution to the field in accelerating the data preprocessing step: by proposing a “data preprocessing engine (DPE)” which automates the data normalization and labeling functions needed to curate training sets. We explore partitioning text-based healthcare data into its semantic and syntactic properties in order to create a system of transforming unstructured medical records into structured form at scale, and review the limitations of our design. Notably, we still depend on humans-in-the-loop to provide error-correction and explainability for our classifier model. Next we investigate the rapid data labeling function of our DPE through leveraging active learning and weak supervision via programmatic labeling to automate tagging. We discover significant opportunities for a syndicate of the two systems in our DPE for image and text health data, but encounter limitations in measuring accuracy, mitigating bias propagation, and expert dependence — despite improvements in speed, costs, and privacy under our system.

Finally, we suggest an alternate direction for the field, towards developing more tooling in creating a health data infrastructure platform to enable machine learning. We propose the notion of a computable health data lake, and outlines seven principles for its governance and design as a guide to creating it in a manner that balances functionality and ethical responsibility for patient privacy. Researchers, patients, clinicians, policymakers, and developers alike can securely access the information lake as a tool to build models, conduct scientific experiments, and overall improve quality of healthcare. To deliver this vision, we emphasize five technological breakthroughs that are uniquely enabled through our computable health lake infrastructure, and the open scientific problems which may be unlocked. Namely, improved genome-wide association studies, faster clinical trial matchings, better clinical decision support tools, interconnected public health dashboards for policymakers, and personalized patient search engines.

Description

Other Available Sources

Research Data

Keywords

Computable, Data Labeling, Data Normalization, Health Data Lake Infrastructure, Healthcare Data, Machine Learning, Artificial intelligence, Health sciences, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories