Publication:

Learning Inductive Representations of Biomedical Data

Loading...
Thumbnail Image

Date

2020-09-15

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Finlayson, Samuel Gregory. 2020. Learning Inductive Representations of Biomedical Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Representation learning with neural networks has catalyzed rapid progress in biomedical pattern recognition. This progress, however, has generally been limited to domains where data are abundant, richly structured, and stable. In contrast, much of biomedicine is marked by limited and poorly structured data and by highly dynamic deployment environments. In particular, many of the most compelling problem areas in biomedicine involve the "long tails" of rare diseases and rare events. In this thesis, I confront the challenge of learning data representations whose utility can extend into dynamic and data-poor biomedical domains. I do so through three primary projects:

First, I present a novel method for representation learning with subgraphs. This method, called Subgraph Neural Networks (Sub-GNN), learns disentangled representations of subgraph structure, neighborhood, and position through property-aware routing channels. The work is motivated by the desire for methods that can better contextualize patient phenotypes (encoded as subgraphs) into the broader context of biomedical knowledge, which could allow for better diagnostic generalization to novel disorders involving previously unseen phenotypes. Subgraph neural networks provide a principled framework for doing just this, by leveraging the relational inductive biases of the underlying knowledge graph while still respecting subgraphs as independent entities.

Next, I present an approach to learning coordinated representations of small molecules and their associated transcriptional signatures. This approach extends a popular paradigm for drug development (known as connectivity mapping) to operate inductively, making predictions involving drugs that have not previously been experimentally assayed. I benchmark the performance of this approach, studying the circumstances under which it can and cannot achieve strong performance.

Finally, I present an analysis of the clinical challenges posed by dataset shift, the phenomenon in which the input data to a deployed machine learning algorithm become mismatched with its training data. After introducing the problem of general dataset shift, I turn to a special case -- adversarial examples -- which reflect the worst-case generalization conditions for a machine learning system. I then build and test the representational robustness of three high-accuracy machine learning systems, constructing adversarial examples that cause their accuracy to drop to 0% on data that is imperceptibly different from the training data. I discuss the implications of these findings for clinical machine learning, offering specific regulatory recommendations.

I conclude my thesis with lessons learned from these projects, and provide an extensive appendix with three additional smaller-scale projects that branched off of my research.

Description

Other Available Sources

Research Data

Keywords

Machine Learning, Medicine, Neural Networks, Quantitative Biology, Representation Learning, Artificial intelligence, Computer science, Bioinformatics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories