Publication:

Towards robust hypothesis generation in the human microbiome using data-driven discovery

Loading...
Thumbnail Image

Date

2021-11-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Tierney, Braden Thomas. 2021. Towards robust hypothesis generation in the human microbiome using data-driven discovery. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

We have known for centuries that the microbial communities inhabiting our bodies are linked to human phenotypes, including disease. Recently, the massive scale of DNA sequencing output has made the microbiome field reliant on data-driven “discovery:” seeking biological hypotheses from associations in observational data. However, identifying reproducible and interpretable relationships between host phenotype (e.g. disease) and microbial features (e.g. species, pathways, genes), has proven to be difficult. Here, we propose an efficient path through the complexities of data-driven discovery towards identifying biologically meaningful and clinically impactful host-microbe associations.

First, to understand the “data” of the human microbiome, we aimed to quantify the genetic diversity of different body sites. We compared their gene content to each other as well as to other non-human ecologies. We show that at least 50% of the genes in a given sample are ecology-specific, and we identify environmentally “conserved” genes across different human body sites, organisms, and ecologies.

We next aimed to construct strategies for accurately and reproducibly modeling host-microbe interactions. We first built a software tool for scalable, low-cost cloud computing. Then, using a range of machine learning approaches, we compared the ability of microbial taxonomies, pathways, and genes to predict host phenotype, identifying that the gene “data type” far outperformed other representations of the microbiome. We next built software for massive-scale sensitivity analyses, which we used to probe the robustness of reported associations, showing that ⅔ of published human gut microbiome-phenotype associations are not immediately reproducible.

Finally, we demonstrated the potential impact of data science in the microbiome through two examples. First, we compared the ability of the human microbiome versus human genetic common variants to predict disease, identifying that the former outperforms the latter. Second, we identified robust, gene-level indicators of seven different phenotypes, indicating the potential for these “gene-level microbiome architectures” to serve as cross-disease diagnostics and tools for testable hypothesis generation.

Overall, by integrating across statistical and algorithmic methods, we provide a roadmap for data-driven and efficient discovery in the human microbiome. We hope that our efforts will enable faster translation of microbiome data science into both clinical and experimental settings.

Description

Other Available Sources

Research Data

Keywords

gene level analysis, human microbiome, metagenomics, vibration of effects, Bioinformatics, Microbiology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories