Publication: Towards robust hypothesis generation in the human microbiome using data-driven discovery
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
We have known for centuries that the microbial communities inhabiting our bodies are linked to human phenotypes, including disease. Recently, the massive scale of DNA sequencing output has made the microbiome field reliant on data-driven “discovery:” seeking biological hypotheses from associations in observational data. However, identifying reproducible and interpretable relationships between host phenotype (e.g. disease) and microbial features (e.g. species, pathways, genes), has proven to be difficult. Here, we propose an efficient path through the complexities of data-driven discovery towards identifying biologically meaningful and clinically impactful host-microbe associations.
First, to understand the “data” of the human microbiome, we aimed to quantify the genetic diversity of different body sites. We compared their gene content to each other as well as to other non-human ecologies. We show that at least 50% of the genes in a given sample are ecology-specific, and we identify environmentally “conserved” genes across different human body sites, organisms, and ecologies.
We next aimed to construct strategies for accurately and reproducibly modeling host-microbe interactions. We first built a software tool for scalable, low-cost cloud computing. Then, using a range of machine learning approaches, we compared the ability of microbial taxonomies, pathways, and genes to predict host phenotype, identifying that the gene “data type” far outperformed other representations of the microbiome. We next built software for massive-scale sensitivity analyses, which we used to probe the robustness of reported associations, showing that ⅔ of published human gut microbiome-phenotype associations are not immediately reproducible.
Finally, we demonstrated the potential impact of data science in the microbiome through two examples. First, we compared the ability of the human microbiome versus human genetic common variants to predict disease, identifying that the former outperforms the latter. Second, we identified robust, gene-level indicators of seven different phenotypes, indicating the potential for these “gene-level microbiome architectures” to serve as cross-disease diagnostics and tools for testable hypothesis generation.
Overall, by integrating across statistical and algorithmic methods, we provide a roadmap for data-driven and efficient discovery in the human microbiome. We hope that our efforts will enable faster translation of microbiome data science into both clinical and experimental settings.