Publication: Foundations for Genome-Scale Artificial Intelligence
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Modeling whole genomes—the complete sequence of base pairs, including both coding and non-coding regions organized within a three-dimensional architecture—represents a grand challenge in bioinformatics. Achieving this goal could transform our ability to understand complex polygenic diseases and develop treatments for both rare and common conditions. However, whole-genome modeling presents two fundamental challenges: the vast size of genomic data and the intricate complexity of genetic interactions. Artificial intelligence (AI) offers a powerful approach to processing large datasets and extracting meaningful patterns. However, for AI to effectively model whole genomes, it must be (1) generalizable, capable of predicting the effects of novel, unseen mutations; (2) capable of reasoning across sequences and biological scales, as genetic function emerges from the multi-level interactions of sequences; and (3) multimodal, integrating information from diverse sources, including scientific literature. While this dissertation does not yet achieve full-scale genome modeling, it introduces four novel AI methodologies—SPECTRA, Phyla, RLDIF, and Fleming—each addressing essential components necessary to enable this vision. Together, these contributions establish a conceptual and methodological framework that lays the groundwork for the future development of genome-scale artificial intelligence. SPECTRA is a framework for evaluating model generalizability beyond conventional dataset splits. By systematically varying train-test similarity, SPECTRA reveals that existing biological foundation models fail to generalize to sequences dissimilar from their training data. In response, Phyla is designed to explicitly learn how to compare sequences by leveraging evolutionary relationships. Trained on protein phylogenies, Phyla not only excels at sequence comparison but also reconstructs phylogenetic trees with high accuracy, revealing both known and novel evolutionary insights. Additionally, this dissertation presents RLDIF, a categorical conditional diffusion model for protein inverse folding, which leverages reinforcement learning to improve multi-scale modeling. By optimizing sequence design with respect to structural recovery, RLDIF achieves state-of-the-art performance in generating diverse sequences that accurately fold into a specified target structure. Beyond sequence information, AI models must incorporate knowledge from external sources to avoid rediscovering established principles. To address this, Fleming is developed as an AI agent for antibiotic design in tuberculosis, integrating scientific literature with machine learning tools to generate novel antibiotic candidates. The same multimodal approach can be extended to genome analysis, enabling AI to reason over diverse data modalities. Together, these innovations establish a foundation for genome-scale AI, providing key tools for understanding and reasoning over whole genomes.