Publication: Advancing Foundation Models for Medical Diagnosis and Biological Discovery
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The open-ended and complex nature of medicine and biology poses fundamental challenges for building robust artificial intelligence (AI) and machine learning (ML) systems. This dissertation addresses key obstacles to the development of effective foundation models for medical diagnosis and biological discovery through three complementary contributions. First, I introduce frameworks designed to systematically decompose intricate and open-ended problems into tractable sub-components, facilitating the creation of meaningful benchmarks that accurately reflect scientific and clinical advancements. Chapter 1 introduces CRAFT-MD, an evaluation framework that simulates multi-turn doctor–patient conversations to assess diagnostic accuracy of Large Language Models (LLMs) in conversational settings. Chapter 4 presents a novel approach for evaluating the large-scale experimental viability of biological hypotheses generated by LLMs. Second, I develop scalable methods for identifying, curating, and harmonizing high-quality medical datasets crucial for training and evaluating foundation models. This is demonstrated in Chapter 3 through the construction of PanEndoAtlas, the largest endoscopic image dataset to date organized in a clinically meaningful hierarchy, and the accompanying PanEndoX benchmark. Third, I describe the development of specialized models tailored to medical and biological applications. Chapter 2 introduces BEANIE, a non-parametric statistical method for precise nomination of biological hypotheses from single-cell RNA-seq data in patient oncology cohorts. Chapter 3 presents PanEndoFM, a foundation model trained for endoscopic diagnosis on over 10 million images covering the entire GI tract. Together, these contributions advance the methodological, data-centric, and model-building foundations necessary for complex medical and biological domains.