Publication:

Advancing Foundation Models for Medical Diagnosis and Biological Discovery

Loading...
Thumbnail Image

Date

2025-11-20

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Johri, Shreya. 2025. Advancing Foundation Models for Medical Diagnosis and Biological Discovery. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

The open-ended and complex nature of medicine and biology poses fundamental challenges for building robust artificial intelligence (AI) and machine learning (ML) systems. This dissertation addresses key obstacles to the development of effective foundation models for medical diagnosis and biological discovery through three complementary contributions. First, I introduce frameworks designed to systematically decompose intricate and open-ended problems into tractable sub-components, facilitating the creation of meaningful benchmarks that accurately reflect scientific and clinical advancements. Chapter 1 introduces CRAFT-MD, an evaluation framework that simulates multi-turn doctor–patient conversations to assess diagnostic accuracy of Large Language Models (LLMs) in conversational settings. Chapter 4 presents a novel approach for evaluating the large-scale experimental viability of biological hypotheses generated by LLMs. Second, I develop scalable methods for identifying, curating, and harmonizing high-quality medical datasets crucial for training and evaluating foundation models. This is demonstrated in Chapter 3 through the construction of PanEndoAtlas, the largest endoscopic image dataset to date organized in a clinically meaningful hierarchy, and the accompanying PanEndoX benchmark. Third, I describe the development of specialized models tailored to medical and biological applications. Chapter 2 introduces BEANIE, a non-parametric statistical method for precise nomination of biological hypotheses from single-cell RNA-seq data in patient oncology cohorts. Chapter 3 presents PanEndoFM, a foundation model trained for endoscopic diagnosis on over 10 million images covering the entire GI tract. Together, these contributions advance the methodological, data-centric, and model-building foundations necessary for complex medical and biological domains.

Description

Other Available Sources

Research Data

Keywords

Artificial intelligence, Bioinformatics, Biology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories