Publication: Combining Foundation Models in Computational Pathology: Unlocking Multi-Representational Insights
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Foundation models have revolutionized computational pathology, enabling impressive results on tasks involving the classification of gigapixel whole-slide images (WSIs). However, no single foundation model consistently excels across all clinical scenarios. Given that these models differ substantially in their self-supervised training strategies, architectures, and data distributions, each model captures distinct morphological and structural features from histopathological slides. Leveraging multiple foundation models through patch-level feature fusion offers a promising approach to integrate their complementary strengths, potentially improving model robustness and generalization.
In this work, we present what is, to the best of our knowledge, one of the first and most comprehensive investigations of patch-level feature fusion using multiple foundation models. We systematically evaluate fusing three state-of-the-art pathology foundation models—UNI, Virchow, and GigaPath—across 11 established pathology tasks, 8 distinct fusion strategies, all possible encoder combinations, and various latent-space dimensionalities to thoroughly assess robustness. Since clinicians and researchers typically lack advance knowledge of which foundation model will perform best on unseen data, we adopt the average single-model performance as a practically relevant baseline for evaluating fusion methods. Our analysis demonstrates that a novel MLP-based fusion operator consistently surpasses this baseline in 132 out of 176 experiments across four multiple-instance learning (MIL) frameworks.
We further investigate factors influencing fusion effectiveness, finding that learned, parametric fusion operators typically outperform simpler, non-parametric methods predominantly studied in prior work. Additionally, we find that careful tuning of latent dimensionality can yield further performance gains, particularly for challenging multi-class subtyping tasks. Compared to conventional ensembles (aggregating final predictions), we discover that deep patch-level fusion is especially beneficial for multi-class diagnostic scenarios, whereas simpler ensembles may suffice for binary molecular biomarker tasks. Overall, this thesis provides valuable methodological insights and demonstrates the potential of multi-encoder patch-level fusion as a practical strategy for improving computational pathology systems.