Publication:

Statistical Learning Methods for Multi-Dataset Prediction

Loading...
Thumbnail Image

Date

2022-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Loewinger, Gabriel Conan. 2022. Statistical Learning Methods for Multi-Dataset Prediction. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

It has become increasingly common in the biomedical sciences to encounter settings where multiple datasets are available to train statistical learning models. These opportunities arise when fitting prediction models using datasets from, for example, repositories that aggregate studies from different labs or study populations. By training models on datasets that combine multiple sources or studies, it is tempting to assume the resulting prediction algorithm will be more robust to the problem of dataset shift, in which discrepancies in the distribution of training and test data can reduce out-of-sample prediction performance. However, common approaches such as pooling datasets before model fitting can perform poorly when datasets are highly heterogeneous. As such, the development of statistical methods that can explicitly account for heterogeneity across data sources is critical to training models that are replicable across populations. Here we propose statistical learning methods that leverage multiple datasets during model training to improve prediction performance.

In chapter 1, we introduce methods for domain generalization, in which we train a model on each of several datasets and create an aggregate ensemble prediction rule that is constructed to predict well on an unseen dataset, or “domain.” Specifically, we propose the “study strap ensemble,” which generalizes bagging for multi-dataset settings, using a hierarchical resampling procedure. By pairing the method with covariate similarity-based ensemble weighting schemes, we extend the method to multi-source domain adaptation problems, in which a sample of observations of the covariates from the target population is available at the time of model training. We prove existing domain generalization ensembling strategies, as well as standard bagging procedures, are special cases of the study strap ensemble. We demonstrate the effectiveness of our method in a human neuroscience application and in simulations.

In chapter 2, we propose methods for multi-source transfer learning, a setting that arises when an analyst has limited data collected from a distribution of interest, and they wish to leverage multiple auxillary training datasets to improve prediction performance on new observations from the target distribution. We build on “multi-study ensembling,” a multi-dataset procedure that uses a two-stage “stacking” strategy that first fits dataset-specific models and then aggregates ensemble models through a weighted average. Stacking estimates ensemble weights and model parameter weights separately, however, and therefore ignores the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose “optimal ensemble construction,” an “all-in-one” approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each dataset-specific model via a unified optimization formulation. We establish that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction.

In chapter 3, we propose a multi-task learning method, in which we jointly train a collection of sparse linear models, each fit on a separate dataset, to improve performance of each model on its respective domain or “task.” Specifically, we propose methods to extend the best subset selection problem, by placing a separate sparsity constraint on the regression parameters from each task, allowing the supports of the regression coefficients to differ across tasks. We propose a “support heterogeneity regularization” penalty that shrinks together the supports of the model coefficients across tasks, thereby encouraging models to share information during variable selection. We propose approaches based on first-order optimization and local combinatorial search in order to scale the method to high dimensional settings. We showcase the effectiveness of our method on neuroscience and cancer genomics applications.

Description

Other Available Sources

Research Data

Keywords

Domain Adaptation, Domain Generalization, Machine Learning, Multi-Task Learning, Statistical Learning, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories