Publication: Statistical Learning Methods for Multi-Dataset Prediction
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
It has become increasingly common in the biomedical sciences to encounter settings where multiple datasets are available to train statistical learning models. These opportunities arise when fitting prediction models using datasets from, for example, repositories that aggregate studies from different labs or study populations. By training models on datasets that combine multiple sources or studies, it is tempting to assume the resulting prediction algorithm will be more robust to the problem of dataset shift, in which discrepancies in the distribution of training and test data can reduce out-of-sample prediction performance. However, common approaches such as pooling datasets before model fitting can perform poorly when datasets are highly heterogeneous. As such, the development of statistical methods that can explicitly account for heterogeneity across data sources is critical to training models that are replicable across populations. Here we propose statistical learning methods that leverage multiple datasets during model training to improve prediction performance.
In chapter 1, we introduce methods for domain generalization, in which we train a model on each of several datasets and create an aggregate ensemble prediction rule that is constructed to predict well on an unseen dataset, or “domain.” Specifically, we propose the “study strap ensemble,” which generalizes bagging for multi-dataset settings, using a hierarchical resampling procedure. By pairing the method with covariate similarity-based ensemble weighting schemes, we extend the method to multi-source domain adaptation problems, in which a sample of observations of the covariates from the target population is available at the time of model training. We prove existing domain generalization ensembling strategies, as well as standard bagging procedures, are special cases of the study strap ensemble. We demonstrate the effectiveness of our method in a human neuroscience application and in simulations.
In chapter 2, we propose methods for multi-source transfer learning, a setting that arises when an analyst has limited data collected from a distribution of interest, and they wish to leverage multiple auxillary training datasets to improve prediction performance on new observations from the target distribution. We build on “multi-study ensembling,” a multi-dataset procedure that uses a two-stage “stacking” strategy that first fits dataset-specific models and then aggregates ensemble models through a weighted average. Stacking estimates ensemble weights and model parameter weights separately, however, and therefore ignores the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose “optimal ensemble construction,” an “all-in-one” approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each dataset-specific model via a unified optimization formulation. We establish that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction.
In chapter 3, we propose a multi-task learning method, in which we jointly train a collection of sparse linear models, each fit on a separate dataset, to improve performance of each model on its respective domain or “task.” Specifically, we propose methods to extend the best subset selection problem, by placing a separate sparsity constraint on the regression parameters from each task, allowing the supports of the regression coefficients to differ across tasks. We propose a “support heterogeneity regularization” penalty that shrinks together the supports of the model coefficients across tasks, thereby encouraging models to share information during variable selection. We propose approaches based on first-order optimization and local combinatorial search in order to scale the method to high dimensional settings. We showcase the effectiveness of our method on neuroscience and cancer genomics applications.