Publication: Tree-based ensembling strategies for handling heterogeneous data
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Adapting machine learning algorithms to better handle clustering or other partition structure within training data sets is important across a wide variety of biological applications. We first consider multi-study learning, a paradigm that uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. We present novel weighting approaches for constructing tree-based ensemble learners in this setting, showing that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor and achieves superior performance to Random Forest. Next, we broaden the scope of the problem to consider the effect of ensembling forest-based learners trained on clusters within a single data set with heterogeneity in the distribution of the features. We show that constructing ensembles of forests trained on estimated clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and display its robustness and accuracy across simulations and on cancer molecular profiling and gene expression data sets that are naturally divisible into clusters. Finally, we provide theoretical support to these empirical observations by asymptotically analyzing linear least squares and random forest regressions under a linear model. In particular, for random forest regression under fixed dimensional linear models, our bounds imply a strict benefit of our ensembling strategy over classic Random Forest. Code and supplementary material for all chapters are available at https://github.com/m-ramchandran.