Publication:

Tree-based ensembling strategies for handling heterogeneous data

Loading...
Thumbnail Image

Date

2022-04-04

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Ramchandran, Maya. 2022. Tree-based ensembling strategies for handling heterogeneous data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Adapting machine learning algorithms to better handle clustering or other partition structure within training data sets is important across a wide variety of biological applications. We first consider multi-study learning, a paradigm that uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. We present novel weighting approaches for constructing tree-based ensemble learners in this setting, showing that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor and achieves superior performance to Random Forest. Next, we broaden the scope of the problem to consider the effect of ensembling forest-based learners trained on clusters within a single data set with heterogeneity in the distribution of the features. We show that constructing ensembles of forests trained on estimated clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and display its robustness and accuracy across simulations and on cancer molecular profiling and gene expression data sets that are naturally divisible into clusters. Finally, we provide theoretical support to these empirical observations by asymptotically analyzing linear least squares and random forest regressions under a linear model. In particular, for random forest regression under fixed dimensional linear models, our bounds imply a strict benefit of our ensembling strategy over classic Random Forest. Code and supplementary material for all chapters are available at https://github.com/m-ramchandran.

Description

Other Available Sources

Research Data

Keywords

Cancer genomic data, Clustering, Ensemble learning, Machine Learning, Multiple studies, Random Forest, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories