Federated and Transfer Learning with Multi-site Electronic Health Record Data

Liu, Molei

View/Open

PhD_thesis_revise.pdf (1.279Mb)

Author

Liu, Molei

Metadata

Show full item record

Citation

Liu, Molei. 2022. Federated and Transfer Learning with Multi-site Electronic Health Record Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Electronic health records (EHR) data has become crucial resources for a growing number of data-driven biomedical studies such as automated disease diagnosis and genotype-phenotype translation studies. Nevertheless, power of EHR analysis is usually impeded by the limited size of local data and the essential challenges in aggregating EHR data from multiple sources. Statistical challenges of multi-site EHR analysis are mainly due to covariate shift and model heterogeneity across the sites, missing or not properly handling of which can result in bias and poor transportability and generalizability. Meanwhile, both data high dimensionality and privacy concern arise in recent EHR studies and increase the difficulty in handling these challenges. In this paper, we develop novel methods to overcome the statistical and privacy challenges of multi-site EHR data aggregation. Our proposed methods facilitate efficient, transportable and generalizable analysis of large and noisy biomedical data from multi-sites.

In Chapter 1, we propose a novel approach for data shielding high-dimensional Integrative regression (SHIR). Our method protects individual data through summary-statistics-based integrating procedure, accommodates between study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. We show SHIR is statistically more efficient than existing integrative regression approaches. Furthermore, the estimation error incurred by aggregating summary data is negligible compared to the statistically optimal rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using EHR data from multiple chronic disease cohorts.

In Chapter 2, we propose a data shielding integrative large-scale testing (DSILT) method for signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual--level meta--analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.

Importance weighting, as a natural and principle strategy to adjust for covariate shift, has been commonly used in the field of transfer learning. However, it is not robust to model misspecification or excessive estimation error. In Chapter 3, we propose an augmented transfer regression learning (ATReL) approach that introduces an imputation model for the targeted response, and uses it to augment the importance weighting equation. With novel semi-non-parametric constructions and calibrated moment estimating equations for the two nuisance models, our ATReL method is less prone to (i) the curse of dimensionality compared to nonparametric approaches, and (ii) model mis-specification than parametric approaches. We show that our ATReL estimator is root-n-consistent when at least one nuisance model is correctly specified, estimation for the parametric part of the nuisance models achieves parametric rate, and the nonparametric components are rate doubly robust. We also propose ways to enhance the intrinsic efficiency of our estimator and to incorporate modern machine learning methods with our proposed framework.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37371994

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)