Publication:

Statistical Methods for Outcome Measurement Error Correction, And Multi-study Prediction And Causal Inference under Study Heterogeneity

Loading...
Thumbnail Image

Date

2024-09-03

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Wu, Yujie. 2024. Statistical Methods for Outcome Measurement Error Correction, And Multi-study Prediction And Causal Inference under Study Heterogeneity. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

The availability of electronic health records (EHR) enables researchers to improve their understanding of disease etiology and prevention by identifying the health effect of risk factors and causal effects of treatment, and outcome predictions facilitated by machine learning tools can tailor health decision making to the individual level. However, EHR is often subject to measurement error and misclassification in exposures and outcomes, which will lead to biased estimates of health effects, with the effect typically being biased toward the null, causing false negative discoveries. On the other hand, accurate outcome prediction is often hindered by study-to-study variation that stems from different population studied, heterogeneous covariate-outcome relationships, causing a prediction model to have poor out-of-study prediction performance. Existing methods fail to address these issues, limiting researchers’ ability to make use of the EHR data for disease prevention and treatment. This dissertation aims to develop statistical methods that aim to address the challenge in outcome measurement error for epidemiological studies and study heterogeneity for multi-study prediction and causal inference.

In Chapter 1, we propose statistical methods based on weighted estimating equations to correct for outcome measurement errors-caused biases in effect estimates in the settings of time-to-event data with multiple failure types. We discuss the consistency and asymptotic normality of the proposed estimators and also derive their asymptotic variances. This work is motivated by the Conservation of Hearing Study which aims to evaluate risk factors for hearing loss in an ongoing cohort study, Nurses’ Health Studies II. As an illustrative example, we apply the proposed method to adjust for the measurement errors in self- reported hearing outcomes when estimating the associations of tinnitus with hearing loss subtypes.

In Chapter 2, we develop methods to analyze clustered competing risk data when event types are only available in a training dataset and are missing in the main study. We propose to estimate the exposure effects through the cause-specific proportional hazards frailty model, where random effects are introduced into the model to account for the within-cluster correlation. We propose a weighted penalized partial likelihood method where the weights represent the probabilities of the occurrence of events, and the weights can be obtained by fitting a classification model for the event types on the training dataset. Alternatively, we propose an imputation approach in which the missing event types are imputed based on the predictions from the classification model. We derive the analytical variances and evaluate the finite sample properties of our methods in an extensive simulation study. As an illustrative example, we apply our methods to estimate the associations between tinnitus and metabolic, sensory and metabolic+sensory hearing loss in the Conservation of Hearing Study Audiology Assessment Arm.

In Chapter 3, we propose methods for multi-source domain adaptation (MSDA) for regression problems that leverage information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. First, we extend a flexible single-source DA algorithm for classification through outcome coarsening to enable its application to regression problems. We then augment our single-source DA algorithm for regression with ensemble learning to achieve multi-source DA. We consider three learning paradigms in the ensemble algorithm, which combines linearly the target-adapted learners trained with each source domain: (i) a multi-source stacking algorithm to obtain the ensemble weights; (ii) a similarity-based weighting where the weights reflect the quality of DA of each target-adapted learner; and (iii) a combination of the stacking and similarity weights. We illustrate the performance of our algorithms with simulations and a data application where the goal is to predict high-density lipoprotein (HDL) cholesterol levels using the gut microbiome. We observe a consistent improvement in prediction performance of our multi-source DA algorithm over the routinely used methods in all these scenarios.

In Chapter 4, we propose methods to leverage information from multiple clinical trials to facilitate the estimation of study-specific causal effects. We propose a Clustered Cross-study Treatment Effect Estimator (CCTEE) based on a two-stage approach to leverage information from multiple studies to facilitate the estimation of study-specific causal effects. In the first stage, we propose a matching algorithm to quantify the transportability of treatment effect estimates across studies. In the second stage, we use a Dirichlet Process Mixture Model to cluster the studies based on their pairwise transportability and estimate the study-specific causal effects by borrowing information from studies of the clusters. We evaluate the performance of our method through extensive simulations. As an illustrative example, we apply our method to estimate the study-specific average treatment effects of Paliperidone ER on treating schizophrenia in five randomized clinical trials.

Description

Other Available Sources

Research Data

Keywords

Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories