Publication:
Methods for Complex Observational Study Data

No Thumbnail Available

Date

2023-05-02

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Li, Daniel. 2023. Methods for Complex Observational Study Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

This dissertation proposes and applies new methods for complex observational study with applications in studying i) health disparity associations in longitudinal and ecological coronavirus disease 2019 (COVID-19) public surveillance data and ii) missing biomarkers in non-small cell lung cancer (NSCLC) data. Chapters 1 and 2 were motivated by county-level COVID-19, demographic, and socioeconomic data synthesized from various government and nonprofit institutions for all United States (US) counties through March 2021. Chapter 3 was motivated by NSCLC data collected at Dana-Farber Cancer Institute through September 2020. Chapter 1 focuses on identifying US county-level characteristics associated with high COVID-19 burden through December 2020. We used generalized linear mixed models to model cumulative and weekly county COVID-19 cases and deaths. Cumulative and weekly models both included state fixed effects and county-specific random effects. Weekly models additionally allowed county-level covariate effects to vary by season and included US Census region-specific B-splines to adjust for temporal trends. We found rural counties, counties with more minorities and white/non-white segregation, and counties with more people with no high school diploma and with medical comorbidities were associated with higher cumulative COVID-19 case and death rates through December 2020. In spring 2020, urban counties and counties with more minorities and white/non-white segregation were associated with increased weekly case and death rates. In fall 2020, rural counties were associated with larger weekly case and death rates. In spring, summer, and fall 2020, counties with more residents with socioeconomic disadvantage and medical comorbidities were associated with greater weekly case and death rates. These county-level associations were based off complete data from the entire country and come from a single modeling framework that longitudinally analyzes the US COVID-19 pandemic at the county level. Chapter 2 focuses on integrative ecological regression analyses of US county- and state-level cumulative COVID-19 death data through March 2021. We proposed a log-linear random effects model to jointly model county-level total death counts and state-level race- and sex-specific death counts. A penalized composite log-likelihood method was developed for parameter estimation. Simulation studies showed that compared to naive county-level ecological regression, additionally incorporating state-level race- and sex-specific counts could overcome ecological bias and fallacy in estimating individual-level health disparity effects. When applied to US COVID-19 death data, naive ecological regression produced unstable individual-level estimates and estimates inconsistent with other well established results. After incorporating state-level counts, our proposed method produced more stable estimates and estimates consistent with other well established results. Incorporating state-level counts to county-level data is a valuable approach for improving ecological analyses. Chapter 3 focuses on studying missing programmed death-ligand 1 (PD-L1) tumor proportion score (TPS) biomarker values. We modeled PD-L1 TPS using a zero-inflated beta model and performed multiple imputations to study missing biomarker associations with immune checkpoint inhibitor (ICI) treatment efficacy. We found prior treatment and mutations in STK11 were associated with zero PD-L1 expression and among patients with non-zero PD-L1 expression, less PD-L1 expression. Prior treatment and worse performance status were associated with decreased ICI efficacy and survival, while greater tumor mutation burden (TMB) and PD-L1 TPS were associated with increased ICI efficacy and survival. After adjusting for known single nucleotide variants (SNV) and copy number variants (CNV) known to predict PD-L1, PD-L1 TPS was no longer associated with ICI efficacy. However after multiple imputation analyses, although PD-L1 TPS was still not associated with ICI efficacy, mutations in STK11 were now associated with decreased ICI efficacy and survival. A zero-inflated beta model is a biologically and practically meaningful model for predicting PD-L1 TPS, and multiple imputation can improve power for detecting significant biomarkers when there are missing NSCLC data. Together these methods provide additional insight for developing new biomarkers in predicting ICI efficacy.

Description

Other Available Sources

Keywords

Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories