Publication: Topics in Cluster-Correlated Data: Design, Informativeness, and Misclassification
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Cluster-correlated data are ubiquitous in biomedical research and introduce a number of methodological challenges. Motivated by applications in healthcare policy and epidemiology, this dissertation addresses three such problems. The first chapter considers the hospital-profiling setting, where quality-of-care is assessed on the basis of patient-level outcomes, clustered within hospitals. The latter two chapters are motivated by multigenerational studies, wherein interest lies in the effect of exposures on subsequent generations, with children clustered within families. In Chapter 1, we propose an outcome-dependent sampling solution to a health policy problem. Hospital readmission is a key marker of quality of care used by the Centers for Medicare and Medicaid Services to determine hospital reimbursement rates. Analyses of readmission are based on a generalized linear mixed model (GLMM) that permits estimation of hospital-specific measures while adjusting for case-mix differences. Recent moves to address health disparities call for expanding case-mix adjustment to include measures of socioeconomic status while minimizing burden to hospitals associated with data collection. We propose that detailed socioeconomic data be collected on a sub-sample of patients via a cluster-stratified case-control design paired with pseudo-maximum likelihood estimation. In simulations, the proposed approach proves highly efficient when interest lies in either fixed or random components of a GLMM and covariates are unobserved or expensive to collect. In the motivating study of Medicare beneficiaries, the proposed framework provides a means of mitigating disparities in terms of which hospitals are deemed underperformers relative to a naive analysis that fails to adjust for missing case-mix variables. We then shift our attention to multigenerational studies, which are susceptible to informative cluster size—occurring when the number of children to a mother (the cluster size) is related to their outcomes, given covariates. A natural question then emerges: what if some women bear no children at all? The impact of these potentially informative empty clusters is currently unknown, and Chapter 2 first evaluates the performance of standard methods for informative cluster size when cluster size is permitted to be zero. We find that if the informative cluster size mechanism induces empty clusters, standard methods lead to biased estimates of target parameters. Joint models of outcome and size are capable of valid conditional inference as long as empty clusters are explicitly included in the analysis, but in practice empty clusters regularly go unacknowledged. By contrast, estimating equation approaches necessarily omit empty clusters and therefore yield biased estimates of marginal effects. We thus propose a joint marginalized approach that readily incorporates empty clusters and, even in their absence, permits more intuitive interpretations of population-averaged effects than do current methods. Multigenerational studies require many years of follow-up, so exposures are often assessed retrospectively to maximize the number of observable generations—introducing recall bias and mis-measurement. Chapter 3 investigates exposure misclassification when cluster size is potentially informative, and in particular when misclassification is differential by cluster size. First, we show that misclassification in an exposure related to cluster size can induce informativeness even when cluster size would otherwise be non-informative. Second, we show that misclassification that is differential by informative cluster size can not only attenuate estimates of exposure effects but even inflate or reverse the sign of estimates. To correct for bias in estimating marginal parameters, we propose: (i) an approximate expected estimating equations framework, and (ii) an observed likelihood framework for joint marginalized models of cluster size and outcomes. Although the focus is on estimating marginal parameters, a corollary is that the observed likelihood approach permits valid inference for conditional parameters as well.