Statistical Methods for the Design and Analysis of Infectious Disease Studies
Access StatusFull text of the requested work is not available in DASH at this time ("dark deposit"). For more information on dark deposits, see our FAQ.
MetadataShow full item record
CitationKennedy-Shaffer, Lee. 2020. Statistical Methods for the Design and Analysis of Infectious Disease Studies. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractStudies of treatment and prevention for infectious diseases are complicated by many factors: high spatiotemporal variance of outcomes, ethical challenges due to spillover effects on individuals not participating in the study, and logistical difficulties, to name just a few. Statistically, one of the most difficult challenges is accounting for the correlation that occurs between the outcomes of individuals. A vaccine given to one individual, for example, has an effect on the likelihood of infection for everyone around them. This correlation of data violates the independence of outcomes, a core assumption for many statistical methods. As a result, we turn to both design- and analysis-based approaches to solve this problem. Robust variance estimators, generalized estimating equations, mixed effects models, and cluster-based analyses can all be used to maintain validity of results in the presence of clustered data.
As is common in statistics, however, these approaches come with tradeoffs. In most cases, the tradeoff is efficiency: when we properly account for the correlation, our studies lose power to detect effects and our estimates become less precise. Because of this tradeoff, there is an urgent need to better understand the efficiency of these analyses, to accurately predict the power of these analyses, and to improve the precision of estimation. This dissertation seeks to address that need by contributing methods for the design and analysis of studies of correlated data. While the methods proposed are not exclusively valuable for infectious disease studies and cluster randomized trials, we illustrate them with examples from these fields to demonstrate one aspect of their utility. The results, however, apply to a much wider array of settings and deepen our understanding of the statistical properties of correlated data.
In Chapter 1, we develop methods for the estimation of sample size for stratified individual and cluster randomized trials, a key step in the design of such studies. Using asymptotic variance formulae for logistic regressions fitted using generalized estimating equations, we find sample size formulae that accommodate multiple strata and arbitrary design effects. In addition, we provide formulae to find the ratio of sample sizes for a stratified and comparably powered unstratified trial, highlighting situations where stratification is most beneficial to reducing the required size. We illustrate this by applying the methods to a multi-site cluster randomized trial of a prophylactic for household contacts of individuals with multidrug-resistant tuberculosis.
In Chapter 2, we continue to consider the analysis of clustered binary data using generalized estimating equations. Now, we turn to the role of the working correlation structure in the efficiency of the analysis. We derive asymptotic variance formulae for settings where the intracluster correlation coefficient (ICC) is not constant across clusters. When the ICC depends on the values of cluster-level covariates, we find that accounting for this in the working correlation structure generally has minimal impact on the efficiency of the analysis compared to using an exchangeable working correlation structure assuming a common ICC across clusters. In the design stage, however, using an incorrect ICC (e.g., using an estimate of the ICC in the control arm when the true ICC varies by treatment arm) can cause substantial under- or over-estimation of the required sample size of the trial. We demonstrate these effects on an example trial conducted in Bangladesh assessing the impact of water, sanitation, and handwashing interventions delivered to pregnant women living in neighborhood clusters on Giardia infection among children. Overall, while analysis can proceed using the common assumption that the ICC does not vary across clusters with little loss of efficiency, accounting for a varying ICC in the design stage is important.
Finally, in Chapter 3, we turn specifically to the stepped wedge cluster randomized trial (SW-CRT) design. In order to address issues of bias, inflated Type I Error, and reduced power when mixed effects models or fully non-parametric analysis methods are used, we propose several new methods of analysis. The synthetic control method uses a causal inference method from econometrics to match clusters with similar time trends of the outcome, improving the efficiency of vertical methods of analyzing SW-CRTs and thus improving power. The crossover method uses the information-rich horizontal contrasts inherent to SW-CRTs to improve power while avoiding the need to explicitly model time effects. And the crossover-synthetic control and ensemble methods combine these approaches in ways that may improve power in some circumstances. These methods also allow the investigator to specify weights, allowing the explicit targeting of specific causal estimands of interest. Through theoretical results and application to simulated data, we show how these approaches can lead to unbiased estimation and improved power compared to existing approaches. We also demonstrate the range of results that can occur using various analysis methods in an example SW-CRT conducted in Brazil assessing the impact of new tuberculosis diagnostic tools on the outcomes of individuals with tuberculosis.
These results provide new tools for investigators to use in designing and analyzing studies of infectious disease prevention and treatment. They allow investigators to better understand the power of their studies before conducting them and to choose design and analysis features that may improve that power. This of course has taken on an increasing urgency as the world faces the COVID-19 pandemic and a wider audience is exposed, in real time, to the challenges of conducting clinical trials in an outbreak setting. The methods and design implications discussed here may or may not have relevance to the particular studies that will be conducted surrounding this crisis, but they demonstrate once again the importance of careful design and analysis choices by investigators, the relevance of the disease and societal setting in conducting trials, and the necessity of considering statistical properties alongside ethical and logistical properties of proposed designs. They also point the way to future research questions that might be addressed to further improve infectious disease studies.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365906
- FAS Theses and Dissertations