Publication: Statistical Methods for Improving Real-Time Outbreak Detection
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Real-time outbreak detection in resource-constrained settings requires robust statistical methods that can account for aberrations in historical data, reporting delays in the most recent observations, and dynamic changes in transmission trends. This dissertation develops and evaluates outbreak detection frameworks that address these practical challenges using simulation-based evaluation, real-world data applications, and flexible statistical modeling.
Chapter 1 investigates the impact of historical anomalies—termed ``aberrations''—on the performance of rolling outbreak detection methods applied to health management information system (HMIS) data. Motivated by five years of acute respiratory infection (ARI) surveillance data from Liberia, we simulate outbreaks under seven distinct data-generating mechanisms varying in trend and seasonality. We assess five detection algorithms: EARS, Farrington, Holt-Winters, and two Weinberger-Fulcher (WF) models (negative binomial and quasipoisson). Detection accuracy is measured through sensitivity, specificity, and pseudo-ROC curves, under varied aberration timing and outbreak size. We find that the presence of recent aberrations in the baseline degrades performance across models, with context-specific tradeoffs: EARS and WF models perform well in the absence of recent anomalies; WF QP and Holt-Winters maintain better balance between sensitivity and specificity when recent aberrations are present; and Farrington achieves high sensitivity but lower specificity in these settings. These results offer practical guidance for selecting rolling detection models under imperfect baseline conditions common in low- and middle-income countries (LMICs).
Chapter 2 develops a novel frequentist framework for real-time nowcasting of all-cause mortality under reporting delays, using Massachusetts death registration data from 2017 to 2022. Reporting delays are modeled via a discrete-time survival model that incorporates covariates such as day of the week, lag, and snapshot date to flexibly capture evolving delay patterns. Using method-of-moments estimation, we correct underreported death counts, propagate delay uncertainty into variance estimates, and apply LOESS smoothing to stabilize predictions for the most recent days. Variance from both the delay model and smoothing step is incorporated into predictive intervals. Compared to leading Bayesian and spline-based nowcasting methods, including hierarchical Bayesian models, NobBS, EpiNowcast, and GAM approaches, our method achieves superior empirical coverage, lower bias, and narrower interval widths, particularly during the early pandemic phase when reporting delays exhibited sharp day-of-week effects. Explicit modeling of day-of-week reporting behavior substantially improved accuracy relative to approaches that omitted temporal covariates, and the method remained robust to shifts in the reporting distribution across time.
Chapter 3 extends this delay correction framework by integrating nowcasting with slope-based outbreak detection in a unified two-stage approach. Using molecular-confirmed COVID-19 case data from Puerto Rico, we estimate unreported cases via a discrete-time hazard model and then fit a slope-based detection model using generalized estimating equations (GEE), incorporating nowcast-derived variances as observation-level weights. We conduct a simulation study varying epidemic wave intensity, reporting delay speed, and baseline structure---including both stable and declining post-wave baselines---to evaluate time to detection, false positive rate, and calibration across models. We benchmark against the Farrington algorithm, $R_t$-based detection, and the Weinberger-Fulcher model. Our slope-based GEE approach consistently achieves faster and more reliable detection, particularly under low and medium wave scenarios with reporting delays, while maintaining strong calibration across a range of nominal alpha levels. The method also performs well when applied to real-world Puerto Rico data, issuing timely signals across three distinct epidemic waves.
Together, these chapters provide a comprehensive statistical toolkit for outbreak detection under the operational constraints of incomplete, delayed, and aberration-prone surveillance data. The approaches developed are computationally efficient, modular, and applicable across diverse epidemiological contexts, with particular relevance for LMICs and subnational surveillance systems.