Publication: New approaches to factual and counterfactual prediction modeling
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Over the past half century, new methods for quantitative risk prediction and validation were formalized and the number of models, both statistical and algorithmic, increased exponentially. However, this literature has largely focused on descriptive predictions of the world as it is, what I term factual prediction, instead of the world as it would be if we intervened, or counterfactual prediction. In this dissertation, I argue that in many instances counterfactual predictions are desired, but targeting them requires new methods based on causal inference.
In Chapter 1, I take a method traditionally associated with causal inference, the g-formula, and repurpose it as a model for factual and counterfactual prediction. In doing so, I highlight the potential of the g-formula as unifying framework for prediction as well as the assumptions required. Through simulation and an applied data example in the Framingham Offspring Study, I show how the g-formula can estimate factual and counterfactual quantities and leverage multiple repeated measurements over time to produce predictions that update dynamically.
In Chapter 2, I consider an example of a common clinical prediction task, i.e. developing a model for risk-based treatment decisions, where the ideal target is counterfactual. Building on prior work, I clarify the single-arm target trial of interest and propose two estimation methods that allow for separation between the causal and prediction tasks. I apply these methods to predict the statin-naive risk of cardiovascular disease using an emulated trial based on the Multi-Ethnic Study of Atherosclerosis. I find that traditional methods lead to underallocation of treatment at common thresholds of between 5 and 9 percentage points.
Finally, in Chapter 3, I tackle the theoretical question of how to train and validate models for counterfactual prediction when the relevant potential outcomes are not observed for all units. I discuss how to tailor a model for use in the same population under a counterfactual shift in treatment policy, how to assess its performance, and how to perform model and tuning parameter selection. I also provide identifiability results for measures of counterfactual performance for a potentially misspecified prediction model. I illustrate the methods using simulation and apply them to validate the performance of the statin-naive risk prediction model from Chapter 2.