Publication:

Optimal Transport Methods for Causal Inference, Multisample Testing, and Model Interpretation

Loading...
Thumbnail Image

Date

2021-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Dunipace, Eric Arthur. 2021. Optimal Transport Methods for Causal Inference, Multisample Testing, and Model Interpretation. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

The manuscript discusses three topics that utilize optimal transport and related methodologies to solve problems in statistics. Chapter 2 uses the Wasserstein distance to construct interpretable approximations to complicated models, Chapter 3 uses optimal transport distances to construct weighting estimators for causal inference, and Chapter 4 uses Hamiltonian paths and nearest neighbor graphs for multivariate testing. Each chapter is self-contained and the corresponding abstracts are given below.

Chapter 2: Statistical models often include thousands of parameters. However, large models decrease the investigator’s ability to interpret and communicate the estimated parameters. Reducing the dimensionality of the parameter space in the estimation phase is a commonly used approach, but less work has focused on selecting subsets of the parameters for interpreting the estimated model — especially in settings such as Bayesian inference and model averaging. Importantly, many models do not have straightforward interpretations and create another layer of obfuscation. To solve this gap, we introduce a new method that uses the Wasserstein distance to identify a low-dimensional interpretable model projection. After the estimation of complex models, users can budget how many parameters they wish to interpret and the proposed generates a simplified model of the desired dimension minimizing the distance to the full model. We provide simulation results to illustrate the method and apply it to cancer datasets.

Chapter 3: Weighting methods are a common tool to de-bias estimates of causal effects. And though there are an increasing number of seemingly disparate methods, many of them can be folded into one unifying regime: causal optimal transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between a source and target population. Our approach is model-free but can also incorporate moments or any other important functions of covariates that the re- searcher desires to balance. We find that the causal optimal transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control study examining the effect of misoprostol versus oxytocin for treatment of post-partum hemorrhage.

Chapter 4: We propose non-parametric, graph-based tests to assess the distributional balance of covariates in observational studies with multi-valued treatments. Our tests utilize graph structures ranging from Hamiltonian paths that connect all of the data to nearest neighbor graphs that maximally separates data into pairs. We consider algorithms that form minimal distance graphs, such as optimal Hamiltonian paths or non-bipartite matching, or approximate alternatives, such as greedy Hamiltonian paths or greedy nearest neighbor graphs. Extensive simulation studies demonstrate that the proposed tests are able to detect the misspecification of matching models that other methods miss. Contrary to intuition, we also find that tests ran on well-formed ap- proximate graphs do better in most cases than tests run on optimally formed graphs, and that a properly formed test on an approximate nearest neighbor graph performs best, on average. In a multi-valued treatment setting with breast cancer data, these graph-based tests can also detect imbalances otherwise missed by common matching diagnostics. We provide a new R package multivariateTesting to implement these methods and reproduce our results.

Description

Other Available Sources

Research Data

Keywords

causal inference, model interpretation, multisample testing, optimal transport, Wasserstein distance, Biostatistics, Statistics, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories