Publication: Off Policy Reinforcement Learning for Real-World Settings
No Thumbnail Available
Open/View Files
Date
2021-07-12
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Sonabend Worthalter, Aaron Michael. 2021. Off Policy Reinforcement Learning for Real-World Settings. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
In this dissertation, we aim to adapt reinforcement learning (RL) to real-world, high-risk settings. We study how to optimize sequential decision-making in complex settings with large observational data repositories where exploration is unfeasible. In particular, we are motivated by estimating optimal dynamic treatment regimes (DTR) with electronic health records (EHR). We address some of the challenges that differentiate off-policy RL in high-risk settings from other contexts. For example, we account for sampling bias and uncertainty to yield causally valid inference. Additionally, our resulting policy functions are interpretable by domain experts. We also provide measures of statistical efficiency, which are crucial in our settings of large but finite and noisy data.
In Chapter 1, we propose an offline policy and value function learning method based on Bayesian RL. Our estimated policy is optimal and safe as it handles uncertainty through hypothesis testing, allows for different levels of risk aversion, and is interpretable. We provide consistency results and a regret bound, which establishes sample efficiency. The theoretical results are independent of the risk aversion threshold and quality of the expert policy.
Chapter 2 develops a semi-supervised RL (SSRL) method for $Q$-learning and doubly-robust off-policy evaluation. SSRL is specifically relevant to EHR data where outcome information is often not well coded but rather embedded in clinical notes. Our approach leverages a small dataset with true outcomes observed and a large dataset with outcome surrogates. We provide theoretical results for our estimators to understand to what degree efficiency can be gained from SSRL. Our method is at least as efficient as the supervised approach and robust to the misspecification of the imputation models.
Chapter 3 seeks to find the optimal DTR, which maximizes the value function through non-parametric estimation. We frame this as a multi-stage classification problem. To address the discontinuity of the objective function, we use a smooth surrogate for the value function. In particular, we characterize a family of smooth surrogate functions that are Fisher consistent and provide a regret bound tailored to the non-parametric estimation method. In addition, smoothness in the surrogate value function makes the method scalable to large sample sizes.
Description
Other Available Sources
Keywords
doubly robust, Dynamical treatment regimes, electronic health records, off-policy learning, Reinforcement Learning, semi-supervised learning, Statistics, Artificial intelligence
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service