Off Policy Reinforcement Learning for Real-World Settings
Combined thesis.pdf (2.130Mb)
Access StatusFull text of the requested work is not available in DASH at this time ("dark deposit"). For more information on dark deposits, see our FAQ.
Sonabend Worthalter, Aaron Michael
MetadataShow full item record
CitationSonabend Worthalter, Aaron Michael. 2021. Off Policy Reinforcement Learning for Real-World Settings. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractIn this dissertation, we aim to adapt reinforcement learning (RL) to real-world, high-risk settings. We study how to optimize sequential decision-making in complex settings with large observational data repositories where exploration is unfeasible. In particular, we are motivated by estimating optimal dynamic treatment regimes (DTR) with electronic health records (EHR). We address some of the challenges that differentiate off-policy RL in high-risk settings from other contexts. For example, we account for sampling bias and uncertainty to yield causally valid inference. Additionally, our resulting policy functions are interpretable by domain experts. We also provide measures of statistical efficiency, which are crucial in our settings of large but finite and noisy data.
In Chapter 1, we propose an offline policy and value function learning method based on Bayesian RL. Our estimated policy is optimal and safe as it handles uncertainty through hypothesis testing, allows for different levels of risk aversion, and is interpretable. We provide consistency results and a regret bound, which establishes sample efficiency. The theoretical results are independent of the risk aversion threshold and quality of the expert policy.
Chapter 2 develops a semi-supervised RL (SSRL) method for $Q$-learning and doubly-robust off-policy evaluation. SSRL is specifically relevant to EHR data where outcome information is often not well coded but rather embedded in clinical notes. Our approach leverages a small dataset with true outcomes observed and a large dataset with outcome surrogates. We provide theoretical results for our estimators to understand to what degree efficiency can be gained from SSRL. Our method is at least as efficient as the supervised approach and robust to the misspecification of the imputation models.
Chapter 3 seeks to find the optimal DTR, which maximizes the value function through non-parametric estimation. We frame this as a multi-stage classification problem. To address the discontinuity of the objective function, we use a smooth surrogate for the value function. In particular, we characterize a family of smooth surrogate functions that are Fisher consistent and provide a regret bound tailored to the non-parametric estimation method. In addition, smoothness in the surrogate value function makes the method scalable to large sample sizes.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368374
- FAS Theses and Dissertations