Publication: Identifying Decision Points for Safe and Interpretable Batch Reinforcement Learning
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
In batch reinforcement learning (RL), the agent cannot explore the environment but instead learns to act from a fixed set of sample trajectories. Intuitively, one can only expect to make policy improvements for states where multiple actions have been tested in the batch data. This is important from a safety perspective because standard policy learning methods tend to be overly optimistic about unobserved, potentially risky actions. Furthermore, action variation presents natural opportunities for human experts to understand the consequences of different options. We propose a principled framework for the identification of decision points, or states with high action variation under the behavior policy, and their applications in safety and interpretability. Towards safe policy learning, we present a new action-constrained variant of fitted Q iteration to prevent large deviations from the behavior policy. Empirical results from simulated environments show that learned policies robustly improve on behavior performance while avoiding extrapolation error. Towards interpretability, we present an algorithm for simplifying complex MDP environments in terms of decision regions. We test our methodology on the MIMIC medical dataset, obtaining summaries of action effects with potential for future use in human-in-the-loop policy learning. Overall, our framework shows promise for simpler, more robust reinforcement learning through the lens of decision-making.