Publication:

Deployable Online Reinforcement Learning Algorithms

Loading...
Thumbnail Image

Date

2025-05-15

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Trella, Anna. 2025. Deployable Online Reinforcement Learning Algorithms. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Online reinforcement learning (RL) algorithms are being increasingly used in real-world set- tings where dynamic environments may render offline algorithms ineffective. Such algorithms are desirable in this setting because they learn and improve future decision-making using continually collected data. Applications include robotics, recommender systems, fine-tuning large language models, and digital health. However there are many constraints to deploying online RL algorithms in the real world. Common challenges include limited data (sparse, partially-observable, etc.), ac- counting for the complexity of the environment, ensuring stability and autonomy of the algorithm, and facilitating intepretability and explainability of the algorithm. In this thesis, we have the use- inspired goal of making online RL deployable and stable in real-world settings. To do so, we provide a full end-to-end pipeline for online RL deployment. We start with guidelines for making various design decisions for the algorithm before deployment. We highlight the reward design as one of the most important design decisions. Next, we provide a framework for creating a monitoring sys- tem to ensure the algorithm runs stably and autonomously during deployment. Then, we cover post-deployment analyses that can be conducted to (1) explain what the algorithm learning and (2) re-evaluate algorithm design for the next deployment. To make ideas concrete in the previous three stages, we use real examples from the online RL algorithm deployed in the Oralytics clinical trial. Finally, we study a theoretical non-stationary bandit problem inspired by the non-stationarity in many real world problems. We conclude by discussing various open research problems for online reinforcement learning deployment.

Description

Other Available Sources

Research Data

Keywords

Bandit Algorithms, Model Deployment, Online Reinforcement Learning, Sequential Decision-Making, Artificial intelligence

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories