Regret-Based Algorithms for Multi-Armed Bandits

Zhao, Kevin Hanbo

View/Open

ZHAO-SENIORTHESIS-2020.pdf (589.2Kb)

Author

Zhao, Kevin Hanbo

Metadata

Show full item record

Citation

Zhao, Kevin Hanbo. 2020. Regret-Based Algorithms for Multi-Armed Bandits. Bachelor's thesis, Harvard College.

Abstract

Multi-armed bandits (MABs) are the single step analogs of the full reinforcement learning problem. They are simple but powerful models for studying the exploration/exploitation dilemma in the context of making sequential decisions over time and under uncertainty. There are several notions of optimality in the literature: asymptotic optimality, regret optimality, and PAC optimality. In this work, we examine the multi-armed bandit problem (which is sometimes called the sequential allocation problem) within the context of regret optimality. We cover several groundbreaking algorithms that approach MABs through the frequentist and Bayesian perspective. \textbf{UCB1} was one of the first algorithms to achieve a logarithmic regret bound and sparked a new field of literature for upper confidence bound based algorithms. \textbf{UCB-V} was one of the first works to improve the regret bound for \textbf{UCB1} but is still not ``optimal''. We later introduce \textbf{KL-UCB}, \textbf{Thompson Sampling}, and \textbf{Bayes UCB}, which are all able to achieve regret optimality asymptotically (in the Bernoulli reward setting). We then perform experiments to illustrate their robustness and how well these algorithms perform under different reward distributions and slight perturbations to their respective assumptions.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364663

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)