Regret-Based Algorithms for Multi-Armed Bandits
Citation
Zhao, Kevin Hanbo. 2020. Regret-Based Algorithms for Multi-Armed Bandits. Bachelor's thesis, Harvard College.Abstract
Multi-armed bandits (MABs) are the single step analogs of the full reinforcement learning problem. They are simple but powerful models for studying the exploration/exploitation dilemma in the context of making sequential decisions over time and under uncertainty. There are several notions of optimality in the literature: asymptotic optimality, regret optimality, and PAC optimality. In this work, we examine the multi-armed bandit problem (which is sometimes called the sequential allocation problem) within the context of regret optimality. We cover several groundbreaking algorithms that approach MABs through the frequentist and Bayesian perspective. \textbf{UCB1} was one of the first algorithms to achieve a logarithmic regret bound and sparked a new field of literature for upper confidence bound based algorithms. \textbf{UCB-V} was one of the first works to improve the regret bound for \textbf{UCB1} but is still not ``optimal''. We later introduce \textbf{KL-UCB}, \textbf{Thompson Sampling}, and \textbf{Bayes UCB}, which are all able to achieve regret optimality asymptotically (in the Bernoulli reward setting). We then perform experiments to illustrate their robustness and how well these algorithms perform under different reward distributions and slight perturbations to their respective assumptions.Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAACitable link to this page
https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364663
Collections
- FAS Theses and Dissertations [6136]
Contact administrator regarding this item (to report mistakes or request changes)