Publication: Cooperative Multi-Agent Graph Bandits
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
This thesis studies multi-agent, sequential decision making under restrictions on consecutive actions by extending the multi-armed bandit (MAB) paradigm, a fundamental online learning framework with widespread applications in healthcare, recommendation systems, dynamic pricing, and generative artificial intelligence. Specifically, we present the multi-agent graph bandit problem in which N cooperative agents navigate a connected graph G with K nodes. At each time step, agents play an action associated with their current location and observe a random reward. The edges of G thus represent restrictions on which actions can be played sequentially. Unlike in the existing multi-agent MAB literature, we define a coupled reward function, with the total reward of the system formulated as a weighted sum of the rewards sampled by individual agents.
To address this novel learning scenario, we present the Multi-G-UCB algorithm, which episodically estimates, transitions to, and samples from the multiset of nodes that give optimal system-wide reward. We bound our algorithm's regret (a traditional performance measure in online learning settings) by O(γ√(NKT log T) + γDNK log T), where T is the time horizon, D the diameter of G, and γ a boundedness parameter associated with the weight functions. Our agent-average regret bound is tighter than that of any single-agent algorithm and, when T is large, our collective regret matches state-of-the-art bounds from related but simpler bandit formulations.
Motivated by challenges introduced in real-world deployments, we generalize our framework to model agent and communication failures and develop a robust implementation of our algorithm, Robust-Multi-G-UCB, that achieves the same asymptotic performance in the robust setting. We empirically demonstrate that our algorithm surpasses several important benchmarks and performs well in a diverse array of challenging problem instances before finally recording a hardware demonstration deploying Robust-Multi-G-UCB in real-time on a swarm of light-sensing TurtleBots.