Publication:

Cooperative Multi-Agent Graph Bandits

Loading...
Thumbnail Image

Date

2025-05-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Paschalidis, Phevos. 2025. Cooperative Multi-Agent Graph Bandits. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

This thesis studies multi-agent, sequential decision making under restrictions on consecutive actions by extending the multi-armed bandit (MAB) paradigm, a fundamental online learning framework with widespread applications in healthcare, recommendation systems, dynamic pricing, and generative artificial intelligence. Specifically, we present the multi-agent graph bandit problem in which N cooperative agents navigate a connected graph G with K nodes. At each time step, agents play an action associated with their current location and observe a random reward. The edges of G thus represent restrictions on which actions can be played sequentially. Unlike in the existing multi-agent MAB literature, we define a coupled reward function, with the total reward of the system formulated as a weighted sum of the rewards sampled by individual agents.

To address this novel learning scenario, we present the Multi-G-UCB algorithm, which episodically estimates, transitions to, and samples from the multiset of nodes that give optimal system-wide reward. We bound our algorithm's regret (a traditional performance measure in online learning settings) by O(γ√(NKT log T) + γDNK log T), where T is the time horizon, D the diameter of G, and γ a boundedness parameter associated with the weight functions. Our agent-average regret bound is tighter than that of any single-agent algorithm and, when T is large, our collective regret matches state-of-the-art bounds from related but simpler bandit formulations.

Motivated by challenges introduced in real-world deployments, we generalize our framework to model agent and communication failures and develop a robust implementation of our algorithm, Robust-Multi-G-UCB, that achieves the same asymptotic performance in the robust setting. We empirically demonstrate that our algorithm surpasses several important benchmarks and performs well in a diverse array of challenging problem instances before finally recording a hardware demonstration deploying Robust-Multi-G-UCB in real-time on a swarm of light-sensing TurtleBots.

Description

Other Available Sources

Research Data

Keywords

Hardware Demonstration, Multi-Agent Learning, Multi-Armed Bandit, Regret Analysis, UCB Algorithm, Computer science, Statistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories