Publication: Learning to See Agents with Deep Variational Inference
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Unsupervised agent discovery is the ability to identify and model intentional agents from raw perceptual data without explicit supervision. While neurocognitive theories propose different neural mechanisms for agent perception—including mirror neurons and the superior temporal sulcus (STS), we lack computational algorithms that can fully describe agent perception. Existing computational models of agent perception operate on simplified symbolic inputs rather than the raw perceptual data that biological systems process. We introduce a variational objective LVAD that formulates vision-based agent discovery as structured inference over latent actions. Based on LVAD, we implement a deep conditional slot-based variational autoencoder called VAD (Variational Agent Discovery) model. Our model learns internal agent representations directly from raw pixel-based observations, outperforming baselines on predictive tasks including agent action and goal inference in three video-game settings. VAD's internal representations generalize robustly to novel agents and environmental configurations, demonstrating up to 33% advantage in transfer scenarios. The VAD model exhibits predictive capabilities analogous to those observed in infant cognition studies, correctly predicting that agents will take efficient paths to goals when environmental constraints change. Analysis of learned representations reveals functional decomposition of visual scenes along agent-centric lines, with certain neural features exhibiting human mirror-neuron-like activation patterns across different agents performing the same actions. When incorporated as an auxiliary loss in multi-agent reinforcement learning, our VAD objective improves sample efficiency by 21.8% and final performance by 7.6%.