Publication: Uninstructed Structure: Self-Supervised Deep Learning as a Steppingstone to Object Invariance, Intuitive Physics, and the Experience of Beauty
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The fundamental first step of any externally grounded psychological experience is the rapid organization of richly shaped, but largely indecipherable sensory inputs into ecologically relevant units of meaning: in other words, perception. Situated at the bottleneck between mind and matter, perception is often heavily heuristic, unconscious, and ineffable -- a 'black box' of algorithmic shortcuts and representational manifolds that evades easy characterization in the finite symbol sets of natural language. Given their cryptic and inaccessible nature, perceptual computations are often treated as a necessary, but insufficient condition of intelligent thought and feeling, a mere mapping between the physical world and the psychological abstractions that matter: objects, actions, agents, goals, beliefs, desires, and theories. In this thesis, I challenge this notion by isolating perceptual computations in machine vision systems devoid of all such abstractions and by demonstrating that -- far from a mere mapping -- these computations are sufficient for predicting multiple aspects of human psychology across three diverse, increasingly abstract phenomena: the viewpoint-invariant representation of complex objects, the judgment of physical stability in a paradigmatic intuitive physics task, and the affective (or aesthetic) valuation of images. To do this, I adapt classic tools from cognitive (neuro)science, including behavioral psychophysics, physiological probes, and neuroimaging assays for use in silicon, applying them to a diversity of deep neural network models trained only on canonical computer vision tasks. I focus in particular on a type of deep neural network model trained through contrastive learning, a form of self-supervised learning that operates on distinctions between individual inputs, and requires only two data-implicit labels: 'same' versus 'different'. While never quite explaining the full range and complexity of human behavior in each of the test cases we subject it to, this model's general success in capturing key aspects of the human response to visual stimuli provides strong evidence that perception is more than meets the eye. Instead, the predictive power of these models suggests that a nontrivial portion of what we consider intelligent behavior may be more directly a function of general-purpose perceptual computations than often acknowledged, even weighing on those elements of our experience (like the experience of beauty) we intuitively ascribe to the more ethereal, transcendent aspects of the human psyche. The core conclusion of this thesis is that the right combination of inputs, computational architecture, and training target can serve as a direct route from sensation to psychological sense-making, and that modern self-supervised machine learning algorithms provide a decent first approximation of the kinds of structure intelligent biological systems could learn without explicit instruction.