Publication: Emergence of rich visual features from general architectural and learning constraints
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The human visual system supports a wide variety of behavioral capacities including categorizing objects, reading written symbols, comparing numerosities, and detecting social interactions. What constraints and forms of optimization are necessary to create the representations which support these different tasks? One possibility is that specialized visual representations are optimized separately for each of these visual capacities. Another possibility is that generic architectural and learning constraints produce features which solve a variety of tasks. Here, I investigate whether generically optimized features in convolutional neural networks can account for representational signatures measured in human perceptual experiments. In Chapter 1, I investigate whether features optimized for object categorization or features optimized for letter categorization better account for human visual perception of letters. In two large-scale online behavioral experiments, object-trained feature spaces better corresponded to the visual similarity of letters than letter-trained feature spaces. In addition, altering object-trained networks with experience-dependent letter specialization did not improve the match to the behavioral data. These results support the idea that general object-based features are reused when recognizing letters. In Chapter 2, I build on findings that feature tuning for numerosity emerges in convolutional neural networks trained on object categorization (Nasr et al., 2019). I found that a self-supervised AlexNet model trained on ImageNet exhibits four signatures of human numerical perception: decreasing number representations for grouped, bounded, and connected items, and increasing number representations for coherently oriented sets. Thus, humanlike numerical features can emerge when training neural networks to discriminate broadly between different views of the world, without any specialized number-processing constraints. The visual system is sensitive to socially relevant spatial arrangements (Papeo, 2020), for example two people face-to-face (a facing dyad). In Chapter 3, I find that untrained AlexNet and self-supervised AlexNet have features reflecting human visual perception of dyads. These features prefer facing dyads over non-facing dyads, exhibit an inversion effect, and prefer dyads over person-object pairs. These results suggest that generic architectural constraints are sufficient to produce features with socially relevant tuning. Collectively, this dissertation proposes that generically optimized visual features can account for human visual perception in domains as disparate as written symbol recognition, approximate number comparisons, and social interaction detection. Broadly, these findings support the idea that generic constraints produce a rich set of visual features which can support a variety of behavioral tasks.