Publication: How to Get Transformers to Process in Steps
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
As pointed out by Daniel Kahneman, there are at least two qualitatively distinct modes of cognition that take place in the human brain — fast and automatic thinking, labeled “System 1”, and slow and methodical thinking, labeled “System 2”. Progress in Artificial Intelligence requires approaches for modeling both kinds of cognition with computers. Classical programming is very effective for solving many tasks from the do- main of System 2, but is not practical for solving tasks from the domain of System 1. Machine learning is very effective for solving many tasks from the domain of System 1, but has not yet been shown to work robustly on tasks from the domain of System 2. In this work, we investigate if and how machine learning could be successfully applied for algorithmic tasks which are solved via System 2 by humans.
We argue that the standard way of framing machine learning problems that is suit- able for System 1 tasks is inadequate for System 2 tasks, propose an alternative, and demonstrate its effectiveness. Specifically, while learning a direct mapping from inputs to outputs is feasible for System 1 tasks, we argue that algorithmic System 2 tasks can only be solved by learning a mapping from inputs to outputs through a series of inter- mediate steps. We first show that by using enough intermediate steps a 1-layer Trans- former can in principle compute any finite function. We then show empirically that a 1-layer Transformer cannot learn to compute the sum of binary numbers directly from the inputs, but is able to compute the sum when trained to first generate a series of in- termediate results. This demonstrates, at a small scale, how a fixed-size neural network can lack the expressivity to encode the direct input-output mapping for an algorithmic task and yet be fully capable of computing the outputs through intermediate steps. Fi- nally, we show that a Frozen Pretrained Transformer is able to learn binary addition when trained to compute the carry bits before the sum, while it fails to learn the task without using the intermediates. These results support our hypothesis that the use of intermediate computations is necessary for tackling algorithmic tasks from the domain of System 2 via machine learning.