Publication: Theory of Learning in Wide Deep Neural Networks
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
In recent years, significant breakthroughs in artificial intelligence have been largely driven by advances in Deep Neural Networks (DNNs). Inspired by the layered, modular organization of the human brain, these computational models have achieved unprecedented success in fields such as image recognition, structural biology, medicine, and language. Their impressive performance is largely due to their ability to learn to extract patterns from training data and generalize these insights to previously unseen scenarios. The study of how these networks acquire knowledge - commonly known as Deep Learning Theory (DLT) - provides valuable insights into how complex networks extract meaningful features from data. Furthermore, examining the mechanisms of generalization in DNNs may yield crucial insights into how biological neural systems perform similar tasks. The remarkable generalization abilities of DNNs raise several fundamental questions: (1) How do DNNs avoid overfitting despite being heavily overparameterized? (2) What is the relation between the strength of feature learning and the DNNs' generalization capabilities? (3) DNNs commonly struggle with flexibly and continuously learning new tasks in changing environments, tasks which the human brain handles routinely and effortlessly. What factors enable or hinder DNN performance in these scenarios? To explore these questions, this dissertation introduces a framework based on a Bayesian formulation of learning, which allows us to abstract away the complexities of detailed training procedures and focus instead on the structure of the solution space. We develop a theoretical framework for wide DNNs that makes Bayesian Learning analytically tractable, and derive its main properties in learning of a single task as well as a sequence of tasks.
We first analyze learning in a family of simplified, analytically tractable architectures, Deep Linear Neural Networks (DLNNs), in which each unit has a linear activation function. In the thermodynamic limit, where both the number of training examples and the width of the network become very large, yet maintain a fixed ratio, the statistics of the input-output mapping of the network, averaged throughout the solution space, can be solved exactly. Our analysis enables the evaluation of critical network properties, including generalization error, the effects of network width and depth, the size of the training set, as well as the roles of regularization and stochasticity during learning. Our theory allows for computation of both system performance and layer-wise data representations.
We then heuristically extend our theory to fully connected nonlinear DNNs and validate it numerically. For a more rigorous extension to nonlinear DNNs, we propose a tractable nonlinear architecture, Globally Gated Deep Linear Networks (GGDLNs), which preserve key qualitative features of nonlinear networks while remaining analytically tractable. Compared to DLNNs, GGDLNs exhibit richer and more complex dependencies on network depth, width, and regularization. The gating operation enhances network capabilities by allowing for flexible ways to encode context. In particular, we show that GGDLNs are able to learn simultaneously multiple tasks with contradicting labels, by explicitly incorporating task-relevant information into their gating units.
Finally, we extend our theoretical framework to investigate continual learning (CL) in wide DNNs, where networks sequentially learn new tasks without losing previously acquired knowledge. We first consider the single-head scenario, where a single neural network is used to perform both training and inference on all tasks. For tasks with contradicting labels which the single-head architecture struggles with, we consider a multi-head architecture with task-specific readouts. In the multi-head scenario, learning a new task involves modifying the shared hidden-layer weights while adding a new task-specific readout, leaving previous readouts untouched. This architecture can be interpreted as a gated network similar to the GGDLN, where the task identity information is incorporated into non-overlapping sets of gating units. These units then activate the corresponding output pathways for each task. Building upon the previously developed Bayesian framework, we introduce order parameters (OPs) that quantify task similarity and accurately predict the degree of forgetting and anterograde interference. Our findings emphasize that task similarity and network depth significantly impact interference in both single-head and multi-head CL setups, highlighting conditions leading to catastrophic interference and suggesting effective strategies for reducing forgetting.
In summary, this dissertation presents a comprehensive theoretical analysis of learning in wide DNNs for both single and sequential tasks, demonstrating how generalization and internal representations critically depend on network architecture, hyperparameters, and task structures. These insights lay the groundwork for future exploration into generalization within more complex and practically relevant DNN architectures, and offer potential pathways toward understanding the neural mechanisms underlying representation learning and generalization in biological neural systems.