Publication: Scaling and Renormalization in Statistical Learning
No Thumbnail Available
Date
2024-09-03
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Atanasov, Alexander Blagoev. 2024. Scaling and Renormalization in Statistical Learning. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
This thesis develops a theoretical framework for understanding the scaling properties of information processing systems in the regime of large data, large model size, and large computational resources. The goal is to develop an understanding of the impressive performance that deep neural networks have exhibited.
The first part of this thesis examines models linear in their parameters but nonlinear in their inputs. This includes linear regression, kernel regression, and random feature models. Utilizing random matrix theory and free probability, I provide precise characterizations of their training dynamics, generalization capabilities, and out-of-distribution performance, alongside a detailed analysis of sources of variance. A variety of scaling laws observed in state-of-the-art large language and vision models are already present in this simple setting.
The second part of this thesis focuses on representation learning. Leveraging insights from models linear in inputs but nonlinear in parameters, I present a theory of early-stage representation learning where a network with small weight initialization can learn features without altering the loss. This phenomenon, termed silent alignment, is empirically validated across various architectures and datasets. The idea of starting at small initialization leads naturally to the "maximal update parameterization", μP, that allows for feature learning at infinite width. I present empirical studies showing that practical networks can approach their theoretical infinite-width feature learning limits. Finally, I consider down-scaling the output of a neural network by a fixed constant. When this constant is small, the network behaves as a linear model in parameters; when large, it induces silent alignment. I present theoretical and empirical results of the influence of this hyperparameter on feature learning, performance, and dynamics.
Description
Other Available Sources
Keywords
Deep Learning, Empirical Deep Learning, High Dimensional Statistics, Random Matrix Theory, Representation Learning, Statistical Physics, Theoretical physics, Statistical physics, Statistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service