Publication: Computationally Speaking: The Mathematical Foundation of Large Language Models and An Exploration Into How They Tell Stories
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Over the past year, large language models such as ChatGPT have gained immense popularity, with hundreds of millions of active users. The adoption of these models in everyday tasks marks a significant shift in how we perceive and interact with tech- nology, making it all the more crucial to understand how these new tools work. This thesis aims to elucidate the inner workings of large language models, starting from first principles. We begin with an introduction to foundational machine learning concepts. Next we analyze the underlying architecture of neural networks, focusing on the evo- lution from basic feed-forward networks to Recurrent Feed-Forward networks, Long Short-Term Memory networks, and most importantly Transformer networks. In this analysis, we highlight key components such as the residual stream vector space and at- tention block. We then explore the optimization algorithms used to train autoregres- sive Transformer networks, including deterministic gradient descent, stochastic gradi- ent descent, and Adam, with an emphasis on their convergence properties. Finally, we present current research on Transformer network interpretability, including an ongoing research project about differentiating storytelling modes in the popular large language model Llama2. This thesis underscores that the first step to using machine learning responsibly is to understand it mathematically.