Publication: Everything Is a Matrix: Minimizing Data Movement and Parameter Count Across the Machine Learning Stack
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Machine learning has revolutionized natural language processing, computer vision, and beyond. Yet as machine learning models scale in size and capability, the demand for computational resources likewise grows, exposing new challenges in efficient and scalable deployment. Extracting maximal performance from existing hardware is therefore vital to unlocking the next wave of progress in artificial intelligence.
In many modern workloads, matrix operations dominate resource consumption, sometimes accounting for more than 99% of the workload [1]. Thus, we will focus on matrices as the central unit of optimization. This thesis presents an array of novel techniques to reduce memory footprint, accelerate computation, and improve overall hardware utilization. We demonstrate substantial efficiency gains are achievable by rethinking how data is computed, stored, and compressed, with a special focus on matrices, the core computational structure underpinning both scientific computing and neural networks.
First, we address dense matrix multiplication by introducing CAKE, a method that partitions computation into optimally shaped blocks to minimize memory bandwidth bottlenecks (Chapter 2). We extend this method to tensor contractions with any number of loops with mCAKE (Chapter 3). Then, for neural networks exhibiting moderate sparsity, the Rosko framework (Chapter 4) exploits outer-product structure to efficiently skip zero-valued computations and enables the creation of hardware-compatible sparsity patterns through structured pruning.
Next, we investigate efficient representations of weight matrices of neural networks using Singular Value Decomposition (SVD) (Chapter 5), enabling both memory savings and accelerated inference. Building on this, we explore low-rank model compression, where the compact forms of decomposed weight matrices facilitate efficient training and adaptive fine-tuning (Chapter 6). We then introduce blockwise knowledge distillation techniques (Chapter 7) that allow highly compressed, SVD-based student models to learn directly from their full-rank teacher counterparts, preserving both efficiency and model accuracy. Lastly, we demonstrate a privacy-preserving framework for distributed inference that splits computation between local devices and cloud servers, ensuring user data labels remain on-device while leveraging powerful cloud-based feature extractors (Chapter 8).
Together, these contributions meaningfully advance the efficiency and scalability of both conventional scientific workloads and the latest state-of-the-art AI models.
Reference: [1] A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” 2020.