Publication: INFERNO: Accelerating Inference with Model Compression via Layer-Wise and Task-Specific Pruning
No Thumbnail Available
Date
2024-05-09
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Noel, Asher J. 2024. INFERNO: Accelerating Inference with Model Compression via Layer-Wise and Task-Specific Pruning. Bachelor's thesis, Harvard University Engineering and Applied Sciences.
Research Data
Abstract
Machine learning inference latency, or the time required to compute an output of a machine learning model, is increasingly becoming a bottleneck in production systems. Compressing models so that there are fewer, or smaller, computations is the dominant paradigm to increase inference throughput; however, existing methods are limited by expensive pre-training or a lack of task-specific fine-tuning. In this thesis, we present INFERNO, a library designed to accelerate inference on tasks for Convolutional Neural Networks (CNN's) and Large Language Models (LLM's) by introducing two novel model compression algorithms. First, for CNN's, INFERNO introduces greedy layer-wise structured pruning, where, given a task, INFERNO results in a >10x reduction in training cost for comparable accuracy or 30% gain in accuracy for a comparable cost, compared to random sampling, and a larger reduction when compared to state-of-the-art (SOTA) data-center pre-training. Second, for LLM's, INFERNO introduces task-specific pruning on a small sample of unlabeled data, resulting in 1-5% accuracy gains with a 50% pruned LLAMA-2 7B compared to existing SOTA LLM pruning methods. We compare work to existing acceleration libraries and reason there is likely plenty of space for further compression.
Description
Other Available Sources
Keywords
Inference, Large Language Models, Machine Learning, Model Compression, Neural Architecture Search, Structured Pruning, Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service