Publication: INFERNO: Accelerating Inference with Model Compression via Layer-Wise and Task-Specific Pruning
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Machine learning inference latency, or the time required to compute an output of a machine learning model, is increasingly becoming a bottleneck in production systems. Compressing models so that there are fewer, or smaller, computations is the dominant paradigm to increase inference throughput; however, existing methods are limited by expensive pre-training or a lack of task-specific fine-tuning. In this thesis, we present INFERNO, a library designed to accelerate inference on tasks for Convolutional Neural Networks (CNN's) and Large Language Models (LLM's) by introducing two novel model compression algorithms. First, for CNN's, INFERNO introduces greedy layer-wise structured pruning, where, given a task, INFERNO results in a >10x reduction in training cost for comparable accuracy or 30% gain in accuracy for a comparable cost, compared to random sampling, and a larger reduction when compared to state-of-the-art (SOTA) data-center pre-training. Second, for LLM's, INFERNO introduces task-specific pruning on a small sample of unlabeled data, resulting in 1-5% accuracy gains with a 50% pruned LLAMA-2 7B compared to existing SOTA LLM pruning methods. We compare work to existing acceleration libraries and reason there is likely plenty of space for further compression.