Publication:
INFERNO: Accelerating Inference with Model Compression via Layer-Wise and Task-Specific Pruning

No Thumbnail Available

Date

2024-05-09

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Noel, Asher J. 2024. INFERNO: Accelerating Inference with Model Compression via Layer-Wise and Task-Specific Pruning. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Research Data

Abstract

Machine learning inference latency, or the time required to compute an output of a machine learning model, is increasingly becoming a bottleneck in production systems. Compressing models so that there are fewer, or smaller, computations is the dominant paradigm to increase inference throughput; however, existing methods are limited by expensive pre-training or a lack of task-specific fine-tuning. In this thesis, we present INFERNO, a library designed to accelerate inference on tasks for Convolutional Neural Networks (CNN's) and Large Language Models (LLM's) by introducing two novel model compression algorithms. First, for CNN's, INFERNO introduces greedy layer-wise structured pruning, where, given a task, INFERNO results in a >10x reduction in training cost for comparable accuracy or 30% gain in accuracy for a comparable cost, compared to random sampling, and a larger reduction when compared to state-of-the-art (SOTA) data-center pre-training. Second, for LLM's, INFERNO introduces task-specific pruning on a small sample of unlabeled data, resulting in 1-5% accuracy gains with a 50% pruned LLAMA-2 7B compared to existing SOTA LLM pruning methods. We compare work to existing acceleration libraries and reason there is likely plenty of space for further compression.

Description

Other Available Sources

Keywords

Inference, Large Language Models, Machine Learning, Model Compression, Neural Architecture Search, Structured Pruning, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories