Publication:

Scheduling Algorithms for Low-Precision Accumulation on Energy-Efficient Deep Neural Network Hardware

Loading...
Thumbnail Image

Open/View Files

Date

2025-09-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Natesh, Vikas. 2025. Scheduling Algorithms for Low-Precision Accumulation on Energy-Efficient Deep Neural Network Hardware. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

In just a few years, deep neural networks (DNNs) have advanced from image classification and machine translation to large language models (LLMs) capable of processing and generating hundreds of thousands of tokens. Like their predecessors based on convolutional and transformer architectures, modern LLMs rely on extensive matrix multiplications interleaved with non-linear operations. Unlike earlier models, however, LLM dot products routinely exceed 30,000 elements, imposing severe costs in energy and latency—particularly in edge deployments where resources are limited. Enabling efficient inference in such settings demands both novel hardware architectures and new algorithmic strategies for computation scheduling.

At the core of these workloads lies accumulation: the repeated addition of products within large dot products. Narrowing the bitwidth of accumulators can dramatically reduce hardware complexity, energy consumption, and latency. Yet overly narrow accumulators increase the risk of numerical overflow, degrading model accuracy. Existing solutions mitigate this by retraining models with modified weights, a process that is costly for large models and often infeasible due to data access constraints.

This thesis introduces algorithmic and architectural techniques that enable low-bitwidth accumulation without retraining or altering the model weights. We first present Alternating Greedy Schedules (AGS), which partitions the summation into positive and negative subarrays and orders their accumulation to avoid overflow. We then propose Markov Greedy Sums (MGS), which leverage the statistical structure of DNN partial products to maximize use of a narrow accumulator and resort to a wide accumulator only when necessary. Finally, we develop Alternating Balancing Sums (ABS), a buffer-based reordering scheme that provably maintains narrow accumulation for most operations, deferring wide accumulation to a short final phase. Together, these techniques establish a family of scheduling strategies that systematically balance precision, energy, and latency.

We evaluate our designs across diverse DNN inference tasks, showing that they substantially reduce accumulator bitwidth requirements and energy consumption while preserving accuracy—all without retraining. These results demonstrate a path toward scalable, energy-efficient hardware for next-generation DNNs and LLMs, enabling their deployment across a broader range of platforms from cloud servers to edge devices.

Description

Other Available Sources

Research Data

Keywords

Computer science, Electrical engineering, Artificial intelligence

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories