Publication:
Efficient Implementations of Sparse and Quantized Deep Neural Networks Using Systolic Arrays

No Thumbnail Available

Date

2019-05-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

McDanel, Bradley. 2019. Efficient Implementations of Sparse and Quantized Deep Neural Networks Using Systolic Arrays. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Research Data

Abstract

Deep Neural Networks (DNNs) have achieved state-of-the-art performance across a variety of domains, including many natural language processing and computer vision tasks. Though DNNs span such a wide assortment of applications, the fundamental computations performed by many of these networks are the same: a series of matrix multiplications between learned weight matrices and intermediate representations of input samples. This sameness has lead to special-purpose hardware being developed for efficient implementations of the matrix multiplications performed in DNNs. Generally, these types of special-purpose hardware assume that the weights of a DNN are fixed by a previous training phase and cannot be changed. This work focuses on modifying the DNN training process to learn weight matrices which can be efficiently implemented in some target special-purpose hardware, such as a systolic array. Specifically, we introduce column combining, which trains sparse DNNs by jointly optimizing both the objective function and ensuring that the learned sparse weight matrices can be packed into a denser representation when deployed in the systolic array. This leads to significant improvements in the utilization efficiency of the systolic array (8x) over a standard training approach that does not consider the target hardware. We then extend column combining to support the training of a specific type of quantized DNN, where each weight is a power of two. This allows for a multiplication-free systolic array design, as multiplication with powers-of-two weights can be performed using bit shift operations. We introduce a Selector-Accumulator systolic cell, which uses register chains shared across each column in the systolic array to implement these bit shift operations. Using shared register chains decreases power consumption of the systolic array by approximately 2.5x compared to 8-bit fixed point weights. Finally, we discuss how to scale up the number of systolic arrays which can be used in parallel to support large DNNs with billions of weights. We propose using 3D integrated circuits (3D-IC) to mitigate the communication costs between systolic arrays incurred by a conventional 2D approach. We demonstrate that systolic arrays implemented on 3D-IC can reduce the latency and lower the power consumption of CNN inference.

Description

Other Available Sources

Keywords

systolic arrays, neural networks, sparsity, joint optimization, co-design

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories