Efficient Implementations of Sparse and Quantized Deep Neural Networks Using Systolic Arrays
MetadataShow full item record
CitationMcDanel, Bradley. 2019. Efficient Implementations of Sparse and Quantized Deep Neural Networks Using Systolic Arrays. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractDeep Neural Networks (DNNs) have achieved state-of-the-art performance across a variety of domains, including many natural language processing and computer vision tasks. Though DNNs span such a wide assortment of applications, the fundamental computations performed by many of these networks are the same: a series of matrix multiplications between learned weight matrices and intermediate representations of input samples. This sameness has lead to special-purpose hardware being developed for efficient implementations of the matrix multiplications performed in DNNs. Generally, these types of special-purpose hardware assume that the weights of a DNN are fixed by a previous training phase and cannot be changed.
This work focuses on modifying the DNN training process to learn weight matrices which can be efficiently implemented in some target special-purpose hardware, such as a systolic array. Specifically, we introduce column combining, which trains sparse DNNs by jointly optimizing both the objective function and ensuring that the learned sparse weight matrices can be packed into a denser representation when deployed in the systolic array. This leads to significant improvements in the utilization efficiency of the systolic array (8x) over a standard training approach that does not consider the target hardware.
We then extend column combining to support the training of a specific type of quantized DNN, where each weight is a power of two. This allows for a multiplication-free systolic array design, as multiplication with powers-of-two weights can be performed using bit shift operations. We introduce a Selector-Accumulator systolic cell, which uses register chains shared across each column in the systolic array to implement these bit shift operations. Using shared register chains decreases power consumption of the systolic array by approximately 2.5x compared to 8-bit fixed point weights.
Finally, we discuss how to scale up the number of systolic arrays which can be used in parallel to support large DNNs with billions of weights. We propose using 3D integrated circuits (3D-IC) to mitigate the communication costs between systolic arrays incurred by a conventional 2D approach. We demonstrate that systolic arrays implemented on 3D-IC can reduce the latency and lower the power consumption of CNN inference.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42029543
- FAS Theses and Dissertations