Publication:
Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning

No Thumbnail Available

Date

2023-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Tambe, Thierry. 2023. Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

The unabated pursuit of omniscient and omnipotent AI is levying hefty latency, memory, and energy taxes at all computing scales. At the same time, the twilight of Dennard scaling means traditional performance gains are no longer proportionally attained with reduction in transistor feature size -- compelling a global trend towards application-based hardware specialization. Over the course of my PhD, I have built a heterogeneity of solutions co-optimized across the algorithm, architecture, and silicon stack to generate breakthrough advances in arithmetic performance, compute density and flexibility, and energy efficiency for on-chip machine learning (ML), and natural language processing (NLP) in particular. My work aims to significantly increase the application space of embedded ML computing, in both the inference and training regimes, by coalescing innovative vectors spanning the algorithm, memory subsystem, hardware architecture, and circuit layers, while tuning their designs and inter-dependencies to promote greater performance, energy efficiency, and reliability within a silicon chip system. In the algorithm front, this thesis discusses best paper award-winning work on a novel floating-point based data type, AdaptivFloat, which enables resilient quantized AI computations; and is particularly suitable for NLP networks with large parameter distribution. To evaluate AdaptivFloat impact on a real system, this thesis describes a 16nm chip prototype that integrates FlexASR, a programmable hardware accelerator with AdaptivFloat-based processing elements, and specialized for attention-based recurrent neural networks used in speech and machine translation AI workloads. We further verify FlexASR fidelity to the front-end AI application via a formal hardware/software compiler interface. Towards the goal of lowering the prohibitive energy cost of inferencing large language models on TinyML devices, this dissertation describes a principled algorithm-hardware co-design solution, validated in a 12nm chip tapeout, that accelerates Transformer workloads by tailoring the accelerator's latency and energy expenditures according to the complexity of the input query it processes. Finally, recognizing that the overwhelming majority of the data generated during the deep learning training process exhibits a very short-lived lifetime, this thesis proposes leveraging non-conventional embedded dynamic RAMs (eDRAMs) as the main on-chip storage medium for ML training data -- which, along with a tightly-coupled offering of algorithmic alterations and custom hardware specialization, yields significant energy efficiency advantages over conventional SRAMs.

Description

Other Available Sources

Keywords

Deep Learning, Embedded DRAMs, Hardware Accelerators, Natural Language Processing, Number Systems, System-on-Chips, Electrical engineering, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories