Publication: Blocked Algorithms for Neural Networks: Design and Implementation on GPUs
No Thumbnail Available
Open/View Files
Date
2020-10-06
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Tillet, Philippe G. 2020. Blocked Algorithms for Neural Networks: Design and Implementation on GPUs. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
The recent emergence of Deep Neural Networks (DNNs) for machine learning has been largely enabled by the widespread availability of massively parallel computing devices. In particular, Graphics Processing Units (GPUs) have played a critical role, allowing the evaluation of increasingly large DNNs on increasingly large datasets. Unfortunately, the development of efficient programs for GPUs remains laborious, requiring advanced knowledge of specialized compute resources (e.g., tensor cores) and complex memory hierarchies (e.g., caches). This has made it challenging to write efficient and reusable libraries for novel research ideas (e.g., sparsity) in the field of Deep Learning.
In this thesis, we argue that programming paradigms based on blocked algorithms can facilitate the construction of high-performance compute kernels for neural networks. We specifically revisit traditional "single program, multiple data" execution models for GPUs, and propose a variant in which programs -- rather than threads -- are blocked. We show that algorithms expressed using this paradigm define iteration spaces composed of a collection of blocks whose shape and schedule can be automatically optimized using context-aware auto-tuning and block-level data-flow analysis, respectively. We present the design and implementation of these novel techniques in the Triton language and compiler for blocked algorithms, and achieve significant speed-ups over state-of-the-art libraries (cuBLAS/cuDNN) for a wide range of matrix multiplication and convolution tasks commonly encountered in practice.
We finally show how this approach can facilitate the development of efficient compute kernels for some important emerging neural network architectures. We specifically focus on block-sparse self-attention mechanisms in transformers, and demonstrate significant performance gains for training tasks involving long sequence lengths.
Description
Other Available Sources
Keywords
Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service