Blocked Algorithms for Neural Networks: Design and Implementation on GPUs
Tillet, Philippe G.
MetadataShow full item record
CitationTillet, Philippe G. 2020. Blocked Algorithms for Neural Networks: Design and Implementation on GPUs. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractThe recent emergence of Deep Neural Networks (DNNs) for machine learning has been largely enabled by the widespread availability of massively parallel computing devices. In particular, Graphics Processing Units (GPUs) have played a critical role, allowing the evaluation of increasingly large DNNs on increasingly large datasets. Unfortunately, the development of efficient programs for GPUs remains laborious, requiring advanced knowledge of specialized compute resources (e.g., tensor cores) and complex memory hierarchies (e.g., caches). This has made it challenging to write efficient and reusable libraries for novel research ideas (e.g., sparsity) in the field of Deep Learning.
In this thesis, we argue that programming paradigms based on blocked algorithms can facilitate the construction of high-performance compute kernels for neural networks. We specifically revisit traditional "single program, multiple data" execution models for GPUs, and propose a variant in which programs -- rather than threads -- are blocked. We show that algorithms expressed using this paradigm define iteration spaces composed of a collection of blocks whose shape and schedule can be automatically optimized using context-aware auto-tuning and block-level data-flow analysis, respectively. We present the design and implementation of these novel techniques in the Triton language and compiler for blocked algorithms, and achieve significant speed-ups over state-of-the-art libraries (cuBLAS/cuDNN) for a wide range of matrix multiplication and convolution tasks commonly encountered in practice.
We finally show how this approach can facilitate the development of efficient compute kernels for some important emerging neural network architectures. We specifically focus on block-sparse self-attention mechanisms in transformers, and demonstrate significant performance gains for training tasks involving long sequence lengths.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368966
- FAS Theses and Dissertations