Blocked Algorithms for Neural Networks: Design and Implementation on GPUs

Tillet, Philippe G.

dc.contributor.advisor	Kung, H. T.
dc.contributor.advisor	Cox, David
dc.contributor.author	Tillet, Philippe G.
dc.date.accessioned	2021-08-04T05:02:05Z
dc.date.created	2020
dc.date.issued	2020-10-06
dc.date.submitted	2020-11
dc.identifier.citation	Tillet, Philippe G. 2020. Blocked Algorithms for Neural Networks: Design and Implementation on GPUs. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
dc.identifier.other	28150983
dc.identifier.uri	https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368966	*
dc.description.abstract	The recent emergence of Deep Neural Networks (DNNs) for machine learning has been largely enabled by the widespread availability of massively parallel computing devices. In particular, Graphics Processing Units (GPUs) have played a critical role, allowing the evaluation of increasingly large DNNs on increasingly large datasets. Unfortunately, the development of efficient programs for GPUs remains laborious, requiring advanced knowledge of specialized compute resources (e.g., tensor cores) and complex memory hierarchies (e.g., caches). This has made it challenging to write efficient and reusable libraries for novel research ideas (e.g., sparsity) in the field of Deep Learning. In this thesis, we argue that programming paradigms based on blocked algorithms can facilitate the construction of high-performance compute kernels for neural networks. We specifically revisit traditional "single program, multiple data" execution models for GPUs, and propose a variant in which programs -- rather than threads -- are blocked. We show that algorithms expressed using this paradigm define iteration spaces composed of a collection of blocks whose shape and schedule can be automatically optimized using context-aware auto-tuning and block-level data-flow analysis, respectively. We present the design and implementation of these novel techniques in the Triton language and compiler for blocked algorithms, and achieve significant speed-ups over state-of-the-art libraries (cuBLAS/cuDNN) for a wide range of matrix multiplication and convolution tasks commonly encountered in practice. We finally show how this approach can facilitate the development of efficient compute kernels for some important emerging neural network architectures. We specifically focus on block-sparse self-attention mechanisms in transformers, and demonstrate significant performance gains for training tasks involving long sequence lengths.
dc.format.mimetype	application/pdf
dc.language.iso	en
dash.license	LAA
dc.subject	Computer science
dc.title	Blocked Algorithms for Neural Networks: Design and Implementation on GPUs
dc.type	Thesis or Dissertation
dash.depositing.author	Tillet, Philippe G.
dc.date.available	2021-08-04T05:02:05Z
thesis.degree.date	2020
thesis.degree.grantor	Harvard University Graduate School of Arts and Sciences
thesis.degree.level	Doctoral
thesis.degree.name	Ph.D.
dc.contributor.committeeMember	Kung, H. T.
dc.contributor.committeeMember	Cox, David
dc.contributor.committeeMember	Brooks, David
dc.type.material	text
thesis.degree.department	Engineering and Applied Sciences - Computer Science
dc.identifier.orcid	0000-0003-1881-4072
dash.author.email	phil.tillet@gmail.com

Files in this item

Name:: ptillet-dissertation-final.pdf
Size:: 6.152Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

FAS Theses and Dissertations [6136]

Show simple item record