Show simple item record

dc.contributor.advisorKung, H. T.
dc.contributor.advisorCox, David
dc.contributor.authorTillet, Philippe G.
dc.date.accessioned2021-08-04T05:02:05Z
dc.date.created2020
dc.date.issued2020-10-06
dc.date.submitted2020-11
dc.identifier.citationTillet, Philippe G. 2020. Blocked Algorithms for Neural Networks: Design and Implementation on GPUs. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
dc.identifier.other28150983
dc.identifier.urihttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368966*
dc.description.abstractThe recent emergence of Deep Neural Networks (DNNs) for machine learning has been largely enabled by the widespread availability of massively parallel computing devices. In particular, Graphics Processing Units (GPUs) have played a critical role, allowing the evaluation of increasingly large DNNs on increasingly large datasets. Unfortunately, the development of efficient programs for GPUs remains laborious, requiring advanced knowledge of specialized compute resources (e.g., tensor cores) and complex memory hierarchies (e.g., caches). This has made it challenging to write efficient and reusable libraries for novel research ideas (e.g., sparsity) in the field of Deep Learning. In this thesis, we argue that programming paradigms based on blocked algorithms can facilitate the construction of high-performance compute kernels for neural networks. We specifically revisit traditional "single program, multiple data" execution models for GPUs, and propose a variant in which programs -- rather than threads -- are blocked. We show that algorithms expressed using this paradigm define iteration spaces composed of a collection of blocks whose shape and schedule can be automatically optimized using context-aware auto-tuning and block-level data-flow analysis, respectively. We present the design and implementation of these novel techniques in the Triton language and compiler for blocked algorithms, and achieve significant speed-ups over state-of-the-art libraries (cuBLAS/cuDNN) for a wide range of matrix multiplication and convolution tasks commonly encountered in practice. We finally show how this approach can facilitate the development of efficient compute kernels for some important emerging neural network architectures. We specifically focus on block-sparse self-attention mechanisms in transformers, and demonstrate significant performance gains for training tasks involving long sequence lengths.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.subjectComputer science
dc.titleBlocked Algorithms for Neural Networks: Design and Implementation on GPUs
dc.typeThesis or Dissertation
dash.depositing.authorTillet, Philippe G.
dc.date.available2021-08-04T05:02:05Z
thesis.degree.date2020
thesis.degree.grantorHarvard University Graduate School of Arts and Sciences
thesis.degree.levelDoctoral
thesis.degree.namePh.D.
dc.contributor.committeeMemberKung, H. T.
dc.contributor.committeeMemberCox, David
dc.contributor.committeeMemberBrooks, David
dc.type.materialtext
thesis.degree.departmentEngineering and Applied Sciences - Computer Science
dc.identifier.orcid0000-0003-1881-4072
dash.author.emailphil.tillet@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record