Performance Analysis for Machine Learning Applications
CitationWang, Yu. 2020. Performance Analysis for Machine Learning Applications. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractPerformance analysis has been driving the advancements of software and hardware systems for decades. Proper analysis can reveal system and architectural bottlenecks, provide essential information for choosing frameworks and platforms, and further lead to performance optimizations. Systematic performance analysis is difficult since it involves various problem domains, algorithms, software stacks, systems, and hardware. Recently, the surge of machine learning applications and domain-specific architectures stimulates rapid evolution of all those dimensions, thus making performance analysis increasingly challenging.
To tackle this problem, this dissertation conducts deep and systematic performance analysis for a variety of workloads, software frameworks, and hardware systems, and demonstrates proper ways to apply several performance analysis methodologies. First, we study the performance analysis methodology for general-purpose processors, a traditional and mature industry, based on CPU benchmarks and demonstrate the importance of using representative benchmarks in drawing insights about hardware systems. Next, with the lessons learned from traditional methods, we rigorously study deep learning frameworks by applying proper analysis methods at corresponding scenarios. We extract the performance implications of key design features from those frameworks, and the insights are distilled into a set of simple guidelines to tune framework features. The proposed guidelines nearly close the performance gap between the state of the art and the global optimum. Further, we propose a systematic methodology to facilitate performance analysis for rapidly evolving deep learning models and platforms. The proposed methodology can reveal deeper insights that are difficult to discover for traditional approaches. We demonstrate its utility with deep learning by comparing two generations of specialized hardware (Google’s Tensor Processing Unit v2 and v3), three heterogeneous platforms (TPU, GPU, and CPU), and different versions of specialized software (TensorFlow and CUDA). Finally, since machine learning techniques advance rapidly and architects need to be aware of emerging applications, we take the first step towards analyzing Bayesian inference, an important branch of machine learning. Optimization mechanisms are proposed based on the analysis.
With the methodologies and analysis presented in this dissertation, we hope to encourage researchers and engineers to apply our methodologies to new platforms, software, and applications for systematic performance analysis. We envision this to help resolve existing performance bottlenecks and design better software and hardware for the current and future applications.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365132
- FAS Theses and Dissertations