Publication: Advancing System-Level Analysis and Design of Specialized Architectures
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Over the course of the past decade, computation has increasingly spread to the cloud and mobile devices. With the growing computation demands placed by contemporary cloud and mobile workloads, architects have increasingly turned to hardware specialization. Once the niche of standardized computation like video decoding and audio processing, hardware accelerators have now expanded into many more fields. Fueled by the recent explosion of demand for deep neural networks, a huge amount of effort has been poured into advancing state-of-the-art accelerator architectures and designs. However, current research into specialized hardware often overlooks evaluating the complete system, and this can often lead to non-optimal designs. For example, a large fraction of the total power consumed by a system-on-chip (SoC) is actually due to CPUs, but the accuracy of widely used CPU power models have not been thoroughly validated. Second, while we know how to design efficient accelerators in isolation, we have less understanding of how SoC integration impacts their performance and power. In addition, we have not explored how we can leverage SoC-accelerator interfaces to improve efficiency. Finally, architects have mostly explored “deep” acceleration, which focuses on compute-heavy workloads with hot functions, but we have largely ignored “broad” acceleration, which aims to accelerate common low-level routines present across a diverse set of workloads. This dissertation presents the case for a holistic approach to accelerator design that accounts for the surrounding system’s constraints, both for “deep” acceleration and “broad” acceleration. First, it presents a comprehensive validation of McPAT, a widely used CPU power model, with a quantitative analysis of its sources of error. Second, it presents gem5-Aladdin, an complete SoC simulator that can model complex specialized SoCs and can run end-to-end accelerated workloads without the need to write any RTL. Third, this dissertation shows how considerations of system-level effects and SoC interfacing during accelerator design can dramatically improve its overall efficiency, with a deep dive into accelerating deep neural networks and vision pipelines. Finally, it leverages recent work in datacenter system-wide profiling to make a case for broad acceleration. It presents the design of an accelerator for dynamic memory allocation, a widely used programming paradigm that accounts for a significant fraction of total CPU cycles in a major cloud provider’s datacenters. The work presented in this dissertation identifies both challenges and opportunities for extracting maximum performance from acceleration at the system level, both for traditional deep acceleration and broad acceleration in the cloud. We hope it will stimulate more interest and spur further research and development for holistic accelerator design.