Publication:
Efficiency in warehouse-scale computers: a datacenter tax study

No Thumbnail Available

Date

2017-01-25

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

Computation has been steadily migrating from isolated on-premise deployments to the datacenters of a small number of large-scale cloud providers. The datacenters powering the cloud, also known as warehouse-scale computers (WSCs), have a unique set of design constraints, balancing efficiency at scale with ever-growing application needs for performance. Designing next-generation server platforms for WSCs after the end of Dennard scaling is one of the most important challenges for computer architects. In order to guide such future designs, we performed the first (to the best of our knowledge) longitudinal profiling study of a live production WSC. Our performance measurements span tens of thousands of machines over several years, while these machines serve the requests of billions of users. Even though we observe significant diversity, both in applications and architectural behaviors, patterns begin to emerge. We identify the "datacenter tax" -- a set of shared low-level software components that comprise almost 30% of all processor cycles in production datacenters. The constituents of this "tax" -- the necessary components to do distributed computation (data serialization, compression, etc.) -- are also prime candidates for optimization, both in software and through specialized hardware. The latter case has especially high potential upside, but requires hardware accelerators that are markedly different from traditional designs. These new "broad' accelerators face a unique set of challenges: because calls to tax routines tend to be frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput, and because each one accelerator brings a limited amount of overall application speedup, overheads must be kept to a bare minimum. We demonstrate by construction that, while non-trivial, meeting such constraints is possible. Our memory allocation accelerator, Mallacc, reduces the latency of already fast malloc calls by up to 50% while occupying only 0.006% of the silicon area of a typical high-performance core. This thesis identifies the opportunity for broad acceleration and presents first steps towards designing datacenter tax accelerators. We expect that it will spur additional interest, from industry and academia, and will help bridge the gap between research in datacenters and in specialized hardware.

Description

Other Available Sources

Keywords

Computer Science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories