Publication: Hardware-Software Codesign for High-Performance Cloud Networks
No Thumbnail Available
Open/View Files
Date
2020-10-09
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Li, Yuliang. 2020. Hardware-Software Codesign for High-Performance Cloud Networks. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
The cloud is part of the daily life of billions of people and is carrying most of the computation happening on the planet. To deliver the hyper computing power, the network plays a core role in connecting hundreds of thousands of machines inside datacenters. However, as many new cloud applications (e.g., large-scale deep learning, high-performance computing) and new architectures (e.g, resource disaggregation, more heterogeneous hardware accelerators) are demanding ever-increasing high performance, the network starts to become the bottleneck, and it is very difficult to troubleshoot performance problems. This boils down to the insufficiency in the two essential networking tasks: control and telemetry.
* Congestion control and clock synchronization are the two control tasks critical for application performance. However, they have to sacrifice the normal-case performance for worse cases in production, because they are not robust to the high dynamics of traffic and failures.
* We also need precise and fine-grained telemetry for performance troubleshooting. However, we often miss important information and cannot pinpoint the exact culprits, because existing telemetry systems supported by switches and hosts are either imprecise or coarse-grained.
To tackle the challenges, we set qualitatively better objectives than existing approaches: introducing robustness to the performance-critical control tasks, and design telemetry systems that are both precise and fine-grained. Achieving the new objectives is challenging due to the resource and observation limitations of the network devices. Fortunately, new programmable switches and NICs make it possible to codesign different devices, leveraging the advantage of different devices to collaboratively achieve breakthroughs and realize the new objectives.
Based on the new opportunities for codesign, we have three key design principles: (1) closing the gap between observation and control to make control precise and timely, (2) designing new algorithms and data structures to make effective use of different devices' capabilities, and (3) rethinking the division of labor among switches, hosts, and the controller with a paradigm shift away from the self-contained design model. Guided by the principles, we design novel network control and telemetry schemes that achieve the new objectives.
To robustly provide high performance under dynamics, we design novel control schemes that close the gap between observation and control. We design HPCC, a congestion control scheme, which uses a novel metric for both observation and control, and use new programmability to deliver switch states to hosts to help calculate the new observation metric. We also design Sundial, a clock synchronization scheme, which uses a backup plan precomputed by the centralized controller to enable fast failure recovery based on device-local observation.
We also design precise and fine-grained telemetry systems. For switches, we design FlowRadar and LossRadar to expose precise flow and loss information, by dividing the maintenance of hash tables into simple per-packet updates in switches and small amounts of complex computation in the controller. For the host TCP stack, we design DETER, in which hosts only record 0.03% of traffic and the controller can replay per-packet, per-line-of-code information.
Our systems have very wide impacts. Since we designed HPCC in 2019, it is not only deployed in Alibaba cloud, but also supported by many switch and NIC vendors (Intel, Mellanox, Broadcom, Cisco, Innovium, Marvell, etc.); in addition, Alibaba, Intel, and Mellanox are actively writing an IETF draft of HPCC, with the latest version in September 2020. We have built a prototype of Sundial at Google and deployed it in a test cluster with >500 servers in mid 2020, and we show performance improvement in Spanner and in Swift brought by Sundial. FlowRadar and LossRadar also result in a joint patent with Barefoot/Intel, and Alibaba is very interested in using the technique for loss detection. Finally, DETER’s Linux kernel-based implementation is open-sourced, and we reveal several TCP problems when running large Spark and RPC systems in the thesis.
Description
Other Available Sources
Keywords
Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service