Publication: Improving Developer’s Productivity for Heterogeneous Cloud Networks
No Thumbnail Available
Open/View Files
Date
2022-06-06
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
GAO, JIAQI. 2022. Improving Developer’s Productivity for Heterogeneous Cloud Networks. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
The cloud is an integral part of our society, the basis, and actuation of our world. Cloud network evolves rapidly to provide better performance and more diverse functionalities for applications running in the cloud. Traditional software-implemented network functions and team assignments can no longer drive the evolution, heterogeneity becomes the new enabler. Cloud network introduces new domain-specific accelerators to improve network functions’ performance and assigns finer-grained teams to design more, optimized network functions. However, heterogeneity brings overwhelming information to the developers and reduces their productivity, especially in the developing and troubleshooting phase: developers need expertise in every hardware and careful optimizations to develop network functions with decent performance, they also need to understand other teams’ domain knowledge to send incidents to the correct team and resolve incidents in time.
We design and develop four high-level programs to replace the heterogeneity with uniform, simple, and flexible high-level abstractions. Developers directly interact with the abstractions and no longer need to deal with the heterogeneity in hardware and teams. The program encodes the heterogeneity and uses efficient algorithms to bridge the gap between the high-level abstraction and low-level details.
For distributed programmable switch programming, we designed Lyra, which provides one-big-pipeline programming interface and compiles high-level programs into low-level, runnable programs for each programmable switch in the network. Lyra designs one language synthesizer and resource model for each target and uses the SMT solver to propose a solution that fits within the resource constraint and guarantees the program’s correctness.
We designed Vela for programming network functions on a programmable NIC accelerated host. Vela introduces a target-specific performance model to estimate the processing throughput of a code snippet installed on the programmable NIC and an efficient heuristic algorithm to search for a program allocation plan that achieves the best overall performance.
We further extended the compiler to edge gateways settings where the programmable switch and CPU co-exist in a single box to support edge applications and built Sirius. Sirius enriches the Lyra language and allows the developers to define different function chains that different business traffics visit. Sirius can summarize the chains into a monolithic program and synthesize proper traffic classifiers to guarantee traffic isolation. Then, Sirius divides the synthesized program between switch and CPU and inserts proper recirculations to maximize the overall throughput.
For troubleshooting, we identified routing incidents to the right team is crucial: the time-to-diagnose can increase by 10x due to misroutings. So we designed Scouts, per-team gate-keepers that route relevant incidents to their own team and route-away unrelated ones. Each Scout is maintained by each team and uses machine learning algorithms to make the binary judgment based on the team’s domain knowledge and incident content.
Lyra, Sirius, and Scouts are developed in production cloud networks to improve developers’ productivity. More specifically, Lyra and Sirius are deployed in Alibaba Cloud to compile gateway programs, optimize resource usage, and enrich error information provided by vendor compilers. Scouts is deployed in Microsoft Azure’s incident management system and serves as the gate-keeper for the Physical Networking team. For incidents that might be related, the Scout provides its prediction result and the reason.
Description
Other Available Sources
Keywords
Compiler, Datacenter network, Machine learning, Programmable devices, Computer science, Near Eastern studies
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service