Publication: Managing Virtualized Network Functions in the Cloud
No Thumbnail Available
Open/View Files
Date
2024-05-13
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Gong, Junzhi. 2024. Managing Virtualized Network Functions in the Cloud. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
In modern networks, network functions (NFs) are extensively deployed to perform specific traffic processing functionalities, such as firewalls, intrusion detection, and cellular signal processing. More and more network operators are beginning to implement network functions in software rather than specialized hardware appliances. This shift brings about advantages, including improved flexibility, increased feature velocity, and mitigated vendor lock-ins. Virtualizing network functions (NFV) is the emerging trend for almost all networks, including ISPs, clouds, and cellular networks.
However, all of these advantages come with trade-offs. First of all, it is not trivial to support high and stable performance for network functions on commodity servers. Single-CPU performance can no longer catch up with network traffic rate, which pushes network functions to distribute their packet processing pipeline across multiple servers. Such distribution requires significant inter-server (and even inter-CPU core) communications, whose overhead becomes scalability bottlenecks for many network functions, such as vRANs. In terms of packet latencies, both software's performance jitters and in-network queuing contribute to unpredictable latencies, significantly impacting service level objectives (SLOs) for many NFs.
Maintaining good performance for network functions is equally critical, including support for performance diagnosis and resilience to small service disruptions. Diagnosing network function performance problems is not trivial, as these problems can be contributed to by many system-level events (e.g., cache misses), and the impact of such events can propagate across network functions and over time. Enabling resilience for network functions includes live upgrades and failovers, which is not trivial due to the high traffic rate and the black-box nature of network functions.
In this dissertation, we propose three key ideas to address these NFV challenges. The first key idea is to use hardware offloading for better performance and predictable latencies, as many new emerging hardware appliances (e.g., SmartNICs, radios) offer more capabilities to help packet processing. The second key idea is to identify critical minimum states inside network functions to support different network function systems. This helps NFV diagnosis accurately pinpoint the root causes and reduces the time spent on state migration for NFV resilience events. Another key idea is to apply domain-specific knowledge for certain network functions (such as vRANs), allowing developers and operators to design optimized pipelines to address scalability bottlenecks and apply domain-specific solutions for efficient state migrations.
To this end, we design four novel NFV systems for different NFV challenges. We first propose Hydra, a scalable distributed massive MIMO system for vRANs, which uses modern hardware radio capabilities to reduce inter-server communication overheads and uses domain-specific pipeline design to reduce inter-CPU core communication overheads, thereby supporting higher scalability. Hydra is the first system to support 150 antennas and 32 users within three servers. We then propose Octopus, a network function to support predictable latencies using SmartNIC offloading. Octopus repurposes the hardware traffic shaping feature on SmartNICs to achieve accurate packet arrival time on the receiving side. Octopus is the first system to support predictable packet latencies within ~50 ns variations. Next, we propose Microscope, an accurate NFV performance diagnosis system. It identifies the critical in-network queuing information for diagnosis, which allows us to accurately pinpoint why a network function suffers from long tail latency issues or packet drops. Microscope achieves 2.5 times higher accuracy than the state-of-the-art solution. Finally, we propose Atlas, a vRAN resilience solution with minimal service disruption. Specifically, it first applies vRAN domain-specific knowledge to identify the critical minimum states for migration, and then repurposes vRAN-specific protocols to help migrate those states without modifying the source codes. Atlas is the first system to enable resilience for vRANs, and it can mitigate service disruptions within a second.
Description
Other Available Sources
Keywords
5G, Diagnosis, Network function, Performance, Resilience, Virtualization, Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service