Maximizing Vcmts Data Plane Performance with 3Rd Gen Intel® Xeon® Scalable Processor Architecture

White Paper Processor Architecture Study Maximizing vCMTS Data Plane Performance with 3rd Gen Intel® Xeon® Scalable Processor Architecture Intel platform technologies boost virtualized cable modem termination system (vCMTS) data plane performance by 70% Authors Introduction The standardization of distributed access architecture (DAA) for DOCSIS and Brendan Ryan further advancements in the flexible MAC architecture (FMA) standard have System Architect (Intel) enabled the transition to a software-centric cable network infrastructure. This Michael O‘Hanlon paper will explore how Intel technologies can be utilized to increase performance Senior Principal Engineer (Intel) for vCMTS for the various deployment scenarios shown below (see Figure 1). In each scenario, the same DOCSIS MAC software may be deployed, whether as: David Coyle Senior Software Engineer (Intel) • a virtual MAC core (vCore), also known as a virtualized cable modem termination system (vCMTS) on a server in a multiple-system operator (MSO) headend, Rory Sexton • part of a remote-MAC-core (RMC) and remote-PHY deployment on an edge Network Software Engineer (Intel) compute node, Subhiksha Ravisundar • or part of a remote-MAC-PHY device (RMD) deployment on an edge compute node Network Software Engineer (Intel) R-PHY and Virtual-MAC-Core R-PHY and Remote-MAC-Core Head End / CO Head End / CO Table of Contents DOCSIS MAC SW RPD RMC Fiber Fiber Intel vCMTS Reference RPD Coax Fiber CPU Coax Data Plane . 2 DOCSIS vCores MAC SW RPD RMC vCMTS Data Plane Performance RPD Coax CPU Coax DOCSIS Analysis – Single Service Group . 3 MAC SW Crypto and CRC Processing in the R-MAC-PHY vCMTS Data Plane Pipeline . 4 Head End / CO Using Hyper-threading in the vCMTS RMD Fiber Data Plane Pipeline . 6 Coax CPU DOCSIS System Scalability and MAC SW RMD Server Sizing . 8 Coax CPU DOCSIS Compelling vCMTS Performance MAC SW on 3rd Gen Intel® Xeon® Scalable Figure 1 . Flexible MAC Architecture (FMA) Deployment Scenarios Processors . 8 Performance for such a software-centric approach is greatly boosted by Appendix A . 10 technologies such as Data Plane Development Kit (DPDK) and the Intel Multi-Buffer Appendix B . 12 Crypto library (intel-ipsec-mb), which provide highly optimized packet processing in software that is tightly coupled to the ever-developing Intel® architecture. Intel Appendix C . 13 architecture provides native instructions and features that specifically accelerate Appendix D . 13 data plane packet processing for access network functions such as vCMTS. Appendix E . 15 White Paper | Processor Architecture Study 2 +50% 1 Crypto L1 3 Cache Crypto 2x + <=42% 4 L2 2 cores Cache +25% +45%5 B/W Memory Level 3 Cache Level 3 Cache +11%/Core 8 3 x 16 4 x 16 +33% 6 Lanes PCle Gen 3 PCle Gen 4 2x B/W 7 2nd Generation CPU 3rd Generation CPU Figure 2 . Intel® Xeon® CPU Generational Enhancements Independent software vendors (ISVs) can significantly mix (IMIX) traffic blend. When Intel® QuickAssist Technology improve vCMTS performance on 3rd Gen Intel® Xeon® (Intel® QAT) acceleration is employed in the system, a Scalable processor architecture (codename “Ice Lake”) single core can satisfy close to the maximum downstream by taking advantage of the gen-on-gen CPU architecture bandwidth of a pure DOCSIS 3.1 configuration. Since vCMTS enhancements shown in Figure 2 which include bigger workloads exhibit good scalability, a typical server blade cache-sizes at each level, higher core-count, more memory based on dual 3rd Gen Intel Xeon Scalable processors channels at higher speed and greater I/O bandwidth. (e.g. with 64 processor cores total) can achieve compelling performance density. This paper focusses specifically on how to take advantage of advanced features such as: It is worth noting that these performance benefits can also be delivered by deploying a general-purpose compute • Enhanced Intel® Advanced Vector Extensions 512 (Intel® component in the RMC and RMD scenarios shown in Figure AVX-512) 1. In other words, the same performance benefits of Intel • Dual AES encryption engines architecture and complementary technologies, such as Intel QAT and the Intel Multi-Buffer Crypto library, can be used to • Intel® Vector AES New Instructions (AES-NI) achieve similar performance on an edge compute node. • Intel® Vector PCLMULQDQ carry-less multiplication DOCSIS MAC functionality can be broken down into four instruction categories: downstream data plane, upstream data plane, • Intel® QuickAssist Technology control plane, and system management. From a network performance perspective, the most compute-intensive The paper also provides insights into implementation options workload is data plane processing, and consequently, is the and establishes an empirical performance data baseline that focus of this paper. can be used to estimate the capability of a vCMTS platform running on industry-standard, high-volume servers based on Intel vCMTS Reference Data Plane 3rd Gen Intel Xeon Scalable processor architecture. Actual measurements were taken on a 3rd Gen Intel Xeon Scalable Intel has developed a vCMTS reference data plane that processor-based system (see Appendix D for configuration is compliant with DOCSIS 3.1 specifications and based details). on the DPDK packet processing framework. It is publicly available on the Intel 01.org open-source website at The data demonstrates how a single 3rd Gen Intel Xeon 01.org/access-network-dataplanes. The main objective of Scalable processor core running at 2.2 GHz clock speed can this development is to provide a tool to demonstrate the support the downstream channel bandwidth for close to vCMTS data plane packet processing performance of Intel® five orthogonal, frequency-division multiplexing (OFDM) Xeon® processor-based platforms and to assist ISVs and channels in a pure DOCSIS 3.1 configuration for an Internet MSOs in deploying a vCMTS. White Paper | Processor Architecture Study 3 Upper MAC Lower MAC Combined Operation Downstream Receive IP Cable Modem DOCSIS DOCSIS DOCSIS QoS DOCSIS IP Frame CRC DOCSIS DEPI Transmit IP Frames Lookup & Sub Filtering Qos Scheduling (SF Framing Generation BPI+ Encapsulation DEPI/L2TP (DEPI/L2TP) Mgmt Classification & Channel) (incl HCS) Encryption Frames librte_ethdev librte_hash librte_acl librte_acl librte_sched librte_net_crc librte_cryptodev librte_cryptodev librte_mbuf librte_ethdev Combined Operation Upstream Transmit IP Frame CRC DOCSIS BPI+ DOCSIS DOCSIS DOCSIS Service ID UEPI Validate Receive IP IP Frames Verification Decryption Frame HCS Frame Segment Lookup Decapsulation Frame & Strip UEPI/L2TP (UEPI/L2TP) Verification Extraction Reassembly L2TP/IP Hdrs Frames librte_ethdev librte_cryptodev librte_cryptodev librte_net_crc librte_mbuf librte_mbuf librte_hash librte_mbuf librte_mbuf librte_ethdev Figure 3 . Intel vCMTS Reference Data Plane Figure 3 shows the upstream and downstream packet vCMTS Data Plane Performance Analysis – Single processing pipelines implemented for the Intel vCMTS Service Group Reference Data Plane. The downstream data plane is Performance tests were run using the Intel vCMTS Reference implemented as a two-stage pipeline, which perform upper Data plane with a channel configuration to demonstrate MAC and lower MAC processing respectively. The DPDK API the maximum throughput capability for a pure DOCSIS used for each significant DOCSIS MAC data plane packet- 3.1 deployment. Performance with crypto acceleration processing stage is also shown. using both the Intel Multi-Buffer Crypto library and Intel A detailed description of the upstream and downstream QAT is shown. The performance of the 3rd Gen Intel Xeon packet processing stages shown in Figure 3 is provided Scalable processor running at 2.2 GHz core frequency is in Appendix A. Many key innovations and performance also compared to a 2nd Generation Intel® Xeon® Scalable optimizations are supported by the Intel vCMTS Reference processor (codename “Cascade Lake”) running at 2.3 GHz.9 Data Plane, including the following: The channel configuration is a pure DOCSIS 3.1 deployment • Optimized multi-buffer implementation of combined AES with six OFDM channels that produce a theoretical cryptographic (crypto) and CRC processing based on Intel cumulative bandwidth of 11.3 Gbps; however, the effective AES-NI and AVX-512 instructions downstream bandwidth is limited to 10 Gbps for the DOCSIS 3.1 bandwidth limit per service group (SG). Note that this is • Optimized multi-buffer implementation of DES crypto enforced by the vCMTS data-plane pipeline with Downstream processing based on Intel AVX-512 instructions QoS rate-limiting of 10 Gbps per service group. • Acceleration of AES and DES DOCSIS crypto processing using Intel QuickAssist Technology Single Service Group Downstream Throughput • Optimized CRC32 and DOCSIS header check sequence (SINGLE CPU CORE) (HCS) calculation based on Intel AVX-512 instructions 12.00 • DOCSIS service flow and channel scheduling DOCSIS 3.1 DOWNSTREAM B/W LIMIT 10.00 implementation based on DPDK HQoS LINERATE : 5xOFDM : 9.5 GBPS • Packet Streaming Protocol (PSP) fragmentation/ 8.00 reassembly implementation based on the DPDK Mbuf API LINERATE : 4xOFDM : 7.6 GBPS • DOCSIS MAC data plane pre-processing using the 100G 6.00 Intel® Ethernet 800 Series network interface card (NIC) LINERATE : 32xSC-QAM, 2xOFDM : 5.7 GBPS Throughput (Gbps) Throughput • Configurable DOCSIS MAC data plane threading options 4.00 – dual-thread, single-thread, separate or combined upstream and downstream threads 2.00 I M The Intel vCMTS Reference Data Plane runs within

Maximizing Vcmts Data Plane Performance with 3Rd Gen Intel® Xeon® Scalable Processor Architecture

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Fiber Context As a ﬁrst-Class Object

Fiber-Based Architecture for NFV Cloud Databases

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

The Impact of Process Placement and Oversubscription on Application Performance: a Case Study for Exascale Computing

And Computer Systems I: Introduction Systems

Enabling Technologies for Distributed and Cloud Computing Dr

Eindhoven University of Technology MASTER Acceleration of A

HPX-Agseries

Fibers Are Not (P)Threads the Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Use of High Perfo Nce Networks and Supercomputers Ime Flight Simulation