High Level Synthesis for Packet Processing Pipelines Cristian

High Level Synthesis for Packet Processing Pipelines Cristian Soviani Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2007 c 2007 Cristian Soviani All Rights Reserved ABSTRACT High Level Synthesis for Packet Processing Pipelines Cristian Soviani Packet processing is an essential function of state-of-the-art network routers and switches. Implementing packet processors in pipelined architectures is a well-known, established technique, albeit different approaches have been proposed. The design of packet processing pipelines is a delicate trade-off between the desire for abstract specifications, short development time, and design maintainability on one hand and very aggressive performance requirements on the other. This thesis proposes a coherent design flow for packet processing pipelines. Like the design process itself, I start by introducing a novel domain-specific language that provides a high-level specification of the pipeline. Next, I address synthesizing this model and calculating its worst-case throughput. Finally, I address some specific circuit optimization issues. I claim, based on experimental results, that my proposed technique can dramat- ically improve the design process of these pipelines, while the resulting performance matches the expectations of hand-crafted design. The considered pipelines exhibit a pseudo-linear topology, which can be too re- strictive in the general case. However, especially due to its high performance, such an architecture may be suitable for applications outside packet processing, in which case some of my proposed techniques could be easily adapted. Since I ran my experiments on FPGAs, this work has an inherent bias towards that technology; however, most results are technology-independent. Contents 1 Introduction 1 1.1 PacketprocessinginsideaSwitch . 1 1.1.1 FPGAparticularities . 4 1.2 Packetprocessingpipelines. 6 1.3 Packeteditingmodules.......................... 7 1.4 Mainchallenges.............................. 8 1.5 Challenge: Avoiding RTL . 9 1.5.1 Contribution: high level packet editor description . ..... 10 1.5.2 Contribution: packet editor synthesis . 11 1.6 Challenge: Pipeline worst case performance . 12 1.6.1 Contribution: pipeline performance analysis . 14 1.6.2 Computing minimum FIFO sizes . 15 1.7 Challenge: Sequential RTL issues . 15 1.7.1 Contribution: combining Retiming and Shannon . 17 1.8 Contributionsummary. Thesisoutline . 18 2 Synthesis of Packet Processing Pipelines 19 2.1 Overview.................................. 19 2.2 PacketProcessorsandEditing . 21 ii 2.3 Relatedwork ............................... 23 2.4 ThePacketEditingGraph . 23 2.4.1 WhattoexpectfromthePEG. 23 2.4.2 APEGExample ......................... 24 2.4.3 PEGdescription ......................... 26 2.4.3.1 Theinputsubgraph . 28 2.4.3.2 The computation subgraph . 29 2.4.3.3 Theoutputsubgraph. 30 2.5 TheSynthesisProcedure . 32 2.5.1 ThemoduleI/Oprotocol . 34 2.5.2 CoreandWrapper ........................ 36 2.5.3 SplittingDataintoWords . 37 2.5.4 Assigning Read Cycle Indices . 40 2.5.5 Scheduling............................. 42 2.5.6 Synthesizing the controller . 43 2.5.7 Handlingtheendofapacket. 45 2.5.8 Synthesizingthedatapath. 47 2.5.9 Comments on scheduling and binding . 48 2.5.10 Comments on FSM encoding .................. 50 2.6 Implementation .............................. 51 2.6.1 Textualfrontend ......................... 51 2.6.2 Synthesissteps .......................... 51 2.6.3 Simulation............................. 52 2.6.4 RTL and Low-Level Synthesis . 53 2.7 ExperimentalResults........................... 54 2.8 Conclusions ................................ 55 iii 3 Packet Processing Pipeline Analysis 56 3.1 Overview.................................. 56 3.1.1 Motivational example . 58 3.2 RelatedWork ............................... 59 3.3 Singlemoduleanalysis .......................... 61 3.3.1 Definingmoduleperformance . 62 3.3.2 Computing R,W , and T fromtheSTG ............. 63 3.3.3 UsingBellman-Ford. 66 3.3.4 Abstractingthedatapath . 68 3.4 Connectingmodules ........................... 69 3.4.1 ThemoduleI/Oprotocol . 70 3.4.2 The FIFO implementation . 72 3.5 TheBigFIFOscenario.......................... 73 3.5.1 Computingtheworst-casethroughput . 74 3.5.2 Performanceissues ........................ 75 3.5.3 Moduleindependence. 76 3.6 Verifying throughput using model checking . 76 3.6.1 Comparison with explicit analysis . 78 3.6.2 Comparison with simulation . 79 3.7 ComputingtheminimumFIFOsizes . 79 3.7.1 ComputingFIFOSizes . 80 3.7.2 FIFO size monotonicity . 80 3.7.3 An exact depth-first search algorithm . 81 3.7.4 A heuristic search algorithm . 83 3.8 ExperimentalResults........................... 84 3.9 Conclusions ................................ 86 iv 4 Combining Shannon and Retiming 87 4.1 Introduction................................ 87 4.1.1 Briefdescription ......................... 88 4.1.2 Anexample ............................ 89 4.1.3 Overview of the algorithm . 91 4.1.4 Relatedwork ........................... 92 4.2 TheFirstPhase: ComputingArrivalTimes . 93 4.2.1 Basics ............................... 93 4.2.2 Shannondecomposition . 95 4.2.3 Shannon decomposition as a kind of technology mapping . 97 4.2.4 Redundant encodings and arrival times . 98 4.2.5 My family of Shannon encodings . 99 4.2.6 Sets of feasible arrival times . 102 4.2.7 Pruning feasible arrival times . 105 4.2.8 Retiming.............................. 107 4.2.9 Using Bellman-Ford to compute arrival times . 107 4.2.10 Termination of Bellman-Ford and max-iterations ........ 111 4.3 The Second Phase: Resynthesizing the Circuit . 112 4.3.1 Cell families . 116 4.3.2 Cell selection . 117 4.3.3 Propagating required times . 118 4.4 Experiments................................ 119 4.4.1 Combinational circuits . 119 4.4.2 Sequential benchmarks . 120 4.4.3 Packet processing pipelines . 123 4.5 Conclusions ................................ 124 v 5 Final Conclusions 126 5.1 Claim ................................... 126 5.2 PEG—thehumanfactor ........................ 127 5.3 Synthesis — taking advantage of others . 128 5.4 Analysis—choosingwhattomeasure. 130 5.5 Logic synthesis — beyond BLIF . 130 5.6 Openissues ................................ 131 5.6.1 Synthesis - Design space exploration . 131 5.6.2 Analysis — Performance under-estimation . 132 5.6.3 Low level optimization — Experiments . 133 Bibliography 133 vi List of Figures 1.1 Genericswitcharchitecture. 1 1.2 A switch with detail of a packet processing pipeline. .... 3 2.1 Graphic peg specification for the mpls module ............ 25 2.2 Textual peg specification for the mpls module............. 27 2.3 A module I/O signals and its two components . 34 2.4 Splitinto64-bitwords .......................... 38 2.5 Structuringapacketmapintowords . 39 2.6 Read cycle indices and delay bubbles added . 41 2.7 ASM chart for the controller synthesized from Figure 2.6 . 44 3.1 A packet pipeline: how big should the fifosbe?............ 58 3.2 stgsfortheDwVLANprocmodule . 64 3.3 Sample stg ................................ 67 3.4 Original and simplified module fsm ................... 70 3.5 Handshaking between module and adjacent fifos........... 71 3.6 stg of a 3-place fifo ........................... 72 3.7 Computing R assuming optimally-sized fifos ............. 75 3.8 slightly modified pipeline for the model checker . 77 3.9 Data source fsms............................. 77 vii 3.10 An exact algorithm for computing minimum fifos........... 82 3.11 Greedy Search for minimum size fifos ................. 83 4.1 Anexample ................................ 90 4.2 My algorithm for restructuring a circuit S to achieve a period c . 91 4.3 My Bellman-Ford algorithm for computing feasible arrival times . 94 4.4 Shannon decomposition of f with respect to xk ............ 96 4.5 ThebasicShannon“celllibrary” . 96 4.6 Shannon decomposition through node replacement. 97 4.7 The Shannon transform as redundant encoding. 98 4.8 Encoding x, evaluating f(x, y, z), and decoding the output for the s0, s1, s2, and s3 codes............................. 100 4.9 Computing feasible arrival times for a node f ............. 102 4.10 Pruning the feasible arrival times from Figure 4.9. ....... 106 4.11 Iterations near the lowest c ........................ 110 4.12 Computing feasible arrival times for a circuit . 113 4.13 Restructuringacircuit . 114 4.14 Selecting cells for node h inFigure4.12. 115 4.15 Propagating required times from outputs to inputs . 118 4.16 The performance/area tradeoff obtained by my algorithm on a 128-bit ripplecarryadder............................. 120 viii List of Tables 2.1 Synthesis results for selected modules . 54 3.1 Modulestatistics ............................. 84 3.2 Experimentalresults ........................... 85 4.1 Experiments on ISCAS89 sequential benchmarks . 121 4.2 Experiments on packet processing pipelines . 124 ix Acknowledgements I thank my advisor prof. Stephen A. Edwards for his constant guidance and help during my entire graduate school. Not only was his liberal advice essential for my academic progress but he actively contributed to the research that led to this thesis. Doubtlessly, most of the material presented below as original work was published before at several academic venues; with no exception Stephen has co-authored these publications. But scarcely was Stephen’s help strictly academic: he is my friend. I thank Dr. Ilija Hadzic, my mentor during my two Bell Labs

High Level Synthesis for Packet Processing Pipelines Cristian

Packet Processing at Wire Speed Using Network Processors

Packet Processing Execution Engine (PROX) - Performance Characterization for NFVI User Guide

Lightweight Internet Protocol Stack Ideal Choice for High-Performance HFT and Telecom Packet Processing Applications Lightweight Internet Protocol Stack

Evaluating the Power of Flexible Packet Processing for Network Resource Allocation Naveen Kr

An Architecture for High-Speed Packet Switched Networks (Thesis)

ICMP for Ipv6

Packet Processing with a Noc-Enhanced FPGA

Pathminer Powered Predictable Packet Processing

Impressive Packet Processing Performance Enables Greater Workload Consolidation Single Intel® Xeon® Processor Platform Achieves Over 80 Mpps Throughput1

High Level Synthesis Tool for High Speed Packet Processing

High Performance Packet Processing with Flexnic

IMPLEMENTING VOICE OVER IP Copyright 6 2003 by John Wiley & Sons, Inc