Network Processors: Building Block for Programmable Networks
Total Page:16
File Type:pdf, Size:1020Kb
NetworkNetwork Processors:Processors: BuildingBuilding BlockBlock forfor programmableprogrammable networksnetworks Raj Yavatkar Chief Software Architect Intel® Internet Exchange Architecture [email protected] 1 Page 1 Raj Yavatkar OutlineOutline y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions Page 2 Raj Yavatkar IXPIXP NetworkNetwork ProcessorsProcessors Control Processor y Microengines – RISC processors optimized for packet processing Media/Fabric StrongARM – Hardware support for Interface – Hardware support for multi-threading y Embedded ME 1 ME 2 ME n StrongARM/Xscale – Runs embedded OS and handles exception tasks SRAM DRAM Page 3 Raj Yavatkar IXP:IXP: AA BuildingBuilding BlockBlock forfor NetworkNetwork SystemsSystems y Example: IXP2800 – 16 micro-engines + XScale core Multi-threaded (x8) – Up to 1.4 Ghz ME speed RDRAM Microengine Array Media – 8 HW threads/ME Controller – 4K control store per ME Switch MEv2 MEv2 MEv2 MEv2 Fabric – Multi-level memory hierarchy 1 2 3 4 I/F – Multiple inter-processor communication channels MEv2 MEv2 MEv2 MEv2 Intel® 8 7 6 5 y NPU vs. GPU tradeoffs PCI XScale™ Core MEv2 MEv2 MEv2 MEv2 – Reduce core complexity 9 10 11 12 – No hardware caching – Simpler instructions Î shallow MEv2 MEv2 MEv2 MEv2 Scratch pipelines QDR SRAM 16 15 14 13 Memory – Multiple cores with HW multi- Controller Hash Per-Engine threading per chip Unit Memory, CAM, Signals Interconnect Page 4 Raj Yavatkar IXPIXP 24002400 BlockBlock DiagramDiagram Page 5 Raj Yavatkar XScaleXScale CoreCore processorprocessor y Compliant with the ARM V5TE architecture – support for ARM’s thumb instructions – support for Digital Signal Processing (DSP) enhancements to the instruction set – Intel’s improvements to the internal pipeline to improve the memory-latency hiding abilities of the core – does not implement the floating-point instructions of the ARM V5 instruction set Page 6 Raj Yavatkar MicroenginesMicroengines –– RISCRISC processorsprocessors y IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) y ME instruction set specifically tuned for processing network data – Arithmetic and Logical operations that operate at bit, byte, and long- word levels – can be combined with shift and rotate operations in single instructions. – integer multiplication provided; no division or FP operations y 40-bit x 4K control store y six-stage pipeline in an instruction – On an average takes one cycle to execute y Each ME has eight hardware-assisted threads of execution – can be configured to use either all eight threads or only four threads y The non-preemptive hardware thread arbiter swaps between threads in round-robin order Page 7 Raj Yavatkar MicroEngineMicroEngine v2v2 From Next Neighbor D-Push S-Push Bus Bus Local 128 128 128 Next 128 D 128 S Memory GPR GPR Neighbor Xfer In Xfer In 640 words Control Store 4K LM Addr 1 2 per B_op A_op Instructions LM Addr 0 CTX Prev B Prev A P-Random # A_Operand B_Operand CRC Unit Status Multiply and TAGs Lock LRU 32-bit Execution 0-15 0-15 Find first bit Logic CRC remain Data Path (6-bit) CAM Add, shift, logical Local CSRs Status Entry# ALU_Out To Next Neighbor Timers Timestamp 128 D 128 S Xfer Out Xfer Out D-Pull Bus S-Pull Bus Page 8 Raj Yavatkar Why mult-ithreading? Page 9 Raj Yavatkar Packet processing using multi-threading within a MicroEngine Page 10 Raj Yavatkar RegistersRegisters availableavailable toto eacheach MEME y four different types of registers – general purpose, SRAM transfer, DRAM transfer, next- neighbor (NN) – Also, access to many CSRs y 256, 32-bit GPRs – can be accessed in thread-local or absolute mode y 256, 32-bit SRAM transfer registers. – used to read/write to all functional units on the IXP2xxx except the DRAM y 256, 32-bit DRAM transfer registers – divided equally into read-only and write-only – used exclusively for communication between the MEs and the DRAM y Benefit of having separate transfer and GPRs – ME can continue processing with GPRs while other functional units read and write the transfer registers Page 11 Raj Yavatkar NextNext--NeighborNeighbor RegistersRegisters y Each ME has 128, 32-bit next-neighbor registers – makes data written in these registers available in the next microengine (numerically) – E.g., if ME 0 writes data into a next-neighbor register, ME 1 can read the data from its next-neighbor register, and so on y In another mode, these registers are used as extra GPRs – Data written into a next-neighbor register is read back by the same microengine Page 12 Raj Yavatkar GeneralizedGeneralized threadthread signalingsignaling y Each ME thread has 15 numbered signals. y Most accesses to functional units outside of the ME can cause a signal to any one signal number y The signal number generated for any functional unit access is under the programmer’s control y A ME thread can test for the presence or absence of any of these signals – used to control branching on the signal presence – Or, to specify to the thread arbiter that a ME thread is ready to run only after the signal is received y Benefit of the approach – software can have multiple outstanding references to the same unit and wait for all of them to complete using different signals Page 13 Raj Yavatkar DifferentDifferent TypesTypes ofof MemoryMemory Type of Logical Size in Approx Special Memory width bytes unloaded Notes (bytes) latency (cycles) Local to 4 2560 3 Indexed ME addressing post incr/decr On-chip 4 16K 60 Atomic ops scratch 16 rings w/at. get/put SRAM 4 256M 150 Atomic ops 64-elem q- array DRAM 8 2G 300 Direct path to/fro MSF Page 14 Raj Yavatkar IXP2800IXP2800 FeaturesFeatures y Half Duplex OC-192 / 10 Gb/sec Ethernet Network Processor y XScale Core – 700 MHz (half the ME) – 32 Kbytes instruction cache / 32 Kbytes data cache y Media / Switch Fabric Interface – 2 x 16 bit LVDS Transmit & Receive – Configured as CSIX-L2 or SPI-4 y PCI Interface – 64 bit / 66 MHz Interface for Control – 3 DMA Channels y QDR Interface (w/Parity) – (4) 36 bit SRAM Channels (QDR or Co-Processor) – Network Processor Forum LookAside-1 Standard Interface – Using a “clamshell” topology both Memory and Co-processor can be instantiated on same channel y RDR Interface – (3) Independent Direct Rambus DRAM Interfaces – Supports 4i Banks or 16 interleaved Banks – Supports 16/32 Byte bursts Page 15 Raj Yavatkar HardwareHardware FeaturesFeatures toto easeease packetpacket processingprocessing y Ring Buffers – For inter-block communication/synchronization – Producer-consumer paradigm y Next Neighbor Registers and Signaling – Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance – Simple, easy transfer of state y Distributed data caching within each micro-engine – Allows for all threads to keep processing even when multiple threads are accessing the same data Page 16 Raj Yavatkar OutlineOutline y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions Page 17 Raj Yavatkar IXA Portability Framework - Goals y Accelerate software development for the IXP family of network processors y Provide a simple and consistent infrastructure to write networking applications y Enable reuse of code across applications written to the framework y Improve portability of code across the IXP family y Provide an infrastructure for third parties to supply code – for example, to support TCAMs Page 18 Raj Yavatkar IXAIXA SoftwareSoftware FrameworkFramework External Control Plane Protocol Stacks Processors Control Plane PDK XScale™ C/C++ Core Core Components Language Core Component Library Resource Manager Library Microengine Microblock Library Microengine Pipeline C Language Micro Micro Micro block block block Protocol Library Utility Library Hardware Abstraction Library Page 19 Raj Yavatkar Software Framework on the MEv2 y Microengine C compiler (language) y Optimized Data Plane Libraries – Microcode and MicroC library for commonly used functions y Microblock Programming Model – Enables development of modular code building blocks – Defines the data flow model, common data structures, state sharing between code blocks etc – Ensures consistency and improves reuse across different apps y Core component library – Provides a common way of writing slow-path components that interact with their counterpart fast-path code y Microblocks and example applications written to the microblock programming model – IPv4/IPv6 Forwarding, MPLS, DiffServ etc. Page 20 Raj Yavatkar MicroMicro--engineengine CC CompilerCompiler y C language constructs – Basic types, pointers, bit fields y In-line assembly code support y Aggregates – Structs, unions, arrays – Intrinsics for specialized ME functions – Different memory models and special constructs for data placement (e.g., __declspec(sdram) struct msg_hdr hd) Page 21 Raj Yavatkar What is a Microblock? y Data plane packet processing on the microengines is divided into logical functions called microblocks y Coarse Grain and stateful y Example – 5-Tuple Classification – IPv4 Forwarding – NAT y Several microblocks running on a microengine thread can be combined into a microblock group. – A microblock group has a dispatch loop that defines the dataflow for packets between microblocks – A microblock group runs on each thread of one or more microengines y Microblocks can send and receive packets to/from an associated Xscale Core Component Page 22 Raj Yavatkar CoreCore ComponentsComponents andand MicroblocksMicroblocks Core Core Core XScale™ Component Component Component Core Core Component Library Resource Manager Library Microblock Library Micro- engines Microblock Microblock Microblock Microblock Intel/3rd party User-written Core Library blocks code Libraries Page 23 Raj Yavatkar SimplifiedSimplified PacketPacket FlowFlow (IPv6(IPv6 example)example) Rx a. Put Packet in DRAM DRAM SRAM b. Put Descriptor in SRAM H c. Queue Handle on ring N Ethernet Header Prefix next-hop-id Interface# Source H OffsetOffset 3FFF020304 N d. Pull meta-data in GPRs IPv6 Header Flags SizeSize … DMAC e. Set DL state in GPRs Payload … … …. f. Set next_blk = Classify Classify g. Get Headers in HCache Packet Buffer h.Set HeaderType to IPv6 Buffers Descriptors Route Table Next-Hop i.