NetworkNetwork Processors:Processors: BuildingBuilding BlockBlock forfor programmableprogrammable networksnetworks

Raj Yavatkar Chief Architect ® Exchange Architecture

[email protected]

1

Page 1 Raj Yavatkar OutlineOutline

y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions

Page 2 Raj Yavatkar IXPIXP NetworkNetwork ProcessorsProcessors

Control y Microengines – RISC processors optimized for Media/Fabric StrongARM – Hardware support for Interface – Hardware support for multi-threading y Embedded ME 1 ME 2 ME n StrongARM/Xscale – Runs embedded OS and handles exception tasks SRAM DRAM

Page 3 Raj Yavatkar IXP:IXP: AA BuildingBuilding BlockBlock forfor NetworkNetwork SystemsSystems y Example: IXP2800 – 16 micro-engines + XScale core Multi-threaded (x8) – Up to 1.4 Ghz ME speed RDRAM Microengine Array Media – 8 HW threads/ME Controller – 4K per ME MEv2 MEv2 MEv2 MEv2 Fabric – Multi-level 1 2 3 4 I/F – Multiple inter-processor communication channels MEv2 MEv2 MEv2 MEv2 Intel® 8 7 6 5 y NPU vs. GPU tradeoffs PCI XScale™ Core MEv2 MEv2 MEv2 MEv2 – Reduce core complexity 9 10 11 12 – No hardware caching – Simpler instructions Î shallow MEv2 MEv2 MEv2 MEv2 Scratch pipelines QDR SRAM 16 15 14 13 Memory – Multiple cores with HW multi- Controller Hash Per-Engine threading per chip Unit Memory, CAM, Signals Interconnect

Page 4 Raj Yavatkar IXPIXP 24002400 BlockBlock DiagramDiagram

Page 5 Raj Yavatkar XScaleXScale CoreCore processorprocessor y Compliant with the ARM V5TE architecture – support for ARM’s thumb instructions – support for Digital Signal Processing (DSP) enhancements to the instruction set – Intel’s improvements to the internal to improve the memory-latency hiding abilities of the core – does not implement the floating-point instructions of the ARM V5 instruction set

Page 6 Raj Yavatkar MicroenginesMicroengines –– RISCRISC processorsprocessors y IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) y ME instruction set specifically tuned for processing network data – Arithmetic and Logical operations that operate at bit, , and long- word levels – can be combined with shift and rotate operations in single instructions. – integer multiplication provided; no division or FP operations y 40-bit x 4K control store y six-stage pipeline in an instruction – On an average takes one cycle to execute y Each ME has eight hardware-assisted threads of execution – can be configured to use either all eight threads or only four threads y The non-preemptive hardware arbiter swaps between threads in round-robin order

Page 7 Raj Yavatkar MicroEngineMicroEngine v2v2

From Next Neighbor D-Push S-Push Bus

Local 128 128 128 Next 128 D 128 S Memory GPR GPR Neighbor Xfer In Xfer In 640 words Control Store

4K

LM Addr 1 2 per B_op A_op Instructions LM Addr 0 CTX Prev B Prev A

P-Random # A_Operand B_Operand CRC Unit Status Multiply and TAGs Lock LRU 32-bit Execution 0-15 0-15 Find first bit Logic CRC remain Data Path (6-bit) CAM Add, shift, logical Local CSRs Status Entry# ALU_Out To Next Neighbor Timers

Timestamp 128 D 128 S Xfer Out Xfer Out

D-Pull Bus S-Pull Bus Page 8 Raj Yavatkar Why mult-ithreading?

Page 9 Raj Yavatkar Packet processing using multi-threading within a MicroEngine

Page 10 Raj Yavatkar RegistersRegisters availableavailable toto eacheach MEME y four different types of registers – general purpose, SRAM transfer, DRAM transfer, next- neighbor (NN) – Also, access to many CSRs y 256, 32-bit GPRs – can be accessed in thread-local or absolute mode y 256, 32-bit SRAM transfer registers. – used to read/write to all functional units on the IXP2xxx except the DRAM y 256, 32-bit DRAM transfer registers – divided equally into read-only and write-only – used exclusively for communication between the MEs and the DRAM y Benefit of having separate transfer and GPRs – ME can continue processing with GPRs while other functional units read and write the transfer registers

Page 11 Raj Yavatkar NextNext--NeighborNeighbor RegistersRegisters y Each ME has 128, 32-bit next-neighbor registers – makes data written in these registers available in the next microengine (numerically) – E.g., if ME 0 writes data into a next-neighbor register, ME 1 can read the data from its next-neighbor register, and so on y In another mode, these registers are used as extra GPRs – Data written into a next-neighbor register is read back by the same microengine

Page 12 Raj Yavatkar GeneralizedGeneralized threadthread signalingsignaling y Each ME thread has 15 numbered signals. y Most accesses to functional units outside of the ME can cause a signal to any one signal number y The signal number generated for any functional unit access is under the programmer’s control y A ME thread can test for the presence or absence of any of these signals – used to control branching on the signal presence – Or, to specify to the thread arbiter that a ME thread is ready to run only after the signal is received y Benefit of the approach – software can have multiple outstanding references to the same unit and wait for all of them to complete using different signals

Page 13 Raj Yavatkar DifferentDifferent TypesTypes ofof MemoryMemory

Type of Logical Size in Approx Special Memory width unloaded Notes (bytes) latency (cycles) Local to 4 2560 3 Indexed ME addressing post incr/decr On-chip 4 16K 60 Atomic ops scratch 16 rings w/at. get/put SRAM 4 256M 150 Atomic ops 64-elem q- array DRAM 8 2G 300 Direct path to/fro MSF

Page 14 Raj Yavatkar IXP2800IXP2800 FeaturesFeatures y Half Duplex OC-192 / 10 Gb/sec Ethernet Network Processor y XScale Core – 700 MHz (half the ME) – 32 Kbytes instruction / 32 Kbytes data cache y Media / Switch Fabric Interface – 2 x 16 bit LVDS Transmit & Receive – Configured as CSIX-L2 or SPI-4 y PCI Interface – 64 bit / 66 MHz Interface for Control – 3 DMA Channels y QDR Interface (w/Parity) – (4) 36 bit SRAM Channels (QDR or Co-Processor) – Network Processor Forum LookAside-1 Standard Interface – Using a “clamshell” topology both Memory and Co-processor can be instantiated on same channel y RDR Interface – (3) Independent Direct Rambus DRAM Interfaces – Supports 4i Banks or 16 interleaved Banks – Supports 16/32 Byte bursts

Page 15 Raj Yavatkar HardwareHardware FeaturesFeatures toto easeease packetpacket processingprocessing y Ring Buffers – For inter-block communication/synchronization – Producer-consumer paradigm y Next Neighbor Registers and Signaling – Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance – Simple, easy transfer of state y Distributed data caching within each micro-engine – Allows for all threads to keep processing even when multiple threads are accessing the same data

Page 16 Raj Yavatkar OutlineOutline

y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions

Page 17 Raj Yavatkar IXA Portability Framework - Goals

y Accelerate software development for the IXP family of network processors y Provide a simple and consistent infrastructure to write networking applications y Enable reuse of code across applications written to the framework y Improve portability of code across the IXP family y Provide an infrastructure for third parties to supply code – for example, to support TCAMs

Page 18 Raj Yavatkar IXAIXA SoftwareSoftware FrameworkFramework

External Control Plane Protocol Stacks Processors

Control Plane PDK

XScale™ C/C++ Core Core Components Language

Core Component Library

Resource Manager Library

Microengine Microblock Library Microengine Pipeline C Language Micro Micro Micro block block block

Protocol Library Utility Library

Hardware Abstraction Library

Page 19 Raj Yavatkar Software Framework on the MEv2

y Microengine C compiler (language) y Optimized Data Plane Libraries – and MicroC library for commonly used functions y Microblock Programming Model – Enables development of modular code building blocks – Defines the data flow model, common data structures, state sharing between code blocks etc – Ensures consistency and improves reuse across different apps y Core component library – Provides a common way of writing slow-path components that interact with their counterpart fast-path code y Microblocks and example applications written to the microblock programming model – IPv4/IPv6 Forwarding, MPLS, DiffServ etc.

Page 20 Raj Yavatkar MicroMicro--engineengine CC CompilerCompiler y C language constructs – Basic types, pointers, bit fields y In-line assembly code support y Aggregates – Structs, unions, arrays – Intrinsics for specialized ME functions – Different memory models and special constructs for data placement (e.g., __declspec(sdram) struct msg_hdr hd)

Page 21 Raj Yavatkar What is a Microblock? y Data plane packet processing on the microengines is divided into logical functions called microblocks y Coarse Grain and stateful y Example – 5-Tuple Classification – IPv4 Forwarding – NAT y Several microblocks running on a microengine thread can be combined into a microblock group. – A microblock group has a dispatch loop that defines the dataflow for packets between microblocks – A microblock group runs on each thread of one or more microengines y Microblocks can send and receive packets to/from an associated Xscale Core Component

Page 22 Raj Yavatkar CoreCore ComponentsComponents andand MicroblocksMicroblocks

Core Core Core XScale™ Component Component Component Core

Core Component Library

Resource Manager Library

Microblock Library Micro- engines Microblock Microblock Microblock

Microblock Intel/3rd party User-written Core Library blocks code Libraries

Page 23 Raj Yavatkar SimplifiedSimplified PacketPacket FlowFlow (IPv6(IPv6 example)example)

Rx a. Put Packet in DRAM DRAM SRAM b. Put Descriptor in SRAM H c. Queue Handle on ring N Ethernet Header Prefix next-hop-id Interface# Source H OffsetOffset 3FFF020304 N d. Pull meta-data in GPRs IPv6 Header Flags SizeSize … DMAC e. Set DL state in GPRs Payload … … …. f. Set next_blk = Classify Classify g. Get Headers in HCache Packet Buffer h.Set HeaderType to IPv6 Buffers Descriptors Route Table Next-Hop i. Set next_blk = IPv6 IPv6 j. Get DAddr from HCache Local Memory GPRs k. Search RouteTable l. Set next-hop-id = N Ethernet Header Offset, size m. Set next_blk = Encap Header-Type IPV6 dl_buff_handle H IPv6 Header Encap Next-hop-id N dl_next_block 432 n. Get DMAC from next-hop N Header o. Set Eth Hdr in HCache DL state Cache Meta-data p. Flush HCache to DRAM Scratch Ring Scratch Ring Sink q. Flush Meta-data to SRAM PP Microblock-group r. Queue Handle to Ring Handle H Offset Handle Outport Rx Size Source Classify IPv6 Encap Sink Port (2) (3) (4) Descriptor Descriptor

or NextNeighbor Animation: press PgDn 19 times (PgUp to backup)

Page 24 Raj Yavatkar OutlineOutline

y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions

Page 25 Raj Yavatkar WhatWhat cancan II dodo withwith anan IXP?IXP? y Fully programmable architecture – Implement any packet processing applications – Examples from customers – /switching, VPN, DSLAM, Multi-servioce switch, storage, content processing – Intrusion Detection (IDS) and RMON – needs processing of many state elements in parallel – Use as a research platform – Experiment with new algorithms, protocols – Use as a teaching tool – Understand architectural issues – Gain hands-on experience withy networking systems

Page 26 Raj Yavatkar TechnicalTechnical andand BusinessBusiness ChallengesChallenges y Technical Challengers – Shift from ASIC-based paradigm to software-based apps – Challenges in programming an NPU (next) – Trade-off between power, board cost, and no. of NPUs – How to add co-processors for additional functions? y Business challenges – Reliance on an outside supplier for the key component – Preserving intellectual property advantages – Add value and differentiation through software algorithms in data plane, control plane, services plane functionality – Must decrease TTM to be competitive (To NPU or not to NPU?)

Page 27 Raj Yavatkar OutlineOutline

y IXP 2xxx hardware architecture y IXA software architecture y Usage questions y Research questions

Page 28 Raj Yavatkar ArchitecturalArchitectural IssuesIssues y How to scale up to OC-768 and beyond? y What is the “right” architecture? – A set of reconfigurable processing engines vs carefully architected pipelined stages vs a set of fixed-function blocks y Questionable hypotheses – No locality in packet processing? – Temporal vs spatial – Working set size vs available cache capacity – Little or no dependency in packets from different flows?

Page 29 Raj Yavatkar ChallengesChallenges inin ProgrammingProgramming anan NPNP y Distributed, parallel programming model – Multiple microengines, multiple threads y Wide variety of resources – Multiple memory types (latencies, sizes) – Special-purpose engines – Global and local synchronization y Significantly different from the problem seen in scientific computing

Page 30 Raj Yavatkar NPUNPU ProgrammingProgramming ChallengesChallenges y Programming environments for NPUs and network systems are different from those for conventional multi-processors y Automatic allocation of network system resources: Memory

Conventional MP systems Network systems y Rely on locality of memory y Packet processing applications accesses to utilize memory demonstrate little temporal hierarchy effectively locality y Minimizing memory access y Programmers program to a latencies is crucial single-level memory hierarchy y Compilers should manage memory hierarchy explicitly y Compilers are unaware of the – Allocate data structures to memory levels or their appropriate memory levels performance characteristics – Allocation depends on data structure sizes, access pattern, sharing requirements, memory system characteristics, … MemoryMemory management management is is more more complex complex inin network network systems systems

Page 31 Raj Yavatkar NPUNPU ChallengesChallenges -- 22

y Automatic allocation of network system resources: Processors

Conventional MP systems Network systems

y Parallel compilers exploit loop- y Individual packet processing is or function-level parallelism to inherently sequential; little loop- utilize multiple processors to or function-level parallelism speed-up execution of programs y packets belonging to y Operating systems utilize idle different flows in parallel processors to execute multiple programs in parallel y High-throughput and robustness requirements – Compilers should create efficient packet processing pipelines – Granularity of pipeline stage depends on instruction cache Î size, amount of communication NetworkNetwork applications applications are are explicitly explicitly parallel parallel Î between stages, computational concurrency extraction is simpler; complexity of stages, sharing concurrency extraction is simpler; and synchronization butbut throughput throughput and and robustness robustness requirements requirements requirements, … introduceintroduce a a new new problem problem of of pipeline pipeline construction construction ! !

Page 32 Raj Yavatkar ChallengesChallenges (contd.)(contd.) y How to enable a wide range of network applications – TCP offload/termination – How to distribute functionality between SA/Xscale, , and microengines? – Hierarchy of compute vs I/O capabilities – How to allow use of multiple IXPs to solve more compute intensive problems y Networking research – How to take advantage of programmable, open architecture? – Designing “right” algorithms for LPM, range matching, string search, etc – QoS-related algorithms – TM4.1, WRED, etc

Page 33 Raj Yavatkar Questions?Questions?

Page 34 Raj Yavatkar