Qoriq T4240 Communications Processor Deep Dive FTF-NET-F0031

QorIQ T4240 Communications Processor Deep Dive FTF-NET-F0031

Sam Siu & Feras Hamdan

A P R . 2 0 1 4

External Use Agenda

• QorIQ T4240 Communications Processor Overview • e6500 Core Enhancement • Memory Subsystem and MMU Enhancement • QorIQ Power Management features • HiGig Interface • Interlaken Interface • PCI Express® Gen 3 Interfaces (SR-IOV) • Serial RapidIO® Manager (RMAN) • Data Path Acceleration Architecture Enhancements − mEMAC − Offline Ports and Use Case − Storage Profiling − Data Center Bridging (FMAN and QMAN) − Accelerators: SEC, DCE, PME • Debug

TM External Use 1 QorIQ T4240 Communications Processor Processor T1 T2 T1 T2 T1 T2 T1 T2 512KB 6464--bit T1 T2 T1 T2 T1 T2 T1 T2 • 12x e6500, 64-bit, up to 1.8 GHz Power ™ Power ™ Power ™ Power ™ Corenet DDR2/3DDR3/3L

e6500T1Power ™ T2 e6500T1Power ™ T2 e6500T1Power ™ T2 e6500T1Power ™ T2 Platform Cache MemoryMemory • Dual threaded, with128-bit AltiVec engine e6500Power ™ e6500Power ™ e6500Power ™ e6500Power ™ ControllerController 32 KB e650032 KB 32 KB e650032 KB 32 KB e650032 KB 32 KB e650032 KB • Arranged as 3 clusters of 4 CPUs, with 2 512KB 6464--bit D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB Corenet DDR2/3DDR3/3L MB L2 per cluster; 256 KB per thread 32 KB 32 KB D-Cache32 KB I-Cache32 KB 32 KB 32 KB D-Cache32 KB I-Cache32 KB D-Cache I-Cache D-Cache I-Cache Platform Cache MemoryMemory D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache ControllerController Memory SubSystem 64-bit 2MB Banked L2 512KB 64-bit • 1.5 MB CoreNet platform cache w/ECC 2MB Banked L2 Corenet DDR3/3LDDR2/3 Platform Cache MemoryMemory • 3x DDR3 controllers up to 1.87 GHz 2MB Banked L2 ControllerController • Each with up to 1 TB addressability (40 Security Fuse Processor bit physical addressing) Security Monitor CoreNet™ Coherency Fabric Peripheral Access CoreNet Switch Fabric 2x USB 2.0 w/PHY PAMU PAMU PAMU PAMU Mgmt Unit IFC High-speed Serial IO FMan FMan Real Time Debug Power Management DCE Security Queue 3xDMA Watchpoint • 4 PCIe controllers, with Gen3 Parse, Classify, Parse, Classify,

1.0 5.0 Cross

SD/MMC Mgr. Distribute Distribute Trigger • SR-IOV support

2.0

2x DUART

HiGig DCB HiGig DCB Perf CoreNet • 2 sRIO controllers Pattern 2x I2C 1G 1G 1G 1G 1G 1G Monitor Trace

Match Buffer 1/ 1/ 1/ 1/

PCIe

sRIO

PCIe SATA PCIe RMAN SATA • Type 9 and 11 messaging SPI, GPIO Engine Mgr. 10G 10G 10G 10G Interlaken LA 1G 1G 1G 1G 1G 1G Aurora 2.0 • Interworking to DPAA via Rman • 1 Interlaken Look-Aside at up to10 GHz 16-Lane 10GHz SERDES 16-Lane 10GHz SERDES • 2 SATA 2.0 3Gb/s • 2 USB 2.0 with PHY • Device • Data Path Acceleration − TSMC 28 HPM process Network IO − SEC- crypto acceleration 40 Gbps − 1932-pin BGA package • 2 Frame Managers, each with: − PME- Reg-ex Pattern Matcher 10Gbps − 42.5x42.5 mm, 1.0 mm pitch • Up to 25Gbps parse/classify/distribute − DCE- Data Compression Engine 20Gbps • Power targets • 2x10GE, 6x1GE − ~54W thermal max at 1.8 GHz • HiGig, Data Center Bridging Support − ~42W thermal max at 1.5 GHz • SGMII, QSGMII, XAUI, XFI

TM External Use 2 e6500 Core Enhancement

TM External Use 3 e6500 Core Complex

High Performance ® Altivec Altivec Altivec Altivec • 64-bit Power Architecture technology

• Up to 1.8 GHz operation

T T T T T T T • Two threads per core

PMC

PMC PMC e6500 e6500 e6500 e6500 PMC • Dual load/store units, one per thread • 32K 32K 32K 32K 32K 32K 32K 32K 40-bit Real Address − 1 Terabyte physical addr. space • Hardware Table Walk 2MB 16-way Shared L2 Cache, 4 Banks • L2 in cluster of 4 cores − Supports Share across cluster CoreNet Interface − Supports L2 memory allocation to core or thread 40-bit Address Bus 256-bit Rd & Wr Data Busses Energy Efficient Power Management CoreNet Double Data Processor Port − Drowsy : Core, Cluster, AltiVec engine − Wait-on-reservation instruction − Traditional modes CoreMark P4080 T4240 Improvement (1.5 GHz) (1.8 GHz) from P4080 • AltiVec SIMD Unit (128b) − 8,16,32-bit signed/unsigned integer Single Thread 4708 7828 1.7x − 32-bit floating-point Core (dual T) 4708 15,656 3.3x . 173 GFLOP (1.8GHz) − 8,16,32-bit Boolean SoC 37,654 187,873 5.0x • Improve Productivity with Core Virtualization DMIPS/Watt 2.4 5.1 2.1x − Hypervisor (typ) − Logical to Real Addr (LRAT). translation mechanism for improved hypervisor performance

TM External Use 4 General Core Enhancements

• Improved branch prediction and additional link stack entries • Pipeline improvements: − LR, CTR, mfocrf optimization (LR and CTR are renamed) − 16 entry rename/completion buffer • New debug features: − Ability to allocate individual debug events between the internal and external debuggers − More IAC events • Performance monitor − Many more events, six counters per thread − Guest performance monitor interrupt • Private vs. Shared State Registers and other architected state − Shared between threads: . There is only one copy of the register or architected state . A change in one thread affects the other thread if the other thread reads it − Private to the thread and are replicated per thread : . There is one copy per thread of the register or architected state . A change in one thread does not affect the other thread if the thread reads its private copy

TM External Use 5 Corenet Enhancements in QorIQ T 4240

• 100% CoreNet Coherency Fabric 90% − 40-bit Real Address 80% − Higher address bandwidth and active transactions 70% 60% . 1.2 Tbps Read, .6Tbps Write 50% IP Mark − 2X BW increase for core, MMU, and peripheral 40% TCP Mark − Improved configuration architecture 30% 20% • Platform Cache 10% − Increased write bandwidth (>600Gbps) 0% 0 2 4 6 8 10 12 14 16 18 20 22 24 − Increased buffering for improving throughput − Improved data ownership tracking for performance enhancement • Data PreFetch − Tracks CPC misses − Prefetches from multiple memory regions with configurable sizes − Selective tracking based on requesting device, transaction type, data/instruction access − Conservative prefetch requests to avoid system overloading with prefetches − “Confidence” based algorithm with feedback mechanism − Performance monitor events to evaluate the performance of Prefetch in the system

TM External Use 6 Cache and Memory Subsystem Enhancements

TM External Use 7 Shared L2 Cache

• Clusters of cores share a 2M byte, 4-bank, 16-way set associative shared L2 cache. • In addition, there is also support for a 1.5M byte corenet platform cache. • Advantages − L2 cache is shared among 4 cores allowing lines to be allocated among the 4 cores as required . Some cores will need more lines and some will need less depending on workloads − Faster sharing among cores in the cluster (sharing a line between cores in the cluster does not require the data to travel on CoreNet) − Flexible partition of L2 cache base on application cluster group. • Trade Offs − Longer latency to DRAM and other parts of the system outside the cluster − Longer latency to L2 cache due to increased cache size and eLink overhead

T1 T2 T1 T2 T1 T2 T1 T2 512KB 6464--bit T1 T2 T1 T2 T1 T2 T1 T2 Power ™ Power ™ Power ™ Power ™ Corenet DDR2/3DDR3/3L

e6500T1Power ™ T2 e6500T1Power ™ T2 e6500T1Power ™ T2 e6500T1Power ™ T2 Platform Cache MemoryMemory e6500Power ™ e6500Power ™ e6500Power ™ e6500Power ™ ControllerController 32 KB e650032 KB 32 KB e650032 KB 32 KB e650032 KB 32 KB e650032 KB 512KB 6464--bit D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB D-Cache32 KB I-Cache32 KB Corenet DDR2/3DDR3/3L 32 KB 32 KB D-Cache32 KB I-Cache32 KB 32 KB 32 KB D-Cache32 KB I-Cache32 KB D-Cache I-Cache D-Cache I-Cache Platform Cache MemoryMemory D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache D-Cache I-Cache ControllerController 64-bit 2MB Banked L2 512KB 64-bit 2MB Banked L2 Corenet DDR3/3LDDR2/3 Platform Cache MemoryMemory 2MB Banked L2 ControllerController

Security Fuse Processor Security Monitor CoreNet™ Coherency Fabric Peripheral Access 2x USB 2.0 w/PHY PAMU PAMU PAMU PAMU Mgmt Unit

TM External Use 8 Memory Subsystem Enhancements

• The e6500 core has a larger store queue than the e5500 core • Additional registers are provided for L2 cache partitioning controls similar to how partitioning is done in the CPC • Cache locking is supported, however, if a line is unable to be locked, that status is not posted. Cache lock query instructions are provided for determining whether a line is locked • The load store unit contains store gather buffers to collect stores to cache lines before sending them on eLink to the L2 cache • There are no more Line Fill Buffers (LFB) associated with the L1 data cache − These are replaced with Load Miss Queue (LMQ) entries for each thread − They function in a manner very similar to LFBs • Note there are still LFBs for L1 instruction cache

TM External Use 9 MMU Enhancements

TM External Use 10 MMU – TLB Enhancements

• e6500 core implements MMU architecture version 2 (V2) − MMU architecture V2 is denoted by bits in the MMUCFG register • Translation Look-aside Buffers (TLB1), − Variable size pages, supports power of two page sizes (previous cores used power of 4 page sizes) − 4 KB to 1 TB page sizes • Translation Look-aside Buffers (TLB0) increased to 1024 entries − 8 way associativity (from 512, 4 way) − Supports HES (hardware entry select) when written to with tlbwe • PID register is increased to 14 bits (from 8 bits) − Now the operating system can have 16K simultaneous contexts • Real address increased to 40 bits (from 36 bits) • In general, it is backward compatible with MMU operations from e5500 core, except: − some of the configuration registers have different organization (TLBnCFG for example) − There are new config registers for TLB page size (TLBnPS) and LRAT page size (LRATPS) − tlbwe can be executed by guest supervisor (but can be turned off with an EPCR bit) LPID Effective Address (EA) (64bit ) GS AS PID(14bits) (14bit) Effective Page #(0-52 bits) Byte Addr (12-32bits ) 0=Hypervisor Access MSR Real Page Number 1=guest Byte Address (12-40bits) (0-28bits) TM External Use 11 Real Address (40bits) MMU – Virtualization Enhancements (LRAT)

• e6500 core contains an LRAT (logical to real address translation) − The LRAT converts logical addresses (an address the guest operating system thinks are real) and converts them to true real addresses − Translation occurs when the guest executes tlbwe and tries to write TLB0 or during hardware tablewalk for a guest translation − Does not require hypervisor to intervene unless the LRAT incurs a miss (the hypervisor writes entries into the LRAT) − 8 entry fully associative supporting variable size pages from 4 KB to 1 TB (in powers of two) • Prior to the LRAT, the hypervisor had to intervene each time the guest tried to write a TLB entry

Application MMU Instr1 Page Instr2 Fault Instr3 Guest OS --- VA -> Guest RA Writes TLB Trap Hypervisor Guest RA -> RA Implemented Writes TLB TM in HW with LRAT External Use 12 QorIQ Power Management Features

TM External Use 13

Dynamic T4 Family Energy/Power Total Cost of Ownership

Full Mid Light Light to Mid Full Standby

Workload Cyclical Valued

Always on

Strategy Today’s Energy

Dynamic

Clk Gating Energy Savings Cluster Core Cascaded Drowsy Dual Cluster Drowsy SoC

+ Tj Sleep

T4 Advanced Mgt Power

TM External Use 14 Cascaded Power Management Today: All CPUs in pool channel dequeue DPAA uses task queue thresholds to until all FQs empty inform CPUs they are not needed. Broadcast notification when work arrives CPUs selectively awakened as needed.

Core: C0 C1 C2 C3 C0 C1 C2 C3 Drowsy Drowsy • CPU’s run software that drops into polling loop when DPAA is not sending it work.

• Polling loop should include a wait w/ drowsy instruction that puts the core into Shared L2 Shared L2 drowsy

Active CPUs

12 11 10 Task Queue QMan 9 8 7 T5 T4 T3 T2 T1 6 5 4 3 Burst 2 Threshold 2 Threshold 1 1

Power/Performance Day Night

TM External Use 15 e6500 Core Intelligent Power Management

Run, Doze, Nap Run, Nap Altivec Wait • Cores and L2 Altivec Drowsy

Dynamic Frequency Scaling T T • Auto and SW controlled – maintain state

PMC (DFS)of the Cluster Core Drowsy

PMC Drowsy Cluster (cores) e6500 • Auto and SW controlled – maintain state 2048KB Banked L2 Dynamic Clock gating Dynamic clock gating L1 L1 NEW NEW NEW Full On Full On Full On Full On Full On Nap Cluster State PCL00 PCL00 PCL00 PCL00 PCL00 PCL10 Run Doze Nap Global Clk stop Nap (Pwr Gated) Core glb clk stop Core State PH00 PH10/PW10 PH15 PW20 PH20 PH20

Cluster Voltage

Core Voltage

Cluster Clock On On On On On Off

Core Clock On On Off Off Off Off

L2 Cache SW Flushed

L1 Cache SW Invalidated HW Invalidated SW Invalidated SW Invalidated • SoC Sleep with state retention Wakeup Time Active Immediate < 30 ns < 200 ns < 600 ns < 1us • SoC Sleep with RST • Cascade Power Management • Energy Efficient Power Ethernet (EEE)

TM External Use 16 HiGig Interface Support

TM External Use 17 HiGigTM/HiGig+/HiGig2 Interface Support

• The 10 Gigabit HiGigTM / HiGig+TM / HiGig2TM MAC interface interconnect standard Ethernet devices to Switch HiGig Ports. • Networking customers can add features like quality of service (QoS), port trunking, mirroring across devices, and link aggregation at the MAC layer. • The physical signaling across the interface is XAUI, four differential pairs for receive and transmit (SerDes), each operating at 3.125 Gbit/s. HiGig+ is a higher rate version of HiGig 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 Typ Preamble MAC_DA MAC_SA Packet Data FCS e Regular Ethernet Frames 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 Typ Preamble HiGig+ Module Hdr MAC_DA MAC_SA Packet Data FCS* e Ethernet Frames with HiGig+ Header 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 Typ Preamble HiGig2 Module Hdr MAC_DA MAC_SA Packet Data FCS* e Ethernet Frames with HiGig2 Header

TM External Use 18 QorIQ T4240 Processor HiGig Interface

• T4240 FMan Supports HiGig/HiGig+/HiGig2 protocols • In the T4240 processor, the 10G mEMACs can be configured as HiGig interface. In this configuration two of the 1G mEMACs are used as the HiGig message interface

TM External Use 19 SERDES Configuration for HiGig Interface

• Networking protocols (SerDes 1 and SerDes 2) • HiGig notation: HiGig[2]m.n means HiGig[2] (4 lanes @ 3.125 or 3.75 Gbps) − “m” indicates which Frame Manager (FM1 or FM2) − “n” indicates which MAC on the Frame Manager − E.g. “HiGig[2]1.10,” indicates HiGig[2] using FM1’s MAC 10 • When a SerDes protocol is selected with dual HiGigs in one SerDes, both HiGigs must be configured with the same protocol (for example, both with 12 byte headers or both with 16 byte headers)

TM External Use 20 HiGig/HiGig2 Control and Configuration

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

MCRC

IGNIM

NPPR

FIMT

TCM

FER

LLM

LLF LLI

IMG

HiGig/HiGig2 control and Configuration Register (HG_CONFIG)

Name Description LLM_MODE Toggle between HiGig2 Link Level Messages physical link, OR HiGig2 link level messages logical link (SAFC) LLM_IGNORE Ignore HiGig2 link level message quanta LLM_FWD Terminate/forward received HiGig2 link level message IMG[0:7] Inter Message Gap - spacing between HiGig2 messages NOPRMP 0 Toggle preemptive transmission of HiGig2 messages MCRC_FWD Strip/forward HiGig2 message CRC of received messages FER Discard/forward HiGig2 receive message with CRC error FIMT Forward OR Discard message with illegal MSG_TYP IGNIMG Ignore IMG on receive path TCM TC (traffic classes) mapping

TM External Use 21 Interlaken Interface

TM External Use 22 Interlaken Look-Aside Interface

• Use Case: T4240 processor as a data path processor, requiring millions of look-ups per second. Expected requirement in edge routers. • Interlaken Look-Aside is a new high-speed serial standard for connecting TCAMs “network search engines”, “Knowledge Based Processors” to host CPUs and NPUs. Replaces Quad Data Rate (QDR) SRAM interface. • Like Interlaken streaming interfaces (channelized SERDES link, replacing SPI 4.2), Interlaken look-aside supports configurable number of SERDES T4240 lanes (1-32, granularity of single lane) with linearly increasing bandwidth. Freescale supporst x4 and x8, up to 10 GHz. • For lowest latency, each vCPU (thread) in T4240 processor will have a portal into the 10 10 10 10 Interlaken Controller, allowing multiple search requests and results to be returned G G G G concurrently. • Interlaken Look Aside expected to gain traction as interface to other low latency/minimal Interlaken data exchange co-processors, such as Traffic Managers. PCIe and sRIO better for higher latency/high bandwidth applications. • Lane Striping TCAM

TM External Use 23 T4240 (LAC) Features:

• Supports Interlaken Look-Aside Protocol definition, rev. 1.1 • Supports 24 partitioned software portals • Supports in-band per-channel flow control options, with simple xon/xoff semantics • Supports wide range of SerDes speeds (6.25 and 10.3125 Gbps)) • Ability to disable the connection to individual SerDes lanes • A continuous Meta Frame of programmable frequency to guarantee lane alignment, synchronize the scrambler, perform clock compensation, and indicate lane health • 64B/67B data encoding and scrambling • Programmable BURSTSHORT parameter of 8 or 16 bytes • Error detection illegal burst sizes, bad 64/67 word type and CRC-24 error • Error detection on Transmit command programming error • Built-in statistics counters and error counters • Dynamic power down of each software portal

TM External Use 24 Look-Aside Controller Block Diagram

TM External Use 25 Modes of Operation

• T4240 LA controller can be either in Stashing mode or non stashing. • The LAC programming model is based on big Endinan mode, meaning byte 0 on the most significant byte. • In non Stashing mode software has to issue dcbf each time it reads SWPnRSR and RDY bit is not set.

TM External Use 26 Interlaken LA Controller Configuration Registers

• 4KBytes hypervisor space 0x0000-0x0FFF • 4KBytes managing core space 0x1000-0x1FFF • in compliant with trusted architecture ,LSRER, LBARE, LBAR, LLIODNRn, accessed exclusively in hypervisor mode, reserved in managing core mode. • Statistics, Lane mapping, Interrupt , rate, metaframe, burst, FIFO, calendar, debug, pattern, Error, Capture Registers • LAC software portal memory, n= 0,1,2,3,….,23 . • SWPnTCR/ SWPnRCR—software portal 0 transmit/Receive command register • SWPnTER/SWPnRER—software portal 0 transmit/Receive error register • SWPnTDR/SWPnRDR0,1,2,3 —software portal 0,1,2,3 transmit/Receive data register 0,1,2,3 • SWPnRSR—software portal receive status register

TM External Use 27 TCAM Usage in Routing Example

TM External Use 28 Interlaken Look-Aside TCAM Board

125 MHz

SYSCLK

VDDC 0.85V @6A Renesas Config VDDA Interlaken LA 0.85V @ 2A 5Mb TCAM I2C VDDHA 1.80V 0.5A EEPROM VCC_1.8V 1.8V @ 2A Filters VDDO 1.80V 1.0A

VPLL 1.80V 0.25A

0-ohm

3.3V/12V IL-LA REFCLK SMBus Misc: 4x 156.25 MHz Reset, JTAG

TM External Use 29 PCI Express® Gen 3 Interfaces

TM External Use 30 PCI Express® Gen 3 Interfaces

• Two PCIe Gen 3 controllers can be run at the same time with same SerDes reference clock source • PCIe Gen 3 bit rates are supported − When running more than one PCIe controller at Gen3 rates, the associated SerDes reference clocks must be driven by the same source on the board

51G 51G 51G PCIe1 PCIe4 SR-IOV OCN EP 51G 51G

X8 Gen2 or X4 Gen2/3 RC/EP x4 Gen3 RC/EP PCIe2 PCIe3 EP SRIOV 2 PF/64VF 16 SERES PCIe Configuration 8xMSI-X per VF/PF X4 Gen2/3 RC/EP X8 Gen2 or x4 Gen3 PCIe1 PCIe2 PCIe3 PCIe4

Total of 16 lanes x4gen3 x4gen2 x8gen2

X8gen2 x8gen2

x4gen2 x4gen2 x4gen3 x4gen2

TM External Use 31 Single Root I/O Virtualization (SR-IOV) End Point

• With SR-IOV supported in EP, different devices or different software tasks can share IO resources, such as Gigabit Ethernet controllers. − T4240 Supports SR-IOV 1.1 spec version with 2 PFs and 64 VFs per PF − SR-IOV supports native IOV in existing single-root complex PCI Express topologies − Address translation services (ATS) supports native IOV across PCI Express via address translation − Single Management physical or virtual machine on host handles end-point configuration • E.g. T4240 processor as a Converged Network Adapter. Each Virtual Machine running on Host thinks it has a private version of the services card

VM VM … VM 1 2 N

Host

Translation Agent

TM T4240 features single controller (up to x4 Gen 3), 1 PF, 64 VFs External Use 32 PCI Express Configuration Address Register

• The PCI Express configuration address register contains address information for accesses to PCI Express internal and external configuration registers for End Point (EP) with SR-IOV

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 EN

Type EXTREGN VFN PFN REGN

PCI Express Address Offset Register Name Description Enable allows a PCI Express configuration access when PEX_CONFIG_DATA is accessed TYPE 01, Configuration Register Accesses to PF registers for EP with SR-IOV 11, Configuration Register Accesses to VF registers for EP with SR-IOV EXTREGN Extended register number. This field allows access to extended PCI Express configuration space VFN Virtual Function number minus 1. 64-255 is reserved. PFN Physical Function number minus 1. 2-15 is reserved. REGN Register number. 32-bit register to access within specified device

TM External Use 33 Message Signaled Interrupts (MSI-X) Support

• MSI-X allows for EP device to send message interrupts to RC device independently for different Physical or Virtual functions as supported by EP SR-IOV. • Each PF or VF will have eight MSI-X vectors allocated with a total of 256 total MSI-X vectors supported − Supports MSI-X for PF/VF with 8 MSI-X vector per PF or VF − Supports MSI-X trap operation − To access a MSI-X PBA structure, the PF, VF, IDX, EIDXare concatenated to form the 4- byte aligned address of the register within the MSI-X PBA structure. That is, the register address is: . PF || VF || IDX || EIDX || 0b00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Type

PF VF IDX EDIX M

PCI Express Address Offset Register Name Description TYPE Access to PF or VF MSI-X vector table for EP with SR-IOV. PF Physical Function VF Virtual Function IDX MSI-X Entry Index in each VF. EIDX Extended index This field provides which 4-Byte entity within the MSI-X PBA structure to access. M Mode=11

TM External Use 34 Serial RapidIO® Manager (RMAN)

TM External Use 35 RapidIO Message Manager (RMan)

• RMAN supports both inline switching, as well as look aside forwarding operation. RapidIO PDU … Ftype Target ID Src ID Address Packet Data Unit CRC RMan

DCP RapidIOOutbound Traffic

Inbound Rule Reassembly QMan Disassembly Matching Contexts Contexts Classification Reassembly Segmentation HW Channel

Unit Unit Unit

Classification Reassembly ARB Segmentation

Unit Unit Unit

WQ0 WQ1 WQ2 WQ3 WQ4 WQ5 WQ6 WQ7

Classification Reassembly Segmentation

Unit Unit HW Channel Pool Channel Unit

RapidIO Inbound Traffic Inbound RapidIO

WQ0 WQ1 WQ2 WQ3 WQ4 WQ5 WQ6 WQ7

DCP PME Frame Manager D$ e6500I$ SEC 1GE 1GE 10GE L2$ D$ CoreI$

1GE 1GE

D$ I$ TM External Use 37 SW Portal RMan: Greater Performance and Functionality

• Many queues allow multiple inbound/outbound queues per core − Hardware queue management via QorIQ Data Path Architecture (DPAA) • Supports all messaging-style transaction types − Type 11 Messaging − Type 10 Doorbells − Type 9 Data Streaming • Enables low overhead direct core-to-core communication

QorIQ or DSP QorIQ or DSP Channelized CPU- to-CPU transport Device-to-Device Core Core Core Core Core Core Core Core Transport

10G SRIO 10G SRIO

Type9 User PDU

MSG User PDU

TM External Use 38 Data Path Acceleration Architecture (DPAA)

TM External Use 39 Data Path Acceleration Architecture (DPAA) Philosophy P Series T Series • DPAA is design to balance the performance of multiple D$ D$ I$ I$ D$ D$ I$ I$ CPUs and Accelerators with seamless integrations e500mce500mc e6500e6500 L2$ D$ I$ … L2$ D$ I$ … − ANY packet to ANY core to ANY accelerator or network L2$ D$ CoreI$Core L2$ D$ CoreI$Core

interface efficiently WITHOUT locks or semaphores D$ D$ I$ I$ D$ D$ I$ I$ • “Infrastructure” components − Queue Manager (QMan) CoreNet™ Coherency Fabric − Buffer Manager (BMan) • “Accelerator” Components − Cores DCE DCB

− Frame Manager (FMan) RMan RE − RapidIO Message Manager (RMan) − Cryptographic accelerator (SEC) Sec 4.x PME 2 − Pattern matching engine (PME) Queue Buffer − Decompression/Compression Engine (DCE) Manager Mgr − DCB (Data Center Bridging) Frame Manager Frame Manager − RAID Engine (RE) PCD Parse, Classify, • CoreNet … Distribute Buffer Buffer … − Provides the interconnect between the cores and the 1GE 1GE 1G 1G 1G DPAA infrastructure as well as access to memory 10GE 1/10G 1/10G 1GE 1GE 1G 1G 1G

TM External Use 40 DPAA Building Block: Frame Descriptor (FD)

Simple Frame Multi-buffer Frame

Frame Descriptor (Scatter/gather) D PID Buffer Frame Descriptor BPID D PID S/G List Data Address BPID Address 000 Offset Address 00 Length Length 100 Offset BPID Status/Cmd Packet Length Offset Status/Cmd Address 00 Length BPID Data Offset (=0) 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 … D LIODN BPID ELIO - - - - addr Address D offset DN 01 Length offset Data BPID addr (cont) Offset (=0) Fmt Offset Length STATUS/CMD

TM External Use 41 Frame Descriptor Status/Command Word (FMAN Status)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

DCL4C

FRDR

L4CV

DME

KSO

EOF

NSS PHE

FLM

PTE FPE FSE

FCL BLE

DIS

IPP ISP MS

------

Name Description DCL4C L4 (IP/TCP/UDP) Checksum validation Enable/Disable DME DMA error MS MACSEC Frame. This bit is valid on P1023 FPE Frame Physical Error FSE Frame Size Error DIS Discard. This bit is set only for frames that are supposed to be discarded, but are enqueued in an error queue for debug purposes. EOF Extract Out of Frame Error NSS No Scheme Selection foe KeyGen KSO Key Size Over flow Error FCL Frame color as determined by the Policer. 00=green, 01=yellow, 10=red, 11=no reject IPP Illegal Policer Profile error FLM Frame Length Mismatch PTE Parser Time-out ISP Invalid Soft Parser instruction Error PHE Header Error FRDR Frame Drop BLE Block limit is exceeded L4CV L4 Checksum Validation

TM External Use 42 DPAA: mEMAC Controller

TM External Use 43 Multirate Ethernet MAC (mEMAC) Controller QorIQ P Series • A multirate Ethernet MAC (mEMAC) controller features 100 Mbps/1G/2.5G/10G : 10GMAC dTSEC − Supports HiGig/HiGig+/HiGig2 protocols − Dynamic configuration for NIC (Network Interface Card) applications or Switching/Bridging applications to support 10Gbps or below. − Designed to comply with IEEE Std 802.3®, IEEE 802.3u, IEEE 802.3x, IEEE 802.3z, IEEE 802.3ac, Frame Manager Interface IEEE 802.3ab, IEEE-1588 v2 (clock synchronization over Ethernet), IEEE 803.3az and IEEE 802.1QBbb. − RMON statistics 1588 Time Tx FIFO Rx FIFO − CRC-32 generation and append on transmit or Stamping forwarding of user application provided FCS selectable on a per-frame basis. − 8 MAC address comparison on receive and one MAC Config address overwrite on transmit for NIC applications. Control Tx Flow Rx Control Control Control − Selectable promiscuous frame receive mode and Stat transparent MAC address forwarding on transmit − Multicast address filtering with 64-bin hash code lookup table on receive reducing processing load on MDIO Reconcilication higher layers Master − Support for VLAN tagged frames and double VLAN Tags (Stacked VLANs) Phy Mgmt Tx Rx − Dynamic inter packet gap (IPG) calculation for WAN MDIO Interface Interface applications QorIQ T4240 - mEMAC

TM External Use 44 DPAA: FMAN

TM External Use 45 FMan Enhancements

• Storage Profile selection (up to 32 profiles per port) based on classification FMAN − Up to four buffer pools per Storage Profile • Customer Edge Egress Traffic Management (Egress Shaping) Parse, Classify, • Data Center Bridging Distribute

− PFC and ETS muRAM • IEEE802.3az (Energy Efficient Ethernet) • IEEE802.3bf (Time sync) 1G 1G 1G 1/10G 1/10G • IP Frag & Re-assembly Offload 1G 1G 1G • HiGig, HiGig2 • TX confirmation/error queue enhancements − Ability to configure separate FQID for normal confirmations vs errors − Separate FD status for Overflow and physical error • Option to disable S/G on ingress

TM External Use 46 Offline Ports

TM External Use 47 FMAN Ports Types

• Ethernet receive (Rx) and transmitter (Tx) − 1 Gbps/2. 5Gbps/10 Gbps − FMan_v3 some ports can be configured ad HiGig − Jumbo frames of up to 9.6 KB (add uboot bootargs "fsl_fm_max_frm=9600" ) • Offline (O/H) − FMan_v3: 3.75 Mpps (vs 1.5M pps from the P series) − Supports Parse classify distribute (PCD) function on frames extracted frame descriptor (FD) from the Qman − Supports frame copy or move from a storage profile to an other − Able to dequeue and enqueue from/to a QMan queue. The FMan applies a Parse Classify Distribute (PCD) flow and (if configured to do so) enqueues the frame it back in a Qman queue. In FMan_v3 the FMan is able to copy the frame into new buffers and enqueue back to the QMan. − Use case: IP fragmentation and reassembly • Host command − Able to dequeue host commands from a QMan queue. The FMan executes the host command (such as a table update) and enqueues a response to the QMan. The Host commands, require a dedicated PortID (one of the O/H ports) − The registers for Offline and Host commands are named O/H port registers

TM External Use 48 IP Reassembly T4240 Processor Flow

Regular frame: Storage Profile is BMI: chosen according to frame header classification.

Reassembled frame: Storage Parser: Profile is chosen according to MAC Parse The Frame Identify fragments and IP header classification only. Fman Controller: KeyGen: Start reassembly Calculate Hash

Fman Controller: KeyGen: Coarse Classification Calculate Hash

Reassembled Frame Fman Controller: Regular/Fragment link fragment to the right BMI: BMI: reassembly context *Fragments Completed Write IC Allocate buffer reassembly Write frame and IC *Buffer allocation is done Non Completed According to fragment reassembly Non Fragments header only BMI: Enqueue Frame Enqueue Frame Terminate

TM External Use 49 IP Reassembly FMAN Memory Usage

• FMAN Memory: 386 KBytes • Assumption: MTU = 1500 Bytes • Port FMAN Memory consumption: − Each 10G Port = 40 Kbytes − Each 1G Port = 25 Kbytes − Each Offline Port = 10 Kbytes • Coarse Classification tables memory consumption: − 100 Kbytes for all ports • IP Reassembly: − IP Reassembly overhead: 8 Kbytes − Each flow: 10 Bytes • Example: − Usecase with: 2x10G ports + 2x1G port + 1xOffline Ports. − Port configuration: 2x40 + 2x25 + 10 = 140 Kbytes − Coarse Classification : 100 Kbytes − IP reassembly 10K flows: 10K x 10B + 8KB = 108 Kbytes − Total = 140KB + 108KB + 100KB = 348 KBytes

TM External Use 50 Storage Profile

TM External Use 51 Virtual Storage Profiling For Rx and Offline Ports

• Storage profile enable each partition and virtual interface enjoy a dedicated buffer pools. • Storage profile selection after distribution function evaluation or after custom classifier • The same Storage Profile ID (SPID) values from the classification on different physical ports, may yield to different storage profile selection. • Up to 64 storage profiles per port are supported. − 32 storage profiles for FMan_v3L • Storage profile contains − LIODN offset − Up to four buffer pools per Storage Profile − Buffer Start margin/End margin configuration − S/G disable − Flow control configuration

TM External Use 52 Data Center Bridging

TM External Use 53 Policing and Shaping

• Policing put a cap on the network usage and guarantee bandwidth • Shaping smoothes out the egress traffic − May require extra memory to store the shaped traffic. • DCB can be used in: − Between data center network nodes − LAN/network traffic − Storage Area Network (SAN) − IPC traffic (e.g. Infiniband (low latency))

Time

TM External Use 54 Time Support Priority-based Flow Control (802.1Qbb)

Transmit Queues Receive Buffers • Enables lossless behavior for each class of Ethernet Link service Zero Zero • PAUSE sent per virtual lane when buffers One One limit exceeded Two Two − FQ congestion groups state (on/off) Three STOP PAUSE Three Eight Virtual from QMan Four Four Lanes . Priority vector (8 bits) is assigned to each FQ Five Five congestion group Six Six . FQ congestion group(s) are assigned to Seven Seven each port

. Upon receipt of a congestion group state “on” message, for each Rx port associated with this congestion group, a PFC Pause frame is • PFC Pause frame reception transmitted with priority level(s) configured for that group − QMan provides the ability to flow control 8 different traffic classes; in CEETM each of − Buffer pool depletion the 16 class queues within a class queue . Priority level configured on per port (shared by channel can be mapped to one of the 8 all buffer pools used on that port) traffic classes & this mapping applies to all − Near FMan Rx FIFO full channels assigned to the link . There is a single Rx FIFO per port for all priorities, the PFC Pause frame is sent on all priorities

TM External Use 55 Support Bandwidth Management 802.1Qaz

Offered Traffic 10 GE Realized Traffic Utilization • Hierarchical port scheduling defines 3G/s HPC Traffic 2G/s the class-of-service (CoS) properties 3G/s 3G/s 2G/s of output queues, mapped to IEEE 3G/s 3G/s Storage Traffic 3G/s 802.1p priorities 3G/s 3G/s 3G/s 3G/s • Qman CEETM enables Enhanced Tx

Selection (ETS) 802.1Qaz with 3G/s 4G/s 6G/s 3G/s LAN Traffic 5G/s Intelligent sharing of bandwidth 4G/s

between traffic classes control of t1 t2 t3 t1 t2 t3 bandwidth − Strict priority scheduling of the 8 • Supports 32 channels available independent classes. Weighted for allocation across a single bandwidth fairness within 8 grouped FMan classes − e.g. for two10G links, could − Priority of the class group can be allocate 16 channels (virtual links) independently configured to be per link immediately below any of the − Supports weighted bandwidth independent classes fairness amongst channels • Meets performance requirement for − Shaping is supporting on per ETS: bandwidth granularity of 1% channel basis and +/-10% accuracy

TM External Use 56 QMAN CEETM

TM External Use 57 CEETM Scheduling Hierarchy (QMAN 1.2) Network IF • Logics − Green denotes logic units and signal paths Strict Priority

that relate to the request and fulfillment of Committed Rate (CR) packet transmission opportunities − Yellow denotes the same for Excess Rate

for LNI #9 LNI for Weighted Shape Aware (ER) Shape Aware Scheduling FairFair Scheduling Scheduling − Black denotes logic units and signal paths Scheduler Channel that are used for unshaped opportunities or that operate consistently whether used for CR or ER opportunities StrictPriority StrictPriority StrictPriorityStrictPriority StrictPriority • Scheduler − Channel Scheduler: channels are selected WBFS WBFS WBFS WBFS WBFS to send frame from Class Queues − Class scheduler: frames are selected

from Class Queues . Class 0 has

CQ12 CQ11 CQ13 CQ14 CQ15

highest priority CQ10

CQ8 CQ4 CQ5 CQ6 CQ7 CQ9

CQ0 CQ1 CQ2 CQ3

• Algorithm

− Strict Priority (SP) − Weighted Scheduling Class Scheduler Ch6 Class Scheduler Ch7 Class Scheduler Ch8 − Shaped Aware Fair Scheduling (SAFS) unshaped Shaped Shaped 8 Indpt, 8 grp Classes 3 Indpt, 7 grp 2 Indpt, 8 grp − Weighted Bandwidth Fair Scheduling (WBFS) Token Bucket Shaper for Committed Rate Token Bucket Shaper for Excess Rate

TM External Use 58 Weighted Bandwidth Fair Scheduling (WBFS)

• Weighted Bandwidth Fair Scheduling (WBFS) is used to schedule packets from queues within a priority group such that each gets a “fair” amount of bandwidth made available to that priority group • The premises for fairness for algorithm is: − available bandwidth is divided and offered equally to all classes − offered bandwidth in excess of a class’s demand is to be re-offered equally to classes with unmet demand First Second Total BW Initial Distribution ReDistribution Redistribution Attained BW available 10G 1.5G .2G 0G Number of classes 5 3 2 with unmet demand Bandwidth to be 2G .5G .1G offer to each class Offered & Unmet Offered & Unmet Offered & Demand Retained Demand Retained Demand Retained Class 0 .5G .5G 0 .5G Class 1 2G 2G 0 2G Class 2 2.3G 2G .3G .3G 0 2.3G Class 3 3G 2G 1G .5G .5G .1G 2.6G Class 4 4G 2G 2G .5G 1.5G .1G 2.6G Total Consumption 11.8G 8.5G 1.3G .2G 10G

TM External Use 59 DPAA: SEC Engine

TM External Use 60 Security Engine

• Black Keys − In addition to protecting against external bus snooping, Black Keys cryptographically protect against key snooping between security domains

• Blobs − Blobs protect data confidentiality and integrity across power cycles, but do not protect against unauthorized decapsulation or substitution of another user’s blobs − In addition to protecting data confidentiality and integrity across power cycles, Blobs cryptographically protect against blob snooping/substitution between security domains

• Trusted Descriptors − Trusted Descriptors protect descriptor integrity, but do not distinguish between Trusted Descriptors created by different users − In addition to protecting Trusted Descriptor integrity, Trusted Descriptors now cryptographically distinguish between Trusted Descriptors created in different security domains

• DECO Request Source Register − Register added

TM External Use 61 QorIQ T4240 Processor SEC 5.0 Features

Header & Trailer off-load for the following Security Protocols: Queue − IPSec, SSL/TLS, 3G RLC, PDCP, SRTP, 802.11i, 802.16e, 802.1ae Interface (3) Public Key Hardware Accelerator (PKHA) − RSA and Diffie-Hellman (to 4096b)

DMA − Elliptic curve cryptography (1024b) Job Queue − Supports Run Time Equalization Job Ring I/F Controller (1) Random Number Generators (RNG4) − NIST Certified

Descriptor RTIC (4) Snow 3G Hardware Accelerators (STHA)

Controllers − Implements Snow 3.0

− Two for Encryption (F8), two for Integrity (F9) (4) ZUC Hardware Accelerators (ZHA) − Two for Encryption, two for Integrity (2) ARC Four Hardware Accelerators (AFHA) − Compatible with RC4 algorithm CHAs (8) Kasumi F8/F9 Hardware Accelerators (KFHA) − F8 , F9 as required for 3GPP PKHA STHA AFHA − A5/3 for GSM and EDGE KFHA MDHA AESA DESA − GEA-3 for GPRS RNG4 (8) Message Digest Hardware Accelerators (MDHA) ZHA − SHA-1, SHA-2 256,384,512-bit digests − MD5 128-bit digest − HMAC with all algorithms (8) Advanced Encryption Standard Accelerators (AESA) − Key lengths of 128-, 192-, and 256-bit − ECB, CBC, CTR, CCM, GCM, CMAC, OFB, CFB, and XTS (8) Data Encryption Standard Accelerators (DESA) − DES, 3DES (2K, 3K) − ECB, CBC, OFB modes (8) CRC Unit TM External Use 62 − CRC32, CRC32C, 802.16e OFDMA CRC

Life of a Job Descriptor

• QI has room for more work, issues dequeue request for 1 Queue Buffer or 3 FDs DDR/CoreNet Manager FD1 Mgr • Qman selects FQ and provides 1 FD along with (Shared Desc, Frame) FQ Information • QI creates [internal] Job Descriptor and if necessary, obtains output buffers Queue Interface SP Status FQ ID List R FDs FQ FQ FQ FQ FQ Job Prep Logic SP1 0 000 1 E E E D E • QI transfers completed Job Descriptor into one of SP2 0 001 SP3 0 101 2 D E E D E the Holding Tanks SP4 0 011 3 E E E E E JD1 SP5 1 111 • Job Queue Controller finds an available DECO, transfers JD1 to it • DECO initiates DMA of Shared Descriptor from system memory, places it in Descriptor Buffer with Job Queue Controller JD from Holding Tank JR 0 Job Queues DMA • DECO executes descriptor commands, loading Holding Tank Pool registers and FIFOs in its CCB JR 1

JR 2 Holding Holding • CCB obtains and controls CHA(s) to process the Tank 0 Tank 7 data per DECO commands JR 3 ...... • DECO commands DMA to store results and any updated context to system memory • As input buffers are being emptied, DECO tells QI, which DECO Pool may release them back to BMan DECO 0 DECO 7 • Upon completion of all processing through CCB, Descriptor ...... Descriptor DECO resets CCB Buffer Buffer • DECO informs QI that JD1 has completed with status code X, data of length Y has been written to address Z CCB 0 CCB 7 • QI creates outbound FD, enqueues to Qman using FQID from Ctx B field AESA AESA MDHA MDHA CRCA CRCA KFHA KFHA DESA Arbiter Arbiter Arbiter Arbiter Arbiter DESA RNG4 AFHA STHA f9 PKHA ZUCE STHA f9 ZUCE AFHA STHA f8 PKHA ZUEA STHA f8 PKHA ZUEA

TM External Use 63 DPAA: DCE

TM External Use 64 DPAA Interaction: Frame Descriptor Status/CMD

• The Status/Command word in the dequeued FD allows software to modify the processing of individual frames while retaining the performance advantages of enqueuing to a FQ for flow based processing • The three most significant bits of the Command /Status field of the Frame Descriptor have the following meaning:

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 DD LIODN offset BPID ELIODN - - - - addr offset addr (cont) Format Offset Length CMD Token: Pass through data that is echoed with the returned Frame.

3 MSB Description 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 000 Process Command Command Encoding

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

USDC

USPC SCUS

001 Reserved SCRF

Flush

CMD

UHC

B64

RB CE

R Z

010 Reserved I Status

(output Frame)

011 Reserved

100 Context Invalidate Command Token 101 Reserved 110 Reserved 111 NOP Command Token

TM External Use 65 DCE Inputs Flow Stream FD3 PID BPID Addr Buffer Context Addr • SW enqueues work to Offset Length Data Context_A Status/Cmd FQs FD2 DCE via Frame Queues. PID BPID Addr Buffer WQ7 Addr Offset Length Data FQs define the flow for WQ6 Status/Cmd

WQ5 FD1 stateful processing PID BPID Addr Buffer WQ4 Addr Offset Length Data DCE WQ3 Status/Cmd • FQ initialization creates channel WQ2

a location for the DCE to WQ1 Comp WQ0 use when storing flow FD3 PID BPID Addr Buffer Addr stream context Offset Length Data Status/Cmd FQs FD2 • Each work item within Decomp PID BPID Addr Buffer Addr

DCP Portal DCP WQ7 Offset Length Data the flow is defined by a WQ6 Status/Cmd WQ5 FD1 Buffer Frame Descriptor, which PID BPID Addr WQ4 Addr Offset Length Data WQ3 Status/Cmd

includes length, pointer, channel WQ2 offsets, and commands WQ1 WQ0 • DCE has separate Command channels for compress Flow Context_A and decompress Stream Context

TM External Use 66 DCE Outputs

Flow • DCE enqueues results FD3 Stream Buffer PID BPID Addr Addr to SW via Frame Data Offset Length Context Status/Cmd Queues as defined by FD2 Buffer PID BPID Addr Context_A DCE Addr FQ Context_B field. Data Offset Length

Status/Cmd

Buffer FD1 FQ Comp When buffers obtained PID BPID Addr Data Addr Offset Length s from Bman, buffer pool Status/Cmd ID defined by Input FQ FQ Decomp

• Each result is defined by FD3 s Portal DCP

PID BPID Addr Buffer Addr a Frame Descriptor, Offset Length Data Status/Cmd which includes a Status FD2 PID BPID Addr Buffer Addr Context_A field Offset Length Data Status/Cmd

FD1 • DCE updates flow Buffer PID BPID Addr Addr Flow Data Offset Length stream context located Status/Cmd Stream at Context_A as needed Context Status

TM External Use 67 PME

TM External Use 68 Frame Descriptor: STATUS/CMD Treatment

• PME Frame Descriptor Commands − b111 NOP NOP Command − b101 FCR Flow Context Read Command − b100 FCW Flow Context Write Command − b001 PMTCC Table Configuration Command − b000 SCAN Scan Command

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 DD LIODN offset BPID ELIODN - - - - addr offset addr (cont) Format Offset Length Status/CMD

Scan SRV F S/ E SET Subset b000 M R

TM External Use 69 Life of a Packet inside Pattern Matching Engine

FD1 192.168.1.1:80 TCP 10.10.10.100:16734 192.168.1.1:25 TCP 10.10.10.100:17784 192.168.1.1:1863 TCP 10.10.10.100:16855 Frame Queue: A flowA:FD1: 192.168.1.1:80->10.10.10.100:16734 “I want to search free “ DDR Patt1 /free/ Memory flowA:FD2: 192.168.1.1:80->10.10.10.100:16734 “scale FTF 2014 event schedule” tag=0x0001 I W A N T T O S E A R C H F R E E

Access to Pattern Descriptors and State • Combined hash/NFA technology

• 9.6 Gbps raw performance

• Max 32K patterns of up to 128B length Key • Patterns On-Chip CoreNet Data Stateful Element System Pattern Examination Rule − Patt1 /free/ tag=0x0001 Scanning Bus Matcher Engine Engine − Patt2 /freescale/ tag=0x0002 Engine FD2 Interface Frame (DXE) (SRE) • KES (KES)

QMan Agent − Compare hash value of incoming

(PMFA) Hash data(frames) against all patterns Tables Cache Cache • DXE BMan User Definable Reports − Retrieve the pattern with matched hash value for a final comparison. • SRE − Optionally post process match result before sending the report to the CPU

TM External Use 70 Debug

TM External Use 71 Core Debug in Multi-Thread Environment

• Almost all resources are private. Internal debug works as if they are separate cores • External debug is private per thread. An option exists to halt both threads when one thread halts − While threads can be debug halted individually, it is generally not very useful if the debug session will care about the contents of the MMU and caches − Halting both threads prevents the other thread from continuing to compute and essentially clean the L1 caches and the MMU of the state of the thread which initiated the debug halt

TM External Use 72 DPAA Debug trace

• During packet processing, FMan can trace packet processing flow through each of the FMan modules and trap a packet.

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 D LIODN BPID ELIO - - - - addr D offset DN offset addr (cont) Fmt Offset Length STATUS/CMD

TM External Use 73 Summary

TM External Use 74 QorIQ T4 Series Advance Features Summary Feature Benefit High perf/watt • 188k CoreMark in 55W = 3.4 CM/W • Compare to Intel E5-2650: 146k CM in 95W = 1.5 CW/W; • Or: Intel E5-2687W: 200k MC in 150W = 1.3 CM/W • T4 is more than 2x better than E5 • 2x perf/watt compared to P4080, FSL’s previous flagship Highly integrated Integration of 4x 10GE interfaces, local bus, Interlaken, SRIO mean that few chips SOC (takes at least four chips with Intel) and higher performance density Sophisticated • SR-IOV for showing VMs a virtual NIC, 128 VFs (Virtual Functions) PCIe capability • Four ports with ability to be root complex or endpoint for flexible configurations Advanced • Data Center Bridging for lossless Ethernet and QoS Ethernet • 10GBase-KR for backplane connections Secure Boot Prevents code theft, system hacking, and reverse engineering Altivec On-board SIMD engine – sonar/radar and imaging Power • Thread, core, and cluster deep sleep modes Management • Automatic deep sleep of unused resources Advanced • Hypervisor privilege level enables safe guest OS at high performance virtualization • IOMMU ensures memory accesses are restricted to correct area • Virtualization of I/O blocks Hardware offload • Packet handling to 50Gb/s • Security engine to 40Gb/s • Data compression and decompression to 20Gb/s • Pattern matching to 10Gb/s 3x Scalability • 1-, 2-, and 3- cluster solution is 3x performance range over T4080 – T4240 • Enables customer to develop multiple SKUs from on PCB

TM External Use 75 Other Sessions And Useful Information

• FTF2014 Sessions for QorIQ T4 Devices − FTF-NET-F0070_QorIQ Platforms Trust Arch Overview − FTF-NET-F0139_AltiVec_Programming − FTF-NET-F0146_Introduction_to_DPAA − FTF-NET-F0147-DPAAusage − FTF-NET-F0148_DPAA_Debug − FTF-NET-F0157_QorIQ Platforms Trust Arch Demo & Deep Dive

• T4240 Product Website − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240 • Online Training − http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=T4240& tab=Design_Support_Tab

TM External Use 76 Introducing The QorIQ LS2 Family

Breakthrough, New, high-performance architecture built with ease-of-use in mind software-defined Groundbreaking, flexible architecture that abstracts hardware complexity and approach to advance enables customers to focus their resources on innovation at the application level the world’s new virtualized networks Optimized for software-defined networking applications Balanced integration of CPU performance with network I/O and C-programmable datapath acceleration that is right-sized (power/performance/cost) to deliver advanced SoC technology for the SDN era

Extending the industry’s broadest portfolio of 64-bit multicore SoCs Built on the ARM® Cortex®-A57 architecture with integrated L2 switch enabling interconnect and peripherals to provide a complete system-on-chip solution

TM External Use 77 QorIQ LS2 Family Key Features High performance cores with leading interconnect and memory bandwidth SDN/NFV • 8x ARM Cortex-A57 cores, 2.0GHz, 4MB L2 cache, w Neon SIMD Switching • 1MB L3 platform cache w/ECC • 2x 64b DDR4 up to 2.4GT/s A high performance datapath designed Data with software developers in mind Center • New datapath hardware and abstracted acceleration that is called via standard Linux objects • 40 Gbps Packet processing performance with Wireless 20Gbps acceleration (crypto, Pattern Access Match/RegEx, Data Compression) • Management complex provides all init/setup/teardown tasks Leading network I/O integration Unprecedented performance and • 8x1/10GbE + 8x1G, MACSec on up to 4x 1/10GbE ease of use for smarter, more • Integrated L2 switching capability for cost savings capable networks • 4 PCIe Gen3 controllers, 1 with SR-IOV support • 2 x SATA 3.0, 2 x USB 3.0 with PHY

TM External Use 78 See the LS2 Family First in the Tech Lab!

4 new demos built on QorIQ LS2 processors:

Performance Analysis Made Easy

Leave the Packet Processing To Us

Combining Ease of Use with Performance

Tools for Every Step of Your Design

TM External Use 79 TM

www.Freescale.com

Ethernet options: • 10Gbps Ethernet MACs with XAUI or XFI • 1Gbps Ethernet MACs with SGMII (1 lane at 1.25 GHz with 3.125 GHz option for 2.5Gbps Ethernet) • 2 MACs can be used with RGMII • 4 x1Gbps Ethernet MACs can be supported using a single lane at 5 GHz (QSGMII) • HiGig is supported with 4 lines at 3.125 GHz or 3.75 GHz (HiGig+)

High speed serial • 2.5 , 5, 8 GHz for PCIe • 2.5, 3.125, and 5 GHz for sRIO • 3.125, 6.25, and 10.3125 GHz for Interlaken • 1.5, 3.0 GHz for SATA • 1.25, 2.5, 3.125, and 5 GHz for debug

TM External Use 81 Decompression Compression Engine

• Zlib: As specified in RFC1950 • Deflate: As specified as in RFC1951 • GZIP: As specified in RFC1952 • Encoding − supports Base 64 encoding and decoding Bus To (RFC4648) Compressor I/F Corenet • ZLIB, GZIP and DEFLATE header insertion • ZLIB and GZIP CRC computation and insertion • 4 modes of compression 4KB Frame QMan QMan − No compression (just add DEFLATE header) History Agent I/F Portal − Encode only using static/dynamic Huffman codes BMan BMan − Compress and encode using static OR dyamic Huffman codes I/F Portal − at least 2.5:1 compression ratio on the Calgary Corpus Decompressor • All standard modes of decompression − No compression − Static Huffman codes 32KB History − Synamic Huffman codes • Provides option to return original compressed Frame along with the uncompressed Frame or release the buffers to BMAN

TM External Use 82 TM