IBM Poweren Processor

The IBM Power Edge of NetworkTM Processor: A novel System-on-a-Chip for Wire Speed computing

Massimiliano Meneghin and Karol Lynch High Performance Computing Group IBM, Dublin

. What do we mean when we refer to wire speed computing and why do we need to design a new family of computer chips to meet its needs?

. What kind of features does a chip designed for wire speed computing have (i.e. the PowerENTM Architecture)?

. How does one program the PowerENTM?

. How will IBM deploy PowerENTM?

Cloloud Computiing GPS Wireless Disruptive SOA RFID Disruptive IInnovannovattiionsons Multicore CPUs Mobile Devices

Petaflop Supercomputers Software as a Service

Web 2.0

Scale Out Infrastructure Smarter Planet Wireless Convergence Location Aware Apps Disruptive Disruptive Secure Computing Low-Latency Applications Trading Applications Virtual Worlds

Intelligent Devices On Demand Services

Universal Connectivity Services Sciences

Globally Integrated Enterprise

Data volumes are rising exponentially Network traffic capture, distributed sensor networks, GPS enabled smart phones, etc. Increased analysis complexity requires more computing power Threat analisys, tracking of people and/or veicles, problem determination on utility grids, etc.. Applications require a fast response time to enhance their Business Value Trading markets, threat response, market insights, etc.

We face the challenge of rethinking how we build systems!

A new philosophy, the Wire-Speed processor project, which defines a generic processor architecture in which

General purpose Cores - HW Accelerators - I/O functions are closely coupled

a lot High Performance Computing

e ns

i o

t a General purpose

a u

d computing

Streaming and just a few networking

few many Data

a lot High Performance Computing

dat Eme rging per A General purpose ppli s c ation

on computing s i pace

Streaming and just a few networking

few many Data

a lot High Performance Computing

HPC systems

dat Eme rging per A General purpose ppli s c Mainframe ation

on computing s i pace

r t Intel

n I Power

Network ProcesSstreorasming and just a few networking

few many Data

A blurring of the Network and Server worlds Highly Multi-threaded low power cores with full PPC ISA. Standard programming models with OS's and hypervisors Virtualisation support for application consolidation. Accelerators: for Networking and Application tiers. Integrated Network system & Memory I/O. Server RAS & infrastructure. Low total power solution based on throughput optimisation

Building block for network devices WAN (ATM, FR)

Multi-Service Switch

Integrated Switches Firewalling, Intrusion Large Site detection and avoidance

Load Balancing, Traffic distribution, Data filtering WAP Load- Balanced Core Servers BN

Internet

L4-L7 Switch Switch BPS “Cloud Computing”

Functional offloading, Workload optimization

SAN Cloud computing optimization, Building block for appliances Virtualization assistance Storage Area Network optimisation

Objective: Increase Exploitation of: Increase Single Thread Frequency scaling performance Performance ILP + out of order execution from a single chip

This approach has reached its limit: Power Wall ILP Wall

End for CMOS technology & Uniprocessor combination

That is unless you like toxicated chips!

Objective: Increase Exploitation of: Increase Total Thread-Level Parallelism(TLP) performance Throughput Moore's law from a single chip

More Transistors CMP arch: Per Chip unit area + CPUs on the same chip

Few Complex CPUs Many Simple CPUs (up to 4) (up to 9) Intel** dual/quad core IBM Cell B.E IBM Power 6/7 Sun** Niagara

Observation: Where you would start optimising? Some computations are performed while(!done){ more times than others. ... for(int i=0; i<10; i++){ The effort of optimising one is Alpha(...); } rewarded N times for(int i=0; i<10^8; i++){ Beta(...); }...}

Final Solution: Specific HW: And back to the Silicon? System on Chip(SoC): +perf Off-loading of core GP Cores -power cycles to domain Specific HW accelerators -flexibility specific hardware I/O functionality accelerators All closely coupled

Maximisation of the power/performance ratio smartly understand the maximum performance for a certain thermal limit

Maintenance of the system programmability against heterogenous arch a easier life for programmers saves time and money

Bypassing architectural levels to avoid unnecessary overhead system call overhead can be critical (Linux packet filtering)

Efficient management of the I/O fast (controller as close as possible to the CPUs) smart (it can be tailored with respect to specific exigences)

high

VHeryPC la srgyset eHmPsC

d S er ma p rt er s Mainfraame Pla ppli net on ca i tion ct Intel s u General r

st Power Streaming and

I purpose computing networking

low Network Processors

few many Data items

At0 At1 At2 At3 Mem Phy Mem Phy Pervasive Logic PIC 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 A A A A A A A A A A A A A A A A

Comp / Pattern Crypto XML Matching 2 MB L2 2 MB L2 2 MB L2 2 MB L2 Decomp Engine MC MC

PBus Internal I/F Controllers

PBus

PBus External I/F PBus Internal I/F Controllers Controller

PCI Express Ethernet Packet Offload Flash gen. 2 Engine ROM x16 and Misc bifur- IO Logic

catable 4x 10GE MAC 2x 1GE MAC

4B+4B 4B +4B 4B+4B x1 x1 Misc x8 PHY x8 PHY x8 PHY EI3 EI3 EI3 PHY PHY I/O

Full System on a chip whose components are highly integrated. At At At Pervas 0 1 2 At3 Mem PhyMem Phy ive 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 PIC A A A A A A A A A A A A A A A A Logic Patter 16 x 2.3Ghz PowerPC Cores Comp n known as A2 cores. / XM Crypto Match 2MB L2 2MB L2 2MB L2 2MB L2 MC MC Deco L ing mp Engin A2 cores are 64-bit and come e with full PowerPC ISA support. PBus Internal I/F Controllers

Each A2 core comes with PBus support 4 hardware threads. PBus External PBus Internal I/F Controllers Controller A2 cores are packaged in Flash groups of 4 known as AT PCI Ethernet Packet ROM chiplets. Expre Offload Engine Misc Each AT chiplet has its own ss IO 2x 1GE 2Mb L2 cache. 4x 10GE MAC Logic MAC 2 x integrated memory 4B+4B 4B +4B 4B+4B x1 x1 Misc controllers x8 PHY x8 PHY x8 PHY EI3 EI3 EI3 PHY PHY I/O

4 x 10Gb ethernet interfaces which are part of an integrated optimized Ethernet Offload Engine (HEA) that handles advanced packet processing functions.

XML, Cryptography, Regular Expressions and (De)Compression accelerators.

Highly optimized for OS bypass operations, this integrated architecture is designed to reduce the overhead of inter-component communication. © 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor A2 compute node

64-bit CPUs + SMT + CMP + SMP

General Purpose Power ISA Core, Power Instruction Set Architecture Optimised for efficiency in size & power 4-way fine grained simultaneous multi-threaded 2 way concurrent issue 1 Branch/Integer/Load/Store unit 1 FP unit At instruction from different threads / cycle 0 At3 Mem Phy Mem Phy 2 2 2 2 2 2 2 2 A A A A In-order dispatch and execution A A A A

16KB 4-way set associative L1 I/D caches 2MB L2 2MB L2 MC MC 64B cache line Binary compatibility for application level code PBus Performance monitoring interface Multiple low-power states supported

The HEA is an integrated intelligent ethernet controller with 4x10G ports which provides high PBIC bandwidth network connectivity for the PowerEN. HEA could also be considered to be a hardware Host accelerator because it accelerates packet Interface processing tasks. When we say that it is intelligent we mean that is

programmable and can do lots of useful things Enet Enet Enet Enet such as off-loading of protocol processing to 10GE or 1GE 10GE or 1GE 10GE or 1GE 10GE or 1GE hardware. In regular computing systems simply copying packets that arrive off the network from kernel to user space is a massive overhead. The HEA 2x 1GE 4x 10GE MAC + 2x 1GE MAC avoids this though by giving user space MAC applications direct access to packet buffers. Support for virtualization.

 Receive Software interface Thread . Pull Model SW Poll or Interrupt SW provides SW read from CQ . Flexible queue interface Buffer • Queue Pairs, Completion Queues, (SW manage CQ Head ptr) and Event Queues . Scatter/Gather descriptors . Immediate data in queue elements “PULL” rWQE Buffer@  Receive path Memory . Flexible Packet Parsing RQ rCQ • Cksum, Queue selection, metadata • IPv4 and IP6 . Checksum offload Buffers • TCP / UDP Checksum validation rCQE Buffer@ + meta data . Multicast/VLAN filtering. . Queue Selection. 4-HW enqueue rCQE HEA . Cache Injection. 2-Get rWQE 3-DMA packets to Buffers . Header / Payload separation. Queues Context

1-Queue Selection

Receive Side

Network node targets appliance boxes SW processing Thread situated in the middle of the network. SW provides Network node is not the desination point Buffer for packets, rather just an intermediate node in the network. rWQE Buffer@

Example applications are routers, CQE Shooter firewalls, and intrusion detection systems. HEA Memory Features: RQ s SQs rCQ C Hardware assistance for maintaining Q packet ordering. HW enqueue Packet Queus managed by hardware. (Tail) Buffers Small WQE (PBus bandwidth saving). sCQE Buffer@ Hardware Scheduler.

Packet Classification/Parsing Notification that packet functionality. is transmitted HEA

OR Queues CB 8-Re-enqueue to RQ

Scheduler Queue Selection

Receive Side Send Side © 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor PowerEN Architecture - Hardware Accelerators A hardware accelerator is a specialised computer chip that is capable of performing specific tasks much faster than is possible using a general purpose CPU but at the cost of decreased flexibility. One of the best well known example of a hardware accelerator is a video or graphics card.

Most existing accelerators are devices attached to an I/O bus (e.g. PCI Express).

For their targeted tasks accelerators can provide significantly more computation performance and chip density at a lower power budget.

Since accelerators differ from general purpose CPUs, the way in which they are programmed also differs (e.g. OpenGL/Direct3D** for graphics cards).

This can make the task of developing software more difficult (e.g. specialised graphics programmers are needed to develop video games).

The PowerENTM contains the following accelerators: XML Accelerator. Regular Expression and Pattern-matching Accelerator. Compression and Decompression Accelerator. Cryptographic Accelerator

These accelerators have been specifically chosen based on the requirements of network-facing applications.

In later chips based on the WSP architecture these specific accelerators could be replaced by other accelerators with different functionality.

The PowerEN has many features that simplify the task of programming accelerators.

Application programs will immediately be able to take advantage of the accelerators in the PowerEN because the software environment will provide accelerated versions of standard libraries.

The PowerENTM provides the performance advantages of accelerators while reducing the software development costs associated with the heterogeneity of accelerators.

Cache injection into the L2 cache is supported so that data from special-purpose accelerators and packets from the network do not need to be stored on off-chip memory before being accessed by an A2 thread. Having data readily available in the cache reduces the latency associated with memory load operations.

In contrast to existing systems, in which most accelerators are devices attached to an I/O bus and are programmed through system-level interfaces, the individual complexes in the WSP (compute, accelerator, and packet processing) all operate using the same application address space (i.e., virtual address space):

Principle of uniform addressability in a heterogenous environment

PowerENTM is heavily optimised for user space operation on accelerators:

Operations on the accelerators can be initiated directly from user-space, thus avoiding the overhead of a system call which invoke a task switch to the kernel. Application This is achieved by using a new instruction that has being added to the PowerPC ISA for the WSP architecture. Linux The accelerators also “understand” virtual addresses and User thus user space processes and accelerators “work at the Space same desk”. Hypervisor devices Accelerator input data can be directly DMA'd from application virtual address spaces and the output from PowerEN hardware the accelerators after compution has completed can be directly DMA'd back to application virtual address space. Thus all accelerator/application intereaction can bypass the kernel.

© 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor PowerEN Architecture – Cryptography Accelerator Cryptography is key component of many networking applications as many require secure communication (online banking, secure remote access). As cryptography is highly computationally expensive and is difficult to parallelise, specialised hardware accelerator is required to support these networking applications at wire speed. Variety of algorithms supported. Cryptography – DES/3DES, AES, GMAC, RC4 Signature/Hash- SHA-1, SHA-2, MD5, HMAC, AES- XCBC-MAC-96 Supports full Coherency Protocol. Data can come/go from Cache and/or Memory, No alignment restrictions. Random Number Generator - Supplies a 64b random number, supports FIPS 140 compliance. Can be leveraged either using an optimized implementation of OpenSSL** or a special library developed for programming PowerENTM accelerators. © 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor PowerEN Architecture – (De)Compression Accelerator

Compression is a useful method of over coming limited bandwidth (for example on wireless networks). Compression can also be used to reduce storage requirements (e.g. SAN). The (De)compression Accelerator provides hardware support for implementing DEFLATE algorithm: LZ77 coding Huffman Coding. Can be used to develop optimised implementations of gzip and zlib for example. Can be utilised by applications either by using an optimized version of zlib or using some other userspace libraries/interfaces that have been especially developed for programming the accelerators.

© 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor PowerEN Architecture – Pattern Matching/RegX Accelerators Regular expressions are a formal method for specifying sets of strings. Used by intrusion detection/prevention systems. Can process 8 requests in parallel. Implementation based on programmable state machines (known as Bart Finite State Machines). Again application developers that wish to leverage this accelerator will have a choice between an optimised port of a standard pre- existing library (e.g. Perl** Compatible Regular Expression) and a tailored interface developed especially for this accelerator.

Extensible Markup Language (XML): a simple standard text-based format for representing structured information: data + semantic

Some Features: Standard, Human readability, machine independent, extensibility …

Windscreen Wiper The Windscreen wiper automatically removes rain from your windscreen, if it should happen to splash there. It has a rubber blade which can be ordered separately if you need to replace it.

Where it is used? Communication: XMLRPC, Web Services(SOAP), Globus (grid infrastructures) Word processing: ODF and OOXML formats Graphics: formats such as SVG Querying: Xquery (Xpath+SQL-like expressions) © 2010 IBM Corporation TM

PowerEN – IBM novel Wire Speed Processor r o s s 0 e 1 c

PowerEN Architecture – XML Accelerators 0 2 ro

t p - n e re m o p c i o t l e v e mul

s d u n o a e h n c e r a og e r s e e t e R

h f

g l n a i t n i r u o l o p J x M E

. B I

g , n lz i s a s S e

c . o D r

. p

k , r s o e l w il t h e c n

A e

c . n H

a , u m r Y

f . r T e

, p - h ms l g i e h N

. o f H

s , e m k e n t s a r y F s

The most clear example showing the power of the Principle of uniform addressability in a heterogenous environment Offloads compute-intensive XML processing Parse XML document and post-processing the results: Qname Soap validation, Schema validation, XPATH assigner QNAME Cache validation, XSLT Assist

XML engine XML engine ……. XML engine Programmable Post Processing #1 #2 . #N Support a Fully stream and concurrent model Input msg PB arbiter/ output msg input in chunks up to 64KB in size, directly from user ring mux/ ring space documents (zero-copy) PB interface w ith DMA Engines

incremental output results PowerBus more than thousands of concurrent documents XML Engine Data Data f rom XML parse & to Two types of output results Output PB Well- PPE PB Gene- formness Engine rator TLA: incrementally consumed Character check FIFO

PowerEN – IBM novel Wire Speed Processor r o s s 0 e 1 c

PowerEN Architecture – XML Accelerators 0 2 ro

t p - n e re m o p c i o t l e v e mul

s d u n o a e h n c e r a og e r s e e t e R

h f

g l n a i t n i r u o l o p J x M E

. B I

g , n lz i s a s S e

c . o D r

. p

k , r s o e l w il t h e c n

A e

c . n H

a , u m r Y

f . r T e

, p - h ms l g i e h N

. o f H

s , e m k e n t s a r y F s

PowerEN – IBM novel Wire Speed Processor r o s s 0 e 1 c

PowerEN Architecture – XML Accelerators 0 2 ro

t p - n e re m o p c i o t l e v e mul

s d u n o a e h n c e r a og e r s e e t e R

h f

g l n a i t n i r u o l o p J x M E

. B I

g , n lz i s a s S e

c . o D r

. p

k , r s o e l w il t h e c n

A e

c . n H

a , u m r Y

f . r T e

, p - h ms l g i e h N

. o f H

s , e m k e n t s a r y F s

Software thread executes an ICSWX instruction Virtual addresses (16B) (ICSWX = Initiate Co-Processor Store Word Indexed)

Hardware adds the process and Guest OS ID information and turns the ICSWX instruction into a CRB (Co-processor Request Block)

CRB arrives at the destined PBIC, where it is (16B) validated, and queued

The accelerator services the request. Virtual At 0 addresses are translated to physical within PBIC, 2 2 2 2 A A A A ICSWX using a translation table. This is similar to the general- Comp / Crypto RegX XML 2 ML2B L2 Decomp / ADM purpose core translation except that TLB-misses take longer to resolve. PBIC (CC) PBIC (RX) CRB (64B) Interconnect Accelerator sends back update into CSB

One synchronous model

Two asynchronous programming models

Before Sending CRB Virtual addresses (16B)

Async Completion Interrupt Store

(16B) Async Commit Sync Commit

Synchronisations is a main concept in parallelism. The cost of synchronisations bounds the grain of the possible parallelism. Force a thread to wait for a certain event is not a simple operation while targeting performance. In normal systems there are few possibilities, each one with some drawbacks:

pthread_cond_wait: advise the OS scheduler that the thread is waiting for an event. The cost of waiting on a condition variable should be little more than the cost for a context switch plus the time to unlock and lock the associated mutex The overhead is not compatible with syncronizations required by fine grane parallelizations.

Polling: repeatedly read a memory location, until a predefine value is found. Suitable for fine grain parallelization but ... This is an active waiting => waste CPU time + termal dissipation + memory bandwidth

The main issue of the previous approach is that the “wait” operation is not a native mechanism of the hardware/firmware level. PowerEN extends its instruction set with the waitrsv .

Signature: waitrsv(addr, expected_val)

Semantics: The A2 stops dispatching instructions belonging to the thread (no context switch). The A2 hardware thread is put into a low energy consumption status. The A2 restart dispatching the thread instruction when the target value is matched.

do{}while{(*addr_x)!=expected_val}

waitrsv(addr_x, expected_val)

Implementation Aspects: Based on memory reservation Power mechanisms

The Co-processor library part of PowerEN SDK (i.e. WSP SDK) C library for programming one of PowerEN co-processors (accelerators) Current beta SDK version (Sep. 2010), supports the following accelerators:

Compression

Crypto

ADM (Asynchronous Data Mover) The beta SDK is still being enhanced both for functionality and performance

Application Libcop supports the basics for sending requests (CRBs) to the co-processors. It is intended to be used by standard libraries, such as Zlib and Zlib OpenSSL, as well as performance-sensitive libcop applications Accelerator

PowerEN is half way between a general purpose and a special purpose computing platform; as such it is well suited for specific classes of applications: Network I/O bound applications. Highly multi threaded applications with “lightweight threads”. Applications that require lots of XML processing. Solutions that require high volume crypto. Applications that must search patterns and regexp on large volume of data. Applications that require high speed compression and decompression.

PowerEN is not suited for: Heavyweight or floating point intensive computing, Vector processing.

M M M M M M M M A A A A A A A A R R R R R R R R D D D D D D D D

Major Chip Interfaces and 4-chip Scaling 3 3 3 3 3 3 3 3 R R R R R R R R D D D D D D D D D D D D D D D D

1 0 Gb Eth e rn e t 1 0 Gb E th e rn e t Prism 1 0 Gb Eth e rn e t Prism 1 0 Gb E th e rn e t Sy s te m o n a Chip 1 0 Gb Eth e rn e t Sy s te m o n a Ch ip 1 0 Gb E th e rn e t

1 0 Gb Eth e rn e t 1 0 Gb E th e rn e t t t t t s s e s s e e e s s s h s n n h n n r r e e s r r e e s r r r r e e a e e a p p l p p l h h h h t t x x t F x x t F

E E t E E E E t E E

o I I I o I b b b b o o C C C C G G G G B B P P P P 1 1 1 1 Memory l l l l D D D D e e e e l n l l n n n l n n n e e e n e n a a a n n n a h n n n h h h n a a a a C C C C D D D D

h h h h M M M M C C C C

A A A A M M M M R R R R A A A A D D D D R R R

R R R R R 3 3 3 3 D D D D

R R R R 3 3 3 3 D D D D R R R R D D D D D 3 D D D 3 3 3 D D D D

D D D D 1 0 Gb E th e rn e t 1 0 Gb Eth e rn e t Prism 1 0 Gb E th e rn e t R R R R 1 0 Gb Eth e rn e t S y s te m o n a Chi p 1 0 Gb E th e rn e t Prism Sy s te m o n a Chip 1 0 Gb Eth e rn e t 1 0 Gb E th e rn e t A A A A t t s s e e

s s 1 0 Gb Eth e rn e t h n n t t r r e e s s s r r e e e e a s s p p l h n n h h r r t e e t s x x F r r

M e e a M M M E E E E t p p l

h h o I I t t x x F b b

o C E E t C E E

G G B I o I P P b b 1 1 o C C G G B P P 1 1

C C C C h h h h a a a a n n n n n n n n e e e e l l l l

10/1Gb Ethernet Chip-to-chip Connection 10/1Gb Ethernet Prism Network Chip-to-chip Connection SMP Scaling 10/1Gb Ethernet System on a Chip 10/1Gb Ethernet Chip-to-chip Connection L P P P I C 2 C C C / I I U B

E E

o A /

x x o R G p p t

T P r r F e e I l s s O a s s s h

Base I/O

Single-Chip Module

1 chip, 2.3-3.0 GHz ~65W

Chroma Card Blades 1U-2U PCIe 1-2 sockets, Rackmount 1 socket 16-32 cores per 1-4 sockets, blade 16-64 cores 16 cores

© 2010 IBM Corporation PowerENTM – IBM novel Wire Speed Processor Summary of key features of PowerEN PowerEN delivers a significant leap in chip- & system-level throughput optimization Massive Multi-threading Processor: 4 to 256 threads in single memory image. Throughput-optimized design: multi-10GbE I/O, cores, accelerators, high memory BW. Integrated on-chip accelerators: crypto, RegEx, XML, compress/decompress, HEA.

System-on-a-chip design flexibility with great performance efficiency Configuration options (4-64 cores, SCM/DCM, 2.0-3.0 GHz, integrated PCIe & 1G/10G Enet, on-chip accelerators,..) allow easy system designs optimizations System-optimized throughput: eDRAM for L2 caches, latency-hiding with threads, high BW on-chip buses Family of design points: 2U rackmount down to PCIe adapter cards Priority focus on edge computing, infrastructure & integrated stacks

** Trademark, service mark, or registered trademark of OpenSSL Project, The Perl Foundation, Linus Torvalds, Sun Microsystems, Microsoft Corporation.