I/O Virtualization and System Acceleration in POWER8

Michael Gschwind IBM Power Systems Industry Trends Generate New Opportunities System Stack System stack innovations are required to Applications and Services continue Cost/Performance improvements Systems Management & Cloud Deployment 100 Systems Acceleration & HW/SW Optimization Firmware, Operating System and Hypervisor Processors Semiconductor Technology 10 Workload Acceleration Services Delivery Model

Price/Performance Advanced Memory Tech. Network & I/O Accel.

1 2004 2006 2008 2010 2012 2014

• Consolidation in the Cloud – Virtual Machine Aggregation – Processor Virtualization

• I/O Aggregation – I/O Virtualization – I/O Isolation

• Maintain Performance growth for critical workloads – Under Power constraints – Under Cost constraints – Under Area constraints

• Consolidation in the Cloud Power Instruction – Virtual Machine Aggregation – Processor Virtualization Set Architecture

• I/O Aggregation – I/O Virtualization I/O Design – I/O Isolation Architecture

• Maintain Performance growth for critical workloads – Under Power constraints Coherent Accelerator – Under Cost constraints – Under Area constraints Interface Architecture

• I/O Virtualization – IO address translation – Single level translation  support contiguous PowerVM partitions – Hierarchical translation  support Linux/KVM partition memory

• Partition data isolation – Separate I/O address spaces for partitions

• Partition fault isolation – Fault management domains based on Partitionable Endpoints (PE)

PCIe Host Bridge PCIe Host Bridge

Partitionable … Partitionable Partitionable … Partitionable Endpoint State Endpoint State Endpoint State Endpoint State

Port Port Port IOV Adapter Port

Switch Switch

Port Port PF VF VF VF VF Port Port

Adapter Adapter Adapter

Port Port

Switch Port Partitionable Port Endpoints

• Many I/O functions tracked using Partitionable Endpoints – MMIO Load/Store address domains – DMA I/O bus address domains and TCEs – Interrupts – Ordering of transactions per PE Memory – PE Error and Reset domains POWER8 Validate Translate

• Enhanced RAS capabilities PHB PHB Interrupts – Enhanced I/O Error Handling (EEH) … PCIe – Partition isolation

• Translate I/O DMA memory addresses to system memory address – Isolation of partitions – Create contiguous view of non-contiguous data – Enable 32 bit devices for 64 bit systems • Processor chip contains cache of recently accessed tables – Full tables stored in system memory

translate validate the validate I/O validated I/O translated I/O bus DMA bus DMA bus DMA address I/O bus DMA address address address validated and translated I/O bus DMA address

• DMA validation – “Trusted” Requester ID (RID): 16-bit bus/device/function number – “Untrusted” 64-bit address – received from device driver • Interrupts tracked in system memory • PEs impacted by I/O error identified via RID – SRIOV Physical Function fault  PEs of Phys. & all Virt. Adapter Functions

DMA Validation Interrupt Tracking Error State Tracking PE# for operation Interrupt Source Error message Controller state state machine machine On chip

TCE cache Index

PE#

PE# PE# Interrupt vector array Vectors PE# PE# PE# PE# Vector

and state memory) Off Off chip

• Integrated in every Power8 system • Builds on a long history of IBM workload acceleration • Integrates with big-endian and little-endian accelerators

Special Purpose

Hybrid More active Transistors, higher frequency

Multi-Core

More active Transistors, higher frequency Log(Performance) Single Thread More active Transistors, higher frequency

• The 4 ‘P’s of System Design

• Programmer Productivity • Realize accelerator Performance benefits • Portability: Investment protection for applications • Partitioning for multi-user systems: processes, partitions

• Attached accelerators – Accelerate functions that do not fit traditional CPU model – Heterogeneous System Architecture

• Coherent integration in system architecture – Data sharing – Programming – Performance

• Fine-grained data sharing  coherent, shared memory

• Accelerator-initiated data accesses/transfers  coherent, shared memory

• Pointer identity  shared addressing

• Flexible synchronization  symmetric, programmable interfaces

Typical I/O Model Flow: Total ~13µs for data prep

Flow with Coherent Model: Total ~0.36µs

• On-chip integrated accelerators (SoC design) – Compute accelerator (Cell BE) – Compression (P7+) – Encryption (P7+) – Random number generation (P7+) Cell BE – …

• SoC design offers highest integration, but… – Requires new chip design for accelerator – Long time to market – Requires very high volumes POWER7+

• Integrate accelerators into system arch. – Modular interface – Third-party high value-add components Coherence Bus

• Standardized, layered protocol proxy – architectural interface – functional protocol POWER8 – PCIe signaling protocol • Create workload-optimized innovative solutions – Faster time to market PSL – Lower bar to entry – Variety of implementation options FPGAs, ASICs * Power Service Layer © 2015 International Business Machines Corporation CAPI accelerator programming Virtual Addressing • Pointer identity: Pointers reference same object as the host application • CAPI accelerators work with same virtual memory addresses as CPU • CAPI shares page tables and provides Coherence Bus address translation of host application proxy • Peer-to-peer programming between CPU and accelerator with in-memory data sharing Virtualization and Partitioning • Address translation supports process isolation  Accelerator has access to PSL application context (only) • Address translation supports partition isolation  Accelerator has access to partition data (only) © 2015 International Business Machines Corporation CAPI accelerator programming Hardware Managed Cache Coherence • No need for memory pinning • Data fetched by accelerator based on accelerator application flow • Accelerator participates in locks • Low latency communication Coherence Bus proxy

PSL

• Dedicated Model – Accelerator assigned to single process in a partition – Binding via Operating System (process) and Hypervisor (partition) • Time-quantum shared programming model – Protocol-controlled model • Accelerator-directed shared programming model – networking model (select context based on incoming data)

PSL Coherence Bus

proxy

work queues

0.4 compute cache mgmt • Offload data synchronization and data 0.35 0.3 transfers 0.25 Cost avoided by cache-coherent – No need to invalidate data before 0.2 transfers 0.15

initiating I/O transfers (normalized)Time 0.1 – Accelerator feeds itself  avoid use 0.05 0 of high-function CPU as data mover 1 2 4 8 16 SPEs 9 compute • Available translation, transfer and 8 address management synchronization bandwidth scales with time 7 6 Cost avoided by parallelism system-wide 5 page tables – As more accelerators are used, 4 available resources scale up 3 2

– Avoid CPU becoming serial execution normalized bottleneck 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SPEs [Gschwind, ICCD 2008] © 2015 International Business Machines Corporation Coherent accelerator programming flexibility

• Garbage collection accelerator – Explore boundaries of acceleration Normalized area×delay product – Study accelerator programmability

• Pointer identity: advanced data structures – Autonomous traversal of data • Self-paced memory access – Simplifies data management – No callbacks to request data – Zero-copy data access • Handle complex access patterns [Cher & Gschwind, VEE 2008]

• Attach IBM FlashSystem to POWER8 via CAPI • Read/write commands issued via APIs to eliminate 97% of path length • Saves 20-30 cores per 1M IOPS

Application Application

Read/Write Syscall POSIX Async aio_read() 20K I/O Style API aio_write() File System instructions Library strategy ( ) iodone ( ) reduced to LVM 500 Shared strategy ( ) iodone ( ) Memory Work Disk and Adapter DD Queue Pin buffers, Translate, Interrupt, unmap, Map DMA, Start I/O unpin,Iodone scheduling

Identical hardware with 2 different paths to data IOPS per Hardware Thread Latency (µs) FlashSystem

120,000 500

450 100,000 Conventional 400 CAPI 350 I/O 80,000 300

60,000 250

200 40,000 150

100 20,000 50

0 0

POWER S822L >5x better IOPS >2x lower latency per HW thread

Today’s NoSQL in memory (x86) Differentiated NoSQL (POWER8 + CAPI + FlashSystem) WWW WWW 10Gb Uplink 10Gb Uplink Load Balancer POWER8 Server 500GB Cache 500GB Cache 500GBNode Cache Flash Array w/ up 500GBNode Cache 500GBNode Cache to 56TB Node Node Infrastructure Attributes

Infrastructure Requirements - 192 threads in 4U server drawer Backup Nodes - Large distributed (Scale out) - 56 TB of flash per 2U drawer

- Large memory per node - Shared Memory & cache for dynamic tuning

- Networking bandwidth needs - Elimination of I/O and network overhead

- Load balancing - Cluster solution in a box

Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.

• POWER8 delivers advanced virtualization for CPU and I/O – High-performance I/O virtualization – Data isolation – Fault isolation

• POWER8 takes the next step in exploiting system accelerators – Eliminate overheads inherent in I/O model – Reduced latency – Increased throughput

• Collaborative Innovation based on Open Standards – Both specifications donated to OpenPOWER Foundation by IBM – Available royalty-free to all OpenPOWER members