I/O Virtualization and System Acceleration in POWER8

Michael Gschwind IBM Power Systems Industry Trends Generate New Opportunities System Stack System stack innovations are required to Applications and Services continue Cost/Performance improvements Systems Management & Cloud Deployment 100 Systems Acceleration & HW/SW Optimization , Operating System and Hypervisor Processors Semiconductor Technology 10 Workload Acceleration Services Delivery Model

Price/Performance Advanced Memory Tech. Network & I/O Accel.

1 2004 2006 2008 2010 2012 2014

© 2015 International Business Machines Corporation System Design ca. 2015

• Consolidation in the Cloud – Virtual Machine Aggregation – Processor Virtualization

• I/O Aggregation – I/O Virtualization – I/O Isolation

• Maintain Performance growth for critical workloads – Under Power constraints – Under Cost constraints – Under Area constraints

© 2015 International Business Machines Corporation Innovation Through Open Standards

• Consolidation in the Cloud Power Instruction – Virtual Machine Aggregation – Processor Virtualization Set Architecture

• I/O Aggregation – I/O Virtualization I/O Design – I/O Isolation Architecture

• Maintain Performance growth for critical workloads – Under Power constraints Coherent Accelerator – Under Cost constraints – Under Area constraints Interface Architecture

© 2015 International Business Machines Corporation I/O Design Architecture

• I/O Virtualization – IO address translation – Single level translation  support contiguous PowerVM partitions – Hierarchical translation  support /KVM partition memory

• Partition data isolation – Separate I/O address spaces for partitions

• Partition fault isolation – Fault management domains based on Partitionable Endpoints (PE)

© 2015 International Business Machines Corporation Partitioning and Managing the I/O Space: Partitionable Endpoints

PCIe Host Bridge PCIe Host Bridge

Partitionable … Partitionable Partitionable … Partitionable Endpoint State Endpoint State Endpoint State Endpoint State

Port Port Port IOV Adapter Port

Switch Switch

Port Port PF VF VF VF VF Port Port

Adapter Adapter Adapter

Port Port

Switch Port Partitionable Port Endpoints

Adapter Adapter © 2014 International Business Machines Corporation Managing Partitionable Endpoints

• Many I/O functions tracked using Partitionable Endpoints – MMIO Load/Store address domains – DMA I/O bus address domains and TCEs – Interrupts – Ordering of transactions per PE Memory – PE Error and Reset domains POWER8 Validate Translate

• Enhanced RAS capabilities PHB PHB Interrupts – Enhanced I/O Error Handling (EEH) … PCIe – Partition isolation

© 2015 International Business Machines Corporation I/O DMA memory translation

• Translate I/O DMA memory addresses to system memory address – Isolation of partitions – Create contiguous view of non-contiguous data – Enable 32 bit devices for 64 bit systems • Processor chip contains cache of recently accessed tables – Full tables stored in system memory

translate validate the validate I/O validated I/O translated I/O bus DMA bus DMA bus DMA address I/O bus DMA address address address validated and translated I/O bus DMA address

© 2015 International Business Machines Corporation Partitionable Endpoint Management

• DMA validation – “Trusted” Requester ID (RID): 16-bit bus/device/function number – “Untrusted” 64-bit address – received from device driver • Interrupts tracked in system memory • PEs impacted by I/O error identified via RID – SRIOV Physical Function fault  PEs of Phys. & all Virt. Adapter Functions

DMA Validation Interrupt Tracking Error State Tracking PE# for operation Interrupt Source Error message Controller state state machine machine On chip

TCE cache Index

PE#

PE# PE# Interrupt vector array Vectors PE# PE# PE# PE# Vector

and state memory) Off Off chip

DMA Address RID Data DMA Address RID RID (system © 2015 International Business Machines Corporation Coherent Accelerator Processor Interface (CAPI)

• Integrated in every Power8 system • Builds on a long history of IBM workload acceleration • Integrates with big-endian and little-endian accelerators

© 2015 International Business Machines Corporation Microprocessor Trends

Special Purpose

Hybrid More active Transistors, higher frequency

Multi-Core

More active Transistors, higher frequency Log(Performance) Single Thread More active Transistors, higher frequency

© 2015 International Business Machines Corporation Heterogeneous System Challenges

• The 4 ‘P’s of System Design

• Programmer Productivity • Realize accelerator Performance benefits • Portability: Investment protection for applications • Partitioning for multi-user systems: processes, partitions

© 2015 International Business Machines Corporation Workload-optimized acceleration with coherent accelerators

• Attached accelerators – Accelerate functions that do not fit traditional CPU model – Heterogeneous System Architecture

• Coherent integration in system architecture – Data sharing – Programming – Performance

© 2015 International Business Machines Corporation Application Acceleration

• Fine-grained data sharing  coherent, shared memory

• Accelerator-initiated data accesses/transfers  coherent, shared memory

• Pointer identity  shared addressing

• Flexible synchronization  symmetric, programmable interfaces

© 2015 International Business Machines Corporation CAPI Acceleration overcomes Device Driver Deceleration

Typical I/O Model Flow: Total ~13µs for data prep

Flow with Coherent Model: Total ~0.36µs

© 2015 International Business Machines Corporation Workload-optimized acceleration

• On-chip integrated accelerators (SoC design) – Compute accelerator ( BE) – Compression (P7+) – Encryption (P7+) – Random number generation (P7+) Cell BE – …

• SoC design offers highest integration, but… – Requires new chip design for accelerator – Long time to market – Requires very high volumes POWER7+

© 2015 International Business Machines Corporation CAPI: Coherent Accelerator Processor Interface

• Integrate accelerators into system arch. – Modular interface – Third-party high value-add components Coherence Bus

• Standardized, layered protocol proxy – architectural interface – functional protocol POWER8 – PCIe signaling protocol • Create workload-optimized innovative solutions – Faster time to market PSL – Lower bar to entry – Variety of implementation options FPGAs, ASICs * Power Service Layer © 2015 International Business Machines Corporation CAPI accelerator programming Virtual Addressing • Pointer identity: Pointers reference same object as the host application • CAPI accelerators work with same virtual memory addresses as CPU • CAPI shares page tables and provides Coherence Bus address translation of host application proxy • Peer-to-peer programming between CPU and accelerator with in-memory data sharing Virtualization and Partitioning • Address translation supports process isolation  Accelerator has access to PSL application context (only) • Address translation supports partition isolation  Accelerator has access to partition data (only) © 2015 International Business Machines Corporation CAPI accelerator programming Hardware Managed Cache Coherence • No need for memory pinning • Data fetched by accelerator based on accelerator application flow • Accelerator participates in locks • Low latency communication Coherence Bus proxy

PSL

© 2015 International Business Machines Corporation CAPI Accelerator Virtualization

• Dedicated Model – Accelerator assigned to single process in a partition – Binding via Operating System (process) and Hypervisor (partition) • Time-quantum shared programming model – Protocol-controlled model • Accelerator-directed shared programming model – networking model (select context based on incoming data)

PSL Coherence Bus

proxy

work queues

© 2015 International Business Machines Corporation Coherent acceleration of data sharing

0.4 compute cache mgmt • Offload data synchronization and data 0.35 0.3 transfers 0.25 Cost avoided by cache-coherent – No need to invalidate data before 0.2 transfers 0.15

initiating I/O transfers (normalized)Time 0.1 – Accelerator feeds itself  avoid use 0.05 0 of high-function CPU as data mover 1 2 4 8 16 SPEs 9 compute • Available translation, transfer and 8 address management synchronization bandwidth scales with time 7 6 Cost avoided by parallelism system-wide 5 page tables – As more accelerators are used, 4 available resources scale up 3 2

– Avoid CPU becoming serial execution normalized bottleneck 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SPEs [Gschwind, ICCD 2008] © 2015 International Business Machines Corporation Coherent accelerator programming flexibility

• Garbage collection accelerator – Explore boundaries of acceleration Normalized area×delay product – Study accelerator programmability

• Pointer identity: advanced data structures – Autonomous traversal of data • Self-paced memory access – Simplifies data management – No callbacks to request data – Zero-copy data access • Handle complex access patterns [Cher & Gschwind, VEE 2008]

© 2015 International Business Machines Corporation CAPI Attached Flash Optimization

• Attach IBM FlashSystem to POWER8 via CAPI • Read/write commands issued via APIs to eliminate 97% of path length • Saves 20-30 cores per 1M IOPS

Application Application

Read/Write Syscall POSIX Async aio_read() 20K I/O Style API aio_write() File System instructions Library strategy ( ) iodone ( ) reduced to LVM 500 Shared strategy ( ) iodone ( ) Memory Work Disk and Adapter DD Queue Pin buffers, Translate, Interrupt, unmap, Map DMA, Start I/O unpin,Iodone scheduling

© 2015 International Business Machines Corporation CAPI Unlocks the Next Level of Performance for Flash

Identical hardware with 2 different paths to data IOPS per Hardware Thread Latency (µs) FlashSystem

120,000 500

450 100,000 Conventional 400 CAPI 350 I/O 80,000 300

60,000 250

200 40,000 150

100 20,000 50

0 0

POWER S822L >5x better IOPS >2x lower latency per HW thread

© 2015 International Business Machines Corporation What CAPI Means for NoSQL Solutions

Today’s NoSQL in memory (x86) Differentiated NoSQL (POWER8 + CAPI + FlashSystem) WWW WWW 10Gb Uplink 10Gb Uplink Load Balancer POWER8 Server 500GB Cache 500GB Cache 500GBNode Cache Flash Array w/ up 500GBNode Cache 500GBNode Cache to 56TB Node Node Infrastructure Attributes

Infrastructure Requirements - 192 threads in 4U server drawer Backup Nodes - Large distributed (Scale out) - 56 TB of flash per 2U drawer

- Large memory per node - Shared Memory & cache for dynamic tuning

- Networking bandwidth needs - Elimination of I/O and network overhead

- Load balancing - Cluster solution in a box

Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.

© 2015 International Business Machines Corporation Summary

• POWER8 delivers advanced virtualization for CPU and I/O – High-performance I/O virtualization – Data isolation – Fault isolation

• POWER8 takes the next step in exploiting system accelerators – Eliminate overheads inherent in I/O model – Reduced latency – Increased throughput

• Collaborative Innovation based on Open Standards – Both specifications donated to OpenPOWER Foundation by IBM – Available royalty-free to all OpenPOWER members

© 2015 International Business Machines Corporation