A Balanced Programming Model for Emerging Heterogeneous Multicore Systems

Wei Liu, Brian Lewis , Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Sai Luo, Bratin Saha

Intel Labs, Intel Corporation

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs Emerging trends in systems

 Moving towards heterogeneous architectures – Combinations of CPUs & accelerators (e.g., GPUs) – Integrated & discrete accelerators, more and more cores – GPUs becoming more capable – Large virtual address spaces, unified memory, atomics – Manycore CPUs with wide vector instruction sets – Substantial data - and task -parallel computational abilities

…… PCIe PCIe

CPU 2 dGPU CPU ++iGPUiGPU CPU + iGPU dGPU

How best to program these new systems?

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 2 Current programming models

 Heterogeneous programming languages today – OpenCL, Cuda, DirectCompute, … – Low-level, but effective for many applications – Emphasize – Support coarse-grain offloading – Underutilize CPU, don’t support fine-grain computation well – Share flat data structures like arrays – No pointer sharing, require data conversion & marshalling

 Want to – Improve programmer productivity – Harness computational power of both CPUs & accelerators – Extend range of applications easily programmed

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 3 Goal: a balanced programming model  Balanced computation between CPU & accelerators – Enable fine-grained computation using all cores – Better support for task and data parallelism – Load balancing, dynamic reconfiguration – Leverage substantial computational ability of CPUs  Virtual memory (VM) sharing – Directly share pointer-containing data structures like trees Scene graph

 Lightweight atomics, locks, … Shared virtual region – Low-cost task dispatch & coordination CPU Accelerator virtual virtual address address Want Improved Programmer Productivity space space

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 4 Our implementations

 Existing shared VM system for discrete Larrabee – Targets throughput computing on Larrabee platforms – Described in 2009 PLDI paper by B. Saha et al. – Includes memory sharing, synchronization, task placement – Example uses: Bullet physics and Offset game engines

 prototype for CPU -integrated GPU – Processors with on-die Intel Integrated Graphics GPU – Uses OpenCL with shared VM extensions – Buffer-level coherency & fine-grained concurrent access

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 5 Shared memory with discrete Larrabee  Supports release consistency – Natural model for many programs, reduces coherence cost – Mine/Yours ownership rights enable coherency optimizations

 OS on both sides  Shared virtual address range allocated at startup  VM page protection used for coherency  E.g., updates detected using page faults  Implementation resembles SW distributed shared memory  Complicated by different page tables, virtual-to-physical mappings

 Daemons on each side communicate over PCI aperture  Pinned pages hold message queues & copy buffers  Enables user-level communication between runtimes

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 6 Shared memory on CPU-integrated graphics

 Device driver, no OS services like page fault handling  E.g., can’t detect updates using page faults

 Exploits shared physical memory  No data copying, supports fine-grain concurrent data sharing

 To share the virtual address space – Reserve a common virtual memory region – Allocate pinned physical pages & map to same address

 To manage memory coherence – Flush caches as needed – Discard stale data, make updates visible

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 7 Cost of sharing virtual memory? 8 3.5 No Share No Share 7 Share w/ Ownership 3 Share w/ Ownership Share w/o Ownership 6 Share w/o Ownership 2.5 5 2 4 1.5

3 Speedup 2 1 1 0.5

0 0 1 CPU + 2 4 8 16 1 CPU + N LRBs 2 4 8 16 N LRBs (a) BlackScholes (b) Equake

1.6 No Share 1.4 Share w/ Ownership Share w/o Ownership 1.2

1 Larrabee memory sharing

0.8 performance comparable

Speedup 0.6 to traditional marshalmarshal--copycopy

0.4

0.2

0 1 CPU + 2 4 8 16 N LRBs (c) Canny

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 8 Summary

 Current situation – Languages like OpenCL, Cuda, and DirectCompute – Good match for today’s hardware, effective for many applications – Emphasize data parallelism – Focus on coarse-grain offloading, underutilize CPU – Limited ability for work sharing & fine-grain computation – No pointer sharing, require data conversion & marshalling

 Goals – Better programmer productivity & performance – Pointer sharing, memory consistency, low-cost task dispatch – Harness computational power of all cores – Extend range of applications easily programmed

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 9