A B a La N C E D P Ro G Ra M M in G M O D E L Fo R E M E Rg in G H E Te Ro

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems

Wei Liu, Brian Lewis , Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Sai Luo, Bratin Saha

Intel Labs, Intel Corporation

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs Emerging trends in computer systems

Moving towards heterogeneous architectures – Combinations of CPUs & accelerators (e.g., GPUs) – Integrated & discrete accelerators, more and more cores – GPUs becoming more capable – Large virtual address spaces, unified memory, atomics – Manycore CPUs with wide vector instruction sets – Substantial data - and task -parallel computational abilities

…… PCIe PCIe

CPU 2 dGPU CPU ++iGPUiGPU CPU + iGPU dGPU

How best to program these new systems?

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 2 Current programming models

Heterogeneous programming languages today – OpenCL, Cuda, DirectCompute, … – Low-level, but effective for many applications – Emphasize data parallelism – Support coarse-grain offloading – Underutilize CPU, don’t support fine-grain computation well – Share flat data structures like arrays – No pointer sharing, require data conversion & marshalling

Want to – Improve programmer productivity – Harness computational power of both CPUs & accelerators – Extend range of applications easily programmed

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 3 Goal: a balanced programming model Balanced computation between CPU & accelerators – Enable fine-grained computation using all cores – Better support for task and data parallelism – Load balancing, dynamic reconfiguration – Leverage substantial computational ability of CPUs Virtual memory (VM) sharing – Directly share pointer-containing data structures like trees Scene graph

Lightweight atomics, locks, … Shared virtual region – Low-cost task dispatch & coordination CPU Accelerator virtual virtual address address Want Improved Programmer Productivity space space

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 4 Our implementations

Existing shared VM system for discrete Larrabee – Targets throughput computing on Larrabee platforms – Described in 2009 PLDI paper by B. Saha et al. – Includes memory sharing, synchronization, task placement – Example uses: Bullet physics and Offset game engines

Shared memory prototype for CPU -integrated GPU – Processors with on-die Intel Integrated Graphics GPU – Uses OpenCL with shared VM extensions – Buffer-level coherency & fine-grained concurrent access

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 5 Shared memory with discrete Larrabee Supports release consistency – Natural model for many programs, reduces coherence cost – Mine/Yours ownership rights enable coherency optimizations

OS on both sides Shared virtual address range allocated at startup VM page protection used for coherency E.g., updates detected using page faults Implementation resembles SW distributed shared memory Complicated by different page tables, virtual-to-physical mappings

Daemons on each side communicate over PCI aperture Pinned pages hold message queues & copy buffers Enables user-level communication between runtimes

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 6 Shared memory on CPU-integrated graphics

Device driver, no OS services like page fault handling E.g., can’t detect updates using page faults

Exploits shared physical memory No data copying, supports fine-grain concurrent data sharing

To share the virtual address space – Reserve a common virtual memory region – Allocate pinned physical pages & map to same address

To manage memory coherence – Flush caches as needed – Discard stale data, make updates visible

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 7 Cost of sharing virtual memory? 8 3.5 No Share No Share 7 Share w/ Ownership 3 Share w/ Ownership Share w/o Ownership 6 Share w/o Ownership 2.5 5 2 4 1.5 Speedup

3 Speedup 2 1 1 0.5

0 0 1 CPU + 2 4 8 16 1 CPU + N LRBs 2 4 8 16 N LRBs (a) BlackScholes (b) Equake

1.6 No Share 1.4 Share w/ Ownership Share w/o Ownership 1.2

1 Larrabee memory sharing

0.8 performance comparable

Speedup 0.6 to traditional marshalmarshal--copycopy

0.4

0.2

0 1 CPU + 2 4 8 16 N LRBs (c) Canny

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 8 Summary

Current situation – Languages like OpenCL, Cuda, and DirectCompute – Good match for today’s hardware, effective for many applications – Emphasize data parallelism – Focus on coarse-grain offloading, underutilize CPU – Limited ability for work sharing & fine-grain computation – No pointer sharing, require data conversion & marshalling

Goals – Better programmer productivity & performance – Pointer sharing, memory consistency, low-cost task dispatch – Harness computational power of all cores – Extend range of applications easily programmed

A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 9