A Balanced Programming Model for Emerging Heterogeneous Multicore Systems
Wei Liu, Brian Lewis , Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Sai Luo, Bratin Saha
Intel Labs, Intel Corporation
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs Emerging trends in computer systems
Moving towards heterogeneous architectures – Combinations of CPUs & accelerators (e.g., GPUs) – Integrated & discrete accelerators, more and more cores – GPUs becoming more capable – Large virtual address spaces, unified memory, atomics – Manycore CPUs with wide vector instruction sets – Substantial data - and task -parallel computational abilities
…… PCIe PCIe
CPU 2 dGPU CPU ++iGPUiGPU CPU + iGPU dGPU
How best to program these new systems?
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 2 Current programming models
Heterogeneous programming languages today – OpenCL, Cuda, DirectCompute, … – Low-level, but effective for many applications – Emphasize data parallelism – Support coarse-grain offloading – Underutilize CPU, don’t support fine-grain computation well – Share flat data structures like arrays – No pointer sharing, require data conversion & marshalling
Want to – Improve programmer productivity – Harness computational power of both CPUs & accelerators – Extend range of applications easily programmed
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 3 Goal: a balanced programming model Balanced computation between CPU & accelerators – Enable fine-grained computation using all cores – Better support for task and data parallelism – Load balancing, dynamic reconfiguration – Leverage substantial computational ability of CPUs Virtual memory (VM) sharing – Directly share pointer-containing data structures like trees Scene graph
Lightweight atomics, locks, … Shared virtual region – Low-cost task dispatch & coordination CPU Accelerator virtual virtual address address Want Improved Programmer Productivity space space
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 4 Our implementations
Existing shared VM system for discrete Larrabee – Targets throughput computing on Larrabee platforms – Described in 2009 PLDI paper by B. Saha et al. – Includes memory sharing, synchronization, task placement – Example uses: Bullet physics and Offset game engines
Shared memory prototype for CPU -integrated GPU – Processors with on-die Intel Integrated Graphics GPU – Uses OpenCL with shared VM extensions – Buffer-level coherency & fine-grained concurrent access
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 5 Shared memory with discrete Larrabee Supports release consistency – Natural model for many programs, reduces coherence cost – Mine/Yours ownership rights enable coherency optimizations
OS on both sides Shared virtual address range allocated at startup VM page protection used for coherency E.g., updates detected using page faults Implementation resembles SW distributed shared memory Complicated by different page tables, virtual-to-physical mappings
Daemons on each side communicate over PCI aperture Pinned pages hold message queues & copy buffers Enables user-level communication between runtimes
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 6 Shared memory on CPU-integrated graphics
Device driver, no OS services like page fault handling E.g., can’t detect updates using page faults
Exploits shared physical memory No data copying, supports fine-grain concurrent data sharing
To share the virtual address space – Reserve a common virtual memory region – Allocate pinned physical pages & map to same address
To manage memory coherence – Flush caches as needed – Discard stale data, make updates visible
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 7 Cost of sharing virtual memory? 8 3.5 No Share No Share 7 Share w/ Ownership 3 Share w/ Ownership Share w/o Ownership 6 Share w/o Ownership 2.5 5 2 4 1.5 Speedup
3 Speedup 2 1 1 0.5
0 0 1 CPU + 2 4 8 16 1 CPU + N LRBs 2 4 8 16 N LRBs (a) BlackScholes (b) Equake
1.6 No Share 1.4 Share w/ Ownership Share w/o Ownership 1.2
1 Larrabee memory sharing
0.8 performance comparable
Speedup 0.6 to traditional marshalmarshal--copycopy
0.4
0.2
0 1 CPU + 2 4 8 16 N LRBs (c) Canny
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 8 Summary
Current situation – Languages like OpenCL, Cuda, and DirectCompute – Good match for today’s hardware, effective for many applications – Emphasize data parallelism – Focus on coarse-grain offloading, underutilize CPU – Limited ability for work sharing & fine-grain computation – No pointer sharing, require data conversion & marshalling
Goals – Better programmer productivity & performance – Pointer sharing, memory consistency, low-cost task dispatch – Harness computational power of all cores – Extend range of applications easily programmed
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Intel Labs 9