OpenMP for Embedded Systems Sunita Chandrasekaran Asst. Professor Dept. of Computer & Informaon Sciences University of Delaware [email protected] ACK: Peng Sun, Suyang Zhu, Cheng Wang, Barbara Chapman, Tobias Schuele, Marcus Winter

Talk @ the OpenMP Booth #611, November 16, 2016 Heterogeneous Embedded Systems

• Incorporates specialized processing capabilities to handle specific tasks • Example • CPU + GPU • ARM + GPU • ARM + DSP • CPU + FPGA

Figure. Qualcomm SnapdragonTm 810 Block Diagram Figure. NVIDIA Tregra K1 Block Diagram 2 Programming Multicore Embedded Systems – A

3Real Challenge • Heterogeneous systems present complexity at both silicon and system level • Standards and tool-chain in embedded industry are proprietary • Portability and scalability issues • High time-to-market (TTM) solutions • We need industry standards – To offer portable and scalable software solutions and target more than one platform • How suitable are the state-of-the-art models for heterogeneous embedded systems?

• Not portable across more than one type of platform except for OpenCL • Most models are heavy-weight for embedded processors of limited resources • Most models require support from OS and – Sometimes embedded systems are bare-metal • Some of the solutions are restricted to just the homogeneous environment

4 So what do we really need? • Something that’s not too low-level • Something light-weight • Something that can target heterogeneous embedded platforms (beyond CPUs-GPUs) • Something that can help speed time-to-market for products • Last but not the least – we need industry standards

Using industry standards • Two of them used for this work – OpenMP • (high-level, directive-based) – Multicore Association (MCA) • (low-level, light-weight catered to embedded systems) Briefly, on OpenMP Implementations OpenMP Code Translaon int main(void) _INT32 main() • Directives implemented via code { { modification and insertion of runtime library calls int a,b,; int a,b,c; #pragma omp parallel \ /* microtask */ – Typical approach is outlining of private(c) code in parallel region void __ompregion_main1() do_sth(a,b,c); { – Or generation of micro tasks return 0; _INT32 __mplocal_c; • Runtime library responsible for } /*shared variables are kept intact, managing threads substute accesses to private – Scheduling loops variable*/ – Scheduling tasks do_sth(a, b, __mplocal_c); – Implementing synchronization } • Implementation effort is reasonable … /*OpenMP runme calls */ __ompc_fork(&__ompregion_main 1); … Each has custom run-time support. Quality of the } runtime system has major impact on performance. 7 History of OpenMP*

In spring, 7 Unified Fortran Incorporates Supports vendors and the and C/C++: task parallelism. offloading DOE agree on Bigger than both A hard problem execution to OpenMP the spelling of cOMPunity, the individual as OpenMP accelerator and supports parallel loops group of specifications struggles to OpenMP users, coprocessor taskloops, task and form the combined. maintain its is formed and devices, SIMD priorities, OpenMP ARB. The first thread-based organizes parallelism, and doacross loops, By October, International nature, while more. Expands and hints for version 1.0 of Minor workshops on Workshop on accommodating OpenMP beyond locks. Offloading the OpenMP modifications. OpenMP in OpenMP is held. the dynamic Support min/max North America, reductions in C/ traditional now supports specification for It becomes a nature of Europe, and C++. boundaries. asynchronous Fortran is 1.1 major forum for tasking. Asia. execution and released. users to interact with vendor. dependencies to ? 3.0 3.1 4.0 host execution. 1.0 2.0 2.5 4.5 5.0

First hybrid The merge of applications with Fortran MPI* and and C/C+ OpenMP appear. specifications begins.

1.0 2.0

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

26 25 26 22 17 17 19 13 13 13 15 15 8 8 8 11 11 11 11

Permanent ARB Auxiliary ARB

Placeholder Footer Copy / BU Logo or Name Goes Here 8 Multicore Association APIs (MCA) Multicore Association APIs • Develops standards to reduce complexity involved in writing software for multicore chips! • Capturing the basic elements and abstract hardware and system resources! • Cohesive set of foundation APIs! Standardize communication (MCAPI)! Resource Sharing (MRAPI)! Task Management (MTAPI)

Outline Introduction Related Work Work Completed Proposed Work Plan of Work Acknowledgment

Peng Sun Peng Sun Nov.24 2016 Nov.24 2016 11 Programming Model Multicore Task Management API (MTAPI)

MulticoreThe Multicore Association Task develops andManagement promotes open specifications for multicoreAPI product (MTAPI) development.

MTAPI Contributing members: Tasks

Standardized API for task-parallel Working group lead programming on a wide range of hardware architectures Developed and driven by practitioners of market-leading companies Part of Multicore-Association’s ecosystem (MRAPI, MCAPI, SHIM, OpenAMP, …)

Tasks Queues Heterogeneous Systems Shared memory Distributed memory Different instruction set architectures

Ack: SiemensPage 6 (Tobias Schuele, Urs Gleim) Unrestricted © Siemens AG 2016. All rights reserved OpenMP and MCA software stack Programming Model MTAPI for Heterogeneous Systems (cont.) MTAPI Jobs, Tasks & Action Example for the usage of MTAPIIn in heterogeneous a nut systems:shell Node 1 (CPU) Node 2 (GPU)

Task 1 Matrix mult. Action I Job A

Task 2

Application FFT

Task 3 Job B Node 3 (DSP)

Action II Action III

Ack: Siemens (Tobias Schuele, Urs Gleim)

Page 9 Unrestricted © Siemens AG 2016. All rights reserved Performance Evaluation MTAPI Implementations Performance Evaluation MTAPI Implementations

1 PerformanceMTAPI Evaluation implementationsEmbedded Multicore Building Blocks (EMB²) Application MTAPI Implementations Open source library and runtime platform 1 Dataflow Algorithms Embedded Multicore Building Blocks (EMB²) for embeddedApplication multicore systems Containers Embedded Multicore Building 1 Easy parallelization of existing code OpenEmbedded source Multicore library Building and runtimeBlocks (EMB platform²) Application Task management (MTAPI) Dataflow Algorithms forembeddedOpen 1source multicorelibrary and runtime systems platform using high-level patterns Blocks (EMB2) Dataflow Algorithms for embedded multicore systems Real-time capability, Containersresource awareness Base library (abstraction layer) Easy parallelization of existing code Containers • Open source Easy library parallelization and ofruntime existing code platform Task management (MTAPI) using high-level patterns Task management Fine (MTAPI) -grained control over core usage for embeddedusing multicore high-level patterns systems / hypervisor Base library (task(abstraction priorities, layer) affinities) Real Real-time-time capability, capability, resource resource awareness awareness Base library (abstraction layer) Fine-grained control over core usage Hardware • Real-time capability, resource awareness Operating systemLock / hypervisor-/wait-free implementation Fine-(taskgrained priorities, control affinities) over core usage Operating system / hypervisor • Fine-grained(task Lock priorities,control-/wait-free overaffinities) implementation core usage (task Hardware priorities, Lock affinities)-/wait-free implementation UH-MTAPI2 Hardware UH-MTAPI2 MTAPI implementation MTAPI implementation developeddeveloped at the MTAPI implementation developed at the UH-MTAPIUniversities2 of Houston / Delaware Universities of Houston / Delaware at the Universities Utilizes MCAPI of for inter Houston-node communication / Utilizes MCAPI for inter-node communication MTAPIand implementationMRAPI for resource managementdeveloped at the 2 and MRAPI for resource management DelawareUniversities Has been usedof Houston as runtime / systemDelaware for OpenMP programs Has been used as runtime system for • Utilizes MCAPIUtilizes MCAPIfor inter-node for inter-node communication 1 https://github.com/siemens/embb OpenMP programs communicationand2 https://github.com/MCAPro2015/OpenMP_MCA_Project MRAPI and for MRAPI resource for management resource management Has been used as runtime system for 1 Page 13 https://github.com/siemens/embbUnrestricted © Siemens AG 2016. All rights reserved • Used as OpenMPruntimeprograms system for OpenMP 2 https://github.com/MCAPro2015/OpenMP_MCA_Project

programs1 https://github.com/siemens/embb Page 13 Unrestricted © Siemens AG 2016. All rights reserved 2 https://github.com/MCAPro2015/OpenMP_MCA_Project

Page 13 Unrestricted © Siemens AG 2016. All rights reserved Implementation MTAPI SchedulingMTAPI Scheduling

Example for scheduling MTAPI tasks in heterogeneous systems:

Node 0 (CPU) Node 1 (DSP)

Core 0 Core 1 Unit 0

Q00 Q01 Q02 Q10 Q11 Q12 Q0 Scheduler

Worker Worker WorkWork Stealing dealing Bare thread 0 thread 1 metal

Work stealing

Ack: Siemens (Tobias Schuele, Urs Gleim)

Page 12 Unrestricted © Siemens AG 2016. All rights reserved Testbed, Compiler and Benchmark

• Test beds: NVIDIA Jetson TK1 embedded development board with a Tegra K1 Soc (NVIDIA 4-Plus-1 Quad-Core ARM Cortex-A15 processor and a Kepler GPU with 192 CUDA cores). • Compiler: Jetson (GCC 4.8.4, NVCC V6.5.30) • Power Architecture from Freescale – Consisting of Pattern Matching Engine as specialized accelerator • Benchmarks: 1Rodinia and 2BOTS. • Reference Group: 3Siemens MTAPI, GCC OpenMP

1Rodinia: https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/ Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators 2BOTS: https://pm.bsc.es/projects/bots 3Siemens-MTAPI: https://github.com/siemens/embb 15

NormalizedPerformance Evaluation execution times for UH-MTAPI and Matrix MultiplicationSiemens MTAPI (EMB2) for MM

Normalized execution times for UH-MTAPI and Siemens MTAPI (EMB²):

MTAPI-ARM faster than MTAPI-GPU for small matrices due to overhead for data copying MTAPI-GPU faster than MTAPI-ARM-GPU for larger matrices due to load imbalance MTAPI-ARM-GPU-Opt always fastest due to asynchronous transfers and variable block sizes

Page 16 Unrestricted © Siemens AG 2016. All rights reserved OpenMP RTL translation to MTAPI OpenMP – MTAPI RTL Compilation Flow • Compiler front end translates OpenMP constructs to MTAPI-RTL OpenMP APP • OpenMP programs MTAPI RTL contains task construct! ! funcons • OpenMP-MTAPI RTL Compiler MTAPI Tasks includes the runtime • RTL comprises of MTAPI funcon calls calls of the translated OpenMP-MTAPI IR task construct! and we convert OpenMP tasks to RTL ! • OpenMP-MTAPI RTL MTAPI objects incurs and dispatches Code Generator MTAPI tasks • Embedded resources will rely on

MTAPI for management of resources CPU Binary Linker Executable

Outline Introduction Related Work Work Completed Proposed Work Plan of Work Acknowledgment

Peng Sun Peng Sun Nov.24 2016 Nov.24 2016 34 OpenMP-> MTAPI Implementation SparseLU Takeaways and Summary • Industry standards are the way to go ! • OpenMP-MCA incurred little to no overhead – Targeting heterogeneous platforms • Less learning curve • Ability to maintain single code base