Implementing Openmp on a High Performance Embedded Multicore Mpsoc

Implementing OpenMP on a high performance embedded multicore MPSoC Barbara Chapman, Lei Huang Eric Biscondi, Eric Stotzer, Ashish Shrivastava and Alan Gatherer University of Houston Texas Instruments Houston, TX, USA Houston, TX, USA Email: {chapman,leihuang}@cs.uh.edu Email: {eric-biscondi,estotzer,ashu,gatherer}@ti.com Abstract components of an application. We are also considering how code expressed in OpenMP may be translated to run on dif- In this paper we discuss our initial experiences adapting ferent kinds of embedded devices. Since we also require the OpenMP to enable it to serve as a programming model for explicit specification of interfaces and data transfers between high performance embedded systems. A high-level program- components, the result will be a hybrid distributed/shared ming model such as OpenMP has the potential to increase programming model that can be used to model both the programmer productivity, reducing the design/development individual tasks within an embedded application as well costs and time to market for such systems. However, as their interactions. Our implementation is based upon OpenMP needs to be extended if it is to meet the needs of OpenUH, our robust Open64-based compiler infrastructure, embedded application developers, who require the ability to which will be connected with a native code generator to express multiple levels of parallelism, real-time and resource permit us to evaluate them on a real-world system. constraints, and to provide additional information in support This paper is organized as follows. In Section 2, we give of optimization. It must also be capable of supporting the a brief overview of the state of the art in programming mapping of different software tasks, or components, to the embedded systems. Then we describe OpenMP, its imple- devices configured in a given architecture. mentation, and our compiler. Section 4 gives our current ideas on the enhancements needed to this API. We next 1. Introduction describe the current state of our implementation, followed by early results. Finally, our conclusions are discussed in The rapid penetration of multicore and simultaneous Section 7. multithreading hardware technology in the marketplace is changing the nature of application development in all areas 2. Embedded Systems Design of computing. The use of multicore technology, which offers more compute power with lower energy consumption, will increase the design space and significantly increase the 2.1. Hardware performance of Multiprocessor Systems-on-Chip (MPSoCs) that are targeted at embedded applications. Software for Embedded system designers already rely extensively on embedded systems will become more complex, as multi- architectural parallelism to accomplish their goals. MP- core hardware will enable more functions to be implemented SoCs are composed of multiple heterogeneous processing on the same device. This complexity will make the pro- elements or cores, such as RISC processors, digital sig- duction of software the critical path in embedded systems nal processors (DSP), application specific instruction pro- development and make hardware/software designing, parti- cessors (ASIP), field programmable gate arrays (FPGA), tioning, scheduling, correctness verification, quality control stream processors, SIMD processors, accelerators, etc. In and quality assurance more difficult. addition to the heterogeneity of their cores, an MPSoC’s In this paper, we discuss our initial work to adapt an memory system can be heterogeneous where blocks of existing shared memory programming interface, OpenMP, memory are accessible to a subset of the processors. Texas to enable the specification of high performance embedded Instruments DaVinci [1] and OMAP 3 [2] are RISC + applications. We are considering new language features that DSP configurations. Texas Instruments TNETV3020 [3] allow non-functional characteristics or constraints, such as and TMS320TCI6488 [4], Along with an increase in the deadlines, priorities, power constraints, etc., to be specified number and heterogeneity of cores, MPSoCs with multiple as well as notation to enable a high-level description of the subsystems that are able to exploit both asymmetric chip This work is supported by the National Science Foundation, under contracts multiprocessing (AMP) and symmetric chip multiprocessing CCF-0833201 and by Texas Instruments. (SMP) will begin to appear in the near future [5]. 2.2. Embedded Applications extended the C language to address the parallelism on their custom devices (e.g. Clearspeed’s Cn and Intel’s Ct). Embedded systems applications can be broadly classified Data and compute intensive embedded applications can as event-driven or compute and data intensive. In event- often be described using the stream programming model. driven applications, such as automotive embedded systems, Characteristics include large streams of data with high the behavior is characterized by a defined sequence of data rates to and from data sources and among processing responses to a given incoming event from the environment. elements. Examples include signal and image processing of While the actions performed may not be compute or data voice and video data, data compression and encryption, cell intensive, they tend to have strict real time constraints. In phones, cell phone base stations, financial modeling and contrast, compute and data intensive applications, such as HDTV editing consoles. A streaming program typically has wireless infrastructure, telecommunication systems, biomed- two levels. The stream level controls the order of kernel ical devices and multimedia processing, must be designed execution and transfer of data streams for each kernel. to handle and perform computations on large streams of The kernel level represents the actual computation for each data. As the computational complexity increase for such stream element. The appeal of this model is apparent from systems, better design methodologies are needed to find a plethora of related research [11], [12]. an optimal software/hardware configuration while meeting cost, performance, and power constraints. Because growth in 3. OpenMP and OpenUH complexity is perhaps the most significant trend in wireless and multimedia applications, more and more functionality OpenMP is a widely-adopted shared memory parallel is being implemented in software to help shorten the design programming interface providing high level programming cycle. Furthermore, the use of streamlined Application Spe- constructs that enable the user to easily expose an appli- cific Instruction Set Processors (ASIPs) are often employed cation’s task and loop level parallelism in an incremental instead of general purpose microprocessors to help limit fashion. Its range of applicability was significantly extended costs and improve power efficiency [6]. recently by the addition of explicit tasking features [13]. OpenMP has already been proposed as the starting point for 2.3. Programming Embedded Systems a programming model for heterogeneous systems by Intel, ClearSpeed, PGI and CAPS SA. The approach taken today to create embedded systems The idea behind OpenMP is that the user specifies the is not conducive to productivity. Embedded applications are parallelization strategy for a program at a high level by typically developed in a programming language that is close annotating the program code; the implementation works out to the hardware, such as C or assembly, in order to meet the detailed mapping of the computation to the machine. It is strict performance requirements. It is considered impractical the user’s responsibility to perform any code modifications to program high performance embedded applications using needed prior to the insertion of OpenMP constructs. In Java or C# due to resource constraints of the device. The particular, it requires that dependences that might inhibit dominant trend in considering enhancements for MPSoCs parallelization are detected and where possible, removed is to adapt existing programming languages to fit the strict from the code. The major features are directives that specify requirements of applications in the embedded domain, rather that a well-structured region of code should be executed by than designing new parallel programming languages. There a team of threads, who share in the work. Such regions may are no industry standards for programming MPSoCs, al- be nested. Worksharing directives are provided to effect a though a few have been proposed. In particular, the Mul- distribution of work among the participating threads. ticore Association has proposed a message passing model which they refer to as Communications API (CAPI)[7], but 3.1. Implementation of OpenMP this model is not yet completed. In the main, loosely-coupled approaches for designing Most OpenMP compilers translate OpenMP into mul- and developing software are taken, where the codes for tithreaded code with calls to a custom runtime library different devices are programmed separately, using differ- in a straightforward manner, either via outlining [14] or ent programming languages and runtime environments and inlining [15]. Since many execution details are often not system-specific mechanisms are used to enable the necessary known in advance, much of the actual work of assigning interactions. For example, GPU-based programming models computations must be performed dynamically. Part of the [8], [9] are dedicated to GPU devices and are not appropriate implementation complexity lies in ensuring that the presence for general

Implementing Openmp on a High Performance Embedded Multicore Mpsoc

Microprocessor

Heterogeneous Task Scheduling for Accelerated Openmp

Programmers' Tool Chain

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

Intel Threading Building Blocks

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Lecture 12: Introduction to Openmp (Part 1)

Programming Your GPU with Openmp*