Openmp for Embedded Systems Sunita Chandrasekaran Asst

Total Page:16

File Type:pdf, Size:1020Kb

Openmp for Embedded Systems Sunita Chandrasekaran Asst OpenMP for Embedded Systems Sunita Chandrasekaran Asst. Professor Dept. of Computer & Informaon Sciences University of Delaware [email protected] ACK: Peng Sun, Suyang Zhu, Cheng Wang, Barbara Chapman, Tobias Schuele, Marcus Winter Talk @ the OpenMP Booth #611, November 16, 2016 Heterogeneous Embedded Systems • Incorporates specialized processing capabilities to handle specific tasks • Example • CPU + GPU • ARM + GPU • ARM + DSP • CPU + FPGA Figure. Qualcomm SnapdragonTm 810 Block Diagram Figure. NVIDIA Tregra K1 Block Diagram 2 Programming Multicore Embedded Systems – A 3Real Challenge • Heterogeneous systems present complexity at both silicon and system level • Standards and tool-chain in embedded industry are proprietary • Portability and scalability issues • High time-to-market (TTM) solutions • We need industry standards – To offer portable and scalable software solutions and target more than one platform • How suitable are the state-of-the-art models for heterogeneous embedded systems? • Not portable across more than one type of platform except for OpenCL • Most models are heavy-weight for embedded processors of limited resources • Most models require support from OS and compilers – Sometimes embedded systems are bare-metal • Some of the solutions are restricted to just the homogeneous environment 4 So what do we really need? • Something that’s not too low-level • Something light-weight • Something that can target heterogeneous embedded platforms (beyond CPUs-GPUs) • Something that can help speed time-to-market for products • Last but not the least – we need industry standards Using industry standards • Two of them used for this work – OpenMP • (high-level, directive-based) – Multicore Association (MCA) APIs • (low-level, light-weight catered to embedded systems) Briefly, on OpenMP Implementations OpenMP Code Translaon int main(void) _INT32 main() • Directives implemented via code { { modification and insertion of runtime library calls int a,b,c; int a,b,c; #pragma omp parallel \ /* microtask */ – Typical approach is outlining of private(c) code in parallel region void __ompregion_main1() do_sth(a,b,c); { – Or generation of micro tasks return 0; _INT32 __mplocal_c; • Runtime library responsible for } /*shared variables are kept intact, managing threads subs9tute accesses to private – Scheduling loops variable*/ – Scheduling tasks do_sth(a, b, __mplocal_c); – Implementing synchronization } • Implementation effort is reasonable … /*OpenMP run9me calls */ __ompc_fork(&__ompregion_main 1); … Each compiler has custom run-time support. Quality of the } runtime system has major impact on performance. 7 History of OpenMP* In spring, 7 Unified Fortran Incorporates Supports vendors and the and C/C++: task parallelism. offloading DOE agree on Bigger than both A hard problem execution to OpenMP the spelling of cOMPunity, the individual as OpenMP accelerator and supports parallel loops group of specifications struggles to OpenMP users, coprocessor taskloops, task and form the combined. maintain its is formed and devices, SIMD priorities, OpenMP ARB. The first thread-based organizes parallelism, and doacross loops, By October, International nature, while more. Expands and hints for version 1.0 of Minor workshops on Workshop on accommodating OpenMP beyond locks. Offloading the OpenMP modifications. OpenMP in OpenMP is held. the dynamic Support min/max North America, reductions in C/ traditional now supports specification for It becomes a nature of Europe, and C++. boundaries. asynchronous Fortran is 1.1 major forum for tasking. Asia. execution and released. users to interact with vendor. dependencies to ? 3.0 3.1 4.0 host execution. 1.0 2.0 2.5 4.5 5.0 First hybrid The merge of applications with Fortran MPI* and and C/C+ OpenMP appear. specifications begins. 1.0 2.0 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 26 25 26 22 17 17 19 13 13 13 15 15 8 8 8 11 11 11 11 Permanent ARB Auxiliary ARB Placeholder Footer Copy / BU Logo or Name Goes Here 8 Multicore Association APIs (MCA) Multicore Association APIs • Develops standards to reduce complexity involved in writing software for multicore chips! • Capturing the basic elements and abstract hardware and system resources! • Cohesive set of foundation APIs! Standardize communication (MCAPI)! Resource Sharing (MRAPI)! Task Management (MTAPI) Outline Introduction Related Work Work Completed Proposed Work Plan of Work Acknowledgment Peng Sun Peng Sun Nov.24 2016 Nov.24 2016 11 Programming Model Multicore Task Management API (MTAPI) MulticoreThe Multicore Association Task develops andManagement promotes open specifications for multicoreAPI product (MTAPI) development. MTAPI Contributing members: Tasks ° Standardized API for task-parallel Working group lead programming on a wide range of hardware architectures ° Developed and driven by practitioners of market-leading companies ° Part of Multicore-Association’s ecosystem (MRAPI, MCAPI, SHIM, OpenAMP, …) Tasks Queues Heterogeneous Systems ° Shared memory ° Distributed memory ° Different instruction set architectures Ack: SiemensPage 6 (Tobias Schuele, Urs Gleim) Unrestricted © Siemens AG 2016. All rights reserved OpenMP and MCA software stack Programming Model MTAPI for Heterogeneous Systems (cont.) MTAPI Jobs, Tasks & Action Example for the usage of MTAPIIn in heterogeneous a nut systems:shell Node 1 (CPU) Node 2 (GPU) Task 1 Matrix mult. Action I Job A Task 2 Application FFT Task 3 Job B Node 3 (DSP) Action II Action III Ack: Siemens (Tobias Schuele, Urs Gleim) Page 9 Unrestricted © Siemens AG 2016. All rights reserved Performance Evaluation MTAPI Implementations Performance Evaluation MTAPI Implementations 1 PerformanceMTAPI Evaluation implementationsEmbedded Multicore Building Blocks (EMB²) Application MTAPI Implementations ° Open source library and runtime platform 1 Dataflow Algorithms Embedded Multicore Building Blocks (EMB²) for embeddedApplication multicore systems Containers Embedded Multicore Building 1 ° Easy parallelization of existing code ° OpenEmbedded source Multicore library Building and runtimeBlocks (EMB platform²) Application Task management (MTAPI) Dataflow Algorithms for°embeddedOpen 1source multicorelibrary and runtime systems platform using high-level patterns Blocks (EMB2) Dataflow Algorithms for embedded multicore systems ° Real-time capability, Containersresource awareness Base library (abstraction layer) ° Easy parallelization of existing code Containers • Open source° Easy library parallelization and ofruntime existing code platform Task management (MTAPI) using high-level patterns Task management° Fine (MTAPI) -grained control over core usage for embeddedusing multicore high-level patterns systems Operating system / hypervisor Base library (task(abstraction priorities, layer) affinities) ° Real° Real-time-time capability, capability, resource resource awareness awareness Base library (abstraction layer) ° Fine-grained control over core usage Hardware • Real-time capability, resource awareness Operating° systemLock / hypervisor-/wait-free implementation ° Fine-(taskgrained priorities, control affinities) over core usage Operating system / hypervisor • Fine-grained(task° Lock priorities,control-/wait-free overaffinities) implementation core usage (task Hardware priorities,° Lock affinities)-/wait-free implementation UH-MTAPI2 Hardware UH-MTAPI2 MTAPI implementation° MTAPI implementation developeddeveloped at the ° MTAPI implementation developed at the UH-MTAPIUniversities2 of Houston / Delaware Universities of Houston / Delaware at the Universities° Utilizes MCAPI of for inter Houston-node communication / ° Utilizes MCAPI for inter-node communication ° MTAPIand implementationMRAPI for resource managementdeveloped at the 2 and MRAPI for resource management DelawareUniversities ° Has been usedof Houston as runtime / systemDelaware for OpenMP programs ° Has been used as runtime system for • Utilizes° MCAPIUtilizes MCAPIfor inter-node for inter-node communication 1 https://github.com/siemens/embb OpenMP programs communicationand2 https://github.com/MCAPro2015/OpenMP_MCA_Project MRAPI and for MRAPI resource for management resource management° Has been used as runtime system for 1 Page 13 https://github.com/siemens/embbUnrestricted © Siemens AG 2016. All rights reserved • Used as OpenMPruntimeprograms system for OpenMP 2 https://github.com/MCAPro2015/OpenMP_MCA_Project programs1 https://github.com/siemens/embb Page 13 Unrestricted © Siemens AG 2016. All rights reserved 2 https://github.com/MCAPro2015/OpenMP_MCA_Project Page 13 Unrestricted © Siemens AG 2016. All rights reserved Implementation MTAPI SchedulingMTAPI Scheduling Example for scheduling MTAPI tasks in heterogeneous systems: Node 0 (CPU) Node 1 (DSP) Core 0 Core 1 Unit 0 Q00 Q01 Q02 Q10 Q11 Q12 Q0 Scheduler Worker Worker WorkWork Stealing dealing Bare thread 0 thread 1 metal Work stealing Ack: Siemens (Tobias Schuele, Urs Gleim) Page 12 Unrestricted © Siemens AG 2016. All rights reserved Testbed, Compiler and Benchmark • Test beds: NVIDIA Jetson TK1 embedded development board with a Tegra K1 Soc (NVIDIA 4-Plus-1 Quad-Core ARM Cortex-A15 processor and a Kepler GPU with 192 CUDA cores). • Compiler: Jetson (GCC 4.8.4, NVCC V6.5.30) • Power Architecture from Freescale – Consisting of Pattern Matching Engine as specialized accelerator • Benchmarks: 1Rodinia and 2BOTS. • Reference Group: 3Siemens MTAPI, GCC OpenMP 1Rodinia: https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/ Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
Recommended publications
  • Microprocessor
    MICROPROCESSOR www.MPRonline.com THE REPORTINSIDER’S GUIDE TO MICROPROCESSOR HARDWARE EEMBC’S MULTIBENCH ARRIVES CPU Benchmarks: Not Just For ‘Benchmarketing’ Any More By Tom R. Halfhill {7/28/08-01} Imagine a world without measurements or statistical comparisons. Baseball fans wouldn’t fail to notice that a .300 hitter is better than a .100 hitter. But would they welcome a trade that sends the .300 hitter to Cleveland for three .100 hitters? System designers and software developers face similar quandaries when making trade-offs EEMBC’s role has evolved over the years, and Multi- with multicore processors. Even if a dual-core processor Bench is another step. Originally, EEMBC was conceived as an appears to be better than a single-core processor, how much independent entity that would create benchmark suites and better is it? Twice as good? Would a quad-core processor be certify the scores for accuracy, allowing vendors and customers four times better? Are more cores worth the additional cost, to make valid comparisons among embedded microproces- design complexity, power consumption, and programming sors. (See MPR 5/1/00-02, “EEMBC Releases First Bench- difficulty? marks.”) EEMBC still serves that role. But, as it turns out, most The Embedded Microprocessor Benchmark Consor- EEMBC members don’t openly publish their scores. Instead, tium (EEMBC) wants to help answer those questions. they disclose scores to prospective customers under an NDA or EEMBC’s MultiBench 1.0 is a new benchmark suite for use the benchmarks for internal testing and analysis. measuring the throughput of multiprocessor systems, Partly for this reason, MPR rarely cites EEMBC scores including those built with multicore processors.
    [Show full text]
  • Programmers' Tool Chain
    Reduce the complexity of programming multicore ++ Offload™ for PlayStation®3 | Offload™ for Cell Broadband Engine™ | Offload™ for Embedded | Custom C and C++ Compilers | Custom Shader Language Compiler www.codeplay.com It’s a risk to underestimate the complexity of programming multicore applications Software developers are now presented with a rapidly-growing range of different multi-core processors. The common feature of many of these processors is that they are difficult and error-prone to program with existing tools, give very unpredictable performance, and that incompatible, complex programming models are used. Codeplay develop compilers and programming tools with one primary goal - to make it easy for programmers to achieve big performance boosts with multi-core processors, but without needing bigger, specially-trained, expensive development teams to get there. Introducing Codeplay Based in Edinburgh, Scotland, Codeplay Software Limited was founded by veteran games developer Andrew Richards in 2002 with funding from Jez San (the founder of Argonaut Games and ARC International). Codeplay introduced their first product, VectorC, a highly optimizing compiler for x86 PC and PlayStation®2, in 2003. In 2004 Codeplay further developed their business by offering services to processor developers to provide them with compilers and programming tools for their new and unique architectures, using VectorC’s highly retargetable compiler technology. Realising the need for new multicore tools Codeplay started the development of the company’s latest product, the Offload™ C++ Multicore Programming Platform. In October 2009 Offload™: Community Edition was released as a free-to-use tool for PlayStation®3 programmers. Experience and Expertise Codeplay have developed compilers and software optimization technology since 1999.
    [Show full text]
  • Educational Goals for Embedded Systems in the Multicore Era
    AC 2009-414: EDUCATIONAL GOALS FOR EMBEDDED SYSTEMS IN THE MULTICORE ERA James Holt, Freescale Semiconductor, Inc. Jim leads the Multicore Design Evaluation team for Freescale’s NMG/NSD division. Jim has 27 years of industry experience focused on distributed systems, microprocessor and SoC architecture, design verification, and optimization. Jim is an IEEE Senior Member, and is a board member for the Multicore Association. He is also chair of the Integrated Systems & Circuits Science area for the Semiconductor Research Corporation (SRC), and chair of the Multicore Resource API Working group for the Multicore Association. Jim earned a Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MS in Computer Science from Texas State University. Hongchi Shi, Texas State University, San Marcos Hongchi Shi is Professor and Chair of the Computer Science Department at Texas State University-San Marcos. Prior to joining Texas State University, he has been an Assistant/Associate/Full Professor of Computer Science and Electrical and Computer Engineering at the University of Missouri. He obtained his BS degree and MS degree in Computer Science and Engineering from Beijing University of Aeronautics and Astronautics in 1983 and 1986, respectively. He obtained his PhD degree in Computer and Information Sciences from the University of Florida in 1994. Hongchi Shi's research interests include parallel and distributed computing, wireless sensor networks, neural networks, and image processing. He has served on many organizing and/or technical program committees of international conferences in his research areas. He is a member of ACM and a senior member of IEEE.
    [Show full text]
  • Performance Analysis and Tuning in Multicore Environments
    Computer Architecture and Operative Systems Department Master in High Performance Computing Performance Analysis and Tuning in Multicore Environments MSc research Project for the “Master in High Performance Computing” submitted by MARIO ADOLFO ZAVALA JIMENEZ, advised by Eduardo Galobardes. Dissertation done at Escola Tècnica Superior d’Enginyeria (Computer Architecture and Operative Systems Department). 2009 Abstract Performance analysis is the task of monitor the behavior of a program execution. The main goal is to find out the possible adjustments that might be done in order improve the performance. To be able to get that improvement it is necessary to find the different causes of overhead. Nowadays we are already in the multicore era, but there is a gap between the level of development of the two main divisions of multicore technology (hardware and software). When we talk about multicore we are also speaking of shared memory systems, on this master thesis we talk about the issues involved on the performance analysis and tuning of applications running specifically in a shared Memory system. We move one step ahead to take the performance analysis to another level by analyzing the applications structure and patterns. We also present some tools specifically addressed to the performance analysis of OpenMP multithread application. At the end we present the results of some experiments performed with a set of OpenMP scientific application. Keywords: Performance analysis, application patterns, Tuning, Multithread, OpenMP, Multicore. Resumen Análisis de rendimiento es el área de estudio encargada de monitorizar el comportamiento de la ejecución de programas informáticos. El principal objetivo es encontrar los posibles ajustes que serán necesarios para mejorar el rendimiento.
    [Show full text]
  • Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
    Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu1, Sunita Chandrasekaran2, Peng Sun1, Barbara Chapman1, Marcus Winter3, and Tobias Schuele4 1 Dept. of Computer Science, University of Houston [email protected], [email protected], [email protected] 2 Dept. of Computer and Information Sciences, University of Delaware [email protected] 3 Hypnolords Gbr [email protected] 4 Siemens Corporate Technology [email protected] Abstract. Current trends in multicore platform design indicate that heterogeneous systems are here to stay. Such systems include processors with specialized accelerators supporting different instruction sets and different types of memory spaces among several other features. Unfor- tunately, these features increase the effort for programming and porting applications to different target platforms. To solve this problem, effec- tive programming strategies that can exploit the rich feature set of such heterogeneous multicore architectures are required without leading to an increased learning curve for applying these strategies. In this paper, we explore the Multicore Task Management API (MTAPI), a standard for task-based parallel programming in embedded systems. MTAPI provides interfaces for decomposing a problem into a set of tasks while abstracting from the heterogeneous nature of the target platforms. This way, it allows to efficiently distribute work in systems consisting of different cores running at different speed. For evaluation purposes, we use an NVIDIA Jetson TK1 board (ARM + GPU) as our test bed. As applications, we employ codes from benchmark suites such as Rodinia and BOTS. Keywords: Multicore Systems, Runtime, Heterogeneity, Accelerators, MTAPI 1 Introduction Embedded multicore systems are widely used in areas such as networking, au- tomobiles, and robotics.
    [Show full text]
  • Openmp for Heterogeneous Multicore Embedded Systems Using MCA API Standard Interface
    OpenMP for Heterogeneous Multicore Embedded Systems using MCA API standard interface Sunita Chandrasekaran ([email protected]) Peng Sun, Suyang zhu, Barbara Chapman HPCTools Group, University of Houston, USA in collaboration with freescale semiconductor (FSL) and Semiconductor Research Corporation (SRC) OpenMP Booth @ SC14 Real-world applications using embedded systems FPGAS used to receive a number of acoustic pings form an image DSP, a device to interact with a user through the embedded processor within A self-piloted car powered by the NVIDIA Jetson TK1 11/18/14 2 Embedded programmers’ requirements • Write once, reuse anywhere o Avoid rewriting from scratch o Provide incremental migration path essential for application codes o Exploit multiple levels of parallelism • …with familiar and/or commodity programming models o None are perfect, but industry adoption is critical 11/18/14 3 OpenMP – widely adopted standard (www.openmp.org) ! Industry standard for shared memory parallel programming o Supported by large # of vendors (TI, AMD, IBM, Intel, Cray, NVIDIA, HP…..) o OpenMP 4.0 – provides support for compute devices such as Xeon Phis, GPUs and others ! High level directive-based solution Compiler directives, library routines and envt. variables o void main() o Programming Model - Fork-Join parallelism { o Threads communicate by sharing variables o Synchronizations to control race conditions double Res[1000]; o Structured programming to reduce likelihood #pragma omp parallel for of bugs for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); Programming made easy! } 11/18/14 4 } OpenMP Solution Stack 5 OpenMP Parallel Computing Solution Stack OpenMP Application User User layer layer API Directives, OpenMP Environment .
    [Show full text]
  • High-Level Programming Model for Heterogeneous Embedded Systems Using
    HIGH-LEVEL PROGRAMMING MODEL FOR HETEROGENEOUS EMBEDDED SYSTEMS USING MULTICORE INDUSTRY STANDARD APIS A Dissertation Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy By Peng Sun July 2016 HIGH-LEVEL PROGRAMMING MODEL FOR HETEROGENEOUS EMBEDDED SYSTEMS USING MULTICORE INDUSTRY STANDARD APIS Peng Sun APPROVED: Dr. Chapman, Barbara Dept. of Computer Science, University of Houston Dr. Gabriel, Edgar Dept. of Computer Science, University of Houston Dr. Shah, Shishir Dept. of Computer Science, University of Houston Dr. Subhlok, Jaspal Dept. of Computer Science, University of Houston Dr. Chandrasekaran, Sunita Dept. of CIS, University of Delaware Dean, College of Natural Sciences and Mathematics ii Acknowledgements First and foremost, I would like to thank my advisor, Dr. Barbara Chapman, for her invaluable advice and guidance in my Ph.D. study. I appreciate all her dedi- cated guidance, and great opportunities to participate in those worthwhile academic projects, and the funding to complete my Ph.D. study. Special thanks to all my com- mittee members: Dr. Edgar Gabriel, Dr. Shishir Shah, Dr. Jaspal Subhlok, and Dr. Sunita Chandrasekaran, for their time, insightful comments and help. Specifically, I am very grateful to Dr. Sunita Chandrasekaran for the valuable mentoring and guidance for my research and publications. That help is paramount to my Ph.D. Journey. I would also like to thank my fellow labmates of the HPCTools group: Dr. Deepak Eachempati, Tony Curtis, Dr. Dounia Khaldi, Dr. Cheng Wang, Dr. Xiaonan Tian for their valuable discussion and friendships. I enjoy working with them.
    [Show full text]
  • Performance Impact of Lock-Free Algorithms on Multicore Communication Apis
    Performance Impact of Lock-Free Algorithms on Multicore Communication APIs K. Eric Harper and Thijmen de Gooijer Industrial Software Systems ABB Corporate Research [email protected], [email protected] Abstract: - Data race conditions in multi-tasking software applications are prevented by serializing access to shared memory resources, ensuring data consistency and deterministic behavior. Traditionally tasks acquire and release locks to synchronize operations on shared memory. Unfortunately, lock management can add significant processing overhead especially for multicore deployments where tasks on different cores convoy in queues waiting to acquire a lock. Implementing more than one lock introduces the risk of deadlock and using spinlocks constrains which cores a task can run on. The better alternative is to eliminate locks and validate that real-time properties are met, which is not directly considered in many embedded applications. Removing the locks is non-trivial and packaging lock-free algorithms for developers reduces the possibility of concurrency defects. This paper details how a multicore communication API implementation is enhanced to support lock-free messaging and the impact this has on data exchange latency between tasks. Throughput and latency are compared on Windows and Linux between lock-based and lock-free implementations for data exchange of messages, packets, and scalars. A model of the lock-free exchange predicts performance at the system architecture level and provides a stop criterion for the refactoring. The results show that migration from single to multicore hardware architectures degrades lock-based performance, and increases lock-free performance. 1 Introduction The world is full of embedded devices whose software has been developed with the assumption that a single processor executes its functions.
    [Show full text]
  • Making Full Use of Emerging ARM-Based Heterogeneous Multicore Socs Felix Baum, Arvind Raghuraman
    Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs Felix Baum, Arvind Raghuraman To cite this version: Felix Baum, Arvind Raghuraman. Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs. 8th European Congress on Embedded Real Time Software and Systems (ERTS 2016), Jan 2016, TOULOUSE, France. hal-01292325 HAL Id: hal-01292325 https://hal.archives-ouvertes.fr/hal-01292325 Submitted on 23 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs Felix Baum Arvind Raghuraman Embedded Systems Division Embedded Systems Division Mentor Graphics Corporation Mentor Graphics Corporation Fremont, California USA Fremont, California USA [email protected] [email protected] Abstract— The complexity and pace of heterogeneous SoC present in the SoC – all working in concert to achieve the architectures is accelerating at blinding speed. While these design goals of the end application. Typical designs involve a complex hardware architectures enable product vision, they also software context on a master core bringing up a remote create new and difficult challenges for the system architects. software context on a remote core on a demand-driven basis to Running and debugging an Operating System and application offload computation.
    [Show full text]
  • Automatically Parallelizing Embedded Legacy Software on Soft-Core Socs
    Automatically Parallelizing Embedded Legacy Software on Soft-Core SoCs Automatische Parallelisierung bestehender eingebetteter Software mit Soft-Core SoCs Zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) genehmigte Dissertation von Kris Heid aus Groß-Umstadt Tag der Einreichung: 24.06.2019, Tag der Prüfung: 20.08.2019 Darmstadt — D 17 1. Gutachten: Prof. Dr.-Ing. Christian Hochberger 2. Gutachten: Prof. Dr.-Ing. Jeronimo Castrillon Fachgebiet Rechnersysteme Fachbereich Elektrotechnik und Informationstechnik Automatically Parallelizing Embedded Legacy Software on Soft-Core SoCs Automatische Parallelisierung bestehender eingebetteter Software mit Soft-Core SoCs Genehmigte Dissertation von Kris Heid aus Groß-Umstadt 1. Gutachten: Prof. Dr.-Ing. Christian Hochberger 2. Gutachten: Prof. Dr.-Ing. Jeronimo Castrillon Tag der Einreichung: 24.06.2019 Tag der Prüfung: 20.08.2019 Darmstadt — D 17 URN: urn:nbn:de:tuda-tuprints-90205 URL: http://tuprints.ulb.tu-darmstadt.de/id/eprint/9020 Dieses Dokument wird bereitgestellt von tuprints, E-Publishing-Service der TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] Die Veröffentlichung steht unter folgender Creative Commons Lizenz: Namensnennung – Keine kommerzielle Nutzung – Keine Bearbeitung 4.0 International https://creativecommons.org/licenses/by/4.0/deed.de Erklärungen laut Promotionsordnung § 8 Abs. 1 lit. c PromO Ich versichere hiermit, dass die elektronische Version meiner Dissertation mit der schriftlichen Version übereinstimmt. § 8 Abs. 1 lit. d PromO Ich versichere hiermit, dass zu einem vorherigen Zeitpunkt noch keine Promotion versucht wurde. In diesem Fall sind nähere Angaben über Zeitpunkt, Hochschule, Dissertationsthema und Ergebnis dieses Versuchs mitzuteilen. § 9 Abs. 1 PromO Ich versichere hiermit, dass die vorliegende Dissertation selbstständig und nur unter Verwendung der angegebenen Quellen verfasst wurde.
    [Show full text]
  • Heterogeneous Multicore Openamp
    Heterogeneous Multicore OpenAMP Felix Baum Mentor Embedded 1 Trends Create Opportunities Competitive Consolidation SoC Technology Market (Reuse, Space, Advancement Weight, Power) Multiple OSes Mixed Core Types Global Competition Complex Increasing Pace of Innovation Configurations Complexity 2 Heterogeneous Computing for Power/Performance Systems that use more than one kind of processor Allows improved efficiency for various workloads Increases the programming challenge 3 Heterogeneous Computing for Power/Performance 4 SoC FPGA: Heterogeneous Processing Choices Hard processor system: ARM* Cortex*-A9, A53 • Dual- and quad-core variants • System management and control Soft CPU(s): Nios II • Configurable (performance, number of cores) • From C-programmable state machine to Linux* Hardware accelerators • Application specific hardware in FPGA • OpenCL*, MATLAB*/Simulink*, custom designed RTL • Managed by CPU or autonomous 5 Techniques to Parallelize the Processing: OpenMP OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. • It consists of a set of compiler directives, library routines, and environment variables that influence runtime behavior • User identifies supported constructs, and manually inserts directives to assist compiler for a reconstruction of supported constructs into parallel • User does not need to create threads, and does not need to consider the work assigned to threads • OpenMP allows users who do not have a sufficient knowledge of parallel computing to explore parallel computing 6 Techniques to Parallelize the Processing: OpenCL* OpenCL* (Open Computing Language) is the industry’s open standard for writing data-parallel code in heterogeneous computers. • OpenCL is promoted as a primary approach towards programming Intel parallel computing hardware offerings.
    [Show full text]
  • MINIME: Pattern-Aware Multicore Benchmark Synthesizer Etem Deniz, Member, IEEE, Alper Sen, Senior Member, IEEE, Brian Kahne, Jim Holt, Senior Member, IEEE
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2014.2349522, IEEE Transactions on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. V, NO. N, 2013 1 MINIME: Pattern-Aware Multicore Benchmark Synthesizer Etem Deniz, Member, IEEE, Alper Sen, Senior Member, IEEE, Brian Kahne, Jim Holt, Senior Member, IEEE Abstract—We present a novel automated multicore benchmark synthesis framework with characterization and generation components. Our framework uses parallel patterns in capturing important characteristics of multi-threaded applications and generates synthetic multicore benchmarks from those applications. The resulting synthetic benchmarks are small, fast, portable, human-readable, and they accurately reflect microarchitecture dependent and independent characteristics of the original multicore applications. Also, they can use either Pthreads or MCA libraries. We implement our techniques in the MINIME tool and generate synthetic benchmarks from PARSEC, Rodinia, and EEMBC MultibenchTM benchmarks on x86 and Power Architecture R platforms. We show that synthetic benchmarks are representative across a range of multicore machines with different architectures, while being on average 21x faster and 14x smaller than original benchmarks. Index Terms—Multicore Systems, Parallel Patterns, Synthetic Benchmarks. F 1 INTRODUCTION ence of shared memory architectures, or Pthreads, OpenMP, and OpenCL libraries as well as uniform Thermal and power problems limit the performance CPU ISAs. Multicore systems may not be able to that single core processors can deliver. This has led use these benchmarks as they may not support such to the development of multicore systems.
    [Show full text]