Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini

Total Page:16

File Type:pdf, Size:1020Kb

Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini Optimization of multimedia applications on embedded multicore processors Elias Michel Baaklini To cite this version: Elias Michel Baaklini. Optimization of multimedia applications on embedded multicore processors. Other. Université de Valenciennes et du Hainaut-Cambresis; Arab open university. Lebanon branch (Antelias, Liban), 2014. English. NNT : 2014VALE0004. tel-01127384 HAL Id: tel-01127384 https://tel.archives-ouvertes.fr/tel-01127384 Submitted on 7 Mar 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Thèse de doctorat Pour obtenir le grade de Docteur de l’Université de VALENCIENNES ET DU HAINAUT-CAMBRESIS et de l’ARAB OPEN UNIVERSITY, LIBAN Discipline : Informatique Présentée et soutenue par Elias Michel, Baaklini. Le 12/02/2014, à Valenciennes, France. Ecole doctorale : Sciences Pour l’Ingénieur (SPI), Collège Doctoral Lille Nord de France Equipe de recherche, laboratoire : Laboratoire d’Automatique, de Mécanique et d’Informatique Industrielles et Humaines (LAMIH) Optimisation des Applications Multimédia sur des Processeurs Multicoeurs Embarqués JURY Président du jury - Artiba, Abdelhakim, Professeur, Université de Valenciennes, France. Rapporteurs - Goossens, Bernard, Professeur, Université de Perpignan, France. - Diguet, Jean-Philippe, Docteur, Directeur de recherche CNRS, LAB-STICC, Lorient, France. Examinateurs - Yurdakul, Arda, Professeur, Bogazici University, Istanbul, Turquie. - Artiba, Abdelhakim, Professeur, Université de Valenciennes, France. Directeur de thèse - Niar, Smail, Professeur, LAMIH, Université de Valenciennes, Valenciennes, France. Co-encadrant - Sbeity, Hassan, Assistant Professor, Arab Open University, Beyrouth, Liban. Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini University of Valenciennes France A thesis submitted for the degree of Doctor of Philosophy February 2014 Acknowledgments The work presented in this thesis was carried out in the computer science departments at the University of Valenciennes in France and the Arab Open University in Lebanon. I would like to thank all the people who made this research successful. In the first place, I would like to show my greatest gratitude to my thesis director Smail Niar, professor at University of Valenciennes, France, and to my supervisor Hassan Sbeity, assistant professor at Arab Open University in Beirut, Lebanon, for their continuous gui- dance, inspiration, and motivation. Moreover, I would like to deeply thank my reviewers, Bernard Goos- sens, professor at University of Perpignan, and Jean-Philippe Diguet, research director at LAB-STICC in Lorient, for reviewing, correcting, and enhancing my thesis dissertation with their detailed comments and constructive feedback. In addition, I would like to thank and to show my appreciation to my examiners, Arda Yurdakul, professor at Bogazici University in Istanbul, Turkey, and Abdelhakim Artiba, professor at University of Valenciennes, France. Furthermore, I cannot forget all the support of my friends and col- leagues who helped me during my long journey. Thank you all for your company and assistance. Finally, I am very thankful for the valuable support of my family. Special gratitude goes to my mother, Fairouz, for her permanent love and encouragement. I would like to dedicate this PhD thesis to my loving mother Fairouz and to my late father Michel. Elias Baaklini R´esum´e L’utilisation de plusieurs cœurs pour l’ex´ecution des applications mo- biles sera l’approche dominante dans les syst`emesembarqu´espour les prochaines ann´ees.Cette approche permet en g´en´eraled’augmenter les performances du syst`emesans augmenter la vitesse de l’horloge. Grace `acela, la consommation d’´energie reste mod´er´ee.Toutefois, la concurrence entre les tˆaches doit ˆetreexploit´eeafin d’am´eliorerles per- formances du syst`emedans les diff´erentes situations ou l’application peut s’ex´ecuter. Les applications multim´edias comme la vid´eoconf´erence ou la vid´eo haute d´efinition, ont de nombreuses nouvelles fonctionnalit´esqui n´ece- ssitent des calculs complexes par rapport aux normes pr´ec´edentes de codage vid´eo.Ces applications cr´eent une charge de travail tr`esimpor- tante sur les syst`emesmultiprocesseurs. L’exploitation du parall´elisme pour les applications multim´edia, comme le codec vid´eoH.264/AVC, peut se faire `adiff´erents niveaux : au niveau de donn´eesou bien au niveau tˆaches. Dans le cadre de cette th`esede doctorat, nous proposons de nouvelles solutions pour une meilleure exploitation du parall´elismedans les ap- plications multim´edia sur des syst`emesembarqu´esayant une architec- ture parall`elesym´etrique (ou SMP pour Symmetric Multi-Processor). Des approches innovantes pour le d´ecodeur H.264/AVC qui traitent des composantes de couleur et des blocs de l’image en parall`elesont propos´eeset exp´eriment´ees. Mots Cl´es: Multim´edia, Standard H.264/AVC, Compression Vid´eo, Optimisation, Calcul Parall`ele, Syst`emesEmbarqu´es,Processeurs Mul- ticoeurs. Abstract Parallel computing is currently the dominating architecture in em- bedded systems. Concurrency improves the performance of the sys- tem rather without increasing the clock speed which affects the power consumption of the system. However, concurrency needs to be exploi- ted in order to improve the system performance in different applica- tions environments. Multimedia applications (real-time conversational services such as vi- deo conferencing, video phone, etc.) have many new features that re- quire complex computations compared to previous video coding stan- dards. These applications have a challenging workload for future mul- tiprocessors. Exploiting parallelism in multimedia applications can be done at data and functional levels or using different instruction sets and architectures. In this research, we design new parallel algorithms and mapping me- thodologies in order to exploit the natural existence of parallelism in multimedia applications, specifically the H.264/AVC video decoder. We mainly target symmetric shared-memory multiprocessors (SMPs) for embedded devices such as ARM Cortex-A9 multicore chips. We evaluate our novel parallel algorithms of the H.264/AVC video decoder on different levels : memory load, energy consumption, and execution time. Keywords : Multimedia, H.264/AVC Standard, Video Compression, Optimization, Parallel Computing, Embedded Systems, Multicore Pro- cessors. Contents Contents ix List of Figures xiv Nomenclature xvii 1 Introduction 2 1.1 Background and Motivation . 2 1.2 Problem Statement . 3 1.3 Existing Solutions . 4 1.4 Contributions . 4 1.5 Outline . 5 2 Parallel Computing 6 2.1 Introduction . 6 2.2 Types of Parallelism . 7 2.2.1 Classification of Processors . 7 2.2.2 Instruction-Level Parallelism . 9 2.2.3 Data-Level Parallelism . 9 2.2.4 Thread-Level Parallelism . 11 2.3 Memory Architecture for Parallel Systems . 17 2.3.1 Main Memory . 17 2.3.2 Processor Communication . 18 2.3.3 Memory Access . 18 2.3.4 Caches . 18 ix CONTENTS CONTENTS 2.3.5 Cache Coherency . 19 2.4 Parallel Applications . 21 2.4.1 Amdahl’s Law . 21 2.4.2 Challenges of Parallel Processing . 22 2.4.3 Programming Languages . 25 2.5 Conclusion . 27 3 H.264/AVC Standard Overview 28 3.1 Introduction . 28 3.2 Video Coding Review . 29 3.2.1 Digital Video . 29 3.2.2 Block Based Video Coding . 32 3.2.3 Video Coding Standards . 37 3.3 Standard Development . 39 3.4 Features and Tools . 40 3.4.1 Layer Structure . 41 3.4.2 Profiles and Levels . 41 3.4.3 Picture Format . 42 3.5 Video Coding . 43 3.5.1 Encoder . 43 3.5.2 Decoder . 45 3.6 Coding Tools and Functions . 45 3.6.1 Intra Prediction . 46 3.6.2 Inter Prediction . 47 3.6.3 Transform and Quantization . 50 3.6.4 Skipped Macroblocks . 50 3.6.5 Deblocking Filter . 51 3.7 Parallel Implementations . 51 3.7.1 Slice-Level . 51 3.7.2 Macroblock-Level . 52 3.7.3 Deblocking Filter . 53 3.7.4 Discussion . 53 3.8 Summary and Conclusion . 54 x CONTENTS CONTENTS 4 H.264 Color Components Parallel Decoding 56 4.1 Introduction . 56 4.2 Parallel Decoding . 57 4.2.1 Stages Decomposition . 57 4.2.2 Color Components Processing . 58 4.2.3 Parallel Execution and Synchronization . 59 4.2.4 Pipeline Execution . 62 4.3 Experiments with JM H.264 Reference Software . 63 4.3.1 MPARM simulator and H.264 porting . 64 4.3.2 Profiling H.264 Stages . 64 4.3.3 Discussion . 64 4.3.4 Speedup using Parallelism . 65 4.4 Experiments with FFmpeg H.264 Decoder . 66 4.4.1 Multi2Sim Simulator . 66 4.4.2 FFmpeg H.264 Implementation . 67 4.4.3 Speedup using Parallelism . 67 4.4.4 Power Efficiency . 69 4.4.5 FFmpeg Multi-Threaded Version . 70 4.5 Conclusion . 71 5 H.264 Macroblocks Rows Parallel Decoding 74 5.1 Introduction . 74 5.2 Decoder Decomposition . 76 5.2.1 Decoding Stages . 76 5.2.2 Macroblocks . 77 5.3 Parallel Implementation . 78 5.3.1 Parallel Motion Compensation . 79 5.3.2 Macroblocks Dependencies . 80 5.3.3 IDR Frame Frequency . 81 5.3.4 Macroblock Dependency Check Algorithm . 82 5.3.5 Macroblocks Partitioning . 84 5.3.6 Scalability of Parallel Motion Compensation . 85 5.3.7 Parallel Deblocking Filter . 86 xi CONTENTS CONTENTS 5.4 Experimental Results on Multicore Systems . 87 5.4.1 Parallel Execution . 88 5.4.2 Environment and Configurations . 88 5.4.3 Results for Parallel Motion Compensation . 89 5.4.4 Comparison with Related Work . 91 5.4.5 Results for Parallel Deblocking Filter . 93 5.4.6 Results for Overall Execution .
Recommended publications
  • High-Performance Computing and Compiler Optimization
    High-Performance Computing and Compiler Optimization P. (Saday) Sadayappan August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware • Fortran var a • C Machine code var b int a, b; Virtual machine code • C++ mov 3 a a = • 3;Java Transformed source code mov 4 r1 if (a• Text < processing4) { translate Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else { • Command & Low-level commands mov 2 b b Scripting= 3; Semantic jmp l_d } Languages components • Natural language Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages Wednesday,Wednesday, August 22, August 12 22, 12 Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions ♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software Source: J.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • The Paramountcy of Reconfigurable Computing
    Energy Efficient Distributed Computing Systems, Edited by Albert Y. Zomaya, Young Choon Lee. ISBN 978-0-471--90875-4 Copyright © 2012 Wiley, Inc. Chapter 18 The Paramountcy of Reconfigurable Computing Reiner Hartenstein Abstract. Computers are very important for all of us. But brute force disruptive architectural develop- ments in industry and threatening unaffordable operation cost by excessive power consumption are a mas- sive future survival problem for our existing cyber infrastructures, which we must not surrender. The pro- gress of performance in high performance computing (HPC) has stalled because of the „programming wall“ caused by lacking scalability of parallelism. This chapter shows that Reconfigurable Computing is the sil- ver bullet to obtain massively better energy efficiency as well as much better performance, also by the up- coming methodology of HPRC (high performance reconfigurable computing). We need a massive cam- paign for migration of software over to configware. Also because of the multicore parallelism dilemma, we anyway need to redefine programmer education. The impact is a fascinating challenge to reach new hori- zons of research in computer science. We need a new generation of talented innovative scientists and engi- neers to start the beginning second history of computing. This paper introduces a new world model. 18.1 Introduction In Reconfigurable Computing, e. g. by FPGA (Table 15), practically everything can be implemented which is running on traditional computing platforms. For instance, recently the historical Cray 1 supercomputer has been reproduced cycle-accurate binary-compatible using a single Xilinx Spartan-3E 1600 development board running at 33 MHz (the original Cray ran at 80 MHz) 0.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.2
    The Intel x86 Microarchitectures Map Version 2.2 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • New instructions: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • Redalyc.Optimization of Operating Systems Towards Green Computing
    International Journal of Combinatorial Optimization Problems and Informatics E-ISSN: 2007-1558 [email protected] International Journal of Combinatorial Optimization Problems and Informatics México Appasami, G; Suresh Joseph, K Optimization of Operating Systems towards Green Computing International Journal of Combinatorial Optimization Problems and Informatics, vol. 2, núm. 3, septiembre-diciembre, 2011, pp. 39-51 International Journal of Combinatorial Optimization Problems and Informatics Morelos, México Available in: http://www.redalyc.org/articulo.oa?id=265219635005 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative © International Journal of Combinatorial Optimization Problems and Informatics, Vol. 2, No. 3, Sep-Dec 2011, pp. 39-51, ISSN: 2007-1558. Optimization of Operating Systems towards Green Computing Appasami.G Assistant Professor, Department of CSE, Dr. Pauls Engineering College, Affiliated to Anna University – Chennai, Villupuram, Tamilnadu, India E-mail: [email protected] Suresh Joseph.K Assistant Professor, Department of computer science, Pondicherry University, Pondicherry, India E-mail: [email protected] Abstract. Green Computing is one of the emerging computing technology in the field of computer science engineering and technology to provide Green Information Technology (Green IT / GC). It is mainly used to protect environment, optimize energy consumption and keeps green environment. Green computing also refers to environmentally sustainable computing. In recent years, companies in the computer industry have come to realize that going green is in their best interest, both in terms of public relations and reduced costs.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 5, 2015 Blacksburg, Virginia Keywords: Transaction Memory, Software Transaction Memory (STM), Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems Copyright 2015, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. On the other hand, most applications are designed or written for single core processors; so they do not benefit from the underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering. Transactional memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine- grained locking performance. In this dissertation, we exploit TM as an optimistic execution approach for transforming a sequential application into parallel. We design and implement two frameworks that support automatic parallelization: Lerna and HydraVM. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data depen- dency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler.
    [Show full text]
  • High-Performance Parallel Computing
    High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 1.1
    The Intel x86 Microarchitectures Map Version 1.1 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Alternative Names: iAPX 86 Alternative Names: 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) Series: 8086, 8088 (cheaper) Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) • Launch: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • Cheaper: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) • Lower-power: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128 (Celeron), Northwood-128 (Celeron) • Desktop lower-performance:
    [Show full text]
  • Analysis of Task Scheduling for Multi-Core Embedded Systems
    Analysis of task scheduling for multi-core embedded systems Analys av schemaläggning för multikärniga inbyggda system JOSÉ LUIS GONZÁLEZ-CONDE PÉREZ, MASTER THESIS Supervisor: Examiner: De-Jiu Chen, KTH Martin Törngren, KTH Detlef Scholle, XDIN AB Barbro Claesson, XDIN AB MMK 2013:49 MDA 462 Acknowledgements I would like to thank my supervisors Detlef Scholle and Barbro Claesson for giving me the opportunity of doing the Master thesis at XDIN. I appreciate the kindness of Barbro chatting with me in Spanish and the support of Detlef no matter how much time it was required. I want to thank Sebastian, David and the other people at XDIN for the nice environment I lived in during these 20 weeks. I would like to thank the support and guidance of my supervisor at KTH DJ Chen and the help of my examiner Martin Törngren in the last stage of the thesis. I want to thank very much the other thesis colleagues at XDIN Joanna, Cheuk, Amir, Robin and Tobias. You have done this experience a lot more enriching. I would like to say merci! to my friends from Tyresö Benoit, Perrine, Simon, Audrey, Pierre, Marie-Line, Roberto, Alberto, Iván, Vincent, Olivier, Achour, Maxime, Si- mon, Emilie, Adelie, Siim and all the others. I have had great memories with you during the first year at KTH. I thank Osman and Tarek for this year in Midsom- markransen. I thank all the professors and staff from the Mechatronics department Mike, Bengt, Chen, Kalle, Jad and the others for making this programme possible, es- pecially Martin Edin Grimheden for his commitment with the students.
    [Show full text]
  • Thesis Title
    PANEPISTHMIO PATRWN TMHMA HLEKTROLOGWN MHQANIKWN KAI TEQNOLOGIAS UPOLOGISTWN Διπλωματική ErgasÐa tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n thc Poλυτεχνικής Sqoλής tou PanepisthmÐou Patr¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Implementation and Optimization of the OpenCL driver for the NEMA GPUs Epiblèpwn EpÐkouroc Καθηγητής MpÐrmpac Miχάλης, Panepiστήμιo Patr¸n Ariθμόc Diplwματικής ErgasÐac: 228281/2019 Πάτρα, 12/2019 PISTOPOIHSH PistopoieÐtai όti h Διπλωματική ErgasÐa me jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Παρουσιάστηκε δημόσια kai exetάστηκε sto Τμήμα Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n stic / /2019 O epiblèpwn O διευθυντής Tomèa MpÐrmpac Miχάλης Paliουράς BasÐleioc EpÐkouroc Καθηγητής Καθηγητής Ariθμόc Διπλωματικής ErgasÐac: 228281/2019 Jèma: Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Foiτητής Epiblèpwn AsterÐou KwnstantÐnoc MpÐrmpac Miχάλης PerÐlhyh I) EISAGWGH Sto παρελθόn όla ta proγράμματa logiσμικού ήτan grammèna gia σειριακή epexergasÐa. Gia na lujeÐ èna πρόβλημα, katασκευάζοntan ènac αλγόριj- moc o opoÐoc ulopoiούντan wc mia σειριακή akoloujÐa entol¸n. H ektèlesh aut¸n twn entol¸n sunèbaine se ènan upologisτή me ènan μόno epexerga- sτή. Μόno mia entoλή εκτελούντan th forά kai αφού teleÐwne h ektèlesh thc miac entoλής, h επόμενη εκτελούntan en suneqeÐa. O χρόnoc ektèleshc opoiουδήποte proγράμματoc ήτan ανάλοgoc tou ariθμού twn entol¸n, thc περιόδου tou rologiού tou upologisτή kai twn kukl¸n pou apaitoύντan gia thn κάθε entoλή.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1
    Lecture 1: Why Parallelism? Why Efficiency? Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Leela James “Long Time Coming” (A Change is Gonna Come) “I’d heard a bit about parallelism in 213. Then I mastered the idea of span in 210. And so I was just itching to start tuning code for some Skylake cores.” - Leela James, on the inspiration for “Long Time Coming” CMU 15-418/618, Spring 2017 Hi! Alex Ravi Teguh Junhong Prof. Kayvon Yicheng Tao Anant Riya Prof. Bryant CMU 15-418/618, Spring 2017 One common defnition A parallel computer is a collection of processing elements that cooperate to solve problems quickly We care about performance * We’re going to use multiple We care about efficiency processors to get it * Note: different motivation from “concurrent programming” using pthreads in 15-213 CMU 15-418/618, Spring 2017 DEMO 1 (15-418/618 Spring 2017‘s frst parallel program) CMU 15-418/618, Spring 2017 Speedup One major motivation of using parallel processing: achieve a speedup For a given problem: execution time (using 1 processor) speedup( using P processors ) = execution time (using P processors) CMU 15-418/618, Spring 2017 Class observations from demo 1 ▪ Communication limited the maximum speedup achieved - In the demo, the communication was telling each other the partial sums ▪ Minimizing the cost of communication improved speedup - Moved students (“processors”) closer together (or let them shout) CMU 15-418/618, Spring 2017 DEMO 2 (scaling up to four “processors”) CMU 15-418/618, Spring 2017 Class observations
    [Show full text]