High-Performance Computing and Compiler Optimization

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance Computing and Compiler Optimization High-Performance Computing and Compiler Optimization P. (Saday) Sadayappan August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware • Fortran var a • C Machine code var b int a, b; Virtual machine code • C++ mov 3 a a = • 3;Java Transformed source code mov 4 r1 if (a• Text < processing4) { translate Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else { • Command & Low-level commands mov 2 b b Scripting= 3; Semantic jmp l_d } Languages components • Natural language Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages Wednesday,Wednesday, August 22, August 12 22, 12 Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions ♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) • Applications experienced automatic performance improvement 4 Power Density Trends 1000 Rocket Nozzle Nuclear Reactor 2 100 Pentium® 4 Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro i386 Pentium® i486 1 P=VI: 75W @ 1.5V = 50 A! 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ Power density is the expended power per unit area on chip: too high => transistors malfunction 5 5 Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 6 Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." 7 The Only Option: Many Cores ♦ Chip density is still increasing (but end is in sight L) § Clock speed is not ♦ There is little or no more hidden parallelism (ILP) to be found ♦ Parallelism must be exposed to and managed by software ♦ Computers are getting heterogeneous: CPUs, GPUs, TPUs, FPGAs Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 8 8 turing lecture predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por- If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only wasted work to only 10% of the time, thus born. one processor is active. For example, the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura- purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64 To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted. showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex that are effectively executed but turn Each active core burns power whether structures of course, with portions out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of proces- sor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment 19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com- these benchmarks on an Intel Core i7. stating that the speedup from a paral- municate and synchronize periodically The amount of wasted energy is great- lel computer is limited by the portion means most applications have some er, however, since the processor must of a computation that is sequential. portions that can effectively use only use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle. conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling, increasing the number of cores on a EndFigure of 6.Moore’s Growth of computer Law performance Scaling using of integer VLSI programs (SPECintCPU). chip meant power is also increasing at nearly the same rate. Unfortunately, [From Hennessey & Patterson] End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul- End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP), (22%/year) (52%/year) 100,000 or average amount of power the pack- age and cooling system can remove. Although some high-end data centers 10,000 may use more advanced packages and cooling technology, no computer us- 1,000 ers would want to put a small heat exchanger on their desks or wear a ra- 100 diator on their backs to cool their cell- phones. The limit of TDP led directly 10 to the era of “dark silicon,” whereby Performance vs. VAX11-780 Performance processors would slow on the clock 1 rate and turn off idle cores to prevent 1980 1985 1990 1995 2000 2005 2010 2015 overheating. Another way to view this approach is that some chips can real- ♦ Moore’s Law (doubling of #transistors on a chip every 18months-2years) over 4 locate their precious power from the Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones. decades has fueled over 5 orders of magnitude rise in computing power/chip An era without Dennard scaling, Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and ♦ Unfortunately, we are now at the end of that ride 9 62,806 Amdahl’s Law in full effect means 100,000 inefficiency limits improvement in 6,727 performance to only a few percent 10,000 per year (see Figure 6). Achieving 366 higher rates of performance improve- 1,000 ment—as was seen in the 1980s and Speedup 1990s—will require new architec- 100 47 tural approaches that use the inte- 10 grated-circuit capability much more efficiently. We will return to what ap- 1 1 proaches might work after discussing Python C + parallel + memory + SIMD another major shortcoming of mod- loops optimization instructions ern computers—their support, or lack thereof, for computer security. 54 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 Can You Predict Performance? • E1 executes 2*108 FLOPs and double W,X,Y,Z; takes about 0.4s to execute on this laptop for(j=0;j<100000000;j++){ W = 0.999999*X; • Performance in GFLOPs: billions X = 0.999999*W;} (Giga) of FLoating point // Example loop E1 Operations per Second = 0.5 • About how long does E2 take? 1. [0-0.4s] 0.35s => 2.27 GFLOPs for(j=0;j<100000000;j++){ 2. [0.4s-0.6s] W = 0.999999*W + 0.000001; X = 0.999999*X + 0.000001; 3. [0.6s-0.8s] Y = 0.999999*Y + 0.000001; 4. More than 0.8s Z = 0.999999*Z + 0.000001; } // Example loop E2 10 ILP Affects Performance • ILP (Instruction Level Parallelism): double W,X,Y,Z; Many operations in a sequential code could be executed concurrently if they for(j=0;j<100000000;j++){ do not have dependences W = 0.999999*X; • Pipelined stages in functional units can X = 0.999999*W;} be exploited by ILP // Example loop E1 • Multiple functional units in a CPU be exploited by ILP for(j=0;j<100000000;j++){ • ILP is automatically exploited by the W = 0.999999*W + 0.000001; system when possible X = 0.999999*X + 0.000001; • E2’s statements are independent and Y = 0.999999*Y + 0.000001; provide ILP, but E1’s statements are Z = 0.999999*Z + 0.000001; not, and do not provide ILP } // Example loop E2 11 Performance Prediction #define N 32 • About how long will code run for #define T 1024*1024 double A[N][N]; the 4Kx4K matrix? %FLOPS=32*1024*1024=230 1.
Recommended publications
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • The Paramountcy of Reconfigurable Computing
    Energy Efficient Distributed Computing Systems, Edited by Albert Y. Zomaya, Young Choon Lee. ISBN 978-0-471--90875-4 Copyright © 2012 Wiley, Inc. Chapter 18 The Paramountcy of Reconfigurable Computing Reiner Hartenstein Abstract. Computers are very important for all of us. But brute force disruptive architectural develop- ments in industry and threatening unaffordable operation cost by excessive power consumption are a mas- sive future survival problem for our existing cyber infrastructures, which we must not surrender. The pro- gress of performance in high performance computing (HPC) has stalled because of the „programming wall“ caused by lacking scalability of parallelism. This chapter shows that Reconfigurable Computing is the sil- ver bullet to obtain massively better energy efficiency as well as much better performance, also by the up- coming methodology of HPRC (high performance reconfigurable computing). We need a massive cam- paign for migration of software over to configware. Also because of the multicore parallelism dilemma, we anyway need to redefine programmer education. The impact is a fascinating challenge to reach new hori- zons of research in computer science. We need a new generation of talented innovative scientists and engi- neers to start the beginning second history of computing. This paper introduces a new world model. 18.1 Introduction In Reconfigurable Computing, e. g. by FPGA (Table 15), practically everything can be implemented which is running on traditional computing platforms. For instance, recently the historical Cray 1 supercomputer has been reproduced cycle-accurate binary-compatible using a single Xilinx Spartan-3E 1600 development board running at 33 MHz (the original Cray ran at 80 MHz) 0.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.2
    The Intel x86 Microarchitectures Map Version 2.2 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • New instructions: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • Redalyc.Optimization of Operating Systems Towards Green Computing
    International Journal of Combinatorial Optimization Problems and Informatics E-ISSN: 2007-1558 [email protected] International Journal of Combinatorial Optimization Problems and Informatics México Appasami, G; Suresh Joseph, K Optimization of Operating Systems towards Green Computing International Journal of Combinatorial Optimization Problems and Informatics, vol. 2, núm. 3, septiembre-diciembre, 2011, pp. 39-51 International Journal of Combinatorial Optimization Problems and Informatics Morelos, México Available in: http://www.redalyc.org/articulo.oa?id=265219635005 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative © International Journal of Combinatorial Optimization Problems and Informatics, Vol. 2, No. 3, Sep-Dec 2011, pp. 39-51, ISSN: 2007-1558. Optimization of Operating Systems towards Green Computing Appasami.G Assistant Professor, Department of CSE, Dr. Pauls Engineering College, Affiliated to Anna University – Chennai, Villupuram, Tamilnadu, India E-mail: [email protected] Suresh Joseph.K Assistant Professor, Department of computer science, Pondicherry University, Pondicherry, India E-mail: [email protected] Abstract. Green Computing is one of the emerging computing technology in the field of computer science engineering and technology to provide Green Information Technology (Green IT / GC). It is mainly used to protect environment, optimize energy consumption and keeps green environment. Green computing also refers to environmentally sustainable computing. In recent years, companies in the computer industry have come to realize that going green is in their best interest, both in terms of public relations and reduced costs.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 5, 2015 Blacksburg, Virginia Keywords: Transaction Memory, Software Transaction Memory (STM), Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems Copyright 2015, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. On the other hand, most applications are designed or written for single core processors; so they do not benefit from the underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering. Transactional memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine- grained locking performance. In this dissertation, we exploit TM as an optimistic execution approach for transforming a sequential application into parallel. We design and implement two frameworks that support automatic parallelization: Lerna and HydraVM. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data depen- dency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler.
    [Show full text]
  • High-Performance Parallel Computing
    High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 1.1
    The Intel x86 Microarchitectures Map Version 1.1 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Alternative Names: iAPX 86 Alternative Names: 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) Series: 8086, 8088 (cheaper) Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) • Launch: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • Cheaper: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) • Lower-power: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128 (Celeron), Northwood-128 (Celeron) • Desktop lower-performance:
    [Show full text]
  • Analysis of Task Scheduling for Multi-Core Embedded Systems
    Analysis of task scheduling for multi-core embedded systems Analys av schemaläggning för multikärniga inbyggda system JOSÉ LUIS GONZÁLEZ-CONDE PÉREZ, MASTER THESIS Supervisor: Examiner: De-Jiu Chen, KTH Martin Törngren, KTH Detlef Scholle, XDIN AB Barbro Claesson, XDIN AB MMK 2013:49 MDA 462 Acknowledgements I would like to thank my supervisors Detlef Scholle and Barbro Claesson for giving me the opportunity of doing the Master thesis at XDIN. I appreciate the kindness of Barbro chatting with me in Spanish and the support of Detlef no matter how much time it was required. I want to thank Sebastian, David and the other people at XDIN for the nice environment I lived in during these 20 weeks. I would like to thank the support and guidance of my supervisor at KTH DJ Chen and the help of my examiner Martin Törngren in the last stage of the thesis. I want to thank very much the other thesis colleagues at XDIN Joanna, Cheuk, Amir, Robin and Tobias. You have done this experience a lot more enriching. I would like to say merci! to my friends from Tyresö Benoit, Perrine, Simon, Audrey, Pierre, Marie-Line, Roberto, Alberto, Iván, Vincent, Olivier, Achour, Maxime, Si- mon, Emilie, Adelie, Siim and all the others. I have had great memories with you during the first year at KTH. I thank Osman and Tarek for this year in Midsom- markransen. I thank all the professors and staff from the Mechatronics department Mike, Bengt, Chen, Kalle, Jad and the others for making this programme possible, es- pecially Martin Edin Grimheden for his commitment with the students.
    [Show full text]
  • Thesis Title
    PANEPISTHMIO PATRWN TMHMA HLEKTROLOGWN MHQANIKWN KAI TEQNOLOGIAS UPOLOGISTWN Διπλωματική ErgasÐa tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n thc Poλυτεχνικής Sqoλής tou PanepisthmÐou Patr¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Implementation and Optimization of the OpenCL driver for the NEMA GPUs Epiblèpwn EpÐkouroc Καθηγητής MpÐrmpac Miχάλης, Panepiστήμιo Patr¸n Ariθμόc Diplwματικής ErgasÐac: 228281/2019 Πάτρα, 12/2019 PISTOPOIHSH PistopoieÐtai όti h Διπλωματική ErgasÐa me jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Παρουσιάστηκε δημόσια kai exetάστηκε sto Τμήμα Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n stic / /2019 O epiblèpwn O διευθυντής Tomèa MpÐrmpac Miχάλης Paliουράς BasÐleioc EpÐkouroc Καθηγητής Καθηγητής Ariθμόc Διπλωματικής ErgasÐac: 228281/2019 Jèma: Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Foiτητής Epiblèpwn AsterÐou KwnstantÐnoc MpÐrmpac Miχάλης PerÐlhyh I) EISAGWGH Sto παρελθόn όla ta proγράμματa logiσμικού ήτan grammèna gia σειριακή epexergasÐa. Gia na lujeÐ èna πρόβλημα, katασκευάζοntan ènac αλγόριj- moc o opoÐoc ulopoiούντan wc mia σειριακή akoloujÐa entol¸n. H ektèlesh aut¸n twn entol¸n sunèbaine se ènan upologisτή me ènan μόno epexerga- sτή. Μόno mia entoλή εκτελούντan th forά kai αφού teleÐwne h ektèlesh thc miac entoλής, h επόμενη εκτελούntan en suneqeÐa. O χρόnoc ektèleshc opoiουδήποte proγράμματoc ήτan ανάλοgoc tou ariθμού twn entol¸n, thc περιόδου tou rologiού tou upologisτή kai twn kukl¸n pou apaitoύντan gia thn κάθε entoλή.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1
    Lecture 1: Why Parallelism? Why Efficiency? Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Leela James “Long Time Coming” (A Change is Gonna Come) “I’d heard a bit about parallelism in 213. Then I mastered the idea of span in 210. And so I was just itching to start tuning code for some Skylake cores.” - Leela James, on the inspiration for “Long Time Coming” CMU 15-418/618, Spring 2017 Hi! Alex Ravi Teguh Junhong Prof. Kayvon Yicheng Tao Anant Riya Prof. Bryant CMU 15-418/618, Spring 2017 One common defnition A parallel computer is a collection of processing elements that cooperate to solve problems quickly We care about performance * We’re going to use multiple We care about efficiency processors to get it * Note: different motivation from “concurrent programming” using pthreads in 15-213 CMU 15-418/618, Spring 2017 DEMO 1 (15-418/618 Spring 2017‘s frst parallel program) CMU 15-418/618, Spring 2017 Speedup One major motivation of using parallel processing: achieve a speedup For a given problem: execution time (using 1 processor) speedup( using P processors ) = execution time (using P processors) CMU 15-418/618, Spring 2017 Class observations from demo 1 ▪ Communication limited the maximum speedup achieved - In the demo, the communication was telling each other the partial sums ▪ Minimizing the cost of communication improved speedup - Moved students (“processors”) closer together (or let them shout) CMU 15-418/618, Spring 2017 DEMO 2 (scaling up to four “processors”) CMU 15-418/618, Spring 2017 Class observations
    [Show full text]
  • GPU Programming
    GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 The Good Old Days for Software Source: J. Birnbaum Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) Applications experienced automatic performance improvement Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket Nozzle Surface Nuclear Reactor 100 2 Pentium® 4 m Pentium® III c / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi Mendelson, Intel. Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Parallel Platforms ● Shared memory systems (multi-core) ● Distributed systems (cluster) ● Graphics Processing Units (many-core) ● Field-Programmable Gate Arrays (configurable after manufacturing) ● Application-Specific Integrated Circuits GPU-CPU Performance Comparison Source: Thorsten Thormählen In this course..
    [Show full text]