High-Performance Parallel Computing

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance Parallel Computing High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi – 6 – Mendelson, Intel. Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif – 7 – Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." – 8 – The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 9 – 9 – Can You Predict Performance? double W,X,Y,Z; • E1 takes about 0.4s to execute on this laptop for(j=0;j<100000000;j++){ • Performance in GFLOPs: W = 0.999999*X; billions (Giga) of FLoating X = 0.999999*W;} point Operations per // Example loop E1 Second = • About how long does E2 for(j=0;j<100000000;j++){ take? W = 0.999999*W + 0.000001; 1. [0-0.4s] X = 0.999999*X + 0.000001; 2. [0.4s-0.6s] Y = 0.999999*Y + 0.000001; Z = 0.999999*Z + 0.000001; 3. [0.6s-0.8s] } 4. More than 0.8s // Example loop E2 E2 takes 0.35s => 2.27 GFLOPs – 10 – ILP Affects Performance • ILP (Instruction Level double W,X,Y,Z; Parallelism): Many operations in a sequential code could be for(j=0;j<100000000;j++){ executed concurrently if they do W = 0.999999*X; not have dependences X = 0.999999*W;} // Example loop E1 • Pipelined stages in functional units can be exploited by ILP for(j=0;j<100000000;j++){ • Multiple functional units in a W = 0.999999*W + 0.000001; CPU be exploited by ILP X = 0.999999*X + 0.000001; • ILP is automatically exploited by Y = 0.999999*Y + 0.000001; the system when possible Z = 0.999999*Z + 0.000001; } • E2’s statements are independent and provide ILP, // Example loop E2 but E1’s statements are not, and do not provide ILP – 11 – Example: Memory Access Cost #define N 32 #define T 1024*1024 • About how long will code // #define N 4096 run for the 4Kx4K matrix? // #define T 64 1. [0-0.3s] double A[N][N]; 2. [0.3s-0.6s] for(it=0; it<T; it++) 3. [0.6s-1.2s] for (j=0; i<N; i++) 4. More than 1.2s for (i=0; i<N; i++) A[i][j] += 1; 32x3232x32 4Kx4K4Kx4K FLOPsFLOPs 230230 230230 Data Movement overheads from main TimeTime 0.33s0.33s 15.5s memory to cache can GFLOPsGFLOPs 3.243.24 0.07 greatly exceed computational costs – 12 – Cache Memories Adapted from slides by Prof. Randy Bryant (CMU CS15-213) and Dr. Jeffrey Jones (OSU CSE 5441) Topics n Generic cache memory organization n Impact of caches on performance – 13 – Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. n Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure: CPU chip register file L1 ALU cache cache bus system bus memory bus I/O main L2 cache bus interface bridge memory – 14 – Inserting an L1 Cache Between the CPU and Main Memory The tiny, very fast CPU register file The transfer unit between has room for four 4-byte words. the CPU register file and the cache is a 4-byte block. line 0 The small fast L1 cache has room line 1 for two 4-word blocks. The transfer unit between the cache and main memory is a 4-word block (16 bytes). block 10 a b c d ... The big slow main memory block 21 p q r s has room for many 4-word ... blocks. block 30 w x y z ... – 15 – Direct-Mapped Cache Byte 0000 Byte0 000000 Byte1 000001 byte memory -64 Byte2 000010 viewed as 8 blocks of 8 bytes each Byte3 000111 Byte4 000100 Byte5 000101 Memory Block Line 0 Byte6 000110 Memory0 Block valid tag Data Byte7 001011 Memory1 Block valid tag Data Line 1 Byte8 001000 Memory2 Block Memory Block Byte9 001001 43 valid tag Data Line 2 Byte10 001010 Memory Block 5 valid tag Data Line 3 Byte11 = 001111 = Memory Block Byte12 001100 Memory6 Block Byte13 001101 7 Byte14 001110 Round-robin mapping of Byte15 010011 memory blocks to cache lines Byte16 010000 Binary Byte Address Byte17 010001 ……18 ……10 Byte 1111 Byte63 Memory11 64 Each memory block maps to a particular line – 16 – in the cache Direct-Mappedcache lines Cache and pre-fetching RAM byte addressable AC AD appropriate bytes for data type register transferred to register 1 4 4 5 cache AA E AA AB AC AD AE AF B0 B1 cache S line C AB AC U S AD O AE AF 8 B0 system with 64bit data paths transfers 8 bytes B1 J. S. Jones – 17 – Accessing Direct-Mapped Caches Set selection n Use the set (= line) index bits to determine the set of interest. set 0: valid tag cache block selected set (line) set 1: valid tag cache block • • • t bits s bits b bits valid tag cache block 0 0 0 0 1 set S-1: m-1 tag set index block offset0 – 18 – Accessing Direct-Mapped Caches Line matching and word selection n Line matching: Find a valid line in the selected set with a matching tag n Word selection: Then extract the word =1? (1) The valid bit must be set 0 1 2 3 4 5 6 7 selected set (i): 1 0110 w0 w1 w2 w3 (2) The tag bits in the cache line must match the = ? (3) If (1) and (2), then tag bits in the address cache hit, and block offset selects t bits s bits b bits starting byte. 0110 i 100 m-1 tag set index block offset0 – 19 – Direct-Mapped Cache Simulation M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set t=1 s=2 b=1 x xx x Address trace (reads): 0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002] 0 [00002] (miss) 13 [11012] (miss) v tag data v tag data 11 0 M[0-1]m[1] m[0] 1 0 M[0-1]m[1] m[0] (1) (3) 1 1 M[12-13]m[13] m[12] 8 [10002] (miss) 0 [00002] (miss) v tag data v tag data 11 1 M[8-9]m[9] m[8] 11 0 M[0-1]m[1] m[0] (4) (5) 1 1 M[12-13] 11 1 M[12-13]m[13] m[12] – 20 – General Structure of Cache Memory 1 valid bit t tag bits B = 2b bytes Cache is an array per line per line per cache block of sets. valid tag 0 1 • • • B–1 Each set contains E lines • • • one or more lines. set 0: per set valid tag 0 1 • • • B–1 Each line holds a block of data. valid tag 0 1 • • • B–1 set 1: • • • S = 2s sets valid tag 0 1 • • • B–1 • • • valid tag 0 1 • • • B–1 set S-1: • • • valid tag 0 1 • • • B–1 Cache size: C = B x E x S data bytes – 21 – Addressing Caches Address A: t bits s bits b bits m-1 0 v tag 0 1 • • • B–1 • • • set 0: <tag> <set index> <block offset> v tag 0 1 • • • B–1 v tag 0 1 • • • B–1 set 1: • • • v tag 0 1 • • • B–1 The word at address A is in the cache if • • • the tag bits in one of the <valid> lines in v tag 0 1 • • • B–1 set <set index> match <tag>.
Recommended publications
  • High-Performance Computing and Compiler Optimization
    High-Performance Computing and Compiler Optimization P. (Saday) Sadayappan August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware • Fortran var a • C Machine code var b int a, b; Virtual machine code • C++ mov 3 a a = • 3;Java Transformed source code mov 4 r1 if (a• Text < processing4) { translate Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else { • Command & Low-level commands mov 2 b b Scripting= 3; Semantic jmp l_d } Languages components • Natural language Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages Wednesday,Wednesday, August 22, August 12 22, 12 Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions ♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software Source: J.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • The Paramountcy of Reconfigurable Computing
    Energy Efficient Distributed Computing Systems, Edited by Albert Y. Zomaya, Young Choon Lee. ISBN 978-0-471--90875-4 Copyright © 2012 Wiley, Inc. Chapter 18 The Paramountcy of Reconfigurable Computing Reiner Hartenstein Abstract. Computers are very important for all of us. But brute force disruptive architectural develop- ments in industry and threatening unaffordable operation cost by excessive power consumption are a mas- sive future survival problem for our existing cyber infrastructures, which we must not surrender. The pro- gress of performance in high performance computing (HPC) has stalled because of the „programming wall“ caused by lacking scalability of parallelism. This chapter shows that Reconfigurable Computing is the sil- ver bullet to obtain massively better energy efficiency as well as much better performance, also by the up- coming methodology of HPRC (high performance reconfigurable computing). We need a massive cam- paign for migration of software over to configware. Also because of the multicore parallelism dilemma, we anyway need to redefine programmer education. The impact is a fascinating challenge to reach new hori- zons of research in computer science. We need a new generation of talented innovative scientists and engi- neers to start the beginning second history of computing. This paper introduces a new world model. 18.1 Introduction In Reconfigurable Computing, e. g. by FPGA (Table 15), practically everything can be implemented which is running on traditional computing platforms. For instance, recently the historical Cray 1 supercomputer has been reproduced cycle-accurate binary-compatible using a single Xilinx Spartan-3E 1600 development board running at 33 MHz (the original Cray ran at 80 MHz) 0.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.2
    The Intel x86 Microarchitectures Map Version 2.2 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • New instructions: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • Redalyc.Optimization of Operating Systems Towards Green Computing
    International Journal of Combinatorial Optimization Problems and Informatics E-ISSN: 2007-1558 [email protected] International Journal of Combinatorial Optimization Problems and Informatics México Appasami, G; Suresh Joseph, K Optimization of Operating Systems towards Green Computing International Journal of Combinatorial Optimization Problems and Informatics, vol. 2, núm. 3, septiembre-diciembre, 2011, pp. 39-51 International Journal of Combinatorial Optimization Problems and Informatics Morelos, México Available in: http://www.redalyc.org/articulo.oa?id=265219635005 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative © International Journal of Combinatorial Optimization Problems and Informatics, Vol. 2, No. 3, Sep-Dec 2011, pp. 39-51, ISSN: 2007-1558. Optimization of Operating Systems towards Green Computing Appasami.G Assistant Professor, Department of CSE, Dr. Pauls Engineering College, Affiliated to Anna University – Chennai, Villupuram, Tamilnadu, India E-mail: [email protected] Suresh Joseph.K Assistant Professor, Department of computer science, Pondicherry University, Pondicherry, India E-mail: [email protected] Abstract. Green Computing is one of the emerging computing technology in the field of computer science engineering and technology to provide Green Information Technology (Green IT / GC). It is mainly used to protect environment, optimize energy consumption and keeps green environment. Green computing also refers to environmentally sustainable computing. In recent years, companies in the computer industry have come to realize that going green is in their best interest, both in terms of public relations and reduced costs.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 5, 2015 Blacksburg, Virginia Keywords: Transaction Memory, Software Transaction Memory (STM), Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems Copyright 2015, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. On the other hand, most applications are designed or written for single core processors; so they do not benefit from the underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering. Transactional memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine- grained locking performance. In this dissertation, we exploit TM as an optimistic execution approach for transforming a sequential application into parallel. We design and implement two frameworks that support automatic parallelization: Lerna and HydraVM. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data depen- dency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 1.1
    The Intel x86 Microarchitectures Map Version 1.1 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Alternative Names: iAPX 86 Alternative Names: 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) Series: 8086, 8088 (cheaper) Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) • Launch: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • Cheaper: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) • Lower-power: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128 (Celeron), Northwood-128 (Celeron) • Desktop lower-performance:
    [Show full text]
  • Analysis of Task Scheduling for Multi-Core Embedded Systems
    Analysis of task scheduling for multi-core embedded systems Analys av schemaläggning för multikärniga inbyggda system JOSÉ LUIS GONZÁLEZ-CONDE PÉREZ, MASTER THESIS Supervisor: Examiner: De-Jiu Chen, KTH Martin Törngren, KTH Detlef Scholle, XDIN AB Barbro Claesson, XDIN AB MMK 2013:49 MDA 462 Acknowledgements I would like to thank my supervisors Detlef Scholle and Barbro Claesson for giving me the opportunity of doing the Master thesis at XDIN. I appreciate the kindness of Barbro chatting with me in Spanish and the support of Detlef no matter how much time it was required. I want to thank Sebastian, David and the other people at XDIN for the nice environment I lived in during these 20 weeks. I would like to thank the support and guidance of my supervisor at KTH DJ Chen and the help of my examiner Martin Törngren in the last stage of the thesis. I want to thank very much the other thesis colleagues at XDIN Joanna, Cheuk, Amir, Robin and Tobias. You have done this experience a lot more enriching. I would like to say merci! to my friends from Tyresö Benoit, Perrine, Simon, Audrey, Pierre, Marie-Line, Roberto, Alberto, Iván, Vincent, Olivier, Achour, Maxime, Si- mon, Emilie, Adelie, Siim and all the others. I have had great memories with you during the first year at KTH. I thank Osman and Tarek for this year in Midsom- markransen. I thank all the professors and staff from the Mechatronics department Mike, Bengt, Chen, Kalle, Jad and the others for making this programme possible, es- pecially Martin Edin Grimheden for his commitment with the students.
    [Show full text]
  • Thesis Title
    PANEPISTHMIO PATRWN TMHMA HLEKTROLOGWN MHQANIKWN KAI TEQNOLOGIAS UPOLOGISTWN Διπλωματική ErgasÐa tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n thc Poλυτεχνικής Sqoλής tou PanepisthmÐou Patr¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Implementation and Optimization of the OpenCL driver for the NEMA GPUs Epiblèpwn EpÐkouroc Καθηγητής MpÐrmpac Miχάλης, Panepiστήμιo Patr¸n Ariθμόc Diplwματικής ErgasÐac: 228281/2019 Πάτρα, 12/2019 PISTOPOIHSH PistopoieÐtai όti h Διπλωματική ErgasÐa me jèma Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Tou foiτητή tou Τμήματoc Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n AsterÐou KwnstantÐnou tou Nikoλάου Ariθμός Mhtr¸ou: 228281 Παρουσιάστηκε δημόσια kai exetάστηκε sto Τμήμα Hlektroλόgwn Mhqanik¸n kai TeqnologÐac Upologist¸n stic / /2019 O epiblèpwn O διευθυντής Tomèa MpÐrmpac Miχάλης Paliουράς BasÐleioc EpÐkouroc Καθηγητής Καθηγητής Ariθμόc Διπλωματικής ErgasÐac: 228281/2019 Jèma: Ανάπτυξη kai beltistopoÐhsh tou OpenCL driver gia tic NEMA GPUs Topic: Implementation and Optimization of the OpenCL driver for the NEMA GPUs Foiτητής Epiblèpwn AsterÐou KwnstantÐnoc MpÐrmpac Miχάλης PerÐlhyh I) EISAGWGH Sto παρελθόn όla ta proγράμματa logiσμικού ήτan grammèna gia σειριακή epexergasÐa. Gia na lujeÐ èna πρόβλημα, katασκευάζοntan ènac αλγόριj- moc o opoÐoc ulopoiούντan wc mia σειριακή akoloujÐa entol¸n. H ektèlesh aut¸n twn entol¸n sunèbaine se ènan upologisτή me ènan μόno epexerga- sτή. Μόno mia entoλή εκτελούντan th forά kai αφού teleÐwne h ektèlesh thc miac entoλής, h επόμενη εκτελούntan en suneqeÐa. O χρόnoc ektèleshc opoiουδήποte proγράμματoc ήτan ανάλοgoc tou ariθμού twn entol¸n, thc περιόδου tou rologiού tou upologisτή kai twn kukl¸n pou apaitoύντan gia thn κάθε entoλή.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1
    Lecture 1: Why Parallelism? Why Efficiency? Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Leela James “Long Time Coming” (A Change is Gonna Come) “I’d heard a bit about parallelism in 213. Then I mastered the idea of span in 210. And so I was just itching to start tuning code for some Skylake cores.” - Leela James, on the inspiration for “Long Time Coming” CMU 15-418/618, Spring 2017 Hi! Alex Ravi Teguh Junhong Prof. Kayvon Yicheng Tao Anant Riya Prof. Bryant CMU 15-418/618, Spring 2017 One common defnition A parallel computer is a collection of processing elements that cooperate to solve problems quickly We care about performance * We’re going to use multiple We care about efficiency processors to get it * Note: different motivation from “concurrent programming” using pthreads in 15-213 CMU 15-418/618, Spring 2017 DEMO 1 (15-418/618 Spring 2017‘s frst parallel program) CMU 15-418/618, Spring 2017 Speedup One major motivation of using parallel processing: achieve a speedup For a given problem: execution time (using 1 processor) speedup( using P processors ) = execution time (using P processors) CMU 15-418/618, Spring 2017 Class observations from demo 1 ▪ Communication limited the maximum speedup achieved - In the demo, the communication was telling each other the partial sums ▪ Minimizing the cost of communication improved speedup - Moved students (“processors”) closer together (or let them shout) CMU 15-418/618, Spring 2017 DEMO 2 (scaling up to four “processors”) CMU 15-418/618, Spring 2017 Class observations
    [Show full text]
  • GPU Programming
    GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 The Good Old Days for Software Source: J. Birnbaum Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) Applications experienced automatic performance improvement Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket Nozzle Surface Nuclear Reactor 100 2 Pentium® 4 m Pentium® III c / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi Mendelson, Intel. Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Parallel Platforms ● Shared memory systems (multi-core) ● Distributed systems (cluster) ● Graphics Processing Units (many-core) ● Field-Programmable Gate Arrays (configurable after manufacturing) ● Application-Specific Integrated Circuits GPU-CPU Performance Comparison Source: Thorsten Thormählen In this course..
    [Show full text]