Core Processors
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
A Massively-Parallel Mixed-Mode Computer Designed to Support
This paper appeared in th International Parallel Processing Symposium Proc of nd Work shop on Heterogeneous Processing pages NewportBeach CA April Triton A MassivelyParallel MixedMo de Computer Designed to Supp ort High Level Languages Christian G Herter Thomas M Warschko Walter F Tichy and Michael Philippsen University of Karlsruhe Dept of Informatics Postfach D Karlsruhe Germany Mo dula Abstract Mo dula pronounced Mo dulastar is a small ex We present the architectureofTriton a scalable tension of Mo dula for massively parallel program mixedmode SIMDMIMD paral lel computer The ming The programming mo del of Mo dula incor novel features of Triton are p orates b oth data and control parallelism and allows hronous and asynchronous execution mixed sync Support for highlevel machineindependent pro Mo dula is problemorientedinthesensethatthe gramming languages programmer can cho ose the degree of parallelism and mix the control mo de SIMD or MIMDlike as need Fast SIMDMIMD mode switching ed bytheintended algorithm Parallelism maybe nested to arbitrary depth Pro cedures may b e called Special hardware for barrier synchronization of from sequential or parallel contexts and can them multiple process groups selves generate parallel activity without any restric tions Most Mo dula programs can b e translated into ecient co de for b oth SIMD and MIMD archi A selfrouting deadlockfreeperfect shue inter tectures connect with latency hiding Overview of language extensions The architecture is the outcomeofanintegrated de Mo dula extends Mo dula -
Massively Parallel Computing with CUDA
Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential -
CS 677: Parallel Programming for Many-Core Processors Lecture 1
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: mordohai.github.io E-mail: [email protected] Objectives • Learn how to program massively parallel processors and achieve – High performance – Functionality and maintainability – Scalability across future generations • Acquire technical knowledge required to achieve the above goals – Principles and patterns of parallel programming – Processor architecture features and constraints – Programming API, tools and techniques 2 Important Points • This is an elective course. You chose to be here. • Expect to work and to be challenged. • If your programming background is weak, you will probably suffer. • This course will evolve to follow the rapid pace of progress in GPU programming. It is bound to always be a little behind… 3 Important Points II • At any point ask me WHY? • You can ask me anything about the course in class, during a break, in my office, by email. – If you think a homework is taking too long or is wrong. – If you can’t decide on a project. 4 Logistics • Class webpage: http://mordohai.github.io/classes/cs677_s20.html • Office hours: Tuesdays 5-6pm and by email • Evaluation: – Homework assignments (40%) – Quizzes (10%) – Midterm (15%) – Final project (35%) 5 Project • Pick topic BEFORE middle of the semester • I will suggest ideas and datasets, if you can’t decide • Deliverables: – Project proposal – Presentation in class – Poster in CS department event – Final report (around 8 pages) 6 Project Examples • k-means • Perceptron • Boosting – General – Face detector (group of 2) • Mean Shift • Normal estimation for 3D point clouds 7 More Ideas • Look for parallelizable problems in: – Image processing – Cryptanalysis – Graphics • GPU Gems – Nearest neighbor search 8 Even More… • Particle simulations • Financial analysis • MCMC • Games/puzzles 9 Resources • Textbook – Kirk & Hwu. -
Concurrent Objects and Linearizability Concurrent Computaton
Concurrent Objects and Concurrent Computaton Linearizability memory Nir Shavit Subing for N. Lynch object Fall 2003 object © 2003 Herlihy and Shavit 2 Objectivism FIFO Queue: Enqueue Method • What is a concurrent object? q.enq( ) – How do we describe one? – How do we implement one? – How do we tell if we’re right? © 2003 Herlihy and Shavit 3 © 2003 Herlihy and Shavit 4 FIFO Queue: Dequeue Method Sequential Objects q.deq()/ • Each object has a state – Usually given by a set of fields – Queue example: sequence of items • Each object has a set of methods – Only way to manipulate state – Queue example: enq and deq methods © 2003 Herlihy and Shavit 5 © 2003 Herlihy and Shavit 6 1 Pre and PostConditions for Sequential Specifications Dequeue • If (precondition) – the object is in such-and-such a state • Precondition: – before you call the method, – Queue is non-empty • Then (postcondition) • Postcondition: – the method will return a particular value – Returns first item in queue – or throw a particular exception. • Postcondition: • and (postcondition, con’t) – Removes first item in queue – the object will be in some other state • You got a problem with that? – when the method returns, © 2003 Herlihy and Shavit 7 © 2003 Herlihy and Shavit 8 Pre and PostConditions for Why Sequential Specifications Dequeue Totally Rock • Precondition: • Documentation size linear in number –Queue is empty of methods • Postcondition: – Each method described in isolation – Throws Empty exception • Interactions among methods captured • Postcondition: by side-effects -
Goals of This Lecture How to Increase the Compute Power?
Goals of this lecture Motivate you! Design of Parallel and High-Performance Trends Computing High performance computing Fall 2017 Programming models Lecture: Introduction Course overview Instructor: Torsten Hoefler & Markus Püschel TA: Salvatore Di Girolamo 2 © Markus Püschel Computer Science Trends What doubles …? Source: Wikipedia 3 4 How to increase the compute power? Clock Speed! 10000 Sun’s Surface 1000 ) 2 Rocket Nozzle Nuclear Reactor 100 10 8086 Hot Plate 8008 8085 Pentium® Power Density (W/cm Power 4004 286 386 486 Processors 8080 1 1970 1980 1990 2000 2010 Source: Intel Source: Wikipedia 5 6 How to increase the compute power? Evolutions of Processors (Intel) Not an option anymore! Clock Speed! 10000 Sun’s Surface 1000 ) 2 Rocket Nozzle Nuclear Reactor 100 ~3 GHz Pentium 4 Core Nehalem Haswell Pentium III Sandy Bridge Hot Plate 10 8086 Pentium II 8008 free speedup 8085 Pentium® Pentium Pro Power Density (W/cm Power 4004 286 386 486 Processors 8080 Pentium 1 1970 1980 1990 2000 2010 Source: Intel 7 8 Source: Wikipedia/Intel/PCGuide Evolutions of Processors (Intel) Evolutions of Processors (Intel) Cores: 8x ~360 Gflop/s Vector units: 8x parallelism: work required 2 2 4 4 4 8 cores ~3 GHz Pentium 4 Core Nehalem Haswell Pentium III Sandy Bridge Pentium II Pentium Pro free speedup Pentium memory bandwidth (normalized) 9 10 Source: Wikipedia/Intel/PCGuide Source: Wikipedia/Intel/PCGuide A more complete view Source: www.singularity.com Can we do this today? 11 12 © Markus Püschel Computer Science High-Performance Computing (HPC) a.k.a. “Supercomputing” Question: define “Supercomputer”! High-Performance Computing 13 14 High-Performance Computing (HPC) a.k.a. -
Concurrent and Parallel Programming
Concurrent and parallel programming Seminars in Advanced Topics in Computer Science Engineering 2019/2020 Romolo Marotta Trend in processor technology Concurrent and parallel programming 2 Blocking synchronization SHARED RESOURCE Concurrent and parallel programming 3 Blocking synchronization …zZz… SHARED RESOURCE Concurrent and parallel programming 4 Blocking synchronization Correctness guaranteed by mutual exclusion …zZz… SHARED RESOURCE Performance might be hampered because of waste of clock cycles Liveness might be impaired due to the arbitration of accesses Concurrent and parallel programming 5 Parallel programming • Ad-hoc concurrent programming languages • Development tools ◦ Compilers ◦ MPI, OpenMP, libraries ◦ Tools to debug parallel code (gdb, valgrind) • Writing parallel code is an art ◦ There are approaches, not prepackaged solutions ◦ Every machine has its own singularities ◦ Every problem to face has different requisites ◦ The most efficient parallel algorithm might not be the most intuitive one Concurrent and parallel programming 6 What do we want from parallel programs? • Safety: nothing wrong happens (Correctness) ◦ parallel versions of our programs should be correct as their sequential implementations • Liveliness: something good happens eventually (Progress) ◦ if a sequential program terminates with a given input, we want that its parallel alternative also completes with the same input • Performance ◦ we want to exploit our parallel hardware Concurrent and parallel programming 7 Correctness conditions Progress conditions -
Massively Parallel Computers: Why Not Prirallel Computers for the Masses?
Maslsively Parallel Computers: Why Not Prwallel Computers for the Masses? Gordon Bell Abstract In 1989 I described the situation in high performance computers including several parallel architectures that During the 1980s the computers engineering and could deliver teraflop power by 1995, but with no price science research community generally ignored parallel constraint. I felt SIMDs and multicomputers could processing. With a focus on high performance computing achieve this goal. A shared memory multiprocessor embodied in the massive 1990s High Performance looked infeasible then. Traditional, multiple vector Computing and Communications (HPCC) program that processor supercomputers such as Crays would simply not has the short-term, teraflop peak performanc~:goal using a evolve to a teraflop until 2000. Here's what happened. network of thousands of computers, everyone with even passing interest in parallelism is involted with the 1. During the first half of 1992, NEC's four processor massive parallelism "gold rush". Funding-wise the SX3 is the fastest computer, delivering 90% of its situation is bright; applications-wise massit e parallelism peak 22 Glops for the Linpeak benchmark, and Cray's is microscopic. While there are several programming 16 processor YMP C90 has the greatest throughput. models, the mainline is data parallel Fortran. However, new algorithms are required, negating th~: decades of 2. The SIMD hardware approach of Thinking Machines progress in algorithms. Thus, utility will no doubt be the was abandoned because it was only suitable for a few, Achilles Heal of massive parallelism. very large scale problems, barely multiprogrammed, and uneconomical for workloads. It's unclear whether The Teraflop: 1992 large SIMDs are "generation" scalable, and they are clearly not "size" scalable. -
Arxiv:2012.03692V1 [Cs.DC] 7 Dec 2020
Separation and Equivalence results for the Crash-stop and Crash-recovery Shared Memory Models Ohad Ben-Baruch Srivatsan Ravi Ben-Gurion University, Israel University of Southern California, USA [email protected] [email protected] December 8, 2020 Abstract Linearizability, the traditional correctness condition for concurrent data structures is considered insufficient for the non-volatile shared memory model where processes recover following a crash. For this crash-recovery shared memory model, strict-linearizability is considered appropriate since, unlike linearizability, it ensures operations that crash take effect prior to the crash or not at all. This work formalizes and answers the question of whether an implementation of a data type derived for the crash-stop shared memory model is also strict-linearizable in the crash-recovery model. This work presents a rigorous study to prove how helping mechanisms, typically employed by non-blocking implementations, is the algorithmic abstraction that delineates linearizabil- ity from strict-linearizability. Our first contribution formalizes the crash-recovery model and how explicit process crashes and recovery introduces further dimensionalities over the stan- dard crash-stop shared memory model. We make the following technical contributions that answer the question of whether a help-free linearizable implementation is strict-linearizable in the crash-recovery model: (i) we prove surprisingly that there exist linearizable imple- mentations of object types that are help-free, yet not strict-linearizable; (ii) we then present a natural definition of help-freedom to prove that any obstruction-free, linearizable and help-free implementation of a total object type is also strict-linearizable. The next technical contribution addresses the question of whether a strict-linearizable implementation in the crash-recovery model is also help-free linearizable in the crash-stop model. -
Introduction to Parallel Computing
INTRODUCTION TO PARALLEL COMPUTING Plamen Krastev Office: 38 Oxford, Room 117 Email: [email protected] FAS Research Computing Harvard University OBJECTIVES: To introduce you to the basic concepts and ideas in parallel computing To familiarize you with the major programming models in parallel computing To provide you with with guidance for designing efficient parallel programs 2 OUTLINE: Introduction to Parallel Computing / High Performance Computing (HPC) Concepts and terminology Parallel programming models Parallelizing your programs Parallel examples 3 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage 5 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage Using the world’s fastest and largest computers to solve large and complex problems. 5 Serial Computation: Traditionally software has been written for serial computations: To be run on a single computer having a single Central Processing Unit (CPU) A problem is broken into a discrete set of instructions Instructions are executed one after another Only one instruction can be executed at any moment in time 6 Parallel Computing: In the simplest sense, parallel -
A Task Parallel Approach Bachelor Thesis Paul Blockhaus
Otto-von-Guericke-University Magdeburg Faculty of Computer Science Department of Databases and Software Engineering Parallelizing the Elf - A Task Parallel Approach Bachelor Thesis Author: Paul Blockhaus Advisors: Prof. Dr. rer. nat. habil. Gunter Saake Dr.-Ing. David Broneske Otto-von-Guericke-University Magdeburg Department of Databases and Software Engineering Dr.-Ing. Martin Schäler Karlsruhe Institute of Technology Department of Databases and Information Systems Magdeburg, November 29, 2019 Acknowledgements First and foremost I wish to thank my advisor, Dr.-Ing. David Broneske for sparking my interest in database systems, introducing me to the Elf and supporting me throughout the work on this thesis. I would also like to thank Prof. Dr. rer. nat. habil. Gunter Saake for the opportunity to write this thesis in the DBSE research group. Also a big thank you to Dr.-Ing. Martin Schäler for his valuable feedback. Finally I want to thank my family and my cat for their support and encouragement throughout my study. Abstract Analytical database queries become more and more compute intensive while at the same time the computing power of the CPUs stagnates. This leads to new approaches to accelerate complex selection predicates, which are common in analytical queries. One state of the art approach is the Elf, a highly optimized tree structure which utilizes prefix sharing to optimize multi-predicate selection queries to aim for bare metal speed. However, this approach leaves many capabilities that modern hardware offers aside. One of the most popular capabilities that results in huge speed-ups is the utilization of multiple cores, offered by modern CPUs. -
Round-Up: Runtime Checking Quasi Linearizability Of
Round-Up: Runtime Checking Quasi Linearizability of Concurrent Data Structures Lu Zhang, Arijit Chattopadhyay, and Chao Wang Department of ECE, Virginia Tech Blacksburg, VA 24061, USA {zhanglu, arijitvt, chaowang }@vt.edu Abstract —We propose a new method for runtime checking leading to severe performance bottlenecks. Quasi linearizabil- of a relaxed consistency property called quasi linearizability for ity preserves the intuition of standard linearizability while concurrent data structures. Quasi linearizability generalizes the providing some additional flexibility in the implementation. standard notion of linearizability by intentionally introducing nondeterminism into the parallel computations and exploiting For example, the task queue used in the scheduler of a thread such nondeterminism to improve the performance. However, pool does not need to follow the strict FIFO order. One can ensuring the quantitative aspects of this correctness condition use a relaxed queue that allows some tasks to be overtaken in the low level code is a difficult task. Our method is the occasionally if such relaxation leads to superior performance. first fully automated method for checking quasi linearizability Similarly, concurrent data structures used for web cache need in the unmodified C/C++ code of concurrent data structures. It guarantees that all the reported quasi linearizability violations not follow the strict semantics of the standard versions, since are real violations. We have implemented our method in a occasionally getting the stale data is acceptable. In distributed software tool based on LLVM and a concurrency testing tool systems, the unique id generator does not need to be a perfect called Inspect. Our experimental evaluation shows that the new counter; to avoid becoming a performance bottleneck, it is method is effective in detecting quasi linearizability violations in often acceptable for the ids to be out of order occasionally, the source code of concurrent data structures.