Algorithm Libraries for Multi-Core Processors

Total Page:16

File Type:pdf, Size:1020Kb

Algorithm Libraries for Multi-Core Processors Algorithm Libraries for Multi-Core Processors zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften der Fakultat¨ fur¨ Informatik des Karlsruher Instituts fur¨ Technologie genehmigte Dissertation von Dipl.-Inform. Johannes Singler aus Offenburg Karlsruhe, Juli 2010 Tag der mundlichen¨ Prufung:¨ 8. Juli 2010 Referent: Prof. Dr. Peter Sanders Korreferent: Prof. Dr. Ulrich Meyer II Beginning in 2005, the advent of microprocessors with multiple CPU cores on a single chip has lead to a paradigm shift in computer science. Since the clock rates stagnate, there is no automatic performance increase anymore when switching to a newer processor. Instead, in order to achieve an acceleration, a program has to be executed in parallel on the multiple cores. This fact poses an immense challenge to the development of high-performance applications. Not only are multi-core processors ubiquitous nowadays, application developers are also forced to utilize them to avoid stagnation of performance. In this thesis, we examine an approach that facilitates the development of applications that exploit parallelism. We implement parallelized variants of established algorithm libraries, which provide the same functionality as their sequential versions but execute faster by utilizing the multiple cores. This allows developers to use parallelism implicitly, encapsulated by the unaltered interface of the algorithm. For a start, we consider a library for basic algorithms for internal (main) memory, the C++ Standard Template Library (STL). We data-parallelize its algorithms in the shared memory model, including algorithms for specific data structures, building the Multi-Core Standard Template Library (MCSTL). Geometric algorithms are addressed next, by parallelizing selected routines of the Computa- tional Geometry Algorithms Library, facilitated by the usage of MCSTL routines. We apply MCSTL also to speed up internal computation for basic external memory algo- rithms, which efficiently handle large data sets stored on disks. In combination with some task parallelism for algorithmic stages on a higher level, this enables the Standard Template Library for XXL Data Sets (STXXL) to exploit multi-core parallelism. The libraries had already been widely used in their sequential versions, so applications that employ them can immediately benefit from the improved speed. We carry out experiments on the implementations, and evaluate the performance in detail, also relating to hardware limitations like the memory bandwidth and the disk bandwidth. Case studies using MCSTL and STXXL demonstrate the benefit of the libraries to algorithmic applications. Finally, as a starting point of a generalization to distributed memory, and as an advanced application of the MCSTL and the STXXL, we design distributed external memory sorting algorithms for very large inputs. The implementation of the more practical algorithm set two performance world records in 2009. Deutsche Zusammenfassung Beginnend mit dem Jahr 2005 hat das Aufkommen von Prozessoren mit mehreren Rechenker- nen auf einem Chip zu einem Paradigmenwechsel in der Informatik gefuhrt.¨ Da gleichzeitig die Taktfrequenzen nicht weiter steigen, ergibt sich eine Leistungssteigerung nun nun nicht mehr automatisch beim Verwenden eines neuen Prozessors. Vielmehr muss durch die paral- lele Ausfuhrung¨ eines Programms auf mehreren Rechenkernen eine Beschleunigung erreicht werden. Diese Tatsache ist eine große Herausforderung fur¨ das Entwickeln leistungsfahiger¨ Anwendungen. Mehrkern-Prozessoren sind inzwischen nicht nur allgegenwartig,¨ Software- Entwickler mussen¨ deren Fahigkeiten¨ auch ausschopfen,¨ um eine Stagnation der Rechenleistung zu verhindern. In dieser Arbeit untersuchen wir einen Ansatz, um das Entwickeln von Anwendungen, die Parallelismus ausnutzen, zu erleichtern: Wir implementieren parallelisierte Varianten von Algo- rithmenbibliotheken. Diese bieten die gleiche Funktionalitat¨ wie ihre sequentiellen Versionen, ihre Routinen laufen jedoch beschleunigt, da sie mehrere Rechenkerne nutzen. Die Verwen- dung solcher Bibliotheken ermoglicht¨ einem Software-Entwickler impliziten Parallelismus, gekapselt durch die gleichbleibende Schnittstelle des Algorithmus. Wir konzentrieren uns auf etablierte C++-Bibliotheken, als Rahmenwerk zur Verwaltung von Threads wahlen¨ wir OpenMP. Zunachst¨ betrachten wir eine Bibliothek fur¨ grundlegende Algorithmen im Hauptspeicher, die Standard Template Library (STL), welche Teil der C++-Standardbibliothek ist. Ihre Algo- rithmen werden daten-parallelisiert und ergeben somit die Multi-Core Standard Template Library (MCSTL). Diese ist geeignet fur¨ Parallelrechner mit gemeinsamem Speicher, insbeson- dere also Mehrkern-Rechner. Sie enthalt¨ Routinen wie z. B. suchen, partitionieren, mischen, sortieren, und zufallig¨ permutieren. Wir wahlen¨ geeignete parallele Algorithmen fur¨ die einzelnen Routinen aus, passend zu den Gegebenheiten. Desweiteren entwerfen wir Algorithmen zur Manipulation spezieller Datenstrukturen. Fur¨ das Einfugen¨ großer Datenmengen in ein assoziatives Feld bzw. dessen Konstruktion aus einer großen Eingabemenge betrachten wir parallele Algorithmen fur¨ Rot-Schwarz-Baume.¨ Um auch fur¨ Sequenzen, die nicht wahlfrei zugreifbar sind, Daten-Parallelisierung zu ermoglichen,¨ entwerfen wir einen Algorithmus, der sie in nur einem sequentiell Durchlauf in ungefahr¨ gleich große Teile zerteilt, ohne anfangs die Lange¨ zu kennen. Die Leistung aller Algorithmen und ihrer Implementierungen wird ausfuhrlich¨ mit Ex- perimenten auf verschiedenen Rechnern evaluiert. Dabei werden auch Beschrankungen¨ der Hardware berucksichtigt,¨ z. B. die geteilte Speicherbandbreite. Es werden meist gute bis sehr IV gute Beschleunigungswerte erreicht, oft auch schon fur¨ relative kleine Eingaben, ab einer sequentiellen Laufzeit von 100 Mikrosekunden. Die softwaretechnischen Aspekten der Integration der parallelisierten Routinen in eine existierende Bibliothek werden ebenfalls diskutiert. Hier ist insbesondere die Erhaltung der Semantik ein Kernthema. Zwei Fallstudien demonstrieren die Verwendbarkeit der MCSTL-Routinen als Unterpro- gramme in algorithmischen Anwendungen. Zuerst betrachten wir die Konstruktion eines Suffix Arrays, einer Indexdatenstruktur fur¨ Volltextsuche. Durch Verwendung der parallelisierten Routinen als Unterprogramme wird eine existierende Implementierung um Faktor 5 beschleu- nigt. Außerdem wird eine Variante des Kruskal-Algorithmus zum Berechnen des minimalen Spannbaums durch die Parallelisierung mittels der MCSTL so viel schneller, dass sie andere, nicht parallelisierbare Algorithmen hinter sich lasst.¨ Aus dem Gebiet der algorithmischen Geometrie werden einige ausgewahlte¨ Routinen der Computational Geometry Algorithms Library (CGAL) parallel implementiert, auch unter Zuhilfenahme von MCSTL-Routinen. Das Sortieren entlang der raumfullenden¨ Hilbert-Kurve dient vor allem als Vorverarbeitungs- schritt anderer Algorithmen, und benotigt¨ die Parallelisierung, um die Skalierbarkeit jener Algorithmen nicht zu gefahrden.¨ Der Schnitt achsenparalleler (Hyper-)Quader wird oft als Heuristik in der Kollisionserken- nung verwendet. Nur falls sich die einhullenden¨ achsenparallelen Quader zweier Objekte schneiden, muss detailliert untersucht werden, ob sich die eigentlichen Objekte tatsachlich¨ schneiden. Die Konstruktion der Delaunay-Triangulierung in 3 Dimensionen ist Teil vieler Verfahren zur Manipulation von 3D-Objekten. Im Gegensatz zu den bisherigen Algorithmen verwenden wir hier feingranulare Sperren zur Synchronisation wahrend¨ des konkurrierenden Zugriffs auf die Datenstruktur. Wahrend¨ fur¨ die beiden ersten Probleme gute Beschleunigungen erzielt werden, sind diese fur¨ die Delaunay-Triangulierung sogar sehr nah am Optimum. Im weiteren werden Externspeicher-Algorithmen betrachtet, welche große Datenmengen bearbeiten, die auf Grund ihrer Große¨ nicht mehr in den Hauptspeicher passen, und deshalb auf Festplatte(n) gespeichert sind. Die Routinen der MCSTL werden verwendet, um die interne Rechenarbeit zu beschleunigen. In Kombination mit einem Rahmenwerk zur parallelen Ausfuhrung¨ von Aufgaben auf einer hoheren¨ algorithmischen Ebene wird damit auch der Standard Template Library for XXL Data Sets (STXXL) Mehrkern-Parallelismus ermoglicht.¨ Damit konnen¨ wir die Lucke¨ zwischen immer großerer¨ Bandbreite der Festplatten und stagnierender Taktfrequenz der Prozessoren verkleinern. Anwendungsbeispiel ist hier wiederum V die Konstruktion eines Suffix Array, dieses Mal fur¨ noch großere¨ Eingaben. Wir konnen¨ diverse Externspeicher-Algorithmen fur¨ dieses Problem signifikant beschleunigen. Alle erwahnten¨ Bibliotheken hatten in ihren sequentiellen Versionen bereits eine große Ver- breitung gefunden. Außerdem sind diese inklusive der parallelisierten Varianten frei verfugbar.¨ Somit konnen¨ Anwendungen, die sie verwenden, direkt von der hoheren¨ Geschwindigkeit profitieren. Schließlich entwerfen wir Algorithmen zum Sortieren sehr großer Datenmengen auf Rech- nerbundeln¨ mit verteiltem Speicher und mehreren Festplatten pro Rechenknoten. Dies dient einerseits als Startpunkt fur¨ eine Verallgemeinerung der bisherigen Arbeit auf Systeme mit verteiltem Speicher, andererseits als ein weiteres, komplexes Anwendungsbeispiel der MCSTL und der STXXL. Die Implementierung der praktischeren Algorithmen-Variante wird wiederum ausfuhrlich¨ experimentell evaluiert und analysiert. Im Vergleich mit Konkurrenz- Implementierungen schneidet sie exzellent ab, sie erzielte 2009 zwei Leistungs-Weltrekorde,
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • Parallel Prefix Sum (Scan) with CUDA
    Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007 ii Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1 Parallel Prefix Sum (Scan) with CUDA Table of Contents Abstract.............................................................................................................. 1 Table of Contents............................................................................................... 2 Introduction....................................................................................................... 3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan ......................................................................................... 4 A Work-Efficient
    [Show full text]
  • High Performance Computing Through Parallel and Distributed Processing
    Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64 Journal Of Harmonized Research (JOHR) Journal Of Harmonized Research in Engineering 1(2), 2013, 54-64 ISSN 2347 – 7393 Original Research Article High Performance Computing through Parallel and Distributed Processing Shikha Yadav, Preeti Dhanda, Nisha Yadav Department of Computer Science and Engineering, Dronacharya College of Engineering, Khentawas, Farukhnagar, Gurgaon, India Abstract : There is a very high need of High Performance Computing (HPC) in many applications like space science to Artificial Intelligence. HPC shall be attained through Parallel and Distributed Computing. In this paper, Parallel and Distributed algorithms are discussed based on Parallel and Distributed Processors to achieve HPC. The Programming concepts like threads, fork and sockets are discussed with some simple examples for HPC. Keywords: High Performance Computing, Parallel and Distributed processing, Computer Architecture Introduction time to solve large problems like weather Computer Architecture and Programming play forecasting, Tsunami, Remote Sensing, a significant role for High Performance National calamities, Defence, Mineral computing (HPC) in large applications Space exploration, Finite-element, Cloud science to Artificial Intelligence. The Computing, and Expert Systems etc. The Algorithms are problem solving procedures Algorithms are Non-Recursive Algorithms, and later these algorithms transform in to Recursive Algorithms, Parallel Algorithms particular Programming language for HPC. and Distributed Algorithms. There is need to study algorithms for High The Algorithms must be supported the Performance Computing. These Algorithms Computer Architecture. The Computer are to be designed to computer in reasonable Architecture is characterized with Flynn’s Classification SISD, SIMD, MIMD, and For Correspondence: MISD. Most of the Computer Architectures preeti.dhanda01ATgmail.com are supported with SIMD (Single Instruction Received on: October 2013 Multiple Data Streams).
    [Show full text]
  • Paralellizing the Data Cube
    PARALELLIZING THE DATA CUBE By Todd Eavis SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT DALHOUSIE UNIVERSITY HALIFAX, NOVA SCOTIA JUNE 27, 2003 °c Copyright by Todd Eavis, 2003 DALHOUSIE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE The undersigned hereby certify that they have read and recommend to the Faculty of Graduate Studies for acceptance a thesis entitled “Paralellizing the Data Cube” by Todd Eavis in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Dated: June 27, 2003 External Examiner: Virendra Bhavsar Research Supervisor: Andrew Rau-Chaplin Examing Committee: Qigang Gao Evangelos Milios ii DALHOUSIE UNIVERSITY Date: June 27, 2003 Author: Todd Eavis Title: Paralellizing the Data Cube Department: Computer Science Degree: Ph.D. Convocation: October Year: 2003 Permission is herewith granted to Dalhousie University to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of Author THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’S WRITTEN PERMISSION. THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. iii To the two women in my life: Amber and Bailey. iv Table of Contents Table of Contents v List of Tables x List of Figures xi Abstract i Acknowledgements ii 1 Introduction 1 1.1 Overview of Primary Research .
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • CSE373: Data Structures & Algorithms Lecture 26
    CSE373: Data Structures & Algorithms Lecture 26: Parallel Reductions, Maps, and Algorithm Analysis Aaron Bauer Winter 2014 Outline Done: • How to write a parallel algorithm with fork and join • Why using divide-and-conquer with lots of small tasks is best – Combines results in parallel – (Assuming library can handle “lots of small threads”) Now: • More examples of simple parallel programs that fit the “map” or “reduce” patterns • Teaser: Beyond maps and reductions • Asymptotic analysis for fork-join parallelism • Amdahl’s Law • Final exam and victory lap Winter 2014 CSE373: Data Structures & Algorithms 2 What else looks like this? • Saw summing an array went from O(n) sequential to O(log n) parallel (assuming a lot of processors and very large n!) – Exponential speed-up in theory (n / log n grows exponentially) + + + + + + + + + + + + + + + • Anything that can use results from two halves and merge them in O(1) time has the same property… Winter 2014 CSE373: Data Structures & Algorithms 3 Examples • Maximum or minimum element • Is there an element satisfying some property (e.g., is there a 17)? • Left-most element satisfying some property (e.g., first 17) – What should the recursive tasks return? – How should we merge the results? • Corners of a rectangle containing all points (a “bounding box”) • Counts, for example, number of strings that start with a vowel – This is just summing with a different base case – Many problems are! Winter 2014 CSE373: Data Structures & Algorithms 4 Reductions • Computations of this form are called reductions
    [Show full text]
  • Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis
    1 Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis TAL BEN-NUN and TORSTEN HOEFLER, ETH Zurich, Switzerland Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Neural networks; Parallel computing methodologies; Distributed computing methodologies; Additional Key Words and Phrases: Deep Learning, Distributed Computing, Parallel Algorithms ACM Reference Format: Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. 47 pages. 1 INTRODUCTION Machine Learning, and in particular Deep Learning [143], is rapidly taking over a variety of aspects in our daily lives.
    [Show full text]
  • Parallel Programming: Performance
    Introduction Parallel Programming: Rich space of techniques and issues • Trade off and interact with one another Performance Issues can be addressed/helped by software or hardware • Algorithmic or programming techniques • Architectural techniques Todd C. Mowry Focus here on performance issues and software techniques CS 418 • Point out some architectural implications January 20, 25 & 26, 2011 • Architectural techniques covered in rest of class –2– CS 418 Programming as Successive Refinement Outline Not all issues dealt with up front 1. Partitioning for performance Partitioning often independent of architecture, and done first 2. Relationshipp,y of communication, data locality and • View machine as a collection of communicating processors architecture – balancing the workload – reducing the amount of inherent communication 3. Orchestration for performance – reducing extra work • Tug-o-war even among these three issues For each issue: Then interactions with architecture • Techniques to address it, and tradeoffs with previous issues • Illustration using case studies • View machine as extended memory hierarchy • Application to g rid solver –extra communication due to archlhitectural interactions • Some architectural implications – cost of communication depends on how it is structured • May inspire changes in partitioning Discussion of issues is one at a time, but identifies tradeoffs 4. Components of execution time as seen by processor • Use examples, and measurements on SGI Origin2000 • What workload looks like to architecture, and relate to software issues –3– CS 418 –4– CS 418 Page 1 Partitioning for Performance Load Balance and Synch Wait Time Sequential Work 1. Balancing the workload and reducing wait time at synch Limit on speedup: Speedupproblem(p) < points Max Work on any Processor • Work includes data access and other costs 2.
    [Show full text]
  • CSE 613: Parallel Programming Lecture 2
    CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2019 Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Parallel running time on 푝 processing elements, 푇푃 = 푡푒푛푑 – 푡푠푡푎푟푡 , where, 푡푠푡푎푟푡 = starting time of the processing element that starts first 푡푒푛푑 = termination time of the processing element that finishes last Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Sources of overhead ( w.r.t. serial execution ) ― Interprocess interaction ― Interact and communicate data ( e.g., intermediate results ) ― Idling ― Due to load imbalance, synchronization, presence of serial computation, etc. ― Excess computation ― Fastest serial algorithm may be difficult/impossible to parallelize Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Overhead function or total parallel overhead, 푇푂 = 푝푇푝 – 푇 , where, 푝 = number of processing elements 푇 = time spent doing useful work ( often execution time of the fastest serial algorithm ) Speedup Let 푇푝 = running time using 푝 identical processing elements 푇1 Speedup, 푆푝 = 푇푝 Theoretically, 푆푝 ≤ 푝 ( why? ) Perfect or linear or ideal speedup if 푆푝 = 푝 Speedup Consider adding 푛 numbers using 푛 identical processing Edition nd elements. Serial runtime,
    [Show full text]
  • Exploiting Parallelism in Many-Core Architectures: Architectures G Bortolotti, M Caberletti, G Crimi Et Al
    Journal of Physics: Conference Series OPEN ACCESS Related content - Computing on Knights and Kepler Exploiting parallelism in many-core architectures: Architectures G Bortolotti, M Caberletti, G Crimi et al. Lattice Boltzmann models as a test case - Implementing the lattice Boltzmann model on commodity graphics hardware Arie Kaufman, Zhe Fan and Kaloian To cite this article: F Mantovani et al 2013 J. Phys.: Conf. Ser. 454 012015 Petkov - Many-core experience with HEP software at CERN openlab Sverre Jarp, Alfio Lazzaro, Julien Leduc et al. View the article online for updates and enhancements. Recent citations - Lattice Boltzmann benchmark kernels as a testbed for performance analysis M. Wittmann et al - Enrico Calore et al - Computing on Knights and Kepler Architectures G Bortolotti et al This content was downloaded from IP address 170.106.202.58 on 23/09/2021 at 15:57 24th IUPAP Conference on Computational Physics (IUPAP-CCP 2012) IOP Publishing Journal of Physics: Conference Series 454 (2013) 012015 doi:10.1088/1742-6596/454/1/012015 Exploiting parallelism in many-core architectures: Lattice Boltzmann models as a test case F Mantovani1, M Pivanti2, S F Schifano3 and R Tripiccione4 1 Department of Physics, Univesit¨atRegensburg, Germany 2 Department of Physics, Universit`adi Roma La Sapienza, Italy 3 Department of Mathematics and Informatics, Universit`adi Ferrara and INFN, Italy 4 Department of Physics and CMCS, Universit`adi Ferrara and INFN, Italy E-mail: [email protected], [email protected], [email protected], [email protected] Abstract. In this paper we address the problem of identifying and exploiting techniques that optimize the performance of large scale scientific codes on many-core processors.
    [Show full text]
  • An Introduction to Gpus, CUDA and Opencl
    An Introduction to GPUs, CUDA and OpenCL Bryan Catanzaro, NVIDIA Research Overview ¡ Heterogeneous parallel computing ¡ The CUDA and OpenCL programming models ¡ Writing efficient CUDA code ¡ Thrust: making CUDA C++ productive 2/54 Heterogeneous Parallel Computing Latency-Optimized Throughput- CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/54 Why do we need heterogeneity? ¡ Why not just use latency optimized processors? § Once you decide to go parallel, why not go all the way § And reap more benefits ¡ For many applications, throughput optimized processors are more efficient: faster and use less power § Advantages can be fairly significant 4/54 Why Heterogeneity? ¡ Different goals produce different designs § Throughput optimized: assume work load is highly parallel § Latency optimized: assume work load is mostly sequential ¡ To minimize latency eXperienced by 1 thread: § lots of big on-chip caches § sophisticated control ¡ To maXimize throughput of all threads: § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over ALUs via SIMD 5/54 Latency vs. Throughput Specificaons Westmere-EP Fermi (Tesla C2050) 6 cores, 2 issue, 14 SMs, 2 issue, 16 Processing Elements 4 way SIMD way SIMD @3.46 GHz @1.15 GHz 6 cores, 2 threads, 4 14 SMs, 48 SIMD Resident Strands/ way SIMD: vectors, 32 way Westmere-EP (32nm) Threads (max) SIMD: 48 strands 21504 threads SP GFLOP/s 166 1030 Memory Bandwidth 32 GB/s 144 GB/s Register File ~6 kB 1.75 MB Local Store/L1 Cache 192 kB 896 kB L2 Cache 1.5 MB 0.75 MB
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]