GPU Basics GPU Basics S

Total Page:16

File Type:pdf, Size:1020Kb

GPU Basics GPU Basics S GPU Basics GPU Basics S. Sundar & M. Panchatcharam Types of Parallelism Task-based parallelism Data-Based Parallelism S. Sundar and M. Panchatcharam FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding August 9, 2014 47 Outline GPU Basics S. Sundar & M. Panchatcharam 1 Task-based parallelism Task-based parallelism Data-Based Parallelism 2 Data-Based Parallelism FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel 3 FLYNN’s Taxonomoy Computers Simple Coding 4 Other Parallel Patterns 5 Classes of Parallel Computers 6 Simple Coding 47 Pipeline parallelism GPU Basics S. Sundar & M. Panchatcharam 3 Task-based parallelism Data-Based Parallelism A typical OS, exploit a type of parallelism called task-based FLYNN’s Taxonomoy parallelism Other Parallel Patterns Classes of Parallel Example: A user read an article on a website while playing Computers music from media player Simple Coding In terms of parallel programming, the linux operator uses pipe commands to execute this 47 Coarse/fine-grained parallelism GPU Basics S. Sundar & M. Panchatcharam 4 Task-based parallelism Applications are often classified according to how often their Data-Based Parallelism subtasks need to synchronize or communicate with each other FLYNN’s Taxonomoy If subtasks communicate many times per second, then it is Other Parallel Patterns called fine-grained parallelism Classes of Parallel Computers If they do not communicate many times per second, then it is Simple Coding called coarser-grained parallelism If they rarely or never have to communicate, then it is called embarrassingly parallel Embarrassingly parallel applications are easiest to parallelize. 47 Task Parallelism GPU Basics S. Sundar & M. Panchatcharam 5 Task-based parallelism Data-Based Parallelism Task Parallelism FLYNN’s Taxonomoy In a multiprocessor system, task parallelism is achieved when each Other Parallel Patterns Classes of Parallel processor executes different thread on the same or different data. Computers Simple Coding Threads may execute same or different code communication between threads takes place by passing data from one thread to next 47 Data Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism 6 Data-Based Parallelism Data Parallelism FLYNN’s Taxonomoy In a multiprocessor system, each processor performs the same task Other Parallel Patterns on different pieces of distributed data Classes of Parallel Computers Simple Coding Consider adding two matrices. Addition is the task on different piece of data (each element) It is a fine-grained parallelism 47 Instruction Level Parallelism GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism 7 Data-Based Parallelism Instruction level parallelism is a measure of how many of the FLYNN’s Taxonomoy operations in a computer program can be performed Other Parallel Patterns simultaneously. The potential overlap among instructions is called Classes of Parallel Computers instruction level parallelism. Simple Coding Hardware level works on dynamic parallelism Software level works on static parallelism 47 Taxonomy GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Flynn’s Taxonomy is a classification of different computer 8 FLYNN’s Taxonomoy architectures. Various types are as follows: Other Parallel Patterns SIMD - Single Instruction Multiple Data Classes of Parallel Computers MIMD - Multiple Instruction Multiple Data Simple Coding SISD - Single Instruction Single Data MISD - Multiple Instruction Single Data 47 SISD GPU Basics S. Sundar & M. Panchatcharam Standard serial programming Task-based parallelism Single instruction stream on a single data Data-Based Parallelism Single-core CPU is enough 9 FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 MIMD GPU Basics S. Sundar & Today’s dual or quad-core desktop machines M. Panchatcharam Work allocated in one of N CPU cores Task-based parallelism Data-Based Each thread has independent stream of instructions Parallelism Hardware has the control logic for decoding separate 10 FLYNN’s Taxonomoy instruction streams Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 SIMD GPU Basics S. Sundar & Type of Data parallelism M. Panchatcharam Single instruction stream at any one point of time Task-based parallelism Data-Based Single set of logic to decode and execute the instruction Parallelism stream 11 FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 MISD GPU Basics Many functional units perform different operations on the S. Sundar & same data M. Panchatcharam Task-based parallelism Pipeline architectures belong to this type Data-Based Example: Horner’s Rule is an example Parallelism 12 FLYNN’s Taxonomoy Other Parallel Patterns y = (··· (((an∗x+an−1)∗x+an−2)∗x+an−3)∗x+···+a1)∗x+a0 Classes of Parallel Computers Simple Coding 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism FLYNN’s Taxonomoy Are you familiar with loop structures? 13 Other Parallel Patterns Classes of Parallel Types of loop? Computers Entry level loop Simple Coding Exit level loop 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Easy pattern to parallelize Parallelism With inter-loop dependencies removed, decide splitting or FLYNN’s Taxonomoy 14 Other Parallel Patterns partition the work between available processors Classes of Parallel Optimize communication between processors and the use of Computers on chip resources Simple Coding Communication overhead is the bottleneck Decompose based on the number of logical hardware threads available 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based However, oversubscribing the number of threads leads to poor Parallelism FLYNN’s Taxonomoy performance 15 Other Parallel Patterns Reason: Context switching performed in software by the OS Classes of Parallel Computers Aware of hidden dependencies while doing existing serial Simple Coding implementation Concentrate on inner loops and one or more outer loops Best approach, parallelize only the outer loops 47 Loop based Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism Note: Most loop can be flattened FLYNN’s Taxonomoy Reduce inner loop and outer loop to a single loop, if possible 16 Other Parallel Patterns Classes of Parallel Example: Image processing algorithm Computers X pixel axis in the inner loop and Y axis in the outer loop Simple Coding Flatten this loop by considering all pixels as a single dimensional array and iterate over image coordinates 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism It is a common pattern in serial programming where there are Data-Based synchronization points and only certain aspects of the Parallelism program are parallel. FLYNN’s Taxonomoy 17 Other Parallel Patterns The serial code reaches the work that can be distributed to P Classes of Parallel processors in some manner Computers Simple Coding It then Forks or spawns N threads/processes that perform the calculation in parallel. These processes execute independently and finally converge or join once all the calculations are complete A typical approach found in OpenMP 47 Fork/Join Pattern GPU Basics Code splits into N threads and later converges to a single S. Sundar & thread again M. Panchatcharam See figure, we see a queue of data items Task-based parallelism Data-Based Data items are split into three processing cores Parallelism Each data item is processed independently and later written FLYNN’s Taxonomoy to appropriate destination place 18 Other Parallel Patterns Classes of Parallel Computers Simple Coding 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Typically implemented with partitioning of the data Task-based parallelism Data-Based Serial code launches N threads and divide the dataset equally Parallelism between the N threads FLYNN’s Taxonomoy 19 Other Parallel Patterns Works well, if each packet of data takes same time to process Classes of Parallel Computers If one thread takes too much time to work, it becomes single Simple Coding factor determining the local time Why did we choose three threads, Instead of six threads? Reality: Millions of data items attempting to fork million threads will cause almost all OS to fail OS applies "fair scheduling policy" 47 Fork/Join Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Programmer and many multithreaded libraries will use the Parallelism number of logical processor threads available as number of FLYNN’s Taxonomoy processes to fork 20 Other Parallel Patterns Classes of Parallel CPU threads are so expensive to create, destroy and utilize Computers Simple Coding Fork/join pattern is useful when there is an unknown amount of concurrency available in a problem Traversing a tree structure may fork additional threads when it encounters another node 47 Divide and Conquer Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism A pattern for breaking down(divide) large problems into FLYNN’s Taxonomoy smaller sections each of which can be conquered 21 Other Parallel Patterns Classes of Parallel Useful with recursion Computers Example: Quick Sort Simple Coding Quick sort recursively partitions the data into two sets, above pivot point and below pivot point 47 Divide and Conquer Pattern GPU Basics S. Sundar & M. Panchatcharam Task-based
Recommended publications
  • Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP
    Cimple: Instruction and Memory Level Parallelism A DSL for Uncovering ILP and MLP Vladimir Kiriansky, Haoran Xu, Martin Rinard, Saman Amarasinghe MIT CSAIL {vlk,haoranxu510,rinard,saman}@csail.mit.edu Abstract Processors have grown their capacity to exploit instruction Modern out-of-order processors have increased capacity to level parallelism (ILP) with wide scalar and vector pipelines, exploit instruction level parallelism (ILP) and memory level e.g., cores have 4-way superscalar pipelines, and vector units parallelism (MLP), e.g., by using wide superscalar pipelines can execute 32 arithmetic operations per cycle. Memory and vector execution units, as well as deep buffers for in- level parallelism (MLP) is also pervasive with deep buffering flight memory requests. These resources, however, often ex- between caches and DRAM that allows 10+ in-flight memory hibit poor utilization rates on workloads with large working requests per core. Yet, modern CPUs still struggle to extract sets, e.g., in-memory databases, key-value stores, and graph matching ILP and MLP from the program stream. analytics, as compilers and hardware struggle to expose ILP Critical infrastructure applications, e.g., in-memory databases, and MLP from the instruction stream automatically. key-value stores, and graph analytics, characterized by large In this paper, we introduce the IMLP (Instruction and working sets with multi-level address indirection and pointer Memory Level Parallelism) task programming model. IMLP traversals push hardware to its limits: large multi-level caches tasks execute as coroutines that yield execution at annotated and branch predictors fail to keep processor stalls low. The long-latency operations, e.g., memory accesses, divisions, out-of-order windows of hundreds of instructions are also or unpredictable branches.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Scheduling Task Parallelism on Multi-Socket Multicore Systems
    Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Task Parallel Programming in a Nutshell! • A task consists of executable code and associated data context, with some bookkeeping metadata for scheduling and synchronization. • Tasks are significantly more lightweight than threads. • Dynamically generated and terminated at run time • Scheduled onto threads for execution • Used in Cilk, TBB, X10, Chapel, and other languages • Our work is on the recent tasking constructs in OpenMP 3.0. The University of North Carolina at Chapel Hill ! 4 Simple Task Parallel OpenMP Program: Fibonacci! int fib(int n)! {! fib(10)! int x, y;! if (n < 2) return n;! #pragma omp task! fib(9)! fib(8)! x = fib(n - 1);! #pragma omp task! y = fib(n - 2);! #pragma omp taskwait! fib(8)! fib(7)! return x + y;! }! The University of North Carolina at Chapel Hill ! 5 Useful Applications! • Recursive algorithms cilksort cilksort cilksort cilksort cilksort • E.g. Mergesort • List and tree traversal cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge • Irregular computations cilkmerge • E.g., Adaptive Fast Multipole cilkmerge cilkmerge
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Computing a Key to Performance
    Parallel Computing A Key to Performance Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb Dheeraj Bhardwaj <[email protected]> August, 2002 1 Introduction • Traditional Science • Observation • Theory • Experiment -- Most expensive • Experiment can be replaced with Computers Simulation - Third Pillar of Science Dheeraj Bhardwaj <[email protected]> August, 2002 2 1 Introduction • If your Applications need more computing power than a sequential computer can provide ! ! ! ❃ Desire and prospect for greater performance • You might suggest to improve the operating speed of processors and other components. • We do not disagree with your suggestion BUT how long you can go ? Can you go beyond the speed of light, thermodynamic laws and high financial costs ? Dheeraj Bhardwaj <[email protected]> August, 2002 3 Performance Three ways to improve the performance • Work harder - Using faster hardware • Work smarter - - doing things more efficiently (algorithms and computational techniques) • Get help - Using multiple computers to solve a particular task. Dheeraj Bhardwaj <[email protected]> August, 2002 4 2 Parallel Computer Definition : A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”. Driving Forces and Enabling Factors Desire and prospect for greater performance Users have even bigger problems and designers have even more gates Dheeraj Bhardwaj <[email protected]>
    [Show full text]
  • Task Parallelism Bit-Level Parallelism
    Parallel languages as extensions of sequential ones Alexey A. Romanenko [email protected] What this section about? ● Computers. History. Trends. ● What is parallel program? ● What is parallel programming for? ● Features of parallel programs. ● Development environment. ● etc. Agenda 1. Sequential program 2. Applications, required computational power. 3. What does parallel programming for? 4. Parallelism inside ordinary PC. 5. Architecture of modern CPUs. 6. What is parallel program? 7. Types of parallelism. Agenda 8. Types of computational installations. 9. Specificity of parallel programs. 10.Amdahl's law 11.Development environment 12.Approaches to development of parallel programs. Cost of development. 13.Self-test questions History George Boole Claude Elwood Shannon Alan Turing Charles Babbage John von Neumann Norbert Wiener Henry Edward Roberts Sciences ● Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. ● Cybernetics is the interdisciplinary study of the structure of regulatory system Difference machine Arithmometer Altair 8800 Computer with 8-inch floppy disk system Sequential program A program perform calculation of a function F = G(X) for example: a*x2+b*x+c=0, a != 0. x1=(-b-sqrt(b2-4ac))/(2a), x2=(-b+sqrt(b2-4ac))/(2a) Turing machine Plasma modeling N ~ 106 dX ~ F dT2 j j F ~ sum(q, q ) j i i j Complexity ~ O(N*N) more then 1012 * 100...1000 operations Resource consumable calculations ● Nuclear/Gas/Hydrodynamic
    [Show full text]
  • Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions
    Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions Elliott Slaughter Wonchan Lee Sean Treichler Stanford University Stanford University Stanford University SLAC National Accelerator Laboratory [email protected] NVIDIA [email protected] [email protected] Wen Zhang Michael Bauer Galen Shipman Stanford University NVIDIA Los Alamos National Laboratory [email protected] [email protected] [email protected] Patrick McCormick Alex Aiken Los Alamos National Laboratory Stanford University [email protected] [email protected] ABSTRACT 1 for t = 0, T do We present control replication, a technique for generating high- 1 for i = 0, N do −− Parallel 2 for i = 0, N do −− Parallel performance and scalable SPMD code from implicitly parallel pro- 2 for t = 0, T do 3 B[i] = F(A[i]) grams. In contrast to traditional parallel programming models that 3 B[i] = F(A[i]) 4 end 4 −− Synchronization needed require the programmer to explicitly manage threads and the com- 5 for j = 0, N do −− Parallel 5 A[i] = G(B[h(i)]) munication and synchronization between them, implicitly parallel 6 A[j] = G(B[h(j)]) 6 end programs have sequential execution semantics and by their nature 7 end 7 end avoid the pitfalls of explicitly parallel programming. However, with- 8 end (b) Transposed program. out optimizations to distribute control overhead, scalability is often (a) Original program. poor. Performance on distributed-memory machines is especially sen- F(A[0]) . G(B[h(0)]) sitive to communication and synchronization in the program, and F(A[1]) . G(B[h(1)]) thus optimizations for these machines require an intimate un- .
    [Show full text]
  • Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions
    REGENT: A HIGH-PRODUCTIVITY PROGRAMMING LANGUAGE FOR IMPLICIT PARALLELISM WITH LOGICAL REGIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Elliott Slaughter August 2017 © 2017 by Elliott David Slaughter. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/mw768zz0480 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Alex Aiken, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Philip Levis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Modern supercomputers are dominated by distributed-memory machines. State of the art high-performance scientific applications targeting these machines are typically written in low-level, explicitly parallel programming models that enable maximal performance but expose the user to programming hazards such as data races and deadlocks.
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]
  • An Overview of Parallel Ccomputing
    An Overview of Parallel Ccomputing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA MPI Hardware Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA MPI Hardware von Neumann Architecture In 1945, the Hungarian mathematician John von Neumann proposed the above organization for hardware computers. The Control Unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. The Arithmetic Unit performs basic arithmetic operation, while Input/Output is the interface to the human operator. Hardware von Neumann Architecture The Pentium Family. Hardware Parallel computer hardware Most computers today (including tablets, smartphones, etc.) are equipped with several processing units (control+arithmetic units). Various characteristics determine the types of computations: shared memory vs distributed memory, single-core processors vs multicore processors, data-centric parallelism vs task-centric parallelism. Historically, shared memory machines have been classified as UMA and NUMA, based upon memory access times. Hardware Uniform memory access (UMA) Identical processors, equal access and access times to memory. In the presence of cache memories, cache coherency is accomplished at the hardware level: if one processor updates a location in shared memory, then all the other processors know about the update. UMA architectures were first represented by Symmetric Multiprocessor (SMP) machines. Multicore processors follow the same architecture and, in addition, integrate the cores onto a single circuit die. Hardware Non-uniform memory access (NUMA) Often made by physically linking two or more SMPs (or multicore processors).
    [Show full text]
  • Task Level Parallelism
    Task Level Parallelism The topic of this chapter is thread-level parallelism. While, thread-level parallelism falls within the textbook’s classification of ILP and data parallelism. It also falls into a broader topic of parallel and distributed computing. In the next set of slides, I will attempt to place you in the context of this broader computation space that is called task level parallelism. Of course a proper treatment of parallel computing or distributed computing is worthy of an entire semester (or two) course of study. I can only give you a brief exposure to this topic. The text highlighted in green in these slides contain external hyperlinks. 1 / 14 Classification of Parallelism Software Sequential Concurrent Serial Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S example from the textbook). example from the textbook). Execution on a serial platform. Execution on a serial platform. Parallel Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S Hardware example from the textbook). example from the textbook). Execution on a parallel plat- Execution on a parallel plat- form. form. 2 / 14 Flynn’s Classification of Parallelism CU: control unit SM: shared memory DS1 PU MM PU: processor unit IS: instruction stream 1 1 MM: memory unit DS: data stream DS2 PU MM IS 2 2 CU IS SM IS DS CU PU MM DSn PUn MMm (a) SISD computer IS (b) SIMD computer IS1 IS1 IS1 IS1 IS1 DS1 CU PU DS CU PU MM 1 1 1 1 1 SM IS2 IS2 IS2 IS2 DS2 CU PU CU PU MM IS2 2 2 2 2 2 MM MM MM 1 2 m SM ISn ISn ISn ISn ISn DSn IS CUnPU n DS 2 CUnPU n MMm IS1 ISn (c) MISD computer (d) MIMD computer 3 / 14 Task Level Parallelism I Task Level Parallelism: organizing a program or computing solution into a set of processes/tasks/threads for simultaneous execution.
    [Show full text]
  • Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation
    Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation Li Chen1, Lei Liu1, Shenglin Tang1, Lei Huang2, Zheng Jing1, Shixiong Xu1, Dingfei Zhang1, Baojiang Shou1 1 Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China {lchen,liulei2007,tangshenglin,jingzheng,xushixiong, zhangdingfei, shoubaojiang}@ict.ac.cn; 2 Department of Computer Science, University of Houston; Houston, TX, USA [email protected] 5 Abstract. Unified Parallel C (UPC), a parallel extension to ANSI C, is designed for high performance computing on large-scale parallel machines. With General-purpose graphics processing units (GPUs) becoming an increasingly important high performance computing platform, we propose new language extensions to UPC to take advantage of GPU clusters. We extend UPC with hierarchical data distribution, revise the execution model of UPC to mix SPMD with fork-join execution model, and modify the semantics of upc_forall to reflect the data-thread affinity on a thread hierarchy. We implement the compiling system, including affinity-aware loop tiling, GPU code generation, and several memory optimizations targeting NVIDIA CUDA. We also put forward unified data management for each UPC thread to optimize data transfer and memory layout for separate memory modules of CPUs and GPUs. The experimental results show that the UPC extension has better programmability than the mixed MPI/CUDA approach. We also demonstrate that the integrated compile-time and runtime optimization is effective to achieve good performance on GPU clusters. 1 Introduction Following closely behind the industry-wide move from uniprocessor to multi-core and many-core systems, HPC computing platforms are undergoing another major change: from homogeneous to heterogeneous platform.
    [Show full text]