Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers

Total Page:16

File Type:pdf, Size:1020Kb

Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers Lei Huang Kent Milfeld Si Liu Texas Advanced Computing Center Texas Advanced Computing Center Texas Advanced Computing Center The University of Texas at Austin The University of Texas at Austin The University of Texas at Austin 10100 Burnet Rd. Austin, TX, 78758 10100 Burnet Rd. Austin, TX, 78758 10100 Burnet Rd. Austin, TX, 78758 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Performance boosts in HPC nodes have come from Skylake (SKX) and Cascade Lake (CLX) processors [31], making SIMD units wider and aggressively packing more and respectively. These, and other HPC processors, also support more cores in each processor. With multiple processors and so Simultaneous Multi-Threading (SMT) to a level of 2 to 4 many cores it has become necessary to understand and manage process and thread affinity and pinning. However, affinity tools per core. Consequently, there could be 2x to 4x more logical have not been designed specifically for HPC users to quickly processors than physical processors on a node. evaluate process affinity and execution location. To fill in the When working with nodes of such large core counts, the gap, three HPC user-friendly tools, core usage, show affinity, and performance of HPC applications is not only dependent upon amask, have been designed to eliminate barriers that frustrate the number and speed of the cores, but also upon proper users and impede users from evaluating and analyzing affinity for applications. These tools focus on providing convenient scheduling of processes and threads. HPC application runs methods, easy-to-understand affinity representations for large with proper affinity settings will take full advantage of re- process counts, process locality, and run-time core load with sources like local memory, reusable caches, etc., and will socket aggregation. These tools will significantly help HPC users, obtain a distinct benefit in performance. developers and site administrators easily monitor processor utilization from an affinity perspective. II. BACKGROUND Index Terms—Supercomputers, User support tool, Multicore A. Process and Thread Affinity system, Affinity, Resource utilization, Core binding, Real-time monitoring, Debugging A modern computer often has more than one socket per node and therefore HPC applications may have non-uniform I. INTRODUCTION access to memory. Ideally, an application process should be Up to the millennium, the processor frequency of com- placed on a processor that is close to the data in memory modity CPUs increased exponentially year after year. High it accesses, to get the best performance. Process and thread CPU frequency had been one of the major driving forces to affinity/pinning allows a process or a thread to bind to a boost CPU performance, other than the introduction of vector single processor or a set of (logical) processors. The processes processor units. However, it ceased to grow significantly in or threads with specific affinity settings will then only run recent years due to both technical reasons and market forces. on the designated processor(s). For Single-Program Multiple- To accommodate the high demand of computing power in Data (SPMD) applications, managing this affinity can be HPC, significantly more cores are being packed into a single difficult. Moreover, the present-day workflows on modern compute node [15]. supercomputers have moved beyond the SPMD approach and The needs of HPC and the use of core-rich processors are now include hierarchical levels of Multiple-Program Multiple- exemplified in the extraordinary large-scale supercomputers Data (MPMD), demanding even more attention to affinity. found throughout the world. The Sierra supercomputer [17] at MPI affinity for Intel MPI (IMPI), MVAPICH2 the Lawrence Livermore National Laboratory and the Summit (MV2), Open MPI (OMPI), and IBM Spectrum MPI supercomputer [19] at the Oak Ridge National Lab have 44 (SMPI) have a variety of mechanisms for setting processing cores per compute node with two IBM Power9 affinity. IMPI relies solely on “I MPI x” environment CPUs [13]. The Sunway TaihuLight supercomputer [18] at variables, as does MV2 (MV2 CPU/HYBRID BINDING x, the National Supercomputing Center in Wuxi deploys Sunway MV2 CPU MAPPING x, etc.). SMPI uses both environment SW26010 manycore processors, containing 256 processing variables (MP TASK/CPU x) and mpirun command-line cores and additional 4 auxiliary cores for system manage- options (-map-by, -bind-to, -aff shortcuts, etc.). Similarly, ment [32] per node. The Stampede2 supercomputer [28] at OMPI uses mpirun options (-bind-to-core, –cpus-per-proc, the Texas Advanced Computing Center (TACC) provides Intel etc.) and also accepts a rankfile file with a map (slot-list) for Knights Landing (KNL) nodes with 68 cores per node. The each rank. Stampede2 [28] and Frontera [27] supercomputers at TACC When no affinity is specified these MPIs evaluate a node’s provide 48 and 56 processing cores per node with Intel’s hardware configuration (for example with hwloc for MV2 and OMPI) and make appropriate default affinity settings. OpenMP B. Related Work affinity for hybrid runs can be specified by various “vendor” There are many ways to view CPU loads and the affinity methods. However, since all of these MPIs accept OpenMP’s mask of a process. Moreover, some methods are not well- OMP PLACES/OMP PROC BIND specifications, it is best known or only available through unexpected means. to use the standard’s mechanism. Hence, for portable hybrid computing a user must deal with many ways of setting each The Linux command-line tool top [8] and its more recent rank’s affinity. (When a master thread encounters a parallel counterparts (htop or atop) can be used to monitor CPU loads, region it inherits the MPI rank’s mask, and OpenMP Affinity and manage process and thread affinity in real time. The specifications take over). Linux command ps [3] can report which core a process is running on. However, it does not report the affinity mask Fig. 1 shows a schematic of the affinity process. A mask explicitly. It only reports the core the process is presently is maintained for each process by the kernel that describes running on. The taskset [7] command-line utility is normally which processor(s) the process can run on. The mask consists more helpful since it can query and modify the binding affinity of a bit for each processor, and the process can execute on any of a given thread or process. Linux also provides API functions processor where a mask bit is set. There are a myriad of ways sched getaffinity [5] and pthread setaffinity np [4] for a to set and alter the affinity mask for processes of a parallel process or thread to query and set the affinity (kernel mask) application. For instance, vendors have their own way to set of itself. While these tools are pervasive and do provide the affinities for MPI and OpenMP, usually through environment information needed, they are sometimes cumbersome to use, variables. Only recently has OpenMP 4.5 [20], [21] provided a particularly for supercomputer users working with large core standard way to set affinity for threads, and MPI has yet to do counts. this for MPI tasks. As shown in Fig. 1 the affinity can not only be affected before an application is launched but also while For HPC users, these tools may provide too much admin- it is running. There are utilities such as numactl [2] and util- istrative information, and it may not be apparent how to get linux taskset [7] to do this. Furthermore, the affinity can even HPC-relevant information for their applications. Users need be changed within a program with the sched setaffinity [6] to remember extra options or give extra instructions to obtain function. relevant CPU information. For instance, top does not show the loads of individual processors by default. For example, pressing “1” within a top session is required to display the performance data of individual CPUs; and pressing “z” is needed to display running process in color. The htop utility does show usage information for all logical processors and the load on each individual core is represented by a progress bar in text mode. It works up to about one hundred sixty cores. However, the progress bar is distracting on a computer with many cores. Furthermore, such tools were originally designed for admin- istrators to display multi-user information, not for a single- user screen display of HPC information. Therefore, there is a real need for convenient HPC tools to readily display each CPU utilization and affinity information for multicore compute nodes. The MPI libraries and OpenMP 5.0 [20] implementions themselves can present affinity mask information on the pro- Fig. 1. The left box indicates mechanisms for setting the affinity mask. The cesses or threads they instantiate. For instance, by setting the right box illustrates how a BIOS setting has designated the processor ids for the hardware (cores). The center section shows a mask with bits set for I MPI DEBUG environment to 4 or above, the Intel MPI execution on cores 1, 3, 5, and 7. runtime will print a list of processor ids (mask bits set) for each process (rank) at launch time. Likewise for OpenMP 5.0 im- Understanding the “vernaculars” of all these methods can be plementations, setting OMP AFFINITY DISPLAY to TRUE challenging. Even the default settings are sometimes unknown will have the runtime print a line for each thread (number) to users. In addition, users are commonly uncertain of their reporting the processor ids associated with its binding, at the attempts to set the affinity for processes of their parallel beginning of the first parallel region. However, it is difficult to applications.
Recommended publications
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Parallel Computing
    Parallel Computing Announcements ● Midterm has been graded; will be distributed after class along with solutions. ● SCPD students: Midterms have been sent to the SCPD office and should be sent back to you soon. Announcements ● Assignment 6 due right now. ● Assignment 7 (Pathfinder) out, due next Tuesday at 11:30AM. ● Play around with graphs and graph algorithms! ● Learn how to interface with library code. ● No late submissions will be considered. This is as late as we're allowed to have the assignment due. Why Algorithms and Data Structures Matter Making Things Faster ● Choose better algorithms and data structures. ● Dropping from O(n2) to O(n log n) for large data sets will make your programs faster. ● Optimize your code. ● Try to reduce the constant factor in the big-O notation. ● Not recommended unless all else fails. ● Get a better computer. ● Having more memory and processing power can improve performance. ● New option: Use parallelism. How Your Programs Run Threads of Execution ● When running a program, that program gets a thread of execution (or thread). ● Each thread runs through code as normal. ● A program can have multiple threads running at the same time, each of which performs different tasks. ● A program that uses multiple threads is called multithreaded; writing a multithreaded program or algorithm is called multithreading. Threads in C++ ● The newest version of C++ (C++11) has libraries that support threading. ● To create a thread: ● Write the function that you want to execute. ● Construct an object of type thread to run that function. – Need header <thread> for this. ● That function will run in parallel alongside the original program.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Parallel Programming with Openmp
    Parallel Programming with OpenMP OpenMP Parallel Programming Introduction: OpenMP Programming Model Thread-based parallelism utilized on shared-memory platforms Parallelization is either explicit, where programmer has full control over parallelization or through using compiler directives, existing in the source code. Thread is a process of a code is being executed. A thread of execution is the smallest unit of processing. Multiple threads can exist within the same process and share resources such as memory OpenMP Parallel Programming Introduction: OpenMP Programming Model Master thread is a single thread that runs sequentially; parallel execution occurs inside parallel regions and between two parallel regions, only the master thread executes the code. This is called the fork-join model: OpenMP Parallel Programming OpenMP Parallel Computing Hardware Shared memory allows immediate access to all data from all processors without explicit communication. Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU's refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well OpenMP Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems An alternative is MPI (Message Passing Interface): require low-level programming; more difficult programming
    [Show full text]
  • Intel's Hyper-Threading
    Intel’s Hyper-Threading Cody Tinker & Christopher Valerino Introduction ● How to handle a thread, or the smallest portion of code that can be run independently, plays a core component in enhancing a program's parallelism. ● Due to the fact that modern computers process many tasks and programs simultaneously, techniques that allow for threads to be handled more efficiently to lower processor downtime are valuable. ● Motive is to reduce the number of idle resources of a processor. ● Examples of processors that use hyperthreading ○ Intel Xeon D-1529 ○ Intel i7-6785R ○ Intel Pentium D1517 Introduction ● Two main trends have been followed to increase parallelism: ○ Increase number of cores on a chip ○ Increase core throughput What is Multithreading? ● Executing more than one thread at a time ● Normally, an operating system handles a program by scheduling individual threads and then passing them to the processor. ● Two different types of hardware multithreading ○ Temporal ■ One thread per pipeline stage ○ Simultaneous (SMT) ■ Multiple threads per pipeline stage ● Intel’s hyper-threading is a SMT design Hyper-threading ● Hyper-threading is the hardware solution to increasing processor throughput by decreasing resource idle time. ● Allows multiple concurrent threads to be executed ○ Threads are interleaved so that resources not being used by one thread are used by others ○ Most processors are limited to 2 concurrent threads per physical core ○ Some do support 8 concurrent threads per physical core ● Needs the ability to fetch instructions from
    [Show full text]
  • Lecture 3: a Simple Example, Tools, and CUDA Threads
    ECE498AL Lecture 3: A Simple Example, Tools, and CUDA Threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 1 ECE498AL, University of Illinois, Urbana-Champaign A Simple Running Example Matrix Multiplication • A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs – Leave shared memory usage until later – Local, register usage – Thread ID usage – Memory data transfer API between host and device – Assume square matrix for simplicity © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2 ECE498AL, University of Illinois, Urbana-Champaign Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH N • Without tiling: H T D – One thread calculates one element I W of P – M and N are loaded WIDTH times from global memory M P H T D I W WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 3 ECE498AL, University of Illinois, Urbana-Champaign Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4 ECE498AL, University of Illinois, Urbana-Champaign Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)N { k for (int i = 0; i < Width; ++i) H j T for (int j = 0; j < Width; ++j) { D I double sum = 0; W for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M P } P[i * Width + j] = sum; i H } T D I } W k WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W.
    [Show full text]
  • Thread Scheduling in Multi-Core Operating Systems Redha Gouicem
    Thread Scheduling in Multi-core Operating Systems Redha Gouicem To cite this version: Redha Gouicem. Thread Scheduling in Multi-core Operating Systems. Computer Science [cs]. Sor- bonne Université, 2020. English. tel-02977242 HAL Id: tel-02977242 https://hal.archives-ouvertes.fr/tel-02977242 Submitted on 24 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ph.D thesis in Computer Science Thread Scheduling in Multi-core Operating Systems How to Understand, Improve and Fix your Scheduler Redha GOUICEM Sorbonne Université Laboratoire d’Informatique de Paris 6 Inria Whisper Team PH.D.DEFENSE: 23 October 2020, Paris, France JURYMEMBERS: Mr. Pascal Felber, Full Professor, Université de Neuchâtel Reviewer Mr. Vivien Quéma, Full Professor, Grenoble INP (ENSIMAG) Reviewer Mr. Rachid Guerraoui, Full Professor, École Polytechnique Fédérale de Lausanne Examiner Ms. Karine Heydemann, Associate Professor, Sorbonne Université Examiner Mr. Etienne Rivière, Full Professor, University of Louvain Examiner Mr. Gilles Muller, Senior Research Scientist, Inria Advisor Mr. Julien Sopena, Associate Professor, Sorbonne Université Advisor ABSTRACT In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correct- ness), performance improvement and the development of application- specific schedulers.
    [Show full text]
  • Concurrent Computing
    Concurrent Computing CSCI 201 Principles of Software Development Jeffrey Miller, Ph.D. [email protected] Outline • Threads • Multi-Threaded Code • CPU Scheduling • Program USC CSCI 201L Thread Overview ▪ Looking at the Windows taskbar, you will notice more than one process running concurrently ▪ Looking at the Windows task manager, you’ll see a number of additional processes running ▪ But how many processes are actually running at any one time? • Multiple processors, multiple cores, and hyper-threading will determine the actual number of processes running • Threads USC CSCI 201L 3/22 Redundant Hardware Multiple Processors Single-Core CPU ALU Cache Registers Multiple Cores ALU 1 Registers 1 Cache Hyper-threading has firmware logic that appears to the OS ALU 2 to contain more than one core when in fact only one physical core exists Registers 2 4/22 Programs vs Processes vs Threads ▪ Programs › An executable file residing in memory, typically in secondary storage ▪ Processes › An executing instance of a program › Sometimes this is referred to as a Task ▪ Threads › Any section of code within a process that can execute at what appears to be simultaneously › Shares the same resources allotted to the process by the OS kernel USC CSCI 201L 5/22 Concurrent Programming ▪ Multi-Threaded Programming › Multiple sections of a process or multiple processes executing at what appears to be simultaneously in a single core › The purpose of multi-threaded programming is to add functionality to your program ▪ Parallel Programming › Multiple sections of
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • Opencl Introduction
    OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined in four parts Platform Model Execution Model Memory Model Programming Model (We’re going to diverge from this structure a bit) Host-Device Model (Platform Model) The model consists of a host connected to one or more OpenCL devices A device is divided into one or more compute units Compute units are divided into one or more processing elements Host-Device Model The host is whatever the OpenCL library runs on Usually x86 CPUs Devices are processors that the library can talk to CPUs, GPUs, and other accelerators For AMD All CPUs are 1 device (each core is a compute unit and processing element) Each GPU is a separate device Platforms Platform == OpenCL implementation (AMD, NVIDIA, Intel) Uses an ―Installable Client Driver‖ model Generic OpenCL library runs and detects platforms The goal is to allow multiple implementations that co-exist However, current GPU driver model does not allow that Discovering Platforms This function is usually called twice The first call is used to get the number of platforms available to the implementation Space is then allocated for the platform objects The second call is used to retrieve the platform objects * Step 1 Discovering Platforms Discovering Devices Once a platform is selected, we can then query for the devices that it knows how to interact with We can specify which types
    [Show full text]
  • An Introduction to CUDA/Opencl and Graphics Processors
    An Introduction to CUDA/OpenCL and Graphics Processors Bryan Catanzaro, NVIDIA Research Overview ¡ Terminology ¡ The CUDA and OpenCL programming models ¡ Understanding how CUDA maps to NVIDIA GPUs ¡ Thrust 2/74 Heterogeneous Parallel Computing Latency Throughput Optimized CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/74 Latency vs. Throughput Latency Throughput ¡ Latency: yoke of oxen § Each core optimized for executing a single thread ¡ Throughput: flock of chickens § Cores optimized for aggregate throughput, deemphasizing individual performance ¡ (apologies to Seymour Cray) 4/74 Latency vs. Throughput, cont. Specificaons Sandy Bridge- Kepler EP (Tesla K20) 8 cores, 2 issue, 14 SMs, 6 issue, 32 Processing Elements 8 way SIMD way SIMD @3.1 GHz @730 MHz 8 cores, 2 threads, 8 14 SMs, 64 SIMD Resident Strands/ way SIMD: vectors, 32 way Threads (max) SIMD: Sandy Bridge-EP (32nm) 96 strands 28672 threads SP GFLOP/s 396 3924 Memory Bandwidth 51 GB/s 250 GB/s Register File 128 kB (?) 3.5 MB Local Store/L1 Cache 256 kB 896 kB L2 Cache 2 MB 1.5 MB L3 Cache 20 MB - Kepler (28nm) 5/74 Why Heterogeneity? ¡ Different goals produce different designs § Manycore assumes work load is highly parallel § Multicore must be good at everything, parallel or not ¡ Multicore: minimize latency experienced by 1 thread § lots of big on-chip caches § extremely sophisticated control ¡ Manycore: maximize throughput of all threads § lots of big ALUs § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over
    [Show full text]