Parallel, Distributed and GPU Computing Technologies in Single
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huangy Jiuxing Liuz Bulent Abaliz Dhabaleswar K. Panday y Computer Science and Engineering z IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 fhuanwei, [email protected] fjl, [email protected] ABSTRACT in the 1960s [9], but are experiencing a resurgence in both Virtual machine (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called hypervisor). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest operating system. They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of virtualization. Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps). -
Computer Architectures
Computer Architectures Central Processing Unit (CPU) Pavel Píša, Michal Štepanovský, Miroslav Šnorek The lecture is based on A0B36APO lecture. Some parts are inspired by the book Paterson, D., Henessy, V.: Computer Organization and Design, The HW/SW Interface. Elsevier, ISBN: 978-0-12-370606-5 and it is used with authors' permission. Czech Technical University in Prague, Faculty of Electrical Engineering English version partially supported by: European Social Fund Prague & EU: We invests in your future. AE0B36APO Computer Architectures Ver.1.10 1 Computer based on von Neumann's concept ● Control unit Processor/microprocessor ● ALU von Neumann architecture uses common ● Memory memory, whereas Harvard architecture uses separate program and data memories ● Input ● Output Input/output subsystem The control unit is responsible for control of the operation processing and sequencing. It consists of: ● registers – they hold intermediate and programmer visible state ● control logic circuits which represents core of the control unit (CU) AE0B36APO Computer Architectures 2 The most important registers of the control unit ● PC (Program Counter) holds address of a recent or next instruction to be processed ● IR (Instruction Register) holds the machine instruction read from memory ● Another usually present registers ● General purpose registers (GPRs) may be divided to address and data or (partially) specialized registers ● SP (Stack Pointer) – points to the top of the stack; (The stack is usually used to store local variables and subroutine return addresses) ● PSW (Program Status Word) ● IM (Interrupt Mask) ● Optional Floating point (FPRs) and vector/multimedia regs. AE0B36APO Computer Architectures 3 The main instruction cycle of the CPU 1. Initial setup/reset – set initial PC value, PSW, etc. -
Microprocessor Architecture
EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture •ALU (Arithmetic Logic Unit) •Binary Full Adder 3 Microprocessor Bus 4 Architecture by CPU+MEM organization Princeton (or von Neumann) Architecture MEM contains both Instruction and Data Harvard Architecture Data MEM and Instruction MEM Higher Performance Better for DSP Higher MEM Bandwidth 5 Princeton Architecture 1.Step (A): The address for the instruction to be next executed is applied (Step (B): The controller "decodes" the instruction 3.Step (C): Following completion of the instruction, the controller provides the address, to the memory unit, at which the data result generated by the operation will be stored. 6 Harvard Architecture 7 Internal Memory (“register”) External memory access is Very slow For quicker retrieval and storage Internal registers 8 Architecture by Instructions and their Executions CISC (Complex Instruction Set Computer) Variety of instructions for complex tasks Instructions of varying length RISC (Reduced Instruction Set Computer) Fewer and simpler instructions High performance microprocessors Pipelined instruction execution (several instructions are executed in parallel) 9 CISC Architecture of prior to mid-1980’s IBM390, Motorola 680x0, Intel80x86 Basic Fetch-Execute sequence to support a large number of complex instructions Complex decoding procedures Complex control unit One instruction achieves a complex task 10 -
What Do We Mean by Architecture?
Embedded programming: Comparing the performance and development workflows for architectures Embedded programming week FABLAB BRIGHTON 2018 What do we mean by architecture? The architecture of microprocessors and microcontrollers are classified based on the way memory is allocated (memory architecture). There are two main ways of doing this: Von Neumann architecture (also known as Princeton) Von Neumann uses a single unified cache (i.e. the same memory) for both the code (instructions) and the data itself, Under pure von Neumann architecture the CPU can be either reading an instruction or reading/writing data from/to the memory. Both cannot occur at the same time since the instructions and data use the same bus system. Harvard architecture Harvard architecture uses different memory allocations for the code (instructions) and the data, allowing it to be able to read instructions and perform data memory access simultaneously. The best performance is achieved when both instructions and data are supplied by their own caches, with no need to access external memory at all. How does this relate to microcontrollers/microprocessors? We found this page to be a good introduction to the topic of microcontrollers and microprocessors, the architectures they use and the difference between some of the common types. First though, it’s worth looking at the difference between a microprocessor and a microcontroller. Microprocessors (e.g. ARM) generally consist of just the Central Processing Unit (CPU), which performs all the instructions in a computer program, including arithmetic, logic, control and input/output operations. Microcontrollers (e.g. AVR, PIC or 8051) contain one or more CPUs with RAM, ROM and programmable input/output peripherals. -
Parallel Rendering on Hybrid Multi-GPU Clusters
Eurographics Symposium on Parallel Graphics and Visualization (2012) H. Childs and T. Kuhlen (Editors) Parallel Rendering on Hybrid Multi-GPU Clusters S. Eilemann†1, A. Bilgili1, M. Abdellah1, J. Hernando2, M. Makhinya3, R. Pajarola3, F. Schürmann1 1Blue Brain Project, EPFL; 2CeSViMa, UPM; 3Visualization and MultiMedia Lab, University of Zürich Abstract Achieving efficient scalable parallel rendering for interactive visualization applications on medium-sized graphics clusters remains a challenging problem. Framerates of up to 60hz require a carefully designed and fine-tuned parallel rendering implementation that fits all required operations into the 16ms time budget available for each rendered frame. Furthermore, modern commodity hardware embraces more and more a NUMA architecture, where multiple processor sockets each have their locally attached memory and where auxiliary devices such as GPUs and network interfaces are directly attached to one of the processors. Such so called fat NUMA processing and graphics nodes are increasingly used to build cost-effective hybrid shared/distributed memory visualization clusters. In this paper we present a thorough analysis of the asynchronous parallelization of the rendering stages and we derive and implement important optimizations to achieve highly interactive framerates on such hybrid multi-GPU clusters. We use both a benchmark program and a real-world scientific application used to visualize, navigate and interact with simulations of cortical neuron circuit models. Categories and Subject Descriptors (according to ACM CCS): I.3.2 [Computer Graphics]: Graphics Systems— Distributed Graphics; I.3.m [Computer Graphics]: Miscellaneous—Parallel Rendering 1. Introduction Hybrid GPU clusters are often build from so called fat render nodes using a NUMA architecture with multiple The decomposition of parallel rendering systems across mul- CPUs and GPUs per machine. -
The Von Neumann Computer Model 5/30/17, 10:03 PM
The von Neumann Computer Model 5/30/17, 10:03 PM CIS-77 Home http://www.c-jump.com/CIS77/CIS77syllabus.htm The von Neumann Computer Model 1. The von Neumann Computer Model 2. Components of the Von Neumann Model 3. Communication Between Memory and Processing Unit 4. CPU data-path 5. Memory Operations 6. Understanding the MAR and the MDR 7. Understanding the MAR and the MDR, Cont. 8. ALU, the Processing Unit 9. ALU and the Word Length 10. Control Unit 11. Control Unit, Cont. 12. Input/Output 13. Input/Output Ports 14. Input/Output Address Space 15. Console Input/Output in Protected Memory Mode 16. Instruction Processing 17. Instruction Components 18. Why Learn Intel x86 ISA ? 19. Design of the x86 CPU Instruction Set 20. CPU Instruction Set 21. History of IBM PC 22. Early x86 Processor Family 23. 8086 and 8088 CPU 24. 80186 CPU 25. 80286 CPU 26. 80386 CPU 27. 80386 CPU, Cont. 28. 80486 CPU 29. Pentium (Intel 80586) 30. Pentium Pro 31. Pentium II 32. Itanium processor 1. The von Neumann Computer Model Von Neumann computer systems contain three main building blocks: The following block diagram shows major relationship between CPU components: the central processing unit (CPU), memory, and input/output devices (I/O). These three components are connected together using the system bus. The most prominent items within the CPU are the registers: they can be manipulated directly by a computer program. http://www.c-jump.com/CIS77/CPU/VonNeumann/lecture.html Page 1 of 15 IPR2017-01532 FanDuel, et al. -
Parallel Texture-Based Vector Field Visualization on Curved Surfaces Using GPU Cluster Computers
Eurographics Symposium on Parallel Graphics and Visualization (2006) Alan Heirich, Bruno Raffin, and Luis Paulo dos Santos (Editors) Parallel Texture-Based Vector Field Visualization on Curved Surfaces Using GPU Cluster Computers S. Bachthaler1, M. Strengert1, D. Weiskopf2, and T. Ertl1 1Institute of Visualization and Interactive Systems, University of Stuttgart, Germany 2School of Computing Science, Simon Fraser University, Canada Abstract We adopt a technique for texture-based visualization of flow fields on curved surfaces for parallel computation on a GPU cluster. The underlying LIC method relies on image-space calculations and allows the user to visualize a full 3D vector field on arbitrary and changing hypersurfaces. By using parallelization, both the visualization speed and the maximum data set size are scaled with the number of cluster nodes. A sort-first strategy with image-space decomposition is employed to distribute the workload for the LIC computation, while a sort-last approach with an object-space partitioning of the vector field is used to increase the total amount of available GPU memory. We specifically address issues for parallel GPU-based vector field visualization, such as reduced locality of memory accesses caused by particle tracing, dynamic load balancing for changing camera parameters, and the combination of image-space and object-space decomposition in a hybrid approach. Performance measurements document the behavior of our implementation on a GPU cluster with AMD Opteron CPUs, NVIDIA GeForce 6800 Ultra GPUs, and Infiniband network connection. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Viewing algorithms I.3.3 [Three-Dimensional Graphics and Realism]: Color, shading, shadowing, and texture 1. -
Hardware Architecture
Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs. -
Evolution of the Graphical Processing Unit
University of Nevada Reno Evolution of the Graphical Processing Unit A professional paper submitted in partial fulfillment of the requirements for the degree of Master of Science with a major in Computer Science by Thomas Scott Crow Dr. Frederick C. Harris, Jr., Advisor December 2004 Dedication To my wife Windee, thank you for all of your patience, intelligence and love. i Acknowledgements I would like to thank my advisor Dr. Harris for his patience and the help he has provided me. The field of Computer Science needs more individuals like Dr. Harris. I would like to thank Dr. Mensing for unknowingly giving me an excellent model of what a Man can be and for his confidence in my work. I am very grateful to Dr. Egbert and Dr. Mensing for agreeing to be committee members and for their valuable time. Thank you jeffs. ii Abstract In this paper we discuss some major contributions to the field of computer graphics that have led to the implementation of the modern graphical processing unit. We also compare the performance of matrix‐matrix multiplication on the GPU to the same computation on the CPU. Although the CPU performs better in this comparison, modern GPUs have a lot of potential since their rate of growth far exceeds that of the CPU. The history of the rate of growth of the GPU shows that the transistor count doubles every 6 months where that of the CPU is only every 18 months. There is currently much research going on regarding general purpose computing on GPUs and although there has been moderate success, there are several issues that keep the commodity GPU from expanding out from pure graphics computing with limited cache bandwidth being one. -
Exploring Massive Parallel Computation with GPU
Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Exploring Massive Parallel Computation with GPU Ian Bond Massey University, Auckland, New Zealand 2011 Sagan Exoplanet Workshop Pasadena, July 25-29 2011 Ian Bond | Microlensing parallelism 1/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Assumptions/Purpose You are all involved in microlensing modelling and you have (or are working on) your own code this lecture shows how to get started on getting code to run on a GPU then its over to you . Ian Bond | Microlensing parallelism 2/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Outline 1 Need for parallelism 2 Graphical Processor Units 3 Gravitational Microlensing Modelling Ian Bond | Microlensing parallelism 3/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Paralel Computing Parallel Computing is use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer (Wilkinson 2002). How does one achieve parallelism? Ian Bond | Microlensing parallelism 4/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with todays computers’ Examples: – Modelling large DNA structures – Global weather forecasting – N body problem (N very large) – brain simulation Has microlensing -
Dcuda: Hardware Supported Overlap of Computation and Communication
dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi Jeremia Bar¨ Torsten Hoefler Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich [email protected] [email protected] [email protected] Abstract—Over the last decade, CUDA and the underlying utilization of the costly compute and network hardware. To GPU hardware architecture have continuously gained popularity mitigate this problem, application developers can implement in various high-performance computing application domains manual overlap of computation and communication [23], [27]. such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent In particular, there exist various approaches [13], [22] to programming model for GPU clusters. We therefore introduce overlap the communication with the computation on an inner the dCUDA programming model, which implements device- domain that has no inter-node data dependencies. However, side remote memory access with target notification. To hide these code transformations significantly increase code com- instruction pipeline latencies, CUDA programs over-decompose plexity which results in reduced real-world applicability. the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a High-performance system design often involves trading thread stalls, the hardware scheduler immediately proceeds with off sequential performance against parallel throughput. The the execution of another thread ready for execution. This latency architectural difference between host and device processors hiding technique is key to make best use of the available hardware perfectly showcases the two extremes of this design space. -
On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework
On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework Montella, R., Giunta, G., Laccetti, G., Lapegna, M., Palmieri, C., Ferraro, C., Pelliccia, V., Hong, C-H., Spence, I., & Nikolopoulos, D. (2017). On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework. International Journal of Parallel Programming, 45(5), 1142-1163. https://doi.org/10.1007/s10766-016-0462-1 Published in: International Journal of Parallel Programming Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016 Springer Verlag. The final publication is available at Springer via http://dx.doi.org/ 10.1007/s10766-016-0462-1 General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:08. Oct. 2021 Noname manuscript No. (will be inserted by the editor) On the virtualization of CUDA based GPU remoting on ARM and X86 machines in the GVirtuS framework Raffaele Montella · Giulio Giunta · Giuliano Laccetti · Marco Lapegna · Carlo Palmieri · Carmine Ferraro · Valentina Pelliccia · Cheol-Ho Hong · Ivor Spence · Dimitrios S.