Parallel Computing Concepts

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Computing Concepts Parallel Computing Concepts But …. Why are we interested in Parallel Computing? Because we do Supercomputing? But … COMPUTE | STORE | ANALYZE What is a Supercomputing? John M Levesque CTO Office – Applications Director Cray’s Supercomputing Center of Excellence COMPUTE | STORE | ANALYZE Seymour Cray September 28, 1925 To October 5, 1996 First – a supercomputer is: ● A computer whose peak performance is capable of solving the largest complex science/engineering problems ● Peak performance is measured by ● Number of total processors in the system multiplied by the peak of each processor ● Peak of each processor is measured in Floating Point Operations per second. ● A scalar processor can usually generate 2-4 64 Bit results (Multiply/Add) operations each clock cycle ● The Intel Skylake and Intel Knight’s Landing has vector units that can generate 16-32 64 bit results each clock cycle. ● Skylake’s clock cycle varies and the fastest is 3-4 GHZ for 64-128 GFLOPS ● Big Benefit from vectorizing applications ● Summit has 4,356 nodes, each one equipped with two 22- core Power9 CPUs, and six NVIDIA Tesla V100 GPUs for a whopping 187.6 Petaflops and it is the fastest computer in the world COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 4 Workshop However; The Supercomputer is just one piece of Supercomputing ● We need three things to be successful at Supercomputing ● The Supercomputer – Summit is great example ● Large Scientific problem to solve – we have many of those ● Programming expertise to port/optimize existing applications and/or develop new applications to effectively utilize the target Supercomputing architecture ● This is an area where we are lacking – too many application developers are not interested in how to write optimized applications for the target system ● When you achieve less than one percent of the peak performance on a 200 Petaflop computer – you are only achieving less than 2 Petaflops ● There are a few sites who are exceptions to this. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 5 Workshop Ways to make a computer faster ● Lower the clock cycle – increase clock rate ● As the clock cycle becomes shorter the computer can perform more computations each second. Remember we are determining the peak performance by counting the number of floating point operations/second. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 6 Workshop For the last ten years the clock cycle is staying staying the same or increasing Supercomputer CPU Clock Cycle Times 1.00E-07 Developers became lazy during this period Developers 1.00E-08 became worried during this period 1.00E-09 Clock Cycle Time (seconds) 1.00E-10 1960 1970 1980 Year1990 2000 2010 2020 COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 7 Workshop Ways to make a computer faster ● Lower the clock cycle – increase clock rate ● Generate more than one result each clock ● Single Instruction, Multiple Data - SIMD ● Vector instruction utilizing multiple segmented functional units ● Parallel instruction ● Mulitple Instructions, Multiple Data - MIMD ● Data Flow ● Utilize more than one processor ● Shared memory parallelism ● All processors share memory ● Distributed memory parallelism ● All nodes have separate memories, but: ● A node can have several processors sharing memory within the node COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 8 Workshop Vector Unit is an Assembly Line COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 9 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 1 A(1) * B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 10 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 2 A(2) A(1) * * B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 11 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 3 A(3) A(2) A(1) * * * B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 12 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 4 A(4) A(3) A(2) A(1) * * * * B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 13 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 5 A(5) A(4) A(3) A(2) A(1) * * * * * B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 14 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 6 A(6) A(5) A(4) A(3) A(2) A(1) * * * * * * B(6) B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 15 Workshop Cray 1 Vector Unit Time to compute= Startup + Number of Operands Clock Cycle = 7 A(7) A(6) A(5) A(4) A(3) A(2) A(1) + * * * * * * B(7) B(6) B(5) B(4) B(3) B(2) B(1) COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 16 Workshop SIMD Parallel Unit Result Rate = [Number of Operands/Number of Processing Elements] COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 17 Workshop What are today’s Vector processors? ● They are not like the traditional Cray vector processors ● The are actually like the Illiac. COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 18 Workshop Outline ● What was a Supercomputer in the 1960s,70s,80s,90s and the 2000s ● Single Instruction, Multiple Data (SIMD) ● Vector ● Parallel ● Multiple Instructions, Multiple Data (MIMD) ● Distributed Memory Systems ● Beowulf, Paragon, nCUBE, Cray T3D/T3E, IBM SP ● Shared Memory Systems ● SGI Origin ● What is a Supercomputer Today ● déjà vu ● MIMD collection of distributed nodes with a MIMD collection of shared memory processors with SIMD instructions ● So it is important to understand the history of Supercomputing COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 19 Workshop Who is the historian - John Levesque? ● 1964 – 1965 ● Delivered Meads Fine Bread, (Student at University of New Mexico 62-72) ● 3/1966 – 1969 ● Sandia Laboratories – Underground Physics Department (CDC3600) ● 1969 – 1972 ● Air Force Weapons Laboratory (CDC 6600) ● 1972 – 1978 ● R & D Associates - (CDC 7600, Illiac IV, Cray 1, Cyber 203-205) ● 1978 -1979 ● Massachusetts Computer Associates (Illiac IV, Cray 1) Parallizer ● 1979 – 1991 ● Pacific Sierra Research – (Intel Paragon, Ncube, Univac APS) - VAST ● 1991 – 1998 ● Applied Parallel Research – (MPPs of all sorts, CM5, NEX SX) – FORGE ● 1998 – 2001 ● IBM Research – Director Advanced Computing Technology Center ● 1/2001 – 9/2001 ● Times N Systems – Director of Software Development ● 9/2001 – Present ( 11/2018?) ç (9/2021) ● Cray Inc – Director – Cray’s Supercomputing Center of Excellence COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 20 Workshop 1960s COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 21 Workshop First System I drove COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 22 Workshop Then got job at Sandia Laboratories in Albuquerque COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 23 Workshop 1970s COMPUTE | STORE | ANALYZE ASCR Exascale Computing Systems Productivity 24 Workshop CDC 6600 – Multiple Functional Units Could generate an ADD and MULTIPLY at the same time ASCR Exascale Computing Systems Productivity 25 Workshop Computer input devices ASCR Exascale Computing Systems Productivity 26 Workshop Why was the Fortran line length 72? ASCR Exascale Computing Systems Productivity 27 Workshop File organization ASCR Exascale Computing Systems Productivity 28 Workshop Computer Listing ASCR Exascale Computing Systems Productivity 29 Workshop Computational Steering in the 70s Set a sense switch, dump the velocity Vector field to tape, take tape to Calcomp Plotter. ASCR Exascale Computing Systems Productivity 30 Workshop Then in the early 70’s I worked on the Illiac IV Anyone know why the door is open? Programming the 64 SIMD units is exactly the same as programming the 32 SIMT threads in a Nvidia Warp ASCR Exascale Computing Systems Productivity 31 Workshop Actually accessed the Illiac IV remotely Contrarily to popular belief, Al Core did not invent the Internet – Larry Roberts of ARPA did. ASCR Exascale Computing Systems Productivity 32 Workshop Then started working with Seymour Cray’s first Masterpiece – the CDC 7600 This System had a large slow memory (256K words) and a small fast memory (65K words) Similar memory architecture is being used today Intel’s Knights Landing and Summit’s nodes ASCR Exascale Computing Systems Productivity 33 Workshop Seymour’s Innovations in the CDC 7600 ● Segmented Functional Units ● The first hint of vector hardware ● Multiple functional Units capable of parallel execution ● Small main memory with larger extended core memory ASCR Exascale Computing Systems Productivity 34 Workshop Any Idea what this is? 35 18.Feb.13 Cray OpenACC training at CSCS My first laptop ASCR Exascale Computing Systems Productivity 36 Workshop Cyber 203/205 ASCR Exascale Computing Systems Productivity 37 Workshop From paper by Neil Lincoln -CDC Memory to Memory Vector Processor Star 100 Scalar Processor SIMD Parallel Processor – Illiac I of the time ASCR Exascale Computing Systems Productivity 38 Workshop First Real Vector Machines ● Memory to Memory Vector Processors ● Texas Instruments ASC ● Star 100 ● Cyber 203/205 ● Register to Register Vector
Recommended publications
  • ORNL Debuts Titan Supercomputer Table of Contents
    ReporterRetiree Newsletter December 2012/January 2013 SCIENCE ORNL debuts Titan supercomputer ORNL has completed the installation of Titan, a supercomputer capable of churning through more than 20,000 trillion calculations each second—or 20 petaflops—by employing a family of processors called graphic processing units first created for computer gaming. Titan will be 10 times more powerful than ORNL’s last world-leading system, Jaguar, while overcoming power and space limitations inherent in the previous generation of high- performance computers. ORNL is now home to Titan, the world’s most powerful supercomputer for open science Titan, which is supported by the DOE, with a theoretical peak performance exceeding 20 petaflops (quadrillion calculations per second). (Image: Jason Richards) will provide unprecedented computing power for research in energy, climate change, efficient engines, materials and other disciplines and pave the way for a wide range of achievements in “Titan will provide science and technology. unprecedented computing Table of Contents The Cray XK7 system contains 18,688 nodes, power for research in energy, with each holding a 16-core AMD Opteron ORNL debuts Titan climate change, materials 6274 processor and an NVIDIA Tesla K20 supercomputer ............1 graphics processing unit (GPU) accelerator. and other disciplines to Titan also has more than 700 terabytes of enable scientific leadership.” Betty Matthews memory. The combination of central processing loves to travel .............2 units, the traditional foundation of high- performance computers, and more recent GPUs will allow Titan to occupy the same space as Service anniversaries ......3 its Jaguar predecessor while using only marginally more electricity. “One challenge in supercomputers today is power consumption,” said Jeff Nichols, Benefits ..................4 associate laboratory director for computing and computational sciences.
    [Show full text]
  • An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System
    An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System Sadaf Alam, Nicola Bianchi, Nicholas Cardo, Matteo Chesi, Miguel Gila, Stefano Gorini, Mark Klein, Colin McMurtrie, Marco Passerini, Carmelo Ponti, Fabio Verzelloni CSCS – Swiss National Supercomputing Centre Lugano, Switzerland Email: {sadaf.alam, nicola.bianchi, nicholas.cardo, matteo.chesi, miguel.gila, stefano.gorini, mark.klein, colin.mcmurtrie, marco.passerini, carmelo.ponti, fabio.verzelloni}@cscs.ch Abstract—The Swiss National Supercomputing Centre added depth provides the necessary space for full-sized PCI- (CSCS) upgraded its flagship system called Piz Daint in Q4 e daughter cards to be used in the compute nodes. The use 2016 in order to support a wider range of services. The of a standard PCI-e interface was done to provide additional upgraded system is a heterogeneous Cray XC50 and XC40 system with Nvidia GPU accelerated (Pascal) devices as well as choice and allow the systems to evolve over time[1]. multi-core nodes with diverse memory configurations. Despite Figure 1 clearly shows the visible increase in length of the state-of-the-art hardware and the design complexity, the 37cm between an XC40 compute module (front) and an system was built in a matter of weeks and was returned to XC50 compute module (rear). fully operational service for CSCS user communities in less than two months, while at the same time providing significant improvements in energy efficiency. This paper focuses on the innovative features of the Piz Daint system that not only resulted in an adaptive, scalable and stable platform but also offers a very high level of operational robustness for a complex ecosystem.
    [Show full text]
  • Petaflops for the People
    PETAFLOPS SPOTLIGHT: NERSC housands of researchers have used facilities of the Advanced T Scientific Computing Research (ASCR) program and its EXTREME-WEATHER Department of Energy (DOE) computing predecessors over the past four decades. Their studies of hurricanes, earthquakes, NUMBER-CRUNCHING green-energy technologies and many other basic and applied Certain problems lend themselves to solution by science problems have, in turn, benefited millions of people. computers. Take hurricanes, for instance: They’re They owe it mainly to the capacity provided by the National too big, too dangerous and perhaps too expensive Energy Research Scientific Computing Center (NERSC), the Oak to understand fully without a supercomputer. Ridge Leadership Computing Facility (OLCF) and the Argonne Leadership Computing Facility (ALCF). Using decades of global climate data in a grid comprised of 25-kilometer squares, researchers in These ASCR installations have helped train the advanced Berkeley Lab’s Computational Research Division scientific workforce of the future. Postdoctoral scientists, captured the formation of hurricanes and typhoons graduate students and early-career researchers have worked and the extreme waves that they generate. Those there, learning to configure the world’s most sophisticated same models, when run at resolutions of about supercomputers for their own various and wide-ranging projects. 100 kilometers, missed the tropical cyclones and Cutting-edge supercomputing, once the purview of a small resulting waves, up to 30 meters high. group of experts, has trickled down to the benefit of thousands of investigators in the broader scientific community. Their findings, published inGeophysical Research Letters, demonstrated the importance of running Today, NERSC, at Lawrence Berkeley National Laboratory; climate models at higher resolution.
    [Show full text]
  • A Scheduling Policy to Improve 10% of Communication Time in Parallel FFT
    A scheduling policy to improve 10% of communication time in parallel FFT Samar A. Aseeri Anando Gopal Chatterjee Mahendra K. Verma David E. Keyes ECRC, KAUST CC, IIT Kanpur Dept. of Physics, IIT Kanpur ECRC, KAUST Thuwal, Saudi Arabia Kanpur, India Kanpur, India Thuwal, Saudi Arabia [email protected] [email protected] [email protected] [email protected] Abstract—The fast Fourier transform (FFT) has applications A. FFT Algorithm in almost every frequency related studies, e.g. in image and signal Forward Fourier transform is given by the following equa- processing, and radio astronomy. It is also used to solve partial differential equations used in fluid flows, density functional tion. 3 theory, many-body theory, and others. Three-dimensional N X (−ikxx) (−iky y) (−ikz z) 3 f^(k) = f(x; y; z)e e e (1) FFT has large time complexity O(N log2 N). Hence, parallel algorithms are made to compute such FFTs. Popular libraries kx;ky ;kz 3 perform slab division or pencil decomposition of N data. None The beauty of this sum is that k ; k ; and k can be summed of the existing libraries have achieved perfect inverse scaling x y z of time with (T −1 ≈ n) cores because FFT requires all-to-all independent of each other. These sums are done at different communication and clusters hitherto do not have physical all- to-all connections. With advances in switches and topologies, we now have Dragonfly topology, which physically connects various units in an all-to-all fashion.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3)
    OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks U.S. Department of the Interior Minerals Management Service Anchorage, Alaska Contract No. M07PC13368 OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks Nov 2009 This study was funded by the Alaska Outer Continental Shelf Region of the Minerals Manage- ment Service, U.S. Department of the Interior, Anchorage, Alaska, through Contract M07PC13368 with Rutgers University, Institute of Marine and Coastal Sciences. The opinions, findings, conclusions, or recommendations expressed in this report or product are those of the authors and do not necessarily reflect the views of the U.S. Department of the Interior, nor does mention of trade names or commercial products constitute endorsement or recommenda- tion for use by the Federal Government. This document was prepared with LATEX xfig, and inkscape. Acknowledgments The ROMS model is descended from the SPEM and SCRUM models, but has been entirely rewritten by Sasha Shchepetkin, Hernan Arango and John Warner, with many, many other con- tributors. I am indebted to every one of them for their hard work. Bill Hibler first came up with the viscous-plastic rheology we are using. Paul Budgell has rewritten the dynamic sea-ice model, improving the solution procedure and making the water- stress term implicit in time, then changing it again to use the elastic-viscous-plastic rheology of Hunke and Dukowicz.
    [Show full text]
  • Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
    Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks OSU Booth Talk (SC ’19) Ammar Ahmad Awan [email protected] Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University Agenda • Introduction – Deep Learning Trends – CPUs and GPUs for Deep Learning – Message Passing Interface (MPI) • Research Challenges: Exploiting HPC for Deep Learning • Proposed Solutions • Conclusion Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 2 Understanding the Deep Learning Resurgence • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! Deep Machine – Feature extraction vs. hand-crafted Learning Learning features AI Examples: • Deep Learning Examples: Logistic Regression – A renewed interest and a lot of hype! MLPs, DNNs, – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Adopted from: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 3 AlexNet Deep Learning in the Many-core Era 10000 8000 • Modern and efficient hardware enabled 6000 – Computability of DNNs – impossible in the 4000 ~500X in 5 years 2000 past! Minutesto Train – GPUs – at the core of DNN training 0 2 GTX 580 DGX-2 – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and
    [Show full text]
  • UNICOS/Mk Status)
    Status on the Serverization of UNICOS - (UNICOS/mk Status) Jim Harrell, Cray Research, Inc., 655-F Lone Oak Drive, Eagan, Minnesota 55121 ABSTRACT: UNICOS is being reorganized into a microkernel based system. The purpose of this reorganization is to provide an operating system that can be used on all Cray architectures and provide both the current UNICOS functionality and a path to the future distributed systems. The reorganization of UNICOS is moving forward. The port of this “new” system to the MPP is also in progress. This talk will present the current status, and plans for UNICOS/mk. The chal- lenges of performance, size and scalability will be discussed. 1 Introduction have to be added in order to provide required functionality, such as support for distributed applications. As the work of adding This discussion is divided into four parts. The first part features and porting continues there is a testing effort that discusses the development process used by this project. The ensures correct functionality of the product. development process is the methodology that is being used to serverize UNICOS. The project is actually proceeding along Some of the initial porting can be and has been done in the multiple, semi-independent paths at the same time.The develop- simulator. However, issues such as MPP system organization, ment process will help explain the information in the second which nodes the servers will reside on and how the servers will part which is a discussion of the current status of the project. interact can only be completed on the hardware. The issue of The third part discusses accomplished milestones.
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • Safety and Security Challenge
    SAFETY AND SECURITY CHALLENGE TOP SUPERCOMPUTERS IN THE WORLD - FEATURING TWO of DOE’S!! Summary: The U.S. Department of Energy (DOE) plays a very special role in In fields where scientists deal with issues from disaster relief to the keeping you safe. DOE has two supercomputers in the top ten supercomputers in electric grid, simulations provide real-time situational awareness to the whole world. Titan is the name of the supercomputer at the Oak Ridge inform decisions. DOE supercomputers have helped the Federal National Laboratory (ORNL) in Oak Ridge, Tennessee. Sequoia is the name of Bureau of Investigation find criminals, and the Department of the supercomputer at Lawrence Livermore National Laboratory (LLNL) in Defense assess terrorist threats. Currently, ORNL is building a Livermore, California. How do supercomputers keep us safe and what makes computing infrastructure to help the Centers for Medicare and them in the Top Ten in the world? Medicaid Services combat fraud. An important focus lab-wide is managing the tsunamis of data generated by supercomputers and facilities like ORNL’s Spallation Neutron Source. In terms of national security, ORNL plays an important role in national and global security due to its expertise in advanced materials, nuclear science, supercomputing and other scientific specialties. Discovery and innovation in these areas are essential for protecting US citizens and advancing national and global security priorities. Titan Supercomputer at Oak Ridge National Laboratory Background: ORNL is using computing to tackle national challenges such as safe nuclear energy systems and running simulations for lower costs for vehicle Lawrence Livermore's Sequoia ranked No.
    [Show full text]