Current Trends in High Performance Computing
Chokchai Box Leangsuksun, PhD
SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University [email protected]
1
*SWEPCO endowed professorship is made possible by LA Board of Regents
Outline
• What is HPC? • Current Trends • More on PS3 and GPU computing • Conclusion
12 December 2011 2
1 Mainstream CPUs
• CPU speed – plateaus 3-4 Ghz • More cores in a single
chip 3-4 Ghz cap – Dual/Quad core is now – Manycore (GPGPU) • Traditional Applications won’t get a free rides • Conversion to parallel computing (HPC, MT)
This diagram is from “no free lunch article in DDJ
12 December 2011 3
New trends in computing
• Old & current – SMP, Cluster • Multicore computers – Intel Core 2 Duo – AMD 2x 64 • Many-core accelerators – GPGPU, FPGA, Cell • More Many brains in one computer • Not to increase CPU frequency • Harness many computers – a cluster computing
12/12/11 4
2 What is HPC?
• High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc.
• Time to insights, Time to discovery, Times to markets
12 December 2011 5
Parallel Programming Concepts
Conventional serial execution Parallel execution of a problem where the problem is represented involves partitioning of the problem as a series of instructions that are into multiple executable parts that are executed by the CPU mutually exclusive and collectively exhaustive represented as a partially Problem ordered set exhibiting concurrency.
Problem Task Task Task Task CPU
instructions instructions Parallel computing takes advantage of concurrency to : • Solve larger problems with less time • Save on Wall Clock Time • Overcoming memory constraints CPU CPU CPU CPU 6 • Utilizing non-local resources Source from Thomas Sterling’s intro to HPC 12 December 2011 6
3 HPC Applications and Major Industries
• Finite Element Modeling – Auto/Aero • Fluid Dynamics – Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami) • Imaging – Seismic & Medical • Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk, Options Pricing, What if, …) – Wal-mart’s HPC in their operations • Molecular Modeling – Biotech and Pharmaceuticals Complex Problems, Large Datasets, Long Runs
This slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”
12 December 2011 7
HPC Drives Knowledge Economy
12/12/11 8
4 Life Science Problem – an example of Protein Folding
• Take a computing year (in serial mode) to do molecular dynamics simulation for a protein folding problem
• Excerpted from IBM David Klepacki’s The future of HPC 12 December 2011• Petaflop = a thousand trillion floating point operations per second 9
Disaster Preparedness - example
• Project LEAD – Severe Weather prediction (Tornado) – OU leads. • HPC & Dynamically adaptation to weather forecast • Professor Seidel’s LSU CCT – Hurricane Route Prediction – Emergency Preparedness – Accuracy of prediction – 1 Mile2 = $1 M
12 December 2011 10
5 HPC accelerates a product
• FE analysis on 1 CPU – 1,000,000 elements – Numerical processing for 1 element = .1 secs – One computer will take 100,000 secs = 27.7 hrs • Says 100 CPUs – .27 hr ~ 16 mins
12 December 2011 11
Avian Flu Pandemic Modeled on a Supercomputer
• MIDAS (Models of Infectious Disease Agent Study) program • The large-scale, stochastic simulation model examines the nationwide spread of a pandemic influenza virus strain • A simulation starts with 2 passengers with contaminated AF arriving LAX • The simulation rolls out a city-city and census-tract-level picture of the spread of infection • a synthetic population of 281 million people over the course of 180 days • It is a very large scale and complex multi-variant
12 December 2011 12
6 Avian Flu Pandemic (90 days)
Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University) Source from www.lanl.gov 12 December 2011 13
Avian Flu Pandemic (II)
• The results show that advance preparation of a modestly effective vaccine in large quantities appears to be preferable to waiting for the development of a well-matched vaccine that may be too late. • The simulation models a synthetic population that matches U.S. census demographics and worker mobility data by randomly assigning the simulated individuals to households, workplaces, schools, and the like. • The models serve as virtual laboratories to study how infectious diseases and what intervention strategies are more effective • Run on the Los Alamos supercomputer known as Pink, a 1,024-node (2,048 processor) LinuxBIOS/Bpro with 2 GB/ node. Source from www.lanl.gov
12 December 2011 14
7 Significant indicators – why HPC now?
• Main stream computers with multi-cores (Intel or AMD) – In past 1-2 years, CPU speed was flatten at 3+ Ghz – More CPUs in one chip – Dual core, multi-core chips – Traditional software won’t take advantage of these new processors – Personal/Desktop Supercomputing. • Many real problems are highly computational intensive. – NSA uses supercomputing to do data mining – DOE – fusion, plasma, energy related (including weaponry). – Help solving many other important areas (nanotech, life science etc.) – Product design, ERM/Inventory Management • Giants recently sneeze out HPC – Bush’s state of union speech – 3 main S&T focus of which Supercomputing is one of them – Bill Gates’ keynote speech at SC05 – MS goes after HPC • Google search engine - 100,000 nodes • Playstation 3 is a personal supercomputing platform • Hollywood (Entertainment) is HPC-bound (Pixar – more than 3000 CPUs to render animation)
12 December 2011 15
HPC preparedness
• Build work forces that understand HPC paradigm & its applications – HPC/Grid Curriculum in IT/CS/CE/ICT – Offer HPC-enabling tracks to other disciplinary (engineering, life science, physic, computational chem, business etc..) – Training business community – Bring awareness to public • National and strategic policies • Improve Infrastructure
12 December 2011 16
8 Pause here
• Switch to a tour of machine rooms – Clusters, our Lab to show what they will be using.. • Get students’ info on signup sheet for accounts on our clusters (azul, quadcore, GPU and PS3). • Intro to Linux • Then continue on HPC101
12/12/11 17
HPC 101
12 December 2011 18
9 How to Run Applications Faster ?
• There are 3 ways to improve performance: – Work Harder – Work Smarter – Get more Help • Computer Analogy – Using faster hardware – Optimized algorithms and techniques used to solve computational tasks – Multiple computers to solve a particular task
12 December 2011 19
Parallel Programming Concepts
Problem Task Task Task Task
instructions instructions
CPU CPU CPU CPU
Source from Thomas Sterling’s intro to HPC 12 December 2011 20
10 HPC objective
• High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc.
12 December 2011 21
Flynn’s Taxonomy of Computer Architectures
l SISD - Single Instruction/Single Data
l SIMD - Single Instruction/Multiple Data
l MISD - Multiple Instruction/Single Data
l MIMD - Multiple Instruction/Multiple Data
22
11 Single Instruction/Single Data
PU – Processing Unit
Your desktop, before the spread of dual core CPUs
Slide Source: Wikipedia, Flynn’s Taxonomy 23
Flavors of SISD
Instructions:
24
12 More on pipelining…
25
Single Instruction/Multiple Data
Processors that execute same instruction on multiple pieces of data: NVIDIA GPUs
Slide Source: Wikipedia, Flynn’s Taxonomy 26
13 Single Instruction/Multiple Data
l Each core runs the same set of instructions on different data l Example: l GPGPU: processes pixels of an image in parallel
Slide Source: Klimovitski & Macri, Intel 27
SISD versus SIMD
Writing a compiler for SIMD architectures is VERY difficult (inter-thread communication complicates the picture…)
Slide Source: ars technica, Peakstream article 28
14 Multiple Instruction/Single Data
Pipe line : CMU Warp machine.
Slide Source: Wikipedia, Flynn’s Taxonomy 29
Multiple Instruction/Multiple Data
e.g. Multicore systems were based on a MIMD architecture + programming paradigm Such as openMP, multithreads
Slide Source: Wikipedia, Flynn’s Taxonomy 30
15 Multiple Instruction/Multiple Data
l The sky is the limit: each PU is free to do as it pleases
l Can be of either shared memory or distributed memory categories
Instructions:
31
Current HPC Hardware
• Traditionally HPC has adopted expensive parallel hardware: – Massively Parallel Processors (MPP) – Symmetric Multi-Processors (SMP) • Cluster Computers • Recent trends in HPC … • Multicore systems • Heterogeneous Computing with Accelerator Boards (GPGPU, FPGA)
12 December 2011 32
16 HPC cluster
• Login • Compile • Submit job • At least 2 connections
• Run tasks
12 December 2011 33
Parallel Programming Env
• Parallel Programming Environments and Tools – Threads (PCs, SMPs, NOW..) • POSIX Threads • Java Threads – MPI • Linux, NT, on many Supercomputers – OpenMP (predominantly on SMP) – PVM (old) – UPC, Co-array Fortran – CUDA, Brooks+, openCL – Software DSMs (Shmem) – Compilers – RAD (rapid application development tools) – Debuggers – Performance Analysis Tools – Visualization Tools 12 December 2011 34
17 Recent Trends in HPC Hardware
• Multicore & Manycore are now. • Multi CPUs in a single die • Better power consumption • tightly couple and better for multi-threading • GPGPU • As a build blocks for a much larger system • New Top 500 HPC systems - clusters of multi-core & GPGPU
12 December 2011 35
What are HPC systems
12/12/11 36
18 Current top 5 systems
12/12/11 37
Shared vs Distributed Memory
12/12/11 38
19 Shared memory
• Global memory space, accessible by all processors • Processors may have local memory to hold copies of some global memory. • Consistency of copies is usually maintained by hardware (cache coherency)
12/12/11 39
Two typical classes of SM
• Uniform Memory Access (UMA): – Equal access times – identical processors typically represented by Symmetric Multi- processor Machines (SMP) or Multicores • Non-Uniform Memory Access (NUMA): – Memory access times are not uniform, memory access across a link is slower – Often made by physically linking two or more SMPs or heterogeneous computing
12/12/11 40
20 Advantage & Disadvantage
• Global address space is user-friendly • Data sharing between tasks is fast • System may suffer from lack of scalability. Adding CPUs increases traffic on shared memory - to - CPU path. This is especially true for cache coherent systems • Programmer is responsible for correct synchronization • Systems larger than an SMP need some special- purpose components.
12/12/11 41
Distributed Memory
12/12/11 42
21 Multicores
• Three multicore classifications – Homogeneous – Heterogeneous – Hybrid
12 December 2011 43
Multicores(I)
• Homogeneous Cores (a main CPU) – All cores are identical – A traditional MC with few cores • Good for jumbo & few tasks – Not as many tasks/threads as accelerators or GPU. – E.g. Intel Core2Duo, i3, i5, i7, AMD – Programming – Multithreads/openMP
12 December 2011 44
22 Multicores(II)
• Homogeneous Cores as accelerator or compute device – Need a main CPU system – As attached processing units – All cores are identical and many – Good for many SIMD tasks/threads – E.g. NVIDIA GPGPU, Clearspeed FPGA – Programming – library calls from a main program or a new language extension, e.g. CUDA
12 December 2011 45
Multicores(III)
• Heterogeneous Cores – All cores are NOT identical – All in one die – Programming is more difficult – See more in PS3 presentation
12 December 2011 46
23 Multicores(IV)
• Hybrid System – Mix between host cores & accelerator cores – A typical host can be a desktop to server system, e.g. Intel or AMD – Accelerator – NVDIA, ATI Stream or FPGA – Programming model is more complex – Issues – memory bandwidth between host vs. devices 12 December 2011 47
Introduction to Cell BE (PS3) Programming
HPCI: High Performance Computing Initiative
24 PS3 - awesome HPC system
• IBM Cell processor • Affordable • But currently not many tools
12 December 2011 49
Cell BE Architecture Synergistic Processor Element 128-bit RISC, SIMD processor 256 KB local storage memory Use DMA to transfer data between PowerPC Processor Element local storage and main memory Main Processor 64 bit Also support Vector/SIMD Run the OS, Manage SPE
Picture ref: http://gamasutra.com/features/20060721/chow_01.shtml 12 December 2011
25 Cell Programming
• IBM Cell SDK • Main Process run on PPE • Threads run on SPEs • PPE Centric programming paradigm PPE process
SPE thread SPE thread
SPE thread ...
12 December 2011
GPGPU General Purpose Graphic Processing Unit
12/12/11 52
26 Two major players
Parallel Computing on a GPU
• NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 500 GFLOPS on compiled parallel C applications • Tesla T10 1070 from 1-4 TFLOPS • GPU parallelism is better than Moore’s Tesla D870 law, more doubling every year • GPGPU is a GPU that allows user to process both graphics and non-graphics applications.
GeForce 8800
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana- Champaign
27 NVIDIA GeForce 8800 (G80)
• the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth • Supports Scalable Link Interface (SLI)
NVIDIA TeslaTM
• Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance
28 NVIDIA Tesla Card >>
• C-Series(Card) = 1 GPU with 1.5 GB • D-Series(Deskside unit) = 2 GPUs • S-Series(1U server) = 4 GPUs
• Note: 1 G80 GPU = 128 cores = ~500 GFLOPs • 1 T10 = 240 cores = 1 TFLOPs
<< NVIDIA G80
This slide is from NVDIA CUDA tutorial
© David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana- Champaign!
29 GPGPU Programming with CUDA
• CUDA (Compute Unified Device Architecture) is a SDK and API that allow a programmer to write C and Fortran programs to execute on GPGPU.
• Works with NVIDIA G80 or later and Tesla
• The GPGPU is viewed as a compute device
ATI Stream (1)
12/12/11 60
30 ATI 4870
12/12/11 61
ATI 4870 X2
12/12/11 62
31 Architecture of ATI Radeon 4000 series
This slide is from ATI presentation
32 This slide is from ATI presentation
Introduction to Open CL
Toward new approach in Computing
Moayad Almohaishi
33 Introduction to openCL
• OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Take 6 months to come up with the specifications.
OpenCL
• 1. Royalty-free. • 2. Support both task and data parallel programing modes. • 3. Works for vendor-agnostic GPGPUs • 4. including multi cores CPUs • 5. Works on Cell processors. • 6. Support handhelds and mobile devices. • 7. Based on C language under C99.
34 OpenCL
• Can make query on available devices and build an context of the available devices.
• Programmers would be able to program more freely for any kind of device.
• Applications are more resuable even if the hardware changed in the future.
35 OpenCL Platform Model
CPUs+GPU platforms
12/12/11 72
36 Performance of GPGPU
Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana- Champaign
37 Last words!
• HPC or Supercomputing system is not necessarily gigantic in a big machine room but is accessible for Thais and may now be sitting next to your desk • Computing is necessity and Fast computing provides competitive edge, esp Knowledge Economy • New trends of HPC includes GPGPU, various multicore architecture • Prepare ourselves and strengthen our S&T, and industry as well business community for this phenomenon (HPC goes mainstream) before too late.
12 December 2011 75
Back up slides
12/12/11 76
38 Cancer Gene-mining
• Unsuccessful on a uni-processor • Our approach – Novel parallel gene-mining algorithms – Input from microarray – Retain accuracy – Significantly speed up (superlinear) • IBM P5 supercomputer (128 node PPC). Time to run the algorithm, keeping number of nodes fixed Bladder 100 Mesothelioma Breast 1200 80 60 1000 Renal Leukemia 40 secs) 800 20 600 Prostate 0 Lung
taken(in 400 Pancreas Colorectal
Time 200
0 Ovary Lymphoma
13 39 65 91 Melanoma Number of processors OvaMarker based Selection GeneSetMine based Selection
12 December 2011 77
Drug Delivery
• By WU & Palmer, Louisiana Tech U • Assisted by HPCI • A study of microcapsules for drug delivery. • Computational Fluid Dynamics methodology to model the generation of droplets or cores (using alginate and oil) • Goal: better understanding process parameters needed for generating cores of homogeneous size for the manufacturing of microcapsules.
12 December 2011 78
39 Droplet Generation: Experimental Procedure
12 December 2011 79
Droplet Generation: Example Results
Case 1: Olive oil: Density 930 kg/m3 Viscosity 0.03 kg/m-s Alginate: Density 1012 kg/m3 Viscosity 0.2137 kg/m-s
Case 2: Phase 1: Density 918 kg/m3 Viscosity 0.084 kg/m-s Phase 2: Density 998.2 kg/m3 Viscosity 0.001003 kg/m-s
Source from wu’s thesis 12 December 2011 80
40