Current Trends in High Performance Computing

Chokchai Box Leangsuksun, PhD

SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University [email protected]

1

*SWEPCO endowed professorship is made possible by LA Board of Regents

Outline

• What is HPC? • Current Trends • More on PS3 and GPU computing • Conclusion

12 December 2011 2

1 Mainstream CPUs

• CPU speed – plateaus 3-4 Ghz • More cores in a single

chip 3-4 Ghz cap – Dual/Quad core is now – Manycore (GPGPU) • Traditional Applications won’t get a free rides • Conversion to (HPC, MT)

This diagram is from “no free lunch article in DDJ

12 December 2011 3

New trends in computing

• Old & current – SMP, Cluster • Multicore computers – Intel Core 2 Duo – AMD 2x 64 • Many-core accelerators – GPGPU, FPGA, Cell • More Many brains in one computer • Not to increase CPU frequency • Harness many computers – a cluster computing

12/12/11 4

2 What is HPC?

• High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc.

• Time to insights, Time to discovery, Times to markets

12 December 2011 5

Parallel Programming Concepts

Conventional serial execution Parallel execution of a problem where the problem is represented involves partitioning of the problem as a series of instructions that are into multiple executable parts that are executed by the CPU mutually exclusive and collectively exhaustive represented as a partially Problem ordered set exhibiting concurrency.

Problem Task Task Task Task CPU

instructions instructions Parallel computing takes advantage of concurrency to : • Solve larger problems with less time • Save on Wall Clock Time • Overcoming memory constraints CPU CPU CPU CPU 6 • Utilizing non-local resources Source from Thomas Sterling’s intro to HPC 12 December 2011 6

3 HPC Applications and Major Industries

• Finite Element Modeling – Auto/Aero • Fluid Dynamics – Auto/Aero, Consumer Packaged Goods Mfgs, Mfg, Disaster Preparedness (tsunami) • Imaging – Seismic & Medical • Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk, Options Pricing, What if, …) – Wal-mart’s HPC in their operations • Molecular Modeling – Biotech and Pharmaceuticals Complex Problems, Large Datasets, Long Runs

This slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”

12 December 2011 7

HPC Drives Knowledge Economy

12/12/11 8

4 Life Science Problem – an example of Protein Folding

• Take a computing year (in serial mode) to do molecular dynamics simulation for a protein folding problem

• Excerpted from IBM David Klepacki’s The future of HPC 12 December 2011• Petaflop = a thousand trillion floating point operations per second 9

Disaster Preparedness - example

• Project LEAD – Severe Weather prediction (Tornado) – OU leads. • HPC & Dynamically adaptation to weather forecast • Professor Seidel’s LSU CCT – Hurricane Route Prediction – Emergency Preparedness – Accuracy of prediction – 1 Mile2 = $1 M

12 December 2011 10

5 HPC accelerates a product

• FE analysis on 1 CPU – 1,000,000 elements – Numerical processing for 1 element = .1 secs – One computer will take 100,000 secs = 27.7 hrs • Says 100 CPUs – .27 hr ~ 16 mins

12 December 2011 11

Avian Flu Pandemic Modeled on a

• MIDAS (Models of Infectious Disease Agent Study) program • The large-scale, stochastic simulation model examines the nationwide spread of a pandemic influenza virus strain • A simulation starts with 2 passengers with contaminated AF arriving LAX • The simulation rolls out a city-city and census-tract-level picture of the spread of infection • a synthetic population of 281 million people over the course of 180 days • It is a very large scale and complex multi-variant

12 December 2011 12

6 Avian Flu Pandemic (90 days)

Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University) Source from www.lanl.gov 12 December 2011 13

Avian Flu Pandemic (II)

• The results show that advance preparation of a modestly effective vaccine in large quantities appears to be preferable to waiting for the development of a well-matched vaccine that may be too late. • The simulation models a synthetic population that matches U.S. census demographics and worker mobility data by randomly assigning the simulated individuals to households, workplaces, schools, and the like. • The models serve as virtual laboratories to study how infectious diseases and what intervention strategies are more effective • Run on the Los Alamos supercomputer known as Pink, a 1,024-node (2,048 ) LinuxBIOS/Bpro with 2 GB/ node. Source from www.lanl.gov

12 December 2011 14

7 Significant indicators – why HPC now?

• Main stream computers with multi-cores (Intel or AMD) – In past 1-2 years, CPU speed was flatten at 3+ Ghz – More CPUs in one chip – Dual core, multi-core chips – Traditional software won’t take advantage of these new processors – Personal/Desktop Supercomputing. • Many real problems are highly computational intensive. – NSA uses supercomputing to do data mining – DOE – fusion, plasma, energy related (including weaponry). – Help solving many other important areas (nanotech, life science etc.) – Product design, ERM/Inventory Management • Giants recently sneeze out HPC – Bush’s state of union speech – 3 main S&T focus of which Supercomputing is one of them – Bill Gates’ keynote speech at SC05 – MS goes after HPC • Google search engine - 100,000 nodes • Playstation 3 is a personal supercomputing platform • Hollywood (Entertainment) is HPC-bound (Pixar – more than 3000 CPUs to render animation)

12 December 2011 15

HPC preparedness

• Build work forces that understand HPC paradigm & its applications – HPC/Grid Curriculum in IT/CS/CE/ICT – Offer HPC-enabling tracks to other disciplinary (engineering, life science, physic, computational chem, business etc..) – Training business community – Bring awareness to public • National and strategic policies • Improve Infrastructure

12 December 2011 16

8 Pause here

• Switch to a tour of machine rooms – Clusters, our Lab to show what they will be using.. • Get students’ info on signup sheet for accounts on our clusters (azul, quadcore, GPU and PS3). • Intro to Linux • Then continue on HPC101

12/12/11 17

HPC 101

12 December 2011 18

9 How to Run Applications Faster ?

• There are 3 ways to improve performance: – Work Harder – Work Smarter – Get more Help • Computer Analogy – Using faster hardware – Optimized algorithms and techniques used to solve computational tasks – Multiple computers to solve a particular task

12 December 2011 19

Parallel Programming Concepts

Problem Task Task Task Task

instructions instructions

CPU CPU CPU CPU

Source from Thomas Sterling’s intro to HPC 12 December 2011 20

10 HPC objective

• High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc.

12 December 2011 21

Flynn’s Taxonomy of Computer Architectures

l SISD - Single Instruction/Single Data

l SIMD - Single Instruction/Multiple Data

l MISD - Multiple Instruction/Single Data

l MIMD - Multiple Instruction/Multiple Data

22

11 Single Instruction/Single Data

PU – Processing Unit

Your desktop, before the spread of dual core CPUs

Slide Source: Wikipedia, Flynn’s Taxonomy 23

Flavors of SISD

Instructions:

24

12 More on pipelining…

25

Single Instruction/Multiple Data

Processors that execute same instruction on multiple pieces of data: NVIDIA GPUs

Slide Source: Wikipedia, Flynn’s Taxonomy 26

13 Single Instruction/Multiple Data

l Each core runs the same set of instructions on different data l Example: l GPGPU: processes pixels of an image in parallel

Slide Source: Klimovitski & Macri, Intel 27

SISD versus SIMD

Writing a compiler for SIMD architectures is VERY difficult (inter- communication complicates the picture…)

Slide Source: ars technica, Peakstream article 28

14 Multiple Instruction/Single Data

Pipe line : CMU Warp machine.

Slide Source: Wikipedia, Flynn’s Taxonomy 29

Multiple Instruction/Multiple Data

e.g. Multicore systems were based on a MIMD architecture + programming paradigm Such as openMP, multithreads

Slide Source: Wikipedia, Flynn’s Taxonomy 30

15 Multiple Instruction/Multiple Data

l The sky is the limit: each PU is free to do as it pleases

l Can be of either or categories

Instructions:

31

Current HPC Hardware

• Traditionally HPC has adopted expensive parallel hardware: – Processors (MPP) – Symmetric Multi-Processors (SMP) • Cluster Computers • Recent trends in HPC … • Multicore systems • Heterogeneous Computing with Accelerator Boards (GPGPU, FPGA)

12 December 2011 32

16 HPC cluster

• Login • Compile • Submit job • At least 2 connections

• Run tasks

12 December 2011 33

Parallel Programming Env

• Parallel Programming Environments and Tools – Threads (PCs, SMPs, NOW..) • POSIX Threads • Java Threads – MPI • Linux, NT, on many – OpenMP (predominantly on SMP) – PVM (old) – UPC, Co-array Fortran – CUDA, Brooks+, openCL – Software DSMs (Shmem) – Compilers – RAD (rapid application development tools) – Debuggers – Performance Analysis Tools – Visualization Tools 12 December 2011 34

17 Recent Trends in HPC Hardware

• Multicore & Manycore are now. • Multi CPUs in a single die • Better power consumption • tightly couple and better for multi-threading • GPGPU • As a build blocks for a much larger system • New Top 500 HPC systems - clusters of multi-core & GPGPU

12 December 2011 35

What are HPC systems

12/12/11 36

18 Current top 5 systems

12/12/11 37

Shared vs Distributed Memory

12/12/11 38

19 Shared memory

• Global memory space, accessible by all processors • Processors may have local memory to hold copies of some global memory. • Consistency of copies is usually maintained by hardware (cache coherency)

12/12/11 39

Two typical classes of SM

• Uniform Memory Access (UMA): – Equal access times – identical processors typically represented by Symmetric Multi- processor Machines (SMP) or Multicores • Non-Uniform Memory Access (NUMA): – Memory access times are not uniform, memory access across a link is slower – Often made by physically linking two or more SMPs or heterogeneous computing

12/12/11 40

20 Advantage & Disadvantage

• Global address space is user-friendly • Data sharing between tasks is fast • System may suffer from lack of . Adding CPUs increases traffic on shared memory - to - CPU path. This is especially true for cache coherent systems • Programmer is responsible for correct synchronization • Systems larger than an SMP need some special- purpose components.

12/12/11 41

Distributed Memory

12/12/11 42

21 Multicores

• Three multicore classifications – Homogeneous – Heterogeneous – Hybrid

12 December 2011 43

Multicores(I)

• Homogeneous Cores (a main CPU) – All cores are identical – A traditional MC with few cores • Good for jumbo & few tasks – Not as many tasks/threads as accelerators or GPU. – E.g. Intel Core2Duo, i3, i5, i7, AMD – Programming – Multithreads/openMP

12 December 2011 44

22 Multicores(II)

• Homogeneous Cores as accelerator or compute device – Need a main CPU system – As attached processing units – All cores are identical and many – Good for many SIMD tasks/threads – E.g. NVIDIA GPGPU, Clearspeed FPGA – Programming – library calls from a main program or a new language extension, e.g. CUDA

12 December 2011 45

Multicores(III)

• Heterogeneous Cores – All cores are NOT identical – All in one die – Programming is more difficult – See more in PS3 presentation

12 December 2011 46

23 Multicores(IV)

• Hybrid System – Mix between host cores & accelerator cores – A typical host can be a desktop to server system, e.g. Intel or AMD – Accelerator – NVDIA, ATI Stream or FPGA – Programming model is more complex – Issues – memory bandwidth between host vs. devices 12 December 2011 47

Introduction to Cell BE (PS3) Programming

HPCI: High Performance Computing Initiative

24 PS3 - awesome HPC system

• IBM Cell processor • Affordable • But currently not many tools

12 December 2011 49

Cell BE Architecture Synergistic Processor Element 128-bit RISC, SIMD processor 256 KB local storage memory Use DMA to transfer data between PowerPC Processor Element local storage and main memory Main Processor 64 bit Also support Vector/SIMD Run the OS, Manage SPE

Picture ref: http://gamasutra.com/features/20060721/chow_01.shtml 12 December 2011

25 Cell Programming

• IBM Cell SDK • Main Process run on PPE • Threads run on SPEs • PPE Centric programming paradigm PPE process

SPE thread SPE thread

SPE thread ...

12 December 2011

GPGPU General Purpose Graphic Processing Unit

12/12/11 52

26 Two major players

Parallel Computing on a GPU

• NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 500 GFLOPS on compiled parallel C applications • Tesla T10 1070 from 1-4 TFLOPS • GPU parallelism is better than Moore’s Tesla D870 law, more doubling every year • GPGPU is a GPU that allows user to process both graphics and non-graphics applications.

GeForce 8800

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana- Champaign

27 NVIDIA GeForce 8800 (G80)

• the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth • Supports Scalable Link Interface (SLI)

NVIDIA TeslaTM

• Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance

28 NVIDIA Tesla Card >>

• C-Series(Card) = 1 GPU with 1.5 GB • D-Series(Deskside unit) = 2 GPUs • S-Series(1U server) = 4 GPUs

• Note: 1 G80 GPU = 128 cores = ~500 GFLOPs • 1 T10 = 240 cores = 1 TFLOPs

<< NVIDIA G80

This slide is from NVDIA CUDA tutorial

© David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana- Champaign!

29 GPGPU Programming with CUDA

• CUDA (Compute Unified Device Architecture) is a SDK and API that allow a programmer to write C and Fortran programs to execute on GPGPU.

• Works with NVIDIA G80 or later and Tesla

• The GPGPU is viewed as a compute device

ATI Stream (1)

12/12/11 60

30 ATI 4870

12/12/11 61

ATI 4870 X2

12/12/11 62

31 Architecture of ATI Radeon 4000 series

This slide is from ATI presentation

32 This slide is from ATI presentation

Introduction to Open CL

Toward new approach in Computing

Moayad Almohaishi

33 Introduction to openCL

• OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Take 6 months to come up with the specifications.

OpenCL

• 1. Royalty-free. • 2. Support both task and data parallel programing modes. • 3. Works for vendor-agnostic GPGPUs • 4. including multi cores CPUs • 5. Works on Cell processors. • 6. Support handhelds and mobile devices. • 7. Based on C language under C99.

34 OpenCL

• Can make query on available devices and build an context of the available devices.

• Programmers would be able to program more freely for any kind of device.

• Applications are more resuable even if the hardware changed in the future.

35 OpenCL Platform Model

CPUs+GPU platforms

12/12/11 72

36 Performance of GPGPU

Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana- Champaign

37 Last words!

• HPC or Supercomputing system is not necessarily gigantic in a big machine room but is accessible for Thais and may now be sitting next to your desk • Computing is necessity and Fast computing provides competitive edge, esp Knowledge Economy • New trends of HPC includes GPGPU, various multicore architecture • Prepare ourselves and strengthen our S&T, and industry as well business community for this phenomenon (HPC goes mainstream) before too late.

12 December 2011 75

Back up slides

12/12/11 76

38 Cancer Gene-mining

• Unsuccessful on a uni-processor • Our approach – Novel parallel gene-mining algorithms – Input from microarray – Retain accuracy – Significantly speed up (superlinear) • IBM P5 supercomputer (128 node PPC). Time to run the algorithm, keeping number of nodes fixed Bladder 100 Mesothelioma Breast 1200 80 60 1000 Renal Leukemia 40 secs) 800 20 600 Prostate 0 Lung

taken(in 400 Pancreas Colorectal

Time 200

0 Ovary Lymphoma

13 39 65 91 Melanoma Number of processors OvaMarker based Selection GeneSetMine based Selection

12 December 2011 77

Drug Delivery

• By WU & Palmer, Louisiana Tech U • Assisted by HPCI • A study of microcapsules for drug delivery. • Computational Fluid Dynamics methodology to model the generation of droplets or cores (using alginate and oil) • Goal: better understanding process parameters needed for generating cores of homogeneous size for the manufacturing of microcapsules.

12 December 2011 78

39 Droplet Generation: Experimental Procedure

12 December 2011 79

Droplet Generation: Example Results

Case 1: Olive oil: Density 930 kg/m3 Viscosity 0.03 kg/m-s Alginate: Density 1012 kg/m3 Viscosity 0.2137 kg/m-s

Case 2: Phase 1: Density 918 kg/m3 Viscosity 0.084 kg/m-s Phase 2: Density 998.2 kg/m3 Viscosity 0.001003 kg/m-s

Source from wu’s thesis 12 December 2011 80

40