Lecture 1 Parallel Computing Architectures

Lecture 1 Parallel Computing Architectures Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • Goal: Understand parallel computing fundamental concepts – HPC challenges – Flynn’s Taxonomy – Memory Access Models – Multi-core Processors – Graphics Processor Units – Cluster Infrastructures – Cloud Infrastructures ICOM 6025: High Performance Computing 2 HPC Challenges Physics of high-temperature Protein structure and function Global simulation superconducting cuprates for cellulose-to-ethanol conversion of CO2 dynamics Optimization of plasma heating Fundamental instability Next-generation combustion systems for fusion experiments of supernova shocks devices burning alternative fuels Slide source: Thomas Zaharia HPC Challenges Capacity: LES Available # of Overnight Computational Loads cases run Capacity [Flop/s] Unsteady RANS 21 102 1 Zeta (10 ) 3 18 10 RANS Low 1 Exa (10 ) Speed 104 15 x106 RANS HigH 1 Peta (10 ) Speed 5 12 10 “Smart” use of HPC power: 1 Tera (10 ) • Algorithms • Data mining 6 10 • knowledge 1 Giga (109) 1980 1990 2000 2010 2020 2030 Aero Real time Data CFD-based CFD-based HS Optimisation CFD based Set LOADS Full MDO noise Design & CFD-CSM in flight & HQ simulation simulation Capability achieved during one night batch Courtesy AIRBUS France HPC Challenges High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL ICOM 6025: High Performance Computing 5 HPC Challenges https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf ICOM 6025: High Performance Computing 6 Flynn's Taxonomy MISD MIMD Instructions SISD SIMD Data ICOM 6025: High Performance Computing 7 Flynn's Taxonomy •Single Instruction, Multiple Data (SIMD) – All processing units execute the same instruction at any given clock cycle – Best suited for high degree of regularity • Image processing – Good examples • SSE = Streaming SIMD Extensions • SSE, SSE2, Intel MIC (Xeon Phi) • Graphics Processing Units (GPU) ICOM 6025: High Performance Computing 8 Flynn's Taxonomy • Multiple Instruction, Multiple Data (MIMD) – Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include SIMD execution sub-components. ICOM 6025: High Performance Computing 9 Memory Access Models • Shared memory • Distributed memory • Hybrid Distributed-Shared Memory ICOM 6025: High Performance Computing 10 Shared Memory Memory I/O Bus Interconnect L2 L2 L2 CPU CPU CPU ICOM 6025: High Performance Computing 11 Shared Memory • multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor are visible to all other processors. • Two main classes based upon memory access times – Uniform Memory Access (UMA) • Symmetric Multi Processors (SMPs) – Non Uniform Memory Access (NUMA) • Main disadvantage is the lack of scalability between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared memory CPU path ICOM 6025: High Performance Computing 12 Shared Memory • Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap) – Cache miss: non-cache memory access (expensive) ICOM 6025: High Performance Computing 13 Distributed Memory CPU M CPU M L2 L2 Network I/O L2 L2 CPU M CPU M ICOM 6025: High Performance Computing 14 Distributed Memory • Processors have their own local memory. • When a processor needs access to data in another processor – it is usually the task of the programmer to explicitly define how and when data is communicated • Examples: Cray XT4, Clusters, Cloud ICOM 6025: High Performance Computing 15 Hybrid (Distributed-Shared) Memory Shared In practice we have hybrid memory N memory access Shared E memory T W Shared O memory R Shared K memory ICOM 6025: High Performance Computing 16 Parallel computing trends • Multi-core processors – Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores • Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high performance computing • Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run- time. • Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment • Large scale commodity computer data centers (cloud) – Amazon EC2, Eucalyptus, Google App Engine ICOM 6025: High Performance Computing 17 Multi-cores and Moore’s Law Circuits complexity doubles every 18 months Power wall (2004) Source: Intel Source: The National Academies Press, Washington, DC, 2011 ICOM 6025: High Performance Computing 18 Power Wall • The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips ICOM 6025: High Performance Computing 19 Power Density Limits Serial Performance ICOM 6025: High Performance Computing 20 Many-cores (Graphics Processor Units) • Graphics Processor Units (GPUs) – throughput oriented devices designed to provide high aggregate performance for independent computations. • prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task. – GPUs do not use independent instruction decoders • instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area ICOM 6025: High Performance Computing 21 Multi-Core vs. Many-Core • Multi-core processors (minimize latency) – MIMD – Each core optimized for executing a single thread – Lots of big on-chip caches – Extremely sophisticated control • Many-core processors (maximize throughput) – SIMD – Cores optimized for aggregating throughput – Lots of ALUs – Simpler control ICOM 6025: High Performance Computing 22 CPUs: Latency Oriented Design • Large caches – Convert long latency memory accesses to short latency cache accesses ALU ALU Control • Sophisticated control ALU ALU – Branch prediction for reduced branch latency CacHe – Data forwarding for reduced data latency • Powerful ALU DRAM – Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 23 GPUs: Throughput Oriented Design • Small caches – To boost memory throughput • Simple control – No branch prediction – No data forwarding GPU • Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput • Require massive number of DRAM threads to tolerate latencies © David Kirk/NVIDIA and Wen- mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 24 Multi-Core vs. Many-Core 1400 1200 T12 NVIDIA GPU 1000 Intel CPU GT200 800 600 GFLOPs G80 400 200 G70 3GHz Xeon Westmere NV40 3GHz Quad NV30 3GHz Dual Core2 Duo 0 Core P4 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009 ICOM 6025: High Performance Computing 25 Intel® Xeon® Processor E7-8894 v4 • 24 cores • 48 threads • 2.40 GHz • 14 nm • 60MB cache • $8k (July 2017) 26 NVIDIA TITAN Xp • 3840 cores • 1.6 GHz • Pascal Architecture • Peak = 12TF/s • $1.5K ICOM 6025: High Performance Computing 27 Cluster Hardware Configuration Head Node External Storage Local Storage Switch Node 1 Node 2 © Wilson Rivera Node n ICOM 6025: High Performance Computing 28 Cluster Head Node • Head Node – Network interface cards (NIC): one connecting to the public network and the other one connecting to the internal cluster network. – A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services ICOM 6025: High Performance Computing 29 Cluster Interconnection Network • The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected through a Gigabit Ethernet network – Large scale production clusters may be made of 1U or 2U servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance Computing Cluster) – Infiniband – Quadrics – Myrinet – Omni-Path (Intel) ICOM 6025: High Performance Computing 30 Cluster Storage •Storage Area Network (SAN) – Storage devices appear as locally attached to the operating system. •Network Attached Storage (NAS) – Distributed File-based protocols • Parallel Virtual File System (PVFS) • General Parallel File System (GPFS) • Hadoop Parallel File System (HPFS) • Lustre • CERN-VM-FS ICOM 6025: High Performance Computing 31 Cluster Software Cluster Resource Manager Scheduler Monitor Analyzer Cluster Tools and Libraries Communication Compiler Optimization Cluster Infrastructure Operating System Services © Wilson Rivera ICOM 6025: High Performance Computing 32 Top500.org ICOM 6025: High Performance Computing 33 History of Performance ICOM 6025: High Performance Computing Exascale Computing and Big Data 34 Projected Performance 100 Pflop/s 10 Pflop/s SUM 1 Pflop/s 100 Tflop/s N=1 10 Tflop/s 1 Tflop/s N=500 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s ICOM 6025: High Performance Computing 35 #1 TAIHULIGHT @ CHINA • June 2017 • National Supercomputing Center in Wuxi • SW26010 processors developed by NRCPC • 40,960 nodes • 10,649,600 cores •

Lecture 1 Parallel Computing Architectures

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Computer Hardware Architecture Lecture 4

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

Current Trends in High Performance Computing

System & Service Management

A Case for NUMA-Aware Contention Management on Multicore Systems

Lecture 2 Parallel Programming Platforms

Multicore and Multiprocessor Systems: Part I

Non-Uniform Memory Access (NUMA)

Multiprocessing: Architectures and Algorithms

NUMA and GPU So, I Know How to Use MPI and Openmp