Lecture 1 Parallel Computing Architectures

Lecture 1 Parallel Computing Architectures Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • Goal: Understand parallel computing fundamental concepts – HPC challenges – Flynn’s Taxonomy – Memory Access Models – Multi-core Processors – Graphics Processor Units – Cluster Infrastructures – Cloud Infrastructures ICOM 6025: High Performance Computing 2 HPC Challenges Physics of high-temperature Protein structure and function Global simulation superconducting cuprates for cellulose-to-ethanol conversion of CO2 dynamics Optimization of plasma heating Fundamental instability Next-generation combustion systems for fusion experiments of supernova shocks devices burning alternative fuels Slide source: Thomas Zaharia HPC Challenges Capacity: LES Available # of Overnight Computational Loads cases run Capacity [Flop/s] Unsteady RANS 21 102 1 Zeta (10 ) 3 18 10 RANS Low 1 Exa (10 ) Speed 104 15 x106 RANS HigH 1 Peta (10 ) Speed 5 12 10 “Smart” use of HPC power: 1 Tera (10 ) • Algorithms • Data mining 6 10 • knowledge 1 Giga (109) 1980 1990 2000 2010 2020 2030 Aero Real time Data CFD-based CFD-based HS Optimisation CFD based Set LOADS Full MDO noise Design & CFD-CSM in flight & HQ simulation simulation Capability achieved during one night batch Courtesy AIRBUS France HPC Challenges High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL ICOM 6025: High Performance Computing 5 HPC Challenges https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf ICOM 6025: High Performance Computing 6 Flynn's Taxonomy MISD MIMD Instructions SISD SIMD Data ICOM 6025: High Performance Computing 7 Flynn's Taxonomy •Single Instruction, Multiple Data (SIMD) – All processing units execute the same instruction at any given clock cycle – Best suited for high degree of regularity • Image processing – Good examples • SSE = Streaming SIMD Extensions • SSE, SSE2, Intel MIC (Xeon Phi) • Graphics Processing Units (GPU) ICOM 6025: High Performance Computing 8 Flynn's Taxonomy • Multiple Instruction, Multiple Data (MIMD) – Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include SIMD execution sub-components. ICOM 6025: High Performance Computing 9 Memory Access Models • Shared memory • Distributed memory • Hybrid Distributed-Shared Memory ICOM 6025: High Performance Computing 10 Shared Memory Memory I/O Bus Interconnect L2 L2 L2 CPU CPU CPU ICOM 6025: High Performance Computing 11 Shared Memory • multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor are visible to all other processors. • Two main classes based upon memory access times – Uniform Memory Access (UMA) • Symmetric Multi Processors (SMPs) – Non Uniform Memory Access (NUMA) • Main disadvantage is the lack of scalability between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared memory CPU path ICOM 6025: High Performance Computing 12 Shared Memory • Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap) – Cache miss: non-cache memory access (expensive) ICOM 6025: High Performance Computing 13 Distributed Memory CPU M CPU M L2 L2 Network I/O L2 L2 CPU M CPU M ICOM 6025: High Performance Computing 14 Distributed Memory • Processors have their own local memory. • When a processor needs access to data in another processor – it is usually the task of the programmer to explicitly define how and when data is communicated • Examples: Cray XT4, Clusters, Cloud ICOM 6025: High Performance Computing 15 Hybrid (Distributed-Shared) Memory Shared In practice we have hybrid memory N memory access Shared E memory T W Shared O memory R Shared K memory ICOM 6025: High Performance Computing 16 Parallel computing trends • Multi-core processors – Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores • Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high performance computing • Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run- time. • Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment • Large scale commodity computer data centers (cloud) – Amazon EC2, Eucalyptus, Google App Engine ICOM 6025: High Performance Computing 17 Multi-cores and Moore’s Law Circuits complexity doubles every 18 months Power wall (2004) Source: Intel Source: The National Academies Press, Washington, DC, 2011 ICOM 6025: High Performance Computing 18 Power Wall • The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips ICOM 6025: High Performance Computing 19 Power Density Limits Serial Performance ICOM 6025: High Performance Computing 20 Many-cores (Graphics Processor Units) • Graphics Processor Units (GPUs) – throughput oriented devices designed to provide high aggregate performance for independent computations. • prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task. – GPUs do not use independent instruction decoders • instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area ICOM 6025: High Performance Computing 21 Multi-Core vs. Many-Core • Multi-core processors (minimize latency) – MIMD – Each core optimized for executing a single thread – Lots of big on-chip caches – Extremely sophisticated control • Many-core processors (maximize throughput) – SIMD – Cores optimized for aggregating throughput – Lots of ALUs – Simpler control ICOM 6025: High Performance Computing 22 CPUs: Latency Oriented Design • Large caches – Convert long latency memory accesses to short latency cache accesses ALU ALU Control • Sophisticated control ALU ALU – Branch prediction for reduced branch latency CacHe – Data forwarding for reduced data latency • Powerful ALU DRAM – Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 23 GPUs: Throughput Oriented Design • Small caches – To boost memory throughput • Simple control – No branch prediction – No data forwarding GPU • Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput • Require massive number of DRAM threads to tolerate latencies © David Kirk/NVIDIA and Wen- mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 24 Multi-Core vs. Many-Core 1400 1200 T12 NVIDIA GPU 1000 Intel CPU GT200 800 600 GFLOPs G80 400 200 G70 3GHz Xeon Westmere NV40 3GHz Quad NV30 3GHz Dual Core2 Duo 0 Core P4 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009 ICOM 6025: High Performance Computing 25 Intel® Xeon® Processor E7-8894 v4 • 24 cores • 48 threads • 2.40 GHz • 14 nm • 60MB cache • $8k (July 2017) 26 NVIDIA TITAN Xp • 3840 cores • 1.6 GHz • Pascal Architecture • Peak = 12TF/s • $1.5K ICOM 6025: High Performance Computing 27 Cluster Hardware Configuration Head Node External Storage Local Storage Switch Node 1 Node 2 © Wilson Rivera Node n ICOM 6025: High Performance Computing 28 Cluster Head Node • Head Node – Network interface cards (NIC): one connecting to the public network and the other one connecting to the internal cluster network. – A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services ICOM 6025: High Performance Computing 29 Cluster Interconnection Network • The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected through a Gigabit Ethernet network – Large scale production clusters may be made of 1U or 2U servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance Computing Cluster) – Infiniband – Quadrics – Myrinet – Omni-Path (Intel) ICOM 6025: High Performance Computing 30 Cluster Storage •Storage Area Network (SAN) – Storage devices appear as locally attached to the operating system. •Network Attached Storage (NAS) – Distributed File-based protocols • Parallel Virtual File System (PVFS) • General Parallel File System (GPFS) • Hadoop Parallel File System (HPFS) • Lustre • CERN-VM-FS ICOM 6025: High Performance Computing 31 Cluster Software Cluster Resource Manager Scheduler Monitor Analyzer Cluster Tools and Libraries Communication Compiler Optimization Cluster Infrastructure Operating System Services © Wilson Rivera ICOM 6025: High Performance Computing 32 Top500.org ICOM 6025: High Performance Computing 33 History of Performance ICOM 6025: High Performance Computing Exascale Computing and Big Data 34 Projected Performance 100 Pflop/s 10 Pflop/s SUM 1 Pflop/s 100 Tflop/s N=1 10 Tflop/s 1 Tflop/s N=500 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s ICOM 6025: High Performance Computing 35 #1 TAIHULIGHT @ CHINA • June 2017 • National Supercomputing Center in Wuxi • SW26010 processors developed by NRCPC • 40,960 nodes • 10,649,600 cores •

Lecture 1 Parallel Computing Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support