Lecture 1 Parallel Computing Architectures
Dr. Wilson Rivera
ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline
• Goal: Understand parallel computing fundamental concepts – HPC challenges – Flynn’s Taxonomy – Memory Access Models – Multi-core Processors – Graphics Processor Units – Cluster Infrastructures – Cloud Infrastructures
ICOM 6025: High Performance Computing 2 HPC Challenges
Physics of high-temperature Protein structure and function Global simulation superconducting cuprates for cellulose-to-ethanol conversion of CO2 dynamics
Optimization of plasma heating Fundamental instability Next-generation combustion systems for fusion experiments of supernova shocks devices burning alternative fuels
Slide source: Thomas Zaharia HPC Challenges
Capacity: LES Available # of Overnight Computational Loads cases run Capacity [Flop/s] Unsteady
RANS 21 102 1 Zeta (10 )
3 18 10 RANS Low 1 Exa (10 ) Speed
104 15 x106 RANS High 1 Peta (10 ) Speed
5 12 10 “Smart” use of HPC power: 1 Tera (10 ) • Algorithms • Data mining 6 10 • knowledge 1 Giga (109)
1980 1990 2000 2010 2020 2030
Aero Real time Data CFD-based CFD-based HS Optimisation CFD based Set LOADS Full MDO noise Design & CFD-CSM in flight & HQ simulation simulation Capability achieved during one night batch Courtesy AIRBUS France HPC Challenges
High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL ICOM 6025: High Performance Computing 5 HPC Challenges
https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf
ICOM 6025: High Performance Computing 6 Flynn's Taxonomy
MISD MIMD Instructions
SISD SIMD
Data
ICOM 6025: High Performance Computing 7 Flynn's Taxonomy
•Single Instruction, Multiple Data (SIMD)
– All processing units execute the same instruction at any given clock cycle – Best suited for high degree of regularity • Image processing – Good examples • SSE = Streaming SIMD Extensions • SSE, SSE2, Intel MIC (Xeon Phi) • Graphics Processing Units (GPU)
ICOM 6025: High Performance Computing 8 Flynn's Taxonomy
• Multiple Instruction, Multiple Data (MIMD)
– Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include SIMD execution sub-components.
ICOM 6025: High Performance Computing 9 Memory Access Models
• Shared memory • Distributed memory • Hybrid Distributed-Shared Memory
ICOM 6025: High Performance Computing 10 Shared Memory
Memory
I/O Bus Interconnect
L2 L2 L2
CPU CPU CPU
ICOM 6025: High Performance Computing 11 Shared Memory
• multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor are visible to all other processors.
• Two main classes based upon memory access times – Uniform Memory Access (UMA) • Symmetric Multi Processors (SMPs) – Non Uniform Memory Access (NUMA)
• Main disadvantage is the lack of scalability between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared memory CPU path
ICOM 6025: High Performance Computing 12 Shared Memory
• Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap) – Cache miss: non-cache memory access (expensive)
ICOM 6025: High Performance Computing 13 Distributed Memory
CPU M CPU M
L2 L2
Network I/O
L2 L2
CPU M CPU M
ICOM 6025: High Performance Computing 14 Distributed Memory
• Processors have their own local memory. • When a processor needs access to data in another processor – it is usually the task of the programmer to explicitly define how and when data is communicated • Examples: Cray XT4, Clusters, Cloud
ICOM 6025: High Performance Computing 15 Hybrid (Distributed-Shared) Memory
Shared In practice we have hybrid memory N memory access Shared E memory T W Shared O memory R Shared K memory
ICOM 6025: High Performance Computing 16 Parallel computing trends
• Multi-core processors – Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores
• Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high performance computing
• Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run- time. • Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment
• Large scale commodity computer data centers (cloud) – Amazon EC2, Eucalyptus, Google App Engine
ICOM 6025: High Performance Computing 17 Multi-cores and Moore’s Law
Circuits complexity doubles every 18 months
Power wall (2004)
Source: Intel
Source: The National Academies Press, Washington, DC, 2011
ICOM 6025: High Performance Computing 18 Power Wall
• The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips
ICOM 6025: High Performance Computing 19 Power Density Limits Serial Performance
ICOM 6025: High Performance Computing 20 Many-cores (Graphics Processor Units)
• Graphics Processor Units (GPUs)
– throughput oriented devices designed to provide high aggregate performance for independent computations. • prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task.
– GPUs do not use independent instruction decoders • instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area
ICOM 6025: High Performance Computing 21 Multi-Core vs. Many-Core
• Multi-core processors (minimize latency) – MIMD – Each core optimized for executing a single thread – Lots of big on-chip caches – Extremely sophisticated control
• Many-core processors (maximize throughput) – SIMD – Cores optimized for aggregating throughput – Lots of ALUs – Simpler control
ICOM 6025: High Performance Computing 22 CPUs: Latency Oriented Design
• Large caches – Convert long latency memory accesses to short latency cache accesses ALU ALU Control • Sophisticated control ALU ALU – Branch prediction for
reduced branch latency Cache – Data forwarding for reduced data latency • Powerful ALU DRAM
– Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign
23 GPUs: Throughput Oriented Design
• Small caches – To boost memory throughput • Simple control – No branch prediction – No data forwarding GPU • Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput
• Require massive number of DRAM threads to tolerate latencies © David Kirk/NVIDIA and Wen- mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign
24 Multi-Core vs. Many-Core
1400
1200 T12 NVIDIA GPU 1000 Intel CPU GT200 800
600 GFLOPs G80
400
200 G70 3GHz Xeon Westmere NV40 3GHz Quad NV30 3GHz Dual Core2 Duo 0 Core P4 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
ICOM 6025: High Performance Computing 25 Intel® Xeon® Processor E7-8894 v4
• 24 cores • 48 threads • 2.40 GHz • 14 nm • 60MB cache • $8k (July 2017)
26 NVIDIA TITAN Xp
• 3840 cores • 1.6 GHz • Pascal Architecture • Peak = 12TF/s • $1.5K
ICOM 6025: High Performance Computing 27 Cluster Hardware Configuration
Head Node
External Storage Local Storage Switch Node 1
Node 2 © Wilson Rivera
Node n
ICOM 6025: High Performance Computing 28 Cluster Head Node
• Head Node – Network interface cards (NIC): one connecting to the public network and the other one connecting to the internal cluster network. – A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services
ICOM 6025: High Performance Computing 29 Cluster Interconnection Network
• The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected through a Gigabit Ethernet network – Large scale production clusters may be made of 1U or 2U servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance Computing Cluster) – Infiniband – Quadrics – Myrinet – Omni-Path (Intel)
ICOM 6025: High Performance Computing 30 Cluster Storage
•Storage Area Network (SAN) – Storage devices appear as locally attached to the operating system.
•Network Attached Storage (NAS) – Distributed File-based protocols • Parallel Virtual File System (PVFS) • General Parallel File System (GPFS) • Hadoop Parallel File System (HPFS) • Lustre • CERN-VM-FS
ICOM 6025: High Performance Computing 31 Cluster Software
Cluster Resource Manager
Scheduler Monitor Analyzer
Cluster Tools and Libraries
Communication Compiler Optimization
Cluster Infrastructure Operating System Services
© Wilson Rivera ICOM 6025: High Performance Computing 32 Top500.org
ICOM 6025: High Performance Computing 33 History of Performance
ICOM 6025: High Performance Computing Exascale Computing and Big Data 34 Projected Performance
100 Pflop/s 10 Pflop/s SUM
1 Pflop/s 100 Tflop/s N=1
10 Tflop/s
1 Tflop/s N=500 100 Gflop/s
10 Gflop/s
1 Gflop/s 100 Mflop/s
ICOM 6025: High Performance Computing 35 #1 TAIHULIGHT @ CHINA
• June 2017 • National Supercomputing Center in Wuxi • SW26010 processors developed by NRCPC • 40,960 nodes • 10,649,600 cores • Peak =125 PF/s • R max =93 PF/s • 15,371 kW
ICOM 6025: High Performance Computing 36 Cloud Computing
• Cloud computing allows scaling on demand without building or provisioning a data center – Computing resources available on demand (self service) – Charging only for resources utilized (Pay-as-you-go)
• Worldwide revenue from public IT cloud services exceeded $21.5 billion in 2010 – It will reach $72.9 billion in 2015 – compound annual growth rate (CAGR) of 27.6%.
http://www.idc.com/prodserv/idc_cloud.jsp Cloud versus Grid
• Grids – Sharing and coordination of distributed resources – Grid Middleware • Globus, UNICORE, Glite • Clouds – Leverages on virtualization to maximize resource utilization – Cloud Middleware • IaaS, PaaS, SaaS
ICOM 6025: High Performance Computing 38 Layered cloud model
From: K Chen Wright University Cloud Layers
– Infrastructure as a Service (IaaS) • Flexible in terms of the applications to be hosted • Amazon EC2, RackSpace, Nimbus, Eucalyptus
– Platform as a Service (PaaS) • Application domain-specific platforms • Google App Engine, MS Azure, Heroku
– Software as a Service (SaaS) • Service domain-specific • Salesforce, NetSuite
ICOM 6025: High Performance Computing 40 Cloud Economics • Pay by use instead of provisioning for peak
Capacity
Demand Capacity Resources Resources
Demand
Time Time Static data center Data center in the cloud
From: K Chen Wright University Unused resources 41 Cloud Economics
• Setup: – A peak period needs 10 servers to process requests – Assume your service is going to run for 1 year • Private cluster: one-time investment – Servers $1,500 x 10 = $15,000 – Power/AC costs about $200/year/server => $2,000 – Administrator: $50,000 • Public cloud: – Rush hours: 10 hours/day, which needs 10 nodes/hour – Other hours: 14hours need 2 nodes/hour – Total: 128 hour.nodes x $0.10/hour.node =$12.80/day – One year cost = $4,672 Cloud Economics
• Amazon EC2 Pricing • Google engine pricing • Hadoop Sizing • • How much to rent a supercomputer – 8-core VM – 30 GB of RAM (each core 3.75GB) – $1.16/hour – 600,000 cores – 75,000 VMs – $87,000/hour – $2 million per day
ICOM 6025: High Performance Computing 43 Data analytics Ecosystem
Exascale Computing and Big Data ICOM 6025: High Performance Computing 44 Summary
– Parallel computing infrastructure trends • Multi-core Processors – As a result from the need of building power efficient chips. • Graphics Processor Units – Throughput oriented devices designed to provide high aggregate performance for independent computations • Cluster Infrastructures – Head; interconnection; storage; software • Cloud Infrastructures – Physical resources; virtual resources; infrastructure services; application services
ICOM 6025: High Performance Computing 45 Scientific Computing Terminology
Terms Definitio • “High Performancen Computing” (HPC) Computer– Computers Connected through high speed interconnect and configured for • HPC System scientific computing. • The wiring, chips, and software • Interconnect that connects computing components. • Node (blade, • An independent computing unit of an HPC System. Unit has its own sled, etc.) operating system (OS) and memory. The physical cases of a • Chassis node are often called blades and sleds. • Nodes are often aggregated into a chassis (with a backplane) for sharing electrical power, cooling and sharing a local interconnect. Terminology (continued)
Terms Definition
• Self-contained circuits on a single media of size ~20mm x 20mm, containing up to ~1 billion transistors. • Provides a connection between and chip and • Chip or Die a motherboard. • A Central Processor Unit, consisting of a chip or die (often called a processor). • Socket • Modern CPUs contain multiple cores. A core is an execution unit within that can execute a • CPU (or code’s instructions independently while other cores execute a different code’s instructions. processor?) • A single core can have additional circuitry that allows two or more instruction streams (threads) to proceed through a single core • Core “simultaneously”. Hyper-Thread is an Intel trademark for 2 threads. Xeon Phi • Hyper- Coprocessor supports 4 threads. Threading