Lecture 1 Architectures

Dr. Wilson Rivera

ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline

• Goal: Understand parallel computing fundamental concepts – HPC challenges – Flynn’s Taxonomy – Memory Access Models – Multi-core Processors – Graphics Units – Cluster Infrastructures – Cloud Infrastructures

ICOM 6025: High Performance Computing 2 HPC Challenges

Physics of high-temperature Protein structure and function Global simulation superconducting cuprates for cellulose-to-ethanol conversion of CO2 dynamics

Optimization of plasma heating Fundamental instability Next-generation combustion systems for fusion experiments of supernova shocks devices burning alternative fuels

Slide source: Thomas Zaharia HPC Challenges

Capacity: LES Available # of Overnight Computational Loads cases run Capacity [Flop/s] Unsteady

RANS 21 102 1 Zeta (10 )

3 18 10 RANS Low 1 Exa (10 ) Speed

104 15 x106 RANS High 1 Peta (10 ) Speed

5 12 10 “Smart” use of HPC power: 1 Tera (10 ) • Algorithms • Data mining 6 10 • knowledge 1 Giga (109)

1980 1990 2000 2010 2020 2030

Aero Real time Data CFD-based CFD-based HS Optimisation CFD based Set LOADS Full MDO noise Design & CFD-CSM in flight & HQ simulation simulation Capability achieved during one night batch Courtesy AIRBUS France HPC Challenges

High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL ICOM 6025: High Performance Computing 5 HPC Challenges

https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf

ICOM 6025: High Performance Computing 6 Flynn's Taxonomy

MISD MIMD Instructions

SISD SIMD

Data

ICOM 6025: High Performance Computing 7 Flynn's Taxonomy

•Single Instruction, Multiple Data (SIMD)

– All processing units execute the same instruction at any given clock cycle – Best suited for high degree of regularity • Image processing – Good examples • SSE = Streaming SIMD Extensions • SSE, SSE2, Intel MIC (Xeon Phi) • Graphics Processing Units (GPU)

ICOM 6025: High Performance Computing 8 Flynn's Taxonomy

• Multiple Instruction, Multiple Data (MIMD)

– Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include SIMD execution sub-components.

ICOM 6025: High Performance Computing 9 Memory Access Models

• Hybrid Distributed-Shared Memory

ICOM 6025: High Performance Computing 10 Shared Memory

Memory

I/O Bus Interconnect

L2 L2 L2

CPU CPU CPU

ICOM 6025: High Performance Computing 11 Shared Memory

• multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor are visible to all other processors.

• Two main classes based upon memory access times – Uniform Memory Access (UMA) • Symmetric Multi Processors (SMPs) – Non Uniform Memory Access (NUMA)

• Main disadvantage is the lack of between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared memory CPU path

ICOM 6025: High Performance Computing 12 Shared Memory

• Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap) – Cache miss: non-cache memory access (expensive)

ICOM 6025: High Performance Computing 13 Distributed Memory

CPU M CPU M

L2 L2

Network I/O

L2 L2

CPU M CPU M

ICOM 6025: High Performance Computing 14 Distributed Memory

• Processors have their own local memory. • When a processor needs access to data in another processor – it is usually the task of the programmer to explicitly define how and when data is communicated • Examples: Cray XT4, Clusters, Cloud

ICOM 6025: High Performance Computing 15 Hybrid (Distributed-Shared) Memory

Shared In practice we have hybrid memory N memory access Shared E memory T W Shared O memory R Shared K memory

ICOM 6025: High Performance Computing 16 Parallel computing trends

• Multi-core processors – Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores

• Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high performance computing

• Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run- time. • Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment

• Large scale commodity computer data centers (cloud) – Amazon EC2, Eucalyptus, Google App Engine

ICOM 6025: High Performance Computing 17 Multi-cores and Moore’s Law

Circuits complexity doubles every 18 months

Power wall (2004)

Source: Intel

Source: The National Academies Press, Washington, DC, 2011

ICOM 6025: High Performance Computing 18 Power Wall

• The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips

ICOM 6025: High Performance Computing 19 Power Density Limits Serial Performance

ICOM 6025: High Performance Computing 20 Many-cores (Graphics Processor Units)

• Graphics Processor Units (GPUs)

– throughput oriented devices designed to provide high aggregate performance for independent computations. • prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task.

– GPUs do not use independent instruction decoders • instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area

ICOM 6025: High Performance Computing 21 Multi-Core vs. Many-Core

• Multi-core processors (minimize latency) – MIMD – Each core optimized for executing a single – Lots of big on-chip caches – Extremely sophisticated control

• Many-core processors (maximize throughput) – SIMD – Cores optimized for aggregating throughput – Lots of ALUs – Simpler control

ICOM 6025: High Performance Computing 22 CPUs: Latency Oriented Design

• Large caches – Convert long latency memory accesses to short latency cache accesses ALU ALU Control • Sophisticated control ALU ALU – Branch prediction for

reduced branch latency Cache – Data forwarding for reduced data latency • Powerful ALU DRAM

– Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign

23 GPUs: Throughput Oriented Design

• Small caches – To memory throughput • Simple control – No branch prediction – No data forwarding GPU • Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput

• Require massive number of DRAM threads to tolerate latencies © David Kirk/NVIDIA and Wen- mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign

24 Multi-Core vs. Many-Core

1400

1200 T12 NVIDIA GPU 1000 Intel CPU GT200 800

600 GFLOPs G80

400

200 G70 3GHz Xeon Westmere NV40 3GHz Quad NV30 3GHz Dual Core2 Duo 0 Core P4 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009

ICOM 6025: High Performance Computing 25 Intel® Xeon® Processor E7-8894 v4

• 24 cores • 48 threads • 2.40 GHz • 14 nm • 60MB cache • $8k (July 2017)

26 NVIDIA TITAN Xp

• 3840 cores • 1.6 GHz • Pascal Architecture • Peak = 12TF/s • $1.5K

ICOM 6025: High Performance Computing 27 Cluster Hardware Configuration

Head Node

External Storage Local Storage Switch Node 1

Node 2 © Wilson Rivera

Node n

ICOM 6025: High Performance Computing 28 Cluster Head Node

• Head Node – Network interface cards (NIC): one connecting to the public network and the other one connecting to the internal cluster network. – A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services

ICOM 6025: High Performance Computing 29 Cluster Interconnection Network

• The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected through a Gigabit Ethernet network – Large scale production clusters may be made of 1U or 2U servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance Computing Cluster) – Infiniband – Quadrics – Myrinet – Omni-Path (Intel)

ICOM 6025: High Performance Computing 30 Cluster Storage

•Storage Area Network (SAN) – Storage devices appear as locally attached to the operating system.

•Network Attached Storage (NAS) – Distributed File-based protocols • Parallel Virtual File System (PVFS) • General Parallel File System (GPFS) • Hadoop Parallel File System (HPFS) • Lustre • CERN-VM-FS

ICOM 6025: High Performance Computing 31 Cluster Software

Cluster Resource Manager

Scheduler Monitor Analyzer

Cluster Tools and Libraries

Communication Compiler Optimization

Cluster Infrastructure Operating System Services

© Wilson Rivera ICOM 6025: High Performance Computing 32 Top500.org

ICOM 6025: High Performance Computing 33 History of Performance

ICOM 6025: High Performance Computing Exascale Computing and Big Data 34 Projected Performance

100 Pflop/s 10 Pflop/s SUM

1 Pflop/s 100 Tflop/s N=1

10 Tflop/s

1 Tflop/s N=500 100 Gflop/s

10 Gflop/s

1 Gflop/s 100 Mflop/s

ICOM 6025: High Performance Computing 35 #1 TAIHULIGHT @ CHINA

• June 2017 • National Supercomputing Center in Wuxi • SW26010 processors developed by NRCPC • 40,960 nodes • 10,649,600 cores • Peak =125 PF/s • R max =93 PF/s • 15,371 kW

ICOM 6025: High Performance Computing 36

• Cloud computing allows scaling on demand without building or provisioning a data center – Computing resources available on demand (self service) – Charging only for resources utilized (Pay-as-you-go)

• Worldwide revenue from public IT cloud services exceeded $21.5 billion in 2010 – It will reach $72.9 billion in 2015 – compound annual growth rate (CAGR) of 27.6%.

http://www.idc.com/prodserv/idc_cloud.jsp Cloud versus Grid

• Grids – Sharing and coordination of distributed resources – Grid Middleware • Globus, , Glite • Clouds – Leverages on virtualization to maximize resource utilization – Cloud Middleware • IaaS, PaaS, SaaS

ICOM 6025: High Performance Computing 38 Layered cloud model

From: K Chen Wright University Cloud Layers

– Infrastructure as a Service (IaaS) • Flexible in terms of the applications to be hosted • Amazon EC2, RackSpace, Nimbus, Eucalyptus

– Platform as a Service (PaaS) • Application domain-specific platforms • Google App Engine, MS Azure, Heroku

– Software as a Service (SaaS) • Service domain-specific • Salesforce, NetSuite

ICOM 6025: High Performance Computing 40 Cloud Economics • Pay by use instead of provisioning for peak

Capacity

Demand Capacity Resources Resources

Demand

Time Time Static data center Data center in the cloud

From: K Chen Wright University Unused resources 41 Cloud Economics

• Setup: – A peak period needs 10 servers to requests – Assume your service is going to run for 1 year • Private cluster: one-time investment – Servers $1,500 x 10 = $15,000 – Power/AC costs about $200/year/server => $2,000 – Administrator: $50,000 • Public cloud: – Rush hours: 10 hours/day, which needs 10 nodes/hour – Other hours: 14hours need 2 nodes/hour – Total: 128 hour.nodes x $0.10/hour.node =$12.80/day – One year cost = $4,672 Cloud Economics

• Amazon EC2 Pricing • Google engine pricing • Hadoop Sizing • • How much to rent a – 8-core VM – 30 GB of RAM (each core 3.75GB) – $1.16/hour – 600,000 cores – 75,000 VMs – $87,000/hour – $2 million per day

ICOM 6025: High Performance Computing 43 Data analytics Ecosystem

Exascale Computing and Big Data ICOM 6025: High Performance Computing 44 Summary

– Parallel computing infrastructure trends • Multi-core Processors – As a result from the need of building power efficient chips. • Graphics Processor Units – Throughput oriented devices designed to provide high aggregate performance for independent computations • Cluster Infrastructures – Head; interconnection; storage; software • Cloud Infrastructures – Physical resources; virtual resources; infrastructure services; application services

ICOM 6025: High Performance Computing 45 Scientific Computing Terminology

Terms Definitio • “High Performancen Computing” (HPC) Computer– Computers Connected through high speed interconnect and configured for • HPC System scientific computing. • The wiring, chips, and software • Interconnect that connects computing components. • Node (blade, • An independent computing unit of an HPC System. Unit has its own sled, etc.) operating system (OS) and memory. The physical cases of a • Chassis node are often called blades and sleds. • Nodes are often aggregated into a chassis (with a backplane) for sharing electrical power, cooling and sharing a local interconnect. Terminology (continued)

Terms Definition

• Self-contained circuits on a single media of size ~20mm x 20mm, containing up to ~1 billion transistors. • Provides a connection between and chip and • Chip or Die a motherboard. • A Central Processor Unit, consisting of a chip or die (often called a processor). • Socket • Modern CPUs contain multiple cores. A core is an within that can execute a • CPU (or code’s instructions independently while other cores execute a different code’s instructions. processor?) • A single core can have additional circuitry that allows two or more instruction streams (threads) to proceed through a single core • Core “simultaneously”. Hyper-Thread is an Intel trademark for 2 threads. Xeon Phi • Hyper- supports 4 threads. Threading