Introduction to Technology

November 2015

1. OVERVIEW Supercomputer Architecture

• A supercomputer consists of many computers (nodes) connected together by a fast network (interconnect) • Performance is determined by • the capabilities of individual nodes • the ability of the nodes to communicate • The architectural features of the nodes and interconnect set upper limits on the performance of a program running on the system • the characteristics of the program will determine how well it makes use of the available hardware resources 2. (CPU) Central Processing Unit (CPU)

• The processor that carries out the instructions of a computer program

• Evolved from the arithmetic logic unit and control unit of the Von Neumann architecture

• Modern CPUs have much more complexity • additional functionality to improve performance • integrated additional roles that were traditionally external to the CPU

• It is important to understand the architectural features of the CPU in order to make the best use of them when writing and executing programs

• On a Linux system, relevant information is typically found in /proc/cpuinfo Von Neumann Machine (serial processor)

• Main Memory: stores data and instructions • Arithmetic Logic Unit (ALU): performs operations on data • Control Unit (CU): interprets and causes execution of instructions • Input Output: allows transfer of data into and out of the machine Exercise: Acquiring CPU Information

• Write a SLURM script to output the contents of the /proc/cpuinfo file on a compute node • what is the model name of the processor? • which vendor produced the processor? • Find the processor specification on the vendor website • hint: perform a search using the model name

#!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH –-nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE

aprun –n 1 –N 1 cat /proc/cpuinfo > cpuinfo.txt Vector Instructions

• Vector instructions allow a processor to perform multiple arithmetic operations in a single clock cycle

• They utilise Single Instruction Multiple Data (SIMD) parallelism • All operations perform the same instruction to different data • The data values must be consecutive in memory

• The length of vector instructions are increasing with newer architectures Vector Instructions

• Check the flags in /proc/cpuinfo to determine which vector instructions are supported, by referring to the cpuinfo.txt created previously flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid

• Can also check official specification on the vendor website:

• What vector instructions are supported on the processors we are using? Simultaneous Multithreading (SMT)

also known as Hyper-threading on Intel Processors • Appears as two logical processor for each physical processor • The processor has additional circuitry to hold the state of a second , which shares the resources of the physical processor • In theory, allows higher utilisation of resources resulting in faster runtimes • In practice, it is program dependent • works well for some programs • sharing of resources causes worse performance for others Simultaneous Multithreading (SMT)

also known as Hyper-threading on Intel Processors

• SMT can complicate job allocations on

• Since double the number of processors are presented to the scheduler, it may pack jobs such that two tasks are run per processor

• It is important to specify the number of nodes and tasks per node in the SLURM script to ensure your job is being allocated the resources you are expecting

• On systems with SMT enabled, it may be worth trying to run your workflow with both cases (one and two tasks per processor) and comparing performance Simultaneous Multithreading (SMT)

also known as Hyper-threading on Intel Processors

• To determine if SMT is enabled, compare the number of processor entries to the number of siblings in /proc/cpuinfo • Processor entries can be counted with the grep and wc utilities

grep processor cpuinfo.txt | wc –l • Siblings are listed in each processor entry

grep siblings cpuinfo.txt | head –n 1 • Is hyper-threading enabled on the processors we are using? Multicore Architecture

• The circuitry for more than one processor can be placed on a single CPU chip

• The processors of a CPU chip are referred to as cores

• The number of cores on CPUs have been growing steadily over the last decade • Dual-core, Quad-core, Hex-core, Multicore Multicore Architecture

• Each core is able to execute processes independently of the other core, using its own compute resources • However, the cores do share other resources • eg. the memory system, and some levels of caching • If the cores are working in parallel on a shared program, the efficiency may also be reduced by dependencies Multicore Architecture

• The number of cores per CPU is listed in /proc/cpuinfo

grep cores cpuinfo.txt | head –n 1

• They are also typically provided on the vendor website

• How many cores are there on the CPUs we are using? Multi-socket Architecture

• The physical CPU identifier for each processor is listed in /proc/cpuinfo, which can be used to determine the number of physical CPUs on the node grep ’physical id’ cpuinfo.txt | sort –u | wc -l

• How many physical CPUs are there on the nodes we are using?

• The total number of processors should be equal to the result obtained by multiplying the physical CPUs by the cores per CPU, and then doubling if hyper-threading is enabled.

• Most codes work well with one process or per core on a node, but there are exceptions. • It is worth testing your code with different configurations before commencing production jobs

• What are the configurations you would try on the nodes we are using? Cache

• A component that temporarily stores data in order to reduce access time • Widely used for a variety of devices for memory or storage: • eg. CPU, Hard Disk, Tape Storage • Typically, data is transferred on access in blocks, known as a cache line • Cache Hit : The data being accessed is in the cache • Cache Miss: The data being accessed is not in the cache • For read access, cache lines are typically loaded on first access • Variety of methods for determining which cache line to overwrite • least frequently used, least recently used, random • Similar variety of methods for determining when to forward writes to memory • write-through, write back • An algorithm’s pattern of data access affects the caching efficiency Cache: Access Patterns

• Sequential access results in a higher rate of cache hits

• Striding access has a low rate of hits, however modern processors can detect striding access patterns and pre-fetch cache lines

• Random access is typically the worst, with a low rate of hits and no ability to predict subsequent access locations Cache

• Caches may be shared or exclusive to processor cores • typically numbered in order of location relative to the core

• There can be multiple levels of cache in modern processors • smaller, faster caches close to a core • larger, less fast caches shared by cores Exercise: Acquiring Cache Information Write a SLURM script to output the contents of the lscpu command on a compute node • what are the various cache sizes?

#!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH --nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE

aprun –n 1 –N 1 lscpu > lscpu.txt 3. NODE ARCHITECTURE Node Components

• Components on a node are managed by the chipset, and are connected via a PCI-Express bus

• A network adapter connects the node to the rest of the supercomputer via the interconnect

• Some systems have local storage on the node, but more commonly remote storage is used

• Some systems utilise accelerators to the computational performance of the node Node Components

• The components are physically attached to a motherboard Memory

• Memory is the component on a node that stores the data that the processor is actively using, but does not fit in cache.

• It is much larger than the cache, but slower to access by an order of magnitude.

• However it is much faster than larger file storage, such as SSD/HDD/Tape.

Component Example Example Latency Capacity L1 Cache 1 ns 64 kB per core L2 Cache 3 ns 256 kB per core L3 Cache 10 ns 20,480 kB Memory 100 ns 64 GB SSD Storage 30,000 ns 100 GB HDD Storage 15,000,000 ns 1 TB Exercise: Acquiring Memory Information • Write a SLURM script to output the contents of the /proc/meminfo file on a compute node • what is the total memory available on a node? • how much is currently unused? • how much swap space is there, if any?

• Find the processor specification on the vendor website • What information about the memory specification is available?

#!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH --nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE

aprun –n 1 –N 1 cat /proc/meminfo > meminfo.txt Multichannel Memory

• Memory controllers are integrated into modern CPUs

• Multiple channels are used to increase memory bandwidth

• Multiple DIMMs are used to increase capacity per channel • also allows the same capacity with two DIMMS for Multiprocessor Memory

• When a node has multiple CPUs, control of the memory is shared between them

• Memory access from cores on a processor to the memory controlled by the other processor are carried out via a high-speed bus • QuickPath Interconnect (QPI) from Intel • HyperTransport (HT) from AMD Model

• Each processor uses a shared memory with a single address space

• Symmetric Multiprocessor (SMP) or Unified Memory Access (UMA) • Identical memory access times for all memory regions to all processors • Non- (NUMA) • Memory latency varies depending on processor and memory region

• Data in multiple caches must remain identical, built into modern systems Accelerators

• Accelerators are an additional processor on the node, which provide a performance increase over or in addition to the CPUs

• One class of accelerator is the Graphics Processing Unit (GPU) • modern GPUs are well suited to highly parallel scientific workloads • eg NVIDIA Tesla

• There are also many-core CPU accelerators • eg Intel Xeon Phi Common Types of Processors

• Application Specific Integrated Circuit (ASIC) • Field-Programmable Gate Array (FPGA) • Graphics Processing Unit (GPU) • Central Processing Unit (CPU)

ASIC FPGA GPU CPU

Performance Programmability Power Efficiency Cost Efficiency 4. SUPERCOMPUTER ARCHITECTURE Distributed Memory Model

• Each processor has its own local memory, with its own address space

• Data is shared via a communications network

• May require additional memory for multiple copies of data Distributed Memory Model

• You can use a distributed memory model on shared memory architectures

• Processors only access their own allocations, and communications are effectively memory copies Hybrid Memory Model

• Each node has multiple processors, with a shared memory space, while the memory is still distributed between the nodes globally.

• Multiple programming options: • Single node shared memory (eg pthreads, OpenMP) • Multi-node distributed memory (eg MPI) • Hybrid memory (eg MPI+OpenMP) Interconnect

• The interconnect is a fast network that connects the nodes of a supercomputer

• There are many ways to connect nodes together, a given configuration is referred to as a topology

• There are many categories of topologies: • Fat tree • Torus • Dragonfly

• The topology of an interconnect affects the performance of different communication patterns occurring between the nodes of the supercomputer Fat Tree Interconnect

• A hierarchy of switches are used to route traffic between the nodes

• The number of nodes per switch at the top level, and the bandwidth of the connections at each level may vary

• In principle, it is possible to adjust these parameters to fit a given workflow

• In practice, they are typically constrained by the available commodity equipment Torus Interconnect

• A torus interconnect has direct communication with neighbouring nodes, typically in up to three dimensions

• A large portion of scientific codes decompose the data into domains such that nodes only need to communicate with the neighbouring domains

• Works well if there is one code that takes up the whole torus, but can be problematic if there are multiple programs running • node placement may be sub-optimal • links may need to carry non-local information Dragonfly Interconnect

• A dragonfly interconnect has groups of nodes such that at the top level, each group is connected with every other group

• In addition to the direct connection, underutilised links may be used with only a single extra hop • allows for adaptive bandwidth between groups of nodes

• Becomes prohibitive in both cost and physical space for large node counts Partial Dragonfly Interconnect

• A partial dragonfly interconnect uses a dragonfly pattern between subsets of groups of nodes with increasing dimensionality.

• The subsets are arranged such that it takes at most as many hops as dimensions

• Additional bandwidth can be found by taking additional hops to random nodes, then following the normal routing

• Maintains the advantages of the dragonfly interconnect, while capable of being implemented at scale

• Possible to reduce the connectivity of the highest dragonfly, with impact only to global

bandwidth Exercise: Cray XC40 Node Layout

• Run the xtnodestat command: xtnodestat • The output shows the jobs executing on the nodes of the system, with the nodes place in their actual chassis and cabinet locations C0-0 C1-0 C2-0 C3-0 n3 fhhhhgghhicc-kg ddddgppppppqqqq kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n2 Sfhhhhggghiccckg Sddddgppppppqqqq kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n1 Sfhhhhhgghhcccjg Sddddgppppppqqqq kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk c2n0 ghhhhhgghhccc-g ddddgpppppppqqq kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n3 ccccdffffffgggg -nnnnnnoooooodd kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n2 Sccccc-fffffgggg S-nnnnnnoooooodd kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n1 Scccccefffffgggg S-nnnnnnoooooodd kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk c1n0 cccccdffffffggg m-nnnnnoooooood kkkkkkkkkkkkkkkk kkkkkkkkkkkkkkkk n3 --aaaaabbbbbbc mmmmmmmmmmmmmmmm qqrr-AZkkkkkkkkk kkkkkkkkkkkkkkkk n2 SS--aaaaabbbbbbc gmmmmmmmmmmmmmmm qqrr-jZkkkkkkkkk kkkkkkkkkkkkkkkk n1 SS--aaaaabbbbbbb gmmmmmmmmmmmmmmm qqqrrjZkkkkkkkkk kkkkkkkkkkkkkkkk c0n0 --aaaaaabbbbbb gmmmmmmmmmmmmmmm qqqrr-Zkkkkkkkkk kkkkkkkkkkkkkkkk s0123456789abcdef 0123456789abcdef 0123456789abcdef 0123456789abcdef • The network has three dragonfly dimensions: • each other group in the chassis • between chassis in pairs of cabinets • between pairs of cabinets (56% links populated on Magnus)