Introduction to Supercomputer Technology

Introduction to Supercomputer Technology November 2015 1. OVERVIEW Supercomputer Architecture • A supercomputer consists of many computers (nodes) connected together by a fast network (interconnect) • Performance is determined by • the capabilities of individual nodes • the ability of the nodes to communicate • The architectural features of the nodes and interconnect set upper limits on the performance of a program running on the system • the characteristics of the program will determine how well it makes use of the available hardware resources 2. CENTRAL PROCESSING UNIT (CPU) Central Processing Unit (CPU) • The processor that carries out the instructions of a computer program • Evolved from the arithmetic logic unit and control unit of the Von Neumann architecture • Modern CPUs have much more complexity • additional functionality to improve performance • integrated additional roles that were traditionally external to the CPU • It is important to understand the architectural features of the CPU in order to make the best use of them when writing and executing programs • On a Linux system, relevant information is typically found in /proc/cpuinfo Von Neumann Machine (serial processor) • Main Memory: stores data and instructions • Arithmetic Logic Unit (ALU): performs operations on data • Control Unit (CU): interprets and causes execution of instructions • Input Output: allows transfer of data into and out of the machine Exercise: Acquiring CPU Information • Write a SLURM script to output the contents of the /proc/cpuinfo file on a compute node • what is the model name of the processor? • which vendor produced the processor? • Find the processor specification on the vendor website • hint: perform a search using the model name #!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH –-nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE aprun –n 1 –N 1 cat /proc/cpuinfo > cpuinfo.txt Vector Instructions • Vector instructions allow a processor to perform multiple arithmetic operations in a single clock cycle • They utilise Single Instruction Multiple Data (SIMD) parallelism • All operations perform the same instruction to different data • The data values must be consecutive in memory • The length of vector instructions are increasing with newer architectures Vector Instructions • Check the flags in /proc/cpuinfo to determine which vector instructions are supported, by referring to the cpuinfo.txt created previously flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid • Can also check official specification on the vendor website: • What vector instructions are supported on the processors we are using? Simultaneous Multithreading (SMT) also known as Hyper-threading on Intel Processors • Appears as two logical processor for each physical processor • The processor has additional circuitry to hold the state of a second process, which shares the resources of the physical processor • In theory, allows higher utilisation of resources resulting in faster runtimes • In practice, it is program dependent • works well for some programs • sharing of resources causes worse performance for others Simultaneous Multithreading (SMT) also known as Hyper-threading on Intel Processors • SMT can complicate job allocations on supercomputers • Since double the number of processors are presented to the scheduler, it may pack jobs such that two tasks are run per processor • It is important to specify the number of nodes and tasks per node in the SLURM script to ensure your job is being allocated the resources you are expecting • On systems with SMT enabled, it may be worth trying to run your workflow with both cases (one and two tasks per processor) and comparing performance Simultaneous Multithreading (SMT) also known as Hyper-threading on Intel Processors • To determine if SMT is enabled, compare the number of processor entries to the number of siblings in /proc/cpuinfo • Processor entries can be counted with the grep and wc utilities grep processor cpuinfo.txt | wc –l • Siblings are listed in each processor entry grep siblings cpuinfo.txt | head –n 1 • Is hyper-threading enabled on the processors we are using? Multicore Architecture • The circuitry for more than one processor can be placed on a single CPU chip • The processors of a CPU chip are referred to as cores • The number of cores on CPUs have been growing steadily over the last decade • Dual-core, Quad-core, Hex-core, Multicore Multicore Architecture • Each core is able to execute processes independently of the other core, using its own compute resources • However, the cores do share other resources • eg. the memory system, and some levels of caching • If the cores are working in parallel on a shared program, the efficiency may also be reduced by dependencies Multicore Architecture • The number of cores per CPU is listed in /proc/cpuinfo grep cores cpuinfo.txt | head –n 1 • They are also typically provided on the vendor website • How many cores are there on the CPUs we are using? Multi-socket Architecture • The physical CPU identifier for each processor is listed in /proc/cpuinfo, which can be used to determine the number of physical CPUs on the node grep ’physical id’ cpuinfo.txt | sort –u | wc -l • How many physical CPUs are there on the nodes we are using? • The total number of processors should be equal to the result obtained by multiplying the physical CPUs by the cores per CPU, and then doubling if hyper-threading is enabled. • Most codes work well with one process or thread per core on a node, but there are exceptions. • It is worth testing your code with different configurations before commencing production jobs • What are the configurations you would try on the nodes we are using? Cache • A component that temporarily stores data in order to reduce access time • Widely used for a variety of devices for memory or storage: • eg. CPU, Hard Disk, Tape Storage • Typically, data is transferred on access in blocks, known as a cache line • Cache Hit : The data being accessed is in the cache • Cache Miss: The data being accessed is not in the cache • For read access, cache lines are typically loaded on first access • Variety of methods for determining which cache line to overwrite • least frequently used, least recently used, random • Similar variety of methods for determining when to forward writes to memory • write-through, write back • An algorithm’s pattern of data access affects the caching efficiency Cache: Access Patterns • Sequential access results in a higher rate of cache hits • Striding access has a low rate of hits, however modern processors can detect striding access patterns and pre-fetch cache lines • Random access is typically the worst, with a low rate of hits and no ability to predict subsequent access locations Cache • Caches may be shared or exclusive to processor cores • typically numbered in order of location relative to the core • There can be multiple levels of cache in modern processors • smaller, faster caches close to a core • larger, less fast caches shared by cores Exercise: Acquiring Cache Information Write a SLURM script to output the contents of the lscpu command on a compute node • what are the various cache sizes? #!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH --nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE aprun –n 1 –N 1 lscpu > lscpu.txt 3. NODE ARCHITECTURE Node Components • Components on a node are managed by the chipset, and are connected via a PCI-Express bus • A network adapter connects the node to the rest of the supercomputer via the interconnect • Some systems have local storage on the node, but more commonly remote storage is used • Some systems utilise accelerators to boost the computational performance of the node Node Components • The components are physically attached to a motherboard Memory • Memory is the component on a node that stores the data that the processor is actively using, but does not fit in cache. • It is much larger than the cache, but slower to access by an order of magnitude. • However it is much faster than larger file storage, such as SSD/HDD/Tape. Component Example Example Latency Capacity L1 Cache 1 ns 64 kB per core L2 Cache 3 ns 256 kB per core L3 Cache 10 ns 20,480 kB Memory 100 ns 64 GB SSD Storage 30,000 ns 100 GB HDD Storage 15,000,000 ns 1 TB Exercise: Acquiring Memory Information • Write a SLURM script to output the contents of the /proc/meminfo file on a compute node • what is the total memory available on a node? • how much is currently unused? • how much swap space is there, if any? • Find the processor specification on the vendor website • What information about the memory specification is available? #!/bin/bash -l #SBATCH --account=courses01 #SBATCH –-reservation=courseq #SBATCH --nodes=1 #SBATCH --time=00:05:00 #SBATCH --export=NONE aprun –n 1 –N 1 cat /proc/meminfo > meminfo.txt Multichannel Memory • Memory controllers are integrated into modern CPUs • Multiple channels are used to increase memory bandwidth • Multiple DIMMs are used to increase capacity per channel

Load more