Multi-Core Processors and Multithreading

Multi-core processors and multithreading Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2 Multi-core processors and multithreading Paweł Szostek CERN Inverted CERN School of Computing, 23-24 February 2015 1 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Multi-core processors and multithreading: part 1 ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES 2 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU evolution . In the past manufacturers were increasing frequency . Transistors were invested into larger caches and more powerful cores . From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over) . Thermal Design Power (TDP) is stalled at ~150W Why higher clock speed increases the power consumption? 3 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: power dissipation . In the past, there were no power dissipation issues . Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1] . “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use) . This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers 4 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: manufacturing technology 120nm Flu virus 14nm process transistor 5 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading . Problem: when executing a stream of instructions, even with out- of-order execution, a CPU cannot keep all the execution units constantly busy . Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc. 6 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (II) . Solution: we can utilize idle execution units with a different thread . SMT is a hardware feature that can be turned on/off in the BIOS . Most of the hardware resources (including caches) are shared . Needs a separate fetching unit . Can both speed up and slow down execution (see next slide) 7 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (III) . Workloads from HEP- SMT SPEC06 benchmark . Many instances of single-threaded processes run in parallel . Different scalability and reactions to SMT . Cache utilization is the most important factor in SMT impact 8 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (IV) . Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core . Example: list or tree traversal . the role of the helper thread is to prefetch the data . helper thread works in front of the main thread by accessing data ahead of the main thread . think of it as an interesting example of exploiting the hardware source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors” 9 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Non-Uniform Memory Access . Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor . Makes accesses fast, when the memory is “close” to the processor . There is a performance hit when accessing the “foreign” memory . Lowers down the pressure on the memory bus 10 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Cluster-on-die . Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM) . Solution: split the memory on one socket into two nodes 11 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions Extension Generation/year Value added MMX Pentium 64b registers with packed data MMX/1997 types, only integer operations SSE Pentium III/1999 128b registers (XMM), 32b float only SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op instructions AVX2 Haswell/2013 Integer instructions in YMM registers, FMA AVX512 Skylake/2016 512b registers Hardware evolves → programmers and compilers need to adapt 12 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel extensions example – AVX2 . AVX2 is the latest extension from Intel . Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?) . Creative application – Padé approximant + + + + = 1 + + 2+ + 0 1 2 ⋯ 2 + 1 ( +2 + …+ … ) = ⋯ 1 + ( ( + … + … ) 0 1 2 . VDT is a math vector library1+ using2 Padé approximant – libm plug&play replacement with speed -ups reaching 10x 13 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU improvements summary . Common ways to improve CPU performance Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more (see: dark silicon) Hyper-threading Medium overhead, up to Can double workload’s 30% performance memory footprint, improvement possible cache pollution Architectural Increase versatility and Huge design overhead, changes performance, works well with happen ~every 3 years existing software Microarchitectural Transparent for the users Huge design overhead changes More cores Low design overhead, easy Requires heavily-parallel to implement, great software scalability Architectures” “Multicore Nowak A. inspiration: Slide 14 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Multi-core processors and multithreading: part 2 PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE 15 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Concurrency vs. parallelism Do concurrent (not parallel) programs need synchronization to access shared resources? Why? 16 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions What will be value of n after both threads finish their work? 17 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions (II) 18 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python . C++ parallelism skipped on purpose – already covered at CSC . Python is not a performance-oriented language, but can be made less slow . We can still use threading module to benefit from parallel IO operations via threads by relying on OS . Example is deferred to the synchronization slides. But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock? 19 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python (II) . We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though . high memory footprint . no resource sharing . every worker is a separate process from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10)) 20 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CSC Refresher: vector operations . Problem: all the arithmetic operations are executed one element at a time . Solution: introduce vector operations and vector registers What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice? 21 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Auto-vectorization in gcc . Vectorization candidate: (inner) loops. Will only work with more recent gcc versions (>4.6) . By default, auto-vectorization in gcc is disabled . There are tens of optimization flag, but it’s good to retain at least a couple: . -mtune=ARCH, -march=ARCH . -O2, -O3, -Ofast . -ftree-vectorize 22 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization reports . Compiler can tell us which loop was not vectorized and why . gcc: -ftree-vectorize-verbose=[0-9] . icc: -vec-report=[0-7] . List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function. 23 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions (II) . Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch) . A run-time check is added to the output code . in ICC –axARCH can be used instead GCC ICC __attribute__ ((target(“default”))) __declspec(cpu_specific(generic)) int foo() { int foo() { return 0; return 0; } } __attribute__((target(“sse4.2”))) __declspec(cpu_specific(core_i7_sse4_2)) int foo() { int foo() { return 1; return 1; } } 24 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in C++ . Possible to use intrinsics, but very cumbersome and “write-only” . Many libraries to approach vectorization, the choice is not easy . Example: Agner Fog’s Vector Class #include “vectorclass.h” float a[8], b[8], c[8]; float a[8], b[8], c[8]; … … for (int i=0; i<8; ++i) { Vec8f avec, bvec, cvec; c[i] = a[i] + b[i]*1.5f; avec.load(a); } bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c); 25 iCSC2015, Pawel Szostek, CERN Multi-core

Load more