Multi-Core Processors and Multithreading

Multi-core processors and multithreading Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2 Multi-core processors and multithreading Paweł Szostek CERN Inverted CERN School of Computing, 23-24 February 2015 1 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Multi-core processors and multithreading: part 1 ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES 2 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU evolution . In the past manufacturers were increasing frequency . Transistors were invested into larger caches and more powerful cores . From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over) . Thermal Design Power (TDP) is stalled at ~150W Why higher clock speed increases the power consumption? 3 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: power dissipation . In the past, there were no power dissipation issues . Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1] . “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use) . This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers 4 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: manufacturing technology 120nm Flu virus 14nm process transistor 5 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading . Problem: when executing a stream of instructions, even with out- of-order execution, a CPU cannot keep all the execution units constantly busy . Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc. 6 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (II) . Solution: we can utilize idle execution units with a different thread . SMT is a hardware feature that can be turned on/off in the BIOS . Most of the hardware resources (including caches) are shared . Needs a separate fetching unit . Can both speed up and slow down execution (see next slide) 7 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (III) . Workloads from HEP- SMT SPEC06 benchmark . Many instances of single-threaded processes run in parallel . Different scalability and reactions to SMT . Cache utilization is the most important factor in SMT impact 8 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (IV) . Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core . Example: list or tree traversal . the role of the helper thread is to prefetch the data . helper thread works in front of the main thread by accessing data ahead of the main thread . think of it as an interesting example of exploiting the hardware source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors” 9 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Non-Uniform Memory Access . Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor . Makes accesses fast, when the memory is “close” to the processor . There is a performance hit when accessing the “foreign” memory . Lowers down the pressure on the memory bus 10 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Cluster-on-die . Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM) . Solution: split the memory on one socket into two nodes 11 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions Extension Generation/year Value added MMX Pentium 64b registers with packed data MMX/1997 types, only integer operations SSE Pentium III/1999 128b registers (XMM), 32b float only SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op instructions AVX2 Haswell/2013 Integer instructions in YMM registers, FMA AVX512 Skylake/2016 512b registers Hardware evolves → programmers and compilers need to adapt 12 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel extensions example – AVX2 . AVX2 is the latest extension from Intel . Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?) . Creative application – Padé approximant + + + + = 1 + + 2+ + 0 1 2 ⋯ 2 + 1 ( +2 + …+ … ) = ⋯ 1 + ( ( + … + … ) 0 1 2 . VDT is a math vector library1+ using2 Padé approximant – libm plug&play replacement with speed -ups reaching 10x 13 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU improvements summary . Common ways to improve CPU performance Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more (see: dark silicon) Hyper-threading Medium overhead, up to Can double workload’s 30% performance memory footprint, improvement possible cache pollution Architectural Increase versatility and Huge design overhead, changes performance, works well with happen ~every 3 years existing software Microarchitectural Transparent for the users Huge design overhead changes More cores Low design overhead, easy Requires heavily-parallel to implement, great software scalability Architectures” “Multicore Nowak A. inspiration: Slide 14 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Multi-core processors and multithreading: part 2 PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE 15 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Concurrency vs. parallelism Do concurrent (not parallel) programs need synchronization to access shared resources? Why? 16 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions What will be value of n after both threads finish their work? 17 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions (II) 18 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python . C++ parallelism skipped on purpose – already covered at CSC . Python is not a performance-oriented language, but can be made less slow . We can still use threading module to benefit from parallel IO operations via threads by relying on OS . Example is deferred to the synchronization slides. But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock? 19 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python (II) . We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though . high memory footprint . no resource sharing . every worker is a separate process from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10)) 20 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CSC Refresher: vector operations . Problem: all the arithmetic operations are executed one element at a time . Solution: introduce vector operations and vector registers What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice? 21 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Auto-vectorization in gcc . Vectorization candidate: (inner) loops. Will only work with more recent gcc versions (>4.6) . By default, auto-vectorization in gcc is disabled . There are tens of optimization flag, but it’s good to retain at least a couple: . -mtune=ARCH, -march=ARCH . -O2, -O3, -Ofast . -ftree-vectorize 22 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization reports . Compiler can tell us which loop was not vectorized and why . gcc: -ftree-vectorize-verbose=[0-9] . icc: -vec-report=[0-7] . List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function. 23 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions (II) . Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch) . A run-time check is added to the output code . in ICC –axARCH can be used instead GCC ICC __attribute__ ((target(“default”))) __declspec(cpu_specific(generic)) int foo() { int foo() { return 0; return 0; } } __attribute__((target(“sse4.2”))) __declspec(cpu_specific(core_i7_sse4_2)) int foo() { int foo() { return 1; return 1; } } 24 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in C++ . Possible to use intrinsics, but very cumbersome and “write-only” . Many libraries to approach vectorization, the choice is not easy . Example: Agner Fog’s Vector Class #include “vectorclass.h” float a[8], b[8], c[8]; float a[8], b[8], c[8]; … … for (int i=0; i<8; ++i) { Vec8f avec, bvec, cvec; c[i] = a[i] + b[i]*1.5f; avec.load(a); } bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c); 25 iCSC2015, Pawel Szostek, CERN Multi-core

Multi-Core Processors and Multithreading

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support