Multi-Core Processors and Multithreading
Multi-core processors and multithreading
Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2 Multi-core processors and multithreading Paweł Szostek CERN
Inverted CERN School of Computing, 23-24 February 2015
1 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading
Multi-core processors and multithreading: part 1 ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES
2 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU evolution . In the past manufacturers were increasing frequency . Transistors were invested into larger caches and more powerful cores . From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over) . Thermal Design Power (TDP) is stalled at ~150W
Why higher clock speed increases the power consumption?
3 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: power dissipation
. In the past, there were no power dissipation issues . Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1] . “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use) . This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers
4 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: manufacturing technology
120nm
Flu virus 14nm process transistor
5 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading
. Problem: when executing a stream of instructions, even with out- of-order execution, a CPU cannot keep all the execution units constantly busy
. Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.
6 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (II) . Solution: we can utilize idle execution units with a different thread . SMT is a hardware feature that can be turned on/off in the BIOS
. Most of the hardware resources (including caches) are shared . Needs a separate fetching unit
. Can both speed up and slow down execution (see next slide)
7 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (III)
. Workloads from HEP- SMT SPEC06 benchmark . Many instances of single-threaded processes run in parallel . Different scalability and reactions to SMT . Cache utilization is the most important factor in SMT impact
8 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (IV)
. Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core . Example: list or tree traversal . the role of the helper thread is to prefetch the data . helper thread works in front of the main thread by accessing data ahead of the main thread . think of it as an interesting example of exploiting the hardware
source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors” 9 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Non-Uniform Memory Access . Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor
. Makes accesses fast, when the memory is “close” to the processor
. There is a performance hit when accessing the “foreign” memory
. Lowers down the pressure on the memory bus
10 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Cluster-on-die
. Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)
. Solution: split the memory on one socket into two nodes
11 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions Extension Generation/year Value added MMX Pentium 64b registers with packed data MMX/1997 types, only integer operations SSE Pentium III/1999 128b registers (XMM), 32b float only SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op instructions AVX2 Haswell/2013 Integer instructions in YMM registers, FMA AVX512 Skylake/2016 512b registers
Hardware evolves → programmers and compilers need to adapt
12 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel extensions example – AVX2 . AVX2 is the latest extension from Intel
. Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)
. Creative application – Padé approximant
+ + + + = 1 + + 2+ + 𝑛𝑛 𝑎𝑎0 𝑎𝑎1𝑥𝑥 𝑎𝑎2𝑥𝑥 ⋯ 𝑎𝑎𝑛𝑛𝑥𝑥 2 𝑚𝑚 𝑅𝑅 𝑥𝑥 + 1 ( +2 + …𝑚𝑚+ … ) = 𝑏𝑏 𝑥𝑥 𝑏𝑏 𝑥𝑥 ⋯ 𝑏𝑏 𝑥𝑥 1 + ( ( + … + … ) 𝑎𝑎0 𝑥𝑥 𝑎𝑎1 𝑥𝑥 𝑎𝑎2 𝑥𝑥 𝑥𝑥𝑎𝑎𝑛𝑛 . VDT is a math vector library1+ using2 Padé approximant𝑚𝑚 – libm plug&play replacement𝑥𝑥 𝑏𝑏 with𝑥𝑥 𝑏𝑏speed𝑥𝑥 -ups 𝑥𝑥reaching𝑏𝑏 10x
13 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading
CPU improvements summary . Common ways to improve CPU performance Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more (see: dark silicon) Hyper-threading Medium overhead, up to Can double workload’s 30% performance memory footprint, improvement possible cache pollution Architectural Increase versatility and Huge design overhead, changes performance, works well with happen ~every 3 years existing software Microarchitectural Transparent for the users Huge design overhead changes More cores Low design overhead, easy Requires heavily-parallel to implement, great software scalability Architectures” “Multicore Nowak A. inspiration: Slide
14 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading
Multi-core processors and multithreading: part 2 PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE
15 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Concurrency vs. parallelism
Do concurrent (not parallel) programs need synchronization to access shared resources? Why?
16 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions
What will be value of n after both threads finish their work?
17 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions (II)
18 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python . C++ parallelism skipped on purpose – already covered at CSC
. Python is not a performance-oriented language, but can be made less slow
. We can still use threading module to benefit from parallel IO operations via threads by relying on OS
. Example is deferred to the synchronization slides.
But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?
19 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python (II)
. We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though . high memory footprint . no resource sharing . every worker is a separate process
from multiprocessing(.dummy) import Pool
def f(x): return x*x
if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))
20 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CSC Refresher: vector operations
. Problem: all the arithmetic operations are executed one element at a time
. Solution: introduce vector operations and vector registers
What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice?
21 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Auto-vectorization in gcc
. Vectorization candidate: (inner) loops.
. Will only work with more recent gcc versions (>4.6)
. By default, auto-vectorization in gcc is disabled . There are tens of optimization flag, but it’s good to retain at least a couple: . -mtune=ARCH, -march=ARCH . -O2, -O3, -Ofast . -ftree-vectorize
22 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization reports
. Compiler can tell us which loop was not vectorized and why . gcc: -ftree-vectorize-verbose=[0-9] . icc: -vec-report=[0-7]
. List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Analyzing loop at vect.cc:14
vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function.
23 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions (II)
. Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch) . A run-time check is added to the output code
. in ICC –axARCH can be used instead GCC ICC
__attribute__ ((target(“default”))) __declspec(cpu_specific(generic)) int foo() { int foo() { return 0; return 0; } }
__attribute__((target(“sse4.2”))) __declspec(cpu_specific(core_i7_sse4_2)) int foo() { int foo() { return 1; return 1; } }
24 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in C++
. Possible to use intrinsics, but very cumbersome and “write-only”
. Many libraries to approach vectorization, the choice is not easy
. Example: Agner Fog’s Vector Class
#include “vectorclass.h” float a[8], b[8], c[8]; float a[8], b[8], c[8]; … … for (int i=0; i<8; ++i) { Vec8f avec, bvec, cvec; c[i] = a[i] + b[i]*1.5f; avec.load(a); } bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c);
25 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in Python
. Vectorization in Python is possible, but requires extra modules and extra care
. numpy has a complete set of vectorized mathematical operations, requires usage of special types instead of built-in ones. Array notation expressions are vectorized
. Any step outside of numpy world will dramatically slow down execution . Gains not only from vectorization, but also from using C types under the hood . Example: roots of quadratic equations (see next slide)
26 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in Python - example import numpy as np def solve_python(a, b, c): from cmath import sqrt for ai,bi,ci in izip(a,b,c): from itertools import izip delta=bi*bi-4*ai*ci delta_s=sqrt(delta) # generate 1M coefficients x1=((-bi+delta_s)/(2*ai)) a=np.random.randn(1000000) x2=((-bi-delta_s)/(2*ai)) b=np.random.randn(1000000) yield (x1, x2) c=np.random.randn(1000000) timeit def solve_numpy(a, b, c): list(solve_python(a,b,c)) delta=b*b-4*a*c 1 loops, best of 3: 15 s delta_s=np.sqrt(delta+0.j)) x1=((-b+delta_s)/(2*a)) timeit x2=((-b-delta_s)/(2*a)) list(solve_numpy(a,b,c)) return (x1, x2) 10 loops, best of 3: 105 ms
Wow! Where this speed-up comes from?
27 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Accessing shared resources in Python
. C++ locking skipped on purpose – covered by Danilo
. threading.Lock – the lowest synchronization primitive, possible states: released and acquired . Provides two operations: Lock.acquire(blocking=False) and Lock.release()
. threading.Rlock – recurrent lock, can be acquired multiple times.
. Queue.Queue – synchronized queue for message/object passing
28 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Shared resources - example . Multithreaded application for fetching and processing webpages . Communication through synchronized queues
import Queue class FetchThread(Thread): from threading import \ def __init__(self, url_queue, Thread html_queue): import urllib2 Thread.__init__(self) from BeautifulSoup import\ self.url_queue = url_queue BeautifulSoup self.html_queue = html_queue
# hosts = [something, def run(self): something] while True: host = self.queue.get() url_queue = Queue.Queue() url = urllib2.urlopen(host) html_queue = Queue.Queue() chunk = url.read() self.out_queue.put(chunk) self.queue.task_done()
29 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Shared resources – example cont’d
class MineThread(Thread): def main(): def __init__(self, for i in range(5): html_queue): t = FetchThread(url_queue, Thread.__init__(self) html_queue) self.html_queue = \ t.setDaemon(True) html_queue t.start()
def run(self): for host in hosts: while True: url_queue.put(host) c = self.html_queue.get() soup = BeautifulSoup(c) for i in range(5): titles = \ dt = MineThread(html_queue) soup.findAll([‘title’] dt.setDaemon(True) print(titles) dt.start() self.html_queue.\ task_done() queue.join(); html_queue.join() main()
30 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading
Multi-core processors and multithreading: part 3 EVOLUTION OF COMPUTING LANDSCAPE IN THE FUTURE
31 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel tick-tock model
32 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel Xeon Phi
. openlab collaborating since 2008 . PCIe co-processor with 61 cores * 4-way SMT . 1TFLOPS peak performance . 512 bit vectors But… are my
applications . : Next generation even more ready for a , 3 times more performance, cores such massive x86-64 compatible, standalone parallelism? CPU… maybe in desktops?
33 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading ARM 64 (AArch64)
. It’s all about low power . 64-bit memory addressing provides support for large memory (>4GB)
. RISC architecture
. Common software ecosystem with x86-64, uses same management standards
. CISC also expanding in Energy efficiency scalability this direction figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”
34 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Take-home messages
. Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)
. NUMA and COD are another “complex stuff” that a programmer has to keep in mind
. Parallelization is possible not only with C++ . Not everything that looks like an improvement gives you better performance (e.g. AVX)
. Multi-threaded applications always require synchronization to protect shared resources
. Auto-vectorization is a speed-up for free
35 iCSC2015, Pawel Szostek, CERN