Quick viewing(Text Mode)

Multi-Core Processors and Multithreading

Multi-Core Processors and Multithreading

Multi-core processors and multithreading

Evolution of processor architectures: growing complexity of CPUs and its impact on the landscape Lecture 2 Multi-core processors and multithreading Paweł Szostek CERN

Inverted CERN School of , 23-24 February 2015

1 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading

Multi-core processors and multithreading: part 1 ADVANCED TOPICS IN THE ARCHITECTURES

2 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CPU evolution . In the past manufacturers were increasing frequency . Transistors were invested into larger caches and more powerful cores . From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over) . Thermal Design Power (TDP) is stalled at ~150W

Why higher clock speed increases the power consumption?

3 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: power dissipation

. In the past, there were no power dissipation issues . Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1] . “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use) . This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers

4 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Interlude: manufacturing technology

120nm

Flu virus 14nm process transistor

5 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading

. Problem: when executing a stream of instructions, even with out- of-order , a CPU cannot keep all the execution units constantly busy

. Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.

6 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (II) . Solution: we can utilize idle execution units with a different thread . SMT is a hardware feature that can be turned on/off in the BIOS

. Most of the hardware resources (including caches) are shared . Needs a separate fetching unit

. Can both speed up and slow down execution (see next slide)

7 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (III)

. Workloads from HEP- SMT SPEC06 benchmark . Many instances of single-threaded processes run in parallel . Different scalability and reactions to SMT . Cache utilization is the most important factor in SMT impact

8 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Simultaneous Multi-Threading (IV)

. Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core . Example: list or tree traversal . the role of the helper thread is to prefetch the . helper thread works in front of the main thread by accessing data ahead of the main thread . think of it as an interesting example of exploiting the hardware

source: J. Zhou et al. “Improving Performance on Simultaneous Multithreading Processors” 9 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Non-Uniform Memory Access . Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor

. Makes accesses fast, when the memory is “close” to the processor

. There is a performance hit when accessing the “foreign” memory

. Lowers down the pressure on the memory bus

10 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Cluster-on-die

. Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)

. Solution: split the memory on one socket into two nodes

11 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions Extension Generation/year Value added MMX Pentium 64b registers with packed data MMX/1997 types, only integer operations SSE Pentium III/1999 128b registers (XMM), 32b float only SSE2 Pentium 4 /2001 SIMD math on any SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op instructions AVX2 Haswell/2013 Integer instructions in YMM registers, FMA AVX512 Skylake/2016 512b registers

Hardware evolves → programmers and need to adapt

12 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel extensions example – AVX2 . AVX2 is the latest extension from Intel

. Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)

. Creative application – Padé approximant

+ + + + = 1 + + 2+ + 𝑛𝑛 𝑎𝑎0 𝑎𝑎1𝑥𝑥 𝑎𝑎2𝑥𝑥 ⋯ 𝑎𝑎𝑛𝑛𝑥𝑥 2 𝑚𝑚 𝑅𝑅 𝑥𝑥 + 1 ( +2 + …𝑚𝑚+ … ) = 𝑏𝑏 𝑥𝑥 𝑏𝑏 𝑥𝑥 ⋯ 𝑏𝑏 𝑥𝑥 1 + ( ( + … + … ) 𝑎𝑎0 𝑥𝑥 𝑎𝑎1 𝑥𝑥 𝑎𝑎2 𝑥𝑥 𝑥𝑥𝑎𝑎𝑛𝑛 . VDT is a math vector library1+ using2 Padé approximant𝑚𝑚 – libm plug&play replacement𝑥𝑥 𝑏𝑏 with𝑥𝑥 𝑏𝑏speed𝑥𝑥 -ups 𝑥𝑥reaching𝑏𝑏 10x

13 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading

CPU improvements summary . Common ways to improve CPU performance Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more (see: dark silicon) Hyper-threading Medium overhead, up to Can double workload’s 30% performance memory footprint, improvement possible cache pollution Architectural Increase versatility and Huge design overhead, changes performance, works well with happen ~every 3 years existing software Microarchitectural Transparent for the users Huge design overhead changes More cores Low design overhead, easy Requires heavily-parallel to implement, great software scalability Architectures” “Multicore Nowak A. inspiration: Slide

14 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading

Multi-core processors and multithreading: part 2 PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE

15 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Concurrency vs. parallelism

Do concurrent (not parallel) programs need synchronization to access shared resources? Why?

16 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions

What will be value of n after both threads finish their work?

17 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Race conditions (II)

18 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python . ++ parallelism skipped on purpose – already covered at CSC

. Python is not a performance-oriented language, but can be made less slow

. We can still use threading module to benefit from parallel IO operations via threads by relying on OS

. Example is deferred to the synchronization slides.

But wait! Is there a real parallelism in Python? What about the ?

19 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Thread-level parallelism in Python (II)

. We can easily run many processes with package to leverage parallelism easily, not very efficiently though . high memory footprint . no resource sharing . every worker is a separate process

from multiprocessing(.dummy) import Pool

def f(x): return x*x

if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))

20 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading CSC Refresher: vector operations

. Problem: all the arithmetic operations are executed one element at a time

. Solution: introduce vector operations and vector registers

What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice?

21 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Auto-vectorization in gcc

. Vectorization candidate: (inner) loops.

. Will only work with more recent gcc versions (>4.6)

. By default, auto-vectorization in gcc is disabled . There are tens of optimization flag, but it’s good to retain at least a couple: . -mtune=ARCH, -march=ARCH . -O2, -O3, -Ofast . -ftree-vectorize

22 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization reports

. can tell us which loop was not vectorized and why . gcc: -ftree-vectorize-verbose=[0-9] . icc: -vec-report=[0-7]

. List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Analyzing loop at vect.cc:14

vect.cc:14: note: not vectorized: in loop. vect.cc:14: note: bad loop . vect.cc:6: note: vectorized 0 loops in function.

23 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel architectural extensions (II)

. Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch) . A run-time check is added to the output code

. in ICC –axARCH can be used instead GCC ICC

__attribute__ ((target(“default”))) __declspec(cpu_specific(generic)) int foo() { int foo() { return 0; return 0; } }

__attribute__((target(“sse4.2”))) __declspec(cpu_specific(core_i7_sse4_2)) int foo() { int foo() { return 1; return 1; } }

24 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in C++

. Possible to use intrinsics, but very cumbersome and “write-only”

. Many libraries to approach vectorization, the choice is not easy

. Example: Agner Fog’s Vector Class

#include “vectorclass.h” float a[8], b[8], c[8]; float a[8], b[8], c[8]; … … for (int i=0; i<8; ++i) { Vec8f avec, bvec, cvec; c[i] = a[i] + b[i]*1.5f; avec.load(a); } bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c);

25 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in Python

. Vectorization in Python is possible, but requires extra modules and extra care

. has a complete set of vectorized mathematical operations, requires usage of special types instead of built-in ones. Array notation expressions are vectorized

. Any step outside of numpy world will dramatically slow down execution . Gains not only from vectorization, but also from using C types under the hood . Example: roots of quadratic equations (see next slide)

26 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Vectorization in Python - example import numpy as np def solve_python(a, b, c): from cmath import sqrt for ai,bi,ci in izip(a,b,c): from itertools import izip delta=bi*bi-4*ai*ci delta_s=sqrt(delta) # generate 1M coefficients x1=((-bi+delta_s)/(2*ai)) a=np.random.randn(1000000) x2=((-bi-delta_s)/(2*ai)) b=np.random.randn(1000000) yield (x1, x2) c=np.random.randn(1000000) timeit def solve_numpy(a, b, c): list(solve_python(a,b,c)) delta=b*b-4*a*c 1 loops, best of 3: 15 s delta_s=np.sqrt(delta+0.j)) x1=((-b+delta_s)/(2*a)) timeit x2=((-b-delta_s)/(2*a)) list(solve_numpy(a,b,c)) return (x1, x2) 10 loops, best of 3: 105 ms

Wow! Where this speed-up comes from?

27 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Accessing shared resources in Python

. C++ locking skipped on purpose – covered by Danilo

. threading.Lock – the lowest synchronization primitive, possible states: released and acquired . Provides two operations: Lock.acquire(blocking=False) and Lock.release()

. threading.Rlock – recurrent lock, can be acquired multiple times.

. Queue.Queue – synchronized queue for message/object passing

28 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Shared resources - example . Multithreaded application for fetching and processing webpages . Communication through synchronized queues

import Queue class FetchThread(Thread): from threading import \ def __init__(self, url_queue, Thread html_queue): import urllib2 Thread.__init__(self) from BeautifulSoup import\ self.url_queue = url_queue BeautifulSoup self.html_queue = html_queue

# hosts = [something, def run(self): something] while True: host = self.queue.get() url_queue = Queue.Queue() url = urllib2.urlopen(host) html_queue = Queue.Queue() chunk = url.read() self.out_queue.put(chunk) self.queue.task_done()

29 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Shared resources – example cont’d

class MineThread(Thread): def main(): def __init__(self, for i in range(5): html_queue): t = FetchThread(url_queue, Thread.__init__(self) html_queue) self.html_queue = \ t.setDaemon(True) html_queue t.start()

def run(self): for host in hosts: while True: url_queue.put(host) c = self.html_queue.get() soup = BeautifulSoup(c) for i in range(5): titles = \ dt = MineThread(html_queue) soup.findAll([‘title’] dt.setDaemon(True) print(titles) dt.start() self.html_queue.\ task_done() queue.join(); html_queue.join() main()

30 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading

Multi-core processors and multithreading: part 3 EVOLUTION OF COMPUTING LANDSCAPE IN THE FUTURE

31 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel tick-tock model

32 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Intel Xeon Phi

. openlab collaborating since 2008 . PCIe co-processor with 61 cores * 4-way SMT . 1TFLOPS peak performance . 512 bit vectors But… are my

applications . : Next generation even more ready for a , 3 times more performance, cores such massive x86-64 compatible, standalone parallelism? CPU… maybe in desktops?

33 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading ARM 64 (AArch64)

. It’s all about low power . 64-bit memory addressing provides support for large memory (>4GB)

. RISC architecture

. Common software ecosystem with x86-64, uses same management standards

. CISC also expanding in Energy efficiency scalability this direction figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”

34 iCSC2015, Pawel Szostek, CERN Multi-core processors and multithreading Take-home messages

. Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)

. NUMA and COD are another “complex stuff” that a programmer has to keep in mind

. Parallelization is possible not only with C++ . Not everything that looks like an improvement gives you better performance (e.g. AVX)

. Multi-threaded applications always require synchronization to protect shared resources

. Auto-vectorization is a speed-up for free

35 iCSC2015, Pawel Szostek, CERN