CTA01++ : multicore and beyond

Source: Source: Xilinx https://cloud.google.com/tpu/ Johan Peltenburg [email protected] Accelerated Big Data Systems group Quantum & Computer Engineering Delft University of Technology

26 October 2020 @ Hogeschool Rotterdam 1 TU Delft

(2017 figures) 23 461 Students 2 799 PhD students 253 FTE professors 3 448 Scientists 2 385 Supporting 23 Startups / year

2 Accelerated Big Data Systems

Faculty of Electrical Engineering, Mathematics and Computer Science

Department of Quantum & Computer Engineering: – Accelerated Big Data Systems – Computer Engineering – Network Architecture & Services – Applied Quantum Architectures, Quantum Communications Lab, Quantum Computer Architecture Lab, Quantum Computing, Quantum Information & Software, Quantum Integration Technology

3 Table of Contents

● Multicore processors

● Amdahl’s Law in the Multicore Era

● Dark Silicon and the End of Multicore Scaling

● Heterogeneous Computing

4 Moore’s “Law”[1]

What to do with all these transistors?

Original paper: [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. 5 What is a multicore processor?

A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions

Characteristics: – 2 or more general purpose processors – Single component (chip or integrated circuit) – Sharing infrastructure (sharing memory and communication resources)

6 What is a multicore processor? Famous examples of multicore – Intel Core, Core 2, i3, i5, i7, etc. – AMD Athlon, Phenom, Ryzen – Sony CELL microprocessor Many other multicore examples – Adapteva – Aeroflex Intel i7 – Ageia – Ambric – AMD – Analog Devices Intel Core 2 Duo – ARM – ASOCS – Azul Systems …...... Sony/IBM CELL 7 What is a multicore processor?

Is a processor with embedded acceleration circuitry considered as a multicore?

Are two fully independent processors Intel P4 with SSE on a single chip Intel i7 6950X considered as multicore?

Is a GPU considered a multicore processor? Nvidia Tesla V100 8 The birth of the multicore

What reasons to go to multicore?

9 The Power Wall

Dynamic power Power due to consumption leakage current

2 P=ACV f+VI leak

Reduce the supply voltage, V qVth Ileak µexp(- ) 1

10 Overcoming the Power Wall

● Solution: – Reduce frequency – Duplicate CPU core

Source: Intel 2006 11 ILP Wall ● The ’80s: superscalar expansion – 50% per year improvement in performance – Pipeline processor – 10 CPI → 1 CPI ● The ’90s: the era of diminishing returns – Squeezing out the last bit of – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI → 0.5 CPI ● The ’00s: the multicore era – The need for explicit parallelism

● The ‘10s: My guess: the heterogeneous multicore era 12 CTA01++

● Very deep pipelines – Intel once went up to 31!

● Complex branch predictors – Speculative execution

● Advanced memory hierarchy – E.g. prefetching – E.g. 3 levels of cache

● SIMD extensions Source: https://commons.wikimedia.org – E.g. SSE, AVX, AVX2, AVX512

13 CTA01++

● Out-of-Order execution

● Superscalar execution – Can launch execution of more than one instruction.

● Simulatenous Multithreading (SMT) – Can run multiple threads on a single core – Intel calls this HyperThreading

What do all these Source: Patterson & Hennessey improvements cost? 14 Pollack’s Rule[3]

● Pollack’s Rule: Performance increase is roughly proportional to the square root of increase in complexity. REMEMBER THIS FOR LATER!

● Exercise: We double the chip area to make a circuit twice as complex. How much performance do we get?

● We only get 1.4× (√2) improvement in performance.

[3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. 15 More reasons for multicore

● Memory Wall (see book)

● Industry push

16 Table of Contents

● Multicore processors

● Amdahl’s Law in the Multicore Era

● Dark Silicon and the End of Multicore Scaling

● Heterogeneous Computing

17 Amdahl’s Law

1 S(n)= f (1−f )+ n

S = w.r.t. single core n = number of cores f = parallel portion of workload

18 Amdahl’s Law in the Multicore Era[4]

● Resources to build and operate a computer circuit: – Area, capacitive load, frequency, power, money, etc…

● Let’s forget about the specific resource and call it:

● A “Base Core Equivalent” or BCE. – A “Base Core” are the resources required to implement the simplest core imaginable that can run our instruction set.

n = 1 BCE Normalized performance = 1

[4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). 19 Exercise

● n : the total number of BCE resources available to our n = r = 1 BCE design Performance = 1 ● r : the number of BCE resources we use for our single core ● We create an architecture of n = r = 4 BCE r = 4 BCE.

● How much performance do Relative we get compared to performance = 2 ✕ r = 1 BCE? Performance of an r BCE core = √r (remember Pollack’s Rule) 20 Symmetric Multicore[4] (1/2) 1 ● S(n)= Let’s build a multicore f system with an (1−f )+ n = 16 BCE budget. n ● We give every core r = 4 BCE.

● What is the speedup 1 1 S(n)= = over 1 BCE given we 1−f n 1−f f⋅r +f ÷ ÷√r + have a parallel portion √r r √r √r⋅n of f ?

Performance per core Number of cores 21 Symmetric Multicore[4] (2/2)

22 Asymmetric Multicore[4] (1/2)

● A multicore system with an n = 16 BCE budget.

● We create one big core r = 4 BCE.

● We create a small, simple core out of each remaining BCE. ● What is the speedup over 1 1 BCE given we have a parallel S(n)= 1−f f portion of f ? + r r n r √ √ + − 23 Asymmetric Multicore[4] (2/2)

24 Dynamic Multicore[4] (1/2) ● When performing a non-parallelizable part 1-f... – Use all BCE to form one huge core

● When performing parallelizable part f... – Use all BCE to form many tiny cores

25 Dynamic Multicore[4] (2/2)

What sort of magical device would have such dynamic properties?

26 Time for a break

27 Table of Contents

● Multicore processors

● Amdahl’s Law in the Multicore Era

● Dark Silicon and the End of Multicore Scaling

● Heterogeneous Computing

28 Dennard Scaling[2]

● Once upon a time ...

● As transistors became smaller …

● Their power density stayed constant.

● Moore’s law & Dennard Scaling lived happily ever after… The end?

● Leakage current and threshold voltage were not taken into consideration for Dennard Scaling.

● Broke down around 2006: The Power Wall!

[2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. 29 Dark Silicon[5]

● No matter what chip topology we use (CPU-like / GPU-like)

● We must power off parts of the chip to stay within a power budget.

● At 8 nm, we must power off 50% of the chip continuously to stay within power budget!

● Limits to speedup in 2024: – Only 7.9x predicted when paper appeared in 2011! – Shouldn’t we get ~388x according to Moore’s Law?

● Don’t confuse Moore’s Law with performance!

[5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. 30 Source: https://www.ibmbigdatahub.com 31 Source: https://www.ibmbigdatahub.com 32 Source: https://www.ibmbigdatahub.com 33 Source: https://www.ibmbigdatahub.com 34 Where do we go now?

35 Table of Contents

● Multicore processors

● Amdahl’s Law in the Multicore Era

● Dark Silicon and the End of Multicore Scaling

● Heterogeneous Computing

36 Heterogeneous Computing

37 General Purpose Graphics Processing Unit

● Most mainstream accelerator nowadays: GPU – Originally used to render 3D images – Cores got less specialized – could now do any computation

● Now used in general purpose computing: GPGPU

● Programmable using CUDA/OpenCL – C/C++ like languages

● Widely used in scientific, AI, Machine Learning

● Top make use of GPGPUs. – How many GPGPUs?

38 ORNLORNL SummitSummit supersuper computercomputer ● Processor:Processor: IBMIBM POWER9™POWER9™ (2/node)(2/node) ● GPUs:GPUs: 27,64827,648 NVIDIANVIDIA VoltaVolta V100sV100s (6/node)(6/node) ● Nodes:Nodes: 4,6084,608 ● NodeNode Performance:Performance: 4242 TFlop/sTFlop/s ● Memory/node:Memory/node: 512GB512GB DDR4DDR4 ++ 96GB96GB HBM2HBM2 ● NVNV Memory/node:Memory/node: 1600GB1600GB ● TotalTotal SystemSystem Memory:Memory: >10PB>10PB DDR4DDR4 ++ HBMHBM ++ Non-volatileNon-volatile ● PeakPeak PowerPower Consumption:Consumption: 13MW13MW ● ~~ powerpower consumptionconsumption ofof aa reasonablyreasonably sizedsized towntown

39 FPGA accelerator trends (1/2)

https://newsroom.intel.com/editorials/intel-fpgas-accelerating-future/

https://www.xilinx.com/applications/high-performance-computing.html

40 FPGA accelerator trends (2/2)

41 FPGA advantages

● Great flexibility in solution trade-offs. ● Can work with numeric formats that are not supported by GPGPU / CPU – Can have many parallel `cores’ (until we run – out of resources) Arbitrary integer, fixed & floating point widths – E.g. 5-bit integers, float16, posits, etc... – Can completely tailor circuit to application. ● Can interface with I/O directly ● Dataflow computing – Network controllers – can data as it travels on link (filtering, etc...) – Many algorithms map naturally – Non-volatile storage – can process data as it travels to disk – Don’t require load-store of intermediate (compression, error checking, etc...) values – Etc… – Minimal control overhead; no instruction set – No operating system

Source: [6] 42 FPGA disadvantages

● Low clock speeds ~10× lower

● High area overhead compared to fixed function IC ~15× lower

● FPGA circuit itself should be ~150× more efficient than CPU! – Requires computer architecture & digital design knowledge

● Hard to program – Need digital design knowledge ● Ratio of digital design engineers to Python programmers might underflow in float32. – High-Level Synthesis (typically OpenCL, C or C++ → VHDL/Verilog or RTL) ● Still requires hardware knowledge to be competitive performance-wise. ● Wrong abstraction for circuits 43 P&H Turing Lecture

Source: John L. Hennessy and David A. Patterson, ISCA2018 Turing Lecture. Online: https://iscaconf.org/isca2018/turing_lecture.html

44 Our research ● How to efficiently integrate FPGA accelerators in big data environments?

● ABS open-source projects: – https://github.com/abs-tudelft

● Numerous applications alongside which we develop the project. – We work directly with industry.

● Example: how can we accelerate processing of billions of DNA samples as fast as possible – REAL IMPACT!

45 Summary ● Single core performance was halted by three walls (power, ILP and memory) - industry pushed for multicore.

● Multicore processors are responsible for computational performance increase over the last 15 years.

● The end of multicore scaling is near: we will not get much out of improving multicore CPUs due to Dark Silicon.

● Heterogeneous systems are (going to be) dominating the computing industry.

● FPGA accelerators are promising contenders – but still a lot of work to do.

46 Possible bachelor thesis topics

● If you are highly interested / highly skilled in – Digital logic design / VHDL – C programming – Maths – English

● Work with latest datacenter grade FPGA cards – The fastest FPGA accelerator cards on the planet. – 460 GB/s to High-Bandwidth Memory

● FPGA design assignment – Design Machine Learning accelerators with FPGA – Design Genomics accelerators with FPGA – Design new components for our open-source hardware ecosystem – Can be with TU Delft or Teratide (spin-off/startup company)

● Contact me: [email protected] 47 References

[1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE.

[6] O. Pell and V. Averbukh, "Maximum Performance Computing with Dataflow Engines," in Computing in Science & Engineering, vol. 14, no. 4, pp. 98-103, July-Aug. 2012.

48