CTA01++ : multicore and beyond
Source: Source: Xilinx https://cloud.google.com/tpu/ Johan Peltenburg [email protected] Accelerated Big Data Systems group Quantum & Computer Engineering Delft University of Technology
26 October 2020 @ Hogeschool Rotterdam 1 TU Delft
(2017 figures) 23 461 Students 2 799 PhD students 253 FTE professors 3 448 Scientists 2 385 Supporting 23 Startups / year
2 Accelerated Big Data Systems
Faculty of Electrical Engineering, Mathematics and Computer Science
Department of Quantum & Computer Engineering: – Accelerated Big Data Systems – Computer Engineering – Network Architecture & Services – Applied Quantum Architectures, Quantum Communications Lab, Quantum Computer Architecture Lab, Quantum Computing, Quantum Information & Software, Quantum Integration Technology
3 Table of Contents
● Multicore processors
● Amdahl’s Law in the Multicore Era
● Dark Silicon and the End of Multicore Scaling
● Heterogeneous Computing
4 Moore’s “Law”[1]
What to do with all these transistors?
Original paper: [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. 5 What is a multicore processor?
A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions
Characteristics: – 2 or more general purpose processors – Single component (chip or integrated circuit) – Sharing infrastructure (sharing memory and communication resources)
6 What is a multicore processor? Famous examples of multicore – Intel Core, Core 2, i3, i5, i7, etc. – AMD Athlon, Phenom, Ryzen – Sony CELL microprocessor Many other multicore examples – Adapteva – Aeroflex Intel i7 – Ageia – Ambric – AMD – Analog Devices Intel Core 2 Duo – ARM – ASOCS – Azul Systems …...... Sony/IBM CELL 7 What is a multicore processor?
Is a processor with embedded acceleration circuitry considered as a multicore?
Are two fully independent processors Intel P4 with SSE on a single chip Intel i7 6950X considered as multicore?
Is a GPU considered a multicore processor? Nvidia Tesla V100 8 The birth of the multicore
What reasons to go to multicore?
9 The Power Wall
Dynamic power Power due to consumption leakage current
2 P=ACV f+VI leak
Reduce the supply voltage, V qVth Ileak µexp(- ) 1 10 Overcoming the Power Wall ● Solution: – Reduce frequency – Duplicate CPU core Source: Intel 2006 11 ILP Wall ● The ’80s: superscalar expansion – 50% per year improvement in performance – Pipeline processor – 10 CPI → 1 CPI ● The ’90s: the era of diminishing returns – Squeezing out the last bit of implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI → 0.5 CPI ● The ’00s: the multicore era – The need for explicit parallelism ● The ‘10s: My guess: the heterogeneous multicore era 12 CTA01++ ● Very deep pipelines – Intel once went up to 31! ● Complex branch predictors – Speculative execution ● Advanced memory hierarchy – E.g. prefetching – E.g. 3 levels of cache ● SIMD extensions Source: https://commons.wikimedia.org – E.g. SSE, AVX, AVX2, AVX512 13 CTA01++ ● Out-of-Order execution ● Superscalar execution – Can launch execution of more than one instruction. ● Simulatenous Multithreading (SMT) – Can run multiple threads on a single core – Intel calls this HyperThreading What do all these Source: Patterson & Hennessey improvements cost? 14 Pollack’s Rule[3] ● Pollack’s Rule: Performance increase is roughly proportional to the square root of increase in complexity. REMEMBER THIS FOR LATER! ● Exercise: We double the chip area to make a circuit twice as complex. How much performance do we get? ● We only get 1.4× (√2) improvement in performance. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. 15 More reasons for multicore ● Memory Wall (see book) ● Industry push 16 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 17 Amdahl’s Law 1 S(n)= f (1−f )+ n S = speedup w.r.t. single core n = number of cores f = parallel portion of workload 18 Amdahl’s Law in the Multicore Era[4] ● Resources to build and operate a computer circuit: – Area, capacitive load, frequency, power, money, etc… ● Let’s forget about the specific resource and call it: ● A “Base Core Equivalent” or BCE. – A “Base Core” are the resources required to implement the simplest core imaginable that can run our instruction set. n = 1 BCE Normalized performance = 1 [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). 19 Exercise ● n : the total number of BCE resources available to our n = r = 1 BCE design Performance = 1 ● r : the number of BCE resources we use for our single core ● We create an architecture of n = r = 4 BCE r = 4 BCE. ● How much performance do Relative we get compared to performance = 2 ✕ r = 1 BCE? Performance of an r BCE core = √r (remember Pollack’s Rule) 20 Symmetric Multicore[4] (1/2) 1 ● S(n)= Let’s build a multicore f system with an (1−f )+ n = 16 BCE budget. n ● We give every core r = 4 BCE. ● What is the speedup 1 1 S(n)= = over 1 BCE given we 1−f n 1−f f⋅r +f ÷ ÷√r + have a parallel portion √r r √r √r⋅n of f ? Performance per core Number of cores 21 Symmetric Multicore[4] (2/2) 22 Asymmetric Multicore[4] (1/2) ● A multicore system with an n = 16 BCE budget. ● We create one big core r = 4 BCE. ● We create a small, simple core out of each remaining BCE. ● What is the speedup over 1 1 BCE given we have a parallel S(n)= 1−f f portion of f ? + r r n r √ √ + − 23 Asymmetric Multicore[4] (2/2) 24 Dynamic Multicore[4] (1/2) ● When performing a non-parallelizable part 1-f... – Use all BCE to form one huge core ● When performing parallelizable part f... – Use all BCE to form many tiny cores 25 Dynamic Multicore[4] (2/2) What sort of magical device would have such dynamic properties? 26 Time for a break 27 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 28 Dennard Scaling[2] ● Once upon a time ... ● As transistors became smaller … ● Their power density stayed constant. ● Moore’s law & Dennard Scaling lived happily ever after… The end? ● Leakage current and threshold voltage were not taken into consideration for Dennard Scaling. ● Broke down around 2006: The Power Wall! [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. 29 Dark Silicon[5] ● No matter what chip topology we use (CPU-like / GPU-like) ● We must power off parts of the chip to stay within a power budget. ● At 8 nm, we must power off 50% of the chip continuously to stay within power budget! ● Limits to speedup in 2024: – Only 7.9x predicted when paper appeared in 2011! – Shouldn’t we get ~388x according to Moore’s Law? ● Don’t confuse Moore’s Law with performance! [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. 30 Source: https://www.ibmbigdatahub.com 31 Source: https://www.ibmbigdatahub.com 32 Source: https://www.ibmbigdatahub.com 33 Source: https://www.ibmbigdatahub.com 34 Where do we go now? 35 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 36 Heterogeneous Computing 37 General Purpose Graphics Processing Unit ● Most mainstream accelerator nowadays: GPU – Originally used to render 3D images – Cores got less specialized – could now do any computation ● Now used in general purpose computing: GPGPU ● Programmable using CUDA/OpenCL – C/C++ like languages ● Widely used in scientific, AI, Machine Learning ● Top supercomputers make use of GPGPUs. – How many GPGPUs? 38 ORNLORNL SummitSummit supersuper computercomputer ● Processor:Processor: IBMIBM POWER9™POWER9™ (2/node)(2/node) ● GPUs:GPUs: 27,64827,648 NVIDIANVIDIA VoltaVolta V100sV100s (6/node)(6/node) ● Nodes:Nodes: 4,6084,608 ● NodeNode Performance:Performance: 4242 TFlop/sTFlop/s ● Memory/node:Memory/node: 512GB512GB DDR4DDR4 ++ 96GB96GB HBM2HBM2 ● NVNV Memory/node:Memory/node: 1600GB1600GB ● TotalTotal SystemSystem Memory:Memory: >10PB>10PB DDR4DDR4 ++ HBMHBM ++ Non-volatileNon-volatile ● PeakPeak PowerPower Consumption:Consumption: 13MW13MW ● ~~ powerpower consumptionconsumption ofof aa reasonablyreasonably sizedsized towntown 39 FPGA accelerator trends (1/2) https://newsroom.intel.com/editorials/intel-fpgas-accelerating-future/ https://www.xilinx.com/applications/high-performance-computing.html 40 FPGA accelerator trends (2/2) 41 FPGA advantages ● Great flexibility in solution trade-offs. ● Can work with numeric formats that are not supported by GPGPU / CPU – Can have many parallel `cores’ (until we run – out of resources) Arbitrary integer, fixed & floating point widths – E.g. 5-bit integers, float16, posits, etc... – Can completely tailor circuit to application. ● Can interface with I/O directly ● Dataflow computing – Network controllers – can process data as it travels on link (filtering, etc...) – Many algorithms map naturally – Non-volatile storage – can process data as it travels to disk – Don’t require load-store of intermediate (compression, error checking, etc...) values – Etc… – Minimal control overhead; no instruction set – No operating system Source: [6] 42 FPGA disadvantages ● Low clock speeds ~10× lower ● High area overhead compared to fixed function IC ~15× lower ● FPGA circuit itself should be ~150× more efficient than CPU! – Requires computer architecture & digital design knowledge ● Hard to program – Need digital design knowledge ● Ratio of digital design engineers to Python programmers might underflow in float32. – High-Level Synthesis (typically OpenCL, C or C++ → VHDL/Verilog or RTL) ● Still requires hardware knowledge to be competitive performance-wise. ● Wrong abstraction for circuits 43 P&H Turing Lecture Source: John L. Hennessy and David A. Patterson, ISCA2018 Turing Lecture. Online: https://iscaconf.org/isca2018/turing_lecture.html 44 Our research ● How to efficiently integrate FPGA accelerators in big data environments? ● ABS open-source projects: – https://github.com/abs-tudelft ● Numerous applications alongside which we develop the project. – We work directly with industry. ● Example: how can we accelerate processing of billions of DNA samples as fast as possible – REAL IMPACT! 45 Summary ● Single core performance was halted by three walls (power, ILP and memory) - industry pushed for multicore. ● Multicore processors are responsible for computational performance increase over the last 15 years. ● The end of multicore scaling is near: we will not get much out of improving multicore CPUs due to Dark Silicon. ● Heterogeneous systems are (going to be) dominating the computing industry. ● FPGA accelerators are promising contenders – but still a lot of work to do. 46 Possible bachelor thesis topics ● If you are highly interested / highly skilled in – Digital logic design / VHDL – C programming – Maths – English ● Work with latest datacenter grade FPGA cards – The fastest FPGA accelerator cards on the planet. – 460 GB/s to High-Bandwidth Memory ● FPGA design assignment – Design Machine Learning accelerators with FPGA – Design Genomics accelerators with FPGA – Design new components for our open-source hardware ecosystem – Can be with TU Delft or Teratide (spin-off/startup company) ● Contact me: [email protected] 47 References [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. [6] O. Pell and V. Averbukh, "Maximum Performance Computing with Dataflow Engines," in Computing in Science & Engineering, vol. 14, no. 4, pp. 98-103, July-Aug. 2012. 48