Dark Silicon and the End of Multicore Scaling
Total Page:16
File Type:pdf, Size:1020Kb
CTA01++ : multicore and beyond Source: Source: Xilinx https://cloud.google.com/tpu/ Johan Peltenburg [email protected] Accelerated Big Data Systems group Quantum & Computer Engineering Delft University of Technology 26 October 2020 @ Hogeschool Rotterdam 1 TU Delft (2017 figures) 23 461 Students 2 799 PhD students 253 FTE professors 3 448 Scientists 2 385 Supporting 23 Startups / year 2 Accelerated Big Data Systems Faculty of Electrical Engineering, Mathematics and Computer Science Department of Quantum & Computer Engineering: – Accelerated Big Data Systems – Computer Engineering – Network Architecture & Services – Applied Quantum Architectures, Quantum Communications Lab, Quantum Computer Architecture Lab, Quantum Computing, Quantum Information & Software, Quantum Integration Technology 3 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 4 Moore’s “Law”[1] What to do with all these transistors? Original paper: [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. 5 What is a multicore processor? A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions Characteristics: – 2 or more general purpose processors – Single component (chip or integrated circuit) – Sharing infrastructure (sharing memory and communication resources) 6 What is a multicore processor? Famous examples of multicore – Intel Core, Core 2, i3, i5, i7, etc. – AMD Athlon, Phenom, Ryzen – Sony CELL microprocessor Many other multicore examples – Adapteva – Aeroflex Intel i7 – Ageia – Ambric – AMD – Analog Devices Intel Core 2 Duo – ARM – ASOCS – Azul Systems ….......... Sony/IBM CELL 7 What is a multicore processor? Is a processor with embedded acceleration circuitry considered as a multicore? Are two fully independent processors Intel P4 with SSE on a single chip Intel i7 6950X considered as multicore? Is a GPU considered a multicore processor? Nvidia Tesla V100 8 The birth of the multicore What reasons to go to multicore? 9 The Power Wall Dynamic power Power due to consumption leakage current 2 P=ACV f+VI leak Reduce the supply voltage, V qVth Ileak µexp(- ) 1<a<2 kT (V -Vth ) Reduce fmax µ threshold V V t 10 Overcoming the Power Wall ● Solution: – Reduce frequency – Duplicate CPU core Source: Intel 2006 11 ILP Wall ● The ’80s: superscalar expansion – 50% per year improvement in performance – Pipeline processor – 10 CPI → 1 CPI ● The ’90s: the era of diminishing returns – Squeezing out the last bit of implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI → 0.5 CPI ● The ’00s: the multicore era – The need for explicit parallelism ● The ‘10s: My guess: the heterogeneous multicore era 12 CTA01++ ● Very deep pipelines – Intel once went up to 31! ● Complex branch predictors – Speculative execution ● Advanced memory hierarchy – E.g. prefetching – E.g. 3 levels of cache ● SIMD extensions Source: https://commons.wikimedia.org – E.g. SSE, AVX, AVX2, AVX512 13 CTA01++ ● Out-of-Order execution ● Superscalar execution – Can launch execution of more than one instruction. ● Simulatenous Multithreading (SMT) – Can run multiple threads on a single core – Intel calls this HyperThreading What do all these Source: Patterson & Hennessey improvements cost? 14 Pollack’s Rule[3] ● Pollack’s Rule: Performance increase is roughly proportional to the square root of increase in complexity. REMEMBER THIS FOR LATER! ● Exercise: We double the chip area to make a circuit twice as complex. How much performance do we get? ● We only get 1.4× (√2) improvement in performance. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. 15 More reasons for multicore ● Memory Wall (see book) ● Industry push 16 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 17 Amdahl’s Law 1 S(n)= f (1−f )+ n S = speedup w.r.t. single core n = number of cores f = parallel portion of workload 18 Amdahl’s Law in the Multicore Era[4] ● Resources to build and operate a computer circuit: – Area, capacitive load, frequency, power, money, etc… ● Let’s forget about the specific resource and call it: ● A “Base Core Equivalent” or BCE. – A “Base Core” are the resources required to implement the simplest core imaginable that can run our instruction set. n = 1 BCE Normalized performance = 1 [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). 19 Exercise ● n : the total number of BCE resources available to our n = r = 1 BCE design Performance = 1 ● r : the number of BCE resources we use for our single core ● We create an architecture of n = r = 4 BCE r = 4 BCE. ● How much performance do Relative we get compared to performance = 2 ✕ r = 1 BCE? Performance of an r BCE core = √r (remember Pollack’s Rule) 20 Symmetric Multicore[4] (1/2) 1 ● S(n)= Let’s build a multicore f system with an (1−f )+ n = 16 BCE budget. n ● We give every core r = 4 BCE. ● What is the speedup 1 1 S(n)= = over 1 BCE given we 1−f n 1−f f⋅r +f ÷ ÷√r + have a parallel portion √r r √r √r⋅n of f ? Performance per core Number of cores 21 Symmetric Multicore[4] (2/2) 22 Asymmetric Multicore[4] (1/2) ● A multicore system with an n = 16 BCE budget. ● We create one big core r = 4 BCE. ● We create a small, simple core out of each remaining BCE. ● What is the speedup over 1 1 BCE given we have a parallel S(n)= 1−f f portion of f ? + r r n r √ √ + − 23 Asymmetric Multicore[4] (2/2) 24 Dynamic Multicore[4] (1/2) ● When performing a non-parallelizable part 1-f... – Use all BCE to form one huge core ● When performing parallelizable part f... – Use all BCE to form many tiny cores 25 Dynamic Multicore[4] (2/2) What sort of magical device would have such dynamic properties? 26 Time for a break 27 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 28 Dennard Scaling[2] ● Once upon a time ... ● As transistors became smaller … ● Their power density stayed constant. ● Moore’s law & Dennard Scaling lived happily ever after… The end? ● Leakage current and threshold voltage were not taken into consideration for Dennard Scaling. ● Broke down around 2006: The Power Wall! [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. 29 Dark Silicon[5] ● No matter what chip topology we use (CPU-like / GPU-like) ● We must power off parts of the chip to stay within a power budget. ● At 8 nm, we must power off 50% of the chip continuously to stay within power budget! ● Limits to speedup in 2024: – Only 7.9x predicted when paper appeared in 2011! – Shouldn’t we get ~388x according to Moore’s Law? ● Don’t confuse Moore’s Law with performance! [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. 30 Source: https://www.ibmbigdatahub.com 31 Source: https://www.ibmbigdatahub.com 32 Source: https://www.ibmbigdatahub.com 33 Source: https://www.ibmbigdatahub.com 34 Where do we go now? 35 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 36 Heterogeneous Computing 37 General Purpose Graphics Processing Unit ● Most mainstream accelerator nowadays: GPU – Originally used to render 3D images – Cores got less specialized – could now do any computation ● Now used in general purpose computing: GPGPU ● Programmable using CUDA/OpenCL – C/C++ like languages ● Widely used in scientific, AI, Machine Learning ● Top supercomputers make use of GPGPUs. – How many GPGPUs? 38 ORNLORNL SummitSummit supersuper computercomputer ● Processor:Processor: IBMIBM POWER9™POWER9™ (2/node)(2/node) ● GPUs:GPUs: 27,64827,648 NVIDIANVIDIA VoltaVolta V100sV100s (6/node)(6/node) ● Nodes:Nodes: 4,6084,608 ● NodeNode Performance:Performance: 4242 TFlop/sTFlop/s ● Memory/node:Memory/node: 512GB512GB DDR4DDR4 ++ 96GB96GB HBM2HBM2 ● NVNV Memory/node:Memory/node: 1600GB1600GB ● TotalTotal SystemSystem Memory:Memory: >10PB>10PB DDR4DDR4 ++ HBMHBM ++ Non-volatileNon-volatile ● PeakPeak PowerPower Consumption:Consumption: 13MW13MW ● ~~ powerpower consumptionconsumption ofof aa reasonablyreasonably sizedsized towntown 39 FPGA accelerator trends (1/2) https://newsroom.intel.com/editorials/intel-fpgas-accelerating-future/ https://www.xilinx.com/applications/high-performance-computing.html 40 FPGA accelerator trends (2/2) 41 FPGA advantages ● Great flexibility in solution trade-offs. ● Can work with numeric formats that are not supported by GPGPU / CPU – Can have many parallel `cores’ (until we run – out of resources) Arbitrary integer, fixed & floating point widths – E.g. 5-bit integers, float16, posits, etc... – Can completely tailor circuit to application. ● Can interface with I/O directly ● Dataflow computing – Network controllers – can process data as it travels on link (filtering, etc...) – Many algorithms map naturally – Non-volatile storage – can process data as it travels to disk – Don’t require load-store of intermediate (compression, error checking, etc...) values – Etc… – Minimal control overhead; no instruction set – No operating system Source: [6] 42 FPGA disadvantages ● Low clock speeds ~10× lower ● High area overhead compared to fixed function IC ~15× lower ● FPGA circuit itself should be ~150× more efficient than CPU! – Requires computer architecture & digital design knowledge ● Hard to program – Need digital design knowledge ● Ratio of digital design engineers to Python programmers might underflow in float32.