Dark Silicon and the End of Multicore Scaling

Dark Silicon and the End of Multicore Scaling

CTA01++ : multicore and beyond Source: Source: Xilinx https://cloud.google.com/tpu/ Johan Peltenburg [email protected] Accelerated Big Data Systems group Quantum & Computer Engineering Delft University of Technology 26 October 2020 @ Hogeschool Rotterdam 1 TU Delft (2017 figures) 23 461 Students 2 799 PhD students 253 FTE professors 3 448 Scientists 2 385 Supporting 23 Startups / year 2 Accelerated Big Data Systems Faculty of Electrical Engineering, Mathematics and Computer Science Department of Quantum & Computer Engineering: – Accelerated Big Data Systems – Computer Engineering – Network Architecture & Services – Applied Quantum Architectures, Quantum Communications Lab, Quantum Computer Architecture Lab, Quantum Computing, Quantum Information & Software, Quantum Integration Technology 3 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 4 Moore’s “Law”[1] What to do with all these transistors? Original paper: [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. 5 What is a multicore processor? A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions Characteristics: – 2 or more general purpose processors – Single component (chip or integrated circuit) – Sharing infrastructure (sharing memory and communication resources) 6 What is a multicore processor? Famous examples of multicore – Intel Core, Core 2, i3, i5, i7, etc. – AMD Athlon, Phenom, Ryzen – Sony CELL microprocessor Many other multicore examples – Adapteva – Aeroflex Intel i7 – Ageia – Ambric – AMD – Analog Devices Intel Core 2 Duo – ARM – ASOCS – Azul Systems ….......... Sony/IBM CELL 7 What is a multicore processor? Is a processor with embedded acceleration circuitry considered as a multicore? Are two fully independent processors Intel P4 with SSE on a single chip Intel i7 6950X considered as multicore? Is a GPU considered a multicore processor? Nvidia Tesla V100 8 The birth of the multicore What reasons to go to multicore? 9 The Power Wall Dynamic power Power due to consumption leakage current 2 P=ACV f+VI leak Reduce the supply voltage, V qVth Ileak µexp(- ) 1<a<2 kT (V -Vth ) Reduce fmax µ threshold V V t 10 Overcoming the Power Wall ● Solution: – Reduce frequency – Duplicate CPU core Source: Intel 2006 11 ILP Wall ● The ’80s: superscalar expansion – 50% per year improvement in performance – Pipeline processor – 10 CPI → 1 CPI ● The ’90s: the era of diminishing returns – Squeezing out the last bit of implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI → 0.5 CPI ● The ’00s: the multicore era – The need for explicit parallelism ● The ‘10s: My guess: the heterogeneous multicore era 12 CTA01++ ● Very deep pipelines – Intel once went up to 31! ● Complex branch predictors – Speculative execution ● Advanced memory hierarchy – E.g. prefetching – E.g. 3 levels of cache ● SIMD extensions Source: https://commons.wikimedia.org – E.g. SSE, AVX, AVX2, AVX512 13 CTA01++ ● Out-of-Order execution ● Superscalar execution – Can launch execution of more than one instruction. ● Simulatenous Multithreading (SMT) – Can run multiple threads on a single core – Intel calls this HyperThreading What do all these Source: Patterson & Hennessey improvements cost? 14 Pollack’s Rule[3] ● Pollack’s Rule: Performance increase is roughly proportional to the square root of increase in complexity. REMEMBER THIS FOR LATER! ● Exercise: We double the chip area to make a circuit twice as complex. How much performance do we get? ● We only get 1.4× (√2) improvement in performance. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. 15 More reasons for multicore ● Memory Wall (see book) ● Industry push 16 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 17 Amdahl’s Law 1 S(n)= f (1−f )+ n S = speedup w.r.t. single core n = number of cores f = parallel portion of workload 18 Amdahl’s Law in the Multicore Era[4] ● Resources to build and operate a computer circuit: – Area, capacitive load, frequency, power, money, etc… ● Let’s forget about the specific resource and call it: ● A “Base Core Equivalent” or BCE. – A “Base Core” are the resources required to implement the simplest core imaginable that can run our instruction set. n = 1 BCE Normalized performance = 1 [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). 19 Exercise ● n : the total number of BCE resources available to our n = r = 1 BCE design Performance = 1 ● r : the number of BCE resources we use for our single core ● We create an architecture of n = r = 4 BCE r = 4 BCE. ● How much performance do Relative we get compared to performance = 2 ✕ r = 1 BCE? Performance of an r BCE core = √r (remember Pollack’s Rule) 20 Symmetric Multicore[4] (1/2) 1 ● S(n)= Let’s build a multicore f system with an (1−f )+ n = 16 BCE budget. n ● We give every core r = 4 BCE. ● What is the speedup 1 1 S(n)= = over 1 BCE given we 1−f n 1−f f⋅r +f ÷ ÷√r + have a parallel portion √r r √r √r⋅n of f ? Performance per core Number of cores 21 Symmetric Multicore[4] (2/2) 22 Asymmetric Multicore[4] (1/2) ● A multicore system with an n = 16 BCE budget. ● We create one big core r = 4 BCE. ● We create a small, simple core out of each remaining BCE. ● What is the speedup over 1 1 BCE given we have a parallel S(n)= 1−f f portion of f ? + r r n r √ √ + − 23 Asymmetric Multicore[4] (2/2) 24 Dynamic Multicore[4] (1/2) ● When performing a non-parallelizable part 1-f... – Use all BCE to form one huge core ● When performing parallelizable part f... – Use all BCE to form many tiny cores 25 Dynamic Multicore[4] (2/2) What sort of magical device would have such dynamic properties? 26 Time for a break 27 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 28 Dennard Scaling[2] ● Once upon a time ... ● As transistors became smaller … ● Their power density stayed constant. ● Moore’s law & Dennard Scaling lived happily ever after… The end? ● Leakage current and threshold voltage were not taken into consideration for Dennard Scaling. ● Broke down around 2006: The Power Wall! [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. 29 Dark Silicon[5] ● No matter what chip topology we use (CPU-like / GPU-like) ● We must power off parts of the chip to stay within a power budget. ● At 8 nm, we must power off 50% of the chip continuously to stay within power budget! ● Limits to speedup in 2024: – Only 7.9x predicted when paper appeared in 2011! – Shouldn’t we get ~388x according to Moore’s Law? ● Don’t confuse Moore’s Law with performance! [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. 30 Source: https://www.ibmbigdatahub.com 31 Source: https://www.ibmbigdatahub.com 32 Source: https://www.ibmbigdatahub.com 33 Source: https://www.ibmbigdatahub.com 34 Where do we go now? 35 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 36 Heterogeneous Computing 37 General Purpose Graphics Processing Unit ● Most mainstream accelerator nowadays: GPU – Originally used to render 3D images – Cores got less specialized – could now do any computation ● Now used in general purpose computing: GPGPU ● Programmable using CUDA/OpenCL – C/C++ like languages ● Widely used in scientific, AI, Machine Learning ● Top supercomputers make use of GPGPUs. – How many GPGPUs? 38 ORNLORNL SummitSummit supersuper computercomputer ● Processor:Processor: IBMIBM POWER9™POWER9™ (2/node)(2/node) ● GPUs:GPUs: 27,64827,648 NVIDIANVIDIA VoltaVolta V100sV100s (6/node)(6/node) ● Nodes:Nodes: 4,6084,608 ● NodeNode Performance:Performance: 4242 TFlop/sTFlop/s ● Memory/node:Memory/node: 512GB512GB DDR4DDR4 ++ 96GB96GB HBM2HBM2 ● NVNV Memory/node:Memory/node: 1600GB1600GB ● TotalTotal SystemSystem Memory:Memory: >10PB>10PB DDR4DDR4 ++ HBMHBM ++ Non-volatileNon-volatile ● PeakPeak PowerPower Consumption:Consumption: 13MW13MW ● ~~ powerpower consumptionconsumption ofof aa reasonablyreasonably sizedsized towntown 39 FPGA accelerator trends (1/2) https://newsroom.intel.com/editorials/intel-fpgas-accelerating-future/ https://www.xilinx.com/applications/high-performance-computing.html 40 FPGA accelerator trends (2/2) 41 FPGA advantages ● Great flexibility in solution trade-offs. ● Can work with numeric formats that are not supported by GPGPU / CPU – Can have many parallel `cores’ (until we run – out of resources) Arbitrary integer, fixed & floating point widths – E.g. 5-bit integers, float16, posits, etc... – Can completely tailor circuit to application. ● Can interface with I/O directly ● Dataflow computing – Network controllers – can process data as it travels on link (filtering, etc...) – Many algorithms map naturally – Non-volatile storage – can process data as it travels to disk – Don’t require load-store of intermediate (compression, error checking, etc...) values – Etc… – Minimal control overhead; no instruction set – No operating system Source: [6] 42 FPGA disadvantages ● Low clock speeds ~10× lower ● High area overhead compared to fixed function IC ~15× lower ● FPGA circuit itself should be ~150× more efficient than CPU! – Requires computer architecture & digital design knowledge ● Hard to program – Need digital design knowledge ● Ratio of digital design engineers to Python programmers might underflow in float32.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    48 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us