INF5063: Programming Heterogeneous Multi-Core Processors

INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 25th, 2015 INF5063 University of Oslo INF5063 Overview § Course topic and scope § Background For the use and parallel processing using heterogeneous multi-core processors § Examples oF heterogeneous architectures § Vector Processing University of Oslo INF5063 INF5063: The Course People § Håkon Kvale Stensland email: haakonks @ ifi § Carsten Griwodz email: griFF @ ifi § ProFessor Pål Halvorsen email: paalh @ ifi § Guest lectures From Dolphin Interconnect Solutions Hugo Kohmann & Roy Nordstrøm University of Oslo INF5063 Time and place § Lectures: Tuesday 09:00 - 16:00 Storstua (Simula Research Laboratory) − August 25th − September 22nd − October 20th − November 24st § Group exercises: No group exercises in this course! University of Oslo INF5063 Plan For Today (Session 1) 10:15 – 11:00: Course Introduction 11:00 – 11:15: Break 11:15 – 12:00: Introduction to SSE/AVX 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Introduction to Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Using SIMD For Video Processing 14:45 – 15:00: Break 15:00 – 15:45: Codec 63 (c63) Home Exam Precode 15:45 – 16:00: Home Exam 1 is presented University of Oslo INF5063 About INF5063: Topic & Scope § Content: The course gives … − … an overview oF heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) − … an introduction to working with heterogeneous multi-core processors • SSEx / AVXx for x86 • Nvidia’s Family oF GPUs and the CUDA programming Framework • Multiple machines connected with Dolphin PCIe links − … some ideas oF how to use/program heterogeneous multi-core processors University of Oslo INF5063 About INF5063: Topic & Scope § Tasks: The important part of the course is lab-assignments where you program each oF the three examples oF heterogeneous multi-core processors § 3 graded home exams (counting 33% each): − Deliver code and make a demonstration explaining your design and code to the class 1. On the x86 • Video encoding – Improve the perFormance of video compression by using SSE instructions. 2. On the Nvidia graphics cards • Video encoding – Improve the perFormance oF video compression using the GPU architecture. 3. On the distributed systems • Video encoding – The same as above, but exploit the parallelism on multiple GPUs and computers connected with Dolphin PCIe links. § Students will be working together in groups oF two. Try to Find a partner during the session today! § Competition at the end oF the course! Have the Fastest implementation oF the code! University of Oslo INF5063 Background and Motivation: Moore’s Law Motivation: Intel View 2015: § >billion transistors integrated •8,0 billion - nVIDIA GM200 (Maxwell) •8,1 billion - AMD Fury (Pirate Islands) 1971: • 2,300 - Intel 4004 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 2015 (Still): • 5 GHz: IBM Power 6 & 8 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 § Power? Heat? University of Oslo INF5063 Motivation: Intel View § Soon >billion transistors integrated § Clock Frequency can still increase § Future applications will demand TIPS § Power? Heat? University of Oslo INF5063 Motivation “Future applications will demand more instructions per second” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores University of Oslo INF5063 Multicores! Symmetric Multi-Core Processors Intel Haswell University of Oslo INF5063 Symmetric Multi-Core Processors AMD “Piledriver” University of Oslo INF5063 Symmetric Multi-Core Processors § Good − Growing computational power § Problematic − Growing die sizes − Unused resources • Some cores used much more than others • Many core parts Frequently unused § Why not spread the load better? ⇒ Heterogeneous Architectures! University of Oslo INF5063 x86 – heterogeneous? Intel Haswell Core Architecture Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Haswell Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end Intel Haswell Architecture9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU • THUS, an x86 is deFinitelyFast parallel LEA and hasAGU heterogeneousAGU AGU Execution(internal)Shift cores.VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store • The coresFMA have manyFMA (hidden)VALU FShuffle complicatingBranch Factors. One oF these Data are out-ofFBlend-orderFADD execution.VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB University of Oslo INF5063 9XgKB L9 Cache GQ wayU Out-of-order execution § Beginning with Pentium Pro, out-of-order-execution has improved the micro-architecture design − Execution oF an instruction is delayed iF the input data is not yet available. − Haswell architecture has 192 instructions in the out-of-order window. § Instructions are split into micro-operations (μ-ops) − ADD EAX, EBX % Add content in EBX to EAX • simple operation which generates only 1 μ-ops − ADD EAX, [mem1] % Add content in mem1 to EAX • operation which generates 2 μ-ops: 1) load mem1 into (unamed) register, 2) add − ADD [mem1], EAX % Add content in EAX to mem1 • operation which generates 3 μ-ops: 1) load mem1 into (unamed) register, 2) add, 3) write result back to memory University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU Intel Haswell EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU § The x86 architecture “should” be a nice, easy introduction to more complex heterogeneous architectures later in the course… University of Oslo INF5063 Co-Processors § The original IBM PC included a socket For an Intel 8087 Floating point co-processor (FPU) − 50-fold speed up of floating point operations § Intel kept the co-processor up to i486 − 486DX contained an optimized i487 block − Still separate pipeline (pipeline Flush when starting and ending use) − Communication over an internal bus § Commodore Amiga was one oF the earlier machines that used multiple processors − Motorola 680x0 main processor − Blitter (block image transFerrer - moving data, Fill operations, line drawing, perForming boolean operations) − Copper (Co-Processor - change address For video RAM on the Fly) University of Oslo INF5063 nVIDIA Tegra X1 ARM SoC § One oF many multi-core processors for handheld devices § 4 ARM Cortex-A57 processors − 4 ARM Cortex-A53 cores − Out-of-order design − 64-bit ARMv8 instruction set − Cache-coherent cores § Several “dedicated” co-processors: − 4K Video Decoder − 4K Video Encoder − Audio Processor − 2x Image Processor § Fully programmable Maxwell-family GPU with 256 simple cores. University of Oslo INF5063 Intel SCC (Single Chip Cloud Computer) § Prototype processor From Intel’s TerraScale project § 24 «tiles» connected in a mesh § 1.3 billion transistors § Intel SCC Tile: − Two Intel P54C cores Pentium − In-order-execution design − IA32 architecture − NO SIMD units(!!) § No cache coherency § Only made a limited number oF chips For research. University of Oslo INF5063 Intel MIC («Larabee» / «Knights Ferry» / «Knights Corner») § Introduced in 2009 as a consumer GPU From Intel § Canceled in May 2010 due to

INF5063: Programming Heterogeneous Multi-Core Processors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support