INF5063: Programming Heterogeneous Multi-Core Processors

INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 25th, 2015 INF5063 University of Oslo INF5063 Overview § Course topic and scope § Background For the use and parallel processing using heterogeneous multi-core processors § Examples oF heterogeneous architectures § Vector Processing University of Oslo INF5063 INF5063: The Course People § Håkon Kvale Stensland email: haakonks @ ifi § Carsten Griwodz email: griFF @ ifi § ProFessor Pål Halvorsen email: paalh @ ifi § Guest lectures From Dolphin Interconnect Solutions Hugo Kohmann & Roy Nordstrøm University of Oslo INF5063 Time and place § Lectures: Tuesday 09:00 - 16:00 Storstua (Simula Research Laboratory) − August 25th − September 22nd − October 20th − November 24st § Group exercises: No group exercises in this course! University of Oslo INF5063 Plan For Today (Session 1) 10:15 – 11:00: Course Introduction 11:00 – 11:15: Break 11:15 – 12:00: Introduction to SSE/AVX 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Introduction to Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Using SIMD For Video Processing 14:45 – 15:00: Break 15:00 – 15:45: Codec 63 (c63) Home Exam Precode 15:45 – 16:00: Home Exam 1 is presented University of Oslo INF5063 About INF5063: Topic & Scope § Content: The course gives … − … an overview oF heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) − … an introduction to working with heterogeneous multi-core processors • SSEx / AVXx for x86 • Nvidia’s Family oF GPUs and the CUDA programming Framework • Multiple machines connected with Dolphin PCIe links − … some ideas oF how to use/program heterogeneous multi-core processors University of Oslo INF5063 About INF5063: Topic & Scope § Tasks: The important part of the course is lab-assignments where you program each oF the three examples oF heterogeneous multi-core processors § 3 graded home exams (counting 33% each): − Deliver code and make a demonstration explaining your design and code to the class 1. On the x86 • Video encoding – Improve the perFormance of video compression by using SSE instructions. 2. On the Nvidia graphics cards • Video encoding – Improve the perFormance oF video compression using the GPU architecture. 3. On the distributed systems • Video encoding – The same as above, but exploit the parallelism on multiple GPUs and computers connected with Dolphin PCIe links. § Students will be working together in groups oF two. Try to Find a partner during the session today! § Competition at the end oF the course! Have the Fastest implementation oF the code! University of Oslo INF5063 Background and Motivation: Moore’s Law Motivation: Intel View 2015: § >billion transistors integrated •8,0 billion - nVIDIA GM200 (Maxwell) •8,1 billion - AMD Fury (Pirate Islands) 1971: • 2,300 - Intel 4004 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 2015 (Still): • 5 GHz: IBM Power 6 & 8 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 § Power? Heat? University of Oslo INF5063 Motivation: Intel View § Soon >billion transistors integrated § Clock Frequency can still increase § Future applications will demand TIPS § Power? Heat? University of Oslo INF5063 Motivation “Future applications will demand more instructions per second” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores University of Oslo INF5063 Multicores! Symmetric Multi-Core Processors Intel Haswell University of Oslo INF5063 Symmetric Multi-Core Processors AMD “Piledriver” University of Oslo INF5063 Symmetric Multi-Core Processors § Good − Growing computational power § Problematic − Growing die sizes − Unused resources • Some cores used much more than others • Many core parts Frequently unused § Why not spread the load better? ⇒ Heterogeneous Architectures! University of Oslo INF5063 x86 – heterogeneous? Intel Haswell Core Architecture Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Haswell Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end Intel Haswell Architecture9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU • THUS, an x86 is deFinitelyFast parallel LEA and hasAGU heterogeneousAGU AGU Execution(internal)Shift cores.VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store • The coresFMA have manyFMA (hidden)VALU FShuffle complicatingBranch Factors. One oF these Data are out-ofFBlend-orderFADD execution.VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB University of Oslo INF5063 9XgKB L9 Cache GQ wayU Out-of-order execution § Beginning with Pentium Pro, out-of-order-execution has improved the micro-architecture design − Execution oF an instruction is delayed iF the input data is not yet available. − Haswell architecture has 192 instructions in the out-of-order window. § Instructions are split into micro-operations (μ-ops) − ADD EAX, EBX % Add content in EBX to EAX • simple operation which generates only 1 μ-ops − ADD EAX, [mem1] % Add content in mem1 to EAX • operation which generates 2 μ-ops: 1) load mem1 into (unamed) register, 2) add − ADD [mem1], EAX % Add content in EAX to mem1 • operation which generates 3 μ-ops: 1) load mem1 into (unamed) register, 2) add, 3) write result back to memory University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU Intel Haswell EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU § The x86 architecture “should” be a nice, easy introduction to more complex heterogeneous architectures later in the course… University of Oslo INF5063 Co-Processors § The original IBM PC included a socket For an Intel 8087 Floating point co-processor (FPU) − 50-fold speed up of floating point operations § Intel kept the co-processor up to i486 − 486DX contained an optimized i487 block − Still separate pipeline (pipeline Flush when starting and ending use) − Communication over an internal bus § Commodore Amiga was one oF the earlier machines that used multiple processors − Motorola 680x0 main processor − Blitter (block image transFerrer - moving data, Fill operations, line drawing, perForming boolean operations) − Copper (Co-Processor - change address For video RAM on the Fly) University of Oslo INF5063 nVIDIA Tegra X1 ARM SoC § One oF many multi-core processors for handheld devices § 4 ARM Cortex-A57 processors − 4 ARM Cortex-A53 cores − Out-of-order design − 64-bit ARMv8 instruction set − Cache-coherent cores § Several “dedicated” co-processors: − 4K Video Decoder − 4K Video Encoder − Audio Processor − 2x Image Processor § Fully programmable Maxwell-family GPU with 256 simple cores. University of Oslo INF5063 Intel SCC (Single Chip Cloud Computer) § Prototype processor From Intel’s TerraScale project § 24 «tiles» connected in a mesh § 1.3 billion transistors § Intel SCC Tile: − Two Intel P54C cores Pentium − In-order-execution design − IA32 architecture − NO SIMD units(!!) § No cache coherency § Only made a limited number oF chips For research. University of Oslo INF5063 Intel MIC («Larabee» / «Knights Ferry» / «Knights Corner») § Introduced in 2009 as a consumer GPU From Intel § Canceled in May 2010 due to

INF5063: Programming Heterogeneous Multi-Core Processors

Intel® IA-64 Architecture Software Developer's Manual

USE CASE Requirements

Copyrighted Material

Appendix D an Alternative to RISC: the Intel 80X86

SIMD Extensions

Programmable Digital Microcircuits - a Survey with Examples of Use

SIMD: Data Parallel Execution J

Programming the Cell Broadband Engine Examples and Best Practices

APPLICATION NOTE Ap·113

Cell Broadband Engine Architecture

Introduction to Cpu

Sony/Toshiba/IBM (STI) CELL Processor Three Performance-Limiting Walls