<<

Jakob Engblom, PhD, Product Management Engineer, Simics team, , Stockholm, Sweden 2020-05-18

Advanced - Uppsala - 2020 1 My Background

Jakob Engblom ▪ Datavetenskap, Uppsala: D92 ▪ PhD, Computer Systems, Uppsala ▪ Product Management Engineer, Intel System Simulation team , Sweden – Previously at IAR, , Wind River ▪ Intel Evangelist – Simulation ▪ https://software.intel.com/en-us/meet-the- developers/evangelists/team/jakob- engblom ▪ http://engbloms.se/jakob.html

Advanced Computer Architecture - Uppsala - 2020 2 What Does Intel Do?

• Intel® Core® • Intel® Xeon® • SSD • Ethernet • Intel® Atom™ • Processors • 3D XPoint™ • WiFi • Chipsets • Chipsets • Intel® Optane™ • Bluetooth • GPUs • Smart NICs • GNSS • GPUs

Laptop and Server Storage Connectivity desktop

• SoC-FPGA • Movidiua • Processors • Development tools • FPGA • Habana • Gateways • Compilers • FPGA-CPU combo • MobilEye • Security • Simulation solutions • Intel® Xeon® • Management • & Windows drivers • GPUs • UEFI & BIOS • OneAPI

FPGA AI and ML IoT Software

Advanced Computer Architecture - Uppsala - 2020 3 Advanced Computer Architecture - Uppsala - 2020 4 The General Purpose

The simplest computer to use and program ▪ General: can run any workload (*) ▪ Flexible: buy one machine, use for all applications ▪ “Good enough for most jobs” Can run arbitrarily complex workloads ▪ Operating systems ▪ Threads, decisions, control, … ▪ Databases, just-in-time compilers, …

Advanced Computer Architecture - Uppsala - 2020 5 Drawbacks of the General Purpose Processor

Absolute performance Power ▪ A general-purpose processor can only ▪ Power consumption per useful work do operations at a certain rate – “Operations per watt” ▪ Specialized hardware can do it faster Chip area ▪ “Operations per clock” (per chip) ▪ High-performance processor cores are very large, you cannot fit very many

Advanced Computer Architecture - Uppsala - 2020 6 Advanced Computer Architecture - Uppsala - 2020 7 Efficiency vs Generality (with some Examples)

Would be nice, General- but nothing purpose much up here General processor

Graphics processing unit (GPU)

Programmable specialized accelerators Programmable Algorithm This is a logic (FPGA) rather accelerator bad spot blocks (fixed

logic) Specialized

Operations per area/watt/clock

Advanced Computer Architecture - Uppsala - 2020 8 Accelerators: An Economic Bet (on Value per Chip)

Efficiency and maximum performance General-purpose processor cores comes from specialization General-purpose processor cores ▪ Specific operations or problems or Accelerators applications Accelerators Economics: Total chip area Total chip area

▪ Bet chip area on a certain Accelerator has to provide: application/computation/domain ▪ Performance/efficiency advantage ▪ Used accelerator = advantage ▪ For a large set of applications ▪ Unused accelerator = dead weight ▪ If not efficient enough, or not enough applications benefit, they tend to go extinct

Advanced Computer Architecture - Uppsala - 2020 9 Examples of Accelerator Application Areas

Graphics Audio Image Cryptography rendering processing processing

Network Neural Digital signal packet Compression network processing processing training

Neural Data Table Password network movement lookups cracking inference

Advanced Computer Architecture - Uppsala - 2020 10 Early Success: IBM* “HARVEST”

Built for the National Security Agency (NSA), operational in 1962 ▪ Accelerated “certain operations” by 50x to 200x compared to base machine – Code-breaking – Keyword scanning in text ▪ 2x the size of the base machine (!) ▪ Base was IBM 7960 “Stretch”, fastest computer in the world at the time – HARVEST was not independent

Image: https://en.wikipedia.org/wiki/IBM_7950_Harvest *Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 11 Failed Example: * PhysX*

Mid-2000s Final result: ▪ Physics Processing Unit (“PPU”) ▪ Ageia* Acquired by * ▪ PPU to work alongside the GPU ▪ The software development kit (SDK) () and main turned out to be useful and lived on processor to run physics in games ▪ Algorithms run on processor or GPU ▪ Separate PCIe add-in card Market verdict: ▪ Not sufficient value to make gamers invest in an additional piece of hardware

Image: https://www.anandtech.com/show/2001/3 *Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 12 Example: From Specialized to General-ish: GPGPU

Graphics Processing Units (GPUs) started Now a major application area for GPUs out as a way to draw graphics ▪ Affecting architecture and feature ▪ In gaming consoles and in PCs with add-in selection of the designs – data types, etc. cards ▪ GPUs splitting into “graphics” GPUs and – Term coined by Sony* for the GPU in the PlayStation* in 1994 “compute” GPUs – Ex: Nvidia* Tesla* vs GeForce* lines Programmable shaders showed up in 2001 (With Nvidia* GeForce* 3) Still – GPGPU is not “General Purpose” ▪ It was realized that GPUs could do general- in general purpose (GP) math ▪ Certain types of compute problems ▪ In particular, matrix multiplications for floating-point numbers ▪ Entirely unsuitable for decision-making and complex data structures

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 13 Example: Power (Energy) Optimization

“Always-On” Systems ▪ Offload sensor input processing (accelerometers, microphones, cameras,…) ▪ Continuous watch for “wake up” events – Keywords (“Hi XYZ”) or “Wave to wake”, … – Movement or fire, … Low power for long periods of “nothing happens” (energy = power * time…) ▪ AO systems typically implemented using a slow low-power + custom input processing hardware ▪ Not about highest possible performance

Image from https://www.amazon.com/All-new-Echo-Show-2nd-Gen/dp/B077SXWSRP

Advanced Computer Architecture - Uppsala - 2020 14 Advanced Computer Architecture - Uppsala - 2020 15 Different Ways to Implement an Accelerator

Visible specialized Fixed logic Hidden processor FPGA processor • Fixed piece of • Accelerator looks • “Field • Accelerator logic on a chip like a fixed block Programmable exposes a • Does exactly from the outside ” programmable what it is • Runs firmware • Logic that is processor to the designed to do on a processor reconfigured user (and nothing internally before use • Vendor and else) • Only vendor • “Soft hardware” users can write • Vendor provides programs the • User builds the programs the functionality processor accelerator

Advanced Computer Architecture - Uppsala - 2020 16 NOTE: FPGA (Field-Programmable Gate Array)

Fixed chip, manufactured once Logic Logic Logic ▪ Array of logic blocks block block block – Function configurable by reprogramming the memory in each block

▪ Interconnect Logic Logic Logic block block block – Routing configured by reprogramming routing tables Design created in the same way as fixed logic, but compilation (synthesis) Logic Logic Logic block block block creates configuration files

Tip: learn more at https://plan.seek.intel.com/PSG_WW_NC_LPCD_FR_2018_FPGAforDummiesbook

Advanced Computer Architecture - Uppsala - 2020 17 Plug-In Card

Accelerator connected to host computer using a plug-in connector ▪ PCIe – Peripheral Component Interconnect-Express - most common ▪ M.2 – Different connector for PCIe ▪ CXL – Compute Express Link – new standard for -coherent accelerators

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 18 Via External Connectors

Comparatively low bandwidth, can be added to small machines ▪ USB – Universal Serial ▪ USB-C & Thunderbolt*

Intel® Neural Compute Stick 2.

Based on the Intel® Movidius™ Myriad™ X Processing Unit (VPU). repurposed to run CNN and DL workloads.

https://software.intel.com/content/www/us/en/develop/hardware/neural-compute-stick.html

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 19 On the Same Chip (as the Processor Cores)

Very common for “medium-size” accelerators ▪ On-chip GPU, image processor, , AI accelerators, … ▪ Example: Intel® Ice Lake processor SoC:

Graphics and GPGPU

Accelerate video and audio encoding

Accelerate image processing

Source: https://www.anandtech.com/show/14514/examining-intels-ice-lake--and-sunny-cove/2

Advanced Computer Architecture - Uppsala - 2020 20 In the same Package (as the Processor Chip)

Single package, goes into a single System socket on the motherboard memory (RAM) ▪ Combine multiple discrete chips into the package PCIe Other Xeon ▪ Connected using chip-to-chip processors Xeon processor interconnects like PCIe

▪ (Requires some planning to be socket- PCIe UPI Platform controller compatible with other processors) hub (PCH) Intel Arria® FPGA can be used for all kinds FPGA Serial Hardware used to of acceleration connect to stand- Package alone FPGA

More reading: https://www.anandtech.com/show/12773/intel-shows-xeon-scalable-gold-6138p-with-integrated-fpga-shipping-to-vendors

Advanced Computer Architecture - Uppsala - 2020 21 On the Same “Chip”, using Chiplets

Current trend: replace monolithic chips with Example: Intel® Agilex™ combining FPGA core collections of chiplets with external components using EMIB ▪ Chiplets are tightly coupled pieces of silicon ▪ Improves yield by making each manufactured piece smaller ▪ Combine pieces from different processes (like 10nm processor with 22nm analog) ▪ Example tech: “Embedded Multi- Interconnect Bridge” (EMIB) The collection works like a chip from the outside ▪ Integration is tighter than multiple chips in a package – use higher bandwidth “on-chip” interconnects

Advanced Computer Architecture - Uppsala - 2020 22 The software/hardware interface

Advanced Computer Architecture - Uppsala - 2020 23 Accelerator Software Stacks

Application program

Accelerator-driving code

High-level API & Frameworks Higher-level APIs and language bindings make (Python*, Rust*, Java*, C, C++, Matlab*, it easier to write software for the accelerator User space C#*, … )

The documented (standardized) API level Accelerator API (C, typically) -- such as OneAPI®, OpenCL*, OpenGL*, DirectX*, …

Converts the API calls into something User-level driver User-level driver User-level driver suitable for the OS driver to consume. (for Accelerator A) (for Accelerator B) (for software reference) For complex accelerators this includes JIT code generation, optimizers, etc.

Actual hardware interface, very Hardware driver Hardware driver Operating system Software implementation specific to each particular hardware (for Accelerator A) (for Accelerator B) implementation

Hardware Accelerator A Accelerator B Processor

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 24 Accelerator Basics: Talking to Hardware

Code running on the processor Application program

Accelerator API User-level driver Main memory System memory Operating system (RAM) (“main RAM”) Hardware driver mapping Memory operations: loads and Main processor stores core Accelerator Programming Accelerator memory Local memory registers functionality mapping

Interrupt controller Processor memory address space Interrupt signals

Advanced Computer Architecture - Uppsala - 2020 25 Programming Register Specification

The programming registers are described in long manuals...

Source for this example:

▪ Intel® Ethernet Controller i210 Datasheet, Revision 3.4, February 2019.

https://www.intel.com/content/www/us/en/products/network-io/ethernet/controllers/i210-at.html

Advanced Computer Architecture - Uppsala - 2020 26 ...Programming Register Specification...

Accelerators often have hundreds of thousands of registers ▪ Typically, there are buffers, channels, queues, ports, … that repeat a set of registers many times over

Advanced Computer Architecture - Uppsala - 2020 27 ...Programming Register Specification

Each register breaks down into bit fields

The device driver needs to set the bits and registers to make hardware operations happen

Advanced Computer Architecture - Uppsala - 2020 28 Device Driver Code https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/e1000_mac.c

/** * igb_rar_set - Set receive address register Linux intel/igb driver example * @hw: pointer to the HW structure * @addr: pointer to the receive address * @index: receive address array register ▪ Poke the hardware via memory operations * * Sets the receive address array register at index to the address passed * in by addr. #ifndef _E1000_REGS_H_ **/ #define _E1000_REGS_H_ void igb_rar_set(struct e1000_hw *hw, u8 *addr, u32 index) { #define E1000_CTRL 0x00000 /* Device Control - RW */ u32 rar_low, rar_high; ... #define E1000_VET 0x00038 /* VLAN Ether Type - RW */ /* HW expects these in little endian so we reverse the byte order #define E1000_TSSDP 0x0003C /* Time Sync SDP Configuration Register - RW */ * from network order (big endian) to little endian #define E1000_ICR 0x000C0 /* Interrupt Cause Read - R/clr */ */ #define E1000_ITR 0x000C4 /* Interrupt Throttling Rate - RW */ rar_low = ((u32) addr[0] | #define E1000_ICS 0x000C8 /* Interrupt Cause Set - WO */ ((u32) addr[1] << 8) | #define E1000_IMS 0x000D0 /* Interrupt Mask Set - RW */ ((u32) addr[2] << 16) | ((u32) addr[3] << 24)); #define E1000_IMC 0x000D8 /* Interrupt Mask Clear - WO */ #define E1000_IAM 0x000E0 /* Interrupt Acknowledge Auto Mask */ rar_high = ((u32) addr[4] | ((u32) addr[5] << 8)); #define E1000_RCTL 0x00100 /* RX Control - RW */ #define E1000_FCTTV 0x00170 /* Flow Control Transmit Timer Value - RW */ /* If MAC address zero, no need to set the AV bit */ #define E1000_TXCW 0x00178 /* TX Configuration Word - RW */ if (rar_low || rar_high) #define E1000_EICR 0x01580 /* Ext. Interrupt Cause Read - R/clr */ rar_high |= E1000_RAH_AV; #define E1000_EITR(_n) (0x01680 + (0x4 * (_n))) #define E1000_EICS 0x01520 /* Ext. Interrupt Cause Set - W0 */ /* Some bridges will combine consecutive 32-bit writes into #define E1000_EIMS 0x01524 /* Ext. Interrupt Mask Set/Read - RW */ * a single burst write, which will malfunction on some parts. #define E1000_EIMC 0x01528 /* Ext. Interrupt Mask Clear - WO */ * The flushes avoid this. #define E1000_EIAC 0x0152C /* Ext. Interrupt Auto Clear - RW */ */ #define E1000_EIAM 0x01530 /* Ext. Interrupt Ack Auto Clear Mask - RW */ wr32(E1000_RAL(index), rar_low); ... wrfl(); wr32(E1000_RAH(index), rar_high); wrfl(); } https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/e1000_regs.h

Advanced Computer Architecture - Uppsala - 2020 29 Device Driver Code...

The example in the previous slide wrote the receive address registers ▪ Provide 48-bit Ethernet address... How hard can it be?

Advanced Computer Architecture - Uppsala - 2020 30 Typical Operation Flow: Software Driven

Wait for operation to complete Configure Write the Stage data to be operation using programming Use results of processed programming register that starts computation registers operation

Accelerator: do the work as instructed

Advanced Computer Architecture - Uppsala - 2020 31 Detecting Completion: Polling or Interrupt?

Code running on the processor Application program

Accelerator API User-level driver Main memory System memory Operating system (RAM) (“main RAM”) Hardware driver mapping

Poll a programming Main for status core Programming Accelerator registers Accelerator memory Local memory functionality mapping status

Interrupt controller Processor memory address space Fire off interrupt once the operation completes

Advanced Computer Architecture - Uppsala - 2020 32 Polling or Interrupt

Polling advantages: Interrupt advantages: ▪ Simple software (driver sits in a loop) ▪ Leaves processor free to do other things ▪ Shortest possible latency ▪ Low load on interconnect

Disadvantages: Disadvantages: ▪ Burns power ▪ Higher software complexity ▪ Blocks a processor core ▪ Longer latency ▪ High load on interconnect to device

Advanced Computer Architecture - Uppsala - 2020 33 Direct Memory Access (DMA)

DMA is a key mechanism in ▪ The hardware reads or writes system memory directly by itself ▪ Without the processor being involved Normally, operations is to use DMA to bring in a block of data ▪ If results are being computed, also DMA results back out

To control the memory that a certain device can access, modern systems provide “IOMMU” functions – Memory Management Units for Input/Output

Advanced Computer Architecture - Uppsala - 2020 34 Direct Memory Access (DMA)

System memory

Main memory (RAM) mapping Data to Accelerator does operate on memory operations on entire blocks of data

Main processor core Accelerator memory Data to Programming Accelerator mapping operate on registers functionality Interrupt Local Processor memory address space Interrupt signals

Advanced Computer Architecture - Uppsala - 2020 35 Descriptor Tables

Configuring operations using programming registers is not very efficient ▪ Invoking hardware registers is slow, complex to go through driver Solution: Descriptors tables in memory ▪ Simplest case: an array of fixed-size entries ▪ In general: linked list of operations to perform Descriptors are brought in over DMA and parsed by the accelerator logic

Advanced Computer Architecture - Uppsala - 2020 36 Descriptor Tables

System memory In general, operation descriptors are separate from the data buffers Data 1 Data 2 Main Data 3 memory (RAM) DMA fetches mapping Op1 Op2 Op3 descriptors and the buffers they point at

Main processor core Accelerator Programming Accelerator memory registers functionality mapping Data 1 op_pointer Op1 Interrupt Device is programmed Local memory with the address of the controller Processor first operation memory address descriptor space Interrupt signals

Advanced Computer Architecture - Uppsala - 2020 37 Asynchronous Operations

Synchronous: ▪ Software issues each operation after the Application program previous has completed

1 2 3 1 2 3 ▪ = inefficient use of hardware Asynchronous: API Layer ▪ Issue a series of operations before the previous has completed 1 2 3 1 2 3 ▪ Goal: keep the accelerator as busy as possible Accelerator hardware

Accelerator hardware ▪ (assume that operations are not dependent) Cost: more software complexity

Advanced Computer Architecture - Uppsala - 2020 38 Performance Example, Sync vs Async

40k decryptions per second is the hardware limit on this late- 2013 hardware configuration (Intel® Communications Chipset 8955)

This is reached with 20 active web server threads feeding the accelerator.

It does take a bit of parallelism and independent work items in the software to keep the accelerator fully fed. In synchronous mode it is not feasible.

Source: Brian Will et al, Intel® QuickAssist Technology & OpenSSL-1.1.0: Performance, Intel White Paper, 2017

Advanced Computer Architecture - Uppsala - 2020 39 Performance Example (2), Capacity vs Software

Another view on the efficiency of crypto acceleration. Different hardware from the previous slide.

For maximum load, how many transactions per second can be sustained as the size of each transaction increases.

The accelerator is optimized for a certain range of transaction sizes.

Two points: • Acceleration raises maximum performance by 5x (approx) • Acceleration is designed for a certain set of parameters

Source: F5 Accelerates Cryptographic Processing with Intel® QAT, Intel White Paper, 2019

Advanced Computer Architecture - Uppsala - 2020 40 Performance Example (2), Processor Load

Same setup as the previous slide.

Additional point: Processor utilization is lower when using the accelerator.

I.e., 5x higher throughput at approx. 60% processor load.

Accelerators can free up processor cycles to focus on the complex work that is not amenable to acceleration.

Source: F5 Accelerates Cryptographic Processing with Intel® QAT, Intel White Paper, 2019

Advanced Computer Architecture - Uppsala - 2020 41 Latency vs Throughput time

Accelerator

Accelerator Accelerators usually require a certain amount of data to be efficient

Process and produce ▪ Just like a factory needs a certain level of results for a batch of inputs activity to make sense

Latency from appearance of first input to results can be long ▪ Trade-off: latency of processing a single unit of work vs overall throughput

Advanced Computer Architecture - Uppsala - 2020 42 Streaming (Independent) Operation

Some types of accelerators can be set Examples: up to operate on their own Network routers: ▪ Receive inputs, them, and send outputs ▪ Ethernet packets stream in and out ▪ Configuration is similar to providing a ▪ Complex protocols and unusual cases program handed over to main processor ▪ Main processor might be interrupted in Always-on subsystems case something unusual happens ▪ Independent watch for activation events ▪ Accelerator can handle the common ▪ Wake up main processor when detected easy cases (“fast path”)

Advanced Computer Architecture - Uppsala - 2020 43 Streaming (Independent) Operation

Input stream

System memory Output stream

Main memory (RAM) mapping

Input/Output Data to Main processor operate on core Accelerator memory Accelerator Configuration and mapping Programming code registers functionality Interrupt Local memory controller Processor memory address Accelerator independently receives, inputs, performs computations, and space Interrupts the main processor for generates results assistance for complex cases

Advanced Computer Architecture - Uppsala - 2020 44 Cache-Coherent Accelerators

Make the accelerator part of the cache-coherent memory system ▪ Access main memory just like a processor Changes the programming model ▪ Accelerator looks like a software , not completely alien

To make sense, requires a standard interface across vendors ▪ Cache-coherency protocols have been closed traditionally ▪ New standards are appearing, looks like CXL is emerging as the winner

Advanced Computer Architecture - Uppsala - 2020 45 …Cache-Coherent Accelerator…

System memory Data 1 Data 2 Main memory Data 3 (RAM) Data 4 mapping Data 5 Accelerator participates in cache coherency

Pull in data in small units as needed for Main processor processing core Data 5 Cache memory Accelerator memory Data 5 mapping Accelerator Programming Data 1 functionality Interrupt registers Cache memory controller Processor memory address space Interrupt signals

Advanced Computer Architecture - Uppsala - 2020 46 …Cache-Coherent Accelerator

Advantages: Drawbacks: ▪ No need for chunk transfer (unlike DMA) ▪ More noise in the memory system ▪ Accelerator can operate on smaller – Less predictable performance pieces of data (lower overhead) – DMA and local memory on the accelerator is isolated and thus more predictable ▪ Accelerator and software can poll memory changed by software ▪ More complex to implement and verify – Same mechanisms as used between parallel processor cores – (using cache coherency optimized poll mechanisms)

Advanced Computer Architecture - Uppsala - 2020 47 Coding for Programmable Accelerators: Kernels APIs like OpenGL*, DirectX*, OpenCL*, Vulkan*, CUDA*

run(“kernel.kl”) ▪ Use JIT compilation of special kernel languages Kernel Application program The kernel code can also be pre-compiled to a device-independent format, for (kernel.kl) source code example using the SPIR-V intermediate language from the OpenCL world

Kernel JIT compiler Compiler (kernel.kl) user-level driver run(“kernel.kl”) Binary application Driver program

Programmer time Runtime Compiled kernel

Processor Accelerator

*Other names and brands may be claimed as the property of others This sketch seriously simplifies all the steps in the real world

Advanced Computer Architecture - Uppsala - 2020 48 OpenCL* Example Code Application code Kernel code int main (int argc, const char** argv) // C := alpha*A*B + beta*C { // A is in column-major form ... // B is in row-major form (transposed; this is different from gemm_nn) // Create the necessary OpenCL objects up to device queue. // C is in column-major form OpenCLBasic oclobjects( __attribute__((reqd_work_group_size(TILE_GROUP_M, TILE_GROUP_N, 1))) cmdparser.platform.getValue(), kernel void gemm_nt ( cmdparser.device_type.getValue(), global const T * restrict A, cmdparser.device.getValue() int lda, // column stride in elements for matrix A ); global const T * restrict B, int ldb, // row stride in elements for matrix B // Form build options string from given parameters: global T * restrict C, string build_options = int ldc, // column stride in elements for matrix C "-DT=" + cmdparser.arithmetic.getValue() + int k, // number of columns/rows in a matrix (cmdparser.arithmetic_double.isSet() ? " -DSAMPLE_NEEDS_DOUBLE" : "") T alpha, + T beta " -DTILE_SIZE_M=" + to_str(cmdparser.tile_size_M.getValue()) + ) " -DTILE_GROUP_M=" + to_str(cmdparser.tile_group_M.getValue()) + { " -DTILE_SIZE_N=" + to_str(cmdparser.tile_size_N.getValue()) + int Aind = get_group_id(0)*TILE_GROUP_M*TILE_SIZE_M + get_local_id(0); " -DTILE_GROUP_N=" + to_str(cmdparser.tile_group_N.getValue()) + int Bind = get_group_id(1)*TILE_GROUP_N*TILE_SIZE_N + get_local_id(1); " -DTILE_SIZE_K=" + to_str(cmdparser.tile_size_K.getValue()); int Cind = Aind + Bind*ldc; ... // Build kernel T c[TILE_SIZE_M*TILE_SIZE_N] = {(T)0}; OpenCLProgramOneKernel executable( oclobjects, // main accumulation loop L"gemm.cl", for(int l = 0; l < k; ++l) "", { "gemm_" + cmdparser.kernel.getValue(), for(int i = 0; i < TILE_SIZE_M; ++i) build_options for(int j = 0; j < TILE_SIZE_N; ++j) ); c[i*TILE_SIZE_N + j] += ... A[Aind + i*TILE_GROUP_M] * // invoking the kernel is too complex to include here... B[Bind + j*TILE_GROUP_N]; Aind += lda; Bind += ldb; } ... General Matrix Multiply example from https://software.intel.com/content/www/us/en/develop/tools/opencl-sdk/training.html

Advanced Computer Architecture - Uppsala - 2020 49 Coding (for) FPGA Accelerators

Algorithm Application program description for source code FPGA accelerators are a bit different acceleration ▪ Historically, handled as hardware design

in languages like SystemVerilog or VHDL Compiler FPGA synthesis tools ▪ Creating the accelerator part of the programming flow

Configuration file ▪ Synthesis is time consuming = not Application program something to do on the fly Trend: making FPGAs more like SW Programming Accelerator Hardware driver ▪ High-level synthesis from C or C++ registers functionality

▪ Generating FPGA configurations from FPGA Chip high-level APIs like OpenCL

Advanced Computer Architecture - Uppsala - 2020 50 Higher-Level Software Frameworks and APIs

“Raw” accelerator programming can be rather painful

Most users use higher-level frameworks of various types ▪ Provide common kernels, build from these “big blocks” ▪ Hide complexities of using hardware ▪ For novel applications, more manual work would be needed

Advanced Computer Architecture - Uppsala - 2020 52 Advanced Computer Architecture - Uppsala - 2020 53 Improve the Processor?

Using an accelerator is more complex than writing “normal code” ▪ (even though it can be hidden behind high-level frameworks) You can offer accelerator-like efficiency by adding features to the processor ▪ Instructions tailored to certain operations

Advanced Computer Architecture - Uppsala - 2020 54 Vector Processors

Computers built for Scientific Computing (1970s) ▪ Quickly process vectors of numbers ▪ First really successful design: the classic Cray* 1 – Sustained 2 floating-point operations per cycle! – Same processor running the OS and scalar code – Indeed, Cray made a point of the machine being decent at running scalar code too (see Amdahl’s law) For a long time, there were “compute” computers Cray 1, photographed by me at Deutsches Museum, and “business” computers in the market München, in 2004

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 55 Digital Signal Processors

Digital Signal Processors (DSP) – very prominent in 1990s and 2000s

▪ Small, efficient, low-latency processing

▪ Requirements on line-rate processing on per-packet basis

▪ Flexibility – code any algorithm into the same phone or infrastructure hardware

A pain to program

▪ Bizarre and complex instruction sets (exposed pipelines, explicitly parallel very- long instruction-word VLIW, funny loop addressing modes …)

▪ Very bad at handling interrupts

▪ Ran specialized operating systems

No more discrete DSP chips are being designed, but you can buy DSP processors for integration into custom SoCs

▪ Ceva*, Cadence* Tensilica*, Synopsys* ARC*, …

*Other names and brands may be claimed as the property of others Screenshot from http://www.ti.com/processors/digital-signal-processors/c6000-floating-point-dsp/overview.html

Advanced Computer Architecture - Uppsala - 2020 56 Adding Specialized Instructions (Vector Example)

Without vector instructions: With AVX2 or AVX-512:

8x performance increase using specialized vector instructions (AVX-512)

*Other names and brands may be claimed as the property of others Source: https://www.anandtech.com/show/15039/the-intel-core-i9-10980xe-review/5

Computer Architecture - Uppsala - 2020 57 Improve the Processor!

New instruction types are being added to all architectures, all the time Advantages: ▪ Operation latency is shorter – It takes time to set up an accelerator ▪ For small operations, doing the job on the processor is faster and efficient ▪ Easier for operations that interleave with input and output operations ▪ Retain generality of the compute infrastructure

Advanced Computer Architecture - Uppsala - 2020 58 Advanced Computer Architecture - Uppsala - 2020 59 Current Explosion: “AI” (Neural Networks)

“Everyone” is building AI acceleration AI acceleration design has been a key into their chips targeting all kinds of driver for hardware design tool products: revenue in the past few years! ▪ Mobile phones Hundreds of startups are active in the ▪ Laptops area. ▪ Datacenters ▪ Edge compute ▪ Vision systems ▪ Autonomous driving & driver support ▪ …

Advanced Computer Architecture - Uppsala - 2020 60 Accelerating AI Workloads: Just what Is AI?

AI = Artificial Intelligence ▪ Wide range of algorithms and applications – regression, decision tree, random forest, clustering, machine learning, expert systems, … Machine Learning (ML) = subset of AI ▪ The big explosion in recent years, especially in the form of Deep Learning Neural Networks (NN) = subset of ML ▪ Convolutional NN (CNN) = a particular type of NN ▪ Deep Learning (DL) = common use of CNN

Advanced Computer Architecture - Uppsala - 2020 61 Neural Network Example: Training vs Inference

Huge data volume

Forward Time consuming “Wave” Millions of weights ? Runs on general-purpose processor, GPGPU, or Training Lots of Backward Error “t-shirt” special accelerators (like labeled data! Google* TPU*, Intel Gaudi)

“Final” model weights

Fast, operate on a single small input

Forward ”Compiled model” “handbag” Run on general-purpose processor, GPGPU, CNN Inference Single input to classify accelerator, FPGA, ...

*Other names and brands may be claimed as the property of others

Jakob Engblom - What I Learnt about Machine Learning 62 Accelerating Training

Mostly in the data center ▪ Requires large memory, quick access to massive amounts of training data ▪ Throughput is the primary goal

General-purpose GPGPU Google* “Tensor Intel® Gaudi Training processor + special • Nvidia* “Tensor Cores” Processing Unit” Accelerator instructions added to GPUs specifically (TPU) • Has a companion inference for this • Intel® VNNI Instructions • Built to run Tensorflow chip (called Goya) code

Cerebras* Wafer Scale Engine • Biggest “chip” ever – 56x the size of anything else

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 63 Most are built on the idea of 100s or 1000s of Accelerating Inference small processors working together Some are sold as discrete chips, others are IP blocks for use inside a chip design

Out in the world and in the data center ▪ Inference requires less memory, can do with less precision, runs in real time, part of always-on systems… ▪ Latency and low power per operation often more important than maximum throughput

General-purpose GPGPU FPGA Digital Signal Processors processor + special • Not just Intel, AMD*, Nvidia* • Very good for inference • Add features to support “AI” instructions • Also mobile phone GPUs • (Discrete chip) • “Cadence* Tensilica* HiFi ML” • Intel® VNNI • (IP block) • (IP block)

Intel Movidius™ Myriad™ Google* TPU ARM* Ethos* NPU Baidu* Kunlun* AI X VPU • Supports inference as well as • “Neural Processing Unit” • Datacenter • Image processor core training • Dedicated for NN processing • (Discrete chip) • Neural compute engines • Datacenter • Devices, SoCs • (Discrete chip) • (Discrete chip) • (IP block)

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 64 Enabling: Software Frameworks for AI

Many frameworks and kits are Tensorflow* PyTorch* Keras* available for “AI” and “ML” programming ▪ A new accelerator can (has to) use an Caffe* oneAPI™ SciKit-Learn existing framework to get into the market Amazon* Apache* Intel® ▪ Just like how high-level programming Machine Mahout* OpenVINO™ Learning languages let code port between architectures

OpenCL* … ▪ Multiple levels – Keras uses Tensorflow, Tensorflow builds on OpenCL, …

*Other names and brands may be claimed as the property of others

Advanced Computer Architecture - Uppsala - 2020 65 Intel® oneAPI™ – Example API for AI & More

Architecture: One API to port code across execution engine ▪ Hardware independence

Application program ▪ API & low-level C++ coding

▪ Open specification and open Middleware and frameworks Support tools are also important!

Direct programming API-based programming Analysis and ▪ Intel® VTune™ to find bottlenecks Data Parallel C++ (DPC++) oneAPI Libraries debug tools ▪ to determine how to code

oneAPI ▪ Standard debuggers for code debug Intel® DevCloud FPGA GPU (Field- Specialized Processor (Graphics ▪ Access to specialized hardware programmable accelerators processing unit) Gate Array) ▪ Allow testing on a range of architectures without having hardware locally https://www.oneapi.com/ https://software.intel.com/content/www/us/en/develop/tools/oneapi.html

Advanced Computer Architecture - Uppsala - 2020 66 Thank You!