Intro to FPGA Overview

Bill Jenkins for LRZ Intel Programmable SolutionsIntel Group Proprietary Agenda 9:00 am Welcome 1:30 pm Lab 2 OpenCL Flow 9:15 am Introduction to FPGAs 2:15 pm Introduction to DSP Builder 9:45 am FPGA Programming models: RTL 3:00 pm Introduction to Acceleration Stack 10:15 am FPGA Programming models: HLS 4:00 pm Lab 3 Acceleration Stack 11:00 am Lab 1 HLS Flow 4:30 pm Curriculum & University Program Coordination 11:45 am Lunch 12:30 pm FPGA Programming models: OpenCLfor LRZ 1:00 pm High PerformanceIntel Data Flow Proprietary Concepts for LRZ Intel Proprietary The Problem: Flood of Data By 2020 The average internet user will generate ~1.5 GB of traffic per day Smart hospitals will be generating over 3 TB per day radar ~10-100 KB per second Self drivingSelf driving cars cars will will be be generating generating over over sonar ~10-100 KB per second 44,000 TB per GB perday… day… each each gps ~50 KB per second A connected plane will be generating over 40 TB per day lidar ~10-70 MB per second A connected factory will be generating over 1 PB per day for LRZ cameras ~20-40 MB per second All numbers are approximated http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-trafficIntel-forecast/infographic.html Proprietary http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html 5 exaflops https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 1 car per hour http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html Typical HPC Workloads Astrophysics Genomics / Bio-Informatics Artificial Intelligence Molecular Dynamics* for LRZ Big Data Analytics IntelFinancial ProprietaryWeather & CLimate Cyber Security * Source: https://comp-physics-lincoln.org/2013/01/17/molecular-dynamics-simulations-of-amphiphilic-macromolecules-at-interfaces/ 5 Fast Evolution of Technology We now have the compute to solve these problems today in near real-time Bigger Data Better Hardware Smarter Algorithms Image: 50 MB / picture Transistor density doubles Advances in neural networks leading to better Audio: 5 MB / song every 18 months for LRZ accuracy in training models Video: 47 GB / movie Cost / GB in 1995: $1000.00 IntelCost Proprietary / GB in 2015: $0.03 6 for LRZ Intel Proprietary50+ Years of Moore’s Law Computing has Changed… 7 The Urgency of Parallel Computing If engineers keep building processors the way we do now, CPUs will get even faster but they’ll require so much power that they won’t be usable. —Patrick Gelsinger, forformer LRZ Intel Chief Technology Officer, Intel Proprietary February 7, 2001 Source: http://www.cnn.com/2001/tech/ptech/02/07/hot.chips.idg/ 8 Implications to High Performance Computing 50 GFLOPS/W ~100MW for LRZ Intel Proprietary 2022 9 Challenges Scaling Systems to Higher Performance CPU Intensive Memory Intensive IO Intensive Result: Slow Performance Memory Result: Result: Bottleneck Excessive power Slow requirements Performance Bottleneck Bottleneck (high latency) System I/O for LRZI/O Intel Proprietary Need to think about Compute Offload as well as Ingress/Egress Processing 10 Diverse Application Demands for LRZ Intel Proprietary 11 The Intel Vision Heterogeneous Systems: ▪ Span from CPU to GPU to FPGA to dedicated devices with consistent programming models, languages, and tools for LRZ CPUs IntelGPUs Proprietary FPGAs ASSP 12 Heterogeneous Computing Systems Modern systems contain more than one kind of processor ▪ Applications exhibit different behaviors: – Control intensive (Searching, parsing, etc…) – Data intensive (Image processing, data mining, etc…) – Compute intensive (Iterative methods, financial modeling, etc…) ▪ Gain performance by using specialized capabilities of different types of processors for LRZ Intel Proprietary 13 Separation of Concerns Two groups of developers: ▪ Domain experts concerned with getting a result – Host application developers leverage optimized libraries ▪ Tuning experts concerned with performance – Typical FPGA developers that create optimized libraries Intel® Math Kernel Library a simple example of raising the level of abstraction to the math operations for LRZ ▪ Domain experts focusIntel on formulating Proprietary their problems ▪ Tuning experts focus on vectorization and parallelization 14 for LRZ Intel Proprietary 15 FPGA Enabled Performance and Agility Workload N Workload Optimization: Workload 2 ensure Xeon cores serve their Workload 1 highest value processing Efficient Performance: improve performance/watt Real-Time: high bandwidth connectivity and low-latency z parallel processing Milliseconds Developer Advantage: code re-use across Intel FPGA datafor LRZ center productsIntel Proprietary FPGAs enhance CPU-based processing by accelerating algorithms and minimizing bottlenecks 16 FPGAs Provide Flexibility to Control the Data path Compute Acceleration/Offload ▪ Workload agnostic compute ▪ FPGAaaS ▪ Virtualization Intel® Xeon® Processor Inline Data Flow Processing Storage Acceleration ▪ Machine learning ▪ Machine learning ▪ Object detection and recognition ▪ Cryptography ▪ Advanced driver assistance system (ADAS) for LRZ ▪ Compression ▪ Gesture recognition ▪ Indexing ▪ Face detection Intel Proprietary 17 FPGA Architecture DSP Block Memory Block Field Programmable Gate Array (FPGA) ▪ Millions of logic elements ▪ Thousands of embedded memory blocks ▪ Thousands of DSP blocks ▪ Programmable interconnect ▪ High speed transceivers ▪ Various built-in hardened IP for LRZ Used to create Custom Hardware! Logic Intel ProprietaryProgrammable Modules Routing Switch 18 FPGA Architecture: Basic Elements Basic Element 1-bit configurable 1-bit register operation (store result) Configured to perform any 1-bit operation: AND, OR, NOT, ADD, SUB for LRZ Intel Proprietary 19 FPGA Architecture: Flexible Interconnect … Basic Elements are for LRZ surrounded withIntel a Proprietary flexible interconnect 20 FPGA Architecture: Flexible Interconnect … … Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements 21 FPGA Architecture: Custom Operations Using Basic Elements 16-bit add 32-bit sqrt … Your custom 64-bit bit-shuffle and encode Wider custom operations are for LRZ implemented by configuringIntel and Proprietary interconnecting Basic Elements 22 FPGA Architecture: Memory Blocks addr Memory data_out data_in Block 20 Kb Can be configured and grouped using the interconnect to create various cache architectures for LRZ Intel Proprietary 23 FPGA Architecture: Memory Blocks addr Memory data_out data_in Block 20 Kb Few larger Can be configured and grouped Lots of smaller caches using the interconnect to create caches various cache architectures for LRZ Intel Proprietary 24 FPGA Architecture: Floating Point Multiplier/Adder Blocks data_in data_out Dedicated floating point multiply and add blocks for LRZ Intel Proprietary 25 DSP Blocks Thousands DSP Blocks in Modern FPGAs ▪ Configurable to support multiple features – Variable precision fixed-point multipliers – Adders with accumulation register – Internal coefficient register bank – Rounding – Pre-adder to form tap-delay line forfor filters LRZ – Single precision floating point multiplication, addition,Intel accumulation Proprietary 26 FPGA Architecture: Configurable Routing Blocks are connected into a custom data-path that matches your application. for LRZ Intel Proprietary 27 FPGA Architecture: Configurable IO The Custom data-path can be connected directly to custom or standard IO interfaces for inline data processing for LRZ Intel Proprietary 28 FPGA I/Os and Interfaces FPGAs have flexible IO features to support many IO and interface standards ▪ Hardened Memory Controllers – Available interfaces to off-chip memory such as HBM, HMC, DDR SDRAM, QDR SRAM, etc. ▪ High-Speed Transceivers ▪ PCIe* Hard IP ▪ Phase Lock Loops for LRZ Intel Proprietary *Other names and brands may be claimed as the property of others 29 Intel® FPGA Product Portfolio Wide range of FPGA products for a wide range of applications Non-volatile, low-cost, Low-power, cost- Midrange, cost, power, High-performance, single chip small form sensitive performance performance balance state-of-the-art ▪ Products features differs across familiesfor LRZ – Logic density, embeddedIntel memory,Proprietary DSP blocks, transceiver speeds, IP features, process technology, etc. 30 Mapping a Simple Program to an FPGA CPU instructions R0 Load Mem[100] High-level code R1 Load Mem[101] Mem[100] += 42 * Mem[101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem[100] for LRZ Intel Proprietary 31 First let’s take a look at execution on a simple CPU LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B CWriteEnable CData Op for LRZ Fixed and general - General “cover-all-cases” data-paths Intel -ProprietaryFixed data-widths architecture: - Fixed operations 32 Looking at a Single Instruction LdAddr LdData StAddr PC Fetch Load Store StData Instruction Op Op Registers Aaddr A ALU A Val Baddr C Caddr B CWriteEnable CData Op for LRZ Intel Proprietary Very inefficient use of hardware! 33 Sequential Architecture vs. Dataflow Architecture Sequential CPU Architecture FPGA Dataflow Architecture A load load 42 R e A A s o Time u r c A A e for LRZ s Intel Proprietarystore A 34 Custom Data-Path on the FPGA Matches Your Algorithm! High-level

Load more