The Xilinx All Programmable Powerpoint Template
Total Page:16
File Type:pdf, Size:1020Kb
Heterogeneous Multi-Processing for SW- Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott (Xilinx Research) SDC, 2017 Content Heterogeneous Multi-Processing for Software-Defined Multi-Tiered Storage Architectures Who – Xilinx Research and Missing Link Electronics Why – Multi-tiered storage needs predictable performance scalability, deterministic low-latency and cost-efficient flexibility / programmability What – Tera-OPS processing performance in a single-chip heterogeneous compute solution running Linux How – Combine “unconventional” dataflow architectures for acceleration & offloading with Dynamic Partial Reconfiguration and High-Level Synthesis Page 2 © MLE . Xilinx Research and Missing Link Electronics Page 3 © MLE . Xilinx – The All Programmable Company XILINX - Founded 1984 Headquarters Sales and Support Research and Development Manufacturing $2.38B FY15 revenue 20,000 customers worldwide >55% market segment share 3,500+ patents 3,500+ employees worldwide 60 industry firsts Page 4 © MLE . Xilinx Research - Ireland Applications & Architectures Through application-driven technology development with customers, partners, and engineering & marketing Page 5 © MLE . Missing Link Electronics Xilinx Ecosystem Partner Vision: The convergence of software and off-the-shelf programmable logic opens-up more economic system realizations with predictable scalability! Mission: To de-risk the adoption of heterogeneous compute technology by providing pre-validated IP and expert design services. Certified Xilinx Alliance Partner since 2011, Preferred Xilinx PetaLinux Design Service Partner since 2013. Page 6 © MLE . Missing Link Electronics Products & Services TCP/IP & UDP/IP Network Low-Latency Ethernet MAC Protocol Accelerators at form German Fraunhofer HHI. 10/25/50 GigE line-rate. Patented Mixed Signal Key-Value-Store Accelerator systems solutions with for hybrid SSD/HDD integrated Delta-Sigma memcached and object converters in FPGA logic. storage. SATA Storage Extension for A team of FPGA and Linux Xilinx Zynq All-Programmable engineers to support our Systems-on-Chip. customer’s technology projects in the USA and Europe. Page 7 © MLE . Motivation Page 8 © MLE . Technology Forces in Storage Software significantly impacts latency and energy efficiency in systems with nonvolatile memory However, software-defined flexibility is necessary to fully utilize novel storage technologies Hyper-capacity hyper- converged storage systems need more performance, but within cost and energy Source: Steven Swanson and Adrian M. Caulfield, UCSD envelopes IEEE Computer, August 2013 Page 9 © MLE . The Von Neumann Bottleneck [J. Backus, 1977] CPU system performance scalability is limited New Compute Architectures are needed Page 10 © MLE . Spatial vs. Temporal Computing Sequential Processing Parallel Processing with CPU with Logic Gates Source: Dr. Andre DeHon, Upenn: “Spatial vs. Temporal Computing” CPU system performance scalability is limited Spatial computing offers further scaling opportunity New Compute Architectures are needed to take advantage of this Page 11 © MLE . Architectural Choices for Storage Devices General Purpose Digital Processors Signal Application Processors Specific Signal TeraProcessors OPS StrongARM110 TMS320C54x Field 0.4 MIPS/mW 3MIPS/mWProcessing PowerProgrammable 6 ICORE Devices 20At-35 low MOPS/mW Wattage . 10 Application 5 Specific 10 ICs Physically Log F L E X Log E X FY I L I T B I L Optimized Log P E R F O R M A N C E ICs Log P O WE R D I T I SOA SN I P 103 . 104 Source: T.Noll, RWTH Aachen Page 12 © MLE . Use Case: Image/ Video Storage Page 13 © MLE . A Flexible All Programmable Storage Node Storage Node NVMe Drive MPSoC FPGA Network Reconfigurable NVMe Processing Drive NVMe Monitoring Drive Page 14 © MLE . Meta Data Extraction, e.g. Image Quality Metrics Storage Node • Partially under and over exposed NVMe • Medium contrast Drive Network Meta Data Extraction NVMe Drive NVMe Monitoring Drive Page 15 © MLE . Processing, e.g. Thumbnailing, Auto-Correction Storage Node NVMe Drive Network Thumbnailing NVMe Auto Correction Drive NVMe Monitoring Drive Page 16 © MLE . Semantic Feature Extraction, e.g. Classification Storage Node • Group of People NVMe • Xilinx Drive • Outdoor Network Semantic Feature NVMe Extraction Drive NVMe Monitoring Drive Page 17 © MLE . Semantic Search Support Storage Node NVMe Drive Network Semantic Feature Search NVMe Drive • Group of People NVMe Monitoring • Xilinx Drive • Outdoor Scene Page 18 © MLE . Performance Metrics, e.g. Bandwidth, Latency Storage Node NVMe Drive Network Semantic Feature Search NVMe • Performance Counter Drive • Pattern Matching • ID Generation for Tracing NVMe Monitoring Drive Page 19 © MLE . Runtime Programmability Meta Data Extraction Semantic Storage Node Feature NVMe Extraction Drive Network Reconfigurable NVMe Processing Drive NVMe Monitoring Drive Page 20 © MLE . Architectural Concepts Page 21 © MLE . Key Concepts Presented at SDC-2016 Heterogeneous compute device as a single -chip solution Direct network interface with full accelerator for protocols Performance scaling with dataflow architectures Scaling capacity and cost with a Hybrid Storage subsystem Software-defined services Page 22 © MLE . SDC-2016: Single-Chip Solution for Storage Data Node Processing System with quad core 64b processors (A53) DDRx DDRx channels Network Memory channels management management Petalinux NVMe FPGA fabric (PL) interface M.2 NVMe M.2 NVMe drivesM.2 NVMe drivesM.2 NVMe Key Value Store drives Hybrid drives Router Abstraction Memory (memcached) System Memory DDRx DDRx controller channelsDDRx TCP/IP stack channels channels MLE IP SD Services Xilinx IP Page 23 © MLE . SDC-2016: Hardware Accelerated Network Stack Data Node Processing System with quad core 64b processors (A53) DDRx DDRx channels Network Memory channels management management Petalinux NVMe FPGA fabric (PL) interface M.2 NVMe M.2 NVMe drivesM.2 NVMe drivesM.2 NVMe Key Value Store drives Hybrid drives Router Abstraction Memory (memcached) System Memory DDRx DDRx controller channelsDDRx TCP/IP stack TCP/IP Full Accelerator channels Supports 10/25/50 channels GigE line-rates Page 24 © MLE . SDC-2016: Dataflow architectures for performance scaling Streaming Architecture: Flow-controlled series of processing stages which manipulate and pass through packets and their associated state Now: 10 Gbps demonstrated with a 64b data path @ 156MHz using 20% of FPGA Next: 100 Gbps can be achieved by using a 512b @ 200MHz pipeline for example Source: Blott et al: Achieving 10Gbps line-rate key-value stores with FPGAs; HotCloud 2013 Page 25 © MLE . SDC-2016: Scaling Capacity via hybrids SSDs combined with DDRx channels can be used to build high capacity & high performance object stores Concepts and early prototype to scale to 40TB & 80Gbps key value stores Value Hash Parser Store Formatter Lookup Access Hybrid MemorySystem DDRx M.2 NVMe DDRx M.2 NVMe DDRxDDRx Hash Table drivesM.2 NVMe channels Hash Table drivesM.2 NVMe channelsChannelsValue drivesValue Store drivesStore Source: HotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory Page 26 © MLE . SDC-2016: Handling High Latency Accesses without Sacrificing Throughput Request Buffer Cmd: ReadRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSD SSD 100usec Rsp: ResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponse time • Dataflow architectures: no limit to number of outstanding requests • Flash can be serviced at maximum speed Page 27 © MLE . Software-Defined Services SoftwareSoftware-Defined-Defined Services Services Spatial computing of additional services at no performance cost until resource limitations are reached Page 28 © MLE . Software-Defined Services Page 29 © MLE . Software-Defined Services – Proof-of-Concepts Offload engines for Linux Kernel Crypto-API Non-intrusive latency analysis via PCIe TLP “Tracers” Inline processing with Deep Convolutional Neural Networks Declarative Linux Kernel Support Partial Reconfiguration Page 30 © MLE . Software-Defined Services - Example 1) Accelerating the Linux Kernel Crypto-API Crypto-API is a cryptography framework in the Linux kernel used for encryption, decryption, compression, de-compression, etc. Needs acceleration to support processing at higher line-rates (100 GigE). Open Source software implementation that follows a streaming dataflow processing architecture – Hardware Interface: AXI Streaming – Software/ Hardware Interface: SG-DMA in, SG-DMA out High-Level Synthesis generated accelerator blocks from reference C code Page 31 © MLE . System Architecture of Crypto-API Accelerator Page 32 © MLE . Software-Defined Services - Example 2) Non-Intrusive Latency Analysis via PCIe TLP Tracers Performance analysis and ongoing monitoring of bandwidth and latency in distributed systems is difficult. – Round-trip times – Time-outs – Throttling When done in software, results get distorted by additional compute burden. When done in Programmable Logic, it can be (clock cycle) accurate and non-intrusive via adding so-called “Tracers” into the dataflow. Page 33 © MLE . Tracer-Based Performance Analysis Tracers within PCIe Transaction Layer Packets (TLP) – Based on addresses/ IDs, detected at PCIe switches and endpoints – Transparent