Memory-Driven Computing Kimberly Keeton Distinguished Technologist
Total Page:16
File Type:pdf, Size:1020Kb
Memory-Driven Computing Kimberly Keeton Distinguished Technologist Non-Volatile Memories Workshop (NVMW) 2019 – March 2019 Need answers quickly and on bigger data Data growth 180 175 160 140 130 120 100 100 Projected 80 80 Historical 60 63 50 40 33 41 25 19 15 20 9 12 of analyzed data ($) Value Global datasphere (zettabytes) 5 0.3 1 3 0 2005 2010 2015 2020 2025 10-2 100 102 104 106 Data nearly doubles every two years (2013-25) Time to result (seconds) Source: IDC Data Age 2025 study, sponsored by Seagate, Nov 2018 ©Copyright 2019 Hewlett Packard Enterprise Company 2 What’s driving the data explosion? Record Electronic record of event Ex: banking Mediated by people Structured data ©Copyright 2019 Hewlett Packard Enterprise Company 3 What’s driving the data explosion? Record Engage Electronic record of event Interactive apps for humans Ex: banking Ex: social media Mediated by people Interactive Structured data Unstructured data ©Copyright 2019 Hewlett Packard Enterprise Company 4 What’s driving the data explosion? Record Engage Act Electronic record of event Interactive apps for humans Machines making decisions Ex: banking Ex: social media Ex: smart and self-driving cars Mediated by people Interactive Real time, low latency Structured data Unstructured data Structured and unstructured data ©Copyright 2019 Hewlett Packard Enterprise Company 5 More data sources and more data Record Engage Act 40 petabytes 4 petabytes a day 40,000 petabytes a day* 200B rows of recent Posted daily by Facebook’s 4TB daily per self-driving car transactions for Walmart’s 2 billion users (2017) 10M connected cars by 2020 analytic database (2017) Front camera 2MB per active user Front ultrasonic sensors 20MB / sec 10kB / sec Infrared camera Front radar 20MB / sec sensors 100kB / sec Side ultrasonic sensors 100kB / sec Front, rear and top-view cameras 40MB / sec Crash sensors 100kB / sec Rear ultrasonic cameras Rear radar sensors 100kB / sec 100kB / sec * Driver assistance systems only ©Copyright 2019 Hewlett Packard Enterprise Company 6 The New Normal: system balance isn’t keeping up +24.5%/year 2x / 3.2 years +14.2%/year 2x / 5.2 years Balance Ratio (FLOPS / memory / memory (FLOPS Ratio access)Balance Date of Introduction Processors are becoming increasingly imbalanced with respect to data motion J. McCalpin, “Memory Bandwidth and System Balance in HPC Systems,” Invited talk at SC16, 2016. http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/ ©Copyright 2019 Hewlett Packard Enterprise Company 7 Traditional vs. Memory-Driven Computing architecture Today’s architecture Memory-Driven Computing: is constrained by the CPU Mix and match at the speed of memory PCI SATA DDR If you exceed what can be connected Ethernet to one CPU, you need another CPU ©Copyright 2019 Hewlett Packard Enterprise Company 8 Outline – Overview: Memory-Driven Computing – Memory-Driven Computing enablers – Initial experiences with Memory-Driven Computing – The Machine – How Memory-Driven Computing benefits applications – Fabric-aware data management and programming models – Memory-Driven Computing challenges for the NVMW community – Summary ©Copyright 2019 Hewlett Packard Enterprise Company 9 Memory-Driven Computing enablers ©Copyright 2019 Hewlett Packard Enterprise Company 10 Memory + storage hierarchy technologies LATENCY 1-10ns SRAM + Massive b/w Two new (caches) On- entries! 50ns package DRAM DDR 50-100ns DRAM 200ns-1µs NVM 1-10µs SSDs DISKs ms TAPEs s MBs 10-100GBs 1TBs 1-10TBs 10-100TBs CAPACITY ©Copyright 2019 Hewlett Packard Enterprise Company 11 Non-volatile memory (NVM) Phase-Change Memory NVDIMM-N Latency ns μs Spin-Transfer Torque MRAM Resistive RAM 3D Flash (Memristor) – Persistently stores data Source: Haris Volos, et al. "Aerie: Flexible File-System Interfaces to Storage-Class Memory," Proc. EuroSys 2014. – Access latencies comparable to DRAM – Byte addressable (load/store) rather than block addressable (read/write) – Some NVM technologies more energy efficient and denser than DRAM ©Copyright 2019 Hewlett Packard Enterprise Company 12 Scalable optical interconnects Relay Mirrors λ1, λ2, λ3, λ4 – Optical interconnects – Ex: Vertical Cavity Surface Emitting Lasers (VCSELs) – 4 λ Coarse Wavelength Division Multiplexing (CWDM) – 100Gbps/fiber; 1.2Tbps with 12 fibers CWDM filters ASIC λ – Order of magnitude lower power and cost (target) λ1 λ2 λ3 4 Substrate VCSEL – High-radix switches enable low-diameter network optics topologies HyperX topology Source: J. H. Ahn, et al., “HyperX: topology, routing, and packaging of efficient large-scale networks,” Proc. SC, 2009. ©Copyright 2019 Hewlett Packard Enterprise Company 13 Heterogeneous compute: accelerators CPU extensions GPUs Deep Learning Accelerators ISA-level acceleration Data parallel calculations ASIC-like flexible performance – Vector and matrix extensions – Optimized for throughput – Data-flow inspired, systolic, spatial – Reduced precision – High-bandwidth memory – Cost optimized – Example: ARM SVE2 – Example: Nvidia, AMD – Example: Google’s TPU, FPGAs ©Copyright 2019 Hewlett Packard Enterprise Company 14 Gen-Z: open systems interconnect standard http://www.genzconsortium.org Open Standard – Open standard for memory-semantic interconnect CPUs Accelerators – Memory semantics Memory Memory – All communication as memory operations (load/store, Memory put/get, atomics) SoC SoC FPGA NEURO ASIC GPU – High performance – Tens to hundreds GB/s bandwidth – Sub-microsecond load-to-use memory latency Direct Attach, Switched, or Fabric Topology – Scalable from IoT to exascale – Spec available for public download NVM NVM NVM Network Storage Dedicated or shared fabric-attached memory I/O ©Copyright 2019 Hewlett Packard Enterprise Company 15 Consortium with broad industry support Consortium Members (65) System OEM CPU/Accel Mem/Storage Silicon IP Connect Software Cisco AMD Everspin Broadcom Avery Aces Redhat Cray Arm Micron IDT Cadence AMP VMware Dell EMC IBM Samsung Marvell Intelliprop FIT H3C Qualcomm Seagate Mellanox Mentor Genesis Govt/Univ Hitachi Xilinx SK Hynix Microsemi Mobiveil Jess Link ETRI HP Smart Modular Sony Semi PLDA Lotes Oak Ridge HPE Spintransfer Synopsys Luxshare Simula Huawei Toshiba Molex UNH Lenovo WD Samtec Yonsei U NetApp Senko ITT Madras Nokia Tech Svc Provider Eco/Test TE Yadro Google Allion Labs 3M Microsoft Keysight Node Haven Teledyne LeCroy ©Copyright 2019 Hewlett Packard Enterprise Company 16 Gen-Z enables composability and “right-sized” solutions – Logical systems composed of physical components – Or subparts or subregions of components (e.g. memory/storage) – Logical systems match exact workload requirements – No stranded, overprovisioned resources – Facilitates data-centric computing via shared memory – Eliminates data movement ©Copyright 2019 Hewlett Packard Enterprise Company 17 Spectrum of sharing Exclusive data Shared data Composable systems Coarse-grained data Fine-grained data • FAM allocated at sharing sharing boot time • Single exclusive • Concurrent sharing • Per-node exclusive writer at a time by multiple nodes access • “Owner” may • Requires change over time mechanism for • Reallocation of concurrency control memory permits efficient failover • Uses: scale out • Uses: sharing data • Uses: fine-grained composable by reference, data sharing, multi- infrastructure, SW- producer/consumer, user data structures, defined storage memory-based memory-based communication coordination ©Copyright 2019 Hewlett Packard Enterprise Company 18 Initial experiences with Memory-Driven Computing ©Copyright 2019 Hewlett Packard Enterprise Company 19 Fabric-attached memory (FAM) architecture – Byte-addressable non-volatile memory accessible via memory Local DRAM SoC operations NVM – High capacity disaggregated memory pool Local DRAM NVM – Fabric-attached memory pool is accessible by all compute resources SoC – Low diameter networks provide near-uniform low latency – Local volatile memory provides lower latency, high performance tier – Software Network – Memory-speed persistence Local DRAM SoC NVM – Direct, unmediated access to all fabric-attached memory across the memory fabric NVM Local DRAM – Concurrent accesses and data sharing by compute nodes SoC Fabric- Communications and memory fabric Attached – Single compute node hardware cache coherence domains Memory Pool – Separate fault domains for compute nodes and fabric-attached memory ©Copyright 2019 Hewlett Packard Enterprise Company 20 HPE introduces the world’s largest single-memory computer Prototype contains 160 terabytes of fabric-attached memory – The Machine prototype (May 2017) https://www.nextplatform.com/2017/01/09/hpe-powers-machine-architecture/ – 160 TB of fabric-attached, shared memory – 40 SoC compute nodes – ARM-based SoC – 256 GB node-local memory – Optimized Linux-based operating system – High-performance fabric – Photonics/optical communication links with electrical-to-optical transceiver modules – Protocols are early version of Gen-Z – Software stack designed to take advantage of abundant fabric-attached memory ©Copyright 2019 Hewlett Packard Enterprise Company 21 Applications ©Copyright 2019 Hewlett Packard Enterprise Company 22 Memory-Driven Computing benefits applications Memory is shared In-memory In-memory Memory is large noncoherently over fabric communication indexes Unpartitioned Simultaneously datasets explore multiple In-situ analytics alternatives Easier load Pre-compute balancing, analyses failover No storage overheads No explicit data loading Fast checkpointing, verification