UNITY: Unified Memory and File Space

UNITY: Unified Memory and File Space Terry Jones, ORNL June 27, 2017 ORNL is managed by UT-Battelle Terryfor the Jones, US Department ORNL PI of Energy Michael Lang, LANL PI Ada Gavrilovska, GaTech PI Talk Outline • Motivation • UNITY’s Architecture • Early Results • Conclusion ROSS 2017 – Terry Jones – June 27, 2017 2 Timeline to a Predicament – APIs & Growing Memory Complexity • Problem – The simple bifurcated memory hierarchy of the 1950s has evolved into a much more complex hierarchy while interfaces have remained relatively fixed. – At the same time, computer architectures have evolved from single nodes to large parallel systems. • Solution – Update the interface to support a prescriptive (directive based) approach. – Manage dynamic placement & movement with a smart distributed runtime system • Impact – Enable domain scientist to focus on their specialty rather than requiring them to become experts on Fig 1: An early memory hierarchy. memory architectures. – Enable target independent programmability & target independent performance. SGI releases UNIX NUMALink MPI-IO nVidia HBM (originally Seagate first (non-posix API (stacked UNICS) mag disk for IBM RAMAC 305 Multics for parallel I/O) memory) Magnetic disk operating microcomputers CompactFlash HDDs reach storage (5MB) system (5MB) POSIX effort For consumer MIT Core IBM 1TB with UNITY is Intel DRAM Cheap 64K begins electronics 3.5” platters memory OS/360 memory DRAM funded 1950 55 60 65 70 75 80 85 90 95 2000 05 10 15 20 ROSS 2017 – Terry Jones – June 27, 2017 3 Exascale and beyond promises to continue the trend towards complexity Extremely Fast Processor CPU Extremely Expensive (Registers) Tiny Capacity Faster SRAM CPU cache Expensive (Level 1,2,3 cache) Small Capacity Byte-Addressable NVM DRAM DRAM Volatile Nonvolatile Main Memory Reasonably Fast Reasonably Fast Reasonably Priced Product Not Available (Random Access Memory) Reasonable Capacity Denser than NAND Byte Addressable Limited Lifetime (STTRAM, PCRAM, ReRAM) NAND Flash Non-byte Addressable Faster than Magnetic Disk Page based NVM Secondary Storage Level 1 Reasonably Cheaper than DRAM (NAND Flash) Limited Density Limited Lifetime Access SpeedAccess Large Capacity Magnetic Disk Secondary Storage Level 2 Slow Cheap Very Large Capacity Tape Tertiary Storage Very Slow Very Cheap Capacity ROSS 2017 – Terry Jones – June 27, 2017 4 The Traditional Dimensions… Access SpeedAccess Capacity ROSS 2017 – Terry Jones – June 27, 2017 5 The Traditional Dimensions are Being Expanded… Access SpeedAccess Capacity ROSS 2017 – Terry Jones – June 27, 2017 6 The Traditional Dimensions Are Being Expanded Into future directions Access Speed Compatibility with Legacy Apps Concurrency Resiliency Energy Capacity ROSS 2017 – Terry Jones – June 27, 2017 7 …But THe Exposed PatH to Our Memory HierarcHy Remains Bifurcated User Space: Applications, Libraries file IO mmap memory access VFS Traditional FS Memory Mapping FS Buffer Cache Block Device Virtual to Physical Physical Device: Disks, SSDs Physical Device: NVM, DRAM ROSS 2017 – Terry Jones – June 27, 2017 8 Implications • Complexities in managing User Space: Applications, Libraries power and resilience when file IO mmap memory access actions can be taken independently down the two File-based IO Memory-based IO paths • Results in application factoring in multiple data layouts for different architectural reasons Physical Device: Disks, SSDs Physical Device: NVM, DRAM • Computers are good at handling the details dynamically • Burst buffers, new memory layers, concurrency, power and resilience make data placement difficult for domain scientists. ROSS 2017 – Terry Jones – June 27, 2017 9 UNITY ArcHitecture Global-Distibuted Runtime Fragment Name Global Data Placement Server Local node runtime SNOFlake Aggregated system statistics • Persistent deamon to handle subsequent accesses • Also performs post-job security cleanup Node Runtime Optimized Local Data . Node Placement “Dynamic” components Runtime Optimized Node Local Data Placement Runtime • Active throughout life of application Optimized 1 2 Local Data NVM 3 Placement • Able to adjust strategies 7 RAM 4 2 1 4 2 1 5 5 NVM • Incorporates COW optimizations 3 3 2 3 RAM 2 1 4 5 5 NVM 3 2 3 DRAM Local & Global optimizers 2 1 4 3 3 RAM 1 3 • Directs data placement 7 RAM 4 • Global considers collective & machine status optimizations Application 7 HBM Versioned 4 Data 7 2 Fragments Application Nameserver for metadata management Application . • Efficiently describes data mappings Fig 1: The UNITY architecture is designed for an • Keeps track of published objects environment that is (a) prescriptive; (b) • Persistent daemon at well known address distributed; (c) dynamic; (d) cooperative. ROSS 2017 – Terry Jones – June 27, 2017 10 UNITY Design Objectives Global-Distibuted Runtime Fragment Name Global Data Placement Server Local node runtime SNOFlake A unified data environment based on a smart Aggregated system statistics • Persistent deamon to handle subsequent accesses runtime• Also system:performs post-job security cleanup Node Runtime Optimized Local Data . Node 1. frees applications from the complexity of Placement “Dynamic” components Runtime Optimized directly placing and moving data within Node Local Data Placement Runtime • Active throughout life of application Optimized 1 multi-tier storage hierarchies, 2 Local Data NVM 3 Placement • Able to adjust strategies 7 RAM 4 2 1 4 2 1 5 5 NVM • Incorporates COW optimizations 3 3 2 3 RAM 2. while still meeting application-prescribed 2 1 4 5 5 NVM 3 2 3 DRAM Local requirements& Global optimizers for data access performance, 2 1 4 3 3 RAM 1 3 • efficientDirects data data placement sharing, and data durability. 7 RAM 4 • Global considers collective & machine status optimizations Application 7 HBM Versioned 4 Data 7 2 Fragments Application Nameserver for metadata management Application . • Efficiently describes data mappings Fig 1: The UNITY architecture is designed for an • Keeps track of published objects environment that is (a) prescriptive; (b) • Persistent daemon at well known address distributed; (c) dynamic; (d) cooperative. ROSS 2017 – Terry Jones – June 27, 2017 11 Automated movement witH Unity Data Placement Domains With UNITY RAM 1 6 11 16 21 LEGEND HBM 2 7 12 17 22 UNITY User NVM 3 8 18 UNITY 13 23 Service SSD 4 9 14 19 24 Existing System SW Disk 5 10 15 20 25 Not Likely in future Tape Architectures 26 Compute IO Burst Storage Tertiary Storage Note: The orange triangle ( ) specifies the Node(s) Node(s) Buffer (e.g. Lustre or GPFS) (e.g., HPSS) UNITY Memory Hierarchy Layer (MHL); Lower numbers present faster access to the application while higher numbers present Compute Domain (e.g., Titan) Storage Domain (separate systems) more aggre-gated capacity to the application. ROSS 2017 – Terry Jones – June 27, 2017 12 IMDProviding a Prescriptive API unity_create_object(“a”, objDescr); workingset = unity_attach_fragment(workFrag, flags); Scientific Achievement xghost = unity_attach_fragment(ghosttop, flags); A new API enables domain scientists to describe yghost = unity_attach_fragment(ghostleft, flags); how their data is to be used. This permits a smart for (x=0; x<1000; x++) { runtime system to do the tedious work of if ( x>0 ) { managing data placement and movement. reattach(xghost, x); reattach(yghost, x); } Significance & Impact // do work Vendors are providing multiple APIs to deal with unity_publish(workingset); their novel abilities. Through our co-design } oriented project, we provide a unifying way to Fig 1: The UNITY API is designed to be extensible and achieve what the domain scientists want in a flexible – here is an example with meshes & ghost cells. machine independent way. Research Details Currently we have a functional prototype that provides most of the functionality that will be visible to the application developers using the runtime. We have created a global name service that can be used to query datasets and where their data is located. We have created a runtime service that runs on each node and keeps track of the data available locally on the node. The runtime, in conjunction with the global name server create a distributed data store that can utilize both volatile and nonvolatile memory available on the supercomputer. We have designed and implemented hooks in the system, so intelligent data placement services can identify and update the data location and format in order to improve the overall performance. We have modified the SNAP proxy application to use the Unity’s API. We can checkpoint and restart the application and can demonstrate the advantages of Unity by checkpointing a N-rank SNAP job and restarting it as a M-rank job. Datasets/checkpoints can be pulled from multiple hosts over TCP or Infiniband. ROSS 2017 – Terry Jones – June 27, 2017 ‹#› IMDFirst – Do No Harm Scientific Achievement Overhead of UNITY SNAP Checkpoints Demonstration of interface and architecture with SNAP. 4000 3500 Significance & Impact 3000 Distributed systems permit needed functionality like persistent name spaces across run invocations and 2500 across an entire machine. However, they also require careful design to avoid costly overheads. 2000 1500 1000 Research Details A “worse-case” scenario for our design is an 500 important test to determine if the idea is simply not 0 feasible. We were able to validate that the 1 2 4 8 16 32 64 128 256 512 1024 2048 overheads associated with UNITY are not prohibitive even without our “local placement” Standard Checkpoint UNITY engine or “global placement” engine

Load more