P ASSI: A Parallel, Reliable and Scalable Storage Software Infrastructure for Active Storage System and I/O Environments (position paper)

Hsing-bung Chen SongFu High Performance Computing Division Department of and Engineering Los Alamos National Laboratory University of North Texas Los Alamos, New Mexico 87545, USA Denton, Texas 76203, USA [email protected] song.fu@untedu

Abstract---Storage systems are a foundational capability is provided at the storage enclosure level, while the component of computational, experimental, aod majority disk drives are still passive. observational science today. The ever-growiog size of Moreover, with exabytes of to be stored on hundreds computation aod simulation results demaods huge storage of thousands of disk drives, storage scalability becomes a big capacity, which challenges the storage scalability aod challenge. More importantly, at such a scale, data corroption causes data corruption aod disk failure to be and disk failures will be the nonn in Exascale storage commonplace in eI8scale storage environments. Moreover, environments. As data generated by HPC applications, such as the increasing complexity of storage hierarchy aod passive checkpoint/restart data sets, get bigger and so do disk drives, storage devices make today's storage systems inefficient, the time to rebuild a failed drive is extended, causing tens of which force the adoption of new storage technologies. In hours of data diaruption. For the new helium-filled hard this position paper, we propose a Parallel, Reliable aod drives, due to the larger capacity and the slower increase in Scalable Storage Software Infrastructure (PASSI) to performance, the rebuild time of such a disk would be support the design aod prototyping of neIt-generation extensive (almost 3 days). In large arrays, a second disk drive active storage environment. The goal is to meet the scaling can fail before the first is rebuilt, resulting in and aod resilience need of extreme scale science by ensuring significant performance degradation. Dealing with data that storage systems are pervasively intelligent, always corruption and disk failures will become the standard mode of available, never lose or damage data and energy-eflicient. operation for storage systems and it should be performed by machines with little human intervention due to the Keywords--Ethemet drives; parallel erasure coding; disk overwhelming storage scale. Moreover, it should cause little failure prediction; management; energy efficiency. data diaruption or performance degradation to systems and applications. 1. INTRODUCTION In this position paper, we propose a Parallel, Reliable and Scalable Storage Software Infrastructure (PASSI) to Computation and simulation advance knowledge in support the design and prototyping of next-generation active science, energy, and national security. As these simulations storage system and input/output environments. The goal is to have become more accurate and thus realistic, their demands meet the scaling and resilience need of extreme scale science for computational power have also increased. This, in turn, has by ensuring that storage systems are intelligent (at disk drive caused the simulation results to grow in size, increasing the level), always available and never lose or damage data. demands for huge storage systems. High performance I/O PASS! (in Fig. I) explores the new-generation Ethernet becomes increasingly critical, as storing and retrieving a large drives to provide smart, low-power and scaling storage amount of data can greatly affect the overall performance of structure, integrates parallel erasure coding for high­ the system and applications. performance , disk failure prediction for proactive The existing storage systems are mostly passive, that is data rescue, and object storage with balanced data placement, disk drives stage and drain memory/burst buffer checkpoints and supports metadata technologies for high performance file with little intelligence. With the increasing performance and systems. These storage services are ojJloaded to the smart decreasing cost of processors, storage manufactures are adding Ethernet drives, which makes PASS! unique, novel and more system intelligence at I/O peripherals. Active storage attractive. aims to expose computation elements within the storage The rest of this paper is organized as follows. We present infrastructure for general-purpose computation on data. Ethernet connected drives in Section ll. In Section ill, we Although this concept has been appealing for several years present the design of Parallel Erasure Coding and Parallel Data and incorporated in the no Object-based Storage Device Placement software systems. In Section IV, we propose an S' (OSD) standard, a limited number of storage products are (Self-Monitoring, Self-Managing and Self-Healing) software available for HPC systems. Moreover, their processing system in PASS! to address the disk failure challenge and

978-1-4673-8590-9/15/$31.00 ©2015 IEEE Applications using file systems, object storages, and KJV storage Access Interfaces '/ '\ '\ 7 '; ~ · P ... lle l Er ..u re codln, 7 Y : S~ If· )l a nilo rin " ~ If- Ad . ptin, . nd Incorper.tln, 7' P olic:,>' B as(' d P OWH :> '''pport bandwidth on d ~m . nd )bu lI:m , and S.1f-Hn lin a:) hl,h pe rform. nca . nd hi, h R(,SOlll'C l' " ~nd .,,1"11 MPI M \.e d P~r~II~II/O Slan ,e R n ilience )bnallemenl b."dwl dt h matad. ta ."d :M an a gl'IDt'ld - ,. u,11I6 paraH.. I S I MD ~ SofnY3 n I"dexln, tachnolocl" ,. R .. ~lble ,mcooi"8 "'~me to > support . cabbl.e monilorini ~d 1. MDH IM • Par.llal Kay/Valua colI«tion ofpnfomllnce. buit. md .upport rell~bllityon dem~nd 5to ra o.upp ortin, >Tbe- goal is to develop ,. Uti li ... computinf; and u'iljtdaa • multl.dimemional lndi!xI"I! po!icy-basedpower > provide predicti'-e ca.p~bili l i es of networldn, copobl lityfrom M i ddlcw~r", management to control critical ""... 10 tru.t affect "crage Et her""t drl"" ba~ mi«o + 8 1: 10/>111 k ~ y .pac ~ by Iebbility power usage ollaging and orderi", r~"Ii: ~ . and storage ...rver > mm3l:t mdlrpoIt fauluand faull acrive data. • minimizing nr!:work traffi c :> OfflO>ld ~ r8,ur ~ cod ~ [omputi"l domain. worldoad to Ethernet drive. from ,,~th bull.: opn improve the Icoiti ... ' e ofSSIO "> Lisl thenvailable knobs 2. ~ · A n .. r POS I)( file w rve .. .oft"·ar~ I)... trm. .vltam . upportin, in exislingproducls for power ffial::agem ent sucb S.H-iRT- £ rhllnwrDm'. .,Iobal nam ~ 'poce O\'er POSIK as disk sp~(spinup or J J.. HDD -DR.-tM f il ~ system and ~ar ·POSI)( !' '\ l Jl obj.let Itorag .. Iv,tem down), ARM: processor f.!rR!! . Par.llal o.ta PI.ca ....n t H H • paraliHI copy, ~ rHmov~, speed, ARM CPU power com~r .., li,t tool, :> Diltributed halhi"l! bued da t ~ consumption, DDR3 HMDA • hll ~ ully ut lli l ~ network bandwidth [ proacm'll dara '''''CUll from each Ethernet drive

Etbrrnfl Conn~ed DriVt ,,~!h ARM C PU. DDR3 mrmory. XTB diU< ..... cr. Etbrrnff link and tonn~i,ily; It run. rmbedded Linux OS. handle CR!UIt tode computini:. P2P dua placement. plIticip~lt HMDA ~d FPDR :octi,;!it •. support fil e Iyllrm. Obj~ IlOIt. md KIV 110ft II Seagate Kinetic KJV Drive J[ HGST Qt2enEthernet Drives Jl To.hih Kry'Vaine drin fOIpnforrnam;eand t~city J <::;,t E tI, em et drives to support geD-distributed, slI/ort storage t:> ldell iv caJ)acitl' o1l-demfl1ul e,,01l'i1l e vatll

Fig. 1: PASSI software infrastructure.

TABLE 1: Technical features of three different Ethernet connected products.

ARM Memory OS/Software DiskIHDD Ethernet Platform SSD 32-bit, One active Close platform, Seagate 700MHZ,I- 512MB KJV store 4TB and one API or using Planning core standby LLVM Open platform, 32-bit, Embedded 2GBDD3 4TB/8TB One active run Planning HGST LinuxOS I GHZ, I-core applications 4TB: two 2.5" 2 SSDs for HDDfor Two Toshiba 64-bit KJV store Same as performance N/A performance Ethernet Quad-core LinuxOS Kinetic drive One 3.5" SMR links drive for capacity N/A

DRAM errors of Ethernet drives. In Section V, we describe that can support software-defined storage (SDS). There are how we plan to leverage LANL's MDHIM (a Parallel three different Ethernet connected disk drive technologies KeyNalue store) and MarFS (a near-POSIX metadata (Table 1). These products possess attractive features: 1) An technology for Exascale storage systems) in PASSI. In Section Ethernet drive has embedded low-power ARM CPU and VI, we present a policy based power management approach for memory. 2) It runs Linux OS on the drive. 3) It can run data archive storage systems. We conclude this paper and outline processing services closer to the storage media. 4) It has our current and future development activities in Section VII. Ethernet access link and IP based connectivity. 5) It has a high capacity disk space and/or SSD storage. Ethernet drives II. BUILD ACTIVE AND SCALING STORAGE SYSTEMS USING enable us to eliminate high-end storage server requirements, SMART ETHERNET-CONNECTED DRIVES cut down power consumption of systems, and move application services closer to storage media. Ethernet Drive technology provides a new paradigm shift The new Ethernet drives do not merely use Ethernet as a of storage architecture. Ethernet Drives are storage devices new connection interface; they move the communication protocols :from commands on read-and-write data-blocks to a higher level of abstraction. Ethernet drives can offioad traditional logical block address (LBA) management and provide new data p1acement schemes for dispersing object ..... ,-""'-... _., ...... data. This new technology is intended to simplify storage system software stack, allow drivc-to-drive data movement, and reduce significant drive I/O activities :from servers. In this paper, we use Ethernet drives to build a scale-out storage system and design PASS! to achieve self-management, self­ monitoring, and self-healing on this active storage system.

A.. Seagate's Kinetic Drives Seagate's Kinetic drive [1, 11] employs a new access interface and APIs. It uses a key-value data approach to store data at the object level on the drives. Kinetic drive's access interface is an attempt to bypass many layers of the current as file stack and provide an efficient software stack to accommodate current and future application demands. Seagate also provides an LLVM access interface [27], allowing us to run non-key/value storage services on Kinetic drives.

B. HGSTs OpenEthernet Drives HOST', Open Ethomet drive [2] (aka OED) has been designed as a micro-storage server. An OED has one singl.e­ core 32-bit 1 OHZ ARM processor, 2 OB DDR3 DRAM memory, one single Gigabit Ethernet link, and 4 TB disk: Fig. 2: Architecture of smart Ethernet-Drive storage system. space. It runs an embedded Linux as and has extra DRAM memory space and fast processor to run applications. Moreover, HOST's OED architecture enables us to take m. DESIGN PARALLEL ERAsURE CODING AND BALANCE DATA PLACEMENTFORPASSI advantage of Ethernet drives without having to implement the proposed PASSI software following a vendor-specific API Brasure-coded storage systems become :increasingly and architecture. popular for HPC due to their cost-effectiveness in storage space saving and high fault resilience capability [28]. C. Toshiba's KeylValue Drives However, the existing systems use a single macbjne or a multi­ threaded. approach to perform erasure coding, which is not Toshiba has released two Ethernet connected drive scalable to process the gigantic checkpoint/restart data sets for products [3, 4, 5]. The new performance based. Ethernet drive Bxascale science applications. To address this sca1ing tecbnology combines two 2.5-inches high-capacity SATA disk challenge, we propose a Parallel Erasure Coding (parEC) drives, two high-performance M.2 SSDs, two Gigabit Ethernet technique which can encode and decode small chllllks of a big ports, a four-core 64-bit ARM processor, and the Linux OS in data object in parallel on Ethernet drives, and a Parallel Data a 3.5-inch hard drive enclosure. The other capacity based Placement (parDP) technique that distributes era8mc segmenbl Ethernet drive uses an SMR disk for capacity-(:CI[bic pwpose. evenly to all available Ethernet drives in a system. Toshiba's Ethernet Drives adapt Kinetic Drive's interface A. ParEC - Parallel Erasure Coding assess APIs, which helps us to port our PASSI software between the two products. Multi-core processor design has been implemented in today's CPU technology. Each multi-core processor has its D. Ethernet Drives Meet Scale-Out Requirements own embedded SIMD subsystems [16]. It is a challenging task: Big data applications rely on being able to easily and to achieve a high parallelism for erasure coding and :maximize inexpensively create, manage, administrate, and migrate CPU utilization. We notice that the portion of System-IDLE unstructured data. Legacy file and storage architectures cannot state dominates the overall processing time and energy cost meet with these requirements, and thus cannot provide a high compared with I/O and erasure coding processes. When performance and cost-effective solution to Exasca1e encoding and decoding a very big file or dataset, using a computing [10]. The new Ethernet drive technologies provide single-process approach is not capable of achieving sustained a new, promising direction to address the extreme sca1e-out bandwidth for erasure code computation and data movement :in challenge for bulk storage applications. These include an HPC campaign storage system. We need to fully utilize the applications of primary storage, unstructured data, infonnation processing capacity of a storage cluster and provide parallel governance, analytical data, active archival, and cold storage. I/O capability. To this end, we propose a parallel erasure code technology and apply it to very large files and dataset, such as checkpoint-restart data and simulation results, for high performance scientific simulation. We extend the existing single-process or single-thread erasure coding method by adding task/function para1le]jsm and data parallelism. We propose a hybrid parallel approacb, naroed ParalJel Erasure Coding (parEC). It uses MPI parallel I/O for task parallelism and processor's SIMD extensions for data paralJelism [28, 29]. We apply this hybrid paralJelism to perform compute-intensive erasure coding and to provide a high performance, energy-saving, and cost-effective solution for future all-disk based HPC campaign storage systems. We demonstrate that ParEC can significantly reduce the power consumption, improve coding bandwidth, and speed up data placement. We evaluate ParEC on Ethernet drives equipped with 32-bit ARM CPUs and conduct performance and feasibility studies. We compare the vectorization technologies of X86/SIMD and ARMINeon in terms of performance and energy use for erasure code based HPC campaign storage systems. With the combination of data paralJelism based on SIMD computation and task parallelism based on MPI parallel I/O, Fig. 3: Major components ofS'. parallel erasure coding software can improve the performance and scalability of erasure coding techniques, reduce the be recovered through parallel rebuilds on Ethernet drives using archival overhead of extreme-scale scientific data, and reduce ParEC to remedy data corruptions. ParDP also records power and energy consumption. with ParEC, we efficiently metadata, index, and data placement information for MarFS run more scientific simulations on extreme-scale computers and MDlllM to manage these metadata (see Section V). and thus enhance the system utilization. Moreover, the saved power and energy can be exploited by other components of the system, which further improves the system throughput in a N. S' (SELF-MONITORING, SELF-MANAGING AND SELF­ cost-effective manner. HEALING) STORAGE REsiLIENCE MANAGEMENT B. ParDP - Parallel Data Placement The design of S' storage resilience management software For a file with K data segments, ParEC produces M must satisfy a number of requirements: I) They should redundant code segments. The K +M encoding scheme can support scalable mouitoring and collection of performance, tolerate a loss ofup to M data segments and/or code segments. fault, and usage data. Thus onlioe analytics can provide key These segments are placed on K +M Ethernet drives according data to storage managers with low overhead. 2) They should to a placement policy. The reconstruction strategy in case of provide predictive capabilities of critical events that affect disk failures is also addressed in the placement po]jcy. storage reliability. The collected parameters should be We leverage the Ethernet drive technology and Peer-to­ correlated with Ethernet drive characteristics to gain a correct Peer data placement protocols to build a distributed data assessment of the drive state. 3) They should efficiently storage environment for PASS!. We propose a paralJel data manage and report faults and fault domains. Thus proactive placement mechanism called ParDP. ParDP is responsible for data rescue mechanisms can be employed promptly to heal the placing the data and code segments generated from ParEC faults. 4) They should improve the resilience of storage onto Ethernet drives. More specifically, we explore distributed software systems. The vulnerability of storage software to soft hashing methods, in which each Ethernet drive is responsible errors should be characterized and mitigated. for a data region on a circular space, called ring as shown in We propose a set of novel techniques for storage Fig. 2. To address the limitation of traditional consistent resilience assurance that integrate disk health monitoring, disk hashing methods for Exascale storage systems, we enhance the degradation detection, disk failure prediction, proactive data consistent hashing by assigning multiple points in the ring to rescue and memory error detection and correction with an Ethernet drive and using the concept of virtual drives. A distributed controls over Ethernet drives. Machine learning virtual drive is like a physical drive, but each physical drive technologies are explored extensively to automate these can be responsible for multiple virtual drives, depending on its management tasks. The novelty of S' lies in that it is the first capacity. This addresses the heterogeneity issue. When an of its kind that can automatically discover different types of Ethernet drive is predicted to fail (see Section IV), the data disk failures and accurately forecast the futore occurrences of stored on that drive are dispersed to other available drives failures for each type. This is based on a novel concept that we proactively and smoothly without data and computation propose, called disk degradation signatures [8], which disruption, probably at different geo-locations. Data blocks can characterize and quantify the deterioration process of drives B. FPDR - Failure Prediction and Proactive data Rescue Sub-system --Disk health record FPDR leverages regression learning and ensemble learning to track disks' degradation and memory health status, and apply the degradation signatures to forecast when disk failures will occur and detect memory errors. FPDR finds a spare drive and performs proactive data rescue for a failing drive, from which the data are migrated to the spare drive in background prior to a disk: failure. FPDR also corrects the detected memory errors. Ultimately, we apply FPDR to reduce and even avoid disk-rebuild overhead commonly seen in today',f storage systems and ensure storage systems are always available. Index ofhealth records C. Research Issues ofHMDA and FPDR Sub-systems Fig. 4: Disk performance degradation of a disk failUR. There are critical issues that we investigate for efficient, scalable storage resilience management 1) How to store the from having disk errors to becoming failed. We explore the collected runtime disk and memory health data? The runtime embedded processors and memory on Ethernet drives to disk SMART records and system logs are saved on the perfonn in situ analysis of the real-time disk health data, and attached disk of each Ethernet drive. To limit the footprint of the on-drive Ethernet ports to exchange analysis results for these data, we employ a monitoring window, whose size is higher-level aggregation and analysis. We also investigate the configurable. Only the health data inside the current correlation between disk workloads and disk degradation, monitoring window are stored locally. Those that are prior to which guides us to develop reliability-aware workload the cmren.t window have been lea:rned and incorporated in the scheduling strategies [17, 32]. disk. degradation and memory error models. Thus the raw data The embedded DRAM on Ethernet drives may experience can be discarded except for those that are correlated with errors and faults at runtime [24]. We propose software based critical disk and memory errors, which are transferred to the memory integrity checking and correction. We check data HMDA backmd processes for bookkeeping and aggregation. integrity on each memory access by using error-checking We evaluate different sizes of the monitoring windows and code. We use internal data and metadata to detect exploiting machine I.earn:ing to select the best one at runtime. silent data corruptions. In addition, the PASSI software is 2) How to handle false negative disk failure predictions? Our vulnerable to soft errors. To ensure the resilience of PASSI disk fail= predictors keep tracking tho degradation .tatus of software system, we leverage injection tools, such as each drive. Since disks follow gradual degradations in a our F-SEFI [30, 31], to evaluate the vulnerability of PASSI to production environment. our preliminary results show the SOCs and design effective mechanisms to mitigate these false negative rate is very low. Some of the false negatives can vulnerabilities. be logical failures, e.g., caused by corrupted files or file We propose two sub-systems of S3: health monitoring system faults. In these cases, the drives have no physical and degradation analysis (HMDA) andfailure prediction and damage. Diagnostic tools are used to find the software proactive data rescue (FPDR), as shown in Fig. 3. problems and correct them if possible. For missed disk hardware failURS, disk rebuild or parallel disk rebuild is A. HMDA- Health Monitoring and Degradation Analysis performed. An alternative strategy is to explore parallel Sub-system erasure coding (parEC) to rebuild data blocks. We expect the An HMDA daemon on each drive retrieves disk SMART performance degradation to be little because those failures arc readings, memory and system logs periodically from. the very rare. 3) How to detect memory errors and silent data attached disk drive. These runt.ime data arc machine-learned to corruption in Ethernet driver'? Non-chipkill memory build disk. and memory health models and detect anomalous technology is currently used in Ethernet drives. Without behaviors [8][18][19][20][21][22][23]. Aggregated disk health chipkill and an end-to-end data protection technology, data data arc collected through Ethernet connections from. these corruption can go unnoticed and recovery is difficult and active drives to HMDA ba.cbnd processes for higher-level costly, or even impossible to perform. Without end-to-end analysis, where disk replacement and memory fault events are integrity checking, these silent data corruptions can lead to automatically cross-referenced with the detected anomalies unexpected and unexplained problems [25]. W. apply and monitored health data to build and update diak software based memory integrity checking and correction [26] degradation signature and memory error models [8][24]. Fig. in addition to DRAM ECC. We check data integrity during 4 shows the performance degradation of a disk drive prior to a memory accesses by using error-checking code. We use failure. internal data and metadata checksums to detect silent data corruptions. MarfS

GPFS ~1DS I GPFS·~' I OS" 1 I·· .. I 1 Object Object PQS IX I)()SIX ... ••• Slora~e l SIOraKC'i FS , FS,

POM - Parallel data Mover, (GPFS.PFS. NFS. MOS. Object storage) using PFTOOL

I '1\1" I I 1.011'" \rl I I (1)\\1 M't II PI), " \PI I : : TSM I NfS I Archive I HPSS I P .....-5

Fia:. 5: MarfS lyStemarchitecture with metadataman1gement,

are UICd in MaIfIS 10 relOlOnl WI embedded infimDation of the V. MBTADATA SlfPPOR.T lNPASSI data store. IUCb. til data location.. MarFS enables the PASSI middlcwve to provide metadIIa ICMc:cI for both non-POSIX Mad'S [6] IDd MDJDM [7] _ two by tcdmologiel and POSIX HPC 1tonIp. invaded aDd dcliped by LANL'. f'CICIlCbcrI to ~ We intllglm tile MarfS &smewoU into PASSI 1Oftwan: mdIIdI!ta fw HPC ItOdp I)'IkmL ~ mil 1*1 J.WFS ~ to support hi&b pcrfurm'DN! metadI&a at ticeltod .II:IImI&CI cnaue code bued A. h~ Jt7Uh MdI'FS - A Hybrid JI~ ~ Ethermet-drive ~Jor /kMj'0SlX aJUl POSIXStorage.S)tJaat stonp .,.....

/l ~ Mth MHDIM - A PIUtIlhl ~..­ MctadIta» .... ,1• '"lilly divided inln IWO broMI ~ F~fOr /fPC StlHvge strvcIwvl .-twiaID, which ~ tho desig!I and ipCCifieIItioa r:4 dIIa Itt'LlCUa, IDd descrlpttve "..wIata, We have developed a M~ Habing whic:b coapiIcI all oIhcr ilIIOdMtd mfomw:ion mcb til Izwkring ~ (MDInM) (7). a panlIcl Uy-vaB Itorc ~. meID;U" in'II:Ddcd usc .. provcDIIDCC. IIIIOC:iItiou, IIId framework cksipcd for HPC I)'Itlm». MDHIM diffcn from oontcU [10J. M.FS u cbIgMd to iDtcgmc both structural o1hCI" key-vtlue IIoIa: ill that it IIllJ'PC'N HPC nctworb mctId.aa IDd deIaiptivo DX'Iadata [6]. natively. UlCI dynamic ac:rvcn that Itltt md end with applicatiom:, WOIb with plugabl.c d.Wme 1w:kcnd .. onicrJ Ma!:FS it • nllllr-POSIX file IIystem using lC&lo-