An Application-Attuned Framework for Optimizing HPC Storage Systems

Total Page:16

File Type:pdf, Size:1020Kb

An Application-Attuned Framework for Optimizing HPC Storage Systems An Application-Attuned Framework for Optimizing HPC Storage Systems Arnab Kumar Paul Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications Ali R. Butt, Chair Ian Foster Dongyoon Lee Eli Tilevich Gang Wang July 28, 2020 Blacksburg, Virginia Keywords: Parallel File Systems, Object-Based Storage, Data Management, Load Balancing, File System Indexing, Metadata Management, High Performance Computing Copyright 2020, Arnab Kumar Paul An Application-Attuned Framework for Optimizing HPC Storage Systems Arnab Kumar Paul (ABSTRACT) High performance computing (HPC) is routinely employed in diverse domains such as life sciences, and Geology, to simulate and understand the behavior of complex phenomena. Big data driven scientific simulations are resource intensive and require both computing and I/O capabilities at scale. There is a crucial need for revisiting the HPC I/O subsystem to better optimize for and manage the increased pressure on the underlying storage systems from big data processing. Extant HPC storage systems are designed and tuned for a specific set of applications targeting a range of workload characteristics, but they lack the flexibility in adapting to the ever-changing application behaviors. The complex nature of modern HPC storage systems along with the ever-changing application behaviors present unique opportu- nities and engineering challenges. In this dissertation, we design and develop a framework for optimizing HPC storage sys- tems by making them application-attuned. We select three different kinds of HPC storage systems { in-memory data analytics frameworks, parallel file systems and object storage. We first analyze the HPC application I/O behavior by studying real-world I/O traces. Next we optimize parallelism for applications running in-memory, then we design data management techniques for HPC storage systems, and finally focus on low-level I/O load balance for im- proving the efficiency of modern HPC storage systems. In the first part of this dissertation (Chapter3), we focus on collecting and studying file system statistics from two Livermore Computing systems with 15 PiB Lustre file systems at Lawrence Livermore National Laboratory. We add to the state-of-the art in I/O under- standing by providing insight into how general HPC workloads affect the performance of large-scale HPC storage systems. In the second part of this dissertation (Chapter4), we enable dynamic partitioning approach to improve task parallelism for in-memory data ana- lytics frameworks. The third part of this dissertation (Chapter5 and Chapter6) proposes scalable and novel techniques for data management in HPC storage systems. We design and develop a generalized and scalable storage system monitor FSMonitor (Chapter5) for cap- turing and reporting events on heterogeneous large-scale storage systems. Furthermore, we propose a metadata indexing and search tool Brindexer (Chapter6) specifically designed for large-scale HPC storage systems. Brindexer is mainly designed for system administra- tors to help them manage the file system effectively. In the fourth part of this dissertation (Chapter7), we use all the findings and tools from the previous parts to develop an \end- to-end control plane" that leverages real time load information for distributed storage server global data placement while our design model leverages trace-based optimization techniques to minimize I/O load request latency between clients and servers. An Application-Attuned Framework for Optimizing HPC Storage Systems Arnab Kumar Paul (GENERAL AUDIENCE ABSTRACT) Clusters of multiple computers connected through internet are often deployed in industry and laboratories for large scale data processing or computation that cannot be handled by standalone computers. In such a cluster, resources such as CPU, memory, disks are integrated to work together. With the increase in popularity of applications that read and write a tremendous amount of data, we need a large number of disks that can interact effectively in such clusters. This forms the part of high performance computing (HPC) storage systems. Such HPC storage systems are used by a diverse set of applications coming from organizations from a vast range of domains from earth sciences, financial services, telecommunication to life sciences. Therefore, the HPC storage system should be efficient to perform well for the different read and write (I/O) requirements from all the different sets of applications. But current HPC storage systems do not cater to the varied I/O requirements. To this end, this dissertation designs and develops a framework for HPC storage systems that is application- attuned and thus provides much improved performance than other state-of-the-art HPC storage systems without such optimizations. Dedicated to my family - Ma, Baba, Mann, Didi, Tun da, Jhilik, and Aarush; and to my friends who have become a part of my life, for their support, encouragement, motivation, and love. iv Acknowledgments This dissertation would not have been possible without the support of many people who have become an integral part of my life. I would like to take this time to express my gratitude to these awesome individuals. My journey in the path of Computer Science started when I was in the eighth standard when my first computer came to our house. Didi was the one who first inspired me to write computer codes to solve complex problems. I also owe my interest in Computer Science hugely to Geeta ma'am and Rama Rao Sir, who were the ones to instill in me the confidence of coding and digital logic right throughout my high school days. During my B.Tech. days at Netaji Subhash Engineering College, I am thankful to the entire CSE faculty, especially, Anupam Sir and Pritwish Sir, who taught me the fundamentals of Operating System and Computer Hardware, which form the base of my Ph.D. dissertation. My life's journey would not have turned towards higher studies without the support of my roommates and friends during B.Tech. - Soumya, Sabya Da, Debu Da, Koontal Da, Golu, Subroto Da, Pritam, Krishna, Bikram, Arijit, Prithwish, Priyankar, Arka, Arjun, Pratip, Banu, Asif, Samik, Nandu, Manashi, Lopa, Such, Nimi, Animesh, Arunava, Kathi, Raj, Arindam, Aritra, Rahul Roy, Chandrani, Moulik, and Madhu. Next, during my M.Tech. at National Institute of Technology Rourkela, I met so many awesome individuals who shaped my life and prepared me further for my Ph.D. Firstly, spe- cial thanks to Bibhu Sir, Rath Sir, and Mohapatra Sir for letting me work on a M.Tech. topic that was outside my specialization area but would lead me to pursue Ph.D. on that topic. They were the ones who wrote recommendation letters and motivated me to pursue my Ph.D. in the U.S.. My roommates and classmates at NIT Rourkela have been the best. Ravikant - I miss you a lot mere bhai! Koshy, Akshay, Sandeep, Soham, Sumanta, Rajgopal, Aditi, Sai, and Hareesh, you all have given me the best moments to cherish throughout my life. My Ph.D. journey at Virginia Tech could not have gone smoother without the support and encouragement of my advisor, Dr. Ali R. Butt. He has been a mentor, and an elder brother to me during the course of my Ph.D.. The freedom that he gave me during my v Ph.D. to pursue my research ideas, mentor students, teach classes, and brainstorm ideas for proposals has given me the much needed boost for my entire career. He has also showed me how to balance family and work, being successful at both. I am also thankful to Dr. Eli Tilevich who gave me life and Ph.D. lessons during our ping-pong sessions, sometimes even late at night in CRC. I have also been inspired by the rigorous hard work that my committee members, Dr. Dongyoon Lee and Dr. Gang Wang, put in day and night to excel in their respective research areas. I would also like to take this opportunity to thank Dr. Wu Feng who was my first point of contact at Virginia Tech. Thanks are in order for my labmates: Kwang, Umar - my sav- ior, Jay, Safdar and Anwar - my mentors, Bharti, Kevin, Sangeetha, Arpit, Salman, Luna, Hadeel, Nannan, Redwan, Debasmita, Jingoo, Hyogi, Arjun, Jamaal, and Michael. The in- teractions, collaborations, and lengthy discussions about research projects and life in general with my labmates really enriched my Ph.D. experience. DSSL rocks! Subil - In addition to being an awesome labmate, we also collaborated on research. I hope we continue our rela- tionship even in ORNL. Breno - For the past two years we have served as the punching bag for each other's stress in the ping-pong room. I hope we can have many more collaborations in the near future. I am extremely thankful to the wonderful staff in the Computer Science department at Virginia Tech. Special mention needs to be given to Sharon, who has helped me numerous number of times and the special bond that we have created over the years. Teresa - without you none of my conference travels would have been possible. Melanie and Robert - thank you for taking care of us whenever it was needed, and finally the tech staff, for being there always whenever I needed technical assistance for managing our lab's cluster. Thanks are also in order for the graduate school. Thank you for being there whenever I needed help with immigration documents and procedures in the graduate school. My Ph.D. career has been majorly marked by the three summer internships that I have done. Firstly, thank you Dr. Foster, Ryan and Kyle for giving me an opportunity to work in Argonne National Laboratory in the summer of 2017. It was my first ever internship and the experience I gained from it was tremendous.
Recommended publications
  • Base Performance for Beegfs with Data Protection
    BeeGFS Data Integrity Improvements without impacting Performance Presenter: Dr. M. K. Jibbe Technical Director , NetApp ESG Presenter: Joey Parnell 1 SW2019 StorageArchitect, Developer Conference. NetApp © 2019 NetApp ESG Inc. NetApp Proprietary All Rights Reserved. Current Problem • Lustre has been a common choice in the HPC market for years but the acquisition of the Lustre intellectual property by storage provider in 2018 has created a challenge for users that want to avoid vendor lock-in. • Parallel file system needs are expanding beyond HPC, but most lack enterprise reliability • Any BeeGFS configuration does Lack Data protection at different locations: • Storage Client and servers do not have cache protection in the event of kernel panics, software crashes or power loss. Potential issues are: • Data cached in the BeeGFS client itself • Data cached at the underlying filesystem layer in the storage server (XFS, ZFS, ext4, etc.) • Data cached by the OS kernel itself underneath the filesystem in the storage server Goal of the Team (Remove Competitor and Expand Market Reach of BeeGFS) Parallel File system Improve solution resiliency to ensure data integrity by ML BeeGFS Swim lane identifying potential component faults that could lead to BeeGFS IB and FC data loss and changing system design to handle the fault Address known Market without any data integrity challenges. Opportunity Simple solution and support 2 2 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Data Integrity & Availability expectations in HPC (Provide Enterprise Reliability with E-series Storage – BeeGFS) However, . The market we serve goes beyond the HPC scratch file system use cases .
    [Show full text]
  • Spring 2014 • Vol
    Editorial is published two times a year by The GAUSS Centre Editorial for Supercomputing (HLRS, LRZ, JSC) Welcome to this new issue of inside, tive of the German Research Society the journal on Innovative Supercomput- (DFG) start to create interesting re- Publishers ing in Germany published by the Gauss sults as is shown in our project section. Centre for Supercomputing (GCS). In Prof. Dr. A. Bode | Prof. Dr. Dr. Th. Lippert | Prof. Dr.-Ing. Dr. h.c. Dr. h.c. M. M. Resch this issue, there is a clear focus on With respect to hardware GCS is con- Editor applications. While the race for ever tinuously following its roadmap. New faster systems is accelerated by the systems will become available at HLRS F. Rainer Klank, HLRS [email protected] challenge to build an Exascale system in 2014 and at LRZ in 2015. System within a reasonable power budget, the shipment for HLRS is planned to start Design increased sustained performance allows on August 8 with a chance to start Linda Weinmann, HLRS [email protected] for the solution of ever more complex general operation as early as October Sebastian Zeeden, HLRS [email protected] problems. The range of applications 2014. We will report on further Pia Heusel, HLRS [email protected] presented is impressive covering a progress. variety of different fields. As usual, this issue includes informa- GCS has been a strong player in PRACE tion about events in supercomputing over the last years. In 2015 will con- in Germany over the last months and tinue to provide access to leading edge gives an outlook of workshops in the systems for all European researchers field.
    [Show full text]
  • Data Integration and Analysis 10 Distributed Data Storage
    1 SCIENCE PASSION TECHNOLOGY Data Integration and Analysis 10 Distributed Data Storage Matthias Boehm Graz University of Technology, Austria Computer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMK endowed chair for Data Management Last update: Dec 11, 2020 2 Announcements/Org . #1 Video Recording . Link in TeachCenter & TUbe (lectures will be public) . Optional attendance (independent of COVID) . #2 COVID‐19 Restrictions (HS i5) . Corona Traffic Light: RED ORANGE (Dec 07) . Webex lectures revaluated over holidays . #3 Projects and Exercises . 34x SystemDS projects, 11x exercise projects . Final kickoff discussions last week . #4 Course Evaluation . Evaluation period: Dec 15 – Jan 31 . Exam date: TBD, but likely end of January 706.520 Data Integration and Large‐Scale Analysis – 10 Distributed Data Storage Matthias Boehm, Graz University of Technology, WS 2020/21 3 Course Outline Part B: Large‐Scale Data Management and Analysis 12 Distributed Stream 13 Distributed Machine Processing [Jan 15] Learning Systems [Jan 22] 11 Distributed Data‐Parallel Computation [Jan 08] Compute/ Storage 10 Distributed Data Storage [Dec 11] 09 Cloud Resource Management and Scheduling [Dec 04] Infra 08 Cloud Computing Fundamentals [Nov 27] 706.520 Data Integration and Large‐Scale Analysis – 10 Distributed Data Storage Matthias Boehm, Graz University of Technology, WS 2020/21 4 Agenda . Motivation and Terminology . Object Stores and Distributed File Systems . Key‐Value Stores and Cloud DBMS 706.520 Data Integration and Large‐Scale Analysis – 10 Distributed Data Storage Matthias Boehm, Graz University of Technology, WS 2020/21 5 Motivation and Terminology 706.520 Data Integration and Large‐Scale Analysis – 10 Distributed Data Storage Matthias Boehm, Graz University of Technology, WS 2020/21 Motivation and Terminology 6 Overview Distributed Data Storage Global .
    [Show full text]
  • Globalfs: a Strongly Consistent Multi-Site File System
    GlobalFS: A Strongly Consistent Multi-Site File System Leandro Pacheco Raluca Halalai Valerio Schiavoni University of Lugano University of Neuchatelˆ University of Neuchatelˆ Fernando Pedone Etienne Riviere` Pascal Felber University of Lugano University of Neuchatelˆ University of Neuchatelˆ Abstract consistency, availability, and tolerance to partitions. Our goal is to ensure strongly consistent file system operations This paper introduces GlobalFS, a POSIX-compliant despite node failures, at the price of possibly reduced geographically distributed file system. GlobalFS builds availability in the event of a network partition. Weak on two fundamental building blocks, an atomic multicast consistency is suitable for domain-specific applications group communication abstraction and multiple instances of where programmers can anticipate and provide resolution a single-site data store. We define four execution modes and methods for conflicts, or work with last-writer-wins show how all file system operations can be implemented resolution methods. Our rationale is that for general-purpose with these modes while ensuring strong consistency and services such as a file system, strong consistency is more tolerating failures. We describe the GlobalFS prototype in appropriate as it is both more intuitive for the users and detail and report on an extensive performance assessment. does not require human intervention in case of conflicts. We have deployed GlobalFS across all EC2 regions and Strong consistency requires ordering commands across show that the system scales geographically, providing replicas, which needs coordination among nodes at performance comparable to other state-of-the-art distributed geographically distributed sites (i.e., regions). Designing file systems for local commands and allowing for strongly strongly consistent distributed systems that provide good consistent operations over the whole system.
    [Show full text]
  • Towards Transparent Data Access with Context Awareness
    Computer Science • 19(2) 2018 https://doi.org/10.7494/csci.2018.19.2.2844 Micha lWrzeszcz Renata G. S lota Jacek Kitowski TOWARDS TRANSPARENT DATA ACCESS WITH CONTEXT AWARENESS Abstract Open-data research is an important factor accelerating the production and analysis of scientific results as well as worldwide collaboration; still, very little dataEarly is being shared at scale. The aim of this bird article is to analyze existing data-access solutions along with their usage limitations. After analyzing the existing solutions and data-access stakeholder needs, the authors propose their own vision of a data-access model. Keywords distributed, transparent data access, distributed data storage, context awareness Citation Computer Science 19(2) 2018: 201{221 201 202 Micha Wrzeszcz, Jacek Kitowski, Renata S lota 1. Introduction Data access and management can be analyzed on several levels; from personal data to globally distributed shared data with different problems to be overcome by data access and management tools. The simplest case is access to local data; i.e., data stored on direct attached storage (DAS), where the focus is on providing the device drivers and solutions that use the hardware optimally. The next level is to provide data access for a group of users working for a single organization; e.g., provision of network attached storage (NAS). The problems en- countered in this case pertain to possible network failures, higher latencies [36], or the simultaneous work of many users influencing the quality of service [48, 59]. The balance of use of available storage systems to maintain QoS and cost effectiveness in an organization while request scheduling and resource allocation is also expected [28,56].
    [Show full text]
  • The Leading Parallel Cluster File System, Developed with a St Rong Focus on Perf Orm Ance and Designed for Very Ea Sy Inst All Ati on and Management
    The Lead ing Para llel Clus ter File Syste m ww w.thinkpa rq.com ww w.beegfs.io UT BEEGFS ABO UT BEEGFS Wh at is Bee GFS System Archit ectu re BeeGFS (formerly FhGFS) is the leading parallel cluster file system, developed with a st rong focus on perf orm ance and designed for very ea sy inst all ati on and management. If I/O intensive workloads a re your p roblem, BeeGFS is the solution. Wh y use BeeGFS BeeGFS tr ansparentl y sp reads user data ac ross multiple servers. By inc reasing the number of servers and disks in the system, you can simply scale perf ormance and capacity of the file system to the level that you need, seamlessly f rom small clusters up to enterprise-class systems with thousands of nodes. Get The Most Out Of Your Data The flexibility, robustness, and outstanding performance of BeeGFS help our customers a round the globe to increa se prod uctivit y by delivering results faster and by enabli ng new data anal ysis met hod s that we re not possible without the advantages of BeeGFS. ASP ECTS KEY ASP ECTS Ma xim um Scala bili ty Ma xim um Flexibi lit y BeeGFS offers maximum performance and scalability on various BeeGFS supports a wide range of Linu x dist ribu ti on s such levels. It supports dist ribu te d fi le con te nts with flexible as RHEL/Fedora, SLES/OpenSuse or Debian/Ubuntu as well as a striping ac ross storage servers on a pe r-file or pe r-di rectory basis wide range of Linux ke rnels f rom ancient 2.6.18 up to the as well as dist ribu te d me tadat a.
    [Show full text]
  • Maximizing the Performance of Scientific Data Transfer By
    Maximizing the Performance of Scientific Data Transfer by Optimizing the Interface Between Parallel File Systems and Advanced Research Networks Nicholas Millsa,∗, F. Alex Feltusb, Walter B. Ligon IIIa aHolcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC bDepartment of Genetics and Biochemistry, Clemson University, Clemson, SC Abstract The large amount of time spent transferring experimental data in fields such as genomics is hampering the ability of scientists to generate new knowledge. Often, computer hardware is capable of faster transfers but sub-optimal transfer software and configurations are limiting performance. This work seeks to serve as a guide to identifying the optimal configuration for performing genomics data transfers. A wide variety of tests narrow in on the optimal data transfer parameters for parallel data streaming across Internet2 and between two CloudLab clusters loading real genomics data onto a parallel file system. The best throughput was found to occur with a configuration using GridFTP with at least 5 parallel TCP streams with a 16 MiB TCP socket buffer size to transfer to/from 4{8 BeeGFS parallel file system nodes connected by InfiniBand. Keywords: Software Defined Networking, High Throughput Computing, DNA sequence, Parallel Data Transfer, Parallel File System, Data Intensive Science 1. Introduction of over 2,500 human genomes from 26 populations have been determined (The 1000 Genomes Project [3]); genome Solving scientific problems on a high-performance com- sequences have been produced for 3,000 rice varieties from puting (HPC) cluster will happen faster by taking full ad- 89 countries (The 3000 Rice Genomes Project [4]). These vantage of specialized infrastructure such as parallel file raw datasets are but a few examples that aggregate into systems and advanced software-defined networks.
    [Show full text]
  • Some Results Recall: FAT Properties FAT Assessment – Issues
    Recall: A Little Queuing Theory: Some Results • Assumptions: CS162 – System in equilibrium; No limit to the queue Operating Systems and – Time between successive arrivals is random and memoryless Systems Programming Queue Server Lecture 19 Arrival Rate Service Rate μ=1/Tser File Systems (Con’t), • Parameters that describe our system: MMAP, Buffer Cache – : mean number of arriving customers/second –Tser: mean time to service a customer (“m1”) –C: squared coefficient of variance = 2/m12 –μ: service rate = 1/Tser th April 11 , 2019 –u: server utilization (0u1): u = /μ = Tser • Parameters we wish to compute: Prof. Ion Stoica –Tq: Time spent in queue http://cs162.eecs.Berkeley.edu –Lq: Length of queue = Tq (by Little’s law) • Results: –Memoryless service distribution (C = 1): » Called M/M/1 queue: Tq = Tser x u/(1 – u) –General service distribution (no restrictions), 1 server: » Called M/G/1 queue: Tq = Tser x ½(1+C) x u/(1 – u)) 4/11/19 Kubiatowicz CS162 © UCB Spring 2019 Lec 19.2 Recall: FAT Properties FAT Assessment – Issues • File is collection of disk blocks • Time to find block (large files) ?? (Microsoft calls them “clusters”) FAT Disk Blocks FAT Disk Blocks • FAT is array of integers mapped • Block layout for file ??? 1-1 with disk blocks 0: 0: 0: 0: File #1 File #1 – Each integer is either: • Sequential Access ??? » Pointer to next block in file; or 31: File 31, Block 0 31: File 31, Block 0 » End of file flag; or File 31, Block 1 File 31, Block 1 » Free block flag File 63, Block 1 • Random Access ??? File 63, Block 1 • File Number
    [Show full text]
  • 855400Cd46b6c9e7683b5ace58a55234.Pdf
    Ars TeAchrsn iTceachnica UK Register Log in ▼ ▼ Ars Technica has arrived in Europe. Check it out! × Encryption options are great, but Apple's attitude on checksums is still funky. by Adam H. Leventhal - Jun 26, 2016 1:00pm UTC KEEPING ON TRUCKING INDEPENDENCE, WHAT INDEPENDENCE? Two hours or so of WWDC keynoting and Tim Cook didn't mention a new file system once? Andrew Cunningham This article was originally published on Adam Leventhal's blog in multiple parts. Programmer Andrew Nõmm: "I had to be made Apple announced a new file system that will make its way into all of its OS variants (macOS, tvOS, iOS, an example of as a warning to all IT people." watchOS) in the coming years. Media coverage to this point has been mostly breathless elongations of Apple's developer documentation. With a dearth of detail I decided to attend the presentation and Q&A with the APFS team at WWDC. Dominic Giampaolo and Eric Tamura, two members of the APFS team, gave an overview to a packed room; along with other members of the team, they patiently answered questions later in the day. With those data points and some first-hand usage I wanted to provide an overview and analysis both as a user of Apple-ecosystem products and as a long-time operating system and file system developer. The overview is divided into several sections. I'd encourage you to jump around to topics of interest or skip right to the conclusion (or to the tweet summary). Highest praise goes to encryption; ire to data integrity.
    [Show full text]
  • Recent Filesystem Optimisations in Freebsd
    Recent Filesystem Optimisations in FreeBSD Ian Dowse <[email protected]> Corvil Networks. David Malone <[email protected]> CNRI, Dublin Institute of Technology. Abstract 2.1 Soft Updates In this paper we summarise four recent optimisations Soft updates is one solution to the problem of keeping to the FFS implementation in FreeBSD: soft updates, on-disk filesystem metadata recoverably consistent. Tra- dirpref, vmiodir and dirhash. We then give a detailed ex- ditionally, this has been achieved by using synchronous position of dirhash’s implementation. Finally we study writes to order metadata updates. However, the perfor- these optimisations under a variety of benchmarks and mance penalty of synchronous writes is high. Various look at their interactions. Under micro-benchmarks, schemes, such as journaling or the use of NVRAM, have combinations of these optimisations can offer improve- been devised to avoid them [14]. ments of over two orders of magnitude. Even real-world workloads see improvements by a factor of 2–10. Soft updates, proposed by Ganger and Patt [4], allows the reordering and coalescing of writes while maintain- ing consistency. Consequently, some operations which have traditionally been durable on system call return are 1 Introduction no longer so. However, any applications requiring syn- chronous updates can still use fsync(2) to force specific changes to be fully committed to disk. The implementa- Over the last few years a number of interesting tion of soft updates is relatively complicated, involving filesystem optimisations have become available under tracking of dependencies and the roll forward/back of FreeBSD. In this paper we have three goals.
    [Show full text]
  • Toward Automatic Context-Based Attribute Assignment for Semantic file Systems
    Toward automatic context-based attribute assignment for semantic file systems Craig A. N. Soules, Gregory R. Ganger CMU-PDL-04-105 June 2004 Parallel Data Laboratory Carnegie Mellon University Pittsburgh, PA 15213-3890 Abstract Semantic file systems enable users to search for files based on attributes rather than just pre-assigned names. This paper devel- ops and evaluates several new approaches to automatically generating file attributes based on context, complementing existing approaches based on content analysis. Context captures broader system state that can be used to provide new attributes for files, and to propagate attributes among related files; context is also how humans often remember previous items [2], and so should fit the primary role of semantic file systems well. Based on our study of ten systems over four months, the addition of context-based mechanisms, on average, reduces the number of files with zero attributes by 73%. This increases the total number of classifiable files by over 25% in most cases, as is shown in Figure 1. Also, on average, 71% of the content-analyzable files also gain additional valuable attributes. Acknowledgements: We thank the members and companies of the PDL Consortium (including EMC, Hewlett-Packard, HGST, Hitachi, IBM, Intel, LSI Logic, Microsoft, Network Appliance, Panasas, Seagate, Sun, and Veritas) for their interest, insights, feedback, and support. We thank IBM and Intel for hardware grants supporting our research efforts. Keywords: semantic filesystems, context, attributes, classfication 1 Introduction As storage capacity continues to increase, the number of files belonging to an individual user has increased accordingly. Already, storage capacity has reached the point where there is little reason for a user to delete old content—in fact, the time required to do so would be wasted.
    [Show full text]
  • Dominic Giampaolo Interview Page 1 of 4
    frontwheeldrive.com: dominic giampaolo interview Page 1 of 4 home | interviews | reviews | mentors | about | contact Anno Dominic Dominic Giampaolo [by Tom Georgoulias] Dominic Giampaolo isn't exactly a household name, but chances are high you've actually seen the benefits of his work. Dominic was the guy who tracked down a bug in the graphics driver during his days at SGI that was inhibiting a digital effects studio from getting their work completed on the movie Speed. He was also responsible for tracking down a gnarly bug in an eight processor system that only occured within a window of 12 nanoseconds. Dominic worked on various portions of the RealityEngine before Google moving to Be Inc. as a kernel engineer and file system architect for the BeOS. Easily the best search I met Dominic several years ago through a mutual friend and we've been engine out there. corresponding ever since, discussing topics like music, traveling, and software on a weekly basis. Dominic recently left Be Inc. for a senior project management position at Google, so I felt it was time to put him under the BeDope Interview frontwheeldrive spotlight and get his take on various events such as the rise of TA short but entertaining Linux and the Microsoft anti-trust trial. interview from BeDope, the Be Inc. fan site that makes up it's own news. frontwheeldrive: As I understand it, after your graduation you basically drove from Worcester, MA to San Jose, CA and landed a job at SGI. How did you like it there? Inside the BeOS Take a stroll through the BeOS with Dominic and Dominic Giampaolo: I loved working at SGI.
    [Show full text]