File Systems and Archives

DOE HPC Best Practices Workshop File Systems and Archives SEPTEMBER 26-27, 2011 • SAN FRANCISCO, CA The Fifth Workshop on HPC Best Practices: File Systems and Archives Held September 26–27, 2011, San Francisco Workshop Report December 2011 Compiled by Jason Hick, John Hules, and Andrew Uselton Lawrence Berkeley National Laboratory Workshop Steering Committee John Bent (LANL); Jeff Broughton (LBNL/NERSC); Shane Canon (LBNL/NERSC); Susan Coghlan (ANL); David Cowley (PNNL); Mark Gary (LLNL); Gary Grider (LANL); Kevin Harms (ANL/ALCF); Paul Henning, DOE Co-Host (NNSA); Jason Hick, Workshop Chair (LBNL/NERSC); Ruth Klundt (SNL); Steve Monk (SNL); John Noe (SNL); Lucy Nowell (ASCR); Sarp Oral (ORNL); James Rogers (ORNL); Yukiko Sekine, DOE Co-Host (ASCR); Jerry Shoopman (LLNL); Aaron Torres (LANL); Andrew Uselton, Assistant Chair (LBNL/NERSC); Lee Ward (SNL); Dick Watson (LLNL). Workshop Group Chairs Shane Canon (LBNL); Susan Coghlan (ANL); David Cowley (PNNL); Mark Gary (LLNL); John Noe (SNL); Sarp Oral (ORNL); Jim Rogers (ORNL); Jerry Shoopman (LLNL). Ernest Orlando Lawrence Berkeley National Laboratory 1 Cyclotron Road, Berkeley, CA 94720-8148 This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. DISCLAIMER This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor The Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or The Regents of the University of California. Contents Executive Summary...................................................................................................................................... 3 Introduction................................................................................................................................................... 5 Workshop Goals........................................................................................................................................ 5 Workshop Format ..................................................................................................................................... 5 Workshop Breakout Topics ...................................................................................................................... 5 The Business of Storage Systems ......................................................................................................... 5 The Administration of Storage Systems ............................................................................................... 6 The Reliability and Availability of Storage Systems............................................................................ 7 The Usability of Storage Systems......................................................................................................... 7 Workshop Findings....................................................................................................................................... 9 Best Practices............................................................................................................................................ 9 Gaps and Recommendations................................................................................................................... 13 Appendix A: Position Papers...................................................................................................................... 16 Appendix B: Workshop Agenda................................................................................................................. 17 Appendix C: Workshop Attendees ............................................................................................................. 21 Workshop on HPC Best Practices: File Systems and Archives 1 Executive Summary In 2006, the United States Department of Energy (DOE) National Nuclear Security Administration (NNSA) and Office of Science (SC) identified the need for large supercomputer centers to learn from each other on what worked well and what didn’t in the areas of acquisition, installation, integration, testing and operation. Previous High Performance Computing (HPC) Best Practice Workshops focused on System Integration in 2007 (http://outreach.scidac.gov/pibp/), Risk Management in 2008 (http://rmtap.llnl.gov/), Software Lifecycles in 2009 (http://outreach.scidac.gov/swbp/), and Power Management in 2010 (http://outreach.scidac.gov/pmbp/). The workshop on HPC Best Practices on File Systems and Archives was the fifth in the series. The workshop gathered technical and management experts for operations of HPC file systems and archives. Attendees identified and discussed best practices in use at their facilities, and documented findings for the DOE and HPC community in this report. The home page for this workshop is http://outreach.scidac.gov/fsbp/. Prior to the workshop, board members identified four critical functionally components to the operation of storage systems: administration, business, reliability, and usability of storage. These components of storage became the breakout session topics. During registration, attendees designated their primary interests for breakout sessions and submitted a position paper to aid in having discussions around proposals or ideas in the position papers provided. Participants submitted 27 position papers, included in Appendix A. This year the workshop had 64 attendees from 29 different sites from around the world. Participants identified 16 best practices in use: 1. For critical data, multiple copies should be made on multiple species of hardware in geographically separated locations. 2. Build a multidisciplinary data center team. 3. Having dedicated maintenance personnel, vendor or internal staff, is important to increasing system availability. 4. Establish and maintain hot-spare, testbed, and pre-production environments. 5. Establish systematic and ongoing procedures to measure and monitor system behavior and performance. 6. Establish authoritative sources of information. 7. Manage users’ consumption of space. 8. Storage systems should function independently from compute systems. 9. Include middleware benchmarks and performance criteria in system procurement contracts. 10. Deploy and support tools to enable users to more easily achieve the full capabilities of file systems and archives. 11. Training, onsite or online, and meeting with the users directly improves the utilization and performance of the storage system. 12. Actively manage relationships with the storage industry and storage community. 13. Monitor and exploit emerging storage technologies. 14. Continue to re-evaluate the real characteristics of your I/O workload. 15. Use quantitative data to demonstrate the importance of storage to achieving scientific results. 16. Retain original logs generated by systems and software. Workshop on HPC Best Practices: File Systems and Archives 3 Many participants do not implement all the best practices, which means they are bringing several new ideas for improvement back to their sites. In addition, attendees identified 15 gaps that require further attention. Gaps likely addressed by evolutionary solutions include: 1. Incomplete attention to end-to-end data integrity. 2. Storage system software is not resilient enough. 3. Redundant Array of Independent Tape (RAIT) does not exist today. 4. Provide monitoring and diagnostic tools that allow users to understand file system configuration and performance characteristics and troubleshoot poor I/O performance. 5. Communication between storage systems users and storage system experts needs improvement. 6. Need tools and mechanisms to capture provenance and provide life-cycle management for data. 7. Ensure storage systems are prepared to support data-intensive computing and new workflows. 8. Measure, and track trends in storage system usage to the same degree we measure and track flops and cycles on computational systems. 9. Tools to provide scalable performance in interacting with files in the storage system. Gaps requiring revolutionary approaches include: 1. Need quality of service for users in a shared storage environment. 2. Need standard metadata for users to specify data importance and retention. 3. The diagnostic information currently available in today’s storage systems is woefully inadequate. 4. Metadata performance in storage systems already limits usability and won’t meet the needs of exascale. 5. POSIX isn’t expressive enough for the breadth of application I/O patterns. 6. There exists a storage technology gap for exascale. 4 Workshop on HPC Best Practices: File Systems and Archives Introduction The U.S. Department of Energy (DOE)

Load more