University of Warwick Institutional Repository

MENS T A T A G I MOLEM U N IS IV S E EN RS IC ITAS WARW Monitoring, Analysis and Optimisation of I/O in Parallel Applications by Steven Alexander Wright A thesis submitted to The University of Warwick in partial fulfilment of the requirements for admission to the degree of Doctor of Philosophy Department of Computer Science The University of Warwick July 2014 Abstract High performance computing (HPC) is changing the way science is performed in the 21st Century; experiments that once took enormous amounts of time, were dangerous and often produced inaccurate results can now be performed and refined in a fraction of the time in a simulation environment. Current generation supercomputers are running in excess of 1016 floating point operations per second, and the push towards exascale will see this increase by two orders of magnitude. To achieve this level of performance it is thought that applications may have to scale to potentially billions of simultaneous threads, pushing hardware to its limits and severely impacting failure rates. To reduce the cost of these failures, many applications use checkpointing to periodically save their state to persistent storage, such that, in the event of a failure, computation can be restarted without significant data loss. As computational power has grown by approximately 2 every 18 24 months, ⇥ − persistent storage has lagged behind; checkpointing is fast becoming a bottleneck to performance. Several software and hardware solutions have been presented to solve the current I/O problem being experienced in the HPC community and this thesis examines some of these. Specifically, this thesis presents a tool designed for analysing and optimising the I/O behaviour of scientific applications, as well as a tool designed to allow the rapid analysis of one software solution to the problem of parallel I/O, namely the parallel log-structured file system (PLFS). This thesis ends with an analysis of a modern Lustre file system under contention from multiple applications and multiple compute nodes running the same problem through PLFS. The results and analysis presented outline a framework through which application settings and procurement decisions can be made. ii This thesis is dedicated to the memory of my Grandad. Ian Haig Henderson (1928 – 2014) Acknowledgements Since walking into the Department of Computer Science for the first time in October 2006, I have been fortunate enough to meet, work with and enjoy the company of many special people. First and foremost, I would like to thank my supervisor, Professor Stephen Jarvis, for all his help and hard work over the past 4 years and for allowing me the opportunity to undertake a Ph. D. I would also like to thank him for providing me with funding and a post-doctoral position in the department. Secondly, I would like to thank the two best office-mates I’m ever likely to have, Dr. Simon Hammond and Dr. John Pennycook. I have been in a lull since each of them moved on to pastures new. I consider my time sharing an office with each of them to be both the most productive (with Si) and most entertaining (with John) time in my Warwick career, and I miss their daily company dearly. Thirdly, I acknowledge my current office-mates, Robert Bird and Richard Bunt. Both provide nice light entertainment and interesting discussions to break up the day and make my working environment a more pleasant place to spend my time. Additionally, I thank the other members of the High Performance and Scientific Computing group, past and present – David Beckingsale, Dr. Adam Chester, Peter Coetzee, James Davis, Timothy Law, Andy Mallinson, Dr. Gihan Mudalige, Dr. Oliver Perks and Stephen Roberts – for their lunchtime company and occasional pointless discussions. Finally, within the University, I would like to thank the group of individuals that have helped me through both my undergraduate and postgraduate studies – Dr. Abhir Bhalerao, Jane Clarke, Dr. Matt Ismail, Dr. Arshad Jhumka, Dr. Matthew Leeke, Dr. Christine Leigh, Prof. Chang-Tsun Li, Rod Moore, Dr. Roger Packwood, Catherine Pillet, Jackie Pinks, Gill Reeves-Brown, Phillip iv Taylor, Stuart Valentine, Dr. Justin Ward, Paul Williamson, amongst many others. Outside of University, thanks go to the organisations that have contributed resources and expertise to much of the material in this thesis: the team at the Lawrence Livermore National Laboratory for allowing access to the Sierra and Cab supercomputers; the team at Daresbury Laboratory for granting time on their IBM BlueGene/P; and finally, to both Meghan Wingate-McLelland, at Xyratex, and John Bent, at EMC2, for contributing their time and expertise to my investigations into the parallel log-structured file system. Special thanks is reserved for my closest friend for four years of undergraduate studies and current snowboarding partner, Chris Stra↵on. I only wish our snowboarding excursions were both longer and more frequent; they’re currently the week of the year I look forward to the most. And last, but certainly not least, thanks go to my family – Mum, Dad, Gemma and Paula; my beloved Gran and Grandad; my aunties and uncles; and my nieces and nephews – Chloe, Sophie, Megan, Lauren, Charlie, Holly-Mae, Liam, Tyler, Aimee and Isabelle – who often make me laugh uncontrollably and cost me so much money each Christmas time. And finally huge thanks go to my girlfriend, Jessie, for putting up with me throughout my Ph. D. and making my life more enjoyable. Declarations This thesis is submitted to the University of Warwick in support of the author’s application for the degree of Doctor of Philosophy. It has been composed by the author and has not been submitted in any previous application for any degree. The work presented (including data generated and data analysis) was carried out by the author except in the cases outlined below: Performance data in Chapters 4 and 5 for the Sierra supercomputer were • collected by Dr. Simon Hammond. Parts of this thesis have been previously published by the author in the following publications: [141] S. A. Wright, S. D. Hammond, S. J. Pennycook, R. F. Bird, J. A. Herd- man, I. Miller, A. Vadgama, A. H. Bhalerao, and S. A. Jarvis. Parallel File System Analysis Through Application I/O Tracing. The Computer Journal, 56(2):141–155, February 2013 [142] S. A. Wright, S. D. Hammond, S. J. Pennycook, and S. A. Jarvis. Light- weight Parallel I/O Analysis at Scale. Lecture Notes in Computer Science (LNCS), 6977:235–249, October 2011 [143] S. A. Wright, S. D. Hammond, S. J. Pennycook, I. Miller, J. A. Herd- man, and S. A. Jarvis. LDPLFS: Improving I/O Performance without Application Modification. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium Workshops & PhD Forum (IPDPSW’12), pages 1352–1359, Shanghai, China, 2012. IEEE Computer Society, Washington, DC [144] S. A. Wright, S. J. Pennycook, S. D. Hammond, and S. A. Jarvis. RIOT – A Parallel Input/Output Tracer. In Proceedings of the 27th Annual UK vi Performance Engineering Workshop (UKPEW’11), pages 25–39, Brad- ford, UK, July 2011. The University of Bradford, Bradford, UK [145] S. A. Wright, S. J. Pennycook, and S. A. Jarvis. Towards the Automated Generation of Hard Disk Models Through Physical Geometry Discovery. In Proceedings of the 3rd International Workshop on Performance Model- ing, Benchmarking and Simulation of High Performance Computing Sys- tems (PMBS’12), pages 1–8, Salt Lake City, UT, November 2012. ACM, New York, NY In addition, research conducted during the period of registration has also led to the following publications which, while not directly focused on parallel I/O, have nevertheless shaped the research presented in this thesis: [12] R. F. Bird, S. J. Pennycook, S. A. Wright, and S. A. Jarvis. Towards a Portable and Future-proof Particle-in-Cell Plasma Physics Code. In Proceedings of the 1st International Workshop on OpenCL (IWOCL’13), Atlanta, GA, May 2013. Georgia Institute of Technology, GA [13] R. F. Bird, S. A. Wright, D. A. Beckingsale, and S. A. Jarvis. Performance Modelling of Magnetohydrodynamics Codes. Lecture Notes in Computer Science (LNCS), 7587:197–209, July 2013 [98] S. J. Pennycook, S. D. Hammond, G. R. Mudalige, S. A. Wright, and S. A. Jarvis. On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures. The Computer Journal, 55(2):138–153, Febru- ary 2012 [99] S. J. Pennycook, S. D. Hammond, S. A. Wright, J. A. Herdman, I. Miller, and S. A. Jarvis. An Investigation of the Performance Portability of OpenCL. Journal of Parallel and Distributed Computing (JPDC), 73(11):1439–1450, November 2013 Sponsorship and Grants The research presented in this thesis was made possible by the support of the following benefactors and sources: The University of Warwick, United Kingdom: • Engineering and Physical Sciences Research Council Doctoral Training Accounts Studentship (2010–2014) Royal Society: • Industry Fellowship Scheme (IF090020/AM) UK Atomic Weapons Establishment: • “The Production of Predictive Models for Future Computing Requirements” (CDK0660) “AWE Technical Outreach Programme” (CDK0724) “TSB Knowledge Transfer Partnership” (KTP006740) viii Abbreviations ADIO Abstract Device Interface for Parallel I/O ANL Argonne National Laboratory API Application Programmable Interface ATA Advanced Technology Attachment B/W Bandwidth BG/P IBM BlueGene/P CPU Central Processing Unit DFS Distributed File System EMC2 EMC Corporation FLOP/s Floating-Point Operations per Second FUSE File system in User Space GB 10243 Bytes GFLOP/s 109

University of Warwick Institutional Repository

CS 5600 Computer Systems

W4118: Linux File Systems

Ext3 = Ext2 + Journaling

Measuring Parameters of the Ext4 File System

Migrating from Netware to OES 2 Linux

The Third Extended File System with Copy-On-Write

A Novel Term Weighing Scheme Towards Efficient Crawl Of

Ntfs, Fat, Fat32, Ext2, Ext3, Ext4)

Analyzing Metadata Performance in Distributed File Systems

Linux Filesystem Comparisons

The Second Extended File System

Performance Evaluation of Filesystems Compression Features