University of Warwick Institutional Repository
Total Page:16
File Type:pdf, Size:1020Kb
CORE Metadata, citation and similar papers at core.ac.uk Provided by Warwick Research Archives Portal Repository University of Warwick institutional repository: http://go.warwick.ac.uk/wrap A Thesis Submitted for the Degree of PhD at the University of Warwick http://go.warwick.ac.uk/wrap/3713 This thesis is made available online and is protected by original copyright. Please scroll down to view the document itself. Please refer to the repository record for this item for information to help you to cite it. Our policy information is available from the repository home page. Addressing Concerns in Performance Prediction: The Impact of Data Dependencies and Denormal Arithmetic in Scientific Codes by Brian Patrick Foley A thesis submitted to the University of Warwick in partial fulfilment of the requirements for admission to the degree of Doctor of Philosophy Department of Computer Science September 2009 Abstract To meet the increasing computational requirements of the scientific community, the use of parallel programming has become commonplace, and in recent years distributed applications running on clusters of computers have become the norm. Both parallel and distributed applications face the problem of predictive uncertainty and variations in runtime. Modern scientific applications have varying I/O, cache, and memory profiles that have significant and difficult to predict effects on their runtimes. Data-dependent sensitivities such as the costs of denormal floating point calculations introduce more variations in runtime, further hindering predictability. Applications with unpredictable performance or which have highly variable run- times can cause several problems. If the runtime of an application is unknown or varies widely, workflow schedulers cannot efficiently allocate them to compute nodes, leading to the under-utilisation of expensive resources. Similarly, a lack of accurate knowledge of the performance of an application on new hardware can lead to misguided procurement decisions. In heavily parallel applications, minor variations in runtime on individual nodes can have disproportionate effects on the overall application runtime. Even on a smaller scale, a lack of certainty about an application’s runtime can preclude its use in real-time or time-critical applications such as clinical diagnosis. This thesis investigates two sources of data-dependent performance variability. The first source is algorithmic and is seen in a state-of-the-art C++ biomedical imaging application. It identifies the cause of the variability in the application and develops a means of characterising the variability. This ‘probe task’ based model is adapted for use with a workflow scheduler, and the scheduling improvements it brings are examined. The second source of variability is more subtle as it is micro-architectural in nature. Depending on the input data, two runs of an application executing exactly the same sequence of instructions and with exactly the same memory access patterns can have large differences in runtime due to deficiencies in common hardware implementations of denormal arithmetic1. An exception-based profiler is written to detect occurrences of denormal arithmetic and it is shown how this is insufficient to isolate the sources of denormal arithmetic in an application. A novel tool based on the Valgrind binary instrumentation framework is developed which can trace the origins of denormal values and the frequency of their occurrence in an application’s data structures. This second tool is used to isolate and remove the cause of denormal arithmetic both from a simple numerical code, and then from a face recognition application. 1Denormal or subnormal numbers are described in Sec. 1.3 and Sec. 4.3 ii Acknowledgements Firstly I’d like to thank my supervisor Stephen Jarvis for overseeing my years at Warwick, and providing encouragement and guidance through academia and the world of High Performance Computing. Thanks to Dan Spooner for many stimulating and varied discussions over the course of three years. A dozen papers could be written on the subjects discussed, and a small fortune spent on the shiny toys from Silicon Valley we examined. Thanks to Paul Isitt for being an accommodating lab-mate, paper co-author and general collaborator as we explored the work of HPSG and the performance pre- diction community. Thanks to Russell Boyatt for providing much useful technical advice, many movie recommendations, and for being an obliging sounding board. I’m extremely grateful to my friends and family for their moral support. Thanks mum and Niamh for your patience, continued support, and encouragement. I wish I could similarly thank dad for setting me on this path in the first place, but sadly he is no longer with us. Thanks to Gordon, Darina, and Patrick for providing levity, suggestions, and pep talks as the occasion required. I’m sure you’ll be glad never to hear the word ‘thesis’ from me again! iii This thesis is dedicated in loving memory of my father, 1952–2009 NON HABERI, SED ESSE iv Declarations This thesis is presented in accordance with the regulations for the degree of Doctor of Philosophy. It has been composed by myself and has not been submitted in any previous application for any degree. The work described in this thesis has been undertaken by myself except where otherwise stated. The following publications relate to the thesis text: Brian P. Foley, Paul J. Isitt, Daniel P. Spooner, Stephen A. Jarvis, and Graham • R. Nudd. Implementing Performance Services in Globus Toolkit v3. In Proceed- ings of the UK Performance Engineering Workshop, July 2–8 2004, Bradford, UK. [Appendix B] Brian P. Foley, Daniel P. Spooner, Paul J. Isitt, Stephen A. Jarvis, and Graham • R. Nudd. Performance Prediction for a Code with Data-dependent Runtimes. Awarded best paper at UK eScience All Hands Meeting, September 19–22 2005, Nottingham, UK. [Chapter 3] Stephen A. Jarvis, Daniel P. Spooner, Brian P. Foley, Paul J. Isitt, Graham R. Nudd. • Predictive Workflow Scheduling for Multi-Clusters and Grids. CORS/INFORMS Joint International Meeting, May 16–19 2004, Banff, Alberta, Canada. [Chapter 3] Paul J. Isitt, Stephen A. Jarvis, Brian P. Foley, Daniel P. Spooner, and Graham • R. Nudd. Towards Performance-aware Real-time Management for On-demand Computing Environments. In the 7th International Workshop on Performability Modeling of Computer and Communication Systems (PMCCS), September 23–24 2005, Torino, Italy. [Chapter 3] Stephen A. Jarvis, Brian P. Foley, Paul J. Isitt, Daniel Ruckert,¨ and Graham R. • Nudd. Performance Prediction for a Code with Data-dependent Runtimes. In Concurrency and Computation: Practice and Experience 19:1–12, 2007. [Chapter 3] Brian P.Foley,and Stephen A. Jarvis. Tracing Denormal Arithmetic with Valgrind. • Submitted to the 37th International Symposium Computer Architecture (ISCA 2010), June 19–23, Saint Malo, France. [Sec. 5], [Chapter 6] The thesis builds on the research on PACE, [Pap95, KHCN96, CKPN99] TITAN, [SCT+02, SJC+03, SKD+03] and performance modelling [Har99, TLHKN02, MJSN06] performed at the High Performance Systems Group. v Contents Abstract ii Acknowledgements iii Declarationsv Contents vi List of Tablesx List of Figures xii Abbreviations xiv Chapter 1 Introduction1 1.1 CPU performance prediction.......................1 1.2 Context and Previous Research......................7 1.3 Thesis Contributions............................8 1.4 Thesis Overview.............................. 10 Chapter 2 Performance prediction and its application 13 2.1 Performance modelling tools....................... 14 2.1.1 SimpleScalar............................. 14 2.1.2 PACE................................. 15 2.1.3 WARPP............................... 17 2.1.4 Prophesy............................... 19 2.1.5 Performance Evaluation Process Algebra (PEPA)........ 21 2.2 Performance monitoring tools....................... 25 2.2.1 Network Weather System..................... 26 2.2.2 MDS................................. 27 2.3 An analytical performance model of Sweep3D............. 31 2.4 Scheduling as an application........................ 35 2.5 Summary................................... 39 Chapter 3 Performance modelling and scheduling of a data-dependent code 41 3.1 nreg Image Registration.......................... 42 vi 3.2 The nreg algorithm............................. 43 3.3 nreg’s computational costs......................... 45 3.4 Parallelising nreg .............................. 47 3.5 Predictive model.............................. 49 3.5.1 Pre-model work parameter.................... 51 3.6 IXI workflows................................ 54 3.7 Incremental prediction........................... 54 3.8 User interaction............................... 56 3.9 Speculative scheduling........................... 57 3.10 Case study.................................. 58 3.11 Summary................................... 62 Chapter 4 Floating point and denormal handling 65 4.1 Fixed point arithmetic........................... 66 4.2 Floating point arithmetic.......................... 68 4.2.1 Scientific notation.......................... 68 4.2.2 Floating point calculations..................... 69 4.3 IEEE-754................................... 70 4.3.1 Gradual underflow......................... 72 4.3.2 Mathematical properties...................... 73 4.3.3 IEEE-754 implementations..................... 74 4.4 DIP: A denormal profiler for Linux x86................. 76 4.4.1 Floating point exceptions on the 80x86.............. 76 4.4.2 Using exception handlers....................