Use of Continuous Integration Tools for Application Performance Monitoring
Total Page:16
File Type:pdf, Size:1020Kb
Use of Continuous Integration Tools for Application Performance Monitoring Veronica´ G. Vergara Larrea, Wayne Joubert and Chris Fuson Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory Oak Ridge, TN, USA Email: fvergaravg,joubert,fusoncbg at ornl.gov Abstract—High performance computing systems are becom- needed, for example, after a reboot [1]. However, from ing increasingly complex, both in node architecture and in the user standpoint, the highest priority is to validate the the multiple layers of software stack required to compile and reliability and performance of the system as experienced run applications. As a consequence, the likelihood is increasing for application performance regressions to occur as a result of by user applications. Of paramount importance to users routine upgrades of system software components which interact is that applications run reliably, accurately, and efficiently. in complex ways. The purpose of this study is to evaluate To test this accurately, there is no substitute for running the effectiveness of continuous integration tools for application the applications themselves under realistic test conditions. performance monitoring on HPC systems. In addition, this Additionally, it is important to test auxiliary aspects of the paper also describes a prototype system for application perfor- mance monitoring based on Jenkins, a Java-based continuous system which directly impact the user experience. These integration tool. The monitoring system described leverages include performance characteristics of software such as several features in Jenkins to track application performance system libraries, system performance characteristics that can results over time. Preliminary results and lessons learned from be tested using kernel benchmarks, and system configuration monitoring applications on Cray systems at the Oak Ridge issues such as default modules which impact the code Leadership Computing Facility are presented. development experience of users. Keywords-continuous integration; application monitoring; In the past, efforts have been made to monitor application supercomputers performance over time on HPC systems, notably with the NERSC Sustained System Performance metric [2] and the I. INTRODUCTION Department of Defense’s sustained systems performance High performance computing (HPC) systems are growing monitoring effort [3]. Similarly, the goal of the present work in complexity. New types of hardware components continue is to select a small set of application codes and tests to to be added, such as computational accelerators and non- run periodically over time, e.g., daily, weekly, monthly or volatile random-access memory (NVRAM). The software quarterly, to detect performance variances. to support this additional complexity is also growing. These Whether for the purpose of application performance mon- hardware and software components can interact in complex itoring over time, or for the similar task of acceptance ways that are not entirely orthogonal in their coupling. For testing of new systems, it is not uncommon practice for example, an upgraded library could have a higher memory each center to develop in-house tools customized to the requirement due to new algorithms which could in turn specific requirements of the relevant task (see, e.g., [4]). In impact the amount of memory available for message pass- recent years, however, HPC systems have leveraged not only ing interface (MPI) communication buffers. Applications commodity hardware but also software tools used by the themselves are also complex and are being combined into broader software community. Continuous integration (CI) is intricate workflows. Resource allocations on large HPC an increasingly practiced methodology for software develop- systems are costly and continue to be competitively awarded; ment and delivery. To support CI, multiple GUI-based tools users’ ability to accomplish their research goals is negatively have been developed which have been found effective for impacted when system performance unexpectedly degrades. managing periodic or on-demand software building, testing The causes of such degradations are not always obvious; and reporting of results. Though targeted to a slightly dif- large, complex HPC systems often lack transparency to ferent workflow than HPC application-centric system health allow users and center staff the ability to understand what is- monitoring, these tools offer potential promise for improving sues are impacting performance. Furthermore, while systems productivity for the latter workflows. They also inherit the become larger and more complex, center staff workforce typ- advantages of off-the-shelf software. Compared to in-house ically remains a nearly-fixed resource; this points to the need solutions, they are more likely to have a familiar interface for automated means of monitoring system performance. that is easily learned by incoming center staff. Furthermore, At the system administration level, tools are currently if well-supported by third parties or the community, there is available to evaluate system health when such action is potentially less need for center staff commitment to support these tools for development, maintenance, debugging, and time and effort are needed by a user employing the tool porting. on a routine basis during day-to-day operations. The well- The purpose of this paper is to examine popular CI known principles of usability engineering are apropos in this tools to evaluate their appropriateness for use in monitoring context (see, e.g., [8]), leading to design objectives such application performance over time on HPC systems. The as minimizing the number of keystrokes or mouse clicks primary focus here is Jenkins [5], for which we describe the required to accomplish a given task. implementation of a monitoring system as well as present With this backdrop, we now enumerate specific require- results and experiences from using this tool. ments. These fall under two categories: requirements for The remainder of this paper is organized as follows. First, the tool itself as software, and requirements for how the we describe the requirements for an application performance tool must function as part of the application monitoring monitoring tool. We then describe the tools evaluated. As workflow described above. It is clear that the continuous a result of this evaluation, we selected Jenkins to use for integration workflow does not exactly match the HPC sys- implementing the application performance monitoring tool. tem application performance monitoring workflow: for the In the following section, we describe the specific implemen- former, the application code itself is the primary point of tation of this system and then present results from use of failure to be tested, whereas with the latter, the unexpected this tool on the Oak Ridge National Laboratory (ORNL) performance variances from the system software stack and Titan system. As an alternative, we also examine briefly the hardware are primarily in focus, leading to different failure Splunk data monitoring and analysis tool [7], which does response actions. Because of this less-than-perfect match, not satisfy all project requirements but has strengths for the it is not to be expected that CI tools will necessarily be reporting component of the targeted workflow. Finally, we deeply customized to the requirements for an HPC system present discussion, lessons learned and conclusions. application performance monitoring tool. Thus, it may be that all requirements are not met entirely, or that some II. REQUIREMENTS customization of the tool may be needed, via additional The overall objective of this work is to deploy a tool code, use of or development of plugins, or modification usable by center personnel to monitor the performance of of the tool source code itself. In each case, an evaluation applications and kernels, and validate system configuration must be made to determine whether it is better to use the over time on an HPC system. This objective can be clarified tool under these conditions or write a better-customized tool and made more specific by enumerating a set of user stories entirely from scratch. to illustrate how such a tool might be used on a routine In order to evaluate the available continuous integration basis: tools and select a tool for developing the prototype, we • As a center staff member, I want to be able to view created the following list of features that describes the ideal the behavior and performance of each of a set of tool’s match of the application monitoring workflow: applications over time, so that I can determine quickly • INTERFACE: provide a graphical user interface (GUI), whether there has been a performance variance. with support for command line interface (CLI) as • If a variance is observed, I would like to diagnose its needed. cause, e.g., by examining the run logs or correlating • JOB MANAGEMENT: allow full job control via the the event in time with a known system upgrade, so that GUI and provide a mechanism to execute jobs on- remedial action can be taken. demand and periodically, with capacity for organizing • I would like to easily start, stop, or modify the periodic and grouping tests by objective or by system. build and run process for an application (including • VISIBILITY: provide an interactive dashboard to view management of how the tool is run across system and analyze results in big-picture form with the ability reboots) so that the level of reporting can be maintained to