Performance Profiling and Debugging on the K Computer

Performance Profiling and Debugging on the K computer Keiichi Ida Yasuyuki Ohno Shunsuke Inoue Kazuo Minami We have developed application-development support tools for the K computer. This paper describes profiling functions for raising the performance of applications and debugging functions for testing applications as main functions of these tools. In developing these tools, we first defined the work procedure that a user would follow for improving the performance of an application and testing it. We then investigated the form that these application-development support tools should take for each task in that procedure. We here introduce these tools in conjunction with those tasks. Additionally, while the large-scale configuration of the K computer is one of its major features, existing profilers and debuggers for large-scale applications still have problems that have yet to be solved, and we here describe new measures for addressing those problems. We also touch upon profiling functions specifically developed for the advanced hardware of the K computer such as the high- performance SPARC64 VIIIfx processor and Tofu interconnect. 1. Introduction presenting our development objectives and then The K computernote)i is a massively parallel introduce various development support tools supercomputer consisting of more than 80 000 corresponding to the defined work procedure. nodes using advanced hardware such as the high-performance SPARC64 VIIIfx processor 2. Development objectives and Tofu interconnect. We have developed We established three major objectives in application-development support tools for this our development of profiling and debugging supercomputer. functions: support large-scale parallel-processing This paper describes the main functions applications and advanced hardware, make use of these support tools: profiling functions of standard interfaces, and achieve advanced for achieving high-performance applications graphical user interface (GUI) functions. and debugging functions for testing those Like the system itself, application software applications. In setting out to develop these for the K computer is large-scale in nature profiling and debugging functions, we first and thus must use the cutting-edge hardware defined the work procedure that a user would making up this supercomputer. For this reason, follow for achieving and debugging a high- our first development objective was to support performance application. We begin this paper by large-scale, parallel-processing applications and new functions geared to various types note)i “K computer” is the English name that RIKEN has been using for the of new and advanced hardware. Though supercomputer of this project since July explained in detail later in this paper, these 2010. “K” comes from the Japanese word functions include profiling and debugging “Kei,” which means ten peta or 10 to the 16th power. functions supporting applications running tens FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 331–339 (July 2012) 331 K. Ida et al.: Performance Profiling and Debugging on the K computer of thousands of processes in parallel, a large- improvements in user terminals and is limited in scale data display function for presenting the GUI functions and performance. To resolve these performance information of tens of thousands issues, we adopted Adobe System’s AIR4) runtime of processes, self-check functions for the run environment as a new GUI platform owing to its time system (RTS) library and Message Passing foundation in Rich Internet Applications (RIA) Interface (MPI) library, profiling functions technology and its rapid evolution in the field supporting the performance instrumentation of Web applications. With AIR, an AIR client counters of the SPARC64 VIIIfx processor, and a receives only as much data as needed from the communications-cost display function taking the server and locally performs all processing for Tofu interconnect into account. redrawing—such as when the image viewpoint At the same time, the current environment changes—so that the graphic performance of the surrounding development support tools user’s terminal can be fully exploited. is dominated by a trend toward standard interfaces, the use of open source software for 3. Application performance tool components, and the construction of software improvement steps platforms independent of hardware. With this The significance of a supercomputer lies in in mind, our second development objective was its ability to perform high-speed calculations, and to give due consideration toward open systems determining the performance of an application and implement as much as possible application- running on a supercomputer is therefore an development functions on standard interfaces. essential and important endeavor. Moreover, To be more specific, we adopted the DWARF 21) if the performance so determined turns out to debugging file format as an interface between be inadequate, it is imperative that the source the compiler and various tools, which makes of such performance degradation be analyzed it possible to employ a variety of debug tools and efforts be made to improve the application’s provided in open-source format. Additionally, performance. This work for determining and we adopted the Performance Application improving an application’s performance consists Programming Interface (PAPI)2) as the interface of multiple steps beginning with performance for obtaining information from the SPARC64 characteristics analysis (basic profiling) and VIIIfx performance instrumentation counters continuing with a set of mutually independent that enable various types of CPU performance, performance improvement steps (detailed such as number of executing instructions and number of cache misses, to be precisely measured. Finally, to implement a log output Basic profiler Detailed profiler function supporting various types of events, we adopted VampirTrace3) specifications considering Performance improvement their status as a de facto standard. in CPU-operation processing A high-performance GUI is needed to Performance characteristics Performance improvement display the profiling and debugging information analysis in thread-parallelization processing of large-scale parallel-processing applications. Performance improvement Our third development objective was therefore to in process-parallelization processing achieve advanced GUI functions. Many existing application-development support tools use X Window System (X11) as a GUI, but this system Figure 1 1 cannot easily take advantage of performance Application performance improvement steps. Application performance improvement steps. 332 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) K. Ida et al.: Performance Profiling and Debugging on the K computer profiling), as shown in Figure 1. Among the by making rigorous measurements of three steps in the detailed profiler, we here take the times that CPU processing, thread- up performance improvement in CPU-operation parallelization processing, and process- processing and performance improvement in parallelization processing are executed process-parallelization processing. within a measurement interval in contrast to sampling 3.1 Performance characteristics analysis 3) Individually analyze measured times The objective of this first step is to obtain instead of processing them as statistical a comprehensive understanding of application averages, thereby supporting even cases in performance under normal operating conditions. which performance characteristics change To this end, we developed a “basic profiler” with every time a user function is called. four basic functions. 1) Obtain performance information simply on 3.2 Performance improvement in CPU- the basis of highly optimized object code operation processing without re-compilation The high-performance SPARC64 VIIIfx 2) Obtain basic performance information processor incorporates various types of by simply adding an fipp command to an performance instrumentation counters, which ordinary job-run script can be used to classify the total execution time 3) Display processing conditions in terms of an instruction sequence in terms of CPU of CPU-operation processing, thread- operating states (executing instructions, waiting parallelization processing, and process- for memory access, waiting for an operation to parallelization processing complete, etc.). This method, which is commonly 4) Obtain performance information by called cycle accounting,5) enables developers to sampling run-time data at fixed intervals, determine where in the processor bottlenecks are thereby minimizing and fixing overhead. occurring and to analyze and improve processor Once the basic profiler has uncovered performance. If a performance problem is the processing necessary to raise application known to exist in CPU-operation processing, the performance, it is time to move on to the next developer can use the hardware monitor function steps. CPU-operation processing, thread- of the detailed profiler to analyze the problem parallelization processing, and process- using the cycle accounting method (Figure 2). parallelization processing are mutually To give an example of a performance- independent from the viewpoint of performance, improvement procedure, more effective use of the and each one requires a separate approach cache would have to be achieved if it was found by to improving performance. Nevertheless, the cycle accounting that waiting for memory access detailed profiler that we have developed is used was occupying

Load more