JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011 2399

RPPA: A Remote Parallel Program Performance Analysis Tool

Yunlong Xu, Zeng Zhao, Weiguo Wu*, and Yixin Zhao School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an, China Email: [email protected], [email protected]

Abstract—Parallel program performance analysis plays an modifying, deleting, and inserting code to change the important role in exploring parallelism and improving execution behaviors of programs, is applied to the efficiency of parallel programs. To ease the performance performance measurement. The framework of RPPA analysis for remote programs, this paper presents a remote consists of three parts: client, server and computing nodes. parallel program performance analysis tool, RPPA (Remote A performance analysis task is committed to the server Parallel Performance Analyzer), which is based on dynamic code instrumentation. A hierarchical structure is adopted by through the client graphical user interface (i.e., GUI); a RPPA which consists of 3 parts: client, server and server daemon running on the server then starts the computing nodes. Performance analysis tasks are submitted program performance data collection processes on several to the server via the graphical user interface of the client, computing nodes; later, the server gathers performance and then actual analysis processes are started on the data files from computing nodes, and sends back to the computing nodes by the server to collect performance data client for visualization. for visualization in the client. The performance information The reminder of this article is organized as follows: gained by RPPA is comprehensive and intuitive, hence it is First, we consider related work and point out RPPA’s quite helpful for users to analyze performance, locate strength through comparison with existing tools in the bottlenecks of programs, and optimize programs. next section. In Section 3, we describe the overall Index Terms—parallel programming tools, program architecture of RPPA. After that, we present the performance analysis, dynamic code instrumentation, methodology applied in the client, the server and the performance visualization computing nodes in detail respectively in Section 4. In Section 5, we present the evaluation of a RPPA prototype. Section 6 is a discussion. Finally, we consider future I. INTRODUCTION work and conclude the paper. In the past few years, parallel computers have grown very fast, while software for parallel computing has II. RELATED WORK grown comparatively slow. This situation makes the Performance analysis is crucial to parallel program efficiency of parallel systems relatively low, thus development. This work combines knowledge of various hardware performance cannot be fully utilized [1]. fields, e.g., parallel computing, computer architecture, Therefore software becomes the bottleneck of parallel data mining. The related work of program performance computing, and limits the widely use of parallel analysis has been carried out by many organizations and computers. For instance, parallel program development is agencies. challenging due to the lack of effective tools for coding, MPE [2] is an extension of MPICH [3] which is a debugging, performance analyzing and optimizing. portable implementation of MPI. It consists of a large Motivated by the preceding challenge, a remote number of programming interfaces and examples for visualization parallel program performance analysis tool, correctness debugging, performance analysis and RPPA, which aims to help programmers to optimize visualization. It supports several file formats to log performance of applications and make full use of parallel performance information of programs, which can be computing resources, is designed and developed based on viewed by Upshot, Jumpshot [5], and other visualization the survey of existing tools and our former research of a tools. parallel program integrated development environment Paradyn [6], developed by University of Wisconsin, is (i.e., IDE). RPPA provides a solution for performance a software package which aids in analyzing performance analysis of MPI [4] applications running on parallel of large-scale parallel applications. It supports analyzing computers with SMP nodes. To gain comprehensive MPI and PVM [7] applications. The cause of performance information of programs, a dynamic performance problems is systematically detected by instrumentation based approach, which includes inserting and modifying code of programs automatically. And performance bottlenecks are automatically searched 3 The journal article is based on the conference paper, "Research and by means of a W (i.e., When, Where, and Why) model. Design of a Remote Visualization Parallel Program Performance TAU [8] is a parallel program performance analysis Analysis Tool," which appeared in PAAP’10. system developed by University of Oregon, Juelich *Corresponding author: Weiguo Wu

© 2011 ACADEMY PUBLISHER doi:10.4304/jsw.6.12.2399-2406 2400 JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011

Research Center, and Los Alamos National Laboratory Performance Analysis Operation GUI Command Handling Module together. It focuses on system robustness, flexibility, Performance Data Visualization Client Task Visualization portability and integration of other related tools, and Downloading Controlling Launching Views Module Module supports multiple programming languages, including Module C/C++, Fortran, Java, Python. A MPI wrapper library is Connection Manager provided to analyze performance of MPI functions, as Internet well as a log format conversion tool. Data Collection Launching VENUS (Visual Environment of Neocomputer Utility Data File Module Server System) [17] is a parallel program performance MPI Communication Gathering Library Module visualization environment developed by Xi'an Jiaotong

University. It provides solutions for monitoring, Interconnection network analyzing and optimizing large-scale PVM parallel programs based on profiling method. System supports MPI Communication MPI Communication MPI Communication real-time and post-mortem performance visualization. Library Library Library Measurement code is automatically inserted into source Performance Data Collector Performance Data Collector Performance Data Collector code of programs to collect performance data for data MPI Processes MPI Processes MPI Processes analysis and performance visualization. In this way, Performance Data file Performance Data file Performance Data file performance analysis helps users to optimize their programs. Computing Node Computing Node Computing Node Open|SpeedShop [9], a dynamic binary instrumentation based performance tool using DPCL Figure 1. Overall architecture of RPPA [16]/Dyninst [12], aims to overcome a common limitation of performance analysis tools: each tool alone is often systems. The performance information of programs is limited in scope and comes with widely varying presented in RPPA’s GUI in an intuitive way, which is interfaces and workflow constraints, requiring different quite helpful to locate performance bottlenecks and changes in the often complex build and execution improve efficiency of programs. A performance analysis infrastructure of the target application; thus it provides task is committed to the server through the graphical efficient, easy to apply, and integrated performance client; a daemon running on the server starts the program analysis for parallel systems. performance data collection processes, whose core step is HPCToolkit [10], developed by Rice University, is an dynamic instrumentation, on several computing nodes; integrated suite of tools that supports measurement, the server gathers performance data files from the analysis, attribution, and presentation of application computing nodes, and sends back to the client for performance for both sequential and parallel programs. visualization. HPCToolkit can pinpoint and quantify scalability RPPA is customized to the cluster structure. The bottlenecks in fully-optimized parallel programs and overall architecture of RPPA is depicted in Figure 1. The multithreaded programs at a low cost. Call path profiles client connects with the server over the internet, and the for fully-optimized codes can also be collected without computing nodes connect with the server, which can also compiler support. be a computing node, over high-speed interconnection The tools enumerated above have a common defect network. Users control the whole system through a GUI, that performance visualization and data collection are which includes two core modules for submitting both carried out on parallel computers or clusters. It is performance analysis tasks, downloading and visualizing hard for users who are not familiar with parallel systems performance data. The server-side operation details (e.g., to install and apply these tools. The preceding defect how the run-time processes are assigned to the computing hinders the widely use of parallel program development nodes) are transparent to users. As a result, operations are environment and parallel computing resources to some greatly simplified for users who are not familiar with the extent. Therefore, there is an urgent need for a user- bottom parallel cluster structure. The server, entry node friendly remote parallel program performance analysis of a parallel cluster, is responsible for maintaining the tool. RPPA is to meet this need. Users connect to remote connection between the client and the cluster, accepting parallel systems over the internet via RPPA’s client GUI, instructions from the client and then starting performance which is part of a parallel program IDE built on Eclipse analysis processes on computing nodes to collect Plug-in [11] technology. Interactive operations and performance data. Actual performance analysis tasks and performance visualization are both carried out in the program execution tasks are carried out on the computing client, while actual operations are executed on remote nodes: RPPA creates user program processes, into which clusters which are transparent to common users. RPPA is measurement code is inserted, controls its execution, easy to use and portable for Windows, and other generates performance data files, integrates and analyzes operating systems. the files when analysis processes finish.

III. OVERALL ARCHITECTURE IV. DETAILED DESIGN The RPPA tool adopts a hierarchical structure, through A. Client which users are shielded from the complexity of parallel Client provides an interface for users to commit

© 2011 ACADEMY PUBLISHER JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011 2401

User Command Interface Visualization Views Visualization Views Performance Data Parser Module Commands Visualize Parser Control Performance Performance Analysis Command Handling Module Visualization Views Data Controller Data File Parser Visualize Visualization Controller Launch Download Visualize Controlling Data Files Read Controller Command Command Command Module

Read Performance Data Performance Analysis Performance Data Visualization Launching Module Downloading Module Controlling Module Performance Data Performance Data Performance Data File File File Read Figure 3. Performance visualization control module of client File Generate File Generate File Generate Read Command Read Command Command performance comparison between multiple processes is Performance Performance Performance provided. Data File Data File Data File Performance data visualization is an important means

Communication Interface to help users analyze program performance and pinpoint performance bottlenecks. Because it is difficult to Figure 2. Structure of client measure a program’s performance based directly on large analysis tasks and visualizes performance analysis result. volume of performance data files of various entries, not The GUI implemented with Eclipse Plug-in technology to mention that the data volume grows rapidly as greatly simplifies users’ operation. Client is composed of application scales. Therefore, RPPA offers vivid and four functional modules: performance analysis command intuitive graphical charts to help users visualize handling module, performance analysis launching module, performance information analyze program performance performance data downloading module, and performance and then locate bottlenecks quickly. data visualization controlling module. The relationship B. Server between the four modules is shown in Figure 2. Server is the communication relay between the client Performance Analysis Command Handling Module and the computing nodes: it serves as the entry of a This module implemented by extending Plug-in cluster, which is composed of several computing nodes, extension points of Eclipse platform. Graphical items of and controls execution of tasks on computing nodes; it performance analysis (e.g., buttons, menus, pop-up lists also receives commands from the client and responses to and editors) are added to the original GUI of Eclipse these commands. The functionalities of the server platform. Each item corresponds to a specific include launching performance analysis tasks, gathering performance analysis operation. This module supports performance data files from computing nodes, and interactions between users and RPPA. maintaining communication with the client. The performance data file gathering module is the core of the Performance Analysis Launching Module server. This module sends commands to the server and starts According to when collected data is analyzed and remote performance analysis task. A customized Eclipse visualized, program performance analysis falls into two launching mode is implemented by extending the categories: real-time analysis and post-mortem analysis. launchModes extension point of the original platform, Each mode has its strength: the real-time mode can adjust and the actions carried out while launching are data collection during running time of tasks, while the implemented by extending the launchDelegates extension post-mortem mode introduces minor perturbation into the point of Eclipse. original program. As to RPPA, if a real-time mode is Performance Data Downloading Module adopted in the C/S structure, the server has to receive Performance data files are distributed among the performance analysis instructions frequently, deal with computing nodes when analysis task finishes due to the these instructions, and then transmit them to the C/S architecture adopted by RPPA. This module gathers computing nodes on which they are actually executed, and integrates the distributed performance data files, and and finally large volume of data is transferred back to the then transfers the integrated file to the client for client while execution. The whole process which is visualization. greatly affected by network would introduce major perturbation to the original program. Thus the post- Performance Data Visualization Controlling Module mortem mode is adopted in RPPA. The structure of this module is shown in Figure 3. Unlike the launch process of common MPI parallel Performance information can be viewed in multiple program, the launch process of performance analysis task aspects through various display commands and views has to ensure that all MPI processes running on the RPPA provides. A command deals with a specific kind of computing nodes are under the control of the program data set; and a view displays a kind of performance performance analysis tool, specifically the control of the information. It is worth mentioning that, in accordance performance analysis launching module. In this way, with the characteristics of parallel programs, a view of measurement code is instrumented into user programs, and then performance data can be collected. The launch

© 2011 ACADEMY PUBLISHER 2402 JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011

per contra, during execution. The static instrumentation Computing Node A Start performance Data Local Process introduces little perturbation into the original programs, Collection Module Start Remote Process but the information it collects is not comprehensive; Performance Data Collection Communication while the dynamic instrumentation is able to collect Processes Between Processes r h s s h comprehensive performance information at the cost of s / /s h s s r User hello Process h major perturbation. Using dynamic instrumentation, programs are just awakened rather than to be recompiled, relinked and restarted after instrumentation. Apparently, Performance Data Performance Data Collection Processes Collection Processes the dynamic instrumentation is more complicated to

User hello Process implement than the static one. User hello Process Computing Node B In this paper, we use DynistAPI [12], a dynamic binary Computing Node C instrumentation interface developed by University of

Maryland, to insert running processes with binary code. Figure 4. MPI program starting process under the control of data collection module Instrumentation processes (i.e., Mutator) start user program processes which are to be instrumented (i.e., process of MPI parallel program under the control of Mutatee), and attaches themselves to the user processes. launching module is shown in Figure 4. When receiving a Measurement code segments (i.e., Snippet) are inserted performance analysis launch command, the launching into user processes by instrumentation processes when module of the server (e.g., Node A in Figure 4) analyzes user processes are suspended. And performance data is the parameters of the command, creates a performance collected when instrumented code of user processes are analysis task, sets some parameters (e.g., name and full executed. path of the user program, number of processes) for the The process of performance data collection is task, and then starts data collection processes on the described as Algorithm 1. We note that, when no MPI computing nodes. The data collection processes analyze the parameters and then create hello processes based on Algorithm 1: Performance data collection procedure the name of the user processes and the parameters of the Input: A, Attributes of user task performance analysis task. After that, the hello processes Output: PF, an integrated performance data files is considered to be part of the whole task, and managed define: proc_array, array of MPI process pid by MPI runtime environment. proc_array ← spawn(A) suspend(proc_array) C. Computing Nodes foreach i in proc_array The computing nodes, on which MPI runtime define: node, name of computing node environment is deployed, is the substrate of RPPA. A define: file, path of data file file ← new(node,i) bijection is set up between MPI processes and define: procImage, image of user process performance data collection processes on the computing procImage ← getImage(i) nodes. Data collection processes start and control MPI define: usrModule, image of user process processes. The performance data collection module of usrModule ← search(procImage) the computing nodes is launched by the performance data define: usrFunc_array, array of user function collection launching module of the server. MPI processes func_array ← search(usrModule) for j in func_array are started by the data collection module according to define: snippet, binary code segment some parameters (e.g., the absolute path of the MPI define: gettimeofday, function to get current time program). And then the collection module inserts define: fscanf/fprintf, function to read/write a file measurement code into MPI processes, and integrates snippet ← generate(gettimeofday,fopen/fprintf) generated performance data files when finishes. insert(j, snippet,file) The main purpose of performance analysis is to help if j is a MPI function define: commAttr_array, array of communication users to locate program bottlenecks, i.e., code segments attributes, e.g., volume, source, destination which account for large portions of running time. Besides snippet ← generate(commAttr_array,file) running time, the number of function calls is also insert(j,snippet) collected to estimate the average running time of each end if function. As to MPI parallel program, communication end for volume of MPI functions also concerns programmers a wake(proc_array) if i finishes lot. To sum up, three kinds of data, i.e., running time, acknowledge(server,file,i) number of function calls, communication volume, need to end if be collected. end foreach Performance data is collected by means of code while(TRUE) instrumentation [12], which includes modifying, deleting, if all processes complete and inserting code to change the execution behaviors of define: file_array, array of performance file PF ← integrate(file_array) programs. According to the timing of code inserting, return PF instrumentation can be divided into two categories: static end if and dynamic code instrumentation. The static one inserts end while code before execution of programs and the dynamic one,

© 2011 ACADEMY PUBLISHER JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011 2403

communication function is founded, the data collection with measurement code inserted and the running time of process skips the step of collecting MPI communications the original program. information, and goes ahead rather than returns, because What concerns us most is the difference of running the program being analyzed could be a serial program, time between the original program and the program with which can also be executed by mpiexec command in MPI code inserted. The running process of parallel programs is runtime environments. affected by many factors. And in this paper, perturbation evaluation is influenced mainly by network status and V. EVALUATION system background workload status. Thus a single node, This section describes the evaluation of a RPPA on which there is no process running besides Linux prototype in two aspects: functional evaluation and system processes, is used to run the evaluation programs. perturbation evaluation. Pros and cons of RPPA are also In this way, the network and system load factors are ruled presented at the basis of the evaluations. out. The Paradyn performance analysis tool, which also A. Test Environment and Test Case uses the Dyninst tool for dynamic instrumentation, is In this paper, the evaluations are carried out on a selected to compare with RPPA. To make the evaluation parallel computing system, a Dawning 4000L cluster, data more representative, we carry out 3 kinds of tests which composed of 18 nodes. Each node is equipped with: using the stommel program, including running program Intel (R) Xeon (TM) CPU (3.00 GHz × 2), 2G memory, without interference of any performance analysis tools, Broadcom BCM5721 1000Base-T LAN, Linux 2.4 with interference of RPPA, and with interference of the , and OpenMPI [14] 1.2.3 parallel Paradyn tool. 10 groups of these 3 kinds of tests are computing environment. The test case is an MPI parallel conducted. The test results are shown in Figure 8. Test program solving ocean circulation problem by using two- type 1 represents running time without interference of dimensional stommel model developed by Timothy H. any analysis tools, type 2 represents running time with Kaise. RPPA, and type 3 represents the Paradyn tool. It can be seen that, under the same conditions (i.e., the B. Functional Evaluation of Performance Visualization same running environment, and the same types of Users can check performance information in 7 performance data to be collected), the running time of the different views, including: function running time of stommel program with interference of RPPA is closer to single process, function call numbers of single process, the running time without any tools than to the running communication volume of single process, function time with Paradyn. From this perspective, RPPA running time comparison of multi-processes, function call surpasses the Paradyn program performance analysis tool. number comparison of multi-processes, communication However, Figure 8 also shows that both RPPA and volume comparison of multi-processes, and space-time Paradyn introduce major perturbations into the original diagram of all processes. In this evaluation, the test programs. The primary cause is that they both build on program is composed of 5 performance data collection dynamic instrumentation which intrudes the original processes, each of which creates a stommel process on 2 programs frequently by coping, modifying, transferring, computing nodes separately. Data files are downloaded and inserting code during runtime of programs. But for and performance information is visualized when data the same reason, the performance data collected by the collection task finishes. Function running time of one analysis tools based on dynamic instrumentation is process is shown in Figure 5; function running time comprehensive, because performance information about comparison of multi-processes is shown in Figure 6; and all kinds of functions can be gained. Obviously, major space-time diagram of all processes is shown in Figure 7. perturbation is the price of comprehensive performance Functional evaluation shows that the performance data information. collected by RPPA which is based on dynamic instrumentation is comprehensive, because performance VI. DISCUSSION data of all kinds of functions (i.e., user-defined functions, In this section, we discuss some new ideas and external library functions, standard MPI communication methods to improve RPPA. As mentioned in the previous functions) can be collected through dynamic section, dynamic instrumentation based performance instrumentation. This evaluation also shows that, RPPA analysis tool introduces major perturbation into the meets the requirement of program performance analysis, original programs. Hence we seek a novel approach to and brings great convenience to programmers by collect performance data on the server and computing providing a user-friendly GUI. nodes. A multi-level instrumentation based data C. Perturbation Evaluation collection approach is proposed in our recent research. The perturbation introduced into the original programs The implementation of this approach is part of our recent by an analysis tool is an important metric to measure the and future work, and is planned to be presented in follow- performance of the analysis tool itself [15]. The up work. In this section, we first discuss why this perturbation can be reflected by the running time of approach would reduce the perturbation, and how it programs. And the degree of perturbation can be works. estimated by comparing the running time of a program As analysis in section 7 says, dynamic instrumentation collects comprehensive program performance

© 2011 ACADEMY PUBLISHER 2404 JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011

Figure 5. Function running time of one process

Figure 6. Function running time comparison of multi-processes

Figure 7. Space-time diagram of all processes

© 2011 ACADEMY PUBLISHER JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011 2405

3.5 3 2.5 Test Type 1 2 Test Type 2 1.5 1 Test Type 3 0.5

RunningTime(Seconds) 0

Figure 8. Running time comparison of stommel program interfered by different analysis tool information at the cost of major perturbation. Therefore The philosophy of this tool resembles SaaS (i.e., we propose the multi-level instrumentation based Software-as-a-Service) in cloud computing which approach, through which the comprehensiveness of emerges and then receives significant attention in the performance information would not be compromised. The media recently. The cloud conceals the complexity of the essential of this approach is to collect different kinds of infrastructures from common users through internet. information in distinctive proper ways; and only do Cloud computing partly refers to the software delivered dynamic instrumentation when really necessary. We as services over the internet. Users could shift computing separate the functions, whose performance information power away from local servers, across the network cloud, concerns programmers most, into three categories: user- and into large clusters of machines hosted by companies defined function, standard MPI function, external library such as Amazon, Google, IBM, Microsoft, Yahoo! and so function. Performance information of user-defined on. We plan to transplant our parallel program IDE’s function is collected through a source-level client, which include RPPA’s client, into the internet instrumentation [8] based method. As to standard MPI environment by rewriting its GUI with some web functions, we adopt PMPI [4], the MPI standard profiling programming Languages. Then the Eclipse platform interface, to collect its performance information. And for would be replaced by an internet explorer. And with the external library function, to the best of our knowledge, support of a daemon running on the server node of a the dynamic instrumentation is the best method to collect remote cluster, users could develop parallel programs its performance information. We note that the source- remotely over the internet on laptop rather than on level instrumentation could be replaced by the dynamic parallel cluster. In this way, great convenience would be one when there are no other files than the binary ones for brought to programmers, while excellent portability of some reasons. We also note that dynamic instrumentation this tool would also be achieved. could be skipped if the external library functions’ performance information is not a concern or little VII. CONCLUSION AND FUTURE WORK perturbation can be tolerated by users. And still there is A remote parallel program performance analysis tool, plenty of performance information presented to RPPA, is designed based on the research of the existing programmers even if the external library functions are program performance analysis tools, most of which are ignored. not user-friendly. RPPA meets the urgent need of parallel By using this multi-level based approach, the program performance analysis as mentioned in Section 1. perturbation introduced into the original programs would The workflow of RPPA is as follows: performances surely be at a lower level, and comprehensiveness of analyses tasks are submitted to the server through the performance information would also not be compromised. client; then performance data collection processes are Firstly, the source-level based instrumentation introduces started on the computing nodes by the server to collect little perturbation into the original programs because only performance data; at last, program performance a few measurement code is inserted into the source files information is visualized in the client. RPPA is user- at the entry and exit point separately when compilation. friendly, easy to use, and the collected performance And all the information that the dynamic approach information is comprehensive, so it is quite helpful for collects can also be collected by this source-level one. users to analyze performance, locate bottlenecks of Secondly, when using PMPI profiling to collect standard programs, and optimize programs. MPI function’s performance information, there is no need However, there is a common shortcoming of dynamic to copy, modify, transfer, and insert code during runtime instrumentation based performance analysis tools: major of programs. Because all MPI_ prefix functions are perturbation would be introduced into the original substituted by equivalent PMPI_ ones in which some programs. So research on multi-level instrumentation additional measurement code is added. Thus less based performance analysis tool which brings minor perturbation is introduced. And also the perturbation is a part of our future work. Our future work comprehensiveness of performance information would also includes providing supports for variety kinds of not be compromised. parallel programming models and heterogeneous high performance computing systems.

© 2011 ACADEMY PUBLISHER 2406 JOURNAL OF SOFTWARE, VOL. 6, NO. 12, DECEMBER 2011

[12] B. Buck and J.K. Hollingsworth, "An API for runtime code patching," International Journal of High Performance ACKNOWLEDGMENT Computing Applications, vol. 14, pp. 317-329, 2000. This work was supported by the National High-Tech [13] G. Ravipati, A. Bernat, N. Rosenblum, B.P. Miller, J.K. Hollingsworth, "Towards the deconstruction of Dyninst," Research and Development Plan of China under Grant Technical Report, University of Maryland, 2007. No. 2009AA01A131 and No. 2009AA01A135; the [14] E. Gebrial, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, Chinese Ministry of Science and Technology under Grant J. Squyres, et al, "Open MPI: goals, concept, and design of No. 2009DFA12110. a next generation MPI implementation," Lecture Notes in We would like to thank all the people who give helpful Computer Science, Springer Berlin / Heidelberg, vol. 3241, pp. 97-104, 2004. suggestions to our work, especially the reviewers of [15] J. Liu, M.M. Shen, and W.M. Zheng, "Research on PAAP’10 for their comments on the original conference perturbation imposed on parallel programs by debugger," paper [18]. And we also would like to thank all the Chinese Journal of Computer, vol. 25, No. 2, pp. 122-127, members of the integration and application of 2002. heterogeneous resources service project team, department [16] L. DeRose, T. Hoover and J. Hollingsworth, "The dynamic of computer science and technology, Xi’an Jiaotong probe class library — an infrastructure for developing instrumentation for performance tools," Proc. of the 15th University, for their supports to our work. International Parallel and Distributed Processing Symposium, Apr. 2001. REFERENCES [17] X.H. Shi, Y.L. Zhao, S.Q. Zheng, and D.P. Qian, “VENUS: A general parallel performance visualization environment,” [1] C.C. Chen and M.X. Chen, "A generic parallel computing Mini-micro Systems, vol. 19, pp. 1-7, 1998. model for the distributed environment," Proc. 2006 7th International Conference on Parallel and Distributed [18] Y.L. Xu, Z. Zhao, W.G. Wu, and Y.X. Zhao, " Research Computing, Applications and Technologies, pp. 4-7, and Design of a Remote Visualization Parallel Program December 2006. Performance Analysis Tool," Proc. of the Third International Symposium on Parallel Architectures, [2] A. Chan, W. Gropp, and E. Lusk, "User’s guide for MPE: Algorithms and Programming, pp. 214-220, Dec. 2010. extensions for MPI programs," Technical Report, MCS- TM-ANL-98, Argonne National Laboratory, 1998.

[3] W. Gropp, "MPICH2: A new start for MPI implementations," Lecture Notes in Computer Science, Springer Berlin/Heidelberg, vol. 2474, pp. 37-42, 2002. [4] Message Passing Interface Forum, "MPI: A message- Yunlong Xu, born in Sichuan, China, 1987. He is currently a passing interface standard," Technical Report, University Ph.D. candidate at School of Electronic and Information of Tennessee, 1994. Engineering, Xi’an Jiao tong University, China. His research [5] O. Zaki, E. Lusk, W. Gropp, and D. Swider, "Toward interests include parallel computing, and cloud computing. scalable performance visualization with Jumpshot," International Journal of High Performance Computing Applications, vol. 13, pp. 277-288, 1999. [6] B.P. Miller, M. Callaghan, G. Cargille, J.K. Hollingsworth, Zeng Zhao received his master degree in School of Electronic R.B. Irvin, K.L. Karavanic, et al, "The Paradyn parallel and Information Engineering, Xi’an Jiao tong University, China, performance measurement tool," Computer, vol. 28, pp. in 2010. Zhao is currently a member of Tencent, Inc., China. 37-46, November 1995.

[7] W. Gropp and E. Lusk, "Goals guiding design: PVM and MPI," Proc. IEEE International Conference on Cluster Computing, pp. 257-265, 2002. Weiguo Wu, born in 1963, is Professor, Ph.D. supervisor, of [8] S. Shende and A.D. Malony, "The TAU parallel performance system," International Journal of High School of Electronic and Information Engineering, Xi’an Jiao Performance Computing Applications, vol. 20, pp. 287-311, tong University, China. He received his master and Ph.D. 2006. degree in Xi’an Jiaotong University. His research interests [9] M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. include high-performance computer architecture, massive Montoya and S. Cranford, "Open|SpeedShop: An open storage system, computer networks, and embedded systems. source infrastructure for parallel performance analysis," Scientific Programming, vol. 16, pp. 105-121, April 2008. [10] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, "HPCTOOLKIT: Yixin Zhao received his master degree in School of Electronic Tools for performance analysis of optimized parallel and Information Engineering, Xi’an Jiao tong University, China, programs," Concurrency and Computation: Practice & Experience, vol. 22, pp. 1-7, April 2010. in 2009.

[11] E. Clayberg and D. Rubel, "Eclipse Plug-Ins," Addison- Wesley Professional, 2008.

© 2011 ACADEMY PUBLISHER