Performance of Multi-Process and Multi-Thread Processing on Multi-Core SMT Processors

Total Page:16

File Type:pdf, Size:1020Kb

Performance of Multi-Process and Multi-Thread Processing on Multi-Core SMT Processors Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors Hiroshi Inoue and Toshio Nakatani Abstract—Many modern high-performance processors growing gap between the processor and memory speed, SMT support multiple hardware threads in the form of multiple cores is becoming more important. and SMT (Simultaneous Multi-Threading). Hence achieving To exploit thread-level parallelism available in such a good performance scalability of programs with respect to the system, programs have to be parallelized. A programmer can numbers of cores (core scalability) and SMT threads in one core write a parallel program using multiple processes (the (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper multi-process model) or multiple threads within one process compares the performance scalability with two parallelization (the multi-thread model) on today’s systems. Multiple threads models (using multiple processes and using multiple threads in in one process share the virtual memory space, while multiple one process) on two types of hardware parallelism (core processes do not share the same memory space. This scalability and SMT scalability). We tested standard Java difference typically results in a larger memory footprint and benchmarks and a real-world server program written in PHP better inter-process isolation for the multi-process model. For on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the example, the Apache Web server supports both a multi-thread model achieves better SMT scalability compared multi-process model (prefork model) and a multi-thread to the multi-process model by reducing the number of cache model (worker model). The prefork model generates multiple misses and DTLB misses. However both models achieve roughly Web-server processes to handle many HTTP connections equal core scalability. We show that the multi-thread model while the worker model generates multiple threads. The generates up to 7.4 times more DTLB misses than the prefork model is more reliable but consumes more memory. multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory In this paper, we study the interactions between the allocator for a PHP runtime to reduce DTLB misses on parallelization model and the processor architecture aspects multi-core SMT processors. The allocator is aware of the core of a multi-core SMT processor. To identify how to achieve that is running each software thread and allocates memory higher performance on the multi-core SMT processor, this blocks from same memory page for each processor core. When paper focuses on performance comparisons between a using all of the hardware threads on a Niagara, the core-aware multi-thread model and a multi-process model on two types allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%. of hardware parallelism (core scalability and SMT scalability). We measure the throughput of standard Java I. INTRODUCTION benchmarks, SPECjbb2005 and SPECjvm2008, using the multi-process model and the multi-thread model, and then we Modern high-performance processors provide thread-level analyze the detailed architecture-level statistics such as cache parallelism by using multiple hardware threads within each misses and TLB misses on two processors, Sun’s Niagara and core using Simultaneous Multi-Threading (SMT)1 [1]. For Intel’s Nehalem. The Niagara has eight cores and four SMT example, Sun’s UltraSPARC T2 (Niagara 2) processor [2] threads in each core, whereas the Nehalem has four cores and provides 8-way multi-threading in each core, UltraSPARC two SMT threads in each core. In the measured benchmarks, T1 (Niagara) processor [3] and IBM’s POWER7 processor each thread did not communicate with each other so that we [4] provide 4-way multi-threading, and Intel’s Core i7 and can focus on the architecture-level behavior. Xeon (Nehalem) processors [5] provide 2-way To test the core scalability, we increased the number of multi-threading. Such SMT processors can increase the cores with only one SMT thread in each core. To test the SMT efficiency of CPU resource usage by allowing multiple scalability, we increased the number of SMT threads in each threads to run on a single core, even for workloads with core. Although many previous studies [6-11] measured core limited instruction-level parallelism. Multiple threads using scalability and SMT scalability of programs on multi-core SMT can also hide memory access latencies. Due to a SMT processors, this is the first work to focus on the effects of the parallelization model on multi-core SMT processors. Our results showed that the multi-thread model achieved 1 A Niagara core cannot execute instructions from multiple software much better SMT scalability than the multi-process model threads in the same processor cycle and thus the multi-threading in Niagara because the multi-thread model generated a smaller number should be called fine-grained multi-threading. In this paper, we use the word SMT in a wider sense to include fine-grained multi-threading. of cache misses and DTLB misses due to a smaller memory footprint. In contrast to better SMT scalability of the H. Inoue and T. Nakatani are with IBM Research – Tokyo, Kanagawa-ken 242-0001 Japan (e-mail: {inouehrs, nakatani}@jp.ibm.com). multi-thread model, on average, both models achieved almost comparable core scalability on Niagara and the multi-process the performance scalability of the SAP Java application model achieved better core scalability than the multi-thread server on a Niagara processor and on Intel’s Xeon model on Nehalem. We found that, contrary to our (Clovertown) processors, which do not support SMT, while expectations, the multi-thread model generated up to 7.4 changing the number of JVMs from one to four. The one times more DTLB misses than multi-process model when we JVM configuration in their work corresponds to the used many cores. multi-thread model, and using more JVMs makes the We confirmed that our observation was not specific to Java program more like the multi-process model, which is an workloads by testing a real-world server program written in extreme case of the multiple-JVM configurations. They PHP on Niagara. The multi-thread model achieved better showed that using more JVMs eased the lock contention SMT scalability than the multi-process model by reducing the while it increased the memory footprint and cache misses. As DTLB misses, but it showed comparable core scalability. a result, a 2-JVM configuration achieved the best We also observed that the performance advantage of the performance on the Niagara system and a 4-JVM multi-thread model was reduced as the number of cores configuration peaked on the Xeon system. They concluded increased. To achieve the better performance on the SMT that the amount of cache memory was the key factor to processors using many cores with SMT threads, we determine the best configuration. Our results showed that the implemented a new memory allocator in the multi-threaded 4-way SMT of Niagara might be another important factor that PHP runtime. This allocator avoided sharing a page among reduced the performance of 4-JVM configuration on Niagara. threads running on different cores. This reduced the DTLB Many previous studies identified lock contention as the misses by about half and improved the performance when dominant cause of poor scalability on multi-core SMT using all of the hardware threads on Niagara. The modern processors [9-11]. For example, Ishizaki et al. [11] reported memory allocators are often aware of the remote and local that the performance of Java server workloads running on a memories on NUMA systems and control the allocations to Niagara2 processor was improved by up to 132% by minimize the costly remote memory accesses. Our allocator removing the contended locks. We found that appropriate use provided similar control at a finer granularity even on a of the parallelization model is also an important factor in non-NUMA system and improved the performance. improving the performance of programs on multi-core SMT There are two main contributions in this paper. (1) We processors. The programs we used in this paper did not demonstrate that the smaller footprint of the multi-thread actually suffer from lock contention even when we used all of model does not always result in the higher performance and it the hardware threads of a Niagara processor. This was may degrade the performance compared to the multi-process because we selected programs that did not share data between model due to more frequent DTLB misses on today’s the software threads so as to focus on interactions between multi-core processors. The performance of the multi-thread parallelization model and underlying micro architecture. model and the multi-process model depend on the underlying Server-side programs for Web applications, in which each processor architecture and that the interactions are thread serves a different client request without complicated. (2) We show that a memory allocator that tracks communicating among each other, are important examples of the processor core executing each program can improve the such programs that did not share data among software threads. performance on a multi-core SMT processor by significantly Other important examples include Hadoop (an open-source reducing the DTLB misses. implementation of the MapReduce runtime written in Java) or The rest of the paper is organized as follows. Section 2 scientific workloads using MPI. covers related work. Section 3 summarizes the experimental In this paper, we propose a new memory allocator for results with Java benchmarks.
Recommended publications
  • Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load flows for Power Distribution Systems
    Electric Power Systems Research 65 (2003) 169Á/177 www.elsevier.com/locate/epsr Distributed algorithms with theoretic scalability analysis of radial and looped load flows for power distribution systems Fangxing Li *, Robert P. Broadwater ECE Department Virginia Tech, Blacksburg, VA 24060, USA Received 15 April 2002; received in revised form 14 December 2002; accepted 16 December 2002 Abstract This paper presents distributed algorithms for both radial and looped load flows for unbalanced, multi-phase power distribution systems. The distributed algorithms are developed from tree-based sequential algorithms. Formulas of scalability for the distributed algorithms are presented. It is shown that computation time dominates communication time in the distributed computing model. This provides benefits to real-time load flow calculations, network reconfigurations, and optimization studies that rely on load flow calculations. Also, test results match the predictions of derived formulas. This shows the formulas can be used to predict the computation time when additional processors are involved. # 2003 Elsevier Science B.V. All rights reserved. Keywords: Distributed computing; Scalability analysis; Radial load flow; Looped load flow; Power distribution systems 1. Introduction Also, the method presented in Ref. [10] was tested in radial distribution systems with no more than 528 buses. Parallel and distributed computing has been applied More recent works [11Á/14] presented distributed to many scientific and engineering computations such as implementations for power flows or power flow based weather forecasting and nuclear simulations [1,2]. It also algorithms like optimizations and contingency analysis. has been applied to power system analysis calculations These works also targeted power transmission systems. [3Á/14].
    [Show full text]
  • Scalability and Performance Management of Internet Applications in the Cloud
    Hasso-Plattner-Institut University of Potsdam Internet Technology and Systems Group Scalability and Performance Management of Internet Applications in the Cloud A thesis submitted for the degree of "Doktors der Ingenieurwissenschaften" (Dr.-Ing.) in IT Systems Engineering Faculty of Mathematics and Natural Sciences University of Potsdam By: Wesam Dawoud Supervisor: Prof. Dr. Christoph Meinel Potsdam, Germany, March 2013 This work is licensed under a Creative Commons License: Attribution – Noncommercial – No Derivative Works 3.0 Germany To view a copy of this license visit http://creativecommons.org/licenses/by-nc-nd/3.0/de/ Published online at the Institutional Repository of the University of Potsdam: URL http://opus.kobv.de/ubp/volltexte/2013/6818/ URN urn:nbn:de:kobv:517-opus-68187 http://nbn-resolving.de/urn:nbn:de:kobv:517-opus-68187 To my lovely parents To my lovely wife Safaa To my lovely kids Shatha and Yazan Acknowledgements At Hasso Plattner Institute (HPI), I had the opportunity to meet many wonderful people. It is my pleasure to thank those who sup- ported me to make this thesis possible. First and foremost, I would like to thank my Ph.D. supervisor, Prof. Dr. Christoph Meinel, for his continues support. In spite of his tight schedule, he always found the time to discuss, guide, and motivate my research ideas. The thanks are extended to Dr. Karin-Irene Eiermann for assisting me even before moving to Germany. I am also grateful for Michaela Schmitz. She managed everything well to make everyones life easier. I owe a thanks to Dr. Nemeth Sharon for helping me to improve my English writing skills.
    [Show full text]
  • Oracle® Developer Studio 12.6
    ® Oracle Developer Studio 12.6: C++ User's Guide Part No: E77789 July 2017 Oracle Developer Studio 12.6: C++ User's Guide Part No: E77789 Copyright © 2017, Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)
    Scalability and Optimization Strategies for GPU Enhanced Neural Networks (GeNN) Naresh Balaji1, Esin Yavuz2, Thomas Nowotny2 {T.Nowotny, E.Yavuz}@sussex.ac.uk, [email protected] 1National Institute of Technology, Tiruchirappalli, India 2School of Engineering and Informatics, University of Sussex, Brighton, UK Abstract: Simulation of spiking neural networks has been traditionally done on high-performance supercomputers or large-scale clusters. Utilizing the parallel nature of neural network computation algorithms, GeNN (GPU Enhanced Neural Network) provides a simulation environment that performs on General Purpose NVIDIA GPUs with a code generation based approach. GeNN allows the users to design and simulate neural networks by specifying the populations of neurons at different stages, their synapse connection densities and the model of individual neurons. In this report we describe work on how to scale synaptic weights based on the configuration of the user-defined network to ensure sufficient spiking and subsequent effective learning. We also discuss optimization strategies particular to GPU computing: sparse representation of synapse connections and occupancy based block-size determination. Keywords: Spiking Neural Network, GPGPU, CUDA, code-generation, occupancy, optimization 1. Introduction The computational performance of traditional single-core CPUs has steadily increased since its arrival, primarily due to the combination of increased clock frequency, process technology advancements and compiler optimizations. The clock frequency was pushed higher and higher, from 5 MHz to 3 GHz in the years from 1983 to 2002, until it reached a stall a few years ago [1] when the power consumption and dissipation of the transistors reached peaks that could not compensated by its cost [2].
    [Show full text]
  • Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
    A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time.
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Parallel System Performance: Evaluation & Scalability
    ParallelParallel SystemSystem Performance:Performance: EvaluationEvaluation && ScalabilityScalability • Factors affecting parallel system performance: – Algorithm-related, parallel program related, architecture/hardware-related. • Workload-Driven Quantitative Architectural Evaluation: – Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine. – From measured performance results compute performance metrics: • Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism. – Resource-oriented Workload scaling models: How the speedup of a parallel computation is affected subject to specific constraints: 1 • Problem constrained (PC): Fixed-load Model. 2 • Time constrained (TC): Fixed-time Model. 3 • Memory constrained (MC): Fixed-Memory Model. • Parallel Performance Scalability: For a given parallel system and a given parallel computation/problem/algorithm – Definition. Informally: – Conditions of scalability. The ability of parallel system performance to increase – Factors affecting scalability. with increased problem size and system size. Parallel Computer Architecture, Chapter 4 EECC756 - Shaaban Parallel Programming, Chapter 1, handout #1 lec # 9 Spring2013 4-23-2013 Parallel Program Performance • Parallel processing goal is to maximize speedup: Time(1) Sequential Work Speedup = < Time(p) Max (Work + Synch Wait Time + Comm Cost + Extra Work) Fixed Problem Size Speedup Max for any processor Parallelizing Overheads • By: 1 – Balancing computations/overheads (workload) on processors
    [Show full text]
  • Gpu Concurrency
    GPU CONCURRENCY ROBERT SEARLES 5/26/2021 EXECUTION SCHEDULING & MANAGEMENT Pre-emptive scheduling Concurrent scheduling Processes share GPU through time-slicing Processes run on GPU simultaneously Scheduling managed by system User creates & manages scheduling streams C B A B C A B A time time time- slice 2 CUDA CONCURRENCY MECHANISMS Streams MPS MIG Partition Type Single process Logical Physical Max Partitions Unlimited 48 7 Performance Isolation No By percentage Yes Memory Protection No Yes Yes Memory Bandwidth QoS No No Yes Error Isolation No No Yes Cross-Partition Interop Always IPC Limited IPC Reconfigure Dynamic Process launch When idle MPS: Multi-Process Service MIG: Multi-Instance GPU 3 CUDA STREAMS 4 STREAM SEMANTICS 1. Two operations issued into the same stream will execute in issue- order. Operation B issued after Operation A will not begin to execute until Operation A has completed. 2. Two operations issued into separate streams have no ordering prescribed by CUDA. Operation A issued into stream 1 may execute before, during, or after Operation B issued into stream 2. Operation: Usually, cudaMemcpyAsync or a kernel call. More generally, most CUDA API calls that take a stream parameter, as well as stream callbacks. 5 STREAM EXAMPLES Host/Device execution concurrency: Kernel<<<b, t>>>(…); // this kernel execution can overlap with cpuFunction(…); // this host code Concurrent kernels: Kernel<<<b, t, 0, streamA>>>(…); // these kernels have the possibility Kernel<<<b, t, 0, streamB>>>(…); // to execute concurrently In practice, concurrent
    [Show full text]
  • Multiprocessing and Scalability
    Multiprocessing and Scalability A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Multiprocessing and Scalability Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uni- processor systems. However, due to a number of difficult problems, the potential of these machines has been difficult to realize. This is because of the: Fall 2004 2 Multiprocessing and Scalability Advances in technology ─ Rate of increase in performance of uni-processor, Complexity of multiprocessor system design ─ This drastically effected the cost and implementation cycle. Programmability of multiprocessor system ─ Design complexity of parallel algorithms and parallel programs. Fall 2004 3 Multiprocessing and Scalability Programming a parallel machine is more difficult than a sequential one. In addition, it takes much effort to port an existing sequential program to a parallel machine than to a newly developed sequential machine. Lack of good parallel programming environments and standard parallel languages also has further aggravated this issue. Fall 2004 4 Multiprocessing and Scalability As a result, absolute performance of many early concurrent machines was not significantly better than available or soon- to-be available uni-processors. Fall 2004 5 Multiprocessing and Scalability Recently, there has been an increased interest in large-scale or massively parallel processing systems. This interest stems from many factors, including: Advances in integrated technology. Very
    [Show full text]
  • Debugging Multicore & Shared- Memory Embedded Systems
    Debugging Multicore & Shared- Memory Embedded Systems Classes 249 & 269 2007 edition Jakob Engblom, PhD Virtutech [email protected] 1 Scope & Context of This Talk z Multiprocessor revolution z Programming multicore z (In)determinism z Error sources z Debugging techniques 2 Scope and Context of This Talk z Some material specific to shared-memory symmetric multiprocessors and multicore designs – There are lots of problems particular to this z But most concepts are general to almost any parallel application – The problem is really with parallelism and concurrency rather than a particular design choice 3 Introduction & Background Multiprocessing: what, why, and when? 4 The Multicore Revolution is Here! z The imminent event of parallel computers with many processors taking over from single processors has been declared before... z This time it is for real. Why? z More instruction-level parallelism hard to find – Very complex designs needed for small gain – Thread-level parallelism appears live and well z Clock frequency scaling is slowing drastically – Too much power and heat when pushing envelope z Cannot communicate across chip fast enough – Better to design small local units with short paths z Effective use of billions of transistors – Easier to reuse a basic unit many times z Potential for very easy scaling – Just keep adding processors/cores for higher (peak) performance 5 Parallel Processing z John Hennessy, interviewed in the ACM Queue sees the following eras of computer architecture evolution: 1. Initial efforts and early designs. 1940. ENIAC, Zuse, Manchester, etc. 2. Instruction-Set Architecture. Mid-1960s. Starting with the IBM System/360 with multiple machines with the same compatible instruction set 3.
    [Show full text]
  • Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends
    PaScal Viewer: a Tool for the Visualization of Parallel Scalability Trends 1st Anderson B. N. da Silva 2nd Daniel A. M. Cunha Pro-Reitoria´ de Pesquisa, Inovac¸ao˜ e Pos-Graduac¸´ ao˜ Dept. de Eng. de Comp. e Automac¸ao˜ Instituto Federal da Paraiba Univ. Federal do Rio Grande do Norte Joao Pessoa, Brazil Natal, Brazil [email protected] [email protected] 3st Vitor R. G. Silva 4th Alex F. de A. Furtunato 5th Samuel Xavier de Souza Dept. de Eng. de Comp. e Automac¸ao˜ Diretoria de Tecnologia da Informac¸ao˜ Dept. de Eng. de Comp. e Automac¸ao˜ Univ. Federal do Rio Grande do Norte Instituto Federal do Rio Grande do Norte Univ. Federal do Rio Grande do Norte Natal, Brazil Natal, Brazil Natal, Brazil [email protected] [email protected] [email protected] Abstract—Taking advantage of the growing number of cores particular environment. The configuration of this environment in supercomputers to increase the scalabilty of parallel programs includes the number of cores, their operating frequency, and is an increasing challenge. Many advanced profiling tools have the size of the input data or the problem size. Among the var- been developed to assist programmers in the process of analyzing data related to the execution of their program. Programmers ious collected information, the elapsed time in each function can act upon the information generated by these data and make of the code, the number of function calls, and the memory their programs reach higher performance levels. However, the consumption of the program can be cited to name just a few information provided by profiling tools is generally designed [2].
    [Show full text]