Enabling Optimizations to Achieve Higher Perfomance

Enabling Optimizations to Achieve Higher Performance on the HP PA- RISC Architecture Version 1.0 California Language Laboratroy 11000 Wolfe Road Cupertino, California 95014 Last Printing: October 8, 1997 October 8, 1997 (c) Copyright 1997 HEWLETT-PACKARD COMPANY. The information contained in this document is subject to change without notice. HEWLETT-PACKARD MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential damages in connection with furnishing, performance, or use of this material. Hewlett-Packard assumes no responsibility for the use or reliability of its software on equipment that is not furnished by Hewlett-Packard. This document contains proprietary information which is protected by copyright. All rights are reserved. No part of this document may be photocopied, reproduced, or translated to another language without the prior written consent of Hewlett-Packard Company. Page 2 ◆Enabling Optimizations to Achieve Higher Performance on the HP PA-RISC Architecture October 8, 1997 Introduction 1.0 Introduction The HP PA-RISC architecture is designed to deliver industry-leading performance on today’s commercial and technical applications. Users can enhance the performance of their applications on the PA-RISC architecture by adopting strategies that go beyond merely compiling with the –O default optimization option. In this regard, it is important to understand the advanced performance enhancing features of HP compilers to achieve the highest possible performance. This document discusses factors that have traditionally limited application performance and motivates the use of advanced HP compiler functionality to address those limitations. Some guidelines on coding strategies to facilitate compiler optimizations are also presented. Many of the performance-enhancing compiler features that will be discussed in this document are already available in HP’s compilers for the PA-RISC architecture. Users are encouraged to use these features on current PA-RISC systems not only to achieve higher performance on PA-RISC systems, but also to be better positioned to exploit the full performance potential afforded by the next generation HP/Intel Enhanced Mode architecture (which will be referred to as the EM architecture in this document). The EM architecture is an advanced computer architecture that provides a number of unique features for achieving significantly higher performance. HP is developing advanced compiler optimization technology for hardware implementations of this architecture. This document is organized into sections that discuss the following topics: ❍ The factors that affect application performance and the impact of compiler optimizations on performance. ❍ The optimization levels supported by HP’s compilers and a brief description of the functionality provided at each optimization level. ❍ The factors that can limit the degree of optimization effectiveness and suggestions on how to improve optimization effectiveness. ❍ A summary of the recommendations on enabling compiler optimizations to maximize application performance on the PA-RISC architecture. Much of the discussion in the current version of this document is focused on enabling optimizations to improve the performance of CPU-intensive integer applications. Page 3 ◆Enabling Optimizations to Achieve Higher Performance on the HP PA-RISC Architecture October 8, 1997 Performance Factors and Compiler Optimizations 2.0 Performance Factors and Compiler Optimizations The execution time of any application can be characterized by the following equation: Execution Time = (Number of Machine Instructions Executed) × (Average Number of Clock Cycles Required to Execute a Machine Instruction) × (Clock Cycle Time) Since performance is inversely proportional to execution time, higher performance is achieved by executing fewer instructions, spending fewer machines cycles to execute each instruction on average, and/or by increasing the clock frequency. It is well known that algorithmic improvements are a primary means of enhancing application performance since they can often significantly reduce the number of instructions executed. In this document we’ll assume that applicable algorithmic improvements have already been made. Instead, we will focus on further means to boost performance on a PA-RISC architecture system at any given clock frequency. In particular, we’ll examine how compiler optimizations can help reduce both the number of machine instructions executed as well as the average number of cycles required per instruction to deliver higher performance. In the past, optimizing compilers tended to focus principally on reducing the number of instructions executed by an application. However, for most current processors, and for the PA-RISC architecture in particular, the compiler plays a significant role in reducing the average number of cycles per instruction, or CPI for short. HP’s compilers for the PA-RISC architecture implement a number of optimizations to reduce the overall CPI. It is important to understand the factors that determine the CPI and how compiler optimizations help improve this important performance met- ric. 2.1 Instruction-level Parallelism The more instructions that are executed in parallel in each clock cycle on average, the lower the CPI. PA-RISC architecture systems support a very high degree of instruction-level parallelism. However, the compiler must determine the optimal machine instruction sequence to fully exploit the instruction-level parallelism provided by the hardware. Specifically, the hardware can execute multiple instructions in the same clock cycle only if there are no operand dependencies between them. It is the compiler’s responsibility to identify and group independent instructions so that they may be executed concurrently and thus lower the overall CPI. 2.2 Pipeline stalls Modern processors typically execute instructions in multiple pipelined stages. Each instruction’s execution proceeds from one pipeline stage to the next at every clock cycle. Multiple instructions may be executing concurrently in different pipeline stages. However, if an instruction is unable to proceed from one pipeline stage to the next, instructions in earlier pipeline stages are also pre- vented from proceeding, resulting in a pipeline stall. Pipeline stalls typically occur if the machine Page 4 ◆Enabling Optimizations to Achieve Higher Performance on the HP PA-RISC Architecture October 8, 1997 Performance Factors and Compiler Optimizations resources or data values required by an instruction are not yet available. For instance, if instruction B uses the result computed by instruction A, and if instruction A has not yet finished computing its result, instruction B will be unable to start computing its result and hence cause a pipeline stall. Pipeline stalls increase the average number of cycles required to execute each instruction. The more often the processor pipeline is stalled due to data dependencies, the higher the CPI. For example, loading a data value into a register typically takes two cycles to execute; therefore, if an instruction that reads the contents of the loaded register immediately follows the load instruction, the processor is stalled for one cycle, resulting in a loss of performance. However, the compiler can try to place other instructions between the load operation and the instruction that uses the loaded value to avoid the pipeline stall. 2.3 Branch prediction Source-code if-statements typically get translated into branch instructions. A branch instruction may alter the flow of control and cause a disruption in pipelined instruction processing. When a branch instruction enters the processor pipeline, the processor would ordinarily have to wait until the branch instruction progresses to a later pipeline stage, when the outcome of the branch is known, before it can determine which instructions to fetch into the pipeline next. Clearly, with this simplistic strategy, portions of the processor pipeline would remain idle during the execution of branch instructions. Moreover, the performance impact can be quite significant, since branch instructions tend to occur rather frequently in typical applications. To accelerate branch execution, modern processors use special mechanisms to predict the out- comes of branches in advance. This allows instructions to be fetched and executed from the predicted branch destination speculatively. The predicted branch outcome is later compared against the actual outcome once it is known. If the prediction is correct, then the pipeline disruption is effectively avoided. Otherwise, the speculatively executed instructions have to be removed from the pipeline and execution is restarted at the correct branch destination. Accurate branch prediction is becoming an increasingly important factor for high performance since mispredicted branches can significantly increase the CPI. Two types of branches account for most of the branch misprediction penalties: ❑ Conditional branch — because the result of the branch condition is not known until later, as discussed above. ❑ Indirect branch — because the target of the indirect branch may not be known until later. Primary sources of indirect branches are procedure returns, virtual function calls in C++, indirect calls using function pointers, and switch statements. Page

Load more