Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing
Total Page:16
File Type:pdf, Size:1020Kb
Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing John Curreri, Seth Koehler, Brian Holland, and Alan D. George NSF Center for High-Performance Reconfigurable Computing (CHREC) ECE Department, University of Florida {curreri, koehler, holland, george}@chrec.org Abstract software developers typically rely heavily upon debugging and performance analysis tools. In order to High-Level Languages (HLLs) for FPGAs (Field- accommodate the typical software development process, Programmable Gate Arrays) facilitate the use of application mappers support debug of HLL source code reconfigurable computing resources for application on a traditional microprocessor rather than relying upon developers by using familiar, higher-level syntax, HDL simulators and logic analyzers. However, current semantics, and abstractions, typically enabling faster commercial application mappers provide few (if any) development times than with traditional Hardware runtime tools to debug or analyze application Description Languages (HDLs). However, this performance at the HLL source-code level while abstraction is typically accompanied by some loss of executing on one or more FPGAs. While methods and performance as well as reduced transparency of tools for debugging FPGAs have been well researched application behavior, making it difficult to understand and even developed, such as for the Sea Cucumber HLL and improve application performance. While runtime which has tool support for runtime debugging [3], tools for performance analysis are often featured in research is currently lacking in runtime performance development with traditional HLLs for serial and analysis tools for FPGAs, especially when HLLs are parallel programming, HLL-based applications for featured. FPGAs have an equal or greater need yet lack these Without performance tools to assist in analyzing tools. This paper presents a novel and portable application behavior on the FPGA, potential framework for runtime performance analysis of HLL performance gains of reconfigurable computing may be applications for FPGAs, including a prototype tool for lost. A major advantage of reconfigurable computing is performance analysis with Impulse C, a commercial performance increase obtained from application-specific HLL for FPGAs. As a case study, this tool is used to hardware optimizations. Application mappers attempt to locate performance bottlenecks in a molecular dynamics create optimized hardware by extracting parallelism out application. of amenable HLL statements (e.g. performing the iterations of a loop in parallel or in a pipeline if Keywords: Performance analysis, profile, trace, possible). However, the amount of parallelism reconfigurable computing, FPGA, high-level language, extracted, and the overall structure of the design in application mapper, Impulse C, Carte hardware, can depend heavily upon the way in which the algorithm is expressed as well as the techniques used by 1. Introduction the application mapper to translate source code to hardware. Runtime performance analysis (hereafter the Today’s application mappers (i.e. compilers that “runtime” is assumed) allows the application developer translate High-Level Languages (HLLs) to hardware to better understand application behavior in terms of configurations on FPGAs, such as Impulse C [1] or both computation and communication, aiding the Carte [2]) simplify software developers’ transition to developer in locating and removing performance reconfigurable computing and its performance bottlenecks in each. Complex or dynamic data advantages without the steep learning curve associated dependencies during computation, shared with traditional Hardware Description Languages communication channels, and load balancing among (HDLs). While HDL developers have become parallelized components can be very difficult to predict accustomed to debugging their code via simulators, 1 and yet significantly affect performance. Performance monitor the performance of hardware inside the FPGA analysis tools for HLL codes must allow the developer itself [5]. This paper significantly extends our previous to monitor these types of areas while presenting analysis work on performance analysis for HDL applications [6], from the perspective of the HLL source code. which discusses a framework for monitoring Unfortunately, many well-researched debugging performance inside the FPGA called the Hardware techniques may not be suited for runtime performance Measurement Module (HMM), by expanding this analysis. For example, halting an FPGA to read back its framework to support the challenges of HLL mappers. state may not be viable due to the unacceptable level of disturbance caused to the application’s behavior and Original Application timing since the FPGA will be temporarily inaccessible Instrument to the CPU, causing performance problems that did not exist before. Instrumented Application Execute Alternatively, performance can be analyzed through Execution simulation. However, cycle-accurate simulations of Measure Environment complex designs on an FPGA are slow and increase in complexity as additional system components are added Measured Data File Analyze to the simulation. Most (if not all) cycle-accurate Present (Automatically) simulators for FPGAs focus upon signal analysis and do not present the results at the HLL source-code level for a Visualizations software developer. Analyze This paper focuses upon performance analysis of an (Manually) HLL application on a reconfigurable system by Potential Bottlenecks monitoring the application at runtime. We have gained Optimize the majority of our insight about performance analysis Modified Application with application mappers from a prototype performance Optimized Application analysis tool that we have developed in this research for Impulse C. Impulse C, a product of Impulse Accelerated Figure 1. Performance Analysis Steps Technologies, maps a reduced set of C statements to HDL for use on a variety of platforms. In addition, 3. Background examination here of Carte, a product of SRC Computers, provides an alternate example of how C code can be Performance analysis can be divided into six steps mapped to an FPGA along with initial ramifications (derived from Maloney’s work on the TAU performance found for a performance analysis tool supporting Carte. analysis framework for traditional processors [7]) whose The remainder of this paper is organized as follows. end goal is to produce an optimized application. These Section 2 discusses related work while Section 3 steps are Instrument , Measure , Execute , Analyze , provides background information in runtime Present , and Optimize (see Figure 1). The performance analysis. Next, Section 4 covers the instrumentation step inserts the necessary code (i.e. for challenges of performance analysis for HLLs targeting additional hardware in the FPGA’s case) to access and FPGAs. Section 5 then presents a case study using a record application data at runtime, such as variables or molecular dynamics application written in Impulse C. signals to capture performance indicators. Measurement Finally, Section 6 concludes and presents ideas for is the process of recording and storing the performance future work. data at runtime while the application is executing. After execution, analysis of performance data to identify 2. Related Work potential bottlenecks can be performed in one of two ways. Some tools can automatically analyze the To the best of our knowledge from a comprehensive measured data, while other tools rely solely upon the literature search, little previous work exists concerning developer to analyze the results. In either case, data is performance analysis for FPGAs. Hardware typically presented to the user via text, charts, or other performance measurement modules have been integrated visualizations to allow for further analysis. Finally, into FPGAs before; however, they were designed optimization is performed by modifying the specifically for monitoring the execution of soft-core application’s code based upon insights gained via the processors [4]. The Owl framework, which provides previous steps. Since automated optimization is an open performance analysis of system interconnects, uses area of research, optimization at present is typically a FPGAs for performance analysis, but does not actually manual process. Finally, these steps may be repeated as 2 many times as the developer deems necessary, resulting Instrumentation can also be inserted after the in an optimized application. This methodology is application has been mapped from HLL to HDL. employed by a number of existing tools for parallel Instrumentation of VHDL or Verilog provides the most performance analysis including TAU [7], PPW [8] flexibility since measurement hardware can be fully KOJAK [9] and HPCToolkit [10]. customized to the application’s needs, rather than depending upon built-in HLL timing functions. 4. HLL Performance Analysis Challenges However, using instrumentation below the HLL source level does require additional effort to map information While all stages of performance analysis mentioned gathered at the HDL level back to the source level. This above are of interest for application mappers, we limit process is problematic due to the diversity of mapping our discussion to the challenges of instrumentation, schemes and translation techniques employed by various measurement, and analysis for the remainder of this application mappers and even among different versions paper. The