Profiling Tools for Hardware/Software Partitioning of Embedded Applications Dinesh C

Profiling Tools for Hardware/Software Partitioning of Embedded Applications Dinesh C. Suresh Walid A. Najjar Frank Vahid University of California, Riverside University of California, Riverside University of California, Riverside [email protected] [email protected] [email protected] Jason R. Villarreal Greg Stitt University of California, Riverside University of California, Riverside [email protected] [email protected] ABSTRACT consumption embedded software therefore should focus first on Loops constitute the most executed segments of programs and finding that critical code. Possible optimizations include therefore are the best candidates for hardware software aggressive recompilation, customized instruction synthesis, partitioning. We present a set of profiling tools that are specifically customized memory hierarchy synthesis, and hardware/software dedicated to loop profiling and do support combined function and partitioning [10,2] all focusing on the critical code regions. Of loop profiling. One tool relies on an instruction set simulator and those critical code regions, about 85% of those regions are inner can therefore be augmented with architecture and micro- loops, while the remaining 15% are functions. A partitioning tool architecture features simulation while the other is based on should focus first on finding the most critical software loops and compile-time instrumentation of gcc and therefore has very little understanding the execution statistics of those loops, after which slow down compared to the original program We use the results of the tool should try partitioning alternatives coupled with loop the profiling to identify the compute core in each benchmark and transformations in the hardware (such as loop unrolling). Our study the effect of compile-time optimization on the distribution of particular interest is in the hardware/software partitioning of cores in a program. We also study the potential speedup that can be programs, but our methods can be applied to the other optimization achieved using a configurable system on a chip, consisting of a approaches too. CPU embedded on an FPGA, as an example application of these Many profiling tools have been developed. Some tools, like gprof, tools in hardware/software partitioning. only provide function-level profiling and do not provide sufficient information at a more detailed level, such as loop information, Categories and Subject Descriptors necessary for partitioning. However, tools that profile at a more C.3 [Performance of Systems]: Measurement techniques, Design detailed level tend to focus on statements or blocks – a user studies – profiling techniques, hardware/software partitioning. interested in loops must implement additional functionality on top of those profilers. Furthermore, many profiling tools, like ATOM General Terms [12] or Spix [11], are specific to a particular microprocessor Measurement, Performance, Design, Experimentation. family. Keywords Instruction-level profiling tools can be tuned to provide useful Hardware/Software partitioning, loop analysis, compiler information regarding the percentage of time spent in different optimization. loops of a program. Instruction profiling tools can be broadly classified into two categories – compilation based instruction 1. INTRODUCTION profilers and simulation based instruction profilers. A compilation Embedded software is the key contributor to embedded system based profiler instruments the program by adding counters to performance and power consumption. Program execution tends to various basic blocks of the program. During execution the counter spend most of the time in a small fraction of code, a feature known values are written to a separate file. Simulation based instruction as the “90-10 rule” – 90% of the execution time comes from 10% of profiler uses an instruction set simulators they can be further the code. By their very nature, embedded applications tend to classified into static or dynamic profilers. Simulation based follow the 90-10 rule even more so than desktop type of dynamic instruction profilers obtain the instruction profile during applications. the execution of the code on the simulator while in static profiling the execution is written to a trace and the trace is processed to get Tools seeking to optimize the performance and/or energy instruction counts. For very large applications, the trace generated by static profiling can grow to unmanageable proportions. Even Permission to make digital or hard copies of all or part of this work for though a dynamic profiling method is slow compared to the personal or classroom use is granted without fee provided that copies are compiler-based instrumentation, a variety of architectural not made or distributed for profit or commercial advantage and that parameters can be tuned and studied while the program gets copies bear this notice and the full citation on the first page. To copy profiled on a full system simulator. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. We have developed a profiling tool that focuses on collecting LCTES ’03, June 11-13, 2003, San Diego, California, USA. loop level information for a very large variety of microprocessor Copyright 2003 ACM 1-58113-647-1/03/0006…$5.00. platforms. Our profiling tool supports both the instrumentation and 189 the simulation paradigms. We achieved this goal by building on top and it consists of the following two tools – Spix and Spixstat. Spix of two very popular tools – gcc for instrumentation, and Simics [9] generate basic block execution profile while Spixstat generates for simulation – while keeping the output identical for the two statistics on instruction count, branch behavior, opcode usage, etc. paradigms, enabling easy switching among the paradigms. Both gcc Loop information can be easily deduced from the tool’s output. and Simics, and hence our tool, support dozens of popular The Vtune [13] Performance Analyzer collects, analyzes and microprocessors. We call our toolset as Frequent Loop Analysis displays software performance data from the program-level down Toolset (FLAT). to a specific function, module or instruction in a developer's source code. Vtune runs on windows and Linux and is targeted for all 2. RELATED WORK Intel processors. Idtrace [17] is an instrumentation tool for Intel Profilers like gprof are helpful to the extent of determining the architecture on Unix platforms. It produces a variety of trace types time spent on function calls. However, to make judicious like profile, memory reference, and full execution traces. Primitive hardware/software partitioning decisions, knowledge of the program post-processing tools, which read output files, view traces, and at the granularity of loops is imperative. compute basic profile data are included in the IDtrace package. ATOM [11] provides a toolset that lets the user track a Cacheprof [18] is an execution-driven memory simulator for program’s behavior by inserting analysis routine at interesting parts the x86 architecture. It annotates each instruction that reads and of the program. When the program is executed, the analysis routines writes memory and links a cache simulator into the resulting collect information about various parts of the program and dump the executable. Upon execution, the data references are trapped and result to a separate file. ATOM provides the following tools for sent to the simulator. Besides producing a procedure-level instruction profiling-hiprof, pixie and uprof. The hiprof tool is summary, Cacheprof reports number of memory references and the capable of providing sampled program counters for different number of misses for each line of the source code. program events. Pixie tool provides basic block profile information. Uprof is useful for profiling non-time events and can provide FLAT is intended to provide loop/function level information procedure, source line and assembler profiles for a program. for a wide variety of platforms. FLAT_C works for all platforms to which the GNU C Compiler (gcc) has been ported. FLATSIM is The Harvard Atom Like Tool (HALT) [3] provides a flexible capable of producing loop level statistics for a variety of platforms way to add routines to program produced by the SUIF compiler. like x86, Strong ARM, MIPS and SPARC. Users indicate interesting parts of the program by labeling them with SUIF annotations. HALT looks for these annotations, and 3. FLAT: FREQUENT LOOP ANALYSIS inserts function calls to analysis routines that match the type of the TOOLS annotation. Using different analysis routines, Halt provides a Instruction profiling tools provide information based on which number of hardware simulators, performs branch stream analysis, useful hardware/software partitioning decisions can be made. and records statistics for profile-driven optimizations. HALT is Frequent Loop Analysis Tool set (FLAT) is a profiling tool written helpful for obtaining information regarding branch prediction, code in python and it provides the execution time of a given application layout, instruction scheduling, and register allocation. It has been at the granularity of both loops and functions. Loop profiles can be ported to MIPS and ALPHA processors. obtained through two different ways. The first method is to Optimally Profiling and Tracing Programs [4] inserts counters instrument the compiler to output the frequency of a loop and the in the CFGs in order to record the execution count of the basic other method

Profiling Tools for Hardware/Software Partitioning of Embedded Applications Dinesh C

Wind River Vxworks Platforms 3.8

Fax86: an Open-Source FPGA-Accelerated X86 Full-System Emulator

Intel Quartus Prime Pro Edition User Guide: Programmer Send Feedback

A Superscalar Out-Of-Order X86 Soft Processor for FPGA

EDN Magazine, December 17, 2004 (.Pdf)

Introduction to Intel® FPGA IP Cores

North American Company Profiles 8X8

Intel® Arria® 10 Device Overview

Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M

Moore's Law Motivation

Ramp: Research Accelerator for Multiple Processors

Computer Architectures an Overview