A Hardware/Software Infrastructure for Performance Monitoring on Leon3 Multicore Platforms

A Hardware/Software Infrastructure for Performance Monitoring on Leon3 Multicore Platforms authors omitted for blind review Abstract—Monitoring applications at run-time and evaluating The main contribution of this paper is the presentation the recorded statistical data of the underlying micro architecture of a hardware/software infrastructure for performance moni- is one of the key aspects required by many hardware architects toring for the Leon3 platform, which is a widely-used open and system designers as well as high-performance software source soft core based on the Sparc-v8 architecture [8]. In developers. To fulfill this requirement, most modern CPUs for the remainder of the paper, we first discuss the background High Performance Computing (HPC) have been equipped with of PMUs in Section II and then describe the hardware and Performance Monitoring Units (PMU) including a set of hardware counters, which can be configured to monitor a rich set of software implementation of our PMU for Leon3 multicores events. Unfortunately, embedded and reconfigurable systems are in Section III. In Section IV we present experimental results mostly lacking this feature. Towards rapid exploration of High for MiBench workloads and show the overhead incurred by Performance Embedded Computing in near future, we believe performance monitoring. Finally, Section V concludes the that supporting PMU for these systems is necessary. In this paper, paper. we propose a PMU infrastructure, which supports monitoring of up to seven concurrent events. The PMU infrastructure is implemented on an FPGA and is integrated into a Leon3 platform. We show also the integration of our PMU infrastructure II. BACKGROUND –PERFORMANCE MONITORING UNITS with the perf_event, which is the standard PMU architecture of the Linux kernel. Performance analysis based on performance monitoring units (PMUs) requires both, hardware and software infrastructure. The hardware infrastructure for recording statistical I. INTRODUCTION data at the micro-architectural level during program execution For many decades, computer architects have been using basically includes sets of control registers and counters. The simulators to expose and analyze performance metrics by control registers can be programmed to specific events that running workloads on the simulated architecture. The results should be captured which are then counted. The configuration collected from simulation may be inaccurate in some cases written to control registers also determines whether and which due to the workloads running on top of an operating system interrupts are generated on a counter overflow, whether data or the simulators just considering not all the relevant micro- is collected only for user mode or also for kernel mode architectural aspects. More recently, so-called full system sim- execution, and generally to enable or disable data collection. ulators such as gem5 [1] are being used by many researchers While most modern processors include some form of PMU [9], and system architects in order to accurately gather full statisti- the number of measurable events and hardware counters varies cal data at the system level. In case the investigated architecture in different processors [4], [10]. Events commonly available is already available as an implementation, performance data for monitoring include the number of CPU cycles, miss rates can also be collected at runtime with high accuracy and often for different levels of instruction, data or unified caches, for with higher speed than simulation [2]. To that end, many TLBs, or IPC values. modern high-performance processors feature a performance monitoring unit (PMU) that allows to collect performance data. The software infrastructure for a PMU needs to set the A PMU is essentially a set of counters and registers inside configuration for monitoring, start and stop data collection, and the processor that can be programmed to capture the events finally read out and aggregate counter values to performance happening during the application execution. At the end of a measures. There exist several tools and system interfaces sup- measurement, performance monitoring software reads out the porting the collection of statistical data from PMUs. Among PMU counter values and aggregates the results. them, PAPI [11] and the perf tool [12] are commonly used in high-performance computing. These tools rely on a Performance monitoring is not only useful in high- system interface running in kernel mode to access the PMU performance computing but also in embedded and reconfig- hardware counters. Running the system interface in kernel urable computing. Especially the exploration of reconfigurable mode is advantageous since the hardware counters can easily computing has led many researchers to propose ideas for be saved and restored during a context switch, allowing for system optimization and adaptation at runtime, such as self- per-thread performance monitoring. In the Linux operating tuning caches [3]. Runtime adaptation techniques demand for system, two patches for performance monitoring are widely a hardware/software infrastructure capable of system perfor- used: perfctr [13] and perfmon2 [14]. More recently, the mance measurements in real-time. While PMUs have recently perf_event interface [15] and the perf tool have been been added to some well known embedded processors such included into the main Linux kernel source code. The perf as ARM [4], Blackfin [5], and SuperH [6], as well as to the tool tightly works with the perf_event interface and ARM cores in the Xilinx Zynq [7], a performance monitoring makes performance monitoring straight-forward for users since feature is mostly lacking for soft cores embedded into FPGAs. there is no need to compile and add patches. III. PMU DESIGN AND INTEGRATION can be started even if the measured processor core is entering the super user mode (su). Furthermore, the local control regis- Leon3 [8] is a 32-bit open source processor platform from ter determines whether an overflow interrupt should triggered Gaisler Research implementing the Sparc-v8 architecture [16]. and which event to measure. Currently defined event sources The platform supports multiple cores with a shared memory are the CPU clock cycle count, the number of executed and is popular among researchers due to the rich set of instructions, the number of instruction and data cache read supported tools, operating systems, and its extendability. In accesses, the number of instruction and date cache read misses, this chapter, we describe the architecture of our performance and the number of data cache write accesses as well as misses. measurement units and how they integrate into the Leon3 The signal input of the first counter subsystem is hardwired platform as well as into the standard Linux performance to monitor the execution time to allow for an accurate mea- measurement infrastructure. surement basis to which other measurement counters can be normalized to. A. The Architecture We have been tailoring PMUs for Leon3 multi-core ar- The interrupt acknowledge input signal of the control chitectures. To be able to handle properly the performance block depends on the situation of the associated counter. If counters for workloads migrating between the processors, the the counter reaches its overflow threshold and the interrupt PMUs have been replicated for each processor core and the generation is enabled for this measurement subsystem, an performance counters are stored to and are loaded from the interrupt is generated and a status bit in PMU’s global interrupt context of an application by the OS kernel during each context overflow register is set. The overflow comparator signals the switch. Fig. 1 shows the PMU placement in our current overflow to the control logic which, in turn, clears and sets implementation, which is tailored for processor monitoring. its counter inactive. The activated software interrupt handler The PMUs are connected to the signal pulses driven out checks, which event counter has triggered an interrupt and from the Integer Units (IU), from L1 instruction and data updates the according perf even counter variables. Afterwards, cache controllers, and from the I-TLB as well as the D- the interrupt handler releases the event counter by pulsing TLB modules of the MMU. The standard open-source Leon3 an interrupt acknowledgment signal through MP IRQ to the architecture does not include L2 caches or floating-point units. control logic blocks. However, our PMU architecture can easily be extended to monitor and aggregate events from such sources or from The presence of an interrupt logic is the reason for selecting custom hardware modules and reconfiguration controllers in 32 bit wide event counter registers. While using wider register reconfigurable multiprocessor-on-chip systems. widths, for instance 64 bits, would make a counter overflow less likely, there are cases where it is required to generate an interrupt after counting some specified amount of events. Interrupt level & Interrupt ACK Additionally, even counting events at 100MHz will cause an interrupt request roughly every two minutes. Since the Leon3, core 0 Leon3, core 3 time overhead for the interrupt handler needed to serve an .... PMU interrupt is negligible, we avoided wider event counting MP IRQ L1 L1 L1 L1 Controller registers for the sake of a compact architecture and a smaller PMU I$ D$ PMU I$ D$ memory map. However, the counter widths and the memory MMU MMU map can be easily adopted in our design if wider counters are required. Shared AMBA Bus Fig. 1: PMU modules are located close to the monitored event sources. The interrupt signals for overflowed events are B. PMU Registers - Address Mapping and Access handled by the MP IRQ controller module. Table I shows the address mapping for the overall PMU system. Instead of defining new instructions for reading and Fig. 2 illustrates the hardware implementation of the overall writing PMU registers, the extended Address Space Identifier PMU system. Each event counter subsystem consists of an (ASI) lda/sta instructions of the Sparc architecture are used [8]. event source multiplexer, a control logic block, a counter These instructions are available only in system mode.

Load more