A Hardware/Software Infrastructure for Performance Monitoring on Leon3 Multicore Platforms

authors omitted for blind review

Abstract—Monitoring applications at run-time and evaluating The main contribution of this paper is the presentation the recorded statistical data of the underlying micro architecture of a hardware/software infrastructure for performance moni- is one of the key aspects required by many hardware architects toring for the Leon3 platform, which is a widely-used open and system designers as well as high-performance software source soft core based on the Sparc-v8 architecture [8]. In developers. To fulfill this requirement, most modern CPUs for the remainder of the paper, we first discuss the background High Performance Computing (HPC) have been equipped with of PMUs in Section II and then describe the hardware and Performance Monitoring Units (PMU) including a set of hardware counters, which can be configured to monitor a rich set of software implementation of our PMU for Leon3 multicores events. Unfortunately, embedded and reconfigurable systems are in Section III. In Section IV we present experimental results mostly lacking this feature. Towards rapid exploration of High for MiBench workloads and show the overhead incurred by Performance Embedded Computing in near future, we believe performance monitoring. Finally, Section V concludes the that supporting PMU for these systems is necessary. In this paper, paper. we propose a PMU infrastructure, which supports monitoring of up to seven concurrent events. The PMU infrastructure is implemented on an FPGA and is integrated into a Leon3 platform. We show also the integration of our PMU infrastructure II.BACKGROUND –PERFORMANCE MONITORING UNITS with the perf_event, which is the standard PMU architecture of the kernel. Performance analysis based on performance monitoring units (PMUs) requires both, hardware and software infras- tructure. The hardware infrastructure for recording statistical I.INTRODUCTION data at the micro-architectural level during program execution For many decades, computer architects have been using basically includes sets of control registers and counters. The simulators to expose and analyze performance metrics by control registers can be programmed to specific events that running workloads on the simulated architecture. The results should be captured which are then counted. The configuration collected from simulation may be inaccurate in some cases written to control registers also determines whether and which due to the workloads running on top of an interrupts are generated on a counter overflow, whether data or the simulators just considering not all the relevant micro- is collected only for user mode or also for kernel mode architectural aspects. More recently, so-called full system sim- execution, and generally to enable or disable data collection. ulators such as gem5 [1] are being used by many researchers While most modern processors include some form of PMU [9], and system architects in order to accurately gather full statisti- the number of measurable events and hardware counters varies cal data at the system level. In case the investigated architecture in different processors [4], [10]. Events commonly available is already available as an implementation, performance data for monitoring include the number of CPU cycles, miss rates can also be collected at runtime with high accuracy and often for different levels of instruction, data or unified caches, for with higher speed than simulation [2]. To that end, many TLBs, or IPC values. modern high-performance processors feature a performance monitoring unit (PMU) that allows to collect performance data. The software infrastructure for a PMU needs to set the A PMU is essentially a set of counters and registers inside configuration for monitoring, start and stop data collection, and the processor that can be programmed to capture the events finally read out and aggregate counter values to performance happening during the application execution. At the end of a measures. There exist several tools and system interfaces sup- measurement, performance monitoring software reads out the porting the collection of statistical data from PMUs. Among PMU counter values and aggregates the results. them, PAPI [11] and the perf tool [12] are commonly used in high-performance computing. These tools rely on a Performance monitoring is not only useful in high- system interface running in kernel mode to access the PMU performance computing but also in embedded and reconfig- hardware counters. Running the system interface in kernel urable computing. Especially the exploration of reconfigurable mode is advantageous since the hardware counters can easily computing has led many researchers to propose ideas for be saved and restored during a context switch, allowing for system optimization and adaptation at runtime, such as self- per-thread performance monitoring. In the Linux operating tuning caches [3]. Runtime adaptation techniques demand for system, two patches for performance monitoring are widely a hardware/software infrastructure capable of system perfor- used: perfctr [13] and perfmon2 [14]. More recently, the mance measurements in real-time. While PMUs have recently perf_event interface [15] and the perf tool have been been added to some well known embedded processors such included into the main Linux kernel source code. The perf as ARM [4], Blackfin [5], and SuperH [6], as well as to the tool tightly works with the perf_event interface and ARM cores in the Xilinx Zynq [7], a performance monitoring makes performance monitoring straight-forward for users since feature is mostly lacking for soft cores embedded into FPGAs. there is no need to compile and add patches. III.PMUDESIGNAND INTEGRATION can be started even if the measured processor core is entering the super user mode (su). Furthermore, the local control regis- Leon3 [8] is a 32-bit open source processor platform from ter determines whether an overflow interrupt should triggered Gaisler Research implementing the Sparc-v8 architecture [16]. and which event to measure. Currently defined event sources The platform supports multiple cores with a shared memory are the CPU clock cycle count, the number of executed and is popular among researchers due to the rich set of instructions, the number of instruction and data cache read supported tools, operating systems, and its extendability. In accesses, the number of instruction and date cache read misses, this chapter, we describe the architecture of our performance and the number of data cache write accesses as well as misses. measurement units and how they integrate into the Leon3 The signal input of the first counter subsystem is hardwired platform as well as into the standard Linux performance to monitor the execution time to allow for an accurate mea- measurement infrastructure. surement basis to which other measurement counters can be normalized to. A. The Architecture We have been tailoring PMUs for Leon3 multi-core ar- The interrupt acknowledge input signal of the control chitectures. To be able to handle properly the performance block depends on the situation of the associated counter. If counters for workloads migrating between the processors, the the counter reaches its overflow threshold and the interrupt PMUs have been replicated for each processor core and the generation is enabled for this measurement subsystem, an performance counters are stored to and are loaded from the interrupt is generated and a status bit in PMU’s global interrupt context of an application by the OS kernel during each context overflow register is set. The overflow comparator signals the switch. Fig. 1 shows the PMU placement in our current overflow to the control logic which, in turn, clears and sets implementation, which is tailored for processor monitoring. its counter inactive. The activated software interrupt handler The PMUs are connected to the signal pulses driven out checks, which event counter has triggered an interrupt and from the Integer Units (IU), from L1 instruction and data updates the according perf even counter variables. Afterwards, cache controllers, and from the I-TLB as well as the D- the interrupt handler releases the event counter by pulsing TLB modules of the MMU. The standard open-source Leon3 an interrupt acknowledgment signal through MP IRQ to the architecture does not include L2 caches or floating-point units. control logic blocks. However, our PMU architecture can easily be extended to monitor and aggregate events from such sources or from The presence of an interrupt logic is the reason for selecting custom hardware modules and reconfiguration controllers in 32 bit wide event counter registers. While using wider register reconfigurable multiprocessor-on-chip systems. widths, for instance 64 bits, would make a counter overflow less likely, there are cases where it is required to generate an interrupt after counting some specified amount of events. Interrupt level & Interrupt ACK Additionally, even counting events at 100MHz will cause an interrupt request roughly every two minutes. Since the Leon3, core 0 Leon3, core 3 time overhead for the interrupt handler needed to serve an .... PMU interrupt is negligible, we avoided wider event counting MP IRQ L1 L1 L1 L1 Controller registers for the sake of a compact architecture and a smaller PMU I$ D$ PMU I$ D$ memory map. However, the counter widths and the memory MMU MMU map can be easily adopted in our design if wider counters are required. Shared AMBA Bus

Fig. 1: PMU modules are located close to the monitored event sources. The interrupt signals for overflowed events are B. PMU Registers - Address Mapping and Access handled by the MP IRQ controller module. Table I shows the address mapping for the overall PMU system. Instead of defining new instructions for reading and Fig. 2 illustrates the hardware implementation of the overall writing PMU registers, the extended Address Space Identifier PMU system. Each event counter subsystem consists of an (ASI) lda/sta instructions of the Sparc architecture are used [8]. event source multiplexer, a control logic block, a counter These instructions are available only in system mode. The with according overflow logic and a common logic for the ASI = 0x02 is reserved for system control registers and overflow interrupt generation. The heart of a performance has an unused address range from 0xC0 to 0xFC, which is counter subsystem is its control block. The control block takes used to clamp the PMU registers. The following sample code input from a global and a local control register as well as demonstrates how these registers can be accessed: the interrupt acknowledge from the interrupt controller. The functionality of the global register follows the ARM Cortex- u32 val; A9 PMU architecture [4] and allows us to reset and enable asm volatile ("lda [%1] %2, %0": all event counters by a single register access. This feature is "=r"(val): "r"(0xC0), i"(0x02)); of use for the Linux perf_event performance monitoring infrastructure. In this sample code, ASI value 0x02 encoded in the Through the local control register a performance counter instruction lda makes the IU load the current value in the global subsystem can be cleared (clr) and enabled (en), and counting control register at address 0xC0, storing it in the variable val. control register global control register counter register overflow status register

cyc. rst rst en

su clr en irq_en

clear event … increment counter 0 ovf 0 signals Control logic interrupt enable overflow su mode ==

rst en overflow_value overflow IU & Cache interrupt MP IRQ Controller event_id su clr en irq_en Controller

clear … counter i ovf i Control increment logic interrupt enable overflow ==

overflow_value interrupt ack

Fixed cycle counter The ith event counter configured Fig. 2: The design for PMU module per core. The PMU monitors the event signals and manages the counters, depending on a set of control registers. The signals for the overflow interrupt are handled by MP IRQ Controller.

C. Handling Overflow Interrupt

TABLE I: Memory map of the PMU system. The maximum There are two basic ways of introducing interrupt sources number of event counters supported is limited to 7 due to the to Leon3 processors. First, peripheral devices that are con- current limitations of our system design. The maximum nected to the AMBA bus can use the AMBA interrupt lines. number of monitored events can be up to 256. Currently, The AMBA bus interrupt controller has then to prioritize and cycles, instructions, L1:I / L1:D access and read misses as relay the interrupt requests to the Leon3 MP IRQ controller. well as L1:D write accesses and misses, and ITLB as well as A PMU unit using this method of interrupt generation would DTLB misses are supported. need to implement an AMBA bus slave controller and accept the temporal overhead of the AMBA bus interrupt controller. Registers accessed via ASI = 0x02 Additionally, interrupts from other peripheral devices may have Register—32 bits ASI Address Mapping an impact on the measurement precision. Global control (RW): - Bit[0]: enable all event counters (en) The second option, and the one we have chosen in this - Bit[1]: reset/clear all event counters (rst) - Bit[2]: reset/clear cycles countered (cyc.rst) work, for interfacing to the interrupt logic of the Leon3 - Bit[7..3]: number of event counters supported processor is to directly connect interrupt sources to the internal - Bit[31]: reset/clear IRQ pending 0xC0 logic of the MP IRQ controller. To that end, all PMU interrupt overflow status (RW): - Bit[0]: overflow cycle counter request lines are aggregated by an OR gate and sourced into - Bit[n..1]: overflow for event counter - n..1 the external interrupt (EIRQ) handling circuitry of the MP IRQ - Bit[31]: indication for IRQ pending 0xC4 controller. This is shown in Fig. 3. This method has also the Cycle counter (RW): - Bit[31..0]: counter value is being monitored 0xC8 that the number of AMBA bus devices is not increased. Cycle counter control (RW): - Bit[7..0]: reserved - Bit[8]: enable the counter (en) D. System Integration - The Software Stack - Bit[9]: reset/clear the counter (clr) - Bit[10]: counting kernel/user mode (su) Fig. 4 presents the integration of our PMUs into the - Bit[11]: interrupt enable (irq en) 0xCC standard Linux perf_event infrastructure. Instead of ex- th The i event counter (RW): tending the standard Sparc-64 PMU code, we have adopted th - Bit[31]: counter value is being monitored 0xD0 + 8 · (i ) the perf event.c code from the ARM PMU implementation The ith event counter control (RW): - Bit[7..0]: event identifier (event id) due to similarities in the interrupt handling mechanism. - Bit[8]: enable the counter (en) - Bit[9]: reset/clear the counter (clr) From the user space perspective, the perf_event inter- - Bit[10]: counting kernel/user mode (su) face and the perf tool work together as follows: The perf - Bit[11]: interrupt enable (irq en) 0xD4 + 8 · (ith) tool invokes an application for measurement. Depending on the input parameters, the perf tool provides the event sources to monitor to kernels perf_event measurement IRQ Priority Pending select

APBI.PIRQ[31:1] [15:1] IRQO[0].IRL[3:0] [15:1]

[eirq+16] [eirq] IRQ IRQ [15:1] Priority Force[0] mask[0] encoder

Core 0

IRQ from PMU [eirq+16] Pending OVERFLOW [eirq+16] INTERRUPT [0]

[15:1] IRQO[n].IRL[3:0] [15:1]

[eirq+16] [eirq] IRQ IRQ [15:1] Priority Force[0] mask[0] encoder

Core n

IRQ from PMU [eirq+16] Pending OVERFLOW [eirq+16] INTERRUPT [n]

Fig. 3: The additional logic gates inside the MP IRQ Controller of the Leon3 platform support handling PMU interrupts for overflowed event counters.

perf_event infrastructure. The infrastructure in turn con- TABLE II: Hardware resource usage for implementing a figures the PMU infrastructure and starts profiling. When the Leon3 PMU and the total overhead in % compared to a perf tool application finishes its execution, the reads out PMU-less Leon3 system. the event counters and aggregates the final results via the perf_event interface. An example for the output of the 1 core 2 cores 4 cores perf tool is given in Fig. 5. FF LUT FF LUT FF LUT PMU 303 886 606 1641 1212 3956 MP IRQ without PMU 101 205 173 354 285 661 IV. EXPERIMENTSAND RESULTS MP IRQ with PMU 102 210 175 368 289 694 Leon3 with PMU 15371 20956 19879 29413 28848 47142 This section reports on our experiments and results. The Increase [%] 2.0 4.4 3.2 6.0 4.4 9.2 used system configuration for the Leon3 platform is shown Table III. We have synthesized the Leon3 platform and pro- grammed it to a Xilinx ML605 Virtex-6 board. The root file system is located on a CF card. Since our goal was to focus on busses and peripherals stays almost unchanged. This explains, collecting and analyzing statistical data for embedded system why when doubling and quadrupling the core number and the workloads, we have chosen a subset of MiBench [17], a free according PM units the overhead for the PM units compared benchmark suite, for experimentation. to the total size of an Leon3 platform increases form 2 over 3.2 to 4.4 per cent for the number of flip flops and from 4.4 A. Hardware resource usage over 6.0 to 9.0 per cent for the number of look-up tables.

Before presenting the benchmark results, Tab. II sums B. Measurement Output up the hardware overhead for the PMU subsystem when implemented for a single, dual and quad-core Leon3 platform. Table IV displays all our measurements for the selected For the single-core variant the overhead for flop flops and benchmarks. We have monitored eight events: CPU cycles, look-up tables amount for 2 and 4.4 per cent, respectively. instructions, L1:I load misses, L1:I loads, L1:D loads, L1:D When doubling the core number of a Leon system, the integer load misses, L1:D stores, and L1:D store misses. The pre- units get duplicated. Busses and peripherals are typically not sented results cover events captured during user mode, but replicated. Therefore, the size of a Leon system grows linear exclude kernel mode execution. Since in our prototypical with the number of cores but the hardware effort for the implementation the maximum number of event counters is static inline u32 leon3pmu_read_counter(int idx) { asm volatile (" lda [%1] %2, %0 ": "=r"(value) : "r"(idx*8 + 0xC8), i"(0x02)); } USER application perf tool static inline void leon3pmu_ctrl_write(enum leon3_counters counter, unsigned long val){ Linux Kernel asm volatile (" sta %0, [%1] %2 \n" : // no outputs perf event OS : "r"(val), "r"(counter*8 + 0xCC), i"(0x02) : "memory"); arch//kernel/ } perf_event.c static irqreturn_t leon3pmu_handle_irq(int irq_num, void *dev){ HW Leon3 PMU u32 val; … //read overflow register asm volatile (" lda [%1] %2, %0 ": "=r"(val) : "r"(0xC4), "i"(0x02)); … }

Fig. 4: System integration with perf_event.

# perf stat -e cycles,instructions, L1-dcache-loads,L1-dcache-load-misses, TABLE III: Leon3 platform and system configuration L1-dcache-stores,L1-dcache-store-misses ./queens -c 14 Clock Frequency 75Mhz Integer Unit Yes 14 queens on a 14x14 board... Floating Point Software ...there are 365596 solutions Instruction Cache 2-way associative, 8KB, 32bytes/line, LRU ITLB 8 entries Performance counter stats for ’./queens -c 14’: Data Cache 2-way associative, 4KB, 16bytes/line, LRU 8295258591 cycles DTLB 8 entries 5309555048 instructions # 0.640 IPC MMU Yes 965961200 L1-dcache-loads MP IRQ Controller External IRQ (EIRQ) = 14 50932 L1-dcache-load-misses PMU New feature 191890081 L1-dcache-stores Linux Kernel 2.6.36.4 patch from Gaisler Pre-built Linux toolchain from Gaisler 27365021 L1-dcache-store-misses 115.170000000 seconds time elapsed #_ 1%. Fig. 5: Example: Launch and output of the perf tool. The variations in the number of cycles, L1:I read, L1:D read, and L1:D store miss events can be explained by initially varying states of the cache lines. Thus, for more accurate seven, we had to invoke the perf tool twice in order to results the presented PMU infrastructure should be extended avoid multiplexing events, which can lead to inaccurate results by a cache flush and a preheat procedure, to unify the starting during measurement. The first invocation of the perf tool conditions for all benchmarks. is for counting the number of CPU cycles and instructions, the second for gathering the remaining events. We have repeated each measurement for 10 times and computed the coefficient variation (CV=σ/µ), i.e., the standard deviation divided by the V. CONCLUSIONAND DISCUSSION mean average. In this paper, we present a performance measurement The chosen benchmark programs are deterministic in the infrastructure for the single- and multicore Leon3 process- sense that for a given input data set they execute exactly the ing platform. The infrastructure integrates seamlessly into same instructions. The reason that the collected values for the standard Linux performance measurement architecture the number of instructions is not deterministic is the perf perf_event and allow for a comfortable and accurate tool, which also runs in user space. Once the perf tool analysis of microarchitecture measurements using the standard receives a signal indicating that the application stops, the perf Linux profiling tools. From the reconfigurable system perspec- tool has to spend time for disabling the counters in the tive, the presented performance measurement infrastructure event control registers. This leads to a certain deviation in support also monitoring events generated by the reconfigurable consecutive measurements., which are however, quite below platform, such as partial reconfiguration times. TABLE IV: Statistical data collected for a subset workloads of MiBench. Coefficient of variation (CV= σ/µ) compared by running perf tool 10 times on the same application.

1 core 2 cores 4 cores Benchmark Cycles [%] Inst. [%] Time [s] IPC Cycles [%] Inst. [%] Time [s] IPC Cycles [%] Inst. [%] Time [s] IPC basicmath 0.006 0.000 23.005 0.596 0.052 0.000 23.176 0.594 0.040 0.000 23.434 0.594 bitcounts 0.003 0.000 0.821 0.758 0.017 0.000 0.888 0.757 0.056 0.000 0.893 0.755 qsort 0.017 0.001 0.978 0.373 0.081 0.000 1.019 0.373 0.173 0.000 1.048 0.371 jpeg 0.063 0.002 1.085 0.567 0.082 0.001 1.147 0.563 0.169 0.000 1.161 0.564 dijstra 0.052 0.001 1.767 0.486 0.007 0.000 1.81 0.486 0.123 0.056 1.819 0.485 patricia 0.007 0.001 5.421 0.207 0.040 0.001 5.479 0.206 0.145 0.000 5.617 0.206 stringsearch 0.023 0.004 0.077 0.255 0.136 0.000 0.147 0.251 0.109 0.000 0.152 0.251 FFT 0.007 0.000 27.131 0.599 0.008 0.000 27.232 0.598 0.042 0.003 25.419 0.597 CV 0.022 0.001 0.053 0.000 0.107 0.007 1 core 2 cores 4 cores Benchmark L1:I load L1:D load L1:D store L1:I load L1:D load L1:D store L1:I load L1:D load L1:D store misses [%] misses [%] misses [%] misses [%] misses [%] misses [%] misses [%] misses [%] misses [%] basicmath 0.040 0.247 0.346 0.078 0.546 1.474 0.140 0.463 1.232 bitcounts 0.136 0.176 0.382 0.102 0.102 0.318 0.256 0.106 0.472 qsort 0.027 0.018 0.160 0.041 0.046 0.154 0.139 0.030 0.092 jpeg 0.125 0.024 0.076 0.537 0.045 0.066 2.213 0.079 0.055 dijstra 0.052 0.025 0.104 0.052 0.018 0.024 0.601 0.187 0.539 patricia 0.006 0.150 0.184 0.004 0.150 0.124 0.012 0.251 0.232 stringsearch 0.108 0.010 0.009 0.189 0.040 0.117 0.170 0.053 0.039 FFT 0.057 0.166 0.393 0.062 0.298 2.558 0.298 0.659 2.243 CV 0.069 0.102 0.207 0.133 0.156 0.604 0.479 0.229 0.613

REFERENCES [16] SUN Microsystems, SPARC V8 Manual. [Online]. Available: http: //www.gaisler.com/doc/sparcv8.pdf [1] N. L. Binkert, B. M. Beckmann, G. Black, S. K. Reinhardt, A. G. [17] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, R. Brown, “MiBench: A free, commercially representative embedded K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The benchmark suite,” in IEEE 4th Annual Workshop on Workload Charac- gem5 simulator,” SIGARCH News, vol. 39, no. 2, terization, 2001, pp. 3–14. pp. 1–7, 2011. [2] D. Zaparanuks, M. Jovic, and M. Hauswirth, “Accuracy of Performance Counter Measurements,” in Performance Analysis of Systems and Soft- ware (ISPASS), 2009, pp. 23—32. [3] A. Gordon-Ross and F. Vahid, “A self-tuning configurable cache,” in Design Automation Conference (DAC), 44th ACM/IEEE, 2007, pp. 234 – 237. [4] ARM, ARM Cortex-A9, Technical Reference Manual. [Online]. Available: http://infocenter.arm.com [5] ADSP-BF535 Blackfin Processor Hardware Reference. [Online]. Avail- able: http://www.analog.com/static/imported-files/processor\ manuals/ ADSP-BF535\ hwr\ rev3.3.pdf [6] [Online]. Available: http://documentation.renesas.com/doc/products/ tool/rej10b0110\ sh7750.pdf [7] Xilinx, Zynq-7000 All Programmable SoC Technical Reference Manual, UG585 (v1.7) ed., Xilinx Inc., Feb. 2014. [8] Aeroflex Gaisler. [Online]. Available: http://www.gaisler.com/products/ grlib/grlib.pdf [9] B. Sprunt, “The Basics of Performance Monitoring Hardware,” IEEE Micro, vol. 22, no. 4, pp. 64—71, 2002. [10] SUN Microsystems, OpenSPARC T2 Core Microarchitecture Specification. [Online]. Available: http://www.oracle.com/technetwork/ systems/opensparc [11] J. Dongarra, K. London, M. S., P. Mucci, H. Terpstra D.and You, and M. Zhou, “Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters,” in PADTAD Workshop, IPDPS, 2003, p. 289. [12] The linux perf tool. [Online]. Available: http://www.perf.wiki.kernel. org/index.php [13] M. Pettersson, perfctr. [Online]. Available: http://user.it.uu.se/∼mikpe/ linux/perfctr [14] Perfmon2: A flexible performance monitoring interface for Linux. [Online]. Available: https://www.kernel.org/doc/ols/2006/ ols2006v1-pages-269-288.pdf [15] V. M. Weaver, “Linux perf event features and overhead,” in FastPath Workshop, 2013.