Poster Abstract: Fast and Accurate Cycle Estimation through Hybrid Instruction Set Simulation for Embedded Systems

Kilho Lee∗, Wookhyun Han∗, Jaewoo Lee†, Hoon Sung Chwa∗, and Insik Shin∗ ∗School of Computing, KAIST, South Korea †Dept. of Computer and Information Science, University of Pennsylvania, USA [email protected]

Motivation Execution time analysis is essential during the Target Binary design of real-time embedded systems to verify all timing requirements are met. With rapid increase in complexity of Code Executor Virtual Machine Runtime modern hardware components, it becomes much more difficult Cycle Counter to develop an accurate timing model for a target hardware, Cache/TLB Event Counter which serves as a basis for static timing analysis. Recently, simulation-based dynamic timing analysis techniques are be- Code Cache Platform Constructor coming an attractive solution to predict the execution time Peripheral Simulation Dynamic Binary Translator of software in a fast and accurate manner. However, most Engine Processor Pipeline Model Peripheral Event Counter of existing simulation-based timing analysis techniques are limited to simulate the temporal behavior of a processor without consideration of other peripheral devices such as Cycle Estimation Engine storage and network, leading to less accuracy. In this paper, we Required Cycles for Cache & Peripheral Processor Event Statistics propose an accurate cycle estimation framework which allows to use multiple instruction set simulators to simulate not only processors but also diverse peripheral devices. An instruction Fig. 1: The cycle estimator with hybrid instruction set simu- set simulator runs on a host machine to mimic functional lator architecture. behaviors of instructions running on a target hardware. It allows to estimate the execution time of software in a fast and the measured cycle. The benchmark result shows that and accurate way and validate a system even when its target our hybrid approach accurately estimates the system cycles hardware does not yet exist or is not available. while capturing the temporal behaviors of both processors and peripherals, and thereby effectively improves the accuracy Approach To support the full system cycle estimation compared to when either QEMU or OVPsim are used alone. including peripherals, we propose an accurate cycle estimation Future Works At the current stage, we showed preliminary methodology through hybrid instruction set simulation, which results with a proof-of-concept prototype. The prototype only combines QEMU [1] and OVPsim [2], [3] as depicted in Fig- covers ARM processor, cache, and TLB behaviors. Our future ure 1. QEMU has great functionalities to capture the temporal works include (a) extending the prototype to emulate diverse behaviors of processors, however it does not support emulation processors and other types of peripherals such as storages and for important peripherals which affect the system timing, such networks, (b) elaborating each processor and peripheral model as caches and TLBs. In contrast, OVPsim is not suitable to to emulate its complex temporal behaviors, () and conducting capture the temporal behavior of processors, but it is good extensive benchmarks with realistic softwares. to capture the effect of peripherals to the system cycle due to its sophisticated peripheral emulation capability. The proposed ACKNOWLEDGMENT approach takes both benefits from QEMU and OVPsim, and This work was conducted at High-Speed Vehicle Research enables highly accurate cycle estimation for the full system. Center of KAIST with the support of Defense Acquisition We extend QEMU and OVPsim with additional components: Program Administration (DAPA) and Agency for Defense the processor pipeline model and the cycle counter on QEMU, Development (ADD). and the cache/TLB event counter and the peripheral event counter on OVPsim. REFERENCES [1] F. Bellard, “Qemu, a fast and portable dynamic translator.” in USENIX We built a proof-of-concept prototype of our approach, and Annual Technical Conference, FREENIX Track, 2005, pp. 41–46. conducted several benchmarks on the prototype. The prototype [2] Imperas inc. ovp world home page. [Online]. Available: emulates processor, cache, and TLB of Raspberry Pi [4]. http://www.ovpworld.org [3] B. Bailey, “System level virtual prototyping becomes a reality with ovp In order to evaluate the accuracy of the proposed approach, donation from imperas,” White Paper, June, vol. 1, 2008. we run each benchmark on the prototype and on the real [4] “Raspberrypi model-b.” [Online]. Available: Raspberry Pi hardware, and compared the estimated cycle https://www.raspberrypi.org/products/model-b/ Fast and Accurate Cycle Estimation through Hybrid Instruction Set Simulation for Embedded Systems

Kilho Lee∗, Wookhyun Han∗, Jaewoo Lee†, Hoon Sung Chwa∗, and Insik Shin∗ ∗School of Computing, KAIST, South Korea †Dept. of Computer and Information Science, University of Pennsylvania, USA [email protected]

Abstract—Execution time analysis is essential during the design Most of existing ISS for cycle estimation are limited to of real-time embedded systems to verify all timing requirements processor modeling [1], [2]. Thach et al. [1] considered cache are met. With rapid increase in complexity of modern hardware model in processors. Stattelmann et al. [2] accelerated analysis components, it becomes much more difficult to develop an accurate timing model for a target hardware, which serves as speed utilizing timing analysis. The cycle estimation tech- a basis for static timing analysis. Recently, simulation-based niques of previous work are mainly focused on processors and dynamic timing analysis techniques are becoming an attractive caches. They cannot accurately estimate cycles for complex solution to predict the execution time of software in a fast and softwares which exploit diverse system peripherals, because accurate manner. However, most of existing simulation-based they cannot capture how those peripherals affect the system timing analysis techniques are limited to simulate the temporal behavior of a processor without consideration of other peripheral cycle, such as memory, I/O devices, and network devices. For devices such as storage and network, leading to less accuracy. In accurate cycle estimation, the cycle estimation should consider this paper, we propose an accurate cycle estimation framework not only behaviors of processors and caches, but also behaviors which allows to use multiple instruction set simulators to simulate of peripherals. not only processors but also diverse peripheral devices. We build To support the full system cycle estimation including pe- a proof-of-concept prototype of our proposed approach, run several benchmarks, and evaluate how accurately our proposed ripherals, we propose an accurate cycle estimation method- framework estimates the execution time of software running on ology through hybrid ISS, which combines QEMU [3] and a target hardware. OVPsim [4], [5]. Although QEMU has great functionalities to capture the temporal behaviors of processors, it does not I.INTRODUCTION support emulations for important peripherals which affect the system timing, such as caches and TLBs. In contrast, OVPsim Estimating the execution time of software components is is not suitable to capture the temporal behavior of processors, standard practice during the design of real-time embedded but it is good to capture the effect of peripherals on the system systems to verify all timing requirements are met. With cycle due to its sophisticated peripheral emulation capability. advances in technology, hardware components in real-time The proposed approach takes both benefits from QEMU and systems are rapidly becoming more complex. For example, OVPsim and enables highly accurate cycle estimation for the modern processors provide advanced computer architecture full system. features such as caches, pipelines, branch prediction, and out- We build a proof-of-concept prototype of our approach and of-order execution, and they are interconnected with diverse conducted several benchmarks on the prototype. Our prototype peripherals like I/O and network devices. Such rapid increase emulates processor, cache, and TLB of Raspberry Pi [6]. in complexity makes it much difficult to predict the temporal In order to evaluate the accuracy of the proposed approach, behavior of software in a fast and accurate manner. we run each benchmark on the prototype and on the real Existing execution time analysis techniques can broadly Raspberry Pi hardware, and compared the estimated cycle and fall into two classes: static and dynamic. Static analysis the measured cycle. attempts to estimate execution time by examining the structure The benchmark result shows that our hybrid approach ac- of software and modeling the underlying hardware without curately estimates the system cycles while capturing temporal executing it directly on the hardware. This approach requires behaviors of both processors and peripherals, and thereby the manual development of a hardware model for each target effectively improves the accuracy compared to when either hardware, which is extremely difficult to model the effect of QEMU or OVPsim are used alone. modern complex architectural features on the timing behavior accurately. On the other hand, dynamic analysis measures the II.RELATED WORKS execution time of software directly on the real hardware. There are several ways to do dynamic analysis. One of A cycle-accurate simulator reflects a target hardware logic the most widely used techniques is instruction set simulation and conforms to the cycle-by-cycle behavior of the target (ISS). An instruction set simulator runs on a host machine system. It can produce the number of cycles almost the same to mimic functional behaviors of instructions running on a with actual execution on the target hardware. However, cycle- target hardware. ISS allows to estimate the execution time of accurate simulators tend to be slow and complex to design software in a fast and accurate way and validate a system even the target system from the base. In contrast, instruction set when its target hardware does not yet exist or is not available. simulators focus on functional accuracy of instructions and fast Target Binary Target Binary Target Binary

Code Executor Virtual Machine Runtime

Cycle Counter Virtual Machine Runtime Code Executor Cache/TLB Event Counter

Code Cache Platform Constructor Platform Constructor Code Cache Peripheral Simulation Dynamic Binary Translator Engine Processor Pipeline Model Peripheral Event Counter Dynamic Peripheral Simulation Binary Translator Engine

Cycle Estimation Engine Required Cycles for Cache & Peripheral Target Binary Execution Target Binary Execution Processor Event Statistics on Host Machine on Host Machine

(a) QEMU (b) OVPsim (c) The hybrid cycle estimator Fig. 1: The architecture overview of QEMU, OVPsim, and the hybrid cycle estimator. simulations. In result, instruction set simulators are prevailing manner. Modern softwares typically utilize diverse hardware simulation tools for designing and testing embedded systems. resources such as memory, storage, audio/video, and network. QEMU [3] is an open-source instruction set simulator that Therefore, the cycle estimator requires to simulate not only emulates processors through dynamic . It processors, but also diverse peripherals. In addition, since supports various processors, but not enough peripherals to modern softwares are getting complex and heavy, the cycle emulate diverse embedded systems. estimator should be fast enough to analyze the complex Gem5 [7] is an another open-source instruction set simulator software in a timely manner. Thereby, the cycle estimator composed of modularized hardware component models. It should rely on instruction set simulators, not on cycle accurate provides the full system simulation mode and the system call simulators. In this paper, we propose a novel cycle estimation emulation mode. It supports various instruction set architec- technique which uses multiple instruction set simulators (i.e., tures, GPU models, and memory models. However, similar to QEMU and OVPsim) to achieve fast and accurate estimation QEMU it also lacks of diverse hardware peripheral supports. and support diverse hardware components. OVPsim [4], [5] is a full system instruction set simulator B. System Design: Hybrid Architecture that supports various processor models, memory model com- ponents, and peripherals. It supports the behavior modeling of We propose a hybrid cycle estimator composed of QEMU devices including processors, memory, and peripherals. Thus, and OVPsim, as depicted in Figure 1(c). To achiece a fast OVPsim is an adequate simulation platform to design various cycle estimation without losing accuracy, our proposed cycle embedded systems. However, it is a closed-source, so this estimator uses multiple instruction set simulators, instead of limits to extend towards a cycle estimator. cycle-accurate simulators. Although both QEMU and OVPsim Estimation of cycles on such fast instruction set simulators are designed to emulate a target machine and execute a while maintaining simulation speed is an important problem target machine’s binary on a host machine, they could be a to be resolved. There is a couple of existing works to estimate great foundation for cycle estimation. However, either QEMU cycles on instruction set simulators. Thach et al. [1] estimated or OVPsim has inherent weaknesses for full system cycle cycles on QEMU. They split up pipeline scheduling into estimation. The proposed hybrid architecture cycle estimator two phases, a static scheduling that obtains pre-estimation enables that QEMU and OVPsim complement each other, cycles and a dynamic adaptation that refines estimated cycles and it consequently provides accurate cycle estimation for the with runtime factors. They build a cache simulator in QEMU full system with diverse and flexible hardware components and check cache hit or miss in dynamic adaptation phase. support. Stattelmann et al. [2] combined QEMU and existing timing The proposed cycle estimator utilizes QEMU and OVP- analysis to create a timing database. The pre-estimated timing sim at the same time, to capture the temporal behavior of database can be used in QEMU to estimate cycles with processor pipelines and events from peripherals, respectively. slight overhead. These works focus on estimating cycles while Capturing the temporal behavior of a processor requires to simulating fast, but none of these approaches focus on cycle extend the core part of the instruction set simulator internals estimation with simulating full embedded systems including (i.e., instruction fetch and translation routine). Since QEMU peripherals. is an open-source project, it is highly extensible to include its core emulation routines. In contrast, OVPsim is a closed- III.APPROACH source, it is very difficult to extend its core emulation part. The estimation of the processor cycle requires modification of A. System Goal not only the processor models but also the emulation engine. The main objective of this paper is to develop a simulation- Thereby OVPsim is not suitable to estimate the processor based full system cycle estimator which provides the approx- cycles, even though it is capable to generate a customized imation of required cycles for a target binary in a timely processor model with its . It is also important to capture events from diverse periph- virtual machine runtime, captures the events from caches and erals which affect the system cycles. For example, waiting TLBs, and passes the event statistics to the cycle estimation for data from cache, TLB, or storage may incur pipeline engine. In addition, OVPsim supports peripheral modeling stalls and thereby increase in the required cycles. Both QEMU through Behavioral Hardware Modeling (BHM) and Peripheral and OVPsim provide peripheral emulations. However, QEMU Programming Model (PPM) APIs. BHM supports the behavior emulates only limited types of peripherals related to the system models of peripherals, while PPM provides interfaces to the functional behaviors, so it cannot emulate other peripherals platforms such as bus and network connections. The peripheral affecting the system temporal behaviors, such as cache, TLB, event counter runs on top of the peripheral simulation engine and system bus. In contrast, OVPsim provides more sophisti- by using these APIs, captures the events from peripherals, and cated peripheral emulation including the peripherals affecting passes the statistics to the cycle estimation engine. the system temporal behaviors. Besides, OVPsim provides Cycle Estimation Engine The cycle estimation engine col- peripheral simulation engine and APIs to ease of develop- lects the cycle counter result from QEMU and the peripheral ing customized peripheral models, while QEMU requires to event statistics from OVPsim. The estimation engine calculates implement from scratch for customized peripheral models. cycle penalties for peripherals based on the event statistics and each hardware specification. After that, the estimation engine C. Cycle Estimation combines the required cycles for processor and the cycle Processor. The proposed estimator utilizes QEMU to es- penalties, and consequently calculates the required cycles for timate the required cycles for processors. QEMU executes a the full system. code executor and a DBT (Dynamic Binary Translator) with a code cache, as depicted in Figure 1(a). The code executor IV. EVALUATION fetches a code block from the target binary, and executes the A. Experiment setup block if it is already translated for the host machine. If the code block is not translated yet when the code executor fetches it, We implement a proof-of-concept prototype of the proposed the code executor calls the DBT to translate the block and approach and evaluate how accurately the proposed approach executes the block after code translation. The DBT caches estimates cycles. To this end, we run integer benchmarks every translated code block into the code cache to increase on both our cycle estimator prototype and a target hardware emulation speed. respectively, and compare the results. As shown in Figure 1(c), we extend QEMU with two addi- As a target hardware system, we use RaspberryPi Model- tional components: the processor pipeline model and the cycle B [6] composed of single-core ARM1176JZF [8] of 700 MHz, counter. The processor pipeline model runs with the DBT. L1 cache of 16 KB, L2 cache of 128 KB, main memory of 512 It calculates the required cycles for each target instruction MB, and other peripherals such as USB, ethernet, and GPIO. in the code block while considering the processor pipeline RaspberryPi is equipped with diverse hardware components behaviors, and consequently calculates the required cycles for and provides enough computing functionalities with plenty of each code block. The cycle counter runs together with the tools and documentations, which is a suitable target hardware code executor and counts the number of executions for each for experiments. We use Fibonacci number calculation as a code block. Since the required cycles for each code block is computation intensive benchmark and Matrix multiplication as calculated by the processor pipeline model, the cycle counter a memory intensive benchmark. Each benchmark takes input can estimate the total cycles used by the processor to run the parameters, such as the target Fibonacci number and the matrix binary. QEMU passes the estimated required cycles into the size. cycle estimation engine as an intermediate result for the full To evaluate the accuracy of the proposed approach, we system cycle estimation. estimate the required cycles for each benchmark by using Cache, TLB, and other peripherals. The estimator uses our prototype. After that, we run each benchmark on the real OVPsim to estimate the required cycles for cache, TLB, and hardware and measure the elapsed cycles for the benchmark other peripherals. As shown in Figure 1(b), OVPsim executes by using the PMU (Performance Monitoring Unit) in the Virtual Machine Runtime, Platform Constructor and Peripheral hardware. We conduct 50 trials for each measurement on the Simulation Engine. The virtual machine runtime is generated real hardware and use the average of them as a final result, through the platform constructor by combining hardware mod- in order to alleviate measurement noises from the PMU, the els for a target machine, and it emulates the target machine underlying operating systems, and the randomness of cache based on the hardware models. In addition, OVPsim adopts accesses. In contrast we conduct only one trial for each the peripheral simulation engine which emulates behaviors of cycle estimation, because our prototype calculates the required diverse peripherals. cycles based on the target binary and the input parameters and As shows in Figure 1(c), we extend OVPsim with two does not contain any noises. We use the estimated required components: the cache/TLB event counter and the peripheral cycles normalized to the average of the measured elapsed event counter. For emulation of caches and TLBs, OVPsim uti- cycles as the accuracy of the cycle estimation. In order to lizes memory component modeling through Virtual Machine elaborate the effectiveness of the hybrid architecture, we break Interface (VMI) API. Modeled caches and TLBs are embedded down each estimation result into QEMU Only and Hybrid. in the virtual machine runtime, and then important events such QEMU Only utilizes our extended QEMU to estimate cycles as the number of cache hit/miss and TLB hit/miss can be for the processor only, and Hybrid utilizes our full system to captured. The cache/TLB event counter runs on top of the estimate cycles for the processor and the peripherals. QEMU Only Hybrid Measurement the measurement with a large matrix size has a relatively higher standard deviation, which comes from random cache 0.99 0.99 0.99 0.99 0.99 0.99 1 miss patterns due to the underlying . It could be addressed by more sophisticated experimental environment, 0.5 such as conducting measurements while disabling timer inter- rupts.

V. CONCLUSIONAND FUTURE WORKS Normalized Cycles Normalized 0 107 108 109 This paper proposed a fast and accurate simulation-based Fibonacci Number (n) dynamic cycle estimator. In order to emulate various hardware Fig. 2: Fibonacci number benchmark result. components and increase its estimation accuracy, the proposed estimator has a hybrid architecture composed of the extended QEMU and OVPsim. We built a proof-of-concept prototype QEMU Only Hybrid Measurement of our approach and conducted several benchmarks with the

0.98 0.97 1.01 prototype. The benchmark results showed that the proposed 1 0.91 0.84 approach accurately estimates the required cycles of target

0.53 software even the software intensively utilizes diverse hard- 0.5 ware components such as caches and TLBs. At the current stage, we showed preliminary results with a proof-of-concept prototype. The prototype only covers ARM Normalized Cycles Normalized 0 processor, cache, and TLB behaviors. Our future works include 64*64 96*96 128*128 Matrix Size (m*n) (a) extending the prototype to emulate diverse processors and other types of peripherals such as storages and networks, (b) Fig. 3: Matrix multiplication benchmark result. elaborating each processor and peripheral model to emulate its complex temporal behaviors, (c) and conducting extensive benchmarks with realistic softwares. We expect that this paper B. Experiment result would be a great starting point to develop an accurate cycle es- We evaluate the accuracy of our proposed approach with timation technique for complex target softwares which utilize benchmarks on the prototype system. Figures 2 and 3 show diverse hardware components. that the proposed approach accurately estimates the required cycles. They plot the normalized cycles to the average of the ACKNOWLEDGMENT cycle measurement on the real hardware. The error bars on the This work was conducted at High-Speed Vehicle Research graph represent the standard deviations of the measurement Center of KAIST with the support of Defense Acquisition trials. Program Administration (DAPA) and Agency for Defense The Fibonacci number benchmark shows the accuracy of Development (ADD). our cycle estimation for the processor. In Figure 2, QEMU REFERENCES Only results in very accurate cycle estimations. Hybrid is similar to QEMU Only, since the estimated cycles by the [1] D. Thach, Y. Tamiya, S. Kuwamura, and A. Ike, “Fast cycle estimation methodology for instruction-level ,” in 2012 Design, Automation OVPsim occupies very small portion of the Hybrid. This is Test in Europe Conference Exhibition (DATE), March 2012, pp. 248–251. because the Fibonacci number benchmark takes almost all [2] S. Stattelmann, S. Ottlik, A. Viehl, O. Bringmann, and W. Rosenstiel, cycles for ALU operations with no memory accesses, so the “Combining instruction set simulation and wcet analysis for embedded software performance estimation,” in 7th IEEE International Symposium extended QEMU in our proposed system plays a major role on Industrial Embedded Systems (SIES’12). IEEE, 2012, pp. 295–298. to estimate cycles. The accurate results of QEMU Only for [3] F. Bellard, “Qemu, a fast and portable dynamic translator.” in USENIX the Fibonacci number benchmark reflects that the proposed Annual Technical Conference, FREENIX Track, 2005, pp. 41–46. [4] Imperas inc. ovp world home page. [Online]. Available: processor pipeline model can accurately estimate the required http://www.ovpworld.org cycles for the processor. [5] B. Bailey, “System level virtual prototyping becomes a reality with ovp In addition, the matrix multiplication benchmark shows donation from imperas,” White Paper, June, vol. 1, 2008. [6] “Raspberrypi model-b.” [Online]. Available: the effectiveness of our hybrid architecture. Since the matrix https://www.raspberrypi.org/products/model-b/ multiplication is a memory-intensive task, pipeline stalls due to [7] “The gem5 simulator.” [Online]. Available: http://gem5.org/ cache and TLB misses occupy a major portion of the elapsed [8] “Arm1176.” [Online]. Available: https://www.arm.com/products/processors/classic/arm11/arm1176.php cycles. In Figure 3, the estimation accuracy of QEMU Only decreases as matrix size increases, since the QEMU does not know about the temporal behaviors of cache and other peripherals. The proposed hybrid architecture then enables OVPsim to complement this limitation of QEMU. OVPsim estimates pipeline stalls from cache and TLB misses by using peripheral models and compensates the estimation of QEMU by adding cycle penalties from the pipeline stalls. As a result, Hybrid results in more accurate cycle estimation. Note that