Fast and Accurate Cycle Estimation Through Hybrid Instruction Set Simulation for Embedded Systems
Total Page:16
File Type:pdf, Size:1020Kb
Poster Abstract: Fast and Accurate Cycle Estimation through Hybrid Instruction Set Simulation for Embedded Systems Kilho Lee∗, Wookhyun Han∗, Jaewoo Leey, Hoon Sung Chwa∗, and Insik Shin∗ ∗School of Computing, KAIST, South Korea yDept. of Computer and Information Science, University of Pennsylvania, USA [email protected] Motivation Execution time analysis is essential during the Target Binary design of real-time embedded systems to verify all timing requirements are met. With rapid increase in complexity of Code Executor Virtual Machine Runtime modern hardware components, it becomes much more difficult Cycle Counter to develop an accurate timing model for a target hardware, Cache/TLB Event Counter which serves as a basis for static timing analysis. Recently, simulation-based dynamic timing analysis techniques are be- Code Cache Platform Constructor coming an attractive solution to predict the execution time Peripheral Simulation Dynamic Binary Translator of software in a fast and accurate manner. However, most Engine Processor Pipeline Model Peripheral Event Counter of existing simulation-based timing analysis techniques are limited to simulate the temporal behavior of a processor without consideration of other peripheral devices such as Cycle Estimation Engine storage and network, leading to less accuracy. In this paper, we Required Cycles for Cache & Peripheral Processor Event Statistics propose an accurate cycle estimation framework which allows to use multiple instruction set simulators to simulate not only processors but also diverse peripheral devices. An instruction Fig. 1: The cycle estimator with hybrid instruction set simu- set simulator runs on a host machine to mimic functional lator architecture. behaviors of instructions running on a target hardware. It allows to estimate the execution time of software in a fast and the measured cycle. The benchmark result shows that and accurate way and validate a system even when its target our hybrid approach accurately estimates the system cycles hardware does not yet exist or is not available. while capturing the temporal behaviors of both processors and peripherals, and thereby effectively improves the accuracy Approach To support the full system cycle estimation compared to when either QEMU or OVPsim are used alone. including peripherals, we propose an accurate cycle estimation Future Works At the current stage, we showed preliminary methodology through hybrid instruction set simulation, which results with a proof-of-concept prototype. The prototype only combines QEMU [1] and OVPsim [2], [3] as depicted in Fig- covers ARM processor, cache, and TLB behaviors. Our future ure 1. QEMU has great functionalities to capture the temporal works include (a) extending the prototype to emulate diverse behaviors of processors, however it does not support emulation processors and other types of peripherals such as storages and for important peripherals which affect the system timing, such networks, (b) elaborating each processor and peripheral model as caches and TLBs. In contrast, OVPsim is not suitable to to emulate its complex temporal behaviors, (c) and conducting capture the temporal behavior of processors, but it is good extensive benchmarks with realistic softwares. to capture the effect of peripherals to the system cycle due to its sophisticated peripheral emulation capability. The proposed ACKNOWLEDGMENT approach takes both benefits from QEMU and OVPsim, and This work was conducted at High-Speed Vehicle Research enables highly accurate cycle estimation for the full system. Center of KAIST with the support of Defense Acquisition We extend QEMU and OVPsim with additional components: Program Administration (DAPA) and Agency for Defense the processor pipeline model and the cycle counter on QEMU, Development (ADD). and the cache/TLB event counter and the peripheral event counter on OVPsim. REFERENCES [1] F. Bellard, “Qemu, a fast and portable dynamic translator.” in USENIX We built a proof-of-concept prototype of our approach, and Annual Technical Conference, FREENIX Track, 2005, pp. 41–46. conducted several benchmarks on the prototype. The prototype [2] Imperas inc. ovp world home page. [Online]. Available: emulates processor, cache, and TLB of Raspberry Pi [4]. http://www.ovpworld.org [3] B. Bailey, “System level virtual prototyping becomes a reality with ovp In order to evaluate the accuracy of the proposed approach, donation from imperas,” White Paper, June, vol. 1, 2008. we run each benchmark on the prototype and on the real [4] “Raspberrypi model-b.” [Online]. Available: Raspberry Pi hardware, and compared the estimated cycle https://www.raspberrypi.org/products/model-b/ Fast and Accurate Cycle Estimation through Hybrid Instruction Set Simulation for Embedded Systems Kilho Lee∗, Wookhyun Han∗, Jaewoo Leey, Hoon Sung Chwa∗, and Insik Shin∗ ∗School of Computing, KAIST, South Korea yDept. of Computer and Information Science, University of Pennsylvania, USA [email protected] Abstract—Execution time analysis is essential during the design Most of existing ISS for cycle estimation are limited to of real-time embedded systems to verify all timing requirements processor modeling [1], [2]. Thach et al. [1] considered cache are met. With rapid increase in complexity of modern hardware model in processors. Stattelmann et al. [2] accelerated analysis components, it becomes much more difficult to develop an accurate timing model for a target hardware, which serves as speed utilizing timing analysis. The cycle estimation tech- a basis for static timing analysis. Recently, simulation-based niques of previous work are mainly focused on processors and dynamic timing analysis techniques are becoming an attractive caches. They cannot accurately estimate cycles for complex solution to predict the execution time of software in a fast and softwares which exploit diverse system peripherals, because accurate manner. However, most of existing simulation-based they cannot capture how those peripherals affect the system timing analysis techniques are limited to simulate the temporal behavior of a processor without consideration of other peripheral cycle, such as memory, I/O devices, and network devices. For devices such as storage and network, leading to less accuracy. In accurate cycle estimation, the cycle estimation should consider this paper, we propose an accurate cycle estimation framework not only behaviors of processors and caches, but also behaviors which allows to use multiple instruction set simulators to simulate of peripherals. not only processors but also diverse peripheral devices. We build To support the full system cycle estimation including pe- a proof-of-concept prototype of our proposed approach, run several benchmarks, and evaluate how accurately our proposed ripherals, we propose an accurate cycle estimation method- framework estimates the execution time of software running on ology through hybrid ISS, which combines QEMU [3] and a target hardware. OVPsim [4], [5]. Although QEMU has great functionalities to capture the temporal behaviors of processors, it does not I. INTRODUCTION support emulations for important peripherals which affect the system timing, such as caches and TLBs. In contrast, OVPsim Estimating the execution time of software components is is not suitable to capture the temporal behavior of processors, standard practice during the design of real-time embedded but it is good to capture the effect of peripherals on the system systems to verify all timing requirements are met. With cycle due to its sophisticated peripheral emulation capability. advances in technology, hardware components in real-time The proposed approach takes both benefits from QEMU and systems are rapidly becoming more complex. For example, OVPsim and enables highly accurate cycle estimation for the modern processors provide advanced computer architecture full system. features such as caches, pipelines, branch prediction, and out- We build a proof-of-concept prototype of our approach and of-order execution, and they are interconnected with diverse conducted several benchmarks on the prototype. Our prototype peripherals like I/O and network devices. Such rapid increase emulates processor, cache, and TLB of Raspberry Pi [6]. in complexity makes it much difficult to predict the temporal In order to evaluate the accuracy of the proposed approach, behavior of software in a fast and accurate manner. we run each benchmark on the prototype and on the real Existing execution time analysis techniques can broadly Raspberry Pi hardware, and compared the estimated cycle and fall into two classes: static and dynamic. Static analysis the measured cycle. attempts to estimate execution time by examining the structure The benchmark result shows that our hybrid approach ac- of software and modeling the underlying hardware without curately estimates the system cycles while capturing temporal executing it directly on the hardware. This approach requires behaviors of both processors and peripherals, and thereby the manual development of a hardware model for each target effectively improves the accuracy compared to when either hardware, which is extremely difficult to model the effect of QEMU or OVPsim are used alone. modern complex architectural features on the timing behavior accurately. On the other hand, dynamic analysis measures the II. RELATED WORKS execution time of software directly on the real hardware. There are several ways to do dynamic analysis. One of A cycle-accurate simulator reflects a target