ARMSim: An Instruction-Set Simulator for the ARM processor Alpa Shah [[email protected]] - Columbia University

Abstract the ARMSim system. Hereafter, the A hardware simulator is a piece of that machines that are simulated will be referred emulates specific hardware devices, enabling to as target machines, and the system on execution of software that is written and which the simulator is actually run is compiled for those devices, on alternate systems. referred to as the host machine. This paper describes a simulator for the ARM processor, which is widely used in embedded The remainder of the paper is organized as devices like PDAs, cellular phones etc. ARMSim follows: Section 2 discusses different is a lightweight ISA (Instruction Set simulation strategies, and then discusses ISS Architecture) level simulator and a trace strategies in detail. Existing related work on generator too. It has some optimizations at the simulators is discussed in Section 3. It also decoder level to improve performance. describes the ideas and design principles behind ARMSim. The structure and 1. Introduction architecture of ARMSim is presented in Simulator or Virtual machine technology is Section 4. Section 5 describes the realization an integral part of many computing systems of the design, and Section 6 discusses some today. This technology is incredibly useful possible extensions and concludes the since it makes it possible for users and paper. analyzers to test and execute software well before the actual hardware is available, or in 2. Simulation Strategies absence of the actual hardware. Some The best simulation method depends on the benefits of using simulators are: application of the simulation results. This section outlines several simulation strategies • Simulators are flexible and thus new and their applications. features or components can easily be added. • Architectural level Simulation: • It is possible to model anything, including Logic designers build Architectural something that might not be possible to simulators to express and test new designs. do on the hardware. These allow emulation of the different • It allows stress testing of programs like parts of a processor, using either the operating systems by simulating complex simple core, or the core and the data interrupt and exception conditions. caches and other components. These are • Since simulators are built in software, they generally not intended for executing are more deterministic. The deterministic target system binaries on an alternate behavior of simulators makes programs platform, but rather to allow research into execution reproducible, and thus helps in the modification of the internal datapaths locating problems. of the processor.

Primary applications of simulators consist of • Direct Execution: computer architecture studies and Target machine binaries can be executed performance tuning of compiled software, natively on the simulator host processor and the compilation process itself. Various by encasing the program in an types of simulators exist, each addressing environment that makes it execute as different aspects like clock cycle rate, though it were on the simulated system. modeling the microprocessor chip logic, This technique requires that either the modeling the program execution host system have the same instruction set environment, etc. This paper concentrates as the target, or that the program be on an instruction set simulator [ISS] recompiled for the host architecture. implementation for the ARM processor[1], Instructions that cannot execute directly on the host are replaced with procedure ideas. These simulators are written to test calls to simulator code. This method is concepts and processor design tradeoffs; also known as Dynamic Recompilation flexibility is important and speed is not of [Dynarecs]. Native execution of the primary importance. They often gather recompiled code leads to a much faster statistics, and this constrains and slows execution of the simulated software, but down the simulator. ISS are also referred they have lengthy context switching, i.e. as “complete system instruction set when the host processor has to switch to simulator,” as well as “completer system target processes. This can make the simulator.” Furthermore, complete system simulation slower. Most direct execution simulators can be designed to be fully simulators statically inspect and translate deterministic. They effectively address all instructions before simulation begins. two major problems in real-time analysis: viz. difficulty in reproducing the effects of • Threaded Code[13]: program execution, and time distortion This is a simulation technique where each resulting from intrusion. Since simulators op-code in the target machine instruction are implemented fully in software, they set is mapped to the address of some can be fully tailored for specific uses. ISS (lower level) code in the simulator system, may not address issues like the delays to perform the appropriate operation. This from devices such as disk subsystems, and can be implemented efficiently in machine the memory hierarchy, including the code on most processors by simply memory management unit, caches and performing an indirect jump to the data-buses. Timing accuracy in ISS is of address, which is the next instruction. less importance than predictability This method does not suffer from lengthy preciseness, and determinism. An artificial context switching. system has few unpredictable factors. Thus, a simulated system starting • Instruction set simulators: execution in a known state will always Instruction set simulators [ISS] execute proceed along the same path. This is target machine programs by simulating useful for experiments and debugging the effects of each instruction on a target purposes. A user can detect misbehaving machine, one instruction at a time. blocks of code that take more than Instruction set simulators are attractive for expected time, and thus restart the process their flexibility: they can in principle, to examine the offending blocks in greater model any computer, gather any statistics, detail and thus trace the execution. and run any program that the target architecture would run. They easily serve Several techniques can improve the as backend systems for traditional performance of Instruction set simulators. debuggers as well as architecture design Instead of decoding the operation fields tools such as cache simulators. A lot of each time an instruction is executed, the temporal debuggers have recently started instruction is translated once into a form using ISS. An ISS can dispatch instructions that is faster to execute. This idea has been by fetching from a simulated memory, used in a variety of simulators for a isolating the operation code fields, and number of applications. Several efficient also branching, based on the values of memory models have also been proposed. these fields. Once dispatched, reading and manipulation of variables that represent 3. Existing Work the target system’s state are used to There has been a lot of research work on simulate the instruction’s semantics. software simulation of processors. These generally employ the above-mentioned When CPU architects design a new strategies in addition to some new methods. machine they typically write an This section describes some popular instruction-level simulator to test their simulators for different architectures, and also specific ones for ARM. The ARMSim co-processor instructions that vary widely system is compared with existing ARM- from system to system depending on the specific simulators. actual co-processor module available on the chip. ARMSim is used to execute ARM • SimOS[9]: SimOS takes a promising binaries on a different target machine, and approach that handles both user and gather some statistics based on the execution kernel code by dynamically translating of the binaries. It executes one instruction at target instructions into short sequences of a time and updates the processor state host-native code. Unlike most direct accordingly. execution simulators, SimOS keeps only a small part of the target machine’s state in 4. Structure of ARMSim host registers and so is able to multiplex ARMSim is a simulator of the 32-bit between target processors quickly. ARM RISC processor[1]. ARMSim simulates • Shade[2]: Shade simulates SPARC the entire instruction set except for those processor by using dynamic compilation. requiring use of the co-processor unit. The • Talisman[8]: Talisman is a simulator for co-processor is not a functional unit multicomputers based on the threaded- standard across all ARM processors, and it code strategy. was not possible to come up with a common • Mint[10]: Mint is a simulator for MIPS subset of operations that all co-processors R3000. would support. Figure (a) illustrates the processor of the ARM as simulated by The following are existing simulators ARMSim. The following are specific developed for the ARM processor, using functional features of ARMSim: some of the above-mentioned techniques: • System binaries: ARMSim takes in binaries • ARMPhetmaine[12]: this simulator is based for ARM processor for simulation, instead on the direct compilation technique. of assembly code. Thus, code compiled for • tARMac: also based on the same method. the target processor can be directly • SWARM[11]: initially designed as an ARM executed. module to plug into the SimOS system • Binary data representation: ARMSim treats developed at Stanford University. It is data as being stored in the big-endian now is an independent ARM simulator. It representation. implements the core and a small amount • Determinism: The program behavior must of internal co-processors at a basic level, be repeating for testing purposes. Being and provides full support for registers, completely software-simulated, ARMSim memory caches, and the external memory guarantees that the outcome of program hierarchy. behavior is deterministic. • SimARM[15]: an instruction set simulator • Low startup: Being an instruction level [ISS]. ARMulator[14] is another ISS with a simulator, the system has no initial startup slight variation: it ensures identical cycle- time, and no preprocessing is required as count for instructions. This means that in the case of direct compilations or instructions take the same number of threaded-code systems. cycles to execute as if run on real ARM • Extensible: The implementation of the hardware. This is important for precise system is based on a modular design, and simulation since compilers can optimize this makes it possible to extend the code that takes advantage of the cycle- simulator in order to support other counts of specific instructions. modules like I/O devices, and complex memory hierarchies, if needed. 3.1 The ARMSim Approach: • Statistics: ARMSim has been designed to ARMSim is an instruction level simulator support tracing the execution of a for the ARM processor that allows tracing of program and gather statistics. ARMSim program execution. ARMSim simulates the can be used to collect information like entire ISA except for the non-standardized frequency of data memory accesses, the number of branches that are taken, and fetched for execution. This technique is other analyses on program behavior based called decode-once-execute-many-times. on instruction profiling. 4.2.3 Execute The execute phase of the iteration involves computation of memory addresses (to get the operands), if necessary, and the actual operation of the current instruction. This operation updates the state of the system as determined by the instruction type.

4.3 Memory model ARMSim does not explicitly model the Fig. (a) memory hierarchy of the real ARM processor. Being an instruction-level 4.1 Modeling Processor simulator, ARMSim can avoid having to ARMSim is a behavioral modeling of the ARM accurately emulate movement of data processor. This aspect made possible numerous between the memory, the caches and the abstractions from the underlying hardware, e.g. processor, and instead can represent all simplified memory replacing the complete memory-accesses as being uniform. memory hierarchy subsystem, unification of all Instruction memory is stored as a linear the individual functional units in the CPU, etc. array of 21-bit values, as shown in figure (b). The ARM processor elements simulated in ARMSim are program and data memory, the status register, CPSR, the ALU, barrel shifter, conditional execution of instructions, general purpose registers. The state of the system is defined by the values in the registers and memory.

4.2 Modeling Basic Instruction execution ARMSim simulates program execution by iterating through a cycle of instruction Fig. (b) fetching, decoding and execution. 4.3.1 Data Memory 4.2.1 Fetch The ARM supports data transfer in 3 Just like in any microprocessor[6], ARMSim different sizes: byte, halfword, and word internally maintains a program counter to between the registers and memory. determine the next instruction to be fetched ARMSim implements the transfer of from the instruction memory. This program memory data of all the sizes while counter is updated automatically upon each maintaining the endian-ness and signed-ness instruction fetch, and can also be explicitly of the data. updated by instructions. 4.3.2 Program Memory 4.2.2 Decode ARMSim does not simulate the ARM The class of each fetched instruction is first processor in the 16-bit THUMB mode. This identified, and then, the different fields such has the effect that all instructions are stored as the opcode, operands, of the instruction at word-aligned instructions, and all are decoded according to the type of the instructions fetches operate as 32-bit data instruction. ARMSim achieves some transfers. optimization based on the decoding step – each instruction needs to be decoded only 4.4 Processor State modeling once, regardless of the number of times it is Of the various modes of operations possible on the ARM, the most interesting one is the 5.1. ARMCore: a unified CPU: ALU, barrel user mode. Most of the running time of shifter programs is spent executing instructions in ARMSim interprets the execution of all this mode. ARMSim maintains the exact set target system instructions thus obviating the of on-chip registers as are visible to need for separate functional units. instructions executing in the user mode on 5.2. ARMCore: CPSR and general purpose the ARM processor. Out of these registers, registers the CPSR register is updated based on the The 17 general purpose and CPSR registers result of instruction execution, while the visible in the user mode in the ARM data in the 16 general purpose registers can processor are implemented in ARMSim. be explicitly set by the instructions. 5.3. Instruction identification, decoding 4.5 Statistics gathering ARMSim identifies the type of each fetched ARMSim allows tracing through the instruction by performing a bitwise execution of the programs. Some of the comparison of the 32-bit instruction with a information that ARMSim can record are: predefined set of masks. It then creates an • Instruction address, effective address instance of a subclass of the Instruction class • Decoded opcode value that will automatically decode the • Number of branches taken remaining different fields of the instruction • Number of memory references such as the operands, immediate values, and condition codes. 5. Realization of the ARMSim 5.4. Decode-once-execute-many-times system Compiler-generated executable programs, in In this section of the paper, I describe the general, contain many blocks of code that various system design decisions I have are iterated numerous times. ARMSim made while building ARMSim. An ISS is by attempts to utilize this fact to achieve an nature much slower at interpreting optimization in the simulation – since all 32- programs than running the program bit instructions are decoded and executed in natively on the intended hardware, or even the form of an Instruction object, ARMSim dynamic recompiling simulators that can will create a single instance for all iterations optimize the interpretation by actually of the instruction at each memory address. translating certain parts of the target system This allows the simulator to avoid having to program into native code for the host re-create new instances for the instructions system that the simulator itself is running in each iteration of the loop. Instead, after on. However, this sort of translation executing the instruction at the given decreases the amount of control on the memory location the first time, ARMSim simulation – such a thing would prevent will preserve this instance and reuse it for any extensive form of behavioral analysis of any future execution of that instruction. execution. For example, it would be Also, ARMSim carries out all processor- impossible to count the number of times a state-independent decoding only for the first certain block of code in the target program time that the instruction is executed – these has been executed, especially if the original are generally field extractions such as target code is significantly different from its source/destination register identification, translated version. Moreover, with this sort and immediate operands evaluation. of simulation, it is hard to maintain uniformity of conditions among different 5.5. Realization of the memory system host system architectures. ARMSim is a fully abstraction interpretative simulator, and it incorporates ARMSim treats the memory system as being some optimizations that help speed up the uniform, with constant access time, overall execution time of target programs regardless of the address of the memory portably across different host systems. data being accessed. Instructions are stored at word-aligned addresses. Since executing January 22–26, 1990,Washington, DC, instructions could conceivably request byte- USA, pages 53–64, Berkeley, CA, USA, data transfers at non-aligned addresses, Jan. 1990. USENIX. ARMSim permits memory data transfers at [4] P. Magnusson and B. Werner: “Efficient non-word-aligned memory locations. memory simulation in SimICS”. In Proceedings of the 28th Annual Simulation 6. Conclusions, Future Extensions Symposium, 1995. ARMSim is a simulator that is well suited to [5] P. S. Magnusson: “Efficient instruction simulate the overall behavior of execution of cache simulation and execution programs that are intended for execution on profiling with a threaded-code an ARM system. Some important uses for interpreter”. In Proceedings of Winter ARMSim would be: comparisons of the Simulation Conference 97, 1997. quality of compiled code as produced by [6] John Hennessy and David Patterson. different compilers and/or compiler Computer Organization and Design: options. The modular nature of the The Hardware-Software Interface implementation of ARMSim makes it easy (Appendix A, by James R. Larus), to: Morgan Kaufman, 1993. • Can more closely represent the memory [7] Emmett Witchel, Mendel Rosenblum. hierarchy without having to rewrite major Embra: Fast and Flexible Machine parts of the system – this can then be used Simulation. Proceedings of the 1996 to monitor memory access patterns in test ACM SIGMETRICS Conference on programs such as temporal and spatial Measurement and Modeling of locality, size of memory requirement, etc. Computer Systems, Philadelphia, 1996. • Supplement the system with newer [8] Bedicheck, R. 1995. Talisman: Fast and definitions of instructions. accurate multicomputer simulation. In • Incorporate clock cycles-per-instruction Proceedings of the 1995 ACM [CPI] for each class of instruction; possibly SIGMETRICS Conference on include some mechanism for adjusting the Measurement and Modeling of CPI for instructions that cause a cache- Computer Systems (May), 14±24. miss. [9] Mendel Rosenblum, Edouard Bugnion, • Provide an “architectural” level garbage Scott Devine, and Stephen A. Herrod. collector that will intelligently maintain Using the SimOS Machine Simulator to the instruction pointers for those Study Complex Computer Systems. instructions that are in frequent use, e.g. in ACMTOMACS, 1997. a loop, and clear out the rest. A [10] J. E. Veenstra and R. J. Fowler. MINT: A sophisticated version would even perform Front End for Efficient Simulation of prefetching and create appropriate Shared-Memory. In Proceedings of instruction pointers before the instructions MASCOTS, pages 201-207, January 1994. are actually required. [11] SWARM 0.44 Documentation Michael Dales Department of Computing Science, University of Glasgow,Scotland 7. References [12] The Design of ARMphetamine 2 Julian [1] ARM7TDMI ARM processors User’s Brown, University of Bristol Manual, Advanced Risc Machines Ltd. [13] James R. Bell Threaded code. CACM, [2] R. Cmelik, D. Keppel [1994]: “Shade: A 16(6). June 1973 Fast Instruction Set Simulator for [14] The ARMulator; Document number: Execution Profiling," Proceedings of ARM DAI 0032; Issued 1999; SIGMETRICS, ACM, Nashville, TN, 128- http://www.arm.com/ 137. [15] SimARM; Green Hills Software Inc. [3] R. C. Bedichek: “Some efficient http://www.ghs.com/ architecture simulation techniques”. In USENIX Association, editor, Proceedings of the Winter 1990 USENIX Conference,