Evaluation of a Java Processor
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from orbit.dtu.dk on: Sep 27, 2021 Evaluation of a Java Processor Schoeberl, Martin Published in: Tagungsband Austrochip 2005 Publication date: 2005 Document Version Early version, also known as pre-print Link back to DTU Orbit Citation (APA): Schoeberl, M. (2005). Evaluation of a Java Processor. In Tagungsband Austrochip 2005 (pp. 127-134) General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Evaluation of a Java Processor Martin Schoeberl Institute of Computer Engineering Vienna University of Technology, Austria [email protected] TABLE I Abstract— In this paper, we will present the evaluation results for a Java processor, with respect to size and performance. The JOP AND VARIOUS JAVA PROCESSORS Java Optimized Processor (JOP) is an implementation of the Java virtual machine (JVM) in a low-cost FPGA. JOP is the smallest Target Size Speed hardware realization of the JVM available to date. Due to the technology Logic RAM [MHz] efficient implementation of the stack architecture, JOP is also Altera, Xilinx JOP 1830LCs 3KB 100 smaller than a comparable RISC processor in an FPGA. FPGA Although JOP is intended as a processor for embedded real- picoJava No realization 128K gates 38KB time systems, whereas accurate worst case execution time analysis aJile ASIC 0.25µ 25K gates 48KB 100 is more important than average case performance, its general Moon Altera FPGA 3660LCs 4KB performance is still important. We will see that a real-time Lightfoot Xilinx FPGA 3400LCs 4KB 40 Komodo Xilinx FPGA 2600LCs 33 processor architecture does not need to be slow. FemtoJava Xilinx FPGA 2710LCs 0.5KB 56 I. INTRODUCTION In this paper, we will present the evaluation results for a Java processor [1], with respect to size and performance. available as an encrypted HDL source for Altera FPGAs or as This Java processor is called JOP – which stands for Java VHDL or Verilog source code. Optimized Processor –, based on the assumption that a full The Lightfoot 32-bit core [11] is a hybrid 8/32-bit processor native implementation of all Java Virtual Machine (JVM) based on the Harvard architecture. Program memory is 8 bits [2] bytecode instructions is not a useful approach. JOP is a wide and data memory is 32 bits wide. The core contains a Java processor for embedded real-time systems, in particular 3-stage pipeline with an integer ALU, a barrel shifter and a a small processor for resource-constrained devices with time- 2-bit multiply step unit. According to DCT, the performance predictable execution of Java programs. is typically 8 times better than RISC interpreters running at Table I lists the relevant Java processors available to date. the same clock speed. Sun introduced the first version of picoJava [3] in 1997. Sun’s Komodo [12] is a multithreaded Java processor with a picoJava is the Java processor most often cited in research four-stage pipeline. It is intended as a basis for research papers. It is used as a reference for new Java processors and on real-time scheduling on a multithreaded microcontroller. as the basis for research into improving various aspects of a The unique feature of Komodo is the instruction fetch unit Java processor. Ironically, this processor was never released with four independent program counters and status flags for as a product by Sun. A redesign followed in 1999, known four threads. A priority manager is responsible for hardware as picoJava-II that is now freely available with a rich set real-time scheduling and can select a new thread after each of documentation [4], [5]. The architecture of picoJava is bytecode instruction. a stack-based CISC processor implementing 341 different FemtoJava [13] is a research project to build an application instructions and is the most complex Java processor available. specific Java processor. The bytecode usage of the embedded The processor can be implemented [6] in about 440K gates. application is analyzed and a customized version of FemtoJava aJile’s JEMCore is a direct-execution Java processor that is generated in order to minimize the resource usage. Femto- is available as both an IP core and a stand alone processor Java is not included in Section IV, as the processor could not [7], [8]. It is based on the 32-bit JEM2 Java chip developed run even the simplest benchmark. by Rockwell-Collins. The processor contains 48KB zero wait Besides the real Java processors a few FORTH chips state RAM and peripheral components. 16KB of the RAM is (Cjip [14], PSC1000 [15]) are marketed as Java processors. used for the writable control store. The remaining 32KB is Java coprocessors (Jazelle [16], JSTAR [17]) provide Java used for storage of the processor stack. execution speedup for general-purpose processors. Vulcan ASIC’s Moon processor is an implementation of From the Table I we can see that JOP is the smallest the JVM to run in an FPGA. The execution model is the realization of a hardware JVM in an FPGA and also has the often-used mix of direct, microcode and trapped execution. highest clock frequency. As described in [9], a simple stack folding is implemented In the following section, a brief overview of the architecture in order to reduce five memory cycles to three for instruction of JOP is given, followed by a more detailed description of the sequences like push-push-add. The Moon2 processor [10] is microcode. Section III compares JOP’s resource usage with bytecode branch condition next bytecode microcode branch condition Busy JOP Core Memory Interface BC Address BC Data Bytecode Bytecode Bytecode Microcode Microcode Microcode Fetch Cache Fetch, translate Fetch and Decode Execute and branch branch Data Control Fetch Control branch Extension spill, Data bytecode branch fill B Multiplier Decode A Stack Stack Address RAM generation Data Control Stack I/O Interface Interrupt Fig. 2. Datapath of JOP example), (b) the control for the memory and the I/O module, Fig. 1. Block diagram of JOP and (c) the multiplexer for the read data that is loaded into the top-of-stack register. The write data from the top-of-stack (A) is connected directly to all modules. other soft-core processors. In the Section IV, a number of different solutions for embedded Java are compared at the A. The Processor Pipeline bytecode level and at the application level. JOP is a fully pipelined architecture with single cycle II. JOPARCHITECTURE execution of microcode instructions and a novel approach to mapping Java bytecode to these instructions. Figure 2 shows JOP is a stack computer with its own instruction set, the datapath for JOP. called microcode in this paper. Java bytecodes are translated Three stages form the JOP core pipeline, executing mi- into microcode instructions or sequences of microcode. The crocode instructions. An additional stage in the front of the difference between the JVM and JOP is best described as the core pipeline fetches Java bytecodes – the instructions of following: the JVM – and translates these bytecodes into addresses in The JVM is a CISC stack architecture, whereas JOP microcode. Bytecode branches are also decoded and exe- is a RISC stack architecture. cuted in this stage. The second pipeline stage fetches JOP Figure 1 shows JOP’s major function units. A typical instructions from the internal microcode memory and executes configuration of JOP contains the processor core, a memory microcode branches. Besides the usual decode function, the interface and a number of IO devices. third pipeline stage also generates addresses for the stack The processor core contains the three microcode pipeline RAM. As every stack machine instruction has either pop or stages microcode fetch, decode and execute and an additional push characteristics, it is possible to generate fill or spill translation stage bytecode fetch. The module called extension addresses for the following instruction at this stage. The last provides the link between the processor core, and the memory pipeline stage performs ALU operations, load, store and stack and IO modules. The ports to the other modules are the spill or fill. At the execution stage, operations are performed address and data bus for the bytecode instructions, the two with the two topmost elements of the stack. top elements of the stack (A and B), input to the top-of-stack A stack machine with two explicit registers for the two (Data) and a number of control signals. There is no direct topmost stack elements and automatic fill/spill needs neither an connection between the processor core and the external world. extra write-back stage nor any data forwarding. Details of this The memory interface provides a connection between the two-level stack architecture are described in [19]. The short main memory and the processor core. It also contains the pipeline results in short branch delays. Therefore, a hard to bytecode cache. The extension module controls data read and analyze, with respect to Worst Case Execution Time (WCET), write. The busy signal is used by the microcode instruction branch prediction logic can be avoided.