A Flexible and High-Performance Java Embedded Processor

BlueJEP: A Flexible and High-Performance Java Embedded Processor Flavius Gruian Mark Westmijze Dept. of Computer Science Dept. of Computer Science Lund University University of Twente 221 00 Lund, Sweden Enschede, The Netherlands fl[email protected] [email protected] ABSTRACT suitable for architecture exploration and Java-tuned solu- tions. Furthermore, experimental results show that This paper presents BlueJEP, a novel Java embedded pro- Blue- cessor, developed using the relatively new Bluespec Sys- JEP performance is at least as high as other Java processors in its class, with a small penalty in the chip area. temVerilog (BSV) environment. The starting point for Blue- The design described in this paper uses a high level of JEP is a micro-programmed, pipelined, Java-optimized pro- abstraction language for hardware specification called Blue- cessor (jop), written in VHDL. Our BSV solution features a number of design choices, including a longer pipeline and spec SystemVerilog (BSV [3]). BSV is a rule based, strongly- speculative execution, that make the design more flexible, typed, declarative hardware specification language making use of Term Rewriting Systems [11] to describe computation maintainable and high-performance. BlueJEP also appears to be an excellent platform for exploring a number of Java as atomic state changes. Although relatively new, Blue- specific techniques, both in hardware (bytecode folding, mem- spec seems to have captured the interest of industry and ory management, and caching strategies ) and in software academia, and a number of designs written using BSV are (runtime environment, bytecode optimizations). Tests and making their appearance (i.e. [4, 5, 19]). measurements were carried out both through simulation and The paper is organized as follows. In Section 2, we briefly on implementations running on a Xilinx FPGA. review some of the related work. Section 3 describes in detail the BlueJEP architecture, while Section 4 gives the method and results of its FPGA implementation. A dis- Categories and Subject Descriptors cussion regarding our design choices and architectural ex- C.3 [Special-Purpose and Application-Based Systems]: tensions makes the subject of Section 5. Finally, Section 6 Real-time and embedded systems; C.1.3 [Processor Archi- presents our conclusions. tectures]: Other Architecture Styles—pipeline processors, stack-oriented processors 2. RELATED WORK There are several Java processors reported in the research Keywords community, some even available as soft-cores or chips, many Java processor, embedded systems, Bluespec designed for embedded systems and few even for real-time applications. Relevant approaches are briefly listed here. Sun’s PicoJava-II [16], freely available, is arguably the 1. INTRODUCTION most complex Java processor currently, a re-design of an In this paper, we introduce BlueJEP, a Java embedded older solution which was never released. Its architecture processor developed in Bluespec System Verilog. Having the features a stack-based six stages pipelined CISC processor, Java optimized processor as a starting point [15], a VHDL implementing 341 different instructions. Folding of up to design, our solution takes advantage of the a higher level of four instructions is also implemented. abstraction language to implement a number of architectural aJile’s JEMCore, based on Rockwell-Collins’ JEM2 de- choices specific to our processor. In particular, BlueJEP sign, is available both as IP or standalone processor known features a six stages pipelined, micro-programmed, stack as aJ-100 [1], a 0.25µ ASIC operating at 100MHz. The 32- machine, with speculative execution, operand forwarding bit core is a micro-programmed solution, comprising ROM and stage bypassing. The simpler stages and higher-level and RAM control stores, an ALU, barrel shifter, and a 24- specification makes our design more modular and flexible, element register file. JEMCore implements, besides native JVM bytecodes, extended bytecodes for I/O and threading support, along with floating point arithmetic. DCT’s Lightfoot 32-bit core [6] is a hybrid 8-bit instruc- Permission to make digital or hard copies of all or part of this work for tion, 32-bit data path Harvard dual-stack RISC architecture. personal or classroom use is granted without fee provided that copies are The core comprises a three stages pipeline, with an integer not made or distributed for profit or commercial advantage and that copies ALU including a barrel shifter and a multiplication unit. bear this notice and the full citation on the first page. To copy otherwise, to Lightfoot has 128 fixed instructions and 128 reconfigurable, republish, to post on servers or to redistribute to lists, requires prior specific soft-bytecodes. permission and/or a fee. JTRES’07 September 26–28, 2007 Vienna, Austria Vulcan Machines’ [18] Moon2 is a 32-bit processor core Copyright 2007 ACM 978-59593-813-8/07/9 ...$5.00. available as a soft IP for FPGA or ASIC implementation. The Moon core features an ALU, a 256-element internal re-implement the VHDL written JOP ( [15]) in the newer stack, optional code cache, and a micro-program memory Bluespec SystemVerilog. The VHDL design is a four stages for the operation sequence required by each bytecode. pipelined stack machine. The first stage fetches bytecodes The Komodo micro-controller [13] includes a multithreaded from the memory (method cache) and translates them to Java processor core, which is a micro-programmed, four micro-program addresses. The second stage fetches micro- stages pipeline. Its remarkable feature is the four-way in- instructions. The third decodes and generates necessary struction fetch unit, with independent program counters and stack addresses, while the last executes and writes back re- flags, for hardware real-time scheduling of four threads. sults. This last stage carries out all the access to the stack, FemtoJava is research project focused on developing low- all the spills/fills and local variable read/write. To simplify power Java processors for embedded applications. One of and speed up the stack access, the top most two values from the versions features a five stages pipelined stack machine the stack are mirrored by two dedicated registers. later extended to a VLIW machine [2], synthesized for an When starting on our BSV design, we decided to go for FPGA. Data about the FPGA make, clock speed, and whether an almost identical architecture as the VHDL solution. In it actually ran on the FPGA are unclear. order to use the tools already implemented for JOP (micro- assembler and executable image generator), we also wanted 3. THE DESIGN to have the same micro-instruction set and micro-code. Nev- ertheless, as we became more familiar with BSV, we decided Our Java processor architecture has its roots in the Java- that a longer pipeline would be more interesting, more flex- optimized processor (JOP, [15]), and exhibits many features ible and modular, and hopefully faster. Furthermore, the that are commonly found in modern Java embedded proces- micro-instruction set changed slightly, in order to adapt to sors. It is a six stages pipeline, micro-programmed proces- the new architecture and support micro-instruction folding, sor, with a stack machine core, in order to follow the CISC although from the programmer point of view this is rather features of a Java virtual machine. Simple bytecodes can be transparent. Additionally, we went for a bus interface more executed as single micro-instructions, while the more com- suitable for integration in the Xilinx Embedded Develop- plex ones are implemented either as micro-programs or Java ment Kit, namely the On-chip Peripheral Bus (OPB). methods. The target systems are embedded devices with As in the case of JOP, the intention was to obtain an limited memory and device size, and even with real-time re- architecture suitable for real-time systems, without compli- quirements. Therefore, BlueJEP does not offer a complete cated timing behavior, where the worst case behavior is both and general Java environment. For instance class loading is easy to estimate and not much worse than the average case. carried out offline and an executable image (still composed of Furthermore, targeting embedded systems would also mean bytecodes) is generated pre-runtime. The strength of our ap- avoiding power hungry techniques. On the other hand we proach comes from using the high-level features of Bluespec also wanted to get as high performance as possible out of our SystemVerilog, which makes the specification small, flexi- BlueJEP, and allow enough flexibility for testing a number ble, and maintainable. A more detailed description of the of techniques that would require adding or altering registers, BlueJEP architecture follows. functional units, caches, etc. The system architecture suited for our processor is a typ- ical system-on-chip, implemented in our case on a FPGA. 3.2 BlueJEP Pipeline Depicted in Figure 1, such a system contains the processor, The processor went through at least three different ver- a RAM (storing the Java application and heap), a serial port sions until crystalizing into the six stages pipeline depicted (RS232), a timer, and some general purpose input/output in Figure 2. Earlier versions would stall the pipeline any (LEDs and switches), all connected through a system bus. time a data or a control hazard would occur, which meant more complex control. The current version only stalls on data hazards, and uses BlueJEP monitor RAM + speculative execution on branches, which means simpler con- Processor & debug bj.hex trol, higher-performance, but wider pipeline registers. The speculative execution version used by BlueJEP is rather simple, since (micro-code) branches are assumed as always not taken. Normally, the pipeline is kept full with bytecodes On-chip Peripheral Bus (OPB) and micro-instructions that follow sequentially, regardless of micro-jumps, since the effect is only felt in the last stage. Whenever an unexpected deviation of control occurs, the RS232 GPIO Timer pipeline is flushed and the execution resumes using the con- text1 associated with the instruction that caused the branch.

Load more