BlueJEP: A Flexible and High-Performance Java Embedded

Flavius Gruian Mark Westmijze Dept. of Computer Science Dept. of Computer Science Lund University University of Twente 221 00 Lund, Sweden Enschede, The Netherlands fl[email protected] [email protected]

ABSTRACT suitable for architecture exploration and Java-tuned solu- tions. Furthermore, experimental results show that This paper presents BlueJEP, a novel Java embedded pro- Blue- cessor, developed using the relatively new Bluespec Sys- JEP performance is at least as high as other Java processors in its class, with a small penalty in the chip area. temVerilog (BSV) environment. The starting point for Blue- The design described in this paper uses a high level of JEP is a micro-programmed, pipelined, Java-optimized pro- abstraction language for hardware specification called Blue- cessor (jop), written in VHDL. Our BSV solution features a number of design choices, including a longer and spec SystemVerilog (BSV [3]). BSV is a rule based, strongly- , that make the design more flexible, typed, declarative hardware specification language making use of Term Rewriting Systems [11] to describe computation maintainable and high-performance. BlueJEP also appears to be an excellent platform for exploring a number of Java as atomic state changes. Although relatively new, Blue- specific techniques, both in hardware (bytecode folding, mem- spec seems to have captured the interest of industry and ory management, and caching strategies ) and in software academia, and a number of designs written using BSV are (runtime environment, bytecode optimizations). Tests and making their appearance (i.e. [4, 5, 19]). measurements were carried out both through simulation and The paper is organized as follows. In Section 2, we briefly on implementations running on a Xilinx FPGA. review some of the related work. Section 3 describes in detail the BlueJEP architecture, while Section 4 gives the method and results of its FPGA implementation. A dis- Categories and Subject Descriptors cussion regarding our design choices and architectural ex- C.3 [Special-Purpose and Application-Based Systems]: tensions makes the subject of Section 5. Finally, Section 6 Real-time and embedded systems; C.1.3 [Processor Archi- presents our conclusions. tectures]: Other Architecture Styles—pipeline processors, stack-oriented processors 2. RELATED WORK There are several Java processors reported in the research Keywords community, some even available as soft-cores or chips, many Java processor, embedded systems, Bluespec designed for embedded systems and few even for real-time applications. Relevant approaches are briefly listed here. Sun’s PicoJava-II [16], freely available, is arguably the 1. INTRODUCTION most complex Java processor currently, a re-design of an In this paper, we introduce BlueJEP, a Java embedded older solution which was never released. Its architecture processor developed in Bluespec System Verilog. Having the features a stack-based six stages pipelined CISC processor, Java optimized processor as a starting point [15], a VHDL implementing 341 different instructions. Folding of up to design, our solution takes advantage of the a higher level of four instructions is also implemented. abstraction language to implement a number of architectural aJile’s JEMCore, based on Rockwell-Collins’ JEM2 de- choices specific to our processor. In particular, BlueJEP sign, is available both as IP or standalone processor known features a six stages pipelined, micro-programmed, stack as aJ-100 [1], a 0.25µ ASIC operating at 100MHz. The 32- machine, with speculative execution, operand forwarding bit core is a micro-programmed solution, comprising ROM and stage bypassing. The simpler stages and higher-level and RAM control stores, an ALU, , and a 24- specification makes our design more modular and flexible, element register file. JEMCore implements, besides native JVM bytecodes, extended bytecodes for I/O and threading support, along with floating point arithmetic. DCT’s Lightfoot 32-bit core [6] is a hybrid 8-bit instruc- Permission to make digital or hard copies of all or part of this work for tion, 32-bit data path Harvard dual-stack RISC architecture. personal or classroom use is granted without fee provided that copies are The core comprises a three stages pipeline, with an integer not made or distributed for profit or commercial advantage and that copies ALU including a barrel shifter and a multiplication unit. bear this notice and the full citation on the first page. To copy otherwise, to Lightfoot has 128 fixed instructions and 128 reconfigurable, republish, to post on servers or to redistribute to lists, requires prior specific soft-bytecodes. permission and/or a fee. JTRES’07 September 26–28, 2007 Vienna, Austria Vulcan Machines’ [18] Moon2 is a 32-bit processor core Copyright 2007 ACM 978-59593-813-8/07/9 ...$5.00. available as a soft IP for FPGA or ASIC implementation. The Moon core features an ALU, a 256-element internal re-implement the VHDL written JOP ( [15]) in the newer stack, optional code , and a micro-program memory Bluespec SystemVerilog. The VHDL design is a four stages for the operation sequence required by each bytecode. pipelined . The first stage fetches bytecodes The Komodo micro-controller [13] includes a multithreaded from the memory (method cache) and translates them to Java processor core, which is a micro-programmed, four micro-program addresses. The second stage fetches micro- stages pipeline. Its remarkable feature is the four-way in- instructions. The third decodes and generates necessary struction fetch unit, with independent program counters and stack addresses, while the last executes and writes back re- flags, for hardware real-time scheduling of four threads. sults. This last stage carries out all the access to the stack, FemtoJava is research project focused on developing low- all the spills/fills and local variable read/write. To simplify power Java processors for embedded applications. One of and speed up the stack access, the top most two values from the versions features a five stages pipelined stack machine the stack are mirrored by two dedicated registers. later extended to a VLIW machine [2], synthesized for an When starting on our BSV design, we decided to go for FPGA. Data about the FPGA make, clock speed, and whether an almost identical architecture as the VHDL solution. In it actually ran on the FPGA are unclear. order to use the tools already implemented for JOP (micro- assembler and executable image generator), we also wanted 3. THE DESIGN to have the same micro-instruction set and micro-code. Nev- ertheless, as we became more familiar with BSV, we decided Our Java processor architecture has its roots in the Java- that a longer pipeline would be more interesting, more flex- optimized processor (JOP, [15]), and exhibits many features ible and modular, and hopefully faster. Furthermore, the that are commonly found in modern Java embedded proces- micro-instruction set changed slightly, in order to adapt to sors. It is a six stages pipeline, micro-programmed proces- the new architecture and support micro-instruction folding, sor, with a stack machine core, in order to follow the CISC although from the programmer point of view this is rather features of a Java virtual machine. Simple bytecodes can be transparent. Additionally, we went for a interface more executed as single micro-instructions, while the more com- suitable for integration in the Xilinx Embedded Develop- plex ones are implemented either as micro-programs or Java ment Kit, namely the On-chip Peripheral Bus (OPB). methods. The target systems are embedded devices with As in the case of JOP, the intention was to obtain an limited memory and device size, and even with real-time re- architecture suitable for real-time systems, without compli- quirements. Therefore, BlueJEP does not offer a complete cated timing behavior, where the worst case behavior is both and general Java environment. For instance class loading is easy to estimate and not much worse than the average case. carried out offline and an executable image (still composed of Furthermore, targeting embedded systems would also mean bytecodes) is generated pre-runtime. The strength of our ap- avoiding power hungry techniques. On the other hand we proach comes from using the high-level features of Bluespec also wanted to get as high performance as possible out of our SystemVerilog, which makes the specification small, flexi- BlueJEP, and allow enough flexibility for testing a number ble, and maintainable. A more detailed description of the of techniques that would require adding or altering registers, BlueJEP architecture follows. functional units, caches, etc. The system architecture suited for our processor is a typ- ical system-on-chip, implemented in our case on a FPGA. 3.2 BlueJEP Pipeline Depicted in Figure 1, such a system contains the processor, The processor went through at least three different ver- a RAM (storing the Java application and heap), a serial port sions until crystalizing into the six stages pipeline depicted (RS232), a timer, and some general purpose input/output in Figure 2. Earlier versions would stall the pipeline any (LEDs and ), all connected through a system bus. time a data or a control hazard would occur, which meant more complex control. The current version only stalls on data hazards, and uses BlueJEP monitor RAM + speculative execution on branches, which means simpler con- Processor & debug bj.hex trol, higher-performance, but wider pipeline registers. The speculative execution version used by BlueJEP is rather simple, since (micro-code) branches are assumed as always not taken. Normally, the pipeline is kept full with bytecodes On-chip Peripheral Bus (OPB) and micro-instructions that follow sequentially, regardless of micro-jumps, since the effect is only felt in the last stage. Whenever an unexpected deviation of control occurs, the RS232 GPIO Timer pipeline is flushed and the execution resumes using the con- text1 associated with the instruction that caused the branch. This requires sending along the context through the pipeline, calling for wider pipeline registers. The stages see the pipeline registers as searchable FIFOs Figure 1: A typical architecture (except bcfifo and decfifo, which are regular FIFOs), in order containing BlueJEP. to check for stall conditions – which usually means searching for instructions with a certain destination. Implicit condi- tions for stalling a stage are given by attempting to enqueue 3.1 The Starting Point in a full FIFO or dequeue from an empty FIFO. There are a number of reasons that drove our design and 1Java program jpc, micro- pc, and development of BlueJEP. Initially, the intention was to stack pointer sp Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 forward

fo bypass fo fo fo fo fsfi bcfi decfi exfi BC2 micro- jump wbfi microA ROM table Fetch Execute Write- Stack back Fetch Fetch Decode Bytecode micro-I & Fetch Register k rollbac

JPC BC- PC OPD SP VP Cache load cache Registers Stack const CacheCtl MD MwA MrA

bus interface (OPB)

Figure 2: The BlueJEP pipeline

3.2.1 Stage 1: Fetch Bytecode on, augmented with the current value of sp. The Fetch Bytecode stage fetches bytes from the bytecode To give a few examples, micro-code jumps are decoded, cache and feeds them to the next stage, along with their into operations with the destination as register pc. The translation into micro-addresses (using the BC2microA ta- sources are the pc and an immediate operand obtained by ble) and their associated jpc. The bytecode cache fetches translating a jump index into an offset through a jump ta- words from the external memory through the bus interface ble. Stack operations, such as Add for instance, require the and manages the jpc. The BCCache interface is generic top of stack address (given by the stack pointer register sp) enough to be able to accommodate from simple one word in order to figure out the absolute addresses of the two top caches to complex multi-method caches. The experimental values. The destination is also a stack address given by sp-1. results reported later use the standard configuration with a All these addresses are passed on to the next stage, which 1KB, single method cache and concurrent load. This stage will fetch their contents. The same goes for operation access- only stalls if the bcfifo is full. ing local stack variables through the variable pointer register vp. Furthermore, the sp must be updated accordingly, de- 3.2.2 Stage 2: Fetch micro-I pending on the kind of stack operation (Pop or Push). The Fetch micro-I stage keeps track of the micro-program The Decode stage stalls if the contents of a register (sp, vp, counter (pc), fetching new micro-instructions from the micro- . . . ) is required, but about to be changed by an instruction ROM, and feeds them to the next stage along with their present in the later stages (RAW hazard). associated pc. Whenever the micro-code needs to fetch a java operand, it dequeues a byte from the bcfifo and stores 3.2.4 Stage 4: Fetch stack it in an operand register opd. Whenever the micro-code for The decoded instruction contains immediate values or the the current bytecode is completed, and the next must be stack addresses of the operands that will be used either in executed, it dequeues a micro-address from the bcfifo, and the Execute or Write-back stages. Stack contents are fetched updates the pc accordingly. This stage as well, stalls only if inhere, unless the reference is supposed to be modified by the input FIFO or the decfifo are empty and full respectively. an instruction present in the following stages, in which case this stage stalls. Operations and fetched values are passed 3.2.3 Stage 3: Decode to the Execute stage, while data moving instructions bypass The Decode stage dequeues and decodes the next micro- Execute, if idle, and go directly into wbfifo. In order to ac- instruction from the decfifo. The decoded micro-instructions, celerate the pipeline execution, if the required references are enqueued into fsfifo, are either data moving instructions (one about to be modified by the Write-back stage, the operands source and one destination) or an operation (two sources, are forwarded directly from the wbfifo. one operation, one destination). References (sources and destination) may be immediate values, registers, or stack 3.2.5 Stage 5: Execute addresses referred by values or registers. Necessary register The Execute stage dequeues two values and an operation values are fetched in this stage, while stack locations are identifier from the fofifo, executes the operation to obtain a fetched in the next. Along with references and operations, result. Conditional branches are partially handled here, as the context information received from the decfifo is passed the operation is simply discarded if the condition is false or passed on to the next stage if the condition is true. Thus, BSV file, generator.bsv. This file, along with the micro- along with the context received from exfifo, a destination instruction set definition from types.bsv is compiled as a and a single value are enqueued further in the wbfifo. stand-alone simulator by the BSV compiler. Finally, this executable (genrom), will output the .hex memory image 3.2.6 Stage 6: Write-back files for the aforementioned tables. The advantage is that Finally, the values dequeued from wbfifo are written back if the micro-instruction set encoding changes, the bluejasm to the right destination (register or stack address) in the does not require any updates, since the contents of the gen- Write-back stage. Not all registers are available for read erator.bsv file are independent of the encoding. Thus, all of and write. For example, opd can only be read while mwa the new .hex files can be obtained automatically. and mra may only be written. However the access control is managed within the Registers module. Furthermore, this is the stage that may issue a pipeline flush and a context roll- .asm types.bsv back in the case when pc and jpc are explicitly changed. In addition, this stage also controls the bytecode cache, by bsv compiler bluejasm generator.bsv issuing a cache refill when explicitly requested by the micro- -sim code (through the CacheCtl register). Just as in JOP, having explicit cache refills on method invokes and returns, makes the timing analysis easier. This, in turn, contributes genrom to the predictability of the processor, required in real-time embedded systems. Micro- jump BC2microA stack 3.3 Register Mapped External Memory ROM table Although there are no specific bytecodes for accessing ex- ternal memory directly (due to security reasons), this is es- sential for the JVM internals, peripheral access, memory Figure 3: Micro-ROM generation and related tool management, etc. In BlueJEP the access to the external flow. Grey boxes are generated files or standard memory is carried out through a set of three registers: mwa, tools. Colored boxes complete Figure 2. mra, and md. To read a memory location, a Stmra micro- instruction must be issued which writes the top of stack into the mra register and starts an external read access through 3.5 Run-time Aspects the bus interface. The bus interface will then update the md As with most embedded Java systems, the runtime class register with the data arriving on the bus, and the micro- library for BlueJEP is smaller than in a typical desktop code can issue a Ldmd to push this data (or register md if environment. The processor, although truly executing byte- the bus access is completed already) onto the stack. On a codes, uses a specific memory image which has been obtained write, the md contents, popped from the stack via Stmd, are offline, through a custom class loader and linker BlueJim. written back to the memory by issuing a Stmwa instruction, The tool flow from a .java file to an executable BlueJEP which sets the mwa address register and starts a bus write image is depicted in Figure 4. The BlueJim class loader and access. 3.4 Micro-Code Aspects UserApp.java Depending on the micro-code, that is, what sequence of Object.java micro-instructions needs to be executed for each bytecode, Native.java there are four tables/ROMs that need to be generated. JVM.java String.java • The micro-ROM contains the micro-code for all the hardware implemented bytecodes. javac BlueJEP Run-time library • The BC2microA table maps bytecodes to micro-code addresses. to RAM • The jump table translates each of the available indexes bjrt.jar BlueJim bj.hex (up to 32) into an address offset used in the micro-code jumps for updating the pc.

• The initial stack containing constants used in the micro- Figure 4: Obtaining an executable image for the code. BlueJEP processor.

Usually (e.g. in the VHDL specified JOP), a single exe- linker is an in-house Java written application and it uses the cutable, a micro-assembler needs to generate all these data. Byte Code Engineering Library (BCEL) to parse .class files Whenever the encoding of the micro-instruction set changes, and generate a proper BlueJEP executable. The main task the micro-assembler (along with the description of the de- of BlueJim is to transform all generic references into mem- code stage) must be updated as well. ory addresses associated with the classes and constants used Our BSV solution (Figure 3) uses a generator, bluejasm, in the application. Furthermore, all the unused methods and which translates the assembler code into an intermediate classes are discarded, virtual tables re-organized if necessary, in order to minimize the image size. This step is carried out by marking all the classes reachable from the main method Table 1: BlueJEP device area on a XC2V1000, of the user application, and a few other essential runtime fg456-4, optimized for speed, distributed RAM. classes. Finally, the image generator also translates native Resources Taken Available Percentage calls (invokestatics of methods from the Native class) into Slices 3460 5120 68% custom bytecodes, which are not used in the standard JVMs. These custom bytecodes correspond to BlueJEP specific op- Flip-Flops 756 10240 7% erations, such as direct memory access, internal register and 4LUTs 6858 10240 66% stack access, and possibly memory management functions. 2422 used as logic 4436 used as RAM 4. IMPLEMENTATION & EXPERIMENTS The BlueJEP processor specification has been entirely written in Bluespec SystemVerilog [3], along with an OPB figures and the first JOP column were obtained with Xilinx bus controller, main memory and a few dummy peripherals, ISE 9.1, optimized for speed, using only distributed RAM. in order to obtain the system depicted in Figure 1. Using the The JOP version in this first column has an identical bus BSV compiler version 2006.11, this specification was used to (OPB) and cache configuration to the BlueJEP processor. generate, at first, a standalone simulator (host executable, The figures from the last two columns for JOP were taken clock cycle accurate) for fast test and debug. Verilog code from [12]. Only the Virtex-II versions of BlueJEP and was then produced just for the processor, which was simu- JOP were actually ran on our evaluation board. lated and further synthesized to a net-list using the Xilinx Since the Virtex-II versions are the most similar, synthe- ISE 9.1i tool chain [21]. Along with required wrappers and sized with the same parameters, we take a closer look at configuration files, this net-list was then used within the these results. Having a six stages pipeline, BlueJEP is ex- Xilinx EDK 9.1, together with already available Xilinx IPs pected to be able to work at a higher clock speed than the (bus, UART controller, timer, memory, debug cores, and four stages pipeline of JOP. Theoretically, if the stages are other general purpose IO) to generate the final system. The perfectly balanced, to achieve the same work in six clock design was synthesized, mapped, placed and routed on a cycles compared to four, one should be able to increase the Xilinx Virtex-II (XC2V1000, fg456-4) FPGA. At this point clock frequency by 50%. This would put an higher bound on we employed Xilinx ChipScope [20] to tap various signals, the six stages pipeline at 90MHz, which sets the clock fre- including the bus and compare the values obtained from the quency of the BlueJEP solution within acceptable values, real hardware against values obtained using the Bluespec rather close to the bound. In fact, since the throughput is standalone simulation of our design. This step was used to of more importance than latency, the six stages design ends detect and fix bugs or design misses, and confirmed that the up being about 42% faster than the four stages solution, for implementation behaves according to the BSV specification. the tested synthesis parameters. Note that the design was tested at four different levels (BSV simulator, behavioral Verilog simulator, post place- and-route simulation, and in FPGA) during the design pro- Table 2: Maximum clock speeds for BlueJEP and cess, each being useful in revealing a number of issues. For JOP for various Xilinx devices. example, at some point, behavioral simulation and post- PAR simulation were not producing identical results. This occurred due to memory initialization in a Bluespec library Virtex-II Spartan3 Virtex5 file, marked as simulation only, which was easily fixed but XC2V-4 XS3-5 XC5VLX30-3 hard to detect. Another issue was an ISE synthesis dis- JOP 60 MHz 66 MHz 200 MHz crepancy between using block RAM and distributed RAM BlueJEP 85 MHz 76 MHz 221 MHz inference. For minimal area, we allowed ISE to infer RAM clock factor as block or distributed, but as it turned out, the design (fBlueJEP/fJOP ) 1.42 1.15 1.10 functioned properly only we restricted to distributed RAM, which is one of the reasons the used device area is so large. 4.1 Device Area 4.3 Bytecode Execution Speed The device area used by the BlueJEP processor, on the Clock speed is not however the unique best measure of aforementioned FPGA is given in Table 1. Compared to a performance, especially for Java processors. One needs to JOP version with the same bus interface, cache configura- consider the execution speed of bytecodes in order to give tion and synthesis options, BlueJEP takes double the area. a more accurate figure of performance. Having a micro- However, a larger area is expected, given the longer pipeline instruction set, micro-code and executable format similar and higher abstraction level of a BSV specification. Note to JOP, the clock cycles required by BlueJEP to execute that taking only the logic into consideration, the 2422 Xil- bytecodes are in the range of those of JOP (see Table 3). inx 4LUTs of BlueJEP is only slightly larger than the 1830 Note that the figures given in the table were obtained for Altera LCs of JOP as reported in [14]. an external memory access delay of two clock cycles, for a fair comparison to the data in [14], when in reality the 4.2 Clock Speed minimal OPB cycle time with arbitration is three clock cy- We synthesized BlueJEP for a number of Xilinx devices, cles. The figure in the last column is the relative speed-up, and the maximum clock speeds are reported in Table 2, com- rs = (fBlueJEP/fJOP ) ∗ (ccJOP /ccBlueJEP), for the Blue- pared to a equivalent versions of JOP. All the BlueJEP JEP compared to JOP, for a clock factor of 1.42 (Table 2). ability and flexibility. Indeed, from our experience, modify- Table 3: Execution time in clock cycles for BlueJEP ing the BlueJEP design is rather easy, because it was writ- and JOP as given in [14]. ten in BSV and also due to its modularity and simplicity of the pipeline stages. As an example, when adding new micro- Bytecode(s) JOP BlueJEP instructions, the encoding does not need to be specified ex- cc cc rs1.42 plicitly and the micro-ROM generator does not require any iload iadd 2 3 0.95 modifications (see Figure 3). What has to be done is only to iinc 11 13 1.20 include the new micro-instruction mnemonic in the existing ldc 9 12 1.06 micro-instruction type and alter the decode stage to han- if icmplt taken 6 23 0.37 dle the new instruction. The execute stage might also need if icmplt n/taken 6 8 1.06 some changes in case a new and exotic operation must be getfield 23 38 0.86 carried out. Finally, new registers might need to be added. getstatic 15 18 1.18 Nevertheless, all these are simple, well-localized changes. iaload 29 45 0.92 invoke 126 166 1.08 5.1.2 Test & Debug invoke static 100 111 1.28 Testing a BSV design for correct functionality is facilitated by a number of features and module libraries. The majority of design misses and implementation bugs can be detected The longer execution time in BlueJEP appear in this very early in the design by writing test modules for case from pipeline stalls. For example, in the case of iinc each BSV module. This is simplified by using the readily the extra two clock cycles are because of a available StmtFSM module from the BSV standard library, caused by altering the vp register combined with reading for automatic generation of FSMs from sequences of method the stack pointed by vp. Forwarding registers from Write- calls and other operations. Modules with generic types were back to Decode stage would reduce such stalls. In the case tested with simple types at first. Debug messages and the of taken if cmp*, the longer delay comes from a number possibility to define string conversion functions for each type of differences, such as flushing a longer pipeline, a different also helped. The possibility to generate a standalone exe- way of handling micro-code branches, and different micro- cutable simulator on the host platform, before moving to program for those bytecodes. Nevertheless, when one takes hardware simulators, was very useful. To summarize, the into account the faster clock for BlueJEP, many bytecodes whole testing process was more software-like, instead of in- are actually executed faster than on JOP. specting simulation waveforms on the screen. Nevertheless, behavioral simulation of the Verilog output from BSV compiler, as well as lower level hardware simu- 5. DISCUSSION lations were essential in detecting other problems, such as This section gathers a number of BlueJEP-related issues issues with module boundary signals and BSV library bugs. that we considered interesting, such as the pros and cons of Finally, being able to monitor signals at runtime, on the using BlueSpec SystemVerilog, the rationale behind some of FPGA, using Chipscope [20], was the ultimate method to the design choices, and finally possible and planned exten- validate the implementation. sions to our processor. 5.1.3 Results 5.1 BSV Specifics Synthesis results, device area in particular, appear to be One fact worth to notice is that BlueJEP, along with its a step back from the smaller, hand written VHDL designs test system, were completely written from scratch in Blue- such as JOP. The reasons behind these results follow. spec SystemVerilog. We had very limited knowledge and ex- As with any higher level of abstraction language, there is perience with BSV before embarking on the work described a price to pay for specifying designs in BSV. Design time is in this paper, and learning BSV was also one of our goals. drastically reduced (two to three times shorter), but the final Given that BSV is a high level language, we found a num- implementation is harder to control at a lower level. For ex- ber of advantages and drawbacks in using it for designing ample, BSV register files (RegFile) used to model memories, our Java processor as detailed in the following. have five read ports and one write port and end up in the Verilog description as five single port memories. In particu- 5.1.1 Coding lar, due to some quirk in the ISE synthesis tools, we had to One of the advantages of using a high-level specification tag all the memories in the design as distributed memories language is the reduced design time, reflected also by the (although block memories are available), leading to a sub- number of code lines. Comparing our BlueJEP specifica- optimal device area utilization. It is the job of the synthesis tion to a similarly configured version of JOP (in VHDL), re- tools to identify and optimize such constructs, while pre- vealed that the BSV design has about one third fewer lines serving correctness. Thus, the tasks usually taken up by the than the VHDL solution. After compiling the BSV design to designer in order to optimize the implementation are now Verilog, however, the number of Verilog lines is larger than heavily relying on the compilers and synthesis tools. BSV in the VHDL solution. This is explained on one hand by the allows some control by providing a way to import Verilog longer pipeline of BlueJEP and also by the fact that com- designs into the BSV code, which cannot be included in the pilers often produce more code than hand-written solutions. standalone simulation however. Nevertheless, the Verilog output is not meant to be read or Optimizing the timing of the design is another issue that modified directly by the designers. becomes harder with higher abstraction level specifications. Smaller code usually translates into increased maintain- First, it is not easy to identify critical paths reported back from the synthesis tools, paths which now include signals system (Figure 1) in BSV, and go directly to Xilinx ISE that were not written by the designer. Second, a slight for synthesis. In fact, for testing and debug purposes we change in the BSV specification might have a huge and dis- did specify an entire system in BSV, but used very simple tributed impact on the Verilog output. dummy peripherals, good for simulation only. To conclude, BSV is excellent for rapid prototyping and architectural exploration, but once the decisions are taken 5.2.4 Micro-Instruction Set and a final implementation must be employed, it might be The micro-instruction set encoding went through some necessary to rewrite and refine certain modules at register- changes as well. The micro-instruction type was defined as a transfer level in Verilog or VHDL. union, with one element for each specific micro-instruction, some having specific operands. This structure was rather 5.2 Design Choices flat, but made the micro-code easy to read and to translate A number of design choices differentiate our solution from into BSV constructs. Later on we realized that classifying the initial JOP architecture, as we detail in the following. instructions according to their effect on the stack is better for identifying folding patterns. Therefore we changed the type 5.2.1 Six Stages Pipeline to a union of four classes, reflecting stack operations, namely The decision to implement a six instead of four stages producers, consumers, operations, and specials. Specials in- pipeline was taken in order to have simpler stages (therefore clude branches and nop. Although this made the micro-code faster clock), and a more flexible pipeline. For example, the less legible, using the assembler (Figure 3) meant that we Execute stage performs operations on two immediate values could keep unchanged the programmer’s view of the micro- as opposed to just on tos and tos-1 as in JOP. These val- instruction set, and only adjust the back-end. ues are fetched or produced by either FetchStack or Decode stages. This opens the possibility for operating on any two 5.3 Extensions locations from the stack, which is essential for folding micro- One of the reasons behind developing BlueJEP was to instructions. Since not every micro-instruction requires the have a flexible architecture suitable for testing and evaluat- Execute stage, we also added the bypass directly to the ing a few Java-specific techniques, such as bytecode folding, Write-back stage, in order to speed up the execution. Fur- support for hardware assisted memory management, various thermore, feeding back values to FetchStack directly from method caching strategies and even compiler optimizations. the wbfifo, before they were written back to the stack was Two of these are briefly mentioned in the following. another way to gain clock cycles. Normally, all these in- crease the complexity of control, and have an impact on the 5.3.1 Micro-instruction Folding area and may increase the critical path. As usual, the so- Bytecode folding is a method of compacting stack ma- lution ends up being a trade-off between area, clock speed chine specific code, in order to reduce stack accesses. For and pipeline latency. Exploring various architectural choices example, adding two local variables into a third requires proved to be rather easy using the BSV design flow. For ex- four bytecodes: first loading the operands on the top of the ample, we decided not to forward registers from the wbfifo stack, carrying out the addition and storing the result back to the Decode stage, as this would result in only one or two on the stack, and finally writing back the top of the stack in clock cycles shorter execution for rather long code sequences, the destination variable. A three address machine is able to at the expense of increased control and clock period. do this in one instruction. Folding is commonly used in JIT compilers and employed in hardware by a few native Java 5.2.2 Speculative Execution processors (i.e. [7, 16, 17]). Another choice was whether to stall or use speculative ex- The folding extension for BlueJEP, currently under eval- ecution in case of micro-code branches. Early versions used uation, is slightly different in our case in the sense that it to stall the Fetch micro-I stage whenever an operation mod- occurs at micro-instruction rather than bytecode level. Each ifying the micro-pc would be present in later stages. Current micro-instruction is classified according to its effect on the versions use speculative execution in the sense that micro- stack, as a producer (p), consumer (c), operation (o) or spe- instructions are always fetched sequentially, but when the cial (branches and the nop). In order to implement folding, micro-pc is altered by a branch (instead of incremented), the only the first three stages need to be modified. In particular, pipeline is flushed, context restored using the branch con- the Decode stage is the one identifying the patterns, consist- text (sp,jpc) and execution resumed. In the case of stalling, ing of up to four micro-instructions (ppoc being the only more control is needed to search through fifos and identify pattern of length four). This means that the Fetch micro-I instructions that may alter the micro-code control flow. In stage must be able to provide up to four consecutive micro- the case of speculative execution, not taking branches is al- instructions in the same clock cycle. Furthermore, these four ways faster than stalling, but having to flush and refill the micro-instructions could be spread over several bytecodes, pipeline makes taking branches slower. Having to pass con- which means that Fetch bytecode stage must be able to pro- text along makes the pipeline registers wider, but there is vide four bytecodes at once. Naturally, this complicates the no need to search through the fifos. architecture, increasing the device area and possibly the crit- ical path. Although in theory there is a considerable speed- 5.2.3 Bus Interface up that can be achieved by folding patterns of up to four 2 Choosing the On-chip Peripheral Bus interface was moti- micro-instructions , due to pipeline stalls or flushes and the vated by the good OPB support provided by Xilinx EDK, negative impact on the clock speed for such a complex archi- including a library of tested and optimized peripherals and tecture appears to render these gains untenable. Extensions debug IPs working on OPB. It would have been possible, 2Folding gives a speed-up of 1.6 assuming micro-instructions of course, to avoid using EDK and implement the whole are available and can be executed every clock cycle. for shorter and fewer, selected patterns are currently under [4] N. Dave. Designing a processor in Bluespec. Master’s investigation as a compromise between the folding ratio and thesis, MIT, Cambridge, MA, January 2005. clock speed degradation. We are also examining the rela- [5] N. Dave, M. Pellauer, S. Gerding, and Arvind. 802.11a tion between folding and current architectural choices, such transmitter: A case study in microarchitectural as bypass and operand forwarding. exploration. In International Conference on Formal Methods and Models for Codesign (MEMOCODE’06), 5.3.2 Memory Management Support pages 59–68, July 2006. Based on the work described in [9], we designed a mem- [6] Digital Communication Technologies. Lightfoot 32-bit ory management unit (MMU), also written in BSV. Com- Java processor core. data sheet, September 2001. plex bytecodes such as new, newarray, anewarray and [7] M. W. El-Kharashi, F. Gebali, K. F. Li, and F. Zhang. the garbage collection cycle, initially implemented in micro- The JAFARDD processor: a java architecture based code or Java functions, are currently available through the on a folding algorithm, with reservation stations, MMU. The interface between MMU and BlueJEP was im- dynamic translation, and dual processing. IEEE plemented through a few data registers and one control reg- Transactions on Consumer Electronics, ister, accessed via specific micro-instructions. The commu- 48(4):1004–1015, November 2002. nication is carried out asynchronously. Parameters are sent [8] S. Gestegard-Robertz and R. Henriksson. by writing MMU data registers and an operation is initiated Time-triggered garbage collection. In Proceedings of by writing to the MMU control register. Results can be the ACM SIGPLAN Langauges, Compilers, and Tools read back via blocking reads from the MMU data registers. for Embedded Systems, June 2003. The required modifications to the BlueJEP architecture are [9] F. Gruian and Z. Salcic. Designing a concurrent minimal, confined to adding a few new registers, and provid- hardware garbage collector for small embedded ing read and write operations for these, mapped to MMU systems. In Asia-Pacific Computer Systems operations. Currently only stop-the-world, mark-compact Architecture Conference, pages 281–294, 2005. garbage collection is implemented, but a concurrent version, [10] F. Gruian and M. Westmijze. BluEJAMM: A suitable for time-triggered garbage collection [8], is under Bluespec embedded Java architecture with memory development. For more details please refer to [10]. management. In SYNASC’07 Real-Time and Embedded Systems workshop, September 2007. to be 6. CONCLUSION presented. In this paper we introduced BlueJEP, a native Java pro- [11] J. C. Hoe and Arvind. Hardware synthesis from term cessor specified in Bluespec System Verilog, suitable for em- rewriting systems. In VLSI ’99: Proceedings of the bedded systems. Based on an existing Java processor, JOP, IFIP TC10/WG10.5 Tenth International Conference our design features a flexible six stages pipeline, specula- on Very Large Scale Integration, pages 595–619, tive execution, operand forwarding and stage bypassing. Its Deventer, The Netherlands, The Netherlands, 2000. flexibility is partly due to using a higher abstraction level Kluwer, B.V. language, BSV, and partly due to the processor architec- [12] JopWiki. http://www.jopwiki.com/. ture. The downside of our approach is the increase in the de- [13] J. Kreuzinger, U. Brinkschulte, M. Pfeffer, S. Uhrig, vice area, an expected consequence of raising the abstraction and T. Ungerer. Real-time event-handling and level of the specification and longer pipeline. Performance scheduling on a multithreaded java . wise BlueJEP is comparable to JOP, having a faster max- and Microsystems, 27(1):19–31, imum clock speed, but requiring more clock cycles for the February 2003. same operations. No special efforts were put into optimizing [14] M. Schoeberl. Evaluation of a Java processor. In the design for area or speed, however. Tagungsband Austrochip 2005, pages 127–134, Vienna, Currently the BlueJEP processor is used as an evaluation Austria, October 2005. platform for hardware solutions to a several Java specific [15] M. Schoeberl. JOP: A Java Optimized Processor for techniques, such as bytecode folding, garbage collection, and Embedded Real-Time Systems. PhD thesis, Vienna efficient method caches. University of Technology, January 2005. [16] Sun. PicoJava-II guide. Technical Acknowledgments Report 960-1160-11, Sun Microsystems, 1999. We would like to thank Martin Schoeberl for the useful dis- [17] L.-R. Ton, L.-C. Chang, J.-J. Shann, and C.-P. cussions, suggestions and JOP-related data, that helped im- Chung. A software/hardware cooperated stack prove this paper considerably. We would also like to thank operations folding model for java processors. Journal Bluespec for tools and support. of Systems and Software, 72(3):377–387, 2004. [18] Vulcan Machines Ltd. 7. REFERENCES http://www.vulcanmachines.co.uk/, August 2007. [19] R. E. Wunderlich and J. C. Hoe. In-system FPGA [1] AJile Systems. http://www.ajile.com. prototyping of an microarchitecture. In [2] A. C. S. Beck and L. Carro. A VLIW low power Java International Conference on Computer Design, processor for embedded applications. In SBCCI ’04: October 2004. Proceedings of the 17th symposium on Integrated [20] Xilinx. ChipScope Pro Software and Cores User circuits and system design, pages 157–162, New York, Guide, v9.1.01 edition, January 2007. NY, USA, 2004. ACM Press. [21] Xilinx Inc. http://www.xilinx.com/, 2007. [3] Bluespec, Inc. http://www.bluespec.com, 2007.