Towards a Java Multiprocessor
Total Page:16
File Type:pdf, Size:1020Kb
Towards a Java Multiprocessor Christof Pitter Martin Schoeberl Institute of Computer Engineering Institute of Computer Engineering Vienna University of Technology, Austria Vienna University of Technology, Austria [email protected] [email protected] ABSTRACT century. According to [9], we are reaching the limits of exploiting This paper describes the first steps towards a Java multiprocessor ILP efficiently. Unfortunately, semiconductor technology has also system on a single chip for embedded systems. The chip multipro- run across limitations in recent years because of the theoretical lim- cessing (CMP) system consists of a homogeneous set of processing its informed by the theory of physics, e.g., electrical signals cannot elements and a shared memory. Each processor core is based on the travel faster than the speed of light. As a result, the frequency that Java Optimized Processor (JOP). A major challenge in CMP is the used to increase exponentially has leveled off [17]. shared memory access of multiple CPUs. The proposed memory To sustain the rapid growth of computation power new system ar- arbiter resolves possible emerging conflicts of parallel accesses to chitecture advancements have to be made. According to [9] the fu- the shared memory using a fixed priority scheme. Furthermore, the ture direction of computer systems is chip multiprocessing (CMP). paper describes the boot-up of the CMP. We verify the proposed Such a system combines two or more processing elements and a so- CMP architecture by the implementation of the prototype called phisticated communication network on a single chip. A major ad- JopCMP. JopCMP consists of multiple JOPs and a shared mem- vantage of this approach is that progress in computation power does ory. Finally yet importantly, the first implementation of the CMP not come along with an increase of the hardware complexity of the composed of two/three JOPs in an FPGA enables us to present a single processors. However, the shared components like memory comparison of the performance between a single-core JOP and the and I/O of a multiprocessor system produce new challenges that CMP version by running real applications. have to be addressed. Real-time programs are naturally multi-threaded and a good can- didate for on-chip multiprocessors with shared memory. The Java Categories and Subject Descriptors virtual machine (JVM) thread model supports threads that share C.3 [Special-Purpose and Application-Based Systems]: Real- the main memory. Therefore, a multiprocessor JVM in hardware time and embedded systems; D.3.4 [Programming Languages]: is a viable option. The basis for the CMP architecture has been Processors—Run-time environments, Java; B.7.1 [Integrated Cir- set with the Java Optimized Processor (JOP) [23, 24, 25]. JOP is cuits]: Types and Design Styles—Microprocessors and microcom- the implementation of a JVM in hardware. The used application puters model is described in [22, 28] and is based on the Ravenscar Ada profile [7]. In order to generate a small and predictable processor, Keywords several advanced and resource-consuming features (such as instruc- tion folding or branch prediction) were omitted from the design of Multiprocessor, Java, Shared Memory JOP. The resulting low resource usage of JOP makes it possible to integrate more than one processor in a low-cost field programmable 1. INTRODUCTION gate array (FPGA). Modern applications demand ever-increasing computation power. In this paper, we propose a CMP architecture consisting of a They act as the main drivers for the semiconductor industry. For number of JOPs and a shared memory. The shared memory is uni- over 35 years, the speed of transistors has become faster and the formly accessible by the homogeneous processing cores. An arbi- frequency of the clock rate accordingly. Additionally the number tration unit takes care of conflicts due to parallel memory requests. of transistors on an integrated circuit for a given cost doubles ev- Furthermore, we describe the implementation of the caching and ery 24 months, as described by Moore’s Law [20]. The availability synchronization mechanisms. Additionally we present JopCMP, of more transistors has been used by introducing the instruction- the first prototype of the CMP with JOP cores. JopCMP is com- level parallelism (ILP) approach, which was the primary processor posed of multiple JOP cores, integrated in an FPGA, and an exter- design objective between the mid-1980s and the start of the 21st nal memory. A novel memory arbiter controls the memory access of the various JOPs to the shared memory. It resolves possible emerging conflicts of parallel accesses to the shared memory. In comparison to existent memory arbiters of system-on-chip (SoC) Permission to make digital or hard copies of all or part of this work for busses (e.g. AMBA [3]), the proposed arbitration process is per- personal or classroom use is granted without fee provided that copies are formed in the same cycle as the request happens. This increases not made or distributed for profit or commercial advantage and that copies the bandwidth and eases the time predictability of each memory bear this notice and the full citation on the first page. To copy otherwise, to access. Therefore, the implementation of the JopCMP represents a republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. first step towards a time predictable multiprocessor. The ultimate JTRES ’07 September 26-28, 2007 Vienna, Austria goal of our research work is a multiprocessor for safety-critical ap- Copyright 2007 ACM 978-59593-813-8/07/9 ...$5.00. plications. Moreover, an acceptable performance compared with Multiprocessors with a heterogeneous architecture combine a mainstream non real-time Java systems is an objective. core CPU for controlling and communication tasks and additional The rest of the paper is structured as follows. Section 2 presents digital signaling processing elements, interface processors or mo- related work. In Section 3, we describe the proposed CMP archi- bile multimedia processing units. These units are connected to- tecture and go into details of the memory model and caching. Ad- gether using multi-level buses or switches. Some functional units ditionally the issue of synchronization is examined. Section 4 de- may have their own individual memories along with shared mem- scribes the first implementation of the JopCMP system, including ory structures. They are often tailored for specific applications. the boot-up sequence and the shared memory access management Some examples of heterogeneous multiprocessors include the No- of the CMP. Section 5 presents experiments with the JopCMP pro- madik [1] from ST designed for mobile multimedia applications, totype. We compare the performance of the CMP against a sin- the Nexperia PNX-8500 [8] from Philips aimed at digital video en- gle processor. Finally, Section 6 concludes the paper and provides tertainment systems, or the OMAP family [19] from TI designed to guidelines for future work. support 2.5G and 3G wireless applications. Gaisler Research AB designed and implemented a homogeneous 2. RELATED WORK multiprocessor system. It consists of a centralized shared memory and up to four LEON processor cores that are based on the SPARC In this paper, we argue that the replication of a simple pipeline on V8 architecture [11]. This embedded system is available as a syn- a chip is a more effective use of transistors than the implementation thesizable VHDL model. Therefore, it is well suited for SoC de- of super-scalar architectures. The following two subsections are signs. We could not find any literature concerning WCET analysis about the progress made in CMP. regarding the multiprocessor. Another example is the ARM11 MPCore [4]. It introduces a 2.1 Mainstream Multiprocessors pre-integrated symmetric multiprocessor consisting of up to four Due to the power wall [9], the trend towards CMP can be seen ARM11 microarchitecture processors. The 8-stage pipeline archi- in mainstream processors. Currently, three quite different architec- tecture, independent data and instruction caches, and a memory tures are state-of-the-art: management unit for the shared memory make a timing analysis difficult. 1. Multi-core versions of super-scalar architectures (Intel/AMD) The two leaders of the FPGA market Altera and Xilinx both pro- vide software tools and intellectual property (IP) processors to de- 2. Multi-core chip with simple RISC processors (Sun Niagara) sign CMP systems [2, 5]. The Nios II CPUs depict the processing units of the multiprocessor architecture from Altera. It is easy to 3. The CELL architecture create a CMP architecture using the GUI interface of the System- on-a-programmable-chip (SOPC) builder, a tool of the Altera Quar- Mainstream desktop processors from Intel and AMD include tus II design suite. Nevertheless, the dependence on specific IP two or four super-scalar, simultaneous multithreading processors, cores is unavoidable when designing such a system. which are just replications of the original, complex cores. Sun In this paper, we concentrate on homogeneous multiprocessors took a completely different approach with its Niagara T1 [15]. The consisting of two or more similar CPUs sharing a main memory. T1 contains eight processor cores. Each core consists of a sim- Even though much research has been done on multiprocessors, the ple six-stage, single-issue pipeline similar to the original five-stage timing analysis of the systems has so far been disregarded. RISC pipeline. The additional pipeline stage adds fine-grained multithreading. The first version of the chip contains just a sin- gle floating-point unit that is shared by all eight processors. The 3. CMP ARCHITECTURE design is targeted to server workloads. According to [12, 30], two different possibilities of a tightly cou- The Cell multiprocessor [10, 13, 14] is an example of a hetero- pled multiprocessor system exist (see Figure 1).