Chapter 2 Java Processor Architectural

INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. ProQuest Information and Leaming 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 UMI" The JAFARDD Processor: A Java Architecture Based on a Folding Algorithm, with Reservation Stations, Dynamic Translation, and Dual Processing by Mohamed Watheq AH Kamel El-Kharashi B. Sc., Ain Shams University, 1992 M. Sc., Ain Shams University, 1996 A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of D o c t o r o f P h il o s o p h y in the Department of Electrical and Computer Engineering We accept this dissertation as conforming to the required standard Dr. F. Gebali, Supervisor (Department of Electrical and Computer Engineering) Dr. K. F. Li, Supervisor (Department of Electrical and Computer Engineering) Dr. N. J. Dimopoulos, Departmental Member (Department of Electrical and Computer Engineering) Dr. D. M. Miller, Outside Member (Department of Computer Science) Dr. H. Alnuweiri, Exténuai Examiner (Department of Electrical and Computer Engineer ing, University of British Columbia) (5) Mohamed Watheq AH Kamel El-Kharashi, 2002 University of Victoria All rights reserved. This dissertation may not be reproduced in whole or in part by photocopy or other means, without the permission o f the author. u Supervisors: Dr. F. Gebali and Dr. K. F. Li ABSTRACT Java’s cross-platform virtual machine arrangement and its special features that make it ideal for writing network applications, also have a tremendous negative impact on its operations. In spite of its relatively weak performance, Java’s success has motivated the search for techniques to enhance its execution. This work presents the JAFARDD (a Java Architecture based on a Folding Algorithm, with Reservation stations, Dynamic translation, and Dual processing) processor designed to accelerate Java processing. JAFARDD dynamically translates Java bytecodes to RISC instructions to facilitate the use of a typical general-purpose RISC core. This enables the exploitation of the instruction level parallelism among the translated instructions using well established techniques, and facilitates the migration to Java-enabled hardware. Designing hardware for Java requires an extensive knowledge and understanding of its instruction set architecture which were acquired through a comprehensive behavioral analysis by benchmarking. Many aspects of the Java workload behavior were collected and the resulting statistics were analyzed. This helped identify performance-critical aspects that are candidates for hardware support. Our analysis surpasses other similar ones in terms of the number of aspects studied and the coverage of the recommendations made. Next, a global analysis of the design space of Java processors was carried out. Different hardware design options and alternatives that are suitable for Java were explored and their trade-offs were examined. We especially focused on the design methodology, execution engine organization, parallelism exploitation, and support for high-level language features. This analysis helped identify innovative design ideas such as the use of a modified Toma- sulo’s algorithm. This, in turn, motivated the development of a bytecode folding algorithm that integrates with the reservation station concept in JAFARRD. While examining the behavioral analysis and the design space exploration ideas, a list of global architectural design principles started to emerge. These principles ensure JAFARRD can execute Java efficiently and are taken into consideration while the various instruction pipeline modules were designed. Results from the behavioral analysis also confirmed that Java’s stack architecture cre ates virtual data dependencies that limit performance and prohibit instruction level par Ui allelism. To overcome this drawback, stack operation folding has been suggested in the literature to enhance performance by grouping contiguous instructions that have true data dependencies into a compound instruction. We have developed a folding algorithm that, unlike existing ones, does not require the folded instructions to be consecutive. To the best of our knowledge, our folding algorithm is the only one that permits nested pattern folding, tolerates variations in folding groups, and detects and resolves folding hazards completely. By incorporating this algorithm into a Java processor, the need for, and therefore the limi tations of, a stack are eliminated. In addition to an efficient dual processing configuration (i.e., Java and RISC), JA FARDD is empowered with a number of innovative design features, including: an adaptive feedback fetch policy that copes with the variation in Java instruction size, a smart bytecode queue that compensates for the lack of a stack, an on-chip local variable file to facilitate operand access, an early tag assignment to dispatched instructions to reduce processing delay, and a specialized load/store unit that preprocesses object-oriented instructions. The functionality of JAFARDD has been successfully demonstrated through VHDL modeling and simulation. Furthermore, benchmeirking using SPECjvm98 showed that the introduced techniques indeed speed up Java execution. Our bytecode folding algorithm speeds up execution by an average of about 1.29, eliminating an average of 97% of the stack instructions and 50% of the overall instructions. Compared to other proposals, JAFARDD combines Java bytecode folding with dynamic hardware translation, while maintaining the RISC nature of the processor, making this a much more flexible and general approach. Table of Contents Abstract ii Table of Contents v List of Tables xiii List of Figures xvi List of Abbreviations xx Trademarks xxii Acknowledgement xxiii Dedication xxiv 1 Introduction 1 1.1 The Java Virtual M achine ............................................................................... 2 1.1.1 Different Bytecode Manipulation M ethods ....................................... 2 1.1.2 Performance-Hindering F e a tu re s ...................................................... 4 1.2 Motivations ........................................................................................................ 5 1.3 Research Objectives ......................................................................................... 6 1.4 Design Challenges ............................................................................................ 7 1.5 Methodology and Road M ap ............................................................................ 8 2 Java Processor Architectural Requirements: A Quantitative Study 11 2.1 Introduction......................................................................................................... II 2.2 Experiment Framework ................................................................................... 12 2.2.1 Study Platform ...................................................................................... 12 Table o f Contents vi 2.2.2 Trace Generation ................................................................................. 13 2.2.2.1 JDK Core Organization ...................................................... 14 2.2.2.2 Core Instrumentation .......................................................... 15 2.2.2.S Execution Trace Components ............................................. 15 2.2.3 Trace Processing .................................................................................. 15 2.2.4 Benchm arking ..................................................................................... 16 2.3 JVM Instruction Set Architecture (ISA ) ........................................................ 17 2.3.1 Class Files and Their L oader ............................................................ 18 2.3.2 Execution E n g in e ............................................................................... 19 2.3.3 Runtime Data A re a s............................................................................ 19 2.4 Access Patterns for Data Types ......................................................................... 20 2.4.1 JVM Data T y pes ...................................................................................... 21 2.4.2 Single-Type Operations ..........................................................................21 2.4.3 Type Conversion Operations ....................................................................22 2.5 Addressing M o d e s ............................................................................................... 22 2.5.1 JVM Addressing M odes .....................................................................23 2.5.2 Runtime Resolution

Chapter 2 Java Processor Architectural

Dynamic Helper Threaded Prefetching on the Sun Ultrasparc® CMP Processor

Debugging Multicore & Shared- Memory Embedded Systems

Ultrasparc-III Ultrasparc-III Vs Intel IA-64

Sun Blade 1000 and 2000 Workstations

Robert Garner

A Novel Dynamic Scan Low Power Design for Testability Architecture for System-On-Chip Platform

Hotchips '99 Hotchips

Sun Fire V880z Server and Sun XVR-4000 Graphics Accelerator

Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important

Java Microarchitectures

Highly Threaded SPARC Architectures

Opensparc™ Internals