University of California Santa Cruz Santa Cruz
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA SANTA CRUZ SANTA CRUZ INSTRUCTION PROCESSOR WITH SCOREBOARDING A thesis submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER ENGINEERING by Vidyuth Srivatsaa June 2013 The Thesis of Vidyuth Srivatsaa is approved: Professor Jose Renau, Chair Professor Alexandre Brandwajn Mr. Nils Graef Tyrus Miller Vice Provost and Dean of Graduate Studies Copyright c by Vidyuth Srivatsaa 2013 Table of Contents List of Figures v List of Tables vi Abstract vii Dedication viii Acknowledgments ix 1 Introduction 1 1.1 Overview . 1 1.2 Superscalar Processors . 2 1.3 Stage Implementation . 2 1.4 SCCORE ISA . 3 1.5 SCIPS Implementation . 6 2 Fetch Unit 7 3 Scoreboard 8 3.1 Register File . 8 3.2 Pipelining Hazards . 11 3.3 Instruction Scheduling in SCIPS . 13 3.4 Scoreboard verification . 14 4 Execute Unit 16 4.1 Forwarding . 16 5 Data Cache 19 5.1 Introduction to Caches . 19 5.2 Data cache stages . 23 5.3 Data cache FSM . 25 5.4 Data cache verification . 27 iii 6 Evaluation 29 7 Conclusion 31 A Glossary 32 B Testing and Synthesis 34 Bibliography 35 iv List of Figures 1.1 Simple 2-issue Superscalar pipeline . 2 1.2 Retry protocol. 3 1.3 Stage implementation without Retry. 4 1.4 Stage implementation with Retry. 4 1.5 SPARC v8 instruction format . 5 1.6 SCIPS in a nutshell . 6 3.1 Top level of Scoreboard module . 9 3.2 Structure of the Register File . 10 4.1 Top level of the Execute unit in SCIPS . 17 4.2 Register Forwarding . 17 4.3 Forwarding in the Execute unit in SCIPS . 18 5.1 Data cache in SCIPS . 21 5.2 Top level of data cache in SCIPS . 22 5.3 Illustration of skewed associative cache hashing among banks in SCIPS . 22 5.4 Cache address structure . 23 5.5 Data cache bank structure . 24 v List of Tables 6.1 Scoreboard synthesis results ............................ 29 6.2 Execute Unit synthesis results ............................ 30 6.3 Data Cache synthesis results ............................ 30 6.4 SCIPS core synthesis results ............................ 30 vi Abstract Santa Cruz Instruction Processor with Scoreboarding by Vidyuth Srivatsaa This thesis describes Santa Cruz Instruction Processor with Scoreboarding (SCIPS) which is an aggressive 64-bit 2-way superscalar processor with Scoreboard logic implementing the SCCORE ISA designed in System Verilog. The SCIPS consists of a Scoreboard unit - which serves as the control section to stream instructions to the Execute unit resolving potential hazards; the Execute unit which comprises of 5 different functional units, including a skewed associative data cache. SCIPS is a flexible pipelined design owing to the Stage implementation which toler- ates any latency in the pipeline, allowing SCIPS to be re-pipelined at will, helping in significant power savings besides performance improvements. vii Dedicated to my close family who have constantly supported and encouraged me. viii Acknowledgments First, and foremost, I would like to thank my adviser, Professor Jose Renau. He has imparted me with excellent knowledge and remarkable support to help me accomplish what I have. He has provided many ideas which has helped me expand my breadth and depth of knowledge in Computer Architecture. Thanks for being patient and always being available for any help! I would like to thank my thesis reading committee members, Prof. Brandwajn and Mr. Nils Graef, Senior Engineering Manager at LSI for taking the time to read my thesis and for providing valuable feedback in a short notice. I would also like to thank Jason and Pranav for working on the execute stage; James for helping with the Scoreboard; Elnaz for the Stage implementation. I would also like to thank Carol for helping me on many occasions. Lastly, and most importantly, I would like to thank my parents, grandparents and my sister for their constant motivation and encouragement. Big thanks goes to my Uncle, Aunt and Anika. ix Chapter 1 Introduction Mobile processors are ubiquitous owing to a great demand in handheld devices. The challenging aspect in this is to have an energy efficient solution while not compromising on performance. SCIPS is a general purpose processor implementing the SCCORE ISA maintaining a low power profile and a low silicon footprint, while targeting a good performance point. 1.1 Overview This chapter briefly introduces the different units of this processor. The chapter also talks about techniques used in the processor design such as the Stage Implementation, has an overview of the Instruction Set Architecture and briefly goes over pipelining, forwarding and superscalar processors. Chapters 2, 3, 4 and 5 covers the different SCIPS blocks individually. Chapter 2 is about the instruction fetch stage, chapter 3 explains the dynamic scheduling scheme called Scoreboarding used in SCIPS. Chapter 3 briefly covers the Execute unit which is the compute engine for the processor. Chapter 5 elucidates the data cache and goes over the details of a skewed-associative cache that was used in this processor. Chapter 6 gives a summary of implementation results of the SCIPS core and its sub- blocks from a synthesis tool. 1 1.2 Superscalar Processors SCIPS is a superscalar processor. A superscalar processor implements instruction level parallelism within a single processor thereby allowing faster throughput than would oth- erwise be possible at the given clock rate. Superscalar architectures include all features of pipelining but, in addition, there can be several instructions executing simultaneously in the same pipeline stage. SCIPS issues 2 instructions in a cycle, so is a 2-way superscalar processor. Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel. Figure 1.1 shows an ideal pipeline flow of such a 2-way superscalar processor. The issues with superscalar processors are that they are limited to the amount of instruction-level parallelism, complexity and time cost of the dispatcher and associated depen- dency checking logic, and the extra overhead for branch instruction processing. Figure 1.1: Simple 2-issue Superscalar pipeline 1.3 Stage Implementation The Stage implementation can be understood as a handshaking mechanism which tolerates latency in computations in each module in the system and in communications between the modules. This will allow for transformations which can be used in optimization techniques such as re-pipelining an unbalanced design. The stage module can be thought of as a FIFO of size 1. Each data input has a corre- sponding Valid signal, so when the output data comes with an invalid signal, that corresponding 2 data is discarded and not stored anywhere. In cases where the data does come in with a Valid signal, but gets a Retry from the next stage, then the data is held until that Retry signal is re- leased. When the current module issues a Retry signal to the previous stage simply means that it cannot accept any new data and requests the data to be sent again. A basic module to module communication with stage signals where one module wants to send data to an another is shown in figure 1.2. q din din q qRetry dinRetry dinRetry Block A Block B qRetry qValid dinValid dinValid qValid Figure 1.2: Retry protocol. To illustrate this functionality, below are two timing diagrams, figure 1.3 where the test block doesn’t get a Retry on incoming data, and in figure 1.4 when it gets a Retry on incoming data for a couple of clock cycles. While the stage implementation has a slightly additional area overhead, it eases the pipeline complexity as an entire pipeline can be held in the Retry state without having to worry about losing data. Additionally, the pipeline need not have to be balanced as this implementation gives us the flexibility to have parallel pipelines of different lengths. 1.4 SCCORE ISA The Instruction Set Architecture (ISA) serves as the boundary between hardware and software. It gives us the ability to precisely describe how to invoke and access the different operations, modes and storage locations supported by the hardware. 3 Figure 1.3: Stage implementation without Retry. Figure 1.4: Stage implementation with Retry. The SCIPS implements the pre-decoded SCCORE micro-ops. The SPARCv8 and ARM ISA’s are supported to be decoded to the SCCORE ISA. It is a vectored ISA which includes support for ARM’s Thumb instruction set which improves the compiled code density by using a compact 16-bit encoding for a subset of ARM instructions. Pre-decoded µOP’s include simple ALU, branch ALU, Load/Store ALU, Complex ALU instructions with support for conditional moves. The opcode in SCCORE is a byte wide. There are 32 integer registers and 32 floating point registers. Additionally, the ISA has support for conditional codes, and hence has an ICC register. There is a floating point 4 44 The SPARC Architecture Manual: Version 8 rounding register which specifies the different rounding modes. The instruction format which is an implementation of the SPARCv8 [12] is shown in Figure 5-1 Summary of Instruction Formats 1.5. Format 1 (op = 1): CALL op disp30 31 29 0 Format 2 (op = 0): SETHI & Branches (Bicc, FBfcc, CBccc) op rd op2 imm22 op a cond op2 disp22 31 29 28 24 21 0 Format 3 (op = 2 or 3): Remaining instructions op rd op3 rs1 i=0 asi rs2 op rd op3 rs1 i=1 simm13 op rd op3 rs1 opf rs2 31 29 24 18 13 12 4 0 Figure 1.5: SPARC v8 instruction format Instruction Fields The instruction ®elds are interpreted as follows: op and op2 These 2- and 3-bit ®elds encode the 3 major formats and the format 2 instructions according to Tables 5-1 and 5-2.