Organization of the Motorola 88110 Superscalar RISC Microprocessor
Total Page:16
File Type:pdf, Size:1020Kb
Organization of the Motorola 88110 Superscalar RISC Microprocessor Motorola’s second-generation RISC microprocessor employs advanced techniques for exploit- ing instruction-level parallelism, including superscak instruction issue, out-of-order instruc- tion completion, speculative execution, dynamic hx&ruction rescheduling, and two parallel, high-bandwidth, on-chip caches. Designed to serve as the central processor in low-cost per- sonal computers and workstations, the 88110 supports demanding graphics and digital signal processing applications. Keith Diefendorff otorola designers conceived of the puters and workstation systems. Thus, our de- 88000 RISC (reduced instruction- sign objective was good performance at a given Michael Allen set computer) architecture to cost, rather than ultimate performance at any cost. Dl simplify construction of micropro- We recognized that the personal computer envi- Motorola cessors capable of exploiting high degrees of in- ronment is moving toward highly interactive soft- struction-level parallelism without sacrificing clock ware, user-oriented interfaces, voice and image speed. The designers held architectural complexity processing, and advanced graphics and video, to a minimum to eliminate pipeline bottlenecks all of which would place extremely high demands and remove limitations on concurrent instruction on integer, floating-point, and graphics process- execution. ing capabilities. At the same time, we realized The 88100/200 is the first implementation of the 88110 would have to meet these performance the 88000 architecture. It is a three-chip set, re- demands while operating with the inexpensive quiring one CPU (88100) chip and two (or more> DRAM systems typically found in low-cost per- cache (88200) chips. The CPU’s simple scalar sonal computers. design uses multiple concurrent execution units To achieve the performance goals set for the with out-of-order instruction completion to ap- 88110, we needed to obtain more parallelism than proach a throughput of one instruction per clock was achieved in earlier microprocessors. To this cycle. end, we decided to use a superscalar micro- The second-generation, single-chip 88110 RISC architecture to exploit additional instruction-level microprocessor employs superscalar instruction parallelism. Superscalar machines are distin- issue and out-of-order instruction execution tech- guished by their ability to dispatch multiple in- niques to achieve a throughput greater than one structions each clock cycle from a conventional instruction per clock cycle. linear instruction stream. This approach has shown good speedup on general-purpose applications Overview and was a good match to available CMOS tech- In designing the 88110, we aimed at a general- nology. (We believe Agerwala and Cocke coined purpose microprocessor, suitable primarily for use the superscalar term.‘) as the central processor in low-cost personal com- We selected the superscalar approach over 40 IEEE Micro 0272-1732/92/0400-0040$03.00 0 1992 IEEE . - - - other fine-grain parallelism approaches. such as the vector interlocked pipelines and a precise exception model, but it machine and the VLIW (very long instruction word) approach. allows out-of-order instruction completion, some out-of-order because it appeared to be more effective for our intended instruction issue. and branch prediction with speculative ex- application. With a limited transistor budget. spending tran- ecution past branches. sistors on vector hardware would have meant sacrificing sca- We optimized each execution unit for low latency: The lar performance and would have yielded a machine that branch unit uses a branch target instruction cache to reduce suffered from the Amdahl’s Law phenomenon’ on general- branch latency. The integer and graphics units are one-cycle purpose applications. units; the floating-point adder and multiplier are three-cycle, The VLIW approach3 would have introduced severe soft- fully pipelined. IEEE extended-precision units. The load/store ware compatibility restrictions, by exposing hardware paral- unit provides fast access to the cache on a hit. But it is also lelism to the object code program and thereby limiting future highly buffered to increase tolerance to long memory latency implementation flexibility. Also, the VLIW speedup, while on a miss and to allow dynamic reordering of loads and substantial on code with abundant parallelism (such as sci- stores for runtime overlapping of tight loops. entific applications), is less significant on general-purpose The on-chip caches are organized in a Harvard arrange- applications. This limited speedup is due in part to the code ment. giving the processor simultaneous access to instruc- expansion: The inefficient use of opcode bits to control un- tions and data. Each 8-Kbyte, two-way, set-associative cache used execution units and the aggressive loop unrolling re- provides 64 bits each clock cycle to its respective unit. The quired to schedule the available execution unit parallelism write-back data cache is nonblocking for some types of ac- effectively.’ cesses, and it follows a four-state MESI (modified, exclusive, The superscalar approach appeared to be a better match shared, invalid) protocol” for coherence with other caches in to CMOS technology than a superpipelined approach. Super- multiprocessor systems. It also supports selective cache by- scalar designs rely primarily on spatial parallelism-multiple pass and software prefetching from user mode. Two inde- operations running concurrently on separate hardware- pendent, 40-entry, fully associative address translation achieved by duplicating hardware resources such as execu- look-aside buffers U’LBs) support a demand-paged virtual tion units and register file ports. Superpipelined designs, on the other hand, emphasize temporal parallel- ism--overlapping multiple operations on a common piece of hardware- achieved through more deeply pipe- lined execution units with faster clock cycles. As a result, superscalar ma- chines generally require more tran- sistors, whereas superpipelined designs require faster transistors and more careful circuit design to mini- mize the effects of clock skew. Some literature indicates that superscalar and superpipelined machines of the same degree would perform roughly the same.j We felt that CMOS tech- nology generally favors replicating circuitry over increasing clock cycle rates, since CMOS circuit density his- torically has increased at a much faster rate than circuit speed. The 88110 microarchitecture, illus- trated in Figure 1, employs a sym- * * metrical superscalar instruction 32-bit address bus 64-bit data bus dispatch unit, which dispatches two TLB Translation look-aside buffer instructions each clock cycle into an array of 10 concurrent execution units. The design implements fully Figure 1. Block diagram of the 88110. April 1992 41 .. 881 IO RISC returning a 64-bit quotient. l Additional information returned from the integer com- pare instruction to improve string-handling capability. l Addition of static branch prediction to the branch opcodes, providing a mechanism by which the com- piler gives the hardware a hint as to the direction a given (4 (W conditional branch is likely to go. We estimate that the compiler potentially can statically predict more than 85 MC88110 MC88110 MC88110 percent of the dynamically executed conditional branches I I . I correctly. We also believe that runtime branch profiling can further improve this rate in specific cases. I2 cache L2 cache L2 cache l An option that allows I6-bit immediate address offsets I I I and literal constants to be treated as signed numbers 1 (the 88100 treats them only as unsigned). 1 DRAM 1 Floating-point archi~ extensions. Our enhance- 64 ments of the floating-point architecture were more signifi- cant. Anticipating heavy use of the processor for graphics Figure 2. Target system configurations: single (a), dual (b), and DSP and greater use of floating-point data in many general- and multiprocessors (c). purpose PC applications, we added the following: memory environment. A common, 64-bit, external bus ser- l An extended floating-point register file to provide regis- vices cache misses. The demultiplexed, pipelined bus sup- ter name space for floating-point variables and frequently ports burst mode, split transactions, and bus snooping. accessed constants beyond that provided in the 88100 We designed the 88110 especially for the three basic sys- architecture. The extended register file contains thirty- tem configurations shown in Figure 2: two 80-bit registers. Each register can hold one floating- point number of any precision-single, double, or l single processors tightly coupled to low-cost DRAMS; double-extended. For compatibility with existing code, l dual-processor systems, also coupled to inexpensive single- and double-precision floating-point numbers con- DRAMS, either in a symmetrical multiprocessing arrange- tinue to be supported in the generai izgister file as well. ment or with one of the 88110s dedicated to a particular The compiler can use the additional register name space function such as graphics or digital signal processing to improve code schedules and reduce memory refer- (DSP); and ences This feature alone results in a speedup of more l medium-scale shared-memory multiprocessor systems, than 15 percent on the SPEC(Systems Performance Evalu- with each processor using local secondary static RAM ation Cooperative) floating-point benchmarks.’ The MAM) cache, which we call L2. speedup