A Semi-Custom Design Flow in High-Performance Microprocessor Design Gregory A

A Semi-custom Design Flow in High-performance Microprocessor Design Gregory A. Northrop Pong-Fei Lu IBM Research IBM Research Yorktown Heights, NY 10598 Yorktown Heights, NY 10598 [email protected] [email protected] ABSTRACT RISC-like micro-architecture. In this paper we present techniques shown to significantly The physical design of these processors makes extensive use of enhance the custom circuit design process typical of high- hierarchy, partitioning the chip into functional units (instruction, performance microprocessors. This methodology combines FXU, FPU…) and units into macros. There are typically ~6 units flexible custom circuit design with automated tuning and and ~200 distinct macros (about 600 total instances), and a physical design tools to provide new opportunities to optimized macro can have anywhere from 1K to 100K or more transistors. design throughout the development cycle. The macro serves as the primary partitioning unit for logic entry (HDL), and a common macro connectivity description is used for Keywords both functional models for verification and for circuit design. Standard cell, circuit tuning, custom design, methodology. Boolean verification is used to ensure functional equivalence between macro HDL and a schematic representation, which is verified against the physical design. All macros and units are 1. INTRODUCTION fully floorplanned objects, and the global wiring (chip and unit) The development of high performance microprocessors requires is done hierarchically, using a wiring contract methodology. concurrent design at many levels (logical, circuit, physical) with Macros are all characterized for timing, noise, etc., and large teams and tightly interlocked schedules. Often the best represented by models at the global level. Timing rules are design flow is one that most effectively addresses the natural generated for all macros using static transistor-level simulation. conflicts within this flow (e.g., logic stability vs. timing closure), Global timing is run at both the unit and the chip level, using in contrast to one that simply applies the most modern or these macro rules and global wire extractions. The resulting aggressive approach in each domain. This paper describes such a timing is routinely used to generate assertions for all macros, case, in the development and use of a semi-custom design which are used with the static transistor level timing to drive methodology which has significantly enhanced several timing closure. generations of IBM zSeries (S/390) processors [1-3], as well as the IBM POWER4 processor[4]. The coordinated use of a Concurrent design with carefully controlled feedback and common parameterized gate representation, standard cell iteration are the keys to bringing such a design to closure. generation capabilities, place and route merged with custom Circuit and physical design start as soon as sufficient logic is physical design, static transistor level timing + formal circuit defined, while the early emphasis is on simulation for functional tuning, and gain-based synthesis have all led to significant verification. Early floorplanning and initial circuit design are improvements in both quality-of-result and time-to-market in the used to check the cycle time feasibility, and as the design conventional static CMOS design domain. matures, the emphasis shifts from strictly functional verification to logic modification and repartitioning as the primary 2. CUSTOM PROCESSOR DESIGN mechanism for achieving timing closure. This means that the The circuit design methodology described in this paper was efficiency, turn-around-time, and flexibility of the circuit design developed and applied over 3 generations of the microprocessor methods are as important to the ultimate chip cycle-time family used in the IBM eServer zSeries (S/390 mainframe) [1-3]. performance as is the intrinsic circuit performance. These processors have achieved frequencies in excess of 1GHz Circuit implementations of macros fall into 3 classes, with each in a 0.18u CMOS technology, combining high frequency with a type occupying about 1/3 of the area: arrays, (cache, table logic) relatively shallow pipeline (~7 stages) and extensive use of synthesized random logic macros (RLMs) for controls, and full millicode [2] to implement the complex S/390 architecture with a custom dataflow. The dataflow is done predominantly in static, with dynamic circuitry reserved for only extremely critical Permission to make digital or hard copies of all or part of this work for function, in order to meet power requirements. Traditionally, personal or classroom use is granted without fee provided that copies are full custom design is used for arrays, register files, and the not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy dataflow stacks that are typical of the instruction and execution otherwise, or republish, to post on servers or to redistribute to lists, units. Custom design is very effective at optimizing performance requires prior specific permission and/or a fee. and achieving a high area efficiency, particularly where elements DAC 2001, June 18-22, 2001, Las Vegas, Nevada, USA. are identical across the bit range of the data stack, as hierarchy Copyright 2001 ACM 1-58113-297-2/01/0006…$5.00. and careful tiling lead to highly optimal designs. This clearly applies to register files, working registers, muxes, and the like. However, many of the most labor intensive and critical functions, composed of multiple levels of these gates. Note that this particularly those that implement more complex numerical bookset is only a schematic representation; there is no directly functions (adders, incrementers, and comparators) do not make associated layout. such a compelling case for full custom design. They are far less A cell generation tool, described in more detail in section 3.5, is regular across the stack, more complex, and often do not tile very designed specifically to produce layout in a row-based standard easily. They are often timing critical, and although the logical cell image for arbitrary values of NW and PW for each primitive function usually has a stable definition early in the design cell. In addition to its use in semi-custom design, this tool was process, the most appropriate circuit architecture may evolve. It also used to create a conventional library of discrete sizes for use is this class of function which is the primary application for the with synthesis to build the RLMs. This standard cell library had semi-custom design flow outlined below. non-parameterized cells with all the conventional views, including timing rules required for synthesis. The sizes were 3. SEMI-CUSTOM DESIGN FLOW selected by generating a reasonably dense matrix of NW,PW 3.1 Primitive parameterized bookset values for each primitive cell and running it through the cell The basic building block used in this methodology is a set of generator. The NW,PW values were cast as a power level parameterized gates, called the primitive bookset, an example of (NW+PW) and a rise/fall ratio (PW/NW), also called the beta which is shown in Fig 1. These gates are generally a single level ratio. The ratio of adjacent power levels was around 1.25, and of conventional (inverting) static CMOS, with complementary the range of beta ratios was adjusted to specific rise/fall times. pull-up and pull-down nfet and pfet trees, wherein the Depending on the complexity of the function, there were from parameterization simply scales the pfets with a parameter PW 10 to 25 power levels and from 1 to 4 beta ratios for each primitive type. Including a physically compatible set of cells using low Vt fets, there are ~1200 cells in this library. This type of library, commonly referred to as a “tall thin” library, provides PW PW the flexible sizing required for synthesis to achieve maximum cycle time performance in the RLM control logic. NW PW = 8u 3.2 Overall design flow The design flow used to implement custom designs built from the primitive bookset is summarized in Fig 2. This flow is largely NW*T automated, with most of the work contained in the initial design and the place and route floorplan. After an initial pass, iteration NW = 10u of the design is a relatively rapid process. Figure 1. Parameterized gate and sample instance, showing parameters NW and PW. The expressions on the transistors Schematic design Timing are widths, and T represents an optional taper factor. restructure and the nfets with a parameter NW. In general, each fet can have Flatten to gate level Extraction an additional fixed multiplier, T, which is used to define multiple flavors of each logic type providing tradeoffs between the delays from each input pin. With the exception of the XOR Tune sizes Paras Select flatten and XNOR functions, all these primitive gates are a single level itics of inverting logic. A complete set of represented topologies can Map to standard cell Layout be found in Table 1. Table 1. List of parameterized logic types Layout custom cells Place and route Logic type Comments INVERTER Generate layout for Schematics with NAND2…NAND4 Multiple T remaining cells all leaf cell NOR2…NOR3 Multiple T AOI21 & OAI21 Multiple T Figure 2. Complete design flow using the primitive bookset. AOI22 & OAI22 The custom designed schematics can be a hierarchical XOR2 & XNOR2 Pass-gate style combination of parameterized and custom designed cells. MUX2 2-Way transmission gate mux PGMERGE G+P*C for adders (restricted AOI21) 3.3 Design flow details DRXOR2 Dual rail xor (true & comp inputs) This section catalogs some of the details of each step in the flow Together, these parameterized gates form a basis set capable of from Fig 2. Note that not all steps are needed, and a wide covering most of the design space for combinational static variety of combinations have been used, depending upon the circuitry found in a conventional ASIC library, since more details and needs of each design.

A Semi-Custom Design Flow in High-Performance Microprocessor Design Gregory A

A Comprehensive Stochastic Design Methodology for Hold-Timing Resiliency in Voltage-Scalable Design

Overcoming the Challenges in Very Deep Submicron for Area Reduction, Power Reduction and Faster Design Closure

Design Closure: Power Constraints, Best Practices for an Accurate Report Power Estimation

NOT for Printing

SEMICONDUCTOR COLLABORATIVE DESIGN PROCESS Enable Collaborative Design for Complex Semiconductor Projects

Design for Manufacturing (Dfm) in Submicron Vlsi Design

Designing a Chip Challenges, Trends, and Latin America Opportunity

Physical Design of a 3D-Stacked Heterogeneous Multi-Core Processor

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure

UNIVERSITY of CALIFORNIA SAN DIEGO IC Physical Design

Ultrafast Design Methodology Guide for the Vivado Design Suite

Constraint Driven I/O Planning and Placement for Chip-Package Co-Design ∗