<<

SPARC Joint Programming Specification 1 Implementation Supplement: Sun UltraSPARC III

Sun Microsystems Proprietary/Need-to-Know JRC Contributed Material

Working Draft 1.0.5, 10 Sep 2002

Part No.: 806-6754-1 Working Draft 1.0.5, 10 Sep 2002 Copyright 2001 , Inc., 901 San Antonio Road, Palo Alto, California 94303 U.S.A. All rights reserved. Portions of this document are protected by copyright 1994 SPARC International, Inc. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, SunSoft, SunDocs, SunExpress, and Solaris are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements. RESTRICTED RIGHTS: Use, duplication, or disclosure by the U.S. Government is subject to restrictions of FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95) and DFAR 227.7202-3(a).

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON- INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Copyright 2001 Sun Microsystems, Inc., 901 San Antonio Road • Palo Alto, CA 94303-4900 Etats-Unis. Tous droits réservés.

Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l’utilisation, la copie, la distribution, et la décompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, par quelque moyen que ce soit, sans l’autorisation préalable et écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des fournisseurs de Sun.

Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie. UNIX est une marque déposée aux Etats-Unis et dans d’autres pays et licenciée exclusivement par X/Open Company, Ltd. La notice suivante est applicable à Netscape Communicator™: Copyright 1995 Netscape Communications Corporation. Tous droits réservés.

Sun, Sun Microsystems, the Sun logo, AnswerBook2, docs.sun.com, et Solaris sont des marques de fabrique ou des marques déposées, ou marques de service, de Sun Microsystems, Inc. aux Etats-Unis et dans d’autres pays. Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.

L’interface d’utilisation graphique OPEN LOOK et Sun™ a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d’utilisation visuelle ou graphique pour l’industrie de l’informatique. Sun détient une licence non exclusive de Xerox sur l’interface d’utilisation graphique Xerox, cette licence couvrant également les licenciés de Sun qui mettent en place l’interface d’utilisation graphique OPEN LOOK et qui en outre se conforment aux licences écrites de Sun. CETTE PUBLICATION EST FOURNIE "EN L’ETAT" ET AUCUNE GARANTIE, EXPRESSE OU IMPLICITE, N’EST ACCORDEE, Y COMPRIS DES GARANTIES CONCERNANT LA VALEUR MARCHANDE, L’APTITUDE DE LA PUBLICATION A REPONDRE A UNE UTILISATION PARTICULIERE, OU LE FAIT QU’ELLE NE SOIT PAS CONTREFAISANTE DE PRODUIT DE TIERS. CE DENI DE GARANTIE NE S’APPLIQUERAIT PAS, DANS LA MESURE OU IL SERAIT TENU JURIDIQUEMENT NUL ET NON AVENU.

Please Recycle Contents

1. Overview 1 1.1 Navigating the UltraSPARC III Implementation Supplement 1 1.2 Fonts and Notational Conventions 2 1.3 The UltraSPARC III Processor 2 1.3.1 Component Overview 2 1.3.2 Instruction Issue Unit (IIU) 4 1.3.3 Execution Unit (IEU) 5 1.3.4 Data Cache Unit (DCU) 5 1.3.5 Floating Point and Graphics Unit (FGU) 6 1.3.6 External Memory Unit (EMU) 6 1.3.7 System Interface Unit (SIU) 7 1.4 Chip Differences from UltraSPARC I, II 7 1.4.1 Bootbus Limitations 8 1.4.2 Instruction Set Extensions 8 VIS Extensions 8 Interval Arithmetic Support 9 1.4.3 Instruction Differences 9 1.4.4 Memory Subsystem 10 Caches 10 Cache Flushing 11 Translation Lookaside Buffers (TLBs) 12 1.4.5 Interrupts 13 1.4.6 Address Space Size 13 1.4.7 Error Correction 13

Contents iii Sun Microsystems Proprietary/Confidential – JRC Contributed Material 1.4.8 Registers 14 Address Space Identifier (ASI) Registers 14 Ancillary State Registers (ASRs) 15 1.4.9 Noncacheable Store Compression 15 1.4.10 Summary of Differences 16

2. Definitions and Acronyms 19

3. Architectural Overview 23

4. Data Formats 25 4.2.3 Floating-Point, Quad-Precision 25 4.3 Graphics Data Formats 25

5. Registers 27 5.1.7 Floating-Point State Register (FSR) 27 FSR_nonstandard_fp (NS) 27 FSR_floating-point_trap_type (ftt) 28 FSR_current_exception (cexc) 28 5.2.1 PSTATE Register 29 5.2.9 Version (VER) Register 29 5.2.11 Ancillary State Registers (ASRs) 30 Performance Control Register (PCR) (ASR 16) 30 Performance Instrumentation Counter (PIC) Register (ASR 17) 30 Dispatch Control Register (DCR) (ASR 18) 30 5.2.12 Registers Referenced Through ASIs 33 Data Cache Unit Control Register (DCUCR) 33 Data Watchpoint Registers 35 Instruction Trap Register 36

6. Instructions 37 6.1 Processor Pipeline 37 6.1.1 Instruction-Fetch Stages 39 A-stage (Address Generation) 39 P-stage (Predictor Address Generation) 39 F-stage (Fetch) 39

iv SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material B-stage (Branch Target Computation) 39 6.1.2 Instruction Issue Stages 40 I-stage (Instruction Issue) 40 R-stage (Register) 40 6.1.3 Integer Instruction Execution: E-stage (Execute) 41 6.1.4 Floating-Point and VIS Instruction Execution 42 -stage (Cache) 42 M-stage (Miss) 42 W-stage (Write) 42 X-stage (Extend) 43 6.1.5 Trap (T) and Done (D) Stages 43 T-stage (Trap) 43 D-stage (Done) 43 6.2 Grouping Rules 43 6.2.1 Execution Order 44 6.2.2 Integer Register Dependencies to Instructions in the MS Pipeline 44 6.2.3 Integer Instructions Within a Group 45 6.2.4 Same-Group Bypass 45 6.2.5 Floating Point Unit Operand Dependencies 45 Latency and Destination Register Addresses 45 PDIST Special Cases 46 Helpers 46 Floating-Point Grouping Rules 47 6.3 Conditional Moves 51 6.4 Instruction Latencies and Dispatching Properties 52

7. Traps 61 7.1.2 Error_state 61 7.4.2 Trap Type (TT) 61

8. Memory Models 63 8.1 Programmer-Visible Properties of Models 64 8.1.1 Differences Between Memory Models 64 8.1.2 Current Memory Model 65 8.2 Memory Location Identification 65

Working Draft 1.0.5, 10 Sep 2002 Contents v Sun Microsystems Proprietary/Confidential – JRC Contributed Material 8.3 Memory Accesses and Cacheability 65 8.3.1 Coherence Domains 66 Cacheable Accesses 66 Noncacheable and Side-Effect Accesses 66 8.3.2 Global Visibility and Memory Ordering 67 8.4 Memory Synchronization 68 8.4.1 MEMBAR #Sync 68 8.4.2 MEMBAR Rules 68 8.5 Atomic Operations 69 8.6 Nonfaulting Load 71 8.7 Prefetch Instructions 71 8.8 Block Loads and Stores 72 8.9 I/O and Accesses with Side Effects 72 8.9.1 Instruction Prefetch to Side-Effect Locations 73 8.9.2 Instruction Prefetch Exiting Red State 73 8.9.3 UltraSPARC III Internal ASIs 74 8.10 Store Compression 74

A. Instruction Definitions: UltraSPARC III Extensions 77 A.2 Alignment Instructions (VIS I) 78 A.3 Three-Dimensional Array Addressing Instructions (VIS I) 78 A.4 Block Load and Store Instructions (VIS I) 79 A.5 Byte Mask and Shuffle Instructions (VIS II) 83 A.13 Floating-Point Add and Subtract 83 A.26 Load Floating-Point 84 A.27 Load Floating-Point Alternate 84 A.30 Load Quadword, Atomic 84 A.33 Logical Operate Instructions (VIS I) 85 A.35 Memory Barrier 86 A.42 Partial Store (VIS I) 91 A.43 Partitioned Add/Subtract (VIS I) 91 A.44 Partitioned Multiply (VIS I) 92 A.47 Pixel Formatting (VIS I) 92 A.47.5 FPMERGE Instruction 92 A.49 Prefetch Data 93 A.49.1 SPARC V9 Prefetch Variants 93

vi SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material A.55 Set Interval Arithmetic Mode (VIS II) 94 A.59 SHUTDOWN Instruction (VIS I) 94 A.61 Store Floating Point 94 A.62 Store Floating Point into Alternate Space 95

B. IEEE Std 754-1985 Requirements for SPARC V9 97 B.3 Overflow, Underflow, and Inexact Traps 97 B.6 Floating-Point Nonstandard Mode 98 B.6.1 Subnormal Operands 98 B.6.2 Subnormal Results 99 B.6.3 NaN Operands 100

C. Implementation Dependencies 101 C.1 List of Implementation Dependencies 102 C.2 SPARC V9 General Information 112 C.2.1 Level 2 Compliance (Impl. Dep. #1) 113 C.2.2 Unimplemented Opcodes, ASIs, and the ILLTRAP Instruction 113 C.2.3 Trap Levels (Impl. Dep. #37, 38, 39, 40, 114, 115) 113 C.2.4 Trap Handling (Impl. Dep. #16, 32, 33, 35, 36, 216, 217) 114 C.2.5 TPC/TnPC and Resets 114 C.2.6 SIR Support (Impl. Dep. #116) 114 C.2.7 TICK Register 115 C.2.8 Population Count Instruction (POPC) 115 C.2.9 Secure Software 115 C.3 SPARC V9 Integer Operations 115 C.3.1 Integer Register File and Window Control Registers (Impl. Dep. #2) 116 C.3.2 Clean Window Handling (Impl. Dep. #102) 116 C.3.3 Integer Multiply and Divide 116 C.3.4 Version Register (Impl. Dep. #2, 13, 101, 104) 116 C.4 SPARC V9 Floating-Point Operations 117 C.4.1 Subnormal Operands/Results; NaN Operands 117 C.4.2 Overflow, Underflow, and Inexact Traps (Impl. Dep. #3, 55) 117 C.4.3 Quad-Precision Floating-Point Operations (Impl. Dep. #3) 118 C.4.4 Floating-Point Upper and Lower Dirty Bits in FPRS Register 119 C.4.5 Floating-Point Status Register (Impl. Dep. #13, 19, 22, 23, 24) 119

Working Draft 1.0.5, 10 Sep 2002 Contents vii Sun Microsystems Proprietary/Confidential – JRC Contributed Material C.5 SPARC V9 Memory-Related Operations 121 C.6 Non-SPARC V9 Extensions 123 C.6.3 DCR Register Extensions 123 C.6.4 Other Extensions 123

D. Formal Specification of the Memory Models 125

E. Opcode Maps 127

F. Memory Management Unit 129 F.1 Virtual Address Translation 129 F.2 Translation Table Entry (TTE) 130 F.4 Hardware Support for TSB Access 131 F.4.1 Typical TLB Miss/Refill Sequence 132 F.4.2 Faults and Traps 133 F.8 Reset, Disable, and RED_state Behavior 133 F.10 Internal Registers and ASI Operations 134 F.10.3 Accessing MMU Registers 134 F.10.3 Instruction/Data MMU TLB Tag Access Registers 134 F.10.4 I/D TLB Data In, Data Access, and Tag Read Registers 134 Data In and Data Access Registers 134 F.10.6 I/D Translation Storage Buffer Base Registers 135 F.10.7 I/D TSB Extension Registers 136 F.10.12 I/D TLB CAM Diagnostic Register 137 F.12 Translation Lookaside Buffer Hardware 138 F.12.2 TLB Replacement Policy 138 F.12.3 TSB Pointer Logic Hardware Description 140

G. Assembly Language Syntax 143

H. Software Considerations 145

I. Extending the SPARC V9 Architecture 147

J. Programming with the Memory Models 149

K. Changes from SPARC V8 to SPARC V9 151

L. Address Space Identifiers 153

viii SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material L.1 Address Space Identifiers and Address Spaces 153 L.2 ASI Values 153 L.3 ASI Assignments 154 L.3.1 Supported SPARC V9 ASIs 154 L.3.2 UltraSPARC III ASI Assignments 154 L.3.3 Special Memory Access ASIs 157 Block Load and Store ASIs 157 Partial Store ASIs 157

M. Caches and Cache Coherency 159 M.1 Cache Organization 159 M.1.1 Virtual Indexed, Physical Tagged Caches (VIPT) 159 M.1.2 Physical Indexed, Physical Tagged Caches (PIPT) 160 Instruction Cache (I-Cache) 160 External and Write Caches (E-Cache, W-Cache) 161 M.2 Cache Flushing 162 M.2.1 Address Aliasing Flushing 162 M.2.2 Committing Block Store Flushing 163 M.2.3 Displacement Flushing 163 M.3 Coherence Tables 164 M.3.1 Processor State Transition and the Generated Transaction 164 M.3.2 Snoop Output and Input 166 M.3.3 Transaction Handling 170

N. Interrupt Handling 175 N.4 Interrupt ASR Registers 175 N.4.2 Interrupt Vector Dispatch Register 175 N.4.3 Interrupt Vector Dispatch Status Register 175 N.4.5 Interrupt Vector Receive Register 175

O. Reset and RED_state 177 O.1 RED_state Characteristics 177 O.2 Resets 178 O.2.1 Hard Power-on Reset (Hard POR, Power-on Reset, POK Reset) 178 O.2.2 System Reset (Soft POR, Fireplane Reset, POR) 179 O.2.3 Externally Initiated Reset (XIR) 179

Working Draft 1.0.5, 10 Sep 2002 Contents ix Sun Microsystems Proprietary/Confidential – JRC Contributed Material O.2.4 Software-Initiated Reset (SIR) 179 O.3 RED_state Trap Vector 179 O.4 Machine States 180

P. Error Handling 185 P.1 Error Classes 186 P.2 Corrective Actions 186 P.2.1 Fatal Error (FERR) 186 P.2.2 Precise Traps 187 P.2.3 Deferred Traps 188 Error Barriers 188 TPC, TNPC, and Deferred Traps 188 Enabling Deferred Traps 189 Errors Leading to Deferred Traps 189 Special Access Sequence for Recovering Deferred Traps 189 Deferred Trap Handler Functionality 189 P.2.4 Disrupting Traps 190 P.2.5 Multiple Traps 191 P.2.6 Entering RED_state 191 P.3 Memory Errors 191 P.3.1 E-cache Data ECC Error 191 Hw_corrected E-cache Data ECC Errors 191 Sw_correctable E-cache Data ECC Errors 192 Uncorrectable E-cache Data ECC Errors 194 P.3.2 Errors on the System Bus 196 Hw_Corrected System Bus Data and MTag ECC Errors 196 Uncorrectable System Bus Data ECC Errors 197 Uncorrectable System Bus MTag Errors 198 System Bus BERR Errors 198 System Bus Timeout Errors 199 System Bus Hardware Timeouts 200 P.3.3 Memory Errors and Prefetch 200 Memory Errors and Prefetch by the Instruction Fetcher and I-cache 200 P.3.4 Memory Errors and Interrupt Transmission 201 P.3.5 Cache Flushing in the Event of Multiple Errors 202

x SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material P.4 Error Registers 202 P.4.1 E-cache Error Enable Register 202 P.4.2 Asynchronous Fault Status Register 204 AFSR Fields 204 Clearing the AFSR 206 P.4.3 ECC Syndromes 208 E_SYND 208 M_SYND 212 P.4.4 Asynchronous Fault Address Register 213 P.5 Error Reporting Summary 214 P.6 Overwrite Policy 217 P.6.1 AFAR Overwrite Policy 217 P.6.2 E_SYND Data ECC Syndrome Overwrite Policy 218 P.6.3 M_SYND MTag ECC Syndrome Overwrite Policy 218 P.7 Multiple Errors and Nested Traps 219 P.8 Further Details on Detected Errors 219 P.8.1 E-cache Data ECC Error 220 The UCC Error 220 The UCU Error 221 The EDC Error 221 The EDU Error 222 The WDC Error 223 The WDU Error 223 The CPC Error 224 The CPU Error 224 P.8.2 System Bus ECC Errors 224 The CE Error 224 The UE Error 225 The EMC Error 225 The EMU Error 225 The IVC Error 226 The IVU Error 226 P.9 Further Details of ECC Error Processing 226 P.9.1 System Bus ECC Error Detection 226 P.9.2 System Bus ECC Error Injection 227 P.9.3 E-cache ECC Errors 228

Working Draft 1.0.5, 10 Sep 2002 Contents xi Sun Microsystems Proprietary/Confidential – JRC Contributed Material P.9.4 When Are Traps Taken? 228 P.9.5 When Are Interrupts Taken? 232 P.9.6 Error Barriers 233 P.10 UltraSPARC III Behavior Under Asynchronous Error Conditions 235 P.10.1 External Cache Access Errors 235 P.10.2 UltraSPARC III Behavior with Error from System Bus 242 P.11 External Memory Unit Error Handling 251 P.11.1 Asynchronous Fault Status Register 252 P.11.2 EMU Error Status Register 253 P.11.3 EMU Error Mask Register 257 P.11.4 EMU Shadow Register 258 P.11.5 EMU Shadow Scan Chain Order 259 Mask Chain Order 259 Shadow Scan Chain Order 259

Q. Performance Instrumentation 265 Q.1 Performance Control and Counters 265 Q.2 Performance Instrumentation Counter Events 268 Q.2.1 Instruction Execution Rates 268 Q.2.2 IIU Statistics 269 Q.2.3 IIU Stall Counts 269 Q.2.4 R-stage Stall Counts 270 Q.2.5 Recirculate Counts 270 Q.2.6 Memory Access Statistics 271 Q.2.7 System Interface Statistics 272 Q.2.8 Software Statistics 272 Q.2.9 Floating-Point Operation Statistics 272 Q.2.10 Memory Controller Statistics 273 Q.2.11 PCR.SL and PCR.SU Encoding 273

R. Specific Information About Fireplane Interconnect 277 R.1 Power Management 277 R.1.1 Low Power Mode 277 R.2 Fireplane Interconnect ASI Extensions 278 Fireplane Port ID Register 278 Fireplane Configuration Register 279

xii SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material Fireplane Interconnect Address Register 283 R.3 RED_state and Reset Values 283

S. Summary of Differences Between UltraSPARC III and SPARC64 V 285

T. UltraSPARC III Chip Identification 287 T.1 Allocation of Identification Codes for UltraSPARC III 287 T.2 Identification Registers on UltraSPARC III 288 T.2.1 JTAGID (JTAG Identification Register) 288 T.2.2 Version Register (V9) 289 T.2.3 FIREPLANE_PORT_ID MID field 289 T.2.4 Serial ID Register 290 T.2.5 Revision Register 290 T.2.6 Identification Code on UltraSPARC III Package 291

U. Memory Controller 293 U.1 Memory Subsystem 293 U.2 Programmable Registers 297 U.2.1 Memory Timing Control I (Mem_Timing1_CTL) 297 Timing Diagrams for Mem_Timing1_CTL Fields 301 Settings of Mem_Timing1_CTL 302 U.2.2 Memory Timing Control II (Mem_Timing2_CTL) 303 Timing Diagram for Mem_Timing2_CTL Fields 305 Settings of Mem_Timing2_CTL 305 U.2.3 Memory Timing Control III (Mem_Timing3_CTL) 306 rfr_int Field of Mem_Timing3_CTL Register 306 Settings of Mem_Timing3_CTL in Energy Star 1/32 Mode 307 U.2.4 Memory Timing Control IV (Mem_Timing4_CTL) 308 Settings of Mem_Timing4_CTL in Energy Star 1/32 Mode 308 U.2.5 Memory Address Decoding Registers 309 Address Decoding Logic 310 Bank-Size Settings 311 U.2.6 Memory Address Control Register 313 banksel_n_rowaddr_size Encoding 316 Settings of Mem_Address_Control 319 Bank Select/Row and Column Address Generation Logic 320 U.3 Memory Initialization 321

Working Draft 1.0.5, 10 Sep 2002 Contents xiii Sun Microsystems Proprietary/Confidential – JRC Contributed Material U.4 Energy Star 1/32 Mode 323

V. Debug and Diagnostics Support 325 V.1 Diagnostics Control and Accesses 325 V.2 Floating-Point Control 326 V.3 Data Cache Unit Control Register (DCUCR) 326 V.4 Instruction Cache Diagnostic Accesses 326 V.4.1 Instruction Cache Instruction Fields 327 V.4.2 Instruction Cache Tag/Valid Fields 327 IC_tag: I-cache tag numbers 328 V.4.3 Instruction Cache Snoop Tag Fields 329 V.5 Branch Predictor Array Accesses 331 V.6 Data Cache Diagnostic Accesses 331 V.6.1 Data Cache Data Field 332 V.6.2 Data Cache Tag/Valid Fields 333 V.6.3 Data Cache Microtag Fields 334 V.6.4 Data Cache Snoop Tag Access 334 V.6.5 Data Cache Invalidate 335 V.7 External Cache Diagnostics Accesses 336 V.7.1 External Cache Control Register 336 V.7.2 External Cache Data/ECC Fields 337 V.7.3 External Cache Data Staging Registers 337 V.7.4 External Cache Tag/State Field Diagnostics Accesses 339 V.8 Write Cache Diagnostic Accesses 340 V.8.1 Write Cache Diagnostic Valid Bits Register 340 V.8.2 Write Cache Diagnostic Bank Valid Bits Register 341 V.8.3 Write Cache Diagnostic Data Register 342 V.8.4 Write Cache Tag/Valid Fields 343 V.8.5 Write Cache Snoop Tag Register 344 V.9 Integer Unit Design for Test (DFT) 345 V.9.1 IU Shadow Scan Registers 345 V.9.2 IU Observability Bus Signals 346

Bibliography 351

xiv SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Confidential – JRC Contributed Material S. CHAPTER 1

Overview

In this chapter, we discuss these topics: ■ Navigating the UltraSPARC III Implementation Supplement on page 1 ■ The UltraSPARC III Processor on page 2 ■ Chip Differences from UltraSPARC I, II on page 7

For a general overview of SPARC V9 architecture, please refer to Section 1.3 of Commonality.

1.1 Navigating the UltraSPARC III Implementation Supplement

We suggest that you approach this Implementation Supplement SPARC V9 Joint Programming Specification as follows. First, familiarize yourself with the UltraSPARC III processor and its components by reading these sections: ■ The UltraSPARC III Processor on page 2 ■ Component Overview on page 2

If you are familiar with UltraSPARC I or II, then review their differences from UltraSPARC III in this section: ■ Chip Differences from UltraSPARC I, II on page 7

Study the terminology in Chapter 2, Definitions and Acronyms. Then, for details of architectural changes, see the remaining chapters in this portion of the book as your interests dictate. The Sun Microsystems implementation portion of the book (UltraSPARC-III) closely follows the organization of The SPARC Architecture Manual- Version 9. We suggest you keep a copy close at hand.

We added two new chapters: Chapter , Memory Controller, and Chapter , Debug and Diagnostics Support.

1

Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.2 Fonts and Notational Conventions

Fonts and notational conventions are as described in Section 1.2 of Commonality.

1.3 The UltraSPARC III Processor

The UltraSPARC III processor is a high-performance, highly integrated superscalar processor implementing the 64-bit SPARC V9 RISC architecture. It can sustain the execution of up to four instructions per cycle, even in the presence of conditional branches and cache misses, mainly because the units asynchronously feed instructions and data to the rest of the pipeline. Instructions that are predicted to be executed are issued in program order to multiple functional units, executed in parallel, and for added parallelism can be completed out-of-order. To further increase the number of instructions executed per cycle, instructions from two basic blocks can be issued in the same group.

The chip fully implements the 64-bit SPARC V9 architecture. It supports a 64-bit virtual address space and a 43-bit physical address space. The core instruction set has been extended to include graphics instructions that provide the most common operations related to two-dimensional image processing, two- and three- dimensional graphics and image compression algorithms, and parallel operations on pixel data with 8- and 16-bit components.

The execution time of an application is the product of three factors: ■ The number of instructions generated by the compiler ■ The average number of cycles required per instruction ■ The cycle time of the processor

The architecture and implementation coupled with new compiler techniques make it possible to reduce each component while not degrading the other two.

1.3.1 Component Overview

The UltraSPARC III processor contains these components: ■ Instruction Issue Unit (IIU) ■ Floating Point and Graphics Unit (FGU) ■ Integer Execution Unit (IEU) ■ Data Cache Unit (DCU) ■ External Memory Unit (EMU) ■ System Interface Unit (SIU)

2 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material FIGURE 1-1 illustrates the major units, which are described in the next subsections.

Instruction Issue Unit (IIU) Instruction Cache Instruction Queue Steering Logic 4 instructions

Floating Point Integer Execution Unit (FGU) Unit (IEU) Fp Multiply Dependency / Trap Logic FpRF Fp Add / Subtract Fp Divide WARF ALU pipes (0 & 1) Graphics Unit Load/Store/Special pipe

Data Cache Unit (DCU)

Data Write Store Cache Cache Queue

E$ SRAM External Memory System Interface System Interconnect 288 Unit (EMU) Unit (SIU) Local Memory 144 DRAM E$ Tags SRAM Snoop pipe Data Switch DRAM Ctrlr Ctrlr Ctrlr Ctrlr

FIGURE 1-1 UltraSPARC III Major Units

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 3 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.3.2 Instruction Issue Unit (IIU)

The IIU fetches instructions and dispatches them into the execution pipelines. Major blocks are defined in Table 1-1.

TABLE 1-1 Instruction Issue Unit Major Blocks

Name Description

Instruction cache 32-Kbyte, 4-way associative, 32-byte line, 128 bits/cycle Branch Predict Array 16K entries, 2-bit predictor/entry PC tracking logic Instruction Translation 8-Kbyte pages: 128-entry, 2-way associative Buffer 64-Kbyte, 512-Kbyte, and 4-Mbyte locked pages: 16-entry fully associative Instruction queue 16-entry Predecoding logic Predecode bits are generated on an instruction cache fill and saved in the instruction cache. Steering Logic Steers up to four instructions per cycle into the six execution pipes defined as follows: Pipe Action A0 Integer Execute and Floating-Point Load A1 Integer Execute and Floating-Point Load MS Load/Store/Special FGM Floating Point and Graphics Multiply FGA Floating Point and Graphics Add BR Branch Next Fetch Address A stage logic to determine the address of the next four calculation instruction fetch groups.

4 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.3.3 Integer Execution Unit (IEU)

The IEU carries out execution of all integer arithmetic, logical, and shift instructions. Table 1-2 describes the IEU major blocks.

TABLE 1-2 Integer Execution Unit Major Blocks

Block Description

Working register file 7 read ports, 3 write ports Architectural register file 3 write ports, 1 transfer port Two integer execution pipelines 64-bit ALU and shifters; one 64-bit virtual address for Floating-Point Load Virtual address adder for the MS pipe 64 bit Arithmetic and special instruction unit Latency-tolerant integer instructions e.g., mulx, mulscc, tadd Trap generation and recirculate logic Register and pipeline dependency maxtl = 5 checking and bypassing logic 8 register file windows

1.3.4 Data Cache Unit (DCU)

The DCU handles all sourcing and sinking of data for load and store instructions. TABLE 1-3 describes the data cache major blocks.

TABLE 1-3 Data Cache Unit Major Blocks

Block Description

Data cache 64-Kbyte, 4-way associative, 32-byte line, write-through. Provides low latency data source for loads. Write cache 2-Kbyte, 4-way associative, 64-byte line. Reduces store bandwidth to the external cache by merging stores. Data Translation Buffer 8-Kbyte pages: 512-entry, 2-way associative, 64-Kbyte, 512-Kbyte, and 4-Mbyte locked pages: 16-entry, fully associative. Store queue Decouples the pipeline from the latency of store operations. Allows the pipeline to continue flowing while the store waits for data and eventually writes into the write cache. Stores may be grouped with the instruction upon which they depend.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 5 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.3.5 Floating Point and Graphics Unit (FGU)

The Floating Point and Graphics Unit provides five pipelines to execute all floating- point instructions and most VIS graphics instructions. A combination of two floating-point or VIS operations can be started each cycle. Most floating-point instructions complete in four cycles (exceptions are div and sqrt).

Table 1-4 defines the FGU major blocks.

TABLE 1-4 Floating Point and Graphics Unit (FGU) Major Blocks

Name Description

Floating-point register file 32-, 64-bit registers. 5 read and 4 write ports Floating-point multiply pipe Floating-point add/sub pipe Floating-point graphics Shares FRF ports with the floating-point add/sub pipe ALU pipe Floating-point graphics Shares FRF ports with the floating-point multiply pipe multiply pipe Floating-point div/sqrt pipe Shares FRF ports with the floating-point multiply pipe

1.3.6 External Memory Unit (EMU)

The External Memory Unit controls the operation of the external cache and external DRAM memory. Table 1-5 describes the major blocks of the EMU.

TABLE 1-5 Major Units of the External Memory Unit

Name Description

External cache tags 90 Kbytes, 1-cycle throughput, 2-cycle latency. Writeback buffer 512 bytes. Holds 2 or 4 dirty external cache lines, depending on the external cache configuration. External cache control Schedules and services the primary-cache and system requests for external cache data. The unit also controls the read and update of external cache tags to maintain cache coherency. DRAM controller Supports up to 4 banks, 8 Gbytes of EDO DRAM.

The External Cache Unite (ECU) and Memory Control Unit (MCU) are contained in the EMU and included the above description.

6 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.3.7 System Interface Unit (SIU)

Table 1-6 describes the SIU major blocks.

TABLE 1-6 System Interface Unit Major Blocks

Name Description

External data switch Control logic schedules all movement of data between the data control switch and local memory. UltraSPARC III data Control logic schedules all movement of data between the data switch control switch and UltraSPARC III/local memory/Sun™ Fireplane Interconnect. Data path control Control logic and data path for incoming data. Control registers control Programmable control registers, control logic for error reporting, interrupt, ASI read/write access. Pending queues Coherent pending queue (CPQ); noncoherent pending queue (NCPQ) to store all pending transactions; outgoing request queue (ORQ); and pending tag array (PTA). Pending queue control Control logic to enqueue and dequeue the transactions in CPQ and NCPQ; control logic for snoop pipeline (PTA), bootbus interface, HBM control, and E* control. Transaction outgoing Buffers for requests and data from E$ and W$ and control logic to buffer generate the outgoing Fireplane Interconnect transaction and data.

1.4 Chip Differences from UltraSPARC I, II

The UltraSPARC III processor differs from previous UltraSPARC processors in several key areas, including: ■ Bootbus limitations ■ Instruction set extensions ■ Instruction differences ■ Memory subsystem ■ Interrupts ■ Address space size ■ Error correction ■ Registers ■ Noncacheable store compression

This section describes the differences and concludes with a summary table of differences.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 7 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.4.1 Bootbus Limitations

All bootbus addresses must be mapped as side-effect pages with the TTE E bit set. In addition, programmers must not issue the following memory operations to any bootbus address: ■ Prefetch instructions ■ Block load and store instructions ■ Any memory operations with ASI_PHYS_USE_EC or ASI_PHYS_USE_EC_LITTLE ■ Partial store instructions

1.4.2 Instruction Set Extensions

UltraSPARC III has added Sun proprietary extensions to the SPARC V9 Instruction Set Architecture (ISA), in addition to those implemented in UltraSPARC I. The extensions were defined with close cooperation of the software groups (WABI, Graphics, OS, Compiler) that will use them. The extensions are in the areas of VIS extensions, prefetch enhancement, and interval arithmetic support.

VIS Extensions

Three new VIS instructions were added: ■ Byte Mask — Sets the Graphics Status Register (GSR) for a following byte shuffle operation. One byte mask can be issued per instruction group as the last instruction of the group. Byte Mask is a break-after instruction. ■ Byte Shuffle — Allows any set of 8 bytes to be extracted from a pair of double- precision, floating-point registers and written to a destination double-precision, floating-point register. The 32-bit byte mask field of the GSR specifies the pattern of source bytes for the byte shuffle instruction. ■ Edge(ncc) — Two variants: the original instruction sets the integer condition codes, and the new instruction does not set condition codes. Differences between the variants are as follows:

Edge Edgencc

Sets integer condition codes Does not set integer condition codes Single-instruction group Groupable with other instructions

Because of implementation restrictions in the pipeline, all instructions that set condition codes and execute in the MS pipeline stage must be in a single instruction group.

8 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Interval Arithmetic Support

One new instruction was added to improve the efficiency of interval arithmetic computations. The Set Interval Arithmetic Mode (SIAM) instruction enables the rounding mode bits in the Floating-Point Status Register (FSR) to be overridden without the overhead of modifying the RD field of the FSR. Updates directly to FSR are expensive because they flush the pipeline.

Please refer to A.55 in Commonality for details.

1.4.3 Instruction Differences

Several instructions changed. ■ SHUTDOWN — Energy Star compliance is achieved through a different mechanism than that used by the UltraSPARC I processor. For compatibility, the SHUTDOWN instruction in UltraSPARC III executes as a NOP. See SHUTDOWN Instruction (VIS I) on page 94 for more details. ■ FLUSH — Since the processor maintains consistency between the instruction cache and all store and atomic instructions, the FLUSH instruction is used only to clear the pipeline. Unlike the case with the UltraSPARC I processor, the FLUSH address is ignored. It is not used for instruction cache flushing and is not propagated to the system. A single FLUSH at the end of sequence of stores is sufficient to synchronize the pipeline. ■ Floating-point conversion instructions — Because of implementation restrictions, the following integer-to-floating-point conversion instructions generate an unfinished_FPop exception for certain ranges of integer operands, as shown in TABLE 1-7.

TABLE 1-7 Integer/Floating-Point to Floating-Point/Integer unfinished_FPop Exception Conditions

Instruction Unfinished Trap Ranges

FsTOi result < − 231, result ≥ 231, Inf, NaN FsTOx |result| ≥ 252, Inf, NaN FdTOi result < − 231, result ≥ 231, Inf, NaN FdTOx |result| ≥ 252, Inf, NaN FiTOs operand < − 222, operand ≥ 222 FxTOs operand < − 222, operand ≥ 222 FxTOd operand < − 251, operand ≥ 251

When the above instructions take an unfinished_FPop trap, system software must properly emulate the instruction and resume execution.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 9 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ Floating-point subnormal/NaN handling — Because of implementation restrictions, the processor generates an unfinished_FPop exception over a different range of subnormal and NaN operands and results than that of the UltraSPARC I processor. When an unfinished_FPop trap is generated, it is expected that system software will properly emulate the instruction and resume execution. See NaN Operands on page 100 for more details on FPop operand ranges that generate an unfinished_trap. ■ Ticc reserved field checking — The processor checks the reserved field of the Ticc instruction for zero and generates an illegal_instruction trap if the field is nonzero. Neither UltraSPARC I nor UltraSPARC II processors checked the Ticc reserved field for zero.

1.4.4 Memory Subsystem

The memory subsystem design is new. Differences include changes in the caches, cache flushing, and translation lookaside buffers (TLBs).

Caches

The UltraSPARC III memory system comprises five caches; four on-chip and one external to the chip. ■ Data cache (D-cache) — A 64-Kbyte, 4-way associative, virtually indexed, physically tagged (VIPT) cache. The D-cache is write-through, no write-allocate, not included in the external cache. The line size is 32 bytes, no subblocking. The D-cache needs to be flushed only if an alias is created with virtual address bit 13. VA<13> is the only virtual bit used to index the D-cache. ■ Instruction cache (I-cache) — A 32-Kbyte, 4-way associative, physically indexed, physically tagged (PIPT) cache. The I-cache is not included in the external cache. The line size is 32 bytes, no subblocking. The I-cache is kept consistent with the store stream of the processor as well as with external stores from other processors. The I-cache never needs to be flushed, not even for address aliases. ■ Write cache (W-cache) — A 2-Kbyte, 4-way associative, physically indexed, physically tagged (PIPT) cache. The line size is 64 bytes with 32-byte subblocks. The W-cache reduces bandwidth to the external cache by coalescing and bursting stores to the external cache. The W-cache is included in the external cache; all lines in the W-cache have a corresponding line allocated in the external cache. The data state of the W-cache line always supersedes the state of the data in the corresponding external cache line. It is necessary to flush the W-cache for stable storage.

10 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ External cache (E-cache) — A 1- to 8-Mbyte, direct mapped, physically indexed, physically tagged cache. The E-cache is write-allocate, write-back. It is necessary to flush the E-cache for stable storage.

Cache Flushing

The UltraSPARC III D-cache differs in size and organization from the UltraSPARC I D-cache and so requires changes to the algorithms used to flush the cache.

The virtually indexed caches need to be flushed when a virtual address alias is created. Caches that contain modified data need to be flushed for stable storage.

Following are flushing requirements for specific caches: ■ Data cache — The UltraSPARC III D-cache is the only cache that needs to be flushed when a virtual address alias is created. Like the UltraSPARC I D-cache, the UltraSPARC III D-cache uses one virtual address bit for indexing the cache and thus creates an alias boundary of 16 Kbytes for the D-cache. ■ Instruction cache — The processor maintains consistency of the on-chip I-cache with the stores from all processors so that a FLUSH instruction is needed only to ensure the pipeline is consistent. This means a single flush is sufficient at the end of a sequence of stores that updates the instruction stream to ensure correct operation. Unlike the case with the UltraSPARC I processor, the FLUSH instruction does not propagate externally since all I-caches in an UltraSPARC III multiprocessor system are maintained consistent. Since the I-cache is a PIPT cache, it does not have to be flushed for virtual address aliases. The I-cache never contains modified data, so it does not have to be flushed for stable storage. ■ External and write caches — Since the E-cache and W-cache can contain modified data, they must be flushed for stable storage. The W-cache is included in the E- cache, so it is sufficient to flush a block from the E-cache; if there is a corresponding block in the W-cache, it will also be flushed. The recommended procedure to flush modified data from the E-cache back to memory is as follows:

■ Load the block (64 bytes) into the floating-point registers by using FP loads or Block Load. ■ Write the floating-point registers to memory with a Block Store Commit. ■ Issue MEMBAR #Sync to ensure completion. The Block Store with Commit instruction will invalidate the block from both the E-cache and the W-cache. Both of these caches are physically indexed, so they do not need to be flushed for address aliases.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 11 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Translation Lookaside Buffers (TLBs)

The instruction and data translation lookaside buffers share the same organization in the UltraSPARC III processor. The instruction and data (I/D) TLBs use a 16-entry, fully associative TLB to hold entries for 4-Mbyte and 512-Kbyte page sizes and all locked pages of any size.

In addition to the small, fully associative TLB, UltraSPARC III has two large TLBs used exclusively for 8-Kbyte page entries. For data accesses, the TLB is a 512-entry, 2-way associative D-TLB. For instruction accesses, the TLB is a 128-entry, 2-way associative I-TLB.

Note – The lock bit is not used in the 8-Kbyte TLBs.

Other TLB differences are described below. ■ TLB flushing — Both the instruction and data TLBs now have a demap-all operation that removes all unlocked Translation Table Entries (TTEs). See I/D TLB CAM Diagnostic Register on page 137 for details. ■ TTE format — The UltraSPARC III processor now has the additional elements in the TTE format:

■ Physical Address field: Expanded from 28 bits (PA<40:13>,TTE<40:13>) to 30 bits (PA<42:13>,TTE<42:13>). See Translation Table Entry (TTE) on page 130 for more information. ■ Synchronous Fault Status Registers (SFSR) extensions — A new fault type was added to the FT field of the SFSR to indicate an I/D TLB miss, and one status bit was added to the I/D TLB SFSRs:

■ NF: Set to signify that the faulting operation was a speculative load instruction. See I/D Synchronous Fault Status Registers (SFSR) in Commonality for more details on the SFSR. ■ I/D Translation Storage Buffer (TSB) Register — Three new register extensions of the I/D TSB register were added to the UltraSPARC III processor. These registers allow a different TSB virtual address base to be used for each of the three virtual address spaces (primary, secondary, nucleus) in the DTLB and two virtual address spaces (primary, nucleus) in the ITLB. On an I/D TLB miss, the processor selects which TSB Extension Register to use to form the TSB base address, based on the virtual space accessed by the faulting instruction. See I/D TSB Extension Registers on page 136 for more details. ■ TLB Data Access Register — The access address for the TLB Data Access Register has been expanded to enable access to three TLBs, each with up to 512 entries. See I/D TLB CAM Diagnostic Register on page 137 for details.

12 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ TLB Diagnostic Register — A new register replaces the function of the diagnostic bits in the TTE. See I/D TLB CAM Diagnostic Register on page 137.

1.4.5 Interrupts

The UltraSPARC III processor extends the sun-5 interrupt architecture previously implemented in the UltraSPARC I processor in these areas: ■ Module ID fields — Extended from 5 bits to 10 bits. ■ Interrupt Transmit BUSY/NACK bits — Extended from 1 pair to 32 pairs, enabling pipelining of outgoing interrupts. ■ Data Dispatch and Receive Registers — Expanded from 3 to 8, enabling up to 64 bytes to be transmitted in an interrupt. ■ System Tick Interrupt bit — Added to the soft interrupt register. ■ Interrupts — Now occur only before AX pipe instructions.

For more details, please refer to Appendix N, Interrupt Handling.

1.4.6 Address Space Size

The UltraSPARC III processor extends both the virtual and physical address space previously implemented. It implements the full 64-bit virtual address range defined in the SPARC V9 architecture. There are no VA holes, as in UltraSPARC I. The physical address range has also been extended from 41 bits to 43 bits.

Address space with PA<42> = 1 is considered as the noncacheable address space. Physical address 4000000000016 to 7FFFFFFFFFF16 is in the noncacheable area.

1.4.7 Error Correction

Error correction differs from the UltraSPARC I and II handling, as follows: ■ External cache — The processor uses ECC protection on the E-cache instead of parity protection. It requires software correction and recovery for single bit E- cache ECC read errors, which are signaled as a precise error. See E-cache Data ECC Error on page 191 for details. ■ System interface — A new ECC code has been defined for ECC protection across 132 data bits (9 ECC bits) and 3 MTag bits (4 ECC bits) on the system bus and on the data switch. The syndromes for these codes differ from the syndromes used previously. The processor requires software correction and recovery for single-bit system ECC errors, which are signaled as disrupting errors. See ECC Syndromes on page 208 for details.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 13 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.4.8 Registers

Differences in registers include enhancements to ASI registers and ASR registers.

Address Space Identifier (ASI) Registers

Changes to the ASI registers include those to the following registers: ■ SRAM diagnostic registers — Several new diagnostic ASI registers were added for the following on-chip SRAMs: ■ Write cache ■ Branch predict array ■ I/D TLB CAM Changes were made to fields of existing UltraSPARC II diagnostic ASI registers for the following on-chip SRAMs:

■ Data cache ■ External cache ■ Instruction cache These ASI registers were removed:

■ UDB Error Register ■ UDB Control Register See Chapter , Debug and Diagnostics Support, for more details. ■ Instruction Trap Register — An instruction breakpoint register was added to enable an illegal_instruction trap to be generated on an arbitrary opcode. See Section N.5, Software Interrupt Register,inCommonality for details. ■ Asynchronous Fault Status Register (AFSR) — Several changes were made to add new fault types (E-cache ECC errors) and remove old fault types (SDB errors). See Asynchronous Fault Status Register on page 204 for details. ■ Asynchronous Fault Address Register (AFAR) — The AFAR was extended to handle a 43-bit physical address. It is now updated on several errors that previously did not capture the address. See Asynchronous Fault Status Register on page 204 for details. ■ Software Interrupt Register (SOFTINT) — The SOFTINT register has an additional bit added to signal SYSTEM TICK COMPARE interrupts. See Section N.5, Software Interrupt Register,inCommonality for details. ■ System Interface Registers — The UPA interface ASI has been reused for two new Fireplane Interconnect registers: a configuration register and an address register. See Fireplane Interconnect ASI Extensions on page 278 for details.

14 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Ancillary State Registers (ASRs)

Changes to the ASRs include changes to the following registers: ■ System Tick and Compare — Two new ASRs were added to support a system clock: ASR 1816, a System Tick Register (analogous to the per-processor tick register ASR 416), and ASR 1916, a System Tick Compare Register (analogous to the per-processor tick compare register 1716). For details, see TICK Register on page 115. ■ Graphics Status Register (GSR) — New fields were added to the GSR:

■ 32-bit MASK field used by the BSHUFFLE instruction ■ 1-bit IM field to enable interval arithmetic round mode ■ 2-bit IRND field to specify round mode for interval arithmetic For details see Section 5.2.11 of Commonality. ■ Performance Control Register (PCR) — The Performance Control Register has been extended to enable a larger number of performance events to be measured. For details, see Performance Instrumentation Counter Events on page 268. ■ Dispatch Control Register (DCR) — Many control fields were added to the Dispatch Control Register to aid in debugging first silicon. For details, see Dispatch Control Register (DCR) (ASR 18) on page 30.

1.4.9 Noncacheable Store Compression

Like previous implementations, the UltraSPARC III processor uses a 16-byte buffer to merge adjacent noncacheable stores into a single external data transaction. This merging greatly increases store bandwidth to the graphics frame buffer. A change in the algorithm for determining when to break merging improves store bandwidth to graphics devices.

For more details see Store Compression on page 74.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 15 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1.4.10 Summary of Differences

TABLE 1-8 summarizes the differences between the UltraSPARC II and UltraSPARC III processors.

TABLE 1-8 UltraSPARC III and UltraSPARC II Differences (1 of 3)

Element UltraSPARC III Processor UltraSPARC II Processor trap Waits for all outstanding instructions and Waits for all outstanding instructions. all deferred errors. (Note: MEMBAR #Sync has error isolation, too.) See Deferred Traps on page 188. PSTATE.tle (trap little- Unchanged. 0. endian after WDR/XIR/ See TABLE O-1 on page 180. See Table 10-1 in UltraSPARC I & II User’s SIR/RED Manual. Internal ASI load/store MEMBAR #Sync is required Makes side effects visible. and MEMBAR #Sync/ 1. Makes side effects visible. DONE RETRY / 2. Between internal ASI store and internal ASI load. See TABLE 8-4 on page 72. MEMBAR Semantics See TABLE 8-1 on page 68 MEMBAR #LoadLoad NOP All loads wait for completion of all loads. MEMBAR #StoreLoad #Sync All loads wait for completion of all loads. MEMBAR #LoadStore NOP All stores wait for completion of all loads. MEMBAR NOP All stores wait for completion of all stores. #StoreStore, STBAR MEMBAR #Lookaside #Sync Waits until all outstanding memory accesses are complete. MEMBAR #Sync #Sync #Sync (waits for all outstanding instructions and all deferred errors). FLUSH Flushes all fetched instructions (address Creates an I$ invalidate request. (MEMBAR don’t care). #StoreStore is needed in PSO and See Instruction Differences on page 9. RMO.) Store and block store Always creates I$ invalidate request. Never creates I$ invalidate request. Memory model support TSO only TSO (Total Store Order) [arch outstanding bug #7012] Process loads in program order. Process stores in program order. PSO (Partial Store Order) Stores are not ordered with respect to each other. RMO (Relaxed Memory Order No priority rules for any two memory references. prefetch instruction Supported Supported. See Instruction Differences on page 9.

16 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 1-8 UltraSPARC III and UltraSPARC II Differences (2 of 3)

Element UltraSPARC III Processor UltraSPARC II Processor Load miss Recirculates (blocking) Goes to load buffer (nonblocking). Store miss Goes to W$ Goes to store buffer. Virtual address 64 bit 44 bit (with VA hole) Physical address 43 bit 41 bit SHUTDOWN instruction NOP (privileged). Goes to power-down mode. Demap for IMMU Doesn’t support demap with secondary Supported. context. See TABLE F-14 in Commonality. TPC/TNPC after XIR Inaccurate. Accurate. (TPC=PC and ~1F16, TNPC=TPC+4) TPC/TnPC and Resets on page 114. Note: Neither UltraSPARC III nor UltraSPARC II can return to previous condition after XIR. TICK_CMPR.tick_cmpr 0. Unknown. after POR See TABLE O-1 on page 180 GSR.IM after POR 0. Unknown. See TABLE O-1 on page 180. OBS control By obs[0] pin and %dcr.obs. %dcr.obs ASR register See OBS document.

PIC overflow Causes trap type 4F16. Does not check. See Performance Control and Counters on page 265. Reserved field of TCC Checked. Does not check. instruction See Instruction Differences on page 9. Subnormal handling for More cases cause unfinished_FPop trap Some cases cause unfinished FP_op trap. FPU than with UltraSPARC II. sequence_error by FPU Never causes sequence_error May cause sequence_error. FLUSH instruction causes Never. May cause data_access_MMU_miss; trap? may cause data_access_exception prefetcha instruction to MEMBAR #Sync is required. MEMBAR #Sync is not required because internal ASI prefetcha is NOP. TTE<49:43> <49:48>is reserved <49:41>is Diag <47> snoop bit Async fault status register *** Bit assignment is completely different *** I/D TSB Extension ASI Available. Not available. Registers ASI store/load to/from I$ Forbidden. No problem. valid bit during I$ on PRM open bug #7138 I$ Physically indexed, physically tagged Physically indexed, physically tagged. (with microtag).

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 1 Overview 17 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 1-8 UltraSPARC III and UltraSPARC II Differences (3 of 3)

Element UltraSPARC III Processor UltraSPARC II Processor D$ Virtually indexed, physically tagged. Virtually indexed, physically tagged. DMMU-SFSR/DMMU-SFAR dtlb miss trap update. dtlb miss trap does not update. FSR.NS=1 & NaN taken fp_exception_ieee_754. Trap not taken. fadd/fsub/fdivs with Sometimes takes fp_exception_ieee_754 or Takes no trap. NaN/Subnormal operand unfinished_fpop trap.

18 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 2

Definitions and Acronyms

Terminology and acronyms specific to UltraSPARC III, the Sun Microsystems implementation of SPARC V9, are presented below. For definition of terms that are common to all implementations, please refer to Chapter 2 of Commonality.

ARF Architectural Register File.

Ax Either the A0 or A1 pipeline.

BBC Bootbus controller.

CDS Crossbar Data Switch.

CSR Control Status Register.

CPQ Coherent pending queue.

D-cache processor Data memory cache.

DCU Data Cache Unit.

DFT Designed for test.

DQM Data input/output Mask. Q stands for either input or output.

E-cache External memory cache (external to the processor chip).

ECU External Cache Unit.

EMU External Memory Unit.

FFA Synonym for FGA.

FGA Floating-point/Graphics ALU pipeline.

FGM Floating-point/Graphics Multiply pipeline.

FGU Floating Point and Graphics Unit.

19 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material FP0 Synonym for FGM.

FP1 Synonym for FGA.

FRF Floating-point Register File.

HBM Hierarchical Bus Mode.

HPE Hardware Prefetch Enable.

I-cache processor Instruction memory.

IEU Instruction Execution Unit.

IIU Instruction Issue Unit.

LPA Local Physical Address.

MCU Memory Control Unit.

module A master or slave device that attaches to the shared-memory bus.

MOESI A cache-coherence protocol. Each of the letters stands for one of the states that a cache line can be in, as follows: M, modified, dirty data with no outstanding shared copy; O, owned, dirty data with outstanding shared copy(s); E, exclusive, clean data with no outstanding shared copy; S, shared, clean data with outstanding shared copy(s); I, invalid, invalid data.

NCPQ Noncoherent pending queue.

ORQ Outgoing request queue.

PIPT Physically indexed, physically tagged.

PIVT Physically indexed, virtually tagged.

PTA Pending tag array.

RTO Read to own.

RTOR Read to own remote. A reissued RTO transaction.

RTS Read to share.

RTSM Read to share Mtag. An RTS to modify MTag transaction.

SAM SPARC Architecture Manual, Version 9.

scrub To write data from the W-cache to the E-cache.

SDRAM Synchronous Dynamic Random Access Memory

SIG Single-Instruction Group; sometimes shortened to “single-group.”

SIU System Interface Unit.

20 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material SPE Software prefetch enable.

SSM Scalable shared memory.

UE User process error.

victimize [Error handling]

VIPT Virtually indexed, physically tagged.

VIS Visual Instruction Set.

VIVT Virtually indexed, virtually tagged.

WAS Write after write.

WRF Working Register File.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 2 Definitions and Acronyms 21 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 22 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 3

Architectural Overview

Please refer to Chapter 3 in Commonality.

23 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material 24 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 4

Data Formats

Please refer to Chapter 4 in Commonality. That chapter contains information from Chapter 4, Data Formats,fromThe SPARC Architecture Manual, version 9.

Following is information specific to the UltraSPARC III implementation of SPARC V9: ■ Floating-Point, Quad-Precision on page 25 ■ Graphics Data Formats on page 25

Note that section headings below correspond to those of Chapter 4 in Commonality.

4.2.3 Floating-Point, Quad-Precision

TABLE 4-6 in Chapter 4 of Commonality describes the memory and register alignment for doubleword and quadword. Please note the following:

Implementation Note – Floating-point quad is not implemented in the processor. Quad-precision operations, except floating-point multiply-add and multiply-subtract, are emulated in the OS kernel.

4.3 Graphics Data Formats

Graphics instructions are optimized for short integer arithmetic, where the overhead of converting to and from floating point is significant. Image components can be 8 or 16 bits; intermediate results are 16 or 32 bits.

Sun frame buffer pixel component ordering is α,G,B,R.

25 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material Please refer to Section 4.3 in Commonality for details of pixel graphics format (4.3.1), Fixed16 Graphics format (4.3.2), and Fixed32 Graphics format (4.3.3).

26 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 5

Registers

For general information, please see Chapter 5 in Commonality. Implementation- specific information about the following registers is presented below (section numbers correspond to those in Commonality): ■ FSR_nonstandard_fp (NS) on page 27 ■ PSTATE Register on page 29 ■ Version (VER) Register on page 29 ■ Performance Control Register (PCR) (ASR 16) on page 30 ■ Performance Instrumentation Counter (PIC) Register (ASR 17) on page 30 ■ Dispatch Control Register (DCR) (ASR 18) on page 30 ■ Data Cache Unit Control Register (DCUCR) on page 33 ■ Data Watchpoint Registers on page 35 ■ Instruction Trap Register on page 36

For information on MMU registers, please refer to Section F.10, Internal Registers and ASI Operations, on page 134.

In addition, several registers are described in Chapter , Debug and Diagnostics Support, including the following: ■ Miscellaneous registers supporting diagnostic accesses ■ Integer Unit Design for Test (DFT) on page 345 (shadow scan copies of IU registers)

The SPARC V9 architecture also defines two implementation-dependent registers: the IU Deferred-Trap Queue and the Floating-point Deferred-Trap Queue (FQ); the UltraSPARC III processor does not implement either of these queues. See Chapter 5 of Commonality for more information about deferred-trap queues.

5.1.7 Floating-Point State Register (FSR)

FSR_nonstandard_fp (NS)

If a floating-point operation generates a subnormal value on UltraSPARC III and FSR.NS = 1, the value is replaced by a floating-point zero value of the same sign

27 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material (impl. dep. #18). This replacement is usually performed in hardware. However, for the following cases when a subnormal value is generated in the course of the instruction and FSR.NS =1,anfp_exception_other exception with FSR.ftt =2 (unfinished_FPop) is taken and trap handler software is expected to replace the subnormal value with a zero value of the appropriate sign: ■ fadd of numbers with opposite signs ■ fsub of numbers with the same signs ■ fdtos

FSR_floating-point_trap_type (ftt)

UltraSPARC III triggers fp_exception_other with trap type unfinished_FPop under the conditions described in Section B.6. These conditions differ from the JPS1 “standard” set and are described in Section 5.1.7 of Commonality (impl. dep. #248).

FSR_current_exception (cexc)

UltraSPARC III follows the specification in Table 5-10 of Commonality, for treatment of FSR.cexc when a floating-point operation causes an overflow or underflow, with differences in two cases. Those cases are listed in the following table, with the differences marked in bold italic font: TABLE 5-10 Setting of FSR.cexc bits

Exception(s) Current Detected Trap Enable Exception in f.p. Mask bits bits (in fp_exception_ operation (in FSR.TEM) FSR.cexc) ieee_754 of uf nx OFM UFM NXMTrap Occurs? ofc ufc nxc Notes - ✔ ✔ x 0 1 yes 0 10 ✔ - ✔ 0 x 1 yes 1 0 0 (2) Notes: (2) Overflow is always accompanied by inexact.

Although UltraSPARC III hardware returns the FSR.cexc settings shown above, it is assumed that the trap handler for the fp_exception_IEEE_754 trap will change FSR.cexc to correspond to the table in Commonality before FSR.cexc is observed by nonprivileged software. Therefore, as observed by nonprivileged software, the system will appear to adhere to Table 5-10 in Commonality.

28 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 5.2.1 PSTATE Register

The UltraSPARC III processor supports two additional sets (privileged only) of eight 64-bit global registers: interrupt globals and MMU globals. These additional registers are called the trap globals. Two 1-bit fields, PSTATE.IG and PSTATE.MG, were added to the PSTATE register to select which set of global registers to use.

When PSTATE.AM = 1, UltraSPARC III writes the full 64-bit program counter value to the destination register of a CALL, JMPL,orRDPC instruction. When PSTATE.AM = 1 and a trap occurs, UltraSPARC III writes the full 64-bit program counter value to TPC[TL] (impl. dep. # 125).

When PSTATE.AM = 1 and an exception occurs, UltraSPARC III writes the full 64-bit address to the D-SFAR (impl. dep. #241).

Note – Exiting RED_state by writing 0 to PSTATE.RED in the delay slot of a JMPL instruction is not recommended. A noncacheable instruction prefetch can be made to the jmpl target, which can be in a cacheable memory area. That case may result in a bus error on some systems and cause an instruction_access_error trap. Programmers can mask the trap by setting the NCEEN bit in the E-cache Error Enable Register to 0, but this solution masks all noncorrectable error checking. Exiting RED_state with DONE or RETRY avoids the problem.

5.2.9 Version (VER) Register

The version register, VER, is a read-only register that specifies fixed parameters of an implementation. It is read with the RDPR instruction.

On UltraSPARC III, the VER register reads as listed in TABLE 5-11.

TABLE 5-11 VER Register Encoding in UltraSPARC III

Bit VER Field Value

63:48 manuf 003E16 (Sun’s JEDEC code) (impl. dep. #104) 47:32 impl 001416 (impl. dep. #13) 31:24 mask Tape-out # Ver.mask value 3.2 5216 3.4 5416 3.9 5916 3.B 5B16 3.C 5C16 If any additional revision of UltraSPARC III is fabricated, it will receive a mask number greater than 5C16. 23:16 Reserved 0

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 5 Registers 29 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 5-11 VER Register Encoding in UltraSPARC III

Bit VER Field Value 15:8 maxtl 5 7:5 Reserved 0 4:0 maxwin 7

5.2.11 Ancillary State Registers (ASRs)

Please refer to Section 5.2.11 of Commonality for details of the ASRs.

Performance Control Register (PCR) (ASR 16)

UltraSPARC III implements the PCR register as described in Section 5.2.11 of Commonality.

Bits 47:32, 26:17, and bit 3 of PCR are unused in UltraSPARC III (impl. dep. #207). They read as zeroes and writes to them are ignored. PCR.NC =0.

In UltraSPARC III, access to PCR is strictly privileged; when PSTATE.PRIV =0,an attempt to execute either an RDPCR or a WRPCR instruction causes a privileged_opcode exception (impl. dep. #250).

See Appendix Q, Performance Instrumentation, for a detailed discussion of the PCR and PIC register usage and event count definitions.

Performance Instrumentation Counter (PIC) Register (ASR 17)

The PIC register is implemented as described in SPARC JPS1 Commonality.

For PICL/PICU encodings of specific event counters, see Appendix Q, Performance Instrumentation.

Note – PICU and PICL were previously referred to as PIC1 and PIC0, respectively, in early UltraSPARC III documentation.

Dispatch Control Register (DCR) (ASR 18)

The Dispatch Control Register is accessed through ASR 18 and should only be accessed in privileged mode. Nonprivileged accesses to this register cause a privileged_opcode trap.

See also TABLE O-1 on page 180 for the state of this register after reset.

30 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The Dispatch Control Register is illustrated in FIGURE 5-1. The OBS and IFPOE bits are described in the subsections below.

— — OBS BPE RPE SI — IFPOE MS 63 14 13 12 11 6 5 4 3 2 1 0

FIGURE 5-1 Dispatch Control Register (ASR 18)

DCR bits 13:12 are unused on UltraSPARC III; writes to them are ignored and they read as 0 (impl. dep. #203).

UltraSPARC III implements the standard semantics for DCR.BPE, DCR.RPE, DCR.SI, and DCR.MS, as described in Section 5.2.11 of SPARC JPS1 Commonality (impl. dep. #204).

DCR Control for Observability Bus (OBS) (Impl. Dep. #203). To accommodate a system request to enable control of the observability bus through software, a major change was made in the control configuration of the bus. In addition to using pulses at obsdata<0>, bits 11:6 of the Dispatch Control Register (DCR) can be programmed to select the set of signals to be observed at obsdata<9:0>. Note that this feature is an addition to the existing control setup and is available only on versions of UltraSPARC III beyond TO_2.0 (tapeout 2.0).

TABLE 5-12 shows the mapping between the settings on bits 11:6 of the DCR and the values seen on the observability bus.

TABLE 5-12 Signals Observed at obsdata<9:0> for Settings on Bits 11:6 of the DCR

DCR bits Signal obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata 11:6 source <9> <8> <7> <6> <5> <4> <3> <2> <1>

101xxx ECC* 0 0 0 ec_cor ec_uncor sys_uncor sys_cor 0 0 100xxx Clk grid l2clk/4 l2clk/2 l2clk l1clk/4 l1clk/2 l1clk 1 1 1 110xxx IU* a0_valid a1_valid ms_valid br_valid fa_valid fm_valid ins_comp mispred recirc 111000 IOT delta cbu<8> cbu<7> cbu<6> cbu<5> cbu<4> cbu<3> cbu<2> cbu<1> impctl<2> (up) 111001 IOT delta cbd<8> cbd<7> cbd<6> cbd<5> cbd<4> cbd<3> cbd<2> cbd<1> impctl<3> (down) 111011 IOL delta cbu<8> cbu<7> cbu<6> cbu<5> cbu<4> cbu<3> cbu<2> cbu<1> impctl<0> (up) 111010 IOL delta cbd<8> cbd<7> cbd<6> cbd<5> cbd<4> cbd<3> cbd<2> cbd<1> impctl<1> (down) 111110 IOR delta cbu<8> cbu<7> cbu<6> cbu<5> cbu<4> cbu<3> cbu<2> cbu<1> impctl<4> (up)

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 5 Registers 31 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 5-12 Signals Observed at obsdata<9:0> for Settings on Bits 11:6 of the DCR (Continued)

DCR bits Signal obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata 11:6 source <9> <8> <7> <6> <5> <4> <3> <2> <1>

111111 IOR delta cbd<8> cbd<7> cbd<6> cbd<5> cbd<4> cbd<3> cbd<2> cbd<1> impctl<5> (down) 111101 IOB delta cbu<8> cbu<7> cbu<6> cbu<5> cbu<4> cbu<3> cbu<2> cbu<1> impctl<6> (up) 111100 IOB delta cbd<8> cbd<7> cbd<6> cbd<5> cbd<4> cbd<3> cbd<2> cbd<1> impctl<7> (down)

* ECC = Error Correcting Code; IU = Integer Unit

Note that in every valid setting of the DCR obs control bits, bit 11 must be set to 1.

Below are some requirements and recommendations for using the DCR to control the observability bus. ■ Use only one or the other mode of control for the observability bus; that is, control either by pulses at obsdata[0] or by programming of the DCR bits. ■ As long as the por_n pin is asserted, the state of obsdata<9:0> will always be 0. Once the device has been reset, the default state becomes visible on the bus. Note that the DCR OBS control bits are reset to all 0’s on a software POR trap. Until the DCR bit 11 is programmed to 1, the obsdata<0> mode of control will have precedence. ■ There is a latency of approximately 5 CPU cycles between writing the DCR and the signals corresponding to that setting appearing at obsdata<9:0>.

For more information on the observability bus, please refer to IU Observability Bus Signals on page 346.

Single Issue Disable (bit 3). When DCR.SI = 0, the processor is in “single- issue” mode, that is, only allows one instruction to be outstanding at a time. Under certain conditions when DCR.SI = 0 and the data cache is enabled (DCUCR.DC =1)1, the processor can stop dispatching instructions and deadlock. The data cache should normally be disabled (DCUCR.DC = 0) when operating in single-issue mode (DCR.SI = 0). If the D-cache must be enabled during single-issue operation, processor lock-up can be avoided by placing a benign MS pipeline instruction (such as “rd %y, %g0”) after each store or after each block of stores (because stores do not block other stores from retiring).

1. The precise conditions for deadlock: processor in single-issue mode (DCR.SI = 0), data cache enabled (DCUCR.DC = 1), a store hits the Dcache, followed by any load steered to the MS pipeline with no non-load MS instructions between the store and the load. Reference: UltraSPARC III Erratum #16, Bug ID#6888.

32 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Interrupt Floating-Point Operation Enable (bit 1). The IFPOE bit enables system software to take interrupts on FP instructions. This enable bit is cleared by hardware at power-on. System software must set the bit as needed. When this bit is enabled, UltraSPARC III forces an fp_disabled trap when an interrupt occurs on FP- only code. The trap handler is then responsible for checking whether the FP is indeed disabled. If it is not, the trap handler then enables interrupts to take the pending interrupt (impl. dep. #203).

Note – This behavior deviates from SPARC V9 trap priorities in that interrupts are of lower priorities than the other two types of FP exceptions (fp_exception_ieee_754, fp_exception_other).

■ This mechanism is triggered for an FP instruction only if none of the approximately 12 preceding instructions across the two integer, load/store, and branch pipelines are valid, under the assumption that they are better suited to take the interrupt (only one trap entry/exit). ■ Upon entry, the handler must check both TSTATE.PEF and FPRS.FEF bits. If TSTATE.PEF = 1 and FPRF.FEF = 1, the handler has been entered because of an interrupt, either interrupt_vector or interrupt_level. In such a case:

■ The fp_disabled handler should enable interrupts (that is, set PSTATE.IE = 1), then issue an integer instruction (for example, add %g0,%g0,%g0). An interrupt is triggered on this instruction.

■ UltraSPARC III then enters the appropriate interrupt handler (PSTATE.IE is turned off here) for the type of interrupt.

■ At the end of the handler, the interrupted instruction is RETRY’d after returning from the interrupt. The add %g0,%g0,%g0 is RETRY’d.

■ The fp_disabled handler then returns to the original process with a RETRY.

■ The “interrupted” FPop is then retried (taking an fp_exception_ieee_754 or fp_exception_other at this time if needed).

Multiscalar Dispatch Enable (MS) (bit 0). In UltraSPARC III, DCR.MS operates as described in Section 5.2 of Commonality and has no additional side effects (impl. dep. #204).

5.2.12 Registers Referenced Through ASIs

Data Cache Unit Control Register (DCUCR)

ASI 4516 (ASI_DCU_CONTROL_REGISTER),VA=016

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 5 Registers 33 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The Data Cache Unit Control Register contains fields that control several memory- related hardware functions. The functions include Instruction, Write and data caches, MMUs, and watchpoint setting. Most of DCUCR’s functions are described in Section 5.2.12 of Commonality; details specific to UltraSPARC III are described in this section.

After a power-on reset (POR), all fields of DCUCR are set to 0. After a WDR, XIR, or SIR reset, all fields of DCUCR except WE are set to 0 and WE is left unchanged (impl. dep. #240).

See TABLE O-1 on page 180 for the state of this register after reset or RED_state trap.

The Data Cache Unit Control Register, as implemented in UltraSPARC III, is illustrated in FIGURE 5-2 and described in TABLE 5-13. In the table, bits are grouped by function rather than by strict bit sequence.

— CP CV ME RE— — — SL WE PM VM PR PW VRVW — DMIM DC IC

63 50 49 48 4746 45 44 43 42 4140 33 32 25 24 23 22 21 20 4 3 2 1 0

FIGURE 5-2 DCU Control Register Access Data Format (ASI 4516)

TABLE 5-13 DCU Control Register Description

Bits Field Type Use — Description MMU Control 49 CP RW UltraSPARC III implements this cacheability bit as described in Section 5.2.12 of Commonality. (impl. dep. #232) 48 CV RW UltraSPARC III implements this cacheability bit as described in Section 5.2.12 of Commonality. (impl. dep. #232) 3 DM DMMU Enable. Implemented as described in Section 5.2.12 of Commonality. 2 IM IMMU Enable. Implemented as described in Section 5.2.12 of Commonality. Store Queue Control 47 ME RW Noncacheable Store Merging Enable. If cleared, no merging of noncacheable, non- side-effect store data will occur. Each noncacheable store will generate a system bus (Fireplane) transaction. (impl. dep. #240) 46 RE R-A-W Bypass Enable. If cleared, no bypassing of data from the store queue to a dependent load instruction will occur. All load instructions will have their R-A-W predict field cleared. (impl. dep. #240) Second Load Control 42 SL Second Load Steering Enable. If cleared, all load type instructions will be steered to the MS pipeline and no floating-point load type instructions will be issued to the Ax pipelines. (impl. dep. #240)

34 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 5-13 DCU Control Register Description (Continued)

Bits Field Type Use — Description Cache Control 41 WE Write Cache Enable. If 0, all W-cache references will be handled as W-cache misses. Each store queue entry will perform an RMW transaction to the E-cache, and the W- cache will be maintained in a clean state. Software is required to flush the W-cache (force it to a clean state) before setting this bit to 0. (impl. dep. #240) 1 DC UltraSPARC III implements DCUCR.DC. When DCUCR.DC = 0 (D-cache is disabled), the data cache is not updated. When the D-cache is reenabled, software must flush any inconsistent lines from the D-cache (impl. dep. #252). 0 IC UltraSPARC III implements DCUCR.IC. When DCUCR.IC = 0 (I-cache is disabled), the instruction cache is not updated. When the I-cache is reenabled, software must invalidate any inconsistent lines from the instruction cache (impl. dep. #253). Watchpoint Control 40:33 PM<7:0> DCU Physical Address Data Watchpoint Mask. Implemented as described in Section 5.2.12 of Commonality. 32:25 VM<7:0> DCU Virtual Address Data Watchpoint Mask. Implemented as described in Section 5.2.12 of Commonality. 24, 23 PR, PW DCU Physical Address Data Watchpoint Enable. Implemented as described in Section 5.2.12 of Commonality. 22, 21 VR, VW DCU Virtual Address Data Watchpoint Enable. Implemented as described in Section 5.2.12 of Commonality.

Under certain conditions when DCR.SI = 0 and the data cache is enabled (DCUCR.DC = 1), the processor can stop dispatching instructions (deadlock). The data cache should normally be disabled (DCUCR.DC = 0) when operating in single- issue mode (DCR.SI = 0). See Single Issue Disable (bit 3) on page 32 for a description of this condition and methods to avoid it when DCUCR.DC =1.

Data Watchpoint Registers

On UltraSPARC III, watchpoint comparison is only done on the MS (memory) pipeline of the processor; any second-issued Ax pipe FP loads will not trigger a watchpoint. For reliable use of the watchpoint mechanism, the second FP load feature (DCUCR.SL) must be disabled (impl. dep. #244).

Note – The first implementation of UltraSPARC III supports a 43-bit physical address space. Software is responsible to write a zero-extended 64-bit address into the PA Data Watchpoint Register.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 5 Registers 35 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Instruction Trap Register

UltraSPARC III implements the Instruction Trap Register (impl. dep. #205).

The Instruction Trap Register enables an illegal_instruction trap to be generated when an instruction opcode is dispatched from the R stage of the UltraSPARC III pipeline. This facility is used to implement instruction breakpoints.

In UltraSPARC III, the least significant 11 bits (bits 10:0) of a CALL or branch (BPcc, FBPcc, Bicc, BPr) instruction in the instruction cache contain the sum of the least significant 11 bits of the architectural instruction encoding (as appears in main memory) and the least significant 11 bits of the virtual address of the CALL/branch instruction (impl. dep. #245).

Therefore, software on UltraSPARC III that writes the Instruction Trap Register to cause a trap on CALL or branch instruction must either ■ set bits 10:0 of the Mask field to 0 to mask out the implementation-dependent bits from the comparison or ■ place in bits 10:0 of the Match field the sum of bits 10:0 of the instruction and bits 12:2 of the instruction’s virtual address (that is, Match<10:0> = instruction<10:0> + VA<12:2>)

On UltraSPARC III, the instruction breakpoint facility is not fully implemented. Although instruction breakpoints do work in the majority of cases, the possibility exists that: ■ An instruction breakpoint that should have been taken will be missed. One example1 is an instruction in the delay slot of a DCTI in which the static prediction bit predicts “not taken”. If the DCTI is actually taken (and therefore the delay slot executed), the delay slot instruction may be requeued from the processor’s mispredict queue (MQ); in that case, the breakpoint (instruction trap) will not be triggered. ■ Arbitrary instructions will be lost from the instruction stream (a potentially catastrophic error)

1. Reference: UltraSPARC III Erratum #46, Bug ID#7046

36 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 6

Instructions

This chapter departs from the organization of Chapter 6 in Commonality. Instead, it focuses on the needs of compiler writers and others who are interested in scheduling instructions to optimize program performance on the UltraSPARC III pipeline. The chapter discusses these subjects: ■ Processor Pipeline on page 37 ■ Grouping Rules on page 43 ■ Conditional Moves on page 51 ■ Instruction Latencies and Dispatching Properties on page 52, with a table that lists these characteristics for the complete UltraSPARC III instruction set.

Please refer to Chapter 6 of Commonality for general information on instructions.

6.1 Processor Pipeline

The UltraSPARC III processor pipeline consists of 15 stages. Instructions pass through the pipeline in order of groups of up to six instructions. The pipeline stages are referred to by the following mnemonic single-letter names, with the reference shown below:

A PFBI J SRECMWXTD

Address generation Predictor address generation Fetch Branch target computation Instruction issue J: extra S: extra Register Execute Cache Miss Write eXtend Trap Done

37 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material Rather than instruction execution in a single pipeline, several separate pipelines are each dedicated to execution of a particular class of instructions. The execution pipelines start after the R-stage of the pipe. Not all the execution pipelines use all of the pipeline stages, and in some cases (long-latency, floating-point pipes), the stage names must be extended with a cycle number, such as, D1, D2, etc. The following sections provide a stage-by-stage description of the pipeline. While reading the following sections, you may find it useful to refer to FIGURE 6-1, which illustrates at which pipeline stage each of the large architectural structures resides. A

Instruction Cache P

BP 32 KB, 4-way, 32-Byte lines ITLB F

Instruction Instruction Queue 4 X 4 Steering B, I

Dependency Working Register File 7R 3W Check R

Floating Point VA + D W Register File A0 A1 D$ E (MS) /2 / 2 64-KB DTLB Special Unit

4-way D$ Tag C

Miss Sxt/Align = M

FP Mul / Divide W FP Add / Sub ALU (FGA) Graphics Mul (FGM) Graphics

X

Store Queue T

WWW Architectural Register File 3W 1T D

Write Cache 2 KB

FIGURE 6-1 Pipeline Diagram

38 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 6.1.1 Instruction-Fetch Stages

The instruction-fetch pipeline stages, A, P, F, and B, are described below.

A-stage (Address Generation)

The address stage generates and selects the fetch address to be used by the instruction cache in the next cycle. The address that can be selected in this stage for instruction fetching comes from several sources including: ■ Sequential PC ■ Branch target (from I-stage) ■ Trap target ■ Interrupt ■ Predicted return target ■ Jmpl target

This stage also contains the instruction prefetch buffer, which holds up to eight instructions prefetched from the sequential stream.

P-stage (Predictor Address Generation)

The predictor address stage starts fetching four instructions from the instruction cache. Since the I-cache has a two-cycle latency, the P-stage and the F-stage are both used to complete an I-cache access. Although the I-cache has a two-cycle latency, it is pipelined and can access a new set of four instructions every cycle. The address used to start an I-cache access was generated in the previous cycle.

F-stage (Fetch)

The F-stage is used for the second half of the I-cache access. At the end of this stage, up to four instructions from an I-cache line (32 bytes) are latched for use in the I- stage. An I-cache fetch group is not permitted to cross an I-cache line.

The P-stage also accesses the Branch Predictor (BP). The BP is a small, single-cycle access SRAM whose output is latched at the end of the P-stage. The BP predicts the direction of all conditional branches, based on the PC of the branch and the direction history of the most recent conditional branches.

B-stage (Branch Target Computation)

The B-stage is the final stage of the I-Fetch pipeline, A P F B. In this stage, the four fetched instructions are first available in a register. The processor analyzes the instructions, looking for DCTIs that can alter the path of execution. It finds the first DCTI, if any, among the four instructions and computes (if PC relative) or predicts

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 39 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material (if register based) its target address. If this DCTI is predicted taken, the target address is passed to the A-stage to begin fetching from that stream; if predicted not taken, the target is passed on to the CTI queue for use in case of mispredict. Also in the B-stage, the computation of the hit or miss status of the instruction fetch is begun, so that the validity of the four instructions can be reported to the instruction queue.

6.1.2 Instruction Issue Stages

The I-stage and R-stage make up the instruction issue stages. The J-stage and the S- stage are additional stages for use when needed.

I-stage (Instruction Issue)

In the I-stage, the four instructions fetched from the I-cache are entered into the instruction queue. The instruction queue is a 4-wide-by-4-deep structure. Each instruction position from the I-cache is directly wired to one of the 4-deep queue positions. No rotation is done to put the instructions into age order.

Each instruction in the fetch group is steered to one of six possible execution pipelines: A0, A1, BR, MS, FGA, or FGM, described in TABLE 6-1.

TABLE 6-1 Execution Pipelines

Pipeline Description

A0 Integer ALU and Floating-point Load pipe 0 A1 Integer ALU and Floating-point Load pipe 1 BR Branch pipe MS Memory/Special pipe FGM Floating Point/Graphics multiply pipe FGA Floating Point/Graphics ALU pipe

R-stage (Register)

The integer working register file is accessed during the R-stage for the operands of the instructions (up to three) that have been steered to the A0, A1, and MS pipelines. At the end of the R-stage, results from previous instructions are bypassed in place of the register file operands if required.

Up to two floating-point or graphics instructions are sent to the Floating Point/ Graphics Unit in this stage.

40 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The register and pipeline dependencies between the instructions in the group and the instructions in the execution pipelines are calculated concurrently with the register file access. If a dependency is found, the dependent instruction is held in the R-stage until the dependency is released.

The S- and R-stages of each pipeline form a 2-entry queue. The head end of this queue is the R-stage. If the queue contains one instruction, then it is in R; if the queue contains two instructions, then the older instruction is in R and the younger instruction is in S. In every cycle, zero or one instruction may be removed from this queue (that is, going from R to E), and zero or one instruction may be inserted into this queue (that is, going from J to either R or, if R is occupied, to S).

When the queue contains two instructions (that is, if S is occupied), then no instructions may be added to the queue in that cycle, even if one is being removed. If the queue is “out of order” (that is, it contains an instruction that is younger than another instruction still in J), then no instructions may be added to the queue in that cycle, even if one is being removed. Otherwise, the queue may receive a new instruction in any cycle.

6.1.3 Integer Instruction Execution: E-stage (Execute)

Integer instructions in the A0 and A1 pipelines compute their results in the E-stage. The instructions include most arithmetic, all shift, and all logical instructions. Their results are available for bypassing to dependent instructions that are in the R-stage, resulting in single-cycle execution for most integer instructions. The A0 and A1 pipelines are the only two sources of bypass results in the E-stage.

Other integer instructions are steered to the MS pipe and if necessary are sent with their operands to the special execution unit in this stage. Most MS instructions do not proceed [?] to the System Interface Unit (SIU). They can start their execution during the E-stage but will not produce any results to be bypassed until the C-stage.

Load instructions steered to the MS pipe start accessing the data cache during the E- stage. Since the D-cache is SAM [?] accessed, loads do not have to wait for their virtual address to be computed before starting the D-cache access.

Concurrently with the start of the D-cache access, the virtual address is calculated in an adder in the MS pipeline. The virtual address adder drives the data TLB, which starts its access at the end of the E-stage.

Floating-point and graphics instructions access the floating-point register file in the E-stage to obtain their operands. At the end of the E-stage, the results from previous completing floating-point/graphic instructions can be bypassed to the E-stage instructions.

Conditional branch instructions in the BR pipe resolve their directions in the E-stage. Based on their original predicted direction, a mispredict signal is computed and sent to the A-stage for possible refetching of the correct instruction stream.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 41 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material JMPL and RETURN instructions compute their target addresses in the E-stage of the MS pipe. The results are sent to the A-stage to start fetching instructions from the target stream.

6.1.4 Floating-Point and VIS Instruction Execution

Execution of floating-point and VIS instructions is done in the C-stage, M-stage, W- stage, and X-stage.

C-stage (Cache)

The data cache delivers results for doubleword (64-bit) and unsigned word (32-bit) integer loads in the C-stage. The D-TLB is accessed in parallel with the D-cache access and provides a translated physical address at the end of the C-stage.

Special instruction unit results are produced at the end of this stage and can be bypassed to waiting dependent instructions in the R-stage—minimum 2-cycle latency for SIU instructions. The integer pipelines, A0 and A1, write their results back to the working register file in the C-stage.

The C-stage is the first stage of execution for floating-point and VIS instructions in the FGA and FGM pipelines.

M-stage (Miss)

Data cache misses are detected in the M-stage by a comparison of the physical address from the DTLB to the physical address in the D-cache tags. If the load required additional alignment or sign extension (such as signed word, all halfword, and all byte loads), that alignment is carried out in this stage, resulting in a 3-cycle latency for those load operations. D-cache miss requests are sent to the external cache at this time. This stage is used for the second execution cycle of floating-point and VIS instructions.

Load data are available to the floating-point pipelines in the M-stage. Although this is one cycle later than the integer pipelines, a perceived 2-cycle floating-point load latency results from the floating-point execution pipes being pushed back one cycle relative to the integer execution pipelines. The floating-point register file access occurs in E-stage, one cycle after integer register file access.

W-stage (Write)

In the W-stage, the MS integer pipe results are written into the working register file. The W-stage is also used as the third execution cycle of floating-point and VIS instructions.

42 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material X-stage (Extend)

The X pipeline stage is the last execution stage for most floating-point operations (except divide and sqrt) and for all VIS instructions. Floating-point results from this stage are available for bypass to dependent instructions that will be entering the C- stage next cycle.

6.1.5 Trap (T) and Done (D) Stages

This section describes the stages that interrupt or complete instruction execution.

T-stage (Trap)

Traps, including floating-point and integer traps, are signalled in this stage. Any instructions that are younger than the trapping instruction must invalidate their results before reaching the D-stage to prevent their results from being erroneously written into the architectural or floating-point register files.

D-stage (Done)

Integer results are written into the architectural register file in this stage. At this point, they are fully committed and are visible to any traps generated from younger instructions in the pipeline.

Floating-point results are written into the floating-point register file in this stage. These results are visible to any traps generated from younger instructions.

6.2 Grouping Rules

A group is a collection of instructions that launch into the execution pipelines in the same cycle. For example, if four instructions are being evaluated for dispatch in the R-stage and only three of those are allowed to execute in the E-stage, those three instructions are considered to be a group. The fourth instruction, which was held back, is not considered part of the group of instructions that was dispatched.

Instruction grouping rules are necessary for the following reasons. ■ Each pipeline can run only a subset of instructions. ■ Resource dependencies may require special spacing of instructions. ■ Data dependencies may require separation of instructions. ■ Some instructions are multicycle in nature and require special sequencing flows. ■ Special rules can help ease the logic/timing burden.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 43 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Before continuing, we define a few terms that apply to instructions.

break-before: The instruction will always be the first instruction of a group.

break-after: The instruction will always be the last instruction of a group.

single-instruction group (SIG): The instruction will not issue with any other instructions; it will be the only instruction in the group.(SIG is sometimes shortened herein to “single-group.”)

instruction latency: The number of processor cycles after issuing an instruction that a following data-dependent instruction can issue.

blocking, multicycle: The instruction reserves one or more of the execution pipelines for more than one cycle. The reserved pipelines are not available for other instructions to issue into until the blocking, multicycle instruction completes.

6.2.1 Execution Order

● Rule: Within the R-stage, some of the instructions can be dispatched and others cannot. If an instruction is younger than an instruction that is not able to dispatch, then the younger instruction will not be dispatched. “Younger” and “older” refer to instruction order within the program. If two instructions are executed sequentially, then the instruction with the smaller address is considered the older of the two. That is, if the two instructions are executed individually, the instruction that would be executed first is the older instruction.

6.2.2 Integer Register Dependencies to Instructions in the MS Pipeline

● Rule: If an operand of an instruction in the R-stage matches the destination register of an instruction in the MS pipeline’s E-stage, then instruction in the R- stage may not proceed. The MS pipeline has no E-stage bypass. If an operand of an instruction in the R-stage matches the destination register of an instruction in the MS pipeline’s C-stage, then the instruction in the R-stage may not proceed if the instruction in the MS pipeline’s C-stage does not generate its data until the M-stage. For example, LDSB does not have the load data until the M-stage, but LDX has its data in the C-stage. Thus, LDX would not cause an interlock, but LDSB would. Most instructions in the MS pipeline have their data by the M-stage, so there is no dependency check on the MS pipeline’s M-stage destination register. In the case of multicycle MS instructions, the data are always available by the M-stage of the last helper flow.

44 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 6.2.3 Integer Instructions Within a Group

● Rule: Integer instructions within a group are not allowed to write the same destination register. By not writing the same destination register at the same time, we simplify bypass logic and register file write-enable determination and we simplify potential WAW errors as well. The instructions are break-before second destination is written. This rule applies only to integer instructions writing integer registers. Floating-point instructions and floating-point loads (done in the integer A0, A1, and MS pipelines) can be grouped so that two or more instructions in the same group can write the same floating-point destination register.

6.2.4 Same-Group Bypass

● Rule: Same-group bypass is disallowed. The group bypass rule states that no instruction can bypass its result to another instruction in the same group. The one exception to this rule is store. A store instruction can get its store data (rd), but not its address operands (rs1, rs2), from an instruction in the same group.

6.2.5 Floating Point Unit Operand Dependencies

The FPU grouping rules are better understood after a review of FPU operand dependencies, as described below.

Latency and Destination Register Addresses

Floating-point operations have longer latencies than do integer instructions. Moreover, the various floating-point instructions have varying latencies, and we must check the precision of the floating-point operands. Floating-point operands can be generated by the FGA, FGM, Ax, and MS pipelines.

In the current implementation, there are no floating-point latencies that are less than two or more than four cycles (except for floating-point divide and and PDIST → PDIST). Thus, in the Integer Execution Unit, we create a pipeline of destination register addresses that runs from the E-stage to the W-stage for the FGA and FGM pipelines.

The floating-point results from Ax and MS pipelines are limited to floating-point loads, which are available to the floating-point pipelines in the M-stage. Thus, we need to create a pipeline of destination register addresses that runs only from the E- stage to the C-stage.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 45 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Associated with these destination register address pipelines is latency information that is calculated in the E-stage and based on the instruction opcode. The FGA and FGM pipelines can run instructions with variable latency. This additional logic enables compares only on the cycles where floating-point data are not available.

Floating-point divide and square root instructions are variable latency, and that latency is fairly large. Divides and square root operations are not pipelined, so the div/sqrt unit can be “busy.” There are two stages for the divide and square root, which the Integer Execution Unit calls the first part and the finishing part. Each of these stages has a destination register address. We propagate the address from the first part to the finishing part when the Floating Point and Graphics Unit tells us that the first part of the operation is complete. We dequeue the address from the finishing part when the Floating Point and Graphics Unit tells us that the operation is complete.

With all of the various pipelines combined, 14 stages of the pipeline can have a destination register that we must check before we can release a floating-point instruction in the R-stage.

Each floating-point operand and result can be either double or single precision. Single-precision registers can be contained within a double-precision address without there being an exact match. To make the comparison, we use a special 5-bit comparator that takes into account the precision of the two register addresses. For example, floating-point double-precision register %f0 encompasses floating-point single precision registers %f0 and %f1. So, a comparison between single precision %f1 and double-precision %f0 should result in the match logic indicating a dependency.

PDIST Special Cases

PDIST-to-dependent-PDIST is handled as a special case with 1-cycle latency. PDIST latency to any other dependent operation is 4-cycle latency. In addition, a PDIST cannot be issued if there is an ST, block store, or partial store instruction in the M- stage of the pipeline. PDIST issue is delayed if there is a store type instruction two groups ahead of it.

Helpers

Sometimes, an instruction as part of its operation requires multiple flows in the pipeline. We call those extra flows after the initial instruction flow helper cycles. The only pipeline that executes such instructions is the MS pipeline. If an instruction requires a helper, that helper is generated in the R-stage. The help generation logic will generate only as many helpers as the instruction requires.

46 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Most of the time, we can determine the number of helpers by examining only the opcode. However, some recirculate cases run the recirculated instruction differently than the original flow down the pipe, and some instructions, like integer multiply and divide, require variable numbers of helpers. Other helper lengths are determined by outside units, for example, the DCU for atomic memory instructions.

Floating-Point Grouping Rules

● Rule: Instructions requiring helpers are always break-after. There can be no instruction in a group that is younger than an instruction that requires helpers. Another way of saying this is “an instruction that requires helpers will be the youngest in its group.” This rule preserves the in-order execution of the integer instructions and simplifies the logic.

● Rule: Helpers are always single-group. Helpers block the pipe from executing other instructions; thus, instructions with helpers are blocking. A helper cycle is always alone in a group. No other instruction will ever be dispatched from the R-stage if there is a helper cycle in the R-stage.

● There are no special rules concerning integer set-cc and integer branch instructions. Integer set-cc instructions can be grouped in any way with integer branches. In fact, any number of set-cc instructions in any order relation to the branch are allowed, provided that they do not violate any other rules. No special rules apply to this specific case. Integer set-cc instructions in the A1 and A0 pipelines can compute a taken/untaken result in the E-stage, which is the same stage in which the branch is evaluating the correctness of its prediction. The control logic guarantees that the correct condition codes are used in the evaluation. The set-cc instructions in the MS pipeline are all Ax instructions. Since all Ax instructions require helpers, a set-cc instruction and dependent branch instruction will never be in the same group. The Ax set-cc makes its condition codes available in the C-stage of its last helper cycle. This case lines up perfectly with the following branch’s E-stage prediction evaluation, so there is no special rule with set-cc instructions in the MS pipeline.

● Rule: Window changing instructions are single-group. The window changing instructions SAVE, RESTORE, and RETURN are all single- group instructions. These instructions are never grouped with any other instruction. This rule greatly simplifies the tracking of register file addresses.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 47 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ● Rule: Window changing instructions force bubbles after. The window changing instructions SAVE, RESTORE, and RETURN also force a subsequent pipeline bubble. A bubble is distinct from a helper cycle in that there is nothing valid in the pipeline within a bubble. During the bubble, control logic transfers the new window from the Architectural Register File (ARF) to the Working Register File (WRF).

● Rule: Disallow fast window-changing flip-flopping. In this case, fast flip-flopping means changing from one window to the next window and back again before the first change has left the pipeline. Consider this sequence of code.

ADD → %l0 R E C M W X T D SAVE R E C M W X v D RESTORE R E C M W X T D

Both SAVE and RESTORE instructions transfer the contents of the new windows in their E-stages from the architectural register file to the working register file. The ARF is written in the D-stage of all instructions. In this example, if the RESTORE is allowed, the value of %l0 in the WRF will not reflect the result of the ADD because the transfer in the RESTORE’s E-stage occurs before the ADD’s result is written to the ARF. The restore is held in the R-stage until all previous window ARF results are written.

● Rule: Write ASR and Write PR instructions are single-group. WRASR and WRPR are always the youngest instructions in a group. That case prevents problems with an instruction being dependent on the result of the write, which occurs late in the pipeline.

We may decide at a later date to make these instructions single-group.

● Rule: Write ASR and Write PR force seven bubbles after. To guarantee that any instruction that starts in the R-stage is started with the most up-to-date status registers, WRASR and WRPR force bubbles after they are dispatched. Thus, if a WRASR or a WRPR instruction is in the pipeline anywhere from the E-stage to the T-stage, no instructions are dispatched from the R-stage (bubbles are forced in).

● Rule: Read ASR and Read PR force up to six bubbles before (break-before multicycle). An instruction can update the ASRs and PRs. Therefore, if an RDASR or RDPR instruction is in the R-stage and any valid instruction is in the integer pipelines from the E-stage to the X-stage, UltraSPARC III does not allow the RDASR and RDPR instructions to be dispatched. Instead, we wait for all pipeline state to write the ASRs and privileged registers and then read them.

48 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ● Rule: Block Load and Block Store are single-group and multicycle. For simplicity in the Integer Execution Unit and for the memory system, BLD and BST are single-group instructions with helpers.

● Rule: FLUSH is single-group and seven bubbles after. To simplify the Instruction Issue Unit and Integer Execution Unit, the FLUSH instruction is single-group. This make instruction cancellation and issue easier. FLUSH is held in the R-stage until the store queue and the pipeline from E-stage through D-stage is empty.

● Rule: FLUSHW is single-group. To simplify the Integer Execution Unit handling of the register file window flush, the FLUSHW instruction is single-group.

● Rule: MEMBAR (#Sync, #Lookaside, #StoreLoad, #Memissue) is single-group. To simplify the Integer Execution Unit and memory system, MEMBAR is a single- group instruction. MEMBAR will not dispatch until the DCU is prepared for these instructions.

● Rule: SAVED and RESTORED are single-group. To simplify the Integer Execution Unit’s window tracking, SAVED and RESTORED are single-group instructions.

● Rule: Software-initiated reset (SIR) is single-group. For simplicity, SIR is a single-group instruction.

● Rule: Load FSR (LDFSR) is single-group and forces seven bubbles after. For simplicity, LDFSR is a single-group instruction.

● Rule: DONE and RETRY are single-group. For simplicity, we handle the DONE and RETRY (trap exit) instructions as single- group. During this time, we are attempting to update the working register file, restore the state prior to the trap, etc., so it is easier to make these instructions single-group.

● Rule: DONE and RETRY force seven bubbles after. It takes a few cycles to properly restore the pre-trap state and the working register file from the architectural register file, so we force bubbles after the trap exit instructions to give us the cycles to do it all. We will not accept a new instruction until the trap exit instruction leaves the pipeline (also known as D+1).

● Rule: Floating-point divide/square-root is busy. The floating-point divide/square-root unit is a nonpipelined unit. The Integer Execution Unit sets a busy bit for each of the two stages of the div/sqrt and depends on the Floating Point and Graphics Unit to clear them. Only the first part of the div/ sqrt is considered to have a busy unit, so once the first part is complete, a new floating-point div/sqrt operation can be started.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 49 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ● Rule: Floating-point divide/square-root needs a write slot in FGM pipeline. In the stage in which a div/sqrt is moved from the first part to the last part, we cannot issue any instructions to the FGM pipeline. This constraint provides the write slot in the FGM pipeline so the div/sqrt can write the floating-point register file.

● Rule: Floating-point store is dependent on floating-point divide/square-root. The floating-point divide/square-root unit has a latency longer than the normal pipeline. As a result, should a floating-point store depend on the result of a floating- point divide/square-root, the floating-point store instruction may not be dispatched until the floating-point divide/square-root instruction has completed. This restriction is somewhat pessimistic since the store instruction bypasses its store data late in the pipeline. Thus, it should be possible to release the floating-point store somewhat before the floating-point divide/square-root is completed. At this time, we are implementing the pessimistic solution. This approach may change should the floating-point performance require it.

● Rule: Unused floating-point source operands have %f0 dependency. The following floating-point instructions will exhibit dependencies on %f0 when the prior floating-point instruction in an older instruction group use %f0 as the destination register. The false dependency is not recognized for same-group instructions. The intergroup dependency checking logic does not treat these instructions specially even though they do not use an frs1 and/or frs2 operand.

■ %f0 dependency of frs2: fzero{s} fone{s} fsrc1{s}, fnot1{s},

■ %f0 dependency of frs1: fpack16 , fpackfix , fsqrt{s,d,q},

● Rule: Graphics Status Register (GSR) Write instructions are break-after. The SIAM, BMASK, and FALIGNADDR instructions write the GSR. The BSHUFFLE and FALIGNDATA instructions read the GSR in their operation. Because of GSR write latency, a GSR reader cannot be in the same group as a GSR writer unless the GSR reader is older than the GSR writer. The simplest solution to this dependency is to make all GSR write instructions break-after.

Note – The WRGSR instruction is not included in this rule as a special case. The WRGSR instruction is already break-after by virtue of being a WRASR instruction.

50 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 6.3 Conditional Moves

The compiler needs to have a detailed model of the implementation of the various conditional moves so it can optimally schedule code. TABLE 6-2 describes the implementation of the five classes of SPARC V9 conditional moves in the UltraSPARC III pipeline. FADD and ADD instructions (shaded rows) are also described as a reference for comparison with the conditional move instructions.

TABLE 6-2 SPARC V9 Conditional Moves

RD Busy Instruction Latency Pipes Used Cycles Groupable Dependency

FMOVicc 3 cycles FGA and BR 1 Yes iCC -0 FMOVfcc 3 cycles FGA and BR 1 Yes fCC -0 FMOVr 3 cycles FGA and MS 1 Yes N/A FADD 4 cycles FGA 1 Yes N/A ADD 1 cycle One of A0 or A1 1 Yes N/A MOVcc 2 cycles MS and BR 1 Yes iCC;0 MOVR 2 cycles MS and BR 1 Yes N/A

Where:

Rd latency — The number of processor cycles until the destination register is available for bypassing to a dependent instruction.

Pipes used — The UltraSPARC III pipelines that the instruction busies when it is issued. The pipelines are:

A0 Integer execution (arithmetic,logical,shift, 2nd FP load) A1 Integer execution (arithmetic,logical,shift, 2nd FP load) MS Load/Store BR Branch FGA Floating/Graphics Add

Busy cycles — The number of cycles that the pipelines are not available for other instructions to be issued. A value of 1 signifies a fully pipelined instruction.

Groupable — Whether instructions using pipelines other than those used by the conditional move can be issued in the same cycle as the conditional move.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 51 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material {i,f}CC dependency — The number of cycles that a CC setting instruction must be scheduled ahead of the conditional move in order to avoid incurring pipeline stall cycles.

6.4 Instruction Latencies and Dispatching Properties

In this section, a machine description is given in the form of a table (TABLE 6-3) dealing with dispatching properties and latencies of operations. The static or nominal properties are modelled in the following terms (columns in TABLE 6-3), which are discussed below. ■ Latencies ■ Blocking properties in dispatching ■ Pipe resources (Ax, FGA, FGM, MS, BR) ■ Break rules in grouping (before, after, single-group)

The dynamic properties such as effect of cache miss are dealt with elsewhere.

Latency. In the Latency column, latencies are minimum cycles at which a dependent operation (consumer) can be dispatched relative to the producer operation without causing a dependency stall.

Operations like ADDcc produce two results, one in the destination register and another in condition code. For such operations, latencies are stated as a pair x,y where x is for the destination register dependence and y is for the condition code.

A zero latency implies that the producer and consumer operations can be grouped together in a single group, as in {SUBcc, BE %icc}.

Operations like UMUL have different latencies, depending on operand values. These are given as a range, min–max, for example, 6–8 in UMUL. Operations like LDFSR involve waiting for a specified condition. Such cases are described by footnotes and a notation like 32+ for CASA (meaning at least 32 cycles).

Cycles for branch operations (like BPcc) give the dispatching cycle of the retiring target operation relative to the branch. A pair of numbers, like 0,8, is given, depending on the outcome of a branch prediction, where 0 means a correct branch prediction and 8 means a mispredicted case.

Special cases such as FCMP(s,d), in which latencies depend on the type of consuming operations are described in footnotes (bracketed, e.g., [1]).

52 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Blocking. The Blocking column gives the number of cycles during which further dispatching is blocked. Operations like FDIVd have limited blocking property, that is, the blocking is limited to another FD operation in succession. Such cases are noted with footnotes.

Pipe. The Pipe column specifies the resource usage. Operations like MOVcc require more than one resource, as designated by the notation MS & BR. The operation LDF can dispatch to either MS or Ax as indicated.

Break and SIG. Grouping properties are given in columns Break and SIG (single-instruction group). In the Break column, an entry can be “Before,” meaning that this operation causes a break in a group so that the operation starts a new group. Operations like RDCCR require dispatching to be stalled until all operations in flight are completed (reach D-stage); in such cases, details are provided in a footnote reference in the Break column.

Operations like ALIGNADDR must be the last in an instruction group, causing a break in the group of type “After.”

Certain operations must be a single-instruction group. They are designated by “Yes” in the SIG column. Break “Before” and “After” are implied by SIG “Yes.”

TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (1 of 7)

Instruction Latency Blocking Pipe Break SIG

ADD 1, Ax ADDcc 1,0 [1] Ax ADDC 54MSYes ADDCcc 6,5 [2] 5 MS Yes ALIGNADDR 2 MS After ALIGNADDRL 2 MS After AND 1Ax ANDcc 1,0 [1] Ax ANDN 1Ax ANDNcc 1,0 [1] Ax ARRAY(8,16,32)2 MS Bicc 0, 8 [3] 0, 5 [4] BR BMASK 2 MS After BPcc 0, 8 [3] 0, 5 [4] BR BPR 0, 8 [3] 0, 5 [4] BR & MS

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 53 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (2 of 7)

Instruction Latency Blocking Pipe Break SIG

BSHUFFLE 3 FGA Yes CALL label 0-3 [5] BR & MS CASA 32+ 31+ MS After CASXA 32+ 31+ MS After DONE 7 Yes BR & MS Yes EDGE(8,16,32)(L) 5 4 MS Yes EDGE(8,16,32)N 2 MS EDGE(8,16,32)LN 2 MS FABS(s,d) 3 FGA FADD(s,d) 4 FGA FALIGNDATA 3 FGA FANDNOT1(s) 3 FGA FANDNOT2(s) 3 FGA FAND(s) 3 FGA FBPFcc BR FBFcc BR FCMP(s,d) 1,5 [6] FGA FCMPE(s,d) 1,5 [6] FGA FCMPEQ(16,32) 4 MS & FGM FCMPGT(16,32) 4 MS & FGM FCMPLE(16,32) 4 MS & FGM FCMPNE(16,32) 4 MS & FGM FDIVd 20(14) [7] 17(11) [8] FGM FDIVs 17(14) [7] 14(11) [8] FGM FEXPAND 3 FGA FiTO(s,d) 4 FGA FLUSH 8 7 BR & MS Before [9] Yes FLUSHW Yes MS Yes FMOV(s,d) 3 FGA FMOV(s,d)cc 3 FGA & BR

54 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (3 of 7)

Instruction Latency Blocking Pipe Break SIG

FMOV(s,d)fcc 3 FGA & BR FMOV(s,d)r 3 FGA & MS FMUL(s,d) 4 FGM FMUL8SUx16 4 FGM FMUL8ULx16 4 FGM FMUL8x16 4 FGM FMUL8x16AL 4 FGM FMUL8x16AU 4 FGM FMULD8SUx16 4 FGM FMULD8ULx16 4 FGM FNAND(s) 3 FGA FNEG(s,d) 3 FGA FNOR(s) 3 FGA FNOT1(s) 3 FGA FNOT2(s) 3 FGA FONE(s) 3 FGA FORNOT1(s) 3 FGA FORNOT2(s) 3 FGA FOR(s) 3 FGA FPACKFIX 4 FGM FPACK(16,32) 4 FGM FPADD(16, 16s, 32, 32s) 3 FGA FPMERGE 3 FGA FPSUB(16, 16s, 32, 32s) 3 FGA FsMULd 4 FGM FSQRTd 29(14) [7] 26(11) [8] FGM FSQRTs 23(14) [7] 20(11) [8] FGM FSRC1(s) 3 FGA FSRC2(s) 3 FGA F(s,d)TO(d,s) 4 FGA

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 55 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (4 of 7)

Instruction Latency Blocking Pipe Break SIG

F(s,d)TOi 4 FGA F(s,d)TOx 4 FGA FSUB(s,d) 4 FGA FXNOR 3 FGA FXOR(s) 3 FGA FxTO(s,d) 4 FGA FZERO(s) 3 FGA ILLTRAP MS JMPL reg,%o7 0-4, 9-10 [10] 0-3, 8-9 MS & BR JMPL %i7+8,%g0 3-5, 10-12 [11] 2-4, 9-11 MS & BR JMPL %o7+8, %g0 0-4, 9 [12] 0-3, 8 MS & BR LDD 2 Yes MS After LDDA 2 Yes MS After LDDF(A) 3MSorAx LDF(A) 3MSorAx LDFSR [23] Yes MS Yes LDSB(A) 3MS LDSH(A) 3MS LDSTUB(A) 31+ 30+ MS After LDSW(A) 3MS LDUB(A) 3MS LDUH(A) 3MS LDUW(A) 2MS LDX(A) 2MS LDXFSR [23] Yes MS Yes MEMBAR #LoadLoad [13] MS Yes MEMBAR #LoadStore [13] MS Yes MEMBAR #Lookaside [14] MS Yes MEMBAR #MemIssue [14] MS Yes MEMBAR #StoreLoad [14] MS Yes

56 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (5 of 7)

Instruction Latency Blocking Pipe Break SIG

MEMBAR #StoreStore [13] MS Yes MEMBAR #Sync [15] MS Yes MOVcc 2 MS&BR MOVfcc 2 MS&BR MOVr 2MS MULScc 6,5 [2] 5 MS Yes MULX 6-9 5-8 MS After OR 1Ax ORcc 1,0 [1] Ax ORN 1Ax ORNcc 1,0 [1] Ax PDIST 4 FGM PREFETCH(A) MS PST MS RDASI 4 MS Before [16] RDASR 4 MS Before [16] RDCCR 4 MS Before [16] RDFPRS 4 MS Before [16] RDPC 4 MS Before [16] RDPR 4 MS Before [16] RDTICK 4 MS Before [16] RDY 4 MS Before [16] RESTORE 2 1 MS Before [17] Yes RESTORED MS Yes RETRY 2 Yes BR & MS After RETURN 2,9 [18] 1,8 BR & MS Before [19] Yes SAVE 2 1 MS Before [20] Yes SAVED 2 Yes MS Yes SDIV 39 38 MS After SDIVcc 40,39 [2] 39 MS After

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 57 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (6 of 7)

Instruction Latency Blocking Pipe Break SIG

SDIVX 71 70 MS After SETHI 1Ax SHUTDOWN NOP NOP MS NOP NOP SIAM Yes MS Yes SIR Yes BR&MS Yes SLL(X) 1Ax SMUL 6-7 5-6 MS After SMULcc 7-8,-6-7 [2] 6-8 MS After SRA(X) 1Ax SRL(X) 1Ax STB(A) MS STBAR [21] MS Yes STD(A) 2MS Yes STDF(A) MS STF(A) MS STFSR 9 MS Before [22] Yes STH(A) MS STW(A) MS STX(A) MS STXFSR 9 MS Before [22] Yes SUB 1Ax SUBcc 1,0 [1] Ax SUBC 54MSYes SUBCcc 6,5 [2] 5 MS Yes SWAP(A) 31+ 30+ MS After TADDcc 5 Yes MS Yes TSUBcc 5 Yes MS Yes T(i,x)cc BR&MS UDIV(X) 40 39 MS After UDIVcc 41,40 [2] 40 MS After

58 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 6-3 UltraSPARC III Instruction Latencies and Dispatching Properties (7 of 7)

Instruction Latency Blocking Pipe Break SIG

UDIVX 71 70 MS After UMUL 6-8 5-7 MS After UMULcc 7-8,6-7 [2] 6-8 MS After WRASI 16 BR & MS Yes WRASR 7 BR&MS Yes WRCCR 7 BR&MS Yes WRFPRS 7 BR&MS Yes WRPR 7 BR&MS Yes WRY 7 BR&MS Yes XNOR 1Ax XNORcc 1,0 [1] Ax XOR 1Ax XORcc 1,0 [1] Ax

1. These operations produce two results: destination register and condition code (%icc, %xcc). The latency is 1 in the former case and 0 in the latter case. For example, SUBcc and BE %icc are grouped together (0 latency). 2. These operations produce two results: destination register and condition code (%icc, %xcc). The latency is given as a pair of numbers, m,n — for the register and condition code, respectively. When latencies vary in a range, such as in UMULcc, this range is indicated by pair - pair. 3. Latency is x,y for correct,incorrect branch prediction. It is measured as the difference in the dispatching cycle of the retiring target instruction and that of the branch. 4. Blocking cycles are x,y for correct,incorrect branch prediction. They are measured as the difference in the dispatching cycle of instruction in the delay slot (or target, if annulled) that retires and that of the branch. 5. Native Call and Link with immediate target address (label). 6. Latency (through fccn) depends on operations that use fccn: 1 for FMOV(s,d)fcc (due to FA resource constraint); 5 for FBfcc, FBPfcc, MOVfcc. 7. Latency in parentheses applies when operands involve IEEE special values (NaN, INF), including zero and illegal values. 8. Blocking is limited to another FD operation in succession; otherwise, nonblocking. Blocking cycles in parentheses apply when operands involve special and illegal values. 9. Dispatching stall (7+ cycles) until all stores in flight retire. 10. 0–4 if predicted true; 9–10 if mispredicted. 11. Latency is taken to be the difference in dispatching cycles from jmpl to target operation, including the effect of an operation in the delay slot. Blocking cycles thus may include cycles due to restore in the delay slot. In a given pair x,y, x applies when predicted correctly and y when predicted incorrectly. Each x or y may be a range of values.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 6 Instructions 59 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 12. 0–4 if predicted true; 9 if mispredicted. 13. This membar has NOP semantics, since the ordering specified is implicitly done by processor (memory model is TSO). 14. All operations in flight complete as in membar #Sync. 15. All operations in flight complete. 16. Issue stalls a minimum of 7 cycles until all operations in flight are done (get to D-stage). 17. Dispatching stalls until previous SAVE in flight, if any, reaches D-stage. 18. 2 if predicted correctly, 9 otherwise. Similarly for blocking cycles. 19. Dispatching stalls until previous RESTORE in flight, if any, reaches D-stage. 20. Dispatching stall until previous RESTORE in flight, if any, reaches D-stage. 21. Same as membar #StoreStore, which is NOP. 22. Dispatching stalls until all FP operations in flight are done. 23. Wait for completion of all FP operations in flight.

60 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 7

Traps

For general information on traps, please refer to Chapter 7 in Commonality.

In addition, please note the following.

Note – If an instruction breakpoint triggers an illegal_instruction trap, then the illegal_instruction trap has a higher priority than does a privileged_opcode trap.

7.1.2 Error_state

Upon entry into error_state, UltraSPARC III generates an immediate watchdog_reset (WDR) (impl. dep. #254).

7.4.2 Trap Type (TT)

UltraSPARC III implements all mandatory SPARC V9 and SPARC JPS1 exceptions, as described in Chapter 7 of Commonality. In addition, it implements the fast_ECC_error exception listed in TABLE 7-1, which is specific to UltraSPARC III (impl. dep. #202).

On UltraSPARC III, all traps are precise except for: ■ deferred traps from hardware failures encountered during memory accesses, described in Deferred Traps on page 188 ■ disrupting traps from hardware failures encountered during memory accesses, described in Disrupting Traps on page 190 ■ certain cases of the fast_ECC_error exception, described in TABLE 7-1

61 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material TABLE 7-1 Exception Specific to UltraSPARC III

Global Exception or Register Interrupt Request Description TT Set Priority fast_ECC_error fast_ECC_error is taken on an ECC error from the external cache. 07016 AG 2 The trap handler is required to flush the cache line containing the error from both the D$ and E$ since incorrect data would have already been written into the D$. The UltraSPARC III hardware will automatically correct single-bit ECC errors on the E$ writeback when the trap handler performs the E$ flush. After the caches are flushed, the instruction that encountered the error should be retried; the corrected data will then be brought back in from memory and reinstalled in the D$ and E$. In the case of a fast_ECC_error caused by a floating-point load (LDF, LDDF, LDFA,orLDDFA), it in some cases the destination register(s) of the load may be overwritten by invalid data before the trap occurs1 (impl. dep. #44). Thus, fast_ECC_error in such a case is not technically a precise trap; however, PC and nPC will be correctly recorded and all instructions issued before the trapped instruction will have completed.

1. Reference: UltraSPARC III Erratum #116, Bug ID#7220

62 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER 8

Memory Models

The SPARC V9 architecture is a model that specifies the behavior observable by software on SPARC V9 systems. Therefore, access to memory can be implemented in any manner, as long as the behavior observed by software conforms to that of the models described in Commonality and defined in Appendix D, Formal Specification of the Memory Models, also in Commonality.

The SPARC V9 architecture defines three different memory models: Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO). All SPARC V9 processors must provide Total Store Order (or a more strongly ordered model, for example, Sequential Consistency) to ensure SPARC V8 compatibility. Processor consistency is guaranteed in all memory models.

This chapter departs from the organization of Chapter 8 in Commonality.It describes the characteristics of the memory models for UltraSPARC III in sections organized as follows: ■ Programmer-Visible Properties of Models on page 64 ■ Memory Location Identification on page 65 ■ Memory Accesses and Cacheability on page 65 ■ Memory Synchronization on page 68 ■ Atomic Operations on page 69 ■ Nonfaulting Load on page 71 ■ Prefetch Instructions on page 71 ■ Block Loads and Stores on page 72 ■ I/O and Accesses with Side Effects on page 72 ■ Store Compression on page 74

63 Sun Microsystems Proprietary/Need -To-Know – JRC Contributed Material 8.1 Programmer-Visible Properties of Models

The programmer-visible properties are the same for all three models. ■ Loads are processed in program order; that is, there is an implicit MEMBAR #LoadLoad between them. ■ Loads can bypass earlier stores. Any such load that bypasses earlier stores must check (snoop) the store buffer for the most recent store to that location. A MEMBAR #Lookaside is not needed between a store and a subsequent load to the same noncacheable address. ■ A MEMBAR #StoreLoad must be used to prevent a load from bypassing a prior store, if Strong Sequential Order, as defined in The SPARC Architecture Manual V-9 (page 118), is desired. ■ Stores are processed in program order. ■ Stores cannot bypass earlier loads. ■ Accesses with the TTE.E bit set, such as those that have side effects, are all strongly ordered with respect to each other. ■ An external cache or write cache update is delayed on a store hit until all previous stores reach global visibility. For example, a cacheable store following a noncacheable store will not appear globally visible until the noncacheable store has become globally visible; there is an implicit MEMBAR #MemIssue between them.

8.1.1 Differences Between Memory Models

One difference between memory models is the amount of freedom an implementation is allowed in order to obtain higher performance during program execution. These memory models specify any constraints placed on the ordering of memory operations in uniprocessor and shared memory multiprocessor environments.

Although a program written for a weaker memory model potentially benefits from higher execution rates, it may require explicit memory synchronization instructions to function correctly if data are shared. The MEMBAR instruction is a SPARC V9 memory synchronization primitive that enables a programmer to explicitly control the ordering in a sequence of memory operations. (For a description of MEMBAR, see Memory Synchronization on page 68.)

64 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Stores to UltraSPARC III internal ASIs, block loads, and block stores are outside of the memory model; that is, they need MEMBARs to control ordering. See Instruction Prefetch to Side-Effect Locations on page 73 and Block Load and Store Instructions (VIS I) on page 79.

Atomic load-stores are treated as both a load and a store and can only be applied to cacheable address spaces.

8.1.2 Current Memory Model

The current memory model is indicated in the PSTATE.MM field. It is unaffected by normal traps but is set to TSO (PSTATE.MM =0) when the processor enters RED_state.

8.2 Memory Location Identification

A memory location is identified by an 8-bit address space identifier (ASI) and a 64- bit (virtual) address. The 8-bit ASI can be obtained from an ASI register or included in a memory access instruction. The ASI distinguishes among and provides an attribute to different 64-bit address spaces. For example, the ASI is used by the MMU and memory access hardware for control of virtual-to-physical address translations, access to implementation-dependent control and data registers, and access protection. Attempts by nonprivileged software (PSTATE.PRIV = 0) to access restricted ASIs (ASI<7> = 0) cause a privileged_action exception.

8.3 Memory Accesses and Cacheability

Memory is logically divided into real memory (cached) and I/O memory (noncached with and without side effects) spaces. Real memory spaces can be accessed without side effects. For example, a read from real memory space returns the information most recently written. In addition, an access to real memory space does not result in program-visible side effects. In contrast, a read from I/O space may not return the most recently written information and may result in program- visible side effects.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 65 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 8.3.1 Coherence Domains

Two types of memory operations are supported in UltraSPARC III: cacheable and noncacheable accesses, as indicated by the page translation (TTE.CP, TTE.CV). SPARC V9 does not specify memory ordering between cacheable and noncacheable accesses. UltraSPARC III maintains TSO ordering between memory references regardless of their cacheability.

Cacheable Accesses Accesses within the coherence domain are called cacheable accesses. They have these properties: ■ Data reside in real memory locations. ■ Accesses observe supported cache coherency protocol(s). ■ The unit of coherence is 64 bytes.

Noncacheable and Side-Effect Accesses Accesses outside of the coherence domain are called noncacheable accesses. Some of these memory (-mapped) locations may have side effects when accessed. They have the following properties: ■ Data might not reside in real memory locations. Accesses may result in programmer-visible side effects. An example is memory-mapped I/O control registers, such as those in a UART. ■ Accesses do not observe supported cache coherency protocol(s). ■ The smallest unit in each transaction is a single byte. Noncacheable accesses with the TTE.E bit set (those having side effects) are all strongly ordered with respect to other noncacheable accesses with the E bit set. In addition, store compression is disabled for these accesses. Speculative loads with the E bit set cause a data_access_exception trap (with SFSR.FT = 2, speculative load to page marked with E bit). Noncacheable accesses with the TTE.E bit cleared (non-side-effect accesses) are processor consistent and obey TSO memory ordering. In particular, processor consistency ensures that a noncacheable load that references the same location as a previous noncacheable store will load the data of the previous store.

Note – Noncacheable operations are not ordered across the Sun Fireplane Interconnect and bootbus interfaces of UltraSPARC III. Operations within each bus (Fireplane, boot) are kept ordered in compliance with sun5/4u system architecture, but no order is enforced between I/O buses.

Note – Side effect, as indicated in TTE.E, does not imply noncacheability.

66 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 8.3.2 Global Visibility and Memory Ordering

A memory access is considered globally visible when one of the following events occurs: ■ Read or write permission is granted for a cacheable transaction in scalable shared memory (SSM) mode. ■ The transaction request is issued for a noncacheable transaction in SSM mode. ■ The transaction request is issued when not in SSM mode.

More details, including a definition of “SSM mode,” can be found in the Sun Fireplane Interconnect specification.

To ensure the correct ordering between cacheable and noncacheable domains, explicit memory synchronization is needed in the form of MEMBAR instructions. CODE EXAMPLE 8-1 illustrates the issues involved in mixing cacheable and noncacheable accesses. .

CODE EXAMPLE 8-1 Memory Ordering and MEMBAR Examples Assume that all accesses go to non-side-effect memory locations. Process A: While (1) { Store D1:data produced 1 MEMBAR #StoreStore (needed in PSO, RMO for SPARC-V9 compliance) Store F1:set flag While F1 is set (spin on flag) Load F1 2 MEMBAR #LoadLoad, #LoadStore (needed in RMO for SPARC-V9 compliance) Load D2 }

Process B: While (1) { While F1 is cleared (spin on flag) Load F1 2 MEMBAR #LoadLoad, #LoadStore (needed in RMO for SPARC-V9 compliance) Load D1 Store D2 1 MEMBAR #StoreStore (needed in PSO, RMO for SPARC-V9 compliance) Store F1:clear flag }

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 67 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 8.4 Memory Synchronization

UltraSPARC III achieves memory synchronization through MEMBAR and FLUSH.It provides MEMBAR (STBAR in SPARC V8) and FLUSH instructions for explicit control of memory ordering in program execution. MEMBAR has several variations. All MEMBARs are implemented in one of two ways in UltraSPARC III: as a NOP or with MEMBAR #Sync semantics. Since the processor always executes with TSO memory ordering semantics, three of the ordering MEMBARs are implemented as NOPs. TABLE 8-1 lists the MEMBAR implementations.

TABLE 8-1 MEMBAR Semantics

MEMBAR Semantics #LoadLoad NOP. All loads wait for completion of all previous loads. #LoadStore NOP. All stores wait for completion of all previous loads. #Lookaside #Sync. Waits until store buffer is empty. #StoreStore, STBAR NOP. All stores wait for completion of all previous stores. #StoreLoad #Sync. All loads wait for completion of all previous stores. #MemIssue #Sync. Waits until all outstanding memory accesses complete. #Sync #Sync. Waits for all outstanding instructions and all deferred errors.

8.4.1 MEMBAR #Sync

Membar #Sync forces all outstanding instructions and all deferred errors to be completed before any instructions after the MEMBAR are issued.

8.4.2 MEMBAR Rules

TABLE 8-2 summarizes the cases where the programmer must insert a MEMBAR to ensure ordering between two memory operations on UltraSPARC III. The MEMBAR requirements are independent of the currently selected memory model of TSO, PSO, or RMO. Use TABLE 8-2 for ordering purposes only. Be sure not to confuse memory operation ordering with processor consistency or deterministic operation; MEMBARs are required for deterministic operation of certain ASI register updates.

Caution – The MEMBAR requirements for UltraSPARC III are weaker than the requirements of SPARC V9 or sun-5/4u. To ensure code portability across systems, use the stronger of the MEMBAR requirements of SPARC V9 or sun-5/4u.

68 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Read the table as follows: Read from row to column; the first memory operation in program order in a row is followed by the memory operation found in the column. Two symbols are used as table entries: ■ # — No intervening operation is required because Fireplane-compliant systems automatically order R before C. ■ M—MEMBAR #Sync or MEMBAR #MemIssue or MEMBAR #StoreLoad

TABLE 8-2 Summary of MEMBAR Rules (TSO, PSO, and RMO Memory Models)

To Column Operation C:

From Row Operation R: load store atomic load_nc_e store_nc_e load_nc_ne store_nc_ne bload bstore bload_nc bstore_nc

load #######MMMM

store M##M # M # MMMM

atomic #######MMMM

load_nc_e #######MMMM

store_nc_e M## # # M # MMMM

load_nc_ne #######MMMM

store_nc_ne M##M # M # MMMM

bload MMMMMMMMMMM

bstore MMMMMMMMMMM

bload_nc MMMMMMMMMMM

bstore_nc MMMMMMMMMMM

8.5 Atomic Operations

SPARC V9 provides three atomic instructions to support mutual exclusion. They are: ■ SWAP — Atomically exchanges the lower 32 bits in an integer register with a word in memory. This instruction is issued only after store buffers are empty. Subsequent loads interlock on earlier SWAPs. If a page is marked as virtually noncacheable but physically cacheable (TTE.CV = 0 and TTE.CP = 1), allocation is done to the E- and W-caches only. This includes all of the atomic-access instructions.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 69 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ LDSTUB — Behaves like a SWAP except that it loads a byte from memory into an integer register and atomically writes all 1’s (FF16) into the addressed byte. ■ Compare and Swap (CAS(X)A) — Combines a load, compare, and store into a single atomic instruction. It compares the value in an integer register to a value in memory. If they are equal, the value in memory is swapped with the contents of a second integer register. If they are not equal, the value in memory is still swapped with the contents of the second integer register but is not stored. The E-cache will still go into M-state, even if there is no store. All of these operations are carried out atomically; in other words, no other memory operation can be applied to the addressed memory location until the entire compare-and-swap sequence is completed.

These instructions behave like both a load and store access, but the operation is carried out indivisibly. These instructions can be used only in the cacheable domain (not in noncacheable I/O addresses).

These atomic instructions can be used with the ASIs listed in TABLE 8-3. Access with a restricted ASI in unprivileged mode (PSTATE.PRIV = 0) results in a privileged_action trap. Atomic accesses with noncacheable addresses cause a data_access_exception trap (with SFSR.FT = 4, atomic to page marked noncacheable). Atomic accesses with unsupported ASIs cause a data_access_exception trap (with SFSR.FT = 8, illegal ASI value or virtual address).

TABLE 8-3 ASIs That Support SWAP, LDSTUB, and CAS

ASI Name Access ASI_NUCLEUS (LITTLE) Restricted ASI_AS_IF_USER_PRIMARY (LITTLE) Restricted ASI_AS_IF_USER_SECONDARY (LITTLE) Restricted ASI_PRIMARY (LITTLE) Unrestricted ASI_SECONDARY (LITTLE) Unrestricted ASI_PHYS_USE_EC (LITTLE) Restricted

Note – Atomic accesses with nonfaulting ASIs are not allowed, because the latter have the load-only attribute.

70 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 8.6 Nonfaulting Load

A nonfaulting load behaves like a normal load, with the following exceptions: ■ It does not allow side-effect access. An access with the TTE.E bit set causes a data_access_exception trap (with SFSR.FT = 2, speculative load to page marked E bit). ■ It can be applied to a page with the TTE.NFO (nonfault access only) bit set; other types of accesses cause a data_access_exception trap (with SFSR.FT =1016, normal access to page marked NFO).

These loads are issued with ASI_PRIMARY_NO_FAULT{_LITTLE} or ASI_SECONDARY_NO_FAULT{_LITTLE}. A store with a NO_FAULT ASI causes a data_access_exception trap (with SFSR.FT = 8, illegal RW).

When a nonfaulting load encounters a TLB miss, the operating system should attempt to translate the page. If the translation results in an error, then 0 is returned and the load completes silently.

Typically, optimizers use nonfaulting loads to move loads across conditional control structures that guard their use. This technique potentially increases the distance between a load of data and the first use of that data, in order to hide latency. The technique allows more flexibility in code scheduling and improves performance in certain algorithms by removing address checking from the critical code path.

For example, when following a linked list, nonfaulting loads allow the null pointer to be accessed safely in a speculative, read-ahead fashion; the page at virtual address 016 can safely be accessed with no penalty. The NFO bit in the MMU marks pages that are mapped for safe access by nonfaulting loads but that can still cause a trap by other, normal accesses.

Thus, programmers can trap on wild pointer references—many programmers count on an exception being generated when accessing address 016 to debug code—while benefiting from the acceleration of nonfaulting access in debugged library routines.

8.7 Prefetch Instructions

The processor implements all SPARC V9 prefetch instructions except for prefetch page. All prefetches check the external cache before issuing a system request for the requested data.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 71 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE 8-4 describes prefetch instructions.

TABLE 8-4 Prefetch Instructions

Prefetch Instruction for Description

Several Reads 64 bytes of data from the specified target address are prefetched by (fcn =0) means of an RTS transaction and installed in the E-cache. The data are installed into the E-cache in Shared or Exclusive state. One Read 64 bytes of data from the specified target address are prefetched by (fcn =1) means of an RTS transaction and installed in the cache. Several Writes 64 bytes of data from the specified target address are prefetched by (fcn =2) means of an RTS transaction and installed in the E-cache. The data are installed into the E-cache in Exclusive state if possible (that is, if no other processor responds to the snoop). One Write 64 bytes of data from the specified target address are prefetched by (fcn =3) means of an RTS transaction and installed in the E-cache. The data are installed into the E-cache in Exclusive state if possible (that is, if no other processor responds to the snoop). Page Implemented as a NOP. (fcn =4)

8.8 Block Loads and Stores

Block load and store instructions work like normal floating-point load and store instructions, except that the data size (granularity) is 64 bytes per transfer. See Block Load and Store Instructions (VIS I) on page 79 for a full description of the instructions.

8.9 I/O and Accesses with Side Effects

I/O locations might not behave with memory semantics. Loads and stores could have side effects; for example, a read access could clear a register or pop an entry off a FIFO. A write access could set a register address port so that the next access to that address will read or write a particular internal register. Such devices are considered order sensitive. Also, such devices may only allow accesses of a fixed size, so store merging of adjacent stores or stores within a 16-byte region would cause an error (see Store Compression on page 74).

72 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The UltraSPARC III MMU includes an attribute bit in each page translation, TTE.E, which when set signifies that this page has side effects. Accesses other than block loads or stores to pages that have this bit set exhibit the following behavior: ■ Noncacheable accesses are strongly ordered with respect to each other. ■ Noncacheable loads with the E bit set will not be issued to the system until all previous control transfers are resolved. ■ Noncacheable store compression is disabled for E-bit accesses. ■ Exactly those E-bit accesses implied by the program are made in program order. ■ Nonfaulting loads are not allowed and cause a data_access_exception exception (with SFSR.FT = 2, speculative load to page marked E bit). ■ A MEMBAR may be needed between side-effect and non-side-effect accesses while in PSO and RMO modes, for portability across SPARC V9 processors, as well as in some cases of TSO. See TABLE 8-2 on page 69.

8.9.1 Instruction Prefetch to Side-Effect Locations

The processor does instruction prefetching and follows branches that it predicts are taken. Addresses mapped by the IMMU can be accessed even though they are not actually executed by the program. Normally, locations with side effects or that generate timeouts or bus errors are not mapped by the IMMU, so prefetching will not cause problems.

When running with the IMMU disabled, software must avoid placing data in the path of a control transfer instruction target or sequentially following a trap or conditional branch instruction. Data can be placed sequentially following the delay slot of a BA, BPA(p = 1), CALL,orJMPL instruction. Instructions should not be placed closer than 256 bytes to locations with side effects.

8.9.2 Instruction Prefetch Exiting Red State

Exiting RED_state by writing 0 to PSTATE.RED in the delay slot of a JMPL instruction is not recommended. A noncacheable instruction prefetch may be made to the JMPL target, which may be in a cacheable memory area. This situation can result in a bus error on some systems and can cause an instruction access error trap. Programmers can mask the trap by setting the NCEEN bit in the E-cache Error Enable Register to 0, but doing so will mask all noncorrectable error checking. Exiting RED_state with DONE or RETRY or with the destination of the JMPL noncacheable will avoid the problem.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 73 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 8.9.3 UltraSPARC III Internal ASIs

ASIs in the ranges 3016–6F16 and 7216–7F16 are used for accessing internal states. Stores to these ASIs do not follow the normal memory-model ordering rules. Correct operation can be assured by adhering to the following requirements: ■ A MEMBAR #Sync is needed after a store to an internal ASI other than MMU ASIs before the point that side effects must be visible. This MEMBAR must precede the next load or non-internal store. To avoid data corruption, the MEMBAR must also occur before1 the delay slot of a delayed control transfer instruction of any type. Alternatively, a MEMBAR #Sync could be inserted at the beginning of any vulnerable trap handler2. “Vulnerable” trap handlers are those which contain one or more LDXAs from any internal ASI (ASIs 0x30-0x6F, 0x72-0x77, and 0x7A- 0x7F). However, this may cause unacceptable performance reduction in some trap handlers, so this is not the preferred alternative. ■ A FLUSH, DONE,orRETRY is needed after a store to an internal IMMU ASI (ASI 5016–5216,5416–5F16), an I-cache ASI (6616–6F16), or the IC bit in the DCU Control Register, prior to the point that side effects must be visible. A store to DMMU registers other than the context ASIs can use a MEMBAR #Sync. To avoid data corruption, the MEMBAR must also occur before3 the delay slot of a delayed control transfer instruction of any type.

If the store is to an IMMU state register (ASI = 5016,virtual address = 1816), then the FLUSH, DONE,orRETRY must immediately4 following the store. Furthermore, one of the following must be true, to prevent an intervening ITLB miss from causing stale data to be stored:

■ The code must be locked down in the ITLB, or ■ The store and the subsequent FLUSH, DONE, or RETRY should be kept on the same 8 KB page of instruction memory

8.10 Store Compression

Consecutive non-side-effect, noncacheable stores can be combined into aligned 16- byte entries in the store buffer to improve store bandwidth. Cacheable stores will naturally coalesce in the write cache rather than be compressed in the store buffer. Noncacheable stores can be compressed only with adjacent noncacheable stores. To maintain strong ordering for I/O accesses, stores with the side-effect attribute (E bit set) cannot be combined with any other stores.

1. Reference: UltraSPARC III Erratum #64, Bug ID# 7102 2. Reference: UltraSPARC III Erratum #64, Bug ID# 7102 3. Reference: UltraSPARC III Erratum #64, Bug ID# 7102 4. Reference: UltraSPARC III Erratum #63, Bug ID #7100

74 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A 16-byte noncacheable merge buffer is used to coalesce adjacent noncacheable stores. Noncacheable stores will continue to coalesce into the 16-byte buffer until one of the following conditions occurs: ■ The data are pulled from the noncacheable merge buffer by the target device. ■ The store would overwrite a previously written entry (a valid bit is kept for each of the 16 bytes).

Caution – This behavior is unique to the UltraSPARC III processor and differs from previous UltraSPARC implementations.

■ The store is not within the current address range of the merge buffer (within the 16-byte aligned merge region). ■ The store is a cacheable store. ■ The store is to a side-effect page. ■ MEMBAR #Sync.

Working Draft 1.0.5, 10 Sep 2002 S. Chapter 8 Memory Models 75 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 76 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX A

Instruction Definitions: UltraSPARC III Extensions

The UltraSPARC III processor extends the standard SPARC V9 instruction set with four classes of instructions, as described in Commonality. This appendix lists the following additional, UltraSPARC III-specific information for some instructions. Their numbers correspond to those of Commonality instructions. ■ Alignment Instructions (VIS I) on page 78 ■ Three-Dimensional Array Addressing Instructions (VIS I) on page 78 ■ Block Load and Store Instructions (VIS I) on page 79 ■ Byte Mask and Shuffle Instructions (VIS II) on page 83 ■ Load Floating-Point on page 84 ■ Load Floating-Point Alternate on page 84 ■ Logical Operate Instructions (VIS I) on page 85 ■ Memory Barrier on page 86 ■ Partial Store (VIS I) on page 91 ■ Partitioned Add/Subtract (VIS I) on page 91 ■ Partitioned Multiply (VIS I) on page 92 ■ Pixel Formatting (VIS I) on page 92 ■ Prefetch Data on page 93 ■ Set Interval Arithmetic Mode (VIS II) on page 94 ■ SHUTDOWN Instruction (VIS I) on page 94 ■ Store Floating Point on page 94 ■ Store Floating Point into Alternate Space on page 95

77 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.2 Alignment Instructions (VIS I)

Please refer to Section A.2 of Commonality for details.

Note – For good performance, the result of faligndata should not be used as a source operand for a 32-bit graphics instruction in the next three instruction groups.

A.3 Three-Dimensional Array Addressing Instructions (VIS I)

Please refer to A.3 of Commonality for details of these instructions.

Note – To maximize reuse of external cache and TLB data, software should block array references of a large image to the 64-Kbyte level. This means processing elements within a 32x64x64 block.

78 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.4 Block Load and Store Instructions (VIS I)

Please refer to Section A.4 of Commonality for principal details.

Note that block load and block store, by definition, use the LDDFA and STDFA opcodes combined with one of several ASIs dedicated for use by block load and store operations.

Rules Note – Block load and block store instructions are used for transferring large blocks of data (more than 256 bytes), for example, in C library routines bcopy() and bfill(). They do not allocate in the data cache or external cache on a miss. They update the external cache on a hit. One BLD and, in the most extreme case, up to 15 (maximum) BSTs can be outstanding on the interconnect at one time.

Since UltraSPARC III stalls the pipeline to wait for completion of a block load (BLD), the following rules are more constraining than necessary for UltraSPARC III. However, software written to follow the full set of rules described in JPS1 Commonality and this section will run correctly on UltraSPARC I, II, and III.

BLD does not provide register dependency interlocks as do ordinary load instructions. ■ Before referencing BLD data, a second BLD (to a different set of registers) or a MEMBAR #Sync must be performed. If a second BLD is used to synchronize against returning data, UltraSPARC III will continue execution before all data has been returned. The programmer is then responsible for scheduling instructions so registers are only used when they become valid. ■ To determine when data is valid, the programmer must count instruction groups containing FP operate instructions (not FP loads or stores). The lowest-numbered register being loaded by the first BLD may be referenced in the first instruction group following the second BLD, using an FP operate instruction only. ■ The second-lowest-numbered register may be referenced in the second instruction group containing an FP operate instruction, and so on. The best case grouping of FP operate instructions should be assumed (UltraSPARC III can issue two FP operate instructions simultaneously, assuming they are in different FP classes). ■ If this BLD/BLD synchronization mechanism is used, the initial reference to the BLD data must be by an FP operate instruction (not an FP store), and only instruction groups with FP operate instructions are counted when determining BLD data availability.

If the above rules are violated, data from before or after the BLD may be returned by a subsequent instruction reading a register being filled by the BLD operation.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 79 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

If a MEMBAR #Sync is used to synchronize on BLD data, there are no restrictions on data usage, although this will have lower performance. No other MEMBARs can be used to provide data synchronization for BLD.

FP operate instructions can be issued in a single group with FP stores. If BLD/BLD synchronization is used, FP operates and FP stores can be interlaced. This allows an FP operate to reference the returning data before using the data in any FP store (normal store or block store).

Typically, the FP operate instruction will be an FMOVD or FALIGNDATA. UltraSPARC III also continues execution, without register interlocks, before all the store data for BSTs are transferred from the register file.

If store data registers are overwritten before the next block store or MEMBAR #Sync instruction, then the following rule must be observed: ■ The first register can be overwritten in the same instruction group as the BST, the second register can be overwritten in the instruction group following the block store, and so on. If this rule is violated, the store may read the old data or the newly overwritten data from a register.

When determining correctness of a code sample, be aware that a given UltraSPARC implementation may provide more interlocking than required above. For example, there may be partial register interlocks (such as an interlock on the lowest-numbered register). So, software that doesn't meet the above constraints may appear to work on a particular platform. However, to be portable across all UltraSPARC implementations, software must follow all of the above rules.

There must be a MEMBAR #Sync or a trap following a BST before a DONE, RETRY,or WRPR to PSTATE instruction is executed. If this is rule is violated, instructions after the DONE, RETRY or WRPR to PSTATE may not see the effects of the updated PSTATE register.

BLD does not follow memory model ordering with respect to stores. In particular, read-after-write and write-after-read hazards to overlapping addresses are not detected. The side-effects bit (TTE.E) associated with the access is ignored (see Translation Table Entry (TTE) on page 130). Some ordering considerations follow: ■ If ordering with respect to earlier stores is important (for example, a block load that overlaps previous stores), then there must be an intervening MEMBAR #StoreLoad or stronger MEMBAR. ■ If ordering with respect to later stores is important (for example, a block load that overlaps a subsequent store), then there must be an intervening MEMBAR #LoadStore or a reference to the block load data. This restriction does not apply when a trap is taken, so the trap handler does not have to worry about pending block loads. ■ If the BLD overlaps a previous or later store and there is no intervening MEMBAR, then the trap or data referencing the BLD may return data from before or after the store.

80 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

BST does not follow memory model ordering with respect to loads, stores, or flushes. In particular, read-after-write, write-after-write, flush-after-write, and write- after-read hazards to overlapping addresses are not detected. The side-effects bit associated with the access is ignored. Some ordering considerations follow: ■ If ordering with respect to earlier or later loads or stores is important, then there must be an intervening reference to the load data (for earlier loads) or an appropriate MEMBAR instruction. This restriction does not apply when a trap is taken, so the trap handler does not have to worry about pending block stores. ■ If the BST overlaps a previous load and there is no intervening load data reference or MEMBAR #StoreLoad instruction, then the load may return data from before or after the store and the contents of the block are undefined. ■ If the BST overlaps a later load and there is no intervening trap or MEMBAR #LoadStore instruction, then the contents of the block are undefined. ■ If the BST overlaps a later store or flush and there is no intervening trap or MEMBAR #Sync instruction, then the contents of the block are undefined. ■ If the ordering of two successive BST instructions (overlapping or not) is required, then a MEMBAR #Sync must occur between the BST instructions.

Block operations do not obey the ordering restrictions of the currently selected processor memory model (TSO, PSO, RMO). Block operations always execute under an RMO memory ordering model. Explicit MEMBAR instructions are required to order block operations among themselves or with respect to normal memory operations. In addition, block operations do not conform to dependence order on the issuing processor. That is, no read-after-write, write-after-read, or write-after-write checking occurs between block operations. Explicit MEMBAR #Sync instructions are required to enforce dependence ordering between block operations that reference the same address.

Typically, BLD and BST will be used in loops where software can ensure that the data being loaded and the data being stored do not overlap. The loop will be preceded and followed by the appropriate MEMBARs to ensure that there are no hazards with loads and stores outside the loops. CODE EXAMPLE A-1 demonstrates the loop.

Example

CODE EXAMPLE A-1 Byte-Aligned Block Copy Inner Loop with BLD/BST Note that the loop must be unrolled two times to achieve maximum performance. All FP registers are double-precision. Eight versions of this loop are needed to handle all the cases of doubleword misalignment between the source and destination.

loop: faligndata %f0, %f2, %f34

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 81 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

CODE EXAMPLE A-1 Byte-Aligned Block Copy Inner Loop with BLD/BST (Continued) faligndata %f2, %f4, %f36 faligndata %f4, %f6, %f38 faligndata %f6, %f8, %f40 faligndata %f8, %f10, %f42 faligndata %f10, %f12, %f44 faligndata %f12, %f14, %f46 addcc %l0, -1, %l0 bg,pt l1 fmovd %f14, %f48 ! (end of loop handling) l1: ldda [regaddr] ASI_BLK_P, %f0 stda %f32, [regaddr] ASI_BLK_P faligndata %f48, %f16, %f32 faligndata %f16, %f18, %f34 faligndata %f18, %f20, %f36 faligndata %f20, %f22, %f38 faligndata %f22, %f24, %f40 faligndata %f24, %f26, %f42 faligndata %f26, %f28, %f44 faligndata %f28, %f30, %f46 addcc %l0, -1, %l0 be,pnt done fmovd %f30, %f48 ldda [regaddr] ASI_BLK_P, %f16 stda %f32, [regaddr] ASI_BLK_P ba loop faligndata %f48, %f0, %f32 done: !(end of loop processing)

82 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.5 Byte Mask and Shuffle Instructions (VIS II) Please refer to Section A.5 of Commonality for details.

Note – The BMASK instruction uses the MS pipeline and so cannot be grouped with a store, nonprefetchable load, or a special instruction. The integer rd register result is available after a 2-cycle latency. A younger BMASK can be grouped with an older BSHUFFLE (BMASK is “break-after” (see page 44)).

Results have a 4-cycle latency to other dependent FG instructions. The FGA pipe is used to execute BSHUFFLE. The GSR mask must be set at or before the instruction group previous to the BSHUFFLE (GSR.mask dependency). BSHUFFLE is fully pipelined (one per cycle).

A.13 Floating-Point Add and Subtract

Please refer to section A.26 of Commonality for details.

When FSR.NS = 0, the processor operates in standard floating-point mode. FADD or FSUB with a subnormal result causes an fp_exception_other exception with FSR.ftt = unfinished_FPop, system software emulates the instruction, and the correct numerical result is calculated. UltraSPARC II and UltraSPARC III operate identically in this case.

When FSR.NS = 1, the processor operates in “nonstandard” floating-point mode (see Section 5.1.7 in Commonality). When FSR.NS = 1, and FADD or FSUB produces a subnormal result on UltraSPARC II, the result is is replaced by 0 in hardware, without trapping. On UltraSPARC III, an fp_exception_other exception occurs with FSR.ftt = unfinished_FPop. (even though the processor is operating in nonstandard floating-point mode1), then system software emulates the instruction, and the correct numerical result is calculated (instead of replacing the result with 0).

So UltraSPARC III may produce a different (albeit more accurate) result than UltraSPARC II in the following situation: ■ FADD or FSUB produces a subnormal result ■ FSR.NS =1

1. References: UltraSPARC III Erratum #142, Bugtraq ID# 4303733

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 83 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.26 Load Floating-Point

Please refer to section A.26 of Commonality for details.

If a load floating-point instruction causes a trap due to any type of data access error, the contents of the destination register are undefined (impl. dep. #44). See also Table 7-1, Exception Specific to UltraSPARC III, on page 62. In UltraSPARC III, if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned for an LDDF instruction, an LDDF_mem_address_not_aligned exception is generated (impl. dep. #109). Since UltraSPARC traps on all LDQF(A) instructions (illegal_instruction), the LDQF_mem_address_not_aligned trap does not exist (impl. dep. #111).

A.27 Load Floating-Point Alternate

Please refer to section A.27 of Commonality for details.

If a load floating-point instruction causes a trap due to any type of data access error, the contents of the destination register are undefined (impl. dep. #44). See also Table 7-1, Exception Specific to UltraSPARC III, on page 62. In UltraSPARC III, an LDDFA instruction causes an LDDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned (impl. dep. #109). Since UltraSPARC traps on all LDQF(A) instructions (illegal_instruction), the LDQF_mem_address_not_aligned trap does not exist (impl. dep. #111).

A.30 Load Quadword, Atomic

Please refer to section A.30 of Commonality for details.

In UltraSPARC III, a Load Quadword Atomic instruction that accesses a page marked as read-only by the MMU will cause a fast_data_access_protection exception. Although this is a bug in the implementation, its effect is expected to be minimal, since Load Quadword Atomic is a privileged instruction primarily used for

84 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

loading TTE entries from memory pages that are writable. Therefore, the exception may rarely if ever occur, and if it does occur, the load will still complete correctly (just more slowly, due to the exception).

A.33 Logical Operate Instructions (VIS I)

Description The standard 64-bit versions of these instructions perform one of sixteen 64-bit logical operations between rs1 and rs2. The result is stored in rd. The 32-bit (single-precision) version of these instructions performs 32-bit logical operations.

Note – For good performance, the result of a single logical should not be used as part of a 64-bit graphics instruction source operand in the next three instruction groups. Similarly, the result of a standard logical should not be used as a 32-bit graphics instruction source operand in the next three instruction groups.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 85 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.35 Memory Barrier

Opcode op3 Operation MEMBAR 10 1000 Memory Barrier

Format (3)

10 0 op3 0 1111 i=1 — cmask mmask

3130 29 25 2419 18 14 13 127 6 43 0 4

Assembly Language Syntax membar membar_mask

Description The information included in this section should not be used for the decision as to when MEMBARs should be added to software that needs to be compliant across all UltraSPARC-based platforms. The operations of bload/bstore on UltraSPARC III are generally more ordered with respect to other operations, compared to UltraSPARC I and UltraSPARC II. Code written and found to “work” on UltraSPARC III may not work on UltraSPARC I and UltraSPARC II if it does not follow the rules for bload/ bstore specified for those processors. Code that happens to work on UltraSPARC I and UltraSPARC II may not work on UltraSPARC III if it did not meet the coding guidelines specified for those processors. In no case is the coding requirement for UltraSPARC III more restrictive than that for UltraSPARC I and UltraSPARC II.

Software developers should not use the information in this section for determining the need for MEMBARs but instead should rely on the SPARC V9 MEMBAR rules. These UltraSPARC III rules are less restrictive than SPARC V9, UltraSPARC I, and UltraSPARC II rules and are never more restrictive. MEMBAR Rules The UltraSPARC III hardware uses the following rules to guide the interlock implementation.

1. Noncacheable load or store with side-effect bit on will always be blocked.

2. Cacheable or noncacheable bload will not be blocked.

86 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

3. VA<12:5> of a load (cacheable or noncacheable) will be compared with the VA<12:5> of all entries in Store Queue. When a matching is detected, this load (cacheable or noncacheable) will be blocked.

4. An insertion of MEMBAR is required if Strong Ordering is desired while not fitting rules 1 to 3.

TABLE A-1 and TABLE A-2 reflect the hardware interlocking mechanism implemented in UltraSPARC III. The tables are read from Row to Column, the first memory operation in program order being in Row followed by the memory operation found in Column. The two symbols used as table entries are: ■ # — No intervening operation required because Fireplane-compliant systems automatically order R before C ■ M—MEMBAR #Sync or MEMBAR #MemIssue or MEMBAR #StoreLoad required

For VA<12:5> of a column operation not matching with VA<2:5> of a row operation while a strong ordering is desired, the MEMBAR rules summarized in TABLE A-1 reflect UltraSPARC III’s hardware implementation.

TABLE A-1 MEMBAR rules for column VA <12:5> ≠ row VA <12:5> while desiring Strong Ordering

To Column Operation C:

From Row

Operation R: load load from internal ASI store store to internal ASI atomic load_nc_e store_nc_e load_nc_ne store_nc_ne bload bstore bstore_commit bload_nc bstore_nc load # # # ######MM#MM load from internal ASI # # # ########### store M # # # #M#M#MM#MM store to internal ASI # M # ######M##MM atomic # # # ######MM#MM load_nc_e # # # ######MM#MM store_nc_e M # # # ## #M#MM#MM load_nc_ne # # # ######MM#MM store_nc_ne M # # # #M#M#MM#MM bload M # M # MMMMMMM# MM bstore M # M # MMMMMMM# MM

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 87 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

TABLE A-1 MEMBAR rules for column VA <12:5> ≠ row VA <12:5> while desiring Strong Ordering

To Column Operation C:

From Row

Operation R: load load from internal ASI store store to internal ASI atomic load_nc_e store_nc_e load_nc_ne store_nc_ne bload bstore bstore_commit bload_nc bstore_nc bstore_commit M # M # MMMMMMM# MM bload_nc M # M # MMMMMMM# MM bstore_nc M # M # MMMMMMM# MM

When VA<12:5> of a column operation matches VA<12:5> of a row operation, the MEMBAR rules summarized in TABLE A-2 reflect UltraSPARC III’s hardware implementation.

TABLE A-2 MEMBAR rules for column VA<12:5> = row VA<12:5> while desiring Strong Ordering

To Column Operation C:

From Row

Operation R: load load from internal ASI store store to internal ASI atomic load_nc_e store_nc_e load_nc_ne store_nc_ne bload bstore bstore_commit bload_nc bstore_nc load # # # ########### load from internal ASI # # # ########### store # # # ######M#### store to internal ASI # M # ######M##MM atomic # # # ########### load_nc_e # # # ########### store_nc_e # # # ######M##M# load_nc_ne # # # ########### store_nc_ne # # # ######M##M# bload # # # ########### bstore # # # ######M####

88 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

TABLE A-2 MEMBAR rules for column VA<12:5> = row VA<12:5> while desiring Strong Ordering

To Column Operation C:

From Row

Operation R: load load from internal ASI store store to internal ASI atomic load_nc_e store_nc_e load_nc_ne store_nc_ne bload bstore bstore_commit bload_nc bstore_nc bstore_commit M # M # MMMMMMM# MM bload_nc # # # ########### bstore_nc # # # #########M#

Special Rules for Quad LDD (ASI 2416 and ASI 2C16) MEMBAR is only required before quad LDD if VA<12:5> of a preceding store to the same address space matches VA<12:5> of the quad LDD. R-A-W Bypassing Algorithm Load data can be bypassed from previous stores before they become globally visible (Data for Load from the store queue). Data for all types of loads cannot be bypassed from all types of stores.

All types of load instructions can get data from the store queue, except the following load instructions: ■ Signed loads (ldsb, ldsh, ldsw) ■ Atomics ■ Load double to integer register file (ldd) ■ Quad loads to integer register file ■ Load from FSR register ■ Block loads ■ Short floating-point loads ■ Loads from internal ASIs

All types of store instructions can give data to a load, except the following store instructions: ■ Floating-point partial stores ■ Store double from integer register file (std) ■ Store part of atomic ■ Short FP stores ■ Stores to pages with side-effect bit set ■ Stores to noncacheable pages

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 89 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

The algorithm used in UltraSPARC III for R-A-W bypassing is as follows: if ( (Load/store access the same physical address) and (Load/store endianness is the same) and (Load/store size is the same) and (Load data can get its data from store queue) and (Store data in store can give its data to a load) ) then Load will get its data from store queue else Load will get its data from the memory system endif

R-A-W Detection Algorithm When data for a load cannot be bypassed from previous stores before they become globally visible (store data is not yet retired from the store queue), the load is recirculated after the R-A-W hazard is removed. The following conditions can cause this recirculation: ■ Load data can be bypassed from more than one store in the store queue. ■ The load’s VA<12:0> matches a store’s VA<12:0> and store data cannot be bypassed from the store queue. ■ The load’s VA<12:5> matches a store’s VA<12:5> (but load VA<12:3> and store VA<12:3> mismatch) and the load misses the D-cache. ■ Load is from side-effect page (page attribute E = 1) when the store queue contains one or more stores to side-effect pages.

90 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.42 Partial Store (VIS I)

Please refer to Section A.42 in Commonality for details.

Watchpoint exceptions on Partial Store instructions occur conservatively on UltraSPARC III. The DCUCR Data Watchpoint masks are only checked for nonzero value (watchpoint enabled). The byte store mask (r[rs2]) in the Partial Store instruction is ignored, and a watchpoint exception can occur even if the mask is zero (that is, no store will take place) (impl. dep. #249).

Note – If the byte ordering is little-endian, the byte enables generated by this instruction are swapped with respect to big-endian.

A.43 Partitioned Add/Subtract (VIS I)

Description The standard versions of these instructions perform four 16-bit or two 32-bit partitioned adds or subtracts between the corresponding fixed-point values contained in the source operands (rs1, rs2). For subtraction, rs2 is subtracted from rs1. The result is placed in the destination register (rd).

The single-precision versions of these instructions (FPADD16S, FPSUB16S, FPADD32S, FPSUB32S) perform two 16-bit or one 32-bit partitioned add(s) or subtract(s); only the low 32-bits of the destination register are affected.

Note – For good performance, the result of a single FPADD should not be used as part of a source operand of a 64-bit graphics instruction in the next instruction group. Similarly, the result of a standard FPADD should not be used as a 32-bit graphics instruction source operand in the next three instruction groups.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 91 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.44 Partitioned Multiply (VIS I)

Note – For good performance, the result of a partitioned multiply should not be used as a source operand of a 32-bit graphics instruction in the next three instruction groups.

A.47 Pixel Formatting (VIS I)

Please refer to Section A.47 of Commonality for principal pixel formatting instructions.

Description The FPACK instructions convert to a lower-precision fixed or pixel format. Input values are clipped to the dynamic range of the output format. Packing applies a scale factor from GSR.scale to allow flexible positioning of the binary point

Note – For good performance, the result of an FPACK (including FPACK32) should not be used as part of a 64-bit graphics instruction source operand in the next three instruction groups. The result of FEXPAND or FPMERGE should not be used as a 32-bit graphics instruction source operand in the next three instruction groups.

A.47.5 FPMERGE Instruction

Back-to-back FPMERMGEs cannot be done on adjacent cycles.

92 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.49 Prefetch Data

Please refer to A.49, Prefetch Data,ofCommonality for principal details.

A.49.1 SPARC V9 Prefetch Variants

PREFETCH(A) instructions are implemented as NOPs in UltraSPARC III.

The prefetch variant is selected by the fcn field of the instruction. In accordance with SPARC V9, fcn values 4–15 cause an illegal_instruction exception.

PREFETCH(A) instructions execute as shown below.

fcn Prefetch Function 0 NOP 1 NOP 2 NOP 3 NOP

17 (1116) NOP

18 (1216) NOP

19 (1316) NOP

20 (1416) Equivalent to fcn = 0 (NOP)

21 (1516) Equivalent to fcn = 1 (NOP)

22 (1616) Equivalent to fcn = 2 (NOP)

23 (1716) Equivalent to fcn = 3 (NOP)

24–31 (1816–1F16) NOP

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 93 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.55 Set Interval Arithmetic Mode (VIS II)

Description The SIAM instruction sets the GSR.IM and GSR.IRND fields as follows:

GSR.IM = mode<2>

GSR.IRND = mode<1:0>

Note – SIAM is a groupable, break-after instruction. It enables the interval rounding mode to be changed every cycle without flushing the pipeline. FPops in the same instruction group as a SIAM instruction use the previous rounding mode.

A.59 SHUTDOWN Instruction (VIS I)

Please refer to Section A.59 of Commonality for principal details.

In privileged mode, the SHUTDOWN instruction executes as a NOP in UltraSPARC III. An external system signal is used to enter and leave Energy Star mode. See the sun- 4u/sun-5 Sun System Architecture Specification for more details.

A.61 Store Floating Point

Please refer to section A.61 of Commonality for details. In UltraSPARC III, an STDF instruction causes an STDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned (impl. dep. #110). Since UltraSPARC traps on all STQF(A) instructions (illegal_instruction), the STQF_mem_address_not_aligned trap does not exist (impl. dep. #112).

94 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

A.62 Store Floating Point into Alternate Space

Please refer to section A.62 of Commonality for details. In UltraSPARC III, an STDFA instruction causes an STDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned (impl. dep. #110). Since UltraSPARC traps on all STDQF(A) instructions (illegal_instruction), the STQF_mem_address_not_aligned trap does not exist (impl. dep. #112).

Working Draft 1.0.5, 10 Sep 2002 S. Appendix A Instruction Definitions: UltraSPARC III Extensions 95 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A-

96 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX B

IEEE Std 754-1985 Requirements for SPARC V9

The IEEE Std 754-1985 floating-point standard contains a number of implementation dependencies.

Please see Appendix B of Commonality for choices for these implementation dependencies, to ensure that SPARC V9 implementations are as consistent as possible.

Following is information specific to the UltraSPARC III implementation of SPARC V9 in these sections: ■ Overflow, Underflow, and Inexact Traps on page 97 ■ Floating-Point Nonstandard Mode on page 98 ■ Subnormal Operands on page 98 ■ Subnormal Results on page 99 ■ NaN Operands on page 100

B.3 Overflow, Underflow, and Inexact Traps

The UltraSPARC III processor implements precise floating-point . Underflow is detected before rounding. Prediction of overflow, underflow, and inexact traps for divide and square root is used to simplify the hardware (impl. dep. #3, 55).

For floating-point divide, pessimistic prediction occurs when underflow/overflow cannot be determined from examining the source operand exponents. For divide and square root, pessimistic prediction of inexact occurs unless one of the operands is a zero, NaN, or . When pessimistic prediction occurs and the exception is enabled, an fp_exception_other (with FSR.ftt =2,unfinished_FPop) trap is

97 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material generated. System software will properly handle these cases and resume execution. If the exception is not enabled, the actual result status is used to update the aexc/ cexc bits of the FSR.

Note – Significant performance degradation may be observed while the system is running with the inexact exception enabled.

B.6 Floating-Point Nonstandard Mode

UltraSPARC III implements a nonstandard mode, enabled when FSR.NS = 1 and GSR.IM =0.

The UltraSPARC III processor handles some cases of subnormal operands or results directly in hardware and traps on the rest. In the trapping cases, an fp_exception_other (with FSR.ftt =2,unfinished_FPop) trap is signalled and these operations are handled in system software. These cases are listed in TABLE B-1, TABLE B-2, and TABLE B-3.

Since trapping on subnormal operands and results can be quite costly, the UltraSPARC III processor supports the nonstandard floating-point result option of the SPARC V9 architecture. If FSR.NS =1and GSR.IM = 0, then subnormal operands or results encountered in trapping cases are flushed to 0 and no fp_exception_other trap is taken.

UltraSPARC III triggers fp_exception_other with trap type unfinished_FPop under the conditions described in sections B.6.1 through B.6.2. These conditions differ from the JPS1 “standard” set described in Section 5.1.7 of Commonality (impl. dep. #248).

B.6.1 Subnormal Operands

If FSR.NS = 1 and GSR.IM = 0 and subnormal operands are encountered, then the subnormal operand(s) of these operations are replaced by zero(es) with the same sign as the operand being replaced. During the floating-point operation with normalized operands: ■ If a division-by-zero or an invalid-operation condition occurs, then the corresponding condition(s) are signalled in FSR.aexc and FSR.cexc. and an fp_exception_ieee_754 trap occurs (if enabled by FSR.TEM).

98 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ If neither a division-by-zero nor an invalid condition occurs, then an inexact condition plus any other detected floating-point exception conditions are signalled in FSR.aexc and FSR.cexc and an fp_exception_ieee_754 trap occurs (if enabled by FSR.TEM).

If FSR.NS =0orGSR.IM = 1, then subnormal operands generate traps according to TABLE B-1. For multiply, Er is the biased sum of the exponents plus 1.

TABLE B-1 Subnormal Operand Trapping Cases (FSR.NS =0or GSR.IM =1)

Operations One subnormal operand Two subnormal operands

f(sd)to(ds) fp_exception_other trap (with N/A. fsqrt(sd) FSR.ftt =2,unfinished_FPop). fadd/sub(sd) fp_exception_other trap (with fp_exception_other trap (with fsmuld FSR.ftt =2,unfinished_FPop). FSR.ftt =2,unfinished_FPop). fdiv(sd) fmuls If result is not 0, fp_exception_other If result is not 0, fp_exception_other trap (with FSR.ftt =2, trap (with FSR.ftt =2, unfinished_FPop). unfinished_FPop). fmuld If result is not 0, fp_exception_other If result is not 0, fp_exception_other trap (with FSR.ftt =2, trap (with FSR.ftt =2, unfinished_FPop). unfinished_FPop).

B.6.2 Subnormal Results

If FSR.NS = 1 and GSR.IM = 0 and a floating-point operation produces a subnormal result: ■ If the operation was FADD, FSUB,orFdTOs, then an fp_exception_other trap (with FSR.ftt = unfinished_FPop) is generated. ■ If the operation was other than FADD, FSUB,orFdTOs, then

■ the result is replaced by a zero value with the same sign, ■ inexact and underflow conditions are signalled in FSR.aexc and FSR.cexc, ■ an fp_exception_ieee_754 trap occurs (if enabled by FSR.TEM).

If FSR.NS =0orGSR.IM = 1, then subnormal results generate traps according to TABLE B-2. In the table, Er is the biased sum of the exponents plus 1; Erd is the biased exponent of the result before rounding.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix B IEEE Std 754-1985 Requirements for SPARC V9 99 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE B-2 Subnormal Result Trapping Cases (FSR.NS =0or GSR.IM =1)

Operations Condition

f(sd)to(ds) fp_exception_other trap (with FSR.ftt =2,unfinished_FPop). fsqrt(sd) fadd/sub(sd) fp_exception_other trap (with FSR.ftt =2,unfinished_FPop). fdivs fp_exception_other trap (with FSR.ftt =2,unfinished_FPop) if: 1) Er ≥−24 and result is positive and round mode is +∞ 2) Er ≥−24 and result is negative and round mode is −∞ fdivd fp_exception_other trap (with FSR.ftt =2,unfinished_FPop) if: 1) Er ≥−53 and result is positive and round mode is +∞ 2) Er ≥−53 and result is negative and round mode is −∞ fsmuld Never signal fp_exception_other. fmuls fp_exception_other trap (with FSR.ftt =2,unfinished_FPop) if: 1) Er > −24 and Erd <1 2) Er ≥−24 and result is positive and round mode is +∞ 3) Er ≥−24 and result is negative and round mode is −∞ fmuld fp_exception_other trap (with FSR.ftt =2,unfinished_FPop) if: 1) Er > −53 and Erd <1 2) Er ≥−53 and result is positive and round mode is +∞ 3) Er ≥−53 and result is negative and round mode is −∞

B.6.3 NaN Operands

The UltraSPARC III processor generates an fp_exception_other (with FSR.ftt =2, unfinished_FPop) exception for some cases of floating-point operations with NaN operands, as shown in TABLE B-3. The state of the FSR.NS bit does not affect the generation of an unfinished_FPop on NaN operands.

TABLE B-3 NaN Operand Trapping Cases

Operations 1 NaN operand 2 NaN operands

f(sd)to(ix) fp_exception_other trap (with N/A. f(sd)to(ds) FSR.ftt =2,unfinished_FPop). fadd/sub(sd) fp_exception_other trap (with fp_exception_other trap (with FSR.ftt =2,unfinished_FPop). FSR.ftt =2,unfinished_FPop). fmul(sd) Never signal fp_exception_other. Never signal fp_exception_other. fsmuld fdiv(sd) fsqrt(sd)

100 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX C

Implementation Dependencies

This appendix summarizes implementation dependencies in the SPARC V9 standard. In SPARC V9 the notation “IMPL. DEP. #nn:” identifies the definition of an implementation dependency; the notation “(impl. dep. #nn)” identifies a reference to an implementation dependency. These dependencies are described by their number nn in TABLE C-1. For UltraSPARC III, these numbers have been removed from the body of this document to make the document more readable.

In the following section, TABLE C-1 is modified to include a description of the manner in which UltraSPARC III has resolved each implementation dependency. ■ List of Implementation Dependencies on page 102

For more information, please refer to Appendix C in Commonality.

Additionally, this appendix describes these aspects of the UltraSPARC III implementation of SPARC V9: ■ SPARC V9 General Information on page 112 ■ SPARC V9 Integer Operations on page 115 ■ SPARC V9 Floating-Point Operations on page 117 ■ SPARC V9 Memory-Related Operations on page 121 ■ Non-SPARC V9 Extensions on page 123

101 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.1 List of Implementation Dependencies

TABLE C-1 provides a complete list of how each implementation dependency is treated in the UltraSPARC III implementation.

TABLE C-1 UltraSPARC III Implementation Dependencies (1 of 11)

Nbr UltraSPARC III Implementation Notes Page 1 Software emulation of instructions 113 The UltraSPARC III processor meets Level 2 SPARC V9 compliance as specified in The SPARC Architecture Manual-Version 9. 2 Number of IU registers 116 The UltraSPARC III processor implements an 8-window, 64-bit integer register file, and NWINDOWS =8. 3 Incorrect IEEE Std 754-1985 results 97, 118 The UltraSPARC III processor implements precise floating-point exception handling. All quad-precision floating-point instructions cause an fp_exception_other (with FSR.ftt =3,unimplemented_FPop) trap. These operations are emulated in system software. 4-5 Reserved. 6 I/O registers privileged status 121 7 I/O register definitions 121 8 RDASR/WRASR target registers 121, 265 Software can use read/write ancillary state register instructions to read/ write implementation-dependent processor registers (ASRs 18–31). 9 RDASR/WRASR privileged status 121, 265 Whether each of the implementation-dependent read/write ancillary state register instructions (for ASRs 18–31) is privileged is implementation dependent. 10-12 Reserved. 13 VER.impl 29, 116 The impl field of the Version Register is a 16-bit implementation code, 001416, that uniquely identifies the UltraSPARC III processor class CPU. 14-15 Reserved. 16 IU deferred-trap queue 114 UltraSPARC III does not make use of deferred traps or implement an IU deferred-trap queue. 17 Reserved.

102 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (2 of 11)

Nbr UltraSPARC III Implementation Notes Page 18 Nonstandard IEEE 754-1985 results 27, 98, When bit 22 (NS) of the FSR is set, the UltraSPARC III processor can deliver a 120 non-IEEE-754-compatible result. In particular, subnormal operands and results can be flushed to 0. 19 FPU version, FSR.ver 120 The version bit of the FSR identifies a particular implementation of the UltraSPARC III Floating Point and Graphics Unit architecture. 20-21 Reserved. 22 FPU TEM, cexc, and aexc 120 TEM is a 5-bit trap enable mask for the IEEE-754 floating-point exceptions. If a floating-point operate instruction produces one or more exceptions, then the corresponding cexc/aexc bits are set and an fp_exception_ieee_754 (with FSR.ftt =1,IEEE_754_exception) exception is generated. 23 Floating-point traps 120 The UltraSPARC III processor implements precise floating-point exceptions. 24 FPU deferred-trap queue (FQ) 120 The UltraSPARC III processor implements precise floating-point exceptions and does not require a floating-point deferred-trap queue. 25 RDPR of FQ with nonexistent FQ UltraSPARC III does not implement (or need) an FQ, so an attempt to read the FQ causes an illegal_instruction. 26-28 Reserved. 29 Address space identifier (ASI) definitions 154 See TABLE L-1 on page 154 for the full list of ASIs implemented in UltraSPARC III. 30 ASI address decoding 154 UltraSPARC III decodes all eight bits of ASIs. See TABLE L-1 on page 154. 31 Catastrophic error exceptions — UltraSPARC III does not implement the internal_processor_error exception. 32 Deferred traps 114 The UltraSPARC III processor supports precise trap handling for all operations except for deferred or disrupting traps from hardware failures encountered during memory accesses. 33 Trap precision 114 The UltraSPARC III processor supports precise trap handling for all operations except for deferred or disrupting traps from hardware failures encountered during memory accesses. 34 Interrupt clearing 35 Implementation-dependent traps 114 UltraSPARC III does not implement any implementation-specific traps.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 103 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (3 of 11)

Nbr UltraSPARC III Implementation Notes Page 36 Trap priorities 62, 114 The priorities of particular traps are relative and are implementation dependent because a future version of the architecture may define new traps, and implementations may define implementation-dependent traps that establish new relative priorities. UltraSPARC III implements the priority 2 fast_ECC_error exception. 37 Reset trap 114 On a reset trap, TPC may be inaccurate. The value saved in TPC may differ from the correct trap PC by 63 bytes; this value is not necessarily equal to the PC of any instruction that has been or is scheduled to be executed. The value saved in TnPC will always be 4 more than the value saved in TPC. 38 Effect of reset trap on implementation-dependent registers 180 The effect of reset traps on all UltraSPARC III registers is enumerated in TABLE O-1 on page 180. 39 Entering error_state on implementation-dependent errors 113, 177 Upon error_state entry, the processor automatically recovers through [?] watchdog reset (WDR) into RED_state. 40 Error_state processor state 113 CWP updates for window traps that enter error_state are the same as when error_state is not entered. 41 Reserved. 42 FLUSH instruction 121 Whether FLUSH traps is implementation dependent. On the UltraSPARC III processor, the FLUSH effective address is ignored. FLUSH does not access the data MMU and cannot generate a data MMU miss or exception. 43 Reserved. 44 Data access FPU trap 62, 84 If a load floating-point instruction causes a trap due to any type of data access error, the contents of the destination register are undefined. 45 - 46 Reserved. 47 RDASR 48 If an RDASR instruction is in the R-stage and any valid instruction is in the integer pipelines from the E-stage to the X-stage, then the RDASR instruction cannot be dispatched. 48 WRASR 48 If a WRASR instruction is in the pipeline anywhere from the E-stage to the T- stage, no instructions are dispatched from the R-stage. 49-54 Reserved.

104 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (4 of 11)

Nbr UltraSPARC III Implementation Notes Page 55 Floating-point underflow detection 97, 117 See FSR_underflow in Section 5.1.7 of Commonality for details. Note that the UltraSPARC III processor implements precise floating-point exception handling. Prediction of overflow, underflow, and inexact traps for divide and square root is used to simplify the hardware. 56-100 Reserved. 101 Maximum trap level 116 The UltraSPARC III processor supports five trap levels; that is, MAXTL=5 102 Clean windows trap 116 UltraSPARC III generates a clean_window trap when a SAVE instruction requests a window and there are no more clean windows. System software must then initialize all registers in the next available window(s) to 0 before returning to the requesting context. 103 Prefetch instructions 93, 121 PREFETCH(A) instructions with fcn = 0–4 are implemented. In accordance with SPARC V9, PREFETCH(A) instructions with fcn = 5–15 cause an illegal_instruction trap. 104 VER.manuf 29, 116 The manuf bit of the Version Register is a 16-bit manufacturer code, 003E16 (Sun’s JEDEC number), that identifies the manufacture of the processor. 105 TICK register 115 The UltraSPARC III processor implements a 63-bit TICK counter. 106 IMPDEPn instructions 124 The UltraSPARC III processor extends the standard SPARC V9 instruction set. Unimplemented IMPDEP1 and IMPDEP2 opcodes encountered during execution cause an illegal_instruction trap. 107 Unimplemented_LDD trap 122 LDD instructions are directly executed in hardware, so the unimplemented_LDD exception does not exist in UltraSPARC III. 108 Unimplemented_STD trap 122 STD instructions are directly executed in hardware, so the unimplemented_STD exception does not exist in UltraSPARC III. 109 LDDF_mem_address_not_aligned 84, 122 LDDF(A) causes an LDDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned. 110 STDF_mem_address_not_aligned 94, 95, STQF(A) causes an STDF_mem_address_not_aligned trap if the effective 122 address is 32-bit aligned but not 64-bit (doubleword) aligned. 111 LDQF_mem_address_not_aligned 84 This trap is not used in SPARC JPS1. See Section 7.2.5 in Commonality. 112 STQF_mem_address_not_aligned 94, 95 This trap is not used in SPARC JPS1. See Section 7.2.5 in Commonality.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 105 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (5 of 11)

Nbr UltraSPARC III Implementation Notes Page 113 Implemented memory models 122 The UltraSPARC III processor supports only the TSO memory model. 114 RED_state trap vector address (RSTVaddr) 113, 180 The RED_state trap vector is located at an implementation-dependent address referred to as RSTVaddr. In UltraSPARC III, the RED_state trap vector address (RSTVaddr) is 256 Mbytes below the top of the virtual address space. Virtual address FFFF FFFF F000 000016 is passed through to physical address 7FF F000 000016 in RED_state. 115 RED_state processor state 113, 177 (What occurs after the processor enters RED_state is implementation dependent.) In UltraSPARC III, a reset or trap that sets PSTATE.RED (including a trap in RED_state) will clear the DCU Control Register, including enable bits for I-cache, D-cache, IMMU, DMMU, and virtual and physical watchpoints. 116 SIR_enable control flag 114 As for all SPARC V9 JPS1 processors, in the UltraSPARC III processor, SIR is permanently enabled. A software-initiated reset (SIR) is initiated by execution of a SIR instruction while in privileged mode. In nonprivileged mode, a SIR instruction behaves as a NOP. 117 MMU disabled prefetch behavior 121 When the data MMU is disabled, accesses default to the settings in the Data Cache Unit Control Register CP and CV bits. Note: E is the inverse of CP. 118 Identifying I/O locations 73, 122 (The manner in which I/O locations are identified is implementation dependent.) 119 Unimplemented values for PSTATE.MM — Because UltraSPARC III implements only the TSO memory model, PSTATE.MM always reads as 002 and writes to it are ignored. 120 Coherence and atomicity of memory operations 66 Cacheable accesses observe supported cache coherency protocols; the unit of coherence is 64 bytes. Noncacheable and side-effect accesses do not observe supported cache coherency protocols; the smaller unit in each transaction is a single byte. 121 Implementation-dependent memory model 122 The UltraSPARC III processor only implements the TSO memory model. 122 FLUSH latency 121 When a FLUSH operation is performed, the UltraSPARC III processor guarantees that earlier code modifications will be visible across the whole system. 123 Input/output (I/O) semantics 72, 122 The UltraSPARC III MMU includes an attribute bit E in each page translation, which when set signifies that this page has side effects.

106 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (6 of 11)

Nbr UltraSPARC III Implementation Notes Page 124 Implicit ASI when TL > 0 — (see JPS1 Commonality section 8.3). 125 Address masking 29 When PSTATE.AM = 1, UltraSPARC III writes the full 64-bit program counter value to the destination register of a CALL, JMPL,orRDPC instruction. When PSTATE.AM = 1 and a trap occurs, UltraSPARC III writes the full 64-bit program counter value to TPC[TL] (impl. dep. # 125). 126 Register Windows State Registers width — In UltraSPARC III, each register windows state register is implemented in 3 bits. (See Commonality impl. dep. #126.) 202 fast_ECC_error trap 61 UltraSPARC III implements fast_ECC_error trap. It indicates that an ECC error was detected in an external cache and its trap type is 07016. 203 Dispatch Control Register bits 13:6 and 1 31, 33, In UltraSPARC III, bits 11:6 of the Dispatch Control Register (DCR) can be 182 programmed to select the set of signals to be observed at obsdata<9:0>. DCR bit 1 is the Interrupt Floating-Point Operation Enable (IFPOE) bit. 204 DCR bits 5:3 and 0 31, 33, In UltraSPARC-III, DCR.SI operates as described in Section 5.2 of 182 Commonality and has no additional side effects. 205 Instruction Trap Register 36 UltraSPARC III implements the Instruction Trap Register. 206 SHUTDOWN instruction 94 In privileged mode, the SHUTDOWN instruction executes as a NOP in UltraSPARC III. 207 PCR register bits 47:32, 26:17, and bit 3 Bits 47:32, 26:17, and bit 3 of PCR are unused in UltraSPARC III. They read as zero and should be written only to zero or to a value previously read from those bits. 208 Ordering of errors captured in instruction execution (The order in which errors are captured in instruction execution is implementation dependent. Ordering may be in program order or in order of detection.) 209 Software intervention after instruction-induced error (Precision of the trap to signal an instruction-induced error of which recovery requires software intervention is implementation dependent.) 210 ERROR output signal (The causes and the semantics of ERROR output signal are implementation dependent.) 211 Error logging registers’ information (The information that the error logging registers preserves beyond the reset induced by an ERROR signal is implementation dependent.)

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 107 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (7 of 11)

Nbr UltraSPARC III Implementation Notes Page 212 Trap with fatal error (Generation of a trap along with ERROR signal assertion upon a detection of a fatal error is implementation dependent.) 213 AFSR.PRIV 205, PRIV accumulates the state of the PSTATE.PRIV bit at the time the event is 216 detected, rather than the PSTATE.PRIV value associated with the instruction that caused the access which returns the error. 214 Enable/disable control for deferred traps (Whether an implementation provides an enable/disable control feature for deferred traps is implementation dependent.) 215 Error barrier 188 On UltraSPARC III, DONE and RETRY instructions implicitly provide an error barrier function as MEMBAR #Sync. 216 data_access_error trap precision 215 (The precision of a data_access_error trap is implementation dependent.) On UltraSPARC III, data_access_error causes a deferred trap. 217 instruction_access_error trap precision 215 (The precision of an instruction_access_error trap is implementation dependent.) On UltraSPARC III, instruction_access_error causes a deferred trap. 218 async_data_error (Whether async_data_error trap is implemented is implementation dependent. If it does exist, it indicates that an error is detected in a processor core and its trap type is 4016.) 219 Asynchronous Fault Address Register (AFAR) allocation (Allocation of Asynchronous Fault Address Register (AFAR)is implementation dependent.) There may be one instance or multiple instances of AFAR. Although the ASI for AFAR is defined as 4D16, the virtual address of AFAR if there are multiple AFARs is implementation dependent. 220 Addition of logging and control registers for error handling (Whether the implementation supports additional logging/control registers for error handling is implementation dependent.) 221 Special/signalling ECCs (The method to generate “special” or “signalling” ECCs and whether processor-ID is embedded into the data associated with special/signalling ECCs is implementation dependent.) 222 TLB organization 129 TLB organization of UltraSPARC III is described in Appendix F. 223 TLB multiple-hit detection (Whether TLB multiple-hit detections is supported in JPS1 is implementation dependent.)

108 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (8 of 11)

Nbr UltraSPARC III Implementation Notes Page 224 MMU physical address width 129, The physical address width supported by UltraSPARC III is 43 bits. Because 130 of this, TTE bits 46:43 always read as zero, and writes to them are ignored. 225 TLB locking of entries (The mechanism by which entries in TLB are locked is implementation dependent in JPS1.) 226 TTE support for CV bit 131 Whether the CV bit is supported in TTE is implementation dependent in JPS1. The CV bit is fully implemented in UltraSPARC III. (See also impl. dep. #232.) 227 TSB number of entries 136 The maximum number of entries in a TSB is implementation-dependent in JPS1. In UltraSPARC III, the maximum number of TSB entries is 512 x 27, or 64K entries. (See also impl. dep. #236.) 228 TSB_Hash supplied from TSB or context-ID register 136 Whether TSB_Hash is supplied from a TSB register or from a context-ID register is implementation dependent in JPS1. In UltraSPARC III, TSB_Hash is supplied from a TSB extension register. 229 TSB_Base address generation Whether the implementation generates the TSB_Base address by exclusive- ORing the TSB_Base register and a TSB register or by taking TSB_Base field directly from TSB register is implementation dependent in JPS1. This implementation dependency is only to maintain compatibility with the TLB miss handling software of UltraSPARC I/II. 230 data_access_exception trap (The causes of a data_access_exception trap are implementation dependent in JPS1, but there are several mandatory causes of data_access_exception trap.) 231 MMU physical address variability (The variability of the width of the physical address is implementation dependent in JPS1, and if variable, the initial width of the physical address after reset is also implementation dependent in JPS1.) 232 DCU Control Register CP and CV bits 34 (Whether CP and CV bits exist in the DCU Control Register is implementation dependent in JPS1.) UltraSPARC III fully implements the CP and CV “cacheability” bits.(See also impl. dep. #226.) 233 TSB_Hash field (Whether TSB_Hash field is implemented in I/D Primary/Secondary/ Nucleus TSB Extension Register is implementation dependent in JPS1.) 234 TLB replacement algorithm The replacement algorithm for a TLB entry is implementation dependent in JPS1.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 109 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (9 of 11)

Nbr UltraSPARC III Implementation Notes Page 235 TLB data access address assignment (The MMU TLB data access address assignment and the purpose of the address are implementation dependent in JPS1.) 236 TSB_Size field width 136 The width of the TSB_Size field in the TSB Base Register is implementation dependent; the permitted range is from 2 to 6 bits. The least significant bit of TSB_Size is always at bit 0 of the TSB register. Any bits unimplemented at the most significant end of TSB_Size read as 0, and writes to them are ignored. In UltraSPARC III, TSB_Size is 3 bits wide, occupying bits 2:0 of the TSB register. The maximum number of TSB entries is, therefore, 512 × 27 (64K entries). (See also impl. dep. #227 and impl. dep. # 228.) 237 DSFAR/DSFSR for JMPL/RETURN mem_address_not_aligned 133 On a mem_address_not_aligned trap that occurs during a JMPL or RETURN instruction, UltraSPARC III updates the D-SFAR and D-SFSR registers with the fault address and status, respectively. 238 TLB page offset for large page sizes 130 When page offset bits for larger page sizes (PA<15:13>, PA<18:13>, and PA<21:13> for 64-Kbyte, 512-Kbyte, and 4-Mbyte pages, respectively) are stored in the TLB, it is implementation dependent whether the data returned from those fields by a Data Access read are zero or the data previously written to them. On UltraSPARC III, for larger page sizes, these fields read back as the data previously written to them.

239 Register access by ASIs 5516 and 5D16 134 On UltraSPARC III, loads and stores to ASIs 5516 and 5D16 access the IMMU and DMMU (respectively) TLB CAM Diagnostic Registers. 240 DCU Control Register bits 47:41 34, 35 UltraSPARC III implements all seven bits, for store queue and prefetch control. 241 Address Masking and DSFAR 29 When PSTATE.AM = 1 and an exception occurs, UltraSPARC III writes the full 64-bit address to the Data Synchronous Fault Address Register (DSFAR). 242 TLB lock bit 131 An implementation containing multiple TLBs may implement the L (lock) bit in all TLBs but is only required to implement a lock bit in one TLB for each page size. If the lock bit is not implemented in a particular TLB, it reads as 0 and writes to it are ignored. In UltraSPARC III, the TLB lock bit is only implemented in the DMMU 16-entry, fully associative TLB, and the IMMU 16-entry, fully associative TLB. In the TLBs dedicated to 8 KB page translations (DMMU 512-entry, 2-way associative TLB and IMMU 128-entry, 2-way associative TLB), each TLB entry’s lock bit reads as 0 and writes to it are ignored.

110 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (10 of 11)

Nbr UltraSPARC III Implementation Notes Page 243 Interrupt Vector Dispatch Status Register BUSY/NACK pairs 175 The number of BUSY/NACK bit pairs implemented in the Interrupt Vector Dispatch Status Register is implementation dependent. In UltraSPARC III, 32 BUSY/NACK pairs are implemented in the Interrupt Vector Dispatch Status Register. 244 Data Watchpoint Reliability 35 On UltraSPARC III, watchpoint comparison is only done on the MS (memory) pipeline of the processor; any second-issued Ax pipe FP loads will not trigger a watchpoint. For reliable use of the watchpoint mechanism, the second FP load feature (DCUCR.SL) must be disabled. 245 Call/Branch Displacement Encoding in I-Cache 36 In UltraSPARC III, the least significant 11 bits (bits 10:0) of a CALL or branch (BPcc, FBPcc, Bicc, BPr) instruction in the instruction cache contain the sum of the least significant 11 bits of the architectural instruction encoding (as appears in main memory) and the least significant 11 bits of the virtual address of the CALL/branch instructions. 246 VA<38:29> for Interrupt Vector Dispatch Register Access 175 UltraSPARC III interprets all 10 bits of VA<38:39> when the Interrupt Vector Dispatch Register is written. 247 Interrupt Vector Receive Register SID Fields 175 UltraSPARC III sets all 10 physical module ID (MID) bits in the SID_U and SID_L fields of the Interrupt Vector Receive Register. UltraSPARC III obtains SID_U from VA<38:34> of the interrupt source and SID_L from VA<33:29> of the interrupt source. 248 Conditions for fp_exception_other with unfinished_FPop 28, 98 UltraSPARC III triggers fp_exception_other with trap type unfinished_FPop under the conditions described in Section B.6. These conditions differ from the JPS1 “standard” set, described in Commonality Section 5.1.7. 249 Data Watchpoint for Partial Store Instruction 91 Watchpoint exceptions on Partial Store instructions occur conservatively on UltraSPARC III. The DCUCR Data Watchpoint masks are only checked for nonzero value (watchpoint enabled). The byte store mask (r[rs2]) in the Partial Store instruction is ignored, and a watchpoint exception can occur even if the mask is zero (that is, no store will take place). 250 PCR accessibility when PSTATE.PRIV =0 30, 265 In UltraSPARC III, access to PCR is strictly privileged; when PSTATE.PRIV = 0, an attempt to execute either an RDPCR or a WRPCR instruction causes a privileged_opcode exception. 251 Reserved. — 252 DCUCR.DC (Data Cache Enable) 35 UltraSPARC III implements DCUCR.DC. When DCUCR.DC = 0 (D-cache is disabled), the data cache is not updated. When the D-cache is reenabled, software must flush any inconsistent lines from the D-cache.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 111 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-1 UltraSPARC III Implementation Dependencies (11of11)

Nbr UltraSPARC III Implementation Notes Page 253 DCUCR.IC (Instruction Cache Enable) 35 UltraSPARC III implements DCUCR.IC. When DCUCR.IC = 0 (I-cache is disabled), the instruction cache is not updated. When the I-cache is reenabled, software must invalidate any inconsistent lines from the instruction cache. 254 Means of exiting error_state 61 Upon entry into error_state, UltraSPARC III generates an immediate watchdog_reset (WDR).

255 LDDFA with ASI E016 or E116 and misaligned destination register number 157 For LDDF with ASI E016 or E116, if a misaligned (not multiple of 8) destination register number is specified, UltraSPARC III generates an illegal_instruction exception.

256 LDDFA with ASI E016 or E116 and misaligned memory address 157 For LDDF with ASI E016 or E116, if a misaligned (not 64-byte aligned) memory address is specified, UltraSPARC III generates a mem_address_not_aligned exception.

257 LDDFA with ASI C016–C516 or C816–CD16 and misaligned memory 157 address For LDDF with C016–C516 or C816–CD16, if a misaligned (not 8-byte aligned) memory address is specified, UltraSPARC III generates a data_access_expection exception, with fault type 0816 recorded in DSFSR.FTYPE.

C.2 SPARC V9 General Information

The general information provided in this section includes information about the following subjects: ■ Level 2 compliance ■ Unimplemented opcodes, ASIs, and the ILLTRAP instruction ■ Trap levels ■ Trap handling ■ TPC/TnPC and reset ■ SIGM support ■ TICK register ■ Population count (POPC) instruction ■ Secure software

112 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.2.1 Level 2 Compliance (Impl. Dep. #1)

The UltraSPARC III processor meets Level 2 SPARC V9 compliance as specified in The SPARC Architecture Manual-Version 9. The processor correctly interprets all nonprivileged operations and all privileged elements of the architecture.

Note – System emulation routines, such as quad-precision floating-point operations shipped with the UltraSPARC III processor must also be level 2 compliant.

C.2.2 Unimplemented Opcodes, ASIs, and the ILLTRAP Instruction

SPARC V9 unimplemented, reserved, ILLTRAP opcodes, and instructions with invalid reserved fields (other than reserved FPops or fields in graphics instructions that reference floating-point registers) encountered during execution will cause an illegal_instruction trap.

The reserved field in the Tcc instruction is checked in the UltraSPARC III processor, unlike the case in UltraSPARC I and II, where the contents of the field were ignored. Reserved FPops and reserved fields in graphics instructions that reference floating- point registers cause an fp_exception_other (with FSR.ftt = unimplemented_FPop) trap. Unimplemented and reserved ASI values cause a data_access_exception trap.

C.2.3 Trap Levels (Impl. Dep. #37, 38, 39, 40, 114, 115)

The UltraSPARC III processor supports five trap levels; that is, MAXTL=5.

Note – A WRPR instruction to a trap level (TL) with a value greater than MAXTL =5 results in 5 (MAXTL) being written to TL.

Normal execution is at TL = 0. Traps at MAXTL – 1 bring the processor into RED_state. If a trap is generated while the processor is operating at TL = MAXTL, then the processor enters error_state and generates a watchdog reset (WDR). CWP updates for window traps that enter error_state are the same as when error_state is not entered.

Note – The RED_state trap vector address (RSTVaddr) is 256 Mbytes below the top of the virtual address space. Virtual address FFFF FFFF F000 000016 is passed through to physical address 7FF F000 000016 in RED_state.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 113 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material For a complete description of traps and RED_state handling, please refer to Chapter 7, Traps,inCommonality. See also Appendix O, Reset and RED_state.

C.2.4 Trap Handling (Impl. Dep. #16, 32, 33, 35, 36, 216, 217)

The UltraSPARC III processor supports precise trap handling for all operations except: ■ deferred traps from hardware failures encountered during memory accesses, described in Deferred Traps on page 188 ■ disrupting traps from hardware failures encountered during memory accesses, described in Disrupting Traps on page 190 ■ certain cases of the fast_ECC_error exception, described in Table 7-1, Exception Specific to UltraSPARC III, on page 62

All traps supported in the UltraSPARC III processor are listed in Section 7.6, Exception and Interrupt Descriptions, in Commonality, plus Table 7-1, Exception Specific to UltraSPARC III, on page 62 of this document.

C.2.5 TPC/TnPC and Resets

The UltraSPARC III processor does not save an accurate TPC and TnPC on externally initiated reset (XIR) or power-on reset (POR) traps. The TPC saved will be close to that of either the next instruction to be executed or the last instruction executed, as described below.

On a reset trap, TPC will have zeroes in bit positions <5:0>. Bit positions <63:6> will be equal to the corresponding bits of the PC of either the next instruction to be executed or the last instruction executed. The latter value will be seen only in the event that the next instruction to be executed has PC<5:0> = 000000 and is not the target of a CTI. TnPC will be equal to TPC +4.

Here is a simpler but less precise way to describe this behavior: On a reset trap, TPC may be inaccurate. The value saved in TPC may differ from the correct trap PC by 63 bytes; this value is not necessarily equal to the PC of any instruction that has been or is scheduled to be executed. The value saved in TnPC will always be 4 more than the value saved in TPC.

C.2.6 SIR Support (Impl. Dep. #116)

In the UltraSPARC III processor, a software-initiated reset (SIR) is initiated by execution of a SIR instruction while in privileged mode. In nonprivileged mode, a SIR instruction behaves as a NOP. See also Software-Initiated Reset (SIR) on page 179.

114 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.2.7 TICK Register

The UltraSPARC III processor implements a 63-bit TICK counter. For the state of this register after reset, see TABLE O-1 on page 180. The TICK register format is described in TABLE C-2.

TABLE C-2 TICK Register Format

Bit Field Type Use/Description

63 NPT RW Nonprivileged Trap enable. If set, an attempt by nonprivileged software to read the TICK register causes a privileged_opcode trap. If clear, nonprivileged software can read the register with the RDTICK instruction. The register can only be written by privileged software. A write attempt by nonprivileged software causes a privileged_opcode trap. 62:0 counter RW 63-bit elapsed CPU clock cycle counter.

Note – TICK.NPT is set and TICK.counter is cleared after a POR. TICK.NPT is unchanged and TICK.COUNTER is cleared after an XIR.

C.2.8 Population Count Instruction (POPC)

The population count instruction is not directly executed in hardware. It is emulated in software. Execution of a POPC opcode causes an illegal_instruction trap.

C.2.9 Secure Software

To establish an enhanced security environment, it might be necessary to initialize certain processor states between contexts. Examples of such states are the contents of integer and floating-point register files, condition codes, and state registers. See also Clean Window Handling (Impl. Dep. #102), below.

C.3 SPARC V9 Integer Operations

In this section, we discuss the following subjects: ■ Integer Register file and Window Control Registers ■ Clean window handling ■ Integer multiply and divide ■ Version Register

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 115 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.3.1 Integer Register File and Window Control Registers (Impl. Dep. #2)

The UltraSPARC III processor implements an 8-window, 64-bit integer register file and NWINDOWS = 8. The UltraSPARC III processor truncates values stored in CWP, CANSAVE, CANRESTORE, CLEANWIN, cwp field of TSTATE, and OTHERWIN to three bits. Truncation includes implicit updates to these registers by SAVE(D) and RESTORE(D) instructions. The upper two bits of these registers are read as 0.

C.3.2 Clean Window Handling (Impl. Dep. #102)

SPARC V9 introduced the clean window to enhance security and integrity during program execution. A clean window is defined as a register window containing either all zeroes or addresses and data that belong to the current context. The CLEANWIN register records the number of available clean windows.

A clean_window trap is generated when a SAVE instruction requests a window and there are no more clean windows. System software must then initialize all registers in the next available window(s) to 0 before returning to the requesting context.

C.3.3 Integer Multiply and Divide

Integer multiplications (MULScc, UMUL{cc}, SMUL{cc}, MULX) and divisions (SDIV{cc}, UDIV{cc}, UDIVX, SDIVX) are directly executed in hardware.

Multiplications are done 16 bits at a time with early exit when the final result is generated. Divisions use a 1-bit nonrestoring division algorithm.

Note – For best performance, the rs1 operand should be the smaller of the two operands of a multiply operation. This optimization is compatible with the multiply operation in UltraSPARC I.

C.3.4 Version Register (Impl. Dep. #2, 13, 101, 104)

Consult the product data sheet for the content of the Version Register for an implementation. For the state of this register after a reset, see TABLE O-1 on page 180.

116 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-3 lists Version Register format.

TABLE C-3 Version Register Format

Bits Field Type (RW) Use/Description

63:48 manuf R Manufacturer ID. 16-bit manufacturer code, 003E16 (Sun’s JEDEC number) that identifies the manufacture of the processor. Previously for UltraSPARC 1 and UltraSPARC II, the manufacturer’s code was 001716 (TI’s JEDEC number). 47:32 impl R Implementation ID. 16-bit implementation code, 001416, that uniquely identifies the UltraSPARC III processor class CPU. 31:24 mask R Mask set version. 8-bit mask set revision number that comes from the version block. This field is broken up into two subfields, major_revbit <31:28> and minor_rev <27:24>. The field is set equal to the tapeout revision. For the initial tapeout, it is 0.0 or 1016 for the eight bits. 23:16 R Reserved. 15:8 maxt R Maximum trap level supported. Maximum number of supported trap levels beyond level 0. This is the same as the largest possible value for the TL register. For the UltraSPARC III processor, maxtl = 5. 7:5 R Reserved. 4:0 maxwi R Maximum number of windows of integer register file. Maximum index number available for use as a valid CWP value. The value is NWINDOWS − 1. For the UltraSPARC III processor, maxwin =7.

C.4 SPARC V9 Floating-Point Operations

In this section, we discuss implementation-dependent floating-point operations. For information on nonstandard mode, please refer to Floating-Point Nonstandard Mode on page 98.

C.4.1 Subnormal Operands/Results; NaN Operands

See Appendix B, Floating-Point Nonstandard Mode, for a description of FSR.NS’s effects on floating-point operations.

C.4.2 Overflow, Underflow, and Inexact Traps (Impl. Dep. #3, 55)

The UltraSPARC III processor implements precise floating-point exception handling. Underflow is detected before rounding. Prediction of overflow, underflow, and inexact traps for divide and square root simplifies the hardware.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 117 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material For divide, pessimistic prediction occurs when underflow/overflow cannot be determined from examination of the source operand exponents. For divide and square root, pessimistic prediction of inexact occurs unless one of the operands is a zero, NaN, or infinity. When pessimistic prediction occurs and the exception is enabled, an fp_exception_other (with FSR.ftt =2, unfinished_FPop) trap is generated. System software properly handles these cases and resumes execution. If the exception is not enabled, the actual result status updates the aexc/cexc bits of the Floating-Point Status Register.

Note – Major performance degradation can occur while the system is running with the inexact exception enabled.

C.4.3 Quad-Precision Floating-Point Operations (Impl. Dep. #3)

All quad-precision floating-point instructions, listed in TABLE C-4, cause an fp_exception_other (with FSR.ftt =3,unimplemented_FPop) trap. These operations are emulated in system software.

TABLE C-4 Unimplemented Quad-Precision Floating-Point Instructions

Instruction Description

F{s,d}TOq Converts single-/double- to quad-precision FP F{i,x}TOq Converts 32-/64-bit integer to quad-precision FP FqTO{s,d} Converts quad- to single-/double-precision FP FqTO{i,x} Converts quad-precision FP to 32-/64-bit integer FCMP{E}q Quad-precision FP compares FMOVq Quad-precision FP move FMOVqcc Quad-precision FP move, set FP condition code FMOVqr Quad-precision FP move if register match condition FABSq Quad-precision FP FADDq Quad-precision FP addition FDIVq Quad-precision FP division FdMULq Double- to quad-precision FP multiply FMULq Quad-precision FP multiply

118 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-4 Unimplemented Quad-Precision Floating-Point Instructions (Continued)

Instruction Description

FNEGq Quad-precision FP negation FSQRTq Quad-precision FP square root FSUBq Quad-precision FP subtraction

Note – Quad loads and stores are illegal instructions in the UltraSPARC III processor.

C.4.4 Floating-Point Upper and Lower Dirty Bits in FPRS Register

The FPRS_dirty_upper (DU) and FPRS_dirty_lower (DL) bits in the Floating Point Registers State (FPRS) Register are set when an instruction modifying the corresponding upper and lower half of the floating-point register file is dispatched. Instructions that modify floating-point register files include floating-point operate, graphics, floating-point loads, and block load instructions.

The FPRS.DU and FPRS.DL can be pessimistically set even though the instruction modifying the floating-point register file is nullified.

C.4.5 Floating-Point Status Register (Impl. Dep. #13, 19, 22, 23, 24)

The UltraSPARC III processor supports precise traps and implements all three exception fields (TEM, cexc, and aexc) conforming to IEEE Standard 754-1985. See TABLE O-1 on page 180 for the state of the Floating-Point Status Register (FSR).

TABLE C-5 lists the format of the FSR.

TABLE C-5 FSR Register Format

Bits Field Type Use/Description 63:38 R Reserved. 37:36 fcc3 RW Floating-point condition code (set 3), (set 2), (set 1), (set 0 <11:10>). Four sets of 2-bit 35:34 fcc2 RW floating-point condition codes are modified by the FCMP{E} (and LD{X}FSR) instructions. The FCfcc, FBPfcc, RMOVfcc, and MOVCcc instructions use one of the 33:32 fcc1 RW condition code sets to determine conditional control transfers and conditional register moves.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 119 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE C-5 FSR Register Format (Continued)

Bits Field Type Use/Description 31:30 RD RW Rounding direction. IEEE Std. 754-1985 rounding direction, as: RD Round toward 0 Nearest (even if tie) 10 2+∞ 3−∞ When GSR.IM = 1, the value of FSR.RD is overridden by GSR.RND. 29:28 R Reserved. 27:23 TEM RW IEEE-754 trap enable mask. Five-bit trap enable mask for the IEEE-754 floating-point exceptions. If a floating-point operate instruction produces one or more exceptions, then the corresponding cexc/aexc bits are set and an fp_exception_ieee_754 (with FSR.ftt =1,IEEE_754_exception) exception is generated. 22 NS RW Nonstandard floating-point results. Ignored when GSR.IM = 1. When GSR.IM - 0 and FSR.NS = 0, IEEE-754-compatible floating-point results are produced. In particular, subnormal operands or results can cause a trap. See Graphics Status Register (GSR) (ASR19) in Section 5.2.11 of Commonality. When this field is set, the UltraSPARC III processor can deliver a non-IEEE-754- compatible result. In particular, subnormal operands and results can be flushed to 0. See Floating-Point Nonstandard Mode on page 98 for further details. 21:20 R Reserved. 19:17 ver R FPU version number. Identifies a particular implementation of the UltraSPARC III Floating Point and Graphics Unit architecture. 16:14 ftt RW Floating-point trap type. The 3-bit, floating-point, trap-type field is set whenever a floating-point instruction causes the ftp_exception_ieee_754 or fp_exception_other traps. Trap type values and trap signalled are: ftt Trap Type Trap Signalled 0 None — 1 IEE_754_exceptionfp_exception_ieee_754 2 unfinished_FPop fp_exception_other 3 unimplemented_FPopfp_exception_other 4 sequence_error Not used 5 hardware_error Not used 6 invalid_fp_register Not used 7 Reserved — 13 qne RW Floating-point deferred-trap queue (FQ) not empty. Not used because the UltraSPARC III processor implements precise floating-point exceptions. 12 R Reserved. 11:10 fcc0 RW Floating-point condition code (set 0). See <37:32>. Note: fcc0 is the same as the fcc in SPARC V8. 9:5 aexc RW Accumulated outstanding exceptions. Five-bit accrued exception field accumulates ieee_754_exceptions while the corresponding floating-point exception trap is disabled by FSR.TEM. 4:0 cexc RW Current outstanding exceptions. Five-bit current exception field specifies the most recently generated ieee_754_exceptions from the last executed FPop.

120 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.5 SPARC V9 Memory-Related Operations

We discuss or provide pointers to 11 memory-related operations in this section.

Load/store alternate address space (impl. dep. #5, 29, 30). See Address Space Identifiers and Address Spaces on page 153.

Load/store ancillary state register (impl. dep. #6, 7, 8, 9, 47, 48). See Section 5.2.11 in Commonality.

MMU implementation. The UltraSPARC III processor memory management is based on software-managed instruction and data TLBs and in-memory TSBs backed by a Software Translation table.

See Chapter 8, Memory Management Unit.

FLUSH and self-modifying code (impl. dep. #122). FLUSH synchronizes code and data spaces after code space is modified during program execution. On the UltraSPARC III processor, the FLUSH effective address is ignored. FLUSH does not access the data MMU and cannot generate a data MMU miss or exception.

SPARC V9 specifies that the FLUSH instruction has no latency on the issuing processor. In other words, a store to instruction space prior to the FLUSH instruction is visible immediately after the completion of FLUSH. When a FLUSH operation is performed, the UltraSPARC III processor guarantees that earlier code modifications will be visible across the whole system.

See Memory Synchronization on page 68.

PREFETCH(A) (impl. dep. #103, 117). PREFETCH(A) instructions with fcn = 0–4 are implemented. In accordance with SPARC V9, PREFETCH(A) instructions with fcn = 5–15 cause an illegal_instruction trap.

PREFETCH(A) instructions with fcn = 17–31 execute as NOPs.

See A.48 in Commonality and Prefetch Data on page 93.

Nonfaulting load and MMU disable (impl. dep. #117). When the data MMU is disabled, accesses default to the settings in the Data Cache Unit Control Register CP and CV bits. Note: E is the inverse of CP.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 121 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material When the data MMU is disabled (DCUCR.DMU = 0), the side-effect bit (TTE.E =1)is set to 1.

Nonfaulting loads encountered when the MMU is disabled cause a data_access_exception trap with SFSR.FT = 2 (speculative load to page with side- effect attribute).

See MMU Control in TABLE 5-13 on page 34.

LDD/STD handling (impl. dep. #107, 108). LDD and STD instructions are directly executed in hardware.

Note – LDD and STD are deprecated in SPARC V9. In the UltraSPARC III processor, it is more efficient to use LDX and STX for accessing 64-bit data. LDD and STD take longer to execute than do two 32- or 64-bit loads/stores.

Floating-point mem_address_not_aligned (impl. dep. #109, 110, 111, 112).

LDQF(A) and STQF(A) cause an LDDF/STDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned.

LDQF(A) and STQF(A) are not directly executed in hardware. They cause an illegal_instruction trap.

Supported memory models (impl. dep. # 113, 121). The UltraSPARC III processor supports only the TSO memory model.

See Chapter 8, Memory Models.

I/O operations (impl. dep. #118, 123). See I/O and Accesses with Side Effects on page 72.

Quad loads/stores. Quad loads and stores cause an illegal_instruction exception in the UltraSPARC III processor. These include LDQF(A) and STQF(A).

122 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material C.6 Non-SPARC V9 Extensions

C.6.3 DCR Register Extensions

See Dispatch Control Register (DCR) (ASR 18) on page 30 for a description of UltraSPARC III extensions to the Dispatch Control Register (impl. dep. #203, 204, and 205).

C.6.4 Other Extensions

The remaining implementation dependencies of non-SPARC V9 extensions are briefly described below.

Cache subsystem. There are two levels of caches: one level on-chip; one level external.

See Appendix M, Caches and Cache Coherency.

Memory Management Unit. The UltraSPARC III processor implements a multilevel memory management scheme.

See Chapter 8, Memory Models.

Error handling. The UltraSPARC III processor implements a set of programmer-visible error and exception registers.

See Appendix P, Error Handling.

Block memory operations. The UltraSPARC III processor supports 64-byte block memory operations that use a block of 8 double-precision, floating-point registers as a temporary buffer.

See Block Load and Store Instructions (VIS I) on page 79.

Partial stores. The UltraSPARC III processor supports 8-, 16-, and 32-bit partial stores to memory.

See Partial Store (VIS I) on page 91.

Short floating-point loads and stores. The UltraSPARC III processor supports 8- and 16-bit loads and stores to the floating-point registers.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix C Implementation Dependencies 123 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material See A.58 in Commonality.

Atomic quad load. The UltraSPARC III processor supports 128-bit atomic quad load (privileged only) operations to a pair of integer registers.

See A.30 in Commonality.

Interrupt vector handling. Processors and I/O devices can interrupt a selected processor by assembling and sending an interrupt packet consisting of eight 64-bit interrupt data words. Thus, hardware interrupts and cross-calls can have the same hardware mechanism and can share a common software interface for processing.

See Sections N.1 and N.2 in Commonality.

Instruction set extensions (impl. dep. #106). The UltraSPARC III processor extends the standard SPARC V9 instruction set. Unimplemented IMPDEP1 and IMPDEP2 opcodes encountered during execution cause an illegal_instruction trap.

See A.24 in Commonality.

Performance instrumentation. See Performance Instrumentation Counter Events on page 268.

Debug and diagnostics support. See Chapter , Debug and Diagnostics Support.

Low-power support. The UltraSPARC III processor supports a low-power mode to reduce power requirements during idle periods.

See Power Management on page 277.

124 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX D

Formal Specification of the Memory Models

Please refer to Appendix D of Commonality.

125 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 126 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX E

Opcode Maps

UltraSPARC III implements exactly the instruction set specified in JPS1. Therefore, please refer to Appendix E in Commonality for UltraSPARC III opcode maps.

127 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 128 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX F

Memory Management Unit

The Memory Management Unit (MMU) conforms to the requirements set forth in Appendix F of Commonality. In particular, the MMU supports a 64-bit virtual address space, software TLB-miss processing only (no hardware page table walk), simplified protection encoding, and multiple page sizes.

This chapter describes the Memory Management Unit, as seen by the operating system software, in these sections: ■ Virtual Address Translation on page 129 ■ Translation Table Entry (TTE) on page 130 ■ Hardware Support for TSB Access on page 131 ■ Reset, Disable, and RED_state Behavior on page 133 ■ Internal Registers and ASI Operations on page 134 ■ Translation Lookaside Buffer Hardware on page 138

Section numbers in this appendix correspond to those in Appendix F of Commonality. However, figures and tables are numbered consecutively.

F.1 Virtual Address Translation

A 64-bit virtual address (VA) space is supported, with 43 bits of physical address (PA) (impl. dep. #224).

Each MMU consists one or more Translation Lookaside Buffers (TLBs), including micro-TLB structures. The organization of TLB structures may be different between the Instruction MMU and the Data MMU used for the various page sizes. In the D- MMU, a 512-entry, 2-way associative TLB is used for 8-Kbyte page translations, and a 16-entry fully associative TLB is used for 64-Kbyte, 512-Kbyte, and 4-Mbyte page translations and locked pages of all four sizes. In the I-MMU, a 128-entry, 2-way

129 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material associative TLB is used for 8-Kbyte page translations, and a 16-entry fully associative TLB is used for 64-Kbyte, 512-Kbyte and 4-Mbyte page translations and locked pages of all four sizes.

F.2 Translation Table Entry (TTE)

The Translation Table Entry (TTE) is the equivalent of a SPARC V8 page table entry; it holds information for a single page mapping. The TTE is divided into two 64-bit words representing the tag and data of the translation. Just as in a hardware cache, the tag is used to determine whether there is a hit in the TSB; if there is a hit, then the data are fetched by software.

The configuration of the TTE is illustrated in FIGURE F-1 and described in TABLE F-1 (see also Section F.2 in Commonality).

G — Context — VA_tag<63:22> Tag 63 6261 60 48 47 42 41 0

V SizeNFO IE Soft2 Reserved PA<42:13> Soft L CP CV E P W G Data 63 62 61 60 59 58 50 49 43 42 13 12 7 6 5 4 3 2 1 0

FIGURE F-1 Translation Storage Buffer (TSB) Translation Table Entry (TTE)

TABLE F-1 TSB and TTE Bit Description

Bit Field Description Data – 49:47 Reserved Reserved, read as 0. Data – 46:43 Reserved Since UltraSPARC III has a 43-bit physical address space, bits in a TTE entry always read as 0 and writes to them are ignored (impl. dep. #224). Data – 42:13 PA The physical page number. Page offset bits for larger page sizes (PA<15:13>, PA<18:13>, and PA<21:13> for 64-Kbyte, 512-Kbyte, and 4-Mbyte pages, respectively) are stored in the TLB and returned for a Data Access read but are ignored during normal translation. When page offset bits for larger page sizes (PA<15:13>, PA<18:13>, and PA<21:13> for 64-Kbyte, 512-Kbyte, and 4-Mbyte pages, respectively) are stored in the TLB on UltraSPARC III. The data returned from those fields by a Data Access read are the data previously written to them. (impl. dep. #238)

130 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE F-1 TSB and TTE Bit Description (Continued)

Bit Field Description Data – 6 L If the lock bit is set, then the TTE entry will be “locked down” when it is loaded into the TLB; that is, if this entry is valid, it will not be replaced by the automatic replacement algorithm invoked by an ASI store to the Data In Register. The lock bit has no meaning for an invalid entry. Arbitrary entries can be locked down in the TLB. Software must ensure that at least one entry is not locked when replacing a TLB entry; otherwise, a locked entry will be replaced. Since the 16- entry, fully associative TLB is shared for all locked entries as well as for 4-Mbyte and 512-Kbyte pages, the total number of locked pages is limited to less than or equal to 15. In UltraSPARC III, the TLB lock bit is only implemented in the DMMU 16-entry, fully associative TLB, and the IMMU 16-entry, fully associative TLB. In the TLBs dedicated to 8 KB page translations (DMMU 512-entry, 2-way associative TLB and IMMU 128-entry, 2-way associative TLB), each TLB entry’s lock bit reads as 0 and writes to it are ignored. The lock bit set for 8-Kbyte page translation in both IMMU and DMMU is read as 0 and ignored when written. Data – 5 CP, The cacheable-in-physically-indexed-cache bit and cacheable-in-virtually- Data – 4 CV indexed-cache bit determine the placement of data in the caches. UltraSPARC III fully implements the CV bit (impl. dep. #226). The following table describes how CP and CV control cacheability in specific UltraSPARC III caches.

Meaning of TTE when placed in: Cacheable (CP, CV) I-TLB (Instruction Cache PA-indexed) D-TLB (Data Cache VA-indexed) 00, 01 Noncacheable Noncacheable 10 Cacheable E-cache, I-cache Cacheable E-cache, and W- cache 11 Cacheable E-cache, I-cache Cacheable E-cache, D-cache and W-cache

The MMU does not operate on the cacheable bits but merely passes them through to the cache subsystem. The CV bit in the IMMU is read as zero and ignored when written.

F.4 Hardware Support for TSB Access

The MMU hardware provides services to allow the TLB-miss handler to efficiently reload a missing TLB entry for an 8-Kbyte or 64-Kbyte page. These services include: ■ Formation of TSB Pointers, based on the missing virtual address and address space ■ Formation of the TTE Tag Target used for the TSB tag comparison

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 131 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ Efficient atomic write of a TLB entry with a single store ASI operation ■ Alternate globals on MMU-signaled traps

Please refer to Section F.4 of Commonality for additional details.

F.4.1 Typical TLB Miss/Refill Sequence

A typical TLB-miss and -refill sequence is the following:

1. A TLB miss causes either a fast_instruction_access_MMU_miss or a fast_data_access_MMU_miss exception.

2. The appropriate TLB-miss handler loads the TSB Pointers and the TTE Tag Target with loads from the MMU registers.

3. Using this information, the TLB miss handler checks to see if the desired TTE exists in the TSB. If so, the TTE data are loaded into the TLB Data In Register to initiate an atomic write of the TLB entry chosen by the replacement algorithm.

4. If the TTE does not exist in the TSB, then the TLB-miss handler jumps to the more sophisticated, and slower, TSB miss handler.

The virtual address used in the formation of the pointer addresses comes from the Tag Access Register, which holds the virtual address and context of the load or store responsible for the MMU exception. See Translation Table Entry (TTE) on page 130.

Note – There are no separate physical registers in hardware for the pointer registers; rather, they are implemented through a dynamic reordering of the data stored in the Tag Access and the TSB registers.

Hardware provides pointers for the most common cases of 8-Kbyte and 64-Kbyte page miss processing. These pointers give the virtual addresses where the 8-Kbyte and 64-Kbyte TTEs are stored if either is present in the TSB.

n is defined to be the TSB_Size field of the TSB register; it ranges from 0 to 7. Note that TSB_Size refers to the size of each TSB when the TSB is split. The symbol designates concatenation of bit vectors and ⊕ indicates an exclusive-or operation.

For a shared TSB (TSB register split field = 0): 8K_POINTER = TSB_Base[63:13+n] ⊕ TSB_Extension[63:13+n] VA[21+n:13] 0000 64K_POINTER = TSB_Base[63:13+n] ⊕ TSB_Extension[63:13+n] VA[24+n:16] 0000

132 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material For a split TSB (TSB register split field = 1): 8K_POINTER = TSB_Base[63:14+n] ⊕ TSB_Extension[63:14+n] 0 VA[21+n:13] 0000 64K_POINTER = TSB_Base[63:14+n] ⊕ TSB_Extension[63:14+n] 1 VA[24+n:16] 0000

For a more detailed description of the pointer logic with pseudocode and hardware implementation, see TSB Pointer Logic Hardware Description on page 140.

The TSB Tag Target is formed by aligning the missing access VA (from the Tag Access Register) and the current context to positions found above in the description of the TTE tag, allowing a simple XOR instruction for TSB hit detection.

F.4.2 Faults and Traps

On a mem_address_not_aligned trap that occurs during a JMPL or RETURN instruction, UltraSPARC III updates the D-SFAR and D-SFSR registers with the fault address and status, respectively (impl. dep. #237).

F.8 Reset, Disable, and RED_state Behavior

Please refer to Section F.8 of Commonality for general details.

When the I-MMU is disabled, it truncates all instruction accesses to the physical address size (implementation dependent) and passes the physically cacheable bit (Data Cache Unit Control Register CP bit) to the cache system. The access does not generate an instruction_access_exception trap.

Note – While the DMMU is disabled and the default CV bit in the Data Cache Unit Control Register is set to 0, data in the D-cache can be accessed only through load and store alternates to the internal D-cache access ASI. Normal loads and stores bypass the D-cache. Data in the D-cache cannot be accessed by load or store alternates that use ASI_PHYS_*. Other caches are physically indexed or are still accessible.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 133 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material F.10 Internal Registers and ASI Operations

Please refer to Section F.10 of Commonality for details.

F.10.3 Accessing MMU Registers

The CAM Diagnostic Register described in TABLE F-2 is an implementation-specific register in UltraSPARC III (impl. dep. #239).

TABLE F-2 MMU Internal Registers and ASI Operations

IMMU ASI DMMU ASI VA<63:0> Access Register or Operation Name

5516 5D16 4000016–60FF816 Read/Write I/D TLB CAM Diagnostic Register

F.10.3 Instruction/Data MMU TLB Tag Access Registers

After a data_access_exception, the contents of the Context field of the D-MMU Tag Access Register are undefined.

Caution – When the D-MMU causes a trap due to a protection violation or other exception, software should use the context number from D-SFSR.CT instead of from the Context field of the D-TLB Tag Access Register.

F.10.4 I/D TLB Data In, Data Access, and Tag Read Registers

Data In and Data Access Registers

Writes to the TLB Data In register require the virtual address to be set to 0.

The format of the TLB Data Access register virtual address is illustrated in FIGURE F-2 and described in TABLE F-3.

— 0 TLB # — TLB Entry 0 63 19 18 17 16 15 12 11 32 0

FIGURE F-2 I/D MMU TLB Data Access Address

134 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE F-3 TLB Data Access Register

Bit Field Type Description

17:16 TLB # RW The TLB to access, as defined below.

TLB # TLB Type # of Entries 0 Fully associative 64-Kbyte, 4-Mbyte, 16 and 51-Kbyte page size and locked pages 2 2-way associative, 8-Kbyte page size 128 (IMMU) 512 (DMMU)

11:3 TLB Entry RW The TLB entry number to be accessed, in the range 0–511. Not all TLBs will have all 512 entries. All TLBs regardless of size are accessed from 0 to N − 1, where N is the number of entries in the TLB.

F.10.6 I/D Translation Storage Buffer Base Registers

The Translation Storage Buffer (TSB) registers provide information for the hardware formation of TSB pointers and tag target, to assist software in quickly handling TLB misses. If the TSB concept is not employed in the software memory management strategy and therefore the Pointer and Tag Access Registers are not used, then the TSB registers need not contain valid data.

The TSB register is illustrated in FIGURE F-3 and described in TABLE F-4.

TSB_Base (virtual)Split — TSB_Size 63 13 12 11 32 0

FIGURE F-3 MMU I/D TSB Registers

TABLE F-4 TSB Register Description

Bit Field Type Description 63:13 I/D TSB_Base RW Provides the base virtual address of the Translation Storage Buffer. Software must ensure that the TSB base is aligned on a boundary equal to the size of the TSB or both TSBs in the case of a split TSB.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 135 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE F-4 TSB Register Description (Continued)

Bit Field Type Description 12 Split RW When Split =1,theTSB 64-Kbyte pointer address is calculated assuming separate (but abutting and equally sized) TSB regions for the 8-Kbyte and the 64-Kbyte TTEs. In this case, TSB_Size refers to the size of each TSB. The TSB 8-Kbyte pointer address calculation is not affected by the value of the Split bit. When Split = 0, the TSB 64-Kbyte pointer address is calculated assuming that the same lines in the TSB are shared by 8-Kbyte and 64-Kbyte TTEs, called a “common TSB” configuration. Caution: In the “common TSB” configuration (TSB.Split = 0), 8-Kbyte and 64-Kbyte page TTEs can conflict unless the TLB miss handler explicitly checks the TTE for page size. Therefore, do not use the common TSB mode in an optimized handler. For example, suppose an 8-Kbyte page at VA = 200016 and a 64-Kbyte page at VA = 1000016 both exist—a legal situation. These both map to the second TSB line (line 1) and have the same VA tag of 0. Therefore, there is no way for the miss handler to distinguish these TTEs by the TTE tag alone, and unless the miss handler checks the TTE data, it may load an incorrect TTE. 2:0 I/D TSB_Size RW UltraSPARC III implements a 3-bit TSB_Size field (impl. dep. #236). The TSB_Size field provides the size of the TSB as follows: • The number of entries in the TSB (or each TSB if split) = 512 × 2TSB_Size. • The number of entries in the TSB ranges from 512 entries at TSB_Size = 0 (8-Kbyte common TSB, 16-Kbyte split TSB), to 64K entries at TSB_Size = 7 (1-Mbyte common TSB, 2-Mbyte split TSB). Note: Any update to the TSB register immediately affects the data that are returned from later reads of the Tag Target and TSB Pointer Registers.

F.10.7 I/D TSB Extension Registers

Please refer to Section F.10.7 of Commonality for information on TSB Extension Registers.

The TSB registers are defined as follows:

TSB_EXT<63:13> (virtual)Split TSB_Hash. TSB_Size 63 13 12 11 32 0

FIGURE F-4 MMU I/D TSB Extension Registers

In UltraSPARC III, TSB_Hash (bits 11:3 of the extension registers) are exclusive-ORed with the calculated TSB offset to provide a “hash” into the TSB (impl. dep. #228). Changing the TSB_Hash field on a per-process basis minimizes the collision of TSB entries between different processes.

136 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material F.10.12 I/D TLB CAM Diagnostic Register

Accesses to the TLB Diagnostic Register require the virtual address to be set to access a TLB and TLB entry. The virtual address format of the TLB Diagnostic Register virtual address is described below and illustrated in FIGURE F-5.

Bit Field Type Description

17:16 TLB # RW The number of the TLB to access, as follows: TLB TLB type Entries 0 Fully associative 64-Kbyte, 4-Mbyte 16 and 512-Kbyte page size and locked pages [ARE THERE MORE?]

11:3 TLB Entry # RW The number of the TLB entry to be accessed, in the range 0–15. Not all TLBs will have all 512 entries. All TLBs regardless of size are accessed from 0 to N - 1, where N is the number of entries in the TLB.

— 1 TLB # — TLB Entry # 0 63 19 18 17 16 15 12 11 32 0

FIGURE F-5 MMU TLB Diagnostic Access Virtual Address

The format for the CAM Diagnostic Register is described below and illustrated in FIGURE F-6.

BIt Field Type Description

6 LRU RW The LRU bit in the CAM, read-write.

5:3 RAM SIZE R The 3-bit page size field from the RAM, read-only.

2:0 CAM SIZE R The 3 bit page size field from the CAM, read-only.

LRU RAM SIZE CAM SIZE 63 76 532 0

FIGURE F-6 I/D MMU TLB CAM Diagnostic Registers

An ASI store to the TLB CAM Diagnostic Register initiates an internal atomic write to the specified TLB entry. The TLB RAM and CAM entry data are obtained from the store data.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 137 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material An ASI load from the TLB CAM Diagnostic Register initiates an internal read of the data portion of the specified TLB RAM and CAM entry.

F.12 Translation Lookaside Buffer Hardware

This section briefly describes the TLB hardware. For more detailed information, refer to the Section F.12 of Commonality or the corresponding microarchitecture specification.

F.12.2 TLB Replacement Policy

On an automatic replacement write to the TLB, the D-MMU picks the entry to write, based on the following rules: 1. If the new entry maps to an 8-Kbyte unlocked page, then the replacement is directed to the 8-Kbyte, 2-way TLB. Otherwise, the replacement occurs in the fully associative TLB. 2. If replacement is directed to the 8-Kbyte, 2-way TLB, then the replacement set index is generated from the TLB Tag Access Register: bits <20:13> for DMMU, and bits <18:13> for I-MMU.

3. If replacement is directed to the fully associative TLB, then the following alternatives are evaluated:

a. The first invalid entry is replaced (measuring from entry 0). If there is no invalid entry, then

b. the first unused, unlocked (LRU, but clear) entry will be replaced (measuring from entry 0). If there is no unused unlocked entry, then

c. all used bits are reset, and the process is repeated from Step 3b.

The replacement operation is undefined if all entries in the fully associative TLB have their lock bit set.

CODE EXAMPLE F-1 presents code for a pseudorandom replacement algorithm.

138 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material CODE EXAMPLE F-1 Pseudorandom Replacement Algorithm //********************************************************** // // One-hot 4-bit random pattern generator using Type 2 LFSR.// // Initial pattern : 5'b10101; // Primitive polynomial : x^4 + x^3 + x + 1; // // Usage: // rand_out one-hot 4-bit random pattern output. // event_in generate next pattern. // //********************************************************** module lfsr (rand_out, event_in, reset, clk);

output [3:0] rand_out; input event_in; input reset; input clk;

wire [3:0] polynomial = 4'b1011; // Polynomial except most significant bit. wire [4:0] seed = 5’b10101; wire [4:0] lfsr_reg; wire [4:0] lrsr_out = (lfsr_reg ^ seed);

// storage element input wire [4:0] lfsr_in = {lfsr_out[0], ({4{lfsr_out[0]}} & polynomial) ^ lfsr_out[4:1]};

dffe #(5) ff_lfsr (lfsr_reg, lfsr_in, ~reset, event_in, clk);

assign rand_out = {~lfsr_out[1] & ~lfsr_out[0], ~lfsr_out[1] & lfsr_out[0], lfsr_out[1] & ~lfsr_out[0], lfsr_out[1] & lfsr_out[0]}; endmodule

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 139 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material F.12.3 TSB Pointer Logic Hardware Description

FIGURE F-7 illustrates the generation of the 8-Kbyte and 64-Kbyte pointers; CODE EXAMPLE F-2 presents pseudocode for D-MMU pointer logic.

TSB_Ext_Pri TSB_Ext_Sec TSB_Ext_Nuc TSB_Base

Ext<11:3>

Ext<20:13> 64 Kbyte 8 Kbyte VA<24:16> VA<21:13> VA space (Pri,Sec,Nuc)

64k_not8k +

VA<32:22>

TSB_Base<63:21> TSB_Base<20:13> 9

TSB_Split 9

TSB_Size<2:0> TSB Size Logic 7 0 64k_not8k + 43 8

Pointer 0 0 0 0 63 21 20 13 12 4 3 0

TSB Size Logic for Bit N (0 ≤ N ≤ 7) 64 Kbyte 8 Kbyte Ext<20:13> VA<25+N> VA<22+N>

64k_not8k TSB_Base<13+N> + +

(N = TSB_Size)&&TSB_Split 64k_not8k

N ≥ TSB_Size

FIGURE F-7 Formation of TSB Pointers for 8-Kbyte and 64-Kbyte TTE

140 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material CODE EXAMPLE F-2 Pseudocode for DMMU Pointer Logic int64 GenerateTSBPointer( int64 va, // Missing virtual address PointerType type, // 8K_POINTER or 64K_POINTER int64 TSBBase, // TSB Register<63:13> << 13 Boolean split, // TSB Register<12> int TSBSize, // TSB Register<2:0> int SpaceType space) { int64 vaPortion; int64 TSBBaseMask; int64 splitMask;

// Shift va towards lsb appropriately and // zero out the original va page offset vaPortion = (va >> ((type == 8K_POINTER)? 9: 12)) & 0xfffffffffffffff0;

switch (space) { Primary: TSBBASE ^=TSB_EXT_pri; vaPortion ^= TSB_EXT_pri<<1 & 0x1ff0; vaPortion ^= TSB_EXT_pri & 0x1fe000; break; Secondary: TSBBASE ^=TSB_EXT_sec; vaPortion ^= TSB_EXT_sec<<1 & 0x1ff0; vaPortion ^= TSB_EXT_sec & 0x1fe000; break; Nucleus: TSBBASE ^=TSB_EXT_nuc; vaPortion ^= TSB_EXT_nuc<<1 & 0x1ff0; vaPortion ^= TSB_EXT_nuc & 0x1fe000; break; } // TSBBaseMask marks the bits from TSB Base Reg TSBBaseMask = 0xffffffffffffe000 << (split? (TSBSize + 1) : TSBSize);

if (split) { // There’s only one bit in question for split splitMask = 1 << (13 + TSBSize); if (type == 8K_POINTER) // Make sure we’re in the lower half vaPortion &= ~splitMask; else // Make sure we’re in the upper half vaPortion |= splitMask;

Working Draft 1.0.5, 10 Sep 2002 S. Appendix F Memory Management Unit 141 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material CODE EXAMPLE F-2 Pseudocode for DMMU Pointer Logic (Continued) } return (TSBBase & TSBBaseMask) | (vaPortion & ~TSBBaseMask); }

142 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX G

Assembly Language Syntax

Please refer to Appendix G of Commonality.

143 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 144 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX H

Software Considerations

Please refer to Appendix H of Commonality.

145 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 146 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX I

Extending the SPARC V9 Architecture

Please refer to Appendix I of Commonality.

147 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 148 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX J

Programming with the Memory Models

Please refer to Appendix J of Commonality.

149 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 150 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX K

Changes from SPARC V8 to SPARC V9

Please refer to Appendix K of Commonality.

151 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 152 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX L

Address Space Identifiers

The UltraSPARC III processor supports both big- and little-endian byte orderings. The default data access byte ordering after a power-on reset is big-endian. Instruction fetches are always big-endian.

L.1 Address Space Identifiers and Address Spaces

A SPARC V9 processor provides an address space identifier (ASI) with every address sent to memory. The ASI does the following: ■ Distinguishes between different address spaces ■ Provides an attribute that is unique to an address space ■ Maps internal control and diagnostics registers within a processor

UltraSPARC III memory management hardware translates a 64-bit virtual address and an 8-bit ASI to a 43-bit physical address.

L.2 ASI Values

Internal ASIs (also called nontranslating ASIs) are in the ranges 3016–6F16,7216–7716, and 7A16–7F16. These ASIs are not translated by the MMU. Instead, they pass through their virtual addresses as physical addresses. Accesses to internal ASIs with invalid virtual address have undefined behavior. They may or may not cause a data_access_exception trap. They may or may not alias onto a valid virtual address. Software should not rely on any specific behavior.

153 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Note – MEMBAR #Sync is generally needed after stores and prefetches to internal ASIs. A FLUSH, DONE,orRETRY is needed after stores to internal ASIs that affect instruction accesses. See Instruction Prefetch to Side-Effect Locations on page 73 and UltraSPARC III Internal ASIs on page 74.

L.3 ASI Assignments

Please refer to Section L.3 of Commonality.

L.3.1 Supported SPARC V9 ASIs

Please refer to Section L.3.1 of Commonality.

L.3.2 UltraSPARC III ASI Assignments

TABLE L-1 defines all ASIs supported by UltraSPARC III that are not defined by either SPARC V9 or JPS1. These can be used only with LDXA, STXA,orLDDFA, STDFA instructions, unless otherwise noted. Other-length accesses cause a data_access_exception trap.

In TABLE L-1, the superscript numbers in the Type or Description column have the following meaning.

Number Meaning 1 Read- or write-only access will cause a data_access_exception trap. 2 8-bit, 16-bit, 32-bit and 64-bit accesses are allowed. 3 LDDA, STDFA,orSTXA only. Other types of accesses cause a data_access_exception trap. 4 LDDFA or STDFA only. Other types of accesses cause a data_access_exception trap. 5 Can be used with LDSTUBA, SWAPA, CAS(X)A. 6 Causes a data_access_exception trap if the page being accessed is privileged. 7 Not for customer use. Part ID#, laser programmed, unique for each chip. 8 See UltraSPARC III testability documents.

TABLE L-1 UltraSPARC III ASI Extensions (1 of 4)

Value ASI Name (Suggested Macro Syntax) Type VA Description Page

0016–1316 (JPS1) — 1E16–2316 (SPARC V9) —

154 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE L-1 UltraSPARC III ASI Extensions (Continued) (2 of 4)

Value ASI Name (Suggested Macro Syntax) Type VA Description Page

2416 (JPS1) — 2516–2B16 (SPARC V9) — 2C16 (JPS1) — 2D16–2F16 (SPARC V9) — 3016 (Reserved for ASI_PCACHE_STATUS_DATA) 3116 (Reserved for ASI_PCACHE_DATA) 3216 (Reserved for ASI_PCACHE_TAG) 3316 (Reserved for ASI_PCACHE_SNOOP_TAG) 3416–3716 (SPARC V9) 3816 ASI_WCACHE_VALID_BITS W-cache Valid Bits 340 diagnostic access

3916 ASI_WCACHE_DATA RW W-cache data RAM 342 diagnostic access

3A16 ASI_WCACHE_TAG RW W-cache tag RAM 343 diagnostic access

3B16 ASI_WCACHE_SNOOP_TAG RW W-cache snoop tag RAM 344 diagnostic access

3C16–3F16 (SPARC V9) Implementation dependent 4016 ASI_SRAM_FAST_INIT W 4116 (SPARC V9) — 1 4216 ASI_DCACHE_INVALIDATE W D-cache Invalidate 335 diagnostic access

4316 ASI_DCACHE_UTAG RW D-cache uTag diagnostic 334 access

4416 ASI_DCACHE_SNOOP_TAG RW D-cache snoop tag RAM 334 diagnostic access

4516 ASI_DCU_CONTROL_REG RW 016 D-cache Unit Control — Register

4616 ASI_DCACHE_DATA RW D-cache data RAM 332 diagnostic access

4716 ASI_DCACHE_TAG RW D-cache tag/valid RAM 333 diagnostic access

4816–4916 (JPS1) 4A16 ASI_FIREPLANE_CONFIG_REG RW 016 Fireplane Config Register 279 4A16 ASI_FIREPLANE_ADDRESS_REG RW 0816 Fireplane Address 283 Register

4B16 ASI_ESTATE_ERROR_EN_REG RW 016 Estate error enable 202 register

4C16–4D16 (JPS1)

Working Draft 1.0.5, 10 Sep 2002 S. Appendix L Address Space Identifiers 155 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE L-1 UltraSPARC III ASI Extensions (Continued) (3 of 4)

Value ASI Name (Suggested Macro Syntax) Type VA Description Page

4E16 ASI_ECACHE_TAG (ASI_EC_TAG) RW <22:6> E-cache tag state RAM 339 data diagnostic access

4F16 (SPARC V9) — 5016 (JPS1) — — 5116–5416 (JPS1) — 5516 (JPS1) (ASI_ITLB_DATA_ACCESS_REG) RW 016–20FF816 I/O TLB data access — registers

5516 ASI_ITLB_CAM_ACCESS_REG RW 4000016– IMMU TLB CAM 137 60FF816 diagnostic register 5616–6516 (JPS1) — 66 16 ASI_ICACHE_INSTR (ASI_IC_INSTR) RW I-cache RAM diagnostic 327 access

6716 ASI_ICACHE_TAG (ASI_IC_TAG) RW I-cache tag/valid RAM 327 diagnostic access

6816 ASI_ICACHE_SNOOP_TAG (ASI_IC_STAG) RW I -cache snoop tag RAM 329 diagnostic access

6916–6E16 (SPARC V9) — 6F16 ASI_BRANCH_PREDICTION_ARRAY RW Branch Prediction RAM 331 diagnostic access 4,6 7016–7116 (SPARC V9)RW — 7216 ASI_MCU_CTRL_REG 7316 (SPARC V9) — 7416 ASI_ECACHE_DATA RW E-cache data staging 337 register

7516 ASI_ECACHE_CONTROL (ASI_EC_CTRL) RW 016 E-cache control register 336 1 7616 ASI_ECACHE_W (ASI_EC_W) W E-cache data RAM 337 diagnostic write access

7716–7916 (JPS1) — 7A–7E16 (SPARC V9) — 7E 16 ASI_ECACHE_R (ASI_EC_R) R016 E-cache data RAM 337 diagnostic read access

7F16 (JPS1) — 8016–BF16 (SPARC V9) — C016–CB16 (JPS1) — CC16–CD16 (JPS1) — CE16–CF16 (SPARC V9) — D016–D316 (JPS1) — D416–D716 (SPARC V9) — D816–DB16 (JPS1) — DC16–DF16 (SPARC V9) —

156 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE L-1 UltraSPARC III ASI Extensions (Continued) (4 of 4)

Value ASI Name (Suggested Macro Syntax) Type VA Description Page

E016–E116 (JPS1) — E216–EF16 (SPARC V9) — F016–F116 (JPS1) — F216–F716 (SPARC V9) — F816–F916 (JPS1) — FA16–FF16 (SPARC V9) —

L.3.3 Special Memory Access ASIs

Block Load and Store ASIs

ASIs E016 and E116 exist only for use with STDFA instructions as Block Store with Commit operations (see Block Load and Store Instructions (VIS I) on page 79). Neither ASI E016 nor E116 should be used with LDDFA; however, if either is used, the LDDFA behaves as follows:

1. if a destination register number rd is specified which is not a multiple of 8 ("misaligned" rd), an UltraSPARC III processor generates an illegal_instruction exception (impl. dep. #255)

2. If a misaligned (not 64-byte aligned) memory address is specified, an UltraSPARC III processor generates a mem_address_not_aligned exception (impl. dep. #256)

3. If both rd and the memory address are correctly aligned, the processor generates a data_access_exception.

Partial Store ASIs

ASIs C016–C516 and C816–CD16 exist for use with the STDFA instruction for Partial Store operations (see Partial Store (VIS I) on page 91). None of these ASIs should be used with LDDFA; however, if one of them is used, the LDDFA behaves as follows:

1. If a misaligned (not 8-byte aligned) memory address is specified, an UltraSPARC III processor generates a data_access_exception exception, with fault type 0816 recorded in DSFSR.FTYPE (impl. dep. #257)

2. If the memory address is correctly aligned, the processor generates a data_access_exception

Working Draft 1.0.5, 10 Sep 2002 S. Appendix L Address Space Identifiers 157 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 158 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX M

Caches and Cache Coherency

This chapter describes the use of caches and contains these sections: ■ Cache Organization on page 159 ■ Cache Flushing on page 162 ■ Coherence Tables on page 164

M.1 Cache Organization

In this section we describe two cache organizations: virtual indexed, physical tagged caches and physical indexed, physical tagged.

M.1.1 Virtual Indexed, Physical Tagged Caches (VIPT)

The data cache (D-cache) is a virtual indexed, physical tagged (VIPT) cache. Virtual addresses index into the cache tag and data arrays while accessing the DMMU, (that is, DTLBs). The resulting tag is compared against the translated physical address to determine cache hit.

A side effect inherent in a virtual-indexed cache is address aliasing. See Address Aliasing Flushing on page 162.

The D-cache is a write-through, nonallocating on write miss, 64-Kbyte, pseudo-4- way associative cache with a 32-byte block per line. Data accesses bypass the D- cache when the D-cache enable (DC) bit in the DCU Control Register is clear. If the DM bit in the Data Cache Unit Control Register is clear, then cacheability is determined by the CP and CV bits. If the access is mapped by the DMMU as non- virtual-cacheable, then load misses will not allocate in the D-cache. For more information on the DM, CP,orCV bits, see Data Cache Unit Control Register (DCUCR) on page 33.

159 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Note – A nonvirtual cacheable access may access data in the D-cache from an earlier cacheable access to the same physical block, unless the D-cache is disabled. Software must flush the D-cache when changing a physical page from cacheable to non- cacheable (see Cache Flushing on page 162).

M.1.2 Physical Indexed, Physical Tagged Caches (PIPT)

Caches in the PIPT organization category are the instruction cache, external cache, and write cache.

Instruction Cache (I-Cache)

The I-cache is a 32-Kbyte, pseudo-4-way set-associative, write-invalidate cache with 32-byte blocks. Instruction fetches bypass the I-cache in the following cases: ■ The I-cache enable (IC)orIMMU enable (IM) bits in the Data Cache Unit Control Register are clear. ■ The CP bit in the Data Cache Unit Control Register is set. ■ The processor is in RED_state. ■ The fetch is mapped by the IMMU as nonphysical cacheable.

The I-cache snoops stores from other processors or DMA transfers, as well as stores in the same processor and block commit store (see Block Load and Store Instructions (VIS I) on page 79).

The FLUSH instruction is not required to maintain coherency. Stores and block store commits invalidate the I-cache but do not flush instructions that have already been prefetched into the pipeline. A FLUSH, DONE,orRETRY instruction can be used to flush the pipeline.

If a program changes I-cache mode to I-cache-ON from I-cache-OFF, then the next instruction fetching always causes an I-cache miss even if it is supposed to hit. This rule applies even when the done instruction turns on the I-cache by changing its status from RED_state to normal mode. For example,

(in RED_state) setx 0x37e0000000007, %g1, %g2 stxa %g2,[%g0]0x45 // Turn on I-cache when processor // returns normal mode. done // Escape from RED_state.

(back to normal mode) nop // 1st instruction; this always causes an I-cache miss.

160 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material External and Write Caches (E-Cache, W-Cache)

The level-2 caches—the E-cache and the W-cache—are physical indexed, physical tagged (PIPT). These caches have no references to virtual address and context information. The operating system needs no knowledge of such caches after initialization, except for stable storage management and error handling.

Instruction fetches bypass the E-cache in the following cases: ■ The I-MMU is disabled and the CP bit in the Data Cache Unit Control Register is not set. ■ The processor is in RED_state. ■ The access is mapped by the I-MMU as nonphysical cacheable.

Data accesses bypass the E-cache if the D-MMU enable bit in the DCU Control Register is clear or if the access is mapped by the D-MMU as nonphysical cacheable (unless ASI_PHYS_USE_EC is used).

The system must provide a noncacheable, scratch memory for booting code use until the MMUs are enabled.

The E-cache is a unified, writeback, allocating, direct-mapped cache. The E-cache does not include the contents of the instruction cache, and data cache. Its size ranges from 1 Mbyte to 8 Mbytes. Its line size varies from 64 bytes to 512 bytes with 64 byte subblocks. See TABLE M-1.

TABLE M-1 External Cache Organizations

External Cache Size Line Size 64-Byte Sublines per line 1 Mbyte 64 bytes 1 4 Mbytes 256 bytes 4 8 Mbytes 512 bytes 8

Block loads and block stores, which load or store a 64-byte block of data from memory to the Floating Point Register file, do not allocate into the E-cache, to avoid pollution. Prefetch Read Once instructions, which load a 64-byte block of data, do not allocate into the E-cache.

The W-cache is a 2-Kbyte, 4-way associative cache, with 64 bytes per line and 32-byte subblocks. The W-cache is included in the E-cache, and flushing the E-cache ensures that the W-cache has also been flushed.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 161 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material M.2 Cache Flushing

Data in the write-invalidate or write-through caches can be flushed by invalidation of the entry in the cache. Modified data in the E-cache and W-cache must be written back to memory when flushed.

Cache flushing is required in the following cases: ■ A data cache flush is needed when a physical page is changed from (virtually) cacheable to (virtually) noncacheable or when an illegal address aliasing is created (see Address Aliasing Flushing, below). Flushing is done with a displacement flush (see Displacement Flushing on page 163) or by use of ASI accesses. See Data Cache Diagnostic Accesses on page 331. ■ An external cache flush is needed for stable storage. Flushing is done with either a displacement flush or a store with ASI_BLK_COMMIT. Flushing the external cache will flush the corresponding blocks from write cache. See Committing Block Store Flushing on page 163. ■ External, data, and instruction cache flushes may be required when an ECC error occurs on a read from the Sun Fireplane Interconnect or the external cache. Asynchronous Fault Status Register on page 204 describes the case when a flush on an error is required. When an ECC error occurs, invalid data may be written into one of the caches and the cache lines must be flushed to prevent further corruption of data.

M.2.1 Address Aliasing Flushing

A side effect inherent in a virtual-indexed cache is illegal address aliasing. Aliasing occurs when multiple virtual addresses map to the same physical address.

Caution – Since the D-cache is indexed with the virtual address bits and is larger than the minimum page size, it is possible for the different aliased virtual addresses to end up in different cache blocks. Such aliases are illegal because updates to one cache block will not be reflected in aliased cache blocks.

Normally, software avoids illegal aliasing by forcing aliases to have the same address bits (virtual color)uptoanalias boundary. The minimum alias boundary is 16 Kbytes. This size may increase in future designs.

When the alias boundary is violated, software must flush the D-cache if the page was virtual cacheable. In this case, only one mapping of the physical page can be allowed in the DMMU at a time.

162 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Alternatively, software can turn off virtual caching of illegally aliased pages. Doing so allows multiple mapping of the alias to be in the DMMU and avoids flushing of the D-cache each time a different mapping is referenced.

Note – A change in virtual color when allocating a free page does not require a D- cache flush because the D-cache is write-through.

M.2.2 Committing Block Store Flushing

Stable storage must be implemented by software cache flush. Examples of stable storage are battery-backed memory and a transaction log. Data that are present and modified in the E-cache or the W-cache must be written back to the stable storage.

Two ASIs(ASI_BLK_COMMIT_PRIMARY and ASI_BLK_COMMIT_SECONDARY) perform these writebacks efficiently when software can ensure exclusive write access to the block being flushed. These ASIs write back to memory the data from the Floating Point Registers and invalidate the entry in the cache. The data in the Floating Point Registers must first be loaded by a block load instruction. A MEMBAR #Sync instruction can be used to ensure that the flush is complete. See also Block Load and Store Instructions (VIS I) on page 79.

M.2.3 Displacement Flushing

Cache flushing can also be accomplished by a displacement flush. This procedure reads a range of addresses that map to the corresponding cache line being flushed, forcing out modified entries in the local cache. Take care to ensure that the range of read-only addresses is mapped in the MMU before starting a displacement flush; otherwise, the TLB miss handler may put new data into the caches.

See Cache Flushing on page 11 for details on displacement flushing.

Note – Diagnostic ASI accesses to the E-cache can be used to invalidate a line, but they are not an alternative to displacement flushing. Modified data in the E-cache will not be written back to memory when these ASI accesses are used. See Data Cache Diagnostic Accesses on page 331.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 163 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material M.3 Coherence Tables

The set of tables in this section describes the cache coherence protocol that governs the behavior of the processor on the Sun™ Fireplane Interconnect.

M.3.1 Processor State Transition and the Generated Transaction

Tables in this section summarize the following: ■ Hit/Miss, State Change, and Transaction Generated for Processor Action ■ Combined Tag/MTag States ■ Derivation of DTags, CTags, and MTags from Combined Tags

TABLE M-1 Hit/Miss, State Change, and Transaction Generated for Processor Action

Processor Action Combined Block Store Dirty State Mode Load Store/Swap Block Load Block Store w/ Commit Victim I ~SSM Miss: Miss: Miss: Miss: Miss: None RTS RTO RS WS WS SSM & Miss: Miss: Miss: Miss: Miss: None LPA RTS RTO RS R_WS R_WS SSM & MTag miss: MTag miss: MTag miss: None LPA & R_RTS R_RTO R_RS Invalid Invalid retry SSM & Miss: Miss: Miss: Miss: Miss: None ~LPA R_RTS R_RTO R_RS R_WS R_WS E ~SSM Hit Hit: Hit Hit: Miss: None E → M E->M WS SSM & Hit Hit: Hit Hit: Miss: None LPA E → M E->M R_WS SSM & LPA & Invalid retry SSM & Hit Hit: Hit Hit: Miss: None ~LPA E->M E->M R_WS

164 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-1 Hit/Miss, State Change, and Transaction Generated for Processor Action (Continued)

Processor Action Combined Block Store Dirty State Mode Load Store/Swap Block Load Block Store w/ Commit Victim S ~SSM Hit Miss: Hit Miss: Miss: None RTO WS WS SSM & Hit MTag miss: Hit Miss: Miss: None LPA RTO R_WS R_WS SSM & MTag miss: LPA & Invalid R_RTO Invalid Invalid Invalid None retry SSM & Hit MTag miss: Hit Miss: Miss: None ~LPA R_RTO R_WS R_WS O ~SSM Hit Miss: Hit Miss: Miss: WB RTO WS WS SSM & Hit MTag miss: Hit Miss: Miss: WB LPA RTO R_WS R_WS SSM & MTag miss: LPA & Invalid R_RTO Invalid Invalid Invalid None retry SSM & Hit MTag miss: Hit Miss: Miss: R_WB ~LPA R_RTO R_WS R_WS Os ~SSM (legal only in Invalid SSM mode) SSM & Hit MTag miss: Hit Miss: Miss: WB LPA R_RTO R_WS R_WS SSM & MTag miss: LPA & Invalid R_RTO Invalid Invalid Invalid None retry SSM & Hit MTag miss: Hit Miss: Miss: ~LPA R_RTO R_WS R_WS R_WB M ~SSM Hit Hit Hit Hit Miss: WB WS SSM & Hit Hit Hit Hit Miss: WB LPA R_WS SSM & LPA & Invalid retry SSM & Hit Hit Hit Hit Miss: R_WB ~LPA R_WS

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 165 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-2 Combined Tag/MTag States

MTag State: CTag State gI gS gM cM I Os M cO I Os O cE I S E cS I S S cI I I I

TABLE M-3 Deriving DTags, CTags, and MTags from Combined Tags

Combined Tags (CCTags) DTag CTag MTag IdIcIgI E dS cE gM SdScSgS OdOcOgM Os dO cO gS MdOcMgM

M.3.2 Snoop Output and Input

TABLE M-4 summarizes snoop output and DTag transition; TABLE M-5 summarizes snoop input and CIQ operation queueing.

TABLE M-4 Snoop Output and DTag Transition (1 of 4)

Shared Owned Error Next DTag Snooped Request DTag State Output Output Output State Action for Snoop Pipeline Own RTS (for data) dI 0 0 0 dT Own RTS wait data dS 1 0 1 dS Error dO 1 0 1 dO Error dT 1 0 1 dT Error Own RTS (for dI 0 0 0 dS Own RTS inst wait data instructions) dS 1 0 1 dS Error dO 1 0 1 dO Error dT 1 0 1 dT Error

166 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-4 Snoop Output and DTag Transition (2 of 4)

Shared Owned Error Next DTag Snooped Request DTag State Output Output Output State Action for Snoop Pipeline Foreign RTS dI 0 0 dI None dS 1 0 dS None dO 1 1 dO Foreign RTS copyback dT 1 0 dS None Own RTO dI 0 0 dO Own RTO wait data dS & ~SSM 1 1 dO Own RTO no data dS & SSM 0 0 dO Own RTO wait data dO 1 1 dO Own RTO no data dT 1 1 1 dO Error Foreign RTO dI 0 0 dI None dS 0 0 dI Foreign RTO Invalidate dO 0 1 dI Foreign RTO copyback- Invalidate dT 0 0 dI Foreign RTO Invalidate Own RS dI 0 0 dI Own RS wait data dS 0 0 1 dS Error dO 0 0 1 dO Error dT 0 0 1 dT Error Foreign RS dI 0 0 dI None dS 0 0 dS None dO 0 1 dO Foreign RS copyback-discard dT 0 0 dT None Own WB dI 0 1 dI Own WB (cancel) dS 0 1 dI Own WB (cancel) dO 0 0 dI Own WB dT 0 1 1 dI Error Foreign WB dI 0 0 0 dI None dS 0 0 0 dS None dO 0 0 0 dO None dT 0 0 0 dT None

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 167 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-4 Snoop Output and DTag Transition (3 of 4)

Shared Owned Error Next DTag Snooped Request DTag State Output Output Output State Action for Snoop Pipeline Foreign RTSM dI 0 0 dI None dS 0 0 dS Foreign RTSM dO 0 1 dS fRTSM copyback (if cache line is not in the write cache) dO 0 1 dI fRTSR copyback (if cache line is in the write cache) dT 0 0 dS Foreign RTSM Own RTSR (issued by dI 0 0 dO Own RTSR wait data SSM device) dS 0 0 0 dO Own RTSR wait data dO 0 0 1 dO Own RTSR wait data, Error dT 0 0 1 dT Own RTSR wait data, Error Foreign RTSR dI 0 0 dI None dS 1 0 dS None dO 1 1 dS Foreign RTSR dT 1 0 dS None Own RTOR (issued by dI 0 0 dO Own RTOR wait data SSM device) dS 0 0 dO Own RTOR wait data dO 0 0 dO Own RTOR wait data dT 0 0 1 dO Error Foreign RTOR dI 0 0 dI None dS 0 0 dI Foreign RTOR invalidate dO 0 0 dI Foreign RTOR invalidate dT 0 0 dI Foreign RTOR invalidate Own RSR dI 0 0 dI Own RSR wait data dS 0 0 1 dS Error dO 0 0 1 dO Error dT 0 0 1 dT Error Foreign RSR dI 0 0 dI None dS 0 0 dS None dO 0 0 dO None dT 0 0 dT None

168 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-4 Snoop Output and DTag Transition (4 of 4)

Shared Owned Error Next DTag Snooped Request DTag State Output Output Output State Action for Snoop Pipeline Own WS dI 0 0 dI Own WS dS 0 0 dI Own invalidate WS dO 0 0 dI Own invalidate WS dT 0 0 dI Own invalidate WS Foreign WS dI 0 0 dI None dS 0 0 dI Invalidate dO 0 0 dI Invalidate dT 0 0 dI Invalidate

TABLE M-5 Snoop Input and CIQ Operation Queueing Shared Owned Action from Snoop Pipeline Input Input Error (out) Operation Queued in CIQ Own RTS wait data 1 X RTS Shared 0 0 RTS ~Shared 0 1 1 RTS Shared, Error Own RTS inst wait data X X RTS Shared Foreign RTS copyback X 1 Copyback X 0 1 Copyback, Error Own RTO no data 1 X RTO nodata 0 X 1 RTO nodata, error Own RTO wait data 1 X 1 RTO data, error 0 X RTO data Foreign RTO Invalidate X X Invalidate Foreign RTO copyback-Invalidate X 0 1 Copyback-Invalidate, Error 0 1 Copyback-Invalidate 1 1 Invalidate Own RS wait data X X RS data Foreign RS copyback-discard X 0 1 Error X 1 Copyback-discard Foreign RTSM copyback X 0 1 RTSM copyback, Error X 1 RTSM copyback

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 169 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-5 Snoop Input and CIQ Operation Queueing (Continued) Shared Owned Action from Snoop Pipeline Input Input Error (out) Operation Queued in CIQ Own RTSR wait data 1 X RTSR shared 0 X RTSR~shared Own RTOR wait data X X RTOR data Foreign RTOR Invalidate X X Invalidate Own RSR X X RS data Own WS X X Own WS Own WB X X Own WB Own Invalidate WS X X Own Invalidate WS Invalidate X X Invalidate

M.3.3 Transaction Handling

Tables in this section summarize handling of the following: ■ Transactions at the head of CIQ ■ No snoop transactions ■ Transactions internal to UltraSPARC III

TABLE M-6 Transaction Handling at Head of CIQ (1 of 3)

Operation at Head of CIQ CCTag MTag (in/out) Error Retry Next CCTag RTS Shared I gM (in) S gS (in) S gI (in) 1 I M, O, E, S, Os X 1 RTS ~Shared I gM (in) E gS (in) S gI (in) 1 I M, O, E, S, Os X (in) 1 RTSR Shared I gM (in) O gS (in) Os gI (in) 1 1 I M, O, E S, Os X 1 RTSR ~Shared I gM (in) M gS (in) Os

170 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-6 Transaction Handling at Head of CIQ (2 of 3)

Operation at Head of CIQ CCTag MTag (in/out) Error Retry Next CCTag gI (in) 1 1 I M, O, E, S, Os X (in) 1 RTO nodata I, M, E, Os None 1 O, S None M RTO data & ~SSM M, E, O, Os, S X (in) 1 I gM (in) M gS (in) 1 Os gI (in) 1 I RTO data & SSM M, E, Os, O X(in) 1 I, S gM(in) M gS(in) 1 Os gi(in) 1 I RTOR data M,E X (in) 1 O gM (in) 1 O S, Os, I gM (in) M S, O, Os, I gS (in) 1 1 Os gI (in) 1 1 I Foreign RTSR I None I M, O None 1 No change E, Os, S None S RTSM I X (in) 1 M, O, Os None 1 S E, S None S RTSM copyback M, O gM (out) S (if cache line is not in the W-cache) M, O gM (out) I (if cache line is in the W-cache) Os gS (out) S E, S, I 1 copyback M, O gM (out) O Os gS (out) Os I gI (out) I

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 171 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE M-6 Transaction Handling at Head of CIQ (3 of 3)

Operation at Head of CIQ CCTag MTag (in/out) Error Retry Next CCTag E, S gM (out) 1 S Invalidate X I copyback-Invalidate M, O gM (out) I Os gS (out) I I gI (out) I E, S gM (out) 1 I copyback-discard M, O gM (out) No change Os gS (out) Os I gI (out) I E, S gM (out) 1 No change RS data X gM(in) No change X gS(in) No change X gI(in) 1 No change Own WS X gM(out) I Own WB M, O gM (out) I Os gS (out) I I gI (out) I S, E gM (out) 1 I Own Invalidate WS X gM(out) I

TABLE M-7 No Snoop Transaction Handling

Processor Action Combined Block Store Dirty State Mode Load Store/ Swap Block Load Block Store Commit Victim No snoop Miss: Miss: Miss: Miss: Miss: None I RTS_ns RTO_ns RS_ns WS WS S No snoop Error No snoop Hit Hit Hit Hit Miss: None E E->M E->M WS No snoop Hit Hit Hit Hit Miss: WB M WS O No snoop Error Os No snoop Error

172 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material RTS_ns, RTO_ns, and RS_ns are transactions internal to UltraSPARC III and are not visible on the Sun Fireplane Interconnect.

TABLE M-8 Internal Transaction Handling

Operation at Head of CIQ CCTag MTag (in/out) Error Next CCTag RTS_ns I gM (in) E gS (in) 1 S gI (in) 1 I S, E, M, O, Os 1 RTO_ns I gM (in) M gS (in) 1 O gI (in) 1 I S, E, M, O, Os 1 RS_ns I gM (in) I gS(in) 1 I gI(in) 1 I S, E, M, O, Os 1 WS X gM(out) I gS, gI (out) 1 I

Working Draft 1.0.5, 10 Sep 2002 S. Appendix M Caches and Cache Coherency 173 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 174 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX N

Interrupt Handling

Please refer to Appendix N in Commonality.

N.4 Interrupt ASR Registers

Please refer to Section N.4 of Commonality for details of these registers.

N.4.2 Interrupt Vector Dispatch Register

UltraSPARC III interprets all 10 bits of VA<38:39> when the Interrupt Vector Dispatch Register is written (impl. dep. #246).

N.4.3 Interrupt Vector Dispatch Status Register

In UltraSPARC III, 32 BUSY/NACK pairs are implemented in the Interrupt Vector Dispatch Status Register (impl. dep. #243).

N.4.5 Interrupt Vector Receive Register

UltraSPARC III sets all 10 physical module ID (MID) bits in the SID_U and SID_L fields of the Interrupt Vector Receive Register. UltraSPARC III obtains SID_U from VA<38:34> of the interrupt source and SID_L from VA<33:29> of the interrupt source (impl. dep. #247).

175 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 176 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX O

Reset and RED_state

This appendix examines RED_state (Reset, Error, and Debug state) in the following sections: ■ RED_state Characteristics on page 177 ■ Resets on page 178 ■ RED_state Trap Vector on page 179 ■ Machine States on page 180

O.1 RED_state Characteristics

A reset or trap that sets PSTATE.RED (including a trap in RED_state) will clear the DCU Control Register, including enable bits for I-cache, D-cache, I-MMU, D-MMU, and virtual and physical watchpoints. The characteristics of RED_state include the following: ■ The default access in RED_state is noncacheable, so there must be noncacheable scratch memory somewhere in the system. ■ The D-cache, watchpoints, and DMMU can be enabled by software in RED_state, but any trap will disable them again. ■ The IMMU and consequently the I-cache are always disabled in RED_state. Disabling overrides the enable bits in the DCU control register. ■ When PSTATE.RED is explicitly set by a software write, there are no side effects other than that the IMMU is disabled. Software must create the appropriate state itself. ■ A trap when TL = MAXTL − 1 immediately brings the processor into RED_state. In addition, a trap when TL = MAXTL immediately brings the processor into error_state. Upon error_state entry, the processor automatically recovers through watchdog reset (WDR) into RED_state. ■ A trap to error_state immediately triggers watchdog reset (WDR).

177 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ A Signal Monitor (SIGM) instruction generates an SIR trap on the local processor. ■ Trap to software-initiated reset causes an SIR trap on the processor and brings the processor into RED_state. ■ The External Reset pin generates an XIR trap, which is used for system debug or Fireplane Interconnect transactions. ■ The caches continue to snoop and maintain coherence if DVMA or other processors are still issuing cacheable accesses.

Note – Exiting RED_state by writing 0 to PSTATE.RED in the delay slot of a JMPL is not recommended. A noncacheable instruction prefetch can be made to the JMPL target, which may be in a cacheable memory area. This condition could result in a bus error on some systems and cause an instruction_access_error trap. You can mask the trap by setting the NCEEN bit in the ESTATE_ERR_EN register to 0, but this approach will mask all noncorrectable error checking. Exiting RED_state with DONE or RETRY avoids the problem.

O.2 Resets

Reset priorities from highest to lowest are power-on resets (POR, hard or soft), externally initiated reset (XIR), watchdog reset (WDR), and software-initiated reset (SIR).

O.2.1 Hard Power-on Reset (Hard POR, Power-on Reset, POK Reset)

A hard power-on reset (Hard POR) occurs when the POK pin is activated and stays asserted until the processor is within its specified operating range. When the POK pin is active, all other resets and traps are ignored. Power-on reset has a trap type of 1 at physical address offset 2016. Any pending external transactions are canceled. After power-on reset, software must initialize values specified as “unknown” in TABLE O-1 on page 180. In particular, the valid and microtag bits in the I-cache (see Instruction Cache Diagnostic Accesses on page 326), the valid and microtag bits in the D-cache (see Data Cache Diagnostic Accesses on page 331), and all E-cache tags and data (see External Cache Diagnostics Accesses on page 336) must be cleared before the caches are enabled. The I-MMU and D-MMU TLBs must also be initialized as described in Reset, Disable, and RED_state Behavior on page 133.

The MCU refresh control register as well as the Fireplane configuration register must be initialized after a power-on reset.

178 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material In SSM (Scalable Shared Memory) systems, the MTags contained in memory must be initialized before any Fireplane transactions are generated.

Caution – Executing a DONE or RETRY instruction when TSTATE is uninitialized after a POR can damage the chip. The POR boot code should initialize TSTATE<3:0>, using wrpr writes, before any DONE or RETRY instructions are executed.

O.2.2 System Reset (Soft POR, Fireplane Reset, POR)

A system reset occurs when the Reset pin is activated. When the Reset pin is active, all other resets and traps are ignored. System reset has a trap type of 1 at physical address offset 2016. Any pending external transactions are canceled. Memory refresh continues uninterrupted during a system reset.

O.2.3 Externally Initiated Reset (XIR)

Please refer to Section O.2.1 of Commonality. error_state and Watchdog Reset (WDR) Please refer to Section O.2.2 of Commonality.

O.2.4 Software-Initiated Reset (SIR)

Please refer to Section O.2.3 of Commonality.

O.3 RED_state Trap Vector

When a SPARC V9 processor processes a reset or trap that enters RED_state,it takes a trap at an offset relative to the RED_state trap vector base address (RSTVaddr); the base address is at virtual address FFFF FFFF F000 000016, which passes through to physical address 7FF F000 000016 (impl. dep. # 114).

Working Draft 1.0.5, 10 Sep 2002 S. Appendix O Reset and RED_state 179 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material O.4 Machine States

TABLE O-1 shows the machine states created as a result of any reset or when RED_state is entered. RSTVaddr is often abbreviated as RSTV in the table.

TABLE O-1 Machine State After Reset and When Entering RED_state (1 of 4)

Name Fields Hard_POR System Reset WDR XIR SIR RED_state‡ Integer registers Unknown Unchanged Unchanged Floating-point Unknown Unchanged Unchanged registers External Cache 0 0 Unchanged Control Register

RSTVaddr value VA = FFFF FFFF F000 000016 PA = 7FF F000 000016

PC RSTV | 2016 RSTV | 2016 RSTV | 4016RSTV | 6016RSTV | 8016RSTV | A016 nPC RSTV | 2416 RSTV | 2416 RSTV | 4416RSTV | 6416RSTV | 8416RSTV | A416 PSTATE MM 0 (TSO) 0 (TSO) 0 (TSO) RED 1(RED_state) 1(RED_state) 1(RED_state) PEF 1 (FPU on) 1 (FPU on) 1 (FPU on) AM 0 (Full 64-bit 0 (Full 64-bit 0 (Full 64-bit address) address address PRIV 1 (Privileged 1 (Privileged 1 (Privileged mode) mode) mode) IE 0 (Disable 0 (Disable 0 (Disable interrupts) interrupts) interrupts) AG 1 (Alternate 1 (Alternate 1 (Alternate globals selected) globals globals selected) selected) CLE 0 (Current 0 (Current PSTATE.TLE little-endian) little-endian) TLE 0 (Trap little- 0 (Trap little- Unchanged endian) endian) IG 0 (Interrupt 0 (Interrupt 0 (Interrupt globals not selected) globals not globals not selected) selected) MG 0 (MMU 0 (MMU 0 (MMU globals not selected) globals not globals not selected) selected) TBA<63:15> Unknown Unchanged Unchanged Y Unknown Unchanged Unchanged

180 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE O-1 Machine State After Reset and When Entering RED_state (2 of 4)

Name Fields Hard_POR System Reset WDR XIR SIR RED_state‡ PIL Unknown Unchanged Unchanged CWP Unknown Unchanged Unchanged except for register window traps TT[TL] 1 1 Unchanged 3 4 trap type CCR Unknown Unchanged Unchanged ASI Unknown Unchanged Unchanged TL MAXTL MAXTL Min(TL+1, MAXTL)

TPC[TL] Unknown Unchanged PC PC & ~1F16 PC TNPC[TL] Unknown Unchanged nPC nPC=PC+4 nPC TSTATE CCR Unknown Unchanged CCR ASI Unknown Unchanged ASI PSTATE Unknown Unchanged PSTATE CWP Unknown Unchanged CWP PC Unknown Unchanged PC nPC Unknown Unchanged nPC TICK NPT 1 1 Unchanged Unchanged Unchanged counter Restart at 0 Restart at 0 Count Restart at 0 Count CANSAVE Unknown Unchanged Unchanged CANRESTORE Unknown Unchanged Unchanged OTHERWIN Unknown Unchanged Unchanged CLEANWIN Unknown Unchanged Unchanged WSTATE OTHER Unknown Unchanged Unchanged NORMAL Unknown Unchanged Unchanged

VER MANUF 003E16 IMPL 001416 MASK Mask dependent MAXTL 5 MAXWIN 7 FSR all 0 0 Unchanged FPRS all Unknown Unchanged Unchanged

Non-SPARC V9 ASRs SOFTINT Unknown Unchanged Unchanged TICK_COMPARE INT_DIS 1 (off) 1 (off) Unchanged TICK_CMPR 0 0 Unchanged STICK NPT 1 1 Unchanged counter 0 0 Count

Working Draft 1.0.5, 10 Sep 2002 S. Appendix O Reset and RED_state 181 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE O-1 Machine State After Reset and When Entering RED_state (3 of 4)

Name Fields Hard_POR System Reset WDR XIR SIR RED_state‡ STICK_COMPARE INT_DIS 1 (off) 1 (off) Unchanged TICK_CMPR 0 0 Unchanged PCR S1 Unknown Unchanged Unchanged S0 Unknown Unchanged Unchanged UT (trace user) Unknown Unchanged Unchanged ST (trace system) Unknown Unchanged Unchanged PRIV (priv access) Unknown Unchanged Unchanged PIC all Unknown Unknown Unknown GSR IM 0 0 Unchanged others Unknown Unchanged Unchanged DCR MS (impl. dep. #204) 0 0 Unchanged SI (impl. dep. #204) 0 0 RPE (impl. dep. #204) 0 0 BPE (impl. dep. #204) 0 0 OBS (impl. dep. #203) 0 0 IFPOE (impl. dep. 0 0 #203)

Non-SPARC V9 ASIs Fireplane Information See RED_state and Reset Values on page 283 DCUCR WE 0(off) 0(off) Unchanged all others 0 (off) 0 (off) 0 (off) INSTRUCTION_TRAP all 0 (off) 0 (off) Unchanged VA_WATCHPOINT Unknown Unchanged Unchanged PA_WATCHPOINT Unknown Unchanged Unchanged I-SFSR, D-SFSR ASI Unknown Unchanged Unchanged FT Unknown Unchanged Unchanged E Unknown Unchanged Unchanged CTXT Unknown Unchanged Unchanged PRIV Unknown Unchanged Unchanged W Unknown Unchanged Unchanged OW (overwrite) Unknown Unchanged Unchanged FV (SFSR valid) 0 0 Unchanged NF Unknown Unchanged Unchanged TM Unknown Unchanged Unchanged DMMU_SFAR Unknown Unchanged Unchanged INTR_DISPATCH all 0 0 Unchanged INTR_RECEIVE BUSY 0 0 Unchanged MID Unknown Unchanged Unchanged

182 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE O-1 Machine State After Reset and When Entering RED_state (4 of 4)

Name Fields Hard_POR System Reset WDR XIR SIR RED_state‡ ESTATE_ERR_EN all 0 (all off) 0 (all off) Unchanged AFAR PA Unknown Unchanged Unchanged AFSR all 0 Unchanged Unchanged Rfr_CSR all Unknown Unchanged Unchanged Mem_Timing_CSR all Unknown Unchanged Unchanged Mem_Addr_Dec all Unknown Unchanged Unchanged Mem_Addr_Cntl all Unknown Unchanged Unchanged

Other Processor-Specific States Processor and external cache tags, Unknown Unchanged Unchanged microtags and data (includes data, instruction, and write caches) Cache snooping Enabled Instruction Queue Empty Store Queue Empty Empty Unchanged I-TLB, D-TLB Mappings in #2 (2- Unknown Unknown1 Unchanged way set-associative) Mappings in #0 Unknown Unknown Unchanged (fully set- and invalid associative) E (side effect) bit 1 1 1 NC (non-cacheable) 1 1 1 bit

1. Reference: UltraSPARC III Erratum #86, Bug IDs #7155 and 7156 ‡ Processor states are only updated according to the following table if RED_state is entered because of a reset or a trap. If RED_state is entered because the PSTATE.RED bit was explicitly set to 1, then software must create the appropriate states itself. UPDATED - this register field is updated from its shadow register.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix O Reset and RED_state 183 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 184 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX P

Error Handling

Appendix P describes processor behavior to a programmer writing operating system and service processor diagnosis and recovery code for UltraSPARC III. Section headings differ from those of Appendix P of Commonality.

Errors are checked in data arriving at or passing through the processor from the E- cache and system bus, and in MTags arriving from the system bus. In addition, several protocol and internal errors are detectable.

Note – For clarity, this appendix refers to the level-1 instruction cache and data cache as I-cache and D-cache, the write cache as W-cache, and the level-2 external cache as E-cache.

Error information is logged in the Asynchronous Fault Address Register and Asynchronous Fault Status Register. Errors are logged even if their corresponding traps are disabled.

The appendix contains these sections: ■ Error Classes on page 186 ■ Corrective Actions on page 186 ■ Memory Errors on page 191 ■ Error Registers on page 202 ■ Error Reporting Summary on page 214 ■ Overwrite Policy on page 217 ■ Multiple Errors and Nested Traps on page 219 ■ Further Details on Detected Errors on page 219 ■ Further Details of ECC Error Processing on page 226 ■ UltraSPARC III Behavior Under Asynchronous Error Conditions on page 235 ■ External Memory Unit Error Handling on page 251

185 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material P.1 Error Classes

The classes of errors that can occur are:

1. Hardware-corrected errors (hw_corrected) — Corrected automatically by the hardware. A trap is optionally generated to log the error condition.

2. Software-correctable errors (sw_correctable) — Not corrected automatically by the hardware but are correctable by the software.

3. Uncorrectable errors — Not correctable by either software or hardware.

P.2 Corrective Actions

Errors are handled by generation of one of the following actions: ■ Fatal error — Reported when the processor must be reset before continuing. ■ Precise traps — Signalled for sw_correctable errors and one form of uncorrectable error that require system intervention before normal processor execution can continue. ■ Deferred traps — Reported for uncorrectable errors requiring immediate attention, but not system reset. ■ Disrupting traps — Report errors that may need logging and clearing but do not otherwise affect processor execution.

Trap vectors and trap priorities are specified in Tables 7.3 and 7.4 of Commonality.

P.2.1 Fatal Error (FERR)

It is usually impossible to recover a domain that suffers a system snoop request parity error, invalid coherence state, system interface protocol error, or internal system interface error at a processor. When these errors occur, the normal recovery mechanism is to reset the coherence domain to which the processor belongs. When one of these fatal errors is detected by a processor, the processor asserts its ERROR output pin. The response of the system when an ERROR pin is asserted depends on the system design. External Memory Unit Error Handling on page 251 contains detailed information regarding the error conditions that could cause the ERROR pin to be asserted.

186 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Since the AFSR is not reset by a system reset event, error logging information is preserved. The system can generate a domain reset in response to assertion of an ERROR pin, and software can then examine the system registers to determine that reset was due to an FERR. The AFSR of all processors can be read to determine the source and cause of the FERR. TABLE O-1 on page 180 defines the machine states after a reset.

Most errors that lead to FERR do not cause any special processor behavior. However, an uncorrectable error in an MTag is unique in normally causing the processor both to assert its ERROR output pin and to begin trap execution. Uncorrectable errors in MTags are normally fatal and should lead to the affected coherence domain being reset.

ERROR is only asserted at the time any one of the various FERR conditions becomes valid the first time and the pulse width of the ERROR signal is 8 system clock cycles. The FERR bits are sticky after a system reset to allow the interrogation of the fatal error(s), but the ERROR pin will not be asserted again. The existing FERR conditions will be cleared once system software clears AFSR.PERR or AFSR.IERR. A power-on- reset cycle of a processor would be desired after the processor experienced a fatal error.

P.2.2 Precise Traps

A precise trap occurs before any program-visible state has been modified by the instruction to which the TPC points. When a precise trap occurs, several conditions are true.

1. The PC saved in TPC[TL] points to a valid instruction that will be executed by the program. The nPC saved in TNPC[TL] points to the instruction that will be executed after next.

2. All instructions issued before the one pointed to by the TPC have completed execution.

3. Any instructions issued after the one pointed to by the TPC remain unexecuted.

A precise trap occurs when a sw_correctable E-cache ECC error is detected as the result of a D-cache load miss or atomic instruction miss or an I-cache miss. These error conditions are the only ones that require software support to correct single-bit ECC errors; all other single-bit ECC errors are corrected by hardware.

A precise trap also occurs when an uncorrectable E-cache ECC error is detected as the result of a D-cache load miss, atomic instruction, or I-cache miss. If the affected line is in the S or E MOESI state, then software can recover from this problem in the precise trap handler. If the affected line is in the M or O states, then a process or the whole domain must be terminated.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 187 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material An E-cache error can be detected when the instruction fetcher misses in the I-cache. It can also be detected when the I-cache autonomously fetches the second 32-byte line of a 64-byte E-cache subblock. If an error detected in this way is on an instruction that is never executed, then the precise trap associated with the error is never taken. However, the error will be logged in AFSR and AFAR.

E-cache errors associated with speculative load (etc.) fetches to the data cache, where the data are not used, are never logged in AFSR or AFAR, do not load data into the D-cache, and never cause a precise trap.

P.2.3 Deferred Traps

Deferred traps may corrupt the processor state. Such traps lead to termination of the currently executing process or result in a system reset if the system state has been corrupted. Error logging information allows software to determine if system state has been corrupted.

Error Barriers

A MEMBAR #Sync instruction provides an error barrier for deferred traps. It ensures that deferred traps from earlier accesses will not be reported after the MEMBAR.A MEMBAR #Sync should be used when context switching or anytime the PSTATE.PRIV bit is changed, to provide error isolation between processes.

On UltraSPARC III, DONE and RETRY instructions implicitly provide the same function as MEMBAR #Sync so that they act as error barriers (impl. dep. #215). Errors reported as the result of fetching user code after a DONE or RETRY are always reported after the DONE or RETRY.

Traps do not provide the same function as MEMBAR #Sync. See Error Barriers on page 233 for a detailed description of error barriers and traps.

TPC, TNPC, and Deferred Traps

After a deferred trap, the contents of TPC[TL] and TNPC[TL] are undefined (except for the special peek sequence described in Appendix P of Commonality). They do not generally contain the oldest nonexecuted instruction and its next PC. Because of this, execution cannot normally be resumed from the point that the trap is taken. Instruction access errors are reported before execution of the instruction that caused the error, but TPC does not necessarily point to the corrupted instruction.

188 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Enabling Deferred Traps

When an error that leads to a deferred trap occurs, the trap is taken only if the NCEEN bit is set in the E-cache Error Enable Register (see E-cache Error Enable Register on page 202). The deferred trap is an instruction_access_error if the error occurred as the result of fetching instructions. The deferred trap is a data_access_error if the error occurred as the result of a load, store, or atomic data access instruction.

The NCEEN bit should normally be set. If NCEEN is clear, the processor will compute with corrupt data and instructions when an uncorrectable error occurs.

NCEEN also controls a number of disrupting traps associated with uncorrectable errors.

Errors Leading to Deferred Traps

Deferred traps are generated by the following errors: ■ Uncorrectable E-cache ECC error as the result of a Block Load operation. (EDU:BLD) ■ Uncorrectable system bus data ECC error in the system bus read of memory or I/O. Uncorrectable ECC errors on cache fills will be reported for any ECC error in the cache block, not just the referenced word. (UE) ■ Uncorrectable system bus MTag ECC error for any incoming data, including interrupt vectors. These errors also cause the processor to assert its ERROR output pin, so whether the trap is ever executed depends on system design. (EMU) ■ Unmapped (TO) or bus error (BERR) as the result of a system bus read of memory or I/O. Intentional peeks and pokes to test presence and operation of devices are recoverable only if performed as described in the following section.

Special Access Sequence for Recovering Deferred Traps

Section P.2.3 in Commonality describes the special access sequence for recovering deferred traps.

Deferred Trap Handler Functionality

The following is a possible sequence, similar to that described in Section P.2.3 of Commonality, for handling errors resulting in unexpected deferred traps. Within the trap handler:

1. Log the error(s).

2. Reset the error logging bits in AFSR error register. Perform a MEMBAR #Sync to complete internal ASI stores.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 189 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 3. Panic if AFSR.PRIV is set and not performing an intentional peek/poke; otherwise, try to continue.

4. Invalidate the D-cache by writing each line of the D-cache at ASI_DCACHE_TAG. This step is not required for instruction_access_error events, but, in the event of multiple errors arriving before a trap can be taken, the instruction_access_error trap can be taken to handle both an instruction and a data error. It is simplest to invalidate the D-cache in all cases.

5. Abort the current process.

6. For user process UE errors in a conventional UNIX system: Once all processes using the physical page in error have been signalled and terminated, then, as part of the normal page recycling mechanism, clear the UE from main memory by writing the page zero routine to use block store instructions. The trap handler does not usually have to clear out a UE in main memory.

7. Resume execution.

P.2.4 Disrupting Traps

Please refer to Section P.2.4 of Commonality for general information on disrupting traps.

Hw_corrected ECC errors result from detection of a single-bit ECC error as the result of a system bus read or E-cache access (except for sw_correctable E-cache errors). Hw_corrected errors are logged in the Asynchronous Fault Status Register and, except for interrupt vector fetches, in the Asynchronous Fault Address Register. If the correctable error (CEEN) bit in the E-cache Error Enable Register is set to 1, then an ECC_error exception is generated.

E-cache data ECC errors are discussed in E-cache Data ECC Error on page 191. Uncorrectable E-cache data ECC errors as the result of a read to satisfy a store merge, writeback, or copyout require only logging on this processor, because this processor is not using the affected data. Consequently, a disrupting ECC_error trap is taken instead of a deferred trap. This approach avoids panics when the system displaces corrupted user data from the cache.

The ECC_error disrupting trap is enabled by PSTATE.IE. PIL has no effect on ECC_error traps.

Note – To prevent multiple traps from the same error, software should not reenable interrupts until after the disrupting error status bit in AFSR is cleared.

190 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material P.2.5 Multiple Traps

See When Are Traps Taken? on page 228 for a discussion of what happens when multiple traps occur at once.

P.2.6 Entering RED_state

In the event of a catastrophic hardware fault that produces repeated errors or on occasion of a variety of programming faults, the processor can take a number of nested traps, leading to an eventual entry into RED_state. RED_state entry is not normally recoverable. However, programs in RED_state can usefully provide a diagnosis of the problem encountered before some corrective action is attempted. The I-cache and D-cache are automatically disabled by hardware when it clears the IC and DC bits in the DCU Control Register on entering RED_state. The E-cache and W-cache state are unchanged.

P.3 Memory Errors

Memory errors include: ■ E-cache data ECC error ■ Errors on the system bus

P.3.1 E-cache Data ECC Error

E-cache data ECC errors, listed below, are described in the following subsections. ■ Hw_corrected E-cache data ECC errors — Single-bit ECC errors that are corrected by hardware ■ Sw_correctable E-cache data ECC errors —- E-cache ECC errors that require software intervention ■ Uncorrectable E-cache data ECC errors — Multiple-bit ECC errors that are not correctable by hardware or software

Hw_corrected E-cache Data ECC Errors

Hw_corrected ECC errors occur on single-bit errors detected as the result of these transactions:

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 191 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ W-cache read accesses to the E-cache to merge W-cache data with E-cache data to scrub W-cache data back to the E-cache (AFSR.EDC) Note: The action of writing data from the W-cache to the E-cache is referred to as scrubbing. ■ Reads of the E-cache by the processor in order to perform a writeback or copyout to the system bus (AFSR.WDC, AFSR.CPC)

Hw_corrected errors optionally produce an ECC_error disrupting trap, enabled by the CEEN bit in the E-cache Error Enable Register, to carry out error logging.

Sw_correctable E-cache Data ECC Errors

Sw_correctable errors occur on single-bit data errors detected as the result of the following transactions: ■ Reads of data from the E-cache to fill the D-cache ■ Reads of data from the E-cache to fill the I-cache ■ Execution of an atomic instruction

All these events cause the processor to set AFSR.UCC. A sw_correctable error will generate a precise fast_ECC_error trap if the UCEEN bit is set in the E-cache Error Enable Register. The fast_ECC_error trap handler should carry out the following sequence of actions to correct the error.

1. Read the address of the correctable error from the AFAR register.

2. Invalidate the entire D-cache, using writes to ASI_DCACHE_TAG to zero the valid bits in the tags.

3. Displacement-flush the E-cache line that contained the error. If the line was modified, a single-bit error will be corrected when the processor reads the data from the E-cache and writes it to the system bus to perform the writeback. This operation may set the AFSR.WDC or AFSR.WDU bit. If the offending line was in O or M MOESI state and another processor happened to read the line while the trap handler was executing, the AFSR.CPC or CPU bit could be set.

4. Log the error.

5. Clear the AFSR.UCC, UCU, WDC, WDU, CPC, and CPU fields.

6. Displacement-flush any cacheable fast_ECC_error exception vector or cacheable fast_ECC_error trap handler code or data from the E-cache.

7. Reexecute the instruction that caused the error by means of RETRY.

Corrupt data are never stored in the I-cache.

Data in error are stored in the D-cache. If the data were read from the E-cache as the result of a load instruction or an atomic instruction, corrupt data will be stored in the D-cache. However, if the data were read as the result of a block load operation,

192 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material corrupt data will not be stored in the D-cache. Store instructions never cause fast_ECC_error traps directly, just load and atomic operations. ST instructions never result in corrupt data being loaded into the D-cache.

The entire D-cache is invalidated because there are circumstances when the AFAR used by the trap routine does not point to the line in error in the D-cache. This can happen when multiple errors are reported and when an instruction prefetch has logged an E-cache error in AFSR and AFAR but not generated a trap. Displacing the wrong line by mistake from the D-cache would lead to a data correctness problem. Displacing the wrong line by mistake from the E-cache will only lead to the same error being reported twice. The second time the error is reported, the AFAR is likely to be correct. Corrupt data are never stored in the D-cache without a trap being generated to allow it to be cleared out.

Note – While the above code description appears to be appropriate only for correctable E-cache data errors, it is actually effective for uncorrectable E-cache data errors as well.

In the event that it is handling an uncorrectable error, the action at step 3, “Displacement flush the E-cache line that contained the error,” will either invalidate the line in error or will, if it has been modified, write it out to memory via the system bus. If the E-cache still returns an uncorrectable error when the processor reads it to perform a writeback, the WDU bit will be set in the AFSR during this trap handler, which would generate a disrupting trap later if it were not cleared somewhere in this handler. In this case, the processor will write deliberately bad signalling ECC back to memory.

Either way, when the fast_ECC_error trap handler exits and retries the offending instruction, the previously faulty line will be refetched from main memory. It will either be correct, so the program will continue correctly, or will still contain an uncorrectable error, in which case the processor will take a deferred instruction_access_error or data_access_error trap. These later traps must perform the proper cleanup for the uncorrectable error. The fast_ECC_error trap routine does not need to execute any complex cleanup operations.

Encountering a sw_correctable error while executing the sw_correctable trap routine is unlikely to be recoverable. Three approaches avoid the problem.

1. The sw_correctable exception handler code can be written normally, in cacheable space. If a single-bit error exists in exception handler code in the E-cache, then other single bit E-cache data errors will be unrecoverable. To reduce the probability of this, the sw_correctable exception handler code can be flushed from the E-cache at the end of execution. This solution does not cover cases where the E-cache has a hard fault on a data bit, giving a sw_correctable error on every access.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 193 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 2. All the exception handler code can be placed on a noncacheable page. This solution does cover hard faults on data bits in the E-cache, at least adequately to report a diagnosis or to remove the processor from a running domain, provided that the actual exception vector for the fast_ECC_error trap is not in the E-cache. Exception vectors are normally in cacheable memory. To avoid fetching the exception vector from the E-cache, flush it from the E-cache in the fast_ECC_error trap routine.

3. Exception handler code can be placed in cacheable memory, but only in the first 32 bytes of each 64-byte E-cache subblock. At the end of the 32 bytes, the code has to branch to the beginning of another E-cache subblock. The first 32 bytes of each E-cache subblock fetched from system bus are sent directly to the instruction unit without being fetched from the E-cache. None of the I-cache or D-cache lines fetched may be in the E-cache. This behavior does cover hard faults on data bits in the E-cache for systems that do not have noncacheable memory from which the trap routine can be run. The exception vector and the trap routine must all be flushed from the E-cache after the trap routine has executed.

Note – If, by some means, the processor does encounter a sw_correctable E-cache ECC error while executing the fast_ECC_error trap handler, the processor may recurse into RED_state and not record the event in the AFSR, leading to difficult diagnosis. The processor will set the AFSR.ME bit for multiple sw_correctable events but this behavior is expected to occur routinely, when an AFAR and AFSR is captured for an instruction that is prefetched and then discarded.

Note – The fast_ECC_error trap uses the alternate global registers. If a sw_correctable E-cache error occurs while the processor is running some other trap that uses alternate global registers (such as spill and fill traps), there may be no practical way to recover the system state. The fast_ECC_error routine should note this condition and, if necessary, reset the domain rather than recover from the sw_correctable event. One way to look for the condition is to check whether the TL of the fast_ECC_error trap handler is greater than 1.

Uncorrectable E-cache Data ECC Errors

Uncorrectable E-cache data ECC errors occur on multibit data errors detected as the result of the following transactions: ■ Reads of data from the E-cache to fill the D-cache ■ Reads of data from the E-cache to fill the I-cache ■ Execution of an atomic instruction

194 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material These events set AFSR.UCU. An uncorrectable E-cache data ECC error that is the result of an instruction fetch, or a data read caused by any instruction other than a block load, causes a fast_ECC_error trap. As described in Sw_correctable E-cache Data ECC Errors on page 192, these errors will be recoverable by the trap handler if the line at fault was in the S or E MOESI state.

Uncorrectable E-cache data ECC errors can also occur on multibit data errors detected as the result of the following transactions: ■ Reads of data from an O or M state line to respond to an incoming snoop request (copyout) ■ Reads of data from an O or M state line to write it back to memory (writeback) ■ Reads of data from the E-cache to merge with bytes being written by the processor (store merge)

For copyout, the processor reading the uncorrectable error from its E-cache sets its AFSR.CPU bit. In this case, deliberately bad signalling ECC is sent with the data to the chip issuing the snoop request. If the chip issuing the snoop request is a processor, it takes an instruction_access_error or data_access_error trap. If the chip issuing the snoop request is an I/O device, it will have some device-specific error reporting mechanism that the device driver must handle.

The processor being snooped logs error information in AFSR. For copyout, the processor reading the uncorrectable error from its E-cache sets its AFSR.CPU bit.

For writeback, the processor reading the uncorrectable error from its E-cache sets its AFSR.WDU bit.

If an uncorrectable E-cache data ECC error occurs as the result of a writeback or a copyout, deliberately bad signalling ECC is sent with the data to the system bus. Correct system bus ECC for the uncorrectably corrupted data is computed and transmitted on the system bus, and data bits <127:126> are inverted as the corrupt 128-bit word is transmitted on the system bus. This behavior signals to other devices that the word is corrupt and should not be used. The error information is logged in the AFSR, and an optional disrupting ECC_error trap is generated if the NCEEN bit is set in the E-cache Error Enable Register. Software should log the writeback error or the copyout error so that a subsequent uncorrectable system bus ECC error, reported by this chip or any other chip, can be correlated back to the E-cache ECC error.

For an uncorrectable E-cache data ECC error on a store merge, the processor sets AFSR.EDU. Data can be read from E-cache by the W-cache eviction and merge unit, to victimize a W-cache entry and to scrub modified bytes in the W-cache back to the E-cache. If the W-cache is turned off for some reason, then the store buffer causes the procedure to happen on every store instruction for cacheable data. On these E-cache reads, if an uncorrectable error occurs, a disrupting ECC_error trap is generated if the NCEEN bit is set in the E-cache Error Enable Register, and deliberately bad

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 195 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material signalling ECC is scrubbed back to the E-cache. Correct ECC is computed for the corrupt merged data, then ECC check bits 0 and 1 are inverted in the check word scrubbed to the E-cache.

A merge operation also can occur as part of a writeback or copyout event, rather than eviction from the W-cache and update of the E-cache. For writeback or copyout merges, AFSR.WDU or AFSR.CPU is set, and not AFSR.EDU, which only occurs on store merges.

A copyout operation that happens to hit in the processor writeback buffer sets AFSR.WDU, not AFSR.CPU.

P.3.2 Errors on the System Bus

Errors on the system bus are detected as the result of the following accesses: ■ Cacheable read accesses to the system bus Noncacheable read accesses to the system bus Data ECC, MTag ECC, BERR, and TO (no MAPPED) responses are always checked on all these accesses. ■ Fetching interrupt vector data by the processor on the system bus Data ECC, MTag ECC, and BERR responses are always checked on interrupt vector fetches. ■ Transmitting interrupts by the processor on the system bus TO (no MAPPED) response is checked on interrupt vector transmission.

Hw_Corrected System Bus Data and MTag ECC Errors

ECC is checked for data and MTags arriving at the processor from the system bus. Single-bit errors in data and MTags are fixed in hardware. A single-bit data error as the result of a system bus read from memory or I/O sets the AFSR.CE bit. A single- bit data error as the result of an interrupt vector fetch sets AFSR.IVC. A single-bit MTag error as the result of a system bus read from memory or I/O or as the result of an interrupt vector fetch sets AFSR.EMC.

Hw_corrected system bus errors cause an ECC_error disrupting trap if the CEEN bit is set in the E-cache Error Enable Register. The hw_corrected error is corrected by the system bus interface hardware at the processor, and the processor automatically uses the corrected data.

MTag ECC correctness is checked whether or not the processor is configured in SSM mode by setting the SSM bit in the Fireplane Configuration Register.

All four MTag values associated with a 64-byte system bus read are checked for ECC correctness.

196 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Uncorrectable System Bus Data ECC Errors

An uncorrectable system bus data ECC error as the result of a system bus read from memory or I/O sets the AFSR.UE bit. The ECC syndrome is captured in E_SYND, and the AFSR.PRIV bit is set if PSTATE.PRIV was set at the time the error was detected.

Uncorrectable system bus data ECC errors on read accesses to a cacheable space will install the bad ECC from the system bus directly in the E-cache. This action prevents the bad data from being used or written back to memory with good ECC bits. Uncorrectable ECC errors from the system bus on cache fills are reported for any ECC error in the 64-byte E-cache subblock, not just the referenced word. The error information is logged in the AFSR.Aninstruction_access_error or data_access_error deferred trap is generated, provided that the NCEEN bit is set in the E-cache Error Enable Register. If NCEEN were clear, the processor would operate incorrectly on the corrupt data.

An uncorrectable error as the result of a system bus read for an instruction fetch causes an instruction_access_error deferred trap. An uncorrectable error as the result of a data read, a load, atomic, or store operation causes a data_access_error deferred trap. Store operations cause system bus reads when cacheable data referenced are not present in the processor caches. See Multiple Errors and Nested Traps on page 219 for the behavior in the event of multiple errors being detected simultaneously.

When an uncorrectable error is present in a 64-byte E-cache subblock read from the system bus in order to complete a load or atomic instruction, corrupt data will be installed in the D-cache. The deferred trap handler should invalidate the D-cache during recovery, as described in Deferred Trap Handler Functionality on page 189. When an uncorrectable error is present in data fetched from system bus to complete a store instruction, corrupt data are not loaded into the D-cache, but the trap handler cannot differentiate this situation. The normal procedure for the deferred trap handler should be to invalidate the D-cache.

Corrupt data are never stored in the I-cache.

An uncorrectable system bus data ECC error on a read to a noncacheable space is handled in the same way as cacheable accesses, except that the error cannot be stored in the processor caches, so there is no need to flush them.

An uncorrectable system bus data ECC error as the result of an interrupt vector fetch sets AFSR.IVU in the processor fetching the vector. The error is not reported to the chip that generates the interrupt. When the uncorrectable interrupt vector data are read by the interrupt vector fetch hardware of the processor receiving the interrupt, a disrupting ECC_error exception is generated. The processor will store the uncorrected interrupt vector data in the internal interrupt registers unmodified, as it is received from the system bus. The interrupt_vector trap will be inhibited where the interrupt data is known to be corrupted.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 197 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Uncorrectable System Bus MTag Errors

An uncorrectable MTag ECC error as the result of a system bus read of memory or I/ O sets AFSR.EMU.

Whether or not the processor is configured in SSM mode by setting the SSM bit in the Fireplane Configuration Register, system bus MTag ECC is checked.

All four MTag values associated with a 64-byte system bus read are checked for ECC correctness.

Uncorrectable errors in MTags arriving at the processor from the system bus are not normally recoverable. When the processor detects one of these errors, it asserts its ERROR output pin. The response of the system to the assertion of the ERROR output pin is system dependent but usually results in the reset of all the chips in the affected coherence domain.

In addition to asserting its ERROR output pin, the processor takes an instruction_access_error or data_access_error deferred trap if the NCEEN bit is set in the E-cache Error Enable Register.

Whether the trap taken has any effect or meaning depends on the system’s response to the processor ERROR output pin.

The effect of an uncorrectable MTag ECC error on the E-cache state is undefined.

System bus MTag ECC is checked on interrupt vector fetch operations and on read accesses to uncacheable space, even though the MTag has little meaning for these. An uncorrectable error in MTag will still result in the ERROR output pin being asserted.

An uncorrectable error in MTag for a read access to uncacheable space takes an instruction_access_error or data_access_error deferred trap if the NCEEN bit is set in the E-cache Error Enable Register.

An uncorrectable error in MTag ECC for an interrupt vector fetch operation causes an instruction_access_error or data_access_error deferred trap if the NCEEN bit is set in the E-cache Error Enable Register. AFSR.EMU will be set. AFAR will be captured, despite the fact that the address has little meaning. An interrupt_vector trap will still be generated.

System Bus BERR Errors

A BERR may be returned in response to a system bus read operation. In this case, the processor handles the event in the same way as specified above (in Uncorrectable System Bus Data ECC Errors on page 197) except for the following differences.

1. The AFSR.BERR bit is set instead of AFSR.UE.

198 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 2. The AFSR and AFAR overwrite priorities are used for BERR, rather than the UE priorities.

3. Data bits <1:0> of each of the four 128-bit correction words written to the E-cache are inverted to create signalling ECC if the access is cacheable. The E-cache will forward the data to the pipeline but will not install this data in E-cache.

The processor treats both system bus termination code DSTAT = 2 (timeout error) and DSTAT = 3 (bus error) as the same event. Both will cause AFSR.BERR to be set and cause the same signalling ECC to be sent to the E-cache. These conditions are checked on both cacheable and noncacheable reads. Neither sets AFSR.TO.

The processor checks the DSTAT returned for an interrupt vector fetch, sets AFSR.BERR, and captures AFAR (even though it has little meaning) for DSTAT =2or DSTAT = 3 returns.

System Bus Timeout Errors

The AFSR.TO bit is set when no device responds with a MAPPED status as the result of the system bus address phase. This is not a hardware timeout operation, which causes an AFSR.PERR event. It is also different from a DSTAT = 2 timeout response for a system bus transaction, which actually sets AFSR.BERR.

A TO can be returned in response to a system bus read or write operation. In this case, the processor handles the event in the same way as specified above (at Uncorrectable System Bus Data ECC Errors on page 197), except for the following differences.

1. The AFSR.TO bit is set instead of AFSR.UE.

2. The FSR and AFAR overwrite priorities for TO are used, rather than the UE priorities.

3. If the access is a cacheable read transfer, the data value from the system bus and the ECC present on the system bus are written to the E-cache but the data will not be installed in the E-cache.

4. If the access is a write transfer, a deferred data_access_error trap will still be taken. This applies even if the event was a writeback from E-cache, not directly related to instruction processing.

It’s possible that no destination asserts MAPPED when the processor attempts to send an interrupt. This event, too, causes AFSR.TO to be set and AFSR.PRIV and AFAR to be captured, although the exact meaning of the captured information is not clear. A deferred data_access_error trap will be taken.

Copyout events from E-cache cannot see an AFSR.TO event because they are responding to system bus transactions, not initiating system bus transactions.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 199 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material System Bus Hardware Timeouts

The AFSR.TO bit is set when no device responds with a MAPPED status as the result of the system bus address phase. This action is not a hardware timeout operation.

In addition to the AFSR.TO functionality, there are hardware timeouts that detect that a chip is taking too long to complete a system bus operation. This timeout might come into effect if, say, a target device developed a fault during an access to it.

Hardware timeouts are reported as AFSR.PERR fatal error events, no matter what bus activity was taking place. No other status is logged in the AFSR or AFAR, although JTAG registers do capture more information. See External Memory Unit Error Handling on page 251 for details.

P.3.3 Memory Errors and Prefetch

In this section we describe memory errors caused by prefetch by the instruction fetcher, instruction cache, hardware, and software.

Memory Errors and Prefetch by the Instruction Fetcher and I-cache

The instruction fetcher sends requests for instructions to the I-cache before it is certain that the instructions will ever be executed. This occurs, for example, when a branch is mispredicted. These requests appear as perfectly normal operations to the rest of the processor and the rest of the system, which cannot tell that they are prefetch requests.

One of the requests from the instruction fetcher to the I-cache can miss in the I-cache and cause a fetch from the E-cache and can also miss in the E-cache and cause a fetch from the system bus.

In addition, any read by the instruction fetcher of the first 32-byte I-cache line from a 64-byte E-cache subblock can cause an automatic read by the I-cache of the subsequent 32-byte I-cache line from the E-cache.

In the event of an error from the E-cache for one of these fetches, a fast_ECC_error trap is generated, provided that the fetched instruction is actually executed. If the instruction marked as encountering an error is discarded without being executed, no trap is generated. However, AFSR and AFAR will still be updated with the E-cache error status.

In the event of an error from the system bus for an instruction fetch, the processor works exactly as normal, with the AFSR and AFAR being set and a deferred instruction_access_error trap being taken, despite the fact that the faulty line has not yet been used in the committed instruction stream and may, in fact, never be used.

200 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The above applies to speculative fetches well beyond a branch and also to annulled instructions in the delay slot of a branch.

Corrupt data are never stored in the I-cache.

The execution unit can issue speculative requests for data because of load instructions (but not stores or atomic operations). These can miss in the D-cache and go to the E-cache. However, in all circumstances, if the data are not to be used, the execution unit cancels the fetch before the E-cache can detect any errors. The AFSR and AFAR are not updated, the D-cache is not loaded with corrupt data, and no trap is taken.

Speculative data fetches that are later discarded never cause system bus reads. Speculation around store instructions never cause system bus reads for stores that will not be executed.

P.3.4 Memory Errors and Interrupt Transmission

System-bus-data ECC errors for an interrupt vector fetch operation are treated specially. Hw_corrected, interrupt-vector-data ECC errors set AFSR.IVC (not AFSR.CE) and correct the error in hardware before writing the vector data into the interrupt receive registers. Uncorrectable interrupt-vector-data ECC errors set AFSR.IVU (not AFSR.UE) and write the received vector data unchanged into the interrupt receive registers. An uncorrectable error in interrupt data causes an ECC_error trap, not an interrupt_vector trap. A hw_corrected error in interrupt data causes both an interrupt_vector and an ECC_error trap. AFSR.E_SYND will be captured; AFAR will not be captured. AFSR.PRIV will be updated with the state that happens to be in PSTATE.PRIV at the time the event is detected.

System bus MTag ECC errors for interrupt vector fetches, whether in an SSM system or not, are treated exactly as though the bus cycle was a read access to I/O or memory. AFSR.EMC or EMU is set; M_SYND and AFAR are captured. The value captured in AFAR is not meaningful. AFSR.PRIV is updated with the state that happens to be in PSTATE.PRIV at the time the event is detected. For AFSR.EMU events, the processor asserts its ERROR output pin and takes a deferred instruction_access_error trap or data_access_error trap. For AFSR.EMC events, an ECC_error disrupting trap is taken. Both of these events also generate an interrupt_vector trap.

A system bus DSTAT =2orDSTAT = 3 (timeout or bus error) response from the interrupting device to an interrupt vector fetch operation sets AFSR.BERR at the processor that is fetching the interrupt vector and captures AFAR (even though it has little meaning). The interrupt vector data received in this transfer is written into the interrupt receive registers, and an interrupt_vector exception is generated, even though the data may be incorrect. A deferred instruction_access_error trap or data_access_error trap is also generated.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 201 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A processor transmitting an interrupt may receive no MAPPED response to its system bus address cycle. This is treated exactly as though the bus cycle was a read access to I/O or memory. AFSR.TO will be set, AFAR will be captured (although its meaning is uncertain), and AFSR.PRIV will be updated with the state that happens to be in PSTATE.PRIV at the time the event is detected. A deferred data_access_error trap will be generated.

P.3.5 Cache Flushing in the Event of Multiple Errors

If a software trap handler needs to flush a line from any processor cache to ensure correct operation as part of recovery from an error, and multiple uncorrectable errors are reported in the AFSR either through multiple sticky bits or through AFSR.ME, then the value stored in AFAR may not be the only line needing to be flushed. In this case, the trap handler should flush all D-cache contents from the processor to be sure of flushing all the required lines.

P.4 Error Registers

Note – MEMBAR #Sync is generally needed after stores to error ASI registers. See Instruction Prefetch to Side-Effect Locations on page 73.

P.4.1 E-cache Error Enable Register

Refer to TABLE O-1 on page 180 for the state of this register after reset.

ASI= 4B16, VA<63:0>=0016

202 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Name: ASI_ESTATE_ERROR_EN_REG

TABLE P-1 E-cache Error Enable Register Format

Bits Field RW Use 18 FMT Force error on the outgoing system MTag ECC. When this bit is 1, the contents of the FMECC field are transmitted as the system bus MTag ECC bits, for all data sent to the system bus by this processor. These data include writeback, copyout, interrupt vector, and noncacheable store data. 17:14 FMECC Forced error on the outgoing system MTag ECC vector. 4-bit ECC vector to transmit as the system bus MTag ECC bits. 13 FMD Force error on the outgoing system data ECC. When this bit is 1, the contents of the FDECC field are transmitted as the system-bus-data ECC bits, for all data sent to the system bus by this processor. These data include writeback, copyout, interrupt vector, and noncacheable store data. 12:4 FDECC Force error on the outgoing system Data ECC vector. 9-bit ECC vector to transmit as the system-bus-data ECC bits. The FMT and FMD fields allow test code to confirm correct operation of system bus error detection hardware and software. To check E-cache error detection, test programs should use the E-cache diagnostic access operations. 3 UCEEN Enable fast_ECC_error trap on sw_correctable and uncorrectable E-cache error. If set, a sw_correctable or uncorrectable E-cache ECC error will generate a precise fast_ECC_error trap. This event can only occur on reads of the E-cache by this processor for instruction fetches, data loads, and atomic operations; this event cannot occur on merge, writeback, and copyout operations. This bit enables the traps associated with the AFSR.UCC and UCU bits.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 203 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-1 E-cache Error Enable Register Format

Bits Field RW Use 2 Reserved RW Write 0 to this bit. 1 NCEEN RW Enable instruction_access_error, data_access_error,orECC_error trap on uncorrectable ECC errors and system errors. If set, an uncorrectable system bus data or MTag ECC error, system bus TO, or system bus BERR as the result of an instruction fetch causes a deferred instruction_access_error trap, and as the result of an LD (etc.) instruction causes a deferred data_access_error trap. Also, if NCEEN is set, an uncorrectable E-cache data error as the result of a read by this processor to complete a store merge into the E-cache; or, for a writeback or copyout, it generates a disrupting ECC_error trap. Also, if NCEEN is set, an uncorrectable system bus data or MTag ECC error, system bus TO, or system bus BERR as the result of an interrupt vector fetch generates an instruction_access_error, data_access_error,orECC_error trap. If NCEEN is clear, the error is logged in the AFSR and no trap is generated. This bit enables the errors associated with the AFSR.EMU, EDU, WDU, CPU, IVU, UE, BERR, and TO bits. Note: Executing code with NCEEN clear can lead to the processor executing instructions with uncorrectable errors spuriously, because it will not take traps on uncorrectable errors. 0 CEEN RW Enable ECC_error trap on hw_corrected ECC errors. If set, a hw_corrected data or MTag ECC error detected as the result of a system bus read causes an ECC_error disrupting trap. Also, if set, a hw_corrected E-cache data error as the result of a read by this processor to complete a store merge into the E-cache, or for a writeback or copyout, generates a disrupting ECC_error trap. If CEEN is clear, the error is logged in the AFSR and no trap is generated. This bit enables the errors associated with the AFSR.EDC, WDC, CPC, IVC, CE, and EMC bits.

P.4.2 Asynchronous Fault Status Register

The Asynchronous Fault Status Register (AFSR) accumulates all errors that have occurred since its fields were last cleared. The AFSR is updated according to the policy described in TABLE P-12 on page 214.

AFSR Fields

The AFSR is logically divided into six fields: ■ Bit 53, the accumulating multiple-error (ME) bit, is set when an uncorrectable error occurs or a sw_correctable error occurs and the AFSR status bit to report that error is already set to 1. Multiple errors of different types are indicated by setting more than one of the AFSR status bits.

204 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Note – The ME bit is not set if multiple hw_corrected errors with the same status bit occur: only uncorrectable and sw_correctable set the bit.

Note – The ME bit is not set when multiple ECC errors occur within a single 64-byte system bus transaction. The first ECC error in a 16-byte subunit will be logged. Errors in the following 16-byte subunits from the same 64-byte transaction are ignored.

■ Bit 52, the accumulating privilege-error (PRIV), is set when an error is detected at a time when PSTATE.PRIV =1. If this bit is set for an uncorrectable error, system state has been corrupted. (The corruption may be limited and may be recoverable if this situation occurs as the result of code, as described in Special Access Sequence for Recovering Deferred Traps on page 189.)

PRIV accumulates the privilege state of the processor at the time errors are detected, until software clears PRIV.

PRIV accumulates the state of the PSTATE.PRIV bit at the time the event is detected, rather than the PSTATE.PRIV value associated with the instruction that caused the access which returns the error (impl. dep. #213).

Note – MEMBAR #Sync is required before an ASI store that changes the PSTATE.PRIV bit to act as an error barrier between previous transactions that were launched with a different PRIV state; see Instruction Prefetch to Side-Effect Locations on page 73. This requirement ensures that privileged operations which fault will be recorded as privileged in AFSR.PRIV.

■ Bits <51:50>, PERR and IERR, indicate that either an internal inconsistency has occurred in the system interface logic or that a protocol error has occurred on the system bus. If either of these conditions occurs, the processor asserts its ERROR output pin. The AFSR may be read after a reset event used to recover from the error condition discovers the cause. In addition, the specific cause is logged in a JTAG scannable flop. See External Memory Unit Error Handling, for JTAG details.

The IERR status bit indicates that an event has been detected which is likely to have its source inside the processor reporting the problem. The PERR status bit indicates that the error source may well be elsewhere in the system, not in the processor reporting the problem. However, this differentiation cannot be perfect. These are merely likely outcomes. Further error recording elsewhere in the system is desirable for accurate diagnosis. ■ Bits <51:33> are sticky error bits that record the most recently detected errors. Each sticky bit accumulates errors that have been detected since the last write to clear the bit.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 205 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ Bits <19:16> contain the data MTag ECC syndrome captured on a system bus MTag ECC error. The syndrome field captures the status of the first occurrence of the highest-priority error according to the M_SYND overwrite policy. After the AFSR sticky bit, corresponding to the error for which the M_SYND is reported, is cleared, the contents of the M_SYND field will be unchanged but will be unfrozen for further error capture. ■ Bits <8:0> contain the data ECC syndrome. The syndrome field captures the status of the first occurrence of the highest-priority error according to the E_SYND overwrite policy. After the AFSR sticky bit, corresponding to the error for which the E_SYND is reported, is cleared, the contents of the E_SYND field will be unchanged but will be unfrozen for further error capture.

Clearing the AFSR

The AFSR must be explicitly cleared by software; it is not cleared automatically by a read of AFSR. Writes to the AFSR RW1C bits (<53:33>) with particular bits set will clear the corresponding bits in the AFSR. Bits associated with disrupting traps must be cleared before interrupts are reenabled, by setting PSTATE.IE to prevent multiple traps for the same error. Writes to AFSR bits with particular bits clear will not affect the corresponding bits in the AFSR. The syndrome fields are read-only, and writes to these fields are ignored.

If software attempts to clear error bits at the same time as an error occurs, one of two things will happen.

1. The clear will appear to happen before the error occurs. The state of the syndrome, ME, PRIV, and sticky bits, and the state of the AFAR, will all be consistent with the clear having happened before the error occurs.

2. The clear will appear to happen after the error occurs. The state of the syndrome, ME, PRIV, and sticky bits, and the state of the AFAR, will all be consistent with the clear having happened after the error occurs.

Software must clear the PERR and IERR bits by writing a 1 to the corresponding bit positions. When either of these bits is written with a 1, all of the JTAG shadow scan flops associated with that error bit will be cleared. Refer to EMU Shadow Register on page 258 for details of shadow scan flops.

When multiple events have been logged by the various bits in AFSR, at most one of these events will have its status captured in AFAR. AFAR will be unlocked and available to capture the address of another event as soon as the one bit in AFSR that corresponds to the event logged in AFAR is cleared. For example, if AFSR.CE is detected, then AFSR.UE (which overwrites the AFAR), and AFSR.UE is cleared but not AFSR.CE, then AFAR will be unlocked and ready for another event, even though AFSR.CE is still set.

206 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material This same argument also applies to AFSR.M_SYND and AFSR.E_SYND fields. Each field will be unlocked and available for further error capture when the specific AFSR status bit, associated with the event logged in the field, is cleared.

Note – There is no mechanism for software to clear the M_SYND and E_SYND fields. After they are unlocked, they retain the value recorded. The only way to initialize these fields to known values is deliberately to take errors that set the fields.

Refer to TABLE O-1 on page 180 for the state of this register after reset.

ASI = 4C16, VA<63:0> = 0016 Name: ASI_ASYNC_FAULT_STATUS

TABLE P-2 Asynchronous Fault Status Register

Bits Field RW Use 63:54 Reserved R— 53 ME RW1C Multiple error of same type occurred. 52 PRIV RW1C Privileged state error has occurred. 51 PERR RW1C System interface protocol error. 50 IERR RW1C Internal processor error. 49 ISAP RW1C System request parity error on incoming address. 48 EMC RW1C Hw_corrected system bus MTag ECC error. 47 EMU RW1C Uncorrectable system bus MTag ECC error. 46 IVC RW1C Hw_corrected system bus data ECC error for read of interrupt vector. 45 IVU RW1C Uncorrectable system bus data ECC error for read of interrupt vector. 44 TO RW1C Unmapped error from system bus. 43 BERR RW1C Bus error response from system bus. 42 UCC RW1C Sw_correctable E-cache ECC error for instruction fetch or data access other than block load. 41 UCU RW1C Uncorrectable E-cache ECC error for instruction fetch or data access other than block load. 40 CPC RW1C Copyout hw_corrected ECC error. Note: This bit is not set if the copyout hits in the writeback buffer. Instead, the WDC bit is set. 39 CPU RW1C Copyout uncorrectable ECC error. Note: This bit is not set if the copyout hits in the writeback buffer. Instead, the WDU bit is set. 38 WDC RW1C Hw_corrected ECC error from E-cache for writeback. 37 WDU RW1C Uncorrectable ECC error from E-cache for writeback. 36 EDC RW1C Hw_corrected ECC error from E-cache for store merge or block load.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 207 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-2 Asynchronous Fault Status Register (Continued)

Bits Field RW Use 35 EDU RW1C Uncorrectable ECC error from E-cache for store merge or block load. 34 UE RW1C Uncorrectable system bus data ECC error for read of memory or I/O. 33 CE RW1C Hw_corrected system bus data ECC error for read of memory or I/O. 20:32 Reserved R— 19:16 M_SYND R System bus MTag ECC syndrome. 15:9 Reserved R— 8:0 E_SYND R System bus or E-cache data ECC syndrome.

P.4.3 ECC Syndromes

This section provides details of the ECC syndromes: E_SYND and M_SYND.

E_SYND

The AFSR.E_SYND field contains a 9-bit value that indicates which data bit of a 128- bit quadword contains a single-bit error. This field reports the ECC syndrome for system bus and E-cache ECC errors of all types: Hw_corrected, sw_correctable, and uncorrectable. The ECC coding scheme used is described in the system bus specification.

TABLE P-3 shows how to interpret TABLE P-4, which lists the 9-bit ECC syndromes that correspond to a single-bit error for each of the 128-data bits. To locate a syndrome in the table, use the low-order 3 bits of the data bit number to find the column and the high-order 4 bits of the data bit number to find the row. For example, data bit number 126 is at column 0x6, row 0xF and has a syndrome of 1C916.

TABLE P-3 Key to Interpreting TABLE P-4

Interpretation Example

Data bit number, decimal 126

Data bit number, hexadecimal 7E16 Data bit number, 7-bit binary 111 1110 High 4 bits, binary 1111 Low 3 bits, binary 110

High 4 bits, hexadecimal row number F16

Low 3 bits, hexadecimal column number 616

Syndrome returned for 1-bit error in data bit 126 1C916

208 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-4 Data Single-Bit Error ECC Syndromes (indexed by data bit number, all entries 9-bit hexadecimal ECC syndrome)

Column - Low 3 bits Row - High 4 bits 016 116 216 316 416 516 616 716

016 03B 127 067 097 10F 08F 04F 02C

116 147 0C7 02F 01C 117 032 08A 04A

216 01F 086 046 026 09B 08C 0C1 0A1

316 01A 016 061 091 052 00E 109 029

416 02A 019 105 085 045 025 015 103

516 031 00D 083 043 051 089 023 007

616 0B9 049 013 0A7 057 00B 07A 187

716 0F8 11B 079 034 178 1D8 05B 04C

816 064 1B4 037 03D 058 13C 1B1 03E

916 1C3 0BC 1A0 1D4 1CA 190 124 13A

A16 1C0 188 122 114 184 182 160 118

B16 181 150 148 144 142 141 130 0A8

C16 128 121 0E0 094 112 10C 0D0 0B0

D16 10A 106 062 1B2 0C8 0C4 0C2 1F0

E16 0A4 0A2 098 1D1 070 1E8 1C6 1C5

F16 068 1E4 1E2 1E1 1D2 1CC 1C9 1B8

TABLE P-5 shows the 9-bit ECC syndromes that correspond to a single-bit error for each of the nine ECC check bits for the E-cache and system bus error correcting codes used for data.

TABLE P-5 Data Check Bit Single-Bit Error Syndromes

Check Bit Number AFSR.E_SYND

0 00116

1 00216

2 00416

3 00816

4 01016

5 02016

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 209 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-5 Data Check Bit Single-Bit Error Syndromes (Continued)

Check Bit Number AFSR.E_SYND

6 04016

7 08016

8 10016

Other syndromes found in the AFSR.E_SYND field indicate either no error (syndrome 0) or a multibit error has occurred.

TABLE P-6 provides the syndrome description for TABLE P-7, and TABLE P-7 maps all E_SYND ECC syndromes to the indicated event.

TABLE P-6 Legend for TABLE P-7, ECC Syndromes

Syndrome Meaning

--- No error 0–127 Data single-bit error, data bit 0–127 C0–C8 ECC check single-bit error, check bit 0–8 M2 Probable double-bit error within a nibble M3 Probable triple-bit error within a nibble M4 Probable quad-bit error within a nibble M Multibit error

Three syndromes in particular from TABLE P-7 are useful. These are the syndromes corresponding to the three different, deliberately inserted, bad ECC conditions, the signalling ECC codes, used by the processor.

For a BERR event from the system bus for a cacheable load, data bits <1:0> are inverted in the data stored in the E-cache. The syndrome seen when one of these signalling words is read is 11C16. For an uncorrectable ECC error from the E-cache, data bits <127:126> are inverted in data sent to the system bus as part of a writeback or copyout. The syndrome seen when one of these signalling words is read is 07116. For uncorrectable ECC error on the E-cache read done to complete a store merge event, where bytes written by the processor are merged with bytes from an E-cache line, ECC check bits <1:0> are inverted in the data scrubbed back to the E-cache. The syndrome seen when one of these signalling words is read is 00316.

210 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-7 ECC Syndromes

0 1234567 89 ABCD E F 0 --- C0 C1 M2 C2 M2 M3 47 C3 M2 M2 53 M2 41 29 M 1 C4M M 50M23825M2M233 24M211M M216 2 C5 M M 46 M2 37 19 M2 M 31 32 M 7 M2 M2 10 3 M2 40 13 M2 59 M M2 66 M M2 M2 0 M2 67 71 M 4 C6 M M 43 M 36 18 M M2 49 15 M 63 M2 M2 6 5 M2 44 28 M2 M M2 M2 52 68 M2 M2 62 M2 M3 M3 M4 6 M2 26 106 M2 64 M M2 2 120 M M2 M3 M M3 M3 M4 7 116 M2 M2 M3 M2 M3 M M4 M2 58 54 M2 M M4 M4 M3 8 C7 M2 M 42 M 35 17 M2 M 45 14 M2 21 M2 M2 5 9 M 27 M M 99 M M 3 114 M2 M2 20 M2 M3 M3 M A M2 23 113 M2 112 M2 M 51 95 M M2 M3 M2 M3 M3 M2 B 103M M2M3M2M3M3M4M248M M 73M2M M3 C M2 22 110 M2 109 M2 M 9 108 M2 M M3 M2 M3 M3 M D 102 M2 M M M2 M3 M3 M M2 M3 M3 M2 M M4 M M3 E 98M M2M3M2M M3M4M2M3M3M4M3MM M F M2 M3 M3 M M3 M M M 56 M4 M M3 M4 M M M 10 C8 M M2 39 M 34 105 M2 M 30 104 M 101 M M 4 11 M M 100 M 83 M M2 12 87 M M 57 M2 M M3 M 12M29782M278M2M21 96M MMMM M3M2 13 94 M M2 M3 M2 M M3 M M2 M 79 M 69 M M4 M 14 M2 93 92 M 91 M M2 8 90 M2 M2 M M M M M4 15 89 M M M3 M2 M3 M3 M M M M3 M2 M3 M2 M M3 16 86 M M2 M3 M2 M M3 M M2 M M3 M M3 M M M3 17 M M M3 M2 M3 M2 M4 M 60 M M2 M3 M4 M M M2 18 M2 88 85 M2 84 M M2 55 81 M2 M2 M3 M2 M3 M3 M4 19 77M M M M2M3M M M2M3M3M4M3M2M M 1A 74 M M2 M3 M M M3 M M M M3 M M3 M M4 M3 1B M2 70 107 M4 65 M2 M2 M 127 M M M M2 M3 M3 M 1C 80 M2 M2 72 M 119 118 M M2 126 76 M 125 M M4 M3 1D M2 115 124 M 75 M M M3 61 M M4 M M4 M M M 1E M 123 122 M4 121 M4 M M3 117 M2 M2 M3 M4 M3 M M 1F 111M M M M4M3M3M M M M3M M3M2M M

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 211 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material M_SYND

The AFSR.M_SYND field contains a 4-bit ECC syndrome for the 3-bit MTags of the system bus. TABLE P-8 shows the 4-bit syndrome corresponding to a single-bit error in each of the MTag data or correction bits.

TABLE P-8 MTag Single-Bit Error ECC Syndromes

Bit Number AFSR.M_SYND

MTag Data 0 716

MTag Data 1 B16

MTag Data 2 D16

MTag ECC 0 116

MTag ECC 1 216

MTag ECC 2 416

MTag ECC 3 816

A complete MTag syndrome table is shown in TABLE P-9.

TABLE P-9 Syndrome Table for MTag ECC

Syndrome<3:0> Error Indication

016 None

116 MTag ECC 0

216 MTag ECC 1

316 Double bit (UE)

416 MTag ECC 2

516 Double bit (UE)

616 Double bit (UE)

716 MTag Data 0

816 MTag ECC 3

916 Double bit (UE)

A16 Double bit (UE)

B16 MTag Data 1

C16 Double bit (UE)

212 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-9 Syndrome Table for MTag ECC (Continued)

Syndrome<3:0> Error Indication

D16 MTag Data 2

E16 Multiple bit (UE)

F16 Double bit (UE)

The M_SYND field is locked by the AFSR.EMC and EMU bits. The E_SYND field is locked by the AFSR.UE, CE, UCU, UCC, EDU, EDC, WDU, WDC, CPU, CPC, IVU, and IVC bits. So, a data ECC error can lead to the data ECC syndrome being recorded in E_SYND, perhaps with a CE status, then a later MTag ECC error event can store the MTag ECC syndrome in M_SYND, perhaps with an EMC status. The two are independent.

P.4.4 Asynchronous Fault Address Register

This register is captured when one of the AFSR error status bits that capture address is set (see TABLE P-12 for details). The address corresponds to the first occurrence in the AFSR of the highest-priority error that captures the address according to the AFAR overwrite policy. Address capture is reenabled by clearing the corresponding error bit in AFSR. See Clearing the AFSR on page 206 for a description of the behavior when clearing occurs at the same time as an error.

Refer to TABLE O-1 on page 180 for the state of this register after reset.

ASI = 4D16, VA<63:0> = 0x0 Name: ASI_ASYNC_FAULT_ADDRESS

TABLE P-10 describes the AFAR register bits.

TABLE P-10 Asynchronous Fault Address Register

Bits Field RW Use 63:43 Reserved R— 42:4 PA<42:4> RW Physical address of faulting 16-byte component. Bits <5:4> isolate the fault to a 128-bit subunit within a 512-bit coherency block. 3:0 Reserved R—

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 213 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material In the event of multiple errors within a 64-byte block, AFAR captures only the first- detected, highest-priority error, with rare exceptions1: on UltraSPARC-III, there are cases where the wrong physical address may be logged when certain combinations of multiple imultaneous errors occur, as shown in TABLE P-11.

TABLE P-11 SimultaneousError Combinations Resulting in Incorrect AFAR Value

Simultaneous Errors Sources

Case ECU SIU 1 UCU or UCC UE or EMU or CE or EMC or TO or BERR 2 EDU or WDU or CPU CE or EMC or TO or BERR 3 EDC or WDC or CPC TO or BERR

P.5 Error Reporting Summary

TABLE P-12 summarizes error reporting. Speculative instruction fetches that encounter system bus errors are treated exactly as fetch operations. Applicable notes follow the table.

TABLE P-12 Error Reporting Summary ➌ ➊ ➍ ➎ ➋ ➏

Error Event AFSR status bit Trap taken Trap controlled by FERR? M_SYND E_SYND AFAR Priority Set PRIV? Set ME? Flush System unrecoverable error PERR None — 1 0 0 0 0 0 0 Internal unrecoverable error IERR None —- 1 0 0 0 0 0 0 Incoming system address parity error ISAP None — 1 0 0 0 0 1 0 Uncorrectable system bus data ECC: All but interrupt UE ID NCEEN 023111 vector fetch Uncorrectable system bus data ECC: interrupt vector IVU C NCEEN 020110 fetch Hw_corrected system bus data ECC: All but interrupt CE C CEEN 012100 vector fetch

1. Reference: UltraSPARC III Erratum #90, Bug ID #7163

214 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-12 Error Reporting Summary (Continued) ➌ ➊ ➍ ➎ ➋ ➏

Error Event AFSR status bit Trap taken Trap controlled by FERR? M_SYND E_SYND AFAR Priority Set PRIV? Set ME? Flush Hw_corrected system bus data ECC: interrupt vector IVC C CEEN 010100 fetch Uncorrectable system bus MTag ECC EMU ID NCEEN 1203111 Hw_corrected system bus MTag ECC EMC C CEEN 102100 Uncorrectable E-cache ECC: load/I-Fetch/atomic UCU FC UCEEN 024112 Uncorrectable E-cache ECC: store EDU C NCEEN 023010 Uncorrectable E-cache ECC: Block Load EDU D NCEEN 023110 Uncorrectable E-cache ECC: writeback WDU C NCEEN 023010 Uncorrectable E-cache ECC: copyout CPU C NCEEN 023010 Sw_correctable E-cache ECC: load/I-Fetch/atomic UCC FC UCEEN 014112 Hw_corrected E-cache ECC: store EDC C CEEN 012000 Hw_corrected E-cache ECC: Block Load EDC C CEEN 012100 Hw_corrected E-cache ECC: writeback WDC C CEEN 012000 Hw_corrected E-cache ECC: copyout CPC C CEEN 012000 DSTAT = 2 or 3 response, all except “bus error” BERR ID NCEEN 001110 no MAPPED response, “time-out” TO ID NCEEN 001110

➊ Trap types: ■ ID: instruction_access_error or data_access_error trap, depending on whether the error is the result of an instruction fetch or a data access instruction. Always deferred. ■ D: data_access_error trap, because the error is always the result of a data access instruction. Always deferred. ■ C: ECC_error trap. Always disrupting. ■ FC: fast_ECC_error trap. Always precise. ■ None: No trap is taken, processor continues normal execution.

➋ FERR (fatal error). A 1 in the FERR column means that the processor will assert its ERROR output pin for this event. Detailed processor behavior is not specified. It is expected that the system will reset the processor.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 215 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ➌ Priority: All priority entries work as follows. Associated with the AFAR and the AFSR.M_SYND and E_SYND fields is a separate stored “priority” for the data in the field. This priority is stored in the processor but is not visible to the programmer. When the AFSR is empty and no errors have been logged, the effective priority stored for each field is 0. Whenever an event to be logged in the AFSR or AFAR occurs, compare the priority specified for each field for that event to the priority stored internal to the processor for that field. If the priority for the field for the event is numerically higher than the priority stored internal to the processor for that field, then (a) update the field with the value appropriate for the event that has just occurred and (b) update the stored priority in the processor with the priority specified in the table for that field and new event. Note: This procedure implies that fields with a 0 priority in the above table are never stored for that event.

Also associated with each field in the AFAR and AFSR is a record of the AFSR status bit that caused that field to be recorded. Again, this record of status bit is not visible to the programmer. The record of the controlling status bit is updated whenever the field is updated according to the priority rules described in the previous paragraph. When the controlling status bit recorded for a particular field is cleared, the field is cleared and the stored priority for the field is set to 0.

This behavior can lead to odd situations. If first a UE occurs to capture E_SYND and then an EDU occurs, the EDU does not update E_SYND because it has the same priority as UE. Trap handler software clears AFSR.UE but leaves AFSR.EDU set. E_SYND will be unchanged but will be unfrozen for further error capture. A CE occurs, capturing E_SYND again. Despite the fact that AFSR.EDU has a higher priority than AFSR.CE, the E_SYND will be that for the AFSR.CE. A similar odd situation can occur with AFAR.

➍ PRIV: A 1 in the “set PRIV” column implies that the specified event will set the AFSR.PRIV bit if the PSTATE.PRIV bit is 1 at the time the event is detected (impl. dep. #213). A 0 implies that the event has no effect on the AFSR.PRIV bit. AFSR.PRIV accumulates the privilege status of the specified error events detected since the last time the AFSR.PRIV bit was cleared.

➎ ME: A 1 in the “set ME” column implies that the specified event will cause the AFSR.ME bit to be set if the status bit specified for that event is already set at the time the event happens. A 0 in the “set ME” column implies that multiple events do not cause the AFSR.ME bit to be set. AFSR.ME accumulates the multiple error status of the specified error events detected since the last time the AFSR.ME bit was cleared.

➏ Flushing: The “Flush” column contains a 0 if a D-cache flush is never needed for correctness. It contains a 1 if a D-cache flush is needed only if the read access is to a cacheable address. It contains a 2 if a D-cache flush is always needed. Note that, for some of these errors, an E-cache flush or a main memory update is desirable to

216 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material eliminate errors still stored in E-cache or DRAM. However, the system does not need these to ensure that the data stored in the caches does not lead to undetected data corruption. The entries in the table deal only with data correctness.

D-cache flushes should not be needed for instruction_access_error traps, but, in the event of both an instruction and data error being found before the trap handler can be entered, an instruction_access_error routine may have to recover from a data error. It is simplest to invalidate the D-cache for both instruction_access_error and data_access_error traps.

P.6 Overwrite Policy

This section describes the overwrite policy for error status when multiple errors have occurred. Errors are captured in the order that they are detected, not necessarily in program order.

The overwrite policies are described by the “priority” entries in TABLE P-12 on page 214. The descriptions here set out the policies at length.

For the behavior when errors arrive at the same time as the AFSR is being cleared by software, see Clearing the AFSR on page 206.

P.6.1 AFAR Overwrite Policy

Class 4: UCU, UCC (the highest priority)

Class 3: UE, EDU, EMU, WDU, CPU

Class 2: CE, EDC, EMC, WDC, CPC

Class 1: TO, BERR (the lowest priority)

Priority for AFAR updates: (UCU, UCC)>(UE, EDU, EMU, WDU, CPU)>(CE, EDC, EMC, WDC, CPC)>(TO, BERR)

The physical address of the first error within a class ((UCU, UCC), (UE, EDU, EMU, WDU, CPU), (CE, EDC, EMC, WDC, CPC), (TO, BERR)) is captured in the AFAR until the associated error status bit is cleared in AFSR or an error from a higher-priority class occurs. A (CE, EDC, EMC, WDC, CPC) error overwrites prior (TO, BERR) errors. A (UE, EDU, EMU, WDU, CPU) error overwrites prior (CE, EDC, EMC, WDC, CPC), (TO, BERR) errors. A (UCU, UCC) error overwrites prior (UE, EDU, EMU, WDU, CPU), (CE, EDC, EMC, WDC, CPC), (TO, BERR) errors.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 217 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material There is one exception to the above AFAR overwrite policy. Whenever an error from system bus access (UE, EMU, CE, EMC, TO,orBERR) is signalled in the same clock cycle as an internal error (UCU, UCC, EDU, WDU, CPU, EDC, WDC,orCPC), the physical address of the system error will be saved in the AFAR, even though the trap executed may be (for example) fast_ECC_error. It is not possible for software to differentiate this event from the same errors arriving on different clock cycles, but the probability of having simultaneous errors is extremely low. This difficulty only applies to AFAR. AFSR fields do not suffer this confusion on simultaneously arriving errors, and the normal overwrite priorities apply there.

The policy of flushing the entire D-cache on a deferred data_access_error trap or a precise fast_ECC_error trap avoids problems with the AFAR showing an inappropriate address in the following situations: ■ Multiple errors occur. ■ Simultaneous E-cache and system bus errors occur. ■ An E-cache error is captured in AFSR and AFAR, yet no trap is generated because the event was a speculative instruction later discarded. Later, a trap finds the old AFAR. ■ A UE was detected in the first half on a 64-byte block from the system bus, but the second half of the 64-byte block, also in error, was loaded into the D-cache.

P.6.2 E_SYND Data ECC Syndrome Overwrite Policy

Class 2: UE, IVU, EDU, WDU, UCU, CPU (the highest priority)

Class 1: CE, IVC, EDC, WDC, UCC, CPC (the lowest priority)

Priority for E_SYND updates: (UE, IVU, EDU, WDU, UCU, CPU)>(CE, IVC, EDC, WDC, UCC, CPC)

The ECC syndrome of the first data ECC error within a class is captured in the AFSR.E_SYND field until the associated error status bit is cleared in the AFSR or an error from a higher-priority class occurs. A (UE, IVU, EDU, WDU, UCU, CPU) error overwrites prior (CE, IVC, EDC, WDC, UCC, CPC) errors.

P.6.3 M_SYND MTag ECC Syndrome Overwrite Policy

Class 2: EMU (the highest priority)

Class 1: EMC (the lowest priority)

Priority for M_SYND updates: (EMU)>(EMC)

The ECC syndrome of the first MTag ECC error within a class is captured in the AFSR.M_SYND field until the associated error status bit is cleared in the AFSR or an error from a higher-priority class occurs. An EMU error overwrites prior EMC errors.

218 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material P.7 Multiple Errors and Nested Traps

The AFSR.ME bit is set when there are multiple uncorrectable errors associated with the same sticky bit (multiple occurrences of ISAP, EMU, IVU, TO, BERR, UCU, CPU, WDU, EDU,orUE errors) in different data transfers. For example, one ISAP error and one EMU error will not set the ME bit, but two ISAP errors will.

If multiple errors leading to the same trap type are reported before a trap is taken for any one of them, then only one trap will be taken for all those errors.

If multiple errors leading to different trap types are reported before a trap is taken for any one of them, then one trap of each type will be taken. One instruction_access_error and one data_access_error, and so on. Multiple errors occurring in separate correction words of a single transaction, an E- cache read, or a system bus read do not set the AFSR.ME bit.

P.8 Further Details on Detected Errors

This section includes more extensive description for detailed diagnosis. Memory errors include the following two classes: ■ E-cache Data ECC Error ■ Errors on the System Bus Different errors within each class are logged accordingly in AFSR.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 219 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material FIGURE P-1 is a simplified block diagram of the on-chip caches and the external interfaces; it illustrates the main data paths and the terminologies used below for logging the different kind of errors. TABLE P-12 on page 214 is the main reference for all aspects of each individual error. The descriptive paragraphs in the following sections clarify the key concepts.

UltraSPARC III CPU Chip

I-cache D-cache

W-cache Block-Load Buffer Merge Unit

E-cache

E-cache Control Writeback Buffer SRAM

System Interface and Memory Control US-III

DRAM Memory UltraSPARC III Data Switch

Other CPUs, Data Switch, Memory, and IO Controllers

FIGURE P-1 UltraSPARC III Data Paths

P.8.1 E-cache Data ECC Error

The following errors are described in this section: UCC, UCU, EDC, EDU, WDC, WDU, CPC, and CPU.

The UCC Error

When a cacheable fetch misses the I-cache, a cacheable load misses the D-cache, or an atomic operation is performed and it hits the E-cache, data is read from the E- cache SRAM and checked for the correctness of its ECC. If a single-bit error is detected, then the UCC bit is set to log this error condition. The error is software correctable. A precise fast_ECC_error trap is generated provided that the UCEEN bit of the E-cache Error Enable Register is set. For correctness, a software-initiated flush of the D-cache is required because the faulty word will already have been loaded into the D-cache and will be used if the trap routine retries the faulting instruction.

220 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material E-cache errors are not loaded into the I-cache, so there is no need to flush the E- cache.

A software-initiated E-cache flush is desirable so that the corrected data can be brought back from memory later. Without the E-cache flush, a further single-bit error is likely the next time this word is fetched from the E-cache.

Multiple occurrences of this error cause the AFSR.ME to be set.

In the event that the UCC event is for an instruction fetch that is later discarded without the instruction being executed, no trap will be taken.

The UCU Error

When a cacheable load misses the I-cache or D-cache, or an atomic operation misses the D-cache and hits the E-cache, an E-cache read is performed and the data read back from the E-cache SRAM is checked for the correctness of its ECC. If a multibit error is detected, it is recognized as an uncorrectable error and the UCU bit is set to log this error condition. A precise fast_ECC_error trap is generated provided that the UCEEN bit of the E-cache Error Enable Register is set.

For correctness, a software-initiated flush of the D-cache is required because the faulty word may already have been loaded into the D-cache and will be used without any error trap if the trap routine retries the faulting instruction.

Corrupt data are never stored in the I-cache.

A software-initiated flush of the E-cache is required if this event is not to recur the next time the word is fetched from E-cache. The flush may need to be linked with a correction of a multibit error in main memory if that also is corrupted.

Multiple occurrences of this error cause AFSR.ME to be set.

In the event that the UCU event is for an instruction fetch that is later discarded without the instruction being executed, no trap will be taken.

The EDC Error

The AFSR.EDC status bit is set in two ways: errors in block loads to the processor and errors in reading E-cache as the result of store merges into the E-cache.

When a block-load instruction misses the D-cache and hits the E-cache, an E-cache read is performed and the data read back from the E-cache SRAM is checked for the correctness of its ECC. If a single-bit error is detected, then the EDC bit is set to log this error condition. A disrupting ECC_error trap is generated in this case while hardware proceeds to correct the error.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 221 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The modified data from a store operation normally resides in W-cache first. There is no need to read the data from E-cache immediately. The W-cache consists of a set of modified byte values and a mask of modified bytes. When these data need to be evicted from W-cache, they go into the merge unit as the temporary staging area. Data in the E-cache need to be brought back to merge with the updated copy in the merge unit and then be scrubbed back to E-cache. As a debug feature, UltraSPARC III can be operated with W-cache turned off. In this case, every store will incur a read-modify-write of E-cache to ensure that the updated data from the store operation are always merged with the original data in E-cache.

When there is a need to merge the data in W-cache with the original data in E-cache because the W-cache line is being victimized by another store instruction, or whenever a store is executed when the W-cache is turned off, an E-cache read is performed and the data read back from the E-cache SRAM will be checked for the correctness of its ECC. If a single-bit error is detected, the EDC bit is set to log this error condition. A disrupting ECC_error trap is generated in this case while hardware proceeds to correct the error. The corrected data are then merged with the newly written data in the merge unit and scrubbed back to E-cache.

After the merge, the line in the E-cache will be correct. No more single-bit errors should be seen for this line unless a fault introduces a new one.

If this same hardware situation is detected as the result of a merge for a writeback or copyout event, the WDC or CPC bits are set, not the EDC bit.

If the entire 64-byte block is modified while the line is still present in the W-cache, then evicting this 64-byte block from the W-cache does not require an E-cache read. A Block Store hitting the E-cache operates in a similar fashion, as does a Block Store Commit operation. This behavior is a speed optimization in hardware and also minimizes the possibility of data corruption.

The EDU Error

The AFSR.EDU status bit is set in two ways: errors in block loads to the processor and errors in reading E-cache to merge data to complete store operations.

When a block load misses the D-cache and hits the E-cache, an E-cache read is performed and the data read back from the E-cache SRAM is checked for the correctness of its ECC. If a multibit error is detected, it is recognized as an uncorrectable error and the EDU bit is set to log this error condition. A deferred data_access_error trap is generated in this case, provided that the NCEEN bit is set in the E-cache Error Enable Register.

See The EDC Error for a description of the operation of the write merge mechanism. When a multibit error is detected in the original line data read from the E-cache to complete a write merge, it is recognized as an uncorrectable error and the EDU bit is set to log this error condition. A disrupting ECC_error trap is generated in this case, provided that the NCEEN bit is set in the E-cache Error Enable Register.

222 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material After the write merge operation, the line in the E-cache will be scrubbed with the new merged data. To indicate that the data are unusable, deliberately bad signalling ECC is scrubbed to each 128-bit word in the line that contained an uncorrectable error in the original E-cache line. Correct E-cache ECC is computed for the corrupt 128-bit word, then ECC check bits C<1:0> of the 128-bit word are inverted before the word is scrubbed back to the E-cache.

Uncorrectable ECC errors detected as the result of a write merge operation to complete a writeback or copyout do not set EDU, but set WDU or CPU instead.

If an E-cache uncorrectable error is detected as the result of either a write merge or a block load operation and the AFSR.EDU status bit is already set, then AFSR.ME will be set.

The WDC Error

For a writeback operation, when a modified line in the E-cache is being victimized to make way for a new line, an E-cache read is performed and the data read back from the E-cache SRAM is checked for the correctness of its ECC. The data read back from E-cache is put in the writeback buffer as the staging area for the writeback operation. If a single bit error is detected, the WDC bit is set to log this error condition. A disrupting ECC_error trap is generated in this case, provided that the CEEN bit is set in the E-cache Error Enable Register. Hardware proceeds to correct the error, and corrected data are written out to memory through the system bus.

The WDU Error

For a writeback operation, an E-cache read is performed and the data read back from the E-cache SRAM are checked for the correctness of their ECC. The data read back from E-cache are put in the writeback buffer as the staging area for the writeback operation. When a multibit error is detected, it is recognized as an uncorrectable error and the WDU bit is set to log this error condition. A disrupting ECC_error trap is generated in this case, provided that the NCEEN bit is set in the E-cache Error Enable Register.

When the processor reads uncorrectable E-cache data and writes it to the system bus in a writeback operation, it computes correct system bus ECC for the corrupt data, then inverts bits <127:126> of the data to signal to other devices that the data are not usable.

Multiple occurrences of this error cause ASFR.ME to be set.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 223 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The CPC Error

For a copyout operation to serve a snoop request from another chip, an E-cache read is performed and the data read back from the E-cache SRAM are checked for the correctness of their ECC. If a single-bit error is detected, then the CPC bit is set to log this error condition. A disrupting ECC_error trap is generated in this case, provided that the CEEN bit is set in the E-cache Error Enable Register. Hardware proceeds to correct the error, and corrected data are sent to the snooping device.

This bit is not set if the copyout happens to hit in the writeback buffer because the line is being victimized. Instead, the WDC bit is set. See The WDC Error for an explanation.

The CPU Error

For a copyout operation, an E-cache read is performed and the data read back from the E-cache SRAM are checked for the correctness of their ECC. If a multibit error is detected, it is recognized as an uncorrectable error and the CPU bit is set to log this error condition. A disrupting ECC_error trap is generated in this case, provided that the NCEEN bit is set in the E-cache Error Enable Register.

Multiple occurrences of this error cause AFSR.ME to be set.

When the processor reads uncorrectable E-cache data and writes it to the system bus in a copyback operation, it computes correct system bus ECC for the corrupt data, then inverts bits <127:126> of the data to signal to other devices that the data are not usable.

This bit is not set if the copyout hits in the writeback buffer. Instead, the WDU bit is set. See The WDC Error for an explanation of this behavior.

If a copyout needs to merge data from W-cache and detects a UE in E-cache as the result of the merge, C<1:0> is inverted in the data written back to the E-cache and D<127:126> is inverted in the data written to the system bus.

P.8.2 System Bus ECC Errors

The following errors are described in this section: CE, UE, EMC, EMU, IVC, and IVU.

The CE Error

When data are entering the UltraSPARC III chip from the system bus, the data are checked for the correctness of their ECC. If a single-bit error is detected, the CE bit is set to log this error condition. A disrupting ECC_error trap is generated in this case if the CEEN bit is set in the E-cache Error Enable Register. Hardware will correct the error.

224 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The UE Error

When data are entering UltraSPARC III chip from the system bus, the data are checked for the correctness of their ECC. If a multibit error is detected, it is recognized as an uncorrectable error and the UE bit is set to log this error condition. Provided that the NCEEN bit is set in the E-cache Error Enable Register, a deferred instruction_access_error or data_access_error trap is generated, depending on whether the read was to satisfy an instruction fetch or a load operation.

Multiple occurrences of this error cause the AFSR.ME to be set.

The EMC Error

When data are entering UltraSPARC III chip from the system bus, the MTags are checked for the correctness of ECC. If a single-bit error is detected, then the EMC bit is set to log this error condition. A disrupting ECC_error trap is generated, provided that the CEEN bit is set in the E-cache Error Enable Register. Hardware will correct the error.

MTags are also checked for ECC correctness on noncacheable reads and interrupt vector fetch operations and will generate a trap if there is an ECC error.

MTag ECC is checked for an interrupt vector fetch operation. For a hw_corrected MTag error, AFSR.EMC will be set, AFAR captured, and an ECC_error trap generated.

The EMU Error

When data are entering UltraSPARC III chip from the system bus, the MTags are checked for the correctness of ECC. If a multibit error is detected, it is recognized as an uncorrectable error and the EMU bit is set to log this error condition. Provided that the NCEEN bit is set in the E-cache Error Enable Register, a deferred instruction_access_error or data_access_error trap is generated, depending on whether the read was to satisfy an instruction fetch or a load operation.

Multiple occurrences of this error cause the AFSR.ME to be set.

MTags are also checked for ECC correctness on noncacheable reads and interrupt vector fetch operations and will generate a trap if there is an ECC error. On an interrupt-vector-fetch uncorrectable MTag ECC error, AFSR.EMU is set and a deferred instruction_access_error or data_access_error trap is generated. In addition, the received interrupt vector is stored in the interrupt receive registers, and an interrupt_vector trap is taken. This may all be moot, because the processor will assert its ERROR output pin.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 225 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The IVC Error

When interrupt vector data are entering UltraSPARC III chip from the system bus, the data are checked for ECC correctness. If a single-bit error is detected, then the IVC bit is set to log this error condition. A disrupting ECC_error trap is generated, provided that the CEEN bit is set in the E-cache Error Enable Register. Hardware will correct the error.

The IVU Error

When interrupt vector data are entering UltraSPARC III chip from the system bus, the data are checked for ECC correctness. If a multibit error is detected, it is recognized as an uncorrectable error and the IVU bit is set to log this error condition. A disrupting ECC_error trap is generated, provided that the NCEEN bit is set in the E-cache Error Enable Register.

Multiple occurrences of this error cause AFSR.ME to be set.

A multibit error in received interrupt vector data still causes the data to be stored in the interrupt receive registers but does not cause an interrupt_vector disrupting trap.

P.9 Further Details of ECC Error Processing

Further details of ECC error processing include the following topics: ■ System bus ECC error detection and injection ■ External cache ECC errors ■ Occasions when traps are taken ■ Occasions when interrupts are taken ■ Error barriers

P.9.1 System Bus ECC Error Detection

For incoming data from the system bus, ECC error checking is turned on when the system bus DSTAT bits indicate that the data are valid and ECC has been generated for this data. ECC is not checked for system bus data that return with DSTAT =1,2, or 3.

ECC errors may occur in either the data or MTag field. UltraSPARC III can store only one data ECC syndrome and one MTag ECC syndrome for every 64 bytes of incoming data even though it does detect errors in every 16 bytes of data.

226 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The syndrome of the first ECC error detected, whether hw_corrected or uncorrectable, is saved in an internal error register.

If the first occurrence of an ECC error is uncorrectable, then the error register is locked and all subsequent errors within the 64-byte block are ignored.

If the first occurrence of an ECC error is hw_corrected, then subsequent correctable errors within the 64-byte block are corrected but not logged. A subsequent uncorrectable error will overwrite the syndrome of the correctable error. At this point, the error register is locked.

P.9.2 System Bus ECC Error Injection

Not only does UltraSPARC III perform ECC checking for incoming system bus data, it also generates ECC check bits for outgoing system bus data. A problem occurs when new ECC check bits are generated for data that contain an uncorrectable error (ECC or bus error). With new ECC check bits, there is no way to detect that the original data were bad.

To fix this problem without the need to add an additional interface, a new uncorrectable ECC error is injected into the data after the new ECC check bits have been generated. In this way, the receiver will detect the uncorrectable error when it performs its own ECC checking. The deliberately bad ECC is known as a signalling ECC.

For BERR events coming from the system bus and being stored with deliberately bad signalling ECC in the E-cache, an uncorrectable error is injected by inverting data bits <1:0> after correct ECC is generated for the corrupt data.

For a TO event coming from the system bus, the data and ECC values present on the system bus at the time that the TO is detected are stored unchanged in the E-cache. Any result can be returned when the affected E-cache line is read.

For UE events coming from the system bus, the data and ECC values present on the system bus are stored unchanged in the E-cache. An uncorrectable error should be returned when the E-cache line is read but the syndrome is not defined.

For uncorrectable ECC errors detected in copyout or writeback data from the E- cache, an uncorrectable error is injected into outgoing data by inversion of the data bits <127:126> after correct ECC is generated for the corrupt data.

For uncorrectable ECC errors detected in an E-cache read to complete a store merge event associated with an store instruction, ECC check bits <1:0> are inverted in the data scrubbed back to the E-cache.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 227 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material A line that arrives as BERR and is stored in the E-cache with data bits <1:0> inverted can then be rewritten as part of a store merge operation with check bits <1:0> inverted and can eventually be written back out to the system bus with data bits <127:126> inverted.

The E_SYND reported for correction words with data bits <1:0> inverted is always 11C16. The E_SYND reported for correction words with data bits <127:126> inverted is always 07116. The E_SYND reported for correction words with check bits <1:0> inverted is always 00316.

P.9.3 E-cache ECC Errors

ECC error checking for E-cache data is turned on for all read transactions from the E-cache whenever the EC_ECC_ENABLE bit of the E-cache Error Status Register is asserted.

UltraSPARC III can store only one data ECC syndrome and one MTag ECC syndrome for every 32 bytes of E-cache data even though it is possible to detect errors in every 16 bytes of data.

The syndrome of the first ECC error detected, whether hw_corrected, sw_correctable, or uncorrectable, is saved in an internal error register.

If the first occurrence of an ECC error is uncorrectable, then the internal error register is locked and all subsequent errors within the 32 bytes are ignored.

If the first occurrence of an ECC error is correctable, then subsequent correctable errors within the 32 bytes are ignored. A subsequent uncorrectable error will overwrite the syndrome of the correctable error stored in the internal error register. At this point, the internal error register is locked. This applies to both hw_corrected and sw_correctable errors.

The internal error register is cleared and unlocked once the error has been logged in the AFSR.

P.9.4 When Are Traps Taken?

Precise traps such as fast_instruction_access_MMU_miss and fast_ECC_error are taken explicitly when the instruction with the problem is executed.

228 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Disrupting and deferred traps are not associated with particular instructions. In fact, the processor takes these traps only when a valid instruction that will definitely be executed (not discarded as a result of speculation) is moving through particular internal pipelines.

TABLE P-13 maps instructions to pipelines.

TABLE P-13 Traps and When They Are Taken

Initiate Trap Processing Type of Trap When a Valid Instruction Is In

instruction_access_error (See note) data_access_error BR or MS pipe interrupt_vector BR, MS, A0 or A1 pipe (but see note) ECC_error BR or MS pipe interrupt_level_n BR, MS, A0 or A1 pipe (but see note)

Note – instruction_access_error events and many precise traps produce instructions that cannot be executed. So, the instruction fetcher dispatches the affected instruction to the BR pipe, specially marked to cause a trap. It is true to say that instruction_access_error traps are taken only when an instruction is in the BR pipe but have no effect, because the error creates an instruction in the BR pipe itself. So, instruction_access_error traps are taken as soon as the instruction fetcher dispatches the faulty instruction.

See When Are Interrupts Taken? on page 232 for additional information on taking interrupt traps.

TABLE P-13 specifies when the processor will “initiate trap processing,” meaning that the processor will consider what trap to take. When the processor has initiated trap processing, it will take one of the outstanding traps, but not necessarily the one that caused the trap processing to be initiated.

When the processor encounters an event that should lead to a trap, that trap type becomes pending. The processor continues to fetch and execute instructions until a valid instruction is issued to a pipe that is sensitive to a trap that is pending. During this time, further traps may become pending.

When the next committed instruction is issued to a pipe that is sensitive to any pending trap, the following events occur.

1. The processor ceases to issue instructions in the normal processing stream.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 229 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 2. If the pending trap is a precise or disrupting trap, then the processor waits for the completion of all the system bus reads that have already started. The processor does not wait if a deferred trap will be taken. During this waiting time, more traps can become pending.

3. The processor takes the highest-priority pending trap. If a deferred trap, data_access_error,orinstruction_access_error is pending, then the processor begins execution of either the data_access_error or instruction_access_error trap code. If both data_access_error and instruction_access_error traps are pending, the instruction_access_error trap is taken because it is higher-priority, as listed in Table 7-4 in Commonality. Taking a data_access_error trap clears the pending status for data access errors. Taking an instruction_access_error trap clears the pending status for instruction access errors and has no effect on the pending status of data access errors. Because of the priorities as shown in Table 7-4 (Commonality), there cannot be an instruction_access_error trap pending at the time a data_access_error trap is taken. Taking any trap makes all precise traps no longer pending.

The preceding description implies that a pipe has to have a valid instruction to initiate trap handling, but once trap handling is initiated, any of the pending traps can be taken, not just ones to which that pipe is sensitive. So, if the processor is executing A0 pipe instructions and a data_access_error is pending but cannot be taken, an interrupt_vector can arrive and enable the data_access_error trap to be executed even though only A0 pipe instructions are present.

If a data_access_error trap becomes pending but cannot be taken because neither the BR nor MS pipe has a valid instruction, the processor continues to fetch and execute instructions. If an instruction_access_error trap then becomes pending, the offending instruction will be specially issued to the BR pipe to allow trap processing to be initiated. The processor then will examine both the pending traps and take the instruction_access_error trap, say, at TL = 1, because it is higher priority. The data_access_error remains pending. When the first BR or MS pipe instruction is executed in the instruction_access_error trap routine, the data_access_error trap routine will run, say, at TL =2.

This is an odd situation. Despite the fact that the data_access_error has lower priority than the instruction_access_error trap, the data_access_error trap routine runs at a higher TL, within an enclosing instruction_access_error trap routine and before the bulk of that routine. This is the opposite of the usual effect that interrupt priorities have.

The result of this situation is that at the time that the trap handler begins, only one data_access_error trap is executed for all data access errors that have been detected by this time, and only one instruction_access_error trap is executed for all instruction access errors.

230 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Processor action is always determined by the trap priorities listed in Table 7-4 in Commonality, except for one special case, and that is for a precise fast_ECC_error trap pending at the same time as a deferred data_access_error or instruction_access_error trap. In this one case only, the higher-priority deferred trap will be taken and the precise trap will no longer be pending.

If a deferred trap is taken while a precise trap is pending, that precise trap will no longer be pending. So, a data_access_error trap routine might see multiple events logged in AFSR associated with both data and instruction access errors and might also see an E-cache error event logged. The E-cache error event would normally be associated with a precise trap, but the deferred trap happened to arrive at the same time and make the precise trap no longer pending. If the deferred trap routine executes RETRY (an unlikely event in itself), then the precise trap may become pending again, but that would depend on the E-cache giving the same error again.

Pending disrupting traps are affected by the current state of PSTATE.IE and PIL. All the disrupting traps, interrupt_vector, ECC_error, and interrupt_level_[1-15] are temporarily inhibited (that is, their pending status is hidden) when PSTATE.IE is 0. Interrupt_level_[1-15] traps are also temporarily inhibited when PIL is greater than or equal to the interrupt level.

As an example, consider the following (highly unlikely!) sequence of events.

1. An interrupt vector arrives with a hw_corrected system bus data ECC error. This makes ECC_error and interrupt_vector traps pending.

The processor continues to execute instructions looking for a BR, MS, A0, or A1 pipe instruction.

2. The processor executes a system bus read, the RTO associated with an earlier store instruction, and detects an uncorrectable system bus data ECC error. This makes a data_access_error trap pending.

The processor continues executing instructions looking for a BR, MS, A0, or A1 pipe instruction.

3. The processor reads an instruction from the E-cache and detects an E-cache data ECC error. This makes a precise fast_ECC_error trap pending.

4. An earlier prefetch from the system bus by the instruction fetcher, of an instruction now known not to be used, completes. This instruction has a UE, which makes an instruction_access_error pending.

The instruction fetcher dispatches the corrupt instruction, specially marked, in the BR pipe. Because the processor can now take a trap, it inhibits further instruction execution and waits for outstanding system bus reads to complete. When the reads have completed, the processor then examines the various pending traps and begins to execute the deferred instruction_access_error trap, because deferred traps are

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 231 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material handled before fast_ECC_error (as a special case) and that has the higher priority. This makes the instruction_access_error trap and all precise traps no longer pending.

The processor takes only one trap at a time. It begins executing the instruction_access_error trap by fetching the exception vector and executing the instruction there. As part of SPARC V9 trap processing, the processor clears PSTATE.IE,sotheECC_error and interrupt_vector traps cannot be taken at the moment and so are no longer pending (although they’re still remembered, just hidden). The processor will now be running with TL =1.

When the instruction_access_error trap routine executes a BR or MS pipe instruction, the data_access_error trap routine will run at TL = 2. If that routine explicitly set PSTATE.IE, then the interrupt_vector and ECC_error traps would become pending again. After the next BR, MS, A0, or A1 pipe instruction, the processor, after waiting for outstanding reads to complete, would take the interrupt_vector trap, which would run at TL =3.

However, assuming the data_access_error trap routine does not set PSTATE.IE, then it will run uninterrupted to completion at TL = 2. It’s a deferred trap, so it’s not possible to return to the TL = 1 routine correctly. Recovery at this time is a matter for the system designer.

P.9.5 When Are Interrupts Taken?

The processor is only sensitive to interrupt_vector and interrupt_level_[1-15] traps when a valid instruction is in the BR, MS, A0, or A1 pipes. If the processor is executing only FGM or FGA pipe instructions, it will not take the interrupt. This behavior could lead to unacceptably long delays in interrupt processing.

The processor takes special action to avoid this problem when all the following conditions are true: ■ PSTATE.PEF =1 ■ FPRS.FEF =1 ■ An interrupt_vector or interrupt_level_[1-15] trap becomes pending ■ PSTATE.IE =1 ■ (For interrupt_level_[1-15] traps) PIL is less than the pending interrupt level

When this situation occurs and the processor detects that none of the approximately 12 instructions waiting to be dispatched can take the interrupt because they are all FGM or FGA pipe operations, then the processor takes a precise fp_disabled trap on one of the floating-point instructions. This behavior occurs despite the fact that PSTATE.PEF and FSRS.FEF are both still 1.

232 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material It is the responsibility of the fp_disabled trap routine to detect this situation and then to execute a BR, MS, A0, or A1 pipe operation while PSTATE.IE is 1, to ensure that any pending interrupt is taken. The fp_disabled trap routine can then execute a RETRY instruction to return to the original code.

It is possible that there will be some conditions where the processor has already handled the interrupt by the time the fp_disabled trap routine is reached, and the fp_disabled trap routine should not be disturbed if no interrupt occurs while it is running. It is also possible that this mechanism is triggered unexpectedly, when static analysis of the program would not lead one to believe it should come into play. Trap handler software should cope with these situations.

This special handling does not apply to ECC_error disrupting traps. These traps can be postponed by executing only A0 or A1 pipe instructions and FGM and FGA. The detection scheme does not check for A0 and A1 instructions and does not check for pending ECC_error traps, so it is possible to write programs that will postpone ECC_error traps for a long time. However, these traps are not critical to system operation in the way that routine interrupts are.

The special handling of interrupts in the presence of floating-point operations is enabled by the DCR.IFPOE bit (see 33). When this bit is 0, no fp_disabled traps will be generated if the floating point unit is enabled. When this bit is 1, an interrupt that cannot be otherwise delivered will result in an fp_disabled trap.

P.9.6 Error Barriers

A MEMBAR #Sync instruction causes the processor to wait until all system bus reads are complete and the store queue is empty before continuing. Stores will have completed any system bus activity (including any RTO for a cacheable store), but the store data may still be in the W-cache and may not have reached the E-cache.

A DONE or RETRY instruction behaves exactly like a MEMBAR #Sync instruction for the purpose of error isolation. The processor waits for outstanding data loads or stores to complete before continuing.

Traps do not serve as error barriers in the way that MEMBAR #Sync does.

User code can issue a store instruction that misses in the D-cache and E-cache and so will generate a system bus RTO operation. After the store instruction, the user code can go on to trap into privileged code through an explicit software trap instruction, a TLB miss, a spill or fill trap, or an arithmetic exception (such as a floating-point trap). None of these trap events wait for the user code’s pending stores to be issued, let alone to complete. The processor’s store queue can hold several outstanding stores, any or all of which can require system bus activity that can lead to a deferred data_access_error trap. These stores can issue and complete on the system bus after a trap has changed PSTATE.PRIV to 1, and errors as the result of the stores can be logged with AFSR.PRIV = 1, despite the fact that they come from user code.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 233 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material It happens that the detailed timing behavior of the processor prevents the same anomaly with load operations. Uncorrectable errors on user load operations will always present a deferred data_access_error trap with AFSR.PRIV =0.

It is possible to use a MEMBAR #Sync instruction near the front of all trap routines and to handle specially deferred traps that occur there, to properly isolate user-space hardware faults from system-space faults. However, the cost in execution time for the error-free case is significant, and programmers may decide that the additional small gain in possible system availability is not worth the cost in throughput. If there is no MEMBAR #Sync, the effect will be that a very small fraction of errors that might perhaps have terminated only one user process will, instead, result in a reboot of the affected coherence domain.

Neither traps nor explicit MEMBAR #Sync instructions provide a barrier for explicit PREFETCH operations. PREFETCH instructions that are executed before a trap or MEMBAR #Sync can cause system bus activity after the trap or MEMBAR #Sync. Most errors from PREFETCH operations are ignored, specifically to avoid problems with system bus activity, but, when the prefetched data are fetched by an RTSR system bus operation in an SSM system, errors cannot be ignored, and a deferred trap caused by an uncorrectable MTag ECC error as the result of a PREFETCH operation using an RTSR transaction in an SSM system may be presented after a MEMBAR #Sync.

Disrupting and precise traps do not act as though a MEMBAR #Sync was executed at the time of the trap. The reason is that the disrupting and precise traps wait for all reads that have been issued on the system bus to be complete, but not for the store queue to be empty, before beginning to execute the trap.

If several stores are present in the store queue, each store can potentially result in a data_access_error trap as a result of a system bus problem. Because deferred trap processing does not wait for all stores to complete, the data_access_error trap routine can start as soon as an error is detected as the result of the first store. (Execution of the trap routine still may be delayed until the right pipe includes a valid instruction, though). Once the data_access_error routine has started, a further store from the original store queue can result in system bus activity that eventually returns an error and causes another data_access_error trap to become pending. This can (once the correct pipe has a valid instruction in it) start another data_access_error trap routine, at TL = 2. This process can continue until all available trap levels are exhausted and the processor begins RED_state execution.

Entering RED_state has no effect on the contents of the store queue, W-cache, or prefetch queue. If RED_state has been reached because of repeated error events as described above, a further trap can occur from RED_state when TL already equals MAXTL. This situation will cause the processor to take a watchdog reset. The watchdog reset does disable the prefetch queue but has no effect on the W-cache. However, W-cache entries will not be flushed, because the processor will not be

234 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material doing cacheable accesses. The watchdog reset dumps any remaining contents of the store queue. The value of AFSR is unchanged on a watchdog reset, so diagnosis code can deduce a reason for the event.

P.10 UltraSPARC III Behavior Under Asynchronous Error Conditions

The tables in this section describe the behavior of UltraSPARC III under various asynchronous error conditions—errors encountered when accessing the external cache or errors encountered on the system bus. In the tables, each table entry describes a specific condition under which an asynchronous error has been encountered. The detailed actions taken by the processor under each error condition are provided to illustrate the behavior of the processor.

P.10.1 External Cache Access Errors

This section presents information about E-cache errors in the following tables: ■ TABLE P-14, E-cache Data CE and UE Errors ■ TABLE P-15, Writeback and Copyout

TABLE P-14 E-cache Data CE and UE Errors (1 of 4)

Flag Error Fast Logged in ECC Pipeline Event AFSR Error E-cache Data L1 Cache Data Action Comment I-cache fill request with CE UCC Yes Original data + Bad data not Bad data Precise trap in the critical 32-byte E- original ECC in I-cache dropped cache data; these data are used later I-cache fill request with UE UCU Yes Original data + Bad data not Bad data Precise trap in the critical 32-byte E- original ECC in I-cache dropped cache data; these data are used later I-cache fill request with CE UCC Yes Original data + Bad data not Bad data Precise trap in the noncritical 32-byte original ECC in I-cache dropped (the second 32-byte) E-cache data; these data are used later

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 235 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-14 E-cache Data CE and UE Errors (2 of 4)

Flag Error Fast Logged in ECC Pipeline Event AFSR Error E-cache Data L1 Cache Data Action Comment I-cache fill request with UE UCU Yes Original data + Bad data not Bad data Precise trap in the noncritical 32-byte original ECC in I-cache dropped (the 2nd 32-byte) E-cache data; these data are used later I-cache fill request with CE UCC Yes Original data + Bad data not Bad data No trap in the critical 32-byte E- original ECC in I-cache dropped cache data, but these data are never used later or I-cache fill request with CE in the noncritical 32-byte (the second 32-byte) E-cache data, but these data are never used later I-cache fill request with UE UCU Yes Original data + Bad data not Bad data No trap in the critical 32-byte E- original ECC in I-cache dropped cache data, but these data are never used later or I-cache fill request with UE in the noncritical 32-byte (the second 32-byte) E-cache data, but these data are never used later D-cache 32-byte load UCC Yes Original data + Bad data in D- Bad data Precise trap request with CE in the original ECC cache dropped critical 32-byte E-cache data D-cache 32-byte load UCU Yes Original data + Bad data in D- Bad data Precise trap request with UE in the original ECC cache dropped critical 32-byte E-cache data D-cache FP-64-bit load UCC Yes Original data + Bad data in D- Bad data Precise trap request with CE in the original ECC cache dropped critical 32-byte E-cache data D-cache FP-64-bit load UCU Yes Original data + Bad data in D- Bad data Precise trap request with UE in the original ECC cache dropped critical 32-byte E-cache data D-cache atomic request UCC Yes Original data + Bad data in D- Bad data Precise trap with CE in the critical 32- original ECC cache dropped byte E-cache data

236 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-14 E-cache Data CE and UE Errors (3 of 4)

Flag Error Fast Logged in ECC Pipeline Event AFSR Error E-cache Data L1 Cache Data Action Comment D-cache atomic request UCU Yes Original data + Bad data in D- Bad data Precise trap with UE in the critical 32- original ECC cache dropped byte E-cache data D-cache block-load request EDC No Original data + Good data in good data Disrupting with CE in the first 32-byte original ECC block-load taken trap E-cache data buffer or D-cache block-load request with CE in the second 32- byte E-cache data D-cache block-load request EDU No Original data + Bad data in Bad data in Deferred with UE in the first 32-byte original ECC block-load FP register trap E-cache data buffer file or D-cache block-load request with UE in the second 32- byte E-cache data W-cache exclusive request No error No Original data + W-cache gets W-cache No trap (stores/block stores) with logged original ECC the permission proceeds to E-cache tag CE in the critical 32-byte E- to modify the modify the access only cache data data data without E- or cache data W-cache exclusive request access (stores) with CE in the noncritical 32-byte (second 32-byte) E-cache data W-cache exclusive request No error No Original data + W-cache gets W-cache No trap (stores/block stores) with logged original ECC the permission proceeds to E-cache tag UE in the critical 32-byte E- to modify the modify the access only cache data data data without E- or cache data W-cache exclusive request access (stores) with UE in the second 32-byte E-cache data (W-cache eviction) W-cache EDC No Original data + W-cache gets No action Disrupting read request with CE in the original ECC good data trap critical 32-byte E-cache and then gets and then W- data; then, W-cache request merged data + cache sends of scrubbing partially regenerated out merged modified data with CE in ECC data + the critical 32-byte E-cache regenerated data ECC to E- cache

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 237 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-14 E-cache Data CE and UE Errors (4 of 4)

Flag Error Fast Logged in ECC Pipeline Event AFSR Error E-cache Data L1 Cache Data Action Comment (W-cache eviction) W-cache EDU No Original data + W-cache gets No action Disrupting read request with UE in the original ECC raw data & trap & W- critical 32-byte E-cache data and then gets the indication cache flips and then W-cache request of merged data + of bad data; the 2 least scrubbing partially regenerated then, W-cache significant modified data with UE in ECC sends out ECC check the critical 32-byte E-cache merged data + bits C[1:0] in data regenerated both lower ECC to E- and upper cache 16 bytes (W-cache eviction) W-cache No error No W-cache data + W-cache sends No action No trap request of scrubbing 32-byte logged regenerated out 32-byte modified data with CE in ECC modified data the critical 32-byte E-cache + regenerated data ECC to E- cache (W-cache eviction) W-cache No error No W-cache data + W-cache sends No action No trap request of scrubbing 32-byte logged regenerated out 32-byte modified data with UE in ECC modified data the critical 32-byte E-cache + regenerated data ECC to E- cache ASI read E-cache data No No Original data + N/A Corrected No trap request with CE in the first original ECC data 32-byte E-cache data (note that or UltraSPARC ASI read E-cache data I/II returned request with CE in the uncorrected second 32-byte E-cache data data) ASI read E-cache data No No Original data + N/A N/A No trap request with UE in the first original ECC 32-byte E-cache data or ASI read E-cache data request with UE in the second 32-byte E-cache data

238 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-15 Writeback and Copyout (1 of 3)

Event Error logged in AFSR E-cache data Data Sent to System Bus Comment Writeback encountering WDC Original data stay in Corrected data + Disrupting trap CE in the first 32-byte E- E-cache, and ETag corrected ECC cache data invalidated or Writeback encountering CE in the second 32-byte E-cache data Writeback encountering WDU Original data stay in SIU flips the most Disrupting trap UE in the 1st 32-byte E- E-cache, and ETag significant 2 bits of cache data invalidated data D<127:126> of or the corresponding Writeback encountering upper or lower 16- UE in the second 32-byte byte data E-cache data Copyout hits in the WDC Original data stay in Corrected data + Disrupting trap writeback buffer because E-cache, and Etag corrected ECC the line is being invalidated victimized where a CE has already been detected Copyout hits in the WDU Original data stay in SIU flips the most Disrupting trap writeback buffer because E-cache, and ETag significant 2 bits of the line is being invalidated data D<27:126> of the victimized where a UE corresponding upper has already been detected or lower 16-byte data

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 239 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-15 Writeback and Copyout (2 of 3)

Event Error logged in AFSR E-cache data Data Sent to System Bus Comment Copyout encountering CE CPC Original data stay in Corrected data + Disrupting trap in the first 32-byte E- E-cache, and ETag corrected ECC cache data marked as O-state or or Os-state or S-state if Copyout encountering CE servicing f_RTS. in the second 32-byte E- cache data Original data stay in E-cache, and ETag marked as S-state if servicing f_RTSM not hitting W-cache.

Original data stay in E-cache, and ETag remains unchanged if servicing f_RS.

Original data stay in E-cache, and ETag invalidated if servicing f_RTO.

Original E-cache data merged with W-cache data for copyout, and original E-cache data stay in E-cache and Tag invalidated if servicing f_RTSM that hits W-cache

240 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-15 Writeback and Copyout (3 of 3)

Event Error logged in AFSR E-cache data Data Sent to System Bus Comment Copyout encountering UE CPU Original data stay in SIU flips the most Disrupting trap in the first 32-byte E- E-cache and ETag significant 2 bits of cache data. marked as O-state or data D<127:126> of or Os-state or S-state if the corresponding Copyout encountering UE servicing f_RTS. upper or lower 16- in the second 32-byte E- byte data cache data. Original data stay in E-cache and ETag marked as S-state if servicing f_RTSM not hitting W-cache.

Original data stay in E-cache and ETag remains unchanged if servicing f_RS.

Original data stay in E-cache and ETag invalidated if servicing f_RTO.

Original E-cache data merged with W-cache data for copyout and original E-cache data stay in E-cache and Tag invalidated if servicing f_RTSM that hits W-cache.

Note – AFAR points to 16-byte boundary as dictated by the address specified by the E-cache access.

Note – When UE and CE occur in the same 32-byte data, both CE and UE are reported but AFAR will point to the UE case on the 16-byte boundary.

Note – “D-cache FP-64-bit load” means any one of the following four kinds of FP- load instructions: lddf, ldf, lddfa (with some ASIs), or ldfa (with some ASIs).

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 241 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material P.10.2 UltraSPARC III Behavior with Error from System Bus

This section presents information about system bus errors in the following tables: ■ TABLE P-16, CE, UE, TO, and BERR Errors ■ TABLE P-17, EMC and EMU Errors ■ TABLE P-18, IVC and IVU Errors

TABLE P-16 CE, UE, TO, and BERR Errors (1 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment I-cache fill request with CE No Corrected S Good data Good data Disrupting trap CE in the critical 32-byte data + cor- in I-cache taken data from system bus rected ECC I-cache fill request with UE Yes Raw UE S Bad data Bad data Deferred trap (UE) UE in the critical 32-byte data + raw not in- dropped will be taken and data from system bus ECC stalled in I- precise trap cache (fast_ECC_error) will be dropped I-cache fill request with CE No Corrected S No action No action Disrupting trap CE in the noncritical (sec- data + cor- ond) 32-byte data from rected ECC system bus I-cache fill request with UE No Raw UE S No action No action Deferred trap UE in the noncritical (sec- data + raw ond) 32-byte data from ECC system bus D-cache load 32-byte fill CE No Corrected E/S Good data Good data Disrupting trap request with CE in the data + cor- in D-cache taken critical 32-byte data from rected ECC system bus D-cache load 32-byte fill UE Yes Raw UE E/S Bad data in Bad data Deferred trap (UE) request with UE in the data + raw D-cache dropped will be taken and critical 32-byte data from ECC precise trap system bus (fast_ECC_error) will be dropped D-cache load 32-byte fill CE No Corrected E/S No action No action Disrupting trap request with CE in the data + cor- noncritical second) 32- rected ECC byte data from system bus

242 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (2 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment D-cache load 32-byte fill UE No Raw UE E/S No action No action Deferred trap request with UE in the data + raw noncritical (second) 32- ECC byte data from system bus D-cache FP-64-bit load fill CE No Corrected E/S Good & Good data Disrupting trap request with CE in the data + cor- critical 32- taken critical 32-byte data from rected ECC byte data system bus in D-cache D-cache FP-64-bit load fill UE Yes Raw UE E/S Bad data in Bad data Deferred trap (UE) request with UE in the data + raw D-cache dropped will be taken and critical 32-byte data from ECC precise trap system bus (fast_ECC_error) will be dropped D-cache FP-64-bit load fill CE No Corrected E/S Good & No action Disrupting trap request with CE in the data + cor- critical 32- noncritical (second) 32- rected ECC byte data byte data from system in D-cache bus D-cache FP-64-bit load fill UE No Raw UE E/S Bad data No action Deferred trap request with UE in the data + raw not in noncritical (second) 32- ECC cache byte data from system bus D-cache block-load fill re- CE No Not in- N/A Good data Good data Disrupting trap quest with CE in the criti- stalled in cache taken cal 32-byte data from block-load system bus buffer or D-cache block-load fill re- quest with CE in the non- critical (second) 32-byte data from system bus D-cache block-load fill re- UE No Not in- E/S Bad data in Bad data Deferred trap quest with UE in the criti- stalled block-load dropped cal 32-byte data from buffer system bus or D-cache block-load fill re- quest with UE in the non- critical (second) 32-byte data from system bus

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 243 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (3 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment D-cache atomic fill re- CE No Corrected M Good & Good data Disrupting trap quest with CE in the criti- data + cor- critical 32- taken cal 32-byte data from rected ECC byte data system bus in D-cache D-cache atomic fill re- UE Yes Raw UE M Bad data in Bad data Deferred trap (UE) quest with UE in the criti- data + raw D-cache dropped will be taken and cal 32-byte data from ECC precise trap system bus (fast_ECC_error) will be dropped D-cache atomic fill re- CE No Corrected M Good & No action Disrupting trap quest with CE in the non- data + cor- critical 32- critical (second) 32-byte rected ECC byte data data from system bus in D-cache D-cache atomic fill re- UE No Raw UE M No action No action Deferred trap quest with UE in the non- data + raw critical (second) 32-byte ECC data from system bus (stores) W-cache exclu- CE No Corrected M W-cache W-cache Disrupting trap sive fill request with CE data + cor- gets per- proceeds to in the critical 32-byte data rected ECC mission to modify the from system bus modify the data or data after (stores) W-cache exclu- all 64 bytes sive fill request with CE of data in the noncritical (second) have been 32-byte data from system received bus (stores) W-cache exclu- UE No Raw UE M W-cache W-cache Deferred trap sive fill request with UE data + raw gets per- proceeds to in the critical 32-byte data ECC mission to modify the from system bus modify the data, and or data after UE will be (stores) W-cache exclu- all 64 bytes reflected at sive fill request with UE of data the time of in the noncritical (second) have been merging 32-byte data from system received with E- bus cache data Cacheable I-cache fill re- TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) quest for unmapped ad- stalled data not data not will be taken and dress. installed in taken precise trap I-cache (fast_ECC_error) will be dropped

244 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (4 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment Noncacheable I-cache fill TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) request for unmapped stalled data not data not will be taken and address. installed in taken precise trap I-cache (fast_ECC_error) will be dropped Cacheable D-cache 32- TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) byte fill request for un- stalled data in- data not will be taken and mapped address. stalled in taken precise trap D-cache (fast_ECC_error) will be dropped Noncacheable D-cache TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) 32-byte fill request for stalled data not in- data not will be taken and unmapped address. stalled in taken precise trap D-cache (fast_ECC_error) will be dropped Cacheable D-cache FP-64- TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) bit load fill request for stalled data in- data not will be taken and unmapped address. stalled in taken precise trap D-cache (fast_ECC_error) will be dropped Noncacheable D-cache TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) FP-64-bit load fill request stalled data not in- data not will be taken and for unmapped address. stalled in taken precise trap D-cache (fast_ECC_error) will be dropped Cacheable D-cache block- TO No Not in- N/A Garbage Garbage Deferred trap load request for un- stalled data in- data in FP mapped address. stalled in register file block-load buffer Noncacheable D-cache TO No Not in- N/A Garbage Garbage Deferred trap block-load request for un- stalled data in- data in FP mapped address stalled in register file block-load buffer Cacheable D-cache atom- TO Yes Not in- N/A Garbage Garbage Deferred trap (TO) ic fill request for un- stalled data in- data not will be taken and mapped address. stalled in taken precise trap D-cache (fast_ECC_error) will be dropped

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 245 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (5 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment (cacheable stores) W- TO No Not in- N/A ECU gives W-cache Deferred trap cache exclusive fill re- stalled both will drop quest with unmapped ad- (~grant) the store dress. and valid to W-cache W-cache cacheable block TO No N/A N/A N/A N/A Deferred trap store missing E-cache (write stream) with un- mapped address or W-cache cacheable block store commit request (write stream) with un- mapped address. Noncacheable W-cache TO No N/A N/A N/A N/A Deferred trap store with unmapped ad- dress. Noncacheable W-cache TO No N/A N/A N/A N/A Deferred trap block store with un- mapped address. E-cache eviction with un- TO No N/A N/A N/A N/A Deferred trap; evic- mapped address. tion writeback data will be dropped; cache coherence is violated Outgoing interrupt re- TO No N/A N/A N/A N/A Deferred trap; cor- quest with unmapped ad- responding busy dress. bit in Interrupt Vector Dispatch Status Register will automatically be reset to 0 Noncacheable I-cache fill BERR Yes Not N/A Not in- Garbage Deferred trap request with BERR re- installed stalled data not (BERR) will be tak- sponse taken en and precise trap (fast_ECC_error) will be dropped Noncacheable D-cache BERR Yes Not N/A Not in- Garbage Deferred trap 32-byte fill request with installed stalled data not (BERR) will be tak- BERR response taken en and precise trap (fast_ECC_error) will be dropped

246 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (6 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment Noncacheable D-cache BERR Yes Not N/A Not in- Garbage Deferred trap FP-64-bit fill request with installed stalled data not (BERR) will be tak- BERR response taken en and precise trap (fast_ECC_error) will be dropped Noncacheable D-cache BERR No Not in- N/A Not Garbage Deferred trap block-load fill request stalled installed data not with BERR response taken cacheable I-cache fill re- BERR Yes Garbage S Not Garbage Deferred trap quest with BERR in the data in- installed data not (BERR) will be tak- critical 32-byte data from stalled taken en and precise trap system bus (fast_ECC_error) will be dropped; the 2 least signifi- cant data bits [1:0] in both lower and upper 16 bytes are flipped Cacheable I-cache fill re- BERR No Garbage S No action No action Deferred trap; the 2 quest with BERR in the data in- least significant noncritical (second) 32- stalled data bits [1:0] in byte data from system both lower and up- bus per 16 bytes are flipped Cacheable D-cache load BERR Yes Garbage E/S Installed Garbage Deferred trap 32-byte fill request with data in- data not (BERR) will be tak- BERR in the critical 32- stalled taken en and precise trap byte data from system (fast_ECC_error) bus will be dropped the 2 least signifi- cant data bits [1:0] in both lower and upper 16 bytes are flipped Cacheable D-cache load BERR No Garbage E/S No action No action Deferred trap; the 2 32-byte fill request with data in- least significant BERR in the noncritical stalled data bits [1:0] in (second) 32-byte data both lower and up- from system bus per 16 bytes are flipped

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 247 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (7 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment Cacheable D-cache FP-64- BERR Yes Garbage E/S Garbage Garbage Deferred trap bit load fill request with data in- data in D- data (BERR) will be tak- BERR in the critical 32- stalled cache dropped en and precise trap byte data from system (fast_ECC_error) bus will be dropped; the 2 least signifi- cant data bits [1:0] in both lower and upper 16 bytes are flipped Cacheable D-cache FP-64- BERR No Garbage E/S Garbage No action Deferred trap; the 2 bit load fill request with data in- data not in least significant BERR in the noncritical stalled cache data bits [1:0] in (second) 32-byte data both lower and up- from system bus per 16 bytes are flipped Cacheable D-cache block- BERR No Not in- N/A Garbage Garbage Deferred trap; the 2 load fill request with stalled data in data least significant BERR in the critical 32- block-load dropped data bits [1:0] in byte data from system buffer both lower and up- bus per 16 bytes are or flipped Cacheable D-cache block- load fill request with BERR in the noncritical (second) 32-byte data from system bus Cacheable D-cache atom- BERR Yes Garbage M Garbage Garbage Deferred trap ic fill request with BERR data in- data in data (BERR) will be tak- in the critical 32-byte data stalled D-cache dropped en and precise trap from system bus (fast_ECC_error) will be dropped; the 2 least signifi- cant data bits [1:0] in both lower and upper 16 bytes are flipped Cacheable D-cache atom- BERR No Garbage M No action No action Deferred trap; the 2 ic fill request with BERR data in- least significant in the noncritical (second) stalled data bits [1:0] in 32-byte data from system both lower and up- bus per 16 bytes are flipped

248 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-16 CE, UE, TO, and BERR Errors (8 of 8)

Error Flag Logged Fast in ECC E-cache L1 Cache Pipeline Event AFSR Error E-cache Data state Data Action Comment Cacheable Prefetch 0, 1, 2, No er- No Not in- N/A Not Garbage No trap 3 fill request with BERR ror stalled installed data not in the critical 32-byte data logged taken from system bus due to non-RTSR transaction or Cacheable Prefetch 0, 1, 2, 3 fill request with BERR in the second 32-byte data from system bus due to non-RTSR transaction (cacheable stores) W- BERR No Garbage M W-cache W-cache Deferred trap cache exclusive fill re- data in- gets per- proceeds to (BERR) will be tak- quest with BERR in the stalled mission to modify the en and precise trap critical 32-byte data from modify the data, and (fast_ECC_error) system bus data after UE will be will be dropped; or all 64 bytes reflected at the 2 least signifi- (cacheable stores) W- of data the time of cant data bits [1:0] cache exclusive fill re- have been merging in both lower and quest with BERR in the received with E- upper 16 bytes are second 32-byte data from cache data flipped system bus

TABLE P-17 EMC and EMU Errors

Error logged Event AFSR Error Pin comment I-cache fill request with CE in the MTag of the critical 32-byte data EMC No Disrupting trap from system bus or I-cache fill request with CE in the MTag of the noncritical (second) 32- byte data from system bus I-cache fill request with UE in the MTag of the critical 32-byte data EMU Yes Deferred trap from system bus or I-cache fill request with UE in the MTag of the noncritical (second) 32- byte data from system bus D-cache fill request with CE in the MTag of the critical 32-byte data EMC No Disrupting trap from system bus or D-cache fill request with CE in the MTag of the noncritical (second) 32-byte data from system bus

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 249 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-17 EMC and EMU Errors (Continued)

Error logged Event AFSR Error Pin comment D-cache fill request with UE in the MTag of the critical 32-byte data EMU Yes Deferred trap from system bus or D-cache fill request with UE in the MTag of the noncritical (second) 32-byte data from system bus D-cache atomic fill request with CE in the MTag of the critical 32-byte EMC No Disrupting trap data from system bus or D-cache atomic fill request with CE in the MTag of the noncritical (second) 32-byte data from system bus D-cache atomic fill request with UE in the MTag of the critical 32-byte EMU Yes Deferred trap data from system bus or D-cache atomic fill request with UE in the MTag of the noncritical (second) 32-byte data from system bus (stores) W-cache exclusive fill request with CE in the MTag of the crit- EMC No Disrupting trap ical 32-byte data from system bus or (stores) W-cache exclusive fill request with CE in the MTag of the non- critical (second) 32-byte data from system bus (stores) W-cache exclusive fill request with UE in the MTag of the crit- EMU Yes Deferred trap ical 32-byte data from system bus or (stores) W-cache exclusive fill request with UE in the MTag of the non- critical (second) 32-byte data from system bus

Note – For MTag error, when data are delivered to E$, then SIU reports the error. If the data are not delivered to E$, then SIU does not report the error.

250 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-18 IVC and IVU Errors

Error Interrupt Vector Logged Error Receive Register Interrupt Data Event in AFSR Pin Busy Bit Setting Register Comment Interrupt vector with CE in the critical 32- IVC No Yes Corrected in- Disrupting trap byte data from system bus terrupt data and interrupt or taken Interrupt vector with CE in the noncriti- cal (second) 32-byte data from system bus Interrupt vector with UE in the critical 32- IVU No No Garbage data Disrupting trap byte data from system bus and interrupt or dropped Interrupt vector with UE in the noncriti- cal (second) 32-byte data from system bus Interrupt vector with CE in the MTag of EMC No Yes if no IVU Corrected in- Disrupting trap the critical 32-byte data from system bus terrupt data and interrupt or taken if no Interrupt vector with CE in the MTag of IVU; interrupt the noncritical (second) 32-byte data from dropped if IVU system bus Interrupt vector with UE in the MTag of EMU Yes Yes if no IVU received inter- Deferred trap the critical 32-byte data from system bus rupt data and interrupt or taken if no Interrupt vector with UE in the MTag of IVU; interrupt the noncritical (second) 32-byte data from dropped if IVU system bus

Note – AFAR points to 32-byte boundary as dictated by the system bus specification.

Note – When CE and UE occur in the same 32-byte data, only UE will be reported.

P.11 External Memory Unit Error Handling

This section describes the additional error cases that are specific to the External Memory Unit (EMU) design. Extra error detection and handling additions to EMU help designers quickly isolate fatal hardware errors during the system bring-up phase and improve the RAS features of the processor.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 251 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Errors detected in EMU are divided into two categories: ■ Bus protocol errors — Error conditions that violate system bus protocol, and UltraSPARC III coherency tables (cache consistency and snoop result errors) ■ Internal errors — Error conditions that point to inconsistent or illegal operations on some of the EMU’s finite state machines. The Asynchronous Fault Status Register and three EMU registers—Error Status, Error Mask, and Error Shadow—detect and report EMU errors. FIGURE P-2 illustrates the logic for these registers.

Scan_Out_M Error_0 Scan_Out_S Clear_n Error_1 Enable PERR IERR ...... En AFSR Mask bit_0 D Q D En

Mask bit_1 D Q D En

Mask bit_i D Q D En

Mask bit_k D Q D En

Error_k Scan_In_M Shift_En_S Shift_En_M Error_i Load_En Scan_In_S Error Mask Register Error Status Register Error Shadow Register (JTAG scan chain) (system register) (JTAG scan chain)

FIGURE P-2 EMU Error Detection and Reporting Logic

P.11.1 Asynchronous Fault Status Register

Bus protocol errors are reported by setting the PERR field of the AFSR. Internal errors are reported by setting the IERR bit in the AFSR. Once those fields are set, the processor asserts its ERROR output pin for eight consecutive system cycles. For information on how to clear those bits in the AFSR, please refer to Asynchronous Fault Status Register on page 204.

252 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material P.11.2 EMU Error Status Register

Fatal hardware errors (bus protocol errors and internal errors) are reported in the EMU Error Status Register (EESR) if their corresponding mask bits are 0 in the EMU Error Mask Register (EEMR). EESR content can be updated only if no prior error is logged in the register, so only the first error is logged and subsequent errors are ignored. Multiple errors can be reported if they happen in the same cycle.

Once an error is logged in the EESR, a corresponding bit (PERR or IERR) in the AFSR is also set and an error signal is asserted. Errors that are logged in the EESR can be cleared when their associated field in the AFSR is cleared by software. The EESR is reset to 0 only during power-on reset; other resets have no effect on this register.

The EESR is accessible through JTAG. A shadow scan chain can capture the value of the EESR and shift it out of the chip for examination.

TABLE P-18 Parity and ECC Errors (CREG)

Bit Field Error Type Description 0 S_PERR PERR Parity error on system address bus. Copy of ISAP field of the AFSR. 1 M_ECC PERR Uncorrectable ECC error on system data. Copy of ORed value of UE and IVU fields of the AFSR. 2 E_ECC PERR Uncorrectable error on external cache data. Copy of ORed value of EDU, WDU, CPU, and UCU fields of the AFSR. 3 MT_ECC PERR Uncorrectable ECC error on MTag value. Copy of EMU field of the AFSR.

Physical address of the error transaction in the preceding table, except for ISAP,is reported in the AFAR.

TABLE P-19 Internal Errors of the MCU (CREG)

Bit Field Error Type Description 4 CANCL_NH IERR Request to cancel a transaction that has never entered the MCU queues. 5 NO_REFSH IERR Refresh starvation on one of SDRAM banks. 6 MQ_OV PERR Memory controller backing queue overflows after PauseOut is asserted.

TABLE P-20 Internal Errors of the Write Cache (CREG)

Bit Field Error Type Description 7 PRB_MH IERR Multiple-way probe hits. 8 ST_MH IERR Multiple-way store hits.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 253 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-21 System Bus Protocol Error — Data (DPCTL)

Bit Field Error Type Description 0 UDT PERR Undefined DTransID * Read Tx: Incoming DTransID doesn’t match with any outstanding ATransID * Write Tx: Incoming DTransID doesn’t match with any outstanding TargID 10 UTT PERR Undefined TTransID. Incoming TTransID doesn’t match with any outstanding ATransID. 11 MTARG PERR Multiple TargetID issued for the same write transaction. 12 UDG PERR Unexpected DtransID grant. 13 UTG PERR Unexpected TargetID, TTransID grant.

TABLE P-22 Internal Errors of the DPCTL (DPCTL)

Bit Field Error Type Description 14 LWQ_OV IERR Local Write Queue Overflow. 15 LWQ_UF IERR Local Write Queue Underflow. 16 FRDQ_OV IERR Foreign Read Queue Overflow. 17 FRDQ_UF IERR Foreign Read Queue Underflow. 18 C2MS_WER IERR Overwrite a valid C2MS entry by trying to update the valid entry of a local write transaction. 19 C2MS_IR IERR Request to invalidate an unoccupied C2MS entry. 20 S2M_WER IERR Overwrite a valid S2M entry. 21 FRARB_OV IERR Foreign Read Arbitration Queue Overflow. 22 FRARB_UF IERR Foreign Read Arbitration Queue Underflow. 23 M2SARB_OV IERR M2S Arbitration Queue Overflow. 24 M2SARB_UF IERR M2S Arbitration Queue Underflow. 25 LWARB_OV IERR Local Write Arbitration Queue Overflow. 26 LWARB_UF IERR Local Write Arbitration Queue Underflow. 27 WRD_UE IERR Unexpected write data request, write data check. Write data request for unissued TargID. 28 RDR_UE IERR Unexpected read data ready. 29 DROB_WER IERR Overwrite a valid DROB entry. 30 DROB_IR IERR Request to invalidate a invalid DROB entry.

254 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-23 System Bus Protocol Errors — Transaction (QCTL)

Bit Field Error Type Description 31 USC PERR Undefined system bus command. 32 CPQ_TO PERR CPQ system bus timeout. 33 NCPQ_TO PERR NCPQ system bus timeout. 34 WQ_TO PERR Write transaction timeout. 35 TID_TO PERR TargetID timeout. Occurs when UltraSPARC III sends out a valid targetID but no data arrive after the specified timeout period. 36 AID_LK PERR ATransID leakage error. Remote transaction R_* is issued by the processor, but the reissued transaction is unable to complete. 37 CPQ_OV PERR CPQ overflows after PauseOut is asserted. 38 NCPQ_OV IERR NCPQ overflows after PauseOut is asserted. 39 CPQ_UF IERR CPQ underflow. 40 NCPQ_UF IERR NCPQ underflow. 41 ORQ_OV PERR ORQ overflows after PauseOut is asserted. 42 ORQ_UF IERR ORQ underflow. Incoming is asserted when ORQ is empty and HBM mode is set. 43 HBM_CON PERR HBM mode contention. Incoming asserts 2 cycles after PreReq. 44 HBM_ERR PERR HMB mode error. PreReq or Incoming is asserted while HBM mode is not set.

Note – Timeout errors that are reported in the EMU Error Status Register are fatal hardware errors when timeout counters in the EMU have expired. Unmapped memory, unmapped noncached, and unmapped interrupt transactions are reported as nonfatal error in the TO field of the AFSR.

TABLE P-24 Cache Consistency Errors (QCTL)

Bit Field Error Type Description 45 RTS_ER IERR Detect a local RTS on the bus with * PTA state ≠ dI 46 RTO_ER IERR Detect a local RTO on the bus with * E$ state = M * PTA state = dT 47 WB_ER IEER Detect a local WB with * PTA state = dT

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 255 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-24 Cache Consistency Errors (QCTL) (Continued)

Bit Field Error Type Description 48 RS_ER IERR Detect a local RS on the bus with * PTA state ≠ dI 49 RTSR_ER IERR Detect a local RTSR on the bus with * PTA state = dT or dO 50 RTOR_ER IERR Detect a local RTOR with * PTA state = dT 51 RSR_ER IERR Detect a local RSR on the bus with * PTA state ≠ dI

TABLE P-25 Snoop Result Errors (QCTL)

Bit Field Error Type Description 52 RTS_SE PERR Local RTS Shared with Error SharedIn = 0 and OwnedIn = 1 53 RTO_NDE PERR Local RTO no data and SharedIn = 0 54 RTO_WDE PERR Local RTO wait data with SharedIn = 1

TABLE P-26 MTag Errors (QCTL)

Bit Field Error Type Description 55 SSM_MT PERR MTag ≠ gM in non-SSM mode. 56 SSM_URT PERR Unexpected remote transaction (R_*) in non-SSM mode. 57 SSM_URE PERR Unexpected reissued transaction from SSM device (transactions that are not initiated by UltraSPARC III). 58 SSM_IMT PERR Illegal MTag on returned data * MTag = gI for RTSR, RSR * MTag = gI, gS for RTOR

TABLE P-27 Internal Errors on the PENDQ and QCTL (QCTL)

Bit Field Error Type Description 59 CPBK_MH IERR Multiple hits in fast copyback buffer. 60 PTA_OV IERR Too many transaction hit on a same PTA entry (attempt to increment PTA counter > 23). 61 PTA_UDS IERR Undefined PTA state.

256 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE P-28 Internal Errors of the TOB (TOB)

Bit Field Error Type Description 62 AID_ERR IERR Trying to retire inactive AID. 63 AID_ILL IERR Illegal AID (transaction with AID = 0). 64 AID_UD IEER Undefined AID for retry transaction request (request for a retry Tx with an inactive AID). 65 WB_FSM_ILL IERR Writeback state machine encounters illegal state. 66 WBAR_OV IERR WBAR queue overflow. 67 RTOV IERR Retry queue overflow. 68 MRET IERR Multiple retire request for the same transaction. 69 MPF IERR Multiple Pull Flag requests for the same transaction. 70 USB_OV IERR USB buffer overflow. 71 CWBB_UE IERR Unexpected writeback or copyback request for data from the CWBB. 72 CUSB_UE IERR Unexpected data request for noncached data buffer.

TABLE P-29 Internal Errors of the ECU (TOB)

Bit Field Error Type Description 73 CAM_OV IERR Overflow condition for the blocking CAM in the miss block. 74 WBE_UF IERR Underflow condition for a write back entry, a WB entry is retired multiple times. 75 MRQ_ERR IERR Illegal miss request. Src, src_idx, size,.... are not legal. 76 MPT_ERR IERR Miss request protocol error. Handshaking protocol (ec_si_rq, si_ec_req_ack) between SIU and ECU is broken.

P.11.3 EMU Error Mask Register

The EMU Error Mask Register (EEMR) is used to disable error generation of certain error conditions. Each bit in the EEMR controls a group of errors in the EESR. Once a bit is set in the EEMR, error logging for the affected fields in the EESR is disabled and the processor’s ERROR output pin will not be asserted for these events. This behavior does not affect reporting of these errors in the AFSR.

EEMR is designed as a unique JTAG scan chain because it is not desirable to change the content of the EEMR when scanning out errors for debugging. Since the only purpose of this scan chain is to provide the masking capability for error logging,

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 257 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material capture and update functions are not needed. However, during the shifting phase on this scan chain, the JTAG’s TAP controller needs to signal error detection logic blocks so that error reporting is disabled during that period.

EEMR is reset to all 0’s (except EEMR<1> is set to 1) by power-on reset. Resets other than power-on reset have no effect on this register.

TABLE P-30 Error Mask Register

Bits Field Affected EESR bits Description 0 M_PERR EESR<0> and Masks address parity error and MTag ECC error on system address bus. EESR<3> 1 M_ECC EESR<2:1> Masks all data ECC errors that are detected by UltraSPARC III. Default (after reset) value is 1, masked. 2 M_MCU EESR<6:4> Masks all internal errors in the MCU. 3 M_WCU EESR<8:7> Masks all internal errors in the Write Cache Unit. 4 M_UID EESR<11:9> Masks all undefined DTransIDs, TTransID on system data bus, multiple TargId for the same write transaction. 5 M_UG EESR<13:12> Masks all unexpected grants on system data bus. 6 M_DPCTL EESR<30:14> Masks all internal errors in the DPCTL. 7 M_USC EESR<31> Masks all undefined commands on system address bus. 8 M_TO EESR<36:32> Masks all timeout errors. 9 M_OV EESR<38:37> Masks all overflow conditions in CPQ and NCPQ. 10 M_UF EESR<40:39> Masks underflow conditions of CPQ and NCPQ. 11 M_ORQ EESR<42:41> Masks ORQ underflow, overflow errors. 12 M_HBM EESR<44:43> Masks all error related to HBM mode operations. 13 M_CCE EESR<51:45> Masks all cache consistency errors. 14 M_SRE EESR<54:52> Masks all snoop result errors. 15 M_MTE EESR<58:55> Masks all MTag-related errors. 16 M_PENDQ EESR<61:59> Masks all internal errors in the PENDQ and QCTL. 17 M_TOB EESR<72:62> Masks all internal errors in the TOB. 18 M_ECU EESR<76:73> Masks all internal errors in the ECU.

P.11.4 EMU Shadow Register

For each bit in the EESR, there is a corresponding bit in the EMU Shadow Register (ESR) to allow designers to gain visibility to the error status of the EMU. The ESR consists of scannable flip-flops that are connected to form a JTAG data scan chain. Since it is not required to clear error bits in the EESR through JTAG, the EMU Shadow Register carries out only two functions: ■ Capturing values on the EESR to scannable flops in the ESR

258 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ Shifting out the captured values through the scan-out port Both EMU Shadow Register and EMU Error Mask Register are shifted at JTAG’s TCK clock frequency.

P.11.5 EMU Shadow Scan Chain Order

The following sections list the chain order for the EMU mask chain and EMU shadow chain.

Mask Chain Order EMU mask chain, M_ECC is closest to TDI attribute REGISTER_NAME of UltraSPARC III. Entity is ■ "M_ECC(MASKERR<0>)," & ■ "M_PARITY(MASKERR<1>), " & ■ "M_WCU(MASKERR<2>), " & ■ "M_MCU(MASKERR<3>), " & ■ "M_UG(MASKERR<4>), " & ■ "M_UID(MASKERR<5>), " & ■ "M_DPCTL(MASKERR<6>), " & ■ "M_ECU(MASKERR<7>), " & ■ "M_TOB(MASKERR<8>), " & ■ "M_TO(MASKERR<9>), " & ■ "M_CCE(MASKERR<10>), " & ■ "M_HBM(MASKERR<11>), " & ■ "M_ORQ(MASKERR<12>), " & ■ "M_PENDQ(MASKERR<13>), " & ■ "M_UF(MASKERR<14>), " & ■ "M_USC(MASKERR<15>), " & ■ "M_MTE(MASKERR<16>), " & ■ "M_OV(MASKERR<17>), " & ■ "M_SRE(MASKERR<18>)";

Shadow Scan Chain Order

EMU shadow scan chain, E_SYND<0> is closest to TDI attribute REGISTER_NAME of UltraSPARC III. Entity is ■ "E_SYND<0>(SHADOW_EMU<0>), " & ■ "E_SYND<1>(SHADOW_EMU<1>), " & ■ "E_SYND<2>(SHADOW_EMU<2>), " & ■ "E_SYND<3>(SHADOW_EMU<3>), " & ■ "E_SYND<4>(SHADOW_EMU<4>), " & ■ "E_SYND<5>(SHADOW_EMU<5>), " & ■ "E_SYND<6>(SHADOW_EMU<6>), " &

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 259 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ "E_SYND<7>(SHADOW_EMU<7>), " & ■ "E_SYND<8>(SHADOW_EMU<8>), " & ■ "0(SHADOW_EMU<9>), " & ■ "0(SHADOW_EMU<10>), " & ■ "0(SHADOW_EMU<11>), " & ■ "0(SHADOW_EMU<12>), " & ■ "0(SHADOW_EMU<13>), " & ■ "0(SHADOW_EMU<14>), " & ■ "0(SHADOW_EMU<15>), " & ■ "M_SYND<0>(SHADOW_EMU<16>), " & ■ "M_SYND<1>(SHADOW_EMU<17>), " & ■ "M_SYND<2>(SHADOW_EMU<18>), " & ■ "M_SYND<3>(SHADOW_EMU<19>), " & ■ "0(SHADOW_EMU<20>), " & ■ "0(SHADOW_EMU<21>), " & ■ "0(SHADOW_EMU<22>), " & ■ "0(SHADOW_EMU<23>), " & ■ "0(SHADOW_EMU<24>), " & ■ "0(SHADOW_EMU<25>), " & ■ "0(SHADOW_EMU<26>), " & ■ "0(SHADOW_EMU<27>), " & ■ "0(SHADOW_EMU<28>), " & ■ "0(SHADOW_EMU<29>), " & ■ "0(SHADOW_EMU<30>), " & ■ "0(SHADOW_EMU<31>), " & ■ "0(SHADOW_EMU<32>), " & ■ "CE"(SHADOW_EMU<33>), " & ■ "UE"(SHADOW_EMU<34>), " & ■ "EDU"(SHADOW_EMU<35>), " & ■ "EDC"(SHADOW_EMU<36>), " & ■ "WDU"(SHADOW_EMU<37>), " & ■ "WDC"(SHADOW_EMU<38>), " & ■ "CPU"(SHADOW_EMU<39>), " & ■ "CPC"(SHADOW_EMU<40>), " & ■ "UCU"(SHADOW_EMU<41>), " & ■ "UCC"(SHADOW_EMU<42>), " & ■ "BERR"(SHADOW_EMU<43>), " & ■ "TO"(SHADOW_EMU<44>), " & ■ "IVU"(SHADOW_EMU<45>), " & ■ "IVC"(SHADOW_EMU<46>), " & ■ "EMU"(SHADOW_EMU<47>), " & ■ "EMC"(SHADOW_EMU<48>), " & ■ "ISAP"(SHADOW_EMU<49>), " & ■ "IERR"(SHADOW_EMU<50>), " & ■ "PERR"(SHADOW_EMU<51>), " & ■ "PRIV"(SHADOW_EMU<52>), " & ■ "ME"(SHADOW_EMU<53>), " &

260 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ "0(SHADOW_EMU<54>), " & ■ "0(SHADOW_EMU<55>), " & ■ "0(SHADOW_EMU<56>), " & ■ "0(SHADOW_EMU<57>), " & ■ "0(SHADOW_EMU<58>), " & ■ "0(SHADOW_EMU<59>), " & ■ "0(SHADOW_EMU<60>), " & ■ "0(SHADOW_EMU<61>), " & ■ "0(SHADOW_EMU<62>), " & ■ "0(SHADOW_EMU<63>), " & ■ "AFAR<4>(SHADOW_EMU<64>), " & ■ "AFAR<5>(SHADOW_EMU<65>), " & ■ "AFAR<6>(SHADOW_EMU<66>), " & ■ "AFAR<7>(SHADOW_EMU<67>), " & ■ "AFAR<8>(SHADOW_EMU<68>), " & ■ "AFAR<9>(SHADOW_EMU<69>), " & ■ "AFAR<10>(SHADOW_EMU<70>), " & ■ "AFAR<11>(SHADOW_EMU<71>), " & ■ "AFAR<12>(SHADOW_EMU<72>), " & ■ "AFAR<13>(SHADOW_EMU<73>), " & ■ "AFAR<14>(SHADOW_EMU<74>), " & ■ "AFAR<15>(SHADOW_EMU<75>), " & ■ "AFAR<16>(SHADOW_EMU<76>), " & ■ "AFAR<17>(SHADOW_EMU<77>), " & ■ "AFAR<18>(SHADOW_EMU<78>), " & ■ "AFAR<19>(SHADOW_EMU<79>), " & ■ "AFAR<20>(SHADOW_EMU<80>), " & ■ "AFAR<21>(SHADOW_EMU<81>), " & ■ "AFAR<22>(SHADOW_EMU<82>), " & ■ "AFAR<23>(SHADOW_EMU<83>), " & ■ "AFAR<24>(SHADOW_EMU<84>), " & ■ "AFAR<25>(SHADOW_EMU<85>), " & ■ "AFAR<26>(SHADOW_EMU<86>), " & ■ "AFAR<27>(SHADOW_EMU<87>), " & ■ "AFAR<28>(SHADOW_EMU<88>), " & ■ "AFAR<29>(SHADOW_EMU<89>), " & ■ "AFAR<30>(SHADOW_EMU<90>), " & ■ "AFAR<31>(SHADOW_EMU<91>), " & ■ "AFAR<32>(SHADOW_EMU<92>), " & ■ "AFAR<33>(SHADOW_EMU<93>), " & ■ "AFAR<34>(SHADOW_EMU<94>), " & ■ "AFAR<35>(SHADOW_EMU<95>), " & ■ "AFAR<36>(SHADOW_EMU<96>), " & ■ "AFAR<37>(SHADOW_EMU<97>), " & ■ "AFAR<38>(SHADOW_EMU<98>), " & ■ "AFAR<39>(SHADOW_EMU<99>), " & ■ "AFAR<40>(SHADOW_EMU<100>), " &

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 261 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ "AFAR<41>(SHADOW_EMU<101>), " & ■ "AFAR<42>(SHADOW_EMU<102>), " & ■ "0(SHADOW_EMU<103>), " & ■ "M_ECC(SHADOW_EMU<104>), " & ■ "MQ_OV(SHADOW_EMU<105>), " & ■ "CANCL_NH(SHADOW_EMU<106>), " & ■ "PRB_MH(SHADOW_EMU<107>), " & ■ "MT_ECC(SHADOW_EMU<108>), " & ■ "S_PERR(SHADOW_EMU<109>), " & ■ "ST_MH(SHADOW_EMU<110>), " & ■ "NO_REFSH(SHADOW_EMU<111>), " & ■ "E_ECC(SHADOW_EMU<112>), " & ■ "UTT(SHADOW_EMU<113>), " & ■ "UTG(SHADOW_EMU<114>), " & ■ "UDG(SHADOW_EMU<115>), " & ■ "MTARG(SHADOW_EMU<116>), " & ■ "LWQ_OV(SHADOW_EMU<117>), " & ■ "WRD_UE(SHADOW_EMU<118>), " & ■ "DROB_WER(SHADOW_EMU<119>), " & ■ "LWARB_UF(SHADOW_EMU<120>), " & ■ "C2MS_IR(SHADOW_EMU<121>), " & ■ "S2M_WER(SHADOW_EMU<122>), " & ■ "M2SARB_OV(SHADOW_EMU<123>), " & ■ "FRDQ_OV(SHADOW_EMU<124>), " & ■ "DROB_IR(SHADOW_EMU<125>), " & ■ "FRDQ_UF(SHADOW_EMU<126>), " & ■ "RDR_UE(SHADOW_EMU<127>), " & ■ "C2MS_WER(SHADOW_EMU<128>), " & ■ "LWQ_UF(SHADOW_EMU<129>), " & ■ "FRARB_OV(SHADOW_EMU<130>), " & ■ "UDT(SHADOW_EMU<131>), " & ■ "LWARB_OV(SHADOW_EMU<132>), " & ■ "FRARB_UF(SHADOW_EMU<133>), " & ■ "M2SARB_UF(SHADOW_EMU<134>), " & ■ "CAM_OV(SHADOW_EMU<135>), " & ■ "AID_ILL(SHADOW_EMU<136>), " & ■ "WB_FSM_ILL(SHADOW_EMU<137>), " & ■ "MRET(SHADOW_EMU<138>), " & ■ "RTOV(SHADOW_EMU<139>), " & ■ "MPT_ERR(SHADOW_EMU<140>), " & ■ "AID_UD(SHADOW_EMU<141>), " & ■ "AID_ERR(SHADOW_EMU<142>), " & ■ "MPF(SHADOW_EMU<143>), " & ■ "CWBB_UE(SHADOW_EMU<144>), " & ■ "USB_OV(SHADOW_EMU<145>), " & ■ "WBE_UF(SHADOW_EMU<146>), " & ■ "CUSB_UE(SHADOW_EMU<147>), " &

262 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ "WBAR_OV(SHADOW_EMU<148>), " & ■ "MRQ_ERR(SHADOW_EMU<149>), " & ■ "RTO_WDE(SHADOW_EMU<150>), " & ■ "RTSR_ER(SHADOW_EMU<151>), " & ■ "NCPQ_OV(SHADOW_EMU<152>), " & ■ "RTOR_ER(SHADOW_EMU<153>), " & ■ "RSR_ER(SHADOW_EMU<154>), " & ■ "PTA_UDS(SHADOW_EMU<155>), " & ■ "TID_TO(SHADOW_EMU<156>), " & ■ "WQ_TO(SHADOW_EMU<157>), " & ■ "AID_LK(SHADOW_EMU<158>), " & ■ "NCPQ_TO(SHADOW_EMU<159>), " & ■ "ORQ_UF(SHADOW_EMU<160>), " & ■ "CPBK_MH(SHADOW_EMU<161>), " & ■ "RTO_ER(SHADOW_EMU<162>), " & ■ "RS_ER(SHADOW_EMU<163>), " & ■ "HBM_ERR(SHADOW_EMU<164>), " & ■ "WB_ER(SHADOW_EMU<165>), " & ■ "ORQ_OV(SHADOW_EMU<166>), " & ■ "CPQ_UF(SHADOW_EMU<167>), " & ■ "CPQ_TO(SHADOW_EMU<168>), " & ■ "NCPQ_UF(SHADOW_EMU<169>), " & ■ "RTS_ER(SHADOW_EMU<170>), " & ■ "CPQ_OV(SHADOW_EMU<171>), " & ■ "HBM_CON(SHADOW_EMU<172>), " & ■ "RTO_NDE(SHADOW_EMU<173>), " & ■ "PTA_OV(SHADOW_EMU<174>), " & ■ "SSM_MT(SHADOW_EMU<175>), " & ■ "SSM_URE(SHADOW_EMU<176>), " & ■ "SSM_IMT(SHADOW_EMU<177>), " & ■ "RTS_SE(SHADOW_EMU<178>), " & ■ "SSM_URT(SHADOW_EMU<179>), " & ■ "USC(SHADOW_EMU<180>)";

Working Draft 1.0.5, 10 Sep 2002 S. Appendix P Error Handling 263 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 264 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX Q

Performance Instrumentation

Note – This appendix will be substantially updated in the next revision.

Up to two performance events can be measured simultaneously. The Performance Control Register (PCR) controls event selection and filtering (that is, counting in privileged or nonprivileged mode) for a pair of 32-bit performance instrumentation counters (PICs). This chapter describes the performance instrumentation in these sections: ■ Performance Control and Counters on page 265 ■ Performance Instrumentation Counter Events on page 268

Q.1 Performance Control and Counters

The 64-bit PCR and PIC are accessed through read/write Ancillary State Register instructions (RDASR/WRASR). PCR and PIC are located at ASRs 16 (1016) and 17 (1116), respectively. Access to the PCR is privileged. Attempted access via RDPCR or WRPCR when PSTATE.PRIV = 0 causes a privileged_opcode exception (impl. dep. #250). Software can restrict nonprivileged access to PICs by setting the PCR.PRIV field while in privileged mode. When PCR.PRIV = 1, an attempt by nonprivileged software to access the PICs causes a privileged_action trap. Software can control event measurements in nonprivileged or privileged modes by setting the PCR.UT (user trace) and PCR.ST (system trace) fields.

265 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Each of the two 32-bit PICs, except PICL and PICU, can accumulate over 4 billion events before wrapping around silently. Extended event logging can be accomplished by periodic reading of the contents of the PICs before each overflows. Additional statistics can be collected by use of the two PICs over multiple passes of program execution.

Note – Overflow of PICL or PICU causes a disrupting trap and SOFTINT register bit 15 to be set to 1 and then causes an interrupt_level_15 trap.

Two events can simultaneously be measured by setting the PCR.select fields along with the PCR.UT and PCR.ST fields. The selected statistics are reflected during subsequent accesses to the PICs. The difference between the values read from the PIC on two reads reflects the number of events that occurred between them for the selected PICs. Software can only rely on read-to-read counts of the PIC for accurate timing and not on write-to-read counts. See also TABLE O-1 on page 180 for the state of these registers after reset.

The Performance Control Register (PCR) is described in detail in Section 5.2.11 of Commonality.

FIGURE Q-1 provides an example of operational flow of use of the PCR and PIC when performance instrumentation is used.

FIGURE Q-1 Operational Flow of PCR/PIC

266 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material start

set up PCR context switch to B accumulate stat hi_select_value → PCR.SU → in PIC low_select_value → PCR.SL PCR [saveA1] [0,1] → PCR.UT PIC → [saveA2] → [0,1] PCR.ST PIC → r[rd] [0,1] → PCR.PRIV → 0 PIC → PIC → r[rd] PIC r[rd]

switch to context B

accumulate stat in PIC end

back to context A

PIC → r[rd] context switch to A

[saveA1] → PCR accumulate stat [saveA2] → PIC in PIC PIC → r[rd]

Working Draft 1.0.5, 10 Sep 2002 S. Appendix Q Performance Instrumentation 267 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Q.2 Performance Instrumentation Counter Events

PICs enable collection of information about a variety of events. Tables in this section describe the counters that provide data for the following: ■ Instruction execution rates ■ IIU stalls and R-stage stalls ■ Recirculate events ■ Statistics about memory access, software, floating-point operation, and the memory controller

Note – Read-read counts could be inaccurate by as much as 20.

The section also lists the events from which the six-bit PCR SU and SL fields can select.

Q.2.1 Instruction Execution Rates

Using the two counters described in TABLE Q-1 to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.

TABLE Q-1 Counters for Instruction Execution Rates

Counter Description Cycle_cnt [PICL,PICU] Accumulated cycles. This count is similar to the SPARC V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields. Instr_cnt [PICL,PICU] The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.

268 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Q.2.2 IIU Statistics

The counters listed in TABLE Q-2 record branch prediction statistics for taken and untaken branches. A retired branch in the following descriptions refers to a branch that reaches the D-stage without being invalidated.

TABLE Q-2 Counters for Collecting IIU Statistics

Counter Description IU_Stat_Br_miss_taken [PICL] The count of retired branches that were predicted to be taken, but in fact were not taken. IU_Stat_Br_miss_untaken [PICU] The count of retired branches that were predicted to be taken, but in fact were not taken. IU_Stat_Br_count_taken [PICL] Count of retired taken branches. IU_Stat_Br_count_untaken [PICU] Count of retired untaken branches.

Q.2.3 IIU Stall Counts

IIU stall counts, listed in TABLE Q-3, are the major cause of pipeline stalls (bubbles) from the steering and dependency stages of the pipeline. Stalls are counted for each clock at which the associated condition is true.

TABLE Q-3 Counters for IIU Stalls

Counter Description Dispatch0_IC_miss [PICL] I-queue is empty from instruction cache miss. This count includes external cache miss processing if an external cache miss also occurs. Dispatch0_mispred [PICU] I-queue is empty because of branch misprediction. Dispatch0_br_target[PICL] I-queue is empty because of a branch target address calculation. Dispatch0_2nd_br[PICL] I-queue is empty because of a refetch of a second branch within a fetch group. Dispatch_rs_mispred [PICL] I-queue is empty because of a return address stack misprediction.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix Q Performance Instrumentation 269 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Q.2.4 R-stage Stall Counts

Stalls are counted for each clock at which the associated condition is true, as described in TABLE Q-4.

TABLE Q-4 Counters for R-stage Stalls

Counter Description Rstall_storeQ [PICL] Store queue cannot hold additional stores and a store instruction is the first instruction in the group. Rstall_FP_use [PICU] First instruction in the group depends on an earlier floating-point result that is not yet available. Rstall_IU_use [PICL] First instruction in the group depends on an earlier integer result that is not yet available.

Q.2.5 Recirculate Counts

Recirculation instrumentation is implemented through the counters listed in TABLE Q-5.

TABLE Q-5 Counters for Recirculation

Counter Description Re_RAW_miss [PICU] There is a load in the E-stage and there is a nonbypassable read-after-write hazard with an earlier store instruction. This condition means that load data are being delayed by completion of an earlier store. Re_endian_miss [PICU] There is a little-endian load in the E-stage that was predicted to be a normal latency, big-endian load. Re_FPU_bypass[PICU] An FPU bypass condition that does not have a direct bypass path occurred. Re_DC_miss[PICU] A data cache miss occurred. Re_EC_miss[PICU] An external cache miss occurred.

270 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Q.2.6 Memory Access Statistics

Instruction, data, write and external cache access statistics can be collected through the counters listed in TABLE Q-6. Counts are updated by each cache access, regardless of whether the access will be used.

TABLE Q-6 Counters for Memory Access Statistics

Counter Description Instruction Cache IC_ref [PICL] I-cache references. I-cache references are fetches of up to four instructions from an aligned block of eight instructions. I-cache references are generally prefetches and do not correspond exactly to the instructions executed. IC_miss [PICU] I-cache misses. IC_miss_cancelled [PICU] I-cache misses canceled. ITLB_miss [PICU] I-TLB miss traps taken. Data Cache DC_rd [PICL] D-cache read references (including accesses that subsequently trap). References to pages that are not virtually cacheable (TTE CV bit = 0) are not counted. DC_rd_miss [PICU] D-cache read misses. DC_wr [PICL] D-cache write references (including accesses that subsequently trap). Non-data- cacheable accesses are not counted. DC_wr_miss [PICU] D-cache write misses. DTLB_miss [PICU] D-TLB miss traps taken. Write Cache WC_miss [PICU] W-cache misses. WC_snoop_cb [PICU] W-cache copybacks generated by external snoops. WC_scrubbed [PICU] W-cache hits to clean lines. WC_wb_wo_read [PICU] W- cache writebacks not requiring a read. External Cache The E-cache write hit count is determined by subtraction of the read hit and the instruction hit count from the total E-cache hit count. The E-cache write reference count is determined by subtraction of the D-cache read miss and I-cache misses from the total E-cache references. Because of write caching, this is not the same as D-cache write misses. Note: A block load or store access is counted as eight references. Atomics count the read and write individually. EC_ref [PICL] Total E-cache references. Noncacheable accesses are not counted. A 64-byte request is counted as 1. EC_misses [PICU] Total number of E-cache misses sent to the System Interface Unit (SIU). EC_write_hit_RTO [PICL] E-cache hits that do a read-to-own bus transaction. EC_wb [PICU] The number of dirty subblocks that produce writebacks due to E-cache misses. EC_snoop_inv [PICL] E-cache invalidates generated from an external snoop.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix Q Performance Instrumentation 271 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE Q-6 Counters for Memory Access Statistics (Continued)

Counter Description EC_snoop_cb [PICU] E-cache copybacks generated from an external snoop. EC_rd_miss [PICL] E-cache read misses from D-cache requests. EC_ic_miss [PICU] E-cache read misses from instruction cache requests.

Q.2.7 System Interface Statistics

System interface statistics are collected through the counters listed in TABLE Q-7.

TABLE Q-7 Counters for System Interface Statistics

Counter Description SI_snoop [PICL] Count of snoops from other processors (foreign). SI_ciq_flow [PICL] Count system cycles with flow control (PauseOut) asserted from this processor. SI_owned [PICL] Count of times owned_in is asserted on our requests.

Q.2.8 Software Statistics

Software statistics are collected through the counters listed in TABLE Q-8.

TABLE Q-8 Counters for Software Statistics

Counter Description SW_count_0 [PICL] Software-generated count of occurrences of sethi %hi(0xfc000),%g0 instruction. SW_count_1 [PICU] Software-generated count of occurrences of sethi %hi(0xfc000),%g0 instruction.

Q.2.9 Floating-Point Operation Statistics

Floating-point operation statistics are collected through the counters listed in TABLE Q-9.

TABLE Q-9 Counters for Floating-Point Operation Statistics

Counter Description FA_pipe_completion [PICL] Count of the number of instructions that complete execution on the Floating- Point/Graphics ALU (FGA) pipeline. FM_pipe_completion [PICU] Count of the number of instructions that complete execution on the Floating Point/Graphics Multiply (FGM) pipeline.

272 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Q.2.10 Memory Controller Statistics

Memory controller statistics are collected through the counters listed in TABLE Q-10.

TABLE Q-10 Counters for Memory Controller Statistics

Counter Description MC_reads_0 [PICL] Number of read requests completed to memory bank 0. MC_writes_0 [PICU] Number of write requests completed to memory bank 0. MC_reads_1 [PICL] Number of read requests completed to memory bank 1. MC_writes_1 [PICU] Number of write requests completed to memory bank 1. MC_reads_2 [PICL] Number of read requests completed to memory bank 2. MC_writes_2 [PICU] Number of write requests completed to memory bank 2. MC_reads_3 [PICL] Number of read requests completed to memory bank 3. MC_writes_3 [PICU] Number of write requests completed to memory bank 3. MC_stalls_0 [PICL] Number of processor cycles that requests were stalled in the MCU queues because bank 0 was busy with a previous request. The delay could be due to data bus contention, bank busy, data availability for a write, etc. MC_stalls_1 [PICU] Number of processor cycles that requests were stalled in the MCU queues because bank 1 was busy with a previous request. The delay could be due to data bus contention, bank busy, data availability for a write, etc. MC_stalls_2 [PICL] Number of processor cycles that requests were stalled in the MCU queues because bank 2 was busy with a previous request. The delay could be due to data bus contention, bank busy, data availability for a write, etc. MC_stalls_3 [PICU] Number of processor cycles that requests were stalled in the MCU queues because bank 3 was busy with a previous request. The delay could be due to data bus contention, bank busy, data availability for a write, etc.

Q.2.11 PCR.SL and PCR.SU Encoding

TABLE Q-11 lists PIC.SL selection bit field encoding; TABLE Q-12 lists PIC.SU encoding.

TABLE Q-11 PIC.SL Selection Bit Field Encoding

PCR.SL Value PICL Selection

000000 Cycle_cnt 000001 Instr_cnt 000010 Dispatch0_IC_miss 000011 Dispatch0_br_target 000100 Dispatch0_2nd_br

Working Draft 1.0.5, 10 Sep 2002 S. Appendix Q Performance Instrumentation 273 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE Q-11 PIC.SL Selection Bit Field Encoding (Continued)

PCR.SL Value PICL Selection

000101 Rstall_storeQ 000110 Rstall_IU_use 000111 Reserved 001000 IC_ref 001001 DC_rd 001010 DC_wr 001011 Reserved 001100 EC_ref 001101 EC_write_hit_RTO 001110 EC_snoop_inv 001111 EC_rd_miss 010000 PC_port0_rd 010001 SI_snoop 010010 SI_ciq_flow 010011 SI_owned 010100 SW_count_0 010101 IU_Stat_Br_miss_taken 010110 IU_Stat_Br_count_taken 010111 Dispatch_rs_mispred 011000 FA_pipe_completion 011001 -011111 Reserved 100000 MC_reads_0 100001 MC_reads_1 100010 MC_reads_2 100011 MC_reads_3 100100 MC_stalls_0 100101 MC_stalls_2

274 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE Q-12 PIC.SU Selection Bit Field Encoding

PCR.SU Value PICU Selection

000000 Cycle_cnt 000001 Instr_cnt 000010 Dispatch0_mispred 000011 IC_miss_cancelled 000100 Re_endian_miss 000101 Re_FPU_bypass 000110 Re_DC_miss 000111 Re_EC_miss 001000 IC_miss 001001 DC_rd_miss 001010 DC_wr_miss 001011 Rstall_FP_use 001100 EC_misses 001101 EC_wb 001110 EC_snoop_cb 001111 EC_ic_miss 010000 Re_PC_miss 010001 ITLB_miss 010010 DTLB_miss 010011 WC_miss 010100 WC_snoop_cb 010101 WC_scrubbed 010110 WC_wb_wo_read 010111 Reserved 011000 PC_soft_hit 011001 PC_snoop_inv 011010 PC_hard_hit 011011 PC_port1_rd 011100 SW_count_1 011101 IU_Stat_Br_miss_untaken

Working Draft 1.0.5, 10 Sep 2002 S. Appendix Q Performance Instrumentation 275 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE Q-12 PIC.SU Selection Bit Field Encoding (Continued)

PCR.SU Value PICU Selection

011110 IU_Stat_Br_count_untaken 011111 PC_MS_misses 100000 MC_writes_0 100001 MC_writes_1 100010 MC_writes_2 100011 MC_writes_3 100100 MC_stalls_1 100101 MC_stalls_3 100110 Re_RAW_miss 100111 FM_pipe_completion 101000-111111 Reserved

276 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX R

Specific Information About Fireplane Interconnect

This appendix describes the UltraSPARC III system bus in the following sections: ■ Power Management on page 277 ■ Fireplane Interconnect ASI Extensions on page 278 ■ RED_state and Reset Values on page 283

R.1 Power Management

Power-down mode supports Energy Star compliance. Energy Star specifies a system power dissipation of 30 watts in the standby mode. To support this requirement, the goal is 2 watts for the processor and one-half watt for the remainder of the module when in the power-down mode.

R.1.1 Low Power Mode

Low power mode is entered when an external system device asserts the Fireplane_slow signal. In low power mode, the processor divides the internal clock by the amount specified by the E*_CLK field of the Fireplane Interconnect configuration register; see Fireplane Configuration Register on page 279 for details. Whenever the Fireplane_slow signal is asserted, the contents of the E*_CLK field are copied to the clock divider; the Fireplane_slow signal can then be deasserted.

To exit power-down mode, software resets the E*_CLK field to the full clock rate value. Then, another assertion of the Fireplane_slow signal copies the new clock rate selection to the clock divider.

277 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Caution – Software must exit low power mode in two clock rate steps from 1/32 to 1/2, then from 1/2 to full clock rate. First, software selects 1/2 clock rate and initiates the low power sequence. Second, software selects full clock rate in the E*_CLK field of the Fireplane Configuration Register and initiates a second low power sequence. Failure to transition from 1/32 to 1/2 and then to full clock rate could result in incorrect program behavior.

Prior to the assertion of the Fireplane_slow signal, the external system device asserts a Fireplane_freeze signal to quiesce all activity on the processor interconnect. The processor responds by halting all outgoing Fireplane Interconnect requests and deasserting the Fireplane_freeze_ack signal (a wired-or signal). After all Fireplane masters have responded to the Fireplane_freeze signal, the external device asserts the Fireplane_slow signal to transition to a new clock frequency.

In low power mode, the processor continues to function as normal but at a reduced rate. There is no need for software to flush any caches to preserve state—all caches and memory will retain state and continue to operate. The only requirement of software is to set up the E*_CLK field of the Fireplane Configuration Register before a low power clock transition occurs.

R.2 Fireplane Interconnect ASI Extensions

Fireplane Interconnect ASI extensions include: ■ Fireplane Port ID Register ■ Fireplane Configuration Register ■ FIreplane Address Register

Fireplane Port ID Register

The per-processor FIREPLANE_PORT_ID Register can be accessed only from the Fireplane bus as a read-only, noncacheable, slave access at offset 0 of the address space of the processor port.

This register indicates the capability of the CPU module. See TABLE O-1 on page 180 for the state of this register after reset. The FIREPLANE_PORT_ID Register is illustrated in FIGURE R-1 and described in TABLE R-1.

278 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 63 56 55 27 26 17 16 15 10 9 4 3 0

FC16 Reserved AID M/S MID MT MR

FIGURE R-1 FIREPLANE_PORT_ID Register Format

TABLE R-1 FIREPLANE_PORT_ID Register Format

Bits Field Description

63:56 FC16 A one-byte field containing the value FC16. This field is used by the OpenBoot PROM to indicate that no Fcode PROM is present. 55:27 Reserved. 26:17 AID 10-bit Fireplane Agent ID, read-only field, copy of AID field of the Fireplane Configuration Register. 16 M/S Reserved: Master or Slave bit. Indicates whether the agent is a Fireplane Interconnect master or a Fireplane Interconnect slave device: 1 = master, 0 = slave.

15:10 MID Manufacturer ID, read-only field. 03E16 for UltraSPARC III. Consult the product data sheet for the content of this register. 9:4 MT Module Type, read-only field, copy of MT field of the Fireplane Interconnect Configuration Register. 3:0 MR Module revision, read-only field, copy of the MR field of the Fireplane Interconnect Configuration Register.

Fireplane Configuration Register

The Fireplane Configuration Register can be accessed at ASI 4A16, VA = 0. This is a 64-bit register; non-64-bit aligned accesses cause a mem_address_not_aligned trap. See TABLE O-1 on page 180 for the state of this register after reset. The register is illustrated below and described in TABLE R-2.

Reserved DTL_6 DTL_5 DTL_4 DTL_3 DTL_2 DTL_1 MR MT

63 61 60 59 58 57 56 55 54 53 52 51 50 49 48 45 44 39

_ TOF TOL — DEAD E*_CLK DBG AID CLK CBND CBASE SLOW HBM SSM

38 37 34 33 32 31 30 29 28 27 26 17 16 15 14 9 8 3 2 1 0

FIGURE R-2 Fireplane Configuration Register Format

Working Draft 1.0.5, 10 Sep 2002 S. Appendix R Specific Information About Fireplane Interconnect 279 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE R-2 FIREPLANE_CONFIG Register Format

Bits Field Description 60:59 DTL_6 DTL_* <1:0>: DTL termination mode.

58:57 DTL_5 016: Reserved 56:55 DTL_4 116: DTL-end – Termination pullup 216: DTL-mid – 25 ohm pulldown 54:53 DTL_3 316: DTL-2 – Termination pullup and 25 ohm pulldown 52:51 DTL_2 See TABLE R-3 on page 281 for DTL pin configuration. 50:49 DTL_1 48:45 MR<3:0> Module revision. Written at boot time by the OpenBoot PROM (OBP) code, which reads it from the module serial PROM. 44:39 MT<5:0> Module type. Written at boot time by OBP code, which reads it from the module serial PROM. 38 TOF Timeout Freeze mode. If set, all timeout counters are reset and stop counting. × 37:34 TOL<3:0> Timeout Log value. Timeout period is 2(10 + (2 TOL)) Fireplane cycles. Setting TOL ≥ 10 results in the max timeout period of 230 Fireplane cycles. Setting TOL =9 results in the Fireplane timeout period of 1.75 seconds. A TOL value of 0 should not be used since the timeout could occur immediately or as much as 210 Fireplane cycles later. TOL Timeout Period TOL Timeout Period 0210 8226 1212 9228 2214 10 230 3216 11 231 4218 12 232 5220 13 232 +231 6222 14 233 7224 15 233 +232 32 DEAD 0: Back-to-back Fireplane Request Bus Mastership. 1: Inserts a dead cycle in between bus masters on the Fireplane Request Bus. 31:30 E*_CLK<1:0> Selects the processor clock ratio to be used for the next E* clock rate transition. The defined clock ratios are: 0: Full processor clock rate. 1: After the next E* clock rate transition the processor will operate at 1/2 of the standard clock rate. 2: After the next E* clock rate transition the processor will operate at 1/32 of the standard clock rate. 3: Reserved. 28:27 DBG<1:0> Debug. 0: Up to 15 outstanding transactions allowed 1: Up to 8 outstanding transactions allowed 2: Up to 4 outstanding transactions allowed 3: One outstanding transaction allowed 26:17 AID<9:0> Contains the 10-bit Fireplane bus agent identifier for this processor. This field must be initialized on power-up before any Fireplane transactions are initiated.

280 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE R-2 FIREPLANE_CONFIG Register Format (Continued)

Bits Field Description 16:15 CLK<1:0> Processor to Fireplane clock ratio. This field may only be written during initialization before any Fireplane transactions are initiated. 0: 4:1 processor to system clock ratio 1: 5:1 processor to system clock ratio 2: 6:1 processor to system clock ratio 3: Reserved 14:9 CBND<5:0> CBND address limit. Physical address bits <42:37> are compared to the CBND field. If PA<42:37> ≥ CBASE and PA<42:37> < CBND, then PA is in COMA space (Remote_WriteBack not issued in SSM mode). 8:3 CBASE<5:0> CBASE address limit. Physical address bits <42:37> are compared to the CBASE field. If PA<42:37> ≥ CBASE and PA<42:37> < CBND, then PA is in COMA space (Remote_WriteBack not issued in SSM mode). 2 SLOW If set, it expects snoop responses from other Fireplane agents, using the slow snoop response as defined in the Fireplane Interface Specification. 1 HBM Hierarchical bus mode. If set, uses the Fireplane Interconnect protocol for a multilevel transaction request bus. If cleared, UltraSPARC III uses the Fireplane Interconnect protocol for a single-level transaction request bus. 0 SSM If set, performs Fireplane transactions in accordance with the Fireplane Interconnect Scalable Shared Memory model. See the Sun Fireplane Interconnect Specification for more details.

TABLE R-3 DTL Pin Configurations

Power- on Reset Power DTL GROUPS MIdrange Server Enterprise Server State mid end mid end

DTL_1 316 316 216 116 216 Group 0 COMMAND_L<1:0> ADDRESS_L<42:4> MASK_L<9:0> ATRANSID_L<8:0> Group 2 ADDRARBOUT_L ADDRARBIN_L<4:0> Group 8 ADDRPTY_L

DTL_2 116 316 116 116 216 Group 1 INCOMING_L PREREQIN_L

Working Draft 1.0.5, 10 Sep 2002 S. Appendix R Specific Information About Fireplane Interconnect 281 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE R-3 DTL Pin Configurations (Continued)

Power- on Reset Power DTL GROUPS MIdrange Server Enterprise Server State Workstation mid end mid end

DTL_3 316 116 116 116 116 216 Group 3 PAUSEOUT_L MAPPEDOUT_L SHAREDOUT_L OWNEDOUT_L

DTL_4 316 316 116 116 216 Group 5 DTRANSID_L<8:0> DTARG_L DSTAT_L<1:0> Group 6 TARGID_L<8:0> TTRANSID_L<8:0> PARITYBIDI_L

DTL_5 116 116 116 116 116 116 Group 11 ERROR_L FREEZE_L FREEZEACK_L CHANGE_L

DTL_6 116 116 116 116 116 116 Group 4 PAUSEIN_L OWNEDIN_L SHAREDIN_L MAPPEDIN_L

Note – The Fireplane bootbus signals CAStrobe_L, ACD_L, and Ready_L have their DTL configuration programmable through two UltraSPARC III package pins. All other Fireplane DTL signals that do not have a programmable configuration are configured as DTL-end.

Note – Several fields of the Fireplane Configuration Register do not take effect until after a soft POR is performed. If these fields are read before a soft POR, then the value last written will be returned. However, this value may not be the one currently being used by the processor. The fields that require a soft POR to take effect are DTL, CLK, MT, and MR.

282 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Fireplane Interconnect Address Register

The Fireplane Interconnect Register can be accessed at ASI 4A16,VA=0816. This is a 64-bit register; non-64-bit-aligned accesses cause a mem_address_not_aligned trap.

FIGURE R-3 illustrates the register, Where: Address is the 20-bit physical address of the Fireplane-accessible registers (Fireplane_port_id and Memory Controller registers). These address bits correspond to Fireplane bus address bits PA<42:23>.

Reserved Address Reserved

63 42 23 22 0

FIGURE R-3 Fireplane Control Register

See TABLE O-1 on page 180 for the state of this register after reset.

R.3 RED_state and Reset Values

Reset values and RED_state for Fireplane-specific machines are listed in TABLE R-4.

TABLE R-4 Fireplane-Specific Machine State After Reset and in RED_state

Name Fields Hard_POR System Reset WDR XIR SIR RED_state3

FIREPLANE_PORT_ID * FC FC16 AID 0

MID 3E16 MT undefined1 MR undefined1

Working Draft 1.0.5, 10 Sep 2002 S. Appendix R Specific Information About Fireplane Interconnect 283 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE R-4 Fireplane-Specific Machine State After Reset and in RED_state (Continued)

Name Fields Hard_POR System Reset WDR XIR SIR RED_state3 FIREPLANE_CONFIG SSM 0 unchanged unchanged HBM 0 updated2 unchanged SLOW 0 updated2 unchanged CBASE 0 unchanged unchanged CBND 0 unchanged unchanged CLK 0 updated2 unchanged MT 0 updated2 unchanged MR 0 updated2 unchanged E*_CLK 0 unchanged unchanged AID 0 <9:5> unchanged, unchanged <4:0> updated2 DBG 0 unchanged unchanged DTL see TABLE R-3 updated2 unchanged TOL 0 unchanged unchanged TOF 0 unchanged unchanged FIREPLANE_ADDRESS all unknown unchanged unchanged

1. The state of Module Type (MT) and Module Revision (MR) fields of FIREPLANE_PORT_ID after reset is not defined. Typically, software reads a module PROM after a reset, then updates MT and MR. 2. This field of the FIREPLANE_CONFIG register is not immediately updated when written by software. Instead, software writes to a shadow register, and the contents of this field of the shadow register are not copied to the active FIREPLANE_CONFIG register until the next reset event. Writes to this field of the FIREPLANE_CONFIG register have no visible effect until a reset occurs. 3. See TABLE O-1 on page 180 for the state of the registers after reset.

284 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX S

Summary of Differences Between UltraSPARC III and SPARC64 V

Please refer to Appendix S of the SPARC64 V Implementation Supplement.

285 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 286 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX T

UltraSPARC III Chip Identification

T.1 Allocation of Identification Codes for UltraSPARC III

Allocation is listed in TABLE T-1. All values are in hexadecimal, written most- significant-bit-first, and labelled as follows: ■ J JTAGID version ■ M V9 version mask_major ■ m V9 version mask_minor

TABLE T-1 Allocation Summary

Revision JTAGID Version Tape-out Process (12 bits) (32 bits) (64 bits)

TO_3.2 15C05.A-Al 25216 2919C07D16 003E00145200050716 JMm J Mm

TO_3.4 15C05.A-Al 55416 5919C07D16 003E00145400050716 JMm J Mm

TO_3.9 15C05.B-Cu F5916 F919C07D16 003E00145900050716 JMm J Mm

TO_3.B 15C05.B-Cu 65B16 6919C07D16 003E00145B00050716 JMm J Mm

TO_3.C 15C05.B-Cu 85C16 8919C07D16 003E00145C00050716 JMm J Mm

287 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material T.2 Identification Registers on UltraSPARC III

The four identification registers on UltraSPARC III are listed in TABLE T-2:

TABLE T-2 UltraSPARC III ID Registers

Register Width

JTAGID 32 bits Version (VER) 64 bits SerialID 64 bits Revision 12 bits

T.2.1 JTAGID (JTAG Identification Register)

The 32-bit JTAGID register is only accessible from the TAP controller (JTAG), and complies with the IEEE 1149.1 standard. The JTAG instruction code to access this register is FC16. Bit 31 is the most significant bit, which is closest to TDI. Bits of JTAGID are described in TABLE T-3.

TABLE T-3 JTAGID Register Bits

Bits Field Description

31:28 version Version. These 4 bits are incremented on every tape-out (whether all layers or not), and the layout is such that the change can be made on any metal layer. 27:12 part_no Part Number. These 16 bits identify the part; UltraSPARC III is 919C16. 11:1 manuf Manufacturing ID. Manufacturer’s JEDEC code in a compressed format, which is 03E16. 0 — The least significant bit of the JTAGID register is always 1.

288 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material T.2.2 Version Register (V9)

The 64-bit Version (VER) register is only accessible via the SPARC V9 RDPR instruction. It is not accessible through an ASI, ASR, or the JTAG interface. In SPARC assembly language, the instruction is ‘rdpr %ver, regrd’. The contents of the VER register on UltraSPARC III are described in Version (VER) Register on page 29.

For comparative and historical purposes, VER.impl on various UltraSPARC implementations has been encoded as listed in TABLE T-4.

TABLE T-4 Ver.impl Values

Implementation Sun Project name VER.impl

UltraSPARC I Spitfire 001016

UltraSPARC II Blackbird 001116

UltraSPARC IIi Sabre 001216

UltraSPARC II Sapphire Black 001316

UltraSPARC IIi Sapphire Red 001316

UltraSPARC III Cheetah 001416

The VER.mask field consists of two 4-bit subfields: the “major” mask number (VER bits 31:28) and the “minor” mask number (VER bits 27:24). The major mask number bits are incremented for every major release (that is, release of all layers). The minor mask number bits are incremented for every minor release (that is, fewer than all layers released; usually one or more metal layers).

T.2.3 FIREPLANE_PORT_ID MID field

The 6-bit MID field in the FIREPLANE_PORT_ID register contains the 6 least significant bits (3E16) of Sun’s JEDEC code. This register can be found in the cregs_dp block.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix T UltraSPARC III Chip Identification 289 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material T.2.4 Serial ID Register

The SerialId register is 64 bits, which are laser fuse-programmed during manufacturing. The JTAG instruction code to access this register is FD16. This 1 register is also accessible from the ASI bus at ASI address 5316 (ASI_SERIAL_ID ). Bit 0 is closest to TDO. Register bits are described in TABLE T-5.

TABLE T-5 SerialID Register Bits

Bit (Fuse) # Meaning

63:41 Reserved (read as 0) 40 Bin number 39:16 Lot number 15:10 Wafer number 9:5 Column number on wafer 4:0 Row number on wafer

T.2.5 Revision Register

The 12-bit Revision register is composed from part of the JTAGID register and parts of the Version register. It is only accessible through JTAG, and the JTAG instruction code is FE16. Register bits are listed in TABLE T-6.

TABLE T-6 Revision Register BIts

Bits Field name Source

11:8 version JTAGID<31:28> 7:4 mask_major VER<31:28> 3:0 mask_minor VER<27:24>

1. ASI_SERIAL_ID was called ASI_DEVICE_ID+SERIAL_ID in some older documents

290 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material T.2.6 Identification Code on UltraSPARC III Package

The identification codes listed in TABLE T-7 are printed on UltraSPARC III packages.

TABLE T-7 UltraSPARC III Package ID Codes

Tapeout Process Sun Internal Part # Marketing Part # TO_3.2 15C05.A-Al 100-6682-04 SME1050ALGA-600 100-6683-04 SME1050ALGA-750 100-6753-04 SME1050ALGA-900 TO_3.4 15C05.A-Al 100-6984-01 SME1050ALGA-750 100-6985-01 SME1050ALGA-900 TO_3.9 15C05.B-Cu 100-XXXX-01 SME1050ALGA-600 100-6954-01 SME1050ALGA-750 100-6870-02 SME1050ALGA-900 TO_3.B 15C05.B-Cu 100-7003-01 SME1050ALGA-750 100-7004-01 SME1050ALGA-900

TO_3.C 15C05.B-Cu (TBD) (TBD)

Working Draft 1.0.5, 10 Sep 2002 S. Appendix T UltraSPARC III Chip Identification 291 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 292 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX U

Memory Controller

The memory system consists of the memory controller, external address latches, external control buffers, and two physical banks—one even and one odd bank. This chapter focuses on the memory controller, briefly describing the other entities of the memory system. The chapter contains these sections: ■ Memory Subsystem on page 293 ■ Programmable Registers on page 297 ■ Memory Initialization on page 321 ■ Energy Star 1/32 Mode on page 323

U.1 Memory Subsystem

Each physical bank consists of four 144-bit SDRAM DIMMs. Each physical bank can have one or two logical banks, for a maximum of four logical banks. These four banks share a 576-bit data bus. The even and odd physical banks have individual address buses buffered by a latch. They also have individual control buses, nonlatched but buffered. The memory controller supports overlapping cycles to the SDRAM banks for maximum main memory performance. FIGURE U-1 illustrates the functional blocks of the memory subsystem. FIGURE U-2 offers a more detailed look.

293 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material CLKs CLK CLK Buf CTLs CTLs CTL SDRAM SDRAM Buf 144-bit DIMM 144-bit DIMM

Add Latch Latched_adr_even

UltraSPARC III UltraSPARC ADDRs Memory Controller Latched_adr_odd

Add 576-bit Data Bus Latch

Data Switches (bit sliced)

FIGURE U-1 Memory Subsystem

294 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Individual chip selects (CS) and clock enable (CKE) are buffered for each bank. Bank pairs, on the same DIMM, share the other control signals (RAS, CAS, WE, and A0– A14). Data are bused to all banks. Bank interleave is optional and by address only (A9–A6), and not data. Data are accessed by all 576 bits (4 DIMMS) in one cycle.

x2 Address Address Address Address Latch_en_E & Bank & Bank A0-A12 & Bank & Bank Select Select & BS0-1 Select Select

Address latch

Latch_en_O address (bank 0) (bank 2) (bank 0) (bank 2) (bank 0) (bank 0) (bank 2) (bank 2) RAS0_L RAS0_L RAS0_L RAS0_L CAS0_L CAS0_L CAS0_L CAS0_L WE0_L WE0_L WE0_L WE0_L CKE_L_0 CKE_L_0 CKE_L_0 CKE_L_0 CKE_L_2 CKE_L_2 CKE_L_2 CKE_L_2 CS_L_0 CS_L_0 CS_L_0 controlbuffer MemoryController CS_L_0 CS_L_2 CS_L_2 CS_L_2 CS_L_2 144 144 144 144

576-bit Data

144 bit Data Address Address Address buffer Address & Bank & Bank & Bank & Bank A0-A12 Select & BS0-1 Select Select Select control CDS (bank 1) (bank 1) (bank 3) (bank 1) (bank 3) (bank 3) (bank 3) Bit- (bank 1) Sliced RAS1_L RAS1_L RAS1_L RAS1_L Data CAS1_L CAS1_L CAS1_L CAS1_L Switch WE1_L WE1_L WE1_L WE1_L CKE_L_1 CKE_L_1 CKE_L_1 CKE_L_1 CKE_L_3 CKE_L_3 CKE_L_3 CKE_L_3 CS_L_1 CS_L_1 CS_L_1 CS_L_3 CS_L_3 CS_L_1 CS_L_3 CS_L_3

288-bit Data 144 144 144 144

576-bit Data System Data Bus

FIGURE U-2 Detailed Block Diagram

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 295 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Minimum memory configuration is 256 Mbytes (only 4 DIMM slots are populated with 64-Mbyte, single-bank DIMMs), and maximum memory configuration is 8 Gbytes (all eight DIMM slots are populated with 1-Gbyte, dual-bank DIMMs). TABLE U-1 and TABLE U-2 list examples of the different DIMM sizes supported by the memory controller.

The refresh rates are dependent on the number of DIMMs installed, and the refresh interval specified by the vendor is based on the number of internal banks in the SDRAM. See the refresh formulas in Tim1 and Tim3 descriptions. Internal banks are not otherwise considered. Bank selects are considered part of the row address. No internal bank interleave or individual precharge is performed.

TABLE U-1 Sizes of Supported Single Bank DIMMs

Single-Bank #of Min System Max System 144-pin DIMM Base Device Devices Memory Memory SUN Part #

64 Mbyte 64 Mbyte (4M x 16) 9 256 Mbyte 512 Mbyte 501-5398 128 Mbyte 128 Mbyte (8M x 16) 9 512 Mbyte 1 Gbyte — 256 Mbyte 256 Mbyte (16M x 16) 9 1 Gbyte 2 Gbyte — 512 Mbyte 256 Mbyte (32M x 8) 18 2 Gbyte 4 Gbyte —

TABLE U-2 Sizes of Supported Double Bank DIMMs

Dual-Bank #of Min System Max System 144-pin DIMM Base Device Devices Memory Memory SUN Part #

128 Mbyte 64 Mbyte (4M x 16) 18 512 Mbyte 1 Gbyte 501-4489 256 Mbyte 64 Mbyte (8M x 8) 36 1 Gbyte 2 Gbyte — 256 Mbyte 128 Mbyte (8M x 16) 18 1 Gbyte 2 Gbyte 501-5401 512 Mbyte 128 Mbyte (16M x 8) 36 2 Gbyte 4 Gbyte — 1 Gbyte 128 Mbyte (32M x 4) 72 4 Gbyte 8 Gbyte 501-5399 512 Mbyte 256 Mbyte (16M x 16) 18 2 Gbyte 4 Gbyte 501-5030 1 Gbyte 256 Mbyte (32M x 8) 36 4 Gbyte 8 Gbyte 501-5031 2 Gbyte 256 Mbyte (64Mx4) 72 8 Gbyte 16 Gbyte —

Note – This is not an exhaustive list. Other SDRAM and stacked DIMM configurations might be supported if the maximum row and column address limitations are considered.

296 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material U.2 Programmable Registers

The memory controller can interface to a variety of SDRAM grades and densities and processor/system frequency. To accomplish this interfacing, the various device- dependent and system-dependent parameters are kept in the Memory Control Programmable Registers.

Nine Memory Control Registers reside in Sun Fireplane Interconnect noncacheable memory space and are accessible from both Fireplane and through internal ASI space as PIO slave access only: ■ Memory Timing Control I (Mem_Timing1_CTL) (1) ■ Memory Timing Control II (Mem_Timing2_CTL) (1) ■ Memory Timing Control III (Mem_Timing3_CTL) (1) ■ Memory Timing Control IV (Mem_Timing4_CTL) (1) ■ Memory Address Decoding Registers (4) ■ Memory Address Control Register (1)

Caution – Writing to all of the memory controller registers must always be done with full, 64-bit (doubleword) stores. Byte, halfword, word, and partial store operations are not supported.

Note – In the following register descriptions, some bits are defined as reserved (R). All reserved bits should always be programmed to 0.

U.2.1 Memory Timing Control I (Mem_Timing1_CTL)

PA = FIREPLANE_ADDRESS_REG + 40000016 ASI 7216,VA=016 Mem_Timing1_CTL is a 64-bit-wide register that together with Mem_Timing2_CTL controls the functionality of the memory controller in full-speed mode and Energy Star 1/2 mode (except for sdram_ctl_dly and sdram_clk_dly fields; see Mem_Timing3_CTL).

The Mem_Timing1_CTL register is illustrated in FIGURE U-3 and described in TABLE U-3.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 297 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material sdram_ctl_dly sdram_clk_dly R auto_rfr_cycle rd_wait pc_cycle wr_more_ras_pw

6360 59 57 56 5549 48 44 43 38 37 32

rd_more_ras_pw act_wr_dly act_rd_dly bank_present rfr_int set_mode_reg rfr_enable prechg_all

31 2625 20 19 14 13 12 113 2 1 0

FIGURE U-3 Memory Timing1 Control Register (Mem_Timing1_CTL)

Note – At power-on, all fields, except prechg_all, rfr_enable, and set_model_reg, come up unknown. prechg_all, rfr_enable, and set_mode_reg are reset to 0 at power-on.

TABLE U-3 Mem_Timing1_CTL Register Description (1 of 4)

Bit Field Description 63:69 sdram_ctl_dly<3:0> Used in full-speed mode only. Controls the delay between the system clock at processor input and the SDRAM control/address outputs driven by the processor. The delay is required to align the SDRAM control signals sourced by the MCU with the SDRAM clock at the DIMMs. Note: The maximum setting for this parameter is system_clk*2 - 1; for example, for cpu:system clock ratio of 4:1, the maximum value for this parameter is 7. FIGURE U-4 on page 301 shows SDRAM clock and control skews. sdram_clk_dly is set to provide a centered CDS data latch point for SDRAM reads. sdram_ctl_dly is then set to provide proper address and control steup and hold at the SDRAM. 59:57 sdram_clk_dly<2:0> Used in full-speed mode only. Controls the delay between system clock at processor input and the SDRAM clock output driven by the processor. This delay is required to align the memory read data from the DIMM with the latch point for the system data switch. 56 — Reserved. 55:49 auto_rfr_cycle<6:0> Along with parameter cmd_pw in Mem_Timing2_CTL, controls the autorefresh (cas before ras refresh) row cycle time. No command will be issued to the memory bank to which the AUTO_RFR command was issued until after (cmd_pw +1+auto_rfr_cycle) cpu clock cycles.

298 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-3 Mem_Timing1_CTL Register Description (2 of 4)

Bit Field Description 48:44 rd_wait<4:0> Controls the number of wait states during a read cycle. To have 0 wait state, set this field to 0. To have X SDRAM cycles of wait state, where X is 1 or 2, set this field to (cmd_pw +(X −1) * number_of_cpu_cycles/sdram_cycle). In general, the processor can generate SDRAM read and write cycles with wait states for slow memory topologies. The processor does that by using the SDRAM clock enable signal for read cycles and by delaying the write data command for write cycles. Clock suspends are programmed in cpu clock cycles and conform to the programmed control signal phase shift. The following table shows the number of cpu clocks for which the clock enable signal is false.

Ratio 4 Ratio 5 Ratio 6 Read Wait States rd_wait<4:0> rd_wait<4:0> rd_wait<4:0> rd_wait (1 wait state) 00111 01001 01011 rd_wait (2 wait states) 01111 10011 10111 rd_wait (3 wait states) 11111 NA NA

For writes, the parameter that is varied to insert wait states is act_wr_dly<6:0>, also in Tim1 and Tim3. The SDRAM clock edge that latches the write command also latches the data. If the data are slow, then the write command is delayed by the appropriate number of SDRAM clocks. The method for programming write wait states is to add the existing act_wr_dly<6:0> parameter to a value equal to the number of cpu clocks in an SDRAM clock. This value is ratio dependent. Adding wait states to a CSR set is an involved process that requires reprogramming most fields to account for the extra SDRAM clock(s) between SDRAM cycles. 43:38 pc_cycle<5:0> Controls the row precharge time for both read and write cycles. 37:32 wr_more_ras_pw Along with parameters act_wr_dly and cmd_pw, controls the interval for <5:0> which the row stays active during a write cycle. 31:26 rd_more_ras_pw Along with parameters act_rd_dly, cmd_pw, and rd_wait, controls the <5:0> interval for which the row stays active during a read cycle. 25:20 act_wr_dly<5:0> Along with parameter cmd_pw in Mem_Timing2_CTL, controls the delay between ACTIVATE command and WRITE command. (cmd_pw + 1 + act_wr_dly) cpu clock cycles after the ACTIVATE command is issued, the WRITE command is issued. 19:14 act_rd_dly<5:0> Along with parameter cmd_pw in Mem_Timing2_CTL, controls the delay between ACTIVATE command and READ command. (cmd_pw +1 + act_rd_dly) cpu clock cycles after the ACTIVATE command is issued, the READ command is issued. cmd_pw is the parameter that controls pulse width of SDRAM control signals, as defined further in Memory Timing Control II (Mem_Timing2_CTL) on page 303.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 299 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-3 Mem_Timing1_CTL Register Description (3 of 4)

Bit Field Description 13:12 bank_present Specifies numbers of memory banks in the system. These bits must be set by <1:0> software after decoding the DIMM SPD PROM. The memory controller uses this parameter to decide whether to perform the power-on initialization sequence and the regular autorefresh cycles to memory bank pairs. When the memory controller is disabled for the particular bank (by setting mem_addr_dec.valid = 0), then the content of this field is ignored. The allowed encodings are: 2’b00: Banks 0 and 2 present (only even banks are present). 2’b01: Banks 0, 1, 2, and 3 present (all four banks are present). 2’b10: Banks 1 and 3 present (only odd banks are present). 2’b11: Reserved. Note: Refresh is always done in bank pairs. If single bank DIMMS are installed, then the configuration for the bank pair is used. 11:3 rfr_int<8:0> Specifies the time interval between row refreshes. During normal mode (full clock rate), this parameter is specified in quanta of 32 cpu cycles; see the description of Mem_Timing3_CTL for Energy Star 1/2 and 1/32 modes. The deduction by Rd/Wr_latency (specified in quanta of 32, or 1 cpu clock cycle) is to account for the case where a refresh to a bank is needed immediately after a read or write operation is started on that bank. Full-speed refresh rate formula for one physical bank systems is expressed as: (Refresh Period/# of rows) Rd/Wr_latency rfr_int = Lower Bound ( processor clk period x 32 - )

Full-speed refresh rate formula for two physical bank systems is expressed as: (Refresh Period/# of rows) Rd/Wr_latency rfr_int = Lower Bound ( processor clk period x 64 - )

FIGURE U-5 on page 302 shows fixed-bank-pair timing for a four-bank refresh. 2 set_mode_reg The positive edge transition on this signal triggers the logic inside MCU to send MODE_REG_SET command to the present memory banks. Note: This field should not be set to 1 until all other fields of Memory Control Registers are programmed with correct values; otherwise, the memory controller state machines might go into some unknown state, possibly causing a hang condition. 1 rfr_enable 1 = enable refresh; 0 = disable refresh. Power-on reset is the only reset condition that clears the rfr_enable bit. The rfr_enable bit can be written to at any time. If a refresh operation is in progress at the time of clearing this bit, it will be continue to completion. Note: This field should not be set to 1 until all other fields of Memory Control Registers are programmed with correct values; otherwise, the memory controller state machines might go into some unknown state, possibly causing a hang condition.

300 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-3 Mem_Timing1_CTL Register Description (4 of 4)

Bit Field Description 0 prechg_all The positive edge transition on this signal will trigger the logic inside MCU to send PRECHARGE_ALL command to the present memory banks. Note: This field should not be set to 1 until all other fields of Memory Control Registers are programmed with correct values; otherwise, the memory controller state machines might go into some unknown state, possibly causing a hang condition.

Timing Diagrams for Mem_Timing1_CTL Fields

FIGURE U-4 illustrates examples of clock and control skews for sdram_ctl_dly<3:0> (see TABLE U-3 on page 298).

CPU clock

Processor and Fireplane are fixed Fireplane clock

SDRAM controls can be skewed 0 to 7, 9, 11 for ratios 4, 5, and 6: sdram_ctl_dly Control clock example: skew by one

SDRAM clock can be skewed 0 to 3, 4, or 5 for ratios 4, 5, and 6: sdram_ctl_dly SDRAM clock example: skew by two

FIGURE U-4 SDRAM Clock and Control Skews

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 301 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material FIGURE U-5 illustrates timing for rfr_int<8:0> (see TABLE U-3).

Fireplane

SDRAM ratio 4:1 auto_rfr_cycle CS_bank_0 refresh cmd same bank CS_bank_2 refresh cmd RAS_even Rd or Wr CAS_even

CS_bank_1 rfr_int refresh cmd CS_bank_3 refresh cmd RAS_odd

CAS_odd

FIGURE U-5 Four-Bank Refresh Diagram Showing Fixed-bank-pair Timing

Settings of Mem_Timing1_CTL

Below are specified the preliminary settings for all the fields of Mem_Timing1_CTL as a function of DIMM grade, processor, and system frequency. The final settings for these fields will be specified by the analog engineer responsible for system board design and will be based on the Spice analysis of the memory subsystem. ■ 125 MHz DIMM, 600 MHz processor, 150 MHz system (system_processor_clkr = 4): {4’h0, 4’h0, 7’h25, 5’h0, 6’h0d, 6’h07, 6’h07, 6’h08, 6’h08, 2’b01, 9’h046, 1’b0, 1’b1, 1’b0}

■ 125 MHz DIMM, 600 MHz processor, 120 MHz system (system_processor_clkr = 5): {4’h0, 4’h0, 7’h32, 5’h0, 6’h11, 6’h09, 6’h09, 6’h0a, 6’h0a, 2’b01, 9’h046, 1’b0, 1’b1, 1’b0}

■ 125 MHz DIMM, 600 MHz processor, 100 MHz system (system_processor_clkr = 6): {4’h0, 4’h0, 7’h3c, 5’h0, 6’h15, 6’h0b, 6’h0b, 6’h0c, 6’h0c, 2’b01, 9’h046, 1’b0, 1’b1, 1’b0}

302 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material U.2.2 Memory Timing Control II (Mem_Timing2_CTL)

PA = FIREPLANE_ADDRESS_REG + 40000816 ASI 7216,VA=816 Mem_Timing2_CTL is a 64-bit-wide register that together with Mem_Timing1_CTL controls the functionality of the Memory Controller. All fields, except wr_msel_dly and rd_msel_dly, are used in both full- and half-speed modes. The two fields, wr_msel_dly and rd_msel_dly, used during half-speed mode are specified in Mem_Addr_CTL register.

FIGURE U-6 shows the Mem_Timing2_CTL fields; the fields are described in TABLE U- 4.

wr_msel_dly rd_msel_dly wrdata_thld rdwr_rd_ti_dly auto_prechg_enbl rd_wr_pi_more_dly

63 58 57 52 51 50 49 44 43 42 38

rd_wr_ti_dly wr_wr_pi_more_dly wr_wr_ti_dly rdwr_rd_pi_more_dly R sdram_mode_reg_data

37 32 31 27 26 21 20 16 15 14 0

FIGURE U-6 Memory Timing2 Control Register (Mem_Timing2_CTL)

Note – At power-on, all fields of Mem_Timing2_CTL come up in unknown state.

TABLE U-4 Memory Timing2 Control Register Description

Bit Field Description 63:58 wr_msel_dly<5:0> Used in full-speed mode only. Controls the number of delay cycles after the ACTIVATE command is issued before the data switch’s msel command can be issued to drive data onto the memory data bus for a memory write cycle. A setting of 0 for this parameter is valid. rd_msel_dly and wr_msel_dly are set to provide correct latch points for read data and CDS data drive points for write data, as shown in FIGURE U-7 on page 305. 57:52 rd_msel_dly<5:0> Used in full-speed mode only. Controls the number of delay cycles after the ACTIVATE command is issued before the data switch’s msel command can be issued to register the memory read data into data switch. A setting of 0 for this parameter is valid. See FIGURE U-7 on page 305. 51:50 wrdata_thld<1:0> Specifies the memory write data hold time, in number of system clock cycles. The encoding is as follows: 2’b00: hold time = 1 system clock cycle 2’b01: hold time = 2 system clock cycles 2’b10: hold time = 3 system clock cycles 2’b11: hold time = 4 system clock cycles

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 303 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-4 Memory Timing2 Control Register Description (Continued)

Bit Field Description 49:44 rdwr_rd_ti_dly<5:0> Controls the delay between a read or write transaction and a following read transaction to the opposite DIMM group of the first read/write. The delay ensures contention-free operation on the common data bus and address bus. 43 auto_prechg_enbl Specifies SDRAM read/write access mode. When this bit is set to 1, all read and write operations sent out by the MCU have auto-precharge on (ADDRESS[10] = 1’b1). When the bit is set to 0, read and write accesses are done with auto-precharge off (ADDRESS[10] = 1’b0). 42:38 rd_wr_pi_more_dly<4:0> Along with parameter rd_wr_ti_dly, controls the delay between a read transaction and a following write transaction to the other logical bank of a same DIMM group of the read. The delay ensures contention-free operation on the common data bus, address bus, cas, and we. Setting of 0 for this parameter is valid. 37:32 rd_wr_ti_dly<5:0> Controls the delay between a read transaction and a following write transaction to the opposite DIMM group of the read. The delay ensures contention-free operation on the common data bus and address bus. 31:27 wr_wr_pi_more_dly<4:0> Along with wr_wr_ti_dly, controls the delay between two write transactions to the two logical banks of a same DIMM group. The delay ensures contention-free operation on the common data bus, address bus, cas, and we. Setting of 0 for this parameter is valid. 26:21 wr_wr_ti_dly<5:0> Controls the delay between two write transactions to two different DIMM groups. The delay ensures contention-free operation on the common data bus and address bus. 20:16 rdwr_rd_pi_more_dly<4:0> Along with rdwr_rd_ti_dly, controls the delay between a read or write transaction and a following read transaction to the other logical bank of a same DIMM group of the first read/write. The delay ensures contention- free operation on the common data bus, address bus, cas, and we. 15 — Reserved. 14:0 sdram_mode_reg_data Specifies the address line value during the SDRAM-mode, register-set <14:0> command cycle issued only after system power-up. The register bit numbers align with the SDRAM address bits at the processor. Note: The specific SDRAM vendor specification will have mode register bit definitions. This is a JEDEC standard. All undefined bits should be set to 0. Note: For Ultra SPARC-III, bits 0, 1, 2 and 3 must be set to 0. These bits set SDRAM burst length = 1 and SDRAM burst type = sequential. sdram_mode_reg_data<6:4> specifies the CAS latency. The encoding is as follows: 3’b000: CAS latency equals 1 3’b010: CAS latency of 2 3’b011: CAS latency of 3 Others: Reserved Note: The Memory Controller is only tested to support a CAS latency of 2.

304 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Timing Diagram for Mem_Timing2_CTL Fields

As described in TABLE U-4 on page 303, fields rd_msel_dly and wr_msel_dly are set to provide correct latch points for read and write data. FIGURE U-7 illustrates the control points.

Ch_clk one CPU clk skew

SDRAM_clk SDRAM data hi-Z SDRAM data delay Data path delay SDRAM data delay SDRAM data delay Data path delay Latch SDRAM_DATA data valid

Data path delay Data path delay Data path delay CDS_in

CPMS clk Q CPMS clk to hi-Z CDS_out

System bus Latch Drive Latch

FIGURE U-7 CDS Data Control Points

Settings of Mem_Timing2_CTL

Below are specified the preliminary settings for all the fields of Mem_Timing2_CTL as a function of DIMM grade, processor, and system frequency. The final settings for these fields will be specified by the analog engineer responsible for system board design and will be based on the Spice analysis of the memory subsystem. ■ 125 MHz DIMM, 600 MHz processor, 150 MHz system (system_processor_clkr = 4): {6’h08, 6’h17, 2’b10, 6’h0b, 1’b1, 5’h0, 6’h1b, 5’h07, 6’h0b, 5’h07, 1’b0, 15’h000}

■ 125 MHz DIMM, 600 MHz processor, 120 MHz system (system_processor_clkr = 5): {6’h0a, 6’h18, 2’b10, 6’h0f, 1’b1, 5’h0, 6’h13, 5’h09, 6’h0f, 5’h09, 1’b0, 15’h000}

■ 125 MHz DIMM, 600 MHz processor, 100 MHz system (system_processor_clkr = 6): {6’h0c, 6’h1d, 2’b10, 6’h13, 1’b1, 5’h0, 6’h1b, 5’h0b, 6’h13, 5’h0b, 1’b0, 15’h000}

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 305 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material U.2.3 Memory Timing Control III (Mem_Timing3_CTL)

PA = FIREPLANE_ADDRESS_REG + 40003816 ASI 7216,VA=3816 Mem_Timing3_CTL and Mem_Timing4_CTL are 64-bit wide registers that together control the functionality of the memory controller during 1/32 speed mode. The register contents automatically replace those used during Energy Star half-mode. Note that it is required for phase lock loop stability to sequence through half-mode between full speed and 1/32 mode. Also note that for half-mode, only half_mode_wr_msel_dly, half_mode_sdram_clk_dly, half_mode_rd_msel_dly, and half_mode_sdram_ctl_dly replace the contents of the full-speed registers Mem_Timing1_CTL and Mem_Timing2_CTL. Refer to the Energy Star transitions document for a full description of the sequences.

Note – At power-on, all fields except prechg_all, rfr_enable, and set_model_reg come up unknown. prechg_all, rfr_enable, and set_mode_reg are reset to 0 at power-on.

FIGURE U-8 illustrates the Mem_Timing3_CTL register. Definitions for the fields are similar to those in Mem_Timing1_CTL except for rfr_int, which is described below. Note that all bits in this register must be set properly if Energy Star mode is used.

sdram_ctl_dly sdram_clk_dly R auto_rfr_cycle rd_wait pc_cycle wr_more_ras_pw

6360 59 57 56 5549 48 44 43 38 37 32

rd_more_ras_pw act_wr_dly act_rd_dly bank_present rfr_int set_mode_reg rfr_enable prechg_all

31 2625 20 19 14 13 12 113 2 1 0

FIGURE U-8 Memory Timing3 Control Register (Mem_Timing3_CTL)

rfr_int Field of Mem_Timing3_CTL Register

rfr_int<8:0> (Mem_Timing3_CTL <11:3>) specifies the time interval between row refreshes. In Energy Star 1/32 mode, the field sets the refresh rate. For Energy Star 1/32 mode, the parameter is specified in quanta of processor cycles. The divisor is 1 for this mode. This rate and divisor are automatically selected when Energy Star 1/32 mode is entered. Note that during Energy Star half-speed mode, the refresh rate is automatically adjusted by the divisor being changed to 16; however, the rate is specified in Mem_Timing4_CTL, as is full-speed mode. The deduction by Rd/ Wr_latency (specified in quanta of 1 processor clock cycle) is to account for the case

306 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material where a refresh to a bank is needed immediately after a read or write operation is started on that bank.

Note – Set this parameter carefully because refresh cycles use a significant amount of the memory bandwidth in Energy Star 1/32 mode.

FIGURE U-9 and FIGURE U-10 provide the formulas for full-speed refresh rates.

(Refresh Period/# of rows) Rd/Wr_latency rfr_int = Lower Bound ( processor clk period - )

FIGURE U-9 Full-speed Refresh Rate Formula for One Physical Bank System

(Refresh Period/# of rows) Rd/Wr_latency rfr_int = Lower Bound ( processor clk period x 2 - )

FIGURE U-10 Full-speed Refresh Rate Formula for Two Physical Bank Systems

Settings of Mem_Timing3_CTL in Energy Star 1/32 Mode

Below are specified the preliminary settings for all the fields of Mem_Timing3_CTL as a function of DIMM grade, processor, and system frequency. The final settings for these fields will be specified by the analog engineer responsible for system board design and will be based on the Spice analysis of the memory subsystem.

■ 125 MHz DIMM, 18.75 MHz processor, 150 MHz system (system_processor_clkr =4) {4’h0, 4’h0, 1’h0, 7’h03, 5’h0, 6’h03, 6’h05, 6’h05, 6’h00, 6’h00, 2’b01, 9’h03c, 1’b0, 1’b1, 1’b0}

■ 125 MHz DIMM, 18.75 MHz processor, 120 MHz system (system_processor_clkr =5) {4’h0, 4’h0, 1’h0, 7’h04, 5’h0, 6’h05, 6’h07, 6’h07, 6’h00, 6’h00, 2’b01, 9’h04b, 1’b0, 1’b1, 1’b0}

■ 125 MHz DIMM, 18.75 MHz processor, 100 MHz system (system_processor_clkr =6) {4’h0, 4’h0, 1’h0, 7’h05, 5’h0, 6’h05, 6’h09, 6’h09, 6’h00, 6’h00, 2’b01, 9’h05a, 1’b0, 1’b1, 1’b0}

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 307 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material U.2.4 Memory Timing Control IV (Mem_Timing4_CTL)

PA = FIREPLANE_ADDRESS_REG + 40004016 ASI 7216,VA=4016 Energy Star 1/32 mode timing control register. See the description of Mem_Timing3_CTL on 306.

Note – At power-on, all fields come up unknown.

FIGURE U-11 illustrates the Mem_Timing4_CTL register. Definitions for the fields are identical to those in Mem_Timing2_CTL. Note that all bits in this register must be set properly if Energy Star mode is used.

wr_msel_dly rd_msel_dly wrdata_thld rdwr_rd_ti_dly auto_prechg_enbl rd_wr_pi_more_dly

63 58 57 52 51 50 49 44 43 42 38

rd_wr_ti_dly wr_wr_pi_more_dly wr_wr_ti_dly rdwr_rd_pi_more_dly R sdram_mode_reg_data

37 32 31 27 26 21 20 16 15 14 0

FIGURE U-11 Memory Timing4 Control Register (Mem_Timing4_CTL)

Settings of Mem_Timing4_CTL in Energy Star 1/32 Mode

Below are specified the preliminary settings for all the fields of Mem_Timing4_CTL as a function of DIMM grade, processor, and system frequency. The final settings for these fields will be specified by the analog engineer responsible for system board design and will be based on the Spice analysis of the memory subsystem.

■ 125 MHz DIMM, 18.75 MHz processor, 150 MHz system (system_processor_clkr =4) {6’h00, 6’h03, 2’b10, 6’h0b, 1’b1, 5’h0, 6’h1b, 5’h07, 6’h0b, 5’h07, 1’b0, 15’h000}

■ 125 MHz DIMM, 18.75 MHz processor, 120 MHz system (system_processor_clkr =5) {6’h00, 6’h04, 2’b10, 6’h0f, 1’b1, 5’h0, 6’h23, 5’h09, 6’h0f, 5’h09, 1’b0, 15’h000}

■ 125 MHz DIMM, 18.75 MHz processor, 100 MHz system (system_processor_clkr =6) {6’h00, 6’h05, 2’b10, 6’h13, 1’b1, 5’h0, 6’h2b, 5’h0b, 6’h13, 5’h0b 1’b0, 15’h000}

308 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material U.2.5 Memory Address Decoding Registers

PA = FIREPLANE_ADDRESS_REG + 40001016–40002816 ASI 7216,VA=1016–2816 There are four memory address decoding registers, one for each logical bank. The format of the Memory Address Decoding Register is shown in FIGURE U-12; the fields are described in TABLE U-5.

valid Reserved UK Reserved UM Reserved LK Reserved LM Reserved

6362 53 52 41 4037 36 20 19 18 17 14 13 12 11 8 7 0

FIGURE U-12 Memory Address Decoding Register (mem_addr_dec)

TABLE U-5 Memory Address Decoding Register Fields

Bit Field Description 63 valid Valid bit. When this bit is set to 0, the memory bank is disabled. 62:53 Reserved. 40:37 13:12 7:0 52:41 UK<11:0> Upper mask field to mask match for physical address <37:26>. Note: UK<10]> is always set to zero (no mask A36) because the CAS address generation logic uses A35:A26 only to generate the CAS address. Using A36 for CAS address would require not using A36 for the SDRAM bank selects and would eliminate small (64 Mbit) SDRAM from the supported devices. 36:20 UM<16:0> Upper match field to match physical address <42:26>. 16:14 LK<3:0> Lower mask field to mask match for physical address <9:6>. 11:8 LM<3:0> Lower match field to match physical address <9:6>.

Note – The bit fields are wider than the parameters. Always set undefined bits (reserved for future architectures) to 0.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 309 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Address Decoding Logic

The address decoding logic decodes the different fields in the Memory Address Decoding Register to determine whether a particular logical bank should respond to a physical address, as illustrated in FIGURE U-13.

PHYSICAL ADDRESS

42 2625 10 9 6 5 0 4 UM<16:0> LM<3:0> 17 (42:26)

17 4

UK<11:0> LK<3:0> 12 (37:26)

11 4

5 (42:38) V

1 => Respond 0 => Ignore

FIGURE U-13 Address Decoding Logic

The matching of upper physical address bits against the bank’s Upper Match (UM<16:0>) field determines whether the physical address falls into the memory segment in which the bank resides. The Upper Mask field (UK<11:0>) provides masks for address matching of 11 physical address bits <37:26>. A 1 in a UK bit means that matching of the corresponding physical address is ignored. The number of address bits to be masked depends on the segment size as described below.

Notes – UK<10> is reserved and always set to 0.

UK<11> is a special mask for some SSM systems and is not set for SDRAM bank size definition. Refer to the Sun Fireplane Interconnect specification for a full description.

UK<1:0> are always set to 0 because 16-Mbit SDRAM is not supported by this controller. The reason is that DRAM A11 is a CAS address and not a bank select bit.

When interleaving on LSB, the address space spanned by all banks that are interleaved together is called a segment. Segment size is a function of interleaving factor and the bank size.

310 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The minimum segment size is achieved when the interleaving factor is 1 and 64- Mbyte single-bank DIMM is used (smallest DIMM supported, comprising nine 64- Mbit SDRAM devices), which equals 64 Mbytes per DIMM x4DIMMs per bank x 1- way = 256 Mbytes (28 address bits). In this case, the remaining upper address bits <42:28> must match against the UM field; hence, the UK field will be set to 12’h003.

The maximum segment size is achieved when the interleaving factor is 16 and 2- Gbyte DIMM is used (largest DIMM supported, comprising seventy-two 64-Mbyte x 4 SDRAM devices), which equals 1 Gbyte per DIMM x 4 DIMMs per bank x 16-way = 64 Gbytes (36 address bits). In this case, only physical address bits <42:36> have to match against the UM field; hence, the UK field will be set to 12’h3ff. So, to support interleaving factors of 1, 2, 4, 8, and 16 and bank sizes from 64 Mbytes to 1 Gbyte, the UK field must be 10 bits wide.

The Upper Match field is determined by the base address of the segment. All the unmasked address bits in address <42:26> must be specified. The base address of each memory segment can be chosen to be anywhere as long as it aligns to a segment boundary. For example, if two banks of 64-Mbyte SIMMs are interleaved together, the segment size will be 64 Mbytes per SIMM x 4 SIMMs per bank x 2-way = 512 Mbytes (29 address bits). The address range for this segment can be 2000000016–3FFFFFFFF16. The Upper Mask will be 12’h007 (only matching physical <42:29>). The Upper Match will be 17’h00008.

Bank-Size Settings

TABLE U-6 through TABLE U-10 summarize UK settings for various bank sizes.

TABLE U-6 UK Settings for Bank Size = 256 Mbytes (4 x 64-Mbyte DIMMs)

Interleaving Factor Segment Size UK<11:0>

1-way 256 Mbyte 0000 0000 00112

2-way 512 Mbyte 0000 0000 01112

4-way 1 Gbyte 0000 0000 11112

8-way 2 Gbyte 0000 0001 11112

16-way 4 Gbyte 0000 0011 11112

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 311 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-7 UK Settings for Bank Size = 256 Mbytes (4 x 64-Mbyte DIMMs)

Interleaving Factor Segment Size UK<11:0>

1-way 512 Mbyte 0000 0000 01112

2-way 1 Gbyte 0000 0000 11112

4-way 2 Gbyte 0000 0001 11112

8-way 4 Gbyte 0000 0011 11112

16-way 8 Gbyte 0000 0111 11112

TABLE U-8 UK Settings for Bank Size = 1 Gbyte (4 x 256-Mbyte DIMMs)

Interleaving Factor Segment Size UK<11:0>

1-way 1 Gbyte 0000 0000 11112

2-way 2 Gbyte 0000 0001 11112

4-way 4 Gbyte 0000 0011 11112

8-way 8 Gbyte 0000 0111 11112

16-way 16 Gbyte 0000 1111 11112

TABLE U-9 UK Settings for Bank Size = 2 Gbytes (4 x 512-Mbyte DIMMs

Interleaving Factor Segment Size UK<11:0>

1-way 2 Gbyte 0000 0001 11112

2-way 4 Gbyte 0000 0011 11112

4-way 8 Gbyte 0000 0111 11112

8-way 16 Gbyte 0000 1111 11112

16-way 32 Gbyte 0001 1111 11112

TABLE U-10 UK Settings for Bank Size = 4 Gbytes (4 x 1 Gbyte DIMMs)

Interleaving Factor Segment Size UK<11:0>

1-way 4 Gbyte 0000 0011 11112

2-way 8 Gbyte 0000 0111 11112

4-way 16 Gbyte 0000 1111 11112

8-way 32 Gbyte 0001 1111 11112

16-way 64 Gbyte 0011 1111 11112

312 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The matching of physical address <9:6> against the bank’s Lower Match field determines which lines within a memory segment the bank should respond to. These are the actual address bits over which the interleave is done. The Lower Mask field masks matching of some physical address bits. The number of bits to be masked depends on the interleave factor. The setting of LK and LM, based on the interleaving factor, is shown in TABLE U-11.

TABLE U-11 Interleaving Factors for Setting LK and LM

Interleaving Factor LK<3:0> LM<3:0> 1-way 1111 xxxx 2-way 1110 xxx0, xxx1 4-way 1100 xx00, xx01, xx10, xx11 8-way 1000 x000, x001, x010, x011, x100, x101, x110, x111 16-way 0000 0000 – 1111

U.2.6 Memory Address Control Register

PA = FIREPLANE_ADDRESS_REG + 40003016 ASI 7216, VA=3016 Memory Address Control is a 64-bit register that controls the row and column address generation for all four logical banks of memory. Also controlled here are the pulse widths for the address latch enable signal and for the SDRAM control signals, which are used for all speed modes: full, Energy Star half-mode, and 1/32 mode.

This register also controls memory data timing and clock and control skews in Energy Star half-mode. The following bits automatically replace the values in Mem_Timing1_CTL and Mem_Timing2_CTL when Energy Star half-mode is entered: ■ half_mode_wr_msel_dly ■ half_mode_rd_msel_dly ■ half_mode_sdram_ctl_dly ■ half_mode_sdram_clk_dly

The format of the Memory Address Control Register is shown in FIGURE U-14; the fields are described in TABLE U-12.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 313 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material R addr_le_pw cmd_pw

63 62 60 59 56

half_mode_wr_msel_dly half_mode_rd_msel_dly half_mode_sdram_ctl_dly half_mode_sdram_clk_dly

55 50 49 44 43 40 39 37

R banksel_n_rowaddr_size_b3 enc_intlv_b3 banksel_n_rowaddr_size_b2 enc_intlv_b2

36 35 32 31 27 26 23 22 18

banksel_n_rowaddr_size_b1 enc_intlv_b1 banksel_n_rowaddr_size_b0 enc_intlv_b0

13 17 14 9 8 5 4 0

FIGURE U-14 Memory Address Control Register (Mem_Addr_Cntl)

TABLE U-12 Memory Address Control Register Fields (1 of 3)

Bit Field Description 63, 36 Reserved. 62:60 addr_le_pw<2:0> Specifies ADDR_LE_ low pulse width, in number of processor clock cycles. The encoding is as follows: 2’b000: addr_le pulse width is 1 processor clk cycle. 2’b001: addr_le pulse width is 2 processor clk cycles. 2’b010: addr_le pulse width is 3 processor clk cycles. 2’b011: addr_le pulse width is 4 processor clk cycles. 2’b100: addr_le pulse width is 5 processor clk cycle. 2’b101: addr_le pulse width is 6 processor clk cycles. 2’b110: addr_le pulse width is 7 processor clk cycles. 2’b111: addr_le pulse width is 8 processor clk cycles. This field is used for all speed modes: full, Energy Star half-mode, and 1/32 mode. 59:56 cmd_pw<3:0> Specifies the width of CKE, CS, RAS, CAS, WE signals sourced by the memory controller, in number of processor clock cycles. Valid settings are 0111 for ratio 4, 1001 for ratio 5, and 1011 for ratio 6 systems. This field is used for all speed modes: full, Energy Star half-mode, and 1/32 mode. 55:50 half_mode_wr_msel_dly<5:0> Used in half-speed mode only. Controls the number of delay cycles after the ACTIVATE command is issued before the data switch’s msel command can be issued to drive data onto memory data bus for a memory write cycle. A setting of 0 for this parameter is valid.

314 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-12 Memory Address Control Register Fields (2 of 3)

Bit Field Description 49:44 half_ode_rd_msel_dly<5:0> Used in half-speed mode only. Controls the number of delay cycles after the ACTIVATE command is issued before the data switch’s msel command can be issued to register the memory read data into data switch. A setting of 0 for this parameter is valid. 43:40 half_mode_sdram_ctl_dly<3:0> Used in half-speed mode only. Controls the delay between system clock at processor input and the SDRAM control/address outputs driven by the processor. The delay is required to align the SDRAM control signals sourced by the MCU with the SDRAM clock at the DIMMs. The maximum setting for this parameter is system_clk*2 -1, for example, for processor:system clock ratio of 4:1, the maximum value for this parameter is 7. 39:37 half_mode_dram_clk_dly<2:0> Used in half-speed mode only. Controls the delay between system clock at processor input and the SDRAM clock output driven by the processor. This delay is required to align the memory read data from the DIMM with the latch point for the system data switch. The maximum setting for this parameter is system_clk -1,for example, for processor:system clock ratio of 4:1, the maximum value for this parameter is 3. 35:32 banksel_n_rowaddr_size_b3<3:0> Specify the number of address bits used for bank select address 26:23 banksel_n_rowaddr_size_b2<3:0> and row address of the SDRAM device used to populate memory 17:14 banksel_n_rowaddr_size_b1<3:0> bank 3, 2, 1, and 0. The encodings are defined in 8:5 banksel_n_rowaddr_size_b<[3:0> banksel_n_rowaddr_size Encoding on page 316. Note: Only two settings for banksel_n_rowaddr_size are valid: 1000 and 0100.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 315 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-12 Memory Address Control Register Fields (3 of 3)

Bit Field Description 31:27 enc_intlv_b3<4:0> Specify the interleaving factor for memory banks 3, 2, 1, and 0, 22:18 enc_intlv_b2<4:0> respectively, as follows: 13:9 enc_intlv_b1<4:0> 4:0 enc_intlv_b0<4:0> enc_intlv<4:0> Interleaving Factor 0000x 1-way (no interleaving) 0001x 2-way 001xx 4-way 01xxx 8-way 1xxxx 16-way

Note: The five encodings above are all the valid values for the enc_intlv parameter. Setting this parameter to a value that is not one of the five valid values will result in a hardware malfunction. Interleave can be performed over a segment of memory made up of different-sized banks. For example, two 256-Mbyte banks can be interleaved with one 512-Mbyte bank. In this case, the total segment size is 1 Gbyte, the two 256-Mbyte banks are set up for 4- way interleave, and the 512-Mbyte bank is set up for 2-way interleave.

banksel_n_rowaddr_size Encoding

TABLE U-13 shows the settings of banksel_n_rowaddr_size encoding for several supported SDRAM devices. The table is not an exhaustive list of SDRAM-supported devices. Other possibilities must map into the banksel_n_rowaddr_size table for bank select bits, row address bits, and column address bits.

TABLE U-13 Examples of banksel_n_rowaddr_size_bx Values

Total Bank #of banksel_n_ Select and RAS Single Bank DIMM Size internal rowaddr_size Address Bits in Mbytes SDRAM Device banks 0100 14 64 (9 parts) 4 Mb x 16 (64 Mb) 4 0100 14 128 (18 parts) 8 Mb x 8 (64 Mb) 4 0100 14 256 (36 parts) 16 Mb x 4 (64 Mb) 4 0100 14 128 (9 parts) 8 Mb x 16 (128 Mb) 4 0100 14 256 (18 parts) 16 Mb x 8 (128 Mb) 4 0100 14 512 (36 parts) 32 Mb x 4 (128 Mb) 4 1000 15 256 (9 parts) 16 Mb 16 (256 Mb) 4 1000 15 512 (18 parts) 32 Mb x 8 (256 Mb) 4

316 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-13 Examples of banksel_n_rowaddr_size_bx Values (Continued)

Total Bank #of banksel_n_ Select and RAS Single Bank DIMM Size internal rowaddr_size Address Bits in Mbytes SDRAM Device banks 1000 15 1024 (36 parts) 64 MB x 4 (256 Mb) 4 1000 15 512 (9 parts) 32 MB x 16 (512 Mb) 8 1000 15 1024 (18 parts) 64 MB x 8 (512 Mb) 4

Note – For double external bank DIMMs, multiply the number of devices and the bank capacity by 2. Internal and external banks are not associated in any way.

TABLE U-14 through TABLE U-18 show the physical address mapping to SDRAM Bank select, RAS address, and CAS address for all supported SDRAMs. All SDRAM supported by the memory controller must map into those tables. Because CAS address depends on the bank interleave factor, we show a table for all bank interleave possibilities.

TABLE U-14 banksel_n_rowaddr_size_b3<3:0> Encoding for enc_intlv_b0<4:0> = 1-Way

Number of enc_ bank_n Number of SDRAM intlv row_addr_ SDRAM Bank Device Bank Select and Row Physical Address to CAS Address <4:0> size<3:0> Total Select Bits Address Bits Address Mapping Mapping 1-Way B[0] - PA27 0100 14 1 13 00001 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<30:28>, PA<9:6>, PA<13:10> BS<1:0> - PA27:26 0100 14 2 12 00001 RAS<11:0>- PA25:14 CAS<11,9:0> - PA<30:28>, PA<9:6>, PA<13:10> BS[0] - PA28 1000 15 1 14 00001 RAS<13:0> - PA27:14 CAS<11,9:0> - PA<31:29>, PA<9:6>, PA<13:10> BS<1:0> - PA28:27 1000 15 2 13 00001 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<31:29>, PA<9:6>, PA<13:10> BS<2:0> - PA28:26 1000 15 3 12 00001 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<31:29>, PA<9:6>, PA<13:10>

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 317 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-15 banksel_n_rowaddr_size_b3<3:0> Encoding for enc_intlv_b0<4:0> = 2-Way

Number of Number of SDRAM enc_ bank_n SDRAM Device Bank Select and intlv row_addr_ Bank Address ROW Address Physical Address to CAS Address <4:0> size<3:0> Total Select Bits Bits Mapping Mapping 2-Way

BS[0] - PA27 CAS<11,9:0> - PA<31:28>, PA<9:7>, 0100 14 1 13 00010 RAS<12:0> - PA26:14 PA<13:10>

BS<1:0> - PA27:26 CAS<11,9:0> - PA<31:28>, PA<9:7>, 0100 14 2 12 00010 RAS<11:0> - PA25:14 PA<13:10>

BS[0] - PA28 CAS<11,9:0> - PA<32:29>, PA<9:7>, 1000 15 1 14 00010 RAS<13:0> - PA27:14 PA<13:10>

BS<1:0> - PA28:27 CAS<11,9:0> - PA<32:29>, PA<9:7>, 1000 15 2 13 00010 RAS<12:0> - PA26:14 PA<13:10>

BS<2:0> - PA28:26 CAS<11,9:0> - PA<32:29>, PA<9:7>, 1000 15 3 12 00010 RAS<11:0> - PA25:14 PA<13:10>

TABLE U-16 banksel_n_rowaddr_size_b3<3:0> Encoding for enc_intlv_b0<4:0> = 4-Way

enc_ bank_n Number of Number of intlv row_addr_ SDRAM Bank SDRAM Device Bank Select and Row Physical Address to CAS Address <4:0> size<3:0> Total Select Bits Address Bits Address Mapping Mapping 4-Way BS[0] - PA27 0100 14 1 13 00100 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<32:28>, PA<9:8>, PA<13:10> BS<1:0> - PA27:26 0100 14 2 12 00100 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<32:28>, PA<9:8>, PA<13:10> BS[0] - PA28 1000 15 1 14 00100 RAS<13:0> - PA27:14 CAS<11,9:0> - PA<33:29>, PA<9:8>, PA<13:10> BS<1:0> - PA28:27 1000 15 2 13 00100 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<33:29>, PA<9:8>, PA<13:10> BS<2:0> - PA28:26 1000 15 3 12 00100 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<33:29>, PA<9:8>, PA<13:10>

318 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE U-17 banksel_n_rowaddr_size_b3<3:0> Encoding for enc_intlv_b0<4:0> = 8-Way

enc_ bank_n Number of Number of intlv row_addr_ SDRAM Bank SDRAM Device Bank Select and Row Physical Address to CAS Address <4:0> size<3:0> Total Select Bits Address Bits Address Mapping Mapping 8-Way BS[0] - PA27 0100 14 1 13 01000 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<33:28>, PA[9], PA<13:10> BS<1:0> - PA27:26 0100 14 2 12 01000 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<33:28>, PA[9], PA<13:10> BS[0] - PA28 1000 15 1 14 01000 RAS<13:0> - PA27:14 CAS<11,9:0> - PA<34:29>, PA[9], PA<13:10> BS<1:0> - PA28:27 1000 15 2 13 01000 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<34:29>, PA[9], PA<13:10> BS<2:0> - PA28:26 1000 15 3 12 01000 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<34:29>, PA[9], PA<13:10>

TABLE U-18 banksel_n_rowaddr_size_b3<3:0>Encoding for enc_intlv_b0<4:0> = 16-Way

enc_ bank_n Number of Number of intlv row_addr_ SDRAM Bank SDRAM Device Bank Select and Row Physical Address to CAS Address <4:0> size<3:0> Total Select Bits Address Bits Address Mapping Mapping 16-Way BS[0] - PA27 0100 14 1 13 10000 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<34:28>, PA<13:10> BS<1:0> - PA27:26 0100 14 2 12 10000 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<34:28>, PA<13:10> BS[0] - PA28 1000 15 1 14 10000 RAS<13:0> - PA27:14 CAS<11,9:0> - PA<35:29>, PA<13:10> BS<1:0> - PA28:27 1000 15 2 13 10000 RAS<12:0> - PA26:14 CAS<11,9:0> - PA<35:29>, PA<13:10> BS<2:0> - PA28:26 1000 15 3 12 10000 RAS<11:0> - PA25:14 CAS<11,9:0> - PA<35:29>, PA<13:10>

Settings of Mem_Address_Control

Below are specified preliminary settings for all the fields of Mem_Address_CTL as a function of DIMM grade, processor, and system frequency. The final settings for these fields will be specified by the analog engineer responsible for system board design and will be based on the Spice analysis of the memory subsystem, the DIMMs installed in the system as defined by the SPD PROMS, and the desired interleave factors determined by the OBP or POST startup algorithms.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 319 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ 125 MHz DIMM, 600 MHz processor, 150 MHz system (system_processor_clkr =4) {1’b0, 3’b011, 4’h7, 6’h06, 6’h0d, 4’h2, 3’b000, 1’b0, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00}

■ 125 MHz DIMM, 600 MHz processor, 120 MHz system (system_processor_clkr = 5) {1’b0, 3’b100, 4’h9, 6’h07, 6’h10, 4’h3, 3’b000, 1’b0, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00}

■ 125 MHz DIMM, 600 MHz processor, 100 MHz system (system_processor_clkr =6) {1’b0, 3’b010, 4’hb, 6’h0c, 6’h11, 4’h0, 3’b000, 1’b0, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00, 4’h0, 5’h00}

Bank Select/Row and Column Address Generation Logic

The MCU supports up to 15-bit banksel and row addresses (total sum of both) and up to 11-bit column addresses. The banksel and row address is fixed at physical address<28:14>. Four bits of the column address are fixed at physical address<13:10>. The remaining seven bits of the column address are determined by two fields, banksel_n_rowaddr_size and enc_intlv, of the Memory Address Control Register, as illustrated in FIGURE U-15 and explained in more detail later in the text.

Note – This configuration is changed from earlier versions of the Programmer’s Reference Manual. Physical address 27 and 26 are not used for CAS address generation because 16-Mbit SDRAMs are not supported by this controller. Shifts by two and three are never used because PA<27:26> maps to a smaller segment than is supported.

320 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Banksel_n_Row_Address<14:0> physical address<28:14>

physical address <35:28> (see note)

shift right by 0, 1 (see note) banksel_n_rowaddr_size 0100 = shift right 1 7 1000 = shift right 0 Intermediate Column PA<34:28> for 64 Mb and 128 Mb SDRAM or Address PA<35:29> for 256 Mb and 512 Mb SDRAM

physical address<9:6>

11

shift right by 0, 1, 2, 3, or 4 enc_intlv

7

Column_Address<10:0> physical address<13:10>

FIGURE U-15 Row and Column Address Generation

U.3 Memory Initialization

The SDRAM initialization sequence requires a minimum pause of 200 µsec after power-up followed by a PRECHARGE_ALL command and eight AUTO_RFR cycles, then followed by a MODE_REG_SET command to program the SDRAM parameters (burst length, CAS latency, etc.).

A special sequence controlled by software is required before any accesses can be made to memory: ■ The hardware powers up with the DQM signal asserted. This is an SDRAM vendor requirement to avoid internal and external bank data bus contention because of unknowns at power-up. ■ First, the I2C PROMs on the DIMMs are read. The size of the installed memory is used to program the Memory Controller Unit (MCU) address registers, interleave, bank present, refresh interval, and CAS latency CSR fields.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 321 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ The information on the Fireplane clock rate and the processor ratio is used to determine the CSR timing fields. Now, software removes DQM by a write to the BBC. ■ Next, software gives the PRECHARGE_ALL command to the MCU by setting a bit in Timing control 1. Software waits for the precharge to be completed. ■ Then, the mode register set command is given to the MCU by setting another bit in Timing control 1. ■ Next, eight refresh cycles must complete to all banks. Note that the refresh interval can be set to a temporary small value (8) to speed up this action. Software waits for this to be completed. ■ Finally, the correct refresh rate is set in Timing control 1 per the refresh interval formula.

The specific OpenBoot PROM programming sequence follows:

1. Release DQM by writing to the BBC port. Wait TBD time for DQM to reach stable state at all SDRAMs.

2. Set up CSR. Load valid values in ASI_MCU_ADDR_CTRL_REG ASI_MCU_ADDR_DEC_REG0 ASI_MCU_ADDR_DEC_REG1 ASI_MCU_ADDR_DEC_REG2 ASI_MCU_ADDR_DEC_REG3 ASI_MCU_TIM1_CTRL_REG ASI_MCU_TIM2_CTRL_REG ASI_MCU_TIM3_CTRL_REG ASI_MCU_TIM4_CTRL_REG Note: rfr_int should be set to 8, and rfr_enable should be set to 0 in ASI_MCU_TIM1_CTRL_REG.

Note: In steps 3 through 6, xx-xx refers to the previous contents of the bits.

3. Precharge all. Load ASI_MCU_TIM1_CTRL_REG with xx-xx_000001000_0_0_1. Delay 50 system clocks to allow time for the precharge command to be issued to all SDRAMs.

4. Set mode register

Load ASI_MCU_TIM1_CTRL_REG with xx-xx_000001000_1_0_0

Delay 100 system clocks to allow time for the set mode register command to be issued to all SDRAM.

322 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 5. Burst eight refresh cycles to each of four banks. Load ASI_MCU_TIM1_CTRL_REG with xx-xx_000001000_0_1_00. Delay 100 system clocks to allow time for all required refresh cycles to complete at all SDRAMs.

6. Set proper refresh rate from formula. Load ASI_MCU_TIM1_CTRL_REG with xx-xx_{set proper rate}_0_1_0. Delay 50 system clocks to allow new setting to take effect.

SDRAM initalization is now finished, and normal memory accesses are now allowed.

U.4 Energy Star 1/32 Mode

Energy Star 1/32 mode is a special mode that saves system power by slowing the clocks while preserving as much memory performance as possible. The SDRAM is operated with different timing in this mode. The activate cycle (RAS and CS) and the command cycle (CAS and CS) occur in concurrent clock cycles without the 1-clock lag in the full- and half-speed modes. CAS latency is still set at 2 or as programmed for full speed. Refresh cycles use a significant amount of the memory bandwidth in this mode, so program refresh_int carefully.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix U Memory Controller 323 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 324 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S.APPENDIX V

Debug and Diagnostics Support

This chapter defines debug and diagnostics support in these sections: ■ Diagnostics Control and Accesses on page 325 ■ Floating-Point Control on page 326 ■ Instruction Cache Diagnostic Accesses on page 326 ■ Branch Predictor Array Accesses on page 331 ■ Data Cache Diagnostic Accesses on page 331 ■ External Cache Diagnostics Accesses on page 336 ■ Write Cache Diagnostic Accesses on page 340 ■ Integer Unit Design for Test (DFT) on page 345

See also Dispatch Control Register (DCR) (ASR1216) on page 30 and Data Cache Unit Control Register (DCUCR) on page 33.

V.1 Diagnostics Control and Accesses

The diagnostics control and data registers are accessed through RDASR/WRASR or Load/Store Alternate instructions.

All debug and diagnostics accesses are doubleword-aligned, 64-bit accesses. Nonaligned accesses cause a mem_address_not_aligned exception. Accesses must use LDXA/STXA/LDDFA/STDFA instructions. Using another type of load or store causes a data_access_exception exception (with SFSR.FT = 8, illegal ASI size). Attempts to access these registers while in nonprivileged mode cause a data_access_exception exception (with SFSR.FT = 1, privilege violation). User accesses can be accomplished through system calls to these facilities. See I/D Synchronous Fault Status Register (SFSR) in Commonality for SFSR details.

325 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Caution – An STXA to any internal debug or diagnostic register requires a MEMBAR #Sync before another load instruction is executed. Furthermore, the MEMBAR must be executed in or before the delay slot of a delayed control transfer instruction of any type. This requirement is not just to guarantee that result of the STXA is seen but is also imposed because the STXA may corrupt the load data if there is not an intervening MEMBAR #Sync!

V.2 Floating-Point Control

Two state bits (PSTATE.PEF and FPRS.FEF) in the SPARC V9 architecture provide the means to disable direct floating-point execution. If either field is set to 0, an fp_disabled exception is taken when any floating-point instruction is encountered.

Note – Graphics instructions that use the floating-point register file and instructions that read or update the Graphic Status Register (GSR) are treated as floating-point instructions. They cause an fp_disabled exception if either PSTATE.PEF or FPRS.FEF is zero. See Graphics Status Register (GSR) (ASR 19) in Section 5.2.11 of Commonality for more information.

V.3 Data Cache Unit Control Register (DCUCR)

See Section 5.2.12 in Commonality and Data Cache Unit Control Register (DCUCR) on page 33 in this document for a description of DCUCR.

V.4 Instruction Cache Diagnostic Accesses

Three I-cache diagnostic accesses are supported: ■ I-cache instruction fields ■ I-cache tag/valid fields ■ I-cache snoop tag fields

326 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.4.1 Instruction Cache Instruction Fields

ASI 6616, VA<63:16> = 0, VA<15:14> = IC_way, VA<13:3> = IC_addr, VA<2:0> = 0

Name: ASI_ICACHE_INSTR

The instruction cache instruction fields are described below and illustrated in FIGURE V-1.

Bits Field Use — Description 15:14 IC_way This 2-bit field selects a way (4-way associative). 13:3 IC_addr This 11-bit index <13:3> selects a 32-bit instruction and associated predecode bits.

— IC_predecode IC_instr

63 42 41 32 31 0 where: IC_predecode is 10-bit predecode bits associated with the instruction field, and IC_instr is a 32-bit instruction field. IC_addr corresponds to VA<12:2> of the instruction address.

— IC_way IC_addr —

63 16 15 14 13 302

FIGURE V-1 Instruction Cache Instruction Access Address Format

V.4.2 Instruction Cache Tag/Valid Fields

ASI 6716, VA<63:16> = 0, VA<15:14> = IC_way, VA<13:5> = IC_addr, VA<4:3> = IC_tag

Name: ASI_ICACHE_TAG

The instruction cache tag and valid fields are illustrated in FIGURE V-2 and described in TABLE V-1.

— IC_way IC_addr IC_tag — 63 16 15 14 13 5 4 3 2 0

FIGURE V-2 Instruction Cache Tag/Valid Access Address Format

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 327 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE V-1 Fields in I-Cache Tag/Valid Access Address

Bits Field Description 15:14 IC_way A 2-bit field that selects a way (4-way associative). 13:5 IC_addr IC_addr <13:6>. Corresponds to bits VA<12:5> of the instruction address and used to index the cache.for writing. However, IC_addr <13:5> corresponds to bits VA<12:4> and is used for reading, since only half of the data from a cache line can be read in each cycle. IC_addr<5> is used to access the I-cache Valid and load store prediction bits. It is a “don’t care” in accessing the Physical Address Tag and Microtag fields. IC_addr<5> = 0 Corresponds to the upper half of the load store predict bits. IC_addr<5> = 1 Corresponds to the lower half of the load store predict bits. The valid bit will always be read out. 4:3 IC_tag Instruction cache tag number: 0, 1, and 2. (upper bits; lower bits). For clarity, the microtags are illustrated in the subsection following the table.

After ASI load or store instruction with either ASI 6716 or ASI 6816, instruction cache consistency may be broken, even if the instruction cache is disabled. The reason is that snoops and invalidates to the instruction cache may collide with the ASI load/ store. Thus, before these ASI accesses, the instruction cache must be turned off. Then, before the instruction cache is turned on again, all of the instruction cache valid bits must be cleared to keep cache consistency.

IC_tag: I-cache tag numbers

FIGURE V-3 through FIGURE V-7 illustrate the meaning of the tag numbers in IC_tag. In the figures, Undefined means the value of these bits is undefined on reads and must be masked off by software.

0 = Physical Address Tag where: IC_tag is the 30-bit physical tag field (PA<42:13> of the associated instructions).

— IC_tag Undefined 63 38 37 8 7 0

FIGURE V-3 Data Format for I-cache Physical Address Tag Field

1 = Microtag where: IC_utag is the 8-bit virtual microtag field (VA<20:13> of the associated instructions).

— IC_utag Undefined 63 46 45 38 37 0

FIGURE V-4 Data Format for I-cache Microtag Field

328 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Caution – The I-cache microtags must be initialized after power-on before the instruction cache is enabled. For each of the four ways of each index of the instruction cache, the microtags must contain a unique value. For example, for index 0, the four microtags could be initialized to 0, 1, 2, and 3, respectively. The values need not be unique across indices; in the previous example 0, 1, 2, and 3 could also be used to cache index 1,2,3....

2 = Valid/predict tag where: Valid is the Valid bit for the line and IC_vpred is the 8-bit LPB bits for eight instructions starting at the 32-byte boundary align address given by IC_addr.

— Valid IC_vpred<7:0> Undefined 63 55 54 53 46 45 0

FIGURE V-5 Format for Writing I-cache Valid/Predict Tag Field Data

2 = Valid/predict tag where: Valid is the Valid bit for the line and IC_vpred is the 8-bit LPB bits for eight instructions starting at the 32-byte boundary align address given by IC_addr.

— Valid IC_vpred<7:4> Undefined

63 51 50 49 46 45 0

FIGURE V-6 Format for Reading Upper Bits of Valid/Predict Tag Field Data

2 = Valid/predict tag where: Valid is the Valid bit for the line and IC_vpred is the 8-bit LPB bits for eight instructions starting at the 32-byte boundary align address given by IC_addr.

— Valid IC_vpred<3:0> Undefined 63 51 50 49 46 45 0

FIGURE V-7 Format for Reading Lower Bits of Valid/Predict Tag Field Data

V.4.3 Instruction Cache Snoop Tag Fields

ASI 6816, VA<63:16> = 0, VA<15:14> = IC_way, VA<13:6> = IC_addr, VA<5:0> = 0

Name: ASI_ICACHE_SNOOP_TAG

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 329 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The address format for the I-cache snoop tag fields is described below and illustrated in FIGURE V-8.

Bits Field Description 15:14 IC_way This 2-bit field selects a way (4-way associative). 13:6 IC_addr This 8-bit index (VA<12:5>) selects a cache tag.

— IC_way IC_addr — 63 16 15 14 13 605

FIGURE V-8 Address Format of Instruction Cache Snoop Tag

The data format for the I-cache snoop tag is described below and illustrated in FIGURE V-9.

Bits Field Description 63:38 Undefined The value of these bits is undefined on reads and must be masked off by software. 37:8 IC_snoop_tag The 30-bit physical tag field (PA<42:13> of the associated instructions).

Undefined IC_snoop_tag Undefined

63 38 37 8 7 0

FIGURE V-9 Data Format of Instruction Cache Snoop Tag

After ASI load or store instruction with either ASI 6716 or ASI 6816, instruction cache consistency may be broken, even if the instruction cache is disabled. The reason is that snoops and invalidates to the instruction cache may collide with the ASI load/ store. Thus, before these ASI accesses, the instruction cache must be turned off. Then, before the instruction cache is turned on again, all of the instruction cache valid bits must be cleared to keep cache consistency.

330 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.5 Branch Predictor Array Accesses

ASI 6F16, VA<63:16> = 0, VA<15:3> = BPA_addr, VA<2:0> = 0

Name: ASI_BRANCH_PREDICTION_ARRAY

The address format of the branch-prediction array access is illustrated in FIGURE V-10, where: BPA_addr is a 13-bit index (VA<15:3>) that selects a branch prediction array entry.

— BPA_addr —

63 16 15 302

FIGURE V-10 Branch Prediction Array Access Address Format

The branch prediction array entry is described below and illustrated in FIGURE V-11.

Bits Field Use — Description 63:4 Undefined The value of these bits is undefined on reads and must be masked off by software. 3:2 PT_Bit The two predict bits if the last prediction was TAKEN. 1:0 PNT_Bits The two predict bits if the last prediction was NOT_TAKEN.

Undefined PNT_Bits PT_Bits 63 3 2 1 0

FIGURE V-11 Branch Prediction Array Data Format

V.6 Data Cache Diagnostic Accesses

Five D-cache ASI accesses are supported:

■ D-cache Data field (ASI 4616) ■ D-cache Tag/Valid fields (ASI 4716) ■ D-cache snoop tag access (ASI 4416) ■ D-cache microtag fields (ASI 4316) ■ D-cache Invalidate (ASI 4216).

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 331 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.6.1 Data Cache Data Field

ASI 4616, VA<63:16> = 0, VA<15:14> = DC_way, VA<13:3> = DC_addr, VA<2:0> = 0

Name: ASI_DCACHE_DATA

The address format for D-cache data access is described below and illustrated in FIGURE V-12.

Bits Field Use — Description 15:14 DC_way A 2-bit index that selects an associative way (4-way associative). 13:3 DC_addr An 11-bit index that selects a 64-bit data field.

— DC_way DC_addr —

63 16 15 14 13 302

FIGURE V-12 Data Cache Data Access Address Format

The data format for D-cache data access is illustrated in FIGURE V-13,

DC_data is 64-bit data.

DC_data 63 0

FIGURE V-13 Data Cache Data Access Data Format

Note – A MEMBAR #Sync is required before and after a load or store to ASI_DCACHE_DATA.

332 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.6.2 Data Cache Tag/Valid Fields

ASI 4716, VA<63:16> = 0, VA<15:14> = DC_way, VA<13:5> = DC_addr, VA<4:0> = 0

Name: ASI_DCACHE_TAG

The address format for the D-cache Tag/Valid fields is described below and illustrated in FIGURE V-14.

Bits Field Use — Description 15:14 DC_way A 2-bit index that selects an associative way (4-way associative). 13:5 DC_addr A 9-bit index that selects a tag/valid field (512 tags).

— DC_way DC_addr —

63 16 15 14 13 504

FIGURE V-14 Data Cache Tag/Valid Access Address Format

The data format for D-cache Tag/Valid access in described below and shown in FIGURE V-15.

Bits Field Use — Description 30:1 DC_tag The 30-bit physical tag (PA<42:13> of the associated data). 0 DC_valid The 1-bit valid field.

— DC_tag DC_valid 63 31 30 10

FIGURE V-15 Data Cache Tag/Valid Access Data Format

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 333 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.6.3 Data Cache Microtag Fields

ASI 4316, VA<63:16> = 0, VA<15:14> = DC_way, VA<13:5> = DC_addr, VA<4:0> = 0

Name: ASI_DCACHE_UTAG

The address format for the D-cache microtag access is described below and illustrated in FIGURE V-16.

Bits Field Use — Description 15:14 DC_way A 2-bit index that selects an associative way (4-way associative). 13:5 DC_addr A 9-bit index that selects a tag/valid field (512 tags).

— DC_way DC_addr —

63 16 15 14 13 504

FIGURE V-16 Data Cache Microtag Access Address Format

The data format for D-cache microtag access is illustrated in FIGURE V-17,

DC_utag is the 8-bit virtual microtag (VA<21:14> of the associated data). A MEMBAR #Sync is required before and after a load or store to ASI_DCACHE_UTAG.

Reserved DC_utag 63 8 7 0

FIGURE V-17 Data Cache Microtag Access Data Format

Caution – The data cache microtags must be initialized after power-on before the data cache is enabled. For each of the four ways of each index of the data cache, the microtags must contain a unique value; for example, for index 0, the four microtags could be initialized to 0, 1, 2, and 3. respectively. The values need not be unique across indices; in the previous example 0, 1, 2, and 3 could also be used to cache index 1,2,3....

V.6.4 Data Cache Snoop Tag Access

ASI 4416, VA<63:16> = 0, VA<15:14> = DC_way, VA<13:5> = DC_addr, VA<4:0> = 0

334 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Name: ASI_DCACHE_SNOOP_TAG

The address format for the D-cache snoop tag fields is described below and illustrated in FIGURE V-18.

Bits Field Use — Description 15:14 DC_way A 2-bit index that selects an associative way (4-way associative). 13:5 DC_addr A 9-bit index that selects a snoop tag field (512 tags).

— DC_way DC_addr —

63 16 15 14 13 5 4 0

FIGURE V-18 Data Cache Snoop Tag Access Address Format

FIGURE V-19 illustrates the data format for D-cache snoop tag access, where: DC_snoop_tag is the 30-bit physical snoop tag (PA<42:13> of the associated data).

Reserved DC_snoop_tag — 63 31 30 10

FIGURE V-19 Data Cache Snoop Tag Access Data Format

V.6.5 Data Cache Invalidate

ASI 4216 Name: ASI_DCACHE_INVALIDATE

A store that uses the Data Cache Invalidate ASI invalidates a D-cache line that matches the supplied physical address from the data cache. A load to this ASI returns an and does not invalidate a D-cache line.

The address format for D-cache invalidate is illustrated in FIGURE V-20, where: the D-cache line matching PhysicalAddress is invalidated. If there is no matching D-cache entry, then the ASI store is a NOP.

Reserved Physical Address X

63 43 42 5 4 0

FIGURE V-20 Data Cache Invalidate Address Format

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 335 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.7 External Cache Diagnostics Accesses

Separate ASIs are provided for reading and writing the E-cache tags and data. This section describes four E-cache accesses: ■ E-cache Control Register ■ E-cache Data/ECC fields ■ E-cache data staging registers ■ E-cache tag/state field diagnostics accesses

V.7.1 External Cache Control Register

ASI 7516, VA =016 Name: ASI_ECACHE_CTRL

The E-cache Control Register is illustrated in FIGURE V-21 and described in TABLE V-2. — zz trace_in trace_out reserved EC_turn_rw EC_early EC_size EC_clock EC_ECC_en EC_ECC_force EC_check 63 20 19 18 17 16 15 14 13 12 11 10 908

FIGURE V-21 External Cache Control Register

TABLE V-2 External Cache Control Register Fields

Bits Field Description 20 ZZ E-cache sleep mode. 0 = External cache sleep mode control is disabled; 1 = External cache sleep mode control is enabled. 19 trace_in Data trace-in cycles 0 = 2 cycles 1 = 3 cycles 18 trace_out Reserved. 16 EC_turn_rw External Cache data turnaround cycle, read->write 0 = 1 SRAM cycle between read->write. (not supported in 3:1 clock mode, EC_clock = 0). 1 = 2 SRAM cycles between read->write. 15 EC_early Reserved. 14:13 EC_size The size specified here affects the size of the EC_addr field in the E-cache Data Register. See the next section. 0 = 1-Mbyte E-cache size 1 = 4-Mbyte E-cache size 2 = 8-Mbyte E-cache size 3 = unused

336 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material TABLE V-2 External Cache Control Register Fields (Continued)

Bits Field Description 12:11 EC_clock <12:11> = 0 selects 3:1 E-cache clock ratio. <12:11> = 1 selects 4:1 Ex-cache clock ratio. <12:11> = Reserved. <12:11> = Reserved. 10 EC_ECC_en If set, enables ECC checking on E-cache data bits. 9 EC_ECC_force If set, forces EC_check<8:0> onto E-cache ECC bits. 8:0 EC_check ECC check vector to force onto ECC bits

V.7.2 External Cache Data/ECC Fields

ASI 7616 (writing)or7E16 (reading), VA<63:23> = 0, VA<19:5> = EC_addr, VA<4:0> = 0 (1 MB), VA<21:5> = EC_addr, VA<4:0> = 0 (4 MB) VA<22:5> = EC_addr, VA<4:0> = 0 (8 MB) Name: ASI_ECACHE_W, ASI_ECACHE_R

The E-cache data/ECC field is described below and illustrated in FIGURE V-22.

Bits Field Use — Description 19:5 EC_addr The size of this field is determined by the EC_size field specified in the External Cache Control Register. See the preceding section. 1 MByte — Use a 15-bit index <19:5> that reads and writes a 32-byte field from the E- cache to and from the E-cache Data Staging registers (see next section). 4 Mbytes — Use a 17-bit index <21:5> that reads and writes a 32-byte field from the E- cache to and from the E-cache Data Staging registers (see next section). 8 Mbytes — Use 18-bit index <22:5> that reads and writes a 32-byte field from the E- cache to and from the E-cache Data Staging registers (see next section). Note: When Data is don’t care, use %g0.

Reserved EC_addr Reserved 63 23 22 20 19 504

FIGURE V-22 External Cache Data Access Address Format

V.7.3 External Cache Data Staging Registers

ASI 7416, VA<63:6> = 0, VA<5:3> = Staging Register Number 0-4, 5-7 Reserved. Name: ASI_ECACHE_DATA

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 337 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The E-cache address format for the Data Staging register is described below and illustrated in FIGURE V-24.

Bits Field Use — Description 5:3 data_register_number Selects one of five staging registers. EC_data = <0-3> 4=EC_data_ECC.

Reserved data_register_number unused 63 6 5 3 2 0

FIGURE V-23 External Cache Data Staging Register Address Format

The E-cache data format for the Data Staging register is described below and illustrated in FIGURE V-24.

Bits Field Use — Description 63:0 EC_data_N 64-bit staged E-cache data. EC_data_0<255:192> corresponds to VA<5:3> = 000 EC_data_1<191:128>corresponds to VA<5:3> = 001 EC_data_2<127:64> corresponds to VA<5:3> = 010 EC_data_3<63:0> corresponds to VA<5:3> = 011 VA<5:3>= 100 corresponds to EC_data_4_ECC_hi and EC_data_ECC_lo

EC_data_N 63 0 FIGURE V-24 External Cache Staging Register Data Format

The data format of E-cache staging ECC access is illustrated in FIGURE V-25, where: EC_data_ECC_ is a 9-bit E-cache data ECC field.

— EC_data_ECC_hi EC_data_ECC_lo

63 18 17 9 8 0

FIGURE V-25 External Cache Staging ECC Access Data Format

338 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material V.7.4 External Cache Tag/State Field Diagnostics Accesses

ASI 4E16, VA<63:23> = 0, VA<19:6> = EC_tag_addr for 1 Mbyte, VA<21:8> = EC_tag_addr for 4 Mbyte, VA<22:9> = EC_tag_addr for 8 Mbyte, VA<5:0> = 0

ASI_ECACHE_TAG (4E16)

FIGURE V-26 illustrates the address format for E-cache tag access.

1 M — EC_tag_addr — 63 20 19 605

4 M — EC_tag_addr — 63 22 21 807

8 M — EC_tag_addr — 63 23 22 9 8 0

FIGURE V-26 External Cache Tag Access Address Format

The data format for E-cache Tag/State access is described below and illustrated in FIGURE V-27.

Bits Field Use — Description See EC_tag 23-bit physical tag field FIGURE EC_tag<43:21> = PA<42:20> of associated data for 1 Mbyte. V-27. EC_tag<43:23> = PA<42:22> of associated data for 4 Mbytes. EC_tag<43:24> = PA<42:23> of associated data for 8 Mbytes. See EC_state The eight 3-bit E-cache state fields for the 8-Mbyte case are encoded as FIGURE follows: V-27. EC_stateN<2:0> = 000Invalid EC_stateN<2:0> = 001Shared EC_stateN<2:0> = 010Exclusive EC_stateN<2:0> = 011 Owner EC_stateN<2:0> = 100Modified EC_stateN<2:0> = 101Reserved EC_stateN<2:0> = 110Owner/Shared EC_stateN<2:0> = 111Reserved

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 339 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 1 M — EC_tag unused EC_state0

63 44 43 21 20 302

4 M — EC_tag unused EC_state3 EC_state2 EC_state1 EC_state0 63 44 43 23 22 12 11 9 8 6 5 3 20

8 M — EC_tag EC_state7 EC_state6 EC_state5 EC_state4 EC_state3 EC_state2 EC_state1 EC_state0 63 44 43 24 23 21 20 18 17 15 14 12 11 9 8 6 5 3 2 0

FIGURE V-27 External Cache Tag/State Access Data Format

V.8 Write Cache Diagnostic Accesses

Five W-cache ASI accesses are supported: ■ W-cache diagnostic valid bits register ■ W-cache diagnostic bank valid bits register ■ W-cache diagnostic data register ■ W-cache Tag/Valid fields ■ W-cache snoop tag register

Note – Writing to any bits in this register to influence the behavior of the W-cache may produce undefined results. Read/write access to this register is for diagnostic use only.

V.8.1 Write Cache Diagnostic Valid Bits Register

ASI 3816, VA<63:11> = 0, VA<10:9> = WC_way, VA<8:6> = WC_addr, VA<5:0> = Reserved

Name: ASI_WCACHE_VALID_BITS

340 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The address format for W-cache diagnostic valid bits access is described below and illustrated in FIGURE V-28.

Bits Field Use — Description 10:9 WC_way A 2-bit entry that selects an associative way (4-way associative). 8:6 WC_addr A 3-bit index (VA<8:6>) that selects a W-cache entry. 5:0 Reserved Must be 0.

Reserved WC_way WC_addr Reserved 6311 10 90 8 6 5

FIGURE V-28 Write Cache Diagnostic Valid Bits Access Address Format

FIGURE V-29 illustrates the data format for W-cache diagnostic valid bits access,

wcache_valid_bits is a doubleword of W-cache data. Valid bit for each data byte in the cache line, bit 63 => byte 0. A MEMBAR #Sync is required before and after a load or store to ASI_WCACHE_VALID_BITS.

wcache_valid_bits 63 0

FIGURE V-29 Write Cache Diagnostic Valid Bits Access Data Format

V.8.2 Write Cache Diagnostic Bank Valid Bits Register

ASI 3816, VA<63:11> = 2, VA<10:9> = WC_way, VA<8:6> = WC_addr, VA<5:0> = Reserved

Name: ASI_WCACHE_BANK_VALID_BITS

The address format for W-cache diagnostic valid bits access is described below and illustrated in FIGURE V-30.

Bits Field Use — Description 10:9 WC_way A 2-bit entry that selects an associative way (4-way associative). 8:6 WC_addr A 3-bit index (VA<8:6>) that selects a W-cache entry. 5:0 Reserved Must be 0.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 341 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Reserved 1 Reserved WC_way WC_addr Reserved 6313 12 11 10 90 8 6 5

FIGURE V-30 Write Cache Diagnostic Valid Bits Access Address Format

The data format for W-cache, diagnostic bank, valid bits access is described below and illustrated in FIGURE V-31.AMEMBAR #Sync is required before and after a load or store to ASI_WCACHE_VALID_BITS.

Bits Field Use — Description 1 WC_vbank0_valid Valid bit for the wcache_valid_bits<63:32>. 0 WC_vbank1_valid Valid bit for the wcache_valid_bits<31:0>.

.

Reserved WC_vbank0_valid WC_vbank1_valid

63 2 1 0

FIGURE V-31 Write Cache Diagnostic Bank Valid Bits Access Data Format

V.8.3 Write Cache Diagnostic Data Register

ASI 3916, VA<63:12> = 0, VA<11> = WC_port VA<10:9> = WC_way, VA<8:6> = WC_addr, VA<5:3> = WC_dbl_word

Name: ASI_WCACHE_DATA

The address format for W-cache diagnostic data access is described below and illustrated in FIGURE V-32.

Bits Field Description 11 WC_port A 1-bit field that selects port 0 or port 1 of the dual-ported RAM. Both ports give the same values, and this bit is used only during manufacture testing. 10:9 WC_way A 2-bit entry that selects an associative way (4-way associative). 8:6 WC_addr A 3-bit index (VA<8:6>) that selects a W-cache entry. 5:3 WC_dbl_word A 3-bit field that selects one of 8 doublewords, read from the Data Return. 2:0 Reserved Must be 0.

342 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Reserved WC_port WC_way WC_addr WC_dbl_word Reserved

63 12 11 10 9 8 6 5 3 2 0

FIGURE V-32 Write Cache Diagnostic Data Access Address Format

The data format for W-cache diagnostic data access is shown in FIGURE V-33.A MEMBAR #Sync is required before and after a load or store to ASI_WCACHE_DATA.In FIGURE V-33, wcache_data is a doubleword of W-cache data.

wcache_data 63 0

FIGURE V-33 Write Cache Diagnostic Data Access Data Format

V.8.4 Write Cache Tag/Valid Fields

ASI 3A16, VA<63:12> = 0, VA<11> = WC_port, VA<10:9> = WC_way, VA<8:6> = WC_addr, VA<5:0> = Reserved Name: ASI_WCACHE_TAG The address format of W-cache tag register access is described below and illustrated in FIGURE V-34.

Bits Field Description 11 WC_port A 1-bit field that selects port 0 or port 1 of the dual-ported RAM. Both ports give the same values, and this bit is used only during manufacture testing. 10:9 WC_way A 2-bit entry that selects an associative way (4-way associative). 8:6 WC_addr A 3-bit index (VA<8:6>) that selects a W-cache entry. 5:0 Reserved Must be 0.

Reserved WC_port WC_way WC_addr Reserved

6312 11 10 90 8 6 5

FIGURE V-34 Write Cache Tag Register Access Address Format

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 343 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The data format for W-cache, tag register access is described below and illustrated in FIGURE V-35.AMEMBAR #Sync is required before and after a load or store to ASI_WCACHE_TAG.

Bits Field Description 61 WC_vbank0_valid Valid bit for the W-cache RAM data <511:256>. 60 WC_vbank1_valid Valid bit for the W-cache RAM data <255:0>. 59:37 Reserved2 Must be zero. Note: Writing a nonzero value to this field may generate an undefined result. Software should not rely on any specific behavior. 36:0 WC_physical_tag Note: A 37-bit physical tag (PA<42:6>) of the associative data. Setting ASI_WCACHE_TAG.WC_physical_tag and ASI_WCACHE_SNOOP_TAG.WC_physical_tag of the same W-cache entry to different values has undefined results.

.

Reserved WC_vbank0_valid WC_vbank1_valid Reserved2 WC_physical_tag 63 6261 6059 37 36 0

FIGURE V-35 Write Cache Tag Register Access Data Format

V.8.5 Write Cache Snoop Tag Register

ASI 3B16, VA<63:12> = 0, VA<11> = WC_port, VA<10:9> = WC_way, VA<8:6> = WC_addr, VA<5:0> = 0

Name: ASI_WCACHE_SNOOP_TAG

The address format of W-cache snoop tag access is described below and illustrated in FIGURE V-36.AMEMBAR #Sync is required before and after a load or store to ASI_WCACHE_SNOOP_TAG.

Bits Field Description 11 WC_port A 1-bit field that selects port 0 or port 1 of the dual-ported RAM. Both ports give the same values, and this bit is used only during manufacture testing. 10:9 WC_way A 2-bit entry that selects an associative way (4-way associative). 8:6 WC_addr A 3-bit index (VA<8:6>) that selects a W-cache entry. 5:0 Reserved Must be 0.

344 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Reserved WC_port WC_way WC_addr Reserved

6312 11 10 90 8 6 5

FIGURE V-36 Write Snoop Tag Access Address Format

The data format for W-cache, snoop tag access is described below and illustrated in FIGURE V-37.

Bits Field Use — Description 37 WC_valid A 1-bit field that indicates a valid physical tag entry. 36:0 WC_physical_tag (PA <42:6>). The 37-bit physical tag of associated data. Note: Setting ASI_WCACHE_TAG.WC_physical_tag and ASI_WCACHE_SNOOP_TAG.WC_physical_tag of the same W-cache entry to different values has undefined results.

Reserved WC_valid physical_tag 63 3837 36 0

FIGURE V-37 Write Cache Snoop Tag Access Data Format

V.9 Integer Unit Design for Test (DFT)

The sections that follow describe the design for a method to make visible the state of the processor core for debugging.

V.9.1 IU Shadow Scan Registers

The following architectural registers in the IU have shadow scan copies. ■ PC<63:6> (iblock_2.cq_dp.grp_pc_e<63:6>) This is the high-order portion of the PC of the first instruction in the E-stage of the pipeline (or the next instruction expected to enter the E-stage if there are no instructions in E). ■ PSTATE<11:0> (ieu.ms_spreg.ms_spreg_dp_17.arch_pstate<11:0>) , RED, PEF, AM, PRIV, IE, AG} ■ TPC<63:0> (ieu.ms_spreg.ms_spreg_dp_64.arch_tpc<63:0>) ■ TnPC<63:0> (ieu.ms_spreg.ms_spreg_dp_64.arch_tnpc<63:0>) ■ TT<8:0> (ieu.ms_spreg.ms_spreg_dp_17.arch_tt<8:0>)

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 345 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ■ TL<2:0> (ieu.ms_spreg.ms_spreg_cntl.arch_tl<2:0>) ■ TSTATE<39:0> (ieu.ms_spreg.ms_spreg_dp_64.arch_tstate<39:0>) {CCR<7:0>, ASI<7:0>, 4’bz, PSTATE<11:0>, 5’bz, CWP<2:0>} ■ ASI<7:0> (ieu.ms_spreg.ms_spreg_dp_17.arch_asi<7:0>) ■ CWP<2:0> (ieu.ms_spreg.ms_spreg_cntl.arch_cwp<2:0>) ■ CCR<7:0> (ieu.ms_spreg.ms_spreg_dp_17.arch_ccr<7:0>)

V.9.2 IU Observability Bus Signals

The following comments were extracted from :/sys/mod/cpu/iu/iiu/grouping_l/ch_staging_ctl/rtl/observability.v

See also DCR Control for Observability Bus (OBS) (Impl. Dep. #203) on page 31.

All IU observability bus signals will transition at the sys_clk rate.This transition is controlled by an input, observ_ce_a1, which is flopped once and then used as a clock-enable for all the output flops. The examples below show a divide-by-4 sys_clk.

clk

observ_ce_a1

observ_ce

The accum signals are asserted if the indicated condition was true at any time within a sampling window of the same period as observ_ce, as shown below.

// observ_ce_a1 ===== // observ_ce ===== // ii_a0_valid_r === // ii_a0_valid_r_d1 === // ii_a0_valid_r_accum_a1 ======// ii_a0_valid_r_accum ======// (R-valid window) ======

Valid R-stage instructions will first become visible on the ii__valid_r_accum outputs sometime between the (R+2) and (R+5) cycles. See the preceding example. In other words, ii__valid_r_accum is

346 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material asserted for an N-cycle period if and only if ii__valid_r was asserted at some time during the N-cycle window that was exactly (N+2) cycles earlier than this N-cycle period. output ii_a0_valid_r_accum; // A0-pipe R-stage had valid instructions. output ii_a1_valid_r_accum; // A1-pipe R-stage had valid instructions. output ii_ms_valid_r_accum; // MS-pipe R-stage had valid instructions. output ii_br_valid_r_accum; // BR-pipe R-stage had valid instructions. output ii_fa_valid_r_accum; // FA-pipe R-stage had valid instructions. output ii_fm_valid_r_accum; // FM-pipe R-stage had valid instructions.

ii_instr_compl_accum was originally documented as being asserted in the observability bus period that begins between the (D+3) and (D+6) cycle (inclusive) of the completed instruction. As implemented, however, it is asserted in the observability bus period that begins between the (D+4) and (D+7) cycle (inclusive) of the completed instruction.

Completed instructions will first become visible on the ii_instr_compl_accum output sometime between their (D+4) and (D+7) cycles, inclusive. In other words, ii_instr_compl_accum is asserted for an N-cycle period if and only if one or more instructions completed in D at some time in the N-cycle window that was exactly (N+3) cycles earlier than this N-cycle period. For example:

// observ_ce_a1 ===== // observ_ce ===== // (instr1) E C M W X T D // (instr2) RECMWXTD1234 // (instr3) RECMWXTD1234 // (instr4) RECMWXTD!234567 // (instr5) RECMWXTD1234567 // (instr6) RECMWXTD1234567 // instr_cnt_d_d2[2:0] 010000000002300000000 // instr_compl_d_d2 = = = // ii_instr_compl_accum_a1 ======// ii_instr_compl_accum ======// (D-compl window) ======// output ii_instr_compl_accum; // Instructions have completed.

Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 347 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material The in_progress signals are asserted if the indicated operation was “in progress” in the cycle in which observ_ce is asserted. A recirculate or mispredict is defined to be in progress (that is, recirc_in_progress_a1 or mispred_in_progress_a1 is asserted) between the (D+3)-cycle of the recirculating or mispredicting instruction and the (D+2)-cycle of the next instruction to complete, inclusive.

The net effect is that ii_recirc_in_progress or ii_mispred_in_progress is first asserted sometime between the (D+4) and (D+7) cycles, inclusive, of the recirculating or mispredicting instruction and is not deasserted until sometime between the (D+4) and (D+7) cycles, inclusive, of the next instruction to complete. In other words, ii_(recirc/mispred)_in_progress is asserted for one or more N-cycle periods if and only if a recirculated/mispredicted instruction reached D at some time in the N-cycle window that was exactly (N+3) cycles earlier than the first N-cycle period of assertion. Once asserted, the first N-cycle period of deassertion occurs when any instruction completes during the N-cycle window that was exactly (N+3) cycles earlier than the deassertion period. For example,

// observ_ce_a1 ===== // observ_ce ===== // Ld RECMWXTD1234 // Ld APFBIJRECMWXTD1234 // iq_recirc_t = // recirc_d_d2 == // instr_compl_d_d2 = // recirc_in_progress_a1 ======// ii_recirc_in_progress ======// (D-recirc window) ==== // (D-compl window) ===

// observ_ce_a1 ===== // bicc RRCMWXTD1234567 // (dly) APFBIJRECMWXTD1234567 // ie_mispred_qual_c = // mispred_d_d2 = // instr_compl_d_d2 == // mispred_in_progress_a1 ======// ii_mispred_in_progress ======// (D-mispred window) ==== // (D-compl window) ====

output ii_recirc_in_progress; // Recirculate in progress output ii_mispred_in_progress; // Mispredict in progress

348 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Working Draft 1.0.5, 10 Sep 2002 S. Appendix V Debug and Diagnostics Support 349 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 350 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. CHAPTER

Bibliography

General References

Please refer to Bibliography in Commonality.

351 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material 352 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Index

A tag/valid access format 327 A pipeline stage 39, 41 latch 294 A0 pipeline 40, 229, 230 space identifier (ASI) 65, 153 A1 pipeline 40, 229 space size, differences from UltraSPARC I 13 accesses virtual-to-physical translation 65 branch predictor array 331 address alias flushing 162 cacheable 66 aexc field of FSR 98 data cache diagnostic 331 AFAR debug and diagnostics 325 capture 201, 225 external cache capturing status of multiple events 206 diagnostics 336 correctable error address 192 tag/state field diagnostics 339 error information logging 185 Fireplane Configuration Register 279 error logging 188 I/O 72 extension 14 instruction cache hardware-corrected ECC errors 190 diagnostics 326 instruction fetch error 200 tag/valid address format 327 overwrite policy noncacheable 66 exception to 218 nonfaulting ASIs 70 priorities 217 real memory space 65 overwrite priorities, BERR handling 199 restricted ASI 65 PA field 213 with side effects 64, 66, 72 pointing to 32-byte boundary 251 write cache 340 priority 216 add/subtract instructions, partitioned 91 register capture 213 address state after clearing 206 aliasing 159 state after reset 183 bank select/row, generation 320 AFSR capture, reenabling 213 BERR field 198, 199, 201, 204, 207, 215, 227, 228, data cache access format 332 247 Fireplane Address Register 283 bit description 207 illegal address alliasing 162 capturing 213 instruction cache CE field 196, 201, 204, 206, 208, 213, 214, 216, 224, instruction access format 327 251 changes to 14

S. Index 1 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material clearing 206 225, 226 CPC field 192, 204, 207, 213, 215, 222, 224 on watchdog reset 235 CPU field 192, 195, 204, 207, 213, 215 overwrite priorities, BERR handling 199 E_SYND field PERR field 199, 200, 205, 206, 207, 214, 252 capture 201 PRIV field capture by UE 216 accumulating state 205 data priority 216 captured information 199 description 208 deferred trap 190 ECC syndrome capture 197 description 207 initializing 207 ECC errors 197 locking 213 fault recording 205 overwrite policy 206, 218 logging errors from stores 233 unlocking 207 setting 216 value reported for correction words with bits updating 201 inverted 228 state after reset 183 EDC field 192, 204, 207, 213, 215, 221 TO field 199, 202, 204, 207, 215, 227 EDU field 195, 204, 208, 213, 215, 216, 222 UCC field 192, 203, 207, 213, 215, 220 EMC field 196, 201, 204, 207, 213, 215, 225, 249, UCU field 195, 203, 207, 213, 215 251 UE field 197, 198, 201, 204, 206, 208, 213, 214, 216, EMU error logging 253 225, 227, 251 EMU field 198, 201, 204, 207, 213, 215, 225, 249, WDC field 192, 204, 207, 213, 215, 222 251 WDU field 192, 193, 195, 196, 204, 207, 213, 215, error accumulation 204 223, 224 error information logging 185 writes to 206 error logging 188 alias error logging, deferred traps 189 address 159 error logging, system bus data ECC error 197 boundary 162 hardware-corrected ECC errors 190 alternate global registers 194 IERR field 206, 207, 214, 252 MMU 132 ISAP field 207, 214 ancillary state registers (ASRs) 102 IVC field 201, 204, 207, 213, 215, 226, 251 ancillary state registers (ASRs) See ASRs IVU field 197, 201, 204, 207, 213, 214, 226, 251 Architectural Register File (ARF) 48 logging multiple events 206, 231 ARF (Architectural Register File) 48 logical field ASI accumulating multiple-error (ME) 204 atomic access 70 accumulating privilege-error (PRIV) 205 data cache accesses 331, 340 data ECC syndrome 206 internal 153 internal inconsistency in system interface logic load from TLB Data Access register 138 205 nonfaulting 70 most recently detected errors (sticky bits) 205 nontranslating 153 system bus MTag ECC error 206 operations 134 M_SYND field registers, differences from UltraSPARC I 14 data priority 216 registers, removed in UltraSPARC III 14 ECC syndromes 212 registers, state after reset 181 initializing 207 restricted 65 locking 213 store from ASI 330 overwrite policy 218 UltraSPARC III internal 74 unlocking 207 ASI_AS_IF_USER_PRIMARY 70 ME field 194, 202, 207, 216, 219, 221, 223, 224, ASI_AS_IF_USER_PRIMARY_LITTLE 70

S. Index 2 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ASI_AS_IF_USER_SECONDARY 70 ASI_SECONDARY 70 ASI_AS_IF_USER_SECONDARY_LITTLE 70 ASI_SECONDARY_LITTLE 70 ASI_ASYNC_FAULT_ADDRESS 213 ASI_SECONDARY_NO_FAULT 71 ASI_ASYNC_FAULT_STATUS 207 ASI_SECONDARY_NO_FAULT_LITTLE 71 ASI_BLK_COMMIT 162 ASI_SERIAL_ID 290 ASI_BLK_COMMIT_PRIMARY 163 ASI_SRAM_FAST_INIT 155 ASI_BLK_COMMIT_SECONDARY 163 ASI_WCACHE_DATA 155, 342 ASI_BRANCH_PREDICTION_ARRAY 156, 331 ASI_WCACHE_SNOOP_TAG 155, 344 ASI_DCACHE_DATA 155, 332 ASI_WCACHE_TAG 155, 343 ASI_DCACHE_INVALIDATE 155 ASI_WCACHE_VALID_BITS 155, 340, 341 ASI_DCACHE_SNOOP_TAG 155, 335 ASRs ASI_DCACHE_TAG 155, 190, 192, 333, 334, 335 differences from UltraSPARC I 15 ASI_DCACHE_UTAG 155 grouping rules 48 ASI_DCU_CONTROL_REG 155 implementation dependent, read/write 102 ASI_DCU_CONTROL_REGISTER 33 Asynchronous Fault Address Register, See AFAR ASI_DEVICE_ID+SERIAL_ID 290 Asynchronous Fault Status Register, See AFSR ASI_EC_CTRL 156 atomic instructions ASI_EC_R 156 and ECC errors 192 ASI_EC_TAG 156 compare and swap 70 ASI_EC_W 156 LDSTUB 70 ASI_ECACHE_CONTROL 156 mutual exclusion support 69 ASI_ECACHE_CTRL 336 and store queue 89 ASI_ECACHE_DATA 156, 337 SWAP 69 ASI_ECACHE_R 156, 337 use with ASIs 70 ASI_ECACHE_TAG 156, 339 Ax pipeline 35, 111 ASI_ECACHE_W 156, 337 ASI_ESTATE_ERROR_EN_REG 155, 203 ASI_FIREPLANE_ADDRESS_REG 155 B ASI_FIREPLANE_CONFIG_REG 155 B pipeline stage 39 ASI_IC_INSTR 156 bank segment size, See memory bank segment size ASI_IC_TAG 156 bank select/row address logic 320 ASI_ICACHE_INSTR 156, 327 BERR, See AFSR BERR field ASI_ICACHE_SNOOP_TAG 329 big-endian ASI_ICACHE_TAG 156, 327 default ordering after POR 153 ASI_ITLB_CAM_ACCESS_REG 156 instruction fetches 153 ASI_ITLB_DATA_ACCESS_REG 156 swapping in partial store instructions 91 ASI_MCU_CTRL_REG 156 bin number 290 ASI_NUCLEUS 70 bit vectors, concatenation 132 ASI_NUCLEUS_LITTLE 70 BLD, See block load instruction ASI_PCACHE_DATA 155 block ASI_PCACHE_SNOOP_TAG 155 load and store instructions ASI_PCACHE_STATUS_DATA 155 compliance across UltraSPARC platforms 86 ASI_PCACHE_TAG 155 data size (granularity) 72 ASI_PHYS_USE_EC 8, 70, 161 E-cache access counting 271 ASI_PHYS_USE_EC_LITTLE 8, 70 external cache allocation 161 ASI_PRIMARY 70 UltraSPARC III specifics 79 ASI_PRIMARY_LITTLE 70 load instruction 72 ASI_PRIMARY_NO_FAULT 71 in block store flushing 163 ASI_PRIMARY_NO_FAULT_LITTLE 71 E-cache allocation 161

Working Draft 1.0.5, 10 Sep 2002 S. Index 3 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material ECC error 189 bubbles 269 grouping 49 bubbles, See stalls ordering 80 bus protocol error 252 and store queue 89 BUSY/NACK pairs 175 memory operations 123 bypass, data cache 159 operations and memory model 81 byte mask overlapping stores 80 See also BMASK instruction store instruction 222 grouping 83 data size (granularity) 72 instruction 8 E-cache allocation 161 byte ordering 91, 153 grouping 49 byte shuffle instruction, See BSHUFFLE instruction ordering 81 and PDIST 46 use in loops 81 C store with commit 11 C pipeline stage 42, 43 use in loops 81 cache BMASK instruction coherency protocol 66, 106 and BSHUFFLE instruction 83 consistency errors 255 and MS pipeline 83 control fields of DCUCR 35 grouping rules 50 diagram 220 bootbus address differences from UltraSPARC I 10 mapping 8 external 161 unallowed memory operations 8 flushing 162 booting code 161 invalidation 335 BR pipeline 40, 229, 230 level 1 159 branch instructions 36 level 2 161 branch instructions, conditional 41 miss 42 branch prediction organization 159 in B pipeline stage 39 physical indexed, physical tagged 160 mispredict signal 41 scrubbing 192, 195 predictor array accesses 331 write 161 statistics for taken/untaken 269 cacheability, determination of 159 tag/valid field access address 331 cacheable accesses Branch Predictor (BP) 39 indication 66 BRANCH_PREDICTION_ARRAY properties 66 BPA_addr field 331 CALL instruction 36 PNT_Bits (prediction not taken) field 331 CAM Diagnostic Register 134 PT_Bits field 331, 332 CAM Diagnostic Register format 137 break-after, definition 44 CANRESTORE register 181 break-after, See grouping rules CANSAVE register 181 break-before, definition 44 CAS(X)A instruction 70, 154 break-before, See grouping rules catastrophic hardware fault 191 breakpoint 14, 36 CCR register 181 BSHUFFLE instruction CE error, See system bus ECC error UE and BMASK instruction 83 CEEN field, See Error Enable Register CEEN field fully pipelined 83 cexc field of FSR 98 grouping rules 50 cexc field of FSR register 28 BST, See block store instruction clean_window exception 105, 116 bubble, vs. helper 48 CLEANWIN register 116, 181

S. Index 4 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material clearing the AFSR 206 load miss 187 clipping values, See FPACK instruction microtags 334 coherence miss 42 domain 66, 198 miss counts 271 unit of 66 and RED_state 191 coherent pending queue (CPQ) 7 reference counts 271 compliance snoop tag access 334 Energy Star 277 tag/valid access address format 333, 334, 335 SPARC V9 102, 113 tag/valid access data format 333, 334 condition code register 181 tag/valid fields 333, 334, 335 conditional branch instructions 41 unit control register, See DCUCR conditional move instructions Data Cache Unit Control Register, See DCU and grouping rules 51 DCUCR Context field of the D-TLB Tag Access Register 134 data corruption 192 corrupt data 192 data path diagram 220 counters, measuring instruction completion 268 data switches 220, 294 counts, recirculating 270 data TLB 12 CPC E-cache ECC error 224 data watchpoint registers 35, 111 CPQ 7 data_access_error exception 193, 195, 197, 198, 199, CPU E-cache ECC error 224 201, 204, 215, 217, 218, 222, 225, 229 cross call 124 data_access_exception exception 66, 70, 71, 73, 113, Crossbar Data Switch (CDS), See data switches 122, 153, 154, 157, 325 CSR timing fields 322 data_register_number 338 CTI queue 40 DC_wr 271 current exception (cexc) field of FSR register 28 DC_wr_miss 271 CWP register 181 D-cache, See data cache cycles accumulated, count 268 DCACHE_DATA DC_addr field 332, 333, 334, 335 DC_data field 332 D DC_way field 332 D pipeline stage 43, 269 DCACHE_SNOOP_TAG 335 data cache DCACHE_TAG access 41 DC_addr field 335 access statistics 271 DC_tag field 333 and RED_state 177 DC_utag field 334 and block load/store 79 DC_valid field 333 bypassing 159 DC_way field 333, 334, 335 data access address format 332 DCR data access data format 332 access 31 data field 332 changes to 15 description 159 IFPOE field 33, 233 diagnostic accesses 331 OBS field 31, 107 Enable bit 35 requirements for controlling observability bus 32 error handling 197 SI field 32 flushing 11, 160, 162, 216, 221 state after reset 182 flushing after certain traps 218 DCTI 39 flushing after multiple errors 202 DCU invalidating for certain traps 217 control register 177 invalidation 190, 192, 335 Control Register, See DCUCR

Working Draft 1.0.5, 10 Sep 2002 S. Index 5 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material determining helper cycles 47 division algorithm 116 enable_D-MMU bit 34 D-MMU physical_address_data_watchpoint_enable 35 and RED_state 177 virtual_address_data_watch-point_enable 35 context # upon a trap 134 virtual_address_data_watchpoint_mask 35 disabled, effect on D-cache 133 DCUCR pointer logic pseudocode 141 access data format 34 TLB replacement policy 138 CP (cacheability) field 34, 133, 159, 160, 161 virtual address translation 129 CV (cacheability) field 34, 133, 159 DONE instruction DC (D-cache enable) field 35, 159 after internal store to ASI 74 DM field 34, 159 after stores to ASIs 154 IC (I-cache enable) field 35, 74, 160, 191 and BST 80 IM (IMMU enable) field 34, 160 error barrier for deferred trap 188 ME field 34 error isolation 233 PM (physical address) field 35 exiting RED_state 29, 178 PR (PA data watchpoint enable) field 35 flushing pipeline 160 PW (PA data watchpoint enable) field 35 grouping rules 49 RE (R-A-W bypass enable) field 34 when TSTATE uninitialized 179 SL (second load steering enable) field 34 D-SFAR register SL, disabling for FP loads 35, 111 exception address (64-bit) 29 state after reset 182 state after reset 182 VM (virtual address) field 35 update after trap 133 VR (VA data watchpoint enable) field 35 D-SFSR VW (VA data watchpoint enable) field 35 FTYPE field upon data_access_exception 157 WE (W-cache enable) field 35 D-SFSR register debug accesses 325 See also SFSR deferred traps 61, 103, 114 CT (context) field 134 and TPC/TNPC 188 state after reset 182 errors generating 189 update after trap 133 handling unexpected traps 189 D-TLB special handling with MEMBAR #Sync 234 access 42 when taken 229 for data accesses 12 delayed control-transfer instruction (DCTI) 39 state after reset 183 diagnostics Tag Access Register, See Tag Access Register accesses 325 DVMA 178 external cache accesses 336 tag/state field accesses 339 E disabled MMU 106, 121 E pipeline stage 41, 43 Dispatch Control Register (DCR), See DCR E*_CLK, See Fireplane Configuration Register Dispatch_rs_mispred 269 E*_CLK field Dispatch0_2nd_br 269 E_SYND field of AFSR, See AFSR E_SYND field Dispatch0_br_target 269 EC_ic_miss 272 displacement flush 163, 192 EC_misses 271 disrupting trap E-cache, See external cache error barriers 234 ECACHE_CTRL processing 230 data trace-in cycles 336 when taken 229 EC_check field 337 disrupting traps 61, 114 EC_clock field 337

S. Index 6 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material EC_ECC_en field 337 overwrite policy (data) 218 EC_ECC_force field 337 overwrite policy (MTag) 218 External Cache data turnaround cycle 336 storage 226 size specification 336 ECC_error exception 190, 192, 195, 196, 197, 201, sleep mode 336 204, 215, 221, 222, 223, 224, 226, 229, 233 ECACHE_W ECU (External Cache Unit), See EMU EC_addr field 337 EDC, See AFSR EDC field EC_data field 338 Edge instructions 8 EC_data_ECC field 338 Edgencc instruction 8 ECC EDU, See AFSR EDU field bad condition, deliberately inserted 210, 227 EEMR, See EMU Error Mask Register check bit generation 227 EESR, See EMU Error Status Register check vector 337 EMC error, See system bus ECC error EMC correcting corrupted merged data 196 EMC, See AFSR EMC field error EMU See also external cache and system bus ECC er- error categories 252 ror error handling 251 atomic instructions 192 field of AFSR, See AFSR EMU field CPC 224 logical blocks 6 CPU 224 mask chain order 259 EDC 221 registers EDU 222 Error Mask Register (EEMR) external cache 228 bit description 258 hardware correctable 227 disabling certain error conditions 257 hardware corrected 190 purpose 257 logged in EESR 253 reset 258 MTag interrupt vector fetches 201 shifting 259 simultaneous E-cache and system bus errors updating content 253 218 Error Status Register (EESR) single-bit 190 fatal hardware error reporting 253 software correctable 187 Shadow Register 258 software correctable (double) 193 shadow scan chain order 259 system bus data 196, 201 EMU error, See system bus ECC error EMU system data 253 Energy Star UCC 220 1/32 mode uncorrectable 187, 190, 227 behavior 323 uncorrectable MTag 198 control 313 uncorrectable system bus data 197 time interval, row refreshes 306 WDC 223 timing control 308 WDU 223 compliance with 9 error correction 62 compliance, power-down mode 277 MTag correctness checking 196, 225 Fireplane support 277 protection, on external cache 13 half-mode 306, 313 signalling 227 mode, entering/leaving 94 syndrome saving system power (1/32 mode) 323 data single-bit errors 209 setting refresh rate 306 E_SYND field, 9-bit 209 error barriers 233 E_SYND field, single-bit 208 Error Enable Register M_SYND field 212 CEEN field 190, 192, 196, 204, 214, 224, 225, 226

Working Draft 1.0.5, 10 Sep 2002 S. Index 7 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material FDECC field 203 external cache FMD field 203 access errors 235 FMECC field 203 access statistics 271 FMT field 203 and block load/store 79 ISAPEN field 204 bypassing by instruction fetches 161 NCEEN field 29, 73, 189, 195, 198, 204, 214, 222, Control register See ECACHE_CTRL 224, 225 correctable/uncorrectable data errors 193 register format 202 data access address format 337 state after reset 183 data access bypass 161 UCEEN field 192, 203, 220, 221 data ECC error ERROR output pin 186, 187, 189, 198, 201, 215, 225, correctable/uncorrectable 228 252 CPU 224 Error Status Register EDC 221 EC_ECC_ENABLE field 228 EDU 222 error_state 61 hardware corrected 191 error_state, and watchdog reset 113, 179 simultaneous with system bus error 218 errors software correctable 192 cache consistency (QCTL) 255 software correctable (double) 193 clearing AFSR bits 206 storage 228 deferred 188 UCC 220, 235 disrupting 190 UCU 221, 235, 236 ECC uncorrectable 194 See ECC error and external cache data ECC er- uncorrectable data 253 ror WDC 223 fatal 186 WDU 223 handling after unexpected deferred trap 189 data staging register address format 338 handling differences from UltraSPARC I 13 data/ECC fields 337 hardware-corrected 186 description 161 internal vs. protocol 205 diagnostics accesses 163, 336 isolation 233 ECC errors 228 logging 192 ECC fields 337 logging in EESR 253 error correction 13 memory 191 error detection 188 multiple, indicator 204 Error Enable Register 189, 190, 203 multiple, uncorrectable 219 Error Enable Register, See also Error Enable parity, on system address bus 253 Register recoverable ECC errors 190 flushing 11, 162, 221 reporting summary 214 handling bad ECC 197 software correctable 190 invalidating a line 163 software-correctable 186 parity error 189 timeout (fatal) 255 PIPT 161 uncorrectable 186, 197, 234 and RED_state 191 ESR, See EMU Shadow Register scrubbing 223, 227 ESTATE_ERROR_EN, See Error Enable Register staging ECC access data format 338 eviction, write cache 238 staging register data format 338 exceptions tag access address format 339 fp_exception_other 83 tag/state access data format 340 execution pipelines 40 tag/state field 339 extended instructions 105, 124 update 64

S. Index 8 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material External Cache Unit (ECU), See also EMU entries Mem_Timing1_CTL register 297 External Memory Unit (EMU), See EMU entries Mem_Timing2_CTL register 303 External Reset pin 178 Mem_Timing3_CTL register 306 Externally Initiated Reset (XIR) 114, 178 Mem_Timing4_CTL register 308 Memory Address Control register 313 Fireplane Configuration Register F access 279 F pipeline stage 39 accessing 279 FADD instruction 83 CBASE field 281 FADD instructions 99, 100 CBND field 281 fadd of numbers with opposite signs 28 CLK field 281 FALIGNADDR instruction E*_CLK field 278, 280 grouping rules 50 reset values 284 FALIGNDATA instruction HBM (hierarchical bus mode) field 281 grouping rules 50 non-64-bit aligned accesses 279 performance considerations 78 SLOW field 281 fast flip-flopping 48 SSM field 196, 281 fast_data_access_MMU_miss exception 132 updating fields 284 fast_ECC_error exception 61, 62, 104, 114, 192, 193, Fireplane Interconnect 66 194, 200, 203, 215, 218, 220, 228 FIREPLANE_ADDRESS_REG, See Fireplane fast_instruction_access_MMU_miss exception 132, Address Register 228 FIREPLANE_PORT_ID fatal error AID field 279 hardware 253 MID field 279, 289 processor behavior 215 MR (module revision) field 279, 284 reporting 253 MT (module type) field 279, 284 when reported 186 register accessing 278 FDIV instructions 100 Fireplane_slow signal 277 FDIVd instruction 99, 100 FiTOs instruction 9 FDIVs instruction 99, 100 floating point FdTOi instruction 9, 100 deferred-trap queue (FQ) 27, 103 FdTOs instruction 99, 100 divide/square root 49, 50 fdtos instruction 28 exception handling 97, 102, 105, 117 FdTOx instruction 9, 100 grouping rules 47–50 FERR, See fatal error interrupt handling 233 FEXPAND instruction, pixel formatting 92 latencies 45 FFA, See FGA pipeline mem_address_not_aligned 122 FGA pipeline 4, 19, 38, 40, 42, 45, 46, 51, 52, 83, 232, NaN operands 100 233, 272 nonstandard mode 98 FGM pipeline 4, 19, 38, 40, 42, 45, 46, 52, 232, 233, operation statistics 272 272 pipelines 45 Fireplane register file access 41 ASI extensions 278 short load instructions 123 DTL signals 282 short store instructions 123 power management square root 97, 118 exiting low power mode 278 store instructions 50 power-down mode 277 subnormal value generation 27 reset values 283 trap types 120 Fireplane Address Register floating-point

Working Draft 1.0.5, 10 Sep 2002 S. Index 9 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material queue, See deferred-trap queue (FQ) FPSUB32S instruction 91 Floating-Point Registers State, See FPRS FQ, See floating-point deferred-trap queue (FQ) Floating-point Status Register, See FSR FSMULd instruction 100 floating-point trap type (ftt) field of FSR register 28 FsMULd instruction 100 floating-point trap types FSQRT instructions 99, 100 unfinished_FPop 83 FSR FLUSH instruction aexc field 98, 119, 120 after internal store 74 cexc field 98, 119, 120 after stores to ASIs 154 fcc field in SPARC V8 120 differences from UltraSPARC I 9 fcc0 field 120 grouping rule 49 fcc1 field 119 I-cache 11, 160 fcc2 field 119 memory ordering control 68 fcc3 field 119 self-modifying code 104, 121 ftt field 28, 120 flushing nonstandard floating-point operation 28 address aliasing 162 NS field 27, 120 data cache 11, 202, 216, 221 = 0 99 differences from UltraSPARC I 11 = 1 27, 98, 99 displacement 163, 192 =0 83 external cache 11, 161, 221 =1 83 instruction cache 11 qne field 120 TLB 12 RD field 9, 120 write cache 11, 161 state after reset 181 FLUSHW instruction, grouping rule 49 TEM field 119, 120 FMD, See Error Enable Register FMD field ver field 120 FMECC, See Error Enable Register FMECC field FsTOd instruction 99, 100 FMOVcc instructions FsTOi instruction 9, 100 grouping rules 51 FsTOx instruction 9, 100 FMUL instructions 100 FSUB instruction 83 FMULd instruction 99, 100 FSUB instructions 99, 100 FMULs instruction 99, 100 fsub of numbers with the same signs 28 fp_disabled exception 33, 232, 233, 326 ftt field of FSR register 28 fp_exception_ieee_754 exception 33, 99 FxTOd instruction 9 fp_exception_other exception 28, 33, 83, 97, 98, 100, FxTOs instruction 9 102, 113, 118 FP0, See FGM pipeline FP1, See FGA pipeline G FPACK instructions 92 global registers FPACK, performance usage 92 interrupt 29 FPADD16S instruction 91 MMU 132 FPADD32S instruction 91 trap 29 FPMERGE instruction, back-to-back execution 92 global visibility 67 FPops, reserved 113 graphics FPRS register 8-bit data format 25 DL (dirty lower) bit 119 instructions using floating-point register file 326 DU (dirty upper) bit 119 treatment as floating-point instruction 326 FEF field 33, 326 Graphics Status Register (GSR), See GSR state after reset 181 grouping rules 43–50 FPSUB16S instruction 91 BMASK and BSHUFFLE 83

S. Index 10 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material SIAM instruction 94 IC_addr field 328 GSR IC_tag field 328 byte mask instruction 8 IC_utag field 328 mask, setting before BSHUFFLE 83 IC_vpred field 329 new fields 15 IC_way field 328 reading or updating 326 identification code scale factor for packing 92 allocation of 287 SIAM instruction sets IM and RND 94 printed on package 291 state after reset 182 IEEE Std 754-1985 97, 119 write instruction latency 50 IEEE Std. 1149.1 288 ieee_754_exception exception 120 IIU H branch prediction statistics 269 hard power-on reset 178 stall counts 269 hardware illegal address aliasing 162 fatal errors 253, 255 illegal_instruction exception 61, 93, 105, 113, 121, interlocking mechanism 87 124, 157 interrupts 124 illegally aliased page 163 timeouts 200 ILLTRAP opcode 113 TLB 138 I-MMU hardware-corrected errors 186 bypassing E-cache 161 HBM 7 disabled 73, 133 helper and instruction prefetching 73 cycle 47 TLB replacement policy 138 execution order 47 virtual address translation 129 generation 46 implementation dependency 101 in pipelines 46 inexact trap 117 hierarchical bus mode (HBM) 7 instruction breakpoint 14, 36 breakpoint, trap priorities 61 I bypass 45 I pipeline stage 40 conditional branch 41 I/D Translation Storage Buffer Register dependency check 44 differences from UltraSPARC I 12 dispatching properties 52 I/O Edge 8 access 72, 74 Edgencc 8 identifying locations 106 execution order 44 memory 65 execution rates 268 memory-mapped 66 explicit synchronization 81 noncacheable address 70 fetcher 200, 229 IC_miss 271 grouping rules 43–50 IC_miss_cancelled 271 issue unit (IIU) 3, 4 I-cache, See instruction cache latency 44, 46, 52 ICACHE_INSTR multicycle, blocking 44 IC_addr field 327, 330 number completed 268 IC_instr field 327 prefetch 73, 178 IC_way field 327, 330 quad-precision, floating point 102, 118 UC_addr field 330 SIAM 9 ICACHE_TAG single-group, See single-instruction group

Working Draft 1.0.5, 10 Sep 2002 S. Index 11 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material window-saving 47 DPCTL 254 with helpers 49 ECU 257 writing integer register 45 PENDQ and QCTL 256 instruction cache TOB 257 access statistics 271 write cache 253 bypassing 160 internal_processor_error exception 103 consistency 11, 330 interrupt corrupt data 221 data description 160 interrupt_trap inhibition 197 diagnostic access 326 uncorrectable error 201 disabled in RED_state 177, 191 differences from UltraSPARC I 13 effect of mode change 160 on floating-point instructions 33 Enable bit 35 global registers 29 flushing 11 hardware 124 instruction access address 327 packet 124 instruction fields 327 processing 232 microtags 328, 329 uncorrected interrupt vector data 197 miss 187 vector fetch 197, 198, 201 miss counts 271 Interrupt Vector Dispatch Register 175 physical indexed, physical tagged (PIPT) 160 Interrupt Vector Dispatch Status Register 175 prefetch miss 200 Interrupt Vector Receive Register 175 reference counts 271 interrupt_level_n exception 229 snoop tag field 329 interrupt_vector exception 33, 197, 198, 201, 225, 226, snoop tag/valid field access data 330 229 tag/valid field interval arithmetic access address format 327 SIAM instruction 94 access data 328, 329 support 9 illustration 327 INTR_DISPATCH register 182 instruction queue, state after reset 183 INTR_RECEIVE register 182 instruction TLB 12 invalidation Instruction Trap Register 36 data cache 190, 192 instruction_access_error exception 29, 178, 190, 192, invalidation of data cache 335 193, 195, 197, 198, 200, 201, 204, 215, 217, 225, 229 I-SFSR register, See SFSR instruction_access_exception exception 133 I-TLB INSTRUCTION_TRAP register 182 instruction accesses 12 instructions per cycle (IPC) 268 state after reset 183 integer Tag Access Register, See Tag Access Register divide 116 IU Deferred-Trap Queue register 27 execution unit (IEU) 3, 5 IVC error, See system bus ECC error IVC multiplication/division 116 IVU error, See system bus ECC error IVU register file 102, 116 integer register file access 40 interleave J factors for bank sizes 311–313 J pipeline stage 40, 41 memory 310 JEDEC code 288, 289 specifying factor for 316 JMPL instruction 29, 42, 133, 178 internal ASI 153 JTAG internal errors accessing EESR 253 definition 252 accessing ESR 258

S. Index 12 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Identification register (JTAGID) 288 Marketing part # 291 scan chain, EEMR 257 MAXTL 177 MCU 178 MCU (Memory Control Unit) 6 L Mem_Addr_CTL register 183, 303 latency Mem_Addr_Dec register 183 BMASK and BSHUFFLE 83 mem_address_not_aligned exception 110, 133, 157, floating-point operations 45 279, 283, 325 FPADD instruction 91 Mem_Timing_CSR register 183 partitioned multiply 92 Mem_Timing1_CTL register LDD instruction 89, 122 bit replacement 313 LDDF instruction 62, 241 fields LDDF_mem_address_not_aligned exception 122 act_rd_dly 299 LDDFA instruction 62, 154, 241, 325 act_wr_dly 299 block load 79 auto_rfr_cycle 298 LDF instruction 62, 241 bank_present 300 LDFA instruction 62, 241 pc_cycle 299 LDFSR instruction 49, 89 prechg_all 301 LDQF instruction 122 rd_more_ras_pw 299 LDQFA instruction 122 rfr_enable field 300 LDSB instruction 89 rfr_int 300 LDSH instruction 89 sdram_clk_dly 298 LDSTUB instruction 70 sdram_ctl_dly 298 LDSW instruction 89 set_mode_reg 300 LDXA instruction 154, 325 wr_more_ras_pw 299 LDXFSR instruction 89 Mem_Timing2_CTL register level 1 cache, flushing 162 bit replacement 313 level 2 cache 161 fields little-endian auto_prechg_enbl 304 byte ordering 153 illustration 303 ordering in partial store instructions 91 rd_msel_dly 303 load instructions, getting data from store queue 89 rd_wr_pi_more_dly 304 load recirculation 90 rd_wr_ti_dly 304 logical operations 85 rdwr_rd_pi_more_dly 304 lot number 290 rdwr_rd_ti_dly 304 low power mode 124, 277 sdram_mode_reg_data 304 LRU, TLB entry replacement 138 wr_msel_dly 303 wr_wr_pi_more_dly 304 wr_wr_ti_dly 304 M wrdata_thld 303 M pipeline stage 42 Mem_Timing3_CTL register M_SYND field of AFSR, See AFSR M_SYND field field settings 307 machine state illustration 306 after reset 180 rfr_int field 306 in RED_state 180 Mem_Timing4_CTL register MAPPED fields response 196, 215 illustration 308 response from interrupt transmission 202 settings 308 status 199, 200 refresh rate setting 306

Working Draft 1.0.5, 10 Sep 2002 S. Index 13 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material with Mem_Timing3_CTL register 306 rules for interlock implementation 86 MEMBAR UltraSPARC III specifics 86 #LoadLoad 16, 64, 67 memory #LoadStore 16, 67 access statistics 271 #LoadStore and block store 81 bank, access counts 273 #Lookaside 64 cached 65 #MemIssue 64, 87 controller 293 #StoreLoad 64 current model, indication 65 and BLD 80 errors and BST 81 hardware corrected 191 changes from UltraSPARC II 16 interrupts 201 for strong ordering 87 MTag 196 #StoreStore 106, 121 prefetch 200–?? and BST 81 software corrected 192 changes from UltraSPARC II 16 system bus accesses 196 code example 67 types 191 #Sync global visibility of memory accesses 67 after BST 80 initialization 321 after internal ASI store 74 interleave 310 after stores to error ASI registers 202 location 65 after stores/prefetches to ASIs 154 minimum/maximum bank segment size 311 after STXA 326 minimum/maximum configuration 296 before ASI store that changes PSTATE.PRIV models 205 and block operations 81 before/after load/store ordering and block store 81 ASI_DCACHE_DATA 332 partial store order (PSO) 63, 81 ASI_DCACHE_UTAG 334 relaxed memory order (RMO) 81 ASI_WCACHE_DATA 343 strongly ordered 74, 87 ASI_WCACHE_SNOOP_TAG 344 supported 122 ASI_WCACHE_TAG 344 total store order (TSO) 63, 81 ASI_WCACHE_VALID_BITS 341, 342 types 63 changes from UltraSPARC II 16 noncacheable, scratch 161 E-cache flushing 11 ordering 67 error isolation 233 sequence before access 321 error isolation for deferred trap 188 subsystem, differences from UltraSPARC I 10 semantics 68 synchronization 68 for strong ordering 87 Memory Address Control Register to ensure complete flush 163 addr_le_pw 314 to handle deferred traps 234 bank select/row address fields 315 with PREFETCHA instruction 17 cmd_pw field 314 within trap handler 189 half_mode_dram_clk_dly field 315 instruction 86 illustration 314 definition 64 interleave fields 316 explicit synchronization 67 Memory Address Decoding Register grouping rules 49 decoding logic 310 memory ordering 68 illustration 309 side-effect accesses 73 LK (lower mask) field 309 single group 49 LM (lower match) field 309 QUAD_LDD requirement 89 UK (upper mask) field 309

S. Index 14 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material UM (upper match) field 309 ECC syndrome 228 valid field 309 errors (QCTL) 256 Memory Control Unit (MCU) 6 single-bit error 212 memory controller multiplication algorithm 116 internal errors 253 mutual exclusion, atomic instructions 69 programmable registers 297–320 statistics, counters 273 writing to registers 297 N Memory Management Unit, See MMU NaN operands 100 memory-mapped I/O 66 NCEEN field, See Error Enable Register NCEEN merge buffer 75, 195 field microtags NCPQ 7 data cache 334 noncacheable instruction cache 328, 329 accesses 66 mispredict queue (MQ) 36 I/O address 70 mispredict signal 41 instruction prefetch 73, 178 miss in D-cache 42 store compression 75 MMU 129–138 store compression, differences from UltraSPARC alternate global registers 132 I15 conformity 129 store merging enable 34 control fields of DCUCR 34 noncoherent pending queue (NCPQ) 7 disable 133 nonfaulting disabled 106, 121 ASIs and atomic accesses 70 global registers 29 load I/D TLB Diagnostic Register 134 and TLB miss 71 I/D TSB Extension Registers 136 behavior 71 I/D TSB Registers 135 use by optimizer 71 TLB CAM Diagnostic Registers 137 when MMU disabled 122 TLB Diagnostic Access Address 137 nonstandard floating point 98 mode nonstandard floating-point operation 27 low power 277 nontranslating ASI 153 SSM 281 nPC register 180 standby 277 NS field of FSR 27 MOVcc instructions NS field, SeeFSR NS field grouping rules 51 NWINDOWS value MOVR instructions integer register file 102, 116 grouping rules 51 windows of VER 117 MS pipeline and data_access_error 230 description 40 O E-stage bypass 44 OBS, See DCR initiating trap processing 229 obsdata 31, 32 instruction requirements 8 observability bus integer instruction execution 41 configuration changes 31 watchpoint comparison 35 DCR control requirements 32 and W-stage 42 mapping for bits 11:6 31 MTag opcodes ECC error 198, 226, 234 reserved fields in 113 ECC error for interrupt vector fetches 201 unimplemented 113

Working Draft 1.0.5, 10 Sep 2002 S. Index 15 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material OpenBoot PROM programming sequence 322 EC_wb 271 operand EC_write_hit_clean 271 NaN 100 IC_ref 271 subnormal 98 PRIV field 265 ordering ST field 265, 266, 268 block load 80 state after reset 182 block store 81 unused bits 30, 107 ORQ 7 UT field 265, 266, 268 OTHERWIN register 181 PDIST, instruction latency 46 outgoing request queue (ORQ) 7 pending tag array (PTA) 7 overflow exception 97, 118 pending trap, See trap pending overflow trap 117 Performance Control Register, See PCR overwrite policies performance hints AFAR overwrite 217 faligndata usage 78 data ECC syndrome 218 FPACK usage 92 MTag ECC syndrome 218 FPADD usage 91 logical operate instructions 85 partitioned multiply usage 92 P Performance Instrumentation Counter Register, See P pipeline stage 39 PIC PA Data Watchpoint Register 35 physical address translation 42 PA_WATCHPOINT register 182 physically indexed, physically tagged, See PIPT PACK instructions 92 cache packing instructions, See FPACK instructions PIC register part # access 265 marketing 291 event logging 266 Sun internal 291 events 268 partial store instruction 46, 123 PICL 30, 273 partial store instructions 157 PICU 30 partial store order, See PSO memory model SL selection bit field encoding 273 partitioned add/subtract instructions 91 state after reset 182 partitioned multiply instructions 92 SU selection bit field encoding 275 PC register 180, 187 PIC0, See PIC register PICL field PC, Instr_cnt 268 PIC1, See PIC register PICU field PCR PIL register 181, 190, 231, 232 access 265 pipeline extension 15 See also processor pipeline function A0 40, 41, 229, 230 Cycle_cnt 268 A1 40, 229 DC_hit 271 accepting interrupts 232 DC_ref 271 Ax 35, 111 Dispatch0_2nd_br 269 BR 40, 229, 230 Dispatch0_br_target 269 conditional moves 51 Dispatch0_IC_miss 269 dependencies 41 Dispatch0_mispred 269 FFA, See FGA pipeline EC_hit 272 FGA 4, 19, 38, 40, 42, 45, 46, 51, 52, 83, 232, 233, EC_ref 271 272 EC_snoop_inv 271 FGM 4, 19, 38, 40, 42, 45, 46, 52, 232, 233, 272 EC_snoop_wb 272 FP0, See FGM pipeline

S. Index 16 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material FP1, See FGA pipeline prefetch MS 8, 35, 40, 41, 42, 229, 230 data errors 234 pending interrupt 233 instruction, noncacheable 178 stages instructions 73 A 39, 41 PREFETCH instruction B39 descriptions 71 C 42, 43 external cache allocation 161 D 43, 269 memory barriers 234 E41,43 variants 93 F39 PREFETCHA instruction 93, 105, 121 I40 privileged registers 48 J 40, 41 privileged_action exception 65, 70, 265 M42 privileged_opcode exception 30, 61, 115, 265 mnemonics 37 processor pipeline P39 See also pipeline stages R 36, 40, 43, 270 address stage 39 S 40, 41 branch target computation stage 39 T43 cache stage 42 W42 done stage 43 stalls, causes 269 execute stage 41 trap processing initiation 229 fetch stage 39 PIPT cache instruction issue 40 caches so organized 160 miss stage 42 E-cache 11 predictor address generation 39 I-cache 10, 11, 160 register stage 40 level 2 caches 161 trap stage 43 W-cache 10 PSO memory model 63, 67, 68, 73 pixel formatting 92 PSTATE pixel orderings 25 AM field 29 POK pin 178 IE field 33, 190, 206, 231, 232, 233 population count (POPC) instruction 115 IG field 29 power-down mode 277 MG field 29 power-on reset (POR) MM field 65, 106 hard reset when POK pin activated 178 PEF field 232, 326 and I-cache microtags 329 PRIV field 65, 70, 188, 197, 201, 205, 216 soft reset 282 stores after trap 233 software 32 RED field system reset when Reset pin activated 179 clearing DCUCR 106, 177 TICK register fields 115 exiting RED_state 73, 178 precise trap explicitly set 177 E-cache error 188 register 29 error barriers 234 state after reset 180 handling 103, 114 WRPR instruction and BST 80 occurrence 187 PTA 7 processing 230 when taken 228 prediction Q branch 331 quad load instruction 89 pessimistic for divide and square root 97, 118 quad loads/stores 119

Working Draft 1.0.5, 10 Sep 2002 S. Index 17 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material quad-precision floating point instructions 102, 118 CLEANWIN 116 queue Data Cache Unit Control 34 instruction, state after reset 183 EMU Error Mask 253 SPARC V9, unimplemented registers 27 EMU Error Status 253 store, state after reset 183 External Cache Error Enable 190 Fireplane Configuration 279 Floating Point Registers State (FPRS) 119 R Floating-Point Status (FSR) 27 R pipeline stage 36, 40, 43 Floating-point Status Register (FSR) 119 R-A-W global trap 29 Bypass Enable bit in DCUCR 34 Instruction Trap 36 bypassing algorithm 89 integer file 116 bypassing data from store queue 34 IU Control 31 detection algorithm 90 Memory Address Control 313, 314 RD field of FSR register 9 Memory Address Decoding 309 RDASR instruction 265 Performance Instrumentation Counter (PIC) 30 diagnostic control/data 325 programmable types 297 dispatching 48, 104 PSTATE 29 forcing bubbles before 48, 104 Refresh Control 298 RDCCR instruction 53 SOFTINT 14 RDPCR instruction 30 STICK 15 RDPR instruction 29, 289 values after reset 180 dispatching 48 Version 116 forcing bubbles before 48 Window Control 116 RDTICK instruction 115 relaxed memory order (RMO), See RMO memory Re_DC_miss counter 270 model Re_EC_miss counter 270 reserved Re_endian_miss counter 270 fields in opcodes 113 Re_FPU_bypass counter 270 instructions 113 Re_RAW_miss counter 270 reset real memory 65 Fireplane values 283 recirculation instrumentation 270 PSTATE.RED 106, 177 RED_state register values after reset 180 default memory model 65 software Initiated Reset (SIR) 178 effect of entering 234 system 179 entering 191 watchdog reset 177 exiting 29, 73 Reset pin 179 Fireplane Interconnect 283 RESTORED instruction, single group 49 instruction cache bypassing 160 restricted ASI 65 issuing trap when TL = MAXTL 234 RETRY instruction 33 MMU behavior 133 after fp_disabled trap 233 physical address 106, 113 after internal store to ASI 74 trap vector 106, 179 after stores to ASIs 154 trap vector address (RSTVaddr) 106, 113 and BST 80 refresh rate, formulas, for setting 307 error barrier for deferred trap 188 register error isolation 233 access exiting RED_state 29, 178 floating-point 41 flushing pipeline 160 integer 40 grouping rules 49

S. Index 18 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material reexecuting instruction that caused error 192 grouping rules 50 use with IFPOE 33 interval arithmetic support 9 when TSTATE uninitialized 179 rounding 94 RETURN instruction 42, 133 setting GSR fields 94 Rfr_CSR register 183 side effect RMO memory model 63, 67, 68, 73, 81 accesses 66, 72 R-stage stall counts 270 attribute 122 Rstall_FP_use counter 270 and block load 80 Rstall_IU_use counter 270 instruction placement 73 Rstall_storeQ counter 270 instruction prefetching 73 RSTVaddr 106, 113, 179 visible 66 RTO operation 233 SIG, See single-instruction group RTS transaction 72 SIGM instruction 178 signalling ECC 227 single-bit ECC error 190 S single-instruction group 44, 47, 48, 49, 53 S pipeline stage 40, 41 single-issue mode, See DCUCR SI (single-issue Safari, See Fireplane entries disable) field SAVE instruction, clean_window exception 105, 116 SIR instruction SAVED instruction, single group 49 grouping rule 49 Scalable Shared Memory mode, See SSM mode SIGM support 106, 114 scrubbing SIU 250, 271 external cache 223, 227 snoop tags write cache 192, 195 data cache 334 SDRAM, initialization sequence 321 instruction cache 329 Second Load Steering Enable bit 34 snooping security instruction cache 160 clean windows 116 result errors 256 enhanced security environment 115 snoop counts 272 self-modifying code 121 store buffer 64 Serial ID register 288, 290 SOFTINT register 14, 181 Set Interval Arithmetic Mode (SIAM) instruction 9 software correctable errors 190 SETCC instruction, grouping 47 software statistics, counters 272 SFSR software-correctable errors 186 See also D-SFSR and I-SFSR Software-Initiated Reset (SIR) 49, 178 extensions 12 SPARC V9 compliance 102, 113 FT field 12 speculative data fetches 201 extension: I/D TLB miss 12 speculative load 66, 122, 188 FT = 10 71 SRAM FT = 2 66, 71, 73, 122 changes 14 FT = 4 70 new diagnostic registers 14 FT = 8 70, 71, 325 SSM mode NF field 12 checking for configuration 198 state after reset 182 enabling for Fireplane transactions 281 short floating-point load instruction 89 globally visible memory access 67 SHUTDOWN instruction MTag errors 256 differences from UltraSPARC I 9 RTSR system bus operation 234 execution as NOP 94 SSM system 234 SIAM instruction SSM, See Fireplane Configuration Register, SSM

Working Draft 1.0.5, 10 Sep 2002 S. Index 19 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material field IVU 226 stable storage 163 simultaneous with E-cache error 218 stalls UE 224, 225 counted 269 hardware timeouts 200 pipeline 269 hardware-corrected errors 196 R Stage counts 270 instruction fetch error 200 standby mode 277 MTag error 198 STBAR instruction 68 problem with several stores on queue 234 STD instruction 122 protocol error 205 STDF_mem_address_not_aligned exception 122 protocol errors STDFA instruction 154, 325 data (DPCTL) 254 block store 79 transaction (QCTL) 255 STICK register 15, 181 termination code DSTAT 199 STICK_COMPARE register 182 timeout error 199 store uncorrectable data ECC error on read 197 buffer system interface merging 72, 195 logic error 205 snooping 64 registers 14 compression 66, 74 statistics, counters 272 instructions, giving data to a load 89 system interface unit (SIU) instructions 42 noncacheable, coalescing 75 system power dissipation 277 queue R-stage stall count 270 state after reset 183 T STQF instruction 105, 122 T pipeline stage 43 STQFA instruction 105, 122 Tag Access Register 132, 134 Strong Sequential Order 64 TAP controller 288 strongly ordered memory model 74, 87 tape-out 288 STXA instruction 154, 325 TBA register 180 STXA instruction, caution 326 Tcc instruction, reserved field 113 subnormal operand 98 TDI 288 Sun internal part # 291 Ticc instruction 10 SW_count_0 272 TICK register SW_count_1 272 counter field 115 SWAP instruction 69 format 115 Synchronous Fault Status Register, See SFSR implementation 115 system bus NPT field 115 access error, exception to AFAR overwrite policy state after reset 181 218 TICK_COMPARE register 181 accesses causing errors 196 timeout (TO) error BERR as result of read operation 198 copyout events 199 copyout events 199 handling 199 DSTAT 226 TL register 181, 230 ECC error TL, See trap level CE 224 TLB detection 196, 226 and 3-dimensional arrays 78 EMU 225 CAM Diagnostic Register 134, 137 injection 227 data access 42 IVC 226 Data Access register 12, 134

S. Index 20 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material Data In register 132, 134 pending 229–232 Diagnostic Register 13 precise trap handling 103, 114 differences from UltraSPARC I 12 privileged_opcode 115 D-TLB state after reset 183 processing 229–232 flushing 12 processing, when initiated 229 hardware 138 STDF_mem_address_not_aligned 105 I-TLB state after reset 183 when taken 229 miss and nonfaulting load 71 trap globals 29 miss counts 271 trap handler miss handler 131, 132 deferred trap procedure 189 miss processing 129 ECC errors 62 miss/refill sequence 132 traps missing entry 131 deferred traps 61, 114 MMU implementation 121 disrupting traps 61, 114 replacement policy 138 TSB TNPC register 181, 187, 188 Extension Registers total store order, See TSO memory model new additions 12 TPC register 181, 187, 188 TSB_Hash field 136 TPC/TnPC behavior on resets 114 I/D Translation Storage Buffer Register 135 Translation miss handler 132 Lookaside Buffer, See TLB MMU implementation 121 Storage Buffer (TSB) 135, 136 pointer logic 140 Storage Buffer, See TSB pointer logic hardware 140 Table Entry (TTE) 130 SB_Size field 132 Table Entry, See TTE shared 132 trap split 133 atomic accesses 70 Tag Target Register 133 atomic instructions 70 TSO memory model 63, 65, 66, 68, 73 clean_window 105, 116 TSTATE register data_access_exception 113, 122 initializing 179 deferred traps 103 PEF field 33 disabled floating-point exceptions 120 state after reset 181 fp_disabled 33 TT register 181 fp_exception_ieee_754 33 TTE fp_exception_other 28, 33, 98, 113 configuration 130 generation for prefetched instruction 200 CP (cacheability) field 66, 69, 131 illegal_instruction 113 CV (cacheability) field 66, 69, 131 instruction_access_error 29 E field 64, 66, 71, 73, 122 interrupt_vector 197 entry, locking in TSB 131 LDDF_mem_address_not_aligned 105, 122 format 12 level L (lock) field 131 TL = 1 232 NFO field 71 TL = 2 232 PA (physical page number) field 130 TL = 3 232 SPARC V8 equivalent 130 TL = MAXTL 177, 234 Tag Target 131 TL = MAXTL - 1 177 level (MAXTL) 105, 113 noncacheable accesses 66 U overflow/underflow/inexact 117 UART 66

Working Draft 1.0.5, 10 Sep 2002 S. Index 21 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material UCC, See AFSR UCC field watchdog_reset (WDR) 113, 177, 235 UCEEN field, See Error Enable Register UCEEN watchpoint field control fields of DCUCR 35 UCU, See AFSR UCU field data 35, 111 UE, See AFSR UE field and RED_state 177 UltraSPARC I 7, 86 using reliably 35, 111 UltraSPARC II 7, 86 WC_miss 271 uncorrectable error 186, 197 WC_scrubbed 271 underflow exception 97, 118 WC_snoop_cb 271 underflow trap 117 WC_wb_wo_read 271 unfinished_FPop exception 9, 10, 28, 97, 98, 100, 118 W-cache, See write cache unimplemented opcodes 113 WDC E-cache error 223 unimplemented_FPop exception 102, 118 WDU E-cache ECC error 223 user code issuing store instructions 233 window changing 48 Working Register File (WRF) 48 WRASR instruction V accessing PCR/PIC 265 VA_WATCHPOINT register 182 diagnostic control/data 325 value clipping, See FPACK instruction forcing bubbles after 48, 104 VER grouping rule 48 impl field 117, 289 WRF (Working Register File) 48 manuf field 117 WRGSR instruction 50 mask 290 write cache mask field 117 access statistics 271 maxtl field 117 characteristics 10 maxwin field 117 data eviction 222 register 181, 289 description 161 VER register 29 diagnostic accesses 155, 340 VIPT cache Diagnostic Bank Valid Bits Register 341 behavior 159 Diagnostic Valid Bits Register 340 D-cache 10, 159 Enable bit 35 side effect 159 eviction 238 virtual address 65 flushing 11, 161 virtual address 0 71 internal errors 253 virtual caching 163 merge mechanism 222 virtual color 162 miss counts 271 virtually indexed, physically tagged, See VIPT cache and RED_state 191 virtual-to-physical address translation 65 scrubbing 192, 195 VIS extensions scrubbing data to E-cache 195 byte mask 8 writeback operation 223 byte shuffle 8 WRPCR instruction 30 differences from UltraSPARC I 8 WRPR instruction edge variants 8 forcing bubbles after 48 VIS instruction execution 42 grouping rule 48 to PSTATE and BST 80 WSTATE register 181 W W pipeline stage 42 wafer number 290

S. Index 22 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material X XIR, See Externally Initiated Reset (XIR)

Y Y register 180

Z zero virtual address 71

Working Draft 1.0.5, 10 Sep 2002 S. Index 23 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material S. Index 24 SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III • Working Draft 1.0.5, 10 Sep 2002 Sun Microsystems Proprietary/Need-To-Know – JRC Contributed Material