Index

Note : Online information is listed by print page number and a period followed by “e” with online page number (54.e1). Page references preceded by a single letter with hyphen refer to appendices. Page references followed by “f ,” “t ,” and “ b” refer to fi gures, tables, and boxes, respectively. 0-9, and symbols ID (ASID) , 436 VAX fl oating-point formats , D-29 inadequate , 497.e5–497.e6 ALU control , 249–251 . See also 1-bit ALU , A-26–A-29 . See also shared , 507–508 Arithmetic logic unit (ALU) Arithmetic logic unit (ALU) single physical, 507 , 507–508 bits , 250–251 , 250f adder , A-27f virtual , 436 logic , C-6–C-7 CarryOut , A-28 Address translation mapping to gates , C-4–C-7 for most signifi cant bit , A-33f for ARM cortex-A53 , 458f truth tables , C-5f , C-5f illustrated , A-29f defi ned , 418–419 ALU control block , 253 logical unit for AND/OR, A-27 f fast , 428–430 defi ned , C-4–C-6 performing AND, OR, and addition, for Intel core i7 , 458f generating ALU control bits , C-6 f A-31 , A-33f TLB for , 428–430 ALUOp , 250 , C-6 b –C-7b 64-bit ALU , A-29–A-31 . See also Address-control lines, C-26 f bits , 250 , 251 Arithmetic logic unit (ALU) Addresses control signal, 253 from 63 copies of 1-bit ALU , A-34f b a s e , 6 9 Amazon Web Services (AWS) , 415b with 64 1-bit ALUs, A-30 f byte , 70 AMD Opteron X4 (Barcelona) , 533 , 534f defi ning in Verilog , A-36–A-37 defi ned , 68 AMD64 , 148 , 148 , 215 , 173.e5 illustrated , A-35f m e m o r y , 7 8b Amdahl’s law , 391 , 493–494 ripple carry adder, A-29 virtual , 418–419 , 438 , 439 b corollary , 49 7090/7094 hardware , 248.e6 Addressing defi ned , 49 base , 118 f fallacy , 546 A in branches , 115–117 and (and) , 64f displacement , 118 AND gates , A-12–A-13 , C-7 Absolute references , 127 immediate , 118 f AND operation, 90 , A-6 Abstractions PC-relative , 115–116 , 118f andi (and immediate) , 64f hardware/soft ware interface , 22 register , 118 f Annual failure rate (AFR), 408–409 principle , 22 RISC-V modes , 117–118 versus MTTF of disks, 408 b –409b to simplify design, 11 x86 modes, 151 Antidependence , 325 Accumulator architectures , 173.e1–173.e2 Addressing modes Antifuse , A-77 A c r o n y m s , 9 desktop architectures, D-5–D-6 Apple computer , 54.e6 Active matrix , 18 Advanced Vector Extensions (AVX), 216 , Apple iPad 2 A1395 , 20f add (add) , 64f 217 logic board of, 20 f addi (add immediate), 64 f , 72 , 84 AGP , B-9–B-10 processor integrated circuit of, 21 f Addition , 172–175 . See also Arithmetic Algol-60 , 173.e6 Application binary interface (ABI) , 22 binary , 172 b –173b Aliasing , 434 Application programming interfaces fl oating-point , 196–199 , 204 Alignment restriction, 70 (APIs) operands , 173 , 173 All-pairs N-body algorithm , B-65 defi ned , B-4 signifi cands , 195b –196b Alpha architecture graphics , B-14 speed , 175 b bit count instructions , D-29 Architectural registers , 335–336 Address interleaving , 370–371 fl oating-point instructions , D-28–D-29 Arithmetic , 170 Address select logic, C-24 , C-25 f instructions , D-27–D-29 addition , 172–175 Address space , 418 , 421b no divide , D-28 addition and subtraction , 172–175 extending , 467b PAL code , D-28 division , 181–189 fl at , 467 unaligned load-store, D-28 fallacies and pitfalls , 220–223 I-1 I-2 Index

Arithmetic (Continued) microcode , C-30 Biased notation , 81 , 193 fl oating-point , 189–214 number acceptance , 126 Binary , 82 historical perspective , 225 object fi le , 126 ASCII versus, 109b multiplication , 175–181 A s s e m b l y l a n g u a g e , 1 5f conversion to decimal numbers, 77 b parallelism and , 214–215 defi ned , 14 , 125 defi ned , 74 Streaming SIMD Extensions and fl oating-point, 205 f Bisection bandwidth, 525 advanced vector extensions in illustrated , 15f Bit maps x86 , 215–216 programs , 125 defi ned , 18 subtraction , 172–175 RISC-V , 64f , 8 5b –86b g o a l , 1 8 subword parallelism, 214–215 translating into machine language , storing , 18 subword parallelism and matrix 85b –86b Bit-Interleaved Parity (RAID 3), 481.e4 multiply , 216–220 Asserted signals, 240 , A-4 Bits Arithmetic instructions . See also Associativity ALUOp , 250 , 251 Instructions in caches, 395 b –396b defi ned , 14 desktop RISC , D-11f , D-11f degree, increasing, 394–396 , 442 dirty , 428 b embedded RISC, D-13 f increasing , 399–400 guard , 212 logical , 241–242 set, tag size versus, 399b –400b patterns , 212 b –213b operands , 67–74 Atomic compare and swap , 123b reference , 426 b Arithmetic intensity , 531–532 Atomic exchange , 122 rounding , 212 Arithmetic logic unit (ALU) . See also Atomic fetch-and-increment, 123 b sign , 75 ALU control ; Control units Atomic memory operation , B-21 state , C-8–C-10 1-bit , A-26–A-29 Attribute interpolation , B-43–B-44 sticky , 212 64-bit , A-29–A-31 auipc’s eff ect, 156 valid , 374–376 before forwarding, 297 f Automobiles, computer application in , 4 Blocking assignment , A-24 branch datapath, 244–245 Average memory access time (AMAT), Blocking factor, 404 hardware , 174 392 Block-Interleaved Parity (RAID 4), 481. memory-reference instruction calculating , 392 b e4–481.e5 use , 235 Blocks for register values , 242 B combinational , A-4–A-5 R-format operations, 243 f defi ned , 365–366 signed-immediate input, 300 Bandwidth , 29–30 fi nding , 442–443 ARM Cortex-A53 , 234 , 332–340 bisection , 525 fl exible placement , 392–396 address translation for , 458f external to DRAM , 388 least recently used (LRU) , 399 caches in, 459 f memory , 388 locating in cache , 397–399 data cache miss rates for , 460f network , 523–524 miss rate and, 381 f memory hierarchies of, 457 Barrier synchronization , B-18 multiword, mapping addresses to, performance of, 460–462 defi ned , B-20 380b –381b specifi cation, 333 f for thread communication , B-34 placement locations, 441 TLB hardware for, 458 f Base addressing , 69 , 118 placement strategies , 394 ARPAnet , 54.e9 Base registers, 69 replacement selection , 399 Arrays , 405 f Basic block , 95b replacement strategies , 444 logic elements , A-18–A-20 Benchmarks , 528–538 spatial locality exploitation , 381 multiple dimension , 210 defi ned , 46 state , A-4–A-5 pointers versus, 141–144 Linpack , 528 , 248.e2–248dir.e3 , valid data , 374–376 procedures for setting to zero, 141 f 248.e3 Bonding , 28 ASCII multiprocessor , 528–538 Boolean algebra, A-6–A-7 binary numbers versus, 109b NAS parallel , 530 Bounds check shortcut, 96 character representation, 108 f parallel , 529 f Branch datapath defi ned , 108–109 PARSEC suite, 530 ALU , 244–245 symbols , 111 SPEC CPU , 46–48 operations , 244–245 Assemblers , 125–127 SPEC power , 48–49 Branch if Equal (beq), A-32 defi ned , 14 SPECrate , 528 Branch if greater than or equal, unsigned function , 125–127 Stream , 538b (bgeu) , 95–96 Index I-3

Branch if less than (blt) instruction , compiling assignment with registers , set-associative cache, 395 95–96 67b –68b steps , 383 Branch if less than, unsigned (bltu), compiling while loops in, 94 b –95b in write-through cache, 383 95–96 sort algorithms , 141 f Cache performance, 388–408 Branch instructions translation hierarchy, 124 f calculating , 390 b –391b pipeline impact , 306f translation to RISC-V assembly hit time and , 391–392 Branch not taken language, 65 impact on processor performance, assumption , 305–306 variables , 104b 390–391 defi ned , 244 C.mmp , 577.e3–577.e4 Cache-aware instructions , 470 Branch prediction C + + language, 173.e7 , 150.e26 Caches , 373–388 . See also Blocks b u ff ers , 308 Cache blocking and matrix multiply, accessing , 376–382 as control hazard solution , 272 463–466 in ARM cortex-A53 , 459f defi ned , 271–272 Cache coherence, 452–456 associativity in , 395b –396b dynamic , 272 , 308–312 coherence , 452 bits in , 380b static , 322 consistency , 452 bits needed for , 380 Branch predictors enforcement schemes, 454 contents illustration, 377 f accuracy , 310 implementation techniques , 482. defi ned , 19–22 , 373–374 correlation , 310–311 e10–482.e11 direct-mapped , 374 , 375 f , 380 , 392 information from , 310–311 migration , 454 empty , 376 tournament , 311–312 problem , 452 , 453 f , 456b FSM for controlling, 447–452 Branch table , 97–98 protocol example , 482.e11–482.e15 fully associative , 393 Branch taken protocols , 454 GPU , B-38 cost reduction , 306–307 replication , 454 inconsistent , 383 defi ned , 244 snooping protocol , 454–456 index , 378 Branch target snoopy , 482.e16 in Intel Core i7 , 459f addresses , 244 state diagram , 482.e15 f Intrinsity FastMATH example, b u ff ers , 310 Cache coherency protocol, 482.e11–482. 385–387 B r a n c h e s . See also Conditional e15 locating blocks in, 397–399 branches fi nite-state transition diagram, 482.e14 f locations , 375 f addressing in , 115–117 functioning , 482.e13 f multilevel , 388 , 400–403 compiler creation, 93–94 mechanism , 482.e13f nonblocking , 458 decision, moving up , 306–307 state diagram , 482.e15 f physically addressed, 434–435 delayed , 272 , 306–308 states , 482.e12 physically indexed, 434 b –435b ending , 95 b write-back cache , 482.e14f physically tagged , 434b –435b execution in ID stage, 307 Cache controllers, 457 primary , 400 , 407–408 pipelined , 308 b coherent cache implementation secondary , 400 , 407–408 target address , 306–307 techniques , 482.e10–482.e11 set-associative , 393 Branch-on-zero instruction , 258–259 implementing , 482.e1 simulating , 466 b Bubble Sort , 140 snoopy cache coherence, 482.e16 size , 379–381 Bubbles , 303 SystemVerilog , 482.e1–482.e4 split , 387 b Bus-based coherent multiprocessors , Cache hits , 458 summary , 387–388 577.e1 Cache misses tag fi eld, 378 Buses , A-18–A-19 block replacement on , 443–444 tags , 482.e1f , 482.e10–482.e11 , 482.e11 Bytes capacity , 445 , 446 virtual memory and TLB integration , addressing , 70 compulsory , 445 433–435 o r d e r , 7 0 confl ict , 445 virtually addressed , 434 defi ned , 382 virtually indexed, 434 C direct-mapped cache , 394 virtually tagged, 434 fully associative cache, 396 write-back , 384 , 385 , 444 C language handling , 382–383 write-through , 383 , 385 , 444 assignment, compiling into RISC-V , memory-stall clock cycles , 389 writes , 383–385 65b reducing with fl exible block placement, Callee , 99 , 101 compiling , 144–145 , 150.e1–150.e2 392–396 Caller , 99 I-4 Index

Capabilities , 497.e12 Clocks , A-47–A-49 C o m p a r i s o n s Capacity misses , 445 edge , A-47 , A b -49b constant operands in , 72–74 Carry lookahead , A-37–A-47 in edge-triggered design , A-72 f signed versus unsigned, 95–96 4-bit ALUs using, A-43 f skew , A-73 Compilers , 125 adder , A-38 specifi cation, A-56 f branch creation, 94 b fast, with fi rst level of abstraction , synchronous system, A-47–A-48 brief history, 173.e7–173.e8 A-38–A-40 Cloud computing, 522–523 conservative , 150.e6 fast, with “infi nite” hardware, A-38 defi ned, 7 defi ned , 14 fast, with second level of abstraction , Cluster networking, 527–528 , 553.e1 , 553. front end , 150.e2 A-40–A-45 e3–553.e5 , 553.e6–553.e9 function , 14 , 125 plumbing analogy, A-41 f , A-42f Clusters , 577.e7–577.e8 high-level optimizations , 150.e3–150.e4 ripple carry speed versus, Ab -45b defi ned , 490 , 520 , 577.e7 ILP exploitation , 368.e4–368.e5 summary , A-45–A-47 isolation , 521 Just In Time (JIT), 133 Carry save adders , 181 organization , 489 optimization , 141 , 173.e8 CDC 6600 , 368.e2 , 54.e6 scientifi c computing on , 577.e7 speculation , 321–322 Cell phones, 6–7 Cm* , 577.e3–577.e4 structure , 150.e1f Central processor unit (CPU) . See also CMOS (complementary metal oxide Compiling Processors semiconductor), 41 C assignment statements , 65b classic performance equation , 36–40 Coarse-grained multithreading, 504–505 C language, 94 b –95b , 144–145 , 150. defi ned , 19 Cobol , 173.e6 e1 , 150.e2 execution time, 32 , 33–34 Code generation , 150.e12 fl oating-point programs, 206 b –207b performance , 33–35 Code , 150.e6 if-then-else , 93b system, time, 32 Cold-start miss , 445 in Java , 150.e18–150.e19 time , 389 Collision misses, 445 procedures , 100b –101b , 102b –103b time measurements, 33–34 Column major order, 403 recursive procedures , 102b –103b user, time, 32 Combinational blocks , A-4–A-5 while loops, 94 b –95b Cg pixel shader program , B-15 Combinational control units , C-4–C-8 Compressed sparse row (CSR) matrix , Characters Combinational elements, 238 B-55 , B-56 ASCII representation , 108–109 Combinational logic , 239 , A-3–A-4 , Compulsory misses, 445 , 446–447 in Java , 111–113 A-9–A-20 Computer architects , 11–13 Chips , 19 , 25–26 , 26 arrays , A-18–A-19 abstraction to simplify design, 11 manufacturing process, 26 decoders , A-9–A-10 common case fast, 11 Classes defi ned , A-4–A-5 dependability via redundancy, 12 defi ned , 150.e14 don’t cares, A-17–A-18 hierarchy of memories, 12 packages , 150.e20 multiplexors , A-10 Moore’s law , 11 Clock cycles ROMs , A-14–A-16 parallelism , 12 defi ned , 33 two-level , A-11–A-14 pipelining , 12 memory-stall , 389 Verilog , A-23–A-26 prediction , 12 number of registers and, 67 Commercial computer development , 54. Computers worst-case delay and , 260 e3–54.e9 application classes, traditional , 3 Clock cycles per instruction (CPI) , Commit units applications , 4 35–36 , 270 b u ff er , 327 arithmetic for , 170 one level of caching, 400 defi ned , 327 characteristics , 54.e12f two levels of caching, 400 in update control, 332 b commercial development , 54.e3–54.e9 Clock rate Common case fast , 11 component organization, 17 f defi ned , 33 Common subexpression elimination , components , 17f frequency switched as function of, 41 150.e5 design measure, 53 power and , 40 Communication , 23–24 desktop , 5 Clocking methodology, 239–241 , A-47 overhead, reducing , 44–45 embedded , 5–6 edge-triggered , 239 , A-47 , A-72–A-73 thread , B-34 fi rst , 54.e2 level-sensitive , A-73–A-74 , A-74–A-75 Compact code , 173.e3–173.e4 in information revolution , 4 for predictability , 239 Compare and branch zero , 307 instruction representation, 81–89 Index I-5

performance measurement , 54.e1–54.e3 branch not taken assumption, 305–306 Cores post-PC era , 6–7 branch prediction as solution, 272 defi ned , 43 servers , 5 delayed decision approach, 272 b number per chip , 43 Condition codes/fl ags, 96 dynamic branch prediction, 308–312 Correlation predictor, 310–311 Conditional branches logic implementation in Verilog, 366. Cosmic Cube , 577.e6–577.e7 changing program counter with , 310 b e7–366.e10 CPU , 9 compiling if-then-else into , 93b pipeline stalls as solution, 270 f Cray computers , 248.e4 , 248.e5 defi ned , 92–93 pipeline summary , 312–313 Critical word fi rst , 382 desktop RISC , D-16f solutions , 270f Crossbar networks , 525–526 embedded RISC, D-16 f static multiple-issue processors and, 322 CTSS (Compatible Time-Sharing implementation , 97b Control lines System), 497.e13 in loops , 117 asserted , 254 CUDA programming environment , 513 , PA-RISC , D-34–D-36 , D-35f in datapath, 253 f B-5–B-6 PC-relative addressing , 115–116 execution/address calculation , 289 barrier synchronization, B-18 , B-34 RISC , D-10–D-16 fi nal three stages , 291 f development , B-17 , B-17–B-18 SPARC , D-10–D-12 instruction decode/register fi le read , hierarchy of thread groups, B-18 Conditional move instructions, 311 b – 289 kernels , B-19 , B-24 312 b instruction fetch , 289 key abstractions, B-18 C o n fl ict misses , 445 memory access , 289 paradigm , B-19–B-22 Constant memory , B-40 setting of , 254 parallel plus-scan template, B-61 f Constant operands, 72–74 values , 289 per-block shared memory , B-58 frequent occurrence, 72 write-back , 289 plus-reduction implementation , Content Addressable Memory (CAM), Control signals B-63f 398b –399b ALUOp , 253 programs , B-6 , B-24 , B-24 Context switch, 436 b defi ned , 240 scalable parallel programming with, C o n t r o l e ff ect of, 254 f B-17–B-23 ALU , 249–251 multi-bit , 254 shared memories, B-18 challenge , 313 pipelined datapaths with , 288–292 threads , B-36 fi nalizing, 259 truth tables , C-14f Cyclic redundancy check, 413 b –414b forwarding , 298 Control units , 237–238 . See also Cylinder , 372 FSM , C-8–C-22 Arithmetic logic unit (ALU) implementation, optimizing, C-27 address select logic, C-24 , C-25 f D mapping to hardware, C-3–C-4 , combinational, implementing , C-4–C-8 , C-8–C-22 , C-22–C-28 , C-4–C-8 D fl ip-fl ops , A-50–A-51 , A-52 C-28–C-32 , C-32–C-33 with explicit counter, C-23 f D latches, A-50–A-51 , A-51 memory , C-26 f illustrated , 255f Data bits , 411f organizing, to reduce logic, C-31–C-32 logic equations, C-11–C-12 Data fl ow analysis, 150.e8 pipelined , 288–292 main, designing, 251–254 Data hazards , 266–269 , 292–305 . See also Control and status register (CSR) access as microcode , C-28f Hazards instructions , 462–463 -state outputs, C-10 , C-12 b –C-13b forwarding , 266–267 , 292–305 Control fl ow graphs, 150.e8 , 150.e9 output , 249–251 , C-10 load-use , 267–269 , 306 illustrated examples , 150.e8f , 150.e9f , RISC-V , C-10f stalls and , 301–305 150.e11f Cooperative thread arrays (CTAs), B-30 Data parallel problem decomposition , Control functions Coprocessors B-17 , B-18f ALU, mapping to gates , C-4–C-7 defi ned , 210 b Data race , 121 defi ning, 254 Core RISC-V instruction set . See also MIPS Data selectors, 235–236 PLA, implementation , C-7 , C-20 abstract view, 236 f Data transfer instructions. See also ROM, encoding , C-19 desktop RISC , D-9f Instructions for single-cycle implementation , implementation , 234–235 defi ned , 68 , 69 259–260 implementation illustration, 237 f load , 69 Control hazards , 269–272 , 305–313 overview , 235–238 o ff set , 69 branch delay reduction , 306–308 subset , 234 store , 70–71 I-6 Index

Datacenters , 7 bubble insertion and, 303 illustrated , 375f Data-level parallelism, 498 detection , 295 b memory block location , 393 f Datapath elements name , 325 misses , 395 b –396b defi ned , 241 sequence , 293 single comparator , 397 sharing , 246–247 Design total number of bits , 380 Datapaths compromises and , 84 Dirty bit, 428 b branch , 244–245 datapath , 241 Dirty , 428 b building , 241–249 digital , 343 Disk memory , 371–373 control signal truth tables, C-14 f logic , 238–241 Displacement addressing, 118 control unit , 255f main control unit , 251–254 Distributed Block-Interleaved Parity defi ned , 19 memory hierarchy, challenges , 447 f (RAID 5), 481.e5–481.e6 design , 241 pipelining instruction sets , 265 Divide algorithm, 184 b exception handling , 316f Desktop and server RISCs. See also Dividend , 182 for fetching instructions , 243f Reduced instruction set computer Division , 181–189 for hazard resolution via forwarding, (RISC) architectures algorithm , 183 f 300f addressing modes, D-6 dividend , 182 for memory instructions, 245 architecture summary, D-4 f , D-4f divisor , 182 in operation for branch-if-equal arithmetic/logical instructions, D-11 f Divisor , 182 instruction , 258–259 conditional branches , D-16 divu (Divide Unsigned). See also Arithmetic in operation for load instruction, 257 f constant extension summary , D-9f , faster , 186–187 in operation for R-type instruction, D-9f fl oating-point , 204–210 256f control instructions , D-11f hardware , 182–185 operation of, 254–259 conventions equivalent to MIPS core , hardware, improved version , 185 f pipelined , 274–292 D-12f operands , 182 for RISC-V architecture , 247 data transfer instructions, D-10 f quotient , 182 for R-type instructions , 254–257 features added to, D-45 f remainder , 182 single, creating, 245–249 fl oating-point instructions , D-12f in RISC-V , 187 single-cycle , 273 instruction formats , D-7f signed , 185–186 static two-issue , 324f multimedia extensions , D-16–D-18 SRT , 187 Deasserted signals , 240 , A-4 multimedia support , D-18 f Don’t cares, A-17–A-18 DEC PDP-8, 173.e2 f Desktop computers, defi ned, 5 example , Ab -17b –Ab -18b Decimal numbers Device driver, 553.e4 term , 251 binary number conversion to , 77 b DGEMM (Double precision General Double data rate (DDR) , 369–370 defi ned , 74 Matrix Multiply), 216–217 , 340 , Double Data Rate (DDR) SDRAM , Decision-making instructions , 92–98 342–343 , 403 , 528 369–370 , A-64 Decoders , A-9–A-10 cache blocked version of, 405 f Double precision. See also Single two-level , A-64 optimized C version of, 218 f , 340f , 464f precision Decoding machine language , 118–120 performance , 342f , 406f defi ned , 191 Defect , 26–27 Dicing , 27 FMA , B-45 , B-45–B-46 Delayed branches, 272 . See also Branches Dies , 26–27 GPU , B-45 , Bb -74b as control hazard solution , 272 Digital design pipeline, 343 representation , 210–212 embedded RISCs and, D-23 Digital signal-processing (DSP) Doubleword , 67 , 151 reducing , 306–308 extensions , D-19 Dual inline memory modules (DIMMs) , Delayed decision, 272 b DIMMs (dual inline memory modules) , 371 DeMorgan’s theorems, A-11 497.e4 Dynamic branch prediction, 308–312 . See Denormalized numbers , 214 Direct Data IO (DDIO), 553.e6 also Control hazards Dependability via redundancy, 12 Direct memory access (DMA) , 553.e2f , branch prediction buff er , 308 Dependable memory hierarchy, 408–414 553.e3 loops and , 310b failure, defi ning, 408–410 Direct3D , B-13 Dynamic hardware predictors , 272 Dependences Direct-mapped caches . See also Caches Dynamic multiple-issue processors , 320 , between pipeline registers , 236–237 address portions , 397f 326–331 . See also Multiple issue between pipeline registers and ALU choice of , 398–399 pipeline scheduling , 327–331 inputs , 295–296 defi ned , 374 , 392 superscalar , 326 Index I-7

Dynamic pipeline scheduling , 327–331 Embedded computers, 5–6 imprecise , 319 b commit unit , 327 application requirements , 6 interrupts versus, 313 concept , 327 design , 5 pipelined computer example , 316b – hardware-based speculation, 329–331 growth , 54.e11 317 b primary units , 328f Embedded Benchmark in pipelined implementation , 315–319 reorder buff er , 332b Consortium (EEMBC), 54.e11 precise , 319 b reservation station, 327 Embedded RISCs. See also Reduced reasons for, 314–315 Dynamic random access memory instruction set computer (RISC) result due to overfl ow in add (DRAM) , 368 , 369–371 , A-62–A-64 architectures instruction , 318f bandwidth external to, 388 addressing modes, D-6 in RISC-V architecture, 314–315 cost , 23 architecture summary, D-4 f , D-4f saving/restoring stage on , 438 defi ned , 19 , A-62 arithmetic/logical instructions, D-14 f Executable fi les DIMM , 497.e4 conditional branches , D-16 defi ned , 127–129 Double Date Rate (DDR), 369–370 constant extension summary , D-9f , Execute or address calculation stage, 280 early board, 497.e4 f D-9f Execute/address calculation GPU , B-37–B-38 control instructions , D-15f control line , 289 growth of capacity , 25 f data transfer instructions, D-13 f load instruction, 280 history , 497.e1 delayed branch and , D-23 store instruction, 280 internal organization of, 370 f DSP extensions , D-19 Execution time pass transistor, A b -62b –Ab -64b general purpose registers , D-5 CPU , 32 , 33–34 SIMM , 497.e5f , 497.e4 instruction conventions, D-15 f pipelining and , 274 single-transistor , A-63f instruction formats , D-8f as valid performance measure , 50–51 size , 388 multiply-accumulate approaches , Explicit counters, C-23–C-24 , C-26 f speed , 23–24 D-19f Exponents , 190 synchronous (SDRAM) , 369–370 , A- Encoding 59 , A-64 defi ned , C-31 F two-level decoder, A-64 RISC-V instruction , 85f , 119f Dynamically linked libraries (DLLs), ROM control function , C-18 Failures, synchronizer, A-75–A-76 130–132 ROM logic function , A-15 Fallacies . See also Pitfalls defi ned , 130 x86 instruction , 153–154 Amdahl’s law, 546 lazy procedure linkage version, 130 ENIAC (Electronic Numerical Integrator arithmetic , 220 and Calculator), 497.e1 , 54.e2–54.e3 , assembly language for performance , E 54.e2 , 54.e3 158b EPIC , 368.e4 commercial binary compatibility Early restart, 382 b Error correction , A-64–A-66 importance , 158b Edge-triggered clocking methodology , Error Detecting and Correcting Code defi ned , 49 239 , 240 , A-47 , A-72–A-73 (RAID 2), 481.e4 GPUs , B-72 , B-75 advantage , A-48 Error detection , A-65–A-66 low utilization uses little power, 50 b clocks , A-72–A-73 Error detection code , 410 peak performance, 546 b drawbacks , A-73–A-74 Ethernet , 23–24 pipelining , 343 illustrated , A-49f EX stage powerful instructions mean higher rising edge/falling edge , A-47 load instructions, 280 f performance , 157 EDSAC (Electronic Delay Storage overfl ow exception detection , 315 , 318 f right shift , 220b Automatic Calculator), 497.e2 f , 497. store instructions, 282 f False sharing, 455 e1 , 54.e2 E x a b y t e , 6f Fast carry Eispack , 248.e2–248dir.e3 , 248.e3 Exception enable , 437b w i t h fi rst level of abstraction , Electrically erasable programmable read- Exceptions , 313–319 A-38–A-40 only memory (EEPROM) , 371 association , 319 b with “infi nite” hardware, A-38 Elements datapath with controls for handling , with second level of abstraction, combinational , 238 316f A-40–A-45 datapath , 241 , 246–247 defi ned , 191 , 313 Fast Fourier Transforms (FFT) , B-53 memory , A-49–A-57 detecting , 313 Fault avoidance , 409 state , 238 , 240 , 242 f , A-47 , Ab -49b event types and, 313 Fault forecasting , 409 I-8 Index

Fault tolerance, 409 intermediate calculations, 210–211 Format fi elds, C-31 Fermi architecture , 513 , 542 operands , 205 f Fortran , 173.e6 Field programmable devices (FPDs) , overfl ow , 190 Forwarding , 292–305 A-77–A-78 packed format , 216 ALU before , 297f Field programmable gate arrays (FPGAs) , precision , 221 control , 298 A-77 procedure with two-dimensional datapath for hazard resolution , 300f Fields matrices, 80 b defi ned , 266–267 defi ned , 83 programs, compiling , 79 b –80b graphical representation , 267 f format , C-31 registers , 210 b illustrations , 366.e25 names , 83 representation , 190–191 multiple results and , 269 RISC-V , 83–89 RISC-V instruction frequency for, 224 f multiplexors , 298f Files, register , 242 , 247 , A b -49b , RISC-V instructions , 204–210 pipeline registers before , 297f A-53–A-55 rounding , 210–211 with two instructions , 266b –267b Fine-grained multithreading, 504 sign and magnitude , 190 Verilog implementation , 366.e3–366.e5 Finite-state machines (FSMs) , 447–452 , SSE2 architecture, 215 , 215 f Fractions , 190 , 191 A-66–A-71 subtraction , 204 Frame buff er , 18 control , C-8–C-22 underfl ow , 190 Frame pointers , 104 controllers , 450 f units , 211–212 Front end , 150.e2 for multicycle control , C-9f in x86 , 215f Fully associative caches . See also Caches for simple cache controller, 451–452 Floating vectors, 248.e2 block replacement strategies , 443–444 implementation , 449 , A-69 Floating-point addition, 196–199 choice of , 443 Mealy , 450 arithmetic unit block diagram , 200 f defi ned , 393 Moore , 450b –451b binary , 197 b –199b memory block location , 393 f next-state function, 449 , A-66 illustrated , 198f misses , 396 output function , A-66 , A-68 instructions , 204–210 Fully connected networks , 525 state assignment , A-69 steps , 196 , 196 , 196 , 196–197 Fused-multiply-add (FMA) operation, state register implementation, A-70 f Floating-point arithmetic (GPUs), 212b , B-45 style of , 450b –451b B-41–B-46 synchronous , A-66 basic , B-42 G SystemVerilog , 482.e6 f double precision , B-45–B-46 , B b -74b t r a ffi c light example , A-67 performance , B-44 Game consoles , B-9 Flash memory , 371 specialized , B-42–B-44 Gates , A-3–A-4 , A-4–A-9 defi ned , 23 supported formats, B-42 AND , A-12–A-13 , C-7 Flat address space , 467 operations, B-44 delays , A-45 Flip-fl ops Floating-point control and status register mapping ALU control function to, D fl ip-fl ops , A-50–A-51 , A-52 (fcsr) , 191 C-4–C-7 defi ned , A-50–A-51 Floating-point instructions NAND , A-8–A-9 Floating point, 189–214 desktop RISC , D-12f NOR , A-8–A-9 , A-49f assembly language , 205f SPARC , D-31–D-32 Gather-scatter , 501 , 542 backward step , 248.e3–248.e4 Floating-point multiplication, 199–204 General Purpose GPUs (GPGPUs) , B-5 binary to decimal conversion , 195 b binary , 203 b –204b General-purpose registers , 147 branch , 204 illustrated , 202f architectures , 173.e2f challenges , 224 instructions , 204 embedded RISCs, D-5 diversity versus portability, 248.e2–248. signifi cands , 199–203 Generate e3 steps , 199–201 , 201 , 201 , 201 , 201–203 defi ned , A-39 division , 204 Flow-sensitive information, 150.e13 b – example , Ab -44b –Ab -45b fi rst dispute, 248.e1–248.e2 150.e14 b super , A-40 form , 190 Flushing instructions , 306 , 307–308 Gigabyte , 6f fused multiply add , 212b exceptions and, 317 b Global common subexpression guard digits, 211 b For loops , 142 , 150.e25 elimination, 150.e5 history , 248.e2 inner , 150.e23 Global memory, B-21 , B-39 IEEE 754 standard , 191–196 SIMD and , 577.e2 Global miss rates, 406 b Index I-9

Global optimization, 150.e4–150.e10 Guard digits High-level languages, 14–16 code , 150.e6 defi ned , 210–211 b e n e fi ts , 16 implementing , 150.e7 rounding with , 211b computer architectures, 173.e4 Global pointers , 104 b importance , 16 GPU computing. See also Graphics H High-level optimizations, 150.e3–150.e4 processing units (GPUs) Hit rate , 366 defi ned , B-5–B-6 Half precision , B-42 Hit time visual applications, B-6 Halfwords , 112 cache performance and, 391–392 GPU system architectures, B-7–B-12 Hamming, Richard, 410 defi ned , 366–367 graphics logical pipeline, B-10 Hamming distance, 410 Hit under miss, 458 heterogeneous , B-7–B-9 Hamming Error Correction Code (ECC) , Hold time , A-52–A-53 implications for, B-24–B-25 410–411 Horizontal microcode, C-32 interfaces and drivers , B-9–B-10 calculating , 410 Hot-swapping , 481.e6–481.e7 unifi ed , B-10–B-11 Hard disks Human genome project , 4 Graph coloring , 150.e11 access times , 23 Graphics displays defi ned , 23 I computer hardware support , 18 Hardware L C D , 1 8 as hierarchical layer, 13 f I/O , 553.e1–553.e2 , 553.e2 , 553.e2 Graphics logical pipeline, B-10 language of, 14–16 on system performance , 481.e1b –481. Graphics processing units (GPUs) , 512– operations , 63–67 e2 b 519 . See also GPU computing supporting procedures in, 98–108 I/O benchmarks . See Benchmarks as accelerators, 512 synthesis , A-21 IBM 360/85 , 497.e5 attribute interpolation , B-43–B-44 translating microprograms to, IBM 701 , 54.e4 defi ned , 46 , 496–497 , B-3 C-28–C-32 IBM 7030 , 368.e1 evolution , B-5 virtualizable , 416 IBM ALOG , 248.e6 fallacies and pitfalls , B-72–B-75 Hardware description languages. See also IBM Blue Gene , 577.e8 , 577.e8–577.e9 fl oating-point arithmetic , B-16 , Verilog IBM Personal Computer, 173.e5 , 54.e7 B-41–B-46 , B-74 defi ned , A-20 IBM System/360 computers, 54.e5 f , 368. GeForce 8-series generation , B-5 using , A-20–A-26 e1 , 248.e5 , 248.e6 general computation , Bb -73b VHDL , A-20–A-21 IBM z/VM , 497.e12 General Purpose (GPGPUs) , B-5 Hardware multithreading, 504–507 ID stage graphics mode , B-6 coarse-grained , 504–505 branch execution in, 307 , 308 graphics trends, B-4 options , 505f load instructions , 280f history , B-3–B-4 simultaneous , 505 store instruction in, 279 f logical graphics pipeline, B-13–B-14 Hardware-based speculation, 329–331 IEEE 754 fl oating-point standard , 191– mapping applications to , B-55–B-72 Harvard architecture , 54.e3 196 , 192f , 248.e7–248.e9 . See also memory , 512 Hazard detection units , 301 Floating point multilevel caches and, 512 pipeline connections for , 304–305 fi rst chips , 248.e7–248.e9 N-body applications , B-65–B-68 Hazards . See also Pipelining in GPU arithmetic , B-42 NVIDIA architecture , 513–515 control , 269–272 , 305–313 implementation , 248.e9 parallel memory system , B-36–B-41 data , 266–269 , 292–305 rounding modes, 211–212 parallelism , 513 , B-76 forwarding and , 300b today , 248.e9 performance doubling, B-4 structural , 265–266 , 282 If statements, 115–116 perspective , 517–519 H e a p If-then-else , 93b programming , B-12–B-25 allocating space on, 104 , 145 programming interfaces to, B-17 defi ned , 105 Immediate addressing, 118 real-time graphics, B-13 Heterogeneous systems , B-4–B-5 Immediate instructions, 72 Graphics shader programs , B-14–B-15 architecture , B-7–B-12 Imprecise interrupts , 319 b , 368.e2–368.e3 Gresham’s Law, 225 , 248.e1 defi ned , B-3 Index-out-of-bounds check , 96 Grid computing, 523 b –524b Hexadecimal numbers, 82 Induction variable elimination , 150.e6 Grids , B-19 binary number conversion to , 82 f , 8 3b Inheritance , 150.e14 GTX 280 , 538–539 Hierarchy of memories, 12 In-order commit , 328–329 I-10 Index

Input devices, 16–17 Instructions , 60 , D-25–D-27 , D-40 , Instructions per clock cycle (IPC) , 320 Inputs , 251 D-40–D-43 . See also Arithmetic Integrated circuits (ICs), 19 . See also Instances , 150.e14 instructions ; MIPS ; Operands specifi c chips Instruction count , 36 , 38 add immediate, 72–74 cost , 27 Instruction decode/register fi le read stage addition , 174 defi ned , 25 control line , 288–292 Alpha , D-27–D-29 manufacturing process, 26 load instruction, 277 arithmetic-logical , 241–242 very large-scale (VLSIs) , 25 store instruction, 282 ARM , D-36–D-38 Intel Core i7 , 46–49 , 234 , 491 , 538–543 Instruction execution illustrations , 366. a s s e m b l y , 6 5 address translation for , 458f e16–366.e25 basic block, 95 b architectural registers, 335–336 clock cycle 9 , 366.e24f cache-aware , 470 caches in, 459 f clock cycles 1 and 2 , 366.e20f conditional branch , 92–93 , 93b memory hierarchies of, 457–462 clock cycles 3 and 4 , 366.e21f conditional move, 311 b –312b microarchitecture , 335 clock cycles 5 and 6 , 366.e22f data transfer , 68 performance of , 460 clock cycles 7 and 8 , 366.e23f decision-making , 92–98 SPEC CPU benchmark, 46–48 examples , 366.e24–366.e25 defi ned , 14 , 62 SPEC power benchmark, 48–49 forwarding , 366.e25 , 366.e25 desktop RISC conventions , D-12 f TLB hardware for, 458 f no hazard, 366.e16 as electronic signals , 81–82 Intel Core i7 920 , 335–338 pipelines with stalls and forwarding , embedded RISC conventions, D-15 f microarchitecture , 335 366.e25 encoding , 85f Intel Core i7 960 Instruction fetch stage fetching , 243 f benchmarking and roofl ines of, control line , 289 fl oating-point , 204–210 538–543 load instruction, 277 fl oating-point (x86) , 215f Intel Core i7 Pipelines , 332–340 , 335–338 store instruction, 282 fl ushing , 306 , 307–308 memory components, 336 f Instruction formats , 153 immediate , 72 performance , 338–340 defi ned , 82 introduction to , 62–63 program performance , 339 b desktop/server RISC architectures, D-7 f left -to-right fl ow , 275 specifi cation, 333 f embedded RISC architectures , D-8f load , 69 Intel IA-64 architecture , 173.e2f I-type , 84 logical operations , 89–92 Intel Paragon , 577.e6–577.e7 MIPS , 146f M32R , D-40 Intel Th reading Building Blocks, B-60 RISC-V , 146f memory access, B-33–B-34 Intel x86 R-type , 84 , 251–252 memory-reference , 235 clock rate and power for , 40f x86 , 153–154 multiplication , 181 Interference graphs , 150.e10 Instruction latency , 344–345 nop , 302–303 Interleaving , 388 Instruction mix , 39–40 , 54.e9 PA-RISC , D-34–D-36 Interprocedural analysis, 150.e13 b –150.e14 b Instruction set architecture performance , 35–36 Interrupt enable , 437b branch address calculation, 244 pipeline sequence, 302 f Interrupt-driven I/O , 553.e3 defi ned , 22 , 52 PowerPC , D-12–D-13 , D-32–D-34 Interrupts history , 161 PTX , B-31 , B-32f defi ned , 191 , 313 maintaining , 52 representation in computer, 81–89 event types and, 313 protection and, 417 restartable , 438–439 exceptions versus , 313 thread , B-31–B-34 resuming , 438b –439b imprecise , 319 b , 368.e2–368.e3 virtual machine support , 416 R-type , 241–242 , 246–247 precise , 319 b Instruction sets , B-49 SPARC , D-29–D-32 vectored , 314 MIPS-32 , 146f store , 71 Intrinsity FastMATH processor , 385–387 RISC-V , 160 store-conditional doubleword, 122–123 caches , 386 f x86 growth , 156f subtraction , 174 data miss rates , 387f , 397f Instruction-level parallelism (ILP), SuperH , D-39–D-40 read processing , 432f 342–343 . See also Parallelism thread , B-30–B-31 TLB , 430–433 compiler exploitation, 368.e4–368.e5 Th umb , D-38–D-39 write-through processing, 432 f defi ned , 43 b , 319–320 vector , 498–500 Inverted page tables, 427 exploitation, increasing , 331 as words , 62 Issue packets , 322–323 and matrix multiply, 340–343 x86 , 146–155 I-type , 87 b Index I-11

J lbu (load byte, unsigned) , 64 f Load-reserved doubleword , 122–123 ld (load doubleword) , 64 f Load-store architectures, 173.e2 Java Leaf procedures. See also Procedures Load upper immediate, 113–114 bytecode , 132 defi ned , 102 Load-use data hazard , 267–269 , 306 bytecode architecture, 150.e10–150.e12 example , 112f Load-use stalls , 306 characters in , 111–113 Least recently used (LRU) Load word , 113b compiling in, 150.e18–150.e19 as block replacement strategy , 443–444 Load word unsigned, 113 b goals , 132 defi ned , 399 Local area networks (LANs), 24 . interpreting , 132 , 144–145 , 150.e14 pages , 424–426 See also Networks keywords , 150.e20 Least signifi cant bits Local memory , B-21 , B-40 method invocation in, 150.e20 defi ned , 74 Local miss rates , 406b pointers , 150.e25–150.e26 SPARC , D-31 Local optimization, 150.e4 . primitive types , 150.e25 L e ft -to-right instruction fl ow , 275 See also Optimization programs, starting , 132–133 Level-sensitive clocking , A-73–A-74 , implementing , 150.e7 reference types , 150.e25 A-74–A-75 Locality sort algorithms , 141 f defi ned , A-73–A-74 principle , 364–365 strings in , 111–113 two-phase , A-74 spatial , 364 , 367 b translation hierarchy, 132 f lh (load halfword) , 64 f temporal , 364 , 367b while loop compilation in, 150.e17 b – lhu (load halfword, unsigned) , 64 f Lock synchronization , 121 150.e18 b Link , 553.e1–553.e2 Locks , 508–511 Java Virtual Machine (JVM) , 145 , 150.e15 Linkers , 127–129 Logic Jump-and-link register instruction (jalr) , defi ned , 127 address select , C-24 , C-25f 97–98 , 99 executable fi les , 127–129 ALU control , C-6–C-7 Jump instructions , D-26 steps , 127 combinational , 240 , A-5 , A-9–A-20 branch instruction versus, 248f Linking object fi les , 128b –129b components , 239 control and datapath for , 249 Linpack , 528 , 248.e2–248dir.e3 , 248.e3 control unit equations , C-11f implementing , 235–238 Liquid crystal displays (LCDs), 18 design , 238–241 instruction format , 248 LISP, SPARC support, D-30 equations , A b -7b Just In Time (JIT) compilers , 133 , 550 Live range , 150.e10 minimization , A-18 Livermore Loops , 54.e10 programmable array (PAL) , A-77 K Load balancing, 495 b –496b sequential , A-4–A-5 , A-55–A-57 Load byte , 109 two-level , A-11–A-14 Karnaugh maps, A-18 Load doubleword, 69 , 71–72 Logical operations , 89–92 Kernel mode , 435 Load instructions . See also Store AND , 90 Kernels instructions desktop RISC , D-11f , D-11f CUDA , B-19 , B-24 access , B-41 embedded RISC, D-13 f defi ned , B-19–B-22 base register, 252 NOT , 91 Kilobyte , 6f compiling with , 71 b O R , 9 1 datapath in operation for, 257 f shift s , 90 L defi ned , 69 x o r , 9 1 EX stage, 280 f Long instruction word (LIW) , 368.e4 LAPACK , 221–222 halfword unsigned, 112 Lookup tables (LUTs) , A-77–A-78 Large-scale multiprocessors , 577.e6 , ID stage, 279 f Loop unrolling 577.e6–577.e7 IF stage, 279 f defi ned , 325–326 , 150.e3–150.e4 Latches load byte unsigned , 78 for multiple-issue pipelines , 325b –326b D latch, A-50–A-51 , A-51 load half , 112 register renaming and, 325 defi ned , A-50–A-51 MEM stage, 281 f Loops , 94–96 L a t e n c y pipelined datapath in , 284f conditional branches in , 115–116 instruction , 344–345 signed , 78 b for , 142 m e m o r y , Bb -74b unit for implementing , 245 f prediction and , 310 b pipeline , 274b unsigned , 78 b test , 142 , 143 use , 323–325 WB stage, 281 f while, compiling, 94 b –95b lb (load byte) , 64f Loaders , 130 lr.d (load reserved), 64 f I-12 Index

lui (load upper immediate), 64 f read-only (ROM), A-14–A-16 variance , 407b lw (load word), 64 f SDRAM , 369–370 virtual memory , 417–441 lwu (load word, unsigned), 64 f secondary , 23 Memory rank , 371 shared , B-17 , B-39–B-40 Memory technologies, 368–373 M spaces , B-39 disk memory , 371–373 SRAM , A-57–A-59 DRAM technology , 368 , 369–371 M32R , D-15 , D-40 stalls , 390 fl ash memory, 371 Machine code , 82 technologies for building, 24–28 SRAM technology, 368 , 369 Machine instructions , 82 texture , B-40 Memory-mapped I/O, 553.e2 Machine language , 15f virtual , 417–441 Memory-stall clock cycles , 389 branch off set in , 116b –117b volatile , 22–23 Message passing decoding , 118–120 Memory access instructions, B-33–B-34 defi ned , 519 defi ned , 14 , 82 Memory access stage multiprocessors , 519–524 illustrated , 15f control line , 290f Metastability , A-75–A-76 RISC-V , 87–89 load instruction, 280 f M e t h o d s SRAM , 19–22 store instruction, 280 defi ned , 150.e14 translating RISC-V assembly language Memory bandwidth , 538–539 , 547 b invoking in Java , 150.e19–150.e20 into , 85b –86b Memory consistency model , 456b Microarchitectures , 335 Main memory , 418 . See also Memory Memory elements, A-49–A-57 Intel Core i7 920 , 335–338 defi ned , 23 clocked , A-50 Microcode page tables , 427 D fl ip-fl op , A-50–A-51 , A-52 assembler , C-30 physical addresses , 418 D latch , A-51 control unit as , C-28f Mapping applications , B-55–B-72 DRAMs , A-62–A-64 defi ned , C-27 Mark computers , 54.e3 fl ip-fl op , A-50 dispatch ROMs , C-30 , C-30f Matrix multiply, 216–220 , 543–546 hold time, A-52–A-53 horizontal , C-32 Mealy machine , 450 , A-67 , A-70–A-71 , latch , A-50 vertical , C-32 Ab -71b setup time , A-52–A-53 , A-53 f Microinstructions , C-31 Mean time to failure (MTTF), 408–409 SRAMs , A-57–A-59 Microprocessors versus AFR of disks, 408 b –409b unclocked , A-50 design shift , 491 improving , 409–410 Memory hierarchies, 535 multicore , 8 , 43 , 490–491 Media Access Control (MAC) address , of ARM cortex-A53 , 457–462 Microprograms 553.e6 block (or line) , 365–366 as abstract control representation , Megabyte , 6f cache performance, 388–408 C-30–C-31 M e m o r y caches , 373–388 fi eld translation , C-28–C-29 addresses , 78 b common framework , 441–447 translating to hardware , C-28–C-32 a ffi nity , 536f defi ned , 365 Migration , 454 atomic , B-21 design challenges , 447 b Million instructions per second (MIPS), bandwidth , 369–370 , 387 b development , 497.e5–497.e7 51 cache , 19–22 , 373–388 , 388–408 exploiting , 362 Minterms CAM , 398b –399b of Intel Core i7 , 457–462 defi ned , A-12–A-13 , C-20 constant , B-40 level pairs , 366f in PLA implementation, C-20 control , C-26 multiple levels , 365 MIP-map , B-44 defi ned , 19 overall operation of , 433 b –434b MIPS and RISC-V DRAM , 19 , 369–371 , A-62–A-64 parallelism and, 452–456 , 481.e1–481. common features between, 145 fl ash , 23 e2 MIPS-16 global , B-21 , B-39 pitfalls , 466–470 16-bit instruction set , D-41–D-42 GPU , 512 program execution time and , 407 immediate fi elds, D-41 instructions, datapath for , 245 quantitative design parameters, 441 f instructions , D-40–D-43 local , B-21 , B-40 redundant arrays and inexpensive MIPS core instruction changes , main , 23 disks, 456 D-42–D-43 nonvolatile , 22–23 reliance on, 367 PC-relative addressing , D-41 operands , 68–72 structure , 365f MIPS-32 instruction set , 145 parallel system, B-36–B-41 structure diagram, 368 f MIPS-64 instructions , 145 , D-25–D-27 Index I-13

conditional procedure call instructions , v e c t o r versus, 499b –500b organization , 489 , 519 D-27 Multiple dimension arrays , 210 for performance , 547 constant shift amount , D-25 Multiple instruction multiple data shared memory, 490–491 , 507–512 jump/call not PC-relative, D-26 (MIMD) , 548–549 s o ft ware , 491f move to/from control registers, D-26 defi ned , 497 , 498 TFLOPS , 577.e5 nonaligned data transfers, D-25 fi rst multiprocessor , 577.e3–577.e4 UMA , 508 NOR , D-25 Multiple instruction single data (MISD), Multistage networks , 525–526 parallel single precision fl oating-point 497–498 Multithreaded multiprocessor operations, D-27 Multiple issue , 320 architecture , B-25–B-36 reciprocal and reciprocal square root, code scheduling, 324 b –325b conclusion , B-36 D-27 dynamic , 320 , 326–331 ISA , B-31–B-34 SYSCALL , D-25 issue packets , 322–323 massive multithreading , B-25–B-26 TLB instructions , D-26–D-27 loop unrolling and , 325b –326b multiprocessor , B-26–B-27 Mirroring , 481.e4 processors , 320 multiprocessor comparison, B-35–B-36 Miss penalty static , 320 , 322–326 SIMT , B-27–B-29 defi ned , 366–367 throughput and , 330b special function units (SFUs) , B-35 determination , 381–382 Multiple processors , 543–546 streaming processor (SP) , B-34 multilevel caches, reducing, 400–403 Multiple-clock-cycle pipeline diagrams , thread instructions , B-30–B-31 Miss rates 284–285 threads/thread blocks management, B-30 block size versus, 381–382 fi ve instructions , 286f Multithreading , B-25–B-26 data cache , 442 f illustrated , 285–288 coarse-grained , 504–505 defi ned , 366 Multiplexors , A-10 defi ned , 496–497 global , 406 b controls , 449 fi ne-grained, 504 improvement , 381–382 in datapath, 253 f hardware , 504–507 Intrinsity FastMATH processor , 387 defi ned , 235–236 simultaneous (SMT), 505 local , 406 b forwarding, control values, 298 f Must-information , 150.e13 b –150.e14b miss sources, 446 selector control, 249 Mutual exclusion , 121 split cache, 387 b two-input , A-10 Miss under miss , 458 Multiplicand , 176 N MMX (MultiMedia eXtension) , 215 Multiplication , 175–181 . See also Moore machines , 450 , A-67 , A-70–A-71 , Arithmetic Name dependence, 325 Ab -71b fast, hardware , 180 NAND gates , A-8–A-9 Moore’s law, 11 , 369 , 512 , 553.e1–553.e2 , faster , 180–181 NAS (NASA Advanced Supercomputing) , Bb -72b fi rst algorithm, 178 f 530 Most signifi cant bit fl oating-point , 199–204 N-body 1-bit ALU for , A-33f hardware , 176–180 all-pairs algorithm, B-65 defi ned , 74 instructions , 181 GPU simulation, B-71 MS-DOS , 497.e15 , 497.e15 operands , 181 mathematics , B-65–B-66 Multicore , 507–512 product , 181 multiple threads per body , B-68–B-72 Multicore multiprocessors , 8 , 43 sequential version , 176–180 optimization , B-67 defi ned , 8 , 490–491 signed , 180 performance comparison, B-69–B-70 MULTICS (Multiplexed Information and Multiplier , 176 results , B-70–B-72 Computing Service) , 497.e8 Multiply algorithm, 176–180 shared memory use , B-67–B-68 M u l t i l e v e l c a c h e s . See also Caches Multiply-add (MAD) , B-42 Negation shortcut, 78–79 complications , 406 b Multiprocessors Nested procedures, 102–104 defi ned , 388 , 406b benchmarks , 528–538 compiling recursive procedure miss penalty, reducing , 400–403 bus-based coherent , 577.e6 showing , 102 b –103b performance of , 400b –401b defi ned , 490 NetFPGA 10-Gigagit Ethernet card, 553. summary , 407–408 historical perspective , 551 e1 f , 553.e2f Multimedia extensions large-scale , 577.e6 , 577.e6–577.e7 Network of Workstations , 577.e7–577.e8 desktop/server RISCs , D-16–D-18 message-passing , 519–524 Network topologies, 524–527 as SIMD extensions to instruction sets , multithreaded architecture , B-26–B-27 , implementing , 526–527 577.e3 B-36 multistage , 527f I-14 Index

Networking , 553.e3–553.e4 rasterization , B-50 Operations operating system in , 553.e3–553.e5 ROP , B-50–B-51 atomic, implementing , 122 performance improvement, 553. scalability , B-51 hardware , 63–67 e6–553.e9 sorting performance, B-54–B-55 logical , 89–92 Networks , 23–24 special function approximation x86 integer , 151–152 advantages , 23 statistics , B-43 f Optimization bandwidth , 525 special function unit (SFU) , B-50 class explanation , 150.e13f crossbar , 525–526 streaming multiprocessor (SM) , compiler , 141 f fully connected, 525 B-48–B-49 control implementation , C-27 local area (LANs), 23–24 streaming processor , B-49–B-50 global , 150.e4–150.e10 multistage , 525–526 streaming processor array (SPA) , B-46 high-level , 150.e3–150.e4 wide area (WANs) , 23–24 texture/processor cluster (TPC), B-47 local , 150.e4–150.e10 , 150.e7 Newton’s iteration, 210 b NVIDIA GPU architecture, 513–515 manual , 144 Next state NVIDIA GTX 280 , 539f , 540f or (inclusive or), 64 f nonsequential , C-24 NVIDIA Tesla GPU , 538–543 OR operation, 174 , A-6 sequential , C-23–C-24 ori (inclusive or immediate) , 64f Next-state function, 449 , A-66 O Out-of-order execution defi ned , 449 defi ned , 328 implementing, with sequencer , Object fi les , 128b –129b performance complexity , 406b –407b C-22–C-28 debugging information , 127 processors , 332 b Next-state outputs , C-27 , C-12b –C-13b header , 126 Output devices , 16–17 example , C-12 linking , 128 b –129b Overfl ow implementation , C-12–C-13 relocation information, 126 defi ned , 75 , 190 logic equations, C-12 b –C-13b static data segment , 126 detection , 174 truth tables, C-13–C-15 symbol table, 127 exceptions , 316 f No Redundancy (RAID 0), 481.e3 text segment , 126 fl oating-point, 191 No write allocation , 384 Object-oriented languages. See also Java occurrence , 173 Nonblocking assignment , A-24 brief history , 173.e7 saturation and, 175 b Nonblocking caches, 332 b , 458 defi ned , 145 , 150.e14 subtraction , 173 Nonuniform memory access (NUMA), One’s complement , 81 , A-29 508 Opcodes P Nonvolatile memory , 22–23 control line setting and , 254 Nops , 302–303 defi ned , 83 , 252 P + Q redundancy (RAID 6) , 481.e6 NOR gates , A-8–A-9 OpenGL , B-13 P a c k e d fl oating-point format, 216 cross-coupled , A-49f OpenMP (Open MultiProcessing), Page faults , 424 . See also Virtual memory D latch implemented with, A-51 f 510b –511b , 530 for data access , 459 NOR operation, D-25 Operands , 67–74 . See also Instructions defi ned , 418–419 NOT operation , 91 , A-6 32-bit immediate, 113–114 handling , 420 , 437–439 Numbers adding , 173 virtual address causing, 430–433 b i n a r y , 7 4 arithmetic instructions , 67 Page tables , 443 computer versus real-world, 213 compiling assignment when in defi ned , 422–423 decimal , 74 , 77b memory, 69 b illustrated , 425f denormalized , 214 constant , 72–74 indexing , 422–423 hexadecimal , 83 division , 181–189 inverted , 427 signed , 74–81 fl oating-point, 205 f levels , 427 unsigned , 74–81 memory , 68–72 main memory , 427 NVIDIA GeForce 8800 , B-46–B-55 multiplication , 175–181 register , 422–423 all-pairs N-body algorithm , B-71 RISC-V , 64f storage reduction techniques , 427 dense linear algebra computations , Operating systems updating , 422 B-51–B-53 brief history , 497.e8 VMM , 439b FFT performance, B-53 defi ned , 13 P a g e s . See also Virtual memory instruction set , B-49 encapsulation , 22 defi ned , 418–419 performance , B-51 in networking, 553.e3–553.e5 dirty , 428 b Index I-15

fi nding , 422–423 task , B-24 Physical addresses , 418 LRU , 424–426 task-level , 490 mapping to , 418–419 o ff set , 419 thread , B-22 space , 507 , 509b –511b physical number, 419 Paravirtualization , 470 Physically addressed caches, 434–435 placing , 422–423 PA-RISC , D-14 , D-17 Pipeline registers size , 420f branch vectored, D-35 before forwarding, 296–298 virtual number , 419 conditional branches , D-34 , D-35f dependences , 295–296 , 296f Parallel bus, 553.e1–553.e2 debug instructions , D-36 forwarding unit selection, 300 Parallel execution , 121 decimal operations, D-35 Pipeline stalls, 268 Parallel memory system , B-36–B-41 . extract and deposit, D-35 avoiding with code reordering, See also Graphics processing units instructions , D-34–D-36 268b –269b (GPUs) load and clear instructions, D-36 data hazards and, 301–305 caches , B-38 multiply/add and multiply/subtract, insertion , 303f constant memory , B-40 D-36 load-use , 306 DRAM considerations, B-37–B-38 nullifi cation, D-34 as solution to control hazards , 270f global memory, B-39 nullifying branch option , D-25 Pipelined branches , 308b load/store access , B-41 store bytes short, D-36 Pipelined control , 288–292 . See also local memory , B-40 synthesized multiply and divide, Control memory spaces, B-39 D-34–D-35 control lines , 288–289 , 289 MMU , B-38–B-39 Parity , 481.e4 overview illustration , 304f ROP , B-41 bits , 410–411 specifying , 289 shared memory, B-39–B-40 code , 418 , A-64–A-65 Pipelined datapaths , 274–292 surfaces , B-41 PARSEC (Princeton Application with connected control signals , 292f texture memory, B-40 Repository for Shared Memory with control signals, 288–292 Parallel processing programs, 492–497 Computers) , 530 corrected , 284 f creation diffi culty , 492–497 Pass transistor , A b -62b –Ab -64b illustrated , 277f defi ned , 490 PCI-Express (PCIe) , 527 , B-7–B-8 , 553. in load instruction stages, 284 f for message passing , 509 b –511b e1–553.e2 Pipelined dependencies , 294f great debates in , 577.e4–577.e6 PC-relative addressing , 115–116 , 118 Pipelines for shared address space , 509b –511b P e a k fl oating-point performance, 532 branch instruction impact, 306 f use of, 547 Pentium bug morality play , 222 f e ff ectiveness, improving, 368.e3–368.e4 Parallel reduction, B-62 Performance , 28–40 execute and address calculation stage , Parallel scan, B-60–B-63 assessing , 28 278 , 280 CUDA template, B-61 f classic CPU equation, 36–40 fi ve-stage , 262 , 278 , 286b –288b inclusive , B-60 components , 38f graphic representation , 267 f , 284–288 tree-based , B-62 f CPU , 33–35 instruction decode and register fi le Parallel soft ware , 491 defi ning, 29–32 read stage, 276 f , 280 Parallelism , 12 , 43 b , 319–332 equation, using, 36–40 instruction fetch stage , 277 f , 280 and computers arithmetic , 214–215 improving , 34 b –35b instructions sequence , 302f data-level , 224 , 498 instruction , 35–36 latency , 274 b debates , 577.e4–577.e6 measuring , 32–33 , 54.e9 memory access stage, 278 , 280 GPUs and , 512 , B-76 program , 9–10 multiple-clock-cycle diagrams, instruction-level , 43 , 319–320 , 331 ratio , 31 284–285 memory hierarchies and, 452–456 , r e l a t i v e , 3 1b performance bottlenecks , 330–331 481.e1–481.e2 r e s p o n s e t i m e , 3 0b single-clock-cycle diagrams , 284–285 multicore and , 507b sorting , B-49–B-50 stages , 262–263 multiple issue , 320b throughput , 30b static two-issue , 323f multithreading and , 505 time measurement , 32 write-back stage, 277 , 282 performance benefi ts , 44 Personal computers (PCs), 7 f Pipelining , 12 , 260–274 process-level , 490 defi ned, 5 advanced , 331–332 redundant arrays and inexpensive Personal mobile device (PMD) b e n e fi ts , 260 disks, 456 defi ned , 6–7 control hazards , 269–272 subword , D-17 P e t a b y t e , 6f data hazards, 266–269 I-16 Index

Pipelining (Continued) PowerPC two-issue , 323–325 exceptions and, 315–319 algebraic right shift , D-33 vector , 497–498 execution time and , 274b branch registers, D-32–D-33 VLIW , 322 fallacies , 343–344 condition codes , D-12–D-13 Product , 176 hazards , 265–269 instructions , D-12–D-13 Product of sums, A-11 instruction set design for, 265 instructions unique to , D-32–D-34 Program counters (PCs), 241 laundry analogy, 261 f load multiple/store multiple , D-33 changing with conditional branch, overview , 260–274 logical shift ed immediate , D-33 311b –312b paradox , 261–262 rotate with mask , D-33 defi ned , 99 , 241 performance improvement, 265 Precise interrupts , 319 b exception , 435 , 437 pitfall , 343–344 Prediction , 12 incrementing , 241 , 243f simultaneous executing instructions , 2-bit scheme , 310 instruction updates, 277 274b accuracy , 310 Program performance speed-up formula, 263 dynamic branch, 308–312 elements aff ecting , 39t structural hazards , 265–266 , 282 loops and , 310b understanding , 9 summary , 312–313 steady-state , 310 Programmable array logic (PAL), A-77 throughput and , 274b Prefetching , 470 , 534 Programmable logic arrays (PLAs) P i t f a l l s . See also Fallacies Primitive types , 150.e25 component dots illustration, A-16 f address space extension , 382–383 Procedure calls control function implementation , C-7 f , arithmetic , 220–223 preservation across , 104 C-20 associativity , 467 b Procedures , 98–108 defi ned , A-12–A-13 defi ned , 49 compiling , 100 b –101b example , Ab -13b –Ab -14b GPUs , B-74 compiling, showing nested procedure illustrated , A-13f ignoring memory system behavior , linking , 100 b –101b ROMs and , A-15–A-16 466b execution steps , 98 size , C-20 memory hierarchies, 466–470 frames , 104 truth table implementation, A-13 out-of-order processor evaluation , leaf , 102 Programmable logic devices (PLDs) , A-77 467b nested , 102b –103b Programmable ROMs (PROMs) , A-14 performance equation subset , 50b recursive , 106b Programming languages. See also specifi c pipelining , 343–344 for setting arrays to zero , 141f languages pointer to automatic variables , 159 b sort , 135 brief history of, 173.e6–173.e7 sequential word addresses, 159 b strcpy , 110b –111b object-oriented , 145 simulating cache , 466 string copy , 110b –111b variables , 67 s o ft ware development with swap , 134 Programs multiprocessors, 546 b Process identifi ers , 436 assembly language , 125 VMM implementation , 468–470 Process-level parallelism, 490 Java, starting , 132–133 Pixel shader example , B-15–B-17 Processors , 232 parallel processing , 490 P i x e l s , 1 8 control , 19 starting , 124–133 Pointers as cores , 43 translating , 124–133 arrays versus, 141–144 d a t a p a t h , 1 9 P r o p a g a t e frame , 104 defi ned , 17 b , 19 defi ned , A-39 global , 104 b dynamic multiple-issue , 320 example , Ab -44b –Ab -45b incrementing , 143 multiple-issue , 320 super , A-40 Java , 150.e25–150.e26 out-of-order execution, 332 b , Protected keywords, 150.e20 stack , 99 , 102–104 406b –407b Protection Polling , 553.e6 performance growth , 44f defi ned , 418 Pop , 99 ROP , B-12 , B-41 implementing , 435–437 Power speculation , 321–322 mechanisms , 497.e12 clock rate and , 40 static multiple-issue , 320 , 322–326 VMs for , 414–415 critical nature of , 53 streaming , B-34 Protection group , 481.e4 e ffi ciency , 331–332 superscalar , 326 , 505–506 , 368.e4 Pseudo MIPS r e l a t i v e , 4 1b –42b technologies for building, 24–28 defi ned , 225 Index I-17

Pseudoinstructions D-27–D-29 , D-29–D-32 , right half , 278 defi ned , 125 D-32–D-34 , D-34–D-36 , RISC-V conventions, 253 f summary , 126 D-36–D-38 , D-38–D-39 , spilling , 71 Pthreads (POSIX threads) , 530 D-39–D-40 , D-40 , D-40–D-43 , Status , 314 PTX instructions , B-31 , B-32f D-43–D-45 , 368.e3 , 173.e4 . See temporary , 68 , 100 Public keywords, 150.e20 also Desktop and server RISCs; variables , 68 Push Embedded RISCs Relative performance, 31 b defi ned , 99 group types , D-3–D-4 Relative power, 41 b –42b using , 102–104 instruction set lineage , D-44f Reliability , 408–409 Reduction , 509 Remainder, defi ned , 182 Q Redundant arrays of inexpensive disks Reorder buff ers , 332b (RAID), 481.e1–481.e2 Replication , 454 Quad words , 151 history , 481.e6–481.e7 Requested word fi rst , 382 Quicksort , 401 b –403b , 402f RAID 0, 481.e3 Request-level parallelism , 522 Quotient , 182 RAID 1, 481.e4 Reservation stations RAID 2, 481.e4 b u ff ering operands in , 327 R RAID 3, 481.e4 defi ned , 327 RAID 4, 481.e4–481.e5 R e s p o n s e t i m e , 3 0b Race , A-72–A-73 RAID 5, 481.e5–481.e6 Restartable instructions , 438–439 Radix sort , 401b –403b , 402f , B-63–B-65 RAID 6, 481.e6 Return address , 99 CUDA code , B-64f spread of, 481.e5 R - f o r m a t implementation , B-63–B-65 summary , 481.e6–481.e7 ALU operations, 243 f R A I D . See Redundant arrays of use statistics , 481.e6 f Ripple carry inexpensive disks (RAID) Reference bit , 426b adder , A-29 RAM , 9 References carry lookahead speed versus, Raster operation (ROP) processors, B-12 , absolute , 127 A b -45b B-41 , B-50–B-51 types , 150.e25 RISC-V , 62 , 85–87 fi xed function , B-41 Register addressing , 118 f architecture , 188 f Raster refresh buff er , 18 Register allocation , 150.e10–150.e12 arithmetic instructions , 63 Rasterization , B-50 Register fi les , Ab -49b , A-53–A-55 arithmetic/logical instructions not in, Ray casting (RC) , 542 in behavioral Verilog, A-56 D-21f , D-23f Read-only memories (ROMs), defi ned , 242 , A b -49b , A-53 assembly instruction, mapping, A-14–A-16 single , 247 , 247 81b –82b control entries, C-16 b –C-18b two read ports implementation, A-54 f common extensions to , D-20–D-25 control function encoding , C-19 with two read ports/one write port , compiling C assignment statements dispatch , C-25f A-54f into , 65b implementation , C-15–C-19 write port implementation, A-55 f compiling complex C assignment into , logic function encoding, A-15 Register-memory architecture , 173.e2 66b overhead , C-18 Registers , 148 , 149–151 control instructions not in , D-21 f PLAs and, A-15–A-16 , A-16 architectural , 314 , 335–336 control registers, 437 b programmable (PROM) , A-14 b a s e , 6 9 control unit , C-10 total size , C-15–C-16 clock cycle time and , 67 data transfer instructions not in , D-20 f , Read-stall cycles , 389 compiling C assignment with , 67b –68b D-22f Read-write head , 371 defi ned , 67 divide in, 187 Receive message routine , 519 destination , 252 exceptions in, 314–315 Recursive procedures, 106 b . fl oating-point, 210 b fi elds, 83–89 See also Procedures left half , 278 fl oating-point instructions , 204–210 clone invocation , 102 number specifi cation, 242 fl oating-point instructions not in , Reduced instruction set computer page table , 422–423 D-22f (RISC) architectures, D-3–D-5 , pipeline , 295–296 , 295–296 , 296f , 300 instruction classes, 157 f D-5–D-9 , D-9–D-16 , D-16–D-18 , primitives , 67 instruction encoding , 85f , 119f D-19 , D-20–D-25 , D-25–D-27 , renaming , 325 instruction formats , 120 , 146f I-18 Index

RISC-V (Continued) Seek , 372 Sign extension, 244 instruction set , 62 , 159–160 , 224 , 234 , Segmentation , 421 b defi ned , 78 b D-9–D-16 Selector values , A-10 shortcut , 78–79 machine language , 87–89 Semiconductors , 25–26 Signals memory addresses , 70f Send message routine , 519 asserted , 240 , A-4 memory allocation for program and Sensitivity list, A-23–A-24 control , 240 , 253 , 253 , 253 data , 106f Sequencers deasserted , 240 , A-4 multiply in , 181 explicit , C-32 Signed division, 185–186 Pseudo , 224f implementing next-state function Signed multiplication, 180 register conventions , 107 f with, C-22–C-28 Signed numbers, 74–81 static multiple issue with, 322–326 Sequential logic , A-4 sign and magnitude , 75 R o o fl ine model, 532–533 , 534 f , 535 Servers , 481.e6 . See also Desktop and treating as unsigned , 96 with ceilings , 536 f server RISCs Signifi cands , 191–192 computational roofl ine , 533 , 535 cost and capability , 5 addition , 196–197 illustrated , 532f Service accomplishment, 408–409 multiplication , 199–203 Opteron generations , 533 Service interruption, 408 Silicon , 25–26 with overlapping areas shaded, 537 f Set-associative caches , 393 . See also as key hardware technology , 53 p e a k fl oating-point performance, 536 f Caches crystal ingot, 26 peak memory performance , 540 f address portions , 397f defi ned , 25–26 with two kernels , 537f block replacement strategies , 443 w a f e r s , 2 6 R o t a t i o n a l d e l a y . See Rotational latency choice of , 442 Silicon crystal ingot , 26 Rotational latency , 373 four-way , 394f , 397 SIMD (Single Instruction Multiple Rounding , 210–211 memory-block location , 393f Data) , 496–497 , 548–549 accurate , 210–211 misses , 395 b –396b computers , 577.e1–577.e3 bits , 212 n -way , 393 data vector , B-35 with guard digits , 211 b two-way , 394 f extensions , 577.e3 IEEE 754 modes , 211–212 Set less than instruction (slt) , A-31 for loops and , 577.e2 Row-major order, 209 b –210b , 403 Setup time , A-52–A-53 , A-53 f massively parallel multiprocessors , R-type, defi ned , 87 b sh (store halfword), 64 f 577.e1 R-type instructions, 246 b –247b Shaders small-scale , 577.e3 datapath for, 254–257 defi ned , B-14 vector architecture , 498–500 datapath in operation for, 256 f fl oating-point arithmetic , B-14 in x86 , 498 RV32 , 73b graphics , B-14–B-15 SIMMs (single inline memory modules), RV64 , 73b pixel example , B-15–B-17 497.e5f , 497.e4 Shading languages, B-14 Simple programmable logic devices S Shadowing , 481.e4 (SPLDs), A-77 S h a r e d m e m o r y . See also Memory Simplicity , 65–67 Saturation , 175 b as low-latency memory, B-21 Simultaneous multithreading (SMT) , sb (store byte), 64 f caching in , B-58–B-60 505 SB-type instruction format , 115 CUDA , B-58 support , 505 f sc.d (store conditional) , 64f N-body and , B-66f thread-level parallelism, 505 SCALAPAK , 221–222 per-CTA , B-39 unused issue slots , 505f Scaling SRAM banks , B-40 Single error correcting/Double error strong , 495 Shared memory multiprocessors (SMP), correcting (SEC/DEC) , 410–414 weak , 495 507–512 Single instruction single data (SISD), 498 , Scientifi c notation defi ned , 490–491 , 507–508 502–504 adding numbers in , 197 single physical address space , 507 Single precision. See also Double defi ned , 189 synchronization , 508–511 precision for reals, 189 Shift left logical immediate (slli) , 90 binary representation , 194 b sd (store doubleword), 64 f Shift right arithmetic (srai) , 90 defi ned , 191 Search engines , 4 Shift right logical immediate (srli) , 90 Single-clock-cycle pipeline diagrams , Secondary memory , 23 Sign and magnitude , 190 285–288 Sectors , 371–372 Sign bit , 77 illustrated , 287f Index I-19

Single-cycle datapaths. See also Datapaths least signifi cant bits, D-31 f Stacks illustrated , 275f multiple precision fl oating-point allocating space on, 104 instruction execution, 276 f results , D-32 for arguments , 99 Single-cycle implementation nonfaulting loads, D-32 defi ned , 99 control function for , 259 overlapping integer operations , D-31 pop , 99 nonpipelined execution versus quadruple precision fl oating-point push , 99 , 102–104 pipelined execution, 264 f arithmetic , D-36 Stalls , 268 non-use of, 259–260 register windows , D-29–D-30 avoiding with code reordering, penalty , 260 support for LISP and Smalltalk , D-30 268b –269b pipelined performance versus, Sparse matrices, B-55–B-58 behavioral Verilog with detection, 262b –263b Sparse Matrix-Vector multiply (SpMV), 366.e5–366.e7 Single-instruction multiple-thread B-55 , B-57f , B-58 data hazards and, 301–305 (SIMT) , B-27–B-29 CUDA version , B-57f illustrations , 366.e25 overhead , B-35 serial code, B-57 f insertion into pipeline, 303 f multithreaded warp scheduling, B-28 f shared memory version, B-59 f load-use , 306 processor architecture, B-28–B-29 Spatial locality, 364 memory , 389 warp execution and divergence, large block exploitation of, 381 as solution to control hazard , 269 B-29–B-30 tendency , 367 write-back scheme , 390 Single-program multiple data (SPMD), SPEC , 54.e10–54.e11 write buff er , 389 B-22 CPU benchmark, 46–48 Standby spares, 481.e7 sll (shift left logical) , 64 f power benchmark, 48–49 S t a t e slli (shift left logical immediate) , 64 f SPEC89 , 54.e10 in 2-bit prediction scheme , 310 Smalltalk-80 , 173.e7 , 173.e7 SPEC92 , 54.e11 assignment , A-69 , C-27 Smart phones , 7 SPEC95 , 54.e11 bits , C-8–C-10 Snooping protocol , 454–456 SPEC2000 , 54.e11 exception, saving/restoring , 438 Snoopy cache coherence , 482.e16 SPEC2006 , 54.e11 logic components, 239 S o ft ware optimization SPECrate , 528 specifi cation of, 422 b via blocking , 403–407 SPECratio , 47–48 State elements S o ft ware Special function units (SFUs) , B-35 , clock and , 239 l a y e r s , 1 3f B-50 combinational logic and, 239 multiprocessor , 490 defi ned , B-42–B-43 defi ned , 238–239 , A-47 parallel , 491 Speculation , 321–322 inputs , 239 as service , 7 , 522 , 548 hardware-based , 329–331 register fi le , Ab -49b systems , 13 implementation , 321 in storing/accessing instructions, Sort algorithms , 141 f performance and, 321 , 322 242f Sort procedure , 135 . See also Procedures problems , 321 Static branch prediction , 322 code for body , 136–138 recovery mechanism , 321 Static data full procedure, 139–140 Speed-up challenge segment , 105 passing parameters in , 138 balancing load, 495 b –496b Static multiple-issue processors , 320 , preserving registers in , 138–139 bigger problem, 494 b –495b 322–326 . See also Multiple issue procedure call , 138 Spilling registers , 71 b –72b , 99 control hazards and , 322–323 register allocation for , 136–141 Split algorithm, 542 instruction sets , 322 Sorting performance, B-54–B-55 Split caches, 387 b with RISC-V ISA, 322–326 Space allocation sra (shift right arithmetic) , 64f Static random access memories (SRAMs), on heap, 105–108 srai (shift right arithmetic immediate) , 368 , 369 , A-57–A-66 on stack , 104 64f array organization, A-61 f SPARC srl (shift right logical) , 64 f basic structure , A-60 f annulling branch, D-23–D-25 srli (shift right logical immediate) , 64 f defi ned , 19–22 , A-57 CASA , D-31–D-32 Stack architectures, 173.e3–173.e4 fi xed access time, A-57 conditional branches , D-10–D-16 Stack pointers large , A-58 fast traps , D-30 adjustment , 102–104 read/write initiation , A-58 fl oating-point operations , D-31 defi ned , 99 synchronous (SSRAMs) , A-59 instructions , D-29–D-32 values , 101 f three-state buff ers , A-58 , A-59f I-20 Index

Static variables , 104b Subtraction , 172–175 . See also Arithmetic SystemVerilog Steady-state prediction, 310 binary , 172 b –173b cache controller, 482.e1–482.e4 Sticky bits, 212 fl oating-point, 204 cache data and tag modules , 482.e16 Store buff ers , 332b negative number, 174 FSM , 482.e6f Store byte , 109 overfl ow , 174 simple cache block diagram, 482.e3 f Store-conditional doubleword, 1 Subword parallelism, 214–215 , 342 f , D-17 type declarations, 482.e1 f 22–123 and matrix multiply, 216–220 Store doubleword, 70–71 Sum of products , A-11 , Ab -12b T Store instructions . See also Supercomputers , 368.e2 Load instructions defi ned, 5 Tablets , 7f access , B-41 SuperH , D-15 , D-39–D-40 Tags base register, 252 Superscalars defi ned , 374 compiling with , 71 defi ned , 326 , 368.e3–368.e4 in locating block , 397 conditional , 122–123 dynamic pipeline scheduling , 326–327 page tables and , 424 defi ned , 71 b multithreading options , 492 size of , 399b –400b EX stage, 282 f Supervisor Exception Cause Register Tail call , 107 ID stage, 279 f (SCAUSE) , 314 Task identifi ers , 436 IF stage, 279 f Supervisor exception program counter Task parallelism, B-24 instruction dependency, 300 b (SEPC) , 314 , 362 , 437 Task-level parallelism, 490 MEM stage, 281 f address capture , 317–319 Tebibyte (TiB) , 5 unit for implementing , 245 f defi ned , 315–317 Telsa PTX ISA , B-31 WB stage, 281 f in restart determination, 314 arithmetic instructions , B-33 Store word , 113b Supervisor exception return (sret) , 435 barrier synchronization, B-34 Stored program concept, 63 Supervisor Page Table Base Register GPU thread instructions , B-32f as computer principle, 88 b (SPTBR) , 425f memory access instructions, 206 illustrated , 88f Supervisor Trap Vector (STVEC) , 319b Temporal locality, 364 principles , 159–160 Surfaces , B-41 tendency , 367 Strcpy procedure , 110b –111b . sw (store word), 64 f Temporary registers, 68 , 100 See also Procedures Swap procedure , 134 . See also Terabyte (TB) , 6f as leaf procedure, 111 Procedures defi ned, 5 pointers , 111 body code , 134–135 Texture memory, B-40 Stream benchmark , 538b full , 135 , 139–140 Texture/processor cluster (TPC), B-47 Streaming multiprocessor (SM) , B-13 register allocation , 134–135 TFLOPS multiprocessor , 577.e4–577.e5 , Streaming processors , B-34 , B-49–B-50 Swap space , 424 577.e5 array (SPA) , B-41 , B-46 Symbol tables , 126 Th rashing , 440 Streaming SIMD Extension 2 (SSE2) Synchronization , 121–124 , 542 Th read blocks , 516f fl oating-point architecture , 215 barrier , B-18 , B-20 , B-34 creation , B-23 Streaming SIMD Extensions (SSE) and defi ned , 508–511 defi ned , B-19 advanced vector extensions lock , 121 managing , B-30 in x86 , 215 overhead, reducing , 44–45 memory sharing, B-20–B-21 Stretch computer , 368.e1f , 368.e1 unlock , 121 synchronization , B-20–B-21 Strings Synchronizers Th read parallelism , B-22 defi ned , 109–111 from D fl ip-fl op , A-75f Th reads in Java , 111–113 defi ned , A-75 creation , B-23 representation , 108 f failure , A-75–A-76 CUDA , B-36 Strip mining, 500 b Synchronous DRAM (SRAM) , 369 , A-59 , ISA , B-31–B-34 Striping , 481.e3 A-64 managing , B-30 Strong scaling , 495 Synchronous SRAM (SSRAM) , A-59 memory latencies and, B b -74b Structural hazards , 265 , 282 Synchronous system, A-47–A-48 multiple, per body , B-68–B-72 STXR (store exclusive register), 122–123 Syntax tree, 150.e2 warps , B-27–B-28 sub (subtract) , 64f System calls, defi ned , 362 Th ree Cs model , 445b Subnormals , 214 Systems soft ware , 13 Th ree-state buff ers , A-58 , A-59f Index I-21

Th roughput U instructions , 499 defi ned , 29–30 multimedia extensions and, multiple issue and , 320 Unconditional branches , 93 498–500 pipelining and , 262 Underfl ow , 190 s c a l a r versus, 500–501 Th umb , D-15f , D-38–D-39 Unicode Vectored interrupts, 314 Timing alphabets , 111 Verilog asynchronous inputs, A-75–A-76 defi ned , 111 behavioral defi nition of RISC-V ALU , level-sensitive , A-74–A-75 example alphabets , 112f A-25f methodologies , A-71–A-77 Unifi ed GPU architecture , B-10–B-11 behavioral defi nition with bypassing , two-phase , A-74f illustrated , B-11 f 366.e4f TLB misses, 429 . See also Translation- processor array , B-11–B-12 behavioral defi nition with stalls for lookaside buff er (TLB) Uniform memory access (UMA) , 508 , B-9 loads , 366.e6 f handling , 437–439 multiprocessors , 508 behavioral specifi cation , A-21 , 366. occurrence , 437 Units e1–366.e3 problem , 440 commit , 327 , 332b behavioral specifi cation of multicycle Tomasulo’s algorithm , 368.e2 control , 237–238 , 249–251 , C-4–C-8 , MIPS design, 366.e11 f Touchscreen , 19 C-10f , C-12–C-13 behavioral specifi cation with Tournament branch predicators , defi ned , 211–212 simulation, 366.e1–366.e3 311–312 fl oating point , 211–212 behavioral specifi cation with stall Tracks , 371–372 hazard detection, 301 , 304–305 detection , 366.e5–366.e7 Transfer time , 373 for load/store implementation , 245 f behavioral specifi cation with synthesis , Transistors , 25 special function (SFUs) , B-35 , 366.e10–366.e16 Translation-lookaside buff er (TLB) , B-42–B-43 , B-50 blocking assignment , A-24 428–430 , D-26–D-27 , 497.e5 . UNIVAC I , 54.e4f , 54.e3–54.e4 branch hazard logic implementation , See also TLB misses UNIX , 173.e7 , 497.e10 , 497.e13 , 497.e13 , 366.e7–366.e10 associativities , 430 497.e14 combinational logic , A-23–A-26 illustrated , 429f AT&T , 497.e14 datatypes , A-21–A-23 integration , 433 Berkeley version (BSD), 497.e14 defi ned , A-20–A-21 Intrinsity FastMATH , 430–433 genius , 497.e16 forwarding implementation , 366. typical values , 430 history , 497.e13 , 497.e14 e3–366.e5 Transmit driver and NIC hardware Unlock synchronization , 121 modules , A-23 f time versus receive driver and NIC Unsigned numbers, 74–81 multicycle MIPS datapath , 366.e13f hardware time, 553.e7 f Use latency nonblocking assignment, A-24 Tree-based parallel scan, B-62 f defi ned , 323–325 operators , A-22–A-23 Truth tables , A-5 one-instruction , 323–325 program structure, A-23 ALU control lines , C-5f reg , A-21 , A-21 for control bits , 251 V RISC-V ALU defi nition in, A-36–A-37 datapath control outputs , C-17f sensitivity list, A-23–A-24 datapath control signals , C-14f Vacuum tubes , 25f sequential logic specifi c a t i o n , defi ned , 251 Valid bit , 374–376 A-55–A-57 example , Ab -5b Variables structural specifi cation, A-21 next-state output bits, C-15 f C language, 104 b wire , A-21 , A-21 , A-22 PLA implementation, A-13 programming language, 67 Vertical microcode, C-32 Two’s complement representation , register , 67 Very large-scale integrated (VLSI) 76 static , 104 b circuits , 25 advantage , 77 storage class , 104b Very Long Instruction Word (VLIW) negation shortcut, 78 b –79b type , 104 b defi ned , 322 rule , 80b VAX architecture, 173.e3 , 497.e6 fi rst generation computers , sign extension shortcut , 79 b –80b Vector lanes, 500 368.e4 Two-level logic, A-11–A-14 Vector processors , 497–504 . processors , 322 Two-phase clocking , A-74 , A-74 f See also Processors VHDL , A-20–A-21 TX-2 computer , 577.e3 conventional code comparison, Video graphics array (VGA) controllers, 499b –500b B-3–B-4 I-22 Index

Virtual addresses Warps , B-27–B-28 write-through cache, 384 , 385 causing page faults, 438 Weak scaling, 495 Write-stall cycles , 389 defi ned , 418–419 Wear levelling , 371 Write-through caches. See also Caches mapping from, 418–419 While loops, 94 b –95b advantages , 444 size , 420–421 Whirlwind , 497.e1 defi ned , 383 , 444 Virtual machine monitors (VMMs) Wide area networks (WANs) , 24 . See also tag mismatch, 384 defi ned , 414 Networks implementing , 468b Wide immediate operands , 113–114 X laissez-faire attitude, 468 W o r d s page tables , 439b accessing , 68 x86 , 146–155 in performance improvement, 417 defi ned , 67 Advanced Vector Extensions in, requirements , 416 double , 151 215–216 Virtual machines (VMs) , 414–417 load , 69 , 71 brief history, 173.e5–173.e6 b e n e fi ts , 414–415 quad , 151 conclusion , 154–155 illusion , 439b s t o r e , 7 1b data addressing modes , 149–151 instruction set architecture support , Working set , 440 evolution , 96 417 World Wide Web , 4 fi rst address specifi er encoding , 155f performance improvement, 417 Worst-case delay , 260 instruction encoding , 153–154 for protection improvement, 414–415 Write buff ers instruction formats , 154f Virtual memory , 417–441 . See also Pages defi ned , 385 instruction set growth, 156 f address translation , 418–419 , 428–430 stalls , 381 instruction types, 152 f integration , 433–435 write-back cache , 385 integer operations, 151–152 for large virtual addresses, 426–427 Write invalidate protocols, 454 registers , 149–151 mechanism , 440 Write serialization, 453–454 SIMD in , 496–497 motivations , 417–418 Write-back caches . See also Caches Streaming SIMD Extensions in, page faults, 418–419 , 424 advantages , 444 215–216 protection implementation , 435–437 cache coherency protocol, 482.e4 typical instructions/functions, 154 f segmentation , 421 b complexity , 385 typical operations , 153 f summary , 439–441 defi ned , 384 , 444 unique , D-36–D-38 virtualization of , 439 b stalls , 389 Xerox Alto computer, 54.e7–54.e9 writes , 428 write buff ers , 385 XMM , 215 Virtualizable hardware , 416 Write-back stage xor (exclusive or) , 64f Virtually addressed caches , 434 control line , 290f xori (exclusive or immediate), 64 f Visual computing, B-3 load instruction, 280 Volatile memory, 22 store instruction, 282 Y Writes W complications , 384 b –385b Yahoo! Cloud Serving Benchmark expense , 440 (YCSB) , 530 Wa f e r s , 2 6 handling , 383–385 Yield , 27 defects , 26–27 memory hierarchy handling of , YMM , 216 dies , 27 , 27 , 27 , 28 331–332 y i e l d , 2 7 schemes , 384 Z Warehouse Scale Computers (WSCs), 7 , virtual memory , 427 519–524 , 548 write-back cache , 384 , 385 Z e t t a b y t e , 6f