Index

Note: Online information is listed by chapter and section number followed by page (OL3.11-7). Page references preceded by a single letter with hyphen refer to appendices.

1-bit ALU, B-26–29. See also Arithmetic addiu (Add Imm.Unsigned), 119 AGP, C-9 logic unit (ALU) Address interleaving, 381 Algol-60, OL2.21-7 adder, B-27 Address select logic, D-24, D-25 Aliasing, 444 CarryOut, B-28 Address space, 428, 431 Alignment restriction, 69–70 for most signifi cant bit, B-33 extending, 479 All-pairs N-body algorithm, C-65 illustrated, B-29 fl at, 479 Alpha architecture logical unit for AND/OR, B-27 ID (ASID), 446 bit count instructions, E-29 performing AND, OR, and addition, inadequate, OL5.17-6 fl oating-point instructions, E-28 B-31, B-33 shared, 519–520 instructions, E-27–29 32-bit ALU, B-29–38. See also Arithmetic single physical, 517 no divide, E-28 logic unit (ALU) unmapped, 450 PAL code, E-28 defi ning in Verilog, B-35–38 virtual, 446 unaligned load-store, E-28 from 31 copies of 1-bit ALU, B-34 Address translation VAX fl oating-point formats, E-29 illustrated, B-36 for ARM cortex-A8, 471 ALU control, 259–261. See also ripple carry adder, B-29 defi ned, 429 Arithmetic logic unit (ALU) tailoring to MIPS, B-31–35 fast, 438–439 bits, 260 with 32 1-bit ALUs, B-30 for Intel core i7, 471 logic, D-6 32-bit immediate operands, 112–113 TLB for, 438–439 mapping to gates, D-4–7 7090/7094 hardware, OL3.11-7 Address-control lines, D-26 truth tables, D-5 Addresses ALU control block, 263 A 32-bit immediates, 113–116 defi ned, D-4 base, 69 generating ALU control bits, D-6 Absolute references, 126 byte, 69 ALUOp, 260, D-6 Abstractions defi ned, 68 bits, 260, 261 hardware/soft ware interface, 22 memory, 77 control signal, 263 principle, 22 virtual, 428–431, 450 Amazon Web Services (AWS), 425 to simplify design, 11 Addressing AMD Opteron X4 (Barcelona), 543, 544 Accumulator architectures, OL2.21-2 32-bit immediates, 113–116 AMD64, 151, 224, OL2.21-6 Acronyms, 9 base, 116 Amdahl’s law, 401, 503 Active matrix, 18 displacement, 116 corollary, 49 add (Add), 64 immediate, 116 defi ned, 49 add.d (FP Add Double), A-73 in jumps and branches, 113–116 fallacy, 556 add.s (FP Add Single), A-74 MIPS modes, 116–118 and (AND), 64 Add unsigned instruction, 180 PC-relative, 114, 116 AND gates, B-12, D-7 addi (Add Immediate), 64 pseudodirect, 116 AND operation, 88 Addition, 178–182. See also Arithmetic register, 116 AND operation, A-52, B-6 binary, 178–179 x86 modes, 152 andi (And Immediate), 64 fl oating-point, 203–206, 211, A-73–74 Addressing modes, A-45–47 Annual failure rate (AFR), 418 instructions, A-51 desktop architectures, E-6 versus.MTTF of disks, 419–420 operands, 179 addu (Add Unsigned), 64 Antidependence, 338 signifi cands, 203 Advanced Vector Extensions (AVX), 225, Antifuse, B-78 speed, 182 227 Apple computer, OL1.12-7 I-1 I-2 Index

Apple iPad 2 A1395, 20 TLB hardware for, 471 programs, 123 logic board of, 20 ARM instructions, 145–147 translating into machine language, 84 processor integrated circuit of, 21 12-bit immediate fi eld, 148 when to use, A-7–9 Application binary interface (ABI), 22 addressing modes, 145 Asserted signals, 250, B-4 Application programming interfaces block loads and stores, 149 Associativity (APIs) brief history, OL2.21-5 in caches, 405 defi ned, C-4 calculations, 145–146 degree, increasing, 404, 455 graphics, C-14 compare and conditional branch, increasing, 409 Architectural registers, 347 147–148 set, tag size versus, 409 Arithmetic, 176–236 condition fi eld, 324 Atomic compare and swap, 123 addition, 178–182 data transfer, 146 Atomic exchange, 121 addition and subtraction, 178–182 features, 148–149 Atomic fetch-and-increment, 123 division, 189–195 formats, 148 Atomic memory operation, C-21 fallacies and pitfalls, 229–232 logical, 149 Attribute interpolation, C-43–44 fl oating-point, 196–222 MIPS similarities, 146 Automobiles, computer application in, 4 historical perspective, 236 register-register, 146 Average memory access time (AMAT), multiplication, 183–188 unique, E-36–37 402 parallelism and, 222–223 ARMv7, 62 calculating, 403 Streaming SIMD Extensions and ARMv8, 158–159 advanced vector extensions in x86, ARPANET, OL1.12-10 B 224–225 Arrays, 415 subtraction, 178–182 logic elements, B-18–19 Backpatching, A-13 subword parallelism, 222–223 multiple dimension, 218 Bandwidth, 30–31 subword parallelism and matrix pointers versus, 141–145 bisection, 532 multiply, 225–228 procedures for setting to zero, 142 external to DRAM, 398 Arithmetic instructions. See also ASCII memory, 380–381, 398 Instructions binary numbers versus, 107 network, 535 desktop RISC, E-11 character representation, 106 Barrier synchronization, C-18 embedded RISC, E-14 defi ned, 106 defi ned, C-20 logical, 251 symbols, 109 for thread communication, C-34 MIPS, A-51–57 Assembler directives, A-5 Base addressing, 69, 116 operands, 66–, 73 Assemblers, 124–126, A-10–17 Base registers, 69 Arithmetic intensity, 541 conditional code assembly, A-17 Basic block, 93 Arithmetic logic unit (ALU). See also defi ned, 14, A-4 Benchmarks, 538–540 ALU control; Control units function, 125, A-10 defi ned, 46 1-bit, B-26–29 macros, A-4, A-15–17 Linpack, 538, OL3.11-4 32-bit, B-29–38 microcode, D-30 multicores, 522–529 before forwarding, 309 number acceptance, 125 multiprocessor, 538–540 branch datapath, 254 object fi le, 125 NAS parallel, 540 hardware, 180 pseudoinstructions, A-17 parallel, 539 memory-reference instruction use, 245 relocation information, A-13, A-14 PARSEC suite, 540 for register values, 252 speed, A-13 SPEC CPU, 46–48 R-format operations, 253 symbol table, A-12 SPEC power, 48–49 signed-immediate input, 312 Assembly language, 15 SPECrate, 538–539 ARM Cortex-A8, 244, 345–346 defi ned, 14, 123 Stream, 548 address translation for, 471 drawbacks, A-9–10 beq (Branch On Equal), 64 caches in, 472 fl oating-point, 212 bge (Branch Greater Th an or Equal), 125 data cache miss rates for, 474 high-level languages versus, A-12 bgt (Branch Greater Th an), 125 memory hierarchies of, 471–475 illustrated, 15 Biased notation, 79, 200 performance of, 473–475 MIPS, 64, 84, A-45–80 Big-endian byte order, 70, A-43 specifi cation, 345 production of, A-8–9 Binary numbers, 81–82 Index I-3

ASCII versus, 107 operations, 254 C conversion to decimal numbers, 76 Branch delay slots defi ned, 73 defi ned, 322 C.mmp, OL6.15-4 Bisection bandwidth, 532 scheduling, 323 C language Bit maps Branch equal, 318 assignment, compiling into MIPS, defi ned, 18, 73 Branch instructions, A-59–63 65–66 goal, 18 jump instruction versus, 270 compiling, 145, OL2.15-2–2.15-3 storing, 18 list of, A-60–63 compiling assignment with registers, Bit-Interleaved Parity (RAID 3), OL5.11- pipeline impact, 317 67–68 5 Branch not taken compiling while loops in, 92 Bits assumption, 318 sort algorithms, 141 ALUOp, 260, 261 defi ned, 254 translation hierarchy, 124 defi ned, 14 Branch prediction translation to MIPS assembly language, dirty, 437 as control hazard solution, 284 65 guard, 220 buff ers, 321, 322 variables, 102 patterns, 220–221 defi ned, 283 C++ language, OL2.15-27, OL2.21-8 reference, 435 dynamic, 284, 321–323 Cache blocking and matrix multiply, rounding, 220 static, 335 475–476 sign, 75 Branch predictors Cache coherence, 466–470 state, D-8 accuracy, 322 coherence, 466 sticky, 220 correlation, 324 consistency, 466 valid, 383 information from, 324 enforcement schemes, 467–468 ble (Branch Less Th an or Equal), 125 tournament, 324 implementation techniques, Blocking assignment, B-24 Branch taken OL5.12-11–5.12-12 Blocking factor, 414 cost reduction, 318 migration, 467 Block-Interleaved Parity (RAID 4), defi ned, 254 problem, 466, 467, 470 OL5.11-5–5.11-6 Branch target protocol example, OL5.12-12–5.12-16 Blocks addresses, 254 protocols, 468 combinational, B-4 buff ers, 324 replication, 468 defi ned, 376 Branches. See also Conditional snooping protocol, 468–469 fi nding, 456 branches snoopy, OL5.12-17 fl exible placement, 402–404 addressing in, 113–116 state diagram, OL5.12-16 least recently used (LRU), 409 compiler creation, 91 Cache coherency protocol, OL5.12- loads/stores, 149 condition, 255 12–5.12-16 locating in cache, 407–408 decision, moving up, 318 fi nite-state transition diagram, OL5.12- miss rate and, 391 delayed, 96, 255, 284, 318–319, 322, 15 multiword, mapping addresses to, 390 324 functioning, OL5.12-14 placement locations, 455–456 ending, 93 mechanism, OL5.12-14 placement strategies, 404 execution in ID stage, 319 state diagram, OL5.12-16 replacement selection, 409 pipelined, 318 states, OL5.12-13 replacement strategies, 457 target address, 318 write-back cache, OL5.12-15 spatial locality exploitation, 391 unconditional, 91 Cache controllers, 470 state, B-4 Branch-on-equal instruction, 268 coherent cache implementation valid data, 386 Bubble Sort, 140 techniques, OL5.12-11–5.12-12 blt (Branch Less Th an), 125 Bubbles, 314 implementing, OL5.12-2 bne (Branch On Not Equal), 64 Bus-based coherent multiprocessors, snoopy cache coherence, OL5.12-17 Bonding, 28 OL6.15-7 SystemVerilog, OL5.12-2 Boolean algebra, B-6 Buses, B-19 Cache hits, 443 Bounds check shortcut, 95 Bytes Cache misses Branch datapath addressing, 70 block replacement on, 457 ALU, 254 order, 70, A-43 capacity, 459 I-4 Index

Cache misses (Continued) virtual memory and TLB integration, defi ned, 33 compulsory, 459 440–441 memory-stall, 399 confl ict, 459 virtually addressed, 443 number of registers and, 67 defi ned, 392 virtually indexed, 443 worst-case delay and, 272 direct-mapped cache, 404 virtually tagged, 443 Clock cycles per instruction (CPI), 35, fully associative cache, 406 write-back, 394, 395, 458 282 handling, 392–393 write-through, 393, 395, 457 one level of caching, 410 memory-stall clock cycles, 399 writes, 393–395 two levels of caching, 410 reducing with fl exible block placement, Callee, 98, 99 Clock rate 402–404 Callee-saved register, A-23 defi ned, 33 set-associative cache, 405 Caller, 98 frequency switched as function of, 41 steps, 393 Caller-saved register, A-23 power and, 40 in write-through cache, 393 Capabilities, OL5.17-8 Clocking methodology, 249–251, B-48 Cache performance, 398–417 Capacity misses, 459 edge-triggered, 249, B-48, B-73 calculating, 400 Carry lookahead, B-38–47 level-sensitive, B-74, B-75–76 hit time and, 401–402 4-bit ALUs using, B-45 for predictability, 249 impact on processor performance, 400 adder, B-39 Clocks, B-48–50 Cache-aware instructions, 482 fast, with fi rst level of abstraction, edge, B-48, B-50 Caches, 383–398. See also Blocks B-39–40 in edge-triggered design, B-73 accessing, 386–389 fast, with “infi nite” hardware, B-38–39 skew, B-74 in ARM cortex-A8, 472 fast, with second level of abstraction, specifi cation, B-57 associativity in, 405–406 B-40–46 synchronous system, B-48–49 bits in, 390 plumbing analogy, B-42, B-43 Cloud computing, 533 bits needed for, 390 ripple carry speed versus, B-46 defi ned, 7 contents illustration, 387 summary, B-46–47 Cluster networking, 537–538, OL6.9-12 defi ned, 21, 383–384 Carry save adders, 188 Clusters, OL6.15-8–6.15-9 direct-mapped, 384, 385, 390, 402 Cause register defi ned, 30, 500, OL6.15-8 empty, 386–387 defi ned, 327 isolation, 530 FSM for controlling, 461–462 fi elds, A-34, A-35 organization, 499 fully associative, 403 OLC 6600, OL1.12-7, OL4.16-3 scientifi c computing on, OL6.15-8 GPU, C-38 Cell phones, 7 Cm*, OL6.15-4 inconsistent, 393 Central processor unit (CPU). See also CMOS (complementary metal oxide index, 388 Processors semiconductor), 41 in Intel Core i7, 472 classic performance equation, 36–40 Coarse-grained multithreading, 514 Intrinsity FastMATH example, coprocessor 0, A-33–34 Cobol, OL2.21-7 395–398 defi ned, 19 Code generation, OL2.15-13 locating blocks in, 407–408 execution time, 32, 33–34 Code , OL2.15-7 locations, 385 performance, 33–35 Cold-start miss, 459 multilevel, 398, 410 system, time, 32 Collision misses, 459 nonblocking, 472 time, 399 Column major order, 413 physically addressed, 443 time measurements, 33–34 Combinational blocks, B-4 physically indexed, 443 user, time, 32 Combinational control units, D-4–8 physically tagged, 443 Cg pixel shader program, C-15–17 Combinational elements, 248 primary, 410, 417 Characters Combinational logic, 249, B-3, B-9–20 secondary, 410, 417 ASCII representation, 106 arrays, B-18–19 set-associative, 403 in Java, 109–111 decoders, B-9 simulating, 478 Chips, 19, 25, 26 defi ned, B-5 size, 389 manufacturing process, 26 don’t cares, B-17–18 split, 397 Classes multiplexors, B-10 summary, 397–398 defi ned, OL2.15-15 ROMs, B-14–16 tag fi eld, 388 packages, OL2.15-21 two-level, B-11–14 tags, OL5.12-3, OL5.12-11 Clock cycles Verilog, B-23–26 Index I-5

Commercial computer development, Moore’s law, 11 forwarding, 307 OL1.12-4–1.12-10 parallelism, 12 FSM, D-8–21 Commit units pipelining, 12 implementation, optimizing, D-27–28 buff er, 339–340 prediction, 12 for jump instruction, 270 defi ned, 339–340 Computers mapping to hardware, D-2–32 in update control, 343 application classes, 5–6 memory, D-26 Common case fast, 11 applications, 4 organizing, to reduce logic, D-31–32 Common subexpression elimination, arithmetic for, 176–236 pipelined, 300–303 OL2.15-6 characteristics, OL1.12-12 Control fl ow graphs, OL2.15-9–2.15-10 Communication, 23–24 commercial development, OL1.12- illustrated examples, OL2.15-9, overhead, reducing, 44–45 4–1.12-10 OL2.15-10 thread, C-34 component organization, 17 Control functions Compact code, OL2.21-4 components, 17, 177 ALU, mapping to gates, D-4–7 Comparison instructions, A-57–59 design measure, 53 defi ning, 264 fl oating-point, A-74–75 desktop, 5 PLA, implementation, D-7, list of, A-57–59 embedded, 5, A-7 D-20–21 Comparisons, 93 fi rst, OL1.12-2–1.12-4 ROM, encoding, D-18–19 constant operands in, 93 in information revolution, 4 for single-cycle implementation, 269 signed versus unsigned, 94–95 instruction representation, 80–87 Control hazards, 281–282, 316–325 Compilers, 123–124 performance measurement, OL1.12-10 branch delay reduction, 318–319 branch creation, 92 PostPC Era, 6–7 branch not taken assumption, 318 brief history, OL2.21-9 principles, 86 branch prediction as solution, 284 conservative, OL2.15-6 servers, 5 delayed decision approach, 284 defi ned, 14 Condition fi eld, 324 dynamic branch prediction, front end, OL2.15-3 Conditional branches 321–323 function, 14, 123–124, A-5–6 ARM, 147–148 logic implementation in Verilog, high-level optimizations, OL2.15-4 changing program counter with, 324 OL4.13-8 ILP exploitation, OL4.16-5 compiling if-then-else into, 91 pipeline stalls as solution, 282 Just In Time (JIT), 132 defi ned, 90 pipeline summary, 324 machine language production, A-8–9, desktop RISC, E-16 simplicity, 317 A-10 embedded RISC, E-16 solutions, 282 optimization, 141, OL2.21-9 implementation, 96 static multiple-issue processors and, speculation, 333–334 in loops, 115 335–336 structure, OL2.15-2 PA-RISC, E-34, E-35 Control lines Compiling PC-relative addressing, 114 asserted, 264 C assignment statements, 65–66 RISC, E-10–16 in datapath, 263 C language, 92–93, 145, OL2.15- SPARC, E-10–12 execution/address calculation, 300 2–2.15-3 Conditional move instructions, 324 fi nal three stages, 303 fl oating-point programs, 214–217 Confl ict misses, 459 instruction decode/register fi le read, if-then-else, 91 Constant memory, C-40 300 in Java, OL2.15-19 Constant operands, 72–73 instruction fetch, 300 procedures, 98, 101–102 in comparisons, 93 memory access, 302 recursive procedures, 101–102 frequent occurrence, 72 setting of, 264 while loops, 92–93 Constant-manipulating instructions, values, 300 Compressed sparse row (CSR) matrix, A-57 write-back, 302 C-55, C-56 Content Addressable Memory (CAM), Control signals Compulsory misses, 459 408 ALUOp, 263 Computer architects, 11–12 Context switch, 446 defi ned, 250 abstraction to simplify design, 11 Control eff ect of, 264 common case fast, 11 ALU, 259–261 multi-bit, 264 dependability via redundancy, 12 challenge, 325–326 pipelined datapaths with, 300–303 hierarchy of memories, 12 fi nishing, 269–270 truth tables, D-14 I-6 Index

Control units, 247. See also Arithmetic shared memories, C-18 in operation for R-type instruction, logic unit (ALU) threads, C-36 266 address select logic, D-24, D-25 Cyclic redundancy check, 423 operation of, 264–269 combinational, implementing, D-4–8 Cylinder, 381 pipelined, 286–303 with explicit counter, D-23 for R-type instructions, 256, 264–265 illustrated, 265 D single, creating, 256 logic equations, D-11 single-cycle, 283 main, designing, 261–264 D fl ip-fl ops, B-51, B-53 static two-issue, 336 as microcode, D-28 D latches, B-51, B-52 Deasserted signals, 250, B-4 MIPS, D-10 Data bits, 421 Debugging information, A-13 -state outputs, D-10, D-12–13 Data fl ow analysis, OL2.15-11 DEC PDP-8, OL2.21-3 output, 259–261, D-10 Data hazards, 278, 303–316.See also Decimal numbers Conversion instructions, A-75–76 Hazards binary number conversion to, 76 Cooperative thread arrays (CTAs), C-30 forwarding, 278, 303–316 defi ned, 73 Coprocessors, A-33–34 load-use, 280, 318 Decision-making instructions, 90–96 defi ned, 218 stalls and, 313–316 Decoders, B-9 move instructions, A-71–72 Data layout directives, A-14 two-level, B-65 Core MIPS instruction set, 236. See also Data movement instructions, A-70–73 Decoding machine language, 118–120 MIPS Data parallel problem decomposition, Defect, 26 abstract view, 246 C-17, C-18 Delayed branches, 96.See also Branches desktop RISC, E-9–11 Data race, 121 as control hazard solution, 284 implementation, 244–248 Data segment, A-13 defi ned, 255 implementation illustration, 247 Data selectors, 246 embedded RISCs and, E-23 overview, 245 Data transfer instructions.See also for fi ve-stage pipelines, 26, 323–324 subset, 244 Instructions reducing, 318–319 Cores defi ned, 68 scheduling limitations, 323 defi ned, 43 load, 68 Delayed decision, 284 number per chip, 43 off set, 69 DeMorgan’s theorems, B-11 Correlation predictor, 324 store, 71 Denormalized numbers, 222 Cosmic Cube, OL6.15-7 Datacenters, 7 Dependability via redundancy, 12 Count register, A-34 Data-level parallelism, 508 Dependable memory hierarchy, 418–423 CPU, 9 Datapath elements failure, defi ning, 418 Cray computers, OL3.11-5–3.11-6 defi ned, 251 Dependences Critical word fi rst, 392 sharing, 256 between pipeline registers, 308 Crossbar networks, 535 Datapaths between pipeline registers and ALU CTSS (Compatible Time-Sharing branch, 254 inputs, 308 System), OL5.18-9 building, 251–259 bubble insertion and, 314 CUDA programming environment, 523, control signal truth tables, D-14 detection, 306–308 C-5 control unit, 265 name, 338 barrier synchronization, C-18, C-34 defi ned, 19 sequence, 304 development, C-17, C-18 design, 251 Design hierarchy of thread groups, C-18 exception handling, 329 compromises and, 161 kernels, C-19, C-24 for fetching instructions, 253 datapath, 251 key abstractions, C-18 for hazard resolution via forwarding, digital, 354 paradigm, C-19–23 311 logic, 248–251, B-1–79 parallel plus-scan template, C-61 for jump instruction, 270 main control unit, 261–264 per-block shared memory, C-58 for memory instructions, 256 memory hierarchy, challenges, 460 plus-reduction implementation, C-63 for MIPS architecture, 257 pipelining instruction sets, 277 programs, C-6, C-24 in operation for branch-on-equal Desktop and server RISCs.See also scalable parallel programming with, instruction, 268 Reduced instruction set computer C-17–23 in operation for load instruction, 267 (RISC) architectures Index I-7

addressing modes, E-6 Dividend, 189 bandwidth external to, 398 architecture summary, E-4 Division, 189–195 cost, 23 arithmetic/logical instructions, E-11 algorithm, 191 defi ned, 19, B-63 conditional branches, E-16 dividend, 189 DIMM, OL5.17-5 constant extension summary, E-9 divisor, 189 Double Date Rate (DDR), 379–380 control instructions, E-11 Divisor, 189 early board, OL5.17-4 conventions equivalent to MIPS core, divu (Divide Unsigned), A-52.See also GPU, C-37–38 E-12 Arithmetic growth of capacity, 25 data transfer instructions, E-10 faster, 194 history, OL5.17-2 features added to, E-45 fl oating-point, 211, A-76 internal organization of, 380 fl oating-point instructions, E-12 hardware, 189–192 pass transistor, B-63 instruction formats, E-7 hardware, improved version, 192 SIMM, OL5.17-5, OL5.17-6 multimedia extensions, E-16–18 instructions, A-52–53 single-transistor, B-64 multimedia support, E-18 in MIPS, 194 size, 398 types of, E-3 operands, 189 speed, 23 Desktop computers, defi ned, 5 quotient, 189 synchronous (SDRAM), 379–380, Device driver, OL6.9-5 remainder, 189 B-60, B-65 DGEMM (Double precision General signed, 192–194 two-level decoder, B-65 Matrix Multiply), 225, 352, 413, 553 SRT, 194 Dynamically linked libraries (DLLs), cache blocked version of, 415 Don’t cares, B-17–18 129–131 optimized C version of, 226, 227, 476 example, B-17–18 defi ned, 129 performance, 354, 416 term, 261 lazy procedure linkage version, 130 Dicing, 27 Double data rate (DDR), 379 Dies, 26, 26–27 Double Data Rate RAMs (DDRRAMs), E Digital design pipeline, 354 379–380, B-65 Digital signal-processing (DSP) Double precision.See also Single precision Early restart, 392 extensions, E-19 defi ned, 198 Edge-triggered clocking methodology, DIMMs (dual inline memory modules), FMA, C-45–46 249, 250, B-48, B-73 OL5.17-5 GPU, C-45–46, C-74 advantage, B-49 Direct Data IO (DDIO), OL6.9-6 representation, 201 clocks, B-73 Direct memory access (DMA), OL6.9-4 Double words, 152 drawbacks, B-74 Direct3D, C-13 Dual inline memory modules (DIMMs), illustrated, B-50 Direct-mapped caches.See also Caches 381 rising edge/falling edge, B-48 address portions, 407 Dynamic branch prediction, 321–323.See EDSAC (Electronic Delay Storage choice of, 456 also Control hazards Automatic Calculator), OL1.12-3, defi ned, 384, 402 branch prediction buff er, 321 OL5.17-2 illustrated, 385 loops and, 321–323 Eispack, OL3.11-4 memory block location, 403 Dynamic hardware predictors, 284 Electrically erasable programmable read- misses, 405 Dynamic multiple-issue processors, 333, only memory (EEPROM), 381 single comparator, 407 339–341.See also Multiple issue Elements total number of bits, 390 pipeline scheduling, 339–341 combinational, 248 Dirty bit, 437 superscalar, 339 datapath, 251, 256 Dirty , 437 Dynamic pipeline scheduling, 339–341 memory, B-50–58 Disk memory, 381–383 commit unit, 339–340 state, 248, 250, 252, B-48, B-50 Displacement addressing, 116 concept, 339–340 Embedded computers, 5 Distributed Block-Interleaved Parity hardware-based speculation, 341 application requirements, 6 (RAID 5), OL5.11-6 primary units, 340 defi ned, A-7 div (Divide), A-52 reorder buff er, 343 design, 5 div.d (FP Divide Double), A-76 reservation station, 339–340 growth, OL1.12-12–1.12-13 div.s (FP Divide Single), A-76 Dynamic random access memory Embedded Benchmark Divide algorithm, 190 (DRAM), 378, 379–381, B-63–65 Consortium (EEMBC), OL1.12-12 I-8 Index

Embedded RISCs. See also Reduced datapath with controls for handling, pipelining, 355–356 instruction set computer (RISC) 329 powerful instructions mean higher architectures defi ned, 180, 326 performance, 159 addressing modes, E-6 detecting, 326 right shift , 229 architecture summary, E-4 event types and, 326 False sharing, 469 arithmetic/logical instructions, E-14 imprecise, 331–332 Fast carry conditional branches, E-16 instructions, A-80 with “infi nite” hardware, B-38–39 constant extension summary, E-9 interrupts versus, 325–326 with fi rst level of abstraction, B-39–40 control instructions, E-15 in MIPS architecture, 326–327 with second level of abstraction, data transfer instructions, E-13 overfl ow, 329 B-40–46 delayed branch and, E-23 PC, 445, 446–447 Fast Fourier Transforms (FFT), C-53 DSP extensions, E-19 pipelined computer example, 328 Fault avoidance, 419 general purpose registers, E-5 in pipelined implementation, 327–332 Fault forecasting, 419 instruction conventions, E-15 precise, 332 Fault tolerance, 419 instruction formats, E-8 reasons for, 326–327 Fermi architecture, 523, 552 multiply-accumulate approaches, E-19 result due to overfl ow in add Field programmable devices (FPDs), B-78 types of, E-4 instruction, 330 Field programmable gate arrays (FPGAs), Encoding saving/restoring stage on, 450 B-78 defi ned, D-31 Exclusive OR (XOR) instructions, A-57 Fields fl oating-point instruction, 213 Executable fi les, A-4 Cause register, A-34, A-35 MIPS instruction, 83, 119, A-49 defi ned, 126 defi ned, 82 ROM control function, D-18–19 linker production, A-19 format, D-31 ROM logic function, B-15 Execute or address calculation stage, 292 MIPS, 82–83 x86 instruction, 155–156 Execute/address calculation names, 82 ENIAC (Electronic Numerical Integrator control line, 300 Status register, A-34, A-35 and Calculator), OL1.12-2, OL1.12- load instruction, 292 Files, register, 252, 257, B-50, B-54–56 3, OL5.17-2 store instruction, 292 Fine-grained multithreading, 514 EPIC, OL4.16-5 Execution time Finite-state machines (FSMs), 451–466, Error correction, B-65–67 as valid performance measure, 51 B-67–72 Error Detecting and Correcting Code CPU, 32, 33–34 control, D-8–22 (RAID 2), OL5.11-5 pipelining and, 286 controllers, 464 Error detection, B-66 Explicit counters, D-23, D-26 for multicycle control, D-9 Error detection code, 420 Exponents, 197–198 for simple cache controller, 464–466 Ethernet, 23 External labels, A-10 implementation, 463, B-70 EX stage Mealy, 463 load instructions, 292 F Moore, 463 overfl ow exception detection, 328 next-state function, 463, B-67 store instructions, 294 Facilities, A-14–17 output function, B-67, B-69 Exabyte, 6 Failures, synchronizer, B-77 state assignment, B-70 Exception enable, 447 Fallacies. See also Pitfalls state register implementation, B-71 Exception handlers, A-36–38 add immediate unsigned, 227 style of, 463 defi ned, A-35 Amdahl’s law, 556 synchronous, B-67 return from, A-38 arithmetic, 229–232 SystemVerilog, OL5.12-7 Exception program counters (EPCs), 326 assembly language for performance, traffi c light example, B-68–70 address capture, 331 159–160 Flash memory, 381 copying, 181 commercial binary compatibility characteristics, 23 defi ned, 181, 327 importance, 160 defi ned, 23 in restart determination, 326–327 defi ned, 49 Flat address space, 479 transferring, 182 GPUs, C-72–74, C-75 Flip-fl ops Exceptions, 325–332, A-33–38 low utilization uses little power, 50 D fl ip-fl ops, B-51, B-53 association, 331–332 peak performance, 556 defi ned, B-51 Index I-9

Floating point, 196–222, 224 Floating-point instructions, A-73–80 Fully associative caches. See also Caches assembly language, 212 absolute value, A-73 block replacement strategies, 457 backward step, OL3.11-4–3.11-5 addition, A-73–74 choice of, 456 binary to decimal conversion, 202 comparison, A-74–75 defi ned, 403 branch, 211 conversion, A-75–76 memory block location, 403 challenges, 232–233 desktop RISC, E-12 misses, 406 diversity versus portability, OL3.11- division, A-76 Fully connected networks, 535 3–3.11-4 load, A-76–77 Function code, 82 division, 211 move, A-77–78 Fused-multiply-add (FMA) operation, fi rst dispute, OL3.11-2–3.11-3 multiplication, A-78 220, C-45–46 form, 197 negation, A-78–79 fused multiply add, 220 SPARC, E-31 G guard digits, 218–219 square root, A-79 history, OL3.11-3 store, A-79 Game consoles, C-9 IEEE 754 standard, 198, 199 subtraction, A-79–80 Gates, B-3, B-8 instruction encoding, 213 truncation, A-80 AND, B-12, D-7 intermediate calculations, 218 Floating-point multiplication, 206–210 delays, B-46 machine language, 212 binary, 210–211 mapping ALU control function to, MIPS instruction frequency for, 236 illustrated, 209 D-4–7 MIPS instructions, 211–213 instructions, 211 NAND, B-8 operands, 212 signifi cands, 206 NOR, B-8, B-50 overfl ow, 198 steps, 206–210 Gather-scatter, 511, 552 packed format, 224 Flow-sensitive information, OL2.15-15 General Purpose GPUs (GPGPUs), precision, 230 Flushing instructions, 318, 319 C-5 procedure with two-dimensional defi ned, 319 General-purpose registers, 150 matrices, 215–217 exceptions and, 331 architectures, OL2.21-3 programs, compiling, 214–217 For loops, 141, OL2.15-26 embedded RISCs, E-5 registers, 217 inner, OL2.15-24 Generate representation, 197–202 SIMD and, OL6.15-2 defi ned, B-40 rounding, 218–219 Formal parameters, A-16 example, B-44 sign and magnitude, 197 Format fi elds, D-31 super, B-41 SSE2 architecture, 224–225 Fortran, OL2.21-7 Gigabyte, 6 subtraction, 211 Forward references, A-11 Global common subexpression underfl ow, 198 Forwarding, 303–316 elimination, OL2.15-6 units, 219 ALU before, 309 Global memory, C-21, C-39 in x86, 224 control, 307 Global miss rates, 416 Floating vectors, OL3.11-3 datapath for hazard resolution, 311 Global optimization, OL2.15-5 Floating-point addition, 203–206 defi ned, 278 code, OL2.15-7 arithmetic unit block diagram, 207 functioning, 306 implementing, OL2.15-8–2.15-11 binary, 204 graphical representation, 279 Global pointers, 102 illustrated, 205 illustrations, OL4.13-26–4.13-26 GPU computing. See also Graphics instructions, 211, A-73–74 multiple results and, 281 processing units (GPUs) steps, 203–204 multiplexors, 310 defi ned, C-5 Floating-point arithmetic (GPUs), pipeline registers before, 309 visual applications, C-6–7 C-41–46 with two instructions, 278 GPU system architectures, C-7–12 basic, C-42 Verilog implementation, OL4.13- graphics logical pipeline, C-10 double precision, C-45–46, C-74 2–4.13-4 heterogeneous, C-7–9 performance, C-44 Fractions, 197, 198 implications for, C-24 specialized, C-42–44 Frame buff er, 18 interfaces and drivers, C-9 supported formats, C-42 Frame pointers, 103 unifi ed, C-10–12 operations, C-44 Front end, OL2.15-3 Graph coloring, OL2.15-12 I-10 Index

Graphics displays Handlers defi ned, 376 computer hardware support, 18 defi ned, 449 Hit under miss, 472 LCD, 18 TLB miss, 448 Hold time, B-54 Graphics logical pipeline, C-10 Hard disks Horizontal microcode, D-32 Graphics processing units (GPUs), 522– access times, 23 Hot-swapping, OL5.11-7 529. See also GPU computing defi ned, 23 Human genome project, 4 as accelerators, 522 Hardware attribute interpolation, C-43–44 as hierarchical layer, 13 I defi ned, 46, 506, C-3 language of, 14–16 evolution, C-5 operations, 63–66 I fallacies and pitfalls, C-72–75 supporting procedures in, 96–106 I/O, A-38–40, OL6.9-2, OL6.9-3 fl oating-point arithmetic, C-17, C-41– synthesis, B-21 memory-mapped, A-38 46, C-74 translating microprograms to, D-28–32 on system performance, OL5.11-2 GeForce 8-series generation, C-5 virtualizable, 426 I/O benchmarks.See Benchmarks general computation, C-73–74 Hardware description languages. See also IBM 360/85, OL5.17-7 General Purpose (GPGPUs), C-5 Verilog IBM 701, OL1.12-5 graphics mode, C-6 defi ned, B-20 IBM 7030, OL4.16-2 graphics trends, C-4 using, B-20–26 IBM ALOG, OL3.11-7 history, C-3–4 VHDL, B-20–21 IBM Blue Gene, OL6.15-9–6.15-10 logical graphics pipeline, C-13–14 Hardware multithreading, 514–517 IBM Personal Computer, OL1.12-7, mapping applications to, C-55–72 coarse-grained, 514 OL2.21-6 memory, 523 options, 516 IBM System/360 computers, OL1.12-6, multilevel caches and, 522 simultaneous, 515–517 OL3.11-6, OL4.16-2 N-body applications, C-65–72 Hardware-based speculation, 341 IBM z/VM, OL5.17-8 NVIDIA architecture, 523–526 Harvard architecture, OL1.12-4 ID stage parallel memory system, C-36–41 Hazard detection units, 313–314 branch execution in, 319 parallelism, 523, C-76 functions, 314 load instructions, 292 performance doubling, C-4 pipeline connections for, 314 store instruction in, 291 perspective, 527–529 Hazards, 277–278. See also Pipelining IEEE 754 fl oating-point standard, 198, programming, C-12–24 control, 281–282, 316–325 199, OL3.11-8–3.11-10. See also programming interfaces to, C-17 data, 278, 303–316 Floating point real-time graphics, C-13 forwarding and, 312 fi rst chips, OL3.11-8–3.11-9 summary, C-76 structural, 277, 294 in GPU arithmetic, C-42–43 Graphics shader programs, C-14–15 Heap implementation, OL3.11-10 Gresham’s Law, 236, OL3.11-2 allocating space on, 104–106 rounding modes, 219 Grid computing, 533 defi ned, 104 today, OL3.11-10 Grids, C-19 Heterogeneous systems, C-4–5 If statements, 114 GTX 280, 548–553 architecture, C-7–9 I-format, 83 Guard digits defi ned, C-3 If-then-else, 91 defi ned, 218 Hexadecimal numbers, 81–82 Immediate addressing, 116 rounding with, 219 binary number conversion to, 81–82 Immediate instructions, 72 Hierarchy of memories, 12 Imprecise interrupts, 331, OL4.16-4 H High-level languages, 14–16, A-6 Index-out-of-bounds check, 94–95 benefi ts, 16 Induction variable elimination, OL2.15-7 Half precision, C-42 computer architectures, OL2.21-5 Inheritance, OL2.15-15 Halfwords, 110 importance, 16 In-order commit, 341 Hamming, Richard, 420 High-level optimizations, OL2.15-4–2.15- Input devices, 16 Hamming distance, 420 5 Inputs, 261 Hamming Error Correction Code (ECC), Hit rate, 376 Instances, OL2.15-15 420–421 Hit time Instruction count, 36, 38 calculating, 420–421 cache performance and, 401–402 Instruction decode/register fi le read stage Index I-11

control line, 300 compiler exploitation, OL4.16-5–4.16-6 PA-RISC, E-34–36 load instruction, 289 defi ned, 43, 333 performance, 35–36 store instruction, 294 exploitation, increasing, 343 pipeline sequence, 313 Instruction execution illustrations, and matrix multiply, 351–354 PowerPC, E-12–13, E-32–34 OL4.13-16–4.13-17 Instructions, 60–164, E-25–27, E-40–42. PTX, C-31, C-32 clock cycle 9, OL4.13-24 See also Arithmetic instructions; remainder, A-55 clock cycles 1 and 2, OL4.13-21 MIPS; Operands representation in computer, 80–87 clock cycles 3 and 4, OL4.13-22 add immediate, 72 restartable, 450 clock cycles 5 and 6, OL4.13-23, addition, 180, A-51 resuming, 450 OL4.13-23 Alpha, E-27–29 R-type, 252 clock cycles 7 and 8, OL4.13-24 arithmetic-logical, 251, A-51–57 shift , A-55–56 examples, OL4.13-20–4.13-25 ARM, 145–147, E-36–37 SPARC, E-29–32 forwarding, OL4.13-26–4.13-31 assembly, 66 store, 71, A-68–70 no hazard, OL4.13-17 basic block, 93 store conditional, 122 pipelines with stalls and forwarding, branch, A-59–63 subtraction, 180, A-56–57 OL4.13-26, OL4.13-20 cache-aware, 482 SuperH, E-39–40 Instruction fetch stage comparison, A-57–59 thread, C-30–31 control line, 300 conditional branch, 90 Th umb, E-38 load instruction, 289 conditional move, 324 trap, A-64–66 store instruction, 294 constant-manipulating, A-57 vector, 510 Instruction formats, 157 conversion, A-75–76 as words, 62 ARM, 148 core, 233 x86, 149–155 defi ned, 81 data movement, A-70–73 Instructions per clock cycle (IPC), 333 desktop/server RISC architectures, E-7 data transfer, 68 Integrated circuits (ICs), 19. See also embedded RISC architectures, E-8 decision-making, 90–96 specifi c chips I-type, 83 defi ned, 14, 62 cost, 27 J-type, 113 desktop RISC conventions, E-12 defi ned, 25 jump instruction, 270 division, A-52–53 manufacturing process, 26 MIPS, 148 as electronic signals, 80 very large-scale (VLSIs), 25 R-type, 83, 261 embedded RISC conventions, E-15 Intel Core i7, 46–49, 244, 501, 548–553 x86, 157 encoding, 83 address translation for, 471 Instruction latency, 356 exception and interrupt, A-80 architectural registers, 347 Instruction mix, 39, OL1.12-10 exclusive OR, A-57 caches in, 472 Instruction set architecture fetching, 253 memory hierarchies of, 471–475 ARM, 145–147 fi elds, 80 microarchitecture, 338 branch address calculation, 254 fl oating-point (x86), 224 performance of, 473 defi ned, 22, 52 fl oating-point, 211–213, A-73–80 SPEC CPU benchmark, 46–48 history, 163 fl ushing, 318, 319, 331 SPEC power benchmark, 48–49 maintaining, 52 immediate, 72 TLB hardware for, 471 protection and, 427 introduction to, 62–63 Intel Core i7 920, 346–349 thread, C-31–34 jump, 95, 97, A-63–64 microarchitecture, 347 virtual machine support, 426–427 left -to-right fl ow, 287–288 Intel Core i7 960 Instruction sets, 235, C-49 load, 68, A-66–68 benchmarking and roofl ines of, ARM, 324 load linked, 122 548–553 design for pipelining, 277 logical operations, 87–89 Intel Core i7 Pipelines, 344, 346–349 MIPS, 62, 161, 234 M32R, E-40 memory components, 348 MIPS-32, 235 memory access, C-33–34 performance, 349–351 Pseudo MIPS, 233 memory-reference, 245 program performance, 351 x86 growth, 161 multiplication, 188, A-53–54 specifi cation, 345 Instruction-level parallelism (ILP), 354. negation, A-54 Intel IA-64 architecture, OL2.21-3 See also Parallelism nop, 314 Intel Paragon, OL6.15-8 I-12 Index

Intel Th reading Building Blocks, C-60 jr (Jump Register), 64 lhu (Load Halfword Unsigned), 64 Intel x86 J-type instruction format, 113 li (Load Immediate), 162 clock rate and power for, 40 Jump instructions, 254, E-26 Link, OL6.9-2 Interference graphs, OL2.15-12 branch instruction versus, 270 Linkers, 126–129, A-18–19 Interleaving, 398 control and datapath for, 271 defi ned, 126, A-4 Interprocedural analysis, OL2.15-14 implementing, 270 executable fi les, 126, A-19 Interrupt enable, 447 instruction format, 270 function illustration, A-19 Interrupt handlers, A-33 list of, A-63–64 steps, 126 Interrupt-driven I/O, OL6.9-4 Just In Time (JIT) compilers, using, 126–129 Interrupts 132, 560 Linking object fi les, 126–129 defi ned, 180, 326 Linpack, 538, OL3.11-4 event types and, 326 K Liquid crystal displays (LCDs), 18 exceptions versus, 325–326 LISP, SPARC support, E-30 imprecise, 331, OL4.16-4 Karnaugh maps, B-18 Little-endian byte order, A-43 instructions, A-80 Kernel mode, 444 Live range, OL2.15-11 precise, 332 Kernels Livermore Loops, OL1.12-11 vectored, 327 CUDA, C-19, C-24 ll (Load Linked), 64 Intrinsity FastMATH processor, 395–398 defi ned, C-19 Load balancing, 505–506 caches, 396 Kilobyte, 6 Load instructions. See also Store data miss rates, 397, 407 instructions read processing, 442 L access, C-41 TLB, 440 base register, 262 write-through processing, 442 Labels block, 149 Inverted page tables, 436 global, A-10, A-11 compiling with, 71 Issue packets, 334 local, A-11 datapath in operation for, 267 LAPACK, 230 defi ned, 68 J Large-scale multiprocessors, OL6.15-7, details, A-66–68 OL6.15-9–6.15-10 EX stage, 292 j (Jump), 64 Latches fl oating-point, A-76–77 jal (Jump And Link), 64 D latch, B-51, B-52 halfword unsigned, 110 Java defi ned, B-51 ID stage, 291 bytecode, 131 Latency IF stage, 291 bytecode architecture, OL2.15-17 instruction, 356 linked, 122, 123 characters in, 109–111 memory, C-74–75 list of, A-66–68 compiling in, OL2.15-19–2.15-20 pipeline, 286 load byte unsigned, 76 goals, 131 use, 336–337 load half, 110 interpreting, 131, 145, OL2.15-15– lbu (Load Byte Unsigned), 64 load upper immediate, 112, 113 2.15-16 Leaf procedures. See also Procedures MEM stage, 293 keywords, OL2.15-21 defi ned, 100 pipelined datapath in, 296 method invocation in, OL2.15-21 example, 109 signed, 76 pointers, OL2.15-26 Least recently used (LRU) unit for implementing, 255 primitive types, OL2.15-26 as block replacement strategy, 457 unsigned, 76 programs, starting, 131–132 defi ned, 409 WB stage, 293 reference types, OL2.15-26 pages, 434 Load word, 68, 71 sort algorithms, 141 Least signifi cant bits, B-32 Loaders, 129 strings in, 109–111 defi ned, 74 Loading, A-19–20 translation hierarchy, 131 SPARC, E-31 Load-store architectures, OL2.21-3 while loop compilation in, OL2.15- Left -to-right instruction fl ow, 287–288 Load-use data hazard, 280, 318 18–2.15-19 Level-sensitive clocking, B-74, B-75–76 Load-use stalls, 318 Java Virtual Machine (JVM), 145, defi ned, B-74 Local area networks (LANs), 24. See also OL2.15-16 two-phase, B-75 Networks Index I-13

Local labels, A-11 M main, 23 Local memory, C-21, C-40 nonvolatile, 22 Local miss rates, 416 M32R, E-15, E-40 operands, 68–69 Local optimization, OL2.15-5. Machine code, 81 parallel system, C-36–41 See also Optimization Machine instructions, 81 read-only (ROM), B-14–16 implementing, OL2.15-8 Machine language, 15 SDRAM, 379–380 Locality branch off set in, 115 secondary, 23 principle, 374 decoding, 118–120 shared, C-21, C-39–40 spatial, 374, 377 defi ned, 14, 81, A-3 spaces, C-39 temporal, 374, 377 fl oating-point, 212 SRAM, B-58–62 Lock synchronization, 121 illustrated, 15 stalls, 400 Locks, 518 MIPS, 85 technologies for building, 24–28 Logic SRAM, 21 texture, C-40 address select, D-24, D-25 translating MIPS assembly language usage, A-20–22 ALU control, D-6 into, 84 virtual, 427–454 combinational, 250, B-5, B-9–20 Macros volatile, 22 components, 249 defi ned, A-4 Memory access instructions, C-33–34 control unit equations, D-11 example, A-15–17 Memory access stage design, 248–251, B-1–79 use of, A-15 control line, 302 equations, B-7 Main memory, 428. See also Memory load instruction, 292 minimization, B-18 defi ned, 23 store instruction, 292 programmable array (PAL), page tables, 437 Memory bandwidth, 551, 557 B-78 physical addresses, 428 Memory consistency model, 469 sequential, B-5, B-56–58 Mapping applications, C-55–72 Memory elements, B-50–58 two-level, B-11–14 Mark computers, OL1.12-14 clocked, B-51 Logical operations, 87–89 Matrix multiply, 225–228, 553–555 D fl ip-fl op, B-51, B-53 AND, 88, A-52 Mealy machine, 463–464, B-68, B-71, D latch, B-52 ARM, 149 B-72 DRAMs, B-63–67 desktop RISC, E-11 Mean time to failure(MTTF), 418 fl ip-fl op, B-51 embedded RISC, E-14 improving, 419 hold time, B-54 MIPS, A-51–57 versus AFR of disks, 419–420 latch, B-51 NOR, 89, A-54 Media Access Control (MAC) address, setup time, B-53, B-54 NOT, 89, A-55 OL6.9-7 SRAMs, B-58–62 OR, 89, A-55 Megabyte, 6 unclocked, B-51 shift s, 87 Memory Memory hierarchies, 545 Long instruction word (LIW), addresses, 77 of ARM cortex-A8, 471–475 OL4.16-5 affi nity, 545 block (or line), 376 Lookup tables (LUTs), B-79 atomic, C-21 cache performance, 398–417 Loop unrolling bandwidth, 380–381, 397 caches, 383–417 defi ned, 338, OL2.15-4 cache, 21, 383–398, 398–417 common framework, 454–461 for multiple-issue pipelines, 338 CAM, 408 defi ned, 375 register renaming and, 338 constant, C-40 design challenges, 461 Loops, 92–93 control, D-26 development, OL5.17-6–5.17-8 conditional branches in, 114 defi ned, 19 exploiting, 372–498 for, 141 DRAM, 19, 379–380, B-63–65 of Intel core i7, 471–475 prediction and, 321–323 fl ash, 23 level pairs, 376 test, 142, 143 global, C-21, C-39 multiple levels, 375 while, compiling, 92–93 GPU, 523 overall operation of, 443–444 lui (Load Upper Imm.), 64 instructions, datapath for, 256 parallelism and, 466–470, OL5.11-2 lw (Load Word), 64 layout, A-21 pitfalls, 478–482 lwc1 (Load FP Single), A-73 local, C-21, C-40 program execution time and, 417 I-14 Index

Memory hierarchies (Continued) defi ned, B-12, D-20 data transfer instructions not in, E-20, quantitative design parameters, 454 in PLA implementation, D-20 E-22 redundant arrays and inexpensive MIP-map, C-44 fl oating-point instructions not in, E-22 disks, 470 MIPS, 64, 84, A-45–80 instruction set, 233, 244–248, E-9–10 reliance on, 376 addressing for 32-bit immediates, MIPS-16 structure, 375 116–118 16-bit instruction set, E-41–42 structure diagram, 378 addressing modes, A-45–47 immediate fi elds, E-41 variance, 417 arithmetic core, 233 instructions, E-40–42 virtual memory, 427–454 arithmetic instructions, 63, A-51–57 MIPS core instruction changes, E-42 Memory rank, 381 ARM similarities, 146 PC-relative addressing, E-41 Memory technologies, 378–383 assembler directive support, A-47–49 MIPS-32 instruction set, 235 disk memory, 381–383 assembler syntax, A-47–49 MIPS-64 instructions, E-25–27 DRAM technology, 378, 379–381 assembly instruction, mapping, 80–81 conditional procedure call instructions, fl ash memory, 381 branch instructions, A-59–63 E-27 SRAM technology, 378, 379 comparison instructions, A-57–59 constant shift amount, E-25 Memory-mapped I/O, OL6.9-3 compiling C assignment statements jump/call not PC-relative, E-26 use of, A-38 into, 65 move to/from control registers, E-26 Memory-stall clock cycles, 399 compiling complex C assignment into, nonaligned data transfers, E-25 Message passing 65–66 NOR, E-25 defi ned, 529 constant-manipulating instructions, parallel single precision fl oating-point multiprocessors, 529–534 A-57 operations, E-27 Metastability, B-76 control registers, 448 reciprocal and reciprocal square root, Methods control unit, D-10 E-27 defi ned, OL2.15-5 CPU, A-46 SYSCALL, E-25 invoking in Java, OL2.15-20–2.15-21 divide in, 194 TLB instructions, E-26–27 static, A-20 exceptions in, 326–327 Mirroring, OL5.11-5 mfc0 (Move From Control), A-71 fi elds, 82–83 Miss penalty mfh i (Move From Hi), A-71 fl oating-point instructions, 211–213 defi ned, 376 mfl o (Move From Lo), A-71 FPU, A-46 determination, 391–392 Microarchitectures, 347 instruction classes, 163 multilevel caches, reducing, 410 Intel Core i7 920, 347 instruction encoding, 83, 119, A-49 Miss rates Microcode instruction formats, 120, 148, A-49–51 block size versus, 392 assembler, D-30 instruction set, 62, 162, 234 data cache, 455 control unit as, D-28 jump instructions, A-63–66 defi ned, 376 defi ned, D-27 logical instructions, A-51–57 global, 416 dispatch ROMs, D-30–31 machine language, 85 improvement, 391–392 horizontal, D-32 memory addresses, 70 Intrinsity FastMATH processor, 397 vertical, D-32 memory allocation for program and local, 416 Microinstructions, D-31 data, 104 miss sources, 460 Microprocessors multiply in, 188 split cache, 397 design shift , 501 opcode map, A-50 Miss under miss, 472 multicore, 8, 43, 500–501 operands, 64 MMX (MultiMedia eXtension), 224 Microprograms Pseudo, 233, 235 Modules, A-4 as abstract control representation, register conventions, 105 Moore machines, 463–464, B-68, B-71, D-30 static multiple issue with, 335–338 B-72 fi eld translation, D-29 MIPS core Moore’s law, 11, 379, 522, OL6.9-2, translating to hardware, D-28–32 architecture, 195 C-72–73 Migration, 467 arithmetic/logical instructions not in, Most signifi cant bit Million instructions per second (MIPS), E-21, E-23 1-bit ALU for, B-33 51 common extensions to, E-20–25 defi ned, 74 Minterms control instructions not in, E-21 move (Move), 139 Index I-15

Move instructions, A-70–73 Multiplicand, 183 defi ned, 506 coprocessor, A-71–72 Multiplication, 183–188. See also fi ne-grained, 514 details, A-70–73 Arithmetic hardware, 514–517 fl oating-point, A-77–78 fast, hardware, 188 simultaneous (SMT), 515–517 MS-DOS, OL5.17-11 faster, 187–188 multu (Multiply Unsigned), A-54 mul.d (FP Multiply Double), A-78 fi rst algorithm, 185 Must-information, OL2.15-5 mul.s (FP Multiply Single), A-78 fl oating-point, 206–208, A-78 Mutual exclusion, 121 mult (Multiply), A-53 hardware, 184–186 Multicore, 517–521 instructions, 188, A-53–54 N Multicore multiprocessors, 8, 43 in MIPS, 188 defi ned, 8, 500–501 multiplicand, 183 Name dependence, 338 MULTICS (Multiplexed Information multiplier, 183 NAND gates, B-8 and Computing Service), OL5.17- operands, 183 NAS (NASA Advanced Supercomputing), 9–5.17-10 product, 183 540 Multilevel caches. See also Caches sequential version, 184–186 N-body complications, 416 signed, 187 all-pairs algorithm, C-65 defi ned, 398, 416 Multiplier, 183 GPU simulation, C-71 miss penalty, reducing, 410 Multiply algorithm, 186 mathematics, C-65–67 performance of, 410 Multiply-add (MAD), C-42 multiple threads per body, C-68–69 summary, 417–418 Multiprocessors optimization, C-67 Multimedia extensions benchmarks, 538–540 performance comparison, C-69–70 desktop/server RISCs, E-16–18 bus-based coherent, OL6.15-7 results, C-70–72 as SIMD extensions to instruction sets, defi ned, 500 shared memory use, C-67–68 OL6.15-4 historical perspective, 561 Negation instructions, A-54, A-78–79 vector versus, 511–512 large-scale, OL6.15-7–6.15-8, OL6.15- Negation shortcut, 76 Multiple dimension arrays, 218 9–6.15-10 Nested procedures, 100–102 Multiple instruction multiple data message-passing, 529–534 compiling recursive procedure (MIMD), 558 multithreaded architecture, C-26–27, showing, 101–102 defi ned, 507, 508 C-35–36 NetFPGA 10-Gigagit Ethernet card, fi rst multiprocessor, OL6.15-14 organization, 499, 529 OL6.9-2, OL6.9-3 Multiple instruction single data (MISD), 507 for performance, 559 Network of Workstations, OL6.15- Multiple issue, 332–339 shared memory, 501, 517–521 8–6.15-9 code scheduling, 337–338 soft ware, 500 Network topologies, 534–537 dynamic, 333, 339–341 TFLOPS, OL6.15-6 implementing, 536 issue packets, 334 UMA, 518 multistage, 537 loop unrolling and, 338 Multistage networks, 535 Networking, OL6.9-4 processors, 332, 333 Multithreaded multiprocessor operating system in, OL6.9-4–6.9-5 static, 333, 334–339 architecture, C-25–36 performance improvement, OL6.9- throughput and, 342 conclusion, C-36 7–6.9-10 Multiple processors, 553–555 ISA, C-31–34 Networks, 23–24 Multiple-clock-cycle pipeline diagrams, massive multithreading, C-25–26 advantages, 23 296–297 multiprocessor, C-26–27 bandwidth, 535 fi ve instructions, 298 multiprocessor comparison, C-35–36 crossbar, 535 illustrated, 298 SIMT, C-27–30 fully connected, 535 Multiplexors, B-10 special function units (SFUs), C-35 local area (LANs), 24 controls, 463 streaming processor (SP), C-34 multistage, 535 in datapath, 263 thread instructions, C-30–31 wide area (WANs), 24 defi ned, 246 threads/thread blocks management, Newton’s iteration, 218 forwarding, control values, 310 C-30 Next state selector control, 256–257 Multithreading, C-25–26 nonsequential, D-24 two-input, B-10 coarse-grained, 514 sequential, D-23 I-16 Index

Next-state function, 463, B-67 NVIDIA GPU architecture, 523–526 compiler, 141 defi ned, 463 NVIDIA GTX 280, 548–553 control implementation, D-27–28 implementing, with sequencer, NVIDIA Tesla GPU, 548–553 global, OL2.15-5 D-22–28 high-level, OL2.15-4–2.15-5 Next-state outputs, D-10, D-12–13 O local, OL2.15-5, OL2.15-8 example, D-12–13 manual, 144 implementation, D-12 Object fi les, 125, A-4 or (OR), 64 logic equations, D-12–13 debugging information, 124 OR operation, 89, A-55, B-6 truth tables, D-15 defi ned, A-10 ori (Or Immediate), 64 No Redundancy (RAID 0), OL5.11-4 format, A-13–14 Out-of-order execution No write allocation, 394 header, 125, A-13 defi ned, 341 Nonblocking assignment, B-24 linking, 126–129 performance complexity, 416 Nonblocking caches, 344, 472 relocation information, 125 processors, 344 Nonuniform memory access (NUMA), static data segment, 125 Output devices, 16 518 symbol table, 125, 126 Overfl ow Nonvolatile memory, 22 text segment, 125 defi ned, 74, 198 Nops, 314 Object-oriented languages. See also Java detection, 180 nor (NOR), 64 brief history, OL2.21-8 exceptions, 329 NOR gates, B-8 defi ned, 145, OL2.15-5 fl oating-point, 198 cross-coupled, B-50 One’s complement, 79, B-29 occurrence, 75 D latch implemented with, B-52 Opcodes saturation and, 181 NOR operation, 89, A-54, E-25 control line setting and, 264 subtraction, 179 NOT operation, 89, A-55, B-6 defi ned, 82, 262 Numbers OpenGL, C-13 P binary, 73 OpenMP (Open MultiProcessing), 520, computer versus real-world, 221 540 P+Q redundancy (RAID 6), OL5.11-7 decimal, 73, 76 Operands, 66–73. See also Instructions Packed fl oating-point format, 224 denormalized, 222 32-bit immediate, 112–113 Page faults, 434. See also Virtual memory hexadecimal, 81–82 adding, 179 for data access, 450 signed, 73–78 arithmetic instructions, 66 defi ned, 428 unsigned, 73–78 compiling assignment when in handling, 429, 446–453 NVIDIA GeForce 8800, C-46–55 memory, 69 virtual address causing, 449, 450 all-pairs N-body algorithm, C-71 constant, 72–73 Page tables, 456 dense linear algebra computations, division, 189 defi ned, 432 C-51–53 fl oating-point, 212 illustrated, 435 FFT performance, C-53 memory, 68–69 indexing, 432 instruction set, C-49 MIPS, 64 inverted, 436 performance, C-51 multiplication, 183 levels, 436–437 rasterization, C-50 shift ing, 148 main memory, 437 ROP, C-50–51 Operating systems register, 432 scalability, C-51 brief history, OL5.17-9–5.17-12 storage reduction techniques, 436–437 sorting performance, C-54–55 defi ned, 13 updating, 432 special function approximation encapsulation, 22 VMM, 452 statistics, C-43 in networking, OL6.9-4–6.9-5 Pages. See also Virtual memory special function unit (SFU), C-50 Operations defi ned, 428 streaming multiprocessor (SM), atomic, implementing, 121 dirty, 437 C-48–49 hardware, 63–66 fi nding, 432–434 streaming processor, C-49–50 logical, 87–89 LRU, 434 streaming processor array (SPA), C-46 x86 integer, 152, 154–155 off set, 429 texture/processor cluster (TPC), Optimization physical number, 429 C-47–48 class explanation, OL2.15-14 placing, 432–434 Index I-17

size, 430 Paravirtualization, 482 space, 517, 521 virtual number, 429 PA-RISC, E-14, E-17 Physically addressed caches, 443 Parallel bus, OL6.9-3 branch vectored, E-35 Pipeline registers Parallel execution, 121 conditional branches, E-34, E-35 before forwarding, 309 Parallel memory system, C-36–41. See debug instructions, E-36 dependences, 308 also Graphics processing units decimal operations, E-35 forwarding unit selection, 312 (GPUs) extract and deposit, E-35 Pipeline stalls, 280 caches, C-38 instructions, E-34–36 avoiding with code reordering, 280 constant memory, C-40 load and clear instructions, E-36 data hazards and, 313–316 DRAM considerations, C-37–38 multiply/add and multiply/subtract, insertion, 315 global memory, C-39 E-36 load-use, 318 load/store access, C-41 nullifi cation, E-34 as solution to control hazards, 282 local memory, C-40 nullifying branch option, E-25 Pipelined branches, 319 memory spaces, C-39 store bytes short, E-36 Pipelined control, 300–303. See also MMU, C-38–39 synthesized multiply and divide, Control ROP, C-41 E-34–35 control lines, 300, 303 shared memory, C-39–40 Parity, OL5.11-5 overview illustration, 316 surfaces, C-41 bits, 421 specifying, 300 texture memory, C-40 code, 420, B-65 Pipelined datapaths, 286–303 Parallel processing programs, 502–507 PARSEC (Princeton Application with connected control signals, 304 creation diffi culty, 502–507 Repository for Shared Memory with control signals, 300–303 defi ned, 501 Computers), 540 corrected, 296 for message passing, 519–520 Pass transistor, B-63 illustrated, 289 great debates in, OL6.15-5 PCI-Express (PCIe), 537, C-8, OL6.9-2 in load instruction stages, 296 for shared address space, 519–520 PC-relative addressing, 114, 116 Pipelined dependencies, 305 use of, 559 Peak fl oating-point performance, 542 Pipelines Parallel reduction, C-62 Pentium bug morality play, 231–232 branch instruction impact, 317 Parallel scan, C-60–63 Performance, 28–36 eff ectiveness, improving, OL4.16- CUDA template, C-61 assessing, 28 4–4.16-5 inclusive, C-60 classic CPU equation, 36–40 execute and address calculation stage, tree-based, C-62 components, 38 290, 292 Parallel soft ware, 501 CPU, 33–35 fi ve-stage, 274, 290, 299 Parallelism, 12, 43, 332–344 defi ning, 29–32 graphic representation, 279, 296–300 and computers arithmetic, 222–223 equation, using, 36 instruction decode and register fi le data-level, 233, 508 improving, 34–35 read stage, 289, 292 debates, OL6.15-5–6.15-7 instruction, 35–36 instruction fetch stage, 290, 292 GPUs and, 523, C-76 measuring, 33–35, OL1.12-10 instructions sequence, 313 instruction-level, 43, 332, 343 program, 39–40 latency, 286 memory hierarchies and, 466–470, ratio, 31 memory access stage, 290, 292 OL5.11-2 relative, 31–32 multiple-clock-cycle diagrams, multicore and, 517 response time, 30–31 296–297 multiple issue, 332–339 sorting, C-54–55 performance bottlenecks, 343 multithreading and, 517 throughput, 30–31 single-clock-cycle diagrams, 296–297 performance benefi ts, 44–45 time measurement, 32 stages, 274 process-level, 500 Personal computers (PCs), 7 static two-issue, 335 redundant arrays and inexpensive defi ned, 5 write-back stage, 290, 294 disks, 470 Personal mobile device (PMD) Pipelining, 12, 272–286 subword, E-17 defi ned, 7 advanced, 343–344 task, C-24 Petabyte, 6 benefi ts, 272 task-level, 500 Physical addresses, 428 control hazards, 281–282 thread, C-22 mapping to, 428–429 data hazards, 278 I-18 Index

Pipelining (Continued) branch registers, E-32–33 two-issue, 336–337 exceptions and, 327–332 condition codes, E-12 vector, 508–510 execution time and, 286 instructions, E-12–13 VLIW, 335 fallacies, 355–356 instructions unique to, E-31–33 Product, 183 hazards, 277–278 load multiple/store multiple, E-33 Product of sums, B-11 instruction set design for, 277 logical shift ed immediate, E-33 Program counters (PCs), 251 laundry analogy, 273 rotate with mask, E-33 changing with conditional branch, 324 overview, 272–286 Precise interrupts, 332 defi ned, 98, 251 paradox, 273 Prediction, 12 exception, 445, 447 performance improvement, 277 2-bit scheme, 322 incrementing, 251, 253 pitfall, 355–356 accuracy, 321, 324 instruction updates, 289 simultaneous executing instructions, dynamic branch, 321–323 Program libraries, A-4 286 loops and, 321–323 Program performance speed-up formula, 273 steady-state, 321 elements aff ecting, 39 structural hazards, 277, 294 Prefetching, 482, 544 understanding, 9 summary, 285 Primitive types, OL2.15-26 Programmable array logic (PAL), B-78 throughput and, 286 Procedure calls Programmable logic arrays (PLAs) Pitfalls. See also Fallacies convention, A-22–33 component dots illustration, B-16 address space extension, 479 examples, A-27–33 control function implementation, D-7, arithmetic, 229–232 frame, A-23 D-20–21 associativity, 479 preservation across, 102 defi ned, B-12 defi ned, 49 Procedures, 96–106 example, B-13–14 GPUs, C-74–75 compiling, 98 illustrated, B-13 ignoring memory system behavior, 478 compiling, showing nested procedure ROMs and, B-15–16 memory hierarchies, 478–482 linking, 101–102 size, D-20 out-of-order processor evaluation, 479 execution steps, 96 truth table implementation, B-13 performance equation subset, 50–51 frames, 103 Programmable logic devices (PLDs), B-78 pipelining, 355–356 leaf, 100 Programmable ROMs (PROMs), B-14 pointer to automatic variables, 160 nested, 100–102 Programming languages. See also specifi c sequential word addresses, 160 recursive, 105, A-26–27 languages simulating cache, 478 for setting arrays to zero, 142 brief history of, OL2.21-7–2.21-8 soft ware development with sort, 135–139 object-oriented, 145 multiprocessors, 556 strcpy, 108–109 variables, 67 VMM implementation, 481, 481–482 string copy, 108–109 Programs Pixel shader example, C-15–17 swap, 133 assembly language, 123 Pixels, 18 Process identifi ers, 446 Java, starting, 131–132 Pointers Process-level parallelism, 500 parallel processing, 502–507 arrays versus, 141–145 Processors, 242–356 starting, 123–132 frame, 103 as cores, 43 translating, 123–132 global, 102 control, 19 Propagate incrementing, 143 datapath, 19 defi ned, B-40 Java, OL2.15-26 defi ned, 17, 19 example, B-44 stack, 98, 102 dynamic multiple-issue, 333 super, B-41 Polling, OL6.9-8 multiple-issue, 333 Protected keywords, OL2.15-21 Pop, 98 out-of-order execution, 344, 416 Protection Power performance growth, 44 defi ned, 428 clock rate and, 40 ROP, C-12, C-41 implementing, 444–446 critical nature of, 53 speculation, 333–334 mechanisms, OL5.17-9 effi ciency, 343–344 static multiple-issue, 333, 334–339 VMs for, 424 relative, 41 streaming, C-34 Protection group, OL5.11-5 PowerPC superscalar, 339, 515–516, OL4.16-5 Pseudo MIPS algebraic right shift , E-33 technologies for building, 24–28 defi ned, 233 Index I-19

instruction set, 235 Reduced instruction set computer (RISC) number specifi cation, 252 Pseudodirect addressing, 116 architectures, E-2–45, OL2.21-5, page table, 432 Pseudoinstructions OL4.16-4. See also Desktop and pipeline, 308, 309, 312 defi ned, 124 server RISCs; Embedded RISCs primitives, 66 summary, 125 group types, E-3–4 Receiver Control, A-39 Pthreads (POSIX threads), 540 instruction set lineage, E-44 Receiver Data, A-38, A-39 PTX instructions, C-31, C-32 Reduction, 519 renaming, 338 Public keywords, OL2.15-21 Redundant arrays of inexpensive disks right half, 290 Push (RAID), OL5.11-2–5.11-8 spilling, 71 defi ned, 98 history, OL5.11-8 Status, 327, A-35 using, 100 RAID 0, OL5.11-4 temporary, 67, 99 RAID 1, OL5.11-5 Transmitter Control, A-39–40 Q RAID 2, OL5.11-5 Transmitter Data, A-40 RAID 3, OL5.11-5 usage convention, A-24 Quad words, 154 RAID 4, OL5.11-5–5.11-6 use convention, A-22 Quicksort, 411, 412 RAID 5, OL5.11-6–5.11-7 variables, 67 Quotient, 189 RAID 6, OL5.11-7 Relative performance, 31–32 spread of, OL5.11-6 Relative power, 41 R summary, OL5.11-7–5.11-8 Reliability, 418 use statistics, OL5.11-7 Relocation information, A-13, A-14 Race, B-73 Reference bit, 435 Remainder Radix sort, 411, 412, C-63–65 References defi ned, 189 CUDA code, C-64 absolute, 126 instructions, A-55 implementation, C-63–65 forward, A-11 Reorder buff ers, 343 RAID, See Redundant arrays of types, OL2.15-26 Replication, 468 inexpensive disks (RAID) unresolved, A-4, A-18 Requested word fi rst, 392 RAM, 9 Register addressing, 116 Request-level parallelism, 532 Raster operation (ROP) processors, C-12, Register allocation, OL2.15-11–2.15-13 Reservation stations C-41, C-50–51 Register fi les, B-50, B-54–56 buff ering operands in, 340–341 fi xed function, C-41 defi ned, 252, B-50, B-54 defi ned, 339–340 Raster refresh buff er, 18 in behavioral Verilog, B-57 Response time, 30–31 Rasterization, C-50 single, 257 Restartable instructions, 448 Ray casting (RC), 552 two read ports implementation, B-55 Return address, 97 Read-only memories (ROMs), B-14–16 with two read ports/one write port, Return from exception (ERET), 445 control entries, D-16–17 B-55 R-format, 262 control function encoding, D-18–19 write port implementation, B-56 ALU operations, 253 dispatch, D-25 Register-memory architecture, OL2.21-3 defi ned, 83 implementation, D-15–19 Registers, 152, 153–154 Ripple carry logic function encoding, B-15 architectural, 325–332 adder, B-29 overhead, D-18 base, 69 carry lookahead speed versus, B-46 PLAs and, B-15–16 callee-saved, A-23 Roofl ine model, 542–543, 544, 545 programmable (PROM), B-14 caller-saved, A-23 with ceilings, 546, 547 total size, D-16 Cause, A-35 computational roofl ine, 545 Read-stall cycles, 399 clock cycle time and, 67 illustrated, 542 Read-write head, 381 compiling C assignment with, 67–68 Opteron generations, 543, 544 Receive message routine, 529 Count, A-34 with overlapping areas shaded, 547 Receiver Control register, A-39 defi ned, 66 peak fl oating-point performance, Receiver Data register, A-38, A-39 destination, 83, 262 542 Recursive procedures, 105, A-26–27. See fl oating-point, 217 peak memory performance, 543 also Procedures left half, 290 with two kernels, 547 clone invocation, 100 mapping, 80 Rotational delay.See Rotational latency stack in, A-29–30 MIPS conventions, 105 Rotational latency, 383 I-20 Index

Rounding, 218 n-way, 403 data vector, C-35 accurate, 218 two-way, 404 extensions, OL6.15-4 bits, 220 Setup time, B-53, B-54 for loops and, OL6.15-3 with guard digits, 219 sh (Store Halfword), 64 massively parallel multiprocessors, IEEE 754 modes, 219 Shaders OL6.15-2 Row-major order, 217, 413 defi ned, C-14 small-scale, OL6.15-4 R-type instructions, 252 fl oating-point arithmetic, C-14 vector architecture, 508–510 datapath for, 264–265 graphics, C-14–15 in x86, 508 datapath in operation for, 266 pixel example, C-15–17 SIMMs (single inline memory modules), Shading languages, C-14 OL5.17-5, OL5.17-6 S Shadowing, OL5.11-5 Simple programmable logic devices Shared memory. See also Memory (SPLDs), B-78 Saturation, 181 as low-latency memory, C-21 Simplicity, 161 sb (Store Byte), 64 caching in, C-58–60 Simultaneous multithreading (SMT), sc (Store Conditional), 64 CUDA, C-58 515–517 SCALAPAK, 230 N-body and, C-67–68 support, 515 Scaling per-CTA, C-39 thread-level parallelism, 517 strong, 505, 507 SRAM banks, C-40 unused issue slots, 515 weak, 505 Shared memory multiprocessors (SMP), Single error correcting/Double error Scientifi c notation 517–521 correcting (SEC/DEC), 420–422 adding numbers in, 203 defi ned, 501, 517 Single instruction single data (SISD), 507 defi ned, 196 single physical address space, 517 Single precision. See also Double for reals, 197 synchronization, 518 precision Search engines, 4 Shift amount, 82 binary representation, 201 Secondary memory, 23 Shift instructions, 87, A-55–56 defi ned, 198 Sectors, 381 Sign and magnitude, 197 Single-clock-cycle pipeline diagrams, Seek, 382 Sign bit, 76 296–297 Segmentation, 431 Sign extension, 254 illustrated, 299 Selector values, B-10 defi ned, 76 Single-cycle datapaths. See also Datapaths Semiconductors, 25 shortcut, 78 illustrated, 287 Send message routine, 529 Signals instruction execution, 288 Sensitivity list, B-24 asserted, 250, B-4 Single-cycle implementation Sequencers control, 250, 263–264 control function for, 269 explicit, D-32 deasserted, 250, B-4 defi ned, 270 implementing next-state function with, Signed division, 192–194 nonpipelined execution versus D-22–28 Signed multiplication, 187 pipelined execution, 276 Sequential logic, B-5 Signed numbers, 73–78 non-use of, 271–272 Servers, OL5. See also Desktop and server sign and magnitude, 75 penalty, 271–272 RISCs treating as unsigned, 94–95 pipelined performance versus, 274 cost and capability, 5 Signifi cands, 198 Single-instruction multiple-thread Service accomplishment, 418 addition, 203 (SIMT), C-27–30 Service interruption, 418 multiplication, 206 overhead, C-35 Set instructions, 93 Silicon, 25 multithreaded warp scheduling, C-28 Set-associative caches, 403. See also as key hardware technology, 53 processor architecture, C-28 Caches crystal ingot, 26 warp execution and divergence, address portions, 407 defi ned, 26 C-29–30 block replacement strategies, 457 wafers, 26 Single-program multiple data (SPMD), choice of, 456 Silicon crystal ingot, 26 C-22 four-way, 404, 407 SIMD (Single Instruction Multiple Data), sll (Shift Left Logical), 64 memory-block location, 403 507–508, 558 slt (Set Less Th an), 64 misses, 405–406 computers, OL6.15-2–6.15-4 slti (Set Less Th an Imm.), 64 Index I-21

sltiu (Set Less Th an Imm.Unsigned), 64 Spatial locality, 374 push, 98, 100 sltu (Set Less Th an Unsig.), 64 large block exploitation of, 391 recursive procedures, A-29–30 Smalltalk-80, OL2.21-8 tendency, 378 Stalls, 280 Smart phones, 7 SPEC, OL1.12-11–1.12-12 as solution to control hazard, 282 Snooping protocol, 468–470 CPU benchmark, 46–48 avoiding with code reordering, 280 Snoopy cache coherence, OL5.12-7 power benchmark, 48–49 behavioral Verilog with detection, Soft ware optimization SPEC2000, OL1.12-12 OL4.13-6–4.13-8 via blocking, 413–418 SPEC2006, 233, OL1.12-12 data hazards and, 313–316 Sort algorithms, 141 SPEC89, OL1.12-11 illustrations, OL4.13-23, OL4.13-30 Soft ware SPEC92, OL1.12-12 insertion into pipeline, 315 layers, 13 SPEC95, OL1.12-12 load-use, 318 multiprocessor, 500 SPECrate, 538–539 memory, 400 parallel, 501 SPECratio, 47 write-back scheme, 399 as service, 7, 532, 558 Special function units (SFUs), C-35, C-50 write buff er, 399 systems, 13 defi ned, C-43 Standby spares, OL5.11-8 Sort procedure, 135–139. See also Speculation, 333–334 State Procedures hardware-based, 341 in 2-bit prediction scheme, 322 code for body, 135–137 implementation, 334 assignment, B-70, D-27 full procedure, 138–139 performance and, 334 bits, D-8 passing parameters in, 138 problems, 334 exception, saving/restoring, 450 preserving registers in, 138 recovery mechanism, 334 logic components, 249 procedure call, 137 Speed-up challenge, 503–505 specifi cation of, 432 register allocation for, 135 balancing load, 505–506 State elements Sorting performance, C-54–55 bigger problem, 504–505 clock and, 250 Source fi les, A-4 Spilling registers, 71, 98 combinational logic and, 250 Source language, A-6 SPIM, A-40–45 defi ned, 248, B-48 Space allocation byte order, A-43 inputs, 249 on heap, 104–106 features, A-42–43 in storing/accessing instructions, on stack, 103 getting started with, A-42 252 SPARC MIPS assembler directives support, register fi le, B-50 annulling branch, E-23 A-47–49 Static branch prediction, 335 CASA, E-31 speed, A-41 Static data conditional branches, E-10–12 system calls, A-43–45 as dynamic data, A-21 fast traps, E-30 versions, A-42 defi ned, A-20 fl oating-point operations, E-31 virtual machine simulation, A-41–42 segment, 104 instructions, E-29–32 Split algorithm, 552 Static multiple-issue processors, 333, least signifi cant bits, E-31 Split caches, 397 334–339. See also Multiple issue multiple precision fl oating-point Square root instructions, A-79 control hazards and, 335–336 results, E-32 sra (Shift Right Arith.), A-56 instruction sets, 335 nonfaulting loads, E-32 srl (Shift Right Logical), 64 with MIPS ISA, 335–338 overlapping integer operations, E-31 Stack architectures, OL2.21-4 Static random access memories (SRAMs), quadruple precision fl oating-point Stack pointers 378, 379, B-58–62 arithmetic, E-32 adjustment, 100 array organization, B-62 register windows, E-29–30 defi ned, 98 basic structure, B-61 support for LISP and Smalltalk, E-30 values, 100 defi ned, 21, B-58 Sparse matrices, C-55–58 Stack segment, A-22 fi xed access time, B-58 Sparse Matrix-Vector multiply (SpMV), Stacks large, B-59 C-55, C-57, C-58 allocating space on, 103 read/write initiation, B-59 CUDA version, C-57 for arguments, 140 synchronous (SSRAMs), B-60 serial code, C-57 defi ned, 98 three-state buff ers, B-59, B-60 shared memory version, C-59 pop, 98 Static variables, 102 I-22 Index

Status register sub.d (FP Subtract Double), A-79 cache data and tag modules, OL5.12-6 fi elds, A-34, A-35 sub.s (FP Subtract Single), A-80 FSM, OL5.12-7 Steady-state prediction, 321 Subnormals, 222 simple cache block diagram, OL5.12-4 Sticky bits, 220 Subtraction, 178–182. See also Arithmetic type declarations, OL5.12-2 Store buff ers, 343 binary, 178–179 Store instructions. See also Load fl oating-point, 211, A-79–80 T instructions instructions, A-56–57 access, C-41 negative number, 179 Tablets, 7 base register, 262 overfl ow, 179 Tags block, 149 subu (Subtract Unsigned), 119 defi ned, 384 compiling with, 71 Subword parallelism, 222–223, 352, E-17 in locating block, 407 conditional, 122 and matrix multiply, 225–228 page tables and, 434 defi ned, 71 Sum of products, B-11, B-12 size of, 409 details, A-68–70 Supercomputers, OL4.16-3 Tail call, 105–106 EX stage, 294 defi ned, 5 Task identifi ers, 446 fl oating-point, A-79 SuperH, E-15, E-39–40 Task parallelism, C-24 ID stage, 291 Superscalars Task-level parallelism, 500 IF stage, 291 defi ned, 339, OL4.16-5 Tebibyte (TiB), 5 instruction dependency, 312 dynamic pipeline scheduling, 339 Telsa PTX ISA, C-31–34 list of, A-68–70 multithreading options, 516 arithmetic instructions, C-33 MEM stage, 295 Surfaces, C-41 barrier synchronization, C-34 unit for implementing, 255 sw (Store Word), 64 GPU thread instructions, C-32 WB stage, 295 Swap procedure, 133. See also Procedures memory access instructions, C-33–34 Store word, 71 body code, 135 Temporal locality, 374 Stored program concept, 63 full, 135, 138–139 tendency, 378 as computer principle, 86 register allocation, 133 Temporary registers, 67, 99 illustrated, 86 Swap space, 434 Terabyte (TB) , 6 principles, 161 swc1 (Store FP Single), A-73 defi ned, 5 Strcpy procedure, 108–109. See also Symbol tables, 125, A-12, A-13 Text segment, A-13 Procedures Synchronization, 121–123, 552 Texture memory, C-40 as leaf procedure, 109 barrier, C-18, C-20, C-34 Texture/processor cluster (TPC), pointers, 109 defi ned, 518 C-47–48 Stream benchmark, 548 lock, 121 TFLOPS multiprocessor, OL6.15-6 Streaming multiprocessor (SM), C-48–49 overhead, reducing, 44–45 Th rashing, 453 Streaming processors, C-34, C-49–50 unlock, 121 Th read blocks, 528 array (SPA), C-41, C-46 Synchronizers creation, C-23 Streaming SIMD Extension 2 (SSE2) defi ned, B-76 defi ned, C-19 fl oating-point architecture, 224 failure, B-77 managing, C-30 Streaming SIMD Extensions (SSE) and from D fl ip-fl op, B-76 memory sharing, C-20 advanced vector extensions in x86, Synchronous DRAM (SRAM), 379–380, synchronization, C-20 224–225 B-60, B-65 Th read parallelism, C-22 Stretch computer, OL4.16-2 Synchronous SRAM (SSRAM), B-60 Th reads Strings Synchronous system, B-48 creation, C-23 defi ned, 107 Syntax tree, OL2.15-3 CUDA, C-36 in Java, 109–111 System calls, A-43–45 ISA, C-31–34 representation, 107 code, A-43–44 managing, C-30 Strip mining, 510 defi ned, 445 memory latencies and, C-74–75 Striping, OL5.11-4 loading, A-43 multiple, per body, C-68–69 Strong scaling, 505, 517 Systems soft ware, 13 warps, C-27 Structural hazards, 277, 294 SystemVerilog Th ree Cs model, 459–461 sub (Subtract), 64 cache controller, OL5.12-2 Th ree-state buff ers, B-59, B-60 Index I-23

Th roughput sign extension shortcut, 78 register, 67 defi ned, 30–31 Two-level logic, B-11–14 static, 102 multiple issue and, 342 Two-phase clocking, B-75 storage class, 102 pipelining and, 286, 342 TX-2 computer, OL6.15-4 type, 102 Th umb, E-15, E-38 VAX architecture, OL2.21-4, OL5.17-7 Timing U Vector lanes, 512 asynchronous inputs, B-76–77 Vector processors, 508–510. See also level-sensitive, B-75–76 Unconditional branches, 91 Processors methodologies, B-72–77 Underfl ow, 198 conventional code comparison, two-phase, B-75 Unicode 509–510 TLB misses, 439. See also Translation- alphabets, 109 instructions, 510 lookaside buff er (TLB) defi ned, 110 multimedia extensions and, 511–512 entry point, 449 example alphabets, 110 scalar versus, 510–511 handler, 449 Unifi ed GPU architecture, C-10–12 Vectored interrupts, 327 handling, 446–453 illustrated, C-11 Verilog occurrence, 446 processor array, C-11–12 behavioral defi nition of MIPS ALU, problem, 453 Uniform memory access (UMA), 518, B-25 Tomasulo’s algorithm, OL4.16-3 C-9 behavioral defi nition with bypassing, Touchscreen, 19 multiprocessors, 519 OL4.13-4–4.13-6 Tournament branch predicators, 324 Units behavioral defi nition with stalls for Tracks, 381–382 commit, 339–340, 343 loads, OL4.13-6–4.13-8 Transfer time, 383 control, 247–248, 259–261, D-4–8, behavioral specifi cation, B-21, OL4.13- Transistors, 25 D-10, D-12–13 2–4.13-4 Translation-lookaside buff er (TLB), defi ned, 219 behavioral specifi cation of multicycle 438–439, E-26–27, OL5.17-6. See fl oating point, 219 MIPS design, OL4.13-12–4.13-13 also TLB misses hazard detection, 313, 314–315 behavioral specifi cation with associativities, 439 for load/store implementation, 255 simulation, OL4.13-2 illustrated, 438 special function (SFUs), C-35, C-43, behavioral specifi cation with stall integration, 440–441 C-50 detection, OL4.13-6–4.13-8 Intrinsity FastMATH, 440 UNIVAC I, OL1.12-5 behavioral specifi cation with synthesis, typical values, 439 UNIX, OL2.21-8, OL5.17-9–5.17-12 OL4.13-11–4.13-16 Transmit driver and NIC hardware time AT&T, OL5.17-10 blocking assignment, B-24 versus.receive driver and NIC hardware Berkeley version (BSD), OL5.17-10 branch hazard logic implementation, time, OL6.9-8 genius, OL5.17-12 OL4.13-8–4.13-10 Transmitter Control register, A-39–40 history, OL5.17-9–5.17-12 combinational logic, B-23–26 Transmitter Data register, A-40 Unlock synchronization, 121 datatypes, B-21–22 Trap instructions, A-64–66 Unresolved references defi ned, B-20 Tree-based parallel scan, C-62 defi ned, A-4 forwarding implementation, Truth tables, B-5 linkers and, A-18 OL4.13-4 ALU control lines, D-5 Unsigned numbers, 73–78 MIPS ALU defi nition in, B-35–38 for control bits, 260–261 Use latency modules, B-23 datapath control outputs, D-17 defi ned, 336–337 multicycle MIPS datapath, OL4.13-14 datapath control signals, D-14 one-instruction, 336–337 nonblocking assignment, B-24 defi ned, 260 operators, B-22 example, B-5 V program structure, B-23 next-state output bits, D-15 reg, B-21–22 PLA implementation, B-13 Vacuum tubes, 25 sensitivity list, B-24 Two’s complement representation, 75–76 Valid bit, 386 sequential logic specifi cation, B-56–58 advantage, 75–76 Variables structural specifi cation, B-21 negation shortcut, 76 C language, 102 wire, B-21–22 rule, 79 programming language, 67 Vertical microcode, D-32 I-24 Index

Very large-scale integrated (VLSI) W memory hierarchy handling of, circuits, 25 457–458 Very Long Instruction Word (VLIW) Wafers, 26 schemes, 394 defi ned, 334–335 defects, 26 virtual memory, 437 fi rst generation computers, OL4.16-5 dies, 26–27 write-back cache, 394, 395 processors, 335 yield, 27 write-through cache, 394, 395 VHDL, B-20–21 Warehouse Scale Computers (WSCs), 7, Write-stall cycles, 400 Video graphics array (VGA) controllers, 531–533, 558 Write-through caches. See also Caches C-3–4 Warps, 528, C-27 advantages, 458 Virtual addresses Weak scaling, 505 defi ned, 393, 457 causing page faults, 449 Wear levelling, 381 tag mismatch, 394 defi ned, 428 While loops, 92–93 mapping from, 428–429 Whirlwind, OL5.17-2 X size, 430 Wide area networks (WANs), 24. See also Virtual machine monitors (VMMs) Networks x86, 149–158 defi ned, 424 Words Advanced Vector Extensions in, 225 implementing, 481, 481–482 accessing, 68 brief history, OL2.21-6 laissez-faire attitude, 481 defi ned, 66 conclusion, 156–158 page tables, 452 double, 152 data addressing modes, 152, 153–154 in performance improvement, 427 load, 68, 71 evolution, 149–152 requirements, 426 quad, 154 fi rst address specifi er encoding, 158 Virtual machines (VMs), 424–427 store, 71 historical timeline, 149–152 benefi ts, 424 Working set, 453 instruction encoding, 155–156 defi ned, A-41 World Wide Web, 4 instruction formats, 157 illusion, 452 Worst-case delay, 272 instruction set growth, 161 instruction set architecture support, Write buff ers instruction types, 153 426–427 defi ned, 394 integer operations, 152–155 performance improvement, 427 stalls, 399 registers, 152, 153–154 for protection improvement, 424 write-back cache, 395 SIMD in, 507–508, 508 simulation of, A-41–42 Write invalidate protocols, 468, 469 Streaming SIMD Extensions in, Virtual memory, 427–454. See also Pages Write serialization, 467 224–225 address translation, 429, 438–439 Write-back caches. See also Caches typical instructions/functions, 155 integration, 440–441 advantages, 458 typical operations, 157 mechanism, 452–453 cache coherency protocol, OL5.12-5 Xerox Alto computer, OL1.12-8 motivations, 427–428 complexity, 395 XMM, 224 page faults, 428, 434 defi ned, 394, 458 protection implementation, stalls, 399 Y 444–446 write buff ers, 395 segmentation, 431 Write-back stage Yahoo! Cloud Serving Benchmark summary, 452–453 control line, 302 (YCSB), 540 virtualization of, 452 load instruction, 292 Yield, 27 writes, 437 store instruction, 294 YMM, 225 Virtualizable hardware, 426 Writes Virtually addressed caches, 443 complications, 394 Z Visual computing, C-3 expense, 453 Volatile memory, 22 handling, 393–395 Zettabyte, 6