Computer Architectures Advance CPU Design
Tien-Fu Chen
National Chung Cheng Univ.
©byTien-FuChen@CCU Adv CPU-0
MMX technology ! Basic concepts
" small native data types " compute-intensive operations " a lot of inherent parallelism => single-instruction multiple data (SIMD)
! features
" packed data type " a rich set of MMX instructions to perform parallel operations " saturation arithmetic different from regular arithmetic: don’t truncate/wrapping around choosing largest or smallest numbers " parallel compare " overlapped operations " pack/unpack data type " compatible extension architectures ©byTien-FuChen@CCU Adv CPU-1 Packed Data Types (small data types packed into one) register ! Dual Usage of Floating-point Register
! Enhanced Instruction Set Operating In Parallel Fashion
" Totally 57 MMX instructions are added to IA.
©byTien-FuChen@CCU Adv CPU-2
Fast DSP computation
©byTien-FuChen@CCU Adv CPU-3 Performance of Matrix Multiplication Performance Comparison between IA and MMX -working example on Matrix and vector multiplication Traditional IA MMX No.ofLoads 32 8 No.ofMultiply 16 4 No.ofAdd 15 3 Vector Vector *Loop control 12 0 multiplication Other overhead 0 3 Final result save 1 1 Instr Count 76 19 **Cycle Count 200 12 Total Instrs 4(4x76+3)=1228 4(4x19+3) = 316 Matrix Vector Multiplication Both under 1200 cycles 207 cycles optimized mode Comp Result: Speed up 5.8 times * Assume we per form 4 MACs (out of 16) per loop iteration of our code. for ( K = 1; K < 5; K++) { Mac (K); } So for each loop, there will be 3 instruction per iteration, increment, compare, and branch. ** 1) The cycle count is dominated by the nonpipelined, 11-cycle integer multiply operation 2) 4 mispredictions totally when existing the loops 3) All data are in on-chip caches; ©byTien-FuChen@CCU Adv CPU-4
More Parallelisms ! Streaming SIMD Extension (SSE) since Pentium III.
" Physically add eight new 128 bit XMM registers and 70 instruction set. New machine state introduced. " Support four 32-bit single precision floating point operations in parallel. Recall all MMX SIMD instruction are all for mere integers. ! Streaming SIMD Extension 2 (SSE2) since Pentium 4.
" Use XMM registers. No new machine. " 144 new instructions added. " Support double precision floating point parallel operations. ! IA-64 ItaniumTM Architecture.
" Enable, enhance, express, exploit Parallelism at: Proc./Thread level for programmers, at the instruction level for compilers. All explicitly.
©byTien-FuChen@CCU Adv CPU-5 Objectives of IA-64 Instruction Set Architecture (ISA) ! Intel and HP Technology Alliance
! Enable industry leading system performance
" Breakthrough performance
" Headroom
! Enable compatibility with today’s IA-32 software & PA- RISC software
! Allow scalability over a wide range of implementations
! Full 64-bit Full 64-bit computing
©byTien-FuChen@CCU Adv CPU-6
Next Generation Terminology ! EPIC: (Explicitly Parallel Instruction Computing): the next generation processor technology
" e.g., RISC, CISC
! IA-64 (Intel Architecture, 64-bit): the architecture that incorporates EPIC Technology
" e.g., IA-32, PA-RISC
! Merced processor: the project name for Intel’s first IA-64-based implementation
" e.g., Pentium II, PA-8500
©byTien-FuChen@CCU Adv CPU-7 Features of IA-64 Architecture ! Explicit Parallelism
" ILP is explicit in machine code
" compiler analyzes and identifies parallelism at compile time ! Predication Enhances Parallelism
! Speculation Minimizes the Effect of Memory Latency
! IA-64 Processors are Massively Resourced
" Many registers
" Many functional units
" Inherently scalable ! Performance, headroom, binary compatibility
©byTien-FuChen@CCU Adv CPU-8
Predication: Features and Benefits ! Compiler given larger scheduling scope
" Nearly all instructions can be predicated
" State updated if an instruction?s predicate is true, otherwise
" acts as a NOP
" Compiler assigns predicates, compare instructions set them
" Architecture provides 64 1-bit predicate registers (PR) ! Predicated execution removes branches
" Convert a control dependence to a data dependence
" Reduce mispredict penalties ! Parallel execution through larger basic
" Effective use of parallel hardware
©byTien-FuChen@CCU Adv CPU-9 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”
! IA-64: instruction set architecture; EPIC is type
" EPIC = 2nd generation VLIW? ! Itanium™ the first implementation (2001)
" Highly parallel and deeply pipelined hardware at 800Mhz " 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process ! 128 64-bit integer registers + 128 82-bit floating point registers
" Not separate register files per functional unit as in old VLIW ! Hardware checks dependencies (interlocks => binary compatibility over time)
! Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? ©byTien-FuChen@CCU Adv CPU-10
Binary Compatibility
C, C++, IA-32 PA-RISC High-level Fortran, Object Object Language COBOL Code Code • Application Source Compatible • Design Criteria • C, C++ and FTN • Systems Architecture • Transparent to User Native • Default Compiler and Optimizer
Native IA-64 Code
Dynamic HP-UX and NT Translator IA-64
Play: Next generation ISA ©byTien-FuChen@CCU Adv CPU-11 VLIW Processor Architectures for DSP !Why VLIW Architecture?
" VLIW is especially suitable for DSP applications
" DSP algorithms are dominated by data-parallel computation and consist of core tight loops executed repeatedly. # Convolution, FFT
" Single-chip high-performance VLIW processors with multiple FUs are commercially available.
©byTien-FuChen@CCU Adv CPU-12
VLIW Architecture ! Instruction-Level Parallelism (ILP) " Multiple different FUs in parallel. " Each instruction contains an operation code for each FU. ! Data-Level Parallelism (DLP) " Single FU is divided to perform the same operation on multiple smaller precision data. ! Instruction Set Architecture " Each processor has its own instruction to further enhance the performance. " Complex_multiply for FFT and autocorrelation algorithms ! Memory I/O " Via DMA controller " Predictable access time " Hide the data transfer time behind the processing time by independent work " Real-time requirement
©byTien-FuChen@CCU Adv CPU-13 TI TMS320C62
! 256 bits per instr. (8x32bit)
! 2 clusters
" Each with 4 Fus
" Each with 16 32-bit register
" One cross-cluster read port each way ! Two integer ALU support partitioned instr.
! Programmable DMA controller with two 32-kB memory
©byTien-FuChen@CCU Adv CPU-14
TI TMS320C80
• ILP, DLP, multiple processors on single chip • 4 ADSP (DSP+VLIW) – A 16-bit MUL, a 3-input 32-bit ALU, a branch unit, 2 load/strore units. – 3 zero-overhead loop ! DMA (Transfer Controller) controllers " Support various types of – One 2-KB I-cache, Four 2- data transfers with complex KB D-cache address calculation. • RISC processor ! No support for some –FPU:FPMAC powerful instrs. – A 4-KB I-cache, A 4-KB D- " SAD, inner-product
©byTien-FuChen@CCUcache Adv CPU-15 Philips Trimedia TM1000
! 27 Fus, coprocessor for MPEG-2 decoding
! NO DMA controller, 16 KB D-cache, 32 KB I-cache
! One PCI port, MM I/O
! Issue 5 simultaneous instr per cycle
! DSPALU: partitioned Instr.
! DSPMUL: partitioned instr. Inner-product
©byTien-FuChen@CCU Adv CPU-16
Transmeta’s Crusoe Processor, TM5400 ! General purpose microprocessor based on VLIW.
" Difficult: Binary code compatibility, Very complicated compiler ! Support X86 (MS Windows, Linux):
" X86 code morphing software using dynamic binary code translation. ! 2 interger units, 1 FPU, 1 load/store, 1 branch " 64 KB 16-way L1 D-cache " 64 KB 8-way I-cache " 256 KB L2 cache " 64 32 bit GPR " VLIW instr size: 64, 128 bits, 4 instr per cycle. Support partioned instr. Crusoe: A low-power x86 processor
! Crusoe processor = Software + hardware
Code Morphing software • Dynamically translates x86 instructions into VLIW instructions 3/4 • Provides x86 compatibility • Optimization and scheduling by software
VLIW hardware • 128 bit Very long Instruction Word Processor • Simple and fast 1/4 • Fewer transistors
Low power x86 compatibility PC performance
©byTien-FuChen@CCU Adv CPU-18
Crusoe VLIW
©byTien-FuChen@CCU Adv CPU-19 Code Morphing Software
A dynamic translation system, reside in a ROM,
First program to start executing when booting
! Drawing the H/W and S/W line
" Software: decoding x86 instructions and generating parallel molecule
" Hardware: execute using a simple, high-speed VLIW engine ! Decoding and scheduling
" Translation cache : CMS translates instructions once, saving the resulting translation for re-use $ Skip the translation in the next time
Play: Transmeta Crusoe ©byTien-FuChen@CCU Adv CPU-20