Computer Architectures Advance CPU Design

Tien-Fu Chen

National Chung Cheng Univ.

©byTien-FuChen@CCU Adv CPU-0

MMX technology ! Basic concepts

" small native data types " compute-intensive operations " a lot of inherent parallelism => single-instruction multiple data (SIMD)

! features

" packed data type " a rich set of MMX instructions to perform parallel operations " saturation arithmetic different from regular arithmetic: don’t truncate/wrapping around choosing largest or smallest numbers " parallel compare " overlapped operations " pack/unpack data type " compatible extension architectures ©byTien-FuChen@CCU Adv CPU-1 Packed Data Types (small data types packed into one) register ! Dual Usage of Floating-point Register

! Enhanced Instruction Set Operating In Parallel Fashion

" Totally 57 MMX instructions are added to IA.

©byTien-FuChen@CCU Adv CPU-2

Fast DSP computation

©byTien-FuChen@CCU Adv CPU-3 Performance of Matrix Multiplication Performance Comparison between IA and MMX -working example on Matrix and vector multiplication Traditional IA MMX No.ofLoads 32 8 No.ofMultiply 16 4 No.ofAdd 15 3 Vector Vector *Loop control 12 0 multiplication Other overhead 0 3 Final result save 1 1 Instr Count 76 19 **Cycle Count 200 12 Total Instrs 4(4x76+3)=1228 4(4x19+3) = 316 Matrix Vector Multiplication Both under 1200 cycles 207 cycles optimized mode Comp Result: Speed up 5.8 times * Assume we per form 4 MACs (out of 16) per loop iteration of our code. for ( K = 1; K < 5; K++) { Mac (K); } So for each loop, there will be 3 instruction per iteration, increment, compare, and branch. ** 1) The cycle count is dominated by the nonpipelined, 11-cycle integer multiply operation 2) 4 mispredictions totally when existing the loops 3) All data are in on-chip caches; ©byTien-FuChen@CCU Adv CPU-4

More Parallelisms ! Streaming SIMD Extension (SSE) since Pentium III.

" Physically add eight new 128 bit XMM registers and 70 instruction set. New machine state introduced. " Support four 32-bit single precision floating point operations in parallel. Recall all MMX SIMD instruction are all for mere integers. ! Streaming SIMD Extension 2 (SSE2) since Pentium 4.

" Use XMM registers. No new machine. " 144 new instructions added. " Support double precision floating point parallel operations. ! IA-64 ItaniumTM Architecture.

" Enable, enhance, express, exploit Parallelism at: Proc./Thread level for programmers, at the instruction level for compilers. All explicitly.

©byTien-FuChen@CCU Adv CPU-5 Objectives of IA-64 Instruction Set Architecture (ISA) ! Intel and HP Technology Alliance

! Enable industry leading system performance

" Breakthrough performance

" Headroom

! Enable compatibility with today’s IA-32 software & PA- RISC software

! Allow scalability over a wide range of implementations

! Full 64-bit Full 64-bit computing

©byTien-FuChen@CCU Adv CPU-6

Next Generation Terminology ! EPIC: (Explicitly Parallel Instruction Computing): the next generation processor technology

" e.g., RISC, CISC

! IA-64 (Intel Architecture, 64-bit): the architecture that incorporates EPIC Technology

" e.g., IA-32, PA-RISC

! Merced processor: the project name for Intel’s first IA-64-based implementation

" e.g., Pentium II, PA-8500

©byTien-FuChen@CCU Adv CPU-7 Features of IA-64 Architecture ! Explicit Parallelism

" ILP is explicit in machine code

" compiler analyzes and identifies parallelism at compile time ! Predication Enhances Parallelism

! Speculation Minimizes the Effect of Memory Latency

! IA-64 Processors are Massively Resourced

" Many registers

" Many functional units

" Inherently scalable ! Performance, headroom, binary compatibility

©byTien-FuChen@CCU Adv CPU-8

Predication: Features and Benefits ! Compiler given larger scheduling scope

" Nearly all instructions can be predicated

" State updated if an instruction?s predicate is true, otherwise

" acts as a NOP

" Compiler assigns predicates, compare instructions set them

" Architecture provides 64 1-bit predicate registers (PR) ! Predicated execution removes branches

" Convert a control dependence to a data dependence

" Reduce mispredict penalties ! Parallel execution through larger basic

" Effective use of parallel hardware

©byTien-FuChen@CCU Adv CPU-9 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”

! IA-64: instruction set architecture; EPIC is type

" EPIC = 2nd generation VLIW? ! Itanium™ the first implementation (2001)

" Highly parallel and deeply pipelined hardware at 800Mhz " 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process ! 128 64-bit integer registers + 128 82-bit floating point registers

" Not separate register files per functional unit as in old VLIW ! Hardware checks dependencies (interlocks => binary compatibility over time)

! Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? ©byTien-FuChen@CCU Adv CPU-10

Binary Compatibility

C, C++, IA-32 PA-RISC High-level Fortran, Object Object Language COBOL Code Code • Application Source Compatible • Design Criteria • C, C++ and FTN • Systems Architecture • Transparent to User Native • Default Compiler and Optimizer

Native IA-64 Code

Dynamic HP-UX and NT Translator IA-64

Play: Next generation ISA ©byTien-FuChen@CCU Adv CPU-11 VLIW Processor Architectures for DSP !Why VLIW Architecture?

" VLIW is especially suitable for DSP applications

" DSP algorithms are dominated by data-parallel computation and consist of core tight loops executed repeatedly. # Convolution, FFT

" Single-chip high-performance VLIW processors with multiple FUs are commercially available.

©byTien-FuChen@CCU Adv CPU-12

VLIW Architecture ! Instruction-Level Parallelism (ILP) " Multiple different FUs in parallel. " Each instruction contains an operation code for each FU. ! Data-Level Parallelism (DLP) " Single FU is divided to perform the same operation on multiple smaller precision data. ! Instruction Set Architecture " Each processor has its own instruction to further enhance the performance. " Complex_multiply for FFT and autocorrelation algorithms ! Memory I/O " Via DMA controller " Predictable access time " Hide the data transfer time behind the processing time by independent work " Real-time requirement

©byTien-FuChen@CCU Adv CPU-13 TI TMS320C62

! 256 bits per instr. (8x32bit)

! 2 clusters

" Each with 4 Fus

" Each with 16 32-bit register

" One cross-cluster read port each way ! Two integer ALU support partitioned instr.

! Programmable DMA controller with two 32-kB memory

©byTien-FuChen@CCU Adv CPU-14

TI TMS320C80

• ILP, DLP, multiple processors on single chip • 4 ADSP (DSP+VLIW) – A 16-bit MUL, a 3-input 32-bit ALU, a branch unit, 2 load/strore units. – 3 zero-overhead loop ! DMA (Transfer Controller) controllers " Support various types of – One 2-KB I-cache, Four 2- data transfers with complex KB D-cache address calculation. • RISC processor ! No support for some –FPU:FPMAC powerful instrs. – A 4-KB I-cache, A 4-KB D- " SAD, inner-product

©byTien-FuChen@CCUcache Adv CPU-15 Philips Trimedia TM1000

! 27 Fus, coprocessor for MPEG-2 decoding

! NO DMA controller, 16 KB D-cache, 32 KB I-cache

! One PCI port, MM I/O

! Issue 5 simultaneous instr per cycle

! DSPALU: partitioned Instr.

! DSPMUL: partitioned instr. Inner-product

©byTien-FuChen@CCU Adv CPU-16

Transmeta’s Crusoe Processor, TM5400 ! General purpose based on VLIW.

" Difficult: Binary code compatibility, Very complicated compiler ! Support (MS Windows, Linux):

" X86 code morphing software using dynamic binary code translation. ! 2 interger units, 1 FPU, 1 load/store, 1 branch " 64 KB 16-way L1 D-cache " 64 KB 8-way I-cache " 256 KB L2 cache " 64 32 bit GPR " VLIW instr size: 64, 128 bits, 4 instr per cycle. Support partioned instr. Crusoe: A low-power x86 processor

! Crusoe processor = Software + hardware

Code Morphing software • Dynamically translates x86 instructions into VLIW instructions 3/4 • Provides x86 compatibility • Optimization and scheduling by software

VLIW hardware • 128 bit Very long Instruction Word Processor • Simple and fast 1/4 • Fewer transistors

Low power x86 compatibility PC performance

©byTien-FuChen@CCU Adv CPU-18

Crusoe VLIW

©byTien-FuChen@CCU Adv CPU-19 Code Morphing Software

A dynamic translation system, reside in a ROM,

First program to start executing when booting

! Drawing the H/W and S/W line

" Software: decoding x86 instructions and generating parallel molecule

" Hardware: execute using a simple, high-speed VLIW engine ! Decoding and scheduling

" Translation cache : CMS translates instructions once, saving the resulting translation for re-use $ Skip the translation in the next time

Play: Crusoe ©byTien-FuChen@CCU Adv CPU-20