Advance CPU Architecture

Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. ©byTien-FuChen@CCU Adv CPU-0 MMX technology ! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism => single-instruction multiple data (SIMD) ! features " packed data type " a rich set of MMX instructions to perform parallel operations " saturation arithmetic different from regular arithmetic: don’t truncate/wrapping around choosing largest or smallest numbers " parallel compare " overlapped operations " pack/unpack data type " compatible extension architectures ©byTien-FuChen@CCU Adv CPU-1 Packed Data Types (small data types packed into one) register ! Dual Usage of Floating-point Register ! Enhanced Instruction Set Operating In Parallel Fashion " Totally 57 MMX instructions are added to IA. ©byTien-FuChen@CCU Adv CPU-2 Fast DSP computation ©byTien-FuChen@CCU Adv CPU-3 Performance of Matrix Multiplication Performance Comparison between IA and MMX -working example on Matrix and vector multiplication Traditional IA MMX No.ofLoads 32 8 No.ofMultiply 16 4 No.ofAdd 15 3 Vector Vector *Loop control 12 0 multiplication Other overhead 0 3 Final result save 1 1 Instr Count 76 19 **Cycle Count 200 12 Total Instrs 4(4x76+3)=1228 4(4x19+3) = 316 Matrix Vector Multiplication Both under 1200 cycles 207 cycles optimized mode Comp Result: Speed up 5.8 times * Assume we per form 4 MACs (out of 16) per loop iteration of our code. for ( K = 1; K < 5; K++) { Mac (K); } So for each loop, there will be 3 instruction per iteration, increment, compare, and branch. ** 1) The cycle count is dominated by the nonpipelined, 11-cycle integer multiply operation 2) 4 mispredictions totally when existing the loops 3) All data are in on-chip caches; ©byTien-FuChen@CCU Adv CPU-4 More Parallelisms ! Streaming SIMD Extension (SSE) since Pentium III. " Physically add eight new 128 bit XMM registers and 70 instruction set. New machine state introduced. " Support four 32-bit single precision floating point operations in parallel. Recall all MMX SIMD instruction are all for mere integers. ! Streaming SIMD Extension 2 (SSE2) since Pentium 4. " Use XMM registers. No new machine. " 144 new instructions added. " Support double precision floating point parallel operations. ! IA-64 ItaniumTM Architecture. " Enable, enhance, express, exploit Parallelism at: Proc./Thread level for programmers, at the instruction level for compilers. All explicitly. ©byTien-FuChen@CCU Adv CPU-5 Objectives of IA-64 Instruction Set Architecture (ISA) ! Intel and HP Technology Alliance ! Enable industry leading system performance " Breakthrough performance " Headroom ! Enable compatibility with today’s IA-32 software & PA- RISC software ! Allow scalability over a wide range of implementations ! Full 64-bit Full 64-bit computing ©byTien-FuChen@CCU Adv CPU-6 Next Generation Terminology ! EPIC: (Explicitly Parallel Instruction Computing): the next generation processor technology " e.g., RISC, CISC ! IA-64 (Intel Architecture, 64-bit): the architecture that incorporates EPIC Technology " e.g., IA-32, PA-RISC ! Merced processor: the project name for Intel’s first IA-64-based implementation " e.g., Pentium II, PA-8500 ©byTien-FuChen@CCU Adv CPU-7 Features of IA-64 Architecture ! Explicit Parallelism " ILP is explicit in machine code " compiler analyzes and identifies parallelism at compile time ! Predication Enhances Parallelism ! Speculation Minimizes the Effect of Memory Latency ! IA-64 Processors are Massively Resourced " Many registers " Many functional units " Inherently scalable ! Performance, headroom, binary compatibility ©byTien-FuChen@CCU Adv CPU-8 Predication: Features and Benefits ! Compiler given larger scheduling scope " Nearly all instructions can be predicated " State updated if an instruction?s predicate is true, otherwise " acts as a NOP " Compiler assigns predicates, compare instructions set them " Architecture provides 64 1-bit predicate registers (PR) ! Predicated execution removes branches " Convert a control dependence to a data dependence " Reduce mispredict penalties ! Parallel execution through larger basic " Effective use of parallel hardware ©byTien-FuChen@CCU Adv CPU-9 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” ! IA-64: instruction set architecture; EPIC is type " EPIC = 2nd generation VLIW? ! Itanium™ the first implementation (2001) " Highly parallel and deeply pipelined hardware at 800Mhz " 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process ! 128 64-bit integer registers + 128 82-bit floating point registers " Not separate register files per functional unit as in old VLIW ! Hardware checks dependencies (interlocks => binary compatibility over time) ! Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? ©byTien-FuChen@CCU Adv CPU-10 Binary Compatibility C, C++, IA-32 PA-RISC High-level Fortran, Object Object Language COBOL Code Code • Application Source Compatible • Design Criteria • C, C++ and FTN • Systems Architecture • Transparent to User Native • Default Compiler and Optimizer Native IA-64 Code Dynamic HP-UX and NT Translator IA-64 Play: Next generation ISA ©byTien-FuChen@CCU Adv CPU-11 VLIW Processor Architectures for DSP !Why VLIW Architecture? " VLIW is especially suitable for DSP applications " DSP algorithms are dominated by data-parallel computation and consist of core tight loops executed repeatedly. # Convolution, FFT " Single-chip high-performance VLIW processors with multiple FUs are commercially available. ©byTien-FuChen@CCU Adv CPU-12 VLIW Architecture ! Instruction-Level Parallelism (ILP) " Multiple different FUs in parallel. " Each instruction contains an operation code for each FU. ! Data-Level Parallelism (DLP) " Single FU is divided to perform the same operation on multiple smaller precision data. ! Instruction Set Architecture " Each processor has its own instruction to further enhance the performance. " Complex_multiply for FFT and autocorrelation algorithms ! Memory I/O " Via DMA controller " Predictable access time " Hide the data transfer time behind the processing time by independent work " Real-time requirement ©byTien-FuChen@CCU Adv CPU-13 TI TMS320C62 ! 256 bits per instr. (8x32bit) ! 2 clusters " Each with 4 Fus " Each with 16 32-bit register " One cross-cluster read port each way ! Two integer ALU support partitioned instr. ! Programmable DMA controller with two 32-kB memory ©byTien-FuChen@CCU Adv CPU-14 TI TMS320C80 • ILP, DLP, multiple processors on single chip • 4 ADSP (DSP+VLIW) – A 16-bit MUL, a 3-input 32-bit ALU, a branch unit, 2 load/strore units. – 3 zero-overhead loop ! DMA (Transfer Controller) controllers " Support various types of – One 2-KB I-cache, Four 2- data transfers with complex KB D-cache address calculation. • RISC processor ! No support for some –FPU:FPMAC powerful instrs. – A 4-KB I-cache, A 4-KB D- " SAD, inner-product ©byTien-FuChen@CCUcache Adv CPU-15 Philips Trimedia TM1000 ! 27 Fus, coprocessor for MPEG-2 decoding ! NO DMA controller, 16 KB D-cache, 32 KB I-cache ! One PCI port, MM I/O ! Issue 5 simultaneous instr per cycle ! DSPALU: partitioned Instr. ! DSPMUL: partitioned instr. Inner-product ©byTien-FuChen@CCU Adv CPU-16 Transmeta’s Crusoe Processor, TM5400 ! General purpose microprocessor based on VLIW. " Difficult: Binary code compatibility, Very complicated compiler ! Support X86 (MS Windows, Linux): " X86 code morphing software using dynamic binary code translation. ! 2 interger units, 1 FPU, 1 load/store, 1 branch " 64 KB 16-way L1 D-cache " 64 KB 8-way I-cache " 256 KB L2 cache " 64 32 bit GPR " VLIW instr size: 64, 128 bits, 4 instr per cycle. Support partioned instr. Crusoe: A low-power x86 processor ! Crusoe processor = Software + hardware Code Morphing software • Dynamically translates x86 instructions into VLIW instructions 3/4 • Provides x86 compatibility • Optimization and scheduling by software VLIW hardware • 128 bit Very long Instruction Word Processor • Simple and fast 1/4 • Fewer transistors Low power x86 compatibility PC performance ©byTien-FuChen@CCU Adv CPU-18 Crusoe VLIW ©byTien-FuChen@CCU Adv CPU-19 Code Morphing Software A dynamic translation system, reside in a ROM, First program to start executing when booting ! Drawing the H/W and S/W line " Software: decoding x86 instructions and generating parallel molecule " Hardware: execute using a simple, high-speed VLIW engine ! Decoding and scheduling " Translation cache : CMS translates instructions once, saving the resulting translation for re-use $ Skip the translation in the next time Play: Transmeta Crusoe ©byTien-FuChen@CCU Adv CPU-20.

Advance CPU Architecture

SIMD Extensions

The Technology Behind Crusoe™ Processors

Crusoe Processor Model TM3120

Computer Architectures an Overview

Dissertation

The Transmeta Code Morphing Software

Dynamic Binary Translation and Optimization Erik R. Altman Kemal

TM5800 Data Book

Seminar Report

Crusoe Processor Model TM5400

IEEE Paper Template in A4

Bringing Virtualization to the X86 Architecture with the Original Vmware Workstation