Advance CPU Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. ©byTien-FuChen@CCU Adv CPU-0 MMX technology ! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism => single-instruction multiple data (SIMD) ! features " packed data type " a rich set of MMX instructions to perform parallel operations " saturation arithmetic different from regular arithmetic: don’t truncate/wrapping around choosing largest or smallest numbers " parallel compare " overlapped operations " pack/unpack data type " compatible extension architectures ©byTien-FuChen@CCU Adv CPU-1 Packed Data Types (small data types packed into one) register ! Dual Usage of Floating-point Register ! Enhanced Instruction Set Operating In Parallel Fashion " Totally 57 MMX instructions are added to IA. ©byTien-FuChen@CCU Adv CPU-2 Fast DSP computation ©byTien-FuChen@CCU Adv CPU-3 Performance of Matrix Multiplication Performance Comparison between IA and MMX -working example on Matrix and vector multiplication Traditional IA MMX No.ofLoads 32 8 No.ofMultiply 16 4 No.ofAdd 15 3 Vector Vector *Loop control 12 0 multiplication Other overhead 0 3 Final result save 1 1 Instr Count 76 19 **Cycle Count 200 12 Total Instrs 4(4x76+3)=1228 4(4x19+3) = 316 Matrix Vector Multiplication Both under 1200 cycles 207 cycles optimized mode Comp Result: Speed up 5.8 times * Assume we per form 4 MACs (out of 16) per loop iteration of our code. for ( K = 1; K < 5; K++) { Mac (K); } So for each loop, there will be 3 instruction per iteration, increment, compare, and branch. ** 1) The cycle count is dominated by the nonpipelined, 11-cycle integer multiply operation 2) 4 mispredictions totally when existing the loops 3) All data are in on-chip caches; ©byTien-FuChen@CCU Adv CPU-4 More Parallelisms ! Streaming SIMD Extension (SSE) since Pentium III. " Physically add eight new 128 bit XMM registers and 70 instruction set. New machine state introduced. " Support four 32-bit single precision floating point operations in parallel. Recall all MMX SIMD instruction are all for mere integers. ! Streaming SIMD Extension 2 (SSE2) since Pentium 4. " Use XMM registers. No new machine. " 144 new instructions added. " Support double precision floating point parallel operations. ! IA-64 ItaniumTM Architecture. " Enable, enhance, express, exploit Parallelism at: Proc./Thread level for programmers, at the instruction level for compilers. All explicitly. ©byTien-FuChen@CCU Adv CPU-5 Objectives of IA-64 Instruction Set Architecture (ISA) ! Intel and HP Technology Alliance ! Enable industry leading system performance " Breakthrough performance " Headroom ! Enable compatibility with today’s IA-32 software & PA- RISC software ! Allow scalability over a wide range of implementations ! Full 64-bit Full 64-bit computing ©byTien-FuChen@CCU Adv CPU-6 Next Generation Terminology ! EPIC: (Explicitly Parallel Instruction Computing): the next generation processor technology " e.g., RISC, CISC ! IA-64 (Intel Architecture, 64-bit): the architecture that incorporates EPIC Technology " e.g., IA-32, PA-RISC ! Merced processor: the project name for Intel’s first IA-64-based implementation " e.g., Pentium II, PA-8500 ©byTien-FuChen@CCU Adv CPU-7 Features of IA-64 Architecture ! Explicit Parallelism " ILP is explicit in machine code " compiler analyzes and identifies parallelism at compile time ! Predication Enhances Parallelism ! Speculation Minimizes the Effect of Memory Latency ! IA-64 Processors are Massively Resourced " Many registers " Many functional units " Inherently scalable ! Performance, headroom, binary compatibility ©byTien-FuChen@CCU Adv CPU-8 Predication: Features and Benefits ! Compiler given larger scheduling scope " Nearly all instructions can be predicated " State updated if an instruction?s predicate is true, otherwise " acts as a NOP " Compiler assigns predicates, compare instructions set them " Architecture provides 64 1-bit predicate registers (PR) ! Predicated execution removes branches " Convert a control dependence to a data dependence " Reduce mispredict penalties ! Parallel execution through larger basic " Effective use of parallel hardware ©byTien-FuChen@CCU Adv CPU-9 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” ! IA-64: instruction set architecture; EPIC is type " EPIC = 2nd generation VLIW? ! Itanium™ the first implementation (2001) " Highly parallel and deeply pipelined hardware at 800Mhz " 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process ! 128 64-bit integer registers + 128 82-bit floating point registers " Not separate register files per functional unit as in old VLIW ! Hardware checks dependencies (interlocks => binary compatibility over time) ! Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? ©byTien-FuChen@CCU Adv CPU-10 Binary Compatibility C, C++, IA-32 PA-RISC High-level Fortran, Object Object Language COBOL Code Code • Application Source Compatible • Design Criteria • C, C++ and FTN • Systems Architecture • Transparent to User Native • Default Compiler and Optimizer Native IA-64 Code Dynamic HP-UX and NT Translator IA-64 Play: Next generation ISA ©byTien-FuChen@CCU Adv CPU-11 VLIW Processor Architectures for DSP !Why VLIW Architecture? " VLIW is especially suitable for DSP applications " DSP algorithms are dominated by data-parallel computation and consist of core tight loops executed repeatedly. # Convolution, FFT " Single-chip high-performance VLIW processors with multiple FUs are commercially available. ©byTien-FuChen@CCU Adv CPU-12 VLIW Architecture ! Instruction-Level Parallelism (ILP) " Multiple different FUs in parallel. " Each instruction contains an operation code for each FU. ! Data-Level Parallelism (DLP) " Single FU is divided to perform the same operation on multiple smaller precision data. ! Instruction Set Architecture " Each processor has its own instruction to further enhance the performance. " Complex_multiply for FFT and autocorrelation algorithms ! Memory I/O " Via DMA controller " Predictable access time " Hide the data transfer time behind the processing time by independent work " Real-time requirement ©byTien-FuChen@CCU Adv CPU-13 TI TMS320C62 ! 256 bits per instr. (8x32bit) ! 2 clusters " Each with 4 Fus " Each with 16 32-bit register " One cross-cluster read port each way ! Two integer ALU support partitioned instr. ! Programmable DMA controller with two 32-kB memory ©byTien-FuChen@CCU Adv CPU-14 TI TMS320C80 • ILP, DLP, multiple processors on single chip • 4 ADSP (DSP+VLIW) – A 16-bit MUL, a 3-input 32-bit ALU, a branch unit, 2 load/strore units. – 3 zero-overhead loop ! DMA (Transfer Controller) controllers " Support various types of – One 2-KB I-cache, Four 2- data transfers with complex KB D-cache address calculation. • RISC processor ! No support for some –FPU:FPMAC powerful instrs. – A 4-KB I-cache, A 4-KB D- " SAD, inner-product ©byTien-FuChen@CCUcache Adv CPU-15 Philips Trimedia TM1000 ! 27 Fus, coprocessor for MPEG-2 decoding ! NO DMA controller, 16 KB D-cache, 32 KB I-cache ! One PCI port, MM I/O ! Issue 5 simultaneous instr per cycle ! DSPALU: partitioned Instr. ! DSPMUL: partitioned instr. Inner-product ©byTien-FuChen@CCU Adv CPU-16 Transmeta’s Crusoe Processor, TM5400 ! General purpose microprocessor based on VLIW. " Difficult: Binary code compatibility, Very complicated compiler ! Support X86 (MS Windows, Linux): " X86 code morphing software using dynamic binary code translation. ! 2 interger units, 1 FPU, 1 load/store, 1 branch " 64 KB 16-way L1 D-cache " 64 KB 8-way I-cache " 256 KB L2 cache " 64 32 bit GPR " VLIW instr size: 64, 128 bits, 4 instr per cycle. Support partioned instr. Crusoe: A low-power x86 processor ! Crusoe processor = Software + hardware Code Morphing software • Dynamically translates x86 instructions into VLIW instructions 3/4 • Provides x86 compatibility • Optimization and scheduling by software VLIW hardware • 128 bit Very long Instruction Word Processor • Simple and fast 1/4 • Fewer transistors Low power x86 compatibility PC performance ©byTien-FuChen@CCU Adv CPU-18 Crusoe VLIW ©byTien-FuChen@CCU Adv CPU-19 Code Morphing Software A dynamic translation system, reside in a ROM, First program to start executing when booting ! Drawing the H/W and S/W line " Software: decoding x86 instructions and generating parallel molecule " Hardware: execute using a simple, high-speed VLIW engine ! Decoding and scheduling " Translation cache : CMS translates instructions once, saving the resulting translation for re-use $ Skip the translation in the next time Play: Transmeta Crusoe ©byTien-FuChen@CCU Adv CPU-20.