Intel® Itanium™ Processor Microarchitecture Overview

Total Page:16

File Type:pdf, Size:1020Kb

Intel® Itanium™ Processor Microarchitecture Overview Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Microarchitecture Overview Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ® Microprocessor Forum October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Unveiling the Intel® Itanium™ Processor Design l Leading-edge implementation of IA-64 architecture for world-class performance l New capabilities for systems that fuel the Internet Economy l Strong progress on initial silicon ® Microprocessor Forum 2 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Goals l World-class performance on high-end applications – High performance for commercial servers – Supercomputer-level floating point for technical workstations l Large memory management with 64-bit addressing l Robust support for mission critical environments – Enhanced error correction, detection & containment l Full IA-32 instruction set compatibility in hardware l Deliver across a broad range of industry requirements – Flexible for a variety of OEM designs and operating systems Deliver world-class performance and features for servers & workstations and emerging internet applications ® Microprocessor Forum 3 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview EPIC Design Philosophy ì Maximize performance via EPICEPIC hardware & software synergy ì Advanced features enhance instruction level parallelism ìPredication, Speculation, ... ì Massive hardware resources for parallel execution VLIW OOO / SuperScalar ì High performance EPIC building block RISC CISC Performance Time Achieving performance at the most fundamental level ® Microprocessor Forum 4 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ EPIC Design Maximizes SW-HW Synergy Architecture Features programmed by compiler: Register Branch Explicit Data & Control Memory Stack Predication Hints Parallelism Speculation Hints & Rotation Micro-architecture Features in hardware: Fetch Issue Register Control Parallel Resources Memory Handling Bypasses & Dependencies Subsystem 4 Integer + Fast, Simple 6-Issue Simple Fast, 4 MMX Units 128 GR & Instruction 128 FR, 2 FMACs Three Cache Register (4 for SSE) levels of & Branch Remap cache: & Predictors 2 LD/ST units L1, L2, L3 Stack Engine 32 entry ALAT ® Speculation Deferral Management Microprocessor Forum 5 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Breakthrough Levels of Parallelism M F I M F I 6 instructions provides 12 parallel ops/clock (SP: 20 parallel ops/clock) •Load 4 DP (8 SP) 2 ALU ops for digital content creation ops via 2 ld-pair & scientific computing •2 ALU oper 4 DP FLOPS (post incr) (8 SP FLOPS) 6 instructions M I B M I B provides 8 parallel ops / clock for enterprise & Internet applications 2 Loads + 1 Branch Hint + 2 ALU ops 2 ALU ops 1 Branch instr (post incr) Itanium™ delivers greater instruction level ® parallelism than any contemporary processor Microprocessor Forum 6 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Highlights of the Itanium™ Pipeline l 6-Wide EPIC hardware under precise compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction l 10-stage in-order pipeline with cycle time designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache l Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Aggressive branch prediction to reduce branch penalty – Non-blocking caches and register scoreboard to hide load latency Parallel, deep, and dynamic pipeline ® designed for maximum throughput Microprocessor Forum 7 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview 10 Stage In-Order Core Pipeline Front End Execution •Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str to 6 instructions/cycle • Advanced load control •Hierarchy of branch • Predicate delivery & branch predictors • Nat/Exception//Retirement •Decoupling buffer WORD-LINE EXPAND RENAME DECODE REGISTER READ IPG FET ROT EXP RENWLD REG EXE DET WRB INST POINTER FETCH ROTATE EXECUTEEXCEPTION WRITE-BACK GENERATION DETECT Instruction Delivery Operand Delivery •Dispersal of up to 6 • Reg read + Bypasses instructions on 9 ports • Register scoreboard •Reg. remapping • Predicated •Reg. stack engine dependencies ® Microprocessor Forum 8 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Frontend: Prefetch & Fetch IPG FET ROT l SW-triggered prefetch loads target code early using br hints l Streaming prefetch of large blocks via hint on branch l Early prefetch of small blocks via BRP instruction l I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer l Buffer allows front-end to fetch even when back-end is stalled l Hides instruction cache misses and branch bubbles 8 bundle buffer IP IP MUX I-Cache & ITLB Fetch bubble Feed Branch Predictor Structures & Resteers IPG FET ROT EXP Aggressive instruction fetch hardware to feed a ® highly parallel, high performance machine Microprocessor Forum 9 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Front End: Branch Prediction IPG FET ROT l Branch hints combine with predictor hierarchy to improve branch prediction: four progressive resteers l 4 TARs programmed by “importance” hints l 512-entry 2-level predictor provides dynamic direction prediction l 64-entry BTAC contains footprint of upcoming branch targets (programmed by branch hints, and allocated dynamically) IP IP I-Cache & ITLB 8 bundle buffer MUX Loop Exit Return Stack Buffer Corrector Target Adaptive Br Target Branch Branch Address 2-Level Address Address Address Registers Predictor Cache Calc 1 Calc 2 IPG FET ROT EXP Intelligent branch prediction improves ® performance across all workloads Microprocessor Forum 10 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Dispersal EXP l Stop bits eliminate dependency checking l Templates simplify routing l 1st available dispersal from 6 syllables to 9 issue ports – Keep issuing until stop bit, resource oversubscription, or asymmetry M0 S0 M1 S1 S2 I0 I1 Dispersal Network F0 F1 S3 S4 S5 B0 B1 B2 ROT EXP ®Achieves highly parallel execution with simple hardware Microprocessor Forum 11 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Stacking REN l Massive 128 register file accommodates multiple variable sized procedures via stacking l Eliminates most register spill / fill at procedure interfaces l Achieved transparently to the compiler l Using register remapping via parallel adders l Stack engine performs the few required spill/fills Stall Stack Engine M0 Spill/Fill Injection M1 Integer, FP, I0 & Predicate I1 Renamers F0 EXP F1 REN WLD Unique register model enables faster ® execution of object-oriented code Microprocessor Forum 12 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Operand Delivery WLD REG l Multiported register file + mux hierarchy delivers operands in REG l Unique “Delayed Stall” mechanism used for register dependencies l Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read Bypass Muxes 128 Entry Integer ALUs Src Register File 8R / 6W WLD REG EXE Src Src Dependency Control OLM comparators Scoreboard Comparators Delayed Stall Dst Preds Avoids pipeline flush to enable a more ® effective, higher throughput pipeline Microprocessor Forum 13 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Predicate Delivery EXE l All instructions read operands and execute l Canceled at retirement if predicates off l Predicates generated in EXE (by cmps), delivered in DET, & feed into: retirement, branch execution and dependency detection l Smart control network cancels false stalls on predicated dependencies l Dependency detection for cancelled producer/consumer (REG) REG EXE DET Bypass Muxes Predicate To Dependency Detect (x6) Register To Branch Execution (x3) File Read To Retirement (x6) I-Cmps F-Cmps Higher performance through removal of branch ® penalties in server and workstation applications Microprocessor Forum 14 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Parallel Branch Execution DET WRB l Speculation + predication result in clusters of branches l Execution of 3 branches/clock optimizes for clustered branches l Branch execution in DET allows cmps-->branches in same issue group REG EXE DET WRB 3 BR Read Pred. Delivery Retirement of branch bundle IP relative Address Direction Resteer Validation Validation IP Most recent branch prediction info Parallel branch hardware extends performance ® benefits of EPIC technology Microprocessor Forum 15 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Speculation Hardware DET WRB l Control Speculation support requires minimal hardware l Computed memory exception delivered with data as tokens (NaTs) l NaTs propagate through subsequent executions like source data l Data Speculation enabled efficiently via ALAT structure l 32 outstanding advanced loads l Indexed by reg-ids, keeps partial physical address tag l 0 clk checks: dependent use can be issued in parallel with check EXE DET Physical WRB Address 32-entry Adv Ld Status Exception Address Logic TLB ALAT Check & Exception Memory Spec. Ld. Status (NaT) Subsystem Check Instruction ® Efficient elimination of memory
Recommended publications
  • Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance
    White Paper Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation White Paper Inside Intel®Core™ Microarchitecture Introduction Introduction 2 The Intel® Core™ microarchitecture is a new foundation for Intel®Core™ Microarchitecture Design Goals 3 Intel® architecture-based desktop, mobile, and mainstream server multi-core processors. This state-of-the-art multi-core optimized Delivering Energy-Efficient Performance 4 and power-efficient microarchitecture is designed to deliver Intel®Core™ Microarchitecture Innovations 5 increased performance and performance-per-watt—thus increasing Intel® Wide Dynamic Execution 6 overall energy efficiency. This new microarchitecture extends the energy efficient philosophy first delivered in Intel's mobile Intel® Intelligent Power Capability 8 microarchitecture found in the Intel® Pentium® M processor, and Intel® Advanced Smart Cache 8 greatly enhances it with many new and leading edge microar- Intel® Smart Memory Access 9 chitectural innovations as well as existing Intel NetBurst® microarchitecture features. What’s more, it incorporates many Intel® Advanced Digital Media Boost 10 new and significant innovations designed to optimize the Intel®Core™ Microarchitecture and Software 11 power, performance, and scalability of multi-core processors. Summary 12 The Intel Core microarchitecture shows Intel’s continued Learn More 12 innovation by delivering both greater energy efficiency Author Biographies 12 and compute capability required for the new workloads and usage models now making their way across computing. With its higher performance and low power, the new Intel Core microarchitecture will be the basis for many new solutions and form factors. In the home, these include higher performing, ultra-quiet, sleek and low-power computer designs, and new advances in more sophisticated, user-friendly entertainment systems.
    [Show full text]
  • POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors
    POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors THE ABILITY TO ESTIMATE POWER CONSUMPTION DURING EARLY-STAGE DEFINITION AND TRADE-OFF STUDIES IS A KEY NEW METHODOLOGY ENHANCEMENT. OPPORTUNITIES FOR SAVING POWER CAN BE EXPOSED VIA MICROARCHITECTURE-LEVEL MODELING, PARTICULARLY THROUGH CLOCK- GATING AND DYNAMIC ADAPTATION. Power dissipation limits have Thus far, most of the work done in the area David M. Brooks emerged as a major constraint in the design of high-level power estimation has been focused of microprocessors. At the low end of the per- at the register-transfer-level (RTL) description Pradip Bose formance spectrum, namely in the world of in the processor design flow. Only recently have handheld and portable devices or systems, we seen a surge of interest in estimating power Stanley E. Schuster power has always dominated over perfor- at the microarchitecture definition stage, and mance (execution time) as the primary design specific work on power-efficient microarchi- Hans Jacobson issue. Battery life and system cost constraints tecture design has been reported.2-8 drive the design team to consider power over Here, we describe the approach of using Prabhakar N. Kudva performance in such a scenario. energy-enabled performance simulators in Increasingly, however, power is also a key early design. We examine some of the emerg- Alper Buyuktosunoglu design issue in the workstation and server mar- ing paradigms in processor design and com- kets (see Gowan et al.)1 In this high-end arena ment on their inherent power-performance John-David Wellman the increasing microarchitectural complexities, characteristics. clock frequencies, and die sizes push the chip- Victor Zyuban level—and hence the system-level—power Power-performance efficiency consumption to such levels that traditionally See the “Power-performance fundamentals” Manish Gupta air-cooled multiprocessor server boxes may box.
    [Show full text]
  • The Intel Microprocessors: Architecture, Programming and Interfacing Introduction to the Microprocessor and Computer
    Microprocessors (0630371) Fall 2010/2011 – Lecture Notes # 1 The Intel Microprocessors: Architecture, Programming and Interfacing Introduction to the Microprocessor and computer Outline of the Lecture Evolution of programming languages. Microcomputer Architecture. Instruction Execution Cycle. Evolution of programming languages: Machine language - the programmer had to remember the machine codes for various operations, and had to remember the locations of the data in the main memory like: 0101 0011 0111… Assembly Language - an instruction is an easy –to- remember form called a mnemonic code . Example: Assembly Language Machine Language Load 100100 ADD 100101 SUB 100011 We need a program called an assembler that translates the assembly language instructions into machine language. High-level languages Fortran, Cobol, Pascal, C++, C# and java. We need a compiler to translate instructions written in high-level languages into machine code. Microprocessor-based system (Micro computer) Architecture Data Bus, I/O bus Memory Storage I/O I/O Registers Unit Device Device Central Processing Unit #1 #2 (CPU ) ALU CU Clock Control Unit Address Bus The figure shows the main components of a microprocessor-based system: CPU- Central Processing Unit , where calculations and logic operations are done. CPU contains registers , a high-frequency clock , a control unit ( CU ) and an arithmetic logic unit ( ALU ). o Clock : synchronizes the internal operations of the CPU with other system components using clock pulsing at a constant rate (the basic unit of time for machine instructions is a machine cycle or clock cycle) One cycle A machine instruction requires at least one clock cycle some instruction require 50 clocks. o Control Unit (CU) - generate the needed control signals to coordinate the sequencing of steps involved in executing machine instructions: (fetches data and instructions and decodes addresses for the ALU).
    [Show full text]
  • UNIT 8B a Full Adder
    UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state.
    [Show full text]
  • Unit 8 : Microprocessor Architecture
    Unit 8 : Microprocessor Architecture Lesson 1 : Microcomputer Structure 1.1. Learning Objectives On completion of this lesson you will be able to : ♦ draw the block diagram of a simple computer ♦ understand the function of different units of a microcomputer ♦ learn the basic operation of microcomputer bus system. 1.2. Digital Computer A digital computer is a multipurpose, programmable machine that reads A digital computer is a binary instructions from its memory, accepts binary data as input and multipurpose, programmable processes data according to those instructions, and provides results as machine. output. 1.3. Basic Computer System Organization Every computer contains five essential parts or units. They are Basic computer system organization. i. the arithmetic logic unit (ALU) ii. the control unit iii. the memory unit iv. the input unit v. the output unit. 1.3.1. The Arithmetic and Logic Unit (ALU) The arithmetic and logic unit (ALU) is that part of the computer that The arithmetic and logic actually performs arithmetic and logical operations on data. All other unit (ALU) is that part of elements of the computer system - control unit, register, memory, I/O - the computer that actually are there mainly to bring data into the ALU to process and then to take performs arithmetic and the results back out. logical operations on data. An arithmetic and logic unit and, indeed, all electronic components in the computer are based on the use of simple digital logic devices that can store binary digits and perform simple Boolean logic operations. Data are presented to the ALU in registers. These registers are temporary storage locations within the CPU that are connected by signal paths of the ALU.
    [Show full text]
  • Hardware Architecture
    Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs.
    [Show full text]
  • HOW FAST? the Current Intel® Core™ Processor Has 43,000,000% More Transistors Than the 4004 Processor
    40yrs of Intel® microprocessor innovation Following Moore’s Law the whole way Intel co-founder Gordon Moore once made a famous prediction that transistor The world’s first microprocessor count for computer chips would —the Intel® 4004—was “born” in 1971, double every two years. 10 years before the first PC came along. Using Moore’s Law as a guiding principle, Intel has provided ever-increasing functionality, performance and energy efficiency to its products. Just think: What if the world had followed this golden rule the last 40 years? HOW FAST? The current Intel® Core™ processor has 43,000,000% more transistors than the 4004 processor. If a village with a 1971 population of 100 had grown as quickly, it would now be by far the largest city in the world. War and Peace? Wait a second. The 4004 processor executed 92,000 instructions per second, while today’s Intel® Core™ i7 processor can run 92 billion. If your typing had accelerated at that rate, you’d be able to type Tolstoy’s classic in just over 1 second. 0101010101010101… You would need 25,000 years to turn a light switch on and off 1.5 trillion times, but today’s processors can do that in less than a second. A PENNY SAVED… When released in 1981, the first well- equipped IBM PC cost about $11,250 in inflation-adjusted 2011 dollars. Today, much more powerful PCs are available in the $500 range (or even less). Fly me to the moon If space travel had come down in price as much as transistors have since 1971, the Apollo 11 mission, which cost around $355 million in 1969, would cost about as much as a latte.
    [Show full text]
  • Microcontroller Serial Interfaces
    Microcontroller Serial Interfaces Dr. Francesco Conti [email protected] Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital Interfaces • Analog Timer DMAs Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital • Analog • Interconnect Example: STM32F101 MCU • AHB system bus (ARM-based MCUs) • APB peripheral bus (ARM-based MCUs) Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH
    [Show full text]
  • With Extreme Scale Computing the Rules Have Changed
    With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems.
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • What Is a Microprocessor?
    Behavior Research Methods & Instrumentation 1978, Vol. 10 (2),238-240 SESSION VII MICROPROCESSORS IN PSYCHOLOGY: A SYMPOSIUM MISRA PAVEL, New York University, Presider What is a microprocessor? MISRA PAVEL New York University, New York, New York 10003 A general introduction to microcomputer terminology and concepts is provided. The purpose of this introduction is to provide an overview of this session and to introduce the termi­ MICROPROCESSOR SYSTEM nology and some of the concepts that will be discussed in greater detail by the rest of the session papers. PERIPNERILS EIPERI MENT II A block diagram of a typical small computer system USS PRINIER KErBOARD INIERfICE is shown in Figure 1. Four distinct blocks can be STDRIGE distinguished: (1) the central processing unit (CPU); (2) external memory; (3) peripherals-mass storage, CONTROL standard input/output, and man-machine interface; 0111 BUS (4) special purpose (experimental) interface. IODiISS The different functional units in the system shown here are connected, more or less in parallel, by a number CENTKll ElIEBUL of lines commonly referred to as a bus. The bus mediates PROCESSING ME MOil transfer of information among the units by carrying UN IT address information, data, and control signals. For example, when the CPU transfers data to memory, it activates appropriate control lines, asserts the desired Figure 1. Block diagram of a smaIl computer system. destination memory address, and then outputs data on the bus. Traditionally, the entire system was built from a CU multitude of relatively simple components mounted on interconnected printed circuit cards. With the advances of integrated circuit technology, large amounts of circuitry were squeezed into a single chip.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]