IA-64 and Itanium(Tm) Processor Architecture Overview

Total Page:16

File Type:pdf, Size:1020Kb

IA-64 and Itanium(Tm) Processor Architecture Overview IIntnteell®® IIttananiiuumm™™ AArrcchhiitteeccttururee RReeffrreesshh oonn ssoommee ffeeaattuurreess rreelleevvaanntt ffoorr ddeevveellooppeerrss HHeeiinnzz BBaasstt [[email protected]] March 2005 IIttaanniiuumm®® PPrroocceessssoorr ArArcchhiitteeccttuurree SSeelleecctteedd FFeeaattuurreses 64-biit Addressiing Fllat Memory Modell Instructiion Levell Parallllelliism (6-way) Large Regiister Fiilles Automatiic Regiister Stack Engiine Prediicatiion Software Piipelliiniing Support Regiister Rotatiion Loop Controll Hardware Sophiistiicated Branch Archiitecture Controll & Data Specullatiion Powerfull 64-biit Integer Archiitecture Advanced 82-biit Flloatiing Poiint Archiitecture Mulltiimediia Support (MMX™ Technollogy) Copyright © 2001, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners TTraraddiittiioonnaall AArcrchhiitteeccttuureress:: LLiimmiitteedd PPaarraalllleelliissmm Original Source Sequential Machine Code Code Hardware Compile parallelized code multiple functional units Execution Units Available- Execution Units Available- . Used Inefficiently . TTooddaayy’’ss PPrroocceessssoorrss aarree oofftteenn 6600%% IIddllee Copyright © 2001, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners IInntteell®® IIttaanniiuum®m® AArrchchiitteectctuurere:: EExxpplliicciitt PPaarraalllleelliissmm Origiinaal SSoouurcee Parallel Machine Code Code Compile Compiller Hardware multiple functional units Itanium Architecture Compiler Views More efficient use of . Wider execution resources . Scope IInnccrreeaasseess PPaarraalllleell EExxeeccuuttiioonn Copyright © 2001, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners EEPPIICC IInnssttruruccttiioonn PPaararalllleelliissmm Source Code Instruction Groupps No RAW or WAW dependencies (serries of bundles) dependencies Issued in parallel depending on resources Instructionn 3 instructions + Bundles template (3 Instructions) 3 x 41 bits + 5 bits = (3 Instructions) 128 bits Up to 6 iinsstruucctioonss exeecuted pperr clloock Copyright © 2001, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners IInnssttruruccttiioonn LLeevveell PPaararalllleelliissmm Instruction Groups • No RAW or WAW dependencies instr 1 // 1st. group instr 2;; // 1st. group • Delimited by ‘stops’ in assembly code instr 3 // 2nd. group • Instructions in groups issued in parallel, depending instr 4 // 2nd. group on available resources. IInnssttrruuccttiioonn BBuunnddlleess { .mii • 3 instructions and 1 template in 128-bit bundle ld4 r28=[r8] // load add r9=2,r1 // Int op. • Instruction dependencies by using ‘stops’ Instruction dependencies by using ‘stops’ add r30=1,r1 // Int op. • Instruction groups can span multiple bundles } 128 bits (bundle) Instruction 2 Instruction 1 Instruction 0 template 41 bits 41 bits 41 bits 5 bits MMeemmoorryy ((MM)) MMeemmoorryy ((MM)) IInntteeggeerr ((II)) (MMI) FFlleexxiibbllee IIssssuuee CCaappaabbiilliittyy Copyright © 2001, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners LLaarrggee RReeggiisstteerr SSeett General Floating-point Predicate Branch Application Registers Registers Registers Registers Registers NaT 64-bit 82-bit 64-bit 64-bit GR0 0 FR0 + 0.0 PR0 1 BR0 AR0 GR1 FR1 + 1.0 PR1 AR1 BR7 GR31 FR31 PR15 AR31 GR32 FR32 PR16 AR32 PR63 GR127 FR127 AR127 32 Static 32 Static 16 Static 96 Stacked 96 Rotating 48 Rotating Copyright © 2001, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn PPrreeddiiccaattee rreeggiisstteerrss aaccttiivvaattee//iinnaaccttiivvaattee iinnssttrruuccttiioonnss PPrreeddiiccaattee RReeggiisstteerrss aarree sseett bbyy CCoommppaarree IInnssttrruuccttiioonnss • EExxaammppllee:: ccmmpp..eeqq pp11,, pp22 == rr22,, rr33 ((AAllmmoosstt)) alalll iinnssttrruuccttiioonnss ccaann bbee pprreeddiiccaatteedd (p1) ldfd f32=[r32],8 (p2) fmpy.d f36=f6,f36 PPrreeddiiccaattiioonn:: • eelliimmiinnaatteess bbrraanncchhiinngg iinn iiff//eellssee llooggiicc bblloocckkss • ccrreeaatteess llaarrggeerr ccooddee bblloocckkss ffoorr ooppttiimmiizzaattiioonn • ssiimmpplliiffiieess ssttaarrtt uupp//sshhuuttddoowwnn ooff ppiippeelliinneedd llooooppss Copyright © 2001, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn Code Example: absolute difference of two numbers C Code if (r2 >= r3) Non-Predicated r4 = r2 - r3; Pseudo Code else r4 = r3 - r2; cmpGE r2, r3 jump_zero P2 P1: sub r4 = r2, r3 Predicated Assembly Code jump end cmp.ge p1,p2 = r2,r3 ;; P2: sub r4 = r3, r2 (p1) sub r4 = r2,r3 end: ... (p2) sub r4 = r3, r2 Predication Removes Branches, Enables Parallel Execution Copyright © 2001, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee TThhee ttrraaddiittiioonnaall uussee ooff aa pprroocceedduurree ssttaacckk iinn mmeemmoorryy ffoorr pprroocceedduurree ccaallll mmananaaggeemmeenntt ddeemmananddss aa llaarrggee oovveerrhheeaadd.. TThhee IInntteell®® IIttaanniiuumm™™ pprroocceessssoorr ffamamiillyy uusseess tthhee ggeenneerraall rreeggiisstteerr ssttaacckk ffoorr pprroocceedduurree ccaallll mmaannagageemmeenntt,, tthhuuss eelliimmiinnaattiinngg tthhee ffrreeqquueenntt mmeemmoorryy acaccceesssseess.. Copyright © 2001, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk GGRRss 00--3311 aarree gglloobbaall ttoo aallll pprroocceedduurreess SSttaacckkeedd rreeggiisstteerrss bbeeggiinn aatt GGRR3322 aanndd aarree 96 Stacked llooccaall ttoo eeaacchh pprroocceedduurree 127 Each procedure’s register stack frame Each procedure’s register stack frame PROC B vvaarriieess ffrroomm 00 ttoo 9966 rreeggiisstteerrss Overlap OOnnllyy GGRRss iimmpplleemmeenntt aa rreeggiisstteerr ssttaacckk • The FRs, PRs, and BRs are glloball to allll procedures PROC A 32 RReeggiisstteerr SSttaacckk EEnnggiinnee ((RRSSEE)) 31 32 Global • Upon stack overfllow/underfllow, regiisters are saved/restored to/from a backiing store transparentlly 0 Optimizes the Call/Return Mechanism Copyright © 2001, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee aatt WWoorrkk:: CCaallll cchhaannggeess ffrraammee ttoo ccoonnttaaiinn oonnllyy tthhee ccaalllleerr’’ss oouuttppuutt AAlllloocc iinnssttrr.. sseettss tthhee ffrraammee rreeggiioonn ttoo tthhee ddeessiirreedd ssiizzee • Three architecturere parameters: local, ouutpuut, and rotating RReettuurrnn rreessttoorreess tthhee ssttaacckk ffrraammee ooff tthhee ccaalllleerr 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32 Local Local (Inputs) (Inputs) 32 Call Alloc Ret 32 PROC A PROC B PROC B PROC A IImmpprroovveedd eexxeeccuuttiioonn ssppeeeedd iinn oooo llaanngguuaaggeess,, ii..ee.. JJaavvaa,, CC++++ Copyright © 2001, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners RReeggiisstteerr rroottaattiioonn EExxaammppllee:: 8 general Floaatinng point and pprereddiicaate registers rottating, cocounnted registers loop (br.ctop) • Always rotate the same set off registers Before: After br.ctop taken: – FR 32-127 gr32 123 29 gr32 • Rotate in the same gr33 8189 123 gr33 diirection as general gr34 0 8189 gr34 registers 99 0 gr35 gr35 – highest rotates to lowest gr36 gr36 abc 99 register number gr37 gr37 9ad6 abc – all other values rotate beef 9adc gr38 gr38 towards larger register gr39 29 Wraparound beef gr39 numbers gr40 4567 4567 gr40 • Rotate at the same time as gr41 818 818 gr41 geeneral registers (at the ... ... modulo-scheduled loop ... ... instruction) Copyright © 2001, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners SSooffttwwaarree PPiippeelliinniinngg Sequential Loop Software-Pipelined Loop load compute store e e e e m m m m i i i i T T T T Traditional archiitectuuress uussee looopp unrolllinng • Results in code expansion and increased cache misses Itanium™ Software Piipelinning uses rotating registers • Allows overlapping execution of multiple loop instances IIttaanniiuumm™™ pprroovviiddeess ddiirreecctt ssuuppppoorrtt ffoorr SSooffttwwaarree PPiippeelliinniinngg Copyright © 2001, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd LLoooopp CCoonnssiiddeerr Pseudo Code: loop: ldfd x[i] C code: fmpy.d y[i] = a, x[i] for (i = 0; i < n; i++) stfd y[i] y[i] = a * x[i]; br.ctop loop AAssssuummee • Instruction Latencies: – ldfd (fp load) 4 cycles* *Cycle counts for – fmpy.d (fp mul) 2 cycles* demonstration – stfd (fp store) 1 cycle* purposes only. – br.ctop (branch counted loop top) 1 cycle* • ldfd, fmpy.d, stfd and br can be issueed in the same instruction group ( only w/o RAW or WAW dependencies) Copyright © 2001, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd lloooopp Cycle 1: lld x[1] Cycle 2: lld x[2] For n = 8 Cycle 3: lld x[3] Cycle 4: lld x[4] Prolog Cycle 5: lld x[5] fmpy y[1]=a,x[1] Cycle
Recommended publications
  • The Central Processing Unit(CPU). the Brain of Any Computer System Is the CPU
    Computer Fundamentals 1'stage Lec. (8 ) College of Computer Technology Dept.Information Networks The central processing unit(CPU). The brain of any computer system is the CPU. It controls the functioning of the other units and process the data. The CPU is sometimes called the processor, or in the personal computer field called “microprocessor”. It is a single integrated circuit that contains all the electronics needed to execute a program. The processor calculates (add, multiplies and so on), performs logical operations (compares numbers and make decisions), and controls the transfer of data among devices. The processor acts as the controller of all actions or services provided by the system. Processor actions are synchronized to its clock input. A clock signal consists of clock cycles. The time to complete a clock cycle is called the clock period. Normally, we use the clock frequency, which is the inverse of the clock period, to specify the clock. The clock frequency is measured in Hertz, which represents one cycle/second. Hertz is abbreviated as Hz. Usually, we use mega Hertz (MHz) and giga Hertz (GHz) as in 1.8 GHz Pentium. The processor can be thought of as executing the following cycle forever: 1. Fetch an instruction from the memory, 2. Decode the instruction (i.e., determine the instruction type), 3. Execute the instruction (i.e., perform the action specified by the instruction). Execution of an instruction involves fetching any required operands, performing the specified operation, and writing the results back. This process is often referred to as the fetch- execute cycle, or simply the execution cycle.
    [Show full text]
  • The Microarchitecture of a Low Power Register File
    The Microarchitecture of a Low Power Register File Nam Sung Kim and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {kimns, tnm}@eecs.umich.edu ABSTRACT Alpha 21464, the 512-entry 16-read and 8-write (16-r/8-w) ports register file consumed more power and was larger than The access time, energy and area of the register file are often the 64 KB primary caches. To reduce the cycle time impact, it critical to overall performance in wide-issue microprocessors, was implemented as two 8-r/8-w split register files [9], see because these terms grow superlinearly with the number of read Figure 1. Figure 1-(a) shows the 16-r/8-w file implemented and write ports that are required to support wide-issue. This paper directly as a monolithic structure. Figure 1-(b) shows it presents two techniques to reduce the number of ports of a register implemented as the two 8-r/8-w register files. The monolithic file intended for a wide-issue microprocessor without hardly any register file design is slow because each memory cell in the impact on IPC. Our results show that it is possible to replace a register file has to drive a large number of bit-lines. In register file with 16 read and 8 write ports, intended for an eight- contrast, the split register file is fast, but duplicates the issue processor, with a register file with just 8 read and 8 write contents of the register file in two memory arrays, resulting in ports so that the impact on IPC is a few percent.
    [Show full text]
  • UNIT 8B a Full Adder
    UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state.
    [Show full text]
  • 1.1.2. Register File
    國 立 交 通 大 學 資訊科學與工程研究所 碩 士 論 文 同步多執行緒架構中可彈性切割與可延展的暫存 器檔案設計之研究 Design of a Flexibly Splittable and Stretchable Register File for SMT Architectures 研 究 生:鐘立傑 指導教授:單智君 教授 中 華 民 國 九 十 六 年 八 月 I II III IV 同步多執行緒架構中可彈性切割與可延展的暫存 器檔案設計之研究 學生:鐘立傑 指導教授:單智君 博士 國立交通大學資訊科學與工程研究所 碩士班 摘 要 如何利用最少的硬體資源來支援同步多執行緒是一個很重要的研究議題,暫存 器檔案(Register file)在微處理器晶片面積中佔有顯著的比例。而且為了支援同步多 執行緒,每一個執行緒享有自己的一份暫存器檔案,這樣的設計會增加晶片的面積。 在本篇論文中,我們提出了一份可彈性切割與可延展的暫存器檔案設計,在這 個設計裡:1.我們可以在需要的時候彈性切割一份暫存器檔案給兩個執行緒來同時 使用,2.適當的延伸暫存器檔案的大小來增加兩個執行緒共用的機會。 藉由我們設計可以得到的益處有:1.增加硬體資源的使用率,2. 減少對於記憶 體的存取以及 3.提升系統的效能。此外我們設計概念可以任意的滿足不同的應用程 式的需求。 V Design of a Flexibly Splittable and Stretchable Register File for SMT Architectures Student:Li-Jie Jhing Advisor:Dr, Jean Jyh-Jiun Shann Institute of Computer Science and Engineering National Chiao-Tung University Abstract How to support simultaneous multithreading (SMT) with minimum resource hence becomes a critical research issue. The register file in a microprocessor typically occupies a significant portion of the chip area, and in order to support SMT, each thread will have a copy of register file. That will increase the area overhead. In this thesis, we propose a register file design techniques that can 1. Split a copy of physical register file flexibly into two independent register sets when required, simultaneously operable for two independent threads. 2. Stretch the size of the physical register file arbitrarily, to increase probability of sharing by two threads. Benefits of these designs are: 1. Increased hardware resource utilization. 2. Reduced memory
    [Show full text]
  • Virtual Memory - Paging
    Virtual memory - Paging Johan Montelius KTH 2020 1 / 32 The process code heap (.text) data stack kernel 0x00000000 0xC0000000 0xffffffff Memory layout for a 32-bit Linux process 2 / 32 Segments - a could be solution Processes in virtual space Address translation by MMU (base and bounds) Physical memory 3 / 32 one problem Physical memory External fragmentation: free areas of free space that is hard to utilize. Solution: allocate larger segments ... internal fragmentation. 4 / 32 another problem virtual space used code We’re reserving physical memory that is not used. physical memory not used? 5 / 32 Let’s try again It’s easier to handle fixed size memory blocks. Can we map a process virtual space to a set of equal size blocks? An address is interpreted as a virtual page number (VPN) and an offset. 6 / 32 Remember the segmented MMU MMU exception no virtual addr. offset yes // < within bounds index + physical address segment table 7 / 32 The paging MMU MMU exception virtual addr. offset // VPN available ? + physical address page table 8 / 32 the MMU exception exception virtual address within bounds page available Segmentation Paging linear address physical address 9 / 32 a note on the x86 architecture The x86-32 architecture supports both segmentation and paging. A virtual address is translated to a linear address using a segmentation table. The linear address is then translated to a physical address by paging. Linux and Windows do not use use segmentation to separate code, data nor stack. The x86-64 (the 64-bit version of the x86 architecture) has dropped many features for segmentation.
    [Show full text]
  • X86 Memory Protection and Translation
    2/5/20 COMP 790: OS Implementation COMP 790: OS Implementation Logical Diagram Binary Memory x86 Memory Protection and Threads Formats Allocators Translation User System Calls Kernel Don Porter RCU File System Networking Sync Memory Device CPU Today’s Management Drivers Scheduler Lecture Hardware Interrupts Disk Net Consistency 1 Today’s Lecture: Focus on Hardware ABI 2 1 2 COMP 790: OS Implementation COMP 790: OS Implementation Lecture Goal Undergrad Review • Understand the hardware tools available on a • What is: modern x86 processor for manipulating and – Virtual memory? protecting memory – Segmentation? • Lab 2: You will program this hardware – Paging? • Apologies: Material can be a bit dry, but important – Plus, slides will be good reference • But, cool tech tricks: – How does thread-local storage (TLS) work? – An actual (and tough) Microsoft interview question 3 4 3 4 COMP 790: OS Implementation COMP 790: OS Implementation Memory Mapping Two System Goals 1) Provide an abstraction of contiguous, isolated virtual Process 1 Process 2 memory to a program Virtual Memory Virtual Memory 2) Prevent illegal operations // Program expects (*x) – Prevent access to other application or OS memory 0x1000 Only one physical 0x1000 address 0x1000!! // to always be at – Detect failures early (e.g., segfault on address 0) // address 0x1000 – More recently, prevent exploits that try to execute int *x = 0x1000; program data 0x1000 Physical Memory 5 6 5 6 1 2/5/20 COMP 790: OS Implementation COMP 790: OS Implementation Outline x86 Processor Modes • x86
    [Show full text]
  • Virtual Memory in X86
    Fall 2017 :: CSE 306 Virtual Memory in x86 Nima Honarmand Fall 2017 :: CSE 306 x86 Processor Modes • Real mode – walks and talks like a really old x86 chip • State at boot • 20-bit address space, direct physical memory access • 1 MB of usable memory • No paging • No user mode; processor has only one protection level • Protected mode – Standard 32-bit x86 mode • Combination of segmentation and paging • Privilege levels (separate user and kernel) • 32-bit virtual address • 32-bit physical address • 36-bit if Physical Address Extension (PAE) feature enabled Fall 2017 :: CSE 306 x86 Processor Modes • Long mode – 64-bit mode (aka amd64, x86_64, etc.) • Very similar to 32-bit mode (protected mode), but bigger address space • 48-bit virtual address space • 52-bit physical address space • Restricted segmentation use • Even more obscure modes we won’t discuss today xv6 uses protected mode w/o PAE (i.e., 32-bit virtual and physical addresses) Fall 2017 :: CSE 306 Virt. & Phys. Addr. Spaces in x86 Processor • Both RAM hand hardware devices (disk, Core NIC, etc.) connected to system bus • Mapped to different parts of the physical Virtual Addr address space by the BIOS MMU Data • You can talk to a device by performing Physical Addr read/write operations on its physical addresses Cache • Devices are free to interpret reads/writes in any way they want (driver knows) System Interconnect (Bus) : all addrs virtual DRAM Network … Disk (Memory) Card : all addrs physical Fall 2017 :: CSE 306 Virt-to-Phys Translation in x86 0xdeadbeef Segmentation 0x0eadbeef Paging 0x6eadbeef Virtual Address Linear Address Physical Address Protected/Long mode only • Segmentation cannot be disabled! • But can be made a no-op (a.k.a.
    [Show full text]
  • Operating Systems
    UC Santa Barbara Operating Systems Christopher Kruegel Department of Computer Science UC Santa Barbara http://www.cs.ucsb.edu/~chris/ Virtual Memory and Paging UC Santa Barbara • What if a program is too big to be loaded in memory • What if a higher degree of multiprogramming is desirable • Physical memory is split in page frames • Virtual memory is split in pages • OS (with help from the hardware) manages the mapping between pages and page frames 2 Mapping Pages to Page Frames UC Santa Barbara • Virtual memory: 64KB • Physical memory: 32KB • Page size: 4KB • Virtual memory pages: 16 • Physical memory pages: 8 3 Memory Management Unit UC Santa Barbara • Automatically performs the mapping from virtual addresses into physical addresses 4 Memory Management Unit UC Santa Barbara • Addresses are split into a page number and an offset • Page numbers are used to look up a table in the MMU with as many entries as the number of virtual pages • Each entry in the table contains a bit that states if the virtual page is actually mapped to a physical one • If it is so, the entry contains the number of physical page used • If not, a page fault is generated and the OS has to deal with it 5 Page Tables UC Santa Barbara • Page tables contain an entry for each virtual table • If virtual memory is big (e.g., 32 bit and 64 bit addresses) the table can become of unmanageable size • Solution: instead of keeping them in the MMU move them to main memory • Problem: page tables are used each time an access to memory is performed.
    [Show full text]
  • With Extreme Scale Computing the Rules Have Changed
    With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • Lecture 15 15.1 Paging
    CMPSCI 377 Operating Systems Fall 2009 Lecture 15 Lecturer: Emery Berger Scribe: Bruno Silva,Jim Partan 15.1 Paging In recent lectures, we have been discussing virtual memory. The valid addresses in a process' virtual address space correspond to actual data or code somewhere in the system, either in physical memory or on the disk. Since physical memory is fast and is a limited resource, we use the physical memory as a cache for the disk (another way of saying this is that the physical memory is \backed by" the disk, just as the L1 cache is \backed by" the L2 cache). Just as with any cache, we need to specify our policies for when to read a page into physical memory, when to evict a page from physical memory, and when to write a page from physical memory back to the disk. 15.1.1 Reading Pages into Physical Memory For reading, most operating systems use demand paging. This means that pages are only read from the disk into physical memory when they are needed. In the page table, there is a resident status bit, which says whether or not a valid page resides in physical memory. If the MMU tries to get a physical page number for a valid page which is not resident in physical memory, it issues a pagefault to the operating system. The OS then loads that page from disk, and then returns to the MMU to finish the translation.1 In addition, many operating systems make some use of pre-fetching, which is called pre-paging when used for pages.
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]