Computer Power Management Rules Ø Jim Kardach, ReRed Chief Power Architect, Intel HP

Total Page:16

File Type:pdf, Size:1020Kb

Computer Power Management Rules Ø Jim Kardach, Re�Red Chief Power Architect, Intel H�P Computer Power Management Rules Ø Jim Kardach, re-red chief power architect, Intel hp://www.youtube.com/watch?v=cZ6akewB0ps 1 HW1 Ø Has been posted on the online schedule. Ø Due on March 3rd, 1pm. Ø Submit in class. Ø Hard deadline: no homework accepted aer deadline. Ø No collaboraon is allowed. Chenyang Lu CSE 467S The Power Problem Ø Processors improve performance at the cost of power. q Performance/wa remains low. Ø Soluon q Hardware offer mechanisms for saving power. q SoRware executes power management policies. 3 Power vs. Energy Ø Power: Energy consumed per unit -me q 1 wa = 1 joule/second Ø Power à heat Ø Energy à baery life 4 Why worry about energy? Intel vs. Duracell 16x 14x Processor (MIPS) 12x Hard Disk (capacity) 10x Improvement (compared to year 0) 8x 6x Memory (capacity) 4x 2x Battery (energy stored) 1x 0 1 2 3 4 5 6 Time (years) Ø No Moore’s Law in baeries: 2-3%/year growth. Trend in Power Density Sun’s Surface 1000 Rocket Nozzle Nuclear Reactor 2 100 Pentium® 4 Pentium® III Watts/cm Pentium® II 10 Hot plate Pentium® Pro i386 Pentium® New Microarchitecture Challenges in the Coming i486 Generaons of CMOS Process Technologies, Fred Pollack, 1 Intel Corp. Micro, 1999. 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ Process 6 Trend in Cooling Soluon 7 Power Ø Hardware support Ø Power management policy Ø Power manager Ø Holis-c approach 8 CMOS Power ConsumpDon Ø Voltage drops: power consump-on ∝ V2. Ø Toggling: more ac-vity à higher power. Ø Leakage when inac-ve. 9 Power-Saving Features Voltage drops " Reduce power supply voltage. Toggling " Run at lower clock frequency. " Reduce ac-vity. " Disable func-on units when not in use. Leakage " Disconnect parts from power supply when not in use. 10 Dynamic Voltage Scaling Ø! Why voltage scaling? q! Power ∝ V2 à reduce power supply voltage saves energy. q! Lower voltage à lower clock frequency. •! Tradeoff between performance vs. energy. Ø! Why dynamic? q! Peak compu-ng demand is much higher than average. Ø! Changing voltage takes -me q! to stabilize power supply and clock 11 Examples Ø! StrongARM SA-1100 takes two supplies q! VDD is main 3.3V supply. q! VDDX is 1.5V. Ø! AMD K6-2+ q! 8 frequencies: 200-600 MHz. q! Voltage: 1.4, 2.0 V. q! Transi-on -me: 0.4 ms for voltage change. Ø! PowerPC 603 q! Can shut down unused execu-on units. q! Cache organized into subarrays to reduce ac-ve circuitry. 12 Intel SpeedStep Intel Core 2 Duo E6600 Intel Pen-um M P states 13 Linux DVFS Governors Ø! Performance q! Always set at the max frequency Ø! Powersave q! Always set at the lowest frequency Ø! Ondemand q! Automacally adjust the frequency according to CPU usage Ø! Conservave q! Like ondemand, but in a more conservave way. Ø! Userspace q! Set at a fixed frequency by the user 14 Ondemand Ø! Ini-al implementaon in 2.6.9 Ø! For all CPUs q! if (> 80% busy) then P0 (max frequency) q! if (< 20% busy) then down by 20% Ø! Mul-ple improvements since 2.6.9 15 Get & Set CPU Frequency Ø! Get the current frequency: q! /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_cur_freq q! Example: 2400000 (2.4GHz) Ø! Frequency & governors available: q! /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_available_frequencies q! Example: 2400000 2133000 1867000 1600000 q! /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_available_governor q! Example: ondemand userspace performance powersave conservave Ø! Set the frequency: q! Root privilege q! echo userspace > /sys/devices/system/cpu/cpu[X]/cpufreq/ scaling_governor q! echo 2133000 > /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_setspeed 16 Clock GaDng Ø! Applicable to clocked digital components q! Processors, controllers, memories Ø! Stop clock à stop signal propagaon in circuits ✔! Short transi-on -me q! Clock generaon is not stopped q! Only clock distribu-on is stopped ✘! Relavely high power consump-on q! Clock itself s-ll consumes energy q! Cannot prevent power leaking 17 Supply Shutdown Ø! Disconnect parts from power supply when not in use. ✔! General ✔! Save most power ✘! Long transi-on -me 18 Example: SA-1100 Three power modes: Ø! Run: normal operaon. Ø! Idle: stops CPU clock, w. I/O logic s-ll powered. Ø! Sleep: shuts off most of chip ac-vity 19 SA-1100 SLEEP Ø! RUN à SLEEP q! (30 µs) Flush to memory CPU states (registers) q! (30 µs) Reset processor state and wakeup event q! (30 µs) Shut down clock Ø! SLEEP à RUN q! (10 ms) Ramp up power supply q! (150 ms) Stabilize clock q! (negligible) CPU boot 20 Intel Core Duo Processor SV IntelIntel CoreCore DuoDuo ProcessorProcessor SVSV Name Vcc Watt C0 High Frequencey Mode (P0) 1.3 31 C0 Low Frequency Mode (Pn) 1.0 C1 Auto Halt Stop Grant (HFM) 15.8 C1E Enhanced Halt (LFM) 4.8 C2 Stop Clock (HFM) 15.5 C2E Enhanced Stop Clock (LFM) 4.7 C3 Deep Sleep (HFM) 10.5 C3E Enhanced Deep Sleep (LFM) 3.4 C4 Intel Deeper Sleep 0.85 2.2 DC4 Intel Enhanced Deeper Sleep 0.80 1.8 3 Ottawa Linux*Intel® SymposiumCore™ Duo Processor 65nmJuly 19, Process2006 – Datasheet 7 21 The Mote Revolution: Low Power Wireless Sensor Network Devices, Joseph Polastre, Robert Szewczyk, Cory Sharp, David Culler, Hot Chips 16. 22 Power Consumpon Computer with Wireless NIC Power Ø! Hardware support Ø! Power management policy Ø! Power manager Ø! Holis-c approach 24 Approaches Ø! Stac Power Management q! Does not depend on ac-vity. q! Example: user-ac-vated power-down. Ø! Dynamic Power Management q! Adapt to ac-vity at run -me. q! Example: automacally disabling func-on units. 25 Dynamic Power Management Ø! Inherent tradeoff: energy vs. performance Ø! Fundamental premises q! Non-uniform workload during operaon q! Possible to predict workload with some degree of accuracy 26 PowerPC 603 AcDvity Percentage of -me idle for SPEC integer/floang-point: unit Specint92 Specfp92 D cache 29% 28% I cache 29% 17% load/store 35% 17% fixed-point 38% 76% floang-point 99% 30% system register 89% 97% 27 Problem FormulaDons Ø! Minimize energy under performance constraints q! Real--me applicaons Ø! Op-mize performance under energy/power constraints q! Baery life-me (energy) q! Temperature (power) 28 Power Down/Up Cost Ø! Going into/out of an inac-ve mode costs q! me q! energy Ø! Must determine if going into an inac-ve mode is worthwhile. Ø! Model power states with a Power State Machine (PSM) 29 SA-1100 Power State Machine PON = 400 mW run 10 µs 160 ms 90 µs 10 µs 90 µs idle sleep P = 50 mW OFF POFF = 0.16 mW PTR = PON 30 Greedy Policy Ø Immediately goes to sleep when system becomes idle Ø Works when transi-on -me is negligible q Ex. between IDLE and RUN in SA-1100 Ø Doesn’t work when transi-on -me is long! q Ex. between SLEEP and RUN/IDLE in SA-1100 q Need beer solu-ons! 31 Break-Even Time TBE Ø Minimum idle -me required to compensate for the cost of entering an inac-ve state. Ø Enter an inac-ve state is beneficial only if idle -me > TBE. 32 Break-Even Time PTR ≤ PON Ø PTR: Power consump-on during transi-on Ø PON: Power consump-on when ac-ve Ø TBE of an inac-ve state is the total -me it takes to enter and leave the state Ø TBE = TTR = TON,OFF + TOFF,ON q TBE = 160 ms + 90 µs for SLEEP in SA-1100 33 SA-1100 Power State Machine PON = 400 mW run 10 µs 160 ms 90 µs 10 µs 90 µs idle sleep P = 50 mW OFF POFF = 0.16 mW PTR = PON 34 Break-Even Time PTR > PON Ø TBE must include addi-onal inac-ve -me to compensate for extra power consump-on during transi-on. TBE = TTR + TTR(PTR - PON)/(PON - POFF) Ø Reduce TBE à save more energy q Shorter TTR q Higher power difference between PON – POFF q Lower PTR 35 Inherent Exploitability Ø Achievable energy saving depends on workload! q Distribu-on of idle periods Ø Given an idle period Tidle > TBE q ES(Tidle) = (Tidle - TTR)(PON - POFF) + TTR(PON – PTR) Ø Assump-ons q No performance penalty. q Ideal manager with knowledge of workload in advance. 36 Inherent Exploitability based on real workload 37 Time-Power Product Workload-independent Metric CS = TBEPOFF Ø An inac-ve state with lower CS may save more energy Ø Only a crude es-mate q May not be representave of real power savings 38 Predicve Techniques Ø Interested event: p = {Tidle > TBE} q Predict based on history Ø Observed event: o q Triggers state transi-on Ø Objec-ve: predict p based on o 39 Metrics Ø Safety: condi-onal probability Prob(p|o) q If an observed event happens à the probability of Tidle>TBE q Ideally, safety = 1. Ø Efficiency: Prob(o|p) q If Tidle > TBE à the probability of correctly predic-ng. Ø Overpredicon à high performance penalty à poor safety Ø Underpredicon à wastes energy à poor efficiency 40 Fixed Timeout Policy Ø Enter inac-ve state when system has been idle for TTO q o: Tidle > TTO Ø Wake up in response to ac-vity Ø Hypothesis: If system has been idle for TTO à it will con-nue to be idle for Tidle-TTO > TBE 41 TTO??? Ø Increasing TTO improves safety, but reduces efficiency.
Recommended publications
  • Memory Hierarchy Memory Hierarchy
    Memory Key challenge in modern computer architecture Lecture 2: different memory no point in blindingly fast computation if data can’t be and variable types moved in and out fast enough need lots of memory for big applications Prof. Mike Giles very fast memory is also very expensive [email protected] end up being pushed towards a hierarchical design Oxford University Mathematical Institute Oxford e-Research Centre Lecture 2 – p. 1 Lecture 2 – p. 2 CPU Memory Hierarchy Memory Hierarchy Execution speed relies on exploiting data locality 2 – 8 GB Main memory 1GHz DDR3 temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache ? 200+ cycle access, 20-30GB/s spatial locality: neighbouring data is also likely to be 6 used soon, so load them into the cache at the same 2–6MB time using a ‘wide’ bus (like a multi-lane motorway) L3 Cache 2GHz SRAM ??25-35 cycle access 66 This wide bus is only way to get high bandwidth to slow 32KB + 256KB main memory ? L1/L2 Cache faster 3GHz SRAM more expensive ??? 6665-12 cycle access smaller registers Lecture 2 – p. 3 Lecture 2 – p. 4 Caches Importance of Locality The cache line is the basic unit of data transfer; Typical workstation: typical size is 64 bytes 8 8-byte items. ≡ × 10 Gflops CPU 20 GB/s memory L2 cache bandwidth With a single cache, when the CPU loads data into a ←→ 64 bytes/line register: it looks for line in cache 20GB/s 300M line/s 2.4G double/s ≡ ≡ if there (hit), it gets data At worst, each flop requires 2 inputs and has 1 output, if not (miss), it gets entire line from main memory, forcing loading of 3 lines = 100 Mflops displacing an existing line in cache (usually least ⇒ recently used) If all 8 variables/line are used, then this increases to 800 Mflops.
    [Show full text]
  • Intel® Strongarm® SA-1110 High- Performance, Low-Power Processor for Portable Applied Computing Devices
    Advance Copy Intel® StrongARM® SA-1110 High- Performance, Low-Power Processor For Portable Applied Computing Devices PRODUCT HIGHLIGHTS ■ Innovative Application Specific Standard Product (ASSP) delivers leadership performance, integration and low power for palm-size devices, PC companions, smart phones and other emerging portable applied computing devices As businesses and individuals rely increasingly on portable applied ■ High-speed 100 MHz memory bus and a computing devices to simplify their lives and boost their productivity, flexible memory these devices have to perform more complex functions quickly and controller that adds efficiently. To satisfy ever-increasing customer demands to support for SDRAM, communicate and access information ‘anytime, anywhere’, SMROM, and variable- manufacturers need technologies that deliver high-performance, robust latency I/O devices — provides design functionality and versatility while meeting the small-size and low-power flexibility, scalability and restrictions of portable, battery-operated products. Intel designed the high memory bandwidth SA-1110 processor with all of these requirements in mind. ■ Rich development The Intel® SA-1110 is a highly integrated 32-bit StrongARM® environment enables processor that incorporates Intel design and process technology along leading edge products with the power efficiency of the ARM* architecture. The SA-1110 is while reducing time- to-market software compatible with the ARM V4 architecture while utilizing a high-performance micro-architecture that is optimized to take advantage of Intel process technology. The Intel SA-1110 provides the performance, low power, integration and cost benefits of the Intel SA-1100 processor plus a high speed memory bus, flexible memory controller and the ability to handle variable-latency I/O devices.
    [Show full text]
  • Comparison of Contemporary Real Time Operating Systems
    ISSN (Online) 2278-1021 IJARCCE ISSN (Print) 2319 5940 International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 11, November 2015 Comparison of Contemporary Real Time Operating Systems Mr. Sagar Jape1, Mr. Mihir Kulkarni2, Prof.Dipti Pawade3 Student, Bachelors of Engineering, Department of Information Technology, K J Somaiya College of Engineering, Mumbai1,2 Assistant Professor, Department of Information Technology, K J Somaiya College of Engineering, Mumbai3 Abstract: With the advancement in embedded area, importance of real time operating system (RTOS) has been increased to greater extent. Now days for every embedded application low latency, efficient memory utilization and effective scheduling techniques are the basic requirements. Thus in this paper we have attempted to compare some of the real time operating systems. The systems (viz. VxWorks, QNX, Ecos, RTLinux, Windows CE and FreeRTOS) have been selected according to the highest user base criterion. We enlist the peculiar features of the systems with respect to the parameters like scheduling policies, licensing, memory management techniques, etc. and further, compare the selected systems over these parameters. Our effort to formulate the often confused, complex and contradictory pieces of information on contemporary RTOSs into simple, analytical organized structure will provide decisive insights to the reader on the selection process of an RTOS as per his requirements. Keywords:RTOS, VxWorks, QNX, eCOS, RTLinux,Windows CE, FreeRTOS I. INTRODUCTION An operating system (OS) is a set of software that handles designed known as Real Time Operating System (RTOS). computer hardware. Basically it acts as an interface The motive behind RTOS development is to process data between user program and computer hardware.
    [Show full text]
  • Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019
    http://www.diva-portal.org Postprint This is the accepted version of a paper presented at EuroSys'19. Citation for the original published paper: Farshin, A., Roozbeh, A., Maguire Jr., G Q., Kostic, D. (2019) Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth EuroSys Conference (EuroSys'19), Dresden, Germany, 25-28 March 2019. ACM Digital Library N.B. When citing this work, cite the original published paper. Permanent link to this version: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-244750 Make the Most out of Last Level Cache in Intel Processors Alireza Farshin∗† Amir Roozbeh∗ KTH Royal Institute of Technology KTH Royal Institute of Technology [email protected] Ericsson Research [email protected] Gerald Q. Maguire Jr. Dejan Kostić KTH Royal Institute of Technology KTH Royal Institute of Technology [email protected] [email protected] Abstract between Central Processing Unit (CPU) and Direct Random In modern (Intel) processors, Last Level Cache (LLC) is Access Memory (DRAM) speeds has been increasing. One divided into multiple slices and an undocumented hashing means to mitigate this problem is better utilization of cache algorithm (aka Complex Addressing) maps different parts memory (a faster, but smaller memory closer to the CPU) in of memory address space among these slices to increase order to reduce the number of DRAM accesses. the effective memory bandwidth. After a careful study This cache memory becomes even more valuable due to of Intel’s Complex Addressing, we introduce a slice- the explosion of data and the advent of hundred gigabit per aware memory management scheme, wherein frequently second networks (100/200/400 Gbps) [9].
    [Show full text]
  • Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc
    Freescale Semiconductor AN2808 Application Note Rev. 1, 06/2005 Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc. Contents 1 Scope and Definitions 1. Scope and Definitions . 1 2. Feature Overview . 2 The purpose of this application note is to facilitate migration 3. 7447A Specific Features . 12 from IBM’s 750FX-based systems to Freescale’s 4. Programming Model . 16 MPC7447A. It addresses the differences between the 5. Hardware Considerations . 27 systems, explaining which features have changed and why, 6. Revision History . 30 before discussing the impact on migration in terms of hardware and software. Throughout this document the following references are used: • 750FX—which applies to Freescale’s MPC750, MPC740, MPC755, and MPC745 devices, as well as to IBM’s 750FX devices. Any features specific to IBM’s 750FX will be explicitly stated as such. • MPC7447A—which applies to Freescale’s MPC7450 family of products (MPC7450, MPC7451, MPC7441, MPC7455, MPC7445, MPC7457, MPC7447, and MPC7447A) except where otherwise stated. Because this document is to aid the migration from 750FX, which does not support L3 cache, the L3 cache features of the MPC745x devices are not mentioned. © Freescale Semiconductor, Inc., 2005. All rights reserved. Feature Overview 2 Feature Overview There are many differences between the 750FX and the MPC7447A devices, beyond the clear differences of the core complex. This section covers the differences between the cores and then other areas of interest including the cache configuration and system interfaces. 2.1 Cores The key processing elements of the G3 core complex used in the 750FX are shown below in Figure 1, and the G4 complex used in the 7447A is shown in Figure 2.
    [Show full text]
  • Caches & Memory
    Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw x1,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) } sw x0,-24(fp) li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 2 Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw ra,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) } sw x0,-24(fp) li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 3 1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register ALU D D file B +4 addr PC B control din dout M inst memory extend new imm forward pc Stack, Data, Code detect unit hazard Stored in Memory Instruction Instruction Write- ctrl ctrl ctrl Fetch Decode Execute Memory Back IF/ID
    [Show full text]
  • Power Management 24
    Power Management 24 The embedded Pentium® processor family implements Intel’s System Management Mode (SMM) architecture. This chapter describes the hardware interface to SMM and Clock Control. 24.1 Power Management Features • System Management Interrupt can be delivered through the SMI# signal or through the local APIC using the SMI# message, which enhances the SMI interface, and provides for SMI delivery in APIC-based Pentium processor dual processing systems. • In dual processing systems, SMIACT# from the bus master (MRM) behaves differently than in uniprocessor systems. If the LRM processor is the processor in SMM mode, SMIACT# will be inactive and remain so until that processor becomes the MRM. • The Pentium processor is capable of supporting an SMM I/O instruction restart. This feature is automatically disabled following RESET. To enable the I/O instruction restart feature, set bit 9 of the TR12 register to “1”. • The Pentium processor default SMM revision identifier has a value of 2 when the SMM I/O instruction restart feature is enabled. • SMI# is NOT recognized by the processor in the shutdown state. 24.2 System Management Interrupt Processing The system interrupts the normal program execution and invokes SMM by generating a System Management Interrupt (SMI#) to the processor. The processor will service the SMI# by executing the following sequence. See Figure 24-1. 1. Wait for all pending bus cycles to complete and EWBE# to go active. 2. The processor asserts the SMIACT# signal while in SMM indicating to the system that it should enable the SMRAM. 3. The processor saves its state (context) to SMRAM, starting at address location SMBASE + 0FFFFH, proceeding downward in a stack-like fashion.
    [Show full text]
  • Power Management Using FPGA Architectural Features Abu Eghan, Principal Engineer Xilinx Inc
    Power Management Using FPGA Architectural Features Abu Eghan, Principal Engineer Xilinx Inc. Agenda • Introduction – Impact of Technology Node Adoption – Programmability & FPGA Expanding Application Space – Review of FPGA Power characteristics • Areas for power consideration – Architecture Features, Silicon design & Fabrication – now and future – Power & Package choices – Software & Implementation of Features – The end-user choices & Enablers • Thermal Management – Enabling tools • Summary Slide 2 2008 MEPTEC Symposium “The Heat is On” Abu Eghan, Xilinx Inc Technology Node Adoption in FPGA • New Tech. node Adoption & level of integration: – Opportunities – at 90nm, 65nm and beyond. FPGAs at leading edge of node adoption. • More Programmable logic Arrays • Higher clock speeds capability and higher performance • Increased adoption of Embedded Blocks: Processors, SERDES, BRAMs, DCM, Xtreme DSP, Ethernet MAC etc – Impact – general and may not be unique to FPGA • Increased need to manage leakage current and static power • Heat flux (watts/cm2) trend is generally up and can be non-uniform. • Potentially higher dynamic power as transistor counts soar. • Power Challenges -- Shared with Industry – Reliability limitation & lower operating temperatures – Performance & Cost Trade-offs – Lower thermal budgets – Battery Life expectancy challenges Slide 3 2008 MEPTEC Symposium “The Heat is On” Abu Eghan, Xilinx Inc FPGA-101: FPGA Terms • FPGA – Field Programmable Gate Arrays • Configurable Logic Blocks – used to implement a wide range of arbitrary digital
    [Show full text]
  • Stealing the Shared Cache for Fun and Profit
    IT 13 048 Examensarbete 30 hp Juli 2013 Stealing the shared cache for fun and profit Moncef Mechri Institutionen för informationsteknologi Department of Information Technology Abstract Stealing the shared cache for fun and profit Moncef Mechri Teknisk- naturvetenskaplig fakultet UTH-enheten Cache pirating is a low-overhead method created by the Uppsala Architecture Besöksadress: Research Team (UART) to analyze the Ångströmlaboratoriet Lägerhyddsvägen 1 effect of sharing a CPU cache Hus 4, Plan 0 among several cores. The cache pirate is a program that will actively and Postadress: carefully steal a part of the shared Box 536 751 21 Uppsala cache by keeping its working set in it. The target application can then be Telefon: benchmarked to see its dependency on 018 – 471 30 03 the available shared cache capacity. The Telefax: topic of this Master Thesis project 018 – 471 30 00 is to implement a cache pirate and use it on Ericsson’s systems. Hemsida: http://www.teknat.uu.se/student Handledare: Erik Berg Ämnesgranskare: David Black-Schaffer Examinator: Ivan Christoff IT 13 048 Sponsor: Ericsson Tryckt av: Reprocentralen ITC Contents Acronyms 2 1 Introduction 3 2 Background information 5 2.1 A dive into modern processors . 5 2.1.1 Memory hierarchy . 5 2.1.2 Virtual memory . 6 2.1.3 CPU caches . 8 2.1.4 Benchmarking the memory hierarchy . 13 3 The Cache Pirate 17 3.1 Monitoring the Pirate . 18 3.1.1 The original approach . 19 3.1.2 Defeating prefetching . 19 3.1.3 Timing . 20 3.2 Stealing evenly from every set .
    [Show full text]
  • Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice
    Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice Jairam S, Madhusudan Rao, Jithendra Srinivas, Parimala Vishwanath, Udayakumar H, Jagdish Rao SoC Center of Excellence, Texas Instruments, India (sjairam, bgm-rao, jithendra, pari, uday, j-rao) @ti.com 1 AGENDA • Introduction • Combinational Clock Gating – State of the art – Open problems • Sequential Clock Gating – State of the art – Open problems • Clock Power Analysis and Estimation • Clock Gating In Design Flows JS/BGM – ISLPED08 2 AGENDA • Introduction • Combinational Clock Gating – State of the art – Open problems • Sequential Clock Gating – State of the art – Open problems • Clock Power Analysis and Estimation • Clock Gating In Design Flows JS/BGM – ISLPED08 3 Clock Gating Overview JS/BGM – ISLPED08 4 Clock Gating Overview • System level gating: Turn off entire block disabling all functionality. • Conditions for disabling identified by the designer JS/BGM – ISLPED08 4 Clock Gating Overview • System level gating: Turn off entire block disabling all functionality. • Conditions for disabling identified by the designer • Suspend clocks selectively • No change to functionality • Specific to circuit structure • Possible to automate gating at RTL or gate-level JS/BGM – ISLPED08 4 Clock Network Power JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of – Clock Tree Buffer Power JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of – Clock Tree Buffer
    [Show full text]
  • IXP400 Software's Programmer's Guide
    Intel® IXP400 Software Programmer’s Guide June 2004 Document Number: 252539-002c Intel® IXP400 Software Contents INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. The Intel® IXP400 Software v1.2.2 may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. MPEG is an international standard for video compression/decompression promoted by ISO. Implementations of MPEG CODECs, or MPEG enabled platforms may require licenses from various entities, including Intel Corporation. This document and the software described in it are furnished under license and may only be used or copied in accordance with the terms of the license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation.
    [Show full text]
  • Comparative Architectures
    Comparative Architectures CST Part II, 16 lectures Lent Term 2006 David Greaves [email protected] Slides Lectures 1-13 (C) 2006 IAP + DJG Course Outline 1. Comparing Implementations • Developments fabrication technology • Cost, power, performance, compatibility • Benchmarking 2. Instruction Set Architecture (ISA) • Classic CISC and RISC traits • ISA evolution 3. Microarchitecture • Pipelining • Super-scalar { static & out-of-order • Multi-threading • Effects of ISA on µarchitecture and vice versa 4. Memory System Architecture • Memory Hierarchy 5. Multi-processor systems • Cache coherent and message passing Understanding design tradeoffs 2 Reading material • OHP slides, articles • Recommended Book: John Hennessy & David Patterson, Computer Architecture: a Quantitative Approach (3rd ed.) 2002 Morgan Kaufmann • MIT Open Courseware: 6.823 Computer System Architecture, by Krste Asanovic • The Web http://bwrc.eecs.berkeley.edu/CIC/ http://www.chip-architect.com/ http://www.geek.com/procspec/procspec.htm http://www.realworldtech.com/ http://www.anandtech.com/ http://www.arstechnica.com/ http://open.specbench.org/ • comp.arch News Group 3 Further Reading and Reference • M Johnson Superscalar microprocessor design 1991 Prentice-Hall • P Markstein IA-64 and Elementary Functions 2000 Prentice-Hall • A Tannenbaum, Structured Computer Organization (2nd ed.) 1990 Prentice-Hall • A Someren & C Atack, The ARM RISC Chip, 1994 Addison-Wesley • R Sites, Alpha Architecture Reference Manual, 1992 Digital Press • G Kane & J Heinrich, MIPS RISC Architecture
    [Show full text]