Low-Power, High-Performance Architecture of the Pwrficient Processor Family

Total Page:16

File Type:pdf, Size:1020Kb

Low-Power, High-Performance Architecture of the Pwrficient Processor Family Low-Power, High-Performance Architecture of the PWRficient Processor Family Presented by Tse-Yu Yeh Director, Architecture & Verification Hot Chips 18 Aug 21, 2006 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 1 Hthi 18 Acknowledgement THE ARCHITECTURE, DESIGN, AND IMPLEMENTATION OF THE PWRFICIENT PROCESSOR FAMILY ARE THE OUTCOME OF THE UNTIRING EFFORTS OF THE ENTIRE TEAM AT P.A. SEMI, INC. 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 2 Outline Design paradigm Family Core Performance and power Summary 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 3 Low-Power Design Paradigm Design goal —970-class performance at less than 7 watts per core Power dissipation is a primary consideration at all levels Circuit and Process Voltage/frequency scaling Multiple power planes for optimal voltage selection per region Microarchitecture and Logic Clock gating to reduce power of idle circuits Active and pre-charge standby modes in external DRAM array Sizing and hierarchy of cache structures Architecture Integration to reduce I/O interface power Internal and external power-saving modes — CPU, memory controller, PCIe Verification Monitor power consumption against budget throughout the design process Software Power management of I/O devices and CPU 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 4 Power-Influenced µ/Architectural Choices Extensive fine-grained clock gating used throughout core and SOC If a function can be performed sequentially without performance loss, why build power-hungry parallel mechanisms? The smaller, the better Design employs smaller subsections that are powered up for frequent accesses Much attention to clocking only those elements that are performing work Each major block’s design criteria included goals for power in addition to the traditional area, complexity & timing budgets Energy per memory access at each level of cache vs. DRAM Leakage, voltage, and execution frequency Overall execution time saving Speculation has to be effective or it becomes a power sink 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 5 Fine-Grain Clock Gating Reduces Dynamic Power 80 Coarse-Grain Clock Gating 70 60 50 Reset and Flop Normal Thermal Initialization Operation Virus 40 % of Flops Clocked 30 20 Time 10 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 6 Device-Specific Vdd Reduces Static and Dynamic Power 45 Conventional approach 40 Operate at 1.1V across entire process 35 30 range 25 Fast parts tend to be very leaky 20 15 10 5 0 FF @ 1.1V TT @ 1.1V SS @ 1.1V 45 P.A. Semi approach 40 35 Operate at device-specific optimal Vdd 30 Partition power plane for optimal 25 voltage selection per region 20 Enables full process range for power 15 yield 10 5 0 FF @ 0.8V TT @ 0.9V SS @ 1.1V Leakage Power Dynamic Power 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 7 PWRficient Family 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 8 Hthi 18 PWRficient Family PA6T core —the CPU Power Architecture compliant, 64-bit, high-performance FPU and VMX 7W @ 2GHz worst-case power dissipation CONEXIUM™—the on-chip coherent interconnect Scalable cross-bar interconnect 1–8 SMP cores 1 or 2 L2 caches, sized 512KB–8MB 1–4 64-bit DDR2 memory controllers ENVOI™—the I/O system SERDES I/O—PCI Express®, XAUI, SGMII Offload engines—TCP/IP, iSCSI, cryptography, and RAID Support I/O—Boot bus, UARTs, SMBus, GPIOs 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 9 PWRficient PA6T-1682M Block Diagram 64KB 64KB 64KB 64KB I-Cache D-Cache I-Cache D-Cache 2MB2MB DDR2 DDR2 L2 Cache DDR2 DDR2 L2 Cache Controller Controller PA6T 2GHz Core PA6T 2GHz Core Controller Controller CONEXIUM™ Interchange Power Control TTM*TTM* I/OI/O OffloadOffload System Control BridgeBridge DMADMA CacheCache EnginesEngines PTM†† Interrupt/GPIO PTM ENVOI™ Intelligent I/O SMBus (3) OctalOctal PCIPCI DualDual 10GbE10GbE QuadQuad GbEGbE ExpressExpress XAUIXAUI SGMIISGMII UART (2) ConfigurableConfigurable Boot Bus SERDESSERDES (24)(24) *Transaction trace memory †Peripheral trace memory 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 10 The PA6T Core 8/21/2006 © Copyright 2006 P.A. Semi, Inc. 11 Hthi 18 PA6T Core Features Fully compliant Power Architecture High-performance memory implementation hierarchy @ 2GHz Power Architecture version 2.04 L1 data—32GB/s read or write Full FPU and VMX SIMD capabilities L2 data—16GB/s read plus 64-bit with 32-bit capability 16GB/s write DDR2-1067—16GB/s read or write Super-scalar, out-of-order design 16 transactions in flight L1 instruction cache Fetch four instructions per cycle into 64- entry scheduler CONEXIUM Interchange Issue up to 3 per cycle into 6 functional units 1G transactions per second Branch predictors 64GB/s peak data rate Strongly ordered memory model MOESI coherency protocol Issue out of order, retire in order Hypervisor and virtualization support 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 12 PA6T Low-Power Features Improved branch prediction Minimize incorrect speculation to avoid wasting power Efficient superscalar design Index-based, out-of-order execution engine minimizes the use of CAMs and avoids unnecessary replays Energy-efficient memory pipelines Hierarchical address translation to balance power consumption and performance Highly integrated memory pipeline that supports out-of-order execution and sequential consistency with minimal data movement Coherent memory subsystem Coherent memory subsystem designed for high throughput and low latency while minimizing the energy used per reference Extensive power-management capabilities Doze, nap, and sleep power-down modes trade varying degrees of power savings with recovery time 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 13 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 IF IT ID DE UC PM MP SE SP IS RR AL AD AG TR DT DD LW IntegerInteger InstructionInstruction InstructionInstruction InstructionInstruction FetchFetch DecodeDecode && MapMap IssueIssue FloatingFloating PointPoint VMXVMX BranchBranch PredictionPrediction Load/StoreLoad/Store 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 14 Front End I-cache 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 64KB 2-way associative IERATIERAT 2-cycle 4 instructions/fetch µIERATµIERAT µSeqµSeq 4 fetches to CONEXIUM Next-line pre-fetch after a miss I-CacheI-Cache DecDec µDecµDec MapMap Instruction buffer 4 fetch groups (16 instructions) Decode Next Select RFRF DepDep Line MapMap MapMap 4 µ ops/dispatch Taken Map Not Taken 4 µ ops renamed/cycle 64 rename registers RSB Pred TAP Target Target back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 15 Branch Processing Branch prediction 1 2 3 4 5 6 7 IF IT ID DE UC PM MP 0-cycle bubble: 16-entry next-fetch prediction IERATIERAT 4-cycle bubble (taken prediction) Path: 16K bi-mode taken/not taken µIERAT (4 predicted) µIERAT µSeq 16-entry return stack 64-entry history-assisted target predictor I-Cache Dec µDec Map IP-relative Early verification NextNext SelectSelect RF Dep Branch execution LineLine Map Map TakenTaken 2-cycle branch execution on integer P1 NotNot Branch misprediction TakenTaken 13-cycle path mispredict RSBRSB 13-cycle target mispredict PredPred TAPTAP TargetTarget TargetTarget back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 16 Scheduler and Execution Pipes Schedule & Issue 8 9 10 11 12 13 14 15 16 … 20 64-entry schedule buffer SE SP IS RR AL AD 6464 dependency 3 pickers Pre-slotted Replay from scheduler IntInt P0:IntegerP0:Integer RFRF Replay prevention Retire PickPick CRCR P1:Integer/BrP1:Integer/Br 00 BRBR Commit to architecture registers PickPick EvalEval IssueIssue Precise exception 11 P1:FPUP1:FPU ArithmeticArithmetic 88 4 µ ops/cycle P1:FPUP1:FPU CompareCompare PickPick FPUFPU 3 pickers, 5 execution pipes + 22 RF RF P1:VMXP1:VMX FloatFloat 88 one load/store pipe P1:VMXP1:VMX ComplexComplex IntegerInteger 66 VMXVMX Integer (P0&1): 2-cycle Micro RFRF Micro P0:VMXP0:VMX PermutePermute FP (P1): 8-cycle OpOp BufferBuffer P0:VMXP0:VMX EasyEasy IntInt VMX (P1): 8-cycle FP or 6- cycle complex integer VMX (P0): 2-cycle permute or simple integer Load/store (P2) back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 17 Load/Store Processing Latency —Load to Use 10 11 12 13 14 15 IS RR AL AD L1 cache 4 AG TR DT DD L2 cache 24 Open page 100 IntInt P0:IntegerP0:Integer RFRF Closed page 120 data addr Remote L1 42 CRCR P1:Integer/BrP1:Integer/Br BRBR Loads and Stores Issued out of order FPUFPU RF Strongly-ordered stores Issue RF Issue 32-Entry Load/Store Queue VMX D-CacheD-Cache VMX MRB RF MRB RF HWHW PrefetchPrefetch Re-ordered for consistency Checking and retire CONEXIUM AddrAddr Load/StoreLoad/Store DERATDERAT Store-to-load forwarding MicroMicro GenGen QueueQueue OpOp 16 blocks in flight BufferBuffer SNPSNP 12-entry hardware prefetcher DupDup TagTag back to main diagram 8/21/2006 ©Copyright 2006 P.A. Semi, Inc. Hotchips 18 18 Address Translation 1 2 3 4 I-address translation IF IT ID DE 4-entry direct µ IERAT IERAT 64-entry 2-way IERAT D-address translation µIERAT 128-entry 2-way DERAT EA VA SLB—64-entry fully-assoc I-Cache CPU TLB—1024-entry 4-way SLB TLB CPU SLB TLB BUS HTWHTW BUS I/FI/F H/W Page table walk Reload D-CacheD-Cache ERAT RA 4-walks in flight EA 64-bit effective address Addr Load/Store Prefetch of next PTE Addr DERAT Load/Store VA 65-bit virtual address Gen DERAT Queue Gen Queue RA 44-bit real (physical) address after a TLB miss EA RA ERAT Effective-to-real address AG TR DT DD translation SLB Segment look-aside buffer 12 13 14 15 HTW Hardware page-table walk back to main diagram 8/21/2006 ©Copyright 2006 P.A.
Recommended publications
  • Memory Centric Characterization and Analysis of SPEC CPU2017 Suite
    Session 11: Performance Analysis and Simulation ICPE ’19, April 7–11, 2019, Mumbai, India Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Sarabjeet Singh Manu Awasthi [email protected] [email protected] Ashoka University Ashoka University ABSTRACT These benchmarks have become the standard for any researcher or In this paper, we provide a comprehensive, memory-centric charac- commercial entity wishing to benchmark their architecture or for terization of the SPEC CPU2017 benchmark suite, using a number of exploring new designs. mechanisms including dynamic binary instrumentation, measure- The latest offering of SPEC CPU suite, SPEC CPU2017, was re- ments on native hardware using hardware performance counters leased in June 2017 [8]. SPEC CPU2017 retains a number of bench- and operating system based tools. marks from previous iterations but has also added many new ones We present a number of results including working set sizes, mem- to reflect the changing nature of applications. Some recent stud- ory capacity consumption and memory bandwidth utilization of ies [21, 24] have already started characterizing the behavior of various workloads. Our experiments reveal that, on the x86_64 ISA, SPEC CPU2017 applications, looking for potential optimizations to SPEC CPU2017 workloads execute a significant number of mem- system architectures. ory related instructions, with approximately 50% of all dynamic In recent years the memory hierarchy, from the caches, all the instructions requiring memory accesses. We also show that there is way to main memory, has become a first class citizen of computer a large variation in the memory footprint and bandwidth utilization system design.
    [Show full text]
  • Overview of the SPEC Benchmarks
    9 Overview of the SPEC Benchmarks Kaivalya M. Dixit IBM Corporation “The reputation of current benchmarketing claims regarding system performance is on par with the promises made by politicians during elections.” Standard Performance Evaluation Corporation (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard,MIPS Computer Systems and SUN Microsystems in cooperation with E. E. Times. SPEC is a nonprofit consortium of 22 major computer vendors whose common goals are “to provide the industry with a realistic yardstick to measure the performance of advanced computer systems” and to educate consumers about the performance of vendors’ products. SPEC creates, maintains, distributes, and endorses a standardized set of application-oriented programs to be used as benchmarks. 489 490 CHAPTER 9 Overview of the SPEC Benchmarks 9.1 Historical Perspective Traditional benchmarks have failed to characterize the system performance of modern computer systems. Some of those benchmarks measure component-level performance, and some of the measurements are routinely published as system performance. Historically, vendors have characterized the performances of their systems in a variety of confusing metrics. In part, the confusion is due to a lack of credible performance information, agreement, and leadership among competing vendors. Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All instructions, however, are not equal. Since CISC machine instructions usually accomplish a lot more than those of RISC machines, comparing the instructions of a CISC machine and a RISC machine is similar to comparing Latin and Greek. 9.1.1 Simple CPU Benchmarks Truth in benchmarking is an oxymoron because vendors use benchmarks for marketing purposes.
    [Show full text]
  • 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf
    Chapter 3 Arithmetic for Computers 1 § 3.1Introduction Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing with overflow Floating-point real numbers Representation and operations Rechnerstrukturen 182.092 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) zero ovf 1 Must support the Arithmetic/Logic 1 operations of the ISA A 32 add, addi, addiu, addu ALU result sub, subu 32 mult, multu, div, divu B 32 sqrt 4 and, andi, nor, or, ori, xor, xori m (operation) beq, bne, slt, slti, sltiu, sltu With special handling for sign extend – addi, addiu, slti, sltiu zero extend – andi, ori, xori overflow detection – add, addi, sub Rechnerstrukturen 182.092 3 — Arithmetic for Computers 3 (Simplyfied) 1-bit MIPS ALU AND, OR, ADD, SLT Rechnerstrukturen 182.092 3 — Arithmetic for Computers 4 Final 32-bit ALU Rechnerstrukturen 182.092 3 — Arithmetic for Computers 5 Performance issues Critical path of n-bit ripple-carry adder is n*CP CarryIn0 A0 1-bit result0 ALU B0 CarryOut0 CarryIn1 A1 1-bit result1 B ALU 1 CarryOut 1 CarryIn2 A2 1-bit result2 ALU B2 CarryOut CarryIn 2 3 A3 1-bit result3 ALU B3 CarryOut3 Design trick – throw hardware at it (Carry Lookahead) Rechnerstrukturen 182.092 3 — Arithmetic for Computers 6 Carry Lookahead Logic (4 bit adder) LS 283 Rechnerstrukturen 182.092 3 — Arithmetic for Computers 7 § 3.2 Addition and Subtraction 3.2 Integer Addition Example: 7 + 6 Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands
    [Show full text]
  • SEWIP Program Leverages COTS P 36 P 28 an Interview with Deon Viergutz, Vice President of Cyber Solutions at Lockheed Martin Information Systems & Global Solutions
    @military_cots John McHale Obsolescence trends 8 Special Report Shipboard displays 44 Mil Tech Trends Predictive analytics 52 Industry Spotlight Aging avionics software 56 MIL-EMBEDDED.COM September 2015 | Volume 11 | Number 6 RESOURCE GUIDE 2015 P 62 Navy SEWIP program leverages COTS P 36 P 28 An interview with Deon Viergutz, Vice President of Cyber Solutions at Lockheed Martin Information Systems & Global Solutions Military electronics market overview P 14 Volume 11 Number 6 www.mil-embedded.com September 2015 COLUMNS BONUS – MARKET OVERVIEW Editor’s Perspective 14 C4ISR funding a bright spot in military 8 Tech mergers & military electronics electronics market obsolescence By John McHale, Editorial Director By John McHale Q&A EXECUTIVE OUTLOOK Field Intelligence 10 Metadata: When target video 28 Defending DoD from cyberattacks, getting to data is not enough the left of the boom By Charlotte Adams 14 An interview with Deon Viergutz, Vice President of Cyber Solutions at Lockheed Martin Information Mil Tech Insider Systems & Global Solutions 12 Broadwell chip boosts GPU performance for COTS SBCs 32 RF and microwave innovation drives military By Aaron Frank radar and electronic warfare applications An interview with Bryan Goldstein, DEPARTMENTS General Manager of the Aerospace and Defense, Analog Devices 22 Defense Tech Wire By Mariana Iriarte SPECIAL REPORT 60 Editor’s Choice Products Shipboard Electronics 112 University Update 36 U.S. Navy’s electronic warfare modernization On DARPA’s cybersecurity radar: 36 effort centers on COTS Algorithmic and side-channel attacks By Sally Cole, Senior Editor By Sally Cole 114 Connecting with Mil Embedded 44 Key to military display technologies: Blog – The fascinating world of System integration By Tom Whinfrey, IEE Inc.
    [Show full text]
  • Asustek Computer Inc.: Asus P6T
    SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation ASUSTeK Computer Inc. (Test Sponsor: Intel Corporation) SPECfp 2006 = 32.4 Asus P6T Deluxe (Intel Core i7-950) SPECfp_base2006 = 30.6 CPU2006 license: 13 Test date: Oct-2008 Test sponsor: Intel Corporation Hardware Availability: Jun-2009 Tested by: Intel Corporation Software Availability: Nov-2008 0 3.00 6.00 9.00 12.0 15.0 18.0 21.0 24.0 27.0 30.0 33.0 36.0 39.0 42.0 45.0 48.0 51.0 54.0 57.0 60.0 63.0 66.0 71.0 70.7 410.bwaves 70.8 21.5 416.gamess 16.8 34.6 433.milc 34.8 434.zeusmp 33.8 20.9 435.gromacs 20.6 60.8 436.cactusADM 61.0 437.leslie3d 31.3 17.6 444.namd 17.4 26.4 447.dealII 23.3 28.5 450.soplex 27.9 34.0 453.povray 26.6 24.4 454.calculix 20.5 38.7 459.GemsFDTD 37.8 24.9 465.tonto 22.3 470.lbm 50.2 30.9 481.wrf 30.9 42.1 482.sphinx3 41.3 SPECfp_base2006 = 30.6 SPECfp2006 = 32.4 Hardware Software CPU Name: Intel Core i7-950 Operating System: Windows Vista Ultimate w/ SP1 (64-bit) CPU Characteristics: Intel Turbo Boost Technology up to 3.33 GHz Compiler: Intel C++ Compiler Professional 11.0 for IA32 CPU MHz: 3066 Build 20080930 Package ID: w_cproc_p_11.0.054 Intel Visual Fortran Compiler Professional 11.0 FPU: Integrated for IA32 CPU(s) enabled: 4 cores, 1 chip, 4 cores/chip, 2 threads/core Build 20080930 Package ID: w_cprof_p_11.0.054 CPU(s) orderable: 1 chip Microsoft Visual Studio 2008 (for libraries) Primary Cache: 32 KB I + 32 KB D on chip per core Auto Parallel: Yes Secondary Cache: 256 KB I+D on chip per core File System: NTFS Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/ SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation ASUSTeK Computer Inc.
    [Show full text]
  • Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions
    Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions Margaret Martonosi David Brooks Pradip Bose VET NOV TES TAM EN TVM DE I VI GE T SV B NV M I NE Moore’s Law & Power Dissipation... Moore’s Law: ❚ The Good News: 2X Transistor counts every 18 months ❚ The Bad News: To get the performance improvements we’re accustomed to, CPU Power consumption will increase exponentially too... (Graphs courtesy of Fred Pollack, Intel) Why worry about power dissipation? Battery life Thermal issues: affect cooling, packaging, reliability, timing Environment Hitting the wall… ❚ Battery technology ❚ ❙ Linear improvements, nowhere Past: near the exponential power ❙ Power important for increases we’ve seen laptops, cell phones ❚ Cooling techniques ❚ Present: ❙ Air-cooled is reaching limits ❙ Power a Critical, Universal ❙ Fans often undesirable (noise, design constraint even for weight, expense) very high-end chips ❙ $1 per chip per Watt when ❚ operating in the >40W realm Circuits and process scaling can no longer solve all power ❙ Water-cooled ?!? problems. ❚ Environment ❙ SYSTEMS must also be ❙ US EPA: 10% of current electricity usage in US is directly due to power-aware desktop computers ❙ Architecture, OS, compilers ❙ Increasing fast. And doesn’t count embedded systems, Printers, UPS backup? Power: The Basics ❚ Dynamic power vs. Static power vs. short-circuit power ❙ “switching” power ❙ “leakage” power ❙ Dynamic power dominates, but static power increasing in importance ❙ Trends in each ❚ Static power: steady, per-cycle energy cost ❚ Dynamic power: power dissipation due to capacitance charging at transitions from 0->1 and 1->0 ❚ Short-circuit power: power due to brief short-circuit current during transitions.
    [Show full text]
  • Continuous Profiling: Where Have All the Cycles Gone?
    Continuous Profiling: Where Have All the Cycles Gone? JENNIFER M. ANDERSON, LANCE M. BERC, JEFFREY DEAN, SANJAY GHEMAWAT, MONIKA R. HENZINGER, SHUN-TAK A. LEUNG, RICHARD L. SITES, MARK T. VANDEVOORDE, CARL A. WALDSPURGER, and WILLIAM E. WEIHL Digital Equipment Corporation This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low overhead (1–3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce a precise and accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is being spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them. Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems; D.2.2 [Software Engineering]: Tools and Techniques—profiling tools; D.2.6 [Pro- gramming Languages]: Programming Environments—performance monitoring; D.4 [Oper- ating Systems]: General; D.4.7 [Operating Systems]: Organization and Design; D.4.8 [Operating Systems]: Performance General Terms: Performance Additional Key Words and Phrases: Profiling, performance understanding, program analysis, performance-monitoring hardware An earlier version of this article appeared at the 16th ACM Symposium on Operating System Principles (SOSP), St.
    [Show full text]
  • A Bibliography of Publications in IEEE Micro
    A Bibliography of Publications in IEEE Micro Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 E-mail: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe/ 16 September 2021 Version 2.108 Title word cross-reference -Core [MAT+18]. -Cubes [YW94]. -D [ASX19, BWMS19, DDG+19, Joh19c, PZB+19, ZSS+19]. -nm [ABG+16, KBN16, TKI+14]. #1 [Kah93i]. 0.18-Micron [HBd+99]. 0.9-micron + [Ano02d]. 000-fps [KII09]. 000-Processor $1 [Ano17-58, Ano17-59]. 12 [MAT 18]. 16 + + [ABG+16]. 2 [DTH+95]. 21=2 [Ste00a]. 28 [BSP 17]. 024-Core [JJK 11]. [KBN16]. 3 [ASX19, Alt14e, Ano96o, + AOYS95, BWMS19, CMAS11, DDG+19, 1 [Ano98s, BH15, Bre10, PFC 02a, Ste02a, + + Ste14a]. 1-GHz [Ano98s]. 1-terabits DFG 13, Joh19c, LXB07, LX10, MKT 13, + MAS+07, PMM15, PZB+19, SYW+14, [MIM 97]. 10 [Loc03]. 10-Gigabit SCSR93, VPV12, WLF+08, ZSS+19]. 60 [Gad07, HcF04]. 100 [TKI+14]. < [BMM15]. > [BMM15]. 2 [Kir84a, Pat84, PSW91, YSMH91, ZACM14]. [WHCK18]. 3 [KBW95]. II [BAH+05]. ∆ 100-Mops [PSW91]. 1000 [ES84]. 11- + [Lyl04]. 11/780 [Abr83]. 115 [JBF94]. [MKG 20]. k [Eng00j]. µ + [AT93, Dia95c, TS95]. N [YW94]. x 11FO4 [ASD 05]. 12 [And82a]. [DTB01, Dur96, SS05]. 12-DSP [Dur96]. 1284 [Dia94b]. 1284-1994 [Dia94b]. 13 * [CCD+82]. [KW02]. 1394 [SB00]. 1394-1955 [Dia96d]. 1 2 14 [WD03]. 15 [FD04]. 15-Billion-Dollar [KR19a].
    [Show full text]
  • Specfp Benchmark Disclosure
    SPEC CFP2006 Result Copyright 2006-2016 Standard Performance Evaluation Corporation Cisco Systems SPECfp2006 = Not Run Cisco UCS C220 M4 (Intel Xeon E5-2667 v4, 3.20 GHz) SPECfp_base2006 = 125 CPU2006 license: 9019 Test date: Mar-2016 Test sponsor: Cisco Systems Hardware Availability: Mar-2016 Tested by: Cisco Systems Software Availability: Dec-2015 0 30.0 60.0 90.0 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 630 660 690 720 750 780 810 840 900 410.bwaves 572 416.gamess 43.1 433.milc 72.5 434.zeusmp 225 435.gromacs 63.4 436.cactusADM 894 437.leslie3d 427 444.namd 31.6 447.dealII 67.6 450.soplex 48.3 453.povray 62.2 454.calculix 61.6 459.GemsFDTD 242 465.tonto 52.1 470.lbm 799 481.wrf 120 482.sphinx3 91.6 SPECfp_base2006 = 125 Hardware Software CPU Name: Intel Xeon E5-2667 v4 Operating System: SUSE Linux Enterprise Server 12 SP1 (x86_64) CPU Characteristics: Intel Turbo Boost Technology up to 3.60 GHz 3.12.49-11-default CPU MHz: 3200 Compiler: C/C++: Version 16.0.0.101 of Intel C++ Studio XE for Linux; FPU: Integrated Fortran: Version 16.0.0.101 of Intel Fortran CPU(s) enabled: 16 cores, 2 chips, 8 cores/chip Studio XE for Linux CPU(s) orderable: 1,2 chips Auto Parallel: Yes Primary Cache: 32 KB I + 32 KB D on chip per core File System: xfs Secondary Cache: 256 MB I+D on chip per core System State: Run level 3 (multi-user) Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/ SPEC CFP2006 Result Copyright 2006-2016 Standard Performance
    [Show full text]
  • Intel Corporation: Lenovo T400 (Intel Core 2 Duo T9900)
    SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation Intel Corporation SPECfp2006 = 21.2 Lenovo T400 (Intel Core 2 Duo T9900) SPECfp_base2006 = 20.5 CPU2006 license: 13 Test date: Mar-2009 Test sponsor: Intel Corporation Hardware Availability: Mar-2009 Tested by: Intel Corporation Software Availability: Nov-2008 0 1.00 3.00 5.00 7.00 9.00 11.0 13.0 15.0 17.0 19.0 21.0 23.0 25.0 27.0 29.0 32.0 27.6 410.bwaves 27.6 19.9 416.gamess 19.1 18.8 433.milc 18.8 434.zeusmp 24.1 19.7 435.gromacs 19.3 30.7 436.cactusADM 30.0 437.leslie3d 17.5 16.2 444.namd 16.2 24.5 447.dealII 22.1 17.1 450.soplex 16.9 30.4 453.povray 24.3 20.0 454.calculix 20.0 16.3 459.GemsFDTD 16.5 19.2 465.tonto 17.5 470.lbm 15.3 21.2 481.wrf 21.2 30.8 482.sphinx3 31.0 SPECfp_base2006 = 20.5 SPECfp2006 = 21.2 Hardware Software CPU Name: Intel Core 2 Duo T9900 Operating System: Windows XP Professional w/ SP2 (64-bit) CPU Characteristics: Compiler: Intel C++ Compiler Professional 11.0 for IA32 CPU MHz: 3066 Build 20080930 Package ID: w_cproc_p_11.0.054 Intel Visual Fortran Compiler Professional 11.0 FPU: Integrated for IA32 CPU(s) enabled: 2 cores, 1 chip, 2 cores/chip Build 20080930 Package ID: w_cprof_p_11.0.054 CPU(s) orderable: 1 chip Microsoft Visual Studio 2008 (for libraries) Primary Cache: 32 KB I + 32 KB D on chip per core Auto Parallel: Yes Secondary Cache: 6 MB I+D on chip per chip File System: NTFS Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/
    [Show full text]
  • Compactpci and Avancedtca Systems
    ® VOLUME 11 • NUMBER 9 DECEMBER 2007 CompactPCI www.compactpci-systems.com ® www.advancedtca-systems.com and AdvancedTCA Systems The Magazine for Developers of Open Communication, Industrial, and Rugged Systems COLUMNS PRODUCTS 8 Editor’s Foreword AdvancedTCA Summit 2007 19 AdvancedMCs, PrAMCs, Carriers By Joe Pavlat Sponsored by: Adax,® Inc. 41 Storage FEATURES 43 Blades/AdvancedTCA CompactPCI Sponsored by: Emerson Network Power; Sun Microsystems HIGH AVAILABILITY ® 10 Achieving high availability and management 50 Power andwith the latest standard COTS technologies By Dr. Asif Naseem, GoAhead Software 35 Connectors AdvancedTCA Sponsored by:Systems Harting Technology Group; CONEC FABRICS 18 Comparing Ethernet and RapidIO 55 Test & Development By Tom Roberts, Mercury Computer Systems 27 Enclosures/Packaging AdvancedTCA Sponsored by: Carlo Gavazzi Computing Solutions; Schroff 30 The critical importance of shelf management 73 VoIP in AdvancedTCA By Frank Fitzgerald, Carlo Gavazzi Computing Solutions 59 Integrated Systems Sponsored by: Alliance Systems; Kontron MicroTCA 67 MicroTCA 38 MicroTCA power module input connectors Sponsored by: CorEdge Networks; Motorola Inc. By Juergen Hahn-Barth, CONEC Corporation 23 Networking/Communications INTERVIEW 62 Speaking of middleware 71 PMCs, PrPMCs, Carriers Sponsored by: Xembedded, Inc. An interview with Jim Lawrence and Chris Lanfear, Enea 47 Switches WEB RESOURCES 51 Single Board Computers Subscribe to the magazine or E-letter at: Sponsored by: Aitech Defense Systems, Inc.; General Dynamics www.opensystems-publishing.com/subscriptions Industry news: COVER (Clockwise): Read: www.compactpci-systems.com/news The HDCIII is a high density SS7/ATM controller from ADAX (www.adax.com). The HDCIII provides Submit: www.opensystems-publishing.com/news/submit 8 E1/T1 trunks, simultaneous support for 248 MTP2 LSLs, HSLs, and SS7 ATM AAL5, and offers AMC, PMC, PCI-X and PCIe versions from a single driver.
    [Show full text]
  • Historical Perspective and Further Reading 4.7
    4.7 Historical Perspective and Further Reading 4.7 From the earliest days of computing, designers have specified performance goals—ENIAC was to be 1000 times faster than the Harvard Mark-I, and the IBM Stretch (7030) was to be 100 times faster than the fastest computer then in exist- ence. What wasn’t clear, though, was how this performance was to be measured. The original measure of performance was the time required to perform an individual operation, such as addition. Since most instructions took the same exe- cution time, the timing of one was the same as the others. As the execution times of instructions in a computer became more diverse, however, the time required for one operation was no longer useful for comparisons. To take these differences into account, an instruction mix was calculated by mea- suring the relative frequency of instructions in a computer across many programs. Multiplying the time for each instruction by its weight in the mix gave the user the average instruction execution time. (If measured in clock cycles, average instruction execution time is the same as average CPI.) Since instruction sets were similar, this was a more precise comparison than add times. From average instruction execution time, then, it was only a small step to MIPS. MIPS had the virtue of being easy to understand; hence it grew in popularity. The In More Depth section on this CD dis- cusses other definitions of MIPS, as well as its relatives MOPS and FLOPS! The Quest for an Average Program As processors were becoming more sophisticated and relied on memory hierar- chies (the topic of Chapter 7) and pipelining (the topic of Chapter 6), a single exe- cution time for each instruction no longer existed; neither execution time nor MIPS, therefore, could be calculated from the instruction mix and the manual.
    [Show full text]