A Chip Multithreaded Processor for Network-Facing Workloads

Total Page:16

File Type:pdf, Size:1020Kb

A Chip Multithreaded Processor for Network-Facing Workloads A CHIP MULTITHREADED PROCESSOR FOR NETWORK- FACING WORKLOADS THE ARTICLE DESCRIBES AN EARLY IMPLEMENTATION OF A CHIP MULTITHREADED (CMT) PROCESSOR, SUN MICROSYSTEM’S TECHNOLOGY TO SUPPORT ITS THROUGHPUT COMPUTING STRATEGY. CMT PROCESSORS ARE DESIGNED TO HANDLE WORKLOADS CONSISTING OF MULTIPLE THREADS WITH LIGHT COMPUTATIONAL DEMANDS, TYPICAL OF WEB SERVERS AND OTHER NETWORK-FACING SYSTEMS. With its throughput computing The processor described in this article is a strategy, Sun Microsystems seeks to reverse member of Sun’s first generation of CMT long-standing trends towards increasingly processors designed to efficiently execute net- elaborate processor designs by focusing work-facing workloads. Network-facing sys- instead on simple, replicated designs that tems primarily service network clients and are effectively exploit threaded workloads and often grouped together under the label “Web thus enable radically higher levels of perfor- servers.” Systems of this type tend to favor com- mance scaling.1 pute dense form factors, such as racks and Throughput computing is based on chip blades.2 The processor’s dual-thread execution Sanjiv Kapil multithreading processor design technology. capability, compact die size, and minimal In CMT technology, maximizing the amount power consumption combine to produce high Harlan McGhan of work accomplished per unit of time or throughput performance per watt, per transis- other relevant resource, rather than minimiz- tor, and per square millimeter of die area. Given Jesse Lawrendra ing the time needed to complete a given task the short design cycle Sun needed to create the or set of tasks, defines performance. Similar- processor, the result is a compelling early proof Sun Microsystems ly, a CMT processor might seek to use power of the value of throughput computing. more efficiently by clocking at a lower fre- quency—if this tradeoff lets it generate more Design goals work per watt of expended power. By CMT To satisfy target market requirements, the standards, the best processor accomplishes the design team set the following design goals: most work per second of time, per watt of expended power, per square millimeter of die • Focus on throughput (number of area, and so on (that is, it operates most threads), not single-thread performance. efficiently). • Maximize performance-watt ratio. 20 Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE • Minimize die area. of translating this design to current 130-nm • Minimize design cost for one- to four- copper metal technology, targeting a frequen- way servers. cy of 1,200 MHz, proved formidable. Minimizing power consumption was a con- Meeting these goals lets system designers cern throughout the design, affecting not only pack processors tightly together to build inex- the choice of thread execution engine, but also pensive boxes or racks offering very high per- the design of every functional block on the formance per cubic foot. Alternatively, they chip. The processor also provides low-power can use the higher packing densities and lower operating modes, allowing software to selec- costs possible with the design to provide more tively reduce power consumption during peri- memory or greater functional integration. ods of little or no activity by reducing the Three additional requirements constrained chip’s operating frequency to 1/2 (idle mode) the four basic design goals: minimizing time or 1/32 (sleep mode) of normal clock speed. to market and development cost while main- To further reduce costs (and reduce system- taining full scalable processor architecture level power consumption by eliminating the (Sparc) V9 (64-bit) binary compatibility.3 need to drive external buses), we replaced the These constraints dictated our decision to large (4- to 8-Mbyte) external L2 cache typi- leverage existing UltraSparc designs to the cal of UltraSparc II systems with 1 Mbyte of greatest extent possible, while the design goals on-chip L2 cache. Moving the L2 cache on motivated our unusual decision to combine chip also substantially improved performance elements from two different UltraSparc gen- by reducing access latencies; moreover, cache erations. frequency automatically scales up when the processor’s clock rate increases. Overview To achieve compatibility with current-gen- To maximize thread density while mini- eration systems, we took the design’s memo- mizing both die area and power consumption, ry controller and JBus system interface from we rejected the third-generation 29-million- current UltraSparc IIIi processors, reworking transistor UltraSparc III processor design4 in them to interface with the dual UltraSparc II favor of the execution engine from the far pipelines. Indeed, to maximize leverage and smaller and less power-hungry first-genera- minimize costs, we designed the processor to tion 5.2-million-transistor UltraSparc I be pin compatible with UltraSparc IIIi, fit- design.5,6 This choice let us create a processor ting into the same 959-pin ceramic micro- incorporating two complete four-way super- PGA package used for the existing product. scalar execution engines on a single die, con- Finally, we implemented a robust set of reli- ferring simple dual-thread capability (one ability, availability, and serviceability (RAS) thread per core), while keeping both die area features to ensure high levels of data integri- and power consumption to a minimum. ty, and we added new CMT features to con- Achieving even higher thread counts, trol the processor’s dual-thread operation. although suitable for the target market, was Figure 1 (next page) is a block diagram of the inconsistent with the processor’s stringent design, showing the major functional units schedule and die size goals. and their interconnections. The decision to use the UltraSparc I execu- tion engine did introduce a complication, Thread execution engine however: porting the older design, originally We lifted the thread execution engine essen- implemented in a mid-1990s 0.5-micron, 3.3- tially intact from the UltraSparc II processor. V processor, to contemporary deep-submicron The execution engine’s design comprises a technology. To avoid needless work, the design nine-stage execution pipeline, incorporating effort started from the most advanced version four-way superscalar instruction fetch and dis- of the basic UltraSparc I design available: a patch to a set of nine functional units: four 250-nanometer aluminum metal database cre- integer, including load/store and branch; three ated in 1998 for the last revision of the Ultra- floating-point (FP), including divide/square Sparc II processor (targeted at an upper root; and two single-instruction, multiple- frequency of 450 MHz). Even so, the problem data (SIMD) operations. The pipeline also MARCH–APRIL 2004 21 HOT CHIPS 15 Double data rate 1 (DDR1) SDRAM L2 cache 1 19 16 L2 cache 0 35 137 128 128 32 128 32 128 30 30 JBus Memory control JBus Core 1 Core 0 unit 128 unit (MCU) 128 unit 128 128 128 128 57 128 JBus Control pins Address pins Data bus Address/data shared bus Figure 1. Processor block diagram. We reworked the memory controller and JBus system interface, originally from UltraSparc IIIi processors, to interface with UltraSparc II pipelines. includes a set of eight overlapping general- total core, including the execution units, both purpose register windows, a set of 32 floating- L1 caches, the L2 cache controller, the point registers, and two separate 16-Kbyte L1 load/store unit, and associated miscellaneous caches for instructions and data together with logic, aggregated to less than 29 mm2. Thus, associated memory management units, each together the two cores take up only about a with a 64-entry translation look-aside buffer quarter of the 206-mm2 die. (TLB) for caching recently mapped virtual- The integer unit dissipates only about 3 to-physical address pairs. watts, with the floating-point unit adding less Between its two four-issue pipelines, the than another 1.5 watts. Maximum power dis- processor can fetch and issue eight instruc- sipation by the entire execution engine is less tions per clock cycle. A 12-entry instruction than 7 watts, so between them, the two engines buffer decouples instruction fetch from are responsible for less than half the 32 watts instruction issue to buffer any discrepancies maximum dissipated by the processor chip. between these two rates. Figure 2 details the UltraSparc II pipeline. On-chip L2 cache When implemented in 130-nm technolo- Supporting each of the two UltraSparc II gy, the 64-bit, superscalar UltraSparc II execution engines in the processor is a high- pipeline design proved remarkably efficient in performance, 512-Kbyte, four-way set-asso- both area and power. The basic integer unit, ciative, on-chip L2 cache. A 128-bit data bus including two full arithmetic logic units, connects each core to the L2. The total load- required only about 8 mm2 of area, with the to-use latency for the L2 is 9 to 10 cycles, with FP and SIMD units, which share execution four cycles needed to transit in and out of the resources, adding less than 5 mm2 more. The cache itself, including error-correcting code 22 IEEE MICRO I-Cache 16 Kbyte, two-way 1 Fetch set association (2WSA) 12 entry I-Buffer 2 Decode 3 General purpose (GP) Group register windows Floating-point (FP)/single-instruction, Integer ops multiple-data (SIMD) ops 4 Execute 4 Register FP register file D-Cache 5 Cache 5 X1 16 Kbyte, 1WSA 1. Four instructions read from I-Cache, branch prediction, target obtained Miss 6 N1 6 X2 2. Instructions predecoded and to L2 Hit put into instruction buffer or 3. Up to four instructions grouped for issue, 7 7 integer ops read general purpose regs memory- N2 X3 Sign-extended 4. Integer ops execute, FP/SIMD ops read management floating-point registers unit 5. Loads/stores access D-Cache, FP/SIMD ops begin execution 6.
Recommended publications
  • Memory Centric Characterization and Analysis of SPEC CPU2017 Suite
    Session 11: Performance Analysis and Simulation ICPE ’19, April 7–11, 2019, Mumbai, India Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Sarabjeet Singh Manu Awasthi [email protected] [email protected] Ashoka University Ashoka University ABSTRACT These benchmarks have become the standard for any researcher or In this paper, we provide a comprehensive, memory-centric charac- commercial entity wishing to benchmark their architecture or for terization of the SPEC CPU2017 benchmark suite, using a number of exploring new designs. mechanisms including dynamic binary instrumentation, measure- The latest offering of SPEC CPU suite, SPEC CPU2017, was re- ments on native hardware using hardware performance counters leased in June 2017 [8]. SPEC CPU2017 retains a number of bench- and operating system based tools. marks from previous iterations but has also added many new ones We present a number of results including working set sizes, mem- to reflect the changing nature of applications. Some recent stud- ory capacity consumption and memory bandwidth utilization of ies [21, 24] have already started characterizing the behavior of various workloads. Our experiments reveal that, on the x86_64 ISA, SPEC CPU2017 applications, looking for potential optimizations to SPEC CPU2017 workloads execute a significant number of mem- system architectures. ory related instructions, with approximately 50% of all dynamic In recent years the memory hierarchy, from the caches, all the instructions requiring memory accesses. We also show that there is way to main memory, has become a first class citizen of computer a large variation in the memory footprint and bandwidth utilization system design.
    [Show full text]
  • Overview of the SPEC Benchmarks
    9 Overview of the SPEC Benchmarks Kaivalya M. Dixit IBM Corporation “The reputation of current benchmarketing claims regarding system performance is on par with the promises made by politicians during elections.” Standard Performance Evaluation Corporation (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard,MIPS Computer Systems and SUN Microsystems in cooperation with E. E. Times. SPEC is a nonprofit consortium of 22 major computer vendors whose common goals are “to provide the industry with a realistic yardstick to measure the performance of advanced computer systems” and to educate consumers about the performance of vendors’ products. SPEC creates, maintains, distributes, and endorses a standardized set of application-oriented programs to be used as benchmarks. 489 490 CHAPTER 9 Overview of the SPEC Benchmarks 9.1 Historical Perspective Traditional benchmarks have failed to characterize the system performance of modern computer systems. Some of those benchmarks measure component-level performance, and some of the measurements are routinely published as system performance. Historically, vendors have characterized the performances of their systems in a variety of confusing metrics. In part, the confusion is due to a lack of credible performance information, agreement, and leadership among competing vendors. Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All instructions, however, are not equal. Since CISC machine instructions usually accomplish a lot more than those of RISC machines, comparing the instructions of a CISC machine and a RISC machine is similar to comparing Latin and Greek. 9.1.1 Simple CPU Benchmarks Truth in benchmarking is an oxymoron because vendors use benchmarks for marketing purposes.
    [Show full text]
  • 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf
    Chapter 3 Arithmetic for Computers 1 § 3.1Introduction Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing with overflow Floating-point real numbers Representation and operations Rechnerstrukturen 182.092 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) zero ovf 1 Must support the Arithmetic/Logic 1 operations of the ISA A 32 add, addi, addiu, addu ALU result sub, subu 32 mult, multu, div, divu B 32 sqrt 4 and, andi, nor, or, ori, xor, xori m (operation) beq, bne, slt, slti, sltiu, sltu With special handling for sign extend – addi, addiu, slti, sltiu zero extend – andi, ori, xori overflow detection – add, addi, sub Rechnerstrukturen 182.092 3 — Arithmetic for Computers 3 (Simplyfied) 1-bit MIPS ALU AND, OR, ADD, SLT Rechnerstrukturen 182.092 3 — Arithmetic for Computers 4 Final 32-bit ALU Rechnerstrukturen 182.092 3 — Arithmetic for Computers 5 Performance issues Critical path of n-bit ripple-carry adder is n*CP CarryIn0 A0 1-bit result0 ALU B0 CarryOut0 CarryIn1 A1 1-bit result1 B ALU 1 CarryOut 1 CarryIn2 A2 1-bit result2 ALU B2 CarryOut CarryIn 2 3 A3 1-bit result3 ALU B3 CarryOut3 Design trick – throw hardware at it (Carry Lookahead) Rechnerstrukturen 182.092 3 — Arithmetic for Computers 6 Carry Lookahead Logic (4 bit adder) LS 283 Rechnerstrukturen 182.092 3 — Arithmetic for Computers 7 § 3.2 Addition and Subtraction 3.2 Integer Addition Example: 7 + 6 Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands
    [Show full text]
  • Asustek Computer Inc.: Asus P6T
    SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation ASUSTeK Computer Inc. (Test Sponsor: Intel Corporation) SPECfp 2006 = 32.4 Asus P6T Deluxe (Intel Core i7-950) SPECfp_base2006 = 30.6 CPU2006 license: 13 Test date: Oct-2008 Test sponsor: Intel Corporation Hardware Availability: Jun-2009 Tested by: Intel Corporation Software Availability: Nov-2008 0 3.00 6.00 9.00 12.0 15.0 18.0 21.0 24.0 27.0 30.0 33.0 36.0 39.0 42.0 45.0 48.0 51.0 54.0 57.0 60.0 63.0 66.0 71.0 70.7 410.bwaves 70.8 21.5 416.gamess 16.8 34.6 433.milc 34.8 434.zeusmp 33.8 20.9 435.gromacs 20.6 60.8 436.cactusADM 61.0 437.leslie3d 31.3 17.6 444.namd 17.4 26.4 447.dealII 23.3 28.5 450.soplex 27.9 34.0 453.povray 26.6 24.4 454.calculix 20.5 38.7 459.GemsFDTD 37.8 24.9 465.tonto 22.3 470.lbm 50.2 30.9 481.wrf 30.9 42.1 482.sphinx3 41.3 SPECfp_base2006 = 30.6 SPECfp2006 = 32.4 Hardware Software CPU Name: Intel Core i7-950 Operating System: Windows Vista Ultimate w/ SP1 (64-bit) CPU Characteristics: Intel Turbo Boost Technology up to 3.33 GHz Compiler: Intel C++ Compiler Professional 11.0 for IA32 CPU MHz: 3066 Build 20080930 Package ID: w_cproc_p_11.0.054 Intel Visual Fortran Compiler Professional 11.0 FPU: Integrated for IA32 CPU(s) enabled: 4 cores, 1 chip, 4 cores/chip, 2 threads/core Build 20080930 Package ID: w_cprof_p_11.0.054 CPU(s) orderable: 1 chip Microsoft Visual Studio 2008 (for libraries) Primary Cache: 32 KB I + 32 KB D on chip per core Auto Parallel: Yes Secondary Cache: 256 KB I+D on chip per core File System: NTFS Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/ SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation ASUSTeK Computer Inc.
    [Show full text]
  • Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions
    Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions Margaret Martonosi David Brooks Pradip Bose VET NOV TES TAM EN TVM DE I VI GE T SV B NV M I NE Moore’s Law & Power Dissipation... Moore’s Law: ❚ The Good News: 2X Transistor counts every 18 months ❚ The Bad News: To get the performance improvements we’re accustomed to, CPU Power consumption will increase exponentially too... (Graphs courtesy of Fred Pollack, Intel) Why worry about power dissipation? Battery life Thermal issues: affect cooling, packaging, reliability, timing Environment Hitting the wall… ❚ Battery technology ❚ ❙ Linear improvements, nowhere Past: near the exponential power ❙ Power important for increases we’ve seen laptops, cell phones ❚ Cooling techniques ❚ Present: ❙ Air-cooled is reaching limits ❙ Power a Critical, Universal ❙ Fans often undesirable (noise, design constraint even for weight, expense) very high-end chips ❙ $1 per chip per Watt when ❚ operating in the >40W realm Circuits and process scaling can no longer solve all power ❙ Water-cooled ?!? problems. ❚ Environment ❙ SYSTEMS must also be ❙ US EPA: 10% of current electricity usage in US is directly due to power-aware desktop computers ❙ Architecture, OS, compilers ❙ Increasing fast. And doesn’t count embedded systems, Printers, UPS backup? Power: The Basics ❚ Dynamic power vs. Static power vs. short-circuit power ❙ “switching” power ❙ “leakage” power ❙ Dynamic power dominates, but static power increasing in importance ❙ Trends in each ❚ Static power: steady, per-cycle energy cost ❚ Dynamic power: power dissipation due to capacitance charging at transitions from 0->1 and 1->0 ❚ Short-circuit power: power due to brief short-circuit current during transitions.
    [Show full text]
  • Continuous Profiling: Where Have All the Cycles Gone?
    Continuous Profiling: Where Have All the Cycles Gone? JENNIFER M. ANDERSON, LANCE M. BERC, JEFFREY DEAN, SANJAY GHEMAWAT, MONIKA R. HENZINGER, SHUN-TAK A. LEUNG, RICHARD L. SITES, MARK T. VANDEVOORDE, CARL A. WALDSPURGER, and WILLIAM E. WEIHL Digital Equipment Corporation This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low overhead (1–3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce a precise and accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is being spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them. Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems; D.2.2 [Software Engineering]: Tools and Techniques—profiling tools; D.2.6 [Pro- gramming Languages]: Programming Environments—performance monitoring; D.4 [Oper- ating Systems]: General; D.4.7 [Operating Systems]: Organization and Design; D.4.8 [Operating Systems]: Performance General Terms: Performance Additional Key Words and Phrases: Profiling, performance understanding, program analysis, performance-monitoring hardware An earlier version of this article appeared at the 16th ACM Symposium on Operating System Principles (SOSP), St.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Specfp Benchmark Disclosure
    SPEC CFP2006 Result Copyright 2006-2016 Standard Performance Evaluation Corporation Cisco Systems SPECfp2006 = Not Run Cisco UCS C220 M4 (Intel Xeon E5-2667 v4, 3.20 GHz) SPECfp_base2006 = 125 CPU2006 license: 9019 Test date: Mar-2016 Test sponsor: Cisco Systems Hardware Availability: Mar-2016 Tested by: Cisco Systems Software Availability: Dec-2015 0 30.0 60.0 90.0 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 630 660 690 720 750 780 810 840 900 410.bwaves 572 416.gamess 43.1 433.milc 72.5 434.zeusmp 225 435.gromacs 63.4 436.cactusADM 894 437.leslie3d 427 444.namd 31.6 447.dealII 67.6 450.soplex 48.3 453.povray 62.2 454.calculix 61.6 459.GemsFDTD 242 465.tonto 52.1 470.lbm 799 481.wrf 120 482.sphinx3 91.6 SPECfp_base2006 = 125 Hardware Software CPU Name: Intel Xeon E5-2667 v4 Operating System: SUSE Linux Enterprise Server 12 SP1 (x86_64) CPU Characteristics: Intel Turbo Boost Technology up to 3.60 GHz 3.12.49-11-default CPU MHz: 3200 Compiler: C/C++: Version 16.0.0.101 of Intel C++ Studio XE for Linux; FPU: Integrated Fortran: Version 16.0.0.101 of Intel Fortran CPU(s) enabled: 16 cores, 2 chips, 8 cores/chip Studio XE for Linux CPU(s) orderable: 1,2 chips Auto Parallel: Yes Primary Cache: 32 KB I + 32 KB D on chip per core File System: xfs Secondary Cache: 256 MB I+D on chip per core System State: Run level 3 (multi-user) Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/ SPEC CFP2006 Result Copyright 2006-2016 Standard Performance
    [Show full text]
  • Intel Corporation: Lenovo T400 (Intel Core 2 Duo T9900)
    SPEC CFP2006 Result spec Copyright 2006-2014 Standard Performance Evaluation Corporation Intel Corporation SPECfp2006 = 21.2 Lenovo T400 (Intel Core 2 Duo T9900) SPECfp_base2006 = 20.5 CPU2006 license: 13 Test date: Mar-2009 Test sponsor: Intel Corporation Hardware Availability: Mar-2009 Tested by: Intel Corporation Software Availability: Nov-2008 0 1.00 3.00 5.00 7.00 9.00 11.0 13.0 15.0 17.0 19.0 21.0 23.0 25.0 27.0 29.0 32.0 27.6 410.bwaves 27.6 19.9 416.gamess 19.1 18.8 433.milc 18.8 434.zeusmp 24.1 19.7 435.gromacs 19.3 30.7 436.cactusADM 30.0 437.leslie3d 17.5 16.2 444.namd 16.2 24.5 447.dealII 22.1 17.1 450.soplex 16.9 30.4 453.povray 24.3 20.0 454.calculix 20.0 16.3 459.GemsFDTD 16.5 19.2 465.tonto 17.5 470.lbm 15.3 21.2 481.wrf 21.2 30.8 482.sphinx3 31.0 SPECfp_base2006 = 20.5 SPECfp2006 = 21.2 Hardware Software CPU Name: Intel Core 2 Duo T9900 Operating System: Windows XP Professional w/ SP2 (64-bit) CPU Characteristics: Compiler: Intel C++ Compiler Professional 11.0 for IA32 CPU MHz: 3066 Build 20080930 Package ID: w_cproc_p_11.0.054 Intel Visual Fortran Compiler Professional 11.0 FPU: Integrated for IA32 CPU(s) enabled: 2 cores, 1 chip, 2 cores/chip Build 20080930 Package ID: w_cprof_p_11.0.054 CPU(s) orderable: 1 chip Microsoft Visual Studio 2008 (for libraries) Primary Cache: 32 KB I + 32 KB D on chip per core Auto Parallel: Yes Secondary Cache: 6 MB I+D on chip per chip File System: NTFS Continued on next page Continued on next page Standard Performance Evaluation Corporation [email protected] Page 1 http://www.spec.org/
    [Show full text]
  • Historical Perspective and Further Reading 4.7
    4.7 Historical Perspective and Further Reading 4.7 From the earliest days of computing, designers have specified performance goals—ENIAC was to be 1000 times faster than the Harvard Mark-I, and the IBM Stretch (7030) was to be 100 times faster than the fastest computer then in exist- ence. What wasn’t clear, though, was how this performance was to be measured. The original measure of performance was the time required to perform an individual operation, such as addition. Since most instructions took the same exe- cution time, the timing of one was the same as the others. As the execution times of instructions in a computer became more diverse, however, the time required for one operation was no longer useful for comparisons. To take these differences into account, an instruction mix was calculated by mea- suring the relative frequency of instructions in a computer across many programs. Multiplying the time for each instruction by its weight in the mix gave the user the average instruction execution time. (If measured in clock cycles, average instruction execution time is the same as average CPI.) Since instruction sets were similar, this was a more precise comparison than add times. From average instruction execution time, then, it was only a small step to MIPS. MIPS had the virtue of being easy to understand; hence it grew in popularity. The In More Depth section on this CD dis- cusses other definitions of MIPS, as well as its relatives MOPS and FLOPS! The Quest for an Average Program As processors were becoming more sophisticated and relied on memory hierar- chies (the topic of Chapter 7) and pipelining (the topic of Chapter 6), a single exe- cution time for each instruction no longer existed; neither execution time nor MIPS, therefore, could be calculated from the instruction mix and the manual.
    [Show full text]
  • An Evaluation of the TRIPS Computer System
    Appears in the Proceedings of the 14th International Conference on Architecture Support for Programming Languages and Operating Systems An Evaluation of the TRIPS Computer System Mark Gebhart Bertrand A. Maher Katherine E. Coons Jeff Diamond Paul Gratz Mario Marino Nitya Ranganathan Behnam Robatmili Aaron Smith James Burrill Stephen W. Keckler Doug Burger Kathryn S. McKinley Department of Computer Sciences The University of Texas at Austin [email protected] www.cs.utexas.edu/users/cart Abstract issue-width scaling of conventional superscalar architec- The TRIPS system employs a new instruction set architec- tures. Because of these trends, major microprocessor ven- ture (ISA) called Explicit Data Graph Execution (EDGE) dors have abandoned architectures for single-thread perfor- that renegotiates the boundary between hardware and soft- mance and turned to the promise of multiple cores per chip. ware to expose and exploit concurrency. EDGE ISAs use a While many applications can exploit multicore systems, this block-atomic execution model in which blocks are composed approach places substantial burdens on programmers to par- of dataflow instructions. The goal of the TRIPS design is allelize their codes. Despite these trends, Amdahl’s law dic- to mine concurrency for high performance while tolerating tates that single-thread performance will remain key to the emerging technology scaling challenges, such as increas- future success of computer systems [9]. ing wire delays and power consumption. This paper eval- In response to semiconductor scaling trends, we designed uates how well TRIPS meets this goal through a detailed a new architecture and microarchitecture intended to extend ISA and performance analysis.
    [Show full text]
  • Performance Evaluation of Vmware and Virtualbox
    2012 IACSIT Hong Kong Conferences IPCSIT vol. 29 (2012) © (2012) IACSIT Press, Singapore Performance Evaluation of VMware and VirtualBox Deepak K Damodaran1+, Biju R Mohan2, Vasudevan M S 3 and Dinesh Naik 4 Information Technology, National Institute of Technology Karnataka, India Abstract. Virtualization is a framework of dividing the resources of a computer into multiple execution environments. More specific it is a layer of software that provides the illusion of a real machine to multiple instances of virtual machines. Virtualization offers a lot of benefits including flexibility, security, ease to configuration and management, reduction of cost and so forth, but at the same time it also brings a certain degree of performance overhead. Furthermore, Virtual Machine Monitor (VMM) is the core component of virtual machine (VM) system and its effectiveness greatly impacts the performance of whole system. In this paper, we measure and analyze the performance of two virtual machine monitors VMware and VirtualBox using LMbench and IOzone, and provide a quantitative and qualitative comparison of both virtual machine monitors. Keywords: Virtualization, Virtual machine monitors, Performance. 1. Introduction Years ago, a problem aroused. How to run multiple operating systems on the same machine at the same time. The solution to this problem was virtual machines. Virtual machine monitor (VMM) the core part of virtual machines sits between one or more operating systems and the hardware and gives the illusion to each running operating system that it controls the machine. Behind the scenes, however the VMM actually is in control of the hardware, and must multiplex running operating systems across the physical resources of the machine.
    [Show full text]