High-Performance Power-Efficient Bulldozer-Core X86-64 Server

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance Power-Efficient Bulldozer-Core X86-64 Server HIGH-PERFORMANCE POWER- EFFICIENT X86-64 SERVER AND DESKTOP PROCESSORS Using the core codenamed “Bulldozer” Sean White 19 August 2011 THE DIE Overview, Floorplan, and Technology 2 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Photograph 3 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | What’s on it? . Eight “Bulldozer” cores – High-performance, power-efficient AMD64 cores . First generation of a new execution core family from AMD (Family 15h) . Two cores in each Bulldozer module – 128 KB of Level1 Data Cache, 16 KB/core, 64-byte cacheline, 4-way associative, write-through – 256 KB of Level1 Instruction Cache, 64 KB/Bulldozer module, 64-byte cacheline, 2-way associative – 8 MB of Level2 Cache, 2 MB/Bulldozer module, 64-byte cacheline, 16- way associative . Integrated Northbridge which controls: – 8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI – Two 72-bit wide DDR3 memory channels – Four 16-bit receive/16-bit transmit HyperTransport™ links 4 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Floorplan (315 mm2) HyperTransport™ Phy MiscIO HyperTransport 2MB L2 2MB L2 Bulldozer Bulldozer Cache Cache Module Module ™ Phy 2MB L3 Cache 2MB L3 Cache DDR3 Northbridge Phy HyperTransport 2MB L3 Cache 2MB L3 Cache Bulldozer 2MB L2 2MB L2 Bulldozer ™ Module Cache Cache Module Phy HyperTransport™ Phy MiscIO 5 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Process Technology Layer Type Pitch 32-nm Silicon-On-Insulator (SOI) Hi-K Metal Gate CPP1 - 130 nm (HKMG) process from GlobalFoundries M01 1x 104 nm M02 1x 104 nm 11-metal-layer-stack M03 1x 104 nm M04 1.3x 130 nm Low-k dielectric M05 1.3x 130 nm M06 2x 208 nm Dual strain liners and eSiGe to improve M07 2x 208 nm performance. M08 2x 208 nm M09 4x 416 nm Multiple VT (HVT, RVT, LVT) and long-channel transistors. M10 16x 1.6 µm M11 16x 1.6 µm 1 Contacted poly pitch 6 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE A few highlights 7 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | What’s different about this core? A Bulldozer module contains 2 cores “Bulldozer” Functions with high utilization that can’t be shared without significant Fetch compromises exist for each core Decode Ex: Integer pipelines, Level1 data Integer Integer caches Scheduler Scheduler Other functions are shared between the FP Scheduler cores Ex: Floating point pipelines, Level2 Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline bit bit cache - - FMAC FMAC 128 128 This allows two cores to each use a L1 DCache L1 DCache larger, higher-performance function (ex: floating point unit) as they need Shared L2 Cache it for less total die area than having separate, smaller functions for each core 8 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared Frontend . Decoupled predict and fetch pipelines Prediction Queue ICache L1 BTB . Prediction-directed Fetch Queue L2 BTB instruction prefetch Ucode ROM 4 x86 Decoders . Instruction cache: 64K Byte, 2-way Int Int . 32-Byte fetch Scheduler FP Scheduler Scheduler . Instruction TLBs: Instr Instr Retire Retire bit bit - - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 – Level1: 72 128 EX, DIV EX, DIV EX, MUL EX, MUL entries, mixed page sizes Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache – Level2: 512 entries, 4-way, Data Prefetcher Shared L2 Cache 4K pages . Branch fusion 9 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Separate Integer Units . Thread retire logic Prediction Queue ICache . Physical Register File L1 BTB (PRF)-based register Fetch Queue renaming L2 BTB Ucode ROM 4 x86 Decoders . Unified scheduler per core . Way-predicted 16K Byte Level1 Data cache Int Int Scheduler FP Scheduler Scheduler . Data TLB: 32-entry fully associative Instr Instr Retire Retire bit bit - . Fully out-of-order - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 128 EX, DIV EX, DIV load/store EX, MUL EX, MUL – 2 128-bit loads/cycle Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache – 1 128-bit store/cycle – 40-entry Load queue Data Prefetcher Shared L2 Cache 24-entry Store queue 10 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared FPU . Co-processor organization Prediction Queue ICache L1 BTB Fetch Queue . Reports completion L2 BTB back to parent core Ucode ROM 4 x86 Decoders . Dual 128-bit Floating Point Int Int Multiply/Accumulate Scheduler FP Scheduler Scheduler (FMAC) pipelines Instr Instr Retire Retire bit bit - . Dual 128-bit Packed - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 128 EX, DIV EX, DIV Integer pipelines EX, MUL EX, MUL . PRF-based register Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache renaming . A single floating point Data Prefetcher Shared L2 Cache scheduler for both cores 11 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | New Floating Point Instructions New Instructions Applications/Use Cases SSSE3, SSE4.1, • Video encoding and transcoding SSE4.2 • Biometrics algorithms (AMD and Intel) • Text-intensive applications AESNI • Application using AES encryption PCLMULQDQ • Secure network transactions (AMD and Intel) • Disk encryption (MSFT BitLocker) • Database encryption (Oracle) • Cloud security AVX Floating point intensive applications: (AMD and Intel) • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling FMA4 HPC applications (AMD Unique) XOP • Numeric applications (AMD Unique) • Multimedia applications • Algorithms used for audio/radio 12 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared Level 2 Cache . 16-way unified L2 cache Prediction Queue ICache L1 BTB . L2 TLB and page Fetch Queue L2 BTB walker Ucode ROM 4 x86 Decoders – 1024-entry, 8-way – Services both Int Int FP Scheduler Instruction and Scheduler Scheduler Data requests Instr Instr Retire Retire bit bit - - MMX MMX FMAC FMAC . Multiple data AGen AGen AGen AGen 128 128 EX, DIV EX, DIV prefetchers EX, MUL EX, MUL . 23 outstanding L2 Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache cache misses for memory system Data Prefetcher Shared L2 Cache concurrency 13 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE A few of its highlights 14 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | Simplified Block Diagram CPU CPU CPU 4 Bulldozer modules C0 CPUC1 L2 L2 L2 L2 8 MB L3 Cache L2 Cache Synchronization APIC APIC (interrupts)APIC L3 L3 System Request (interrupts) APIC(*2)(interrupts) Arrays CTL Queue (SRQ) 2 DRAM channels Crossbar Memory DRAM CTL1 DDR PHY CTL DRAM CTL0 DDR PHY HT HT HT HTCTLCTL CTL CTL 4 HyperTransport™ links HT/PCI-E HT/PCIHT PHY-E HT/PCIPHY-E PHY PHY 15 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | SRQ, Crossbar, Memory Controller System Request Queue and Crossbar The northbridge SRQ and crossbar form the traffic hub of the design, routing from any requestor to the appropriate destination. Examples: A core request to the L3 cache; a request from another die or chip from a HyperTransport™ link to the memory controller, etc. Memory Controller Each memory controller enforces cache coherency and proper order of operations for the memory space it “owns”, distributing this responsibility across multi-die or multi-chip systems. This allows multi-die or multi-chip systems to scale rather than bottleneck at a single coherency/ordering choke point in the system. Obviously, the memory controller also sends the read/write requests to the DRAM controllers. 16 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | DRAM Controllers, L3 Cache, Probe Filter DDR3 DRAM Controllers Two per die Supports Unbuffered (UDIMM), Registered (RDIMM) and Load- Reduced (LRDIMM) memory, 1.50V, 1.35V and 1.25V depending on the processor package, with frequencies up to DDR3-1866 New power-saving features: Low-voltage DDR3, fast self-refresh entry with internal clocks gated, aggressive precharge power down, tristate addr/cmd/bank w/chip-select deassertion, throttle activity for thermal or power reduction, put RX circuits in standby when no reads active, etc. L3 Cache and Probe Filter Up to 8 MB of Level3 cache, normally shared, can be partitioned L3 data ECC protected (single-bit correct, double-bit detect)
Recommended publications
  • Direct Memory Access Components Verification System
    ТРУДЫ МФТИ. — 2012. — Том 4, № 1 Frolov P. V. et al. 1 УДК 004.052.42 P. V. Frolov, V. N. Kutsevol, A. N. Meshkov, N. Yu. Polyakov, M. P. Ryzhov AO «MCST» PAO «INEUM» Direct Memory Access components verification system A method of direct memory access subsystem verification used for Elbrus series micro- processors has been described. A peripheral controller imitator has been developed in order to reduce verification overhead. The model of imitator has been included into the functional machine simulator. A pseudorandom test generator for verification of the direct memory access subsystem has been based on the simulator. Ключевые слова: system verification, functional model, direct memory access, pseu- dorandom test generation. Direct Memory Access components verification system 1. Introduction Modern computer systems require very intensive data exchange between the peripheral de- vices and the random-access memory. In the most cases this exchange is performed by the direct memory access (DMA) subsystem. The increasing demands for the performance of the subsys- tem lead to an increase in its complexity, therefore requiring development of effective approaches to DMA subsystem verification [1,2]. This article is based on a result of a comprehensive project than combined implementation of a there co-designed verification techniques based on the consecutive investigation of theDMA subsystem employing one the three models: 1) a functional model written in C++ that corre- sponds to behaviour of the subsystem in the environment determined by a real computer system configuration, 2) RTL model in Verilog and 3) FPGA-based prototype. This article describesthe first method that enables verifying correctness of the design at an early stage of the verification and eliminate a large quantity of bugs using simple tests.
    [Show full text]
  • VIA RAID Configurations
    VIA RAID configurations The motherboard includes a high performance IDE RAID controller integrated in the VIA VT8237R southbridge chipset. It supports RAID 0, RAID 1 and JBOD with two independent Serial ATA channels. RAID 0 (called Data striping) optimizes two identical hard disk drives to read and write data in parallel, interleaved stacks. Two hard disks perform the same work as a single drive but at a sustained data transfer rate, double that of a single disk alone, thus improving data access and storage. Use of two new identical hard disk drives is required for this setup. RAID 1 (called Data mirroring) copies and maintains an identical image of data from one drive to a second drive. If one drive fails, the disk array management software directs all applications to the surviving drive as it contains a complete copy of the data in the other drive. This RAID configuration provides data protection and increases fault tolerance to the entire system. Use two new drives or use an existing drive and a new drive for this setup. The new drive must be of the same size or larger than the existing drive. JBOD (Spanning) stands for Just a Bunch of Disks and refers to hard disk drives that are not yet configured as a RAID set. This configuration stores the same data redundantly on multiple disks that appear as a single disk on the operating system. Spanning does not deliver any advantage over using separate disks independently and does not provide fault tolerance or other RAID performance benefits. If you use either Windows® XP or Windows® 2000 operating system (OS), copy first the RAID driver from the support CD to a floppy disk before creating RAID configurations.
    [Show full text]
  • ATX-945G Industrial Motherboard in ATX Form Factor with Intel® 945G Chipset
    ATX-945G Industrial Motherboard in ATX form factor with Intel® 945G chipset Supports Intel® Core™2 Duo, Pentium® D CPU on LGA775 socket 1066/800/533MHz Front Side Bus Intel® Graphics Media Accelerator 950 Dual Channel DDR2 DIMM, maximum 4GB Integrated PCI Express LAN, USB 2.0 and SATA 3Gb/s ATX-945G Product Overview The ATX-945G is an ATX form factor single-processor (x8) and SATA round out the package for a powerful industrial motherboard that is based on the Intel® 945G industrial PC. I/O features of the ATX-945G include a chipset with ICH7 I/O Controller Hub. It supports the Intel® 32-bit/33MHz PCI Bus, 16-bit/8MHz ISA bus, 1x PCI Express Core™2 Duo, Pentium® D, Pentium® 4 with Hyper-Threading x16 slot, 2x PCI Express x1 slots, 3x PCI slots, 1x ISA slot Technology, or Celeron® D Processor on the LGA775 socket, (shared w/ PCI), onboard Gigabit Ethernet, LPC, EIDE Ultra 1066/800/533MHz Front Side Bus, Dual Channel DDR2 DIMM ATA/100, 4 channels SATA 3Gb/s, Mini PCI card slot, UART 667/533/400MHz, up to 4 DIMMs and a maximum of 4GB. The compatible serial ports, parallel port, floppy drive port, HD Intel® Graphic Media Accelerator 950 technology with Audio, and PS/2 Keyboard and Mouse ports. 2048x1536x8bit at 75Hz resolution, PCI Express LAN, USB 2.0 Block Diagram CPU Core™2 Duo Pentium® D LGA775 package 533/800/1066MHz FSB Dual-Core Hyper-Threading 533/800/1066 MHz FSB D I M Northbridge DDR Channel A M ® x Intel 945G GMCH 2 DDRII 400/533/667 MHz D I CRT M M DB-15 DDR Channel B x 2 Discrete PCIe x16 Graphics DMI Interface 2 GB/s IDE Device
    [Show full text]
  • Improving Resource Utilization in Heterogeneous CPU-GPU Systems
    Improving Resource Utilization in Heterogeneous CPU-GPU Systems A Dissertation Presented to the Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the requirements for the Degree Doctor of Philosophy (Computer Engineering) by Michael Boyer May 2013 c 2013 Michael Boyer Abstract Graphics processing units (GPUs) have attracted enormous interest over the past decade due to substantial increases in both performance and programmability. Programmers can potentially leverage GPUs for substantial performance gains, but at the cost of significant software engineering effort. In practice, most GPU applications do not effectively utilize all of the available resources in a system: they either fail to use use a resource at all or use a resource to less than its full potential. This underutilization can hurt both performance and energy efficiency. In this dissertation, we address the underutilization of resources in heterogeneous CPU-GPU systems in three different contexts. First, we address the underutilization of a single GPU by reducing CPU-GPU interaction to improve performance. We use as a case study a computationally-intensive video-tracking application from systems biology. Because of the high cost of CPU-GPU coordination, our initial, straightforward attempts to accelerate this application failed to effectively utilize the GPU. By leveraging some non-obvious optimization strategies, we significantly decreased the amount of CPU-GPU interaction and improved the performance of the GPU implementation by 26x relative to the best CPU implementation. Based on the lessons we learned, we present general guidelines for optimizing GPU applications as well as recommendations for system-level changes that would simplify the development of high-performance GPU applications.
    [Show full text]
  • Get More out of the Intel Foxhollow Platform
    Get More Out Of the Intel Foxhollow Platform Akber Kazmi, Marketing Director, PLX Technology Introduction As being reported by the mainstream technology media, Intel is leveraging the technology from its latest server-class Nehalem CPU to offer the Lynnfield CPU, targeted for high-end desktop and entry-level servers. This platform is codenamed “Foxhollow “. Intel is expected to launch this new platform sometime in the second half of 2009. This entry-level uni-processor (UP) server platform will feature two to four cores as Intel wants to pack a lot of processing power in all its platforms. The Foxhollow platform is quite different from the previous Desktops and UP servers in that it reduces the solution from three chips to two chips by eliminating the northbridge and replacing the southbridge with a new device called the Platform Controller Hub (or PCH) code named Ibexpeak (5 Series Chipset). As Intel has moved the memory controller and the graphics function into the CPU, there's no need for an MCH (Memory Controller Hub), so Intel has simplified its chipset design to keep costs down in the entry-level and mainstream segments. The PCH chip interfaces with the CPU through Intel’s DMI interconnect. The PCH will support eight PCIe lanes, up to four PCI slots, the GE MAC, display interface controllers, I/O controllers, RAID controllers, SATA controllers, USB 2.0 controllers, etc. Foxhollow Motherboards Foxhollow motherboards are being offered in two configurations, providing either two or three x8 PCIe ports for high performance I/Os. However, motherboard vendors can use an alternate configuration that provides one more PCIe x8 port with no significant burden and instead offers 33% more value than the three port solution and 50% more value than the two port solution.
    [Show full text]
  • Hypertransport Extending Technology Leadership
    HyperTransport Extending Technology Leadership International HyperTransport Symposium 2009 February 11, 2009 Mario Cavalli General Manager HyperTransport Technology Consortium Copyright HyperTransport Consortium 2009 HyperTransport Extending Technology Leadership HyperTransport and Consortium Snapshot Industry Status and Trends HyperTransport Leadership Role February 11, 2009 Mario Cavalli General Manager HyperTransport Technology Consortium Copyright HyperTransport Consortium 2009 HyperTransport Snapshot Low Latency, High Bandwidth, High Efficiency Point-to-Point Interconnect Leadership CPU-to-I/O CPU-to-CPU CPU-to-Coprocessor Copyright HyperTransport Consortium 2009 Adopted by Industry Leaders in Widest Range of Applications than Any Other Interconnect Technology Copyright HyperTransport Consortium 2009 Snapshot Formed 2001 Controls, Licenses, Promotes HyperTransport as Royalty-Free Open Standard World Technology Leaders among Commercial and Academic Members Newly Elected President Mike Uhler VP Accelerated Computing Advanced Micro Devices Copyright HyperTransport Consortium 2009 Industry Status and Trends Copyright HyperTransport Consortium 2009 Global Economic Downturn Tough State of Affairs for All Industries Consumer Markets Crippled with Long-Term to Recovery Commercial Markets Strongly Impacted Copyright HyperTransport Consortium 2009 Consequent Business Focus Cost Effectiveness No Redundancy Frugality Copyright HyperTransport Consortium 2009 Downturn Breeds Opportunities Reinforced Need for More Optimized, Cost-Effective Computing
    [Show full text]
  • Motherboards, Processors, and Memory
    220-1001 COPYRIGHTED MATERIAL c01.indd 03/23/2019 Page 1 Chapter Motherboards, Processors, and Memory THE FOLLOWING COMPTIA A+ 220-1001 OBJECTIVES ARE COVERED IN THIS CHAPTER: ✓ 3.3 Given a scenario, install RAM types. ■ RAM types ■ SODIMM ■ DDR2 ■ DDR3 ■ DDR4 ■ Single channel ■ Dual channel ■ Triple channel ■ Error correcting ■ Parity vs. non-parity ✓ 3.5 Given a scenario, install and configure motherboards, CPUs, and add-on cards. ■ Motherboard form factor ■ ATX ■ mATX ■ ITX ■ mITX ■ Motherboard connectors types ■ PCI ■ PCIe ■ Riser card ■ Socket types c01.indd 03/23/2019 Page 3 ■ SATA ■ IDE ■ Front panel connector ■ Internal USB connector ■ BIOS/UEFI settings ■ Boot options ■ Firmware upgrades ■ Security settings ■ Interface configurations ■ Security ■ Passwords ■ Drive encryption ■ TPM ■ LoJack ■ Secure boot ■ CMOS battery ■ CPU features ■ Single-core ■ Multicore ■ Virtual technology ■ Hyperthreading ■ Speeds ■ Overclocking ■ Integrated GPU ■ Compatibility ■ AMD ■ Intel ■ Cooling mechanism ■ Fans ■ Heat sink ■ Liquid ■ Thermal paste c01.indd 03/23/2019 Page 4 A personal computer (PC) is a computing device made up of many distinct electronic components that all function together in order to accomplish some useful task, such as adding up the numbers in a spreadsheet or helping you to write a letter. Note that this defi nition describes a computer as having many distinct parts that work together. Most PCs today are modular. That is, they have components that can be removed and replaced with another component of the same function but with different specifi cations in order to improve performance. Each component has a specifi c function. Much of the computing industry today is focused on smaller devices, such as laptops, tablets, and smartphones.
    [Show full text]
  • ECS P4M890T-M2 Manual.Pdf
    Preface Copyright This publication, including all photographs, illustrations and software, is protected under international copyright laws, with all rights reserved. Neither this manual, nor any of the material contained herein, may be reproduced without written consent of the author. Version 1.0B Disclaimer The information in this document is subject to change without notice. The manufacturer makes no representations or warranties with respect to the contents hereof and specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. The manufacturer reserves the right to revise this publication and to make changes from time to time in the content hereof without obligation of the manufacturer to notify any person of such revision or changes. Trademark Recognition Microsoft, MS-DOS and Windows are registered trademarks of Microsoft Corp. MMX, Pentium, Pentium-II, Pentium-III, Pentium-4, Celeron are registered trademarks of Intel Corporation. Other product names used in this manual are the properties of their respective owners and are acknowledged. Federal Communications Commission (FCC) This equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reason- able protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference
    [Show full text]
  • Comparison of High Performance Northbridge Architectures in Multiprocessor Servers
    Comparison of High Performance Northbridge Architectures in Multiprocessor Servers Michael Koontz MS CpE Scholarly Paper Advisor: Dr. Jens-Peter Kaps Co-Advisor: Dr. Daniel Tabak - 1 - Table of Contents 1. Introduction ...............................................................................................................3 2. The x86-64 Instruction Set Architecture (ISA) ...................................................... 3 3. Memory Coherency ...................................................................................................4 4. The MOESI and MESI Cache Coherency Models.................................................8 5. Scalable Coherent Interface (SCI) and HyperTransport ....................................14 6. Fully-Buffered DIMMS ..........................................................................................16 7. The AMD Opteron Northbridge ............................................................................19 8. The Intel Blackford Northbridge Architecture .................................................... 27 9. Performance and Power Consumption .................................................................32 10. Additional Considerations ..................................................................................34 11. Conclusion ............................................................................................................ 36 - 2 - 1. Introduction With the continuing growth of today’s multi-media, Internet based culture, businesses are becoming more dependent
    [Show full text]
  • A Unified Memory Network Architecture for In-Memory
    A Unified Memory Network Architecture for In-Memory Computing in Commodity Servers Jia Zhan∗, Itir Akgun∗, Jishen Zhao†, Al Davis‡, Paolo Faraboschi‡, Yuangang Wang§, and Yuan Xie∗ ∗University of California, Santa Barbara †University of California, Santa Cruz ‡HP Labs §Huawei Abstract—In-memory computing is emerging as a promising largely residing in main memory, they also pose significant paradigm in commodity servers to accelerate data-intensive challenges in DRAM scaling, especially for memory capacity. processing by striving to keep the entire dataset in DRAM. To If the large dataset cannot fit entirely in memory, it has to address the tremendous pressure on the main memory system, discrete memory modules can be networked together to form a be spilled to disk, thus causing sub-optimal performance due memory pool, enabled by recent trends towards richer memory to disk I/O contention. Unfortunately, with the current DDRx interfaces (e.g. Hybrid Memory Cubes, or HMCs). Such an inter- based memory architecture, the scaling of DRAM capacity memory network provides a scalable fabric to expand memory falls far behind the growth of dataset size of in-memory capacity, but still suffers from long multi-hop latency, limited computing applications [12]. A straightforward approach to bandwidth, and high power consumption—problems that will continue to exacerbate as the gap between interconnect and increase memory capacity is to add more CPU sockets for transistor performance grows. Moreover, inside each memory additional memory channels. However, this will introduce module, an intra-memory network (NoC) is typically employed to significant hardware cost [13]. Moreover, in a multi-socket connect different memory partitions.
    [Show full text]
  • AMD Opteron™ 4200 Series Processor
    AMD Opteron™ 4200 Series Processor AMD Opteron™ 4200 Series Processor What’s new in the AMD Opteron™ 4200 Series Processor (Codenamed “Valencia”) and the new “Bulldozer” Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed “San Marino” or “Adelaide”) (codenamed “Valencia") (with BIOS update) In their first microarchitecture rebuild since Opteron, AMD introduces the Bulldozer module, implemented in both the Opteron 4200 and 6200 Series. The modules are engineered to emphasize throughput, core count, and parallel server workloads. A module consists of two tightly coupled cores, in which some core resources are shared. The effect is between a core with two threads and a dual-core processor, in which each core is fully independent. Performance, scalability, and efficiency for today’s applications. Processor Features Platform Features . 32nm process technology . 2 DDR3 memory channels, LRDIMM, RDIMM, UDIMM . Up to 8 Bulldozer cores up to 1600 GT/s . Evolved integrated northbridge o 1.25 and 1.35V Low-voltage DRAM o 8M L3 cache o 1.5V also available o Bandwidth improvements . 2 HyperTransport™ technology 3.0 links up to 6.4 GT/s o Memory capacity improvements . Advanced Platform Management Link (APML) for o Power efficiency improvements system management . Up to16MB combined L2 + L3 cache . Compatible with existing C32 socket with BIOS update Product Specifications Clock Frequency / Turbo L3 Max Memory System Bus Process TDP1 / ACP2 Model Cores L2 Cache Core Frequency (GHz) Cache Speed Speed 4284 3.0 / 3.7 8 4 x 2MB 4280 2.8 / 3.5 95 W / 75W 4238 3.3 / 3.7 8MB DDR3-1600 MHz 6.4 GT/s 4234 3.1 / 3.5 6 3 x 2MB 32nm 4226 2.7 / 3.1 4274 HE 2.5 / 3.5 8 4 x 2MB 65 W / 50W 8MB DDR3-1600 MHz 6.4 GT/s 4228 HE 2.8 / 3.6 6 3 x 2MB 35W / 32W 4256 EE 1.6 / 2.8 8 4 x 2MB 8MB DDR3-1600 MHz 6.4 GT/s 1 TDP stands for Thermal Design Power.
    [Show full text]
  • Application-Controlled Frequency Scaling Explained
    The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff, Stephan Diestelhorst, and Christof Fetzer, Technische Universät Dresden; Patrick Marlier and Pascal Felber, Université de Neuchâtel; Dave Dice, Oracle Labs https://www.usenix.org/conference/atc14/technical-sessions/presentation/wamhoff This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff Patrick Marlier Dave Dice Stephan Diestelhorst Pascal Felber Christof Fetzer Technische Universtat¨ Dresden, Germany Universite´ de Neuchatel,ˆ Switzerland Oracle Labs, USA Abstract these features from an application as needed. Examples in- Most multi-core architectures nowadays support dynamic volt- clude: accelerating the execution of key sections of code on age and frequency scaling (DVFS) to adapt their speed to the the critical path of multi-threaded applications [9]; boosting system’s load and save energy. Some recent architectures addi- time-critical operations or high-priority threads; or reducing tionally allow cores to operate at boosted speeds exceeding the the energy consumption of applications executing low-priority nominal base frequency but within their thermal design power. threads. Furthermore, workloads specifically designed to run In this paper, we propose a general-purpose library that on processors with heterogeneous cores (e.g., few fast and allows selective control of DVFS from user space to accelerate many slow cores) may take additional advantage of application- multi-threaded applications and expose the potential of hetero- level frequency scaling.
    [Show full text]