Architecture of Parallel Computers CSC / ECE 506 Bluegene

Total Page:16

File Type:pdf, Size:1020Kb

Architecture of Parallel Computers CSC / ECE 506 Bluegene Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter BlueGene/L Program • December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: – Advance the state of the art of scientific simulation. – Advance the state of the art in computer design and software for capability and capacity markets. • November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL). November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. • May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. • June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL. Arch of Parallel Computers CSC / ECE 506 2 BlueGene/L Program • Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. – A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) – It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. – BlueGene holds the #1, #2, and #8 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD – Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible – JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html Arch of Parallel Computers CSC / ECE 506 3 BlueGene/L Program • BlueGene is a family of supercomputers. – BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 – BlueGene/P is the petaflop generation 12/06 – BlueGene/Q is the third generation ~2010. • Requirements for future generations – Processors will be more powerful. – Networks will be higher bandwidth. – Applications developed on BlueGeneG/L will run well on BlueGene/P. Arch of Parallel Computers CSC / ECE 506 4 BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D Arch of Parallel Computers CSC / ECE 506 5 BlueGene/L Fundamentals • Cellular architecture – Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops – Maximal LINPACK performance achieved • Rpeak of 360 Teraflops – Theoretical peak performance • 65,536 dual-processor compute nodes – 700MHz IBM PowerPC 440 processors – 512 MB memory per compute node, 16 TB in entire system. – 800 TB of disk space • 2,500 square feet Arch of Parallel Computers CSC / ECE 506 6 Comparing Systems (Peak) upercomputer Peak Performance 1E+17 multi-Petaflop Petaflop Blue Gene/L 1E+14 Thunder Earth Red Storm Blue Pacific ASCI White, ASCI Q ASCI Red Option SX-5 T3E ASCI Red NWT SX-4 CP-PACS 1E+11 CM-5 Paragon Delta T3D SX-3/44 Doubling time = 1.5 yr. i860 (MPPs) CRAY-2 SX-2 VP2600/10 S-810/20 X-MP4 Y-MP8 Cyber 205 X-MP2 (parallel vectors) 1E+8 CDC STAR-100 (vectors) CRAY-1 CDC 7600 ILLIAC IV CDC 6600 (ICs) Peak Speed (flops) Speed Peak 1E+5 IBM Stretch IBM 7090 (transistors) IBM 704 IBM 701 UNIVAC 1E+2 ENIAC (vacuum tubes) 1940 1950 1960 1970 1980 1990 2000 2010 Year Introduced Arch of Parallel Computers CSC / ECE 506 7 Comparing Systems (Byte/Flop) ! Red Storm 2.0 2003 ! Earth Simulator 2.0 2002 ! Intel Paragon 1.8 1992 ! nCUBE/2 1.0 1990 ! ASCI Red 1.0 (0.6) 1997 ! T3E 0.8 1996 ! BG/L 1.5 0.75(torus)+0.75(tree) 2004 ! Cplant 0.1 1997 ! ASCI White 0.1 2000 ! ASCI Q 0.05 Quadrics 2003 ! ASCI Purple 0.1 2004 ! Intel Cluster 0.1 IB 2004 ! Intel Cluster 0.008 GbE 2003 ! Virginia Tech 0.16 IB 2003 ! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003 Arch of Parallel Computers CSC / ECE 506 8 Comparing Systems (GFlops/Watt) • Power efficiencies of recent supercomputers – Blue: IBM Machines – Black: Other US Machines IBM Journal of Research – Red: Japanese Machines and Development Arch of Parallel Computers CSC / ECE 506 9 Comparing Systems ASCI White ASCI Q Earth Blue Gene/L Simulator Machine Peak 12.3 30 40.96 367 (TF/s) Total Mem. 8 33 10 32 (TBytes) Footprint (sq ft) 10,000 20,000 34,000 2,500 Power (MW)* 1 3.8 6-8.5 1.5 Cost ($M) 100 200 400 100 # Nodes 512 4096 640 65,536 MHz 375 1000 500 700 * 10 megawatts approximate usage of 11,000 households Arch of Parallel Computers CSC / ECE 506 10 BG/L Summary of Performance Results • DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops – Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: – 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: – Single processor performance roughly on par with POWER3 at 375 MHz – Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): – Up to 508 MFlops on single processor at 444 MHz (TU Vienna) – Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: – Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s – At 700 MHz: Would beat STREAM numbers for most high end microprocessors • MPI: – Latency – < 4000 cycles (5.5 ✙s at 700 MHz) – Bandwidth – full link bandwidth demonstrated on up to 6 links Arch of Parallel Computers CSC / ECE 506 11 BlueGene/L Architecture • To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology – This approach was chosen because of the performance/power advantage – In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 – Industry focus on performance / rack » Performance / rack = Performance / watt * Watt / rack » Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling – Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. – BlueGene/L uses only 1.76 megawatts Arch of Parallel Computers CSC / ECE 506 12 Microprocessor Power Density Growth Arch of Parallel Computers CSC / ECE 506 13 System Power Comparison BG/L 450 Thinkpads 2048 processors 20.1 kW 20.3 kW Arch of Parallel Computers CSC / ECE 506 (LS Mok,4/2002) 14 BlueGene/L Architecture • Networks were chosen with extreme scaling in mind – Scale efficiently in terms of both performance and packaging – Support very small messages » As small as 32 bytes – Includes hardware support for collective operations » Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling – BG/L need to be reliable and usable even at extreme scaling limits – 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling – BG/L designed to efficiently utilize a distributed memory, message-passing programming model – MPI is the dominant message-passing model with hardware features added and parameter tuned Arch of Parallel Computers CSC / ECE 506 15 RAS (Reliability, Availability, Serviceability) • System designed for RAS from top to bottom – System issues » Redundant bulk supplies, power converters, fans, DRAM bits, cable bits » Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting » Nearly no single points of failure – Chip design » ECC on all SRAMs » All dataflow outside processors is protected by error-detection mechanisms » Access to all state via noninvasive back door – Low power, simple design leads to higher reliability – All interconnects have multiple error detections and correction coverage » Virtually zero escape probability for link errors Arch of Parallel Computers CSC / ECE 506 16 BlueGene/L System 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 Arch of Parallel Computers CSC / ECE 506 17 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 18 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 19 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 20 Physical Layout of BG/L Arch of Parallel Computers CSC / ECE 506 21 Midplanes and Racks Arch of Parallel Computers CSC / ECE 506 22 The Compute Chip • System-on-a-chip (SoC) • 1 ASIC – 2 PowerPC processors – L1 and L2 Caches – 4MB embedded DRAM – DDR DRAM interface and DMA controller – Network connectivity hardware – Control / monitoring equip.
Recommended publications
  • System Trends and Their Impact on Future Microprocessor Design
    IBM Research System Trends and their Impact on Future Microprocessor Design Tilak Agerwala Vice President, Systems IBM Research System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research Agenda System and application trends Impact on architecture and microarchitecture The Memory Wall Cellular architectures and IBM's Blue Gene Summary System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research TRENDSTRENDS Microprocessors < 10 GHz in systems 64-256 Way SMP 65-45nm, Copper, Highest performance SOI Best MP Scalability 1-2 GHz Leading edge process technology 4-8 Way SMP RAS, virtualization ~100nm technology 10+ of GHz 4-8 Way SMP SMP/Large 65-45nm, Copper, SOI Systems Highest Frequency Cost and power Low GHz sensitive Leading edge process Uniprocessor technology Desktop ~100nm technology 2-4 GHz, Uniproc, Component-based and Game ~100nm, Copper, SOI Consoles Lowest Power / Lowest Multi MHz cost designs SoC capable Embedded Uniprocessor ASIC / Foundry ~100-200nm technologies Systems technology System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research TRENDSTRENDS Large system application trends Traditional commercial applications Databases, transaction processing, business apps like payroll etc. The internet has driven the growth of new commercial applications New life sciences applications are commercial and high-growth Drug discovery and genetic engineering research needs huge amounts of compute power (e.g. protein folding simulations)
    [Show full text]
  • An Overview of the Blue Gene/L System Software Organization
    An Overview of the Blue Gene/L System Software Organization George Almasi´ , Ralph Bellofatto , Jose´ Brunheroto , Calin˘ Cas¸caval , Jose´ G. ¡ Castanos˜ , Luis Ceze , Paul Crumley , C. Christopher Erway , Joseph Gagliano , Derek Lieber , Xavier Martorell , Jose´ E. Moreira , Alda Sanomiya , and Karin ¡ Strauss ¢ IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598-0218 £ gheorghe,ralphbel,brunhe,cascaval,castanos,pgc,erway, jgaglia,lieber,xavim,jmoreira,sanomiya ¤ @us.ibm.com ¥ Department of Computer Science University of Illinois at Urbana-Champaign Urabana, IL 61801 £ luisceze,kstrauss ¤ @uiuc.edu Abstract. The Blue Gene/L supercomputer will use system-on-a-chip integra- tion and a highly scalable cellular architecture. With 65,536 compute nodes, Blue Gene/L represents a new level of complexity for parallel system software, with specific challenges in the areas of scalability, maintenance and usability. In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments. 1 Introduction In November 2001 IBM announced a partnership with Lawrence Livermore National Laboratory to build the Blue Gene/L (BG/L) supercomputer, a 65,536-node machine de- signed around embedded PowerPC processors. Through the use of system-on-a-chip in- tegration [10], coupled with a highly scalable cellular architecture, Blue Gene/L will de- liver 180 or 360 Teraflops of peak computing power, depending on the utilization mode. Blue Gene/L represents a new level of scalability for parallel systems. Whereas existing large scale systems range in size from hundreds (ASCI White [2], Earth Simulator [4]) to a few thousands (Cplant [3], ASCI Red [1]) of compute nodes, Blue Gene/L makes a jump of almost two orders of magnitude.
    [Show full text]
  • Advances in Ultrashort-Pulse Lasers • Modeling Dispersions of Biological and Chemical Agents • Centennial of E
    October 2001 U.S. Department of Energy’s Lawrence Livermore National Laboratory Also in this issue: • More Advances in Ultrashort-Pulse Lasers • Modeling Dispersions of Biological and Chemical Agents • Centennial of E. O. Lawrence’s Birth About the Cover Computing systems leader Greg Tomaschke works at the console of the 680-gigaops Compaq TeraCluster2000 parallel supercomputer, one of the principal machines used to address large-scale scientific simulations at Livermore. The supercomputer is accessible to unclassified program researchers throughout the Laboratory, thanks to the Multiprogrammatic and Institutional Computing (M&IC) Initiative described in the article beginning on p. 4. M&IC makes supercomputers an institutional resource and helps scientists realize the potential of advanced, three-dimensional simulations. Cover design: Amy Henke About the Review Lawrence Livermore National Laboratory is operated by the University of California for the Department of Energy’s National Nuclear Security Administration. At Livermore, we focus science and technology on assuring our nation’s security. We also apply that expertise to solve other important national problems in energy, bioscience, and the environment. Science & Technology Review is published 10 times a year to communicate, to a broad audience, the Laboratory’s scientific and technological accomplishments in fulfilling its primary missions. The publication’s goal is to help readers understand these accomplishments and appreciate their value to the individual citizen, the nation, and the world. Please address any correspondence (including name and address changes) to S&TR, Mail Stop L-664, Lawrence Livermore National Laboratory, P.O. Box 808, Livermore, California 94551, or telephone (925) 423-3432. Our e-mail address is [email protected].
    [Show full text]
  • 2017 HPC Annual Report Team Would Like to Acknowledge the Invaluable Assistance Provided by John Noe
    sandia national laboratories 2017 HIGH PERformance computing The 2017 High Performance Computing Annual Report is dedicated to John Noe and Dino Pavlakos. Building a foundational framework Editor in high performance computing Yasmin Dennig Contributing Writers Megan Davidson Sandia National Laboratories has a long history of significant contributions to the high performance computing Mattie Hensley community and industry. Our innovative computer architectures allowed the United States to become the first to break the teraflop barrier—propelling us to the international spotlight. Our advanced simulation and modeling capabilities have been integral in high consequence US operations such as Operation Burnt Frost. Strong partnerships with industry leaders, such as Cray, Inc. and Goodyear, have enabled them to leverage our high performance computing capabilities to gain a tremendous competitive edge in the marketplace. Contributing Editor Laura Sowko As part of our continuing commitment to provide modern computing infrastructure and systems in support of Sandia’s missions, we made a major investment in expanding Building 725 to serve as the new home of high performance computer (HPC) systems at Sandia. Work is expected to be completed in 2018 and will result in a modern facility of approximately 15,000 square feet of computer center space. The facility will be ready to house the newest National Nuclear Security Administration/Advanced Simulation and Computing (NNSA/ASC) prototype Design platform being acquired by Sandia, with delivery in late 2019 or early 2020. This new system will enable continuing Stacey Long advances by Sandia science and engineering staff in the areas of operating system R&D, operation cost effectiveness (power and innovative cooling technologies), user environment, and application code performance.
    [Show full text]
  • Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*
    Cellular Wave Computers and CNN Technology – a SoC architecture with xK processors and sensor arrays* Tamás ROSKA1 Fellow IEEE 1. Introduction and the main theses 1.1 Scenario: Architectural lessons from the trends in manufacturing billion component devices when crossing the threshold of 100 nm feature size Preliminary proposition: The nature of fabrication technology, the nature and type of data to be processed, and the nature and type of events to be detected or „computed” will determine the architecture, the elementary instructions, and the type of algorithms needed, hence also the complexity of the solution. In view of this proposition, let us list a few key features of the electronic technology of Today and its consequences.. (i) Convergence of CMOS, NANO and OPTICAL technologies towards a cellular architecture with short and sparse wires CMOS chips: * Processors: K or M transistors on an M or G transistor die =>K processors /chip * Wires: at 180 nm or below, gate delay is smaller than wire delay NANO processors and sensors: * Mainly 2D organization of cells integrating processing and sensing * Interactions mainly with the neighbours OPTICAL devices: * parallel processing * optical correlators * VCSELs and programable SLMs, Hence the architecture should be characterized by * 2 D layers (or a layered 3D) * Cellular architecture with * mainly local and /or regular sparse wireing leading via Î a Cellular Nonlinear Network (CNN) Dynamics 1 The Faculty of Information Technology and the Jedlik Laboratories of the Pázmány University, Budapest and the Computer and Automation Institute of the Hungarian Academy of Sciences, Budapest, Hungary ([email protected], www.itk.ppke.hu) * Research supported by the Office of Naval Research, Human Frontiers of Science Program, EU Future and Emergent Technologies Program, the Hungarian Academy of Sciences, and the Jedlik Laboratories of the Pázmány University, Budapest 0-7803-9254-X/05/$20.00 ©2005 IEEE.
    [Show full text]
  • Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64
    Performance modelling and optimization of memory access on cellular computer architecture Cyclops64 Yanwei Niu, Ziang Hu, Kenneth Barner and Guang R. Gao Department of ECE, University of Delaware, Newark, DE, 19711, USA {niu, hu, barner, ggao}@ee.udel.edu Abstract. This paper focuses on the Cyclops64 computer architecture and presents an analytical model and performance simulation results for the preloading and loop unrolling approaches to optimize the performance of SVD (Singular Value Decomposition) benchmark. A performance model for dissecting the total execu- tion cycles is presented. The data preloading using “memcpy” or hand optimized “inline” assembly code, and the loop unrolling approach are implemented and compared with each other in terms of the total number of memory access cycles. The key idea is to preload data from offchip to onchip memory and store the data back after the computation. These approaches can reduce the total memory access cycles and can thus improve the benchmark performance significantly. 1 Introduction The design concept of computer architecture over the last two decades has been mainly on the exploitation of the instruction level parallelism, such as pipelining,VLIW or superscalar architecture. For the next generation of computer architecture, hardware threading multiprocessor is becoming more and more popular. One approach of hard- ware multithreading is called CMP (Chip MultiProcessor) approach, which proposes a single chip design that uses a collection of independent processors with less resource sharing. An example of CMP architecture design is Cyclops64 [1–5], a new architec- ture for high performance parallel computers being developed at the IBM T. J. Watson Research Center and University of Delaware.
    [Show full text]
  • Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 46, NO. 2, FEBRUARY 1999 327 [6] K. F. Hui and B. E. Shi, “Robustness of CNN implementations for Gabor-type image filtering,” in Proc. Asia Pacific Conf. Circuits Systems, Seoul, South Korea, Nov. 1996, pp. 105–108. [7] C. C. Enz, F. Krummenacher, and E. A. Vittoz, “An analytical MOS transistor model valid in all regions of operation and dedicated to low-voltage and low-current applications,” Anal. Integr. Circuits Signal Processing, vol. 8, no. 1, pp. 83–114, July 1995. [8] B. E. Shi and K. F. Hui, “An analog VLSI neural network for phase based machine vision,” in Advances in Neural Information Process- ing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds. Cambridge, MA: MIT, 1998, pp. 726–732. [9] C. A. Mead and T. Delbruck, “Scanners for visualizing activity of analog VLSI circuitry,” Anal. Integr. Circuits Signal Processing, vol. 1, no. 2, pp. 93–106, Oct. 1991. (a) Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System Gert Cauwenberghs and James Waskiewicz Abstract—We present an analog very large scale integration (VLSI) cellular architecture implementing a version of the boundary contour system (BCS) for real-time focal-plane image processing. Inspired by neuromorphic models across the retina and several layers of visual cortex, the design integrates in each pixel the functions of phototransduction and simple cells, complex cells, hypercomplex cells, and bipole cells in each of three directions interconnected on a hexagonal grid. Analog current- mode complementary metal–oxide–semiconductor (CMOS) circuits are used throughout to perform edge detection, local inhibition, directionally (b) selective long-range diffusive kernels, and renormalizing global gain control.
    [Show full text]
  • Software-Defined Hyper-Cellular Architecture for Green and Elastic
    1 Software-Defined Hyper-Cellular Architecture for Green and Elastic Wireless Access Sheng Zhou, Tao Zhao, Zhisheng Niu, and Shidong Zhou Abstract To meet the surging demand of increasing mobile Internet traffic from diverse applications while maintaining moderate energy cost, the radio access network (RAN) of cellular systems needs to take a green path into the future, and the key lies in providing elastic service to dynamic traffic demands. To achieve this, it is time to rethink RAN architectures and expect breakthroughs. In this article, we review the state-of-art literature which aims to renovate RANs from the perspectives of control-traffic decoupled air interface, cloud-based RANs, and software-defined RANs. We then propose a software- defined hyper-cellular architecture (SDHCA) that identifies a feasible way of integrating the above three trends to enable green and elastic wireless access. We further present key enabling technologies to realize SDHCA, including separation of the air interface, green base station operations, and base station functions virtualization, followed by our hardware testbed for SDHCA. Besides, we summarize several future research issues worth investigating. I. INTRODUCTION Since their birth, cellular systems have evolved from the first generation analog systems with very low data rate to today’s fourth generation (4G) systems with more than 100 Mbps capacity to end users. However, the radio access network (RAN) architecture has not experienced arXiv:1512.04935v1 [cs.NI] 15 Dec 2015 many changes: base stations (BSs) are generally deployed and operated in a distributed fashion, Sheng Zhou, Zhisheng Niu and Shidong Zhou are with Tsinghua National Laboratory for Information Science and Technology, Dept.
    [Show full text]
  • An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure
    An Overview on Cyclops-64 Architecture - A Status Report on the Programming Model and Software Infrastructure Guang R. Gao Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected] 2007/6/14 SOS11-06-2007.ppt 1 Outline • Introduction • Multi-Core Chip Technology • IBM Cyclops-64 Architecture/Software • Cyclops-64 Programming Model and System Software • Future Directions • Summary 2007/6/14 SOS11-06-2007.ppt 2 TIPs of compute power operating on Tera-bytes of data Transistor Growth in the near future Source: Keynote talk in CGO & PPoPP 03/14/07 by Jesse Fang from Intel 2007/6/14 SOS11-06-2007.ppt 3 Outline • Introduction • Multi-Core Chip Technology • IBM Cyclops-64 Architecture/Software • Programming/Compiling for Cyclops-64 • Looking Beyond Cyclops-64 • Summary 2007/6/14 SOS11-06-2007.ppt 4 Two Types of Multi-Core Architecture Trends • Type I: Glue “heavy cores” together with minor changes • Type II: Explore the parallel architecture design space and searching for most suitable chip architecture models. 2007/6/14 SOS11-06-2007.ppt 5 Multi-Core Type II • New factors to be considered –Flops are cheap! –Memory per core is small –Cache-coherence is expensive! –On-chip bandwidth can be enormous! –Examples: Cyclops-64, and others 2007/6/14 SOS11-06-2007.ppt 6 Flops are Cheap! An example to illustrate design tradeoffs: • If fed from small, local register files: 64-bit FP – 3200 GB/s, 10 pJ/op unit – < $1/Gflop (60 mW/Gflop) (drawn to a 64-bit FPU is < 1mm^2 scale) and ~= 50pJ • If fed from global on-chip memory: Can fit over 200 on a chip.
    [Show full text]
  • Virtualized Baseband Units Consolidation in Advanced Lte Networks Using Mobility- and Power-Aware Algorithms
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2017 VIRTUALIZED BASEBAND UNITS CONSOLIDATION IN ADVANCED LTE NETWORKS USING MOBILITY- AND POWER-AWARE ALGORITHMS Uladzimir Karneyenka San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Computer Sciences Commons Recommended Citation Karneyenka, Uladzimir, "VIRTUALIZED BASEBAND UNITS CONSOLIDATION IN ADVANCED LTE NETWORKS USING MOBILITY- AND POWER-AWARE ALGORITHMS" (2017). Master's Projects. 571. DOI: https://doi.org/10.31979/etd.sg3g-dr33 https://scholarworks.sjsu.edu/etd_projects/571 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. VIRTUALIZED BASEBAND UNITS CONSOLIDATION IN ADVANCED LTE NETWORKS USING MOBILITY - AND POWER -AWARE ALGORITHMS A Writing Project Report Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment Of the Requirements for the Degree Master of Science By Uladzimir Karneyenka May 2017 © 2017 Uladzimir Karneyenka ALL RIGHTS RESERVED ABSTRACT Virtualization of baseband units in Advanced Long-Term Evolution networks and a rapid performance growth of general purpose processors naturally raise the interest in resource multiplexing. The concept of resource sharing and management between virtualized instances is not new and extensively used in data centers. We adopt some of the resource management techniques to organize virtualized baseband units on a pool of hosts and investigate the behavior of the system in order to identify features which are particularly relevant to mobile environment.
    [Show full text]
  • Simulating Linux Clusters on Linux Clusters
    Full Circle: Simulating Linux Clusters on Linux Clusters ¡ ¢ ¡ Luis Ceze , Karin Strauss , George Almasi , Patrick J. Bohrer , Jose´ R. Brunheroto , ¡ ¡ ¡ ¡ Calin Cas¸caval , Jose´ G. Castanos˜ , Derek Lieber , Xavier Martorell , ¡ ¡ ¡ Jose´ E. Moreira , Alda Sanomiya , and Eugen Schenfeld £ luisceze,kstrauss ¤ @uiuc.edu £ gheorghe,pbohrer,brunhe,cascaval,castanos,lieber, xavim,jmoreira,sanomiya,eugen ¤ @us.ibm.com ¥ Department of Computer Science, University of Illinois at Urbana-Champaign Urbana, IL 61801 ¦ IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598-0218 § IBM Austin Research Laboratory Austin, TX 78758 Abstract. BGLsim is a complete system simulator for parallel machines. It is currently being used in hardware validation and software development for the BlueGene/L cellular architecture machine. BGLsim is capable of functionally simulating multiple nodes of this machine operating in parallel. It simulates in- struction execution in each node and the communication that happens between nodes. BGLsim allows us to develop, test, and run the exactly same code that will be used in the real system. Using BGLsim, we can gather data that helps us debug and enhance software (including parallel software) and evaluate hard- ware. To illustrate the capabilities of BGLsim, we describe experiments running the NAS Parallel Benchmark IS on a simulated BlueGene/L machine. BGLsim is a parallel application that runs on Linux clusters. It executes fast enough to run complete operating systems and complex MPI codes. 1 Introduction Linux clusters have revolutionized high-performance computing by delivering large amounts of compute cycles at a low price. This has enabled computing at a scale that was previously not affordable to many people.
    [Show full text]
  • R00456--FM Getting up to Speed
    GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING Susan L. Graham, Marc Snir, and Cynthia A. Patterson, Editors Committee on the Future of Supercomputing Computer Science and Telecommunications Board Division on Engineering and Physical Sciences THE NATIONAL ACADEMIES PRESS Washington, D.C. www.nap.edu THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The project that is the subject of this report was approved by the Gov- erning Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engi- neering, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competences and with regard for ap- propriate balance. Support for this project was provided by the Department of Energy under Spon- sor Award No. DE-AT01-03NA00106. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the organizations that provided support for the project. International Standard Book Number 0-309-09502-6 (Book) International Standard Book Number 0-309-54679-6 (PDF) Library of Congress Catalog Card Number 2004118086 Cover designed by Jennifer Bishop. Cover images (clockwise from top right, front to back) 1. Exploding star. Scientific Discovery through Advanced Computing (SciDAC) Center for Supernova Research, U.S. Department of Energy, Office of Science. 2. Hurricane Frances, September 5, 2004, taken by GOES-12 satellite, 1 km visible imagery. U.S. National Oceanographic and Atmospheric Administration. 3. Large-eddy simulation of a Rayleigh-Taylor instability run on the Lawrence Livermore National Laboratory MCR Linux cluster in July 2003.
    [Show full text]