The HPC Challenge Benchmark: a Candidate for Replacing LINPACK

Total Page:16

File Type:pdf, Size:1020Kb

The HPC Challenge Benchmark: a Candidate for Replacing LINPACK 2007 SPEC Benchmark Workshop January 21, 2007 Radisson Hotel Austin North TheThe HPCHPC ChallengeChallenge Benchmark:Benchmark: AA CandidateCandidate forfor ReplacingReplacing LINPACKLINPACK inin thethe TOP500?TOP500? Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1 OutlineOutline -- TheThe HPCHPC ChallengeChallenge Benchmark:Benchmark: AA CandidateCandidate forfor ReplacingReplacing LinpackLinpack inin thethe TOP500?TOP500? ♦ Look at LINPACK ♦ Brief discussion of DARPA HPCS Program ♦ HPC Challenge Benchmark ♦ Answer the Question 2 WhatWhat IsIs LINPACK?LINPACK? ♦ Most people think LINPACK is a benchmark. ♦ LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations. ♦ The project had its origins in 1974 ♦ LINPACK: “LINear algebra PACKage” ¾ Written in Fortran 66 3 ComputingComputing inin 19741974 ♦ High Performance Computers: ¾ IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 ♦ Fortran 66 ♦ Run efficiently ♦ BLAS (Level 1) ¾ Vector operations ♦ Trying to achieve software portability ♦ LINPACK package was released in 1979 4 ¾ About the time of the Cray 1 TheThe AccidentalAccidental BenchmarkerBenchmarker ♦ Appendix B of the Linpack Users’ Guide ¾ Designed to help users extrapolate execution time for Linpack software package ♦ First benchmark report from 1977; ¾ Cray 1 to DEC PDP-10 Dense matrices Linear systems Least squares problems Singular values 5 LINPACKLINPACK Benchmark?Benchmark? ♦ The LINPACK Benchmark is a measure of a computer’s floating-point rate of execution for solving Ax=b. ¾ It is determined by running a computer program that solves a dense system of linear equations. ♦ Information is collected and available in the LINPACK Benchmark Report. ♦ Over the years the characteristics of the benchmark has changed a bit. ¾ In fact, there are three benchmarks included in the Linpack Benchmark report. ♦ LINPACK Benchmark since 1977 ¾ Dense linear system solve with LU factorization using partial pivoting ¾ Operation count is: 2/3 n3 + O(n2) ¾ Benchmark Measure: MFlop/s ¾ Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100. 6 ForFor LinpackLinpack withwith nn == 100100 ♦ Not allowed to touch the code. ♦ Only set the optimization in the compiler and run. ♦ Provide historical look at computing ♦ Table 1 of the report (52 pages of 95 page report) ¾ http://www.netlib.org/benchmark/performance.pdf 7 LinpackLinpack BenchmarkBenchmark OverOver TimeTime ♦ In the beginning there was only the Linpack 100 Benchmark (1977) ¾ n=100 (80KB); size that would fit in all the machines ¾ Fortran; 64 bit floating point arithmetic ¾ No hand optimization (only compiler options); source code available ♦ Linpack 1000 (1986) ¾ n=1000 (8MB); wanted to see higher performance levels ¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK ♦ Linpack Table 3 (Highly Parallel Computing - 1991) (Top500; 1993) ¾ Any size (n as large as you can; n=106; 8TB; ~6 hours); ¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK ¾ Strassen’s method not allowed (confuses the operation count and rate) ¾ Reference implementation available ||Ax− b || = O(1) ♦ In all cases results are verified by looking at: ||Axn |||| || ε 21 2 8 ♦ Operations count for factorization nn 32 − ; solve 2 n 32 MotivationMotivation forfor AdditionalAdditional BenchmarksBenchmarks ♦ From Linpack Benchmark and Linpack Benchmark Top500: “no single number can ♦ Good reflect overall performance” ¾ One number ¾ Simple to define & easy to rank ♦ Clearly need something more ¾ Allows problem size to change than Linpack with machine and over time ¾ Stresses the system with a run ♦ HPC Challenge Benchmark of a few hours ¾ Test suite stresses not only ♦ Bad the processors, but the ¾ Emphasizes only “peak” CPU memory system and the speed and number of CPUs interconnect. ¾ The real utility of the HPCC ¾ Does not stress local bandwidth benchmarks are that ¾ Does not stress the network architectures can be described ¾ Does not test gather/scatter with a wider range of metrics than just Flop/s from Linpack. ¾ Ignores Amdahl’s Law (Only does weak scaling) ♦ Ugly ¾ MachoFlops 9 ¾ Benchmarketeering hype AtAt TheThe TimeTime TheThe LinpackLinpack BenchmarkBenchmark WasWas CreatedCreated …… ♦ If we think about computing in late 70’s ♦ Perhaps the LINPACK benchmark was a reasonable thing to use. ♦ Memory wall, not so much a wall but a step. ♦ In the 70’s, things were more in balance ¾ The memory kept pace with the CPU ¾ n cycles to execute an instruction, n cycles to bring in a word from memory ♦ Showed compiler optimization ♦ Today provides a historical base of data 10 ManyMany ChangesChanges ♦ Many changes in our hardware over the past 30 years Top500 Systems/Architectures 500 ¾ Superscalar, Vector, Const. 400 Distributed Memory, Cluster Shared Memory, 300 Multicore, … MPP 200 SMP ♦ While there has been 100 SIMD some changes to the 0 Single Proc. Linpack Benchmark not 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 all of them reflect the advances made in the hardware. ♦ Today’s memory hierarchy is much more complicated. 11 HighHigh ProductivityProductivity ComputingComputing SystemsSystems Goal:Goal: ProvideProvide aa generationgeneration ofof economicallyeconomically viableviable highhigh productivityproductivity computingcomputing systemssystems forfor thethe nationalnational securitysecurity andand industrialindustrial useruser communitycommunity (2010;(2010; startedstarted inin 2002)2002) Focus on: s & Ass nalysi essme A ry R nt ¾Real (not peak) performance of critical national Indust &D Programming Performance Models security applications Characterization & Prediction Hardware ¾ Intelligence/surveillance Technology System Architecture ¾ Reconnaissance Software Technology ¾ Cryptanalysis Ind ustry R&D ¾ Weapons analysis Anal sment ysis & Asses ¾ Airborne contaminant modeling ¾ Biotechnology HPCS Program Focus Areas ¾Programmability: reduce cost and time of developing applications ¾Software portability and system robustness Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80's80's HPCHPC Technology)Technology) ...... toto ...... FutureFuture (Quantum/Bio(Quantum/Bio Computing) Computing) HPCSHPCS RoadmapRoadmap ¾ 5 vendors in phase 1; 3 vendors in phase 2; 1+ vendors in phase 3 ¾ MIT Lincoln Laboratory leading measurement and evaluation team Petascale Systems Full Scale Development TBD Validated Procurement Evaluation Methodology team Advanced Test Evaluation Design & Framework Prototypes team Concept New Evaluation Framework Study team Today Phase 1 Phase 2 Phase 3 $20M (2002) $170M (2003-2005) (2006-2010) ~$250M each 13 Predicted Performance Levels for Top500 100,000 10,000 1,000 TFlop/s Linpack 100 10 1 3,447 Jun-03 4,648 Dec-03 6,267 294 Jun-04 405 Dec-04 33 557 Jun-05 44 Dec-05 2.86 59 Jun-06 3.95 5.46 Dec-06 Total Jun-07 #1 #10 Dec-07 #500 Total Jun-08 Pred. #1 Pred. #10 Dec-08 Pred. #500 Jun-09 14 AA PetaFlopPetaFlop ComputerComputer bbyy thethe EndEnd ofof thethe DecadeDecade ♦ At least 10 Companies developing a Petaflop system in the next decade. ¾ Cray 2+ Pflop/s Linpack ¾ IBM 6.5 PB/s data streaming BW 3.2 PB/s Bisection BW ¾ Sun } 64,000 GUPS ¾ Dawning ¾ Galactic Chinese Companies ¾ Lenovo } ¾ Hitachi Japanese ¾ NEC “Life Simulator” (10 Pflop/s) ¾ Fujitsu } Keisoku project $1B 7 years ¾ Bull 15 PetaFlopPetaFlop ComputersComputers inin 22 Years!Years! ♦ Oak Ridge National Lab ¾ Planned for 4th Quarter 2008 (1 Pflop/s peak) ¾ From Cray’s XT family ¾ Use quad core from AMD ¾ 23,936 Chips ¾ Each chip is a quad core-processor (95,744 processors) ¾ Each processor does 4 flops/cycle ¾ Cycle time of 2.8 GHz ¾ Hypercube connectivity ¾ Interconnect based on Cray XT technology ¾ 6MW, 136 cabinets ♦ Los Alamos National Lab ¾ Roadrunner (2.4 Pflop/s peak) ¾ Use IBM Cell and AMD processors ¾ 75,000 cores 16 HPCHPC ChallengeChallenge GoalsGoals ♦ To examine the performance of HPC architectures using kernels with more challenging memory access patterns than the Linpack Benchmark ¾ The Linpack benchmark works well on all architectures ― even cache-based, distributed memory multiprocessors due to 1. Extensive memory reuse 2. Scalable with respect to the amount of computation 3. Scalable with respect to the communication volume 4. Extensive optimization of the software ♦ To complement the Top500 list ♦ Stress CPU, memory system, interconnect ♦ Allow for optimizations ¾ Record effort needed for tuning ¾ Base run requires MPI and BLAS 17 ♦ Provide verification & archiving of results Tests on Single Processor and System ● Local - only a single processor (core) is performing computations. ● Embarrassingly Parallel - each processor (core) in the entire system is performing computations but they do no communicate with each other explicitly. ● Global - all processors in the system are performing computations and they explicitly communicate with each other. HPCHPC ChallengeChallenge BenchmarkBenchmark Consists of basically 7 benchmarks; ¾ Think of it as a framework or harness for adding benchmarks of interest. 1. LINPACK (HPL) ― MPI Global (Ax = b) 2. STREAM ― Local; single CPU *STREAM ― Embarrassingly parallel 3. PTRANS (A A + BT) ― MPI Global 4. RandomAccess ― Local; single CPU *RandomAccess ― Embarrassingly
Recommended publications
  • Microbenchmarks in Big Data
    M Microbenchmark Overview Microbenchmarks constitute the first line of per- Nicolas Poggi formance testing. Through them, we can ensure Databricks Inc., Amsterdam, NL, BarcelonaTech the proper and timely functioning of the different (UPC), Barcelona, Spain individual components that make up our system. The term micro, of course, depends on the prob- lem size. In BigData we broaden the concept Synonyms to cover the testing of large-scale distributed systems and computing frameworks. This chap- Component benchmark; Functional benchmark; ter presents the historical background, evolution, Test central ideas, and current key applications of the field concerning BigData. Definition Historical Background A microbenchmark is either a program or routine to measure and test the performance of a single Origins and CPU-Oriented Benchmarks component or task. Microbenchmarks are used to Microbenchmarks are closer to both hardware measure simple and well-defined quantities such and software testing than to competitive bench- as elapsed time, rate of operations, bandwidth, marking, opposed to application-level – macro or latency. Typically, microbenchmarks were as- – benchmarking. For this reason, we can trace sociated with the testing of individual software microbenchmarking influence to the hardware subroutines or lower-level hardware components testing discipline as can be found in Sumne such as the CPU and for a short period of time. (1974). Furthermore, we can also find influence However, in the BigData scope, the term mi- in the origins of software testing methodology crobenchmarking is broadened to include the during the 1970s, including works such cluster – group of networked computers – acting as Chow (1978). One of the first examples of as a single system, as well as the testing of a microbenchmark clearly distinguishable from frameworks, algorithms, logical and distributed software testing is the Whetstone benchmark components, for a longer period and larger data developed during the late 1960s and published sizes.
    [Show full text]
  • Overview of the SPEC Benchmarks
    9 Overview of the SPEC Benchmarks Kaivalya M. Dixit IBM Corporation “The reputation of current benchmarketing claims regarding system performance is on par with the promises made by politicians during elections.” Standard Performance Evaluation Corporation (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard,MIPS Computer Systems and SUN Microsystems in cooperation with E. E. Times. SPEC is a nonprofit consortium of 22 major computer vendors whose common goals are “to provide the industry with a realistic yardstick to measure the performance of advanced computer systems” and to educate consumers about the performance of vendors’ products. SPEC creates, maintains, distributes, and endorses a standardized set of application-oriented programs to be used as benchmarks. 489 490 CHAPTER 9 Overview of the SPEC Benchmarks 9.1 Historical Perspective Traditional benchmarks have failed to characterize the system performance of modern computer systems. Some of those benchmarks measure component-level performance, and some of the measurements are routinely published as system performance. Historically, vendors have characterized the performances of their systems in a variety of confusing metrics. In part, the confusion is due to a lack of credible performance information, agreement, and leadership among competing vendors. Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All instructions, however, are not equal. Since CISC machine instructions usually accomplish a lot more than those of RISC machines, comparing the instructions of a CISC machine and a RISC machine is similar to comparing Latin and Greek. 9.1.1 Simple CPU Benchmarks Truth in benchmarking is an oxymoron because vendors use benchmarks for marketing purposes.
    [Show full text]
  • Hypervisors Vs. Lightweight Virtualization: a Performance Comparison
    2015 IEEE International Conference on Cloud Engineering Hypervisors vs. Lightweight Virtualization: a Performance Comparison Roberto Morabito, Jimmy Kjällman, and Miika Komu Ericsson Research, NomadicLab Jorvas, Finland [email protected], [email protected], [email protected] Abstract — Virtualization of operating systems provides a container and alternative solutions. The idea is to quantify the common way to run different services in the cloud. Recently, the level of overhead introduced by these platforms and the lightweight virtualization technologies claim to offer superior existing gap compared to a non-virtualized environment. performance. In this paper, we present a detailed performance The remainder of this paper is structured as follows: in comparison of traditional hypervisor based virtualization and Section II, literature review and a brief description of all the new lightweight solutions. In our measurements, we use several technologies and platforms evaluated is provided. The benchmarks tools in order to understand the strengths, methodology used to realize our performance comparison is weaknesses, and anomalies introduced by these different platforms in terms of processing, storage, memory and network. introduced in Section III. The benchmark results are presented Our results show that containers achieve generally better in Section IV. Finally, some concluding remarks and future performance when compared with traditional virtual machines work are provided in Section V. and other recent solutions. Albeit containers offer clearly more dense deployment of virtual machines, the performance II. BACKGROUND AND RELATED WORK difference with other technologies is in many cases relatively small. In this section, we provide an overview of the different technologies included in the performance comparison.
    [Show full text]
  • Power Measurement Tutorial for the Green500 List
    Power Measurement Tutorial for the Green500 List R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng June 27, 2007 Contents 1 The Metric for Energy-Efficiency Evaluation 1 2 How to Obtain P¯(Rmax)? 2 2.1 The Definition of P¯(Rmax)...................................... 2 2.2 Deriving P¯(Rmax) from Unit Power . 2 2.3 Measuring Unit Power . 3 3 The Measurement Procedure 3 3.1 Equipment Check List . 4 3.2 Software Installation . 4 3.3 Hardware Connection . 4 3.4 Power Measurement Procedure . 5 4 Appendix 6 4.1 Frequently Asked Questions . 6 4.2 Resources . 6 1 The Metric for Energy-Efficiency Evaluation This tutorial serves as a practical guide for measuring the computer system power that is required as part of a Green500 submission. It describes the basic procedures to be followed in order to measure the power consumption of a supercomputer. A supercomputer that appears on The TOP500 List can easily consume megawatts of electric power. This power consumption may lead to operating costs that exceed acquisition costs as well as intolerable system failure rates. In recent years, we have witnessed an increasingly stronger movement towards energy-efficient computing systems in academia, government, and industry. Thus, the purpose of the Green500 List is to provide a ranking of the most energy-efficient supercomputers in the world and serve as a complementary view to the TOP500 List. However, as pointed out in [1, 2], identifying a single objective metric for energy efficiency in supercom- puters is a difficult task. Based on [1, 2] and given the already existing use of the “performance per watt” metric, the Green500 List uses “performance per watt” (PPW) as its metric to rank the energy efficiency of supercomputers.
    [Show full text]
  • The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 Diss
    The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster By Wong Chun Shiang 16138 Dissertation submitted in partial fulfilment of the requirements for the Bachelor of Technology (Hons) (Information and Communications Technology) MAY 2015 Universiti Teknologi PETRONAS Bandar Seri Iskandar 31750 Tronoh Perak Darul Ridzuan CERTIFICATION OF APPROVAL The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 A project dissertation submitted to the Information & Communication Technology Programme Universiti Teknologi PETRONAS In partial fulfillment of the requirement for the BACHELOR OF TECHNOLOGY (Hons) (INFORMATION & COMMUNICATION TECHNOLOGY) Approved by, ____________________________ (DR. IZZATDIN ABDUL AZIZ) UNIVERSITI TEKNOLOGI PETRONAS TRONOH, PERAK MAY 2015 ii CERTIFICATION OF ORIGINALITY This is to certify that I am responsible for the work submitted in this project, that the original work is my own except as specified in the references and acknowledgements, and that the original work contained herein have not been undertaken or done by unspecified sources or persons. ________________ WONG CHUN SHIANG iii ABSTRACT UTP High Performance Computing Cluster (HPCC) is a collection of computing nodes using commercially available hardware interconnected within a network to communicate among the nodes. This campus wide cluster is used by researchers from internal UTP and external parties to compute intensive applications. However, the HPCC has never been benchmarked before. It is imperative to carry out a performance study to measure the true computing ability of this cluster. This project aims to test the performance of a campus wide computing cluster using a selected benchmarking tool, the High Performance Linkpack (HPL).
    [Show full text]
  • A Study on the Evaluation of HPC Microservices in Containerized Environment
    ReceivED XXXXXXXX; ReVISED XXXXXXXX; Accepted XXXXXXXX DOI: xxx/XXXX ARTICLE TYPE A Study ON THE Evaluation OF HPC Microservices IN Containerized EnVIRONMENT DeVKI Nandan Jha*1 | SaurABH Garg2 | Prem PrAKASH JaYARAMAN3 | Rajkumar Buyya4 | Zheng Li5 | GrAHAM Morgan1 | Rajiv Ranjan1 1School Of Computing, Newcastle UnivERSITY, UK Summary 2UnivERSITY OF Tasmania, Australia, 3Swinburne UnivERSITY OF TECHNOLOGY, AustrALIA Containers ARE GAINING POPULARITY OVER VIRTUAL MACHINES (VMs) AS THEY PROVIDE THE ADVANTAGES 4 The UnivERSITY OF Melbourne, AustrALIA OF VIRTUALIZATION WITH THE PERFORMANCE OF NEAR bare-metal. The UNIFORMITY OF SUPPORT PROVIDED 5UnivERSITY IN Concepción, Chile BY DockER CONTAINERS ACROSS DIFFERENT CLOUD PROVIDERS MAKES THEM A POPULAR CHOICE FOR DEVel- Correspondence opers. EvOLUTION OF MICROSERVICE ARCHITECTURE ALLOWS COMPLEX APPLICATIONS TO BE STRUCTURED INTO *DeVKI Nandan Jha Email: [email protected] INDEPENDENT MODULAR COMPONENTS MAKING THEM EASIER TO manage. High PERFORMANCE COMPUTING (HPC) APPLICATIONS ARE ONE SUCH APPLICATION TO BE DEPLOYED AS microservices, PLACING SIGNIfiCANT RESOURCE REQUIREMENTS ON THE CONTAINER FRamework. HoweVER, THERE IS A POSSIBILTY OF INTERFERENCE BETWEEN DIFFERENT MICROSERVICES HOSTED WITHIN THE SAME CONTAINER (intra-container) AND DIFFERENT CONTAINERS (inter-container) ON THE SAME PHYSICAL host. In THIS PAPER WE DESCRIBE AN Exten- SIVE EXPERIMENTAL INVESTIGATION TO DETERMINE THE PERFORMANCE EVALUATION OF DockER CONTAINERS EXECUTING HETEROGENEOUS HPC microservices. WE ARE PARTICULARLY CONCERNED WITH HOW INTRa- CONTAINER AND inter-container INTERFERENCE INflUENCES THE performance. MoreoVER, WE INVESTIGATE THE PERFORMANCE VARIATIONS IN DockER CONTAINERS WHEN CONTROL GROUPS (cgroups) ARE USED FOR RESOURCE limitation. FOR EASE OF PRESENTATION AND REPRODUCIBILITY, WE USE Cloud Evaluation Exper- IMENT Methodology (CEEM) TO CONDUCT OUR COMPREHENSIVE SET OF Experiments.
    [Show full text]
  • Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist
    Biographies Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist Thomas Haigh University of Wisconsin Editor: Thomas Haigh Jack J. Dongarra was born in Chicago in 1950 to a he applied for an internship at nearby Argonne National family of Sicilian immigrants. He remembers himself as Laboratory as part of a program that rewarded a small an undistinguished student during his time at a local group of undergraduates with college credit. Dongarra Catholic elementary school, burdened by undiagnosed credits his success here, against strong competition from dyslexia.Onlylater,inhighschool,didhebeginto students attending far more prestigious schools, to Leff’s connect material from science classes with his love of friendship with Jim Pool who was then associate director taking machines apart and tinkering with them. Inspired of the Applied Mathematics Division at Argonne.2 by his science teacher, he attended Chicago State University and majored in mathematics, thinking that EISPACK this would combine well with education courses to equip Dongarra was supervised by Brian Smith, a young him for a high school teaching career. The first person in researcher whose primary concern at the time was the his family to go to college, he lived at home and worked lab’s EISPACK project. Although budget cuts forced Pool to in a pizza restaurant to cover the cost of his education.1 make substantial layoffs during his time as acting In 1972, during his senior year, a series of chance director of the Applied Mathematics Division in 1970– events reshaped Dongarra’s planned career. On the 1971, he had made a special effort to find funds to suggestion of Harvey Leff, one of his physics professors, protect the project and hire Smith.
    [Show full text]
  • A Peer-To-Peer Framework for Message Passing Parallel Programs
    A Peer-to-Peer Framework for Message Passing Parallel Programs Stéphane GENAUD a;1, and Choopan RATTANAPOKA b;2 a AlGorille Team - LORIA b King Mongkut’s University of Technology Abstract. This chapter describes the P2P-MPI project, a software framework aimed at the development of message-passing programs in large scale distributed net- works of computers. Our goal is to provide a light-weight, self-contained software package that requires minimum effort to use and maintain. P2P-MPI relies on three features to reach this goal: i) its installation and use does not require administra- tor privileges, ii) available resources are discovered and selected for a computation without intervention from the user, iii) program executions can be fault-tolerant on user demand, in a completely transparent fashion (no checkpoint server to config- ure). P2P-MPI is typically suited for organizations having spare individual comput- ers linked by a high speed network, possibly running heterogeneous operating sys- tems, and having Java applications. From a technical point of view, the framework has three layers: an infrastructure management layer at the bottom, a middleware layer containing the services, and the communication layer implementing an MPJ (Message Passing for Java) communication library at the top. We explain the de- sign and the implementation of these layers, and we discuss the allocation strategy based on network locality to the submitter. Allocation experiments of more than five hundreds peers are presented to validate the implementation. We also present how fault-management issues have been tackled. First, the monitoring of the infras- tructure itself is implemented through the use of failure detectors.
    [Show full text]
  • Fair Benchmarking for Cloud Computing Systems
    Fair Benchmarking for Cloud Computing Systems Authors: Lee Gillam Bin Li John O’Loughlin Anuz Pranap Singh Tomar March 2012 Contents 1 Introduction .............................................................................................................................................................. 3 2 Background .............................................................................................................................................................. 4 3 Previous work ........................................................................................................................................................... 5 3.1 Literature ........................................................................................................................................................... 5 3.2 Related Resources ............................................................................................................................................. 6 4 Preparation ............................................................................................................................................................... 9 4.1 Cloud Providers................................................................................................................................................. 9 4.2 Cloud APIs ...................................................................................................................................................... 10 4.3 Benchmark selection ......................................................................................................................................
    [Show full text]
  • On the Performance of MPI-Openmp on a 12 Nodes Multi-Core Cluster
    On the Performance of MPI-OpenMP on a 12 nodes Multi-core Cluster Abdelgadir Tageldin Abdelgadir1, Al-Sakib Khan Pathan1∗ , Mohiuddin Ahmed2 1 Department of Computer Science, International Islamic University Malaysia, Gombak 53100, Kuala Lumpur, Malaysia 2 Department of Computer Network, Jazan University, Saudi Arabia [email protected] , [email protected] , [email protected] Abstract. With the increasing number of Quad-Core-based clusters and the introduction of compute nodes designed with large memory capacity shared by multiple cores, new problems related to scalability arise. In this paper, we analyze the overall performance of a cluster built with nodes having a dual Quad-Core Processor on each node. Some benchmark results are presented and some observations are mentioned when handling such processors on a benchmark test. A Quad-Core-based cluster's complexity arises from the fact that both local communication and network communications between the running processes need to be addressed. The potentials of an MPI-OpenMP approach are pinpointed because of its reduced communication overhead. At the end, we come to a conclusion that an MPI-OpenMP solution should be considered in such clusters since optimizing network communications between nodes is as important as optimizing local communications between processors in a multi-core cluster. Keywords: MPI-OpenMP, hybrid, Multi-Core, Cluster. 1 Introduction The integration of two or more processors within a single chip is an advanced technology for tackling the disadvantages exposed by a single core when it comes to increasing the speed, as more heat is generated and more power is consumed by those single cores.
    [Show full text]
  • Wrangling Rogues: a Case Study on Managing Experimental Post-Moore Architectures Will Powell Jeffrey S
    Wrangling Rogues: A Case Study on Managing Experimental Post-Moore Architectures Will Powell Jeffrey S. Young Jason Riedy Thomas M. Conte [email protected] [email protected] [email protected] [email protected] School of Computational Science and Engineering School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Atlanta, Georgia Atlanta, Georgia Abstract are not always foreseen in traditional data center environments. The Rogues Gallery is a new experimental testbed that is focused Our Rogues Gallery of novel technologies has produced research on tackling rogue architectures for the post-Moore era of computing. results and is being used in classes both at Georgia Tech as well as While some of these devices have roots in the embedded and high- at other institutions. performance computing spaces, managing current and emerging The key lessons from this testbed (so far) are the following: technologies provides a challenge for system administration that • Invest in rogues, but realize some technology may be are not always foreseen in traditional data center environments. short-lived. We cannot invest too much time in configuring We present an overview of the motivations and design of the and integrating a novel architecture when tools and software initial Rogues Gallery testbed and cover some of the unique chal- stacks may be in an unstable state of development. Rogues lenges that we have seen and foresee with upcoming hardware may be short-lived; they may not achieve their goals. Find- prototypes for future post-Moore research. Specifically, we cover ing their limits is an important aspect of the Rogues Gallery.
    [Show full text]
  • Simulating Low Precision Floating-Point Arithmetic
    SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC Higham, Nicholas J. and Pranesh, Srikara 2019 MIMS EPrint: 2019.4 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC∗ NICHOLAS J. HIGHAMy AND SRIKARA PRANESH∗ Abstract. The half precision (fp16) floating-point format, defined in the 2008 revision of the IEEE standard for floating-point arithmetic, and a more recently proposed half precision format bfloat16, are increasingly available in GPUs and other accelerators. While the support for low precision arithmetic is mainly motivated by machine learning applications, general purpose numerical algorithms can benefit from it, too, gaining in speed, energy usage, and reduced communication costs. Since the appropriate hardware is not always available, and one may wish to experiment with new arithmetics not yet implemented in hardware, software simulations of low precision arithmetic are needed. We discuss how to simulate low precision arithmetic using arithmetic of higher precision. We examine the correctness of such simulations and explain via rounding error analysis why a natural method of simulation can provide results that are more accurate than actual computations at low precision. We provide a MATLAB function chop that can be used to efficiently simulate fp16 and bfloat16 arithmetics, with or without the representation of subnormal numbers and with the options of round to nearest, directed rounding, stochastic rounding, and random bit flips in the significand. We demonstrate the advantages of this approach over defining a new MATLAB class and overloading operators.
    [Show full text]