Experiences in Developing Lightweight System Software for Massively Parallel Systems

Total Page:16

File Type:pdf, Size:1020Kb

Experiences in Developing Lightweight System Software for Massively Parallel Systems 1 Experiences in Developing Lightweight System Software for Massively Parallel Systems Barney Maccabe Professor, Computer Science University of New Mexico June 23, 2008 USENIX LASCO Workshop Boston, MA 2 Simplicity Butler Lampson, “Hints for Computer System Design,” IEEE Software, vol. 1, no. 1, January 1984. Make it fast, rather than general or powerful Don't hide power Leave it to the client “Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.” A. Saint-Exupery 3 MPP Operating Systems 4 MPP OS Research 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Cray Red Storm SUNMOS message passing Intel ASCI/Red Catamount re-engineering of Puma LWK direct comparison Intel Paragon Puma/Cougar levels of trust Cplant (Portals) commodity Unified Cplant features Config JRTOS application driven real-time 5 Partitioning for Specialization Red Storm 6 Functional Partitioning Service nodes authentication and authorization job launch, job control, and accounting Compute nodes memory, processor, communication trusted compute kernel passes user id to file system isolation through communication controls I/O nodes storage and external communication 7 Compute Node Structure QK – mechanism Quintessential Kernel Compute Node provides communication and address spaces fixed size–rest to PCT Task PCT Task loads PCT TaskTask PCT – policy Process Control Thread trusted agent on node Q-Kernel application load task scheduling Applications – work 8 Trust Structure Application Application Not trusted (runtime) Server Server PCT PCT Trusted (OS) Q-Kernel Q-Kernel Hardware 9 Is it good? It’s not bad... Other things are bad... Intel Paragon 1993: 1,842 OSF-1/AD was a failure on the compute nodes Paragon #1 6/1994–11/1994 OS noise when using full-featured Intel ASCI/Red 1997: 9,000 kernels processors Livermore and LANL experiences First Teraflop system #1 6/1997–11/2000 40 hours MTBI Historical problem: OS researchers only got to study Red Storm 2005 (Cray XT3); broken systems at scale 10,000 processors 10 Compare to Blue Gene/L BG/L Catamount I/O nodes (servers) QK = Hypervisor CNK trampoline PCT = Dom 0 I/O node Task Task Task PCT App App CNK CNK Q-Kernel Hardware Hardware 11 OS Noise 12 OS Noise OS interference: OS uses resources that the application could have used to do things not directly related to what the application is doing Does not include things like handling TLB misses May include message handling (if the application is not waiting) OS Noise (Jitter): the variation in OS interference Fixed work (selfish): measure variation in time to complete Fixed time (FTQ): measure variation in amount of work completed e.g., garbage collection–noise is usually there to do good things 13 FTQ (Fixed Time Quantum) FTQ on Catamount FTQ Fixed Time Quantum Measure application PCT Quantum work in fixed time QK Quantum quantum Matt Sottile, LANL (now at Google) Source: Larry Kaplan Cray, Inc. June 18, 2007 Cray Inc./FastOS07 Slide 8 14 FTQ on Linux Source: Larry Kaplan Cray, Inc. FTQ on Linux June 18, 2007 Cray Inc./FastOS07 Slide 9 15 FTQ on ASC Purple Typical FTQ Output From Purple The AfterDay AFirmwarefter Pur pleupgrade’s Firmware (VM isUpgr usingade cycles)… Typical FTQ Source: Terry Jones, LLNL 16 What’s the big deal? 1 p0 p1 p2 p3 p4 p5 p6 0.8 0.6 0.4 y=1-.99^x 0.2 probability of encountering noise 0 200 400 600 800 1000 nodes 140 120 noise 100 80 probability = 1%; service time = 20 us collective time 60 probabilty = 10%; service time = 10 us probability = 5%; service time = 10 us 40 20 0 0 2000 4000 6000 8000 10000 12000 14000 16000 nodes 17 Noise does matter POP Sage 25 CTH Source: Kurt Ferreira, UNM 20 2500 Hz, 25 usec Sage 15 CTH 5 % slowdown 10 4 2500 Hz, 25 usec 5 3 0 2000 4000 6000 8000 10000 % slowdown nodes 2 High Frequency, Low Duration 1 2.5% total noise injected 0 0 2000 4000 6000 8000 10000 12000 nodes 18 Noise does matter–really POP Sage CTH Source: Kurt Ferreira, UNM 1500 10 Hz; 2500 usec Sage 1000 CTH % slowdown 60 10 Hz; 2500 usec 500 40 0 0 500 1000 1500 2000 2500 3000 % slowdown nodes 20 Low Frequency, High Duration 2.5% total noise injected 0 0 500 1000 1500 2000 2500 3000 nodes 19 Dealing with noise Minimize noise Lots of short noise is better than small amounts of long noise Make “noisy” services optional Block synchronous systems services synchronizing tens of thousands of nodes is hard Hardware support for noisy operations (e.g., global clock) for operations affected by noise (e.g., collective offload) Develop noise tolerant algorithmic approaches equivalent to latency tolerant and fault oblivious approaches (i.e., accept that noise will eventually dominate all other things) Define how applications can be noise tolerant (e.g., avoid ALLREDUCE) 20 Linux 20 Linux What was the question? 21 The 800lb Penguin rlogin emacs ssh telnet email dns MPI TCP/IP gcc glibc Linux I/O Device I/OI/O Bus Bus I/OI/O Device Device I/O Bus Network Video Disk Network Video DiskDisk Network VideoVideo Disk Network “Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21 The 800lb Penguin on a diet MPI TCP/IP gcc glibc Linux I/O Device I/OI/O Bus Bus I/OI/O Device Device I/O Bus Network Video Disk Network Video DiskDisk Network VideoVideo Disk Network “Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21 The 800lb Penguin on a diet MPI TCP/IP gcc glibc Linux I/O Bus I/O Device Disk Network “Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21 The 800lb Penguin on a diet no “broken” hardware limited number of devices MPI minimal services e.g., CNL, ZeptoOS TCP/IP gcc glibc Linux I/O Bus I/O Device Disk Network “Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 22 Building Compute Node Linux FTQ on Linux FTQ evolving on CNL Complex interactions in changing the quantum •BusyBox (embedded) June 18, 2007 Cray Inc./FastOS07 Slide 10 June 18, 2007 •noCra y remoteInc./FastOS07 login Slide 9 •add “capacity” features as needed by application •Libraries •NFS, LDAP Source: Larry Kaplan Cray, Inc. 23 CTH on Catamount and CNL CTH scaling study Shaped charge 1.75M cells / process 15 10 Time / Timestep (seconds) Timestep / Time Source: Courtenay Vaughan, Sandia 5 CNL LWK Caution: Preliminary results 0 1 1x101 1x102 1x103 1x104 Nodes 24 Partisn on Catamount and CNL 200 213s Partisn scaling study SN problem 24x24x24 cells / process 150 CNL LWK 143s 100 Time (seconds) Time 50% increase Source: Courtenay Vaughan, Sandia on 8096 nodes 50 Caution: Preliminary results 1 1x101 1x102 1x103 1x104 Nodes 25 Keeping up with Linux 7 2.6 Kernels 2.4 Kernels 2.2 Kernels 6 2.0 Kernels 1.2 Kernels 5 4 Source: Oded Koren 3 2 Lines of code in the Linux kernel (millions) 1 Data from Oded Koren 0 1000 2000 3000 4000 Days since release of 1.0.0 (14 March 1994) 26 Catamount is Nimble Source code is small enough that developers can keep it in their head Catmount is <100,000 lines of code Early example: dual processors on ASCI/Red Heater mode Message co-processor mode P0 P1 Memory designed/expected mode of use Compute co-processor mode aka “stunt mode” Virtual node mode Network 6 man-month effort to implement FIFOs became the standard mode 27 Adding Multicore Support SMARTMAP (Brightwell, Pedretti, and Hudson) Map every core’s memory view into every other core’s memory map Almost threads, almost processes modified 20 lines of kernel code in-line function (3 lines of code) to access another core’s memory Modified Open MPI Byte Transport Layer (BTL), requires two copies Message Transport Layer (MTL), message matching in Portals Less than a man-month to implement 28 SMARTMAP Performance BTL Generic Portals (interrupt, 2 copies) MTL Generic Portals (interrupt 1 copy) BTL SMARTMAP (2 copies) 1000 MTL SMARTMAP (1 copy) 100 Source: Ron Brightwell, Sandia 10 Ping Pong Time (microseconds) Time Ping Pong 1 1 10 100 1000 10000 100000 1x106 Message Size (bytes) 29 Why Linux .... Why not .... Community One is the loneliest number... Easier to hire Linux specialists diversity is a good thing Lots of eyes to find solutions, and Linux is a moving target others care hard to get changes into Linux Environment Performance tools HPC is not the goal Development tools (compilers) Shrinking Linux eliminates parts of the environment Libraries Highlander: there will be one when does it stop being Linux? Catamount as a virtualization layer 30 Lightweight Storage Systems 31 Basic Idea Apply lightweight design philosophy to storage systems Enforce access control: authentication, capabilities with revocation Enable consistency: lightweight transactions Expose full power of the storage resources to applications Applications manage bandwidth to storage “Off line” meta data updates – “Meta bots” 32 File/Object Creates Source: Ron Oldfield, Sandia Lustre File Creates LWFS Object Creates 800 60000 16 I/O Servers 8 I/O Servers 4 I/O Servers 50000 2 I/O Servers 600 1 I/O Server 40000 400 30000 16 I/O Servers 20000 Throughput (ops/sec) Throughput (ops/sec) 8 I/O Servers 200 4 I/O Servers 2 I/O Servers 10000 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Number of Clients Number of Clients Note different scales for y axis 33 Checkpoints 1.
Recommended publications
  • Bibliography
    Bibliography App, 2017. (2017). appc: App container specification and tooling. https://github.com/appc/spec. Accetta, M. J., Baron, R. V., Bolosky, W. J., Golub, D. B., Rashid, R. F., Tevanian, A., et al. (1986). Mach: A new kernel foundation for UNIX development. In Proceedings of the USENIX Summer Conference. Ahn, D. H., Garlick, J., Grondona, M., Lipari, D., Springmeyer, B., & Schulz, M. (2014). Flux: A next-generation resource management framework for large HPC centers. In 43rd International Conference on Parallel Processing Workshops (ICCPW), 2014 (pp. 9–17). IEEE. Ajima, Y., Inoue, T., Hiramoto, S., Takagi, Y., & Shimizu, T. (2012). The Tofu interconnect. IEEE Micro, 32(1), 21–31. Akkan, H., Ionkov, L., & Lang, M. (2013). Transparently consistent asynchronous shared mem- ory. In Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS ’13. New York, NY, USA: ACM. Alam, S., Barrett, R., Bast, M., Fahey, M. R., Kuehn, J., McCurdy, C., et al. (2008). Early evaluation of IBM BlueGene/P. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08 (pp. 23:1–23:12). Piscataway, NJ, USA: IEEE Press. Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., et al. (2009). Scalable I/O forward- ing framework for high-performance computing systems. In IEEE International Conference on Cluster Computing and Workshops, 2009. CLUSTER ’09 (pp. 1–10). Alverson, B., Froese, E., Kaplan, L., & Roweth, D. (2012). Cray Inc., white paper WP-Aries01- 1112. Technical report, Cray Inc. Alverson, G. A., Kahan, S., Korry, R., McCann, C., & Smith, B. J.
    [Show full text]
  • Quad-Core Catamount and R&D in Multi-Core Lightweight Kernels
    Quad-core Catamount and R&D in Multi-core Lightweight Kernels Salishan Conference on High-Speed Computing Gleneden Beach, Oregon April 21-24, 2008 Kevin Pedretti Senior Member of Technical Staff Scalable System Software, Dept. 1423 [email protected] SAND Number: 2008-1725A Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Outline • Introduction • Quad-core Catamount LWK results • Open-source LWK • Research directions • Conclusion Going on Four Decades of UNIX Operating System = Collection of software and APIs Users care about environment, not implementation details LWK is about getting details right for scalability LWK Overview Basic Architecture Memory Management … … Policy n 1 n N tio tio Page 3 Page 3 Maker ca ca i l Libc.a Libc.a (PCT) pp ppli Page 2 Page 2 A libmpi.a A libmpi.a Page 1 Page 1 Policy Enforcer/HAL (QK) Page 0 Page 0 Privileged Hardware Physical Application Memory Virtual • POSIX-like environment Memory • Inverted resource management • Very low noise OS noise/jitter • Straight-forward network stack (e.g., no pinning) • Simplicity leads to reliability Nov 2007 Top500 Top 10 System Lightweight Kernel Compute Processors: Timeline 82% run a LWK 1990 – Sandia/UNM OS (SUNMOS), nCube-2 1991 – Linux 0.02 1993 – SUNMOS ported to Intel Paragon (1800 nodes) 1993 – SUNMOS experience used to design Puma First implementation of Portals communication architecture 1994
    [Show full text]
  • A Lightweight Operating System for Ultrascale Supercomputers
    Kitten: A Lightweight Operating System for Ultrascale Supercomputers Presentation to the New Mexico Consortium Ultrascale Systems Research Center August 8, 2011 Kevin Pedretti Senior Member of Technical Staff Scalable System Software, Dept. 1423 [email protected] SAND2011-5626P Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.! Outline •"Introduction •"Kitten lightweight kernel overview •"Future directions •"Conclusion Four+ Decades of UNIX Operating System = Collection of software and APIs Users care about environment, not implementation details LWK is about getting details right for scalability, both within a node and across nodes Sandia Lightweight Kernel Targets •"Massively parallel, extreme-scale, distributed- memory machine with a tightly-coupled network •"High-performance scientific and engineering modeling and simulation applications •"Enable fast message passing and execution •"Offer a suitable development environment for parallel applications and libraries •"Emphasize efficiency over functionality •"Move resource management as close to application as possible •"Provide deterministic performance •"Protect applications from each other Lightweight Kernel Overview Basic Architecture Memory Management … … Policy Maker Page 3 Page 3 (PCT) Libc.a Libc.a Application N Application 1 Page 2 Page 2 libmpi.a libmpi.a
    [Show full text]
  • Palacios and Kitten: New High Performance Operating Systems for Scalable Virtualized and Native Supercomputing
    Palacios and Kitten: New High Performance Operating Systems For Scalable Virtualized and Native Supercomputing John Lange∗, Kevin Pedrettiy, Trammell Hudsony, Peter Dinda∗, Zheng Cuiz, Lei Xia∗, Patrick Bridgesz, Andy Gocke∗, Steven Jaconette∗, Mike Levenhageny, and Ron Brightwelly ∗ Northwestern University, Department of Electrical Engineering and Computer Science Email: fjarusl,pdinda,leixia,agocke,[email protected] y Sandia National Laboratories, Scalable System Software Department Email: fktpedre,mjleven,[email protected], [email protected] z University of New Mexico, Department of Computer Science Email: fzheng,[email protected] Abstract—Palacios is a new open-source VMM under de- • providing access to advanced virtualization features velopment at Northwestern University and the University of such as migration, full system checkpointing, and de- New Mexico that enables applications executing in a virtualized bugging; environment to achieve scalable high performance on large ma- chines. Palacios functions as a modularized extension to Kitten, • allowing system owners to support a wider range of a high performance operating system being developed at Sandia applications and to more easily support legacy appli- National Laboratories to support large-scale supercomputing cations and programming models when changing the applications. Together, Palacios and Kitten provide a thin layer underlying hardware platform; over the hardware to support full-featured virtualized environ- • enabling system users to incrementally port
    [Show full text]
  • Apr4 Porting Portals to Ofed.Pdf
    Porting Portals to OFED Ron Brightwell Scalable System Software Sandia National Laboratories Albuquerque, NM, USA Open Fabrics Alliance International Workshop Monterey, CA April 4, 2011 Sandia is a Multiprogram Laboratory Operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy Under Contract DE-ACO4-94AL85000. Sandia Massively Parallel Systems 2004 1999 1997 Red Storm Prototype Cray XT 1993 Custom interconnect Purpose built RAS Highly balanced and 1990 scalable Cplant Catamount lightweight kernel Commodity-based ASCI Red supercomputer Currently 38,400 cores (quad & dual) Production MPP Hundreds of users Paragon Hundreds of users Enhanced simulation Red & Black capacity Tens of users partitions Linux-based OS First periods Improved licensed for processing MPP nCUBE2 interconnect commercialization World record Sandia’s first large High-fidelity coupled ~2000 nodes performance MPP 3-D physics Routine 3D Achieved Gflops Puma/Cougar simulations performance on lightweight kernel SUNMOS lightweight applications kernel Portals Network Programming Interface • Network API developed by Sandia, U. New Mexico, Intel • Previous generations of Portals deployed on several production massively parallel systems – 1993: 1800-node Intel Paragon (SUNMOS) – 1997: 10,000-node Intel ASCI Red (Puma/Cougar) – 1999: 1800-node Cplant cluster (Linux) – 2005: 10,000-node Cray Sandia Red Storm (Catamount) – 2009: 18,688-node Cray XT5 – ORNL Jaguar (Linux) • Focused on providing – Lightweight “connectionless” model for MPP
    [Show full text]
  • Performance of Various Computers Using Standard Linear Equations Software
    ———————— CS - 89 - 85 ———————— Performance of Various Computers Using Standard Linear Equations Software Jack J. Dongarra* Electrical Engineering and Computer Science Department University of Tennessee Knoxville, TN 37996-1301 Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN 37831 University of Manchester CS - 89 - 85 June 15, 2014 * Electronic mail address: [email protected]. An up-to-date version of this report can be found at http://www.netlib.org/benchmark/performance.ps This work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract DE-AC05-96OR22464, and in part by the Science Alliance a state supported program at the University of Tennessee. 6/15/2014 2 Performance of Various Computers Using Standard Linear Equations Software Jack J. Dongarra Electrical Engineering and Computer Science Department University of Tennessee Knoxville, TN 37996-1301 Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN 37831 University of Manchester June 15, 2014 Abstract This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers. 1. Introduction and Objectives The timing information presented here should in no way be used to judge the overall performance of a computer system. The results reflect only one problem area: solving dense systems of equations. This report provides performance information on a wide assortment of computers ranging from the home-used PC up to the most powerful supercomputers. The information has been collected over a period of time and will undergo change as new machines are added and as hardware and software systems improve.
    [Show full text]
  • Operating Systems at the Extreme Scale
    OperatING SYSTEMS at THE EXTREME SCALE CRESTA White Paper Authors: Daniel Holmes, EPCC, The University of Edinburgh Editors: Erwin Laure KTH Royal Institute of Technology, Mark Parsons and Lorna Smith, EPCC Collaborative Research Into Exascale Systemware, Tools and Applications (CRESTA) ICT-2011.9.13 Exascale computing, software and simulation FOREWORD ABOUT BY DR DANIEL HOLMES, FROM EPCC at THE UNIVERSITY OF EDINBURGH AND CRESTA BY PROFESSOR MARK PARSONS, COORDINatOR OF THE CRESTA PROJect techNical staFF MEMBER. AND EXecutiVE DirectOR at EPCC, THE UNIVERSITY OF EDINBURGH, UK. As we move ever closer to the first exascale systems, understanding the current status of The Collaborative Research into Exascale, Systemware Tools and Applications (CRESTA) project operating system development becomes increasingly important. In particular, being able to quantify is focused on the software challenges of exascale computing, making it a unique project. While the potential impact of the operating system on applications at scale is key. This white paper does a number of projects worldwide are studying hardware aspects of the race to perform 1018 just this, before evaluating and looking to drive developments in operating systems to address calculations per second, no other project is focusing on the exascale software stack in the way that identified scaling issues. we are. The current trend in operating system research and development of re-implementing existing APIs By limiting our work to a small set of representative applications we hope to develop key insights into is likely to continue. However, this approach is incremental and driven by developments in hardware the necessary changes to applications and system software required to compute at this scale.
    [Show full text]
  • ASCI Red Vs. Red Storm
    7X Performance Results – Final Report: ASCI Red vs. Red Storm Joel 0. Stevenson, Robert A. Ballance, Karen Haskell, and John P. Noe, Sandia National Laboratories and Dennis C. Dinge, Thomas A. Gardiner, and Michael E. Davis, Cray Inc. ABSTRACT: The goal of the 7X performance testing was to assure Sandia National Laboratories, Cray Inc., and the Department of Energy that Red Storm would achieve its performance requirements which were defined as a comparison between ASCI Red and Red Storm. Our approach was to identify one or more problems for each application in the 7X suite, run those problems at two or three processor sizes in the capability computing range, and compare the results between ASCI Red and Red Storm. The first part of this paper describes the two computer systems, the 10 applications in the 7X suite, the 25 test problems, and the results of the performance tests on ASCI Red and Red Storm. During the course of the testing on Red Storm, we had the opportunity to run the test problems in both single-core mode and dual-core mode and the second part of this paper describes those results. Finally, we reflect on lessons learned in undertaking a major head-to-head benchmark comparison. KEYWORDS: 7X, ASCI Red, Red Storm, capability computing, benchmark This work was supported in part by the U.S. Department of Energy. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States National Nuclear Security Administration and the Department of Energy under contract DE-AC04-94AL85000 processor sizes, and compare the results between ASCI 1.
    [Show full text]
  • A Lightweight Operating System for Multi-Core Capability Class Supercomputers
    SANDIA REPORT SAND2010-6232XXXX Unlimited Release Printed September, 2010 LDRD Final Report: A Lightweight Operating System for Multi-core Capability Class Supercomputers Kevin T. Pedretti Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Approved for public release; further dissemination unlimited. Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation. NOTICE: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or rep- resent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors. Printed in the United States of America. This report has been reproduced directly from the best available copy.
    [Show full text]
  • Architectural Specification for Massively
    CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2005; 17:1271–1316 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.893 Architectural specification for massively parallel computers: an experience and measurement-based approach‡ Ron Brightwell1,∗,†, William Camp1,BenjaminCole1, Erik DeBenedictis1, Robert Leland1, James Tomkins1 and Arthur B. Maccabe2 1Sandia National Laboratories, Scalable Computer Systems, P.O. Box 5800, Albuquerque, NM 87185-1110, U.S.A. 2Department of Computer Science, University of New Mexico, Albuquerque, NM 87131-0001, U.S.A. SUMMARY In this paper, we describe the hardware and software architecture of the Red Storm system developed at Sandia National Laboratories. We discuss the evolution of this architecture and provide reasons for the different choices that have been made. We contrast our approach of leveraging high-volume, mass-market commodity processors to that taken for the Earth Simulator. We present a comparison of benchmarks and application performance that support our approach. We also project the performance of Red Storm and the Earth Simulator. This projection indicates that the Red Storm architecture is a much more cost-effective approach to massively parallel computing. Published in 2005 by John Wiley & Sons, Ltd. KEY WORDS: massively parallel computing; supercomputing; commodity processors; vector processors; Amdahl; shared memory; distributed memory 1. INTRODUCTION In the early 1980s the performance of commodity microprocessors reached a level that made it feasible to consider aggregating large numbers of them into a massively parallel processing (MPP) computer intended to compete in performance with traditional vector supercomputers based on moderate numbers of custom processors.
    [Show full text]
  • Lightweight Operating Systems for Scalable Native and Virtualized Supercomputing
    Lightweight Operating Systems for Scalable Native and Virtualized Supercomputing April 20, 2009 ORNL Visit Kevin Pedretti Senior Member of Technical Staff Scalable System Software, Dept. 1423 [email protected] Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Acknowledgments · Kitten Lightweight Kernel ± Trammel Hudson, Mike Levenhagen, Kurt Ferreira ± Funding: Sandia LDRD program · Palacios Virtual Machine Monitor ± Peter Dinda & Jack Lange (Northwestern Univ.) ± Patrick Bridges (Univ. New Mexico) · OS Noise Studies ± Kurt Ferreira, Ron Brightwell · Quad-core Catamount ± Sue Kelly, John VanDyke, Courtenay Vaughan Outline · Introduction · Kitten Lightweight Kernel · Palacios Virtual Machine Monitor · Native vs. Guest OS results on Cray XT · Conclusion Going on Four Decades of UNIX Operating System = Collection of software and APIs Users care about environment, not implementation details LWK is about getting details right for scalability See Key for Units Challenge: Exponentially Increasing ParallelismExponentially Increasing Challenge: 33% per year 89% per year 12 GF/core 12 75K cores 75K 900 TF 900 72% per year 1.7M cores (green) cores 1.7M 28M cores (blue) cores 28M 588 GF/core 588 35 GF/core 35 2019 1 EF 1 or LWK Overview Basic Architecture Memory Management N 1 ¼ ¼ n n o o i Policy i t t a a c c Page 3 Page 3 i i l Maker l p p Libc.a Libc.a p (PCT) p A A Page 2 Page 2
    [Show full text]
  • Catamount Vs. Cray Linux Environment
    To Upgrade or not to Upgrade? Catamount vs. Cray Linux Environment S.D. Hammond, G.R. Mudalige, J.A. Smith, J.A. Davis, S.A. Jarvis High Performance Systems Group, University of Warwick, Coventry, CV4 7AL, UK J. Holt Tessella PLC, Abingdon Science Park, Berkshire, OX14 3YS, UK I. Miller, J.A. Herdman, A. Vadgama Atomic Weapons Establishment, Aldermaston, Reading, RG7 4PR, UK Abstract 1 Introduction Assessing the performance of individual hardware and Modern supercomputers are growing in diversity and software-stack configurations for supercomputers is a complexity – the arrival of technologies such as multi- difficult and complex task, but with potentially high core processors, general purpose-GPUs and specialised levels of reward. While the ultimate benefit of such sys- compute accelerators has increased the potential sci- tem (re-)configuration and tuning studies is to improve entific delivery possible from such machines. This is application performance, potential improvements may not however without some cost, including significant also include more effective scheduling, kernel parame- increases in the sophistication and complexity of sup- ter refinement, pointers to application redesign and an porting operating systems and software libraries. This assessment of system component upgrades. With the paper documents the development and application of growing complexity of modern systems, the effort re- methods to assess the potential performance of select- quired to discover the elusive combination of hardware ing one hardware, operating system (OS) and software and software-stack settings that improve performance stack combination against another. This is of par- across a range of applications is, in itself, becoming an ticular interest to supercomputing centres, which rou- HPC grand challenge .
    [Show full text]