Early Evaluation of the Cray XT5

Total Page:16

File Type:pdf, Size:1020Kb

Early Evaluation of the Cray XT5 Early Evaluation of the Cray XT5 Patrick Worley, Richard Barrett, Jeffrey Kuehn Oak Ridge National Laboratory CUG 2009 May 6, 2009 Omni Hotel at CNN Center Atlanta, GA Acknowledgements • Research sponsored by the Climate Change Research Division of the Office of Biological and Environmental Research, by the Fusion Energy Sciences Program, and by the Office of Mathematical, Information, and Computational Sciences, all in the Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. • This research used resources (Cray XT4 and Cray XT5) of the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. • These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. 2 Prior CUG System Evaluation Papers 1. CUG 2008: The Cray XT4 Quad-core : A First Look (Alam, Barrett, Eisenbach, Fahey, Hartman-Baker, Kuehn, Poole, Sankaran, and Worley) 2. CUG 2007: Comparison of Cray XT3 and XT4 Scalability (Worley) 3. CUG 2006: Evaluation of the Cray XT3 at ORNL: a Status Report (Alam, Barrett, Fahey, Messer, Mills, Roth, Vetter, and Worley) 4. CUG 2005: Early Evaluation of the Cray XD1 (Fahey, Alam, Dunigan, Vetter, and Worley) 5. CUG 2005: Early Evaluation of the Cray XT3 at ORNL (Vetter, Alam, Dunigan, Fahey, Roth, and Worley) 6. CUG 2004: ORNL Cray X1 Evaluation Status Report (Agarwal, et al) 7. CUG 2003: Early Evaluation of the Cray X1 at ORNL (Worley and Dunigan) (and subsystem or application-specific views of system performance) 8. CUG 2006: Performance of the Community Atmosphere Model on the Cray X1E and XT3 (Worley) 9. CUG 2005: Comparative Analysis of Interprocess Communication on the X1, XD1, and XT3 (Worley, Alam, Dunigan, Fahey, Vetter) 10. CUG 2004: The Performance Evolution of the Parallel Ocean Program on the Cray X1 (Worley and Levesque) 3 What is an Early Evaluation? 1. A complete evaluation: a. microkernel, kernel, application benchmarks, chosen to examine major subsystems and to be representative of anticipated workload b. optimized with respect to obvious compiler flags, system environment variables (esp. MPI) and configuration options 2. performed quickly (in time for CUG ) a. not exhaustive (can’t answer all questions nor examine all options) b. minimal code modifications 3. with a goal of determining: a. performance promise (a lower bound) b. performance characteristics (good and bad) c. usage advice for users 4. in the context of an “evolving” system, subject to: a. HW instability b. system software upgrades 4 Target System Cray XT5 at ORNL (JaguarPF) - 18,722 compute nodes, 8 processor cores per node, 2 GB memory per core: • 149,776 processor cores and 299,552 TB memory - Compute node contains two 2.3 GHz quad-core Opteron processors (AMD 2356 “Barcelona”) linked with dual HyperTransport connections and DDR2-800 NUMA memory - 3D Torus (25x32x24) with Cray SeaStar2+ NIC (9.6 GB/s peak bidirectional BW in each of 6 directions; 6 GB/s sustained) - Version 2.1 of the Cray Linux Environment (CLE) operating system (as of February 2009) Compared with Cray XT4 at ORNL (Jaguar) - 7832 compute nodes, 4 processor cores per node: 31,328 processor cores - Compute node contains one 2.1 GHz quad-core “Budapest” Opteron and DDR2-800 UMA memory 5 Initial Evaluation Questions • Performance impacts of changes in - Processor architecture (2.3 GHz Barcelona vs. 2.1 GHz Budapest) - Node architecture (2 socket and NUMA vs. 1 socket and UMA): • Additional memory contention? Utility of large page support? • OpenMP performance (8-way maximum vs. 4-way maximum) • MPI communication performance - Intranode and Internode - Point-to-point - Collective • Performance characteristics of running at increased scale • Nature and impacts of performance variability • Application performance 6 Status of Evaluation • Evaluation far from complete: - System not open to general evaluation studies, only studies in support of early science application codes. - Scaling application codes to 150,000 cores is requiring re- examination of algorithms and implementations. - No full system scaling studies as of yet because of high cost and special requirements (e.g. interactive session). - Performance variability makes aspects of evaluation difficult: • There appear to be multiple sources of variability, some that may be eliminated easily, once diagnosed properly, and some that may be intrinsic to the system. • May be possible to mitigate impact of intrinsic variability once it has been diagnosed adequately. • Too much data to present in 30 minute talk. Will describe highlights of preliminary results. 7 Talk Outline 1. Single node performance a. Kernels: DGEMM, FFT, RandomAccess, STREAM b. Application codes: POP, CAM 2. MPI communication performance a. Point-to-point: Intra- and Inter-node b. Collective: Barrier, Allreduce c. HALO 3. Application codes: approaches, performance, and progress a. AORSA b. XGC1 c. CAM 8 Matrix Multiply Benchmark (DGEMM) Evaluated performance of libsci routine for matrix multiply. Achieved 89% of peak, Some degradation observed from running benchmark on all cores simultaneously. Behavior similar to that on XT4-quad, scaled by difference in clocks. 9 Other HPCC Single Node Benchmarks (ratio of JaguarPF to Jaguar) Core Core Socket Node Performance Performance Performance Performance (1core active) (all active) (all active) (all active) FFT 1.074 1.134 1.134 2.267 RandomAccess 1.094 1.139 1.139 2.277 STREAM 0.998 0.937 0.937 1.874 Spatial locality Apps (like STREAM) see small penalty from increased contention – memory controller/channel limitation Temporal locality Apps (like FFT) see moderate improvement Even low locality apps (Like RandomAccess) see some benefit SMP Performance Ratio (MultiCore to SingleCore) Jaguar JaguarPF % Improvement FFT 0.704 0.743 5.6% RandomAccess 0.645 0.671 4.0% STREAM 0.408 0.383 -6.2% SMP efficiency improved for apps that weren't bandwidth limited Memory bandwidth hungry apps suffer from increased contention At JaguarPF scale, 5% ~ 4000 cores Lessons: Eliminate unnecessary memory traffic Consider replacing MPI w/ OpenMP on node (see MPI results) Parallel Ocean Program Using POP to investigate performance impact of memory contention. Using all cores in a node can degrade performance by as much as 45% compared to assigning one process per node. It is still much better to use all cores for a fixed number of nodes. 12 Parallel Ocean Program For a single process, XT5 performance is nearly the same as the XT4. However, when using all cores in a socket XT5 performance was greater than 1.3X that of the XT4 in Oct. 08, and greater than 1.15X in Feb. 09 and May 09, both of which are greater than the difference in the clock speed. 13 Community Atmosphere Model For the Finite Volume dynamics solver, XT5 is 1.38X faster than the XT4 on a single quad-core processor, and 1.36X faster on two quad-core processors. Using the same number of nodes (but only two cores per processor) increases the advantage to 1.44 and 1.45, respectively. Physics dominates runtime in these experiments. 14 Computation Benchmarks: Summary 1. DGEMM “sanity check” looks good. 2. FFT, RandomAccess, and STREAM demonstrate steadily decreasing advantage of XT5 node over XT4 as contention for (main) memory increases. 3. XT5 per node performance is better than the XT4 per node performance no matter how measure it, for both POP and CAM. Memory contention (?) can degrade performance, especially for POP. 4. For this level of parallelism, OpenMP did not improve performance of CAM for same number of cores (not shown). 5. POP all-core performance has degraded by 15% since October. Performance when not using all cores in a socket is essentially unchanged. CAM performance has not changed significantly over this period, but CAM is more compute intensive for this problem granularity than is POP. 15 MPI Point-to-Point Performance Bidirectional bandwidth for single process pair when single pair communicating and when multiple pairs communicating simultaneously. One pair, two pairs and eight pairs achieve same total internode bandwidth. These experiments were unable to saturate network bandwidth. 16 MPI Point-to-Point Performance Same data, but in log-log plot. Intranode latency over 10 times lower than internode latency. Internode latency of single pair is half that of two pair, and 1/5 that of 8 pair. Latency is (also) not affected by number of nodes communicating, in these experiments. 17 MPI Point-to-Point Performance Log-log plots of bidirectional bandwidth between nodes for different platforms, both for a single pair and when all pairs exchange data simultaneously. XT5 and XT4- quadcore demonstrate same total internode performance, resulting in XT5 per pair performance to be half that of XT4 for simultaneous swaps. 18 MPI_Barrier Performance Observed minimum and average MPI_Barrier performance for one MPI process per node and for 8 MPI processes per node. Note preference for power-of-two number of nodes for minimum. Data not collected on a dedicated system. 19 MPI_Barrier Performance Observed minimum, average, and worst case MPI_Barrier performance for one MPI process per node and for 8 MPI processes per node. Worst case tends to increase with node count, but high costs can occur for relatively small node counts also. 20 MPI_Barrier: Platform Comparison Lower latency? less frequent or smaller performance perturbations? give advantage to Catamount and to BG/P results. 21 MPI_Barrier: Platform Comparison Minimum times are comparable between Catamount and CLE results. Big difference is in maximum times.
Recommended publications
  • New CSC Computing Resources
    New CSC computing resources Atte Sillanpää, Nino Runeberg CSC – IT Center for Science Ltd. Outline CSC at a glance New Kajaani Data Centre Finland’s new supercomputers – Sisu (Cray XC30) – Taito (HP cluster) CSC resources available for researchers CSC presentation 2 CSC’s Services Funet Services Computing Services Universities Application Services Polytechnics Ministries Data Services for Science and Culture Public sector Information Research centers Management Services Companies FUNET FUNET and Data services – Connections to all higher education institutions in Finland and for 37 state research institutes and other organizations – Network Services and Light paths – Network Security – Funet CERT – eduroam – wireless network roaming – Haka-identity Management – Campus Support – The NORDUnet network Data services – Digital Preservation and Data for Research Data for Research (TTA), National Digital Library (KDK) International collaboration via EU projects (EUDAT, APARSEN, ODE, SIM4RDM) – Database and information services Paituli: GIS service Nic.funet.fi – freely distributable files with FTP since 1990 CSC Stream Database administration services – Memory organizations (Finnish university and polytechnics libraries, Finnish National Audiovisual Archive, Finnish National Archives, Finnish National Gallery) 4 Current HPC System Environment Name Louhi Vuori Type Cray XT4/5 HP Cluster DOB 2007 2010 Nodes 1864 304 CPU Cores 10864 3648 Performance ~110 TFlop/s 34 TF Total memory ~11 TB 5 TB Interconnect Cray QDR IB SeaStar Fat tree 3D Torus CSC
    [Show full text]
  • Petaflops for the People
    PETAFLOPS SPOTLIGHT: NERSC housands of researchers have used facilities of the Advanced T Scientific Computing Research (ASCR) program and its EXTREME-WEATHER Department of Energy (DOE) computing predecessors over the past four decades. Their studies of hurricanes, earthquakes, NUMBER-CRUNCHING green-energy technologies and many other basic and applied Certain problems lend themselves to solution by science problems have, in turn, benefited millions of people. computers. Take hurricanes, for instance: They’re They owe it mainly to the capacity provided by the National too big, too dangerous and perhaps too expensive Energy Research Scientific Computing Center (NERSC), the Oak to understand fully without a supercomputer. Ridge Leadership Computing Facility (OLCF) and the Argonne Leadership Computing Facility (ALCF). Using decades of global climate data in a grid comprised of 25-kilometer squares, researchers in These ASCR installations have helped train the advanced Berkeley Lab’s Computational Research Division scientific workforce of the future. Postdoctoral scientists, captured the formation of hurricanes and typhoons graduate students and early-career researchers have worked and the extreme waves that they generate. Those there, learning to configure the world’s most sophisticated same models, when run at resolutions of about supercomputers for their own various and wide-ranging projects. 100 kilometers, missed the tropical cyclones and Cutting-edge supercomputing, once the purview of a small resulting waves, up to 30 meters high. group of experts, has trickled down to the benefit of thousands of investigators in the broader scientific community. Their findings, published inGeophysical Research Letters, demonstrated the importance of running Today, NERSC, at Lawrence Berkeley National Laboratory; climate models at higher resolution.
    [Show full text]
  • Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3)
    OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks U.S. Department of the Interior Minerals Management Service Anchorage, Alaska Contract No. M07PC13368 OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks Nov 2009 This study was funded by the Alaska Outer Continental Shelf Region of the Minerals Manage- ment Service, U.S. Department of the Interior, Anchorage, Alaska, through Contract M07PC13368 with Rutgers University, Institute of Marine and Coastal Sciences. The opinions, findings, conclusions, or recommendations expressed in this report or product are those of the authors and do not necessarily reflect the views of the U.S. Department of the Interior, nor does mention of trade names or commercial products constitute endorsement or recommenda- tion for use by the Federal Government. This document was prepared with LATEX xfig, and inkscape. Acknowledgments The ROMS model is descended from the SPEM and SCRUM models, but has been entirely rewritten by Sasha Shchepetkin, Hernan Arango and John Warner, with many, many other con- tributors. I am indebted to every one of them for their hard work. Bill Hibler first came up with the viscous-plastic rheology we are using. Paul Budgell has rewritten the dynamic sea-ice model, improving the solution procedure and making the water- stress term implicit in time, then changing it again to use the elastic-viscous-plastic rheology of Hunke and Dukowicz.
    [Show full text]
  • Introducing the Cray XMT
    Introducing the Cray XMT Petr Konecny Cray Inc. 411 First Avenue South Seattle, Washington 98104 [email protected] May 5th 2007 Abstract Cray recently introduced the Cray XMT system, which builds on the parallel architecture of the Cray XT3 system and the multithreaded architecture of the Cray MTA systems with a Cray-designed multithreaded processor. This paper highlights not only the details of the architecture, but also the application areas it is best suited for. 1 Introduction past two decades the processor speeds have increased several orders of magnitude while memory speeds Cray XMT is the third generation multithreaded ar- increased only slightly. Despite this disparity the chitecture built by Cray Inc. The two previous gen- performance of most applications continued to grow. erations, the MTA-1 and MTA-2 systems [2], were This has been achieved in large part by exploiting lo- fully custom systems that were expensive to man- cality in memory access patterns and by introducing ufacture and support. Developed under code name hierarchical memory systems. The benefit of these Eldorado [4], the Cray XMT uses many parts built multilevel caches is twofold: they lower average la- for other commercial system, thereby, significantly tency of memory operations and they amortize com- lowering system costs while maintaining the sim- munication overheads by transferring larger blocks ple programming model of the two previous genera- of data. tions. Reusing commodity components significantly The presence of multiple levels of caches forces improves price/performance ratio over the two pre- programmers to write code that exploits them, if vious generations. they want to achieve good performance of their ap- Cray XMT leverages the development of the Red plication.
    [Show full text]
  • Cray XT and Cray XE Y Y System Overview
    Crayyy XT and Cray XE System Overview Customer Documentation and Training Overview Topics • System Overview – Cabinets, Chassis, and Blades – Compute and Service Nodes – Components of a Node Opteron Processor SeaStar ASIC • Portals API Design Gemini ASIC • System Networks • Interconnection Topologies 10/18/2010 Cray Private 2 Cray XT System 10/18/2010 Cray Private 3 System Overview Y Z GigE X 10 GigE GigE SMW Fibre Channels RAID Subsystem Compute node Login node Network node Boot /Syslog/Database nodes 10/18/2010 Cray Private I/O and Metadata nodes 4 Cabinet – The cabinet contains three chassis, a blower for cooling, a power distribution unit (PDU), a control system (CRMS), and the compute and service blades (modules) – All components of the system are air cooled A blower in the bottom of the cabinet cools the blades within the cabinet • Other rack-mounted devices within the cabinet have their own internal fans for cooling – The PDU is located behind the blower in the back of the cabinet 10/18/2010 Cray Private 5 Liquid Cooled Cabinets Heat exchanger Heat exchanger (XT5-HE LC only) (LC cabinets only) 48Vdc flexible Cage 2 buses Cage 2 Cage 1 Cage 1 Cage VRMs Cage 0 Cage 0 backplane assembly Cage ID controller Interconnect 01234567 Heat exchanger network cable Cage inlet (LC cabinets only) connection air temp sensor Airflow Heat exchanger (slot 3 rail) conditioner 48Vdc shelf 3 (XT5-HE LC only) 48Vdc shelf 2 L1 controller 48Vdc shelf 1 Blower speed controller (VFD) Blooewer PDU line filter XDP temperature XDP interface & humidity sensor
    [Show full text]
  • Pubtex Output 2006.05.15:1001
    Cray XD1™ Release Description Private S–2453–14 © 2006 Cray Inc. All Rights Reserved. Unpublished Private Information. This unpublished work is protected to trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, GigaRing, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C++ Compiling System, Cray C90, Cray C90D, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray S-MP,
    [Show full text]
  • Jaguar and Kraken -The World's Most Powerful Computer Systems
    Jaguar and Kraken -The World's Most Powerful Computer Systems Arthur Bland Cray Users’ Group 2010 Meeting Edinburgh, UK May 25, 2010 Abstract & Outline At the SC'09 conference in November 2009, Jaguar and Kraken, both located at ORNL, were crowned as the world's fastest computers (#1 & #3) by the web site www.Top500.org. In this paper, we will describe the systems, present results from a number of benchmarks and applications, and talk about future computing in the Oak Ridge Leadership Computing Facility. • Cray computer systems at ORNL • System Architecture • Awards and Results • Science Results • Exascale Roadmap 2 CUG2010 – Arthur Bland Jaguar PF: World’s most powerful computer— Designed for science from the ground up Peak performance 2.332 PF System memory 300 TB Disk space 10 PB Disk bandwidth 240+ GB/s Based on the Sandia & Cray Compute Nodes 18,688 designed Red Storm System AMD “Istanbul” Sockets 37,376 Size 4,600 feet2 Cabinets 200 3 CUG2010 – Arthur Bland (8 rows of 25 cabinets) Peak performance 1.03 petaflops Kraken System memory 129 TB World’s most powerful Disk space 3.3 PB academic computer Disk bandwidth 30 GB/s Compute Nodes 8,256 AMD “Istanbul” Sockets 16,512 Size 2,100 feet2 Cabinets 88 (4 rows of 22) 4 CUG2010 – Arthur Bland Climate Modeling Research System Part of a research collaboration in climate science between ORNL and NOAA (National Oceanographic and Atmospheric Administration) • Phased System Delivery • Total System Memory – CMRS.1 (June 2010) 260 TF – 248 TB DDR3-1333 – CMRS.2 (June 2011) 720 TF • File Systems
    [Show full text]
  • Cray XT3/XT4 Software
    Cray XT3/XT4 Software: Status and Plans David Wallace Cray Inc ABSTRACT: : This presentation will discuss the current status of software and development plans for the CRAY XT3 and XT4 systems. A review of major milestones and accomplishments over the past year will be presented. KEYWORDS: ‘Cray XT3’, ‘Cray XT4’, software, CNL, ‘compute node OS’, Catamount/Qk processors, support for larger system configurations (five 1. Introduction rows and larger), a new version of SuSE Linux and CFS Lustre as well as a host of other new features and fixes to The theme for CUG 2007 is "New Frontiers ," which software deficiencies. Much has been done in terms of reflects upon how the many improvements in High new software releases and adding support for new Performance Computing have significantly facilitated hardware. advances in Technology and Engineering. UNICOS/lc is 2.1 UNICOS/lc Releases moving to new frontiers as well. The first section of this paper will provide a perspective of the major software The last twelve months have been very busy with milestones and accomplishments over the past twelve respect to software releases. Two major releases months. The second section will discuss the current status (UNICOS/lc 1.4 and 1.5) and almost weekly updates have of the software with respect to development, releases and provided a steady stream of new features, enhancements support. The paper will conclude with a view of future and fixes to our customers. UNICOS/lc 1.4 was released software plans. for general availability in June 2006. A total of fourteen minor releases have been released since last June.
    [Show full text]
  • The Gemini Network
    The Gemini Network Rev 1.1 Cray Inc. © 2010 Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C90, Cray C90D, Cray C++ Compiling System, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T90, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray Threadstorm, Cray UNICOS, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray X-MP, Cray XMS, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CrayPat, CrayPort, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power…, Dgauss, Docview, EMDS, GigaRing, HEXAR, HSX, IOS, ISP/Superlink, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.
    [Show full text]
  • A Cray Manufactured XE6 System Called “HERMIT” at HLRS in Use in Autumn 2011
    Press release For release 21/04/2011 More capacity to the PRACE Research Infrastructure: a Cray manufactured XE6 system called “HERMIT” at HLRS in use in autumn 2011 PRACE, the Partnership for Advanced Computing in Europe, adds more capacity to the PRACE Research Infrastructure as the Gauss Centre for Supercomputing (GCS) partner High Performance Computing Center of University Stuttgart (HLRS) will deploy the first installation step of a system called Hermit, a 1 Petaflop/s Cray system in the autumn of 2011. This first installation step is to be followed by a 4-5 Petaflop/s second step in 2013. The contract with Cray includes the delivery of a Cray XE6 supercomputer and the future delivery of Cray's next-generation supercomputer code-named "Cascade." The new Cray system at HLRS will serve as a supercomputing resource for researchers, scientists and engineers throughout Europe. HLRS is one of the leading centers in the European PRACE initiative and is currently the only large European high performance computing (HPC) center to work directly with industrial partners in automotive and aerospace engineering. HLRS is also a key partner of the Gauss Centre for Supercomputing (GCS), which is an alliance of the three major supercomputing centers in Germany that collectively provide one of the largest and most powerful supercomputer infrastructures in the world. "Cray is just the right partner as we enter the era of petaflops computing. Together with Cray's outstanding supercomputing technology, our center will be able to carry through the new initiative for engineering and industrial simulation. This is especially important as we work at the forefront of electric mobility and sustainable energy supply", said prof.
    [Show full text]
  • Cray DVS: Data Virtualization Service
    Cray DVS: Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray DVS, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with the Unicos/lc 2.1 release. DVS is a configurable service that provides compute-node access to a variety of file systems across the Cray high-speed network. The flexibility of DVS makes it a useful solution for many common situations at XT sites, such as providing I/O to compute nodes from NFS file systems. A limited set of use cases will be supported in the initial release but additional features will be added in the future. KEYWORDS: DVS, file systems, network protocol used in the communication layer takes advantage 1. Introduction of the high-performance network. The Cray Data Virtualization Service (Cray DVS) is Most file systems are not capable of supporting a network service that provides compute nodes hundreds or thousands or tens of thousands file system transparent access to file systems mounted on service clients. With DVS, there may be thousands of clients that nodes. DVS projects file systems mounted on service connect to a small number of DVS servers; the underlying nodes to the compute nodes: the remote file system file system sees only the aggregated requests from the appears to be local to applications running on the DVS servers. compute nodes. This solves several common problems: • Access to multiple file systems in the data center 2. Architecture • Resource overhead of file system clients on the The Cray XT series of supercomputers contain compute nodes compute nodes and service nodes.
    [Show full text]
  • Understanding the Cray X1 System
    NAS Technical Report NAS-04-014 December 2004 Understanding the Cray X1 System Samson Cheung, Ph.D. Halcyon Systems Inc. NASA Advanced Supercomputing Division NASA Ames Research Center [email protected] Abstract This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.* Key Words: Cray X1, High Performance Computing, MPI, OpenMP Introduction The Cray vector supercomputer, which dominated in the ‘80s and early ‘90s, was replaced by clusters of shared-memory processors (SMPs) in the mid-‘90s. Universities across the U.S. use commodity components to build Beowulf systems to achieve high peak speed [1]. The supercomputer center at NASA Ames Research Center had over 2,000 CPUs of SGI Origin ccNUMA machines in 2003. This technology path provides excellent price/performance ratio; however, many application software programs (e.g., Earth Science, validated vectorized CFD codes) sustain only a small faction of the peak without a major rewrite of the code. The Japanese Earth Simulator demonstrated the low price/performance ratio by using vector processors connected to high-bandwidth memory and high-performance networking [2]. Several scientific applications sustain a large fraction of its 40 TFLOPS/sec peak performance.
    [Show full text]