Early Operations Experience with the Cray X1 at the Oak Ridge National Laboratory Center for Computational Sciences
Total Page:16
File Type:pdf, Size:1020Kb
Early Operations Experience with the Cray X1 at the Oak Ridge National Laboratory Center for Computational Sciences Arthur Bland, Richard Alexander, Steven M. Carter, Kenneth D. Matney, Sr., Oak Ridge National Laboratory ABSTRACT: Oak Ridge National Laboratory (ORNL) has a history of acquiring, evaluating, and hardening early high-performance computing systems. In March 2003, ORNL acquired one of the fist production Cray X1 systems. We describe our experiences with installing, integrating, and maintaining this system, along with plans for expansion and further system development. • Evaluate experimental new architectures to see 1. Introduction how they work for DOE Office of Science applications; The Center for Computational Sciences (CCS) at the • Work with the manufacturers of these systems to Oak Ridge National Laboratory (ORNL) is an advanced bring the systems to production readiness; computing research test bed sponsored by the U.S. • Use these new systems on a few applications that Department of Energy’s (DOE) Office of Advanced are both important to the nation, and cannot be Scientific Computing Research. Its primary mission is to addressed with the conventional computer systems evaluate and bring to production readiness new computer of the day. architectures and to apply those resources to solving Over the last decade, the applications of interest at the important scientific problems for the DOE. In 2002 the CCS have centered around six major areas: CCS proposed evaluating the Cray X1 for use as a scientific computing system by the DOE Office of Science. The first phase of this five-phase plan was funded and the first of • Materials Science – Many of ORNL’s major eight cabinets was installed in March 2003. In collaboration programs center on materials science. ORNL is with other national laboratories and universities, the CCS is the site of the ongoing construction of the nation’s currently evaluating the system to determine its strengths largest unclassified science project, the Spallation and weaknesses. The ultimate goal of the project is to Neutron Source. This $1.4 billion project will provide data for the DOE in its selection of the architecture provide neutron streams for exploring the structure of a leadership class computer system for open scientific of both hard and soft materials. research. The CCS has outlined a path to this goal that • Global Climate Change – The DOE has a major grows the Cray X1 and then upgrades to the follow-on program in global climate modelling to understand system, code-named “Black Widow.” the energy needs of the world and the impact of energy production and usage on our environment. The CCS has a major program in Climate and 2. Overview of the CCS Carbon Research. The CCS was founded in 1991 by the DOE as part of • Computational Biology – The fields of genomics the High Performance Computing and Communications and proteomics require massive computational program and was one of two high-performance computing power to understand the basis of all life. DOE’s research centers created in DOE national laboratories. At biology program began with the Manhattan Project the time massively parallel computer systems were just to understand the effects of radiation on cellular starting to make their commercial debut into a marketplace biology. That program has grown into a major for scientific computing that had been dominated by parallel effort in genomics. vector systems from Cray Research. ORNL was well suited • Fusion – Fusion reactors hold the promise of to this task having had a Cray X-MP system as well as six providing virtually unlimited power from abundant years experience converting vector applications to run on fuel. Modelling these reactors has proven to be a experimental parallel systems such as the Intel i/PSC series daunting task that requires the most powerful of systems. computers available. • Computational Chemistry – Modelling of complex From the beginning, the CCS had three primary chemical processes help understand everything missions: from combustion to drug design. 1 • Astrophysics – Basic research into the origins of other major research laboratories and universities. Adjacent the universe provides an understanding of the to the computer room is a new 1,700 square foot nature of matter and energy. Modelling a visualization theater to help the researchers glean supernova explosion is among the most understanding and knowledge from the terabytes of data computationally complex phenomena being generated by their simulations. Finally, the building has attempted today. office space for 450 staff members of the Computing and Computational Sciences Directorate of ORNL. An Today the CCS provides a capability computing center additional 52,000 square foot building to house JICS is with over 6 TFLOP/s of peak capacity for DOE researchers under construction across the street from the CSB. JICS is and our university partners. The DOE Scientific Discovery the university outreach arm of the CCS and this building through Advanced Computing (SciDAC) program is a $50 will provide laboratory, office, classroom, and conference million program to develop a new generation of scientific space for faculty and students collaborating with ORNL application programs and is a major initiative within the researchers. DOE. The CCS was selected to provide the majority of the 3.2 Computer Systems compute infrastructure for SciDAC. The major production computer system of the CCS is a 4.5 TFLOP/s IBM Power4 Regatta H system. The machine ORNL and the CCS have a long history of evaluating is composed of 27 nodes, each with 32 processors and from new, experimental computer architectures. In all, over 30 32 GB to 128 GB of memory. The system is linked with different computer systems have been evaluated over the two planes of IBM’s Switch2 or “Colony” interconnect last 15 years, with many more under active or planned providing high-speed communications among the nodes of evaluation. Some of the major systems where ORNL has the machine. In 2003, the CCS will be partnering with IBM evaluated a prototype, beta, or serial number one system to test the next-generation interconnect technology, code- include: KSR-1, Intel i/PSC-1, i/PSC-2, i/PSC-860, Intel named “Federation”. This will provide an order of Touchstone Sigma and Paragon MP, IBM Winterhawk-1, magnitude increase in the bandwidth between nodes while Winterhawk-2, Nighthawk-1, Nighthawk-2, S80, Regatta H, cutting the latency in half. Compaq AlphaServer SC40 and SC45, and the Cray X1. In the near future, the CCS is planning evaluations of the SGI The second major system is the new Cray X1 system. Altix with Madison processors, the Red Storm system Today, this machine has 32 MSPs and 128 GB of memory jointly designed by Sandia National Laboratories and Cray, for a total of 400 GFLOP/s of peak performance. By the Inc., the IBM Blue Gene/D system, and the Cray Black end of September 2003, the system will be expanded to 256 Widow machine. MSPs and 1 TB of memory, providing a peak performance of 3.2 TFLOP/s. 3. Facilities and Infrastructure Additional systems include a 1 TFLOP/s IBM Power3 The CCS provides the facilities, computing, system and a 427 GFLOP/s Compaq SC40. This summer, networking, visualization, and archival storage resources we will also add a 1.5 TFLOP/s SGI Altix system with 256 needed by the DOE Office of Science’s computational Madison processors and 2 TB of globally addressable scientists. memory. 3.1 Computational Science Building 3.3 Networking ORNL has long recognized that computational science Gone are the days of hungry graduate students and is an equal “third-leg” of the scientific discovery triangle researchers making the pilgrimage to the computer center to with experiment and theory. However, the facilities needed worship at the alter of the giant brain. Today, the majority to support leadership class computers have not been funded of the researchers using high-performance computers by the DOE for unclassified scientific research. To bridge around the world will never get near or see the computer the gap, in 2002 ORNL broke ground on a new system. Fully two-thirds of the users of the CCS are located computational sciences building (CSB) as well as another outside of Oak Ridge. Thus our challenge is to provide the new building to house the Joint Institute for Computational networking infrastructure and tools to make accessing the Sciences (JICS). The CSB provides 170,000 square feet of computers and their data just as fast, convenient, and secure space to house the laboratory’s computing infrastructure. as if the resources were local to the user. The CCS has two The center piece is a 40,000 square foot computer room primary connections to our users. For DOE users, the CCS with up to eight megawatts of power and up to 4,800 tons of is connected to the DOE’s Energy Sciences network-ESnet- air conditioning. When completed in June 2003, this will which is a private research network. ESnet provides an be the nation’s largest computer center for scientific OC12 connection (622 mega-bits per second) linking to an research. The CSB also houses the fibre-optic OC48 (2 Gbps) backbone. To link to our university communications hub for the laboratory, providing direct partners, ORNL has contracted with Qwest to provide an access to high-speed, wide-area networks linking ORNL to OC192 (10 Gbps) link from ORNL to Atlanta where it 2 connects to the Internet2 backbone and links to major research universities around the country. But providing the The system arrived on Tuesday, March 18, 2003. It is a wires is only the first piece of a complex stack of software 32 MSP, liquid cooled chassis with a single IOC attached.