Bringing up Red Storm-Lessons to Remember
Total Page:16
File Type:pdf, Size:1020Kb
Bringing Up Red Storm: Lessons to Remember Robert A. Ballance John P. Noe Sandia National Laboratories Albuquerque, New Mexico May 4, 0007 {[email protected], [email protected]} SAND 2007-2904C Abstract Capability computer systems are designed, developed, and operated to provide the computational power to investigate and solve some of the most diffi cult and compelling prob- lems facing humanity. Th ese unique resources can be ex- tremely helpful in scientifi c endeavors, but present exciting challenges for the operators and customers who utilize their power. While many of the diffi culties in getting capability jobs run are inherent in the job itself, others intrude by way of operational issues. Consider Red Storm: in Jumbo mode, its largest confi guration, it provides 13,80 dual-core pro- cessors; over 30,000 links; 3,700 embedded Linux sensors; 88 RAID controllers; 50 10GigE network connections; plus cooling, power, and environmental sensors. For each component, the only defi nitely known state is “down.” Th is overview looks at capability computing from the aspect of operations, but draws many conclusions about the eff ects of system design on the eff ectiveness of system operation. Figure 2: ASCI Red, Installed 01997 Overview ing and on the subject of today’s talk. Solid lines indicate direct andia National Laboratories has been building large systems lines of descendants, from the MPP’s like the nCube and Para- for 15 years now. Prior to that, the Labs had been a pur- gon through ASCI Red to the Cray XT4 line, and from the fi rst chaser and operator of systems like the Paragon, Th inking Beowulf clusters through commercial off erings like Los Lobos to SMachines CM-, and Cray vector systems. During that time, Th underbird. Dashed lines indicate infl uences. Sandians have designed, deployed, and operated the fi rst Terafl op Operating systems also play key a role, with the red lines system (ASCI Red), several of the fi rst large commodity clusters trending the use of small, lightweight kernels, and the blue lines (CPLANT), and this week’s # system Red Storm. Th e fi rst author following variants of Linux. In 0007, Cray is working to blend joined Sandia several years ago, aft er working with the University the red and blue lines, which might bring us to something like of New Mexico and NCSA in standing up Los Lobos, one of the Cray Purple, but that name has already been taken! fi rst Linux clusters to break the Top 100. (Sandia’s CPLANT was Th e lower tier of Figure 1 reminds us that many systems have another; HPTI built the third.) Today, Sandia also operates one been designed, fewer have been built, and even fewer have be- of the largest Linux clusters in the world, Th underbird with 8,960 come ongoing commercial successes! processors. Th e systems named, of course, are only a handful of those Figure 1 maps several of many sets of infl uences on our think- involved in a great conversation. Th e topic concerns Exafl ops for science and engineering: how to achieve them, how to deliver BeoWulf nCube, nCube-2 Los Lobos Thunderbird them, how to apply them, and how to fund them. In 0007, the next clear step is to a Petafl op, and ICC/NWCC CPLANT from a theoretical Petafl op to a sustained, produc- Paragon tion, delivered, multi-petafl op system. Th ere is no ASCI RED Red Storm doubt that any of several vendors can construct a system with a PetaOp peak. What we’ve found, CM-2 BlueGene ? SP however, is that getting from the PetaOp peak to the sustained, pro duction mult-petaOp system will take sustained eff ort from a broad-based team hav- Cray Vector ing diverse talents. T3 Isn’t it interesting that we don’t have a unit for Whirlwind•Stretch•BBN•ETA•Alliant•Elexsi•MasPar•Convex •Kendall Square•CDC•Cray 3•… sustained human eff ort? We need a unit that would Figure 1: Systems from a Sandian’s Perspective signify honest attempts and would harmonize with Petafl ops and Exabytes, something like Giga-tries 1 ability, reliability, and operability another: versatility. The cost, of course, is the overhead of maintaining two installations. In that sense, Red Storm is like having twins. Figure 3 shows the end of ASCI Red. Red was #1 on the Top 500 list for 3.5 years; 7 lists! Red was decommissioned last year, not because it wasn’t useful, but because we could no longer af- ford to keep it running. When it was turned off, there were still jobs on the mesh and more jobs in the queue. ASCI Red served production cycles for over 9 years from its initial operation until its retirement. That’s a long time in supercomputer years! Of course there were upgrades along the way which took advantage of the rapid pace of technology maturation. Processors, memory, network interconnect, and disk storage were all upgraded. The Clock of the Long Now Figure 3: ASCI Red, 02006 “I want to build a clock that ticks once a year. The century hand ad- or Peta-d’oh’s. One of the motivations for this paper, and one of vances once every one hundred years, and the cuckoo comes out on the the missions for organizations like CUG and LCI, is to reduce the millennium. I want the cuckoo to come out every millennium for the number of wasted human cycles in building ever-more-compli- next 10,000 years. If I hurry I should finish the clock in time to see the cated systems. cuckoo come out for the first time.” Danny Hillis, [8] But, lest we get too optimistic, let’s consider the following four very different pictures. Figure shows ASCI Red. Red was Figure 4 shows a prototype of the Clock of the Long Now — an notable for several reasons. It successfully deployed a Lightweight attempt to build a computer, and a very engineered artifact — that Kernel on compute nodes called Cougar. Cougar descended from will continue to deliver cycles and be both maintainable and SUNMOS — the Sandia/UNM OS — that ran on the nCube, and understandable for 10,000 years [4]. The cuckoo did come out which became PUMA on the Paragon. Red also had an inte- the first time, seven years ago, and the clock struck twice: once grated service (RAS) network that provided for system health for each millennium c.e. You might also be interested in Brian and maintenance checks for all nodes. TheRAS design on Red Eno’s study of the bells for the clock; they are based on a 10-digit influenced bothCPLANT and Red Storm. Finally, Red provided permutation so that each day the clock will sound a unique bell a simple mechanism to switch compute nodes from Unclassified sequence! Of course, that sequence will eventually wrap around, to Classified processing and back gain. This red/black switching in a bit over 10,000 years. procedure allows Sandia to deploy capability computing on either The Tacoma Narrows Bridge network by sharing the compute partitions. The cost is relatively low: additional service and I/O nodes, and a mechanism for the Figure 5 shows the Tacoma Narrows Bridge, or “Galloping Gertie”. switch. The duplicated I/O partitions have an additional benefit Bridge builders have had a history of building new structures that to operations, as well as a cost. The benefit is that during major creatively used new materials and new designs that pushed the upgrades, you can keep a portion of the system in production envelope of bridge engineering. Just as regularly, on a well-docu- while other portions are being worked. Add to terms like scal- mented 30-year cycle, bridges were built and collapsed. This cycle was originally noticed by Sibley, and is documented in Henry Petroski’s “Engineers of Dreams” [1]. The amazing thing about the Tacoma Narrows Bridge is not that it was built, or that it fell down, but that the original archi- tecture was carefully studied and engineered, and then built even when there were open questions about its stability. The width to Figure 4: Clock of the Long Now Figure 5: Tacoma Narrow Bridge Red Storm Thunderbird Stance Red + Black Black Service Nodes 320 + 320 Compute Nodes 12,960 4,480 Compute Cores 25,960 8,960 Segments 3360/6240/3360 Interconnect SeaStar Mesh Infiniband Disk 170TB + 170TB 46TB + 420TB shared OS Linux + Catamount Linux Figure 6: Red Storm 02006 span ratio of the deck was 1:76, while the next most comparable Seek First to Emulate bridge (the Golden Gate) at the time had a width to length ratio • Learn from the past of 1:46.5. Tacoma Narrows was truly a “stretch goal”. • Simulate the future Bridges have a lot in common with supercomputers; they cost a lot, take a long time to build, carry a lot of traffic, and are only Learn from the past missed when they are gone. Like bridges, building a supercom- This first principle is a restatement of Petroski’s arguments for the puter often starts with a need and a vision. From those seeds use of historical precedent in the begins lengthy rounds of funding negotiations, politics, further design of new structures. It was negotiations, and redesigns. Stretch goals and new technologies phrased succinctly in a quote by often determine the design of the system. Finally, if the team is John Guale: lucky, the system is built and jobs start running. “A complex system that works Computer engineers tend not to lose the same number of is invariably found to have workmen to death, accident, or caisson disease as the bridge evolved from a simple system builders, but we may lose the same number of systems in the end.