The TSUBAME Grid: Redefining Supercomputing
Total Page:16
File Type:pdf, Size:1020Kb
The TSUBAME Grid Redefining Supercomputing < One of the world’s leading technical institutes, the Tokyo Institute of Technology (Tokyo Tech) created the fastest supercomputer in Asia, and one of the largest outside of the United States. Using Sun x64 servers and data servers deployed in a grid architecture, Tokyo Tech built a cost-effective, flexible supercomputer that meets the demands of compute- and data-intensive applications. With hundreds of systems incorporating thousands of processors and terabytes of memory, the TSUBAME grid delivers 47.38 TeraFLOPS1 of sustained performance and 1 petabyte (PB) of storage to users running common off-the-shelf applications. Supercomputing demands Not content with sheer size, Tokyo Tech was Tokyo Tech set out to build the largest, and looking to bring supercomputing to everyday most flexible, supercomputer in Japan. With use. Unlike traditional, monolithic systems numerous groups providing input into the size based on proprietary solutions that service the and functionality of the system, the new needs of the few, the new supercomputing Highlights supercomputing campus grid infrastructure architecture had to be able to run commerical off-the-shelf and open source applications, • The Tokyo Tech Supercomputer had several key requirements. Groups focused and UBiquitously Accessible Mass on large-scale, high-performance distributed including structural analysis applications like storage Environment (TSUBAME) parallel computing required a mix of 32- and ABAQUS and MSC/NASTRAN, computational redefines supercomputing 64-bit systems that could run the Linux chemistry tools like Amber and Gaussian, and • 648 Sun Fire™ X4600 servers operating system and be capable of providing statistical analysis packages like SAS, Matlab, deliver 85 TeraFLOPS of peak raw over 1,200 SPECint2000 (peak) and 1,200 and Mathematica. compute capacity SPECfp2000 (peak) performance per CPU, • 42 Sun Fire X4500 Data Servers combining for over 20,000 SPECfp_rate2000 The TSUBAME supercomputing grid provide access to 1 petabyte of (peak) and 36 teraflops sustained Linpack The ninth largest supercomputer in the networked storage performance across the system. Each server in world today as measured by TOP5002, the • ClearSpeed Advance accelerator the grid had to incorporate at least eight CPUs TSUBAME grid is powered by 648 Sun Fire™ boards configured in 360 and 16 GB of shared access memory, with over X4600 servers with 11,088 AMD Opteron™ compute nodes help the grid half the servers capable of 32 GB, and total processor cores and 21 terabytes of memory. exceed 47 TeraFLOPS sustained Linpack performance grid memory of 5 TB or more. With all systems interconnected via InfiniBand technology and capable of • Eight Voltaire Grid Director ISR9288 high-speed InfiniBand switches With a wide range of researchers throughout accessing 1 petabyte of hard disk storage in keep traffic in the grid moving the university accessing the system, as well parallel, the TSUBAME grid delivers 47.38 as collaborators all over the world, data TeraFLOPS sustained performance. • Sun N1™ Grid Engine software distributes jobs across systems in storage was a key concern. Over a petabyte of Integrated by NEC and incorporating the grid physical storage capacity was required, with technology from ClearSpeed Technology, • An innovative and integrated no data loss across the entire system of 1,000 Inc., ClusterFS, and Voltaire, as well as the software stack enables common years. A parallel file system with a total RAID Sun N1™ System Manager and Sun N1 Grid off-the-shelf applications, I/O transfer rate of 5 GB/second was needed Engine software, the TSUBAME grid can run including PC applications, to run to support over 1,000 NFS mount points along both the Solaris™ Operating System (OS) and on the grid with fast parallel file systems like Lustre. Linux to deliver applications to users and speed scientific algorithms and data processing. 2 The TSUBAME Grid sun.com/hpc TSUBAME grid system architecture By integrating high-performance AMD Opteron 648 servers. Designed by Sun, the TSUBAME grid consists processors with massive data storage, Sun Fire of 648 Sun Fire X4600 servers running SuSE X4500 servers provide high storage density and 21 TB memory. Linux Enterprise Server 9 SP3 configured into fast throughput rates at nearly half the cost of capacity, capability, and shared memory traditional solutions. In fact, these systems 1 PB data storage. clusters. Together, these systems provide users deliver four-way x64 server performance and access to 11,088 high-performance, dual-core, up to 24 TB direct attached storage in a 4U 47.38 TeraFLOPS. Next-Generation AMD Opteron processors and form factor, with 1 GB/second throughput All in 35 days. 21 TB of memory. Each Sun Fire X4600 server from disks to network and 2 GB/second incorporates two PCI-Express 4x single data throughput to memory. Sun Fire X4500 servers The InfiniBand connectivity schema is designed rate (SDR) InfiniBand host adapters for support up to 16 GB of DDR-400 memory with to provide the TSUBAME grid with optimum connection to the network. ECC. network balancing, maximum availability, and high performance. All Sun Fire X4500 and Sun High-performance x64 compute servers High-speed InfiniBand interconnect Fire X4600 servers are connected to one of the Sun Fire X4600 servers are fast and energy All Sun Fire X4600 compute systems and Sun six edge InfiniBand switches. These six efficient, and are the only four-way x64 servers Fire X4500 data servers are connected to an switches are in turn connected to two Voltaire to scale to 16-way in a compact 4RU form InfiniBand network through eight Voltaire Grid ISR9288 core switches. With 24 links between factor. Indeed, this powerful rackmount server Director ISR9288 high-speed InfiniBand each edge and core switch, the system has a scales quickly from four to eight sockets, switches. Each switch provides 20 Gbps blocking factor of 5:1 and a maximum of nine simply by adding modular processor boards. bidirectional bandwidth for 288 InfiniBand node hops. Multiple paths are available This innovative design enables Sun Fire X4600 ports in a single 14U chassis, enabling 1,352 through the core switch, fostering high systems to be upgraded and scaled to next server and storage links. Up to 11.52 Tbps full availability. Also, each InfiniBand host adapter generation processors and memory without bisectional switch bandwidth in a fat-tree installed in Sun Fire X4600 compute servers is disrupting the existing software and network architecture is possible, with less than 420 attached to a different line board. As a result, environment. Sun Fire X4600 servers support nanoseconds of latency between any two each link is connected to one of the 24 chips up to 64 GB of DDR-400 memory with ECC. ports. As a result, Voltaire ISR9288 switches on each line board, providing optimum can be interconnected to form large clusters distribution of the edge switches. In the TSUBAME grid, 360 compute servers are consisting of thousands of nodes. configured with a ClearSpeed Advance accelerator board for added floating-point performance. The accelerator board combines ClearSpeed CSX600 Sun Fire X4600 (648 Nodes) two CSX600 processors in a PCI-X form factor Accelerators and delivers 96 GFlops theoretical peak performance and 50 GFlops sustained double- precision matrix multiply (DGEMM of BLAS) performance while averaging 25 Watts power Infiniband Network (1440 Gbps) consumption. Voltaire ISR 9288 (8) Inifiniband Switches Ultra high-density storage External Network Forty-two high-performance Sun Fire X4500 Devices servers running RedHat Enterprise Linux 4 External Grid Connectivity provide storage for the TSUBAME grid. These Storage Server A Storage Server B high density data servers all incorporate 48 Sun Fire X4500 (42) NEC iStorage S1800AT direct attached, hot-swappable 500 GB SATA drives, for a total storage capacity of 1 PB. Each Sun Fire X4500 server also includes one Figure 1. The TSUBAME grid system architecture PCI-X 4x SDR InfiniBand host adapter. 3 The TSUBAME Grid sun.com/hpc TSUBAME grid software • PGI 6.1 and GNU (gcc) compilers are Making the grid accessible A wide variety of software packages run on the installed on all compute nodes in the cluster. What makes TSUBAME unique is its ability to compute and data servers and work together • A variety of Message Passing Interface (MPI) make vast computing and storage resources to make the TSUBAME grid widely accessible tools, such as MPICH, OpenMPI, and HP-MPI available to a wide range of users running off- to users. are installed for application portability. Some the-shelf applications with ease. The Sun N1 of these tools utilize the IP over InfiniBand Grid Engine software makes this possible by Compute server software stack (IPoIB) protocol rather than native InfiniBand managing how jobs are allocated to systems in All Sun Fire X4600 servers in the TSUBAME grid protocols. the grid — without users needing to know the run the SuSE Linux Enterprise Server 9 SP3 • A Voltaire ibhost tool enables applications to underlying details of where jobs run. environment, as well as the following: employ MPI communication over the InfiniBand network. Based on MVAPICH, the By using the Sun N1 Grid Engine software, the • Sun N1 Grid Engine 6.0 software provides Voltaire implemenation includes several physical systems that comprise the TSUBAME distributed resource management for user enhancements for the TSUBAME grid, grid can be viewed logically (Figure 2). Users jobs running on the grid. The Sun N1 Grid including support for two accelerator cards log in to the grid via login nodes that are load Engine software runs on a Sun Fire X4100 in a single system, a shared receive queue, balanced via a round robin policy. Sessions are management server within the grid. and adaptive FASTPATH. then transferred to an interactive node by the • Lustre client software provides access to the Sun N1 Grid Engine software.