The TSUBAME Grid Redefining Supercomputing < One of the world’s leading technical institutes, the Tokyo Institute of Technology (Tokyo Tech) created the fastest in Asia, and one of the largest outside of the United States. Using Sun x64 servers and data servers deployed in a grid architecture, Tokyo Tech built a cost-effective, flexible supercomputer that meets the demands of compute- and data-intensive applications. With hundreds of systems incorporating thousands of processors and terabytes of memory, the TSUBAME grid delivers 47.38 TeraFLOPS1 of sustained performance and 1 petabyte (PB) of storage to users running common off-the-shelf applications.

Supercomputing demands Not content with sheer size, Tokyo Tech was Tokyo Tech set out to build the largest, and looking to bring supercomputing to everyday most flexible, supercomputer in Japan. With use. Unlike traditional, monolithic systems numerous groups providing input into the size based on proprietary solutions that service the and functionality of the system, the new needs of the few, the new supercomputing Highlights supercomputing campus grid infrastructure architecture had to be able to run commerical off-the-shelf and open source applications, • The Tokyo Tech Supercomputer had several key requirements. Groups focused and UBiquitously Accessible Mass on large-scale, high-performance distributed including structural analysis applications like storage Environment (TSUBAME) parallel computing required a mix of 32- and ABAQUS and MSC/NASTRAN, computational redefines supercomputing 64-bit systems that could run the Linux chemistry tools like Amber and Gaussian, and • 648 ™ X4600 servers and be capable of providing statistical analysis packages like SAS, Matlab, deliver 85 TeraFLOPS of peak raw over 1,200 SPECint2000 (peak) and 1,200 and Mathematica. compute capacity SPECfp2000 (peak) performance per CPU, • 42 Sun Fire X4500 Data Servers combining for over 20,000 SPECfp_rate2000 The TSUBAME supercomputing grid provide access to 1 petabyte of (peak) and 36 teraflops sustained Linpack The ninth largest supercomputer in the networked storage performance across the system. Each server in world today as measured by TOP5002, the • ClearSpeed Advance accelerator the grid had to incorporate at least eight CPUs TSUBAME grid is powered by 648 Sun Fire™ boards configured in 360 and 16 GB of shared access memory, with over X4600 servers with 11,088 AMD Opteron™ compute nodes help the grid half the servers capable of 32 GB, and total processor cores and 21 terabytes of memory. exceed 47 TeraFLOPS sustained Linpack performance grid memory of 5 TB or more. With all systems interconnected via InfiniBand technology and capable of • Eight Voltaire Grid Director ISR9288 high-speed InfiniBand switches With a wide range of researchers throughout accessing 1 petabyte of hard disk storage in keep traffic in the grid moving the university accessing the system, as well parallel, the TSUBAME grid delivers 47.38 as collaborators all over the world, data TeraFLOPS sustained performance. • Sun N1™ Grid Engine software distributes jobs across systems in storage was a key concern. Over a petabyte of Integrated by NEC and incorporating the grid physical storage capacity was required, with technology from ClearSpeed Technology, • An innovative and integrated no data loss across the entire system of 1,000 Inc., ClusterFS, and Voltaire, as well as the software stack enables common years. A parallel file system with a total RAID Sun N1™ System Manager and Sun N1 Grid off-the-shelf applications, I/O transfer rate of 5 GB/second was needed Engine software, the TSUBAME grid can run including PC applications, to run to support over 1,000 NFS mount points along both the Solaris™ Operating System (OS) and on the grid with fast parallel file systems like . Linux to deliver applications to users and speed scientific algorithms and data processing. 2 The TSUBAME Grid sun.com/hpc

TSUBAME grid system architecture By integrating high-performance AMD Opteron 648 servers. Designed by Sun, the TSUBAME grid consists processors with massive data storage, Sun Fire of 648 Sun Fire X4600 servers running SuSE X4500 servers provide high storage density and 21 TB memory. Linux Enterprise Server 9 SP3 configured into fast throughput rates at nearly half the cost of capacity, capability, and shared memory traditional solutions. In fact, these systems 1 PB data storage. clusters. Together, these systems provide users deliver four-way x64 server performance and access to 11,088 high-performance, dual-core, up to 24 TB direct attached storage in a 4U 47.38 TeraFLOPS. Next-Generation AMD Opteron processors and form factor, with 1 GB/second throughput All in 35 days. 21 TB of memory. Each Sun Fire X4600 server from disks to network and 2 GB/second incorporates two PCI-Express 4x single data throughput to memory. Sun Fire X4500 servers The InfiniBand connectivity schema is designed rate (SDR) InfiniBand host adapters for support up to 16 GB of DDR-400 memory with to provide the TSUBAME grid with optimum connection to the network. ECC. network balancing, maximum availability, and high performance. All Sun Fire X4500 and Sun High-performance x64 compute servers High-speed InfiniBand interconnect Fire X4600 servers are connected to one of the Sun Fire X4600 servers are fast and energy All Sun Fire X4600 compute systems and Sun six edge InfiniBand switches. These six efficient, and are the only four-way x64 servers Fire X4500 data servers are connected to an switches are in turn connected to two Voltaire to scale to 16-way in a compact 4RU form InfiniBand network through eight Voltaire Grid ISR9288 core switches. With 24 links between factor. Indeed, this powerful rackmount server Director ISR9288 high-speed InfiniBand each edge and core switch, the system has a scales quickly from four to eight sockets, switches. Each switch provides 20 Gbps blocking factor of 5:1 and a maximum of nine simply by adding modular processor boards. bidirectional bandwidth for 288 InfiniBand node hops. Multiple paths are available This innovative design enables Sun Fire X4600 ports in a single 14U chassis, enabling 1,352 through the core switch, fostering high systems to be upgraded and scaled to next server and storage links. Up to 11.52 Tbps full availability. Also, each InfiniBand host adapter generation processors and memory without bisectional switch bandwidth in a fat-tree installed in Sun Fire X4600 compute servers is disrupting the existing software and network architecture is possible, with less than 420 attached to a different line board. As a result, environment. Sun Fire X4600 servers support nanoseconds of latency between any two each link is connected to one of the 24 chips up to 64 GB of DDR-400 memory with ECC. ports. As a result, Voltaire ISR9288 switches on each line board, providing optimum can be interconnected to form large clusters distribution of the edge switches. In the TSUBAME grid, 360 compute servers are consisting of thousands of nodes. configured with a ClearSpeed Advance accelerator board for added floating-point performance. The accelerator board combines ClearSpeed CSX600 Sun Fire X4600 (648 Nodes) two CSX600 processors in a PCI-X form factor Accelerators and delivers 96 GFlops theoretical peak performance and 50 GFlops sustained double- precision matrix multiply (DGEMM of BLAS) performance while averaging 25 Watts power Infiniband Network (1440 Gbps) consumption. Voltaire ISR 9288 (8) Inifiniband Switches Ultra high-density storage External Network Forty-two high-performance Sun Fire X4500 Devices servers running RedHat Enterprise Linux 4 External Grid Connectivity provide storage for the TSUBAME grid. These Storage Server A Storage Server B high density data servers all incorporate 48 Sun Fire X4500 (42) NEC iStorage S1800AT direct attached, hot-swappable 500 GB SATA drives, for a total storage capacity of 1 PB. Each Sun Fire X4500 server also includes one Figure 1. The TSUBAME grid system architecture PCI-X 4x SDR InfiniBand host adapter. 3 The TSUBAME Grid sun.com/hpc

TSUBAME grid software • PGI 6.1 and GNU (gcc) compilers are Making the grid accessible A wide variety of software packages run on the installed on all compute nodes in the cluster. What makes TSUBAME unique is its ability to compute and data servers and work together • A variety of Message Passing Interface (MPI) make vast computing and storage resources to make the TSUBAME grid widely accessible tools, such as MPICH, OpenMPI, and HP-MPI available to a wide range of users running off- to users. are installed for application portability. Some the-shelf applications with ease. The Sun N1 of these tools utilize the IP over InfiniBand Grid Engine software makes this possible by Compute server software stack (IPoIB) protocol rather than native InfiniBand managing how jobs are allocated to systems in All Sun Fire X4600 servers in the TSUBAME grid protocols. the grid — without users needing to know the run the SuSE Linux Enterprise Server 9 SP3 • A Voltaire ibhost tool enables applications to underlying details of where jobs run. environment, as well as the following: employ MPI communication over the InfiniBand network. Based on MVAPICH, the By using the Sun N1 Grid Engine software, the • Sun N1 Grid Engine 6.0 software provides Voltaire implemenation includes several physical systems that comprise the TSUBAME distributed resource management for user enhancements for the TSUBAME grid, grid can be viewed logically (Figure 2). Users jobs running on the grid. The Sun N1 Grid including support for two accelerator cards log in to the grid via login nodes that are load Engine software runs on a Sun Fire X4100 in a single system, a shared receive queue, balanced via a round robin policy. Sessions are management server within the grid. and adaptive FASTPATH. then transferred to an interactive node by the • Lustre client software provides access to the Sun N1 Grid Engine software. It is on these parallel file systems running on Sun Fire Data server software stack interactive servers that users create and X4500 data servers. All Sun Fire X4500 servers in the TSUBAME grid submit jobs, and compile and run applications. • Sun N1 System Manager 1.3 software run the RedHat Enterprise Linux 4 operating Batch nodes are available for batch job enables the provisioning and management environment, as well as the Lustre software processing, as well as applications like Matlab of grid resources over a dedicated 100 Mbps from Cluster File Systems, Inc. The Lustre 1.4.7 and Mathematica. Jobs from these systems Ethernet network. Data and applications do software provides scalable, distributed file run only on execution nodes. Management not utilize the Ethernet network at any time. systems in the TSUBAME grid. This object- nodes take care of handling licensing, backup • The NEC Operations Management Application based cluster file system simplifies the and restore operations, high availability provides access to user home directories configuration of large quantities of storage (HA-NFS) access, resource located on NEC iStorage systems. systems. In addition, all Sun Fire X4500 servers accounting, and more. run the Sun N1 System Manager software.

License Server

Login Node #1

N1 Grid Engine Interactive Nodes (4) UNIX Master Host Login Node #2

N1 Grid Engine Shadow Master Windows Login Nodes

Accounting DB Server NFS Server

Internet N1GE Data N1GE Data (NFS) (NFS)

Batch Nodes (644) Sun Web Console Monitoring Management Nodes Execution Nodes License information retrieval

Figure 2. A logical view of the TSUBAME grid 4 The TSUBAME Grid sun.com/hpc

“Not only is the performance of the Sun Grid

HPC environment extremely impressive

today, but the ability of the architecture to

scale rapidly is really phenomenal, and will

enable us to grow our environment to meet

our needs for many years to come — no

matter how compute-intensive our projects

may be.”

Professor Satoshi Matsuoka Head of Research Infrastructure, Global Scientific Information and Computing Center, Tokyo Institute of Technology Figure 3. The TSUBAME supercomputing grid infrastructure at the Tokyo Institute of Technology

Grid achievements • High performance — In high performance • Easy serviceability — Keeping systems going Creating a supercomputer is no small task. With computing environments, speed is essential. is essential if work is to get done. Built using a little imagination and access to superior The TSUBAME grid provides 47.38 TeraFLOPS racked Sun Fire x64 servers designed for technology, Sun and Tokyo Tech created an sustained (Rmax) and 82.12 TeraFLOPS Rpeak accessibility, TSUBAME is easy to service and elegant, high-performance grid solution that Linpack performance, with an efficiency manage. Systems can be easily added and can grow and change to meet user demand. rating of 57.69 percent as measured by removed, and cabling is clean and organized. TOP5003. Tokyo Tech plans to increase the • Scalability, configurability, extensibility — capacity and performance of the grid to • Ability to run any application — Unlike many Protecting technology investments is key reach 100 TeraFLOPS in the future. designed for dedicated any organization. By deploying a grid applications, the TSUBAME grid can run architecture using off-the-shelf components • Fast deployment — Most proprietary common off-the-shelf applications, including that can scale horizontally, vertically, or supercomputers take as long as two years PC applications, without porting code. As a diagonally, and be configured and to go from inception to design, through result, more students can effortlessly share repurposed as needed, Tokyo Tech can take implementation, and deployment. The grid the resources of this powerful grid. advantage of extensive compute power took a mere 35 days to build, giving the today and be sure the system will scale as institute wider access to more compute Sun and Tokyo Tech demand rises. Tokyo Tech expects to power significantly sooner than expected. Tokyo Tech is a premier technological institute. increase grid capacity to enable more users This accomplishment was a direct result of Sun provides the products needed to create to access these vital resources. Sun’s grid design expertise, combined with innovative computing infrastructures. Together, Sun Customer Ready Systems configuration Sun and Tokyo Tech have built a supercomputer- support and NEC integration expertise. class grid that costs less to run.

1,2,3. TOP500 Supercomputing Sites, November 2006, http://www.top500.org/lists/2006/11

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN Web sun.com © 2006 , Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, N1, Solaris, and Sun Fire are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. AMD Opteron, and the AMD Opteron logo are trademarks or registered trademarks of . SPEC, SPECint2000, SPECfp2000, and SPECfp_rate2000 are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). Information subject to change without notice. Printed in USA 12/06 SunWIN #491937