A Case Study on 4 Parallel Machines

A Case Study on 4 Parallel Machines By:- 1. U. Sankar Tejaswi 03CS1001 2. Aditya Awasthi 03CS1020 3. Kunal Silku 03CS1013 4. Pippari Suresh Chandra 03CS1024 THUNDER Thunder Thunder was built by California Digital Corporation at Lawrence Livermore National Laboratory (LLNL) in the year 2004. It is now #14 on the Top 500 list. Operating System CHAOS 3.0 ( Clustered High Availability Operating System). File System Lustre Lite Cluster Wide File System. Interconnect Quadrics: QsNetII Thunder Thunder Nodes 1024 CPU/Node 4 Total CPU’s 4096 CPU speed 1.4 GHZ Bisectional BW 920 GBps Peak Performance 22.9 TFlops Cache/Node 4 MB Memory/node 8 GB Total Memory 8.2 TB Type of memory SDRAM Total Disk Space 75 GB (local) 151 TB (global) OS CHAOS 3.0 CHAOS Kernel: a Red Hat kernel with VM/device support for Quadrics Elan4, Lustre. Quadrics libelan, IP/Elan, MPI, etc: The Quadrics software environment used to run parallel programs. lm_sensors: Hardware monitoring. Fping: Node status tool to ping nodes in parallel. Intel Compilers: The Intel IA32 FORTRAN, C, and C++ compilers. PGI Compilers: The Portland Group Fortran and C Compilers SLURM SLURM: Simple Linux Utility for Resource Management. The primary functions of SLURM are as follows • Monitoring the state of nodes • Logically organizing the nodes into partitions with flexible parameters. • Distributed Production Control System (DPCS) will manage the order of job initiations. • Allocating both node and interconnect resources to jobs. • Monitoring the state of running jobs, including resource utilization rates. LUSTRE LITE Features Decouple Computational and storage resources. Desktop users and file servers . Simplifies data access and usability by providing a consistent view of DFS. Supports redundancy. Provides Uninterrupted service in the event of any failure.. Lustre uses Object Storage Targets (OST) and Metadata Servers (MDS). LUSTRE LITE Features Distributed OST’s do the actual file system I/O and the interfacing with storage devices. Replicated, failover MDS keep track of high- level file and file system changes. Strong file and metadata locking semantics to maintain total coherency of the file system. Distributed File Locking. OST handles locks for the objects that it stores. LUSTRE Functionality Lustre treats files as objects that are located through MDSs. Object-Based Disks (OBDs). Metadata Servers support all file system namespace operations, Directs actual file I/O requests to OSTs. Metadata servers keep a transactional record of file system metadata changes and cluster status. Support failover. Differences Inodes refer to objects on OSTs. File creation through MDS. I/O is done through OST, independent of MDS. MDS updated only when additional namespace changes are made. Lustre Fail-over Mechanism Powerful and unique recovery mechanism. Time-out – Query – Redirect. When failover OSTs are not available Lustre will automatically adapt. New file creation operations will automatically avoid a malfunctioning OST. Lustre Fail-over Mechanism Lustre Fail-over Mechanism RED STORM CRAY XT3 SUPERCOMPUTER SCALABLE BY DESIGN Red Storm Specs. Ranked 9th in current list of top 500 supercomputers. Theoretical Peak 41.47TF Architecture- Distr. memory MIMD Number of compute proc. - 10,368 Total Memory-31.2 TB Disk Storage-240 TB Salient Features of Cray XT3 Scalable processing elements each with their own high performance AMD Opteron processors and memory High bandwidth, low latency interconnect MPP optimized operating system Standards-based programming environment High speed, highly reliable I/O system Scalable Processing Elements PEs - the Basic blocks PE Æ Opteron 64-bit Processor Dedicated Memory HyperTransport link to Cray SeaStar Communication Engine AMD Opteron Processor Highly associative 1MB cache Out of order execution Eliminates the “Northbridge”, Thus very low latency path to local memory- less than 60 nanoseconds. Typical North/South bridge Courtesy:wikipedia.org 3D Torus Connected Arch. Compute PEÆ Light weight kernel Service PEÆ Linux and can be configured for I/O, login, network etc. Each PE connected to 6 neighbours Removes PCI Bottlenecks Scalable Architecture Courtesy:cray.com Scalable OS Æ UNICOS/lc Microkernel at Compute PEs Catamount Full featured OS at Service PEs Compute PEs manage virtual memory memory protection basic scheduling Service PEs Full LinuxÆlogin,I/O, system,network LoginÆLinux utilities,shell, commands I/OÆconnectivity to global || FS SystemÆsystem services like system database NetworkÆhigh speed connectivity Jobs submitted from Login PEs or PBS Pro batch program integrated with System PE Scheduler. Memory Can be configured to 1-8 GB Protected by ChipkillTM Technology -Bit scattering, Thus at max. one bit gets affected. -Dynamic bit-steering, If one chip fails spare chip is used to replace it. Bit-scattering in chipkilltm Courtesy:dell.com Scalable Interconnect High bandwidth, Low latency Cray SeaStarTM chips 3D Torus topology- No switches Carries all message traffic and I/O traffic Cray SeaStar Components 1. HyperTransport link 2. DMA Engine 3. Communication and Management processor 4. High-speed interconnect Router 5. Service port HyperTransport Link Low packet overhead 8 byte in Read, 12 in write 8 Bit parallel link Packet Format Courtesy:hypertransport.org HyperTransport link Courtesy:hypertransport.org Other Important features DMA engine and associated PowerPC 440 processor. Offload message preparation and demultiplexing tasks from processors. Working with Cray XT3 OS provide direct path from application to hardware. Scalable I/O Data RAID connected directly to I/O PEs on high speed interconnect. RAIDÆRedundant array of inexpensive disks. Can be configured to desired bandwidth by selecting appropriate no. RAIDs and service PEs. Data moves directly between app. Space and I/O system-zero copy. XT3 CONFIGURATION 6 Cabinets 24 Cabinets 96 Cabinets 320 Cabinets Compute PEs 548 2260 9108 30508 Service PEs 14 22 54 106 Max Memory 4.3 17.7 71.2 239 Aggregate 2.5 14.5 58.3 196 memory Bandwidth Interconnect(TB/s) 6x12x8 12x12x16 24x16x24 40x32x24 Topology Peak Bisection 0.7 2.2 5.8 11.7 Bandwidth Floor(TB/s) Space 12 72 336 1200 Courtesy:cray.com SGI ALTIX COLUMBIA SUPERCOMPUTER About Columbia Supercomputer Fastest Cluster of world - Currently occupies 4th place, Blue Gene being the only parallel supercomputer highly placed than this Build By SGI.. Based on SGI 3700 4th generation supercomputers Actively used by NASA in science and engineering and the agency's missions and Vision for Space Exploration. Columbia System Facts Based on SGI NUMAflex architecture - 20 SGI Altix 3700 superclusters each with 512 processors - Global shared memory across 512 processors 10,240 Intel Itanium 2 processors - Current processor speed : 1.5 GHz - Current cache : 6 MB Columbia System Facts(Contd..) 1 terabyte of memory per 512 processors , with 20 terabytes of total memory Operating Environment - Linux based operating system - PBS Pro Job Scheduler - Intel Fortran/C/C++ Compilers - SGI ProPack 3.2 Software Columbia System Facts(Contd..) Interconnect - SGI Numalink - InfiniBand network - 10 gigabit Ethernet - 1 gigabit Ethernet Storage - online : 440 terabyte of Fiber Channel RAID Storage - Active storage capacity : 10 pet byte NUMAflex Architecture Uses SGI® NUMA (cachecoherent, nonuniform memory access) protocol implemented directly in hardware for performance and a modular packaging scheme. The key to the NUMAflex design of Altix is a controller ASIC, referred to as the SHUB, that interfaces to the Itanium 2 front side bus, to the memory DIMM(Dual Inline Memory Module : Has a 64 bit path to memory chip )s, to the I/O subsystem. and to other NUMAflex components in the system. Global Shared Memory A single memory address space visible to all system resources, including microprocessors and I/O, across all nodes. Allow access to data directly and extremely quickly, without having to go through I/O or networking bottlenecks. Requires sophisticated system memory interconnect like NUMAlink™ and application that enable shared memory calls, such as Message sage Passing Toolkit (MPT) and XPmem from SGI. mfemrfromom S SGI.GI. Components of Altix 3700 C-brick (compute brick) consists of :- 1. 4 processors 2. 2 SHUBs and 3. up to 32GB of memory The M-brick (memory brick) 1.Essentially a C-brick without the processors. 2.Can be placed in any location in the interconnect fabric that could be occupied by a C-brick Components of Altix 3700(contd.) R-brick (an 8-port NUMAlink™ 3 router brick) - Used to build the interconnect fabric between the C-bricks and M-bricks IX-brick (the base IO brick) PX-brick (a PCI-X expansion brick) - - Attach to the C-brick via the I/O channel D-brick2 (a second-generation JBOD brick) Altix C-Brick Schematic Figure taken from : white paper -The SGI® AltixTM 3000 Global Shared-Memory Architecture by:- Michael Woodacre ,Derek Robb ,,Dean Roe, and Karl Feind Altix C-Brick Functioning Each SHUB ASIC in a Cbrick supports four DDR buses. Each DDR bus may contain up to four DIMMs (each DIMM is 72 bits wide, 64 bits of data and 8 bits of ECC). Four memory buses are independent and can operate simultaneously to provide up to 12.8GB per second of memory bandwidth. Each SHUB ASIC contains a directory cache for the most recent cache-coherency state information. (low look up time). NUMAflex(Processors+Memory) Allows the system to scale up to 512 processors, all working together in a cache coherent manner. Itanium 2 processor supports up to four processors being placed on a bus. If data required is not in one of the on-die caches , the processor sends a request for a cache line of data from the global shared memory. Hardware keeps the data coherent without software intervention. - Snooping - Cache Directory Memory and Cache Hierarchy Figure taken from : white paper -The SGI® AltixTM 3000 Global Shared-Memory Architecture by:- Michael Woodacre ,Derek Robb ,Dean Roe, and Karl Feind Cache Sharing Capable of coherently sharing cache lines among up to 512 processors.

A Case Study on 4 Parallel Machines

End-To-End Performance of 10-Gigabit Ethernet on Commodity Systems

Parallel Computing at DESY Peter Wegner Outline •Types of Parallel

Data Center Architecture and Topology

Comparing Ethernet and Myrinet for MPI Communication

Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

Inside the Lustre File System

High-End HPC Architectures

Guaranteed Periodic Real-Time Communication Over Wormhole

Analysis and Optimisation of Communication Links for Signal Processing Applications

Comparative Performance Analysis with Infiniband and Myrinet-10G

GSN the Ideal Application(S) More Virtual Applications for HEP and Others Some Thoughts About Network Storage

State-Of-The-Art Network Interconnects for Computer Clusters in High Performance Computing