Intel Paragon XP/S Overview Distributed-Memory MIMD

Intel MP Paragon XP/S 150 @ Intel Paragon XP/S Overview Oak Ridge National Labs I Distributed-memory MIMD multicomputer I 2D array of nodes, performing both OS functionality as well as user computation ● Main memory physically distributed among nodes (16-64 MB / node) ● Each node contains two Intel i860 XP processors: application processor for user’s program, message processor for inter-node communication I Balanced design: speed and memory capacity matched to interconnection network, storage facilities, etc. ● Interconnect bandwidth scales with number of nodes ● Efficient even with thousands of processors 1 Fall 2003, MIMD 2 Fall 2003, MIMD Paragon XP/S Nodes Paragon XP/S Node Interconnection I Network Interface Controller (NIC) I 2D mesh chosen after extensive analytical studies and simulation ● Connects node to its PMRC ● Parity-checked, full-duplexed router with I Paragon Mesh Routing Chip (PMRC) / error checking iMRC routes traffic in the mesh I Message processor ● 0.75 µm, triple-metal CMOS ● ● Intel i860 XP processor Routes traffic in four directions and to and from attached node at > 200 MB/s ● Handles all details of sending / receiving I 40 ns to make routing decisions and close a message between nodes, including appropriate switches protocols, packetization, etc. I Transfers are parity checked, router is ● Supports global operations including pipelined, routing is deadlock-free broadcast, synchronization, sum, min, ● Backplane is active backplane of router and, or, etc. chips rather than mass of cables I Application processor ● Intel i860 XP processor (42 MIPS, 50 MHz clock) to execute user programs I 16–64 MB of memory 3 Fall 2003, MIMD 4 Fall 2003, MIMD Paragon XP/S Usage Paragon XP/S Programming I OS is based on UNIX, provides I MIMD architecture, but supports various distributed system services and full UNIX programming models: SPMD, SIMD, to every node MIMD, shared memory, vector shared memory ● System is divided into partitions, some for I/O, some for system services, rest for user applications I CASE tools including: ● Optimizing compilers for FORTRAN, C, I Applications can run on arbitrary number C++, Ada, and Data Parallel FORTRAN of nodes without change ● Interactive Debugger ● Run on larger number of nodes to ● Parallelization tools: FORGE, CAST process larger data sets or to achieve required performance ● Intel’s ProSolver library of equation solvers I Users have client/server access, can ● Intel’s Performance Visualization System submit jobs over a network, or login (PVS) directly to any node ● Performance Analysis Tools (PAT) ● Comprehensive resource control utilities for allocating, tracking, and controlling system resources 5 Fall 2003, MIMD 6 Fall 2003, MIMD Thinking Machines CM-5 Overview Thinking Machines CM-5 Overview (cont.) I Distributed-memory MIMD multicomputer I Other control processors, called I/O Control Processors, manage the ● SIMD or MIMD operation system’s I/O devices I Processing nodes are supervised by a ● Scale to achieve necessary I/O capacity control processor, which runs UNIX ● DataVaults to provide storage ● Control processor broadcasts blocks of instructions to the processing nodes, and I Control processors in general initiates execution ● Scheduling user tasks, allocating ● SIMD operation: nodes are closely resources, servicing I/O requests, synchronized, blocks broadcast as needed accounting, security, etc. ● MIMD operation: nodes fetch instructions ● May execute some code independently and synchronize only as ● required by the algorithm No arithmetic accelerators, but additional I/O connections I Nodes may be divided into partitions ● In small system, one control processor may play a number of roles ● One control processor, called the partition manager, per partition ● In large system, control processors are often dedicated to particular tasks ● Partitions may exchange data (partition manager, I/O cont. proc., etc.) 7 Fall 2003, MIMD 8 Fall 2003, MIMD Thinking Machines CM-5@ CM-5 Nodes NCSA I Processing nodes ● SPARC CPU I 22 MIPS ● 8-32 MB of memory ● Network interface ● (Optional) 4 pipelined vector processing units I Each can perform up to 32 million double- precision floating-point operations per second I Including divide and square root I Fully configured CM-5 would have ● 16,384 processing nodes ● 512 GB of memory ● Theoretical peak performance of 2 teraflops 9 Fall 2003, MIMD 10 Fall 2003, MIMD CM-5 Networks Tree Networks (Reference Material) I Control Network I Binary Tree ● Tightly coupled communication services ● 2k–1 nodes arranged into complete binary tree of depth k–1 ● Optimized for fast response, low latency ● Diameter is 2(k–1) ● Functions: synchronizing processing nodes, broadcasts, reductions, parallel ● Bisection width is 1 prefix operations I Hypertree I Data Network ● Low diameter of a binary tree plus ● 4-ary hypertree, optimized for high improved bisection width bandwidth ● Hypertree of degree k and depth d ● Functions: point-to-point commn. for tens I From “front”, looks like k-ary tree of height d of thousands of items simultaneously I From “side”, looks like upside-down binary ● Responsible for eventual delivery of tree of height d messages accepted I Join both views to get complete network ● ● Network Interface connects nodes or 4-ary hypertree of depth d d d d+1 control processors to the Control or Data I 4 leaves and 2 (2 –1) nodes Network (memory-mapped control unit) I Diameter is 2d d+1 I Bisection width is 2 11 Fall 2003, MIMD 12 Fall 2003, MIMD CM-5 Usage IBM SP2 Overview I Runs Cmost, enhanced vers. of SunOS I Distributed-memory MIMD multicomputer I User task sees a control processor acting I Scalable POWERparallel 1 (SP1) as a Partition Manager (PM), a set of ● Development started February 1992, processing nodes, and inter-processor delivered to users in April 1993 communication facilities ● User task is a standard UNIX process I Scalable POWERparallel 2 (SP2) running on the PM, and one on each of ● 120-node systems delivered 1994 the processing nodes I 4–128 nodes: RS/6000 workstation with ● The CPU scheduler schedules the user POWER2 processor, 66.7 MHz, 267 task on all processors simultaneously MFLOPS I POWER2 used in RS 6000 workstations, I User tasks can read and write directly to gives compatibility with existing software the Control Network and Data Network ● 1997 version (NUMA): ● Control Network has hardware for I High Node (SMP node, 16 nodes max): broadcast, reduction, parallel prefix 2–8 PowerPC 604, 112 MHz, 224 MFLOPS, 64MB–2GB memory operations, barrier synchronization I Wide Node (128 nodes max): ● Data Network provides reliable, deadlock- 1 P2SC (POWER2 Super Chip, 8 chips on free point-to-point communication one chip), 135 MHz, 640 MFLOPS, 64MB–2GB memory 13 Fall 2003, MIMD 14 Fall 2003, MIMD IBM SP2 @ IBM SP2 Overview Oak Ridge National Labs (cont.) I RS/6000 as system console I SP2 runs various combinations of serial, parallel, interactive, and batch jobs ● Partition between types can be changed ● High nodes — interactive nodes for code development and job submission ● Thin nodes — compute nodes ● Wide nodes — configured as servers, with extra memory, storage devices, etc. I A system “frame” contains 16 thin processor or 8 wide processor nodes ● Includes redundant power supplies, nodes are hot swappable within frame ● Includes a high-performance switch for low-latency, high-bandwidth communication 15 Fall 2003, MIMD 16 Fall 2003, MIMD IBM SP2 Processors IBM SP2 Interconnection Network I POWER2 processor I General ● Various versions from 20 to 62.5 MHz ● Multistage High Performance Switch (HPS) network, with extra stages added ● RISC processor, load-store architecture to keep bw to each processor constant I Floating point multiple & add instruction with latency of 2 cycles, pipelined for ● Message delivery initiation of new one each cycle I PIO for short messages with low latency I Conditional branch to decrement and test and minimal message overhead a “count register” (without fixed-point unit I DMA for long messages involvement), good for loop closings ● Multi-user support — hardware protection I POWER 2 processor chip set between partitions and users, guaranteed fairness of message delivery ● 8 semi-custom chips: Instruction Cache Unit, four Data Cache Units, Fixed-Point I Routing Unit (FXU), Floating-Point Unit (FPU), and Storage Control Unit ● Packet switched = each packet may take I 2 execution units per FXU and FPU a different route I Can execute 6 instructions per cycle: 2 ● Cut-through = if output is free, starts FXU, 2 FPU, branch, condition register sending without buffering first I Options: 4-word memory bus with 128 KB ● data cache, or 8-word with 256 KB Wormhole routing = buffer on subpacket basis if buffering is necessary 17 Fall 2003, MIMD 18 Fall 2003, MIMD IBM SP2 AIX Parallel Environment nCUBE Overview I Parallel Operating Environment — based I Distributed-memory MIMD multicomputer on AIX, includes Desktop interface (with hardware to make it look like shared-memory multiprocessor) ● Partition Manager to allocate nodes, copy tasks to nodes, invoke tasks, etc. I History ● Program Marker Array — (online) squares graphically represent program ● nCUBE 1 — 1985 tasks ● nCUBE 2 — 1989 ● System Status Array — (offline) squares show I 34 GFLOPS, scalable percent of CPU utilization I ?–8192 processors ● nCUBE 3 — 1995 I Parallel Message Passing Library I 1–6.5 TFLOPS, 65 TB memory, 24 TB/s hypercube interconnect, 1024 I Visualization Tool — view online and 3 GB/s I/O channels, scalable offline performance

Intel Paragon XP/S Overview Distributed-Memory MIMD

TMA4280—Introduction to Supercomputing

The TOP500 List and Progress in High- Performance Computing

Thor's Hammer/Red Storm

ASCI Red Vs. Red Storm

DARPA's HPCS Program: History, Models, Tools, Languages

Catamount Vs. Cray Linux Environment

Early Evaluation of the Cray XT3 at ORNL

My Group in Tennessee

Survey of “High Performance Machines”

Parallel Machines

It's a Sign, Not a Race

History of Supercomputing