Intel Paragon XP/S Overview Distributed-Memory MIMD

Intel MP Paragon XP/S 150 @ Intel Paragon XP/S Overview Oak Ridge National Labs I Distributed-memory MIMD multicomputer I 2D array of nodes, performing both OS functionality as well as user computation ● Main memory physically distributed among nodes (16-64 MB / node) ● Each node contains two Intel i860 XP processors: application processor for user’s program, message processor for inter-node communication I Balanced design: speed and memory capacity matched to interconnection network, storage facilities, etc. ● Interconnect bandwidth scales with number of nodes ● Efficient even with thousands of processors 1 Fall 2003, MIMD 2 Fall 2003, MIMD Paragon XP/S Nodes Paragon XP/S Node Interconnection I Network Interface Controller (NIC) I 2D mesh chosen after extensive analytical studies and simulation ● Connects node to its PMRC ● Parity-checked, full-duplexed router with I Paragon Mesh Routing Chip (PMRC) / error checking iMRC routes traffic in the mesh I Message processor ● 0.75 µm, triple-metal CMOS ● ● Intel i860 XP processor Routes traffic in four directions and to and from attached node at > 200 MB/s ● Handles all details of sending / receiving I 40 ns to make routing decisions and close a message between nodes, including appropriate switches protocols, packetization, etc. I Transfers are parity checked, router is ● Supports global operations including pipelined, routing is deadlock-free broadcast, synchronization, sum, min, ● Backplane is active backplane of router and, or, etc. chips rather than mass of cables I Application processor ● Intel i860 XP processor (42 MIPS, 50 MHz clock) to execute user programs I 16–64 MB of memory 3 Fall 2003, MIMD 4 Fall 2003, MIMD Paragon XP/S Usage Paragon XP/S Programming I OS is based on UNIX, provides I MIMD architecture, but supports various distributed system services and full UNIX programming models: SPMD, SIMD, to every node MIMD, shared memory, vector shared memory ● System is divided into partitions, some for I/O, some for system services, rest for user applications I CASE tools including: ● Optimizing compilers for FORTRAN, C, I Applications can run on arbitrary number C++, Ada, and Data Parallel FORTRAN of nodes without change ● Interactive Debugger ● Run on larger number of nodes to ● Parallelization tools: FORGE, CAST process larger data sets or to achieve required performance ● Intel’s ProSolver library of equation solvers I Users have client/server access, can ● Intel’s Performance Visualization System submit jobs over a network, or login (PVS) directly to any node ● Performance Analysis Tools (PAT) ● Comprehensive resource control utilities for allocating, tracking, and controlling system resources 5 Fall 2003, MIMD 6 Fall 2003, MIMD Thinking Machines CM-5 Overview Thinking Machines CM-5 Overview (cont.) I Distributed-memory MIMD multicomputer I Other control processors, called I/O Control Processors, manage the ● SIMD or MIMD operation system’s I/O devices I Processing nodes are supervised by a ● Scale to achieve necessary I/O capacity control processor, which runs UNIX ● DataVaults to provide storage ● Control processor broadcasts blocks of instructions to the processing nodes, and I Control processors in general initiates execution ● Scheduling user tasks, allocating ● SIMD operation: nodes are closely resources, servicing I/O requests, synchronized, blocks broadcast as needed accounting, security, etc. ● MIMD operation: nodes fetch instructions ● May execute some code independently and synchronize only as ● required by the algorithm No arithmetic accelerators, but additional I/O connections I Nodes may be divided into partitions ● In small system, one control processor may play a number of roles ● One control processor, called the partition manager, per partition ● In large system, control processors are often dedicated to particular tasks ● Partitions may exchange data (partition manager, I/O cont. proc., etc.) 7 Fall 2003, MIMD 8 Fall 2003, MIMD Thinking Machines CM-5@ CM-5 Nodes NCSA I Processing nodes ● SPARC CPU I 22 MIPS ● 8-32 MB of memory ● Network interface ● (Optional) 4 pipelined vector processing units I Each can perform up to 32 million double- precision floating-point operations per second I Including divide and square root I Fully configured CM-5 would have ● 16,384 processing nodes ● 512 GB of memory ● Theoretical peak performance of 2 teraflops 9 Fall 2003, MIMD 10 Fall 2003, MIMD CM-5 Networks Tree Networks (Reference Material) I Control Network I Binary Tree ● Tightly coupled communication services ● 2k–1 nodes arranged into complete binary tree of depth k–1 ● Optimized for fast response, low latency ● Diameter is 2(k–1) ● Functions: synchronizing processing nodes, broadcasts, reductions, parallel ● Bisection width is 1 prefix operations I Hypertree I Data Network ● Low diameter of a binary tree plus ● 4-ary hypertree, optimized for high improved bisection width bandwidth ● Hypertree of degree k and depth d ● Functions: point-to-point commn. for tens I From “front”, looks like k-ary tree of height d of thousands of items simultaneously I From “side”, looks like upside-down binary ● Responsible for eventual delivery of tree of height d messages accepted I Join both views to get complete network ● ● Network Interface connects nodes or 4-ary hypertree of depth d d d d+1 control processors to the Control or Data I 4 leaves and 2 (2 –1) nodes Network (memory-mapped control unit) I Diameter is 2d d+1 I Bisection width is 2 11 Fall 2003, MIMD 12 Fall 2003, MIMD CM-5 Usage IBM SP2 Overview I Runs Cmost, enhanced vers. of SunOS I Distributed-memory MIMD multicomputer I User task sees a control processor acting I Scalable POWERparallel 1 (SP1) as a Partition Manager (PM), a set of ● Development started February 1992, processing nodes, and inter-processor delivered to users in April 1993 communication facilities ● User task is a standard UNIX process I Scalable POWERparallel 2 (SP2) running on the PM, and one on each of ● 120-node systems delivered 1994 the processing nodes I 4–128 nodes: RS/6000 workstation with ● The CPU scheduler schedules the user POWER2 processor, 66.7 MHz, 267 task on all processors simultaneously MFLOPS I POWER2 used in RS 6000 workstations, I User tasks can read and write directly to gives compatibility with existing software the Control Network and Data Network ● 1997 version (NUMA): ● Control Network has hardware for I High Node (SMP node, 16 nodes max): broadcast, reduction, parallel prefix 2–8 PowerPC 604, 112 MHz, 224 MFLOPS, 64MB–2GB memory operations, barrier synchronization I Wide Node (128 nodes max): ● Data Network provides reliable, deadlock- 1 P2SC (POWER2 Super Chip, 8 chips on free point-to-point communication one chip), 135 MHz, 640 MFLOPS, 64MB–2GB memory 13 Fall 2003, MIMD 14 Fall 2003, MIMD IBM SP2 @ IBM SP2 Overview Oak Ridge National Labs (cont.) I RS/6000 as system console I SP2 runs various combinations of serial, parallel, interactive, and batch jobs ● Partition between types can be changed ● High nodes — interactive nodes for code development and job submission ● Thin nodes — compute nodes ● Wide nodes — configured as servers, with extra memory, storage devices, etc. I A system “frame” contains 16 thin processor or 8 wide processor nodes ● Includes redundant power supplies, nodes are hot swappable within frame ● Includes a high-performance switch for low-latency, high-bandwidth communication 15 Fall 2003, MIMD 16 Fall 2003, MIMD IBM SP2 Processors IBM SP2 Interconnection Network I POWER2 processor I General ● Various versions from 20 to 62.5 MHz ● Multistage High Performance Switch (HPS) network, with extra stages added ● RISC processor, load-store architecture to keep bw to each processor constant I Floating point multiple & add instruction with latency of 2 cycles, pipelined for ● Message delivery initiation of new one each cycle I PIO for short messages with low latency I Conditional branch to decrement and test and minimal message overhead a “count register” (without fixed-point unit I DMA for long messages involvement), good for loop closings ● Multi-user support — hardware protection I POWER 2 processor chip set between partitions and users, guaranteed fairness of message delivery ● 8 semi-custom chips: Instruction Cache Unit, four Data Cache Units, Fixed-Point I Routing Unit (FXU), Floating-Point Unit (FPU), and Storage Control Unit ● Packet switched = each packet may take I 2 execution units per FXU and FPU a different route I Can execute 6 instructions per cycle: 2 ● Cut-through = if output is free, starts FXU, 2 FPU, branch, condition register sending without buffering first I Options: 4-word memory bus with 128 KB ● data cache, or 8-word with 256 KB Wormhole routing = buffer on subpacket basis if buffering is necessary 17 Fall 2003, MIMD 18 Fall 2003, MIMD IBM SP2 AIX Parallel Environment nCUBE Overview I Parallel Operating Environment — based I Distributed-memory MIMD multicomputer on AIX, includes Desktop interface (with hardware to make it look like shared-memory multiprocessor) ● Partition Manager to allocate nodes, copy tasks to nodes, invoke tasks, etc. I History ● Program Marker Array — (online) squares graphically represent program ● nCUBE 1 — 1985 tasks ● nCUBE 2 — 1989 ● System Status Array — (offline) squares show I 34 GFLOPS, scalable percent of CPU utilization I ?–8192 processors ● nCUBE 3 — 1995 I Parallel Message Passing Library I 1–6.5 TFLOPS, 65 TB memory, 24 TB/s hypercube interconnect, 1024 I Visualization Tool — view online and 3 GB/s I/O channels, scalable offline performance

Intel Paragon XP/S Overview Distributed-Memory MIMD

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support