Advanced Computer Architecture
Interconnection Networks
Slides courtesy © Michel Dubois, Murali Annavaram and Per Stenström 1 Parallel Computer Systems
M M M L2 Cache
Interconnection network Interconnection network
L1 L1 L1 CMP CMP CMP
P P P
INTERCONNECT BETWEEN PROCESSOR CHIPS (SYSTEM AREA NETWORK--SAN) INTERCONNECT BETWEEN CORES ON EACH CHIP (ON-CHIP-NETWORK-- OCN or NETWORK ON A CHIP--NOC)
OTHERS (NOT COVERED): WAN (WIDE-AREA NETWORK) LAN (LOCAL AREA NETWORK)
2 Example: Mesh
Switch N.I.
Node
Switch B
Link A
CONNECTS NODES: CACHE MODULES, MEMORY MODULES, CMPS… NODES ARE CONNECTED TO SWITCHES THROUGH A NETWORK INTERFACE (NI) SWITCH: CONNECTS INPUT PORTS TO OUTPUT PORTS LINKS: WIRES TRANSFERING SIGNALS BETWEEN SWITCHES LINKS WIDTH, CLOCK TRANSFER CAN BE SYNCHRONOUS OR ASYNCHRONOUS FROM A TO B: HOP FROM SWITCH TO SWITCH DECENTRALIZED (DIRECT)
3 Simple Communication Model Machine A Machine B in-buffer out-buffer out-buffer in-buffer POINT-TO-POINT REQUEST/REPLY
E B
C A F
D MULTICAST G BROADCAST
POINT-TO-POINT MESSAGE TRANSFER REQUEST/REPLY: REQUEST CARRIES ID OF SENDER MULTICAST: ONE TO MANY BROADCAST: ONE TO ALL 4 Messages and Packets
MESSAGES CONTAIN THE INFORMATION TRANSFERED
MESSAGES ARE BROKEN DOWN INTO PACKETS
PACKETS ARE SENT ONE BY ONE
PAYLOAD: MESSAGE--NOT RELEVANT TO INTERCONNECTION HEADER/TRAILER: CONTAINS INFORMATION TO ROUTE PACKET ERROR CODE: ECC TO DETECT AND CORRECT TRANSMISSION ERRORS HEADER+ECC+TRAILER = PACKET ENVELOPE
5 Example: Bus ADDRESS DATA CONTROL ARBITRATION
BUS=WIRES BROADCAST/BROADCALL COMMUNICATION. NEEDS ARBITRATION. CENTRALIZED vs DISTRIBUTED ARBITRATION LINE MULTIPLEXING (ADDRESS/DATA FOR EXAMPLE) PIPELINING FOR EXAMPLE: ARBITRATION => ADDRESS => DATA SPLIT-TRANSACTION BUS vs CIRCUIT-SWITCHED BUS
CENTRALIZED (INDIRECT) LOW COST SHARED LOW BANDWIDTH
6 Switching Strategy DEFINES HOW CONNECTIONS ARE ESTABLISHED IN THE NETWORK CIRCUIT SWITCHING ESTABLISH A CONNECTION FOR THE DURATION OF THE NETWORK SERVICE EXAMPLE 1: REMOTE MEMORY READ ON A BUS: CONNECT WITH REMOTE NODE HOLD THE BUS WHILE THE REMOTE MEMORY IS ACCESSED RELEASE THE BUS WHEN THE DATA HAS BEEN RETURNED EXAMPLE 2: REMOTE MEMORY READ IN MESH ESTABLISH PATH IN NETWORK HOLD PATH UNTIL DATA IS RETURNED ALSO GOOD FOR TRANSMITTING PACKETS CONTINUOUSLY BETWEEN TWO NODES PACKET SWITCHING MULTIPLEX SEVERAL SERVICES BY SENDING PACKETS WITH ADDRESSES EXAMPLE 1: REMOTE MEMORY READ ON A BUS SEND A REQUEST PACKET TO REMOTE NODE RELEASE BUS WHILE REMOTE MEMORY ACCESS TAKES PLACE REMOTE NODE SENDS REPLY PACKET TO REQUESTER EXAMPLE 2: REMOTE MEMORY READ ON A MESH 7 Packet Switching Strategy
TWO STRATEGIES: STORE-AND-FORWARD AND CUT-THROUGH PACKET SWITCHING IN STORE-AND-FORWARD PACKETS MOVE FROM NODE TO NODE AND ARE STORED IN BUFFERS IN EACH NODE IN CUT-THROUGH, PACKETS MOVE THROUGH NODES IN PIPELINED FASHION, SO THAT THE ENTIRE PACKET MOVES THROUGH SEVERAL NODES AT ONE TIME, WITHOUT INTERMEDIATE BUFFERING
Store-and-forward Cut-through switching switching Switch L
L-2 cycles Switch 2
Switch 1
Time 8 Two Implementations of Cut-Through
IN PRACTICE WE MUST DEAL WITH CONFLICTS AND STALLED PACKETS
VIRTUAL CUT-THROUGH SWITCHING: EACH NODE HAS ENOUGH BUFFERING FOR THE ENTIRE PACKET THE ENTIRE PACKET IS BUFFERED IN THE NODE WHEN THERE IS A TRANSMISSION CONFLICT
WHEN TRAFFIC IS CONGESTED AND CONFLICTS ARE HIGH, VIRTUAL CUT- THROUGH BEHAVES LIKE STORE-AND-FORWARD
WORMHOLE SWITCHING: EACH NODE HAS ENOUGH BUFFERING FOR A FLIT (FLOW CONTROL UNIT) A FLIT IS MADE OF CONSECUTIVE PHITs (PHYSICAL TRANSFER UNIT), i.e. THE BITS TRANSFERRED PER CLOCK (SIZE= THE WIDTH OF A LINK) THE FLIT IS A FRACTION OF A PACKET, SO THE PACKET MUST BE STORED IN SEVERAL CONSECUTIVE NODES (ONE FLIT PER NODE) ON ITS PATH AS IT MOVES THROUGH THE NETWORK. THE FLIT MUST BE ABLE TO CONTAIN THE HEADER OF A PACKET
IN VIRTUAL CUT-THROUGH THE FLIT IS THE WHOLE PACKET
9 Switch Microarchitecture
From Duato and Pinkston in Hennessy and Patterson, 4th edition PHYSICAL CHANNEL = LINK VIRTUAL CHANNEL = BUFFERS + LINK LINK IS TIME-MULTIPLEXED AMONG FLITS
10 Latency Models
END-TO-END PACKET LATENCY = SENDER OH + TIME OF FLIGHT + TRANSMISSION TIME + ROUTING TIME + RECEIVER OH SENDER OVERHEAD: CREATING THE ENVELOPE AND MOVING PACKET TO NI TIME OF FLIGHT: TIME TO SEND A BIT FROM SOURCE TO DESTINATION WHEN THE ROUTE IS ESTABLISHED AND WITHOUT CONFLICTS. (INCLUDES SWITCHING TIME.) TRANSMISSION TIME: TIME TO TRANSFER A PACKET FROM SOURCE TO DESTINATION, ONCE THE FIRST BIT HAS ARRIVED AT DESTINATION PHIT: NUMBER OF BITS TRANSFERED ON A LINK PER CYCLE TRANSMISSION TIME = PACKET SIZE/PHIT SIZE FLIT: FLOW CONTROL UNIT
11 Latency Models
ROUTING TIME: TIME TO SET UP SWITCHES SWITCHING TIME: DEPENDS ON SWITCHING STRATEGY (STORE-AND-FORWARD vs CUT-THROUGH vs CIRCUIT-SWITCHED). AFFECTS TIME OF FLIGHT AND IS INCLUDED IN THAT. RECEIVER OVERHEAD: TIME TO STRIP OUT ENVELOPE AND MOVE PACKET IN
MEASURES OF LATENCY: ROUTING DISTANCE: NB OF LINKS TRAVERSED BY A PACKET AVERAGE ROUTING DISTANCE: AVERAGE OVER ALL PAIRS OF NODES NETWORK DIAMETER: LONGEST ROUTING DISTANCE OVER ALL PAIRS OF NODES
PACKETS OF A MESSAGE CAN BE PIPELINED TRANSFER PIPELINE HAS 3 STAGES SENDER OVERHEAD-->TRANSMISSION -->RECEIVER OVERHEAD TOTAL MESSAGE TIME = TIME FOR THE FIRST PACKET + (N-1)/PIPELINE THROUGHPUT
END-TO-END MESSAGE LATENCY = SENDER OH + TIME OF FLIGHT + TRANSMISSION TIME + ROUTING TIME + (N-1) X MAX (SENDER OH, TRANSMISSION TIME, RECEIVER OH)
12 Latency Models PACKET LATENCY = SENDER OH + TOF (INCL.SWITCHING TIME) + TRANSMISSION TIME + ROUTING TIME + RECEIVER OH
R: ROUTING TIME PER SWITCH; N: NB OF PHITS; L: NB OF SWITCHES; TOF: TIME OF FLIGHT
CIRCUIT SWITCHING ROUTING: LxR (TO SET THE SWITCHES) + L (TO ACK TOSENDER)= Lx(R+1) PACKET LATENCY = SENDER OH + ToF + TRANSMISSION + ROUTING + REC OH = SENDER OH + L + N + Lx(R+1) + RECEIVER OH = SENDER OH + N + Lx(R+2) + RECEIVER OH
STORE-AND-FORWARD PACKET LATENCY = SENDER OH + L x N + N + LxR + RECEIVER OH = SENDER OH + L x N + N + LxR + RECEIVER OH
CUT-THROUGH SWITCHING PACKET LATENCY = SENDER OH + L + N + LxR + RECEIVER OH = SENDER OH + Lx(R+1) + N + RECEIVER OH.
13 Bandwidth Models
BOTTLENECKS INCREASE LATENCY TRANSFERS ARE PIPELINED UNLOADED BANDWIDTH = PACKET_SIZE/MAX(SENDER OV, RECEIVER OV, TRANSMISSION TIME) NETWORK CONTENTION AFFECTS LATENCY AND EFFECTIVE BANDWIDTH (NOT COUNTED IN ABOVE FORMULA)
BISECTION WIDTH NETWORK IS SEEN AS A GRAPH VERTICES ARE SWITCHES AND EDGES ARE LINKS BISECTION IS A CUT THROUGH A MINIMUM SET OF EDGES SUCH THAT THE CUT DIVIDES THE NETWORK GRAPH INTO TWO ISOMORPHIC --I.E., SAME-- SUBGRAPHS EXAMPLE: MESH MEASURES BANDWIDTH WHEN ALL NODES IN ONE SUBGRAPH COMMUNICATE ONLY WITH NODES IN THE OTHER SUBGRAPH
AGGREGATE BANDWIDTH BANDWIDTH ACROSS ALL LINKS DIVIDED BY THE NUMBER OF NODES
14 Interconnecting Two Devices Basic Network Structure and Functions • Media and Form Factor
Metal layers InfiniBand Ethernet connectors
Fiber Optics
Cat5E twisted pair Media type
Coaxial cables Printed circuit boards Myrinet connectors OCNs SANs LANs WANs 0.01 1 10 100 >1,000 Distance (meters) 15 Topologies
INDIRECT NETWORKS: IN IS CENTRALIZED BUS CROSSBAR SWITCH MULTISTAGE INTERCONNECTION NETWORK (MIN)
0 1 2 3 4 5 6 7 2-BY-2 CROSSBAR 2by2 0 0 0 1 1 1 2 2 2 3 3 3 4 5 4 4 6 5 5 7
6 PERFECT_SHUFFLE 6 PERFECT_SHUFFLE CROSSBAR (8by8) PERFECT_SHUFFLE 7 7 OMEGA NETWORK 16 Topologies
INDIRECT NETWORKS: IN IS CENTRALIZED TREE BUTTERFLY 0 0 1 1
2 2 3 3
4 5 0 1 2 3 6 4 4 5 5 BINARY TREE
6 6 7 7
BUTTERFLY: EMBEDDED TREES
BEST TO CONNECT DIFFERENT TYPES OF NODES 17 Topologies
DIRECT NETWORKS: NODES ARE DIRECTLY CONNECTED TO ONE ANOTHER DECENTRALIZED LINEAR ARRAY AND RING
MESH AND TORI
18 Topologies
DIRECT NETWORKS: NODES ARE DIRECTLY CONNECTED TO ONE ANOTHER HYPERCUBE AND k-ARY n-CUBE
Centralized Switched (Indirect) Networks Multistage interconnection networks (MINs) Crossbar split into several stages consisting of smaller crossbars Complexity grows as O(N × log N), where N is # of end nodes Inter-stage connections represented by a set of permutation functions
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Omega topology, perfect-shuffle exchange
20 Network Topology Centralized Switched (Indirect) Networks
0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111
21 16 port, 4 stage Omega network Network Topology Centralized Switched (Indirect) Networks
0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111
22 16 port, 4 stage Baseline network Network Topology Centralized Switched (Indirect) Networks
0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111
23 16 port, 4 stage Butterfly network Network Topology Centralized Switched (Indirect) Networks
0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111
24 16 port, 4 stage Cube network Network Topology
Centralized Switched (Indirect) Networks Multistage interconnection networks (MINs) MINs interconnect N input/output ports using k x k switches
logkN switch stages, each with N/k switches N/k(logkN) total number of switches Example: Compute the switch and link costs of interconnecting 4096 nodes using a crossbar relative to a MIN, assuming that switch cost grows quadratically with the number of input/output ports (k). Consider the following values of k: MIN with 2 x 2 switches MIN with 4 x 4 switches MIN with 16 x 16 switches
25 Network Topology
Centralized Switched (Indirect) Networks Example: compute the relative switch and link costs, N = 4096
2 cost(crossbar)switches = 4096
cost(crossbar)links = 8192
2 2 relative_cost(2 × 2)switches = 4096 / (2 × 4096/2 × log2 4096) = 170
relative_cost(2 × 2)links = 8192 / (4096 × (log2 4096 + 1)) = 2/13 = 0.1538
2 2 relative_cost(4 × 4)switches = 4096 / (4 × 4096/4 × log4 4096) = 170
relative_cost(4 × 4)links = 8192 / (4096 × (log4 4096 + 1)) = 2/7 = 0.2857
2 2 relative_cost(16 × 16)switches = 4096 / (16 × 4096/16 × log16 4096) = 85
relative_cost(16 × 16)links = 8192 / (4096 × (log16 4096 + 1)) = 2/4 = 0.5
26 Network Topology
Centralized Switched (Indirect) Networks Relative switch and link costs for various values of k and N (crossbar relative to a MIN)
0-0.5 0.5-1 1-1.5 1.5-2
350 2 300 250 1.5 200 1 150 8192 8192 100 512 512 0.5 N 32 50 32 N 0 2 0 2 2 16 128 1024 8192 2 16 128 1024 8192 k k Relative switch cost Relative link cost
27 Network Topology – Reduced Blocking
Centralized Switched (Indirect) Networks
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
8 8 9 9
10 10 11 11
12 12 13 13
14 14 15 15
28 16 port Crossbar network Network Topology
Centralized Switched (Indirect) Networks
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
8 8 9 9
10 10 11 11
12 12 13 13
14 14 15 15
29 16 port, 3-stage Clos network Network Topology
Centralized Switched (Indirect) Networks
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
8 8 9 9
10 10 11 11
12 12 13 13
14 14 15 15
30 16 port, 5-stage Clos network Network Topology
Centralized Switched (Indirect) Networks
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
8 8 9 9
10 10 11 11
12 12 13 13
14 14 15 15
31 16 port, 7 stage Clos network = Benes topology Network Topology
Centralized Switched (Indirect) Networks Network Bisection Bidirectional MINs 0 1 Increase modularity 2 Reduce hop count, d 3
Fat tree network 4 Nodes at tree leaves 5 Switches at tree 6 vertices 7 Total link bandwidth 8 is constant across all 9 tree levels, with full 10 bisection 11 bandwidth 12 Equivalent to folded 13 Benes topology 14 Preferred topology in 15 32 many SANs Folded Clos = Folded Benes = Fat tree network Topology Comparison Interconnection Switch Network Bisection Network network degree diameter width size Crossbar N 1 N N switch
Butterfly k logk N N/2 N (built from k- by-k switches)
k-ary tree k+1 2logk N 1 N Linear array 2 N-1 1 N
Ring 2 N/2 2 N
n-by-n mesh 4 2(n-1) n N=n2
n-by-n torus 4 n 2n N=n2
k-dimensional k k 2k-1 N=2k hypercube k-ary n-cube 2k nk/2 2kn-1 N=nk
SWITCH DEGREE: NUMBER OF PORTS IN EACH SWITCH (SWITCH COMPLEXITY) NETWORK DIAMETER: WORST-CASE ROUTING DISTANCE BETWEEN ANY TWO NODES BISECTION WIDTH: NUMER OF LINKS IN BISECTION (WORST-CASE BW) NETWORK SIZE: NUMER OF NODES
33 Routing Algorithms
REFERS TO MECHANISMS TO HANDLE CONFLICTS IN SWITCH-BASED NETWORKS
Input Output ports ports Transmit
Rdy 4-by-4 crossbar
BUFFERS AT INPUT AND OUTPUT PORTS IN VIRTUAL CUT-THROUGH: BUFFER FOR ENTIRE PACKET IN WORMHOLE: BUFFER FOR INTEGRAL NUMBER OF FLITS LINK-LEVEL FLOW CONTROL HANDSHAKE SIGNAL Rdy INDICATES WHETHER FLITS CAN BE TRANSMITTED TO THE DESTINATION BUFFERING FOR CUT-THROUGH (whole packet) vs WORMHOLE (a few flits) HOT SPOT CONTENTION AND TREE SATURATION
34 Routing Algorithms
USE SOURCE AND/OR DESTINATION ADDRESSES OMEGA NETWORK
USE THE DESTINATION ADDRESS IN THIS CASE, 3 BITS
USE ith BIT OF THE DESTINATION (di) TO SELECT UPPER OR LOWER OUTPUT PORT FOR STAGE n-1-i 0 => UP 1 => DOWN
EXAMPLE: ROUTE FROM 4 TO 6 DESTINATION ADDRESS IS 110 DOWN IN STAGE 0 DOWN IN STAGE 1 UP IN STAGE 2
35 Routing Algorithms
BUTTERFLY NETWORK
USE RELATIVE ADDRESS BITWISE EXCLUSIVE OR OF SOURCE AND DESTINATION ADDRESSES TO FORM THE ROUTING ADDRESS IF BIT i IS ZERO, ROUTE STRAIGHT IF BIT i IS ONE, ROUTE ACROSS
EXAMPLE: ROUTE FROM 4 TO 6 SOURCE: 100 DESTINATION: 110 EX-OR: 010
36 Routing Algorithms
DIMENSION-ORDER ROUTING (DETERMINISTIC) NUMBER NODES SO THAT LOWER LEFT CORNER IS (0,0) AND UPPER RIGHT CORNER IS (n-1,n-1) USE RELATIVE ADDRESS:
(X,Y)=(XB-XA,YB-YA) ROUTE FIRST HORIZONTALLY DECREMENT X WHEN X=0, ROUTE VERTICALLY DECREMENT Y UNTIL Y=0 EXAMPLE: MOVE PACKET FROM A TO B RELATIVE ADDRESS IS (X,Y)=(n-2,n-2) FIRST MOVE PACKET HORIZONTALLY DECREMENT X BY 1 (RIGHT MOVE) WHEN X=0, MOVE PACKET VERTICALLY DECREMENT Y BY 1 (UP MOVE)
37 Routing Issues
Deadlock A route from node A to node B blocks a route from node C to node D, and vice versa Can be caused by competition for ports, outputs, or intermediate switch resources Adaptive routing algorithms Routing algorithm can dynamically change routes based on traffic
38 Routing for Large Systems
Cloud servers Typically use hierarchical Ethernet switches Example: Facebook moving to 100G Ethernet backbone in data centers Relatively long routing latency Although typically implemented with a switch fabric, emulates a common contention network Supercomputers Infiniband Switched fabric by design Was envisioned to replace PCI in 1999 Multichannel links (like PCI-e). Now peaks at 25Gbs/channel Proprietary Not many left. Cray, NEC
39 Routing for Small Systems
Intel Xeon PHI Bi-directional cache coherent ring Tilera Planar mesh Rings are common on-chip Fewer wires than full interconnect On-chip bandwidth high enough to compensate for multi-hop transmission
40 Departed Interconnects
Fiber Distributed Data Iinterface (FDDI) 100 Mbs token-passing dual ring First use of 8B/10B! High-performance Parallel Interrface (HIPPI) Used for supercomputer connections Point-to-point link Scalable Coherent Interronncet (SCI) Point-to-point connections with directory-based cache coherence Developed as a standard before anyone used it Myrinet Fiber interconnect for supercomputers Asynchronous Transfer Mode (ATM) Cell switching (48B cells) devised by telecoms Intended to emulate telecom circuit-switched network 41