Advanced Computer Architecture

Interconnection Networks

Slides courtesy © Michel Dubois, Murali Annavaram and Per Stenström 1 Parallel Computer Systems

M M M L2 Cache

Interconnection network Interconnection network

L1 L1 L1 CMP CMP CMP

P P P

 INTERCONNECT BETWEEN PROCESSOR CHIPS (SYSTEM AREA NETWORK--SAN)  INTERCONNECT BETWEEN CORES ON EACH CHIP (ON-CHIP-NETWORK-- OCN or NETWORK ON A CHIP--NOC)

 OTHERS (NOT COVERED):  WAN (WIDE-AREA NETWORK)  LAN (LOCAL AREA NETWORK)

2 Example: Mesh

Switch N.I.

Node

Switch B

Link A

 CONNECTS NODES: CACHE MODULES, MEMORY MODULES, CMPS…  NODES ARE CONNECTED TO SWITCHES THROUGH A NETWORK INTERFACE (NI)  SWITCH: CONNECTS INPUT PORTS TO OUTPUT PORTS  LINKS: WIRES TRANSFERING SIGNALS BETWEEN SWITCHES  LINKS  WIDTH, CLOCK  TRANSFER CAN BE SYNCHRONOUS OR ASYNCHRONOUS  FROM A TO B: HOP FROM SWITCH TO SWITCH DECENTRALIZED (DIRECT)

3 Simple Communication Model Machine A Machine B in-buffer out-buffer out-buffer in-buffer POINT-TO-POINT REQUEST/REPLY

E B

C A F

D MULTICAST G BROADCAST

 POINT-TO-POINT MESSAGE TRANSFER  REQUEST/REPLY: REQUEST CARRIES ID OF SENDER  MULTICAST: ONE TO MANY  BROADCAST: ONE TO ALL 4 Messages and Packets

 MESSAGES CONTAIN THE INFORMATION TRANSFERED

 MESSAGES ARE BROKEN DOWN INTO PACKETS

 PACKETS ARE SENT ONE BY ONE

 PAYLOAD: MESSAGE--NOT RELEVANT TO INTERCONNECTION  HEADER/TRAILER: CONTAINS INFORMATION TO ROUTE PACKET  ERROR CODE: ECC TO DETECT AND CORRECT TRANSMISSION ERRORS  HEADER+ECC+TRAILER = PACKET ENVELOPE

5 Example: ADDRESS DATA CONTROL ARBITRATION

 BUS=WIRES  BROADCAST/BROADCALL COMMUNICATION.  NEEDS ARBITRATION.  CENTRALIZED vs DISTRIBUTED ARBITRATION  LINE MULTIPLEXING (ADDRESS/DATA FOR EXAMPLE)  PIPELINING  FOR EXAMPLE: ARBITRATION => ADDRESS => DATA  SPLIT-TRANSACTION BUS vs CIRCUIT-SWITCHED BUS

CENTRALIZED (INDIRECT) LOW COST SHARED LOW BANDWIDTH

6 Switching Strategy DEFINES HOW CONNECTIONS ARE ESTABLISHED IN THE NETWORK CIRCUIT SWITCHING  ESTABLISH A CONNECTION FOR THE DURATION OF THE NETWORK SERVICE  EXAMPLE 1: REMOTE MEMORY READ ON A BUS:  CONNECT WITH REMOTE NODE  HOLD THE BUS WHILE THE REMOTE MEMORY IS ACCESSED  RELEASE THE BUS WHEN THE DATA HAS BEEN RETURNED  EXAMPLE 2: REMOTE MEMORY READ IN MESH  ESTABLISH PATH IN NETWORK  HOLD PATH UNTIL DATA IS RETURNED  ALSO GOOD FOR TRANSMITTING PACKETS CONTINUOUSLY BETWEEN TWO NODES PACKET SWITCHING  MULTIPLEX SEVERAL SERVICES BY SENDING PACKETS WITH ADDRESSES  EXAMPLE 1: REMOTE MEMORY READ ON A BUS  SEND A REQUEST PACKET TO REMOTE NODE  RELEASE BUS WHILE REMOTE MEMORY ACCESS TAKES PLACE  REMOTE NODE SENDS REPLY PACKET TO REQUESTER  EXAMPLE 2: REMOTE MEMORY READ ON A MESH 7 Packet Switching Strategy

 TWO STRATEGIES: STORE-AND-FORWARD AND CUT-THROUGH PACKET SWITCHING  IN STORE-AND-FORWARD PACKETS MOVE FROM NODE TO NODE AND ARE STORED IN BUFFERS IN EACH NODE  IN CUT-THROUGH, PACKETS MOVE THROUGH NODES IN PIPELINED FASHION, SO THAT THE ENTIRE PACKET MOVES THROUGH SEVERAL NODES AT ONE TIME, WITHOUT INTERMEDIATE BUFFERING

Store-and-forward Cut-through switching switching Switch L

L-2 cycles Switch 2

Switch 1

Time 8 Two Implementations of Cut-Through

IN PRACTICE WE MUST DEAL WITH CONFLICTS AND STALLED PACKETS

 VIRTUAL CUT-THROUGH SWITCHING:  EACH NODE HAS ENOUGH BUFFERING FOR THE ENTIRE PACKET  THE ENTIRE PACKET IS BUFFERED IN THE NODE WHEN THERE IS A TRANSMISSION CONFLICT

WHEN TRAFFIC IS CONGESTED AND CONFLICTS ARE HIGH, VIRTUAL CUT- THROUGH BEHAVES LIKE STORE-AND-FORWARD

 WORMHOLE SWITCHING:  EACH NODE HAS ENOUGH BUFFERING FOR A FLIT (FLOW CONTROL UNIT)  A FLIT IS MADE OF CONSECUTIVE PHITs (PHYSICAL TRANSFER UNIT), i.e. THE BITS TRANSFERRED PER CLOCK (SIZE= THE WIDTH OF A LINK)  THE FLIT IS A FRACTION OF A PACKET, SO THE PACKET MUST BE STORED IN SEVERAL CONSECUTIVE NODES (ONE FLIT PER NODE) ON ITS PATH AS IT MOVES THROUGH THE NETWORK.  THE FLIT MUST BE ABLE TO CONTAIN THE HEADER OF A PACKET

IN VIRTUAL CUT-THROUGH THE FLIT IS THE WHOLE PACKET

9 Switch Microarchitecture

 From Duato and Pinkston in Hennessy and Patterson, 4th edition PHYSICAL CHANNEL = LINK VIRTUAL CHANNEL = BUFFERS + LINK LINK IS TIME-MULTIPLEXED AMONG FLITS

10 Latency Models

END-TO-END PACKET LATENCY = SENDER OH + TIME OF FLIGHT + TRANSMISSION TIME + ROUTING TIME + RECEIVER OH  SENDER OVERHEAD: CREATING THE ENVELOPE AND MOVING PACKET TO NI  TIME OF FLIGHT: TIME TO SEND A BIT FROM SOURCE TO DESTINATION WHEN THE ROUTE IS ESTABLISHED AND WITHOUT CONFLICTS. (INCLUDES SWITCHING TIME.)  TRANSMISSION TIME: TIME TO TRANSFER A PACKET FROM SOURCE TO DESTINATION, ONCE THE FIRST BIT HAS ARRIVED AT DESTINATION  PHIT: NUMBER OF BITS TRANSFERED ON A LINK PER CYCLE  TRANSMISSION TIME = PACKET SIZE/PHIT SIZE  FLIT: FLOW CONTROL UNIT

11 Latency Models

 ROUTING TIME: TIME TO SET UP SWITCHES  SWITCHING TIME: DEPENDS ON SWITCHING STRATEGY  (STORE-AND-FORWARD vs CUT-THROUGH vs CIRCUIT-SWITCHED).  AFFECTS TIME OF FLIGHT AND IS INCLUDED IN THAT.  RECEIVER OVERHEAD: TIME TO STRIP OUT ENVELOPE AND MOVE PACKET IN

 MEASURES OF LATENCY:  ROUTING DISTANCE: NB OF LINKS TRAVERSED BY A PACKET  AVERAGE ROUTING DISTANCE: AVERAGE OVER ALL PAIRS OF NODES  NETWORK DIAMETER: LONGEST ROUTING DISTANCE OVER ALL PAIRS OF NODES

 PACKETS OF A MESSAGE CAN BE PIPELINED  TRANSFER PIPELINE HAS 3 STAGES  SENDER OVERHEAD-->TRANSMISSION -->RECEIVER OVERHEAD  TOTAL MESSAGE TIME = TIME FOR THE FIRST PACKET + (N-1)/PIPELINE

END-TO-END MESSAGE LATENCY = SENDER OH + TIME OF FLIGHT + TRANSMISSION TIME + ROUTING TIME + (N-1) X MAX (SENDER OH, TRANSMISSION TIME, RECEIVER OH)

12 Latency Models PACKET LATENCY = SENDER OH + TOF (INCL.SWITCHING TIME) + TRANSMISSION TIME + ROUTING TIME + RECEIVER OH

R: ROUTING TIME PER SWITCH; N: NB OF PHITS; L: NB OF SWITCHES; TOF: TIME OF FLIGHT

 CIRCUIT SWITCHING  ROUTING: LxR (TO SET THE SWITCHES) + L (TO ACK TOSENDER)= Lx(R+1)  PACKET LATENCY = SENDER OH + ToF + TRANSMISSION + ROUTING + REC OH = SENDER OH + L + N + Lx(R+1) + RECEIVER OH = SENDER OH + N + Lx(R+2) + RECEIVER OH

 STORE-AND-FORWARD  PACKET LATENCY = SENDER OH + L x N + N + LxR + RECEIVER OH = SENDER OH + L x N + N + LxR + RECEIVER OH

 CUT-THROUGH SWITCHING  PACKET LATENCY = SENDER OH + L + N + LxR + RECEIVER OH = SENDER OH + Lx(R+1) + N + RECEIVER OH.

13 Bandwidth Models

 BOTTLENECKS INCREASE LATENCY  TRANSFERS ARE PIPELINED  UNLOADED BANDWIDTH = PACKET_SIZE/MAX(SENDER OV, RECEIVER OV, TRANSMISSION TIME)  NETWORK CONTENTION AFFECTS LATENCY AND EFFECTIVE BANDWIDTH (NOT COUNTED IN ABOVE FORMULA)

 BISECTION WIDTH  NETWORK IS SEEN AS A GRAPH  VERTICES ARE SWITCHES AND EDGES ARE LINKS  BISECTION IS A CUT THROUGH A MINIMUM SET OF EDGES SUCH THAT THE CUT DIVIDES THE NETWORK GRAPH INTO TWO ISOMORPHIC --I.E., SAME-- SUBGRAPHS  EXAMPLE: MESH  MEASURES BANDWIDTH WHEN ALL NODES IN ONE SUBGRAPH COMMUNICATE ONLY WITH NODES IN THE OTHER SUBGRAPH

 AGGREGATE BANDWIDTH  BANDWIDTH ACROSS ALL LINKS DIVIDED BY THE NUMBER OF NODES

14 Interconnecting Two Devices Basic Network Structure and Functions • Media and Form Factor

Metal layers InfiniBand connectors

Fiber Optics

Cat5E twisted pair Media type

Coaxial cables Printed circuit boards Myrinet connectors OCNs SANs LANs WANs 0.01 1 10 100 >1,000 Distance (meters) 15 Topologies

INDIRECT NETWORKS: IN IS CENTRALIZED  BUS   MULTISTAGE INTERCONNECTION NETWORK (MIN)

0 1 2 3 4 5 6 7 2-BY-2 CROSSBAR 2by2 0 0 0 1 1 1 2 2 2 3 3 3 4 5 4 4 6 5 5 7

6 PERFECT_SHUFFLE 6 PERFECT_SHUFFLE CROSSBAR (8by8) PERFECT_SHUFFLE 7 7 OMEGA NETWORK 16 Topologies

INDIRECT NETWORKS: IN IS CENTRALIZED  TREE  BUTTERFLY 0 0 1 1

2 2 3 3

4 5 0 1 2 3 6 4 4 5 5 BINARY TREE

6 6 7 7

BUTTERFLY: EMBEDDED TREES

BEST TO CONNECT DIFFERENT TYPES OF NODES 17 Topologies

DIRECT NETWORKS: NODES ARE DIRECTLY CONNECTED TO ONE ANOTHER DECENTRALIZED  LINEAR ARRAY AND RING

 MESH AND TORI

18 Topologies

DIRECT NETWORKS: NODES ARE DIRECTLY CONNECTED TO ONE ANOTHER HYPERCUBE AND k-ARY n-CUBE

19

 Centralized Switched (Indirect) Networks  Multistage interconnection networks (MINs)  Crossbar split into several stages consisting of smaller crossbars  Complexity grows as O(N × log N), where N is # of end nodes  Inter-stage connections represented by a set of permutation functions

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

Omega topology, perfect-shuffle exchange

20 Network Topology Centralized Switched (Indirect) Networks

0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

21 16 port, 4 stage Omega network Network Topology Centralized Switched (Indirect) Networks

0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

22 16 port, 4 stage Baseline network Network Topology Centralized Switched (Indirect) Networks

0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

23 16 port, 4 stage Butterfly network Network Topology Centralized Switched (Indirect) Networks

0000 0000 0001 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

24 16 port, 4 stage Cube network Network Topology

 Centralized Switched (Indirect) Networks  Multistage interconnection networks (MINs)  MINs interconnect N input/output ports using k x k switches

 logkN switch stages, each with N/k switches  N/k(logkN) total number of switches  Example: Compute the switch and link costs of interconnecting 4096 nodes using a crossbar relative to a MIN, assuming that switch cost grows quadratically with the number of input/output ports (k). Consider the following values of k:  MIN with 2 x 2 switches  MIN with 4 x 4 switches  MIN with 16 x 16 switches

25 Network Topology

 Centralized Switched (Indirect) Networks  Example: compute the relative switch and link costs, N = 4096

2 cost(crossbar)switches = 4096

cost(crossbar)links = 8192

2 2 relative_cost(2 × 2)switches = 4096 / (2 × 4096/2 × log2 4096) = 170

relative_cost(2 × 2)links = 8192 / (4096 × (log2 4096 + 1)) = 2/13 = 0.1538

2 2 relative_cost(4 × 4)switches = 4096 / (4 × 4096/4 × log4 4096) = 170

relative_cost(4 × 4)links = 8192 / (4096 × (log4 4096 + 1)) = 2/7 = 0.2857

2 2 relative_cost(16 × 16)switches = 4096 / (16 × 4096/16 × log16 4096) = 85

relative_cost(16 × 16)links = 8192 / (4096 × (log16 4096 + 1)) = 2/4 = 0.5

26 Network Topology

 Centralized Switched (Indirect) Networks  Relative switch and link costs for various values of k and N (crossbar relative to a MIN)

0-0.5 0.5-1 1-1.5 1.5-2

350 2 300 250 1.5 200 1 150 8192 8192 100 512 512 0.5 N 32 50 32 N 0 2 0 2 2 16 128 1024 8192 2 16 128 1024 8192 k k Relative switch cost Relative link cost

27 Network Topology – Reduced Blocking

 Centralized Switched (Indirect) Networks

0 0 1 1

2 2 3 3

4 4 5 5

6 6 7 7

8 8 9 9

10 10 11 11

12 12 13 13

14 14 15 15

28 16 port Crossbar network Network Topology

 Centralized Switched (Indirect) Networks

0 0 1 1

2 2 3 3

4 4 5 5

6 6 7 7

8 8 9 9

10 10 11 11

12 12 13 13

14 14 15 15

29 16 port, 3-stage Clos network Network Topology

 Centralized Switched (Indirect) Networks

0 0 1 1

2 2 3 3

4 4 5 5

6 6 7 7

8 8 9 9

10 10 11 11

12 12 13 13

14 14 15 15

30 16 port, 5-stage Clos network Network Topology

 Centralized Switched (Indirect) Networks

0 0 1 1

2 2 3 3

4 4 5 5

6 6 7 7

8 8 9 9

10 10 11 11

12 12 13 13

14 14 15 15

31 16 port, 7 stage Clos network = Benes topology Network Topology

Centralized Switched (Indirect) Networks Network Bisection  Bidirectional MINs 0 1  Increase modularity 2  Reduce hop count, d 3

network 4  Nodes at tree leaves 5  Switches at tree 6 vertices 7  Total link bandwidth 8 is constant across all 9 tree levels, with full 10 bisection 11 bandwidth 12  Equivalent to folded 13 Benes topology 14  Preferred topology in 15 32 many SANs Folded Clos = Folded Benes = Fat Topology Comparison Interconnection Switch Network Bisection Network network degree diameter width size Crossbar N 1 N N switch

Butterfly k logk N N/2 N (built from k- by-k switches)

k-ary tree k+1 2logk N 1 N Linear array 2 N-1 1 N

Ring 2 N/2 2 N

n-by-n mesh 4 2(n-1) n N=n2

n-by-n torus 4 n 2n N=n2

k-dimensional k k 2k-1 N=2k hypercube k-ary n-cube 2k nk/2 2kn-1 N=nk

SWITCH DEGREE: NUMBER OF PORTS IN EACH SWITCH (SWITCH COMPLEXITY) NETWORK DIAMETER: WORST-CASE ROUTING DISTANCE BETWEEN ANY TWO NODES BISECTION WIDTH: NUMER OF LINKS IN BISECTION (WORST-CASE BW) NETWORK SIZE: NUMER OF NODES

33 Routing Algorithms

 REFERS TO MECHANISMS TO HANDLE CONFLICTS IN SWITCH-BASED NETWORKS

Input Output ports ports Transmit

Rdy 4-by-4 crossbar

 BUFFERS AT INPUT AND OUTPUT PORTS  IN VIRTUAL CUT-THROUGH: BUFFER FOR ENTIRE PACKET  IN WORMHOLE: BUFFER FOR INTEGRAL NUMBER OF FLITS  LINK-LEVEL FLOW CONTROL  HANDSHAKE SIGNAL  Rdy INDICATES WHETHER FLITS CAN BE TRANSMITTED TO THE DESTINATION  BUFFERING FOR CUT-THROUGH (whole packet) vs WORMHOLE (a few flits)  HOT SPOT CONTENTION AND TREE SATURATION

34 Routing Algorithms

 USE SOURCE AND/OR DESTINATION ADDRESSES OMEGA NETWORK

USE THE DESTINATION ADDRESS IN THIS CASE, 3 BITS

USE ith BIT OF THE DESTINATION (di) TO SELECT UPPER OR LOWER OUTPUT PORT FOR STAGE n-1-i 0 => UP 1 => DOWN

EXAMPLE: ROUTE FROM 4 TO 6 DESTINATION ADDRESS IS 110 DOWN IN STAGE 0 DOWN IN STAGE 1 UP IN STAGE 2

35 Routing Algorithms

BUTTERFLY NETWORK

USE RELATIVE ADDRESS BITWISE EXCLUSIVE OR OF SOURCE AND DESTINATION ADDRESSES TO FORM THE ROUTING ADDRESS IF BIT i IS ZERO, ROUTE STRAIGHT IF BIT i IS ONE, ROUTE ACROSS

EXAMPLE: ROUTE FROM 4 TO 6 SOURCE: 100 DESTINATION: 110 EX-OR: 010

36 Routing Algorithms

DIMENSION-ORDER ROUTING (DETERMINISTIC) NUMBER NODES SO THAT LOWER LEFT CORNER IS (0,0) AND UPPER RIGHT CORNER IS (n-1,n-1) USE RELATIVE ADDRESS:

(X,Y)=(XB-XA,YB-YA) ROUTE FIRST HORIZONTALLY DECREMENT X WHEN X=0, ROUTE VERTICALLY DECREMENT Y UNTIL Y=0 EXAMPLE: MOVE PACKET FROM A TO B RELATIVE ADDRESS IS (X,Y)=(n-2,n-2) FIRST MOVE PACKET HORIZONTALLY DECREMENT X BY 1 (RIGHT MOVE) WHEN X=0, MOVE PACKET VERTICALLY DECREMENT Y BY 1 (UP MOVE)

37 Routing Issues

 Deadlock  A route from node A to node B blocks a route from node C to node D, and vice versa  Can be caused by competition for ports, outputs, or intermediate switch resources  Adaptive routing algorithms  Routing algorithm can dynamically change routes based on traffic

38 Routing for Large Systems

 Cloud servers  Typically use hierarchical Ethernet switches  Example: Facebook moving to 100G Ethernet backbone in data centers  Relatively long routing latency  Although typically implemented with a switch fabric, emulates a common contention network  Supercomputers  Infiniband  by design  Was envisioned to replace PCI in 1999  Multichannel links (like PCI-e). Now peaks at 25Gbs/channel  Proprietary  Not many left. Cray, NEC

39 Routing for Small Systems

 Intel Xeon PHI  Bi-directional cache coherent ring  Tilera  Planar mesh  Rings are common on-chip  Fewer wires than full interconnect  On-chip bandwidth high enough to compensate for multi-hop transmission

40 Departed Interconnects

 Fiber Distributed Data Iinterface (FDDI)  100 Mbs token-passing dual ring  First use of 8B/10B!  High-performance Parallel Interrface (HIPPI)  Used for supercomputer connections  Point-to-point link  Scalable Coherent Interronncet (SCI)  Point-to-point connections with directory-based cache coherence  Developed as a standard before anyone used it  Myrinet  Fiber interconnect for supercomputers  Asynchronous Transfer Mode (ATM)  Cell switching (48B cells) devised by telecoms  Intended to emulate telecom circuit-switched network 41