Overview On-chip Networks: -based

● on-chip networks ● traditional model; buses have address and data pathways ■ systems-on-chip ● can be several ’masters’ (devices that operate the bus, e.g. CPU, DMA engine) ■ bus-based networks: arbitration, transfer modes ( coherency) ● in a multicore context, there be many! ( issue) ■ mesh-based networks: hierarchies, message organization & routing ■ issues: signal propagation delay, energy per bit transfers, reliability ● hence arbitration is a complex issue (and takes time!) ● cache coherency reconsidered and a case for the message passing paradigm ● techniques for improving bus utilization: ● intermission ■ burst transfer mode: multiple requests (in regular pattern) once granted access ● the Intel Single-Chip Cloud ■ pipelining: place next address as data from previous request placed on bus ■ motivations: high performance & power efficient network ■ architecture: , ■ programming model: the Rock Creek Communication Environment ● broadcast: one master ◆ motivations; overview; transferring data between cores sends to others ◆ programming examples; performance ■ e.g. cache ● refs: [1] On-chip Communication Architectures, Pascricha & Dutt, 2010, Morgan-Kaufman coherency - snoop, [2] The Future of , Borkar & Chien, CACM, May 2011 invalidation COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 3 ● (figure - courtesy [1])

Systems on a Chip On-Chip Networks: Current and Next Generations

● buses: only one device can access (make a ‘transaction’) at one time ● former systems can now be ● integrated into a single chip crossbars: devices split in 2 groups of size p ● usually for special-purpose ■ can have p transactions at once, provided at most 1 per device ■ systems e.g. T2: p =9 cores and L2$ banks (OS Slide Cast Ch 5, p86); Fermi GPU: p = 14? ■ for largish p, need to be internally organized as a series of ● high speed per price, power ● ● often have hierarchical may be also organized as a ring (e.g. Cell Broadband Engine) networks ● for larger p, may need a more scalable topology such as a 2-D mesh

Figure 12. Hybrid switching for network-on-a-chip.

C C C C C C C C BusR BusR Bus Bus C C C C C C C C

C C C C C C C C C C BusR BusR Bus Bus Bus C C C C C C C C C C

Bus to connect Second-level bus to connect Second-level -based a cluster clusters (hierarchy of busses) network (hierarchy of networks)

(courtesy [1]) (courtesy [1]) (courtesy[1]) (hierarchicalnetworks,courtesy[2])

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 2 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 4 On-chip Network Mechanisms: Protocols and Routing Cache Coherency Considered Harmful (the `Coherency Wall') ● unit of communication is a message, of limited length ● a core writes data at address Ü in its L1$; must invalidate Ü in other L1$s ■ e.g. T2 L1$ cache line invalidation (writeback) has a ‘payload’ of 8 (8+16) bytes ● standard protocols requires a broadcast message for every cache line invalidation ● generally, routers break messages into fixed-size packets ■ maintaining (MOESI) protocol also requires a broadcast on every miss 2 ● each packet is routed using circuit-based switching (connection established ■ energy cost of each is O(p); overall cost is O(p )! ■ 2 between source and destination – enables ‘pipelined’ transmission) also causes contention (hence delay) in the network (worse than O(p )?) ● ● unit of transmission is a flit, with a phit being transferred each cycle directory-based protocols can direct invalidation messages to only the caches holding the same data ■ routing: the destination’s id in the header flit is used to determine the ■ far more scalable (e.g. SGI Altix large-scale SMP), for lightly-shared data ■ worse otherwise; also introduces overhead through indirection ■ also for each cached line, need a bit vector of length p: O(p2) storage cost ● false sharing in any case results wasted traffic P P P P P P P P writes 0 1 2 3 4 5 6 7 ● hey, what about GPUs? b0 b1 b2 b3 b4 b5 b6 b7 ● atomic instructions, synchronizing the memory system down to the L2$, have (large) O(p) energycosteach (hence O(p2) total!) ● cache line size is sub-optimal for messages on on-chip networks

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 5 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 7

On-chip Network Issues: Signal Delay and Energy/Bit Transfer Shared Memory Programming Paradigm Challenged ● energy cost to access data off-chip (DRAM) is 100× more than from registers ● ref: [3] Kumar et al, The Case For Message Passing On Many-Core Chips ● delay for cross-chip interconnect grows super-linearly with length ● traditionally: paradigm to use when hardware shared memory is available ● ratio of this to transistor switching time scales exponentially with shrinking feature ■ faster and easier! size (≈ 60× @ 65nm, ≈ 130× @ 32nm) [1] ● ■ can be reduced by using ‘repeaters’, but still ≈ 6× @65nm, ≈ 12× @32nm) [1] challenged in late 90’s: message passing code faster for large p on large-scale SGI NUMA shared memory machines (Blackford & Whaley, 98) ■ this however increases energy! (n.b. mesh networks are electrically equivalent) ■ message passing: separate processes run on each CPU, communicating via ● soon, interconnects are expected to consume 50× more energy than logic circuits! some message passing library ● application at futuristic Figure 11. On- interconnect delay and energy (45nm). ◆ note: default paradigm for Xe cluster, 2 × 4 cores per node 3 TFLOPs (d.p.) will ■ reason: better separation of the hardware-shared memory in different 10,000 2 1,000 need (say) 9 × 64 = 576 On-die network energy per bit 100 processes 1,000 1.5 Wire Delay Tb/s data transfer 10 ■ confirmed by numerous similar other studies since especially if fast 100 Wire Energy 1 Measured (pJ)

pJ/Bit 1 ● if operands on average Delay (ps) on-board/on-chip communication network communication channels available 10 0.5 0.1 need to move 1mm to Extrapolated ● is it easier? (note: message passing can also be used for synchronization) 1 0 0.01 the FPU (e.g. from 0 5 10 15 20 0.5u 0.18u 65nm 22nm 8nm On-die interconnect length (mm) ■ SM is a “difficult programming model” [3] (relaxed consistency memory models) L1$), then @ 0.1 pJ/bit, (a) (b) ■ data races, non-determinism; more effort to optimize in the end [5, p6] this consumes 57W! [2] ■ no safety / composability /

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 6 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 8 Message Passing Programming: the Essential Ideas SCC Architecture ● ringsend.c: pass tokens around a ring

"mpi.h" ● chip layout [5, p24], tile layout [5, p25]

∗ ℄ { ●

> ℄ overall organization [5, p27]

● routerarchitecture[5,p28] (VC = virtual channel)

"%d: I am %d of %d\n"

− ■ uses wormhole routing (Dally 86)

< { ● what the SCC looks like: test board [5, p30]; system overview [5, p31]

■ the 4 memory controllers (and memory banks; c.f. T2, Fermi) are part of the }

"%d: after %d reps, have token from process %d\n" network ● memory hierarchy: programmer’s view [5, p35]

● c.f. Posix sockets ( , ) ■ globally accessible per-core test-and-set registers provides time and power ● compile and run: efficient synchronization

xe:/short/c07/mpi$ mpicc -o ringsend ringsend.c ● power management: xe:/short/c07/mpi$ mpirun -np 5 ringsend 2 ■ voltage / frequency islands (variable V and f ) [5, p47] ● message passing mechanism (courtesy LLNL) ■ slightly super-linear relationship: chip voltage & power [5, p33] ● ref: LLNL Tutorial ■ power breakdown: in the cores and routers [5, p34]; note: cores 87W → 5W COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 9 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 11

The Intel Single Chip Cloud Computer: Motivations Programming Model: the Rock Creek Communication Environment

Refs: ● research goals [5, p36]; high level view [5, p38], SPMD execution model ● [4 ] Tim Mattson et al, The 48-core SCC : the programmer’s view, SC’10 RCCE architecture for 3 platforms [5, p37] ● message passing in RCCE [5, p38-41], using the mesage passing buffer (fast [5 ] Tim Mattson, The Future of Many Core Computing shared memory) (the following notes are a guide/commentary/supplement to [5]) ■ shared cache memory has MPBT flag, in page table (propagated down to the ● the supercomputer of 20 years ago now on a L1$ on a miss) chip [5, p19] ■ note limited size of shared memory area! ● motivations: prototype for future many-core ● issues in lack of coherency, when performing [5, p39]: computing research

■ a ÔÙØ´µ : must invalidate in L1$ before write ■ high-performance power-efficient on-chip

■ a geØ´µ , must ensure stale data is not in the L1$ network (why is MPB cached at all?) ■ fine-grained power management (V & f ) ● special for (MP) shared memory [5, p40] (c.f. CUDA) ■ non cache coherent: message-based ■ must be called by all participating processes, so all will be able to access the programming support (courtesy Intel+RSCS) same address offset (A ) ■ cores of modest performance (P54C)

● send and receive implemented in terms of & (copy to/from MPB ● ‘Cloud Computer’? The P54C predates support . . . (shared) memory) [5, p42]

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 10 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 12 Example RCCE Program: ringsend.c on the SCC ● comm. is synchronous: senders must wait till the corresp. receiver does the

"RCCE.h"

∗ ℄ {

"%d: I am process %d of %d\n"

< {

{

} {

}

}

"%d: after %d reps, I have token from process %d\n"

}

● run by rccerun -nue 4 ringsend; (similar 1 program [5, p68-9])

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 13

SCC Performance, Issues and Outlook

● HPL performance [5, p43,45] ● comparison with the Fujitsu AP+ (1996, 64 SuperSPARCs on a torus network) AP+ SCC [4] clock: 50 MHz 1000 MHz message latency: 10µs 5µs message bandwidth: 50 MB/s 125 MB/s L1$ 16KB 16KB ● (short-term) consequences of non-coherency ■ loss of ‘single system image’ (at the OS level) ■ each core has its own OS & filesystem (you sshto each core!) ■ future: software/OS-managed distributed shared memory

COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 14