Bus-Based On-Chip Networks: Current and Next Generations

Overview On-chip Networks: Bus-based ● on-chip communication networks ● traditional model; buses have address and data pathways ■ systems-on-chip ● can be several ’masters’ (devices that operate the bus, e.g. CPU, DMA engine) ■ bus-based networks: arbitration, transfer modes (cache coherency) ● in a multicore context, there be many! (scalability issue) ■ mesh-based networks: hierarchies, message organization & routing ■ issues: signal propagation delay, energy per bit transfers, reliability ● hence arbitration is a complex issue (and takes time!) ● cache coherency reconsidered and a case for the message passing paradigm ● techniques for improving bus utilization: ● intermission ■ burst transfer mode: multiple requests (in regular pattern) once granted access ● the Intel Single-Chip Cloud Computer ■ pipelining: place next address as data from previous request placed on bus ■ motivations: high performance & power efficient network ■ architecture: memory hierarchy, power management ■ programming model: the Rock Creek Communication Environment ● broadcast: one master ◆ motivations; overview; transferring data between cores sends to others ◆ programming examples; performance ■ e.g. cache ● refs: [1] On-chip Communication Architectures, Pascricha & Dutt, 2010, Morgan-Kaufman coherency - snoop, [2] The Future of Microprocessors, Borkar & Chien, CACM, May 2011 invalidation COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 3 ● (figure - courtesy [1]) Systems on a Chip On-Chip Networks: Current and Next Generations ● buses: only one device can access (make a ‘transaction’) at one time ● former systems can now be ● integrated into a single chip crossbars: devices split in 2 groups of size p ● usually for special-purpose ■ can have p transactions at once, provided at most 1 per device ■ systems e.g. T2: p =9 cores and L2$ banks (OS Slide Cast Ch 5, p86); Fermi GPU: p = 14? ■ for largish p, need to be internally organized as a series of switches ● high speed per price, power ● ● often have hierarchical may be also organized as a ring (e.g. Cell Broadband Engine) networks ● for larger p, may need a more scalable topology such as a 2-D mesh Figure 12. Hybrid switching for network-on-a-chip. C C C C C C C C BusR BusR Bus Bus C C C C C C C C C C C C C C C C C C BusR BusR Bus Bus Bus C C C C C C C C C C Bus to connect Second-level bus to connect Second-level router-based a cluster clusters (hierarchy of busses) network (hierarchy of networks) (courtesy [1]) (courtesy [1]) (courtesy[1]) (hierarchicalnetworks,courtesy[2]) COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 2 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 4 On-chip Network Mechanisms: Protocols and Routing Cache Coherency Considered Harmful (the `Coherency Wall') ● unit of communication is a message, of limited length ● a core writes data at address Ü in its L1$; must invalidate Ü in other L1$s ■ e.g. T2 L1$ cache line invalidation (writeback) has a ‘payload’ of 8 (8+16) bytes ● standard protocols requires a broadcast message for every cache line invalidation ● generally, routers break messages into fixed-size packets ■ maintaining (MOESI) protocol also requires a broadcast on every miss 2 ● each packet is routed using circuit-based switching (connection established ■ energy cost of each is O(p); overall cost is O(p )! ■ 2 between source and destination – enables ‘pipelined’ transmission) also causes contention (hence delay) in the network (worse than O(p )?) ● ● unit of transmission is a flit, with a phit being transferred each cycle directory-based protocols can direct invalidation messages to only the caches holding the same data ■ routing: the destination’s id in the header flit is used to determine the path ■ far more scalable (e.g. SGI Altix large-scale SMP), for lightly-shared data ■ worse otherwise; also introduces overhead through indirection ■ also for each cached line, need a bit vector of length p: O(p2) storage cost ● false sharing in any case results wasted traffic P P P P P P P P writes 0 1 2 3 4 5 6 7 ● hey, what about GPUs? b0 b1 b2 b3 b4 b5 b6 b7 ● atomic instructions, synchronizing the memory system down to the L2$, have (large) O(p) energycosteach (hence O(p2) total!) ● cache line size is sub-optimal for messages on on-chip networks COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 5 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 7 On-chip Network Issues: Signal Delay and Energy/Bit Transfer Shared Memory Programming Paradigm Challenged ● energy cost to access data off-chip (DRAM) is 100× more than from registers ● ref: [3] Kumar et al, The Case For Message Passing On Many-Core Chips ● delay for cross-chip interconnect grows super-linearly with length ● traditionally: paradigm to use when hardware shared memory is available ● ratio of this to transistor switching time scales exponentially with shrinking feature ■ faster and easier! size (≈ 60× @ 65nm, ≈ 130× @ 32nm) [1] ● ■ can be reduced by using ‘repeaters’, but still ≈ 6× @65nm, ≈ 12× @32nm) [1] challenged in late 90’s: message passing code faster for large p on large-scale SGI NUMA shared memory machines (Blackford & Whaley, 98) ■ this however increases energy! (n.b. mesh networks are electrically equivalent) ■ message passing: separate processes run on each CPU, communicating via ● soon, interconnects are expected to consume 50× more energy than logic circuits! some message passing library ● application at futuristic Figure 11. On-die interconnect delay and energy (45nm). ◆ note: default paradigm for Xe cluster, 2 × 4 cores per node 3 TFLOPs (d.p.) will ■ reason: better separation of the hardware-shared memory in different 10,000 2 1,000 need (say) 9 × 64 = 576 On-die network energy per bit 100 processes 1,000 1.5 Wire Delay Tb/s data transfer 10 ■ confirmed by numerous similar other studies since especially if fast 100 Wire Energy 1 Measured (pJ) pJ/Bit 1 ● if operands on average Delay (ps) on-board/on-chip communication network communication channels available 10 0.5 0.1 need to move 1mm to Extrapolated ● is it easier? (note: message passing can also be used for synchronization) 1 0 0.01 the FPU (e.g. from 0 5 10 15 20 0.5u 0.18u 65nm 22nm 8nm On-die interconnect length (mm) ■ SM is a “difficult programming model” [3] (relaxed consistency memory models) L1$), then @ 0.1 pJ/bit, (a) (b) ■ data races, non-determinism; more effort to optimize in the end [5, p6] this consumes 57W! [2] ■ no safety / composability / modularity COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 6 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 8 Message Passing Programming: the Essential Ideas SCC Architecture ● ringsend.c: pass tokens around a ring Ò Ð Ù "mpi.h" ● chip layout [5, p24], tile layout [5, p25] Ò Ø Ñ Ò ´ Ò Ø Ö ¸ Ö ∗ Ö Ú ℄ µ { ● Ò Ø ¸ Ô ¸ Ö Ô × Ö > ½ Ø Ó ´ Ö Ú ½ ℄ µ ½ overall organization [5, p27] º º º »» ÅÈÁ Ð Ð × ØÓ Ò Ø Ð Þ Ò Ô ● routerarchitecture[5,p28] (VC = virtual channel) Ô Ö Ò Ø ´ "%d: I am process %d of %d\n" ¸ ¸ ¸ Ô µ Ò Ø Ø Ó Ò ¸ Ð Ø Á ´ Ô · − ½ µ ± Ô ¸ Ö Ø Á ´ · ½ µ ± Ô ■ uses wormhole routing (Dally 86) Ó Ö ´ Ò Ø ¼ < Ö Ô × · · µ { ● ÅÈÁ Ë Ò ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ö Ø Á ¸ º º º µ what the SCC looks like: test board [5, p30]; system overview [5, p31] ÅÈÁ Ê Ú ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ð Ø Á ¸ º º º µ ■ the 4 memory controllers (and memory banks; c.f. T2, Fermi) are part of the } »» Ó Ö ´ º º º µ Ô Ö Ò Ø ´ "%d: after %d reps, have token from process %d\n" ¸ network ¸ Ö Ô × ¸ Ø Ó Ò µ ● memory hierarchy: programmer’s view [5, p35] ● c.f. Posix sockets (×Ò´µ, ´µ) ■ globally accessible per-core test-and-set registers provides time and power ● compile and run: efficient synchronization xe:/short/c07/mpi$ mpicc -o ringsend ringsend.c ● power management: xe:/short/c07/mpi$ mpirun -np 5 ringsend 2 ■ voltage / frequency islands (variable V and f ) [5, p47] ● message passing mechanism (courtesy LLNL) ■ slightly super-linear relationship: chip voltage & power [5, p33] ● ref: LLNL Tutorial ■ power breakdown: in the cores and routers [5, p34]; note: cores 87W → 5W COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 9 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 11 The Intel Single Chip Cloud Computer: Motivations Programming Model: the Rock Creek Communication Environment Refs: ● research goals [5, p36]; high level view [5, p38], SPMD execution model ● [4 ] Tim Mattson et al, The 48-core SCC processor: the programmer’s view, SC’10 RCCE software architecture for 3 platforms [5, p37] ● message passing in RCCE [5, p38-41], using the mesage passing buffer (fast [5 ] Tim Mattson, The Future of Many Core Computing shared memory) (the following notes are a guide/commentary/supplement to [5]) ■ shared cache memory has MPBT flag, in page table (propagated down to the ● the supercomputer of 20 years ago now on a L1$ on a miss) chip [5, p19] ■ note limited size of shared memory area! ● motivations: prototype for future many-core ● issues in lack of coherency, when performing [5, p39]: computing research ■ a ÔÙØ´µ: must invalidate in L1$ before write ■ high-performance power-efficient on-chip ■ a geØ´µ, must ensure stale data is not in the L1$ network (why is MPB cached at all?) ■ fine-grained power management (V & f ) ● special Ê ´×Þµ for (MP) shared memory [5, p40] (c.f.

Bus-Based On-Chip Networks: Current and Next Generations

Network on Chip for FPGA Development of a Test System for Network on Chip

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Using Inspiration from Synaptic Plasticity Rules to Optimize Traffic Flow in Distributed Engineered Networks Arxiv:1611.06937V1

State-Of-The-Art in Heterogeneous Computing

ESA DSP Day 2016 Workshop Proceedings

Hardware Design of Message Passing Architecture on Heterogeneous System

Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware

Smartcell: an Energy Efficient Reconfigurable Architecture for Stream Processing

Towards a Scalable Software Defined Network-On-Chip for Next Generation Cloud