Bus-Based On-Chip Networks: Current and Next Generations

Overview On-chip Networks: Bus-based ● on-chip communication networks ● traditional model; buses have address and data pathways ■ systems-on-chip ● can be several ’masters’ (devices that operate the bus, e.g. CPU, DMA engine) ■ bus-based networks: arbitration, transfer modes (cache coherency) ● in a multicore context, there be many! (scalability issue) ■ mesh-based networks: hierarchies, message organization & routing ■ issues: signal propagation delay, energy per bit transfers, reliability ● hence arbitration is a complex issue (and takes time!) ● cache coherency reconsidered and a case for the message passing paradigm ● techniques for improving bus utilization: ● intermission ■ burst transfer mode: multiple requests (in regular pattern) once granted access ● the Intel Single-Chip Cloud Computer ■ pipelining: place next address as data from previous request placed on bus ■ motivations: high performance & power efficient network ■ architecture: memory hierarchy, power management ■ programming model: the Rock Creek Communication Environment ● broadcast: one master ◆ motivations; overview; transferring data between cores sends to others ◆ programming examples; performance ■ e.g. cache ● refs: [1] On-chip Communication Architectures, Pascricha & Dutt, 2010, Morgan-Kaufman coherency - snoop, [2] The Future of Microprocessors, Borkar & Chien, CACM, May 2011 invalidation COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 3 ● (figure - courtesy [1]) Systems on a Chip On-Chip Networks: Current and Next Generations ● buses: only one device can access (make a ‘transaction’) at one time ● former systems can now be ● integrated into a single chip crossbars: devices split in 2 groups of size p ● usually for special-purpose ■ can have p transactions at once, provided at most 1 per device ■ systems e.g. T2: p =9 cores and L2$ banks (OS Slide Cast Ch 5, p86); Fermi GPU: p = 14? ■ for largish p, need to be internally organized as a series of switches ● high speed per price, power ● ● often have hierarchical may be also organized as a ring (e.g. Cell Broadband Engine) networks ● for larger p, may need a more scalable topology such as a 2-D mesh Figure 12. Hybrid switching for network-on-a-chip. C C C C C C C C BusR BusR Bus Bus C C C C C C C C C C C C C C C C C C BusR BusR Bus Bus Bus C C C C C C C C C C Bus to connect Second-level bus to connect Second-level router-based a cluster clusters (hierarchy of busses) network (hierarchy of networks) (courtesy [1]) (courtesy [1]) (courtesy[1]) (hierarchicalnetworks,courtesy[2]) COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 2 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 4 On-chip Network Mechanisms: Protocols and Routing Cache Coherency Considered Harmful (the `Coherency Wall') ● unit of communication is a message, of limited length ● a core writes data at address Ü in its L1$; must invalidate Ü in other L1$s ■ e.g. T2 L1$ cache line invalidation (writeback) has a ‘payload’ of 8 (8+16) bytes ● standard protocols requires a broadcast message for every cache line invalidation ● generally, routers break messages into fixed-size packets ■ maintaining (MOESI) protocol also requires a broadcast on every miss 2 ● each packet is routed using circuit-based switching (connection established ■ energy cost of each is O(p); overall cost is O(p )! ■ 2 between source and destination – enables ‘pipelined’ transmission) also causes contention (hence delay) in the network (worse than O(p )?) ● ● unit of transmission is a flit, with a phit being transferred each cycle directory-based protocols can direct invalidation messages to only the caches holding the same data ■ routing: the destination’s id in the header flit is used to determine the path ■ far more scalable (e.g. SGI Altix large-scale SMP), for lightly-shared data ■ worse otherwise; also introduces overhead through indirection ■ also for each cached line, need a bit vector of length p: O(p2) storage cost ● false sharing in any case results wasted traffic P P P P P P P P writes 0 1 2 3 4 5 6 7 ● hey, what about GPUs? b0 b1 b2 b3 b4 b5 b6 b7 ● atomic instructions, synchronizing the memory system down to the L2$, have (large) O(p) energycosteach (hence O(p2) total!) ● cache line size is sub-optimal for messages on on-chip networks COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 5 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 7 On-chip Network Issues: Signal Delay and Energy/Bit Transfer Shared Memory Programming Paradigm Challenged ● energy cost to access data off-chip (DRAM) is 100× more than from registers ● ref: [3] Kumar et al, The Case For Message Passing On Many-Core Chips ● delay for cross-chip interconnect grows super-linearly with length ● traditionally: paradigm to use when hardware shared memory is available ● ratio of this to transistor switching time scales exponentially with shrinking feature ■ faster and easier! size (≈ 60× @ 65nm, ≈ 130× @ 32nm) [1] ● ■ can be reduced by using ‘repeaters’, but still ≈ 6× @65nm, ≈ 12× @32nm) [1] challenged in late 90’s: message passing code faster for large p on large-scale SGI NUMA shared memory machines (Blackford & Whaley, 98) ■ this however increases energy! (n.b. mesh networks are electrically equivalent) ■ message passing: separate processes run on each CPU, communicating via ● soon, interconnects are expected to consume 50× more energy than logic circuits! some message passing library ● application at futuristic Figure 11. On-die interconnect delay and energy (45nm). ◆ note: default paradigm for Xe cluster, 2 × 4 cores per node 3 TFLOPs (d.p.) will ■ reason: better separation of the hardware-shared memory in different 10,000 2 1,000 need (say) 9 × 64 = 576 On-die network energy per bit 100 processes 1,000 1.5 Wire Delay Tb/s data transfer 10 ■ confirmed by numerous similar other studies since especially if fast 100 Wire Energy 1 Measured (pJ) pJ/Bit 1 ● if operands on average Delay (ps) on-board/on-chip communication network communication channels available 10 0.5 0.1 need to move 1mm to Extrapolated ● is it easier? (note: message passing can also be used for synchronization) 1 0 0.01 the FPU (e.g. from 0 5 10 15 20 0.5u 0.18u 65nm 22nm 8nm On-die interconnect length (mm) ■ SM is a “difficult programming model” [3] (relaxed consistency memory models) L1$), then @ 0.1 pJ/bit, (a) (b) ■ data races, non-determinism; more effort to optimize in the end [5, p6] this consumes 57W! [2] ■ no safety / composability / modularity COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 6 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 8 Message Passing Programming: the Essential Ideas SCC Architecture ● ringsend.c: pass tokens around a ring Ò Ð Ù "mpi.h" ● chip layout [5, p24], tile layout [5, p25] Ò Ø Ñ Ò ´ Ò Ø Ö ¸ Ö ∗ Ö Ú ℄ µ { ● Ò Ø ¸ Ô ¸ Ö Ô × Ö > ½ Ø Ó ´ Ö Ú ½ ℄ µ ½ overall organization [5, p27] º º º »» ÅÈÁ Ð Ð × ØÓ Ò Ø Ð Þ Ò Ô ● routerarchitecture[5,p28] (VC = virtual channel) Ô Ö Ò Ø ´ "%d: I am process %d of %d\n" ¸ ¸ ¸ Ô µ Ò Ø Ø Ó Ò ¸ Ð Ø Á ´ Ô · − ½ µ ± Ô ¸ Ö Ø Á ´ · ½ µ ± Ô ■ uses wormhole routing (Dally 86) Ó Ö ´ Ò Ø ¼ < Ö Ô × · · µ { ● ÅÈÁ Ë Ò ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ö Ø Á ¸ º º º µ what the SCC looks like: test board [5, p30]; system overview [5, p31] ÅÈÁ Ê Ú ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ð Ø Á ¸ º º º µ ■ the 4 memory controllers (and memory banks; c.f. T2, Fermi) are part of the } »» Ó Ö ´ º º º µ Ô Ö Ò Ø ´ "%d: after %d reps, have token from process %d\n" ¸ network ¸ Ö Ô × ¸ Ø Ó Ò µ ● memory hierarchy: programmer’s view [5, p35] ● c.f. Posix sockets (×Ò´µ, ´µ) ■ globally accessible per-core test-and-set registers provides time and power ● compile and run: efficient synchronization xe:/short/c07/mpi$ mpicc -o ringsend ringsend.c ● power management: xe:/short/c07/mpi$ mpirun -np 5 ringsend 2 ■ voltage / frequency islands (variable V and f ) [5, p47] ● message passing mechanism (courtesy LLNL) ■ slightly super-linear relationship: chip voltage & power [5, p33] ● ref: LLNL Tutorial ■ power breakdown: in the cores and routers [5, p34]; note: cores 87W → 5W COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 9 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 11 The Intel Single Chip Cloud Computer: Motivations Programming Model: the Rock Creek Communication Environment Refs: ● research goals [5, p36]; high level view [5, p38], SPMD execution model ● [4 ] Tim Mattson et al, The 48-core SCC processor: the programmer’s view, SC’10 RCCE software architecture for 3 platforms [5, p37] ● message passing in RCCE [5, p38-41], using the mesage passing buffer (fast [5 ] Tim Mattson, The Future of Many Core Computing shared memory) (the following notes are a guide/commentary/supplement to [5]) ■ shared cache memory has MPBT flag, in page table (propagated down to the ● the supercomputer of 20 years ago now on a L1$ on a miss) chip [5, p19] ■ note limited size of shared memory area! ● motivations: prototype for future many-core ● issues in lack of coherency, when performing [5, p39]: computing research ■ a ÔÙØ´µ: must invalidate in L1$ before write ■ high-performance power-efficient on-chip ■ a geØ´µ, must ensure stale data is not in the L1$ network (why is MPB cached at all?) ■ fine-grained power management (V & f ) ● special Ê ´×Þµ for (MP) shared memory [5, p40] (c.f.

Bus-Based On-Chip Networks: Current and Next Generations

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support