Bus-Based On-Chip Networks: Current and Next Generations

Total Page:16

File Type:pdf, Size:1020Kb

Bus-Based On-Chip Networks: Current and Next Generations Overview On-chip Networks: Bus-based ● on-chip communication networks ● traditional model; buses have address and data pathways ■ systems-on-chip ● can be several ’masters’ (devices that operate the bus, e.g. CPU, DMA engine) ■ bus-based networks: arbitration, transfer modes (cache coherency) ● in a multicore context, there be many! (scalability issue) ■ mesh-based networks: hierarchies, message organization & routing ■ issues: signal propagation delay, energy per bit transfers, reliability ● hence arbitration is a complex issue (and takes time!) ● cache coherency reconsidered and a case for the message passing paradigm ● techniques for improving bus utilization: ● intermission ■ burst transfer mode: multiple requests (in regular pattern) once granted access ● the Intel Single-Chip Cloud Computer ■ pipelining: place next address as data from previous request placed on bus ■ motivations: high performance & power efficient network ■ architecture: memory hierarchy, power management ■ programming model: the Rock Creek Communication Environment ● broadcast: one master ◆ motivations; overview; transferring data between cores sends to others ◆ programming examples; performance ■ e.g. cache ● refs: [1] On-chip Communication Architectures, Pascricha & Dutt, 2010, Morgan-Kaufman coherency - snoop, [2] The Future of Microprocessors, Borkar & Chien, CACM, May 2011 invalidation COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 3 ● (figure - courtesy [1]) Systems on a Chip On-Chip Networks: Current and Next Generations ● buses: only one device can access (make a ‘transaction’) at one time ● former systems can now be ● integrated into a single chip crossbars: devices split in 2 groups of size p ● usually for special-purpose ■ can have p transactions at once, provided at most 1 per device ■ systems e.g. T2: p =9 cores and L2$ banks (OS Slide Cast Ch 5, p86); Fermi GPU: p = 14? ■ for largish p, need to be internally organized as a series of switches ● high speed per price, power ● ● often have hierarchical may be also organized as a ring (e.g. Cell Broadband Engine) networks ● for larger p, may need a more scalable topology such as a 2-D mesh Figure 12. Hybrid switching for network-on-a-chip. C C C C C C C C BusR BusR Bus Bus C C C C C C C C C C C C C C C C C C BusR BusR Bus Bus Bus C C C C C C C C C C Bus to connect Second-level bus to connect Second-level router-based a cluster clusters (hierarchy of busses) network (hierarchy of networks) (courtesy [1]) (courtesy [1]) (courtesy[1]) (hierarchicalnetworks,courtesy[2]) COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 2 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 4 On-chip Network Mechanisms: Protocols and Routing Cache Coherency Considered Harmful (the `Coherency Wall') ● unit of communication is a message, of limited length ● a core writes data at address Ü in its L1$; must invalidate Ü in other L1$s ■ e.g. T2 L1$ cache line invalidation (writeback) has a ‘payload’ of 8 (8+16) bytes ● standard protocols requires a broadcast message for every cache line invalidation ● generally, routers break messages into fixed-size packets ■ maintaining (MOESI) protocol also requires a broadcast on every miss 2 ● each packet is routed using circuit-based switching (connection established ■ energy cost of each is O(p); overall cost is O(p )! ■ 2 between source and destination – enables ‘pipelined’ transmission) also causes contention (hence delay) in the network (worse than O(p )?) ● ● unit of transmission is a flit, with a phit being transferred each cycle directory-based protocols can direct invalidation messages to only the caches holding the same data ■ routing: the destination’s id in the header flit is used to determine the path ■ far more scalable (e.g. SGI Altix large-scale SMP), for lightly-shared data ■ worse otherwise; also introduces overhead through indirection ■ also for each cached line, need a bit vector of length p: O(p2) storage cost ● false sharing in any case results wasted traffic P P P P P P P P writes 0 1 2 3 4 5 6 7 ● hey, what about GPUs? b0 b1 b2 b3 b4 b5 b6 b7 ● atomic instructions, synchronizing the memory system down to the L2$, have (large) O(p) energycosteach (hence O(p2) total!) ● cache line size is sub-optimal for messages on on-chip networks COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 5 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 7 On-chip Network Issues: Signal Delay and Energy/Bit Transfer Shared Memory Programming Paradigm Challenged ● energy cost to access data off-chip (DRAM) is 100× more than from registers ● ref: [3] Kumar et al, The Case For Message Passing On Many-Core Chips ● delay for cross-chip interconnect grows super-linearly with length ● traditionally: paradigm to use when hardware shared memory is available ● ratio of this to transistor switching time scales exponentially with shrinking feature ■ faster and easier! size (≈ 60× @ 65nm, ≈ 130× @ 32nm) [1] ● ■ can be reduced by using ‘repeaters’, but still ≈ 6× @65nm, ≈ 12× @32nm) [1] challenged in late 90’s: message passing code faster for large p on large-scale SGI NUMA shared memory machines (Blackford & Whaley, 98) ■ this however increases energy! (n.b. mesh networks are electrically equivalent) ■ message passing: separate processes run on each CPU, communicating via ● soon, interconnects are expected to consume 50× more energy than logic circuits! some message passing library ● application at futuristic Figure 11. On-die interconnect delay and energy (45nm). ◆ note: default paradigm for Xe cluster, 2 × 4 cores per node 3 TFLOPs (d.p.) will ■ reason: better separation of the hardware-shared memory in different 10,000 2 1,000 need (say) 9 × 64 = 576 On-die network energy per bit 100 processes 1,000 1.5 Wire Delay Tb/s data transfer 10 ■ confirmed by numerous similar other studies since especially if fast 100 Wire Energy 1 Measured (pJ) pJ/Bit 1 ● if operands on average Delay (ps) on-board/on-chip communication network communication channels available 10 0.5 0.1 need to move 1mm to Extrapolated ● is it easier? (note: message passing can also be used for synchronization) 1 0 0.01 the FPU (e.g. from 0 5 10 15 20 0.5u 0.18u 65nm 22nm 8nm On-die interconnect length (mm) ■ SM is a “difficult programming model” [3] (relaxed consistency memory models) L1$), then @ 0.1 pJ/bit, (a) (b) ■ data races, non-determinism; more effort to optimize in the end [5, p6] this consumes 57W! [2] ■ no safety / composability / modularity COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 6 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 8 Message Passing Programming: the Essential Ideas SCC Architecture ● ringsend.c: pass tokens around a ring Ò Ð Ù "mpi.h" ● chip layout [5, p24], tile layout [5, p25] Ò Ø Ñ Ò ´ Ò Ø Ö ¸ Ö ∗ Ö Ú ℄ µ { ● Ò Ø ¸ Ô ¸ Ö Ô × Ö > ½ Ø Ó ´ Ö Ú ½ ℄ µ ½ overall organization [5, p27] º º º »» ÅÈÁ Ð Ð × ØÓ Ò Ø Ð Þ Ò Ô ● routerarchitecture[5,p28] (VC = virtual channel) Ô Ö Ò Ø ´ "%d: I am process %d of %d\n" ¸ ¸ ¸ Ô µ Ò Ø Ø Ó Ò ¸ Ð Ø Á ´ Ô · − ½ µ ± Ô ¸ Ö Ø Á ´ · ½ µ ± Ô ■ uses wormhole routing (Dally 86) Ó Ö ´ Ò Ø ¼ < Ö Ô × · · µ { ● ÅÈÁ Ë Ò ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ö Ø Á ¸ º º º µ what the SCC looks like: test board [5, p30]; system overview [5, p31] ÅÈÁ Ê Ú ´ ² Ø Ó Ò ¸ × Þ Ó ´ Ò Ø µ ¸ º º º ¸ Ð Ø Á ¸ º º º µ ■ the 4 memory controllers (and memory banks; c.f. T2, Fermi) are part of the } »» Ó Ö ´ º º º µ Ô Ö Ò Ø ´ "%d: after %d reps, have token from process %d\n" ¸ network ¸ Ö Ô × ¸ Ø Ó Ò µ ● memory hierarchy: programmer’s view [5, p35] ● c.f. Posix sockets (×Ò´µ, ´µ) ■ globally accessible per-core test-and-set registers provides time and power ● compile and run: efficient synchronization xe:/short/c07/mpi$ mpicc -o ringsend ringsend.c ● power management: xe:/short/c07/mpi$ mpirun -np 5 ringsend 2 ■ voltage / frequency islands (variable V and f ) [5, p47] ● message passing mechanism (courtesy LLNL) ■ slightly super-linear relationship: chip voltage & power [5, p33] ● ref: LLNL Tutorial ■ power breakdown: in the cores and routers [5, p34]; note: cores 87W → 5W COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 9 COMP8320 Lecture 8: On-chip Networks and the Intel SCC 2011 ◭◭◭ • ◮◮◮ × 11 The Intel Single Chip Cloud Computer: Motivations Programming Model: the Rock Creek Communication Environment Refs: ● research goals [5, p36]; high level view [5, p38], SPMD execution model ● [4 ] Tim Mattson et al, The 48-core SCC processor: the programmer’s view, SC’10 RCCE software architecture for 3 platforms [5, p37] ● message passing in RCCE [5, p38-41], using the mesage passing buffer (fast [5 ] Tim Mattson, The Future of Many Core Computing shared memory) (the following notes are a guide/commentary/supplement to [5]) ■ shared cache memory has MPBT flag, in page table (propagated down to the ● the supercomputer of 20 years ago now on a L1$ on a miss) chip [5, p19] ■ note limited size of shared memory area! ● motivations: prototype for future many-core ● issues in lack of coherency, when performing [5, p39]: computing research ■ a ÔÙØ´µ: must invalidate in L1$ before write ■ high-performance power-efficient on-chip ■ a geØ´µ, must ensure stale data is not in the L1$ network (why is MPB cached at all?) ■ fine-grained power management (V & f ) ● special Ê ´×Þµ for (MP) shared memory [5, p40] (c.f.
Recommended publications
  • Network on Chip for FPGA Development of a Test System for Network on Chip
    Network on Chip for FPGA Development of a test system for Network on Chip Magnus Krokum Namork Master of Science in Electronics Submission date: June 2011 Supervisor: Kjetil Svarstad, IET Norwegian University of Science and Technology Department of Electronics and Telecommunications Problem Description This assignment is a continuation of the project-assignment of fall 2010, where it was looked into the development of reactive modules for application-test and profiling of the Network on Chip realization. It will especially be focused on the further development of: • The programmability of the system by developing functionality for more advanced surveillance of the communication between modules and routers • Framework that will be used to test and profile entire applications on the Network on Chip The work will primarily be directed towards testing and running the system in a way that resembles a real system at run time. The work is to be compared with relevant research within similar work. Assignment given: January 2011 Supervisor: Kjetil Svarstad, IET i Abstract Testing and verification of digital systems is an essential part of product develop- ment. The Network on Chip (NoC), as a new paradigm within interconnections; has a specific need for testing. This is to determine how performance and prop- erties of the NoC are compared to the requirements of different systems such as processors or media applications. A NoC has been developed within the AHEAD project to form a basis for a reconfigurable platform used in the AHEAD system. This report gives an outline of the project to develop testing and benchmarking systems for a NoC.
    [Show full text]
  • Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With
    Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright 2016 by Mohamed Saied Abdelfattah Abstract Embedded Networks on Chip for Field-Programmable Gate Arrays Mohamed Saied Abdelfattah Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2016 Modern field-programmable gate arrays (FPGAs) have a large capacity and a myriad of embedded blocks for computation, memory and I/O interfacing. This allows the implementation of ever-larger applications; however, the increase in application size comes with an inevitable increase in complexity, making it a challenge to implement on-chip communication. Today, it is a designer's burden to create a customized communication circuit to interconnect an application, using the fine-grained FPGA fab- ric that has single-bit control over every wire segment and logic cell. Instead, we propose embedding a network-on-chip (NoC) to implement system-level communication on FPGAs. A prefabricated NoC improves communication efficiency, eases timing closure, and abstracts system-level communication on FPGAs, separating an application's behaviour and communication which makes the design of complex FPGA applications easier and faster. This thesis presents a complete embedded NoC solution, includ- ing the NoC architecture and interface, rules to guide its use with FPGA design styles, application case studies to showcase its advantages, and a computer-aided design (CAD) system to automatically interconnect applications using an embedded NoC. We compare NoC components when implemented hard versus soft, then build on this component-level analysis to architect embedded NoCs and integrate them into the FPGA fabric; these NoCs are on average 20{23× smaller and 5{6× faster than soft NoCs.
    [Show full text]
  • An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System
    NAS Technical Report: NAS-2015-04 An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System Subhash Saini,1 Haoqiang Jin,1 Dennis Jespersen,1 Huiyu Feng,2 Jahed Djomehri,3 William Arasin,3 Robert Hood,3 Piyush Mehrotra,1 Rupak Biswas1 1NASA Ames Research Center 2SGI 3Computer Sciences Corporation Moffett Field, CA 94035-1000, USA Fremont, CA 94538, USA Moffett Field, CA 94035-1000, USA {subhash.saini, haoqiang.jin, [email protected] {jahed.djomehri, william.f.arasin, dennis.jespersen, piyush.mehrotra, robert.hood}@nasa.gov rupak.biswas}@nasa.gov ABSTRACT The GPGPU approach relies on streaming multiprocessors and Intel recently introduced the Xeon Phi coprocessor based on the uses a low-level programming model such as CUDA or a high- Many Integrated Core architecture featuring 60 cores with a peak level programming model like OpenACC to attain high performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI performance [1-3]. Intel’s approach has a Phi serving as a Rackable system where each node has two Intel Xeon E2670 8- coprocessor to a traditional Intel processor host. The Phi has x86- core Sandy Bridge processors along with two Xeon Phi 5110P compatible cores with wide vector processing units and uses coprocessors. We have conducted an early performance standard parallel programming models such as MPI, OpenMP, evaluation of the Xeon Phi. We used microbenchmarks to hybrid (MPI + OpenMP), UPC, etc. [4-5]. measure the latency and bandwidth of memory and interconnect, Understanding performance of these heterogeneous computing I/O rates, and the performance of OpenMP directives and MPI systems has become very important as they have begun to appear functions.
    [Show full text]
  • A Comparison of Four Series of Cisco Network Processors
    Advanced Computational Intelligence: An International Journal (ACII), Vol.3, No.4, October 2016 A COMPARISON OF FOUR SERIES OF CISCO NETWORK PROCESSORS Sadaf Abaei Senejani 1, Hossein Karimi 2 and Javad Rahnama 3 1Department of Computer Engineering, Pooyesh University, Qom, Iran 2 Sama Technical and Vocational Training College, Islamic Azad University, Yasouj Branch, Yasouj, Iran 3 Department of Computer Science and Engineering, Sharif University of Technology, Tehran, Iran. ABSTRACT Network processors have created new opportunities by performing more complex calculations. Routers perform the most useful and difficult processing operations. In this paper the routers of VXR 7200, ISR 4451-X, SBC 7600, 7606 have been investigated which their main positive points include scalability, flexibility, providing integrated services, high security, supporting and updating with the lowest cost, and supporting standard protocols of network. In addition, in the current study these routers have been explored from hardware and processor capacity viewpoints separately. KEYWORDS Network processors, Network on Chip, parallel computing 1. INTRODUCTION The expansion of science and different application fields which require complex and time taking calculations and in consequence the demand for fastest and more powerful computers have made the need for better structures and new methods more significant. This need can be met through solutions called “paralleled massive computers” or another solution called “clusters of computers”. It should be mentioned that both solutions depend highly on intercommunication networks with high efficiency and appropriate operational capacity. Therefore, as a solution combining numerous cores on a single chip via using Network on chip technology for the simplification of communications can be done.
    [Show full text]
  • MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles
    MYTHIC MULTIPLIES IN A FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles By Mike Demler (August 27, 2018) ................................................................................................................... Machine learning is a hot field that attracts high-tech The startup plans by the end of this year to tape out investors, causing the number of startups to explode. As its first product, which can store up to 50 million weights. these young companies strive to compete against much larg- At that time, it also aims to release the alpha version of its er established players, some are revisiting technologies the software tools and a performance profiler. The company veterans may have dismissed, such as analog computing. expects to begin volume shipments in 4Q19. Mythic is riding this wave, using embedded flash-memory technology to store neural-network weights as analog pa- Solving a Weighty Problem rameters—an approach that eliminates the power consumed Mythic uses Fujitsu’s 40nm embedded-flash cell, which it in moving data between the processor and DRAM. plans to integrate in an array along with digital-to-analog The company traces its origin to 2012 at the University converters (DACs) and analog-to-digital converters (ADCs), of Michigan, where CTO Dave Fick completed his doctoral as Figure 1 shows. By storing a range of analog voltages in degree and CEO Mike Henry spent two and a half years as the bit cells, the technique uses the cell’s voltage-variable a visiting scholar. The founders worked on a DoD-funded conductance to represent weights with 8-bit resolution. This project to develop machine-learning-powered surveillance drones, eventually leading to the creation of their startup.
    [Show full text]
  • Using Inspiration from Synaptic Plasticity Rules to Optimize Traffic Flow in Distributed Engineered Networks Arxiv:1611.06937V1
    1 Using inspiration from synaptic plasticity rules to optimize traffic flow in distributed engineered networks Jonathan Y. Suen1 and Saket Navlakha2 1Duke University, Department of Electrical and Computer Engineering. Durham, North Carolina, 27708 USA 2The Salk Institute for Biological Studies. Integrative Biology Laboratory. 10010 North Torrey Pines Rd., La Jolla, CA 92037 USA Keywords: synaptic plasticity, distributed networks, flow control arXiv:1611.06937v1 [cs.NE] 21 Nov 2016 Abstract Controlling the flow and routing of data is a fundamental problem in many dis- tributed networks, including transportation systems, integrated circuits, and the Internet. In the brain, synaptic plasticity rules have been discovered that regulate network activ- ity in response to environmental inputs, which enable circuits to be stable yet flexible. Here, we develop a new neuro-inspired model for network flow control that only de- pends on modifying edge weights in an activity-dependent manner. We show how two fundamental plasticity rules (long-term potentiation and long-term depression) can be cast as a distributed gradient descent algorithm for regulating traffic flow in engineered networks. We then characterize, both via simulation and analytically, how different forms of edge-weight update rules affect network routing efficiency and robustness. We find a close correspondence between certain classes of synaptic weight update rules de- rived experimentally in the brain and rules commonly used in engineering, suggesting common principles to both. 1. Introduction In many engineered networks, a payload needs to be transported between nodes without central control. These systems are often represented as weighted, directed graphs. Each edge has a fixed capacity, which represents the maximum amount of traffic the edge can carry at any time.
    [Show full text]
  • State-Of-The-Art in Heterogeneous Computing
    Scientific Programming 18 (2010) 1–33 1 DOI 10.3233/SPR-2009-0296 IOS Press State-of-the-art in heterogeneous computing Andre R. Brodtkorb a,∗, Christopher Dyken a, Trond R. Hagen a, Jon M. Hjelmervik a and Olaf O. Storaasli b a SINTEF ICT, Department of Applied Mathematics, Blindern, Oslo, Norway E-mails: {Andre.Brodtkorb, Christopher.Dyken, Trond.R.Hagen, Jon.M.Hjelmervik}@sintef.no b Oak Ridge National Laboratory, Future Technologies Group, Oak Ridge, TN, USA E-mail: [email protected] Abstract. Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing. Keywords: Power-efficient architectures, parallel computer architecture, stream or vector architectures, energy and power consumption, microprocessor performance 1. Introduction the speed of logic gates, making computers smaller and more power efficient. Noyce and Kilby indepen- The goal of this article is to provide an overview of dently invented the integrated circuit in 1958, leading node-level heterogeneous computing, including hard- to further reductions in power and space required for ware, software tools and state-of-the-art algorithms.
    [Show full text]
  • ESA DSP Day 2016 Workshop Proceedings
    ESA DSP Day 2016 Wednesday June 15th – Thursday June 16th 2016 Gothenburg, Sweden Workshop Proceedings Ed. R. Trautner, ESA/ESTEC Noordwijk, The Netherlands Table of Contents Session 1: Rad-Hard DSP Chips 3 Scalable Sensor Data Processor: Architecture and Development Status 4 PINTO, Ricardo; BERROJO, Luis; GARCIA, Elena; TRAUTNER, Roland; RAUWERDA, Gerard; SUNESEN, Kim; REDANT, Steven; ANDERSSON, Jan; HABINC, Sandi; LÓPEZ, Jesus RC64: High Performance Rad-Hard Many-core DSP 10 GINOSAR, Ran; AVIELY, Peleg; LANGE, Fredy; ISRAELI, Tsvika Session 2: Test, Verification and Qualification of DSP Chips 18 ESCC Qualification of Space Components - Schemes and New Opportunities 19 MARTINEZ, Fernando Scalable Sensor Data Processor: Testing and Validation 23 PINTO, Ricardo; TRAUTNER, Roland; RAUWERDA, Gerard; REDANT, Steven; SUNESEN, Kim; ANDERSSON, Jan; HABINC, Sandi; LÓPEZ, Jesús; BERROJO-VALERO, Luis-Rafael; MARTIN, Beatriz; PIARETTE, Fernando; SANCHEZ DE ROJAS, Pablo Session 3: COTS based DSP Systems and Boards 28 High Performance COTS based Computer for Regenerative Telecom Payloads 29 NOTEBAERT, Olivier; BARTHE, Lyonel; VANHOVE, Jean-Luc; PRIEUR, Olivier SpaceWire and SpaceFibre Interconnect for High Performance DSPs 34 PARKES, Steve; MCCLEMENTS, Chris; GONZALEZ VILLAFRANCA, Alberto; FERRER, Albert Session 4: DSP Day Reception and Poster Session 39 Characterization and Qualification of Microcontrollers and DSPs in Extreme 40 Temperatures DOLZOME, Flavien Radiation Intelligent Memory Controller IP Core 46 WANG, Pierre-xiao; SELLIER, Charles DVB-S2
    [Show full text]
  • Hardware Design of Message Passing Architecture on Heterogeneous System
    HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM by Shanyuan Gao A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering Charlotte 2013 Approved by: Dr. Ronald R. Sass Dr. James M. Conrad Dr. Jiang Xie Dr. Stanislav Molchanov ii c 2013 Shanyuan Gao ALL RIGHTS RESERVED iii ABSTRACT SHANYUAN GAO. Hardware design of message passing architecture on heterogeneous system. (Under the direction of DR. RONALD R. SASS) Heterogeneous multi/many-core chips are commonly used in today's top tier supercomputers. Similar heterogeneous processing elements | or, computation ac- celerators | are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries.
    [Show full text]
  • Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware
    SCHEDULING TASKS ON HETEROGENEOUS CHIP MULTIPROCESSORS WITH RECONFIGURABLE HARDWARE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Justin Stevenson Teller, B.S., M.S. ***** The Ohio State University 2008 Dissertation Committee: Approved by Prof. F¨usun Ozg¨uner,Adviser¨ Prof. Umit¨ C¸ataly¨urek Adviser Prof. Eylem Ekici Graduate Program in Electrical and Computer Engineering c Copyright by Justin Stevenson Teller 2008 ABSTRACT This dissertation presents several methods to more efficiently use the computa- tional resources available on a Heterogeneous Chip Multiprocessor (H-CMP). Using task scheduling techniques, three challenges to the effective usage of H-CMPs are addressed: the emergence of reconfigurable hardware in general purpose computing, utilization of the network on a chip (NoC), and fault tolerance. To utilize reconfigurable hardware, we introduce the Mutually Exclusive Processor Groups reconfiguration model, and an accompanying task scheduler, the Heteroge- neous Earliest Finish Time with Mutually Exclusive Processor Groups (HEFT-MEG) scheduling heuristic. HEFT-MEG schedules reconfigurations using a novel back- tracking algorithm to evaluate how different reconfiguration decisions affect previously scheduled tasks. In both simulation and real execution, HEFT-MEG successfully schedules reconfiguration allowing the architecture to adapt to changing application requirements. After an analysis of IBM’s Cell Processor NoC and generation of a simple stochas- tic model, we propose a hybrid task scheduling system using a Compile- and Run-time Scheduler (CtS and RtS) that work in concert. The CtS, Contention Aware HEFT (CA-HEFT), updates task start and finish times when scheduling to account for network contention.
    [Show full text]
  • Smartcell: an Energy Efficient Reconfigurable Architecture for Stream Processing
    SmartCell: An Energy Efficient Reconfigurable Architecture for Stream Processing by Cao Liang A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Electrical and Computer Engineering April, 2009 Approved: Prof. Xinming Huang Prof. Fred J. Looft ECE Department, WPI ECE Department Head, WPI Dissertation Advisor Prof. Berk Sunar Prof. Russell Tessier ECE Department, WPI ECE Department, UMASS Dissertation Committee Amherst Dissertation Committee Abstract Data streaming applications, such as signal processing, multimedia applications, often require high computing capacity, yet also have stringent power constraints, especially in portable devices. General purpose processors can no longer meet these requirements due to their sequential software execution. Although fixed logic ASICs are usually able to achieve the best performance and energy efficiency, ASIC solutions are expensive to design and their lack of flexibility makes them unable to accommodate functional changes or new system requirements. Reconfigurable systems have long been proposed to bridge the gap between the flexibility of software processors and performance of hardware circuits. Unfortunately, mainstream reconfigurable FPGA designs suffer from high cost of area, power consumption and speed due to the routing area overhead and timing penalty of their bit-level fine granularity. In this dissertation, we present an architecture design, application mapping and perfor- mance evaluation of a novel coarse-grained reconfigurable architecture, named SmartCell, for data streaming applications. The system tiles a large number of computing cell units in a 2D mesh structure, with four coarse-grained processing elements developed inside each cell to form a quad structure.
    [Show full text]
  • Towards a Scalable Software Defined Network-On-Chip for Next Generation Cloud
    sensors Article Towards a Scalable Software Defined Network-on-Chip for Next Generation Cloud Alberto Scionti 1,†, Somnath Mazumdar 2,† ID and Antoni Portero 3,* ID 1 Istituto Superiore Mario Boella (ISMB), 10138 Torino, Italy; [email protected] 2 Simula Research Laboratory, 1325 Lysaker, Norway; [email protected] 3 IT4Innovations, VSB-University of Ostrava, 70833 Ostrava–Poruba, Czech Republic * Correspondence: [email protected] † These authors contributed equally to this work. Received: 11 June 2018; Accepted: 12 July 2018; Published: 18 July 2018 Abstract: The rapid evolution of Cloud-based services and the growing interest in deep learning (DL)-based applications is putting increasing pressure on hyperscalers and general purpose hardware designers to provide more efficient and scalable systems. Cloud-based infrastructures must consist of more energy efficient components. The evolution must take place from the core of the infrastructure (i.e., data centers (DCs)) to the edges (Edge computing) to adequately support new/future applications. Adaptability/elasticity is one of the features required to increase the performance-to-power ratios. Hardware-based mechanisms have been proposed to support system reconfiguration mostly at the processing elements level, while fewer studies have been carried out regarding scalable, modular interconnected sub-systems. In this paper, we propose a scalable Software Defined Network-on-Chip (SDNoC)-based architecture. Our solution can easily be adapted to support devices ranging from low-power computing nodes placed at the edge of the Cloud to high-performance many-core processors in the Cloud DCs, by leveraging on a modular design approach. The proposed design merges the benefits of hierarchical network-on-chip (NoC) topologies (via fusing the ring and the 2D-mesh topology), with those brought by dynamic reconfiguration (i.e., adaptation).
    [Show full text]