Hardware Design of Message Passing Architecture on Heterogeneous System

Total Page:16

File Type:pdf, Size:1020Kb

Hardware Design of Message Passing Architecture on Heterogeneous System HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM by Shanyuan Gao A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering Charlotte 2013 Approved by: Dr. Ronald R. Sass Dr. James M. Conrad Dr. Jiang Xie Dr. Stanislav Molchanov ii c 2013 Shanyuan Gao ALL RIGHTS RESERVED iii ABSTRACT SHANYUAN GAO. Hardware design of message passing architecture on heterogeneous system. (Under the direction of DR. RONALD R. SASS) Heterogeneous multi/many-core chips are commonly used in today's top tier supercomputers. Similar heterogeneous processing elements | or, computation ac- celerators | are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the com- munication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid \MPI+X" computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation. iv ACKNOWLEDGMENTS No words can express my appreciation to my advisor, Dr. Ron Sass. Your wisdom and character have persistently nurtured and guided me through all the joys and difficulties during this journey, and continually so. I am grateful to my committee, Professor James Conrad, Professor Jiang Xie, and Professor Stanislav Molchanov. Your advices have challenged me in every aspect and perfected this work. I would like to thank the RCS lab (Andy, Will, Robin, Bin, Scott, Yamuna, Rahul, Ashwin, Shweta, and countless others). Our untiring discussions have inspired me to explore new directions. Your help has contributed to many parts of this work. I would like to thank my mother and my father for their full-hearted support. Rong, thank you for your love. Without you, nothing of this matters. v TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES ix LIST OF ABBREVIATIONS xi CHAPTER 1: INTRODUCTION1 1.1 High-Performance Computing1 1.2 Interconnect and Communication2 1.3 Motivation4 1.4 Thesis Question7 CHAPTER 2: BACKGROUND 11 2.1 Field Programmable Gate Array 11 2.2 Computational Science 12 2.3 Top500 Supercomputers 12 2.4 Message-Passing 13 2.5 Benchmarks 15 2.6 Amdahl's Law 17 2.7 Communication Model 17 CHAPTER 3: RELATED WORK 19 3.1 MPI Related Research 19 3.1.1 Point-to-point Communication 19 3.1.2 Collective Communication 19 3.1.3 Hardware Optimization 21 3.2 On-chip Message-Passing 23 3.2.1 Raw 23 3.2.2 Intel Terascale Computing 24 3.2.3 RAMP 24 vi 3.2.4 Reconfigurable Computing Cluster 25 CHAPTER 4: DESIGN 26 4.1 Design Infrastructure 26 4.1.1 Off-chip Network 26 4.1.2 On-chip Network 27 4.1.3 Network Interface 28 4.1.4 Base System 28 4.1.5 Miscellaneous IP Cores 30 4.2 Stage 1: Hardware Message-Passing Engine 30 4.2.1 Point-to-point Communication 30 4.2.2 Collective Communication 31 4.3 Stage 2: Heterogeneous System 35 4.3.1 Parallel FFT Operation 38 4.3.2 Parallel Matrix-Vector Multiplication 41 CHAPTER 5: EVALUATION AND ANALYSIS 45 5.1 Evaluation Infrastructure 45 5.2 Testing Methodology 45 5.3 Stage 1 Experiment 46 5.3.1 Barrier Performance Result 46 5.3.2 Broadcast Performance Result 48 5.3.3 Reduce Performance Result 52 5.3.4 Allreduce Performance Result 54 5.3.5 Summary 55 5.4 Communication Model 57 5.4.1 Linear Fitting for Barrier 57 5.4.2 Latency Model 58 5.5 Stage 2 Experiment 63 vii 5.5.1 Parallel Fast Fourier Transformation 65 5.5.2 Parallel Matrix-Vector Multiplication 68 5.5.3 Hybrid Computing System 71 5.6 Validation 72 CHAPTER 6: CONCLUSION 75 REFERENCES 77 viii LIST OF TABLES TABLE 1.1: Top500 HPC trend3 TABLE 5.1: Broadcast measurement vs. simulation 59 TABLE 5.2: Reduce measurement vs. simulation 63 TABLE 5.3: Bandwidth measurement vs. simulation 64 TABLE 5.4: Resource utilization of hardware MPE 73 TABLE 5.5: Performance improvement of hardware MPE 73 TABLE 5.6: Performance improvement of MPE on heterogeneous systems 74 ix LIST OF FIGURES FIGURE 1.1: Configuration of heterogeneous HPC system4 FIGURE 1.2: Communication of future heterogeneous HPC system5 FIGURE 1.3: Proportion of collective operation in synthetic benchmark6 FIGURE 1.4: Traditional operation flow8 FIGURE 1.5: Operation flow of hardware message-passing8 FIGURE 2.1: Diagram of barrier operation 14 FIGURE 2.2: Diagram of broadcast operation 15 FIGURE 2.3: Diagram of reduce operation 15 FIGURE 2.4: Amdahl's Law, performance gain of parallelism 18 FIGURE 4.1: Direct connected off-chip network with the router 27 FIGURE 4.2: 4-ary 3-cube torus network 27 FIGURE 4.3: On-chip components connected around the crossbar switch 28 FIGURE 4.4: Signal interface of LocalLink 28 FIGURE 4.5: Hardware base system with the on-chip router 29 FIGURE 4.6: FSM of barrier operation 32 FIGURE 4.7: FSM of broadcast operation 33 FIGURE 4.8: FSM of reduce operation 34 FIGURE 4.9: Topologies of hardware collective communication 34 FIGURE 4.10: Block diagram of MPE in heterogeneous system 36 FIGURE 4.11: Typical programming method for parallel heterogeneous system 37 FIGURE 4.12: FFT Decimation-In-Time 39 FIGURE 4.13: FFT Decimation-In-Frequency 39 FIGURE 4.14: Block diagram of FFT Core 41 FIGURE 4.15: Block diagram of FFT IO 42 FIGURE 4.16: Block diagram of vector-vector multiplication 43 x FIGURE 4.17: Hybrid of hardware thread and software thread 44 FIGURE 5.1: MPE barrier using different topologies 47 FIGURE 5.2: Software barrier 47 FIGURE 5.3: Bandwidth of different broadcast topologies 49 FIGURE 5.4: Bandwidth of software broadcast 49 FIGURE 5.5: Latency of different broadcast topologies 51 FIGURE 5.6: Latency of software broadcast 51 FIGURE 5.7: Bandwidth of different reduce topologies 53 FIGURE 5.8: Bandwidth of software reduce 53 FIGURE 5.9: Latency of different reduce topologies 54 FIGURE 5.10: Latency of software reduce 55 FIGURE 5.11: Execution time of different allreduce topologies 56 FIGURE 5.12: Latency of software allreduce 56 FIGURE 5.13: Mathematic fitting for MPE barrier 58 FIGURE 5.14: Time chart of broadcast operation 59 FIGURE 5.15: Latency simulation of broadcast 60 FIGURE 5.16: Time chart of reduce operation 61 FIGURE 5.17: Latency simulation of reduce 62 FIGURE 5.18: Max bandwidth simulation of broadcast and reduce 64 FIGURE 5.19: Communication impact on software FFT 65 FIGURE 5.20: Comparison of communication with software FFT 66 FIGURE 5.21: Communication impact on hardware FFT 67 FIGURE 5.22: Comparison of communication with hardware FFT 67 FIGURE 5.23: Communication impact on software MACC computation 68 FIGURE 5.24: Communication impact on accelerated MACC computation 69 FIGURE 5.25: Hardware MACC and MPE 70 FIGURE 5.26: Communication impact on hybrid computing system 72 xi LIST OF ABBREVIATIONS CPU Central Processing Unit COTS Commodity Off The Shelf IC Integrated Circuit PCB Printed Circuit Board IP Intellectual Property FPU Floating Point Unit FPGA Field Programmable Gate Array Flops Floating point operations per second FIFO First In, First Out PetaFlops 1015 Flops TeraFlops 1012 Flops GigaFlops 109 Flops MPI Message-Passing Interface PE Processing Element SOC System on Chip PLB Processor Local Bus MPMC Multiport Memory Controllor PC Personal Computer UART Universal Asynchronous Receiver/Transmitter P2P Point-to-point CHAPTER 1: INTRODUCTION Frequency scaling has played a major role in pushing the computer industry for- ward. Nevertheless, due to the memory wall, the instruction-level parallelism (ILP) wall, and the power wall [1,2], conventional frequency scaling has shown diminish- ing returns in performance in past few years. As a result, academia and industry research has shifted the focus towards the multi/many-core era. New single-chip ar- chitectures have been designed and manufactured to explore and exploit parallelism rather than single-thread performance. To make use of the massive amount of cores | or, processing elements (PEs) | the on-chip and off-chip interconnect becomes the critical component of these new architectures. Presently, these multi/many-core pro- cessors follow traditional multi-chip symmetric multiprocessor (SMP) designs, which are integrated onto a single chip. Consequently, interconnect designs mainly provide point-to-point communication and are mostly used for the general shared-memory processor model. This is sufficient for desktop personal computers; however, in large scale systems with many heterogeneous PEs, this could lead to seriously inefficient communication and make the general-purpose processor the bottleneck of the systems.
Recommended publications
  • Efficient Inter-Core Communications on Manycore Machines
    ZIMP: Efficient Inter-core Communications on Manycore Machines Pierre-Louis Aublin Sonia Ben Mokhtar Gilles Muller Vivien Quema´ Grenoble University CNRS - LIRIS INRIA CNRS - LIG Abstract—Modern computers have an increasing num- nication mechanisms enabling both one-to-one and one- ber of cores and, as exemplified by the recent Barrelfish to-many communications. Current operating systems operating system, the software they execute increasingly provide various inter-core communication mechanisms. resembles distributed, message-passing systems. To sup- port this evolution, there is a need for very efficient However it is not clear yet how these mechanisms be- inter-core communication mechanisms. Current operat- have on manycore processors, especially when messages ing systems provide various inter-core communication are intended to many recipient cores. mechanisms, but it is not clear yet how they behave In this report, we study seven communication mech- on manycore processors. In this report, we study seven anisms, which are considered state-of-the-art. More mechanisms, that are considered state-of-the-art. We show that these mechanisms have two main drawbacks that precisely, we study TCP, UDP and Unix domain sock- limit their efficiency: they perform costly memory copy ets, pipes, IPC and POSIX message queues. These operations and they do not provide efficient support mechanisms are supported by traditional operating sys- for one-to-many communications. We do thus propose tems such as Linux. We also study Barrelfish message ZIMP, a new inter-core communication mechanism that passing, the inter-process communication mechanism of implements zero-copy inter-core message communications and that efficiently handles one-to-many communications.
    [Show full text]
  • D-Bus, the Message Bus System Training Material
    Maemo Diablo D-Bus, The Message Bus System Training Material February 9, 2009 Contents 1 D-Bus, The Message Bus System 2 1.1 Introduction to D-Bus ......................... 2 1.2 D-Bus architecture and terminology ................ 3 1.3 Addressing and names in D-Bus .................. 4 1.4 Role of D-Bus in maemo ....................... 6 1.5 Programming directly with libdbus ................. 9 1 Chapter 1 D-Bus, The Message Bus System 1.1 Introduction to D-Bus D-Bus (the D originally stood for "Desktop") is a relatively new inter process communication (IPC) mechanism designed to be used as a unified middleware layer in free desktop environments. Some example projects where D-Bus is used are GNOME and Hildon. Compared to other middleware layers for IPC, D-Bus lacks many of the more refined (and complicated) features and for that reason, is faster and simpler. D-Bus does not directly compete with low level IPC mechanisms like sock- ets, shared memory or message queues. Each of these mechanisms have their uses, which normally do not overlap the ones in D-Bus. Instead, D-Bus aims to provide higher level functionality, like: Structured name spaces • Architecture independent data formatting • Support for the most common data elements in messages • A generic remote call interface with support for exceptions (errors) • A generic signalling interface to support "broadcast" type communication • Clear separation of per-user and system-wide scopes, which is important • when dealing with multi-user systems Not bound to any specific programming language (while providing a • design that readily maps to most higher level languages, via language specific bindings) The design of D-Bus benefits from the long experience of using other mid- dleware IPC solutions in the desktop arena and this has allowed the design to be optimised.
    [Show full text]
  • Message Passing Model
    Message Passing Model Giuseppe Anastasi [email protected] Pervasive Computing & Networking Lab. (PerLab) Dept. of Information Engineering, University of Pisa PerLab Based on original slides by Silberschatz, Galvin and Gagne Overview PerLab Message Passing Model Addressing Synchronization Example of IPC systems Message Passing Model 2 Operating Systems Objectives PerLab To introduce an alternative solution (to shared memory) for process cooperation To show pros and cons of message passing vs. shared memory To show some examples of message-based communication systems Message Passing Model 3 Operating Systems Inter-Process Communication (IPC) PerLab Message system – processes communicate with each other without resorting to shared variables. IPC facility provides two operations: send (message ) – fixed or variable message size receive (message ) If P and Q wish to communicate, they need to: establish a communication link between them exchange messages via send/receive The communication link is provided by the OS Message Passing Model 4 Operating Systems Implementation Issues PerLab Physical implementation Single-processor system Shared memory Multi-processor systems Hardware bus Distributed systems Networking System + Communication networks Message Passing Model 5 Operating Systems Implementation Issues PerLab Logical properties Can a link be associated with more than two processes? How many links can there be between every pair of communicating processes? What is the capacity of a link? Is the size of a message that the link can accommodate fixed or variable? Is a link unidirectional or bi-directional? Message Passing Model 6 Operating Systems Implementation Issues PerLab Other Aspects Addressing Synchronization Buffering Message Passing Model 7 Operating Systems Overview PerLab Message Passing Model Addressing Synchronization Example of IPC systems Message Passing Model 8 Operating Systems Direct Addressing PerLab Processes must name each other explicitly.
    [Show full text]
  • Message Passing
    COS 318: Operating Systems Message Passing Jaswinder Pal Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Sending A Message Within A Computer Across A Network P1 P2 Send() Recv() Send() Recv() Network OS Kernel OS OS COS461 2 Synchronous Message Passing (Within A System) Synchronous send: u Call send system call with M u send system call: l No buffer in kernel: block send( M ) recv( M ) l Copy M to kernel buffer Synchronous recv: u Call recv system call M u recv system call: l No M in kernel: block l Copy to user buffer How to manage kernel buffer? 3 API Issues u Message S l Buffer (addr) and size l Message type, buffer and size u Destination or source send(dest, msg) R l Direct address: node Id, process Id l Indirect address: mailbox, socket, recv(src, msg) channel, … 4 Direct Addressing Example Producer(){ Consumer(){ ... ... while (1) { for (i=0; i<N; i++) produce item; send(Producer, credit); recv(Consumer, &credit); while (1) { send(Consumer, item); recv(Producer, &item); } send(Producer, credit); } consume item; } } u Does this work? u Would it work with multiple producers and 1 consumer? u Would it work with 1 producer and multiple consumers? u What about multiple producers and multiple consumers? 5 Indirect Addressing Example Producer(){ Consumer(){ ... ... while (1) { for (i=0; i<N; i++) produce item; send(prodMbox, credit); recv(prodMbox, &credit); while (1) { send(consMbox, item); recv(consMbox, &item); } send(prodMbox, credit); } consume item; } } u Would it work with multiple producers and 1 consumer? u Would it work with 1 producer and multiple consumers? u What about multiple producers and multiple consumers? 6 Indirect Communication u Names l mailbox, socket, channel, … u Properties l Some allow one-to-one mbox (e.g.
    [Show full text]
  • Network on Chip for FPGA Development of a Test System for Network on Chip
    Network on Chip for FPGA Development of a test system for Network on Chip Magnus Krokum Namork Master of Science in Electronics Submission date: June 2011 Supervisor: Kjetil Svarstad, IET Norwegian University of Science and Technology Department of Electronics and Telecommunications Problem Description This assignment is a continuation of the project-assignment of fall 2010, where it was looked into the development of reactive modules for application-test and profiling of the Network on Chip realization. It will especially be focused on the further development of: • The programmability of the system by developing functionality for more advanced surveillance of the communication between modules and routers • Framework that will be used to test and profile entire applications on the Network on Chip The work will primarily be directed towards testing and running the system in a way that resembles a real system at run time. The work is to be compared with relevant research within similar work. Assignment given: January 2011 Supervisor: Kjetil Svarstad, IET i Abstract Testing and verification of digital systems is an essential part of product develop- ment. The Network on Chip (NoC), as a new paradigm within interconnections; has a specific need for testing. This is to determine how performance and prop- erties of the NoC are compared to the requirements of different systems such as processors or media applications. A NoC has been developed within the AHEAD project to form a basis for a reconfigurable platform used in the AHEAD system. This report gives an outline of the project to develop testing and benchmarking systems for a NoC.
    [Show full text]
  • Multiprocess Communication and Control Software for Humanoid Robots Neil T
    IEEE Robotics and Automation Magazine Multiprocess Communication and Control Software for Humanoid Robots Neil T. Dantam∗ Daniel M. Lofaroy Ayonga Hereidx Paul Y. Ohz Aaron D. Amesx Mike Stilman∗ I. Introduction orrect real-time software is vital for robots in safety-critical roles such as service and disaster response. These systems depend on software for Clocomotion, navigation, manipulation, and even seemingly innocuous tasks such as safely regulating battery voltage. A multi-process software design increases robustness by isolating errors to a single process, allowing the rest of the system to continue operating. This approach also assists with modularity and concurrency. For real-time tasks such as dynamic balance and force control of manipulators, it is critical to communicate the latest data sample with minimum latency. There are many communication approaches intended for both general purpose and real-time needs [19], [17], [13], [9], [15]. Typical methods focus on reliable communication or network-transparency and accept a trade-off of increased mes- sage latency or the potential to discard newer data. By focusing instead on the specific case of real-time communication on a single host, we reduce communication latency and guarantee access to the latest sample. We present a new Interprocess Communication (IPC) library, Ach,1 which addresses this need, and discuss its application for real-time, multiprocess control on three humanoid robots (Fig. 1). There are several design decisions that influenced this robot software and motivated development of the Ach library. First, to utilize decades of prior development and engineering, we implement our real-time system on top of a POSIX-like Operating System (OS)2.
    [Show full text]
  • Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With
    Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright 2016 by Mohamed Saied Abdelfattah Abstract Embedded Networks on Chip for Field-Programmable Gate Arrays Mohamed Saied Abdelfattah Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2016 Modern field-programmable gate arrays (FPGAs) have a large capacity and a myriad of embedded blocks for computation, memory and I/O interfacing. This allows the implementation of ever-larger applications; however, the increase in application size comes with an inevitable increase in complexity, making it a challenge to implement on-chip communication. Today, it is a designer's burden to create a customized communication circuit to interconnect an application, using the fine-grained FPGA fab- ric that has single-bit control over every wire segment and logic cell. Instead, we propose embedding a network-on-chip (NoC) to implement system-level communication on FPGAs. A prefabricated NoC improves communication efficiency, eases timing closure, and abstracts system-level communication on FPGAs, separating an application's behaviour and communication which makes the design of complex FPGA applications easier and faster. This thesis presents a complete embedded NoC solution, includ- ing the NoC architecture and interface, rules to guide its use with FPGA design styles, application case studies to showcase its advantages, and a computer-aided design (CAD) system to automatically interconnect applications using an embedded NoC. We compare NoC components when implemented hard versus soft, then build on this component-level analysis to architect embedded NoCs and integrate them into the FPGA fabric; these NoCs are on average 20{23× smaller and 5{6× faster than soft NoCs.
    [Show full text]
  • An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System
    NAS Technical Report: NAS-2015-04 An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System Subhash Saini,1 Haoqiang Jin,1 Dennis Jespersen,1 Huiyu Feng,2 Jahed Djomehri,3 William Arasin,3 Robert Hood,3 Piyush Mehrotra,1 Rupak Biswas1 1NASA Ames Research Center 2SGI 3Computer Sciences Corporation Moffett Field, CA 94035-1000, USA Fremont, CA 94538, USA Moffett Field, CA 94035-1000, USA {subhash.saini, haoqiang.jin, [email protected] {jahed.djomehri, william.f.arasin, dennis.jespersen, piyush.mehrotra, robert.hood}@nasa.gov rupak.biswas}@nasa.gov ABSTRACT The GPGPU approach relies on streaming multiprocessors and Intel recently introduced the Xeon Phi coprocessor based on the uses a low-level programming model such as CUDA or a high- Many Integrated Core architecture featuring 60 cores with a peak level programming model like OpenACC to attain high performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI performance [1-3]. Intel’s approach has a Phi serving as a Rackable system where each node has two Intel Xeon E2670 8- coprocessor to a traditional Intel processor host. The Phi has x86- core Sandy Bridge processors along with two Xeon Phi 5110P compatible cores with wide vector processing units and uses coprocessors. We have conducted an early performance standard parallel programming models such as MPI, OpenMP, evaluation of the Xeon Phi. We used microbenchmarks to hybrid (MPI + OpenMP), UPC, etc. [4-5]. measure the latency and bandwidth of memory and interconnect, Understanding performance of these heterogeneous computing I/O rates, and the performance of OpenMP directives and MPI systems has become very important as they have begun to appear functions.
    [Show full text]
  • Message Passing
    Message passing CS252 • Sending of messages under control of programmer – User-level/system level? Graduate Computer Architecture – Bulk transfers? Lecture 24 • How efficient is it to send and receive messages? – Speed of memory bus? First-level cache? Network Interface Design • Communication Model: Memory Consistency Models – Synchronous » Send completes after matching recv and source data sent » Receive completes after data transfer complete from matching send – Asynchronous Prof John D. Kubiatowicz » Send completes after send buffer may be reused http://www.cs.berkeley.edu/~kubitron/cs252 4/27/2009 cs252-S09, Lecture 24 2 Synchronous Message Passing Asynch. Message Passing: Optimistic Source Destination Source Destination (1) Initiate send Recv Psrc, local VA, len (1) Initiate send (2) Address translation on Psrc Send Pdest, local VA, len (2) Address translation (3) Local/remote check Send (P , local VA, len) (3) Local/remote check dest (4) Send-ready request Send-rdy req (4) Send data (5) Remote check for (5) Remote check for posted receive; on fail, Tag match allocate data buffer posted receive Wait Tag check Data-xfer req (assume success) Processor Allocate buffer Action? (6) Reply transaction Recv-rdy reply (7) Bulk data transfer Source VA Dest VA or ID Recv P , local VA, len Time src Data-xfer req Time • More powerful programming model • Constrained programming model. • Wildcard receive => non-deterministic • Deterministic! What happens when threads added? • Destination contention very limited. • Storage required within msg layer? • User/System boundary? 4/27/2009 cs252-S09, Lecture 24 3 4/27/2009 cs252-S09, Lecture 24 4 Asynch. Msg Passing: Conservative Features of Msg Passing Abstraction Destination • Source knows send data address, dest.
    [Show full text]
  • Inter-Process Communication
    QNX Inter-Process Communication QNX IPC 2008/01/08 R08 © 2006 QNX Software Systems. QNX and Momentics are registered trademarks of QNX Software Systems in certain jurisdictions NOTES: QNX, Momentics, Neutrino, Photon microGUI, and “Build a more reliable world” are registered trademarks in certain jurisdictions, and PhAB, Phindows, and Qnet are trademarks of QNX Software Systems Ltd. All other trademarks and trade names belong to their respective owners. 1 IntroductionIntroduction IPC: • Inter-Process Communication • two processes exchange: – data – control – notification of event occurrence IPC Thread Thread Process A Process B QNX IPC 2008/01/08 R08 2 © 2006, QNX Software Systems. QNX and Momentics are registered trademarks of QNX Software Systems in certain jurisdictions NOTES: 2 IntroductionIntroduction QNX Neutrino supports a wide variety of IPC: – QNX Core (API is unique to QNX or low-level) • includes: – QNX Neutrino messaging – QNX Neutrino pulses – shared memory – QNX asynchronous messaging • the focus of this course module – POSIX/Unix (well known, portable API’s ) • includes: – signals – shared memory – pipes (requires pipe process) – POSIX message queues (requires mqueue or mq process) – TCP/IP sockets (requires io-net process) • the focus of the POSIX IPC course module QNX IPC 2008/01/08 R08 3 © 2006, QNX Software Systems. QNX and Momentics are registered trademarks of QNX Software Systems in certain jurisdictions NOTES: API: Application Programming Interface 3 InterprocessInterprocess Communication Communication Topics: Message Passing Designing a Message Passing System (1) Pulses How a Client Finds a Server Server Cleanup Multi-Part Messages Designing a Message Passing System (2) Client Information Structure Issues Related to Priorities Designing a Message Passing System (3) Event Delivery Asynchronous Message Passing QNET Shared Memory Conclusion QNX IPC 2008/01/08 R08 4 © 2006, QNX Software Systems.
    [Show full text]
  • A Comparison of Four Series of Cisco Network Processors
    Advanced Computational Intelligence: An International Journal (ACII), Vol.3, No.4, October 2016 A COMPARISON OF FOUR SERIES OF CISCO NETWORK PROCESSORS Sadaf Abaei Senejani 1, Hossein Karimi 2 and Javad Rahnama 3 1Department of Computer Engineering, Pooyesh University, Qom, Iran 2 Sama Technical and Vocational Training College, Islamic Azad University, Yasouj Branch, Yasouj, Iran 3 Department of Computer Science and Engineering, Sharif University of Technology, Tehran, Iran. ABSTRACT Network processors have created new opportunities by performing more complex calculations. Routers perform the most useful and difficult processing operations. In this paper the routers of VXR 7200, ISR 4451-X, SBC 7600, 7606 have been investigated which their main positive points include scalability, flexibility, providing integrated services, high security, supporting and updating with the lowest cost, and supporting standard protocols of network. In addition, in the current study these routers have been explored from hardware and processor capacity viewpoints separately. KEYWORDS Network processors, Network on Chip, parallel computing 1. INTRODUCTION The expansion of science and different application fields which require complex and time taking calculations and in consequence the demand for fastest and more powerful computers have made the need for better structures and new methods more significant. This need can be met through solutions called “paralleled massive computers” or another solution called “clusters of computers”. It should be mentioned that both solutions depend highly on intercommunication networks with high efficiency and appropriate operational capacity. Therefore, as a solution combining numerous cores on a single chip via using Network on chip technology for the simplification of communications can be done.
    [Show full text]
  • MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles
    MYTHIC MULTIPLIES IN A FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles By Mike Demler (August 27, 2018) ................................................................................................................... Machine learning is a hot field that attracts high-tech The startup plans by the end of this year to tape out investors, causing the number of startups to explode. As its first product, which can store up to 50 million weights. these young companies strive to compete against much larg- At that time, it also aims to release the alpha version of its er established players, some are revisiting technologies the software tools and a performance profiler. The company veterans may have dismissed, such as analog computing. expects to begin volume shipments in 4Q19. Mythic is riding this wave, using embedded flash-memory technology to store neural-network weights as analog pa- Solving a Weighty Problem rameters—an approach that eliminates the power consumed Mythic uses Fujitsu’s 40nm embedded-flash cell, which it in moving data between the processor and DRAM. plans to integrate in an array along with digital-to-analog The company traces its origin to 2012 at the University converters (DACs) and analog-to-digital converters (ADCs), of Michigan, where CTO Dave Fick completed his doctoral as Figure 1 shows. By storing a range of analog voltages in degree and CEO Mike Henry spent two and a half years as the bit cells, the technique uses the cell’s voltage-variable a visiting scholar. The founders worked on a DoD-funded conductance to represent weights with 8-bit resolution. This project to develop machine-learning-powered surveillance drones, eventually leading to the creation of their startup.
    [Show full text]