Hardware Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Part 2: Packet Transmission
Mesh networks LAN technologies and network topology • Early local networks used dedicated links between each pair of computers LANs and shared media • Some useful properties Locality of reference – hardware and frame details can be tailored for Star, bus and ring topologies each link Medium access control protocols – easy to enforce security and privacy Disadvantages of meshes Links between rooms/buildings • Poor scalability • Many links would follow the same physical path Shared Communication Channels Locality of reference • Shared LANs invented in the 1960s • LANs now connect more computers than any other form of network • Rely on computers sharing a single medium • The reason LANs are so popular is due to • Computers coordinate their access the principle of locality of reference • Low cost – physical locality of reference - computers more • But not suitable for wide area - likely to communicate with those nearby communication delays inhibit coordination – temporal locality of reference - computer is more likely to communicate with the same computers repeatedly 1 LAN topologies • LANs may be categorised according to topology ring bus star Pros and cons Example bus network: Ethernet • Star is more robust but hub may be a • Single coaxial cable - the ether - to which bottleneck computers connect • Ring enables easy coordination but is • IEEE standard specifies details sensitive to a cable being cut – data rates • Bus requires less wiring but is also sensitive – maximum length and minimum separation to a cable being cut – frame formats -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load flows for Power Distribution Systems
Electric Power Systems Research 65 (2003) 169Á/177 www.elsevier.com/locate/epsr Distributed algorithms with theoretic scalability analysis of radial and looped load flows for power distribution systems Fangxing Li *, Robert P. Broadwater ECE Department Virginia Tech, Blacksburg, VA 24060, USA Received 15 April 2002; received in revised form 14 December 2002; accepted 16 December 2002 Abstract This paper presents distributed algorithms for both radial and looped load flows for unbalanced, multi-phase power distribution systems. The distributed algorithms are developed from tree-based sequential algorithms. Formulas of scalability for the distributed algorithms are presented. It is shown that computation time dominates communication time in the distributed computing model. This provides benefits to real-time load flow calculations, network reconfigurations, and optimization studies that rely on load flow calculations. Also, test results match the predictions of derived formulas. This shows the formulas can be used to predict the computation time when additional processors are involved. # 2003 Elsevier Science B.V. All rights reserved. Keywords: Distributed computing; Scalability analysis; Radial load flow; Looped load flow; Power distribution systems 1. Introduction Also, the method presented in Ref. [10] was tested in radial distribution systems with no more than 528 buses. Parallel and distributed computing has been applied More recent works [11Á/14] presented distributed to many scientific and engineering computations such as implementations for power flows or power flow based weather forecasting and nuclear simulations [1,2]. It also algorithms like optimizations and contingency analysis. has been applied to power system analysis calculations These works also targeted power transmission systems. [3Á/14]. -
Chapter 5 Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication -
Scalability and Performance Management of Internet Applications in the Cloud
Hasso-Plattner-Institut University of Potsdam Internet Technology and Systems Group Scalability and Performance Management of Internet Applications in the Cloud A thesis submitted for the degree of "Doktors der Ingenieurwissenschaften" (Dr.-Ing.) in IT Systems Engineering Faculty of Mathematics and Natural Sciences University of Potsdam By: Wesam Dawoud Supervisor: Prof. Dr. Christoph Meinel Potsdam, Germany, March 2013 This work is licensed under a Creative Commons License: Attribution – Noncommercial – No Derivative Works 3.0 Germany To view a copy of this license visit http://creativecommons.org/licenses/by-nc-nd/3.0/de/ Published online at the Institutional Repository of the University of Potsdam: URL http://opus.kobv.de/ubp/volltexte/2013/6818/ URN urn:nbn:de:kobv:517-opus-68187 http://nbn-resolving.de/urn:nbn:de:kobv:517-opus-68187 To my lovely parents To my lovely wife Safaa To my lovely kids Shatha and Yazan Acknowledgements At Hasso Plattner Institute (HPI), I had the opportunity to meet many wonderful people. It is my pleasure to thank those who sup- ported me to make this thesis possible. First and foremost, I would like to thank my Ph.D. supervisor, Prof. Dr. Christoph Meinel, for his continues support. In spite of his tight schedule, he always found the time to discuss, guide, and motivate my research ideas. The thanks are extended to Dr. Karin-Irene Eiermann for assisting me even before moving to Germany. I am also grateful for Michaela Schmitz. She managed everything well to make everyones life easier. I owe a thanks to Dr. Nemeth Sharon for helping me to improve my English writing skills. -
L22: Parallel Programming Language Features (Chapel and Mapreduce)
12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease -
The Design of PROFIBUS-DP HUB Based on FPGA
International Conference on Computer Science and Service System (CSSS 2014) The Design of PROFIBUS-DP HUB based on FPGA Jiao Ping Zhou Tong Lab. of Networked Control Systems Lab. of Networked Control Systems Shenyang Institute of Automation (SIA), Chinese Shenyang Institute of Automation (SIA), Chinese Academy of Sciences China, Shenyang Academy of Sciences China, Shenyang [email protected] [email protected] Abstract—This paper analyzes the PROFIBUS-DP and specified the standard EIA RS-485, and the traffic rate is physical layer signals states, DP time series and other from 9.6kbps to 12Mbps [2, 3, 6]. characteristics. In the past, PROFIBUS-DP HUB common IEC 1158-2 specified the station number of problems of poor universal apply, shape communication is PROFIBUS-DP is 126 (station address from 0 to 125 contain more time-consuming, required message analysis, Drop frame master and slave station) [4]. In the one segment of etc, this paper propose a new design of PROFIBUS-DP HUB PROFIBUS-DP, the maximum station is 32. DP HUB or DP which based on FPGA. The design contain a detector which Repeater device requires the use, if want to connect more used to monitor the physical layer signal state of the DP bus, stations or change the network topology. As figure shown, use of FPGA synthesis logic signal, could control the direction DP HUB or DP Repeater divided the DP network into many of data flow in the physical layer, and Unrelated with the subnet segments [7]. Each interface of DP HUB is a upper layer protocol, without message analysis, compliance the protocol which used the standard RS-485, and Repeater, which can independently drive a PROFIBUS baudrate-adaptive. -
Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)
Scalability and Optimization Strategies for GPU Enhanced Neural Networks (GeNN) Naresh Balaji1, Esin Yavuz2, Thomas Nowotny2 {T.Nowotny, E.Yavuz}@sussex.ac.uk, [email protected] 1National Institute of Technology, Tiruchirappalli, India 2School of Engineering and Informatics, University of Sussex, Brighton, UK Abstract: Simulation of spiking neural networks has been traditionally done on high-performance supercomputers or large-scale clusters. Utilizing the parallel nature of neural network computation algorithms, GeNN (GPU Enhanced Neural Network) provides a simulation environment that performs on General Purpose NVIDIA GPUs with a code generation based approach. GeNN allows the users to design and simulate neural networks by specifying the populations of neurons at different stages, their synapse connection densities and the model of individual neurons. In this report we describe work on how to scale synaptic weights based on the configuration of the user-defined network to ensure sufficient spiking and subsequent effective learning. We also discuss optimization strategies particular to GPU computing: sparse representation of synapse connections and occupancy based block-size determination. Keywords: Spiking Neural Network, GPGPU, CUDA, code-generation, occupancy, optimization 1. Introduction The computational performance of traditional single-core CPUs has steadily increased since its arrival, primarily due to the combination of increased clock frequency, process technology advancements and compiler optimizations. The clock frequency was pushed higher and higher, from 5 MHz to 3 GHz in the years from 1983 to 2002, until it reached a stall a few years ago [1] when the power consumption and dissipation of the transistors reached peaks that could not compensated by its cost [2]. -
Computer Hardware Architecture Lecture 4
Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data. -
Parallel Programming
Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e. -
Field Networking Solutions
Field Networking Solutions Courtesy of Steven Engineering, Inc. ! 230 Ryan Way, South San Francisco, CA 94080-6370 ! Main Office: (650) 588-9200 ! Outside Local Area: (800) 258-9200 ! www.stevenengineering.com TYPES OF FIELDBUS NETWORKS* Field Networking 101 Features and Benefits of Fieldbus Networks The combination of intelligent field devices, digital bus networks, and various open communications protocols Fieldbus networks provide an array of features and benefits that make them an excellent choice is producing extraordinary results at process plants in nearly all process control environments. around the world. Compared to conventional technology, fieldbus networks deliver the following benefits: Just as our ability to retrieve, share, and analyze data Reduced field wiring costs has increased tremendously by use of the Internet and - Two wires from the control room to many devices PC network technology in our homes and at our desk- Reduced commissioning costs tops, so has our ability to control and manage our - Less time and personnel needed to perform process plants improved. Digital connectivity in process I/O wiring checkouts - No time spent calibrating intermediate signals manufacturing plants provides an infrastructure for the (such as 4-20mA signals) - Digital values are delivered directly from field flow of real-time data from the process level, making it devices, increasing accuracy available throughout our enterprise networks. This data Reduced engineering/operating costs is being used at all levels of the enterprise to provide - Much smaller space required for panels, I/O racks, and connectivity boxes increased process monitoring and control, inventory and - Fewer I/O cards and termination panels for materials planning, advanced diagnostics, maintenance control system equipment - Lower power consumption by control system planning, and asset management. -
Efficiently Triggering, Debugging and Decoding Low-Speed Serial Buses
Efficiently Triggering, Debugging and Decoding Low-Speed Serial Buses Hello, my name is Jerry Mark, I am a Product Marketing Engineer at Tektronix. The following presentation discusses aspects of embedded design techniques for “Efficiently Triggering, Debugging, and Decoding Low-Speed Serial Buses”. 1 Agenda Introduction – Parallel Interconnects – Transition from Parallel to Serial Buses – High-Speed versus Low-Speed Serial Data Buses Low-Speed Serial Data Buses – Challenges – Technology Reviews – Measurement Solutions Summary 2 In this presentation we will begin by taking a glance at the transition from parallel-to-serial data and the challenges this presents to engineers. Then we will briefly review some of the most widely used low-speed serial buses in industry today and some of their key characteristics. We will turn our attention to the key measurements on these buses. Subsequently, we will present by example how the low-speed serial solution from Tektronix addresses these challenges. 2 Parallel Interconnects Traditional way to connect digital devices used parallel buses Advantages – Simple point-to-point connections – All signals are transmitted in parallel, simultaneously – Easy to capture state of bus (if you have enough channels!) – Decoding the bus is relatively easy Disadvantages – Occupies a lot of circuit board space – All high-speed connections must be the same length – Many connections limit reliability – Connectors may be very large 3 Parallel buses present all of the bits in parallel, as shown in the logic analyzer display. A parallel connection between two ICs can be as simple as point-to-point circuit board traces for each of the data lines. If you can physically probe these lines, it is easy to display the bus state on a logic analyzer or mixed signal oscilloscope.