Multiple Cores + SIMD + Threads) (And Their Performance Characteristics
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstromt, Truman Joe, and Anoop Gupta Computer Systems Laboratory Stanford University, CA 94305 Abstract machine. Currently, many research groups are pursuing the design and construction of such multiprocessors [12, 1, 10]. Two interesting variations of large-scale shared-memory ma- As research has progressed in this area, two interesting vari- chines that have recently emerged are cache-coherent mm- ants have emerged, namely CC-NUMA (cache-coherent non- umform-memory-access machines (CC-NUMA) and cache- uniform memory access machines) and COMA (cache-only only memory architectures (COMA). They both have dis- memory architectures). Examples of the CC-NUMA ma- tributed main memory and use directory-based cache coher- chines are the Stanford DASH multiprocessor [12] and the MIT ence. Unlike CC-NUMA, however, COMA machines auto- Alewife machine [1], while examples of COMA machines are matically migrate and replicate data at the main-memoty level the Swedish Institute of Computer Science’s Data Diffusion in cache-line sized chunks. This paper compares the perfor- Machine (DDM) [10] and Kendall Square Research’s KSR1 mance of these two classes of machines. We first present a machine [4]. qualitative model that shows that the relative performance is Common to both CC-NUMA and COMA machines are the primarily determined by two factors: the relative magnitude features of distributed main memory, scalable interconnection of capacity misses versus coherence misses, and the gramr- network, and directory-based cache coherence. Distributed hirity of data partitions in the application. We then present main memory and scalable interconnection networks are es- quantitative results using simulation studies for eight prtraUeI sential in providing the required scalable memory bandwidth, applications (including all six applications from the SPLASH while directory-based schemes provide cache coherence with- benchmark suite). -
Parallel Patterns for Adaptive Data Stream Processing
Università degli Studi di Pisa Dipartimento di Informatica Dottorato di Ricerca in Informatica Ph.D. Thesis Parallel Patterns for Adaptive Data Stream Processing Tiziano De Matteis Supervisor Supervisor Marco Danelutto Marco Vanneschi Abstract In recent years our ability to produce information has been growing steadily, driven by an ever increasing computing power, communication rates, hardware and software sensors diffusion. is data is often available in the form of continuous streams and the ability to gather and analyze it to extract insights and detect patterns is a valu- able opportunity for many businesses and scientific applications. e topic of Data Stream Processing (DaSP) is a recent and highly active research area dealing with the processing of this streaming data. e development of DaSP applications poses several challenges, from efficient algorithms for the computation to programming and runtime systems to support their execution. In this thesis two main problems will be tackled: • need for high performance: high throughput and low latency are critical re- quirements for DaSP problems. Applications necessitate taking advantage of parallel hardware and distributed systems, such as multi/manycores or cluster of multicores, in an effective way; • dynamicity: due to their long running nature (24hr/7d), DaSP applications are affected by highly variable arrival rates and changes in their workload charac- teristics. Adaptivity is a fundamental feature in this context: applications must be able to autonomously scale the used resources to accommodate dynamic requirements and workload while maintaining the desired Quality of Service (QoS) in a cost-effective manner. In the current approaches to the development of DaSP applications are still miss- ing efficient exploitation of intra-operator parallelism as well as adaptations strategies with well known properties of stability, QoS assurance and cost awareness. -
Detailed Cache Coherence Characterization for Openmp Benchmarks
1 Detailed Cache Coherence Characterization for OpenMP Benchmarks Anita Nagarajan, Jaydeep Marathe, Frank Mueller Dept. of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 [email protected], phone: (919) 515-7889 Abstract models in their implementation (e.g., [6], [13], [26], [24], [2], [7]), and they operate at different abstraction levels Past work on studying cache coherence in shared- ranging from cycle-accuracy over instruction-level to the memory symmetric multiprocessors (SMPs) concentrates on operating system interface. On the performance tuning end, studying aggregate events, often from an architecture point work mostly concentrates on program analysis to derive op- of view. However, this approach provides insufficient infor- timized code (e.g., [15], [27]). Recent processor support for mation about the exact sources of inefficiencies in paral- performance counters opens new opportunities to study the lel applications. For SMPs in contemporary clusters, ap- effect of applications on architectures with the potential to plication performance is impacted by the pattern of shared complement them with per-reference statistics obtained by memory usage, and it becomes essential to understand co- simulation. herence behavior in terms of the application program con- In this paper, we concentrate on cache coherence simu- structs — such as data structures and source code lines. lation without cycle accuracy or even instruction-level sim- The technical contributions of this work are as follows. ulation. We constrain ourselves to an SPMD programming We introduce ccSIM, a cache-coherent memory simulator paradigm on dedicated SMPs. Specifically, we assume the fed by data traces obtained through on-the-fly dynamic absence of workload sharing, i.e., only one application runs binary rewriting of OpenMP benchmarks executing on a on a node, and we enforce a one-to-one mapping between Power3 SMP node. -
A Middleware for Efficient Stream Processing in CUDA
Noname manuscript No. (will be inserted by the editor) A Middleware for E±cient Stream Processing in CUDA Shinta Nakagawa ¢ Fumihiko Ino ¢ Kenichi Hagihara Received: date / Accepted: date Abstract This paper presents a middleware capable of which are expressed as kernels. On the other hand, in- out-of-order execution of kernels and data transfers for put/output data is organized as streams, namely se- e±cient stream processing in the compute uni¯ed de- quences of similar data records. Input streams are then vice architecture (CUDA). Our middleware runs on the passed through the chain of kernels in a pipelined fash- CUDA-compatible graphics processing unit (GPU). Us- ion, producing output streams. One advantage of stream ing the middleware, application developers are allowed processing is that it can exploit the parallelism inherent to easily overlap kernel computation with data trans- in the pipeline. For example, the execution of the stages fer between the main memory and the video memory. can be overlapped with each other to exploit task par- To maximize the e±ciency of this overlap, our middle- allelism. Furthermore, di®erent stream elements can be ware performs out-of-order execution of commands such simultaneously processed to exploit data parallelism. as kernel invocations and data transfers. This run-time One of the stream architecture that bene¯t from capability can be used by just replacing the original the advantages mentioned above is the graphics pro- CUDA API calls with our API calls. We have applied cessing unit (GPU), originally designed for acceleration the middleware to a practical application to understand of graphics applications. -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Chapter 1: Computer Abstractions and Technology 1.6 – 1.7: Performance and Power
Chapter 1: Computer Abstractions and Technology 1.6 – 1.7: Performance and power ITSC 3181 Introduction to Computer Architecture https://passlaB.githuB.io/ITSC3181/ Department of Computer Science Yonghong Yan [email protected] https://passlab.github.io/yanyh/ Lectures for Chapter 1 and C Basics Computer Abstractions and Technology • Lecture 01: Chapter 1 – 1.1 – 1.4: Introduction, great ideas, Moore’s law, aBstraction, computer components, and program execution • Lecture 02: C Basics; Memory and Binary Systems • Lecture 03: Number System, Compilation, Assembly, Linking and Program Execution ☛• Lecture 04: Chapter 1 – 1.6 – 1.7: Performance, power and technology trends • Lecture 05: – 1.8 - 1.9: Multiprocessing and Benchmarking 2 § 1.6 Performance 1.6 Defining Performance • Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 500 1000 1500 0 100000 200000 300000 400000 Cruising Speed (mph) Passengers x mph 3 Response Time and Throughput • Response time çè Latency – How long it takes to do a task • Throughput çè Bandwidth – Total work done per unit time • e.g., tasks/transactions/… per hour • How are response time and throughput affected by – Replacing the processor with a faster version? – Adding more processors? • We’ll focus on response time for now… 4 Relative Performance • Define Performance = 1/Execution Time • “X is n time faster than Y”, i.e. -
AMD Accelerated Parallel Processing Opencl Programming Guide
AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur. -
Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device
Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device Seung Won Min Sitao Huang Mohamed El-Hadedy Electrical and Computer Engineering Electrical and Computer Engineering Electrical and Computer Engineering University of Illinois University of Illinois California State Polytechnic University Urbana, USA Urbana, USA Ponoma, USA [email protected] [email protected] [email protected] Jinjun Xiong Deming Chen Wen-mei Hwu IBM T.J. Watson Research Center Electrical and Computer Engineering Electrical and Computer Engineering Yorktown Heights, USA University of Illinois University of Illinois [email protected] Urbana, USA Urbana, USA [email protected] [email protected] Abstract—Unlike traditional PCIe-based FPGA accelerators, However, choosing the right I/O cache coherence method heterogeneous SoC-FPGA devices provide tighter integrations for different applications is a challenging task because of between software running on CPUs and hardware accelerators. it’s versatility. By choosing different methods, they can intro- Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these duce different types of overheads. Depending on data access options can have inadvertent effects on the achieved bandwidths patterns, those overheads can be amplified or diminished. In depending on applications and data access patterns. To provide our experiments, we find using different I/O cache coherence the most efficient communications between CPUs and accelera- methods can vary overall application execution times at most tors, understanding the data transaction behaviors and selecting 3.39×. This versatility not only makes designers hard to decide the right I/O cache coherence method is essential. -
GPU Implementation Over IPTV Software Defined Networking
Esmeralda Hysenbelliu. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 8, ( Part -1) August 2017, pp.41-45 RESEARCH ARTICLE OPEN ACCESS GPU Implementation over IPTV Software Defined Networking Esmeralda Hysenbelliu* Information Technology Faculty, Polytechnic University of Tirana, Sheshi “Nënë Tereza”, Nr.4, Tirana, Albania Corresponding author: Esmeralda Hysenbelliu ABSTRACT One of the most important issue in IPTV Software defined Network is Bandwidth Issue and Quality of Service at the client side. Decidedly, it is required high level quality of images in low bandwidth and for this reason it is needed different transcoding standards (Compression of image as much as it is possible without destroying the quality of it) as H.264, H265, VP8 and VP9. During a test performed in SMC IPTV SDN Cloud Network, it was observed that with a server HP ProLiant DL380 g6 with two physical processors there was not possible to transcode in format H.264 more than 30 channels simultaneously because CPU’s achieved 100%. This is the reason why it was immediately needed to use Graphic Processing Units called GPU’s which offer high level images processing. After GPU superscalar processor was integrated and done functional via module NVENC of FFEMPEG Program, number of channels transcoded simultaneously was tremendous increased (more than 100 channels). The aim of this paper is to real implement GPU superscalar processors in IPTV Cloud Networks by achieving improvement of performance to more than 60%. Keywords - GPU superscalar processor, Performance Improvement, NVENC, CUDA --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 01 -05-2017 Date of acceptance: 19-08-2017 --------------------------------------------------------------------------------------------------------------------------------------- I. -
SIMD Extensions
SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel. -
Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles -
Parallel Programming
Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.