Detecting False Sharing Efficiently and Effectively
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms
Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna University of Modena And Reggio Emilia, Department of Physics, Informatics and Mathematics, Modena, Italy [email protected] Abstract—Most of today’s mixed criticality platforms feature be put on the memory traffic originated by the iGPU, as it Systems on Chip (SoC) where a multi-core CPU complex (the represents a very popular architectural paradigm for computing host) competes with an integrated Graphic Processor Unit (iGPU, massively parallel workloads at impressive performance per the device) for accessing central memory. The multi-core host Watt ratios [1]. This architectural choice (commonly referred and the iGPU share the same memory controller, which has to to as General Purpose GPU computing, GPGPU) is one of the arbitrate data access to both clients through often undisclosed reference architectures for future embedded mixed-criticality or non-priority driven mechanisms. Such aspect becomes crit- ical when the iGPU is a high performance massively parallel applications, such as autonomous driving and unmanned aerial computing complex potentially able to saturate the available vehicle control. DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts In order to maximize the validity of the presented results, due to parallel accesses to main memory by both CPU cores we considered different platforms featuring different memory and iGPU, so to motivate the need of novel paradigms for controllers, instruction sets, data bus width, cache hierarchy memory centric scheduling mechanisms. We analyzed different configurations and programming models: well known and commercially available platforms in order to i NVIDIA Tegra K1 SoC (TK1), using CUDA 6.5 [2] for estimate variations in throughput and latencies within various memory access patterns, both at host and device side. -
A Cache-Efficient Sorting Algorithm for Database and Data Mining
A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David Tuft, Dinesh Manocha University of North Carolina at Chapel Hill {naga,nikunj,henson,tuft,dm}@cs.unc.edu http://gamma.cs.unc.edu/GPUSORT Abstract els [3]. These techniques have been extensively studied for CPU-based algorithms. We present a fast sorting algorithm using graphics pro- Recently, database researchers have been exploiting the cessors (GPUs) that adapts well to database and data min- computational capabilities of graphics processors to accel- ing applications. Our algorithm uses texture mapping and erate database queries and redesign the query processing en- blending functionalities of GPUs to implement an efficient gine [8, 18, 19, 39]. These include typical database opera- bitonic sorting network. We take into account the commu- tions such as conjunctive selections, aggregates and semi- nication bandwidth overhead to the video memory on the linear queries, stream-based frequency and quantile estima- GPUs and reduce the memory bandwidth requirements. We tion operations,and spatial database operations. These al- also present strategies to exploit the tile-based computa- gorithms utilize the high memory bandwidth, the inherent tional model of GPUs. Our new algorithm has a memory- parallelism and vector processing functionalities of GPUs efficient data access pattern and we describe an efficient to execute the queries efficiently. Furthermore, GPUs are instruction dispatch mechanism to improve the overall sort- becoming increasingly programmable and have also been ing performance. We have used our sorting algorithm to ac- used for sorting [19, 24, 35]. celerate join-based queries and stream mining algorithms. -
How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?
Hot topic session: How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? Francky Catthoor, Nikil D. Dutt, IMEC , Leuven, Belgium ([email protected]) U.C.Irvine, CA ([email protected]) Christoforos E. Kozyrakis, U.C.Berkeley, CA ([email protected]) Abstract bedded memories are correctly incorporated. What is more controversial still however is how to deal with dynamic ap- Current processor architectures, both in the pro- plication behaviour, with multithreading and with highly grammable and custom case, become more and more dom- concurrent architecture platforms. inated by the data access bottlenecks in the cache, system bus and main memory subsystems. In order to provide suf- 2. Explicitly parallel architectures for mem- ficiently high data throughput in the emerging era of highly ory performance enhancement (Christo- parallel processors where many arithmetic resources can work concurrently, novel solutions for the memory access foros E. Kozyrakis) and data transfer will have to be introduced. The crucial question we want to address in this hot topic While microprocessor architectures have significantly session is where one can expect these novel solutions to evolved over the last fifteen years, the memory system rely on: will they be mainly innovative processor archi- is still a major performance bottleneck. The increasingly tecture ideas, or novel approaches in the application com- complex structures used for speculation, reordering, and piler/synthesis technology, or a mix. caching have improved performance, but are quickly run- ning out of steam. These techniques are becoming ineffec- tive in terms of resource utilization and scaling potential. -
Shared Memory Multiprocessors
Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) • Symmetric access to all of main memory from any processor Dominate the server market • Building blocks for larger systems; arriving to desktop Attractive as throughput servers and for parallel programs • Fine-grain resource sharing • Uniform access via loads/stores • Automatic data movement and coherent replication in caches • Useful for operating system too Normal uniprocessor mechanisms to access data (reads and writes) • Key is extension of memory hierarchy to support multiple processors 2 Supporting Programming Models Message passing Programming models Compilation Multiprogramming or library Communication abstraction User/system boundary Shared address space Operating systems support Hardware/software boundary Communication hardware Physical communication medium • Address translation and protection in hardware (hardware SAS) • Message passing using shared memory buffers – can be very high performance since no OS involvement necessary • Focus here on supporting coherent shared address space 3 Natural Extensions of Memory System P1 Pn Switch P1 Pn (Interleaved) First-level $ $ $ Bus (Interleaved) Main memory Mem I/O devices (a) Shared cache (b) Bus-based shared memory P1 Pn P1 Pn $ $ $ $ Mem Mem Interconnection network Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory 4 Caches and Cache Coherence Caches play key role in all cases • Reduce average data access time • Reduce bandwidth -
Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles -
Prefetching for Complex Memory Access Patterns
UCAM-CL-TR-923 Technical Report ISSN 1476-2986 Number 923 Computer Laboratory Prefetching for complex memory access patterns Sam Ainsworth July 2018 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2018 Sam Ainsworth This technical report is based on a dissertation submitted February 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Churchill College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernible pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetch- ing compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3 geometric mean speedup on the most complex × processors, and 2.7 on the simplest. I also provide an analysis of when and why this × is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. -
Eth-28647-02.Pdf
Research Collection Doctoral Thesis Adaptive main memory compression Author(s): Tuduce, Irina Publication Date: 2006 Permanent Link: https://doi.org/10.3929/ethz-a-005180607 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Doctoral Thesis ETH No. 16327 Adaptive Main Memory Compression A dissertation submitted to the Swiss Federal Institute of Technology Zurich (ETH ZÜRICH) for the degree of Doctor of Technical Sciences presented by Irina Tuduce Engineer, TU Cluj-Napoca born April 9, 1976 citizen of Romania accepted on the recommendation of Prof. Dr. Thomas Gross, examiner Prof. Dr. Lothar Thiele, co-examiner 2005 Seite Leer / Blank leaf Seite Leer / Blank leaf Seite Leer lank Abstract Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. Since fast memory is expensive, an economical solution to that desire is a memory hierarchy organized into several levels. Each level has greater capacity than the preceding but it is less quickly accessible. The goal of the memory hierarchy is to provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. In the last decades, the processor performance improved much faster than the performance of the memory levels. The memory hierarchy proved to be a scalable solution, i.e., the bigger the perfomance gap between processor and memory, the more levels are used in the memory hierarchy. For instance, in 1980 microprocessors were often designed without caches, while in 2005 most of them come with two levels of caches on the chip. -
2.2 Adaptive Routing Algorithms and Router Design 20
https://theses.gla.ac.uk/ Theses Digitisation: https://www.gla.ac.uk/myglasgow/research/enlighten/theses/digitisation/ This is a digitised version of the original print thesis. Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Performance Evaluation of Distributed Crossbar Switch Hypermesh Sarnia Loucif Dissertation Submitted for the Degree of Doctor of Philosophy to the Faculty of Science, Glasgow University. ©Sarnia Loucif, May 1999. ProQuest Number: 10391444 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10391444 Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLO. ProQuest LLO. -
Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K
University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Fall 12-1-2018 Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K. Gutiérrez Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Numerical Analysis and Scientific omputC ing Commons, OS and Networks Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation Gutiérrez, Samuel K.. "Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs." (2018). https://digitalrepository.unm.edu/cs_etds/95 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Samuel Keith Guti´errez Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Professor Dorian C. Arnold, Chair Professor Patrick G. Bridges Professor Darko Stefanovic Professor Alexander S. Aiken Patrick S. McCormick Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs by Samuel Keith Guti´errez B.S., Computer Science, New Mexico Highlands University, 2006 M.S., Computer Science, University of New Mexico, 2009 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico December 2018 ii ©2018, Samuel Keith Guti´errez iii Dedication To my beloved family iv \A Dios rogando y con el martillo dando." Unknown v Acknowledgments Words cannot adequately express my feelings of gratitude for the people that made this possible. -
Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines
Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines ∗ Lei Yang1 , Hong Wu2, Tieying Zhang2, Xuntao Cheng2, Feifei Li2, Lei Zou1, Yujie Wang2, Rongyao Chen2, Jianying Wang2, and Gui Huang2 fyang lei, [email protected] fhong.wu, tieying.zhang, xuntao.cxt, lifeifei, zhencheng.wyj, rongyao.cry, beilou.wjy, [email protected] Peking University1 Alibaba Group2 ABSTRACT 101 Hit ratio Frequency-based cache replacement policies that work well QPS on page-based database storage engines are no longer suffi- Latency cient for the emerging LSM-tree (Log-Structure Merge-tree) based storage engines. Due to the append-only and copy- 100 on-write techniques applied to accelerate writes, the state- of-the-art LSM-tree adopts mutable record blocks and issues Normalized value frequent background operations (i.e., compaction, flush) to reorganize records in possibly every block. As a side-effect, 0 50 100 150 200 such operations invalidate the corresponding entries in the Time (s) Figure 1: Cache hit ratio and system performance churn (QPS cache for each involved record, causing sudden drops on the and latency of 95th percentile) caused by cache invalidations. cache hit rates and spikes on access latency. Given the ob- with notable examples including LevelDB [10], HBase [2], servation that existing methods cannot address this cache RocksDB [7] and X-Engine [14] for its superior write perfor- invalidation problem, we propose Leaper, a machine learn- mance. These storage engines usually come with row-level ing method to predict hot records in an LSM-tree storage and block-level caches to buffer hot records in main mem- engine and prefetch them into the cache without being dis- ory. -
Decompose and Conquer: Addressing Evasive Errors in Systems on Chip
Decompose and Conquer: Addressing Evasive Errors in Systems on Chip by Doowon Lee A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 2018 Doctoral Committee: Professor Valeria M. Bertacco, Chair Assistant Professor Reetuparna Das Professor Scott Mahlke Associate Professor Zhengya Zhang Doowon Lee [email protected] ORCID iD: 0000-0003-0046-7746 c Doowon Lee 2018 For my wife, Ghaeun ii ACKNOWLEDGMENTS I would like to express my deepest gratitude to my advisor, Professor Valeria Bertacco. She has affected me in many aspects. She is the most enthusiastic person I have ever met; she always inspires me to achieve more than I could imagine. She has offered an extraordinary research mentorship throughout my Ph.D. journey. She has been consistently available whenever I need her thoughtful advice and keen insight. Also, she has been very patient about my research endeavors, allowing me to navigate different research topics. Her guidance on writing and presentation has been helpful for me to become an outstanding professional after graduation. I also thank my dissertation committee members, Professors Scott Mahlke, Reetuparna Das, and Zhengya Zhang. Their comments on my research projects have been invaluable for me to improve this dissertation. In particular, in-depth discussions with them stimulate me to think out of the box, which in turn strengthens the impacts of the research proposed in the dissertation. Our research group members have helped enjoy a long journey to the degree at the University of Michigan. First of all, I am grateful to Professor Todd Austin for his advice on my research projects. -
A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux
Muttillo et al. EURASIP Journal on Embedded Systems (2016) 2016:15 EURASIP Journal on DOI 10.1186/s13639-016-0051-9 Embedded Systems RESEARCH Open Access A design methodology for soft-core platforms on FPGA with SMP Linux, OpenMP support, and distributed hardware profiling system Vittoriano Muttillo1*, Giacomo Valente1, Fabio Federici1, Luigi Pomante1, Marco Faccio1,CarloTieri2 and Serenella Ferri2 Abstract In recent years, the use of multiprocessor systems has become increasingly common. Even in the embedded domain, the development of platforms based on multiprocessor systems or the porting of legacy single-core applications are frequent needs. However, such designs are often complicated, as embedded systems are characterized by numerous non-functional requirements and a tight hardware/software integration. This work proposes a methodology for the development and validation of an embedded multiprocessor system. Specifically, the proposed method assumes the use of a portable, open source API to support the parallelization and the possibility of prototyping the system on a field-programmable gate array. On this basis, the proposed flow allows an early exploration of the hardware configuration space, a preliminary estimate of performance, and the rapid development of a system able to satisfy the design specifications. An accurate assessment of the actual performance of the system is then enforced by the use of an hardware-based profiling subsystem. The proposed design flow is described, and a version specifically designed for LEON3 processor