Multi-Threading for Multi-Core Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Multi-Threading for Multi-Core Architectures Intel Core Duo AMD Athlon 64 X2 Multithreading and Parallel Microprocessors Stephen Jenks Electrical Engineering and Computer Science [email protected] Mostly Worked on Clusters UCI EECS Scalable Parallel and Distributed Systems Lab 2 Also Build Really Big Displays HIPerWall: 200 Million Pixels 50 Displays 30 Power Mac G5s UCI EECS Scalable Parallel and Distributed Systems Lab 3 Outline 9 Parallelism in Microprocessors 9 Multicore Processor Parallelism 9 Parallel Programming for Shared Memory 9 OpenMP 9 POSIX Threads 9 Java Threads 9 Parallel Microprocessor Bottlenecks 9 Parallel Execution Models to Address Bottlenecks 9 Memory interface 9 Cache-to-cache (coherence) interface 9 Current and Future CMP Technology UCI EECS Scalable Parallel and Distributed Systems Lab 4 Parallelism in Microprocessors Fetch 9 Pipelining is most Buffer prevalent Decode 9Developed in 1960s 9Used in everything Buffer 9Even microcontrollers Register Access 9Decreases cycle time Buffer Buffer 9Allows up to 1 instruction per cycle (IPC) 9No programming changes ALU 9Some Pentium 4s have more than 30 stages! Buffer Write Back UCI EECS Scalable Parallel and Distributed Systems Lab 5 More Microprocessor Parallelism 9 Superscalar allows Instruction Level Parallelism (ILP) ALU 9Replace ALU with multiple functional units 9Dispatch several Becomes instructions at once 9 Out of Order Execution Load/ 9Execute based on data FP INT INT availability Store 9Requires reorder buffer 9 More than 1 IPC 9 No program changes UCI EECS Scalable Parallel and Distributed Systems Lab 6 Thread-Level Parallelism 9 Simultaneous Multi- 9 Chip Multi- threading (SMT) processors (CMP) 9Execute instructions 9More than 1 CPU per from several threads at chip same time 9AMD Athlon 64 X2, 9Intel Hyperthreading, Intel Core Duo, IBM IBM Power 5/6, Cell Power 4/5/6, Xenon, Cell Int L2 Cache Mem I/F Thread 1 System/ CPU1 FP Thread 2 L/S CPU2 UCI EECS Scalable Parallel and Distributed Systems Lab 7 Chip Multiprocessors 9 Several CPU Cores Intel 9Independent execution 9Symmetric (for now) Core 9 Share Memory Hierarchy Duo 9Private L1 Caches 9Shared L2 Cache (Intel Core) 9Private L2 Caches (AMD) (kept coherent via crossbar) 9Shared Memory Interface AMD 9Shared System Interface Athlon 64 9 Lower clock speed X2 Shared Resources Can Help or Hurt! Images from Intel and AMD UCI EECS Scalable Parallel and Distributed Systems Lab 8 Quad Cores Today Memory HyperTransport Memory Controller Mem Link Mem Controller Frontside Bus System/ System/ System/ System/ System/ System/ Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F L2 Cache L2 Cache L2 L2 L2 L2 L2 Cache L2 Cache CPU1 CPU2 CPU1 CPU2 CPU2 CPU1 CPU2 CPU1 CPU3 CPU4 CPU1 CPU2 Core 2 Xeon (Mac Pro) Dual-Core Opteron Core 2 Quad/Extreme UCI EECS Scalable Parallel and Distributed Systems Lab 9 Shared Memory Parallel Programming 9 Could just run multiple programs at once 9 Multiprogramming 9 Good idea, but long tasks still take long 9 Need to partition work among processors 9 Implicitly (Get the compiler to do it) 9 Intel C/C++/Fortran compilers do pretty well 9 OpenMP code annotations help 9 Not reasonable for complex code 9 Explicitly (Thread programming) 9 Primary needs 9 Scientific computing 9 Media encoding and editing 9 Games UCI EECS Scalable Parallel and Distributed Systems Lab 10 Multithreading 9 Definitions 9 Thread operations 9Process - a program in 9Create / spawn execution 9Join / destroy 9CPU state (Regs, PC) 9Suspend & resume 9Resources 9Address space 9 Uses 9Thread - lightweight 9Solve problem together process (Divide & Conquer) 9CPU state 9Do different things 9Shares resources and 9Manage game economy address space with other 9NPC actions threads in same process 9Manage screen drawing 9Stack 9Sound 9Input handling UCI EECS Scalable Parallel and Distributed Systems Lab 11 OpenMP Programming Model 9 Implicit Parallelism with Source Code Annotations #pragma omp parallel for private (i,k) for (i = 0; i < nx; i++) for (k = 0; k < nz; k++) { ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; … 9 Compiler reads pragma and parallelizes loop 9 Partitions work among threads (1 per CPU) 9 Vars i and k are private to each thread 9 Other vars (ez array, for example) are shared across all threads 9 Can force parallelization of “unsafe” loops UCI EECS Scalable Parallel and Distributed Systems Lab 12 Thread pitfalls 9 Shared data 9 False sharing 92 threads perform 9Non-shared data packed A = A + 1 into same cache line Thread 1: Thread 2: int thread1data; 1) Load A into R1 1) Load A into R1 int thread2data; 2) Add 1 to R1 2) Add 1 to R1 3) Store R1 to A 3) Store R1 to A 9Cache line ping-pongs between CPUs when 9Mutual exclusion preserves threads access their data correctness 9 Locks for heap access 9Locks/mutexes 9Semaphores 9malloc() is expensive because of mutual exclusion 9Monitors 9Java “synchronized” 9Use private heaps UCI EECS Scalable Parallel and Distributed Systems Lab 13 POSIX Threads 9 IEEE 1003.4 (Portable Operating System Interface) Committee 9 Lightweight “threads of control”/processes operating within a single address space 9 A Typical “Process” contains a single thread in its address space 9 Threads run concurrently and allow 9 Overlapping I/O and computation 9 Efficient use of multiprocessors 9 Also called pthreads UCI EECS Scalable Parallel and Distributed Systems Lab 14 Concept of Operation 1. When program starts, main thread is running 2. Main thread spawns child threads as needed 3. Main thread and child threads run concurrently 4. Child threads finish and join with main thread 5. Main thread terminates when process ends UCI EECS Scalable Parallel and Distributed Systems Lab 15 Approximate Pi with pthreads /* the thread control function */ void* PiRunner(void* param) { int threadNum = (int) param; int i; double h, sum, mypi, x; printf("Thread %d starting.\n", threadNum); h = 1.0 / (double) iterations; sum = 0.0; for (i = threadNum + 1; i <= iterations; i += threadCount) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; /* now store the result into the result array */ resultArray[threadNum] = mypi; printf("Thread %d exiting.\n", threadNum); pthread_exit(0); } UCI EECS Scalable Parallel and Distributed Systems Lab 16 More Pi with pthreads: main() /* get the default attributes and set up for creation */ for (i = 0; i < threadCount; i++) { pthread_attr_init(&attrs[i]); /* system-wide contention */ pthread_attr_setscope(&attrs[i], PTHREAD_SCOPE_SYSTEM); } /* create the threads */ for (i = 0; i < threadCount; i++) { pthread_create(&tids[i], &attrs[i], PiRunner, (void*)i); } /* now wait for the threads to exit */ for (i = 0; i < threadCount; i++) pthread_join(tids[i], NULL); pi = 0.0; for (i = 0; i < threadCount; i++) pi += resultArray[i]; UCI EECS Scalable Parallel and Distributed Systems Lab 17 Java Threads 9 Threading and synchronization built in 9 An object can have associated thread 9 Subclass Thread or Implement Runnable 9 “run” method is thread body 9 “synchonized” methods provide mutual exclusion 9 Main program 9 Calls “start” method of Thread objects to spawn 9 Calls “join” to wait for completion UCI EECS Scalable Parallel and Distributed Systems Lab 18 Parallel Microprocessor Problems L2 Cache Mem I/F CPU1 System/ CPU MemMem CPU2 NowThen 9 Memory interface too slow for 1 core/thread 9 Now multiple threads access memory simultaneously, overwhelming memory interface 9 Parallel programs can run as slowly as sequential ones! UCI EECS Scalable Parallel and Distributed Systems Lab 19 Our Solution: Producer/Consumer Parallelism Using The Cache Data in Memory Data in Memory Memory Communications Bottleneck Through Cache Thread 1 Thread 2 Producer Consumer Half the Work Half the Work Thread Thread UCI EECS Scalable Parallel and Distributed Systems Lab 20 Converting to Producer/Consumer for (i = 1; i < nx - 1; i++){ for (j = 1; j < ny - 1; j++){ /* Update Magnetic Field */ for (k = 1; k < nz - 1; k++){ double invmu = 1.0/mu[i][j][k]; double tmpx = rx*invmu; double tmpy = ry*invmu; double tmpz = rz*invmu; hx[i][j][k] += tmpz * (ey[i][j][k+1] - ey[i][j][k]) - tmpy * (ez[i][j+1][k] - ez[i][j][k]); hy[i][j][k] += tmpx * (ez[i+1][j][k] - ez[i][j][k]) - tmpz * (ex[i][j][k+1] - ex[i][j][k]); hz[i][j][k] += tmpy * (ex[i][j+1][k] - ex[i][j][k]) - tmpx * (ey[i+1][j][k] - ey[i][j][k]); } /* Update Electric Field */ for (k = 1; k < nz - 1; k++){ double invep = 1.0/ep[i][j][k]; double tmpx = rx*invep; double tmpy = ry*invep; double tmpz = rz*invep; ex[i][j][k] += tmpy * (hz[i][j][k] - hz[i][j-1][k]) - tmpz * (hy[i][j][k] - hy[i][j][k-1]); ey[i][j][k] += tmpz * (hx[i][j][k] - hx[i][j][k-1]) - tmpx * (hz[i][j][k] - hz[i-1][j][k]); ez[i][j][k] += tmpx * (hy[i][j][k] - hy[i-1][j][k]) - tmpy * (hx[i][j][k] - hx[i][j-1][k]); } } } UCI EECS Scalable Parallel and Distributed Systems Lab 21 Synchronized Pipelined Parallelism Model (SPPM) Conventional Producer/Consumer (Spatial Decomposition) (SPPM) UCI EECS Scalable Parallel and Distributed Systems Lab 22 SPPM Features 9 Benefits 9 Drawbacks 9Memory bandwidth 9Complex same as sequential programming version 9Some synchronization 9Performance overhead improvement (usually) 9Not always faster than 9Easy in concept SDM (or sequential) UCI EECS Scalable Parallel and Distributed Systems Lab 23 SPPM Performance (Normalized) FDTD Red-Black Eqn Solver UCI EECS Scalable Parallel and Distributed Systems Lab 24 So What’s Up With AMD CPUs? 9 How can SPPM be slower than Seq? 9Fetching from other core’s cache is slower than fetching from memory! 9Makes consumer slower than producer! UCI EECS Scalable Parallel and Distributed
Recommended publications
  • Shared Memory Multiprocessors
    Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) • Symmetric access to all of main memory from any processor Dominate the server market • Building blocks for larger systems; arriving to desktop Attractive as throughput servers and for parallel programs • Fine-grain resource sharing • Uniform access via loads/stores • Automatic data movement and coherent replication in caches • Useful for operating system too Normal uniprocessor mechanisms to access data (reads and writes) • Key is extension of memory hierarchy to support multiple processors 2 Supporting Programming Models Message passing Programming models Compilation Multiprogramming or library Communication abstraction User/system boundary Shared address space Operating systems support Hardware/software boundary Communication hardware Physical communication medium • Address translation and protection in hardware (hardware SAS) • Message passing using shared memory buffers – can be very high performance since no OS involvement necessary • Focus here on supporting coherent shared address space 3 Natural Extensions of Memory System P1 Pn Switch P1 Pn (Interleaved) First-level $ $ $ Bus (Interleaved) Main memory Mem I/O devices (a) Shared cache (b) Bus-based shared memory P1 Pn P1 Pn $ $ $ $ Mem Mem Interconnection network Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory 4 Caches and Cache Coherence Caches play key role in all cases • Reduce average data access time • Reduce bandwidth
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Detecting False Sharing Efficiently and Effectively
    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
    [Show full text]
  • Decompose and Conquer: Addressing Evasive Errors in Systems on Chip
    Decompose and Conquer: Addressing Evasive Errors in Systems on Chip by Doowon Lee A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 2018 Doctoral Committee: Professor Valeria M. Bertacco, Chair Assistant Professor Reetuparna Das Professor Scott Mahlke Associate Professor Zhengya Zhang Doowon Lee [email protected] ORCID iD: 0000-0003-0046-7746 c Doowon Lee 2018 For my wife, Ghaeun ii ACKNOWLEDGMENTS I would like to express my deepest gratitude to my advisor, Professor Valeria Bertacco. She has affected me in many aspects. She is the most enthusiastic person I have ever met; she always inspires me to achieve more than I could imagine. She has offered an extraordinary research mentorship throughout my Ph.D. journey. She has been consistently available whenever I need her thoughtful advice and keen insight. Also, she has been very patient about my research endeavors, allowing me to navigate different research topics. Her guidance on writing and presentation has been helpful for me to become an outstanding professional after graduation. I also thank my dissertation committee members, Professors Scott Mahlke, Reetuparna Das, and Zhengya Zhang. Their comments on my research projects have been invaluable for me to improve this dissertation. In particular, in-depth discussions with them stimulate me to think out of the box, which in turn strengthens the impacts of the research proposed in the dissertation. Our research group members have helped enjoy a long journey to the degree at the University of Michigan. First of all, I am grateful to Professor Todd Austin for his advice on my research projects.
    [Show full text]
  • A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux
    Muttillo et al. EURASIP Journal on Embedded Systems (2016) 2016:15 EURASIP Journal on DOI 10.1186/s13639-016-0051-9 Embedded Systems RESEARCH Open Access A design methodology for soft-core platforms on FPGA with SMP Linux, OpenMP support, and distributed hardware profiling system Vittoriano Muttillo1*, Giacomo Valente1, Fabio Federici1, Luigi Pomante1, Marco Faccio1,CarloTieri2 and Serenella Ferri2 Abstract In recent years, the use of multiprocessor systems has become increasingly common. Even in the embedded domain, the development of platforms based on multiprocessor systems or the porting of legacy single-core applications are frequent needs. However, such designs are often complicated, as embedded systems are characterized by numerous non-functional requirements and a tight hardware/software integration. This work proposes a methodology for the development and validation of an embedded multiprocessor system. Specifically, the proposed method assumes the use of a portable, open source API to support the parallelization and the possibility of prototyping the system on a field-programmable gate array. On this basis, the proposed flow allows an early exploration of the hardware configuration space, a preliminary estimate of performance, and the rapid development of a system able to satisfy the design specifications. An accurate assessment of the actual performance of the system is then enforced by the use of an hardware-based profiling subsystem. The proposed design flow is described, and a version specifically designed for LEON3 processor
    [Show full text]
  • Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
    Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith Department of Electrical and Computer Engineering University of Wisconsin, Madison {jcantin, mikko, jes}@ece.wisc.edu Abstract caused by broadcasts, high performance multiprocessor To maintain coherence in conventional shared-memory systems decouple the coherence mechanism from the data multiprocessor systems, processors first check other proc- transfer mechanism, allowing data to be moved directly essors’ caches before obtaining data from memory. This from a memory controller to a processor either over a coherence checking adds latency to memory requests and separate data network [1, 2, 3], or separate virtual chan- leads to large amounts of interconnect traffic in broadcast- nels [4]. This approach to dividing data transfer from based systems. Our results for a set of commercial, scien- coherence enforcement has significant performance poten- tific and multiprogrammed workloads show that on tial because the broadcast bottleneck can be sidestepped. average 67% (and up to 94%) of broadcasts are unneces- Many memory requests simply do not need to be broadcast sary. to the entire system, either because the data is not currently Coarse-Grain Coherence Tracking is a new technique shared, the request is an instruction fetch, the request that supplements a conventional coherence mechanism writes modified data back to memory, or the request is for and optimizes the performance of coherence enforcement. non-cacheable I/O data. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that 1.1. Coarse-Grain Coherence Tracking information to avoid unnecessary broadcasts.
    [Show full text]
  • The POWER4 Processor Introduction and Tuning Guide
    Front cover The POWER4 Processor Introduction and Tuning Guide Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler Steve Behling Ron Bell Peter Farrell Holger Holthoff Frank O’Connell Will Weir ibm.com/redbooks International Technical Support Organization The POWER4 Processor Introduction and Tuning Guide November 2001 SG24-7041-00 Take Note! Before using this information and the product it supports, be sure to read the general information in “Special notices” on page 175. First Edition (November 2001) This edition applies to AIX 5L for POWER Version 5.1 (program number 5765-E61), XL Fortran Version 7.1.1 (5765-C10 and 5765-C11) and subsequent releases running on an IBM ^ pSeries POWER4-based server. Unless otherwise noted, all performance values mentioned in this document were measured on a 1.1 GHz machine, then normalized to 1.3 GHz. Note: This book is based on a pre-GA version of a product and may not apply when the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current information. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
    [Show full text]
  • Intel Hyper-Threading Technology
    Intel® Hyper-Threading Technology Technical User’s Guide January 2003 22 Contents Overview of Intel® Hyper-Threading Technology 4 Dealing with Multithreading Code Pitfalls 22 The Evolution of System Key Practices of Thread Synchronization 23 and Processor Architectures 4 Key Practices of System-bus Optimization 23 Single Processor Systems 5 Key Practices of Memory Optimization 23 Multithreading 6 Key Practices of Front-end Optimization 24 Multiprocessing 8 Key Practices: Execution-resource Optimization 24 Multiprocessor Systems 9 Optimization Techniques 24 Multitasking versus Multithreading 11 Eliminate or Reduce the Impact of Spin-wait Loops 24 Hyper-Threading Technology 11 Avoiding 64K Aliasing in the First Level Data Cache 26 Balance the Impact of Background Task Priorities Keys to Hyper-Threading on Physical Processors 28 Technology Performance 15 Avoid Serializing Events and Instructions 28 Understand and Have Clear Optimize Cache Sharing 29 Performance Expectations 15 Overcoming False Sharing in Data Cache 29 Understand Hyper-Threading Synchronization Overhead Greater Than Parallel Region 30 Technology Processor Resources 15 Take Advantage of Write Combining Buffers 30 Maximize Parallel Activity 16 Correct Load Imbalance 31 Best Practices for Optimizing Multitasking Performance 18 Hyper-Threading Technology Application Development Resources 32 Identifying Hyper-Threading Intel® C++ Compiler 32 Technology Performance General Compiler Recommendations 32 Bottlenecks in an Application 19 Logical vs. Physical Processors 19 VTuneTM Performance
    [Show full text]
  • PREDATOR: Predictive False Sharing Detection
    PREDATOR: Predictive False Sharing Detection Tongping Liu Chen Tian Ziang Hu Emery D. Berger School of Computer Science Huawei US R&D Center School of Computer Science University of Massachusetts Amherst [email protected], University of Massachusetts Amherst [email protected] [email protected] [email protected] Abstract General Terms Performance, Measurement False sharing is a notorious problem for multithreaded ap- Keywords False Sharing, Multi-threaded plications that can drastically degrade both performance and scalability. Existing approaches can precisely identify the 1. Introduction sources of false sharing, but only report false sharing actu- While writing correct multithreaded programs is often chal- ally observed during execution; they do not generalize across lenging, making them scale can present even greater obsta- executions. Because false sharing is extremely sensitive to cles. Any contention can impair scalability or even cause ap- object layout, these detectors can easily miss false sharing plications to run slower as the number of threads increases. problems that can arise due to slight differences in mem- False sharing is a particularly insidious form of con- ory allocation order or object placement decisions by the tention. It occurs when two threads update logically-distinct compiler. In addition, they cannot predict the impact of false objects that happen to reside on the same cache line. The sharing on hardware with different cache line sizes. resulting coherence traffic can degrade performance by an This paper presents PREDATOR, a predictive software- order of magnitude [4]. Unlike sources of contention like based false sharing detector. PREDATOR generalizes from locks, false sharing is often invisible in the source code, a single execution to precisely predict false sharing that is making it difficult to find.
    [Show full text]
  • Embedded Multicore: an Introduction
    Embedded Multicore: An Introduction EMBMCRM Rev. 0 07/2009 How to Reach Us: Home Page: www.freescale.com Web Support: http://www.freescale.com/support Information in this document is provided solely to enable system and software USA/Europe or Locations Not Listed: implementers to use Freescale Semiconductor products. There are no express or Freescale Semiconductor, Inc. implied copyright licenses granted hereunder to design or fabricate any integrated Technical Information Center, EL516 circuits or integrated circuits based on the information in this document. 2100 East Elliot Road Tempe, Arizona 85284 Freescale Semiconductor reserves the right to make changes without further notice to +1-800-521-6274 or any products herein. Freescale Semiconductor makes no warranty, representation or +1-480-768-2130 www.freescale.com/support guarantee regarding the suitability of its products for any particular purpose, nor does Freescale Semiconductor assume any liability arising out of the application or use of Europe, Middle East, and Africa: Freescale Halbleiter Deutschland GmbH any product or circuit, and specifically disclaims any and all liability, including without Technical Information Center limitation consequential or incidental damages. “Typical” parameters which may be Schatzbogen 7 provided in Freescale Semiconductor data sheets and/or specifications can and do 81829 Muenchen, Germany vary in different applications and actual performance may vary over time. All operating +44 1296 380 456 (English) +46 8 52200080 (English) parameters, including “Typicals” must be validated for each customer application by +49 89 92103 559 (German) customer’s technical experts. Freescale Semiconductor does not convey any license +33 1 69 35 48 48 (French) under its patent rights nor the rights of others.
    [Show full text]
  • AMC: Advanced Multi-Accelerator Controller
    AMC: Advanced Multi-accelerator Controller 1;2,Tassadaq Hussain, 3,Amna Haider, 3Shakaib A. Gursal, and 1Eduard Ayguade´ 1 Barcelona Supercomputing Center , 2BSC-Microsoft Research Centre , 3Unal Center of Engineering Research & Development Barcelona, Spain ftassadaq.hussain, eduard.ayguadeg @bsc.es ffirst [email protected] Abstract The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the ca- pacity of data-level parallelism on a chip. A generic FPGA based HLS multi- accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator sys- tems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns effi- ciently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) appli- cations and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor- based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems. 1. Introduction In the last few years the density of FPGAs [1, 2] and performance per watt [3] have improved, which allows the High Performance Computing (HPC) industry to increase and provide more functionalities on a single chip.
    [Show full text]
  • Coherent Shared Memories for Fpgas by David Woods a Thesis
    Coherent Shared Memories for FPGAs by David Woods A thesis submitted in conformity with the requirements for the degree of Master of Applied Sciences Graduate Department of Electrical and Computer Engineering University of Toronto Copyright °c 2009 by David Woods Abstract Coherent Shared Memories for FPGAs David Woods Master of Applied Sciences Graduate Department of Electrical and Computer Engineering University of Toronto 2009 To build a shared-memory programming model for FPGAs, a fast and highly parallel method of accessing the shared-memory is required. This thesis presents a ¯rst look at how to implement a coherent caching system in an FPGA. The coherent caching system consists of multiple distributed caches that implement the write-once coherence protocol, allowing e±cient access to system memory while simplifying the user programming model. Several test applications are used to verify functionality, and assess performance of the current system. Results show that with a processor-based system, some applications could bene¯t from improvements to the coherence system, but for many applications, the current system is su±cient. However, the current coherent caching system is not suf- ¯cient for most hardware core based systems, because the faster memory accesses quickly saturate shared system resources. As well, the performance of distributed-memory sys- tems currently surpasses that of the coherent caching system. Performance results are promising, and given the potential for improvements, future work on this system is war- ranted. ii Dedication I dedicate this to my parents for their support throughout my education. And most of all, I'd like to dedicate this to Laura for her help, understanding, and patience throughout my Masters Degree.
    [Show full text]