Computer Hardware Engineering

Total Page:16

File Type:pdf, Size:1020Kb

Computer Hardware Engineering Computer Hardware Engineering IS1200, spring 2016 Lecture 13: SIMD, MIMD, and Parallel Programming David Broman Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley Slides version 1.0 2 Course Structure Module 1: C and Assembly Module 4: Processor Design Programming LE1 LE2 LE3 EX1 LAB1 LE9 LE10 EX4 S2 LAB4 LE4 S1 LAB2 Module 2: I/O Systems Module 5: Memory Hierarchy LE5 LE6 EX2 LAB3 LE11 EX5 S3 Module 3: Logic Design PROJ Module 6: Parallel Processors START and Programs LE7 LE8 EX3 LD-LAB LE12 LE13 EX6 S4 Proj. Expo LE14 Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 3 Abstractions in Computer Systems Computer System Networked Systems and Systems of Systems Application Software Software Operating System Instruction Set Architecture Hardware/Software Interface Microarchitecture Logic and Building Blocks Digital Hardware Design Digital Circuits Analog Circuits Analog Design and Physics Devices and Physics Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 4 Agenda Part I Part II SIMD, Multithreading, and GPUs MIMD, Multicore, and Clusters DLP TLP Part III Parallelization in Practice DLP + TLP Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 5 Part I SIMD, Multithreading, and GPUs DLP Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 6 SISD, SIMD, and MIMD (Revisited) Data-level parallelism. Examples Data Stream are multimedia extensions (e.g., SSE, streaming SIMD Single Multiple extension), vector processors. SISD SIMD Graphical Unit Processors E.g. Intel (GPUs) are both SIMD and Single E.g. SSE MIMD Pentium 4 Instruction in x86 MISD MIMD Task-level parallelism. No examples today E.g. Intel Instruction Stream Instruction Examples are multicore and Core i7 Multiple cluster computers Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 7 Subword Parallelism and Multimedia Extensions Subword parallelism is when a wide This is the same as SIMD or data-level parallelism. data word is operated on in parallel. One instruction operates on multiple data items. Instruction 32-bit data 32-bit data 32-bit data 32-bit data MMX (MutliMedia eXtension), first SIMD by Intel Pentium processors (introduced 1997). Only on Integers. 3D Now! AMD, included single-precision floating-point (1998) Subword Parallelism SSE (Streaming SIMD Extension) introduced by Intel in Pentium III (year 1999). Included single-precision FP. AVX (Advanced Vector Extension), supported by both Intel and AMD (processors available in 2011). Added support for 256 bits and double-precision FP. NEON multimedia extension for ARMv7 and ARMv8 (32 registers 8 bytes wide or 16 registers 16 bytes wide) Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 8 Streaming SIMD Extension (SSE) and E Advanced Vector Extension (AVX) In SSE (and the later version SSE2), assembly instructions are using two-operand format. meaning: %xmm4 = %xmm4 + %xmm0 addpd %xmm0, %xmm4 Note the reversed order (Intel assembly in general) Registers (e.g. %xmm4) are 128-bits in SSE/SEE2. Added the “v” for vector to distinguish AVX from SSE and renamed registers to %ymm that are now 256-bit “pd” means Packed Double precision FP. It can operate on as many FP that fits in the register vaddpd %ymm0, %ymm1, %ymm4 vmovapd %ymm4, (%r11) AVX introduced three-operand format Moves the result to the memory address stored in Meaning: %ymm4 = %ymm0 + %ymm1 %r11 (a 64-bit register). Stores the four 64-bit FP in consecutive order in memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 9 Vector Processors Older, but elegant version if SIMD. Dates back to the Vector supercomputers in the 70s (Seymour Cray) Processors Vector processors have large vector registers, e.g., 32 vector registers, each having 64 64-bit elements. Loop, MUL branch MUL Sequential code in a Element 1 etc. Element 2 loop. Stall between ADD instructions. Stall ADD Stall Element 1 Element 2 MUL MUL MUL MUL MUL Element 1 Element 2 Element 3 element 0 element 0 Efficient use of memory In a vector processor, and low instruction operations are pipelines. ADD ADD ADD bandwidth give good Stall only once per vector Element 1 Element 2 Element 3 energy characteristics. operation. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 10 Recall the idea of a multi-issue uniprocesor Thread A Thread B Thread C Executes only one hardware thread (context switching must be done in software) Slot 1 Slot 2 Slot 3 Typically, all functional units cannot Time be fully utilized in a single-threaded program (white space is unused slot/functional unit). Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 11 Hardware Multithreading In a multithreaded processor, several hardware Thread A Thread B Thread C threads share the same functional units. The purpose of multithreading is to hide latencies and avoid stalls due to cache misses etc. Coarse-grained multithreading, switches threads only at costly Slot 1 stalls, e.g., last-level cache misses. Slot 2 Cannot overcome throughput Slot 3 losses in short stalls. Time Fine-grained multithreading Slot 1 switches between hardware Slot 2 threads every cycle. Better Slot 3 utilization. Time Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 12 Simultaneous multithreading (SMT) Simultaneous multithreading (SMT) combines Thread A Thread B Thread C multithreading with a multiple-issue, dynamically scheduled pipeline. Can fill in the wholes that multiple- issue cannot utilize with cycles from other hardware threads. Thus, better utilization. Slot 1 Slot 2 Slot 3 Time Example: Hyper-threading is Intel's name and implementation of SMT. That is why a processor can have 2 real cores, but the OS shows 4 cores (4 hardware threads). Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 13 Graphical Processing Units (GPUs) A Graphical Processing Unit (GPU) utilizes multithreading, MIMD, SIMD, and ILP. The main form of parallelism that can be used is data-level parallelism. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model from NVIDIA. CUDA All parallelism are expressed as CUDA threads. GPU Therefore, the model is also called Single Instruction Multiple Thread (SIMT). A GPU consists of a set of multithreaded SIMD processors (called streaming multiprocessor using NVIDIA terms). For instance 16 processors. The main idea is to execute a massive number of threads and to use multithreading to hide latency. However, the latest GPUs also include a caches. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 14 Part II MIMD, Multicore, and Clusters TLP Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 15 Shared Memory Multiprocessor (SMP) A Shared Memory Multiprocessor (SMP) has a single physical address space across all processors. An SMP is almost always the same as a multicore processor. In a uniform memory access (UMA) multiprocessor, the latency of accessing memory does not depend Processor Processor Processor on the processor. Core Core Core In a nonuniform memory access L1 Cache L1 Cache L1 Cache (NUMA) multiprocessor, memory can be divided between processor and L2 Cache result in different latencies. Processors (cores) in a SMP Memory communicate via shared memory. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 16 Cache Coherence Different cores’ local caches could result in that different cores see different values for the same memory address. This is called the cache coherency problem. Time step 1 Time step 2 Time step 3 Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Cache Cache Cache Cache Cache Cache 0 0 0 1 0 Memory Memory 0 Memory 0 1 Core 1 reads Core 2 reads Core 1 memory position X. memory position X. writes to Core 2 sees The value is stored The value is stored memory. the incorrect in Core 1’s cache. in Core 2’s cache. value. Part I Part II Part III David Broman SIMD, Multithreading, MIMD, Multicore, Parallelization [email protected] and GPUs and Clusters in Practice 17 Snooping Protocol Cache coherence can be enforced using a cache coherence protocol. For instance a write invalidate protocol, such as the snooping protocol. Time step 2 Time step 2 Time step 3 Processor Processor Processor Processor Processor Processor Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Cache Cache Cache 0 Cache Cache Cache1 0 1 1 Memory Memory 0 Memory 1 1 Core 2 reads Core 2 now tries to read the memory position X.
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • The Road Ahead for Computing Systems
    56 JANUARY 2019 HiPEAC conference 2019 The road ahead for Valencia computing systems Monica Lam on keeping the web open Alberto Sangiovanni Vincentelli on building tech businesses Koen Bertels on quantum computing Tech talk 2030 contents 7 14 16 Benvinguts a València Monica Lam on open-source Starting and scaling a successful voice assistants tech business 3 Welcome 30 SME snapshot Koen De Bosschere UltraSoC: Smarter systems thanks to self-aware chips 4 Policy corner Rupert Baines The future of technology – looking into the crystal ball 33 Innovation Europe Sandro D’Elia M2DC: The future of modular microserver technology 6 News João Pita Costa, Ariel Oleksiak, Micha vor dem Berge and Mario Porrmann 14 HiPEAC voices 34 Innovation Europe ‘We are witnessing the creation of closed, proprietary TULIPP: High-performance image processing for linguistic webs’ embedded computers Monica Lam Philippe Millet, Diana Göhringer, Michael Grinberg, 16 HiPEAC voices Igor Tchouchenkov, Magnus Jahre, Magnus Peterson, ‘Do not think that SME status is the final game’ Ben Rodriguez, Flemming Christensen and Fabien Marty Alberto Sangiovanni Vincentelli 35 Innovation Europe 18 Technology 2030 Software for the big data era with E2Data Computing for the future? The way forward for Juan Fumero computing systems 36 Innovation Europe Marc Duranton, Madeleine Gray and Marcin Ostasz A RECIPE for HPC success 23 Technology 2030 William Fornaciari Tech talk 2030 37 Innovation Europe Solving heterogeneous challenges with the 24 Future compute special Heterogeneity Alliance
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Fog Computing: a Platform for Internet of Things and Analytics
    Fog Computing: A Platform for Internet of Things and Analytics Flavio Bonomi, Rodolfo Milito, Preethi Natarajan and Jiang Zhu Abstract Internet of Things (IoT) brings more than an explosive proliferation of endpoints. It is disruptive in several ways. In this chapter we examine those disrup- tions, and propose a hierarchical distributed architecture that extends from the edge of the network to the core nicknamed Fog Computing. In particular, we pay attention to a new dimension that IoT adds to Big Data and Analytics: a massively distributed number of sources at the edge. 1 Introduction The “pay-as-you-go” Cloud Computing model is an efficient alternative to owning and managing private data centers (DCs) for customers facing Web applications and batch processing. Several factors contribute to the economy of scale of mega DCs: higher predictability of massive aggregation, which allows higher utilization with- out degrading performance; convenient location that takes advantage of inexpensive power; and lower OPEX achieved through the deployment of homogeneous compute, storage, and networking components. Cloud computing frees the enterprise and the end user from the specification of many details. This bliss becomes a problem for latency-sensitive applications, which require nodes in the vicinity to meet their delay requirements. An emerging wave of Internet deployments, most notably the Internet of Things (IoTs), requires mobility support and geo-distribution in addition to location awareness and low latency. We argue that a new platform is needed to meet these requirements; a platform we call Fog Computing [1]. We also claim that rather than cannibalizing Cloud Computing, F. Bonomi R.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S
    Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application paral- lelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow.
    [Show full text]
  • Lecture 14: Vector Processors
    Lecture 14: Vector Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 14- 1 Christos Kozyrakis Announcements • Readings for this lecture – H&P 4th edition, Appendix F – Required paper • HW3 available on online – Due on Wed 11/11th • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 14 - 2 Christos Kozyrakis Review: Multi-core Processors • Use Moore’s law to place more cores per chip – 2x cores/chip with each CMOS generation – Roughly same clock frequency – Known as multi-core chips or chip-multiprocessors (CMP) • Shared-memory multi-core – All cores access a unified physical address space – Implicit communication through loads and stores – Caches and OOO cores lead to coherence and consistency issues EE382A – Autumn 2009 Lecture 14 - 3 Christos Kozyrakis Review: Memory Consistency Problem P1 P2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; • Intuitively, you expect to print A=1 – But can you think of a case where you will print A=0? – Even if cache coherence is available • Coherence talks about accesses to a single location • Consistency is about ordering for accesses to difference locations • Alternatively – Coherence determines what value is returned by a read – Consistency determines when a write value becomes visible EE382A – Autumn 2009 Lecture 14 - 4 Christos Kozyrakis Sequential Consistency (What the Programmers Often Assume) • Definition by L.
    [Show full text]
  • Supercomputing in Plain English: Overview
    Supercomputing in Plain English An Introduction to High Performance Computing Part I: Overview: What the Heck is Supercomputing? Henry Neeman, Director OU Supercomputing Center for Education & Research Goals of These Workshops To introduce undergrads, grads, staff and faculty to supercomputing issues To provide a common language for discussing supercomputing issues when we meet to work on your research NOT: to teach everything you need to know about supercomputing – that can’t be done in a handful of hourlong workshops! OU Supercomputing Center for Education & Research 2 What is Supercomputing? Supercomputing is the biggest, fastest computing right this minute. Likewise, a supercomputer is the biggest, fastest computer right this minute. So, the definition of supercomputing is constantly changing. Rule of Thumb: a supercomputer is 100 to 10,000 times as powerful as a PC. Jargon: supercomputing is also called High Performance Computing (HPC). OU Supercomputing Center for Education & Research 3 What is Supercomputing About? Size Speed OU Supercomputing Center for Education & Research 4 What is Supercomputing About? Size: many problems that are interesting to scientists and engineers can’t fit on a PC – usually because they need more than 2 GB of RAM, or more than 60 GB of hard disk. Speed: many problems that are interesting to scientists and engineers would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer. OU Supercomputing
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Embedded System Introduction.Pdf
    EmbeddedEmbedded SystemSystem IntroductionIntroduction ANDES Confidential WWW.ANDESTECH.COM Embedded System vs Desktop Page 2 相同點 CPU Storage I/O 相異點 Desktop ‧可執行多種功能 ‧作業系統對於系統資源的管理較為複雜 Embedded System ‧執行特定功能 ‧作業系統對於系統資源的管理較為簡單 Page 3 SystemSystem LayerLayer Application Application Operating System Operating System Firmware Firmware Firmware Hardware Hardware Hardware Desktop computer Complex embedded Simple embedded computer computer Page 4 HardwareHardware ArchitectureArchitecture DesktopDesktop ComputerComputer SystemSystem HardwareHardware ArchitectureArchitecture CPU AGP Slot North Bridge Memory PCI Interface USB Interface South Bridge IDE Interface System BIOS ISA Interface Super IO Port Page 5 HardwareHardware ArchitectureArchitecture EmbeddedEmbedded SystemSystem ComputerComputer HardwareHardware ArchitectureArchitecture ADC CPU SPI Digital I/O ROM IIC Host RAM Timers Computer UART Bus Interface Memory Network Interface I/O Page 6 What is the Embedded System? Page 7 IntroductionIntroduction Challenges in embedded system design. Design methodologies. Page 8 EmbeddedEmbedded SystemSystem ?? •An embedded system is a special-purpose computer system designed to perform one or a few dedicated functions •with real-time computing constraints • include hardware, software and mechanical parts Page 9 EmbeddingEmbedding aa computercomputer Page 10 ComponentsComponents ofof anan embeddedembedded systemsystem Characteristics Low power Closed operating environment Cost sensitive Page 11 ComponentsComponents ofof anan embeddedembedded
    [Show full text]
  • An Introduction to Gpus, CUDA and Opencl
    An Introduction to GPUs, CUDA and OpenCL Bryan Catanzaro, NVIDIA Research Overview ¡ Heterogeneous parallel computing ¡ The CUDA and OpenCL programming models ¡ Writing efficient CUDA code ¡ Thrust: making CUDA C++ productive 2/54 Heterogeneous Parallel Computing Latency-Optimized Throughput- CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/54 Why do we need heterogeneity? ¡ Why not just use latency optimized processors? § Once you decide to go parallel, why not go all the way § And reap more benefits ¡ For many applications, throughput optimized processors are more efficient: faster and use less power § Advantages can be fairly significant 4/54 Why Heterogeneity? ¡ Different goals produce different designs § Throughput optimized: assume work load is highly parallel § Latency optimized: assume work load is mostly sequential ¡ To minimize latency eXperienced by 1 thread: § lots of big on-chip caches § sophisticated control ¡ To maXimize throughput of all threads: § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over ALUs via SIMD 5/54 Latency vs. Throughput Specificaons Westmere-EP Fermi (Tesla C2050) 6 cores, 2 issue, 14 SMs, 2 issue, 16 Processing Elements 4 way SIMD way SIMD @3.46 GHz @1.15 GHz 6 cores, 2 threads, 4 14 SMs, 48 SIMD Resident Strands/ way SIMD: vectors, 32 way Westmere-EP (32nm) Threads (max) SIMD: 48 strands 21504 threads SP GFLOP/s 166 1030 Memory Bandwidth 32 GB/s 144 GB/s Register File ~6 kB 1.75 MB Local Store/L1 Cache 192 kB 896 kB L2 Cache 1.5 MB 0.75 MB
    [Show full text]