CIS 501 Computer Architecture This Unit: Multithreading (MT) Readings

Total Page:16

File Type:pdf, Size:1020Kb

CIS 501 Computer Architecture This Unit: Multithreading (MT) Readings This Unit: Multithreading (MT) Application •! Why multithreading (MT)? OS •! Utilization vs. performance Compiler Firmware CIS 501 •! Three implementations Computer Architecture CPU I/O •! Coarse-grained MT Memory •! Fine-grained MT •! Simultaneous MT (SMT) Digital Circuits Unit 10: Hardware Multithreading Gates & Transistors CIS 501 (Martin/Roth): Multithreading 1 CIS 501 (Martin/Roth): Multithreading 2 Readings Performance And Utilization •! H+P •! Performance (IPC) important •! Chapter 3.5-3.6 •! Utilization (actual IPC / peak IPC) important too •! Paper •! Even moderate superscalars (e.g., 4-way) not fully utilized •! Tullsen et al., “Exploiting Choice…” •! Average sustained IPC: 1.5–2 ! < 50% utilization •! Mis-predicted branches •! Cache misses, especially L2 •! Data dependences •! Multi-threading (MT) •! Improve utilization by multi-plexing multiple threads on single CPU •! One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can CIS 501 (Martin/Roth): Multithreading 3 CIS 501 (Martin/Roth): Multithreading 4 Superscalar Under-utilization Simple Multithreading •! Time evolution of issue slot •! Time evolution of issue slot •! 4-issue processor •! 4-issue processor time cache cache Fill in with instructions miss miss from another thread Superscalar Superscalar Multithreading •! Where does it find a thread? Same problem as multi-core •! Same shared-memory abstraction CIS 501 (Martin/Roth): Multithreading 5 CIS 501 (Martin/Roth): Multithreading 6 Latency vs Throughput MT Implementations: Similarities •! MT trades (single-thread) latency for throughput •! How do multiple threads share a single processor? –! Sharing processor degrades latency of individual threads •! Different sharing mechanisms for different kinds of structures +! But improves aggregate latency of both threads •! Depend on what kind of state structure stores +! Improves utilization •! Example •! No state: ALUs •! Thread A: individual latency=10s, latency with thread B=15s •! Dynamically shared •! Thread B: individual latency=20s, latency with thread A=25s •! Persistent hard state (aka “context”): PC, registers •! Sequential latency (first A then B or vice versa): 30s •! Replicated •! Parallel latency (A and B simultaneously): 25s •! Persistent soft state: caches, bpred –! MT slows each thread by 5s •! Dynamically partitioned (like on a multi-programmed uni-processor) + But improves total latency by 5s ! •! TLBs need thread ids, caches/bpred tables don’t •! Different workloads have different parallelism •! Exception: ordered “soft” state (BHR, RAS) is replicated •! SpecFP has lots of ILP (can use an 8-wide machine) •! Transient state: pipeline latches, ROB, RS •! Server workloads have TLP (can use multiple threads) •! Partitioned … somehow CIS 501 (Martin/Roth): Multithreading 7 CIS 501 (Martin/Roth): Multithreading 8 MT Implementations: Differences The Standard Multithreading Picture •! Main question: thread scheduling policy •! Time evolution of issue slots •! When to switch from one thread to another? •! Color = thread •! Related question: pipeline partitioning •! How exactly do threads share the pipeline itself? •! Choice depends on time •! What kind of latencies (specifically, length) you want to tolerate •! How much single thread performance you are willing to sacrifice •! Three designs •! Coarse-grain multithreading (CGMT) Superscalar CGMT FGMT SMT •! Fine-grain multithreading (FGMT) •! Simultaneous multithreading (SMT) CIS 501 (Martin/Roth): Multithreading 9 CIS 501 (Martin/Roth): Multithreading 10 Coarse-Grain Multithreading (CGMT) CGMT •! Coarse-Grain Multi-Threading (CGMT) regfile + Sacrifices very little single thread performance (of one thread) I$ ! D$ –! Tolerates only long latencies (e.g., L2 misses) B P •! Thread scheduling policy •! Designate a “preferred” thread (e.g., thread A) •! Switch to thread B on thread A L2 miss •! CGMT •! Switch back to A when A L2 miss returns •! Pipeline partitioning thread scheduler •! None, flush on switch regfile –! Can’t tolerate latencies shorter than twice pipeline depth regfile •! Need short in-order pipeline for good performance I$ D$ B •! Example: IBM Northstar/Pulsar P L2 miss? CIS 501 (Martin/Roth): Multithreading 11 CIS 501 (Martin/Roth): Multithreading 12 Fine-Grain Multithreading (FGMT) Fine-Grain Multithreading •! Fine-Grain Multithreading (FGMT) •! FGMT –! Sacrifices significant single thread performance •! Multiple threads in pipeline at once +! Tolerates latencies (e.g., L2 misses, mispredicted branches, etc.) •! (Many) more threads •! Thread scheduling policy •! Switch threads every cycle (round-robin), L2 miss or no •! Pipeline partitioning regfile thread scheduler •! Dynamic, no flushing regfile regfile •! Length of pipeline doesn’t matter so much regfile –! Need a lot of threads I$ •! Extreme example: Denelcor HEP D$ •! So many threads (100+), it didn’t even need caches B P •! Failed commercially •! Not popular today •! Many threads ! many register files CIS 501 (Martin/Roth): Multithreading 13 CIS 501 (Martin/Roth): Multithreading 14 Vertical and Horizontal Under-Utilization •! FGMT and CGMT reduce vertical under-utilization •! Loss of all slots in an issue cycle •! Do not help with horizontal under-utilization •! Loss of some slots in an issue cycle (in a superscalar processor) time CGMT FGMT SMT CIS 501 (Martin/Roth): Multithreading 15 CIS 501 (Martin/Roth): Multithreading 16 Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) map table •! What can issue insns from multiple threads in one cycle? •! Same thing that issues insns from multiple parts of same program… regfile •! …out-of-order execution I$ D$ B •! Simultaneous multithreading (SMT): OOO + FGMT P •! Aka “hyper-threading” •! Observation: once insns are renamed, scheduler doesn’t care which •! SMT thread they come from (well, for non-loads at least) •! Replicate map table, share (larger) physical register file •! Some examples thread scheduler map tables •! IBM Power5: 4-way issue, 2 threads •! Intel Pentium4: 3-way issue, 2 threads regfile •! Intel “Nehalem”: 4-way issue, 2 threads I$ D$ •! Alpha 21464: 8-way issue, 4 threads (canceled) B •! Notice a pattern? #threads (T) * 2 = #issue width (N) P CIS 501 (Martin/Roth): Multithreading 17 CIS 501 (Martin/Roth): Multithreading 18 SMT Resource Partitioning Static & Dynamic Resource Partitioning •! Physical regfile and insn buffer entries shared at fine-grain •! Static partitioning (below) •! Physically unordered and so fine-grain sharing is possible •! T equal-sized contiguous partitions •! How are physically ordered structures (ROB/LSQ) shared? ±! No starvation, sub-optimal utilization (fragmentation) –! Fine-grain sharing (below) would entangle commit (and squash) •! Dynamic partitioning •! Allowing threads to commit independently is important •! P > T partitions, available partitions assigned on need basis ±! Better utilization, possible starvation •! ICOUNT: fetch policy prefers thread with fewest in-flight insns •! Couple both with larger ROBs/LSQs thread scheduler map tables regfile regfile I$ I$ D$ D$ B B P P CIS 501 (Martin/Roth): Multithreading 19 CIS 501 (Martin/Roth): Multithreading 20 Multithreading Issues Notes About Sharing Soft State •! Shared soft state (caches, branch predictors, TLBs, etc.) •! Caches are shared naturally… •! Key example: cache interference •! Physically-tagged: address translation distinguishes different threads •! General concern for all MT variants •! … but TLBs need explicit thread IDs to be shared •! Can the working sets of multiple threads fit in the caches? •! Virtually-tagged: entries of different threads indistinguishable •! Shared memory SPMD threads help here •! Thread IDs are only a few bits: enough to identify on-chip contexts +!Same insns ! share I$ •! Thread IDs make sense on BTB (branch target buffer) +!Shared data less D$ contention ! •! BTB entries are already large, a few extra bits / entry won’t matter •! MT is good for workloads with shared insn/data •! Different thread’s target prediction ! automatic mis-prediction •! To keep miss rates low, SMT might need a larger L2 (which is OK) •! Out-of-order tolerates L1 misses •! … but not on a BHT (branch history table) •! BHT entries are small, a few extra bits / entry is huge overhead •! Large physical register file (and map table) •! Different thread’s direction prediction ! mis-prediction not automatic •! physical registers = (#threads * #arch-regs) + #in-flight insns •! Ordered soft-state should be replicated •! map table entries = (#threads * #arch-regs) •! Examples: Branch History Register (BHR), Return Address Stack (RAS) •! Otherwise it becomes meaningless… Fortunately, it is typically small CIS 501 (Martin/Roth): Multithreading 21 CIS 501 (Martin/Roth): Multithreading 22 Multithreading vs. Multicore Research: Speculative Multithreading •! If you wanted to run multiple threads would you build a… •! Speculative multithreading •! A multicore: multiple separate pipelines? •! Use multiple threads/processors for single-thread performance •! A multithreaded processor: a single larger pipeline? •! Speculatively parallelize sequential loops, that might not be parallel •! Both will get you throughput on multiple threads •! Processing elements (called PE) arranged in logical ring •! Multicore core will be simpler, possibly faster clock •! Compiler or hardware assigns iterations to consecutive PEs •! SMT will get you better performance (IPC) on a single thread •! Hardware tracks logical order to detect mis-parallelization •! SMT is basically an ILP engine that converts
Recommended publications
  • Implicitly-Multithreaded Processors
    Appears in the Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors Il Park, Babak Falsafi∗ and T. N. Vijaykumar School of Electrical & Computer Engineering ∗Computer Architecture Laboratory (CALCM) Purdue University Carnegie Mellon University {parki,vijay}@ecn.purdue.edu [email protected] http://www.ece.cmu.edu/~impetus Abstract In this paper, we propose the Implicitly-Multi- Threaded (IMT) processor. IMT executes compiler-speci- This paper proposes the Implicitly-MultiThreaded fied speculative threads from a sequential program on a (IMT) architecture to execute compiler-specified specula- wide-issue SMT pipeline. IMT is based on the fundamental tive threads on to a modified Simultaneous Multithreading observation that Multiscalar’s execution model — i.e., pipeline. IMT reduces hardware complexity by relying on compiler-specified speculative threads [10] — can be the compiler to select suitable thread spawning points and decoupled from the processor organization — i.e., distrib- orchestrate inter-thread register communication. To uted processing cores. Multiscalar [10] employs sophisti- enhance IMT’s effectiveness, this paper proposes three cated specialized hardware, the register ring and address novel microarchitectural mechanisms: (1) resource- and resolution buffer, which are strongly coupled to the distrib- dependence-based fetch policy to fetch and execute suit- uted core organization. In contrast, IMT proposes to map able instructions, (2) context multiplexing to improve utili- speculative threads on to generic SMT. zation and map as many threads to a single context as IMT differs fundamentally from prior proposals, TME allowed by availability of resources, and (3) early thread- and DMT, for speculative threading on SMT. While TME invocation to hide thread start-up overhead by overlapping executes multiple threads only in the uncommon case of one thread’s invocation with other threads’ execution.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Kaisen Lin and Michael Conley
    Kaisen Lin and Michael Conley Simultaneous Multithreading ◦ Instructions from multiple threads run simultaneously on superscalar processor ◦ More instruction fetching and register state ◦ Commercialized! DEC Alpha 21464 [Dean Tullsen et al] Intel Hyperthreading (Xeon, P4, Atom, Core i7) Web applications ◦ Web, DB, file server ◦ Throughput over latency ◦ Service many users ◦ Slight delay acceptable 404 not Idea? ◦ More functional units and caches ◦ Not just storage state Instruction fetch ◦ Branch prediction and alignment Similar problems to the trace cache ◦ Large programs: inefficient I$ use Issue and retirement ◦ Complicated OOE logic not scalable Need more wires, hardware, ports Execution ◦ Register file and forwarding logic Power-hungry Replace complicated OOE with more processors ◦ Each with own L1 cache Use communication crossbar for a shared L2 cache ◦ Communication still fast, same chip Size details in paper ◦ 6-SS about the same as 4x2-MP ◦ Simpler CPU overall! SimOS: Simulate hardware env ◦ Can run commercial operating systems on multiple CPUs ◦ IRIX 5.3 tuned for multi-CPU Applications ◦ 4 integer, 4 floating-point, 1 multiprog PARALLELLIZED! Which one is better? ◦ Misses per completed instruction ◦ In general hard to tell what happens Which one is better? ◦ 6-SS isn’t taking advantage! Actual speedup metrics ◦ MP beats the pants off SS some times ◦ Doesn’t perform so much worse other times 6-SS better than 4x2-MP ◦ Non-parallelizable applications ◦ Fine-grained parallel applications 6-SS worse than 4x2-MP
    [Show full text]
  • A Speculative Control Scheme for an Energy-Efficient Banked Register File
    IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005 741 A Speculative Control Scheme for an Energy-Efficient Banked Register File Jessica H. Tseng, Student Member, IEEE, and Krste Asanovicc, Member, IEEE Abstract—Multiported register files are critical components of modern superscalar and simultaneously multithreaded (SMT) processors, but conventional designs consume considerable die area and power as register counts and issue widths grow. Banked multiported register files consisting of multiple interleaved banks of lesser ported cells can be used to reduce area, power, and access time and previous work has shown that such designs can provide sufficient bandwidth for a superscalar machine. These previous banked designs, however, have complex control structures to avoid bank conflicts or to buffer conflicting requests, which add to design complexity and would likely limit cycle time. This paper presents a much simpler and faster control scheme that speculatively issues potentially conflicting instructions, then quickly repairs the pipeline if conflicts occur. We show that, once optimizations to avoid regfile reads are employed, the remaining read accesses observed in detailed simulations are close to randomly distributed and this contributes to the effectiveness of our speculative control scheme. For a four-issue superscalar processor with 64 physical registers, we show that we can reduce area by a factor of three, access time by 25 percent, and energy by 40 percent, while decreasing IPC by less than 5 percent. For an eight-issue SMT processor with 512 physical registers, area is reduced by a factor of seven, access time by 30 percent, and energy by 60 percent, while decreasing IPC by less than 2 percent.
    [Show full text]
  • The Microarchitecture of a Low Power Register File
    The Microarchitecture of a Low Power Register File Nam Sung Kim and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {kimns, tnm}@eecs.umich.edu ABSTRACT Alpha 21464, the 512-entry 16-read and 8-write (16-r/8-w) ports register file consumed more power and was larger than The access time, energy and area of the register file are often the 64 KB primary caches. To reduce the cycle time impact, it critical to overall performance in wide-issue microprocessors, was implemented as two 8-r/8-w split register files [9], see because these terms grow superlinearly with the number of read Figure 1. Figure 1-(a) shows the 16-r/8-w file implemented and write ports that are required to support wide-issue. This paper directly as a monolithic structure. Figure 1-(b) shows it presents two techniques to reduce the number of ports of a register implemented as the two 8-r/8-w register files. The monolithic file intended for a wide-issue microprocessor without hardly any register file design is slow because each memory cell in the impact on IPC. Our results show that it is possible to replace a register file has to drive a large number of bit-lines. In register file with 16 read and 8 write ports, intended for an eight- contrast, the split register file is fast, but duplicates the issue processor, with a register file with just 8 read and 8 write contents of the register file in two memory arrays, resulting in ports so that the impact on IPC is a few percent.
    [Show full text]
  • Three-Dimensional Integrated Circuit Design: EDA, Design And
    Integrated Circuits and Systems Series Editor Anantha Chandrakasan, Massachusetts Institute of Technology Cambridge, Massachusetts For other titles published in this series, go to http://www.springer.com/series/7236 Yuan Xie · Jason Cong · Sachin Sapatnekar Editors Three-Dimensional Integrated Circuit Design EDA, Design and Microarchitectures 123 Editors Yuan Xie Jason Cong Department of Computer Science and Department of Computer Science Engineering University of California, Los Angeles Pennsylvania State University [email protected] [email protected] Sachin Sapatnekar Department of Electrical and Computer Engineering University of Minnesota [email protected] ISBN 978-1-4419-0783-7 e-ISBN 978-1-4419-0784-4 DOI 10.1007/978-1-4419-0784-4 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009939282 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Foreword We live in a time of great change.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors
    NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A.
    [Show full text]
  • What Is SPMD? Messages
    1 2 Outline Motivation for MPI Overview of PVM and MPI The pro cess that pro duced MPI What is di erent ab out MPI? { the \usual" send/receive Jack Dongarra { the MPI send/receive { simple collective op erations Computer Science Department New in MPI: Not in MPI UniversityofTennessee Some simple complete examples, in Fortran and C and Communication mo des, more on collective op erations Implementation status Mathematical Sciences Section Oak Ridge National Lab oratory MPICH - a free, p ortable implementation MPI resources on the Net MPI-2 http://www.netlib.org/utk/p eople/JackDongarra.html 3 4 Messages What is SPMD? 2 Messages are packets of data moving b etween sub-programs. 2 Single Program, Multiple Data 2 The message passing system has to b e told the 2 Same program runs everywhere. following information: 2 Restriction on the general message-passing mo del. { Sending pro cessor 2 Some vendors only supp ort SPMD parallel programs. { Source lo cation { Data typ e 2 General message-passing mo del can b e emulated. { Data length { Receiving pro cessors { Destination lo cation { Destination size 5 6 Access Point-to-Point Communication 2 A sub-program needs to b e connected to a message passing 2 Simplest form of message passing. system. 2 One pro cess sends a message to another 2 A message passing system is similar to: 2 Di erenttyp es of p oint-to p oint communication { Mail b ox { Phone line { fax machine { etc. 7 8 Synchronous Sends Asynchronous Sends Provide information about the completion of the Only know when the message has left.
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers
    Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger Steffen Zeuch Ulf Leser Department of Computer Science Intelligent Analytics for Massive Data Department of Computer Science Humboldt-Universitat¨ zu Berlin German Research Center for Artificial Intelligence Humboldt-Universitat¨ zu Berlin Berlin, Germany Berlin, Germany Berlin, Germany [email protected] [email protected] [email protected] Abstract—Over the last years, vectorized instructions have multi-threading with SIMD instructions1. For these reasons, been successfully applied to accelerate database algorithms. How- vectorization is essential for the performance of database ever, these instructions are typically only available as intrinsics systems on modern CPU architectures. and specialized for a particular hardware architecture or CPU model. As a result, today’s database systems require a manual tai- Although modern compilers, like GCC [2], provide auto loring of database algorithms to the underlying CPU architecture vectorization [1], typically the generated code is not as to fully utilize all vectorization capabilities. In practice, this leads efficient as manually-written intrinsics code. Due to the strict to hard-to-maintain code, which cannot be deployed on arbitrary dependencies of SIMD instructions on the underlying hardware, hardware platforms. In this paper, we utilize ispc as a novel automatically transforming general scalar code into high- compiler that employs the Single Program Multiple Data (SPMD) execution model, which is usually found on GPUs, on the SIMD performing SIMD programs remains a (yet) unsolved challenge. lanes of modern CPUs. ispc enables database developers to exploit To this end, all techniques for auto vectorization have focused vectorization without requiring low-level details or hardware- on enhancing conventional C/C++ programs with SIMD instruc- specific knowledge.
    [Show full text]
  • Declustering Spatial Databases on a Multi-Computer Architecture
    Declustering spatial databases on a multi-computer architecture 1 2 ? 3 Nikos Koudas and Christos Faloutsos and Ibrahim Kamel 1 Computer Systems Research Institute University of Toronto 2 AT&T Bell Lab oratories Murray Hill, NJ 3 Matsushita Information Technology Lab oratory Abstract. We present a technique to decluster a spatial access metho d + on a shared-nothing multi-computer architecture [DGS 90]. We prop ose a software architecture with the R-tree as the underlying spatial access metho d, with its non-leaf levels on the `master-server' and its leaf no des distributed across the servers. The ma jor contribution of our work is the study of the optimal capacity of leaf no des, or `chunk size' (or `striping unit'): we express the resp onse time on range queries as a function of the `chunk size', and we show how to optimize it. We implemented our metho d on a network of workstations, using a real dataset, and we compared the exp erimental and the theoretical results. The conclusion is that our formula for the resp onse time is very accurate (the maximum relative error was 29%; the typical error was in the vicinity of 10-15%). We illustrate one of the p ossible ways to exploit such an accurate formula, by examining several `what-if ' scenarios. One ma jor, practical conclusion is that a chunk size of 1 page gives either optimal or close to optimal results, for a wide range of the parameters. Keywords: Parallel data bases, spatial access metho ds, shared nothing ar- chitecture. 1 Intro duction One of the requirements for the database management systems (DBMSs) of the future is the ability to handle spatial data.
    [Show full text]