IV. Parallel Operating Systems

Total Page:16

File Type:pdf, Size:1020Kb

IV. Parallel Operating Systems IV. Parallel Op erating Systems Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes IST/INESC, Lisb on, Portugal 1. Intro duction ::::::::::::::::::::::::::::::::::::::::::::::::: 169 2. Classi cation of Parallel Computer Systems ::::::::::::::::::::: 169 2.1 Parallel Computer Architectures ::::::::::::::::::::::::::: 169 2.2 Parallel Op erating Systems :::::::::::::::::::::::::::::::: 174 3. Op erating Systems for Symmetric Multipro cessors ::::::::::::::: 178 3.1 Pro cess Management ::::::::::::::::::::::::::::::::::::: 178 3.2 Scheduling :::::::::::::::::::::::::::::::::::::::::::::: 179 3.3 Pro cess Synchronization :::::::::::::::::::::::::::::::::: 180 3.4 Examples of SMP Op erating Systems : :::::::::::::::::::::: 181 4. Op erating Systems for NORMA environments ::::::::::::::::::: 189 4.1 Architectural Overview ::::::::::::::::::::::::::::::::::: 190 4.2 Distributed Pro cess Management :::::::::::::::::::::::::: 191 4.3 Parallel File Systems ::::::::::::::::::::::::::::::::::::: 194 4.4 Popular Message-Passing Environments ::::::::::::::::::::: 195 5. Scalable Shared Memory Systems :::::::::::::::::::::::::::::: 196 5.1 Shared Memory MIMD Computers ::::::::::::::::::::::::: 197 5.2 Distributed Shared Memory ::::::::::::::::::::::::::::::: 198 References :::::::::::::::::::::::::::::::::::::::::::::::::: 199 Summary. Parallel op erating systems are the interface b etween parallel comput- ers (or computer systems) and the applications (parallel or not) that are executed on them. They translate the hardware's capabilities into concepts usable bypro- gramming languages. Great diversity marked the b eginning of parallel architectures and their op- erating systems. This diversity has since b een reduced to a small set of dominat- ing con gurations: symmetric multipro cessors running commo dity applications and op erating systems (UNIX and Windows NT) and multicomputers running custom kernels and parallel applicatio ns. Additionall y, there is some (mostly exp erimen- tal) work done towards the exploitation of the shared memory paradigm on top of networks of workstations or p ersonal computers. In this chapter, we discuss the op erating system comp onents that are essential to supp ort parallel systems and the central concepts surrounding their op eration: scheduling, synchronization, multi-threading, inter-pro cess communication, mem- ory management and fault tolerance. Currently, SMP computers are the most widely used multipro cessors. Users nd it a very interesting mo del to have a computer, which, although it derives its pro- cessing p ower from a set of pro cessors, do es not require anychanges to applications and only minor changes to the op erating system. Furthermore, the most p opular parallel programming languages have b een p orted to SMP architectures enabling also the execution of demanding parallel application s on these machines. However, users who want to exploit parallel pro cessing to the fullest use those same parallel programming languages on top of NORMA computers. These mul- ticomputers with fast interconnects are the ideal hardware supp ort for message 168 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes passing parallel applicatio ns. The surviving commercial mo dels with NORMA ar- chitectures are very exp ensivemachines, which one will nd running calculus in- tensive applications , suchasweather forecasting or uid dynamics mo delling. We also discuss some of the exp eriments that have b een made b oth in hardware (DASH, Alewife) and in software systems (TreadMarks, Shasta) to deal with the scalability issues of maintaining consistency in shared-memory systems and to prove their applicabil ity on a large-scale. IV. Parallel Op erating Systems 169 1. Intro duction Parallel op erating systems are primarily concerned with managing the re- sources of parallel machines. This task faces manychallenges: application programmers demand all the p erformance p ossible, many hardware con gu- rations exist and change very rapidly,yet the op erating system must increas- ingly b e compatible with the mainstream versions used in p ersonal computers and workstations due b oth to user pressure and to the limited resources avail- able for developing new versions of these system. In this chapter, we describ e the ma jor trends in parallel op erating systems. Since the architecture of a parallel op erating system is closely in uenced by the hardware architecture of the machines it runs on, wehave divided our pre- sentation in three ma jor groups: op erating systems for small scale symmetric multipro cessors (SMP), op erating system supp ort for large scale distributed memory machines and scalable distributed shared memory machines. The rst group includes the currentversions of Unix and Windows NT where a single op erating system centrally manages all the resources. The sec- ond group comprehends b oth large scale machines connected by sp ecial pur- p ose interconnections and networks of p ersonal computers or workstations, where the op erating system of each no de lo cally manages its resources and collab orates with its p eers via message passing to globally manage the ma- chine. The third group is comp osed of non-uniform memory access (NUMA) machines where each no de's op erating system manages the lo cal resources and interacts at a very ne grain level with its p eers to manage and share global resources. We address the ma jor architectural issues in these op erating systems, with a greater emphasis on the basic mechanisms that constitute their core: pro cess management, le system, memory managementandcommunications. For each comp onent of the op erating system, the presentation covers its archi- tecture, describ es some of the interfaces included in these systems to provide programmers with a greater level of exibilityandpower in exploiting the hardware, and addresses the implementation e orts to mo dify the op erat- ing system in order to explore the parallelism of the hardware. These top- ics include supp ort for multi-threading, internal parallelism of the op erating system comp onents, distributed pro cess management, parallel le systems, inter-pro cess communication, low-level resource sharing and fault isolation. These features are illustrated with examples from b oth commercial op erating systems and research prop osals from industry and academia. 2. Classi cation of Parallel Computer Systems 2.1 Parallel Computer Architectures Op erating systems were created to present users and programmers with a view of computers that allows the use of computers abstracting away from the 170 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes details of the machine's hardware and the way it op erates. Paral lel operating systems in particular enable user interaction with computers with parallel architectures. The physical architecture of a computer system is therefore an imp ortant starting p oint for understanding the op erating system that controls it. There are two famous classi cations of parallel computer architectures: Flynn's [Fly72] and Johnson's [Joh88]. 2.1.1 Flynn's classi cation of parallel architectures. Flynn divides computer architectures along two axes according to the number of data sources and the numb er of instruction sources that a computer can pro cess simultaneously. This leads to four categories of computer architectures: SISD - Single Instruction Single Data. This is the most widespread archi- tecture where a single pro cessor executes a sequence of instructions that op erate on a single stream of data. Examples of this architecture are the ma jority of computers in use nowadays such as p ersonal computers (with some exceptions suchasIntelDualPentium machines), workstations and low-end servers. These computers typically run either a version of the Mi- crosoft Windows op erating system or one of the manyvariants of the UNIX op erating system (Sun Microsystems' Solaris, IBM's AIX, Linux...) although these op erating systems include supp ort for multipro cessor architectures and are therefore not strictly op erating systems for SISD machines. MISD - Multiple Instruction Single Data. No existing computer corresp onds to the MISD mo del, whichwas included in this description mostly for the sake of completeness. However, we could envisage sp ecial purp ose machines that could use this mo del. For example, a "cracker computer" with a set of pro cessors where all are fed the same stream of ciphered data whicheach tries to crack using a di erent algorithm. SIMD - Single Instruction Multiple Data. The SIMD architecture is used mostly for sup ercomputer vector pro cessing and calculus. The programming mo del of SIMD computers is based on distributing sets of homogeneous data among an array of pro cessors. Each pro cessor executes the same op eration on a fraction of a data element. SIMD machines can b e further divided into pip elined and parallel SIMD. Pip elined SIMD machines execute the same instruction for an arrayofdataelements byfetching the instruction from memory only once and then executing it in an overloaded way on a pip elined high p erformance pro cessor (e.g. Cray 1). In this case, each fraction of a data element(vector) is in a di erent stage of the pip eline. In contrast to pip elined SIMD, parallel SIMD machines are made from dedicated pro cessing elements connected by high-sp eed links. The pro cessing elements are sp ecial purp ose pro cessors with narrow data paths (some as narrow as one bit) and which execute a restricted set of op erations. Although there is a design e ort to keep pro cessing
Recommended publications
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
    A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time.
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Gpu Concurrency
    GPU CONCURRENCY ROBERT SEARLES 5/26/2021 EXECUTION SCHEDULING & MANAGEMENT Pre-emptive scheduling Concurrent scheduling Processes share GPU through time-slicing Processes run on GPU simultaneously Scheduling managed by system User creates & manages scheduling streams C B A B C A B A time time time- slice 2 CUDA CONCURRENCY MECHANISMS Streams MPS MIG Partition Type Single process Logical Physical Max Partitions Unlimited 48 7 Performance Isolation No By percentage Yes Memory Protection No Yes Yes Memory Bandwidth QoS No No Yes Error Isolation No No Yes Cross-Partition Interop Always IPC Limited IPC Reconfigure Dynamic Process launch When idle MPS: Multi-Process Service MIG: Multi-Instance GPU 3 CUDA STREAMS 4 STREAM SEMANTICS 1. Two operations issued into the same stream will execute in issue- order. Operation B issued after Operation A will not begin to execute until Operation A has completed. 2. Two operations issued into separate streams have no ordering prescribed by CUDA. Operation A issued into stream 1 may execute before, during, or after Operation B issued into stream 2. Operation: Usually, cudaMemcpyAsync or a kernel call. More generally, most CUDA API calls that take a stream parameter, as well as stream callbacks. 5 STREAM EXAMPLES Host/Device execution concurrency: Kernel<<<b, t>>>(…); // this kernel execution can overlap with cpuFunction(…); // this host code Concurrent kernels: Kernel<<<b, t, 0, streamA>>>(…); // these kernels have the possibility Kernel<<<b, t, 0, streamB>>>(…); // to execute concurrently In practice, concurrent
    [Show full text]
  • Hyper-Threading Technology Architecture and Microarchitecture
    Hyper-Threading Technology Architecture and Microarchitecture Deborah T. Marr, Desktop Products Group, Intel Corp. Frank Binns, Desktop ProductsGroup, Intel Corp. David L. Hill, Desktop Products Group, Intel Corp. Glenn Hinton, Desktop Products Group, Intel Corp. David A. Koufaty, Desktop Products Group, Intel Corp. J. Alan Miller, Desktop Products Group, Intel Corp. Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp. Index words: architecture, microarchitecture, Hyper-Threading Technology, simultaneous multi- threading, multiprocessor INTRODUCTION ABSTRACT The amazing growth of the Internet and telecommunications is powered by ever-faster systems Intel’s Hyper-Threading Technology brings the concept demanding increasingly higher levels of processor of simultaneous multi-threading to the Intel performance. To keep up with this demand we cannot Architecture. Hyper-Threading Technology makes a rely entirely on traditional approaches to processor single physical processor appear as two logical design. Microarchitecture techniques used to achieve processors; the physical execution resources are shared past processor performance improvement–super- and the architecture state is duplicated for the two pipelining, branch prediction, super-scalar execution, logical processors. From a software or architecture out-of-order execution, caches–have made perspective, this means operating systems and user microprocessors increasingly more complex, have more programs can schedule processes or threads to logical transistors, and consume more power. In fact, transistor processors as they would on multiple physical counts and power are increasing at rates greater than processors. From a microarchitecture perspective, this processor performance. Processor architects are means that instructions from both logical processors therefore looking for ways to improve performance at a will persist and execute simultaneously on shared greater rate than transistor counts and power execution resources.
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]
  • Intel's Hyper-Threading
    Intel’s Hyper-Threading Cody Tinker & Christopher Valerino Introduction ● How to handle a thread, or the smallest portion of code that can be run independently, plays a core component in enhancing a program's parallelism. ● Due to the fact that modern computers process many tasks and programs simultaneously, techniques that allow for threads to be handled more efficiently to lower processor downtime are valuable. ● Motive is to reduce the number of idle resources of a processor. ● Examples of processors that use hyperthreading ○ Intel Xeon D-1529 ○ Intel i7-6785R ○ Intel Pentium D1517 Introduction ● Two main trends have been followed to increase parallelism: ○ Increase number of cores on a chip ○ Increase core throughput What is Multithreading? ● Executing more than one thread at a time ● Normally, an operating system handles a program by scheduling individual threads and then passing them to the processor. ● Two different types of hardware multithreading ○ Temporal ■ One thread per pipeline stage ○ Simultaneous (SMT) ■ Multiple threads per pipeline stage ● Intel’s hyper-threading is a SMT design Hyper-threading ● Hyper-threading is the hardware solution to increasing processor throughput by decreasing resource idle time. ● Allows multiple concurrent threads to be executed ○ Threads are interleaved so that resources not being used by one thread are used by others ○ Most processors are limited to 2 concurrent threads per physical core ○ Some do support 8 concurrent threads per physical core ● Needs the ability to fetch instructions from
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • IMTEC-91-58 High-Performance Computing: Industry Uses Of
    ‘F ---l_l,__*“.^*___l.ll_l-_.. - -.”_-._ .----._..-_.. ___..-.-..,_.. -...-I_.__...... liitiltd l_--._-_.---..-_---. Sl;tl,~s (;~~ttc~ral Accottttt.irtg Offiw ---- GAO 1Zcport 1,oCongressional Request,ers i HIGH-PERFORMANCE COMPUTING Industry Uses of Supercomputers and High-Speed Networks 144778 RELEASED (;AO/IM’l’E(:-!tl-T,El .-. I .II ._. ._._. “..__-I ._.,__ _. _ .._.. ._.-._ . .I .._..-.-...-..--.-- ~-- United States General Accounting Office GAO Washington, D.C. 20648 Information Management and Technology Division B-244488 July 30,199l The Honorable Ernest F. Hollings ‘Chairman, Senate Committee on Commerce, Science, and Transportation The Honorable Al Gore Chairman, Subcommittee on Science, Technology, and Space Senate Committee on Commerce, Science, and Transportation The Honorable George E. Brown, Jr. Chairman, House Committee on Science, Space, and Technology The Honorable Robert S. Walker Ranking Minority Member House Committee on Science, Space, and Technology The HonorableTim Valentine Chairman, Subcommittee on Technology and Competitiveness House Committee on Scientie, Space, and Technology The Honorable Tom Lewis Ranking Minority Member Subcommittee on Technology and Competitiveness House Committee on Science, Space, and Technology This report responds to your October 2,1990, and March 11,1991, requests for information on supercomputers and high-speed networks. You specifically asked that we l provide examples of how various industries are using supercomputers to improve products, reduce costs, save time, and provide other benefits; . identify barriers preventing the increased use of supercomputers; and . provide examples of how certain industries are using and benefitting from high-speed networks. Page 1 GAO/JMTEG91-59 Supercomputera and High-Speed Networks B244488 As agreed with the Senate Committee on Commerce, Science, and Trans- portation, and Subcommittee on Science, Technology, and Space, our review of supercomputers examined five industries-oil, aerospace, automobile, and chemical and pharmaceutical.
    [Show full text]
  • Introduction to Parallel Computing
    INTRODUCTION TO PARALLEL COMPUTING Plamen Krastev Office: 38 Oxford, Room 117 Email: [email protected] FAS Research Computing Harvard University OBJECTIVES: To introduce you to the basic concepts and ideas in parallel computing To familiarize you with the major programming models in parallel computing To provide you with with guidance for designing efficient parallel programs 2 OUTLINE: Introduction to Parallel Computing / High Performance Computing (HPC) Concepts and terminology Parallel programming models Parallelizing your programs Parallel examples 3 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage 5 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage Using the world’s fastest and largest computers to solve large and complex problems. 5 Serial Computation: Traditionally software has been written for serial computations: To be run on a single computer having a single Central Processing Unit (CPU) A problem is broken into a discrete set of instructions Instructions are executed one after another Only one instruction can be executed at any moment in time 6 Parallel Computing: In the simplest sense, parallel
    [Show full text]
  • Computer-System Architecture (Cont.) Symmetrically Constructed Clusters (Cont.) Advantages: 1
    Computer-System Architecture (cont.) Symmetrically Constructed Clusters (cont.) Advantages: 1. Greater computational power by running applications concurrently on all computers in the cluster. Disadvantages: 1. Applications must be written specifically to take advantage of the cluster by using a technique known as parallelization (dividing a program into separate components that run in parallel on individual computers in the cluster). Parallelization – dividing a program into separate components that run in parallel on individual computers in the cluster. Computer-System Architecture (cont.) interconnect interconnect computer computer computer storage area network (SAN) General structure of a clustered system. Operating-System Structure The operating system provides the environment within which programs are executed. Operating systems must have the ability to multiprogram. (A single program cannot, in general, keep either the CPU or the I/O devices busy at all times.) Single users frequently have multiple programs running. Multiprogramming increases CPU utilization by organizing jobs (code and data) so that the CPU always has one to execute. The operating system keeps several jobs in memory (on disk) simultaneously, which is called a job pool. This pool consists of all processes residing on disk awaiting allocation of main memory. (to be executed) Operating-System Structure 0 operating system job 1 job 2 job 3 job 4 job 5 512M Memory layout for the multiprogramming system. The set of jobs in memory can be a subset of the jobs kept in the job pool. The operating system picks and begins to execute one of the jobs in memory. Eventually, the job may have to wait for some task, such as an I/O operation to complete.
    [Show full text]
  • Lect. 9: Multithreading
    Lect. 9: Multithreading ▪ Memory latencies and even latencies to lower level caches are becoming longer w.r.t. processor cycle times ▪ There are basically 3 ways to hide/tolerate such latencies by overlapping computation with the memory access – Dynamic out-of-order scheduling – Prefetching – Multithreading ▪ OOO execution and prefetching allow overlap of computation and memory access within the same thread (these were covered in CS3 Computer Architecture) ▪ Multithreading allows overlap of memory access of one thread/ process with computation by another thread/process CS4/MSc Parallel Architectures - 2017-2018 1 Blocked Multithreading ▪ Basic idea: – Recall multi-tasking: on I/O a process is context-switched out of the processor by the OS OS interrupt handler OS interrupt handler running running process 1 process 2 Process 1 running running running system call for I/O I/O completion – With multithreading a thread/process is context-switched out of the pipeline by the hardware on longer-latency operations Hardware context Hardware context switch switch process 1 process 2 Process 1 running running running Long-latency operation Long-latency operation CS4/MSc Parallel Architectures - 2017-2018 2 Blocked Multithreading ▪ Basic idea: – Unlike in multi-tasking, context is still kept in the processor and OS is not aware of any changes – Context switch overhead is minimal (usually only a few cycles) – Unlike in multi-tasking, the completion of the long-latency operation does not trigger a context switch (the blocked thread is simply marked
    [Show full text]