UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations

Total Page:16

File Type:pdf, Size:1020Kb

UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations UC Santa Cruz UC Santa Cruz Electronic Theses and Dissertations Title Section Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication Permalink https://escholarship.org/uc/item/8vx8d8wv Author Das, Madan Mohan Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SANTA CRUZ SECTION BASED PROGRAM ANALYSIS TO REDUCE OVERHEAD OF DETECTING UNSYNCHRONIZED THREAD COMMUNICATION A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER ENGINEERING by Madan Mohan Das March 2015 The Dissertation of Madan Mohan Das is approved: Professor Jose Renau, Chair Professor Cormac Flanagan Professor Anujan Varma Tyrus Miller Vice Provost and Dean of Graduate Studies Copyright c by Madan Mohan Das 2015 Table of Contents List of Figures v List of Tables vii Abstract viii Acknowledgments ix 1 Introduction 1 1.1 Contributions . .8 1.2 Thesis Organization . 10 2 Related Work 12 2.1 Static Race Detection . 13 2.2 Dynamic Race Detection . 15 2.3 Deterministic Runtime Systems . 18 2.4 Software Transactional Memory (STM) . 19 2.5 Data Flow Analysis . 21 2.6 Pointer Analysis Background . 22 2.6.1 Flow sensitive vs. insensitive pointer analysis . 23 2.6.2 Context sensitive vs. insensitive pointer analysis . 24 2.6.3 Dynamic vs. Static pointer analysis . 25 2.7 Some important Flow insensitive methods . 26 2.7.1 Andersen’s flow insensitive analysis . 26 2.7.2 Steensgaard’s algorithm . 27 3 Finding Disjoint Thread Sections 29 3.1 Terminologies . 29 3.2 SBPA Pointer Analysis Framework . 32 3.2.1 Perform modref analysis per section . 33 3.2.2 Adaptive non-unification for points-to sets of function arguments . 33 iii 3.2.3 Field sensitivity for array elements . 36 3.3 Constructing the Reduced ICFG . 36 3.4 Single-Threaded Thread Sections (Single-TS) . 41 3.5 Disjoint Thread Sections (Disjoint-TS) . 45 3.6 Overall Instrumentation Flow . 50 4 Programmer Annotations and MTROM 51 4.1 Marking Parallel Code Sections . 52 4.2 Multi Thread Read Only Memory . 52 5 Loop Invariant Log Motion 56 5.1 Scalar Loop Invariant Log Motion (SLILM) . 57 5.2 Vector Loop Invariant Log Motion (VLILM) . 58 5.3 Result of Applying LILM . 61 6 Experimental Results 63 6.1 Experimental Setup . 63 6.2 Results . 65 6.2.1 Overall Results . 66 6.2.2 Analysis of Reduction in Instrumentation . 68 6.2.3 Benchmark Insights . 70 6.2.4 Compilation Overhead . 75 7 A case study with ThreadSanitizer 78 8 Improving Static Race Precision with SBPA 84 8.1 Methodology . 85 8.2 An Example Case . 87 8.3 Results . 91 9 Conclusion 96 10 Future Work 98 10.1 Symbolic Array Partitioning . 98 10.2 Generalized SBPA and MTROM . 99 10.2.1 Extension of SBPA to Tree of Threads . 99 10.2.2 MTROM for Tree of Threads and Dynamic MTROM . 100 10.3 Hardware Implementation of MTROM . 101 Bibliography 104 iv List of Figures 1.1 Two threads T1 and T2 execute four different code sections (CS1, CS2, CS3, CS4) separated by a barrier B. The barrier implicitly creates two thread sections (TS1, TS2) that cannot execute simultaneously. WX and RX represent write and read of some memory location X. Since X is not modified anywhere in TS2, RX does not need to be instrumented. .6 1.2 A typical threaded section of program . .7 3.1 A code snippet showing the call of a function f, that has any of the thread creation, join or synchronization directive. The resulting changes in RICFG are shown on the right. 41 3.2 A program comprised of 4 code sections in 3 thread sections . 42 3.3 A decomposition of threaded section into smaller, disjoint parallel segments . 46 3.4 Two threads executing 5 code sections in 3 disjoint thread sections. 47 6.1 SBPA compiler pass detected 63% of all memory accesses at run-time as non- conflicting. Excluding the improvements from ’Directives’ yields 51% accesses proven as non-conflicting. 66 6.2 SBPA identified 80% of the dynamic memory accesses executed in single threaded mode. 67 6.3 68% of the loads are detected as non-conflicting, with a few applications reach- ing 100%. 68 6.4 61% of the stores do not need to be tracked by tools like data-race detectors which is much better than 38% with Base. For STMs that may have restarts, 52% of the stores can be proven as safe. 69 6.5 Compilation times of Base and SBPA normalized against compile time when no optimization is applied. On average, Base took 75% and SBPA took 46% of unoptimized compile time. 76 7.1 SBPA yields over 2 times speedup compared to Tsan, with some applications achieving over 30 times speedup. 82 v 7.2 ThreadSanitizer speed-ups in the different modes described earlier. SBPA, com- bined with Directives, speeds up ThreadSanitizer execution by a factor of 2.74. 82 8.1 Static race detection flow with SBPA. Static races detected are classified and further validated by dynamic race detection. 88 8.2 Percentages of total read and write lines identified as non-racy by Base and SBPA techniques. 93 8.3 Percentages of read lines identified as non-racy by Base and SBPA techniques. 93 8.4 Percentages of write lines identified as non-racy by Base and SBPA techniques. 93 8.5 Percentages of read and write instructions identified as non-racy by Base and SBPA techniques. 94 8.6 Percentages of read instructions identified as non-racy by Base and SBPA tech- niques. 94 8.7 Percentages of write instructions identified as non-racy by Base and SBPA tech- niques. 94 10.1 A hierarchy of threads in a program, represented as a tree . 100 vi List of Tables 5.1 Total access logging reduction improvement with LILM . 62 5.2 Load logging reduction with LILM. 62 6.1 Benchmarks used, abbreviation shown in parentheses. 64 7.1 Runtimes (in seconds) of executables compiled for ThreadSanitizer dynamic race detection with 2 threads. 80 8.1 Total racy read and write lines, and total program lines for the benchmarks studied. 95 vii Abstract Section Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication by Madan Mohan Das Most systems that test and verify parallel programming, such as data race detectors and soft- ware transactional memory systems, require instrumenting loads and stores in an application. This can cause a very significant runtime and memory overhead compared to executing unin- strumented code. Multithreaded programming typically allows any thread to perform loads and stores at any location in the process’s address space independently. Most of these unsynchro- nized memory accesses are non-conflicting in nature; that is, the values read from or written to memory are only used by a single thread. We propose Section-Based Program Analysis (SBPA), a novel way to decompose the program into disjoint sections to identify non-conflicting loads and stores during program compilation. We combine SBPA with improved context sensitive alias analysis, loop specific op- timizations and a few user directives to further increase the effectiveness of SBPA. We imple- mented SBPA for a deterministic execution runtime environment, and were able to eliminate 63% of dynamic memory access instrumentations. We also integrated SBPA with ThreadSan- itizer, a state of the art dynamic race detector, and achieved a speed-up of 2.74 times on a geometric mean basis. Lastly, we show that SBPA is also effective in static race detection. viii Acknowledgments First, I sincerely thank my adviser Professor Jose Renau for his ideas, vision, constant support and feedback over the last many years on my research. Without his help I won’t be able to reach this milestone in my life. I would also like to thank Professor Cormac Flanagan and Professor Anujan Varma for accepting to review this work, and providing valuable feedback. This thesis wouldn’t have been complete without their inputs. I also thank Gabriel Southern, my fellow researcher and co-author in my publications for his great effort and support for the work presented in this thesis; and for bringing positive thoughts during times of difficulty. I also sincerely thank other MASC lab students for their valuable inputs and feedback during the tenure of my studies. Finally, I thank my wife, son and daughter for being supportive of my decision to pursue this degree and understanding my inability at times to carry out family obligations. ix Chapter 1 Introduction Since the beginning of the computer era, researchers have explored several parallel execution models and architectures. While significant progress has been achieved in paralleliz- ing multiple program execution on the same system, and in distributed multi-processing where the outcome of the program depends on the solution of well-segregated sub-problems, parallel programming in the context of a single program with shared memory remains difficult to adopt for programmers. This fact is most conspicuous when we observe that even today, most pro- grammers first think of their programs as single-threaded applications and often times develop as such, and then consider parallelizing the program as an after thought. The reasons for such behavior are multiple. Most prominent among those is the fact that for software programmers, it is difficult to visualize the parallel execution of many dif- ferent program fragments. Secondly, potentially exponential number of possible interleavings of the program fragment executions makes it almost impossible to contemplate and debug the 1 unpredictable bugs, or as many have said, the ”heisen-bugs”.
Recommended publications
  • An Architectural Trail to Threaded-Code Systems
    Some programmers question certain claims made for threaded-code systems. The author defends these claims by tracing such a system from its origin in simple concepts. An Architectural Trail to Threaded-Code Systems Peter M. Kogge, IBM Federal Systems Division Interest in software systems based on threaded-code * the concepts and syntax of a high-order language, or concepts has grown remarkably in recent years. Ad- HOL; and vocates of such systems regularly make claims for them * a set of commands for a conversational monitor. that many classical programmers regard as bordering on The ISA (such as it is) is for an abstract machine that is the impossible. Typically, proponents claim that it is very efficiently implemented by short code segments in possible to construct, in 5K to 10K bytes of code, a soft- almost any current machine architecture. The HOL con- ware package that cepts include hierarchical construction of programs, * is conversational like APL, Lisp, or Basic; block structures, GOTOless programming, and user- * includes a compile facility with many high-order definable data structures. The commands for the conver- language, structured-programming constructs; sational monitor permit direct interaction with objects * exhibits performance very close to that of machine- defined by the programmer in terms he defines, not sim- coded program equivalents; ply in the terms of the underlying machine (i.e., core * is written largely in itself and consequently is largely dumps). Further, the command set is directly executable portable; in a conversational mode and includes virtually all com- * places no barriers among combinations of system, mands available in the system; when so executed, they compiler, or application code; function just as they would if found in an executed pro- * can include an integrated, user-controlled virtual gram.
    [Show full text]
  • Basic Threads Programming: Standards and Strategy
    Basic Threads Programming: Standards and Strategy Mike Dahlin [email protected] February 13, 2007 1 Motivation Some people rebel against coding standards. I don’t understand the logic. For concurrent programming in particular, there are a few good solutions that have stood the test of time (and many unhappy people who have departed from these solutions.) For concurrent programming, debugging won’t work. You must rely on (a) writing correct code and (b) writing code that you and others can read and understand. Following the rules below will help you write correct, readable code. Rules 2 through 6 below are required coding standards for CS372. Answers to homework and exam problems and project code that do not follow these standards are by definition incorrect. Section 5 discusses additional restrictions for project code (or exam pseudo-code) in Java. Again, answers that deviate from these required coding standars are, by definition, incorrect for this class. If you believe you have a better way to write concurrent programs, that’s great! Bring it to us (before you use it on an assignment!) We will examine your approach the same way we hope that a civil engineering manager would examine a proposal by a bright young civil engineer who has a proposal for a better config- uration of rebar for reinforcing a new bridge: we will hope you have found a better way, but the burden of proof for explaining the superiority of your approach and proving that there are no hidden pitfalls is on you. 2 Coding standards for concurrent programs: 1.
    [Show full text]
  • Effective Inline-Threaded Interpretation of Java Bytecode
    Effective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences Etienne Gagnon1 and Laurie Hendren2 1 Sable Research Group Universit´eduQu´ebec `a Montr´eal, [email protected] 2 McGill University Montreal, Canada, [email protected] Abstract. Inline-threaded interpretation is a recent technique that im- proves performance by eliminating dispatch overhead within basic blocks for interpreters written in C [11]. The dynamic class loading, lazy class initialization, and multi-threading features of Java reduce the effective- ness of a straight-forward implementation of this technique within Java interpreters. In this paper, we introduce preparation sequences, a new technique that solves the particular challenge of effectively inline-threa- ding Java. We have implemented our technique in the SableVM Java virtual machine, and our experimental results show that using our tech- nique, inline-threaded interpretation of Java, on a set of benchmarks, achieves a speedup ranging from 1.20 to 2.41 over switch-based inter- pretation, and a speedup ranging from 1.15 to 2.14 over direct-threaded interpretation. 1 Introduction One of the main advantages of interpreters written in high-level languages is their simplicity and portability, when compared to static and dynamic compiler-based systems. One of their main drawbacks is poor performance, due to a high cost for dispatching interpreted instructions. In [11], Piumarta and Riccardi introduced a technique called inlined-threading which reduces this overhead by dynamically inlining instruction sequences within basic blocks, leaving a single instruction dispatch at the end of each sequence. To our knowledge, inlined-threading has not been applied to Java interpreters before.
    [Show full text]
  • Threaded Code Is Not in Fact Implemented in the Computer's Hardware
    desired generality in others. In summary, when writing hard code the user needs relatively many instructions, each of which executes relatively rapidly. An interpreter, by contrast, is a vehicle by which the user can choose his own instruction set to correspond Programming R. Morris to his specific problem. Obviously such freedom allows Techniques Editor a much shorter program for that: problem to be written. The penalty is that the instruction set of the interpreter Threaded Code is not in fact implemented in the computer's hardware. Instead the interpreter must itself be a computer pro- James R. Bell gram which simulates the action of the interpretive in- Digital Equipment Corporation struction set in terms of the actual instruction set. This can be a time-consuming proposition. Thus interpretive code tends to be shorter but slower than hard code. The concept of "threaded code" is presented as an It is instructive to look at the relation between the alternative to machine language code. Hardware and host hardware and the alternatives discussed. In the software realizations of it are given. In software it is case of hard code an instruction directs the flow of realized as interpretive code not needing an interpreter. processing by its actual execution from the IR, or in- Extensions and optimizations are mentioned. struction register, of the machine. In the case of an in- Key Words and Phrases: interpreter, machine code, terpreter, an "instruction" is in fact merely a datum time tradeoff, space tradeoff, compiled code, subroutine from the interpreting program. Thus it directs the flow calls, threaded code of processing from an accumulator or the equivalent.
    [Show full text]
  • Csc 453 Interpreters & Interpretation
    CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson Interpreters An interpreter is a program that executes another program. An interpreter implements a virtual machine , which may be different from the underlying hardware platform. Examples: Java Virtual Machine; VMs for high-level languages such as Scheme, Prolog, Icon, Smalltalk, Perl, Tcl. The virtual machine is often at a higher level of semantic abstraction than the native hardware. This can help portability, but affects performance. CSc 453: Interpreters & Interpretation 2 1 Interpretation vs. Compilation CSc 453: Interpreters & Interpretation 3 Interpreter Operation ip = start of program; while ( ¬ done ) { op = current operation at ip ; execute code for op on current operands; advance ip to next operation; } CSc 453: Interpreters & Interpretation 4 2 Interpreter Design Issues Encoding the operation I.e., getting from the opcode to the code for that operation (“dispatch”): byte code (e.g., JVM) indirect threaded code direct threaded code. Representing operands register machines: operations are performed on a fixed finite set of global locations (“registers”) (e.g.: SPIM); stack machines: operations are performed on the top of a stack of operands (e.g.: JVM). CSc 453: Interpreters & Interpretation 5 Byte Code Operations encoded as small integers (~1 byte). The interpreter uses the opcode to index into a table of code addresses: while ( TRUE ) { byte op = ∗∗∗ip; switch ( op ) { case ADD: … perform addition ; break; case SUB: … perform subtraction ; break; … } } Advantages: simple, portable. Disadvantages: inefficient. CSc 453: Interpreters & Interpretation 6 3 Direct Threaded Code Indexing through a jump table (as with byte code) is expensive. Idea : Use the address of the code for an operation as the opcode for that operation.
    [Show full text]
  • Static Analysis of Lockless Microcontroller C Programs
    Static Analysis of Lockless Microcontroller C Programs Eva Beckschulze Sebastian Biallas Stefan Kowalewski Embedded Software Laboratory RWTH Aachen University, Germany {lastname}@embedded.rwth-aachen.de Concurrently accessing shared data without locking is usually a subject to race conditions resulting in inconsistent or corrupted data. However, there are programs operating correctly without locking by exploiting the atomicity of certain operations on a specific hardware. In this paper, we describe how to precisely analyze lockless microcontroller C programs with interrupts by taking the hardware architecture into account. We evaluate this technique in an octagon-based value range analysis using access-based localization to increase efficiency. 1 Introduction Static analysis based on abstract interpretation [7] is a formal method that found its way into practice by several commercial code analysis tools. Proving the absence of run-time errors in microcontroller programs is of particular importance as microcontrollers are often deployed in safety-critical systems. However, C code analyzers usually do not cope with C extensions and hardware-specific control prevalent in microcontroller programs. This control is not only necessary for data input/output but also needed to implement interrupt service routines (ISRs), which allows some form of concurrency and can be used for asynchronous hardware communication and periodic tasks. Since the control functions of the hardware are often exposed through normal memory accesses, a sound analysis of microcontroller programs has to reflect these registers in its memory model. On the Atmega16 [2], for example, it is possible to enable/disable interrupts by a write to the SREG register which is located at the memory address 0x5F.
    [Show full text]
  • A Case Study of Multi-Threading in the Embedded Space
    A Case Study of Multi-Threading in the Embedded Space Greg Hoover Forrest Brewer Timothy Sherwood University of California, Santa University of California, Santa University of California, Santa Barbara Barbara Barbara Engineering I Engineering I Engineering I Santa Barbara, California Santa Barbara, California Santa Barbara, California 93106 93106 93106 [email protected] [email protected] [email protected] ABSTRACT vironment. Such systems open the possibility of early de- The continuing miniaturization of technology coupled with tection of structural failure, tracking microclimates in for- wireless networks has made it feasible to physically embed est canopies, and observing migratory patterns of any num- sensor network systems into the environment. Sensor net ber of species [16, 11, 29]. In an ideal world, these tiny processors are tasked with the job of handling a disparate digital systems would be operating a wireless network, han- set of interrupt driven activity, from networks to timers to dling sensor readings, controlling electro-statically-activated the sensors themselves. In this paper, we demonstrate the devices, processing software updates, and performing dis- advantages of a tiny multi-threaded microcontroller design tributed computations. To handle all of these functions at which targets embedded applications that need to respond the required throughput, we argue for the use of dynamic to events at high speed. While multi-threading is typically multi-threading at all levels of the microcontroller design. used to improve resource utilization, in the embedded space At first this concept may seem counterintuitive as most it can provide zero-cycle context switching and interrupt ser- multi-threaded architectures were developed to better-exploit vice threads (IST), enabling complex programmable control an abundance of resources [25], something that these sys- in latency constrained environments.
    [Show full text]
  • The Flow Programming Language: an Implicitly-Parallelizing, Programmer-Safe Language
    ““WriteWrite Once,Once, ParallelizeParallelize Anywhere”Anywhere” TheThe FlowFlow ProgrammingProgramming Language:Language: AnAn implicitly-parallelizing,implicitly-parallelizing, programmer-safeprogrammer-safe languagelanguage Luke Hutchison Moore'sMoore's LawLaw ● TheThe numbernumber ofof transistorstransistors thatthat cancan bebe placedplaced inexpensivelyinexpensively onon anan integratedintegrated circuitcircuit hashas doubleddoubled approximatelyapproximately everyevery twotwo years.years. Moore'sMoore's LawLaw ● ComputingComputing isis work;work; transistorstransistors dodo work;work; moremore transistorstransistors workingworking inin parallelparallel shouldshould yieldyield fasterfaster computerscomputers ● =>=> Moore'sMoore's LawLaw hashas (so(so far)far) alsoalso appliedapplied toto CPUCPU speedspeed ● BUTBUT nownow we'rewe're hittinghitting multiplemultiple walls:walls: ● TransistorTransistor sizesize – tunneling;tunneling; lithographylithography featurefeature sizesize vs.vs. wavelengthwavelength – nono moremore 2D2D densitydensity increasesincreases ● ThermalThermal envelopeenvelope – functionfunction ofof frequencyfrequency andand featurefeature sizesize =>=> wewe havehave hithit aa clockspeedclockspeed wallwall ● DataData widthwidth – 128-bit128-bit CPUs?CPUs? NeedNeed parallelparallel controlcontrol flowflow insteadinstead ButBut it'sit's worseworse thanthan thatthat It'sIt's notnot justjust thethe endend ofof Moore'sMoore's Law...Law... ...it's...it's thethe beginningbeginning ofof anan eraera ofof buggierbuggier software.software.
    [Show full text]
  • Virtual Machine Showdown: Stack Versus Registers
    Virtual Machine Showdown: Stack Versus Registers Yunhe Shi, David Gregg, Andrew Beatty M. Anton Ertl Department of Computer Science Institut f¨urComputersprachen University of Dublin, Trinity College TU Wien Dublin 2, Ireland Argentinierstraße 8 {yshi, David.Gregg, A-1040 Wien, Austria Andrew.Beatty}@cs.tcd.ie [email protected] ABSTRACT be interpreted or compiled. The most popular VMs, such as Virtual machines (VMs) are commonly used to distribute the Java VM, use a virtual stack architecture, rather than programs in an architecture-neutral format, which can eas- the register architecture that dominates in real processors. ily be interpreted or compiled. A long-running question in A long-running question in the design of VMs is whether the design of VMs is whether stack architecture or register stack architecture or register architecture can be implemented architecture can be implemented more efficiently with an more efficiently with an interpreter. On the one hand stack interpreter. We extend existing work on comparing virtual architectures allow smaller VM code so less code must be stack and virtual register architectures in two ways. Firstly, fetched per VM instruction executed. On the other hand, our translation from stack to register code is much more stack machines require more VM instructions for a given sophisticated. The result is that we eliminate an average computation, each of which requires an expensive (usually of more than 47% of executed VM instructions, with the unpredictable) indirect branch for VM instruction dispatch. register machine bytecode size only 25% larger than that of Several authors have discussed the issue [12, 15, 11, 16] and the corresponding stack bytecode.
    [Show full text]
  • Multiple Cores + SIMD + Threads) (And Their Performance Characteristics
    Lecture 2: Modern Multi-Core Architecture: (multiple cores + SIMD + threads) (and their performance characteristics) CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements ▪ WWW - http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/ ▪ We have another TA! - Mike Mu (Yes, our TAs are Michael and Mike) (CMU 15-418, Spring 2012) Review 1. Why has single threaded performance topped out in recent years? 2. What prevented us from obtaining maximum speedup from the parallel programs we wrote last time? (CMU 15-418, Spring 2012) Today ▪ Today we will talk computer architecture ▪ Four key concepts about how modern computers work - Two concern execution - Two concern access to memory ▪ Knowing some architecture will help you - Understand and optimize the performance of your programs - Gain intuition about what workloads might bene!t from fast parallel machines (CMU 15-418, Spring 2012) Part 1: parallel execution (CMU 15-418, Spring 2012) Example program Compute sin(x) using tailor expansion: sin(x) = x - x3/3! + x5/5! + x7/7! + ... void sinx(int N, int terms, float* x, float* result) { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! int sign = -1; for (int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (j+3) * (j+4); sign *= -1; } result[i] = value; } } (CMU 15-418, Spring 2012) Compile program void sinx(int N, int terms, float* x, float* result) x[i] { for (int i=0; i<N; i++) { float value = x[i]; float numer = x[i] * x[i] * x[i]; int denom = 6; // 3! ld r0, addr[r1] int sign = -1; mul r1, r0, r0 mul r1, r1, r0 for (int j=1; j<=terms; j++) ..
    [Show full text]
  • Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors Alejandro Caro
    Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors by Alejandro Caro S.B., Computer Science and Engineering, MIT 1990 S.M., Electrical Engineering and Computer Science, MIT 1993 Engineer in Computer Science, MIT 1998 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 1999 @ 1999 Massachusetts Institute of Technology. All rights reserved. A u th or ....................... ....... ......... ................ Department 4f Electrical Engineering and Computer Science January 27, 1999 Certified by .:... .....- :...... 5.< ....... ............. V.. Arvind.I..v( Johnson Professor of Computer Science Thesis Supervisor 2' Accepted by .. .... .... .. .. .... Arthur C. Smith Chairman, Departmeital Committee on Graduate Students Generating Multithreaded Code from Parallel Haskell for Symmetric Multiprocessors by Alejandro Caro Submitted to the Department of Electrical Engineering and Computer Science on January 27, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This dissertation presents pHc, a new compiler for Parallel Haskell (pH) with complete support for the entire language. pH blends the powerful features of a high-level, higher- order language with implicitly parallel, non-strict semantics. The result is a language that is easy to use but tough to compile. The principal contribution of this work is a new code- generation algorithm that works directly on A*, a terse intermediate representation based on the A-calculus. All the power of the original language is preserved in A*, and in addition, the new representation makes it easy to express the important concept of threads, groups of expressions that can execute sequentially, at a high-level.
    [Show full text]
  • The Vector-Thread Architecture
    Appears in, The 31st Annual International Symposium on Computer Architecture (ISCA-31), Munich, Germany, June 2004 The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic´ MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139 ronny,cbatten,krste @csail.mit.edu Abstract parallel code, each VP directs its own control flow by fetching its The vector-thread (VT) architectural paradigm unifies the vector own AIBs. Implementations of the VT architecture can also exploit and multithreaded compute models. The VT abstraction provides instruction-level parallelism within AIBs. the programmer with a control processor and a vector of virtual In this way, the VT architecture supports a modeless intermin- processors (VPs). The control processor can use vector-fetch com- gling of all forms of application parallelism. This flexibility pro- mands to broadcast instructions to all the VPs or each VP can use vides new ways to parallelize codes that are difficult to vectorize or thread-fetches to direct its own control flow. A seamless intermix- that incur excessive synchronization costs when threaded. Instruc- ing of the vector and threaded control mechanisms allows a VT ar- tion locality is improved by allowing common code to be factored chitecture to flexibly and compactly encode application parallelism out and executed only once on the control processor, and by execut- and locality, and a VT machine exploits these to improve perfor- ing the same AIB multiple times on each VP in turn. Data locality mance and efficiency. We present SCALE, an instantiation of the is improved as most operand communication is isolated to within VT architecture designed for low-power and high-performance em- an individual VP.
    [Show full text]