Context Threading: a flexible and Efficient Dispatch Technique for Virtual Machine Interpreters ∗

Total Page:16

File Type:pdf, Size:1020Kb

Context Threading: a flexible and Efficient Dispatch Technique for Virtual Machine Interpreters ∗ Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters ∗ Marc Berndl, Benjamin Vitale, Mathew Zaleski and Angela Demke Brown University of Toronto, Systems Research Lab {berndl,matz,bv,demke}@cs.toronto.edu Abstract poor performance compared to native code. Consequently, many important systems, such as Sun’s HotSpot [1] and Direct-threaded interpreters use indirect branches to dis- IBM’s production Java virtual machine [21] run in mixed patch bytecodes, but deeply-pipelined architectures rely on mode, compiling and executing some parts of a program branch prediction for performance. Due to the poor cor- while interpreting others. Baseline interpreter performance relation between the virtual program’s control flow and the thus continues to be relevant. hardware program counter, which we call the context prob- Recently, Ertl and Gregg observed that the performance lem, direct threading’s indirect branches are poorly pre- of otherwise efficient direct-threaded interpretation is lim- dicted by the hardware, limiting performance. Our dispatch ited by pipeline stalls and flushes due to extremely poor technique, context threading, improves branch prediction indirect branch prediction [8]. Modern pipelined architec- and performance by aligning hardware and virtual machine tures, such as the Pentium IV (P4) and the PowerPC (PPC), state. Linear virtual instructions are dispatched with na- must keep their pipelines full to perform well. Hardware tive calls and returns, aligning the hardware and virtual PC. branch predictors use the native PC to exploit the highly- Thus, sequential control flow is predicted by the hardware biased branches found in typical (native code) CPU work- return stack. We convert virtual branching instructions to loads [3,12]. Direct-threaded virtual machine (VM) inter- native branches, mobilizing the hardware’s branch predic- preters, however,are not typical workloads. Their branches’ tion resources. We evaluate the impact of context thread- targets are unbiased and therefore unpredictable [8]. For an ing on both branch prediction and performance using inter- interpreted program, it is the virtual program counter (or preters for Java and OCaml on the Pentium and PowerPC vPC) that is correlated with control flow. We therefore pro- architectures. On the Pentium IV, our technique reduces pose to organize the interpreter so that the native PC cor- mean mispredicted branches by 95%. On the PowerPC, it relates with the vPC, exposing virtual control flow to the reduces mean branch stall cycles by 75% for OCaml and hardware. 82% for Java. Due to reduced branch hazards, context We introduce a technique based on subroutine threading, threading reduces mean execution time by 25% for Java and once popular in early interpreters for languages like Forth. by19% and37%for OCaml on theP4 andPPC970,respec- To leverage return address stack prediction. we implement tively. We also combine context threading with a conserva- each virtual instruction as a subroutine which ends in a na- tive inlining technique and find its performance comparable tive return instruction. Note, however, that these subrou- to that of selective inlining. tines are not full-fledged functions in the sense of a higher- level programming language such as C (no register save/re- store, stack frame creation, etc.). When the instructions of 1 Introduction a virtual program are loaded by the interpreter, we trans- late them to a sequence of call instructions, one per virtual instruction, whose targets are these subroutines. Virtual in- Interpretation is a powerful tool for implementing pro- structions are then dispatched simply by natively execut- gramming language systems. It facilitates interactive pro- ing this sequence of calls. The key to the effectiveness of gram development and debugging, compact and portable this simple approach is that at dispatch time, the native PC deployment, and simple language prototyping. This com- is perfectly correlated with the virtual PC. Thus, for non- bination of features makes interpreted languages attractive branching bytecodes, the return address stack in modern in many settings, but their applicability is constrained by processors reliably predicts the address of the next byte- ∗This research was supported by NSERC, IBM CAS and CITO. code to execute. Because the next dynamic instruction is not generally the next static instruction in the virtual pro- Direct Threading Table Opcode Implementation Bodies gram, branches pose a greater challenge, For these virtual (DTT) INST_GOTO: 0 &INST_PUSH arg0 = *vPC; vPC=arg0; instructions, we provide a limited form of specialized inlin- Program bytecode 1 11 goto *vPC++; ing, replacing indirect with relative branches, thus exposing created by compiler 2 &INST_PUSH ... push 11 3 7 INST_PRINT: virtual branches to the hardware’s branch predictors. 4 &INST_MUL push 7 loader ... 5 We review techniques for virtual instruction dispatch in mul 5 &INST_PRINT goto *vPC++; print 6 &INST_DONE 6 interpreters, describe their performance problems, and de- done INST_DONE: vPC ... fine the context problem in Section 2. Then, we discuss exit; #the end Virtual Program Stack ¡ ¡ ¡ ¡ other work on improvingdispatch performancein Section 3. ¡ INST_MUL: ¡ ¡ ¡ ¡ ¡ 4 ¡ ¡ ¡ ¡ ¡ ... ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ We provide relevant details of our implementations in a Java ¡ goto *vPC++; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 0 ¡ ¡ ¡ ¡ virtual machine (SableVM) and the OCaml interpreter on ¡ ¡ ¡ ¡ ¡ ¡ INST_PUSH: 2 ¡ ¡ ¡ ¡ ¡ program ¡ ¡ ¡ ¡ ¡ ... start ¡ ¡ ¡ ¡ the Pentium IV and Power PC architectures in Section 4. ¡ ¡ ¡ ¡ ¡ ¡ goto *vPC++; ¡ ¡ ¡ ¡ Using an extensive suite of benchmarks for both Java and ¡ OCaml, we evaluate context threading in Section 5. This paper makes the following contributions: Figure 1. Direct Threaded VM Interpreter • We introduce a new dispatch technique for virtual ma- mov eax = (r1) ; r1 is vPC lwz r2 = 0(r1) chine interpreters that dramatically improves branch addl 4,r1 mtctr r2 prediction and demonstrate that our technique does not jmp *eax addi r1,r1,4 depend on a specific language or CPU architecture. bctr (a) Pentium IV assembly (b) Power PC assembly • We show context threading is effective. On both the P4 Figure 2. Direct Threaded Dispatch and the PPC970, it eliminates 25% of the mean elapsed time of Java benchmarks, with individual benchmarks running twice as fast as with direct threading. For OCaml, we achieve a 20% reduction in the mean ex- as the Direct Threading Table, or DTT. The operands are ecution time on the P4 and a 35% reduction on the also stored in this list, immediately after the corresponding PPC970 with some benchmarks achieving as much as opcode address. The vPC points into the DTT to indicate 40% and 50%, respectively. the instruction to be executed. The actual dispatch to the next instruction is accomplished by the “goto *vPC++” We show that context threading is compatible with in- • at the end of each opcode body, which is supportedby GNU lining, using a simple heuristic that we call Tiny Inlin- C’s labels-as-values extensions. In Figure 2, we show the ing. On OCaml, we achieve speedups relative to di- assembly code corresponding to this dispatch statement for rect threading of at least 10% over context threading the Pentium IV and PowerPC architectures. The overhead alone. On Java, we perform as well as or better than of these instructions is comparable to the execution time of SableVM’s implementation of selective inlining. small bodies in some VMs. When executing the indirect branch in Figure 2(a) the 2 The Context Problem Pentium IV will speculatively dispatch instructions using a predicted target address. The PowerPC uses a differ- An interpreter executes a virtual program by dispatching ent strategy for indirect branches, as shown in Figure 2(b). its virtual instructions in sequence. The current instruction First the target address is loaded into a register, and then a is indicated by a virtual program counter, or vPC. The vir- branch is executed to this register address. Rather than spec- tual program is a list of virtual instructions, each consisting ulate, the PowerPC stalls until the target address is known, of an opcode and zero or more operands. The exact repre- although other instructions may be scheduled between the sentation depends on the dispatch technique used. load and the branch to reduce or eliminate these stalls. A typical switch-dispatched interpreter, implemented in Stalling and incorrect speculation are serious pipeline C, is a for loop which fetches the opcode at vPC, and hazards. To perform at full speed, modern CPU’s need to then executes a switch statement, each opcode being im- keep their pipelines full by correctly predicting branch tar- plemented by a separate case block (the opcode body, or gets. Indirect branch predictors assume that the branch des- body). However, switch dispatch is considerably slower tination is highly correlated with the address of the indi- than the start-of-the-art, direct-threaded dispatch [7]. rect branch instruction itself. As observed by Ertl [8,9], As shown in Figure 1, a direct threaded interpreter rep- this assumption is usually wrong for direct threaded in- resents virtual program instructions as a list of addresses. terpreter workloads. In a direct-threaded implementation, Each address points to the opcode body. We refer to this list there is only one jump instruction per virtual opcode imple- mented. For example, in Figure 1, there are two instances stack. On the other hand, Curley found subroutine thread- of INST PUSH. In the context of vPC=0, the dispatch at ing faster on the 68000 [5]. On modern hardware the cost the end of the INST PUSH body results in a native indirect of the call and return is much lower, due to return branch branch back to the start of the INST PUSH body (since the prediction hardware, while the cost of direct threading has next virtual instruction at vPC=2 is also an INST PUSH). increased due to misprediction. In Section 5 we demon- However, the target of the same native indirect branch in strate this effect on several modern CPUs. the context of vPC=2 is determined by the address stored Superinstructions reduce the numberof dispatches.
Recommended publications
  • Java in Embedded Linux Systems
    Java in Embedded Linux Systems Java in Embedded Linux Systems Thomas Petazzoni / Michael Opdenacker Free Electrons http://free-electrons.com/ Created with OpenOffice.org 2.x Java in Embedded Linux Systems © Copyright 2004-2007, Free Electrons, Creative Commons Attribution-ShareAlike 2.5 license http://free-electrons.com Sep 15, 2009 1 Rights to copy Attribution ± ShareAlike 2.5 © Copyright 2004-2008 You are free Free Electrons to copy, distribute, display, and perform the work [email protected] to make derivative works to make commercial use of the work Document sources, updates and translations: Under the following conditions http://free-electrons.com/articles/java Attribution. You must give the original author credit. Corrections, suggestions, contributions and Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license translations are welcome! identical to this one. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. License text: http://creativecommons.org/licenses/by-sa/2.5/legalcode Java in Embedded Linux Systems © Copyright 2004-2007, Free Electrons, Creative Commons Attribution-ShareAlike 2.5 license http://free-electrons.com Sep 15, 2009 2 Best viewed with... This document is best viewed with a recent PDF reader or with OpenOffice.org itself! Take advantage of internal
    [Show full text]
  • An Architectural Trail to Threaded-Code Systems
    Some programmers question certain claims made for threaded-code systems. The author defends these claims by tracing such a system from its origin in simple concepts. An Architectural Trail to Threaded-Code Systems Peter M. Kogge, IBM Federal Systems Division Interest in software systems based on threaded-code * the concepts and syntax of a high-order language, or concepts has grown remarkably in recent years. Ad- HOL; and vocates of such systems regularly make claims for them * a set of commands for a conversational monitor. that many classical programmers regard as bordering on The ISA (such as it is) is for an abstract machine that is the impossible. Typically, proponents claim that it is very efficiently implemented by short code segments in possible to construct, in 5K to 10K bytes of code, a soft- almost any current machine architecture. The HOL con- ware package that cepts include hierarchical construction of programs, * is conversational like APL, Lisp, or Basic; block structures, GOTOless programming, and user- * includes a compile facility with many high-order definable data structures. The commands for the conver- language, structured-programming constructs; sational monitor permit direct interaction with objects * exhibits performance very close to that of machine- defined by the programmer in terms he defines, not sim- coded program equivalents; ply in the terms of the underlying machine (i.e., core * is written largely in itself and consequently is largely dumps). Further, the command set is directly executable portable; in a conversational mode and includes virtually all com- * places no barriers among combinations of system, mands available in the system; when so executed, they compiler, or application code; function just as they would if found in an executed pro- * can include an integrated, user-controlled virtual gram.
    [Show full text]
  • Basic Threads Programming: Standards and Strategy
    Basic Threads Programming: Standards and Strategy Mike Dahlin [email protected] February 13, 2007 1 Motivation Some people rebel against coding standards. I don’t understand the logic. For concurrent programming in particular, there are a few good solutions that have stood the test of time (and many unhappy people who have departed from these solutions.) For concurrent programming, debugging won’t work. You must rely on (a) writing correct code and (b) writing code that you and others can read and understand. Following the rules below will help you write correct, readable code. Rules 2 through 6 below are required coding standards for CS372. Answers to homework and exam problems and project code that do not follow these standards are by definition incorrect. Section 5 discusses additional restrictions for project code (or exam pseudo-code) in Java. Again, answers that deviate from these required coding standars are, by definition, incorrect for this class. If you believe you have a better way to write concurrent programs, that’s great! Bring it to us (before you use it on an assignment!) We will examine your approach the same way we hope that a civil engineering manager would examine a proposal by a bright young civil engineer who has a proposal for a better config- uration of rebar for reinforcing a new bridge: we will hope you have found a better way, but the burden of proof for explaining the superiority of your approach and proving that there are no hidden pitfalls is on you. 2 Coding standards for concurrent programs: 1.
    [Show full text]
  • Effective Inline-Threaded Interpretation of Java Bytecode
    Effective Inline-Threaded Interpretation of Java Bytecode Using Preparation Sequences Etienne Gagnon1 and Laurie Hendren2 1 Sable Research Group Universit´eduQu´ebec `a Montr´eal, [email protected] 2 McGill University Montreal, Canada, [email protected] Abstract. Inline-threaded interpretation is a recent technique that im- proves performance by eliminating dispatch overhead within basic blocks for interpreters written in C [11]. The dynamic class loading, lazy class initialization, and multi-threading features of Java reduce the effective- ness of a straight-forward implementation of this technique within Java interpreters. In this paper, we introduce preparation sequences, a new technique that solves the particular challenge of effectively inline-threa- ding Java. We have implemented our technique in the SableVM Java virtual machine, and our experimental results show that using our tech- nique, inline-threaded interpretation of Java, on a set of benchmarks, achieves a speedup ranging from 1.20 to 2.41 over switch-based inter- pretation, and a speedup ranging from 1.15 to 2.14 over direct-threaded interpretation. 1 Introduction One of the main advantages of interpreters written in high-level languages is their simplicity and portability, when compared to static and dynamic compiler-based systems. One of their main drawbacks is poor performance, due to a high cost for dispatching interpreted instructions. In [11], Piumarta and Riccardi introduced a technique called inlined-threading which reduces this overhead by dynamically inlining instruction sequences within basic blocks, leaving a single instruction dispatch at the end of each sequence. To our knowledge, inlined-threading has not been applied to Java interpreters before.
    [Show full text]
  • Android Cours 1 : Introduction `Aandroid / Android Studio
    Android Cours 1 : Introduction `aAndroid / Android Studio Damien MASSON [email protected] http://www.esiee.fr/~massond 21 f´evrier2017 R´ef´erences https://developer.android.com (Incontournable !) https://openclassrooms.com/courses/ creez-des-applications-pour-android/ Un tutoriel en fran¸caisassez complet et plut^ot`ajour... 2/52 Qu'est-ce qu'Android ? PME am´ericaine,Android Incorporated, cr´e´eeen 2003, rachet´eepar Google en 2005 OS lanc´een 2007 En 2015, Android est le syst`emed'exploitation mobile le plus utilis´edans le monde (>80%) 3/52 Qu'est-ce qu'Android ? Cinq couches distinctes : 1 le noyau Linux avec les pilotes ; 2 des biblioth`equeslogicielles telles que WebKit/Blink, OpenGL ES, SQLite ou FreeType ; 3 un environnement d'ex´ecutionet des biblioth`equespermettant d'ex´ecuterdes programmes pr´evuspour la plate-forme Java ; 4 un framework { kit de d´eveloppement d'applications ; 4/52 Android et la plateforme Java Jusqu'`asa version 4.4, Android comporte une machine virtuelle nomm´eeDalvik Le bytecode de Dalvik est diff´erentde celui de la machine virtuelle Java de Oracle (JVM) le processus de construction d'une application est diff´erent Code Java (.java) ! bytecode Java (.class/.jar) ! bytecode Dalvik (.dex) ! interpr´et´e L'ensemble de la biblioth`equestandard d'Android ressemble `a J2SE (Java Standard Edition) de la plateforme Java. La principale diff´erenceest que les biblioth`equesd'interface graphique AWT et Swing sont remplac´eespar des biblioth`equesd'Android. 5/52 Android Runtime (ART) A` partir de la version 5.0 (2014), l'environnement d'ex´ecution ART (Android RunTime) remplace la machine virtuelle Dalvik.
    [Show full text]
  • Compiler Analyses for Improved Return Value Prediction
    Compiler Analyses for Improved Return Value Prediction Christopher J.F. Pickett Clark Verbrugge {cpicke,clump}@sable.mcgill.ca School of Computer Science, McGill University Montreal,´ Quebec,´ Canada H3A 2A7 Overview Introduction and Related Work Contributions Framework Parameter Dependence Analysis Return Value Use Analysis Conclusions Future Work CDP04 – 1/48 Introduction and Related Work Speculative method-level parallelism (SMLP) allows for dynamic parallelisation of single-threaded programs speculative threads are forked at callsites suitable for Java virtual machines Perfect return value prediction can double performance of SMLP (Hu et al., 2003) Implemented accurate return value prediction in SableVM, our group’s JVM (Pickett et al., 2004) Current goals: Reduce memory requirements Achieve higher accuracy CDP04 – 2/48 Speculative Method-Level Parallelism // execute foo non-speculatively r = foo (a, b, c); // execute past return point // speculatively in parallel with foo() if (r > 10) { ... } else { ... } CDP04 – 3/48 Impact of Return Value Prediction RVP strategy return value SMLP speedup none arbitrary 1.52 best predicted 1.92 perfect correct 2.76 26% speedup over no RVP with Hu’s best predictor 82% speedup over no RVP with perfect prediction Improved hybrid accuracy is highly desirable S. Hu., R. Bhargava, and L. K. John. The role of return value prediction in exploiting speculative method-level parallelism. Journal of Instruction-Level Parallelism, 5:1–21, Nov. 2003. CDP04 – 4/48 Return Value Prediction in SableVM Implemented all of Hu et al.’s predictors in SableVM Introduced new memoization predictor into hybrid C.J.F. Pickett and C. Verbrugge. Return value prediction in a Java virtual machine.
    [Show full text]
  • Threaded Code Is Not in Fact Implemented in the Computer's Hardware
    desired generality in others. In summary, when writing hard code the user needs relatively many instructions, each of which executes relatively rapidly. An interpreter, by contrast, is a vehicle by which the user can choose his own instruction set to correspond Programming R. Morris to his specific problem. Obviously such freedom allows Techniques Editor a much shorter program for that: problem to be written. The penalty is that the instruction set of the interpreter Threaded Code is not in fact implemented in the computer's hardware. Instead the interpreter must itself be a computer pro- James R. Bell gram which simulates the action of the interpretive in- Digital Equipment Corporation struction set in terms of the actual instruction set. This can be a time-consuming proposition. Thus interpretive code tends to be shorter but slower than hard code. The concept of "threaded code" is presented as an It is instructive to look at the relation between the alternative to machine language code. Hardware and host hardware and the alternatives discussed. In the software realizations of it are given. In software it is case of hard code an instruction directs the flow of realized as interpretive code not needing an interpreter. processing by its actual execution from the IR, or in- Extensions and optimizations are mentioned. struction register, of the machine. In the case of an in- Key Words and Phrases: interpreter, machine code, terpreter, an "instruction" is in fact merely a datum time tradeoff, space tradeoff, compiled code, subroutine from the interpreting program. Thus it directs the flow calls, threaded code of processing from an accumulator or the equivalent.
    [Show full text]
  • Csc 453 Interpreters & Interpretation
    CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson Interpreters An interpreter is a program that executes another program. An interpreter implements a virtual machine , which may be different from the underlying hardware platform. Examples: Java Virtual Machine; VMs for high-level languages such as Scheme, Prolog, Icon, Smalltalk, Perl, Tcl. The virtual machine is often at a higher level of semantic abstraction than the native hardware. This can help portability, but affects performance. CSc 453: Interpreters & Interpretation 2 1 Interpretation vs. Compilation CSc 453: Interpreters & Interpretation 3 Interpreter Operation ip = start of program; while ( ¬ done ) { op = current operation at ip ; execute code for op on current operands; advance ip to next operation; } CSc 453: Interpreters & Interpretation 4 2 Interpreter Design Issues Encoding the operation I.e., getting from the opcode to the code for that operation (“dispatch”): byte code (e.g., JVM) indirect threaded code direct threaded code. Representing operands register machines: operations are performed on a fixed finite set of global locations (“registers”) (e.g.: SPIM); stack machines: operations are performed on the top of a stack of operands (e.g.: JVM). CSc 453: Interpreters & Interpretation 5 Byte Code Operations encoded as small integers (~1 byte). The interpreter uses the opcode to index into a table of code addresses: while ( TRUE ) { byte op = ∗∗∗ip; switch ( op ) { case ADD: … perform addition ; break; case SUB: … perform subtraction ; break; … } } Advantages: simple, portable. Disadvantages: inefficient. CSc 453: Interpreters & Interpretation 6 3 Direct Threaded Code Indexing through a jump table (as with byte code) is expensive. Idea : Use the address of the code for an operation as the opcode for that operation.
    [Show full text]
  • Static Analysis of Lockless Microcontroller C Programs
    Static Analysis of Lockless Microcontroller C Programs Eva Beckschulze Sebastian Biallas Stefan Kowalewski Embedded Software Laboratory RWTH Aachen University, Germany {lastname}@embedded.rwth-aachen.de Concurrently accessing shared data without locking is usually a subject to race conditions resulting in inconsistent or corrupted data. However, there are programs operating correctly without locking by exploiting the atomicity of certain operations on a specific hardware. In this paper, we describe how to precisely analyze lockless microcontroller C programs with interrupts by taking the hardware architecture into account. We evaluate this technique in an octagon-based value range analysis using access-based localization to increase efficiency. 1 Introduction Static analysis based on abstract interpretation [7] is a formal method that found its way into practice by several commercial code analysis tools. Proving the absence of run-time errors in microcontroller programs is of particular importance as microcontrollers are often deployed in safety-critical systems. However, C code analyzers usually do not cope with C extensions and hardware-specific control prevalent in microcontroller programs. This control is not only necessary for data input/output but also needed to implement interrupt service routines (ISRs), which allows some form of concurrency and can be used for asynchronous hardware communication and periodic tasks. Since the control functions of the hardware are often exposed through normal memory accesses, a sound analysis of microcontroller programs has to reflect these registers in its memory model. On the Atmega16 [2], for example, it is possible to enable/disable interrupts by a write to the SREG register which is located at the memory address 0x5F.
    [Show full text]
  • A Case Study of Multi-Threading in the Embedded Space
    A Case Study of Multi-Threading in the Embedded Space Greg Hoover Forrest Brewer Timothy Sherwood University of California, Santa University of California, Santa University of California, Santa Barbara Barbara Barbara Engineering I Engineering I Engineering I Santa Barbara, California Santa Barbara, California Santa Barbara, California 93106 93106 93106 [email protected] [email protected] [email protected] ABSTRACT vironment. Such systems open the possibility of early de- The continuing miniaturization of technology coupled with tection of structural failure, tracking microclimates in for- wireless networks has made it feasible to physically embed est canopies, and observing migratory patterns of any num- sensor network systems into the environment. Sensor net ber of species [16, 11, 29]. In an ideal world, these tiny processors are tasked with the job of handling a disparate digital systems would be operating a wireless network, han- set of interrupt driven activity, from networks to timers to dling sensor readings, controlling electro-statically-activated the sensors themselves. In this paper, we demonstrate the devices, processing software updates, and performing dis- advantages of a tiny multi-threaded microcontroller design tributed computations. To handle all of these functions at which targets embedded applications that need to respond the required throughput, we argue for the use of dynamic to events at high speed. While multi-threading is typically multi-threading at all levels of the microcontroller design. used to improve resource utilization, in the embedded space At first this concept may seem counterintuitive as most it can provide zero-cycle context switching and interrupt ser- multi-threaded architectures were developed to better-exploit vice threads (IST), enabling complex programmable control an abundance of resources [25], something that these sys- in latency constrained environments.
    [Show full text]
  • The Flow Programming Language: an Implicitly-Parallelizing, Programmer-Safe Language
    ““WriteWrite Once,Once, ParallelizeParallelize Anywhere”Anywhere” TheThe FlowFlow ProgrammingProgramming Language:Language: AnAn implicitly-parallelizing,implicitly-parallelizing, programmer-safeprogrammer-safe languagelanguage Luke Hutchison Moore'sMoore's LawLaw ● TheThe numbernumber ofof transistorstransistors thatthat cancan bebe placedplaced inexpensivelyinexpensively onon anan integratedintegrated circuitcircuit hashas doubleddoubled approximatelyapproximately everyevery twotwo years.years. Moore'sMoore's LawLaw ● ComputingComputing isis work;work; transistorstransistors dodo work;work; moremore transistorstransistors workingworking inin parallelparallel shouldshould yieldyield fasterfaster computerscomputers ● =>=> Moore'sMoore's LawLaw hashas (so(so far)far) alsoalso appliedapplied toto CPUCPU speedspeed ● BUTBUT nownow we'rewe're hittinghitting multiplemultiple walls:walls: ● TransistorTransistor sizesize – tunneling;tunneling; lithographylithography featurefeature sizesize vs.vs. wavelengthwavelength – nono moremore 2D2D densitydensity increasesincreases ● ThermalThermal envelopeenvelope – functionfunction ofof frequencyfrequency andand featurefeature sizesize =>=> wewe havehave hithit aa clockspeedclockspeed wallwall ● DataData widthwidth – 128-bit128-bit CPUs?CPUs? NeedNeed parallelparallel controlcontrol flowflow insteadinstead ButBut it'sit's worseworse thanthan thatthat It'sIt's notnot justjust thethe endend ofof Moore'sMoore's Law...Law... ...it's...it's thethe beginningbeginning ofof anan eraera ofof buggierbuggier software.software.
    [Show full text]
  • GNU Classpath
    GNU Classpath Core Classes for a Diversity of Java Virtual Machines February 2004 ■ Sascha Brawer [email protected] http://www.dandelis.ch/people/brawer/ ■ Mark Wielaard [email protected] http://www.klomp.org/mark/ Overview ■ Overview – Motivation – Anatomy of a Java-like system – Documentation, Quality assurance, Releases ■ Demo ■ Details – Current state – VMs using GNU Classpath ■ How you can help 2 Motivation ■ Currently, most free software is written in C The Good The Bad The Ugly Close to hardware Easy to make bugs Libraries not Fast execution, even Lacks “modern” well integrated with bad compilers language concepts Difficult to learn Large code base, (“modern” = 1980’ies) many libraries Hard to write Ubiquitous portable code 3 Motivation ■ Java is a good foundation for many projects – Type-safety ● Avoids many typical bugs, crashes, security problems ● More opportunities for optimizing compilers – Modular, object-oriented, concurrent, dynamic – Easier to write structured, portable code – Very rich library, reasonably well designed (mostly) – Large developer base (“COBOL of the 1990’ies”) ● There exist lots of other good (even better?) languages: Oberon, Eiffel, O’Caml, Haskell, Erlang, TOM, Cedar, ... ● But: They have not many free applications/libraries 4 Motivation ■ But: Java does not solve every problem – Over-hyped as a solution to everything – Bad reputation for wasting resources (CPU, RAM) ● It is easy to write inefficient programs ● Early implementations were very slow ● Type-safety and memory management have their cost – Often over-estimated: Array bounds checking (→ no buffer overflows) costs ~2% execution time, avoids ~50% security-related bugs – Syntax similar to C++ → not very easy to learn 5 Motivation ■ Java is a compromise Why GNU Will Be – Not necessarily ideal, Compatible with UNIX but reasonable Unix is not my ideal system, but it is not too bad.
    [Show full text]