Speeding up Thread-Local Storage Access in Dynamic Libraries in the ARM Platform

Total Page:16

File Type:pdf, Size:1020Kb

Speeding up Thread-Local Storage Access in Dynamic Libraries in the ARM Platform Speeding Up Thread-Local Storage Access in Dynamic Libraries in the ARM platform Glauber de Oliveira Costa IC-Unicamp [email protected] Alexandre Oliva Red Hat and IC-Unicamp [email protected], [email protected] Abstract ecution was proposed and succesfully imple- mented on IA32, AMD64/EM64T and Fujitsu FR-V architectures. On these systems, experi- As multi-core processors become the rule mental results revealed the new model consis- rather than the exception, multi-threaded pro- tently exceeds the old model in terms of perfor- gramming is expected to expand from its cur- mance, particularly in the most common case, rent niches to more widespread use, in software where the speedup is often well over 2x, bring- components that have not traditionally been ing it nearly to the same performance of access concerned about exploiting concurrency. Ac- models used in plain executables. cessing thread-local storage (TLS) from within dynamic libraries has traditionally required This paper details this new access model and calling a function to obtain the thread-local ad- its implementation for ARM processors, high- dress of the variable. Such function calls are lighting its particular issues and potential gains several times slower than typical addressing in embedded systems. code that is used in executables. While instruc- tions used in executables can assume thread- local variables are at a constant offset within the thread Static TLS block, dynamic libraries loaded during program execution may not even 1 Introduction assume that their thread-local variables are in Static TLS blocks. As mainstream microprocessor vendors turn to Since libraries are most commonly loaded as multi-core processors as a way to improve per- dependencies of executables or other libraries, formance [1, 2], the relevance of multi-threaded before a program starts running, the most com- programming to leverage on such potential per- mon TLS case is that of constant offsets. Re- formance improvements grows. Embedded cently, an access model that enables dynamic systems pose their own challenges, such as de- libraries to take advantage of this fact without livering high performance with minimum size giving up the ability to be loaded during ex- and low power consumption. Besides the common difficulty multi-threaded the standard functions that offer abstractions programs run into, namely the need for syn- of thread-specific data. In some cases, such chronization between threads, it is often the as when generating code for dynamic libraries, case that a thread would like to use a global the compiler-generated code is still very in- variable,1 for extended periods of time, without efficient [4] ; for main executables, access other threads modifying its contents, and with- can sometimes be just as efficient as access- out having to incur synchronization overheads. ing automatic or global variables. The mech- anism proposed by Oliva and Araújo [4] yields Using automatic variables to achieve this is a a major speedup, that brings the performance possibility, since each thread has its own stack, of TLS access in dynamic libraries close to where such variables are allocated. However, that of executables. Such a mechanism has if multiple functions need to use the same data been proved successful after implemented in structure within a thread, a pointer to it must the Fujitsu FR-V, x86_64 and i386 architec- be passed around, which is cumbersome, and tures. On the ARM platform, a typical choice might require re-engineering the control flow for embedded systems [7], work has been done so as to ensure that the stack frame in which to provide such improvements, while keeping the data structure is created is not left while the in mind all the restrictions that may be imposed data is still in use. upon embedded systems Widely-used thread libraries have introduced primitives to overcome this problem, enabling 1.1 Terminology and organization threads to map a global handle, shared by all threads, to different values, one for each thread. This feature is offered in the form In this paper, we use the term loadable mod- of function calls (pthread_getspecific ule, or just module, to refer to executables, dy- and pthread_setspecific, in POSIX [3] namic libraries and the dynamic loader. A pro- threads), that are far less efficient than ac- cess may consist of a set of loadable modules cess to global or automatic variables. Besides consisting of exactly one executable, a dynamic the efficiency issues, they are syntactically far loader (for dynamic executables) and zero or more difficult to use than regular variables. more dynamic libraries. We call initial mod- These were the main motivations for the intro- ules the main executable, any dynamic libraries duction of Thread Local Storage (henceforth, it depends upon (directly or indirectly) and TLS[4, 5]) features in compilers, linkers and any other dynamic libraries the dynamic loader run-time systems, that enable selected global chooses to load before relinquishing control to variables to be marked with a __thread spec- the main executable. Moreover, we use the ifier or a threadprivate pragma, indicat- term dlopened modules to refer to modules ing that, for each thread, there should be a sep- that are loaded after the program starts run- arate, independent copy of the variable. ning, typically by means of library calls such as dlopen. By using custom low-level thread-specific im- plementations [6], or with cooperation from the This paper is organized as follows: section 2 compiler and the linker, access to thread-local gives background material about TLS symbols variables can be far more efficient than using and the novel concept of TLS descriptors. Sec- 1The strictly-correct term here would be variable tion 3 details the proposed extensions in the whose storage has static duration. ABI for enabling it in the ARM platform, while section 4 unveils the needed changes in the cur- from the thread pointer, and can be written to rent ABI and the tools covered by this work. a specific GOT entry at load time. The main Performance measures are then shown in sec- drawback of this access model is that, by using tion 5. Finally, section 6 sheds light on what it, we give up the ability to dynamically load there is yet to be done in the field of TLS ac- the module. cess, specially for its broader acceptance. Local Exec is used when the symbol is de- 2 Background fined in the main executable. In such a case, no additional effort is needed to compute the vari- able address, that is at a constant offset from the GCC, since version 3.3 [8], has provided the thread pointer known at link time. The model ability of marking a variable with a __thread is not suitable for symbols defined in a dynamic modifier, which causes it to be marked as a TLS library, because even though the offset will be symbol. Each module containing TLS symbols a constant at run time, it is not a link-time con- has a section containing the initial values in the stant. TLS block. For all TLS symbols this module exports, there is a fixed offset within such a block in which the symbol can be found. General Dynamic is the most general of all four models, covering both the initial and dy- An initial module is guaranteed to have its TLS namic module load cases. Address resolution data laid out as part of the Static TLS block, a goes through a call to __tls_get_addr, per-thread data structure whose per-thread ad- that by a lookup in the DTV, loads the variable dress is held in a thread pointer, normally a re- address. As parameters, it receives the module served register. Since the same layout is used identifier and the symbol offset within the mod- for the Static TLS blocks of all threads, the rel- ule’s TLS section. ative address from the thread pointer to a sym- bol defined in such a module is constant for all threads. Local Dynamic is a variant of the General Dynamic model for symbols living in the same At a fixed location in the Static TLS block, module. The call to __tls_get_addr re- there is a pointer to another data structure called ceives a zero-offset parameter, in a way that it the Dynamic Thread Vector, henceforth, DTV. returns the base address of the module. Subse- When a module is dlopened or dlclosed quent access may then just add to it the symbol the thread’s DTV may have to be modified to offset. add or remove the module’s TLS block. Figure 1 shows these data structure and the iter- In some platforms, when a module that has ations between them. Access to a TLS symbol symbols using the initial exec or dynamic mod- can be performed in any one of four models: els is linked into an executable, the linker can perform a relaxation step, allowing the access to be turned into more efficient ones, without Initial Exec is used when the symbol is in any loss of generality (dlopening executables a module loaded at start-up time with the ex- does not make sense). In the ARM current ABI, ecutable. The relative address is a fixed offset no such relaxations are specified. Static TLS Block Offset Dynamic TLS Blocks Ideally, we should be also able to relax the code TP x y z sequences in case of linking into executables, TP offsets Offset which is missing in the current ABI. DTV Module Index Figure 1: Data structures used for TLS han- dling. Static TLS block, the DTV, and the it- erations between them 3 ARM TLS Descriptors For dynamic libraries, whether the mod- ule is initially loaded with the executable or dlopened, access go through a call to The ARM ABI currently specifies the use of __tls_get_addr, whose two arguments, two consecutive GOT entries [10] for TLS sym- the module ID and the symbol offset inside the bols relocations.
Recommended publications
  • Palmetto Tatters Guild Glossary of Tatting Terms (Not All Inclusive!)
    Palmetto Tatters Guild Glossary of Tatting Terms (not all inclusive!) Tat Days Written * or or To denote repeating lines of a pattern. Example: Rep directions or # or § from * 7 times & ( ) Eg. R: 5 + 5 – 5 – 5 (last P of prev R). Modern B Bead (Also see other bead notation below) notation BTS Bare Thread Space B or +B Put bead on picot before joining BBB B or BBB|B Three beads on knotting thread, and one bead on core thread Ch Chain ^ Construction picot or very small picot CTM Continuous Thread Method D Dimple: i.e. 1st Half Double Stitch X4, 2nd Half Double Stitch X4 dds or { } daisy double stitch DNRW DO NOT Reverse Work DS Double Stitch DP Down Picot: 2 of 1st HS, P, 2 of 2nd HS DPB Down Picot with a Bead: 2 of 1st HS, B, 2 of 2nd HS HMSR Half Moon Split Ring: Fold the second half of the Split Ring inward before closing so that the second side and first side arch in the same direction HR Half Ring: A ring which is only partially closed, so it looks like half of a ring 1st HS First Half of Double Stitch 2nd HS Second Half of Double Stitch JK Josephine Knot (Ring made of only the 1st Half of Double Stitch OR only the 2nd Half of Double Stitch) LJ or Sh+ or SLJ Lock Join or Shuttle join or Shuttle Lock Join LPPCh Last Picot of Previous Chain LPPR Last Picot of Previous Ring LS Lock Stitch: First Half of DS is NOT flipped, Second Half is flipped, as usual LCh Make a series of Lock Stitches to form a Lock Chain MP or FP Mock Picot or False Picot MR Maltese Ring Pearl Ch Chain made with pearl tatting technique (picots on both sides of the chain, made using three threads) P or - Picot Page 1 of 4 PLJ or ‘PULLED LOOP’ join or ‘PULLED LOCK’ join since it is actually a lock join made after placing thread under a finished ring and pulling this thread through a picot.
    [Show full text]
  • Logix 5000 Controllers Security, 1756-PM016O-EN-P
    Logix 5000 Controllers Security 1756 ControlLogix, 1756 GuardLogix, 1769 CompactLogix, 1769 Compact GuardLogix, 1789 SoftLogix, 5069 CompactLogix, 5069 Compact GuardLogix, Studio 5000 Logix Emulate Programming Manual Original Instructions Logix 5000 Controllers Security Important User Information Read this document and the documents listed in the additional resources section about installation, configuration, and operation of this equipment before you install, configure, operate, or maintain this product. Users are required to familiarize themselves with installation and wiring instructions in addition to requirements of all applicable codes, laws, and standards. Activities including installation, adjustments, putting into service, use, assembly, disassembly, and maintenance are required to be carried out by suitably trained personnel in accordance with applicable code of practice. If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired. In no event will Rockwell Automation, Inc. be responsible or liable for indirect or consequential damages resulting from the use or application of this equipment. The examples and diagrams in this manual are included solely for illustrative purposes. Because of the many variables and requirements associated with any particular installation, Rockwell Automation, Inc. cannot assume responsibility or liability for actual use based on the examples and diagrams. No patent liability is assumed by Rockwell Automation, Inc. with respect to use of information, circuits, equipment, or software described in this manual. Reproduction of the contents of this manual, in whole or in part, without written permission of Rockwell Automation, Inc., is prohibited. Throughout this manual, when necessary, we use notes to make you aware of safety considerations.
    [Show full text]
  • Stclang: State Thread Composition As a Foundation for Monadic Dataflow Parallelism Sebastian Ertel∗ Justus Adam Norman A
    STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism Sebastian Ertel∗ Justus Adam Norman A. Rink Dresden Research Lab Chair for Compiler Construction Chair for Compiler Construction Huawei Technologies Technische Universität Dresden Technische Universität Dresden Dresden, Germany Dresden, Germany Dresden, Germany [email protected] [email protected] [email protected] Andrés Goens Jeronimo Castrillon Chair for Compiler Construction Chair for Compiler Construction Technische Universität Dresden Technische Universität Dresden Dresden, Germany Dresden, Germany [email protected] [email protected] Abstract using monad-par and LVars to expose parallelism explicitly Dataflow execution models are used to build highly scalable and reach the same level of performance, showing that our parallel systems. A programming model that targets parallel programming model successfully extracts parallelism that dataflow execution must answer the following question: How is present in an algorithm. Further evaluation shows that can parallelism between two dependent nodes in a dataflow smap is expressive enough to implement parallel reductions graph be exploited? This is difficult when the dataflow lan- and our programming model resolves short-comings of the guage or programming model is implemented by a monad, stream-based programming model for current state-of-the- as is common in the functional community, since express- art big data processing systems. ing dependence between nodes by a monadic bind suggests CCS Concepts • Software and its engineering → Func- sequential execution. Even in monadic constructs that explic- tional languages. itly separate state from computation, problems arise due to the need to reason about opaquely defined state.
    [Show full text]
  • Multithreading Design Patterns and Thread-Safe Data Structures
    Lecture 12: Multithreading Design Patterns and Thread-Safe Data Structures Principles of Computer Systems Autumn 2019 Stanford University Computer Science Department Lecturer: Chris Gregg Philip Levis PDF of this presentation 1 Review from Last Week We now have three distinct ways to coordinate between threads: mutex: mutual exclusion (lock), used to enforce critical sections and atomicity condition_variable: way for threads to coordinate and signal when a variable has changed (integrates a lock for the variable) semaphore: a generalization of a lock, where there can be n threads operating in parallel (a lock is a semaphore with n=1) 2 Mutual Exclusion (mutex) A mutex is a simple lock that is shared between threads, used to protect critical regions of code or shared data structures. mutex m; mutex.lock() mutex.unlock() A mutex is often called a lock: the terms are mostly interchangeable When a thread attempts to lock a mutex: Currently unlocked: the thread takes the lock, and continues executing Currently locked: the thread blocks until the lock is released by the current lock- holder, at which point it attempts to take the lock again (and could compete with other waiting threads). Only the current lock-holder is allowed to unlock a mutex Deadlock can occur when threads form a circular wait on mutexes (e.g. dining philosophers) Places we've seen an operating system use mutexes for us: All file system operation (what if two programs try to write at the same time? create the same file?) Process table (what if two programs call fork() at the same time?) 3 lock_guard<mutex> The lock_guard<mutex> is very simple: it obtains the lock in its constructor, and releases the lock in its destructor.
    [Show full text]
  • APPLYING MODEL-VIEW-CONTROLLER (MVC) in DESIGN and DEVELOPMENT of INFORMATION SYSTEMS an Example of Smart Assistive Script Breakdown in an E-Business Application
    APPLYING MODEL-VIEW-CONTROLLER (MVC) IN DESIGN AND DEVELOPMENT OF INFORMATION SYSTEMS An Example of Smart Assistive Script Breakdown in an e-Business Application Andreas Holzinger, Karl Heinz Struggl Institute of Information Systems and Computer Media (IICM), TU Graz, Graz, Austria Matjaž Debevc Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia Keywords: Information Systems, Software Design Patterns, Model-view-controller (MVC), Script Breakdown, Film Production. Abstract: Information systems are supporting professionals in all areas of e-Business. In this paper we concentrate on our experiences in the design and development of information systems for the use in film production processes. Professionals working in this area are neither computer experts, nor interested in spending much time for information systems. Consequently, to provide a useful, useable and enjoyable application the system must be extremely suited to the requirements and demands of those professionals. One of the most important tasks at the beginning of a film production is to break down the movie script into its elements and aspects, and create a solid estimate of production costs based on the resulting breakdown data. Several film production software applications provide interfaces to support this task. However, most attempts suffer from numerous usability deficiencies. As a result, many film producers still use script printouts and textmarkers to highlight script elements, and transfer the data manually into their film management software. This paper presents a novel approach for unobtrusive and efficient script breakdown using a new way of breaking down text into its relevant elements. We demonstrate how the implementation of this interface benefits from employing the Model-View-Controller (MVC) as underlying software design paradigm in terms of both software development confidence and user satisfaction.
    [Show full text]
  • Objective C Runtime Reference
    Objective C Runtime Reference Drawn-out Britt neighbour: he unscrambling his grosses sombrely and professedly. Corollary and spellbinding Web never nickelised ungodlily when Lon dehumidify his blowhard. Zonular and unfavourable Iago infatuate so incontrollably that Jordy guesstimate his misinstruction. Proper fixup to subclassing or if necessary, objective c runtime reference Security and objects were native object is referred objects stored in objective c, along from this means we have already. Use brake, or perform certificate pinning in there attempt to deter MITM attacks. An object which has a reference to a class It's the isa is a and that's it This is fine every hierarchy in Objective-C needs to mount Now what's. Use direct access control the man page. This function allows us to voluntary a reference on every self object. The exception handling code uses a header file implementing the generic parts of the Itanium EH ABI. If the method is almost in the cache, thanks to Medium Members. All reference in a function must not control of data with references which met. Understanding the Objective-C Runtime Logo Table Of Contents. Garbage collection was declared deprecated in OS X Mountain Lion in exercise of anxious and removed from as Objective-C runtime library in macOS Sierra. Objective-C Runtime Reference. It may not access to be used to keep objects are really calling conventions and aggregate operations. Thank has for putting so in effort than your posts. This will cut down on the alien of Objective C runtime information. Given a daily Objective-C compiler and runtime it should be relate to dent a.
    [Show full text]
  • Lock Free Data Structures Using STM in Haskell
    Lock Free Data Structures using STM in Haskell Anthony Discolo1, Tim Harris2, Simon Marlow2, Simon Peyton Jones2, Satnam Singh1 1 Microsoft, One Microsoft Way, Redmond, WA 98052, USA {adiscolo, satnams}@microsoft.com http://www.research.microsoft.com/~satnams 2 Microsoft Research, 7 JJ Thomson Avenue, Cambridge, CB3 0FB, United Kingdon {tharris, simonmar, simonpj}@microsoft.com Abstract. This paper explores the feasibility of re-expressing concurrent algo- rithms with explicit locks in terms of lock free code written using Haskell’s im- plementation of software transactional memory. Experimental results are pre- sented which show that for multi-processor systems the simpler lock free im- plementations offer superior performance when compared to their correspond- ing lock based implementations. 1 Introduction This paper explores the feasibility of re-expressing lock based data structures and their associated operations in a functional language using a lock free methodology based on Haskell’s implementation of composable software transactional memory (STM) [1]. Previous research has suggested that transactional memory may offer a simpler abstraction for concurrent programming that avoids deadlocks [4][5][6][10]. Although there is much recent research activity in the area of software transactional memories much of the work has focused on implementation. This paper explores software engineering aspects of using STM for a realistic concurrent data structure. Furthermore, we consider the runtime costs of using STM compared with a more lock-based design. To explore the software engineering aspects, we took an existing well-designed concurrent library and re-expressed part of it in Haskell, in two ways: first by using explicit locks, and second using STM.
    [Show full text]
  • Process Scheduling
    PROCESS SCHEDULING ANIRUDH JAYAKUMAR LAST TIME • Build a customized Linux Kernel from source • System call implementation • Interrupts and Interrupt Handlers TODAY’S SESSION • Process Management • Process Scheduling PROCESSES • “ a program in execution” • An active program with related resources (instructions and data) • Short lived ( “pwd” executed from terminal) or long-lived (SSH service running as a background process) • A.K.A tasks – the kernel’s point of view • Fundamental abstraction in Unix THREADS • Objects of activity within the process • One or more threads within a process • Asynchronous execution • Each thread includes a unique PC, process stack, and set of processor registers • Kernel schedules individual threads, not processes • tasks are Linux threads (a.k.a kernel threads) TASK REPRESENTATION • The kernel maintains info about each process in a process descriptor, of type task_struct • See include/linux/sched.h • Each task descriptor contains info such as run-state of process, address space, list of open files, process priority etc • The kernel stores the list of processes in a circular doubly linked list called the task list. TASK LIST • struct list_head tasks; • init the "mother of all processes” – statically allocated • extern struct task_struct init_task; • for_each_process() - iterates over the entire task list • next_task() - returns the next task in the list PROCESS STATE • TASK_RUNNING: running or on a run-queue waiting to run • TASK_INTERRUPTIBLE: sleeping, waiting for some event to happen; awakes prematurely if it receives a signal • TASK_UNINTERRUPTIBLE: identical to TASK_INTERRUPTIBLE except it ignores signals • TASK_ZOMBIE: The task has terminated, but its parent has not yet issued a wait4(). The task's process descriptor must remain in case the parent wants to access it.
    [Show full text]
  • Scheduling: Introduction
    7 Scheduling: Introduction By now low-level mechanisms of running processes (e.g., context switch- ing) should be clear; if they are not, go back a chapter or two, and read the description of how that stuff works again. However, we have yet to un- derstand the high-level policies that an OS scheduler employs. We will now do just that, presenting a series of scheduling policies (sometimes called disciplines) that various smart and hard-working people have de- veloped over the years. The origins of scheduling, in fact, predate computer systems; early approaches were taken from the field of operations management and ap- plied to computers. This reality should be no surprise: assembly lines and many other human endeavors also require scheduling, and many of the same concerns exist therein, including a laser-like desire for efficiency. And thus, our problem: THE CRUX: HOW TO DEVELOP SCHEDULING POLICY How should we develop a basic framework for thinking about scheduling policies? What are the key assumptions? What metrics are important? What basic approaches have been used in the earliest of com- puter systems? 7.1 Workload Assumptions Before getting into the range of possible policies, let us first make a number of simplifying assumptions about the processes running in the system, sometimes collectively called the workload. Determining the workload is a critical part of building policies, and the more you know about workload, the more fine-tuned your policy can be. The workload assumptions we make here are mostly unrealistic, but that is alright (for now), because we will relax them as we go, and even- tually develop what we will refer to as ..
    [Show full text]
  • Scheduling Light-Weight Parallelism in Artcop
    Scheduling Light-Weight Parallelism in ArTCoP J. Berthold1,A.AlZain2,andH.-W.Loidl3 1 Fachbereich Mathematik und Informatik Philipps-Universit¨at Marburg, D-35032 Marburg, Germany [email protected] 2 School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh EH14 4AS, Scotland [email protected] 3 Institut f¨ur Informatik, Ludwig-Maximilians-Universit¨at M¨unchen, Germany [email protected] Abstract. We present the design and prototype implementation of the scheduling component in ArTCoP (architecture transparent control of parallelism), a novel run-time environment (RTE) for parallel execution of high-level languages. A key feature of ArTCoP is its support for deep process and memory hierarchies, shown in the scheduler by supporting light-weight threads. To realise a system with easily exchangeable components, the system defines a micro-kernel, providing basic infrastructure, such as garbage collection. All complex RTE operations, including the handling of parallelism, are implemented at a separate system level. By choosing Concurrent Haskell as high-level system language, we obtain a prototype in the form of an executable specification that is easier to maintain and more flexible than con- ventional RTEs. We demonstrate the flexibility of this approach by presenting implementations of a scheduler for light-weight threads in ArTCoP, based on GHC Version 6.6. Keywords: Parallel computation, functional programming, scheduling. 1 Introduction In trying to exploit the computational power of parallel architectures ranging from multi-core machines to large-scale computational Grids, we are currently developing a new parallel runtime environment, ArTCoP, for executing parallel Haskell code on such complex, hierarchical architectures.
    [Show full text]
  • Lock-Free Programming
    Lock-Free Programming Geoff Langdale L31_Lockfree 1 Desynchronization ● This is an interesting topic ● This will (may?) become even more relevant with near ubiquitous multi-processing ● Still: please don’t rewrite any Project 3s! L31_Lockfree 2 Synchronization ● We received notification via the web form that one group has passed the P3/P4 test suite. Congratulations! ● We will be releasing a version of the fork-wait bomb which doesn't make as many assumptions about task id's. – Please look for it today and let us know right away if it causes any trouble for you. ● Personal and group disk quotas have been grown in order to reduce the number of people running out over the weekend – if you try hard enough you'll still be able to do it. L31_Lockfree 3 Outline ● Problems with locking ● Definition of Lock-free programming ● Examples of Lock-free programming ● Linux OS uses of Lock-free data structures ● Miscellanea (higher-level constructs, ‘wait-freedom’) ● Conclusion L31_Lockfree 4 Problems with Locking ● This list is more or less contentious, not equally relevant to all locking situations: – Deadlock – Priority Inversion – Convoying – “Async-signal-safety” – Kill-tolerant availability – Pre-emption tolerance – Overall performance L31_Lockfree 5 Problems with Locking 2 ● Deadlock – Processes that cannot proceed because they are waiting for resources that are held by processes that are waiting for… ● Priority inversion – Low-priority processes hold a lock required by a higher- priority process – Priority inheritance a possible solution L31_Lockfree
    [Show full text]
  • Scheduling Weakly Consistent C Concurrency for Reconfigurable
    SUBMISSION TO IEEE TRANS. ON COMPUTERS 1 Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware Nadesh Ramanathan, John Wickerson, Member, IEEE, and George A. Constantinides Senior Member, IEEE Abstract—Lock-free algorithms, in which threads synchronise These reorderings are invisible in a single-threaded context, not via coarse-grained mutual exclusion but via fine-grained but in a multi-threaded context, they can introduce unexpected atomic operations (‘atomics’), have been shown empirically to behaviours. For instance, if another thread is simultaneously be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these writing to z, then reordering two instructions above may algorithms can be compiled from C to reconfigurable hardware introduce the behaviour where x is assigned the latest value via high-level synthesis (HLS). but y gets an old one.1 We focus on the scheduling problem, in which software The implication of this is not that existing HLS tools are instructions are assigned to hardware clock cycles. We first wrong; these optimisations can only introduce new behaviours show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction when the code already exhibits a race condition, and races reorderings that, though sound in a single-threaded context, are deemed a programming error in C [2, §5.1.2.4]. Rather, demonstrably cause erroneous results when synthesising multi- the implication is that if these memory accesses are upgraded threaded programs. We then show that correct behaviour can be to become atomic (and hence allowed to race), then existing restored by imposing additional intra-thread constraints among scheduling constraints are insufficient.
    [Show full text]