Speeding up Thread-Local Storage Access in Dynamic Libraries in the ARM Platform

Speeding Up Thread-Local Storage Access in Dynamic Libraries in the ARM platform Glauber de Oliveira Costa IC-Unicamp [email protected] Alexandre Oliva Red Hat and IC-Unicamp [email protected], [email protected] Abstract ecution was proposed and succesfully implemented on IA32, AMD64/EM64T and Fujitsu FR-V architectures. On these systems, experi- As multi-core processors become the rule mental results revealed the new model consis- rather than the exception, multi-threaded pro- tently exceeds the old model in terms of perfor- gramming is expected to expand from its cur- mance, particularly in the most common case, rent niches to more widespread use, in software where the speedup is often well over 2x, bring- components that have not traditionally been ing it nearly to the same performance of access concerned about exploiting concurrency. Ac- models used in plain executables. cessing thread-local storage (TLS) from within dynamic libraries has traditionally required This paper details this new access model and calling a function to obtain the thread-local ad- its implementation for ARM processors, high- dress of the variable. Such function calls are lighting its particular issues and potential gains several times slower than typical addressing in embedded systems. code that is used in executables. While instruc- tions used in executables can assume thread- local variables are at a constant offset within the thread Static TLS block, dynamic libraries loaded during program execution may not even 1 Introduction assume that their thread-local variables are in Static TLS blocks. As mainstream microprocessor vendors turn to Since libraries are most commonly loaded as multi-core processors as a way to improve per- dependencies of executables or other libraries, formance [1, 2], the relevance of multi-threaded before a program starts running, the most com- programming to leverage on such potential per- mon TLS case is that of constant offsets. Re- formance improvements grows. Embedded cently, an access model that enables dynamic systems pose their own challenges, such as de- libraries to take advantage of this fact without livering high performance with minimum size giving up the ability to be loaded during ex- and low power consumption. Besides the common difficulty multi-threaded the standard functions that offer abstractions programs run into, namely the need for syn- of thread-specific data. In some cases, such chronization between threads, it is often the as when generating code for dynamic libraries, case that a thread would like to use a global the compiler-generated code is still very in- variable,1 for extended periods of time, without efficient [4] ; for main executables, access other threads modifying its contents, and with- can sometimes be just as efficient as access- out having to incur synchronization overheads. ing automatic or global variables. The mechanism proposed by Oliva and Araújo [4] yields Using automatic variables to achieve this is a a major speedup, that brings the performance possibility, since each thread has its own stack, of TLS access in dynamic libraries close to where such variables are allocated. However, that of executables. Such a mechanism has if multiple functions need to use the same data been proved successful after implemented in structure within a thread, a pointer to it must the Fujitsu FR-V, x86_64 and i386 architec- be passed around, which is cumbersome, and tures. On the ARM platform, a typical choice might require re-engineering the control flow for embedded systems [7], work has been done so as to ensure that the stack frame in which to provide such improvements, while keeping the data structure is created is not left while the in mind all the restrictions that may be imposed data is still in use. upon embedded systems Widely-used thread libraries have introduced primitives to overcome this problem, enabling 1.1 Terminology and organization threads to map a global handle, shared by all threads, to different values, one for each thread. This feature is offered in the form In this paper, we use the term loadable mod- of function calls (pthread_getspecific ule, or just module, to refer to executables, dy- and pthread_setspecific, in POSIX [3] namic libraries and the dynamic loader. A pro- threads), that are far less efficient than access may consist of a set of loadable modules cess to global or automatic variables. Besides consisting of exactly one executable, a dynamic the efficiency issues, they are syntactically far loader (for dynamic executables) and zero or more difficult to use than regular variables. more dynamic libraries. We call initial mod- These were the main motivations for the intro- ules the main executable, any dynamic libraries duction of Thread Local Storage (henceforth, it depends upon (directly or indirectly) and TLS[4, 5]) features in compilers, linkers and any other dynamic libraries the dynamic loader run-time systems, that enable selected global chooses to load before relinquishing control to variables to be marked with a __thread spec- the main executable. Moreover, we use the ifier or a threadprivate pragma, indicat- term dlopened modules to refer to modules ing that, for each thread, there should be a sep- that are loaded after the program starts run- arate, independent copy of the variable. ning, typically by means of library calls such as dlopen. By using custom low-level thread-specific im- plementations [6], or with cooperation from the This paper is organized as follows: section 2 compiler and the linker, access to thread-local gives background material about TLS symbols variables can be far more efficient than using and the novel concept of TLS descriptors. Sec- 1The strictly-correct term here would be variable tion 3 details the proposed extensions in the whose storage has static duration. ABI for enabling it in the ARM platform, while section 4 unveils the needed changes in the cur- from the thread pointer, and can be written to rent ABI and the tools covered by this work. a specific GOT entry at load time. The main Performance measures are then shown in sec- drawback of this access model is that, by using tion 5. Finally, section 6 sheds light on what it, we give up the ability to dynamically load there is yet to be done in the field of TLS ac- the module. cess, specially for its broader acceptance. Local Exec is used when the symbol is de- 2 Background fined in the main executable. In such a case, no additional effort is needed to compute the variable address, that is at a constant offset from the GCC, since version 3.3 [8], has provided the thread pointer known at link time. The model ability of marking a variable with a __thread is not suitable for symbols defined in a dynamic modifier, which causes it to be marked as a TLS library, because even though the offset will be symbol. Each module containing TLS symbols a constant at run time, it is not a link-time con- has a section containing the initial values in the stant. TLS block. For all TLS symbols this module exports, there is a fixed offset within such a block in which the symbol can be found. General Dynamic is the most general of all four models, covering both the initial and dy- An initial module is guaranteed to have its TLS namic module load cases. Address resolution data laid out as part of the Static TLS block, a goes through a call to __tls_get_addr, per-thread data structure whose per-thread ad- that by a lookup in the DTV, loads the variable dress is held in a thread pointer, normally a re- address. As parameters, it receives the module served register. Since the same layout is used identifier and the symbol offset within the mod- for the Static TLS blocks of all threads, the rel- ule’s TLS section. ative address from the thread pointer to a symbol defined in such a module is constant for all threads. Local Dynamic is a variant of the General Dynamic model for symbols living in the same At a fixed location in the Static TLS block, module. The call to __tls_get_addr re- there is a pointer to another data structure called ceives a zero-offset parameter, in a way that it the Dynamic Thread Vector, henceforth, DTV. returns the base address of the module. Subse- When a module is dlopened or dlclosed quent access may then just add to it the symbol the thread’s DTV may have to be modified to offset. add or remove the module’s TLS block. Figure 1 shows these data structure and the iter- In some platforms, when a module that has ations between them. Access to a TLS symbol symbols using the initial exec or dynamic mod- can be performed in any one of four models: els is linked into an executable, the linker can perform a relaxation step, allowing the access to be turned into more efficient ones, without Initial Exec is used when the symbol is in any loss of generality (dlopening executables a module loaded at start-up time with the ex- does not make sense). In the ARM current ABI, ecutable. The relative address is a fixed offset no such relaxations are specified. Static TLS Block Offset Dynamic TLS Blocks Ideally, we should be also able to relax the code TP x y z sequences in case of linking into executables, TP offsets Offset which is missing in the current ABI. DTV Module Index Figure 1: Data structures used for TLS han- dling. Static TLS block, the DTV, and the it- erations between them 3 ARM TLS Descriptors For dynamic libraries, whether the module is initially loaded with the executable or dlopened, access go through a call to The ARM ABI currently specifies the use of __tls_get_addr, whose two arguments, two consecutive GOT entries [10] for TLS sym- the module ID and the symbol offset inside the bols relocations.

Load more