Proceedings Template

OS Paradigms Adaptation to Fit New Architectures Xavier Joglar Judit Planas Marisa Gil Universitat Politècnica de Catalunya Universitat Politècnica de Catalunya Universitat Politècnica de Catalunya Jordi Girona 1-3 Jordi Girona 1-3 Jordi Girona 1-3 Barcelona, Spain Barcelona, Spain Barcelona, Spain [email protected] [email protected] [email protected] ABSTRACT In this way, they can be built from components that work Future architectures and computer systems will be better in specific engines in a more energy-saving way. heterogeneous multi-core models, which will improve their These specific cores are also suitable to execute specific performance, resource utilization and energy consumption. parts of a program to improve its overall throughput. Differences between cores mean different binary formats From this preliminary point, some aspects must be analyzed and specific concerns when dealing with applications. The and modified. In a first stage, programming and compiling OS also needs to manage the appropriate information to parallel applications should be heterogeneity-aware, schedule resources to achieve the optimal performance. deciding and building the appropriate executing stream. In this paper we present a first approach in Linux to allow Then, a specific management task must be done by the the application to give information to the OS in order to linker and the loader to compose and load the final perform the best resource scheduling for the code executable binary, interpreting its new information and characteristics (where it has to run). Based on the performing the appropriate actions. Finally, the operating continuation model of the Mach microkernel and the device system must manage, at runtime, the physical resources. drivers of Unix-Linux system, the kernel can continue the To take advantage of these specialized units, the OS must execution flow from one core to another (i.e. from PPE to be able to schedule the code parts onto the most appropriate SPE in the Cell BE case). In this way, the OS can unit depending on the code requirements as well as the anticipate costly actions (for example, loading code or available execution units [5]. In this work, programmers data) or reserve resources depending on task needs. can help the OS to manage the system resources more To reach our target, we adapt the operating system as well efficiently by giving some hints about the application as modify the application binary to divide its code parts behaviour. For example, if the programmer knows that a depending on their characteristics and where they have to function or a loop has special characteristics, he can mark run. the code properly. That is why it is necessary to add some additional information in the executables and make the Experimental work has been done for x86 with MMX operating system able to manage and understand the given extension ISA as well as for PPC and Cell BE. information. Some requirements to achieve this OS improvement are: Keywords Heterogeneous multi-core, OS loader, fat binary. Having the application code portions in the binary (or bit stream map) corresponding to the platform where it 1. INTRODUCTION has to run. Current technology trends move to new processors with Having a “fat binary” executable file containing the specialized cores, like Cell BE [6] or the Intel Exoskeleton different binary parts of code [11]. [11] with specific processor elements such as those which are GPU-based [7], or other accelerator elements, including Allowing threads to run sequentially cross-ISA (i.e. DSP and FPGA units. In this scenario, programs must be from one unit to one of a different type) [9]. redesigned in order to fully exploit this heterogeneity and Providing the OS with the requisite information to improve their performance. schedule a thread in the appropriate engine (depending on the ISA, processor affinity or system load). This work is supported by the European Commission in the context of the SARC Integrated Project (EU contract 27648- FP6), the HiPEAC European Network of Excellence and the Ministry of Science and Technology of Spain and the European Union (FEDER) under contract TIN2007-60625. The main contributions included in this paper are: The type of the new sections created for each piece of code compiled for a different ISA: CESOF creates An extensible runtime fully compatible with both read-only data sections and we create text sections2. current and new architectures. CESOF is focused on Cell BE; we want HELF to be A new loader to manage new binary format. adaptable to as many heterogeneous platforms as New OS objects to hold application heterogeneity possible. information. Programming models using the ELF format for The rest of the document is organised in the following way: heterogeneous multi-core are: in the next section, we describe the environment and introduce a brief idea of our target. In the third section we The Cell Superscalar (CellSs) [1]: provides a simple, explain the modifications at application level as well as the flexible programming model that can take advantage API implemented to make the interaction between the user of the performance benefits from the Cell. CellSs is and operating system level easier. In section four we built from two principal components: the CellSs present the OS adaptations to be able to manage, store and compiler and a runtime library. The compiler is a use the new information. In section five we discuss our source-to-source translator that takes in an annotated results and in section six, about the conclusions and future C source file and produces a pair of source files, one work. Finally, in section seven we give the references where for each Cell processor type (PPU and SPU). The readers can find more information related to our work. runtime constructs a data dependency graph to schedule independent annotated functions to execute 2. OUR APPROACH on different SPUs in parallel. CellSs uses CESOF. As mentioned above, hardware performance can be limited The EXO-CHI for the Intel Exoskeleton [11]: uses by its software resource management. So as hardware accelerator elements in a heterogeneous multi- evolves, software has to evolve too, and vice versa. We threaded execution model and provides the tools thought that adapting the executable format to distribute needed to exploit the system performance and reduce application execution into different processor units would power consumption. be an exciting chance to improve system performance. The loader must be able to identify HELF files and classify Our proposal is to include additional information into the the different parts of code depending on the information binary file so that the operating system can understand it added by the programmer or, automatically, by the compiler and manage system resources as efficiently as possible [8]. to the ELF sections. In this way, the operating system will schedule them in the appropriate processor. We looked for a simple and portable format through different architectures. We considered that the best option To achieve this, we have to consider three aspects: would be to modify the ELF format [10] to make the 1. How to adapt binary code in order to split it into transformation of the conventional ELF files easier. different chunks of execution code3. We call our new extension Heterogeneous ELF (HELF) and 2. How to modify the Linux loader in order to we use the ELF sections to store the information that the understand and load heterogeneous binaries. operating system needs. HELF extension philosophy is based on the idea of embedding different binaries compiled 3. Which new data structures have to be added to the for different ISAs in a single binary file. kernel to store the new information needed. The CESOF1 extension used for Cell BE programs also For the moment, we are not focusing on data sharing, so allows different-ISA binary in a single ELF-format file. some changes might be necessary to adapt our proposal to CESOF embeds compiled code for both the PPE and SPE non shared-memory architectures. in a single file. It uses several ELF sections to include the When a code has to continue its execution in a specific embedded SPE executable image with additional PPE core, for reasons such as being compiled for a special ISA symbol information. or to obtain better performance in an accelerator, the only The two main differences between HELF and CESOF are: information that has to be provided is where it continues (an 2 For now, we are just considering self-contained functions, so that we do not concern about data. 1 More information related to CESOF (such as diagrams or 3 This is a programming and compiler concern, so we will code examples) can be found in [2]. not explain the details of this adaptation in this document. address) and which input data it needs (the parameters). We will use as an example a matrix multiplication This is the same as what the explicit continuation model implemented for the Cell BE. The structure of the original does [3]. code is shown in Figure 1. The PPE creates and initializes the matrices, sends them to the SPE and waits for the We have implemented new library calls to specify the entry results. The SPE receive the matrixes, performs the point and the unit identifier. Based on this information, the computation and sends back the results and performance to system is able to get the required data and prepare the the PPE. Finally, the PPE verifies the results and prints out loading and running environment in the new unit. the performance reached. The matrix sizes and the number In the next sections we explain the application and the of SPEs to use are configurable by command line operating system adaptation to allow that.

Proceedings Template

Cygwin User's Guide

Chapter 1. Origins of Mac OS X

Intel® Oneapi Programming Guide

Pooch Manual In

Cygwin User's Guide

GNU MP the GNU Multiple Precision Arithmetic Library Edition 6.2.0 17 January 2020

Ios Hacking Guide.Pdf

Sage Installation Guide Release 9.4

Portable Executable

A Lightweight Packaging and Execution Environment for Compact, Multi-Architecture Binaries

Real-Time Streaming Multi-Pattern Search for Constant Alphabet∗

Optimized Composition: Generating Efficient Code for Heterogeneous