Time-Predictable Chip-Multiprocessor Design

View metadata,Downloaded citation and from similar orbit.dtu.dk papers on:at core.ac.uk Dec 17, 2017 brought to you by CORE provided by Online Research Database In Technology Time-predictable Chip-Multiprocessor Design Schoeberl, Martin Published in: Asilomar Conference on Signals, Systems, and Computers Publication date: 2010 Document Version Early version, also known as pre-print Link back to DTU Orbit Citation (APA): Schoeberl, M. (2010). Time-predictable Chip-Multiprocessor Design. In Asilomar Conference on Signals, Systems, and Computers General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Time-predictable Chip-Multiprocessor Design Martin Schoeberl Department of Informatics and Mathematical Modeling Technical University of Denmark Email: [email protected] Abstract—Real-time systems need time-predictable platforms average case performance does not matter at all, only the to enable static worst-case execution time (WCET) analysis. Im- WCET is important. proving the processor performance with superscalar techniques As a result such a processor will be slower in the average makes static WCET analysis practically impossible. However, most real-time systems are multi-threaded applications and case than a standard processor. To reconcile performance with performance can be improved by using several processor cores predictability we argue to build chip-multiprocessor (CMP) on a single chip. In this paper we present a time-predictable chip- systems. As some of the resource hungry features, which are multiprocessor system that aims to improve system performance hard to analyze, are dropped from the processor design, the while still enabling WCET analysis. transistors are better spent by replicating simple pipelines on The proposed chip-multiprocessor (CMP) uses a shared memory with a time-division multiple access (TDMA) based memory a single chip. The remaining issue is to build a CMP system access scheduling. The static TDMA schedule can be integrated where the access to the shared resource main memory can be into the WCET analysis. Experiments with a JOP based CMP performed time-predictable. The execution time of the threads showed that the memory access starts to dominate the execution running on the different cores shall not depend on each other. time when using more than 4 processor cores. To provide a Therefore, the access to main memory has to be scheduled better scalability, more local memories have to be used. We add a processor local scratchpad memory and split data caches, which statically with a time-division multiple access (TDMA) based are still time-predictable, to the processor cores. arbitration scheme. Embedded applications need to control and interact with I. INTRODUCTION the real world, a task that is inherently parallel. Therefore, those applications are already written in a multithreaded style Research on time-predictable processor architectures is to interface sensors and actuators and execute control low gaining momentum. The research is driven by the issues code at different periods. These multithreaded applications of worst-case execution time (WCET) analysis for current are good candidates for CMP systems. With future many- processors. Standard processors are optimized for average core systems it will even be possible that each thread has its case performance and many of the average case performance own core to execute. In that case, CPU scheduling, with the enhancing features are hard to integrate into static WCET scheduling overhead and the predictability issues due to task analysis. The most important features, such as caches and preemption, disappears and only the memory accesses need to dynamic branch predictors, contain a lot of state information be (statically) scheduled. that depends on the execution history. While exactly this state increases the performance, it is not feasible to track the II. RELATED WORK concrete state in static program analysis. The state needs to be The basis of a time-predictable CMP system is a time- abstracted and due to the information loss in the abstract state, predictable processor. In the following section several ap- the analysis has to make conservative assumptions, such as proaches to processors designed for real-time systems are predicting a cache miss when the cache content is not known. presented. Time-predictable chip-multiprocessing is a very However, with out-of-order processors even this assumption recent research topic. Therefore, there are not yet so many is not safe. Due to timing anomalies [12] this local worst publications available. case does not need to trigger the global WCET. It has been Edwards and Lee argue: “It is time for a new era of shown that a cache hit can actually lead to a higher execution processors whose temporal behavior is as easily controlled time than a cache miss. Those architectures are not timing as their logical function" [5]. A first simulation of their compositional and lead to a state space explosion for static precision timed (PRET) architecture is presented in [10]. WCET analysis. PRET implements a RISC pipeline and performs chip-level A time-predictable processor is designed to enable and multithreading for six threads to eliminate data forwarding simplify WCET analysis [21]. Only analyzable performance and branch prediction. Scratchpad memories are used instead enhancing features shall be implemented. For example, the of instruction and data caches. The shared main memory is pipeline and cache have to be organized to be timing com- accessed via a TDMA scheme, called the memory wheel. A positional to enable independent pipeline and cache analysis. resent version of PRET [4] defines time-predictable access The optimization of the processor is on the WCET instead to DRAM by assigning each thread a dedicated bank in of average case performance. For hard real-time systems the the memory chips. The access to the individual banks is pipelined and the access time fixed. As the memory banks are memory access pattern. The latter property makes memory not shared between threads, thread communication has to be access time-predictable which is relevant for hard real-time performed via the shared scratchpad memory. Although the systems and WCET analysis. [31] introduces the scratchpad PRET architecture is not a CMP system, the concepts used memory management unit (SMMU) as an enhancement to in the multi-threaded pipeline can also be applied to a CMP scratchpad technology. This proposed solution does not require system. We intend to evaluate the PRET memory controller whole-program pointer analysis and makes load and store with our CMP system. Each bank will be assigned to a set of operation time-predictable. CPUs. Within this set we will perform a TDMA based memory The scope of the Connective Autonomic Real-time System- arbitration. on-Chip (CAR-SoC) [15] project is to integrate the CarCore Our design can be seen as an instance of the PRET in a SoC that supports autonomic computing principles for the architecture [11]. However, we leave the name PRET to the use in real-time capable autonomic systems. The multithreaded original Berkeley design. The main difference between our embedded processor executes the time critical application proposal and PRET is that we focus on time predictability and concurrently to a number of helper threads that monitor the PRET on repeatable timing. In our opinion a time-predictable system and provide autonomic or organic computing features architecture does not need to provide repeatable timing as long such as self-configuration, self-healing, self-optimization and as WCET analysis can deliver tight WCET bounds. self-protection. Heckmann et al. provide examples of problematic processor Multi-Core Execution of Hard Real-Time Applications Sup- features in [7]. The most problematic features found are the porting Analysability (MERASA) is a European Union project replacement strategies for set-associative caches. A pseudo- that aims for multicore processor designs in hard real-time round-robin replacement strategy effectively renders the asso- embedded systems. How a single threaded in-order superscalar ciativity useless for WCET analysis. In conclusion Heckmann processor can be enhanced to provide hardware multithreading et al. suggest the following restrictions for time-predictable is described in [13]. By strictly prioritizing multithreading processors: (1) separate data and instruction caches; (2) locally capabilities and completely isolating threads it is possible deterministic update strategies for caches; (3) static branch to reach a deterministic behavior for tight WCET analysis. prediction; and (4) limited out-of-order execution. The authors The CarCore, a multithreaded embedded processor, supports argue for restriction of processor features of actual processors one hard real-time thread to be executed while several non (of the time) for embedded systems. In contrast to that real-time threads run concurrently in the background. The proposal, we also provide additional features, such as the split MERASA processor is a CMP system where a simplified cache, for a time-predictable processor. version of the CarCore processor is used. The shared memory Whitham argues that the execution time of a basic block is accessed via a shared bus. has to be independent of the execution history [29]. As a con- Not that many papers are available on the design of time- sequence his MCGREP architecture reduces pipelining to two predictable CMP systems.

Time-Predictable Chip-Multiprocessor Design

Analyzing the Benefits of Scratchpad Memories for Scientific Matrix

Memory Architectures 13 Preeti Ranjan Panda

Gpgpus: Overview Pedagogically Precursor Concepts UNIVERSITY of ILLINOIS at URBANA-CHAMPAIGN

Hardware and Software Optimizations for GPU Resource Management

Improving Memory Subsystem Performance Using Viva: Virtual Vector Architecture

A Dynamic Scratchpad Memory Unit for Predictable Real-Time Embedded Systems

Architectural Exploration of Heterogeneous Memory Systems Marcos Horro, Gabriel Rodr´Iguez, Juan Tourino,˜ Mahmut T

Designing Scratchpad Memory Architecture with Emerging STT-RAM Memory Technologies

5. Embedded Systems Dilemma of Chip Memory Diversity By

Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches

FPGA Implementation of a Configurable Cache/Scratchpad

Compiler-Directed Scratchpad Memory Management Via Graph Coloring