Mapping hard real-time applications on many-core processors Quentin Perret, Pascal Maurere, Eric Noulard, Claire Pagetti, Pascal Sainrat, Benoît Triquet

To cite this version:

Quentin Perret, Pascal Maurere, Eric Noulard, Claire Pagetti, Pascal Sainrat, et al.. Mapping hard real-time applications on many-core processors. 24th International Conference on Real-Time and Network Systems (RTNS 2016), Oct 2016, Brest, France. pp. 235-244. ￿hal-01692702￿

HAL Id: hal-01692702 https://hal.archives-ouvertes.fr/hal-01692702 Submitted on 25 Jan 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Open Archive TOULOUSE Archive Ouverte ( OATAO ) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web w here possible.

This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 18801

The contribution was presented at RTNS 2016 : http://rtns16.univ-brest.fr/#page=home

To cite this version : Perret, Quentin and Maurere, Pascal and Noulard, Eric and Pagetti, Claire and Sainrat, Pascal and Triquet, Benoît Mapping hard real-time applications on many-core processors. (2016) In: 24th International Conference on Real-Time and Network Systems (RTNS 2016), 19 October 2016 - 21 October 2016 (Brest, France).

Any corresponde nce concerning this service should be sent to the repository administrator: staff -oatao@listes -diff.inp -toulouse.fr Mapping hard real-time applications on many-core processors

Quentin Perret Pascal Maurère Éric Noulard Airbus Operations S.A.S. Airbus Operations S.A.S. DTIM, UFT-MiP, ONERA Toulouse, France Toulouse, France Toulouse, France Claire Pagetti Pascal Sainrat Benoît Triquet DTIM, UFT-MiP, ONERA IRIT, UFT-MiP, UPS Airbus Operations S.A.S. Toulouse, France Toulouse, France Toulouse, France

ABSTRACT 1.1 Temporal isolation-based framework Many-core processors are interesting candidates for the de- The overall framework workflow is shown in Figure 1. sign of modern avionics computers. Indeed, the computa- When implementing several applications on the same target tional power offered by such platforms opens new horizons to (single or multi/many-core chips), two stakeholders inter- design more demanding systems and to integrate more ap- act: on the one hand, the application designers are in charge plications on a single target. However, they also bring chal- of developing their application and on the other hand, the lenging research topics because of their lack of predictability platform integrator is in charge of integrating the different and their programming complexity. In this paper, we focus applications on the shared platform. on the problem of mapping large applications on a com- plex platform such as the Kalray mppa R -256 while main- taining a strong temporal isolation from co-running appli- cations. We propose a constraint programming formulation of the mapping problem that enables an efficient paralleliza- tion and we demonstrate the ability of our approach to deal with large problems using a real world case study. 1. INTRODUCTION Many-core processors are a potential technology for aerospace industry as long as there is a way to certify them according to aeronautics standards, in particular with: • DO 178 B/C [25] which imposes that the WCET of any application can be computed; • DO 297 [24] which states that mixed-critical applica- tions can execute on the same resource as long as they are strongly segregated, in the sense that an applica- Figure 1: Workflow tion cannot interfere with others even in the presence of failures; • CAST-32 [5] and FAA white paper [13] which detail In [19], we proposed an approach to implement real-time how temporal interferences affect the safety and why safety-critical applications on a many-core target in such a mitigation means against those interferences are manda- way that strict temporal guarantees are ensured whatever tory, when executing on multi-core COTS processors. the behaviour of other applications sharing the resources is. The approach is based on a four-rules execution model to be Because of the architecture of many-core chips, computing followed by the developers (application designers and plat- WCET requires to be able to bound safely the interference form integrator) to avoid any unpredictable or non-segregated delays on shared resources [26] meanwhile ensuring segrega- behaviours. Some of the rules are enforced by an off-line tion imposes that execution times must not vary whatever mapping and computation, while others are ensured by a the co-running applications do. bare-metal hypervisor that we developed. That initial work concerned mostly the activities of the platform integrator. More precisely, given an abstract view of applications in the form of a set of application budgets, the integrator maps off- line during the Allocate phase the budgets on the target, while respecting the execution model. The output is both the computed mapping and the associated code (memory allocation, boot code of the cores and schedule of budgets). Given the application internal execution in the budget, the applications can then run on the target. 1.2 Contribution North IO Cluster

n n n n In this paper, we focus on the application designers ac- R0 R1 R2 R3 tivity which is modified due to the use of a many-core plat- form. The purpose is to help them provide their budget. w e R0 R0 R1 R2 R3 R0 More specifically, each application (App) is defined as a set of multi-periodic communicating tasks and is wrapped in C0 C1 C2 C3 atI Cluster IO East a single partition. A partition also contains an application w e R1 R4 R5 R6 R7 R1 budget (Bud) which is an abstraction of the application that represents the needs in terms of resources (memory, CPU, C4 C5 C6 C7 communication and IOs). The budget is the interface format w e R2 R8 R9 R10 R11 R2

between the application designer and the platform integra- West IO Cluster tor. Indeed, the latter does not need a complete knowledge C8 C9 C10 C11 of each application to share the platform. w e Before explaining how to obtain a budget, we first recall R3 R12 R13 R14 R15 R3 the platform model, the execution model and the application C12 C13 C14 C15 s s s s definition in section 2. We then detail some basic formulas R0 R1 R2 R3 to derive a minimal budget (see section 3). However, it is South IO Cluster up to the application designer to define its budget. Then, we propose a constraint programming approach to validate R Figure 2: mppa -256 architecture overview a given budget (see section 4). A budget is considered as valid when it enables a correct implementation of the appli- cation for both functional and non-functional requirements. During the validate phase, we compute a complete sched- 2.1.3 Network on Chip ule of the application in its budget. This schedule can be Two 2-D torus NoCs interconnect the compute and I/O distributed over several cores and includes not only the map- clusters. The D-NoC can be used for large data transfers ping of tasks on cores but also the tight management of the while the C-NoC enables inter-cluster barriers thanks to communication over the Network on Chip (NoC). Once a short control messages. In both cases, the route of each correct schedule has been found, the user can then decide packet must be defined explicitly by software. to keep it or to optimize it by changing some parameters (ex: the number of cores, the length of the time-slices, . . . ). 2.2 Execution model A second output is the internal budget code. In section 5, In [19], we have defined a four-rules execution model in we illustrate the applicability and the of the con- order to ensure a strict temporal isolation between applica- straint programming approach. We then present the related tions. The application designers must follow those rules to work and conclude. avoid any unpredictable or non-segregated behaviours. We give a brief reminder of the four rules of the execution model: 2. SYSTEM MODEL Rule 1: Spatial partitioning inside compute clusters 2.1 Hardware platform Any PE (resp. any local SRAM bank) inside any com- pute cluster can be reserved by at most 1 partition. 2.1.1 Overview This rule ensures that a partition will never suffer from The Kalray mppa R -256 (Bostan version) integrates 288 local interferences caused by other partitions. cores distributed over 16 Compute Clusters and 4 I/O clus- ters interconnected by a dual Network on Chip (or NoC ), as Rule 2: Isolated accesses on the NoC Any communica- shown in Figure 2. The 16 compute clusters are dedicated tion on a route on the NoC does not encounter any to run user applications while the I/O clusters serve as in- conflict. Given two communications, either they take terfaces with the out-chip components such as the DDR- two non-overlapping routes or they access at different SDRAM. All cores are identical 32-bits VLIW processors times their overlapping routes. More precisely, com- integrating complex components such as a private MMU munications are scheduled in a TDM (Time-Division () enabling both virtual address- Multiplexing) manner and strictly periodic slots are ing and memory protection. defined off-line for each communication.

2.1.2 Compute and I/O Clusters Rule 3: Off-line pre-defined static buffers The mem- Each compute cluster is composed of 16 cores (named ory areas to be sent over the NoC must be defined PEs), 1 specific core used for resource management (RM ), off-line. This allows to both reduce the combinatorial 2MiB of Static Random Access Memory (or SRAM) exclu- explosion of WCET estimation through static analysis sively accessible by the elements of the cluster and 1 DMA and ease the measurement-based validations. engine to transfer data between clusters over the NoC. The local SRAM is organised in 16 memory banks that can be Rule 4: Isolated accesses to DDR-SDRAM bank Any accessed in parallel without inter-core interferences. bank of the external DDR-SDRAM memory can be Each I/O cluster integrates 4 RMs, 4 DMAs (capable of shared by two or more partitions as long as they never sending data through 4 different NoC access points) and access it simultaneously. This allows to greatly miti- several additional peripherals such as a DDR-SDRAM con- gate the inter-partitions interferences at the external troller or an Ethernet controller. bank level. j j j j Some of the rules are enforced by an off-line mapping and Sub-task Ci (in ck) Mi (in KiB) Ii Oi 1 ∅ computation; while others are ensured by a bare-metal hy- τ1 500 519 i1 2 ∅ ∅ pervisor that we developed. Further low-level details on the τ1 500 397 3 ∅ ∅ implementation of this hypervisor are provided in [19]. τ1 700 642 4 ∅ ∅ τ1 400 262 1 ∅ ∅ 2.3 Application model τ2 1,000 287 2 ∅ ∅ An application is a tuple hτ, δi where: τ2 400 799 3 ∅ ∅ τ2 500 542 4 ∅ ∅ 1. τ = {τ1, . . . , τn } is a finite set of periodic tasks. A task τ2 800 764 5 ∅ ∅ τi is defined as τi = hSi,Pi,Tii: τ2 900 490 6 ∅ τ2 500 399 o1 1 ni 7 ∅ ∅ • Si = {τi , . . . , τi } is the set of sub-tasks in τi τ2 600 12 with ni the number of sub-tasks composing τi. All the sub-tasks in Si are assumed to be acti- Table 1: Example of sub-tasks parameters vated simultaneously at the activation of τi and

to have the same implicit deadline equal to Ti. δ 3 j τ2 d τ1 2 i1 τ1 Additionally, the sub-tasks are defined as τi = 4 δa 1 τ1 δc j j j j j τ δe δb 1 τ2 δg δa 1 hCi ,Mi ,Ii ,Oi i with: Ci the WCET of the sub- δ τ1 δf d j δb 1 1 task; M the memory footprint of the sub-task τ1 δb i 2 3 τ2 δ j τ2 τ2 j (size of code, static and read-only data); I and δ δ δg δ 7 i h i d δ τ2 j 2 3 l 1 τ1 τ1 δh O respectively the input and output buffers read 2 i 7 τ2 j τ2 δo 4 5 δ and written by τ τ2 τ2 j 6 i 4 δl τ2 τ1 δk δm 3 τ2 δ • Pi ⊂ Si×Si is the finite set of precedence relations 5 n 6 τ o1 τ2 τ 4 2 constraining the sub-tasks of τi. More precisely, 2 x y x (τi , τi ) ∈ Pi means that τi must be completed y (a) Precedence constraints (b) Data exchanges before τi can start. This kind of precedence rela- tions are assumed to exist only between sub-jobs Figure 3: Example of sub-tasks dependencies. exchanging some data. Moreover, the precedence constraints imposed by Pi have no cycles and can be seen as a Directed Acyclic Graph (or DAG). 2.4 Definition of application’s budget

• Ti the period of the task. In [19], we have defined the notion of application bud- get, that is the interface between application designers and 2. δ = {δ1, . . . , δm } the set of data with δk = hmk, prod, consi the integrator. The application designer must detail to the with mk the size of the data; prod : δ 7→ S which pro- integrator the amount of resources needed by its software.

vides the sub-job producing the data (with S = i Si); We remind those definitions below for the paper to be self- and cons : δ 7→ Sn which provides the set of sub-tasks contained. consuming the data. S Definition 1. An application budget is defined as a tuple The model does not allow precedence constraints between hP, B, I, Ci where sub-tasks having different parent tasks. However, no such 1. P = {PN1,...,PNm} is a finite set of PNs. A Par- constraints are imposed for the data exchanges. The only tition Node (or PN) is a pair hNc,Nbi which repre- constraint for data is to be consumed over the freshest-value sents the need of processing resource (Nc is a number semantics in a deterministic manner2. of cores) and memory (Nb is a number of local memory banks) allocated to an application inside one cluster; Example 1. We consider an application composed of 2 tasks τ1 and τ2 respectively having 4 and 7 sub-tasks. The 11 sub- 2. B represents a number of external DDRx-SDRAM banks; tasks exchange 15 data {δa, . . . , δo } and read from 1 input buffer i1 and write into 1 output buffer o1. Tasks τ1 and 3. I = {I/ON1, . . . , I/ONj } is a finite set of I/ONs. An τ2 respectively have periods T1 = 24 and T2 = 48. The I/O Node (or I/ON) represents an access point to the precedence constraints are depicted in Fig. 3a as two DAG external memory. It is materialized as a lo- (one for each task) and the production and consumption of cated on an I/O cluster. data by tasks and the I/O buffers are depicted in the form of 4. C = {PC ,...,PCp } is a finite set of PCs. A Par- a multi-digraph in Fig. 3b. The parameters of the sub-tasks 1 tition Communication (or PC) represents a need of are provided in Table 1 (where ck stands for clock cycle). communication between two PNs located on two differ- ent clusters (or a PN and an I/ON ) in the form of a 1The I/O buffers only represent the interactions with out- directed strictly periodic NoC access slot. Moreover, a of-chip components such as the reception or emission of Eth- PC is defined as a tuple hsrc, dst, hT,C,Oii where: ernet frames. 2 • src ∈ P ∪ I (resp. dst ∈ P ∪ I) is the source Here deterministic refers to the data production and con- sumption ordering. We consider that two executions of the (destination) of the communication. We assume same schedule must always ensure the same order of pro- that there exists at most one PC linking src and duction and consumption of all data. dst (no two different ways to communicate). • hT,C,Oi is the strictly periodic slot with period 3.2 Minimal budget T , duration C and offset O. In our approach, the definition of the application’s budget is left to the application designer and can be chosen for any 3. BUDGETING arbitrary reason. To help the user in its budget estimation, we list below several necessary conditions that any budget The budget of an application is a set of PNs storing and should meet to have a chance of being valid. executing all the tasks, a number of DDR banks, a set of I/ONs, a set of PCs interconnecting the PNs and I/ONs. 3.2.1 Memory limitations Since we do not accept any code fetching, the application 3.1 Assumptions code and data must be statically allocated in the PNs lo- cal memory. Thus, we can deduce a lower bound on the 3.1.1 No code fetch number of PNs (that is |P|) required to completely store an On the Kalray mppa R -256, the PEs cannot directly ac- application. Assuming that the maximum available memory cess the external DDR-SDRAM, they must always use the space in each compute cluster is Ma, the number of PNs is DMA engines instead. In particular, the of loading bounded by: code from the external DDR-SDRAM (in case an application does not fit into the local SRAM) must be achieved explic- j Mi itly by software. However, we will assume in the rest of the j ∀τi ∈S paper that no code fetch from the external DDR-SDRAM |P| ≥  P  Ma ever occurs. Several works already considered code fetch-   ing such as [3, 12], we prefer to focus on the multi-cluster     parallelization and the efficient use of the NoC instead.   3.2.2 Processing power limitations 3.1.2 Execution model and hypervisor An other bound on P can be deduced from the utilization ratio of the application defined as: One local SRAM bank is reserved by the hypervisor in each cluster (see [19] for further details). Any of the 15 Cj U = i remaining local memory banks can be used by applications. Ti ∀τ j ∈S A global tick (denoted as the Systick) of period Tsys acti- Xi vates periodically the hypervisors of all clusters simultane- Since the number of cores on which the application is sched- ously. Thus, the duration and the period of any PC must uled must be greater than U, we can deduce: be multiple of the Systick.

3.1.3 Valid budget Nc(np ) ≥ ⌈U⌉ Ppn∈ A budget is valid if the application can execute safely with X R the resources offered by the budget. More precisely, On the Kalray mppa -256, 1 ≤ Nc ≤ 16 for all PNs because there are 16 PEs in each cluster. Thus, the absolute Definition 2 (Valid budget). For an application A = minimal number of PNs required is ⌈U/16⌉. hτ, δi, a budget hP, B, I, Ci is valid if there exist: Example 3. The memory footprints of the tasks defined in j • a mapping of all sub-tasks τi ∈o S t P. More pre- Example 1 are provided in Table 1. We assume the mem- cisely, each sub-task is associated to a PN= hNc,Nbi, ory space reserved for storing the data to be exchanged to executes on one of the Nc core and its memory foot- weight Md = 15 KiB. So, the maximum amount of mem- print is stored in the Nb banks; ory available for storing the tasks in each cluster is Ma = 15 ∗ 128 − Md = 1905KiB (16 banks, 1 reserved for the hy- • a local schedule on each core so that the sub-tasks re- pervisor, 128KiB in each bank). So, we can deduce from the spect their deadlines and their precedence constraints; memory limtation that |P| ≥3. The utilization ratio of the task set of Example 1 is 1.85. From the processing power • a mapping of each sub-task’s I/O to the B DDR banks; limitation we can deduce:

Nc(np ) ≥ 2 • a mapping for each data on a PC from the producer’s Ppn∈ PN to the consumers PNs. X . Example 2 (Valid budget and associated schedule). We consider the task set defined in Example 1. As shown on 4. BUDGET VALIDATION Fig. 4, a valid budget for this task set is composed of three Given a budget, we propose to evaluate its validity by PNs p0, p1 and p2; five PCs c10, c21, c01, c12 and c20 linking finding a valid schedule of the considered application in the the PNs where cij is the PC going from pi to pj , and one budget with a constraint programming-approach. Such a I/ON accessible from two PCs cio2 and c1oi . All PNs have a solution is quite standard to compute off-line mapping and similar configuration with Nc = 1 and Nb = 15. All the PCs schedule. It also allows to take into account other con- linking the PNs also have an identical configuration with the straints coming from the specificities of a many-core plat- same duration Lc and the same period Tc. The PCs giving form (i.e. the local SRAM limitations or the non-negligible access to the I/ON have a two time longer period. NoC-related delays) and the limitations imposed by the ex- ecution model (i.e. only strictly periodic communications 1 3 4 5 7 4 p0 τ2,1 τ2,1 τ1,1 τ2,1 τ2,1 τ1,2

2 4 6 p1 τ2,1 τ2,1 τ2,1

1 3 2 1 3 2 p2 τ1,1 τ1,1 τ1,1 τ1,2 τ1,2 τ1,2

i , i , cio2 1 1 1 2 o c1oi 1

1 2 3 4 5 6 7 8 c10 B10 B10 B10 B10 B10 B10 B10 B10

1 2 3 4 5 6 7 8 c21 B21 B21 B21 B21 B21 B21 B21 B21

1 2 3 4 5 6 7 8 9 c01 B01 B01 B01 B01 B01 B01 B01 B01 B01

1 2 3 4 5 6 7 8 9 c12 B12 B12 B12 B12 B12 B12 B12 B12 B12

1 2 3 4 5 6 7 8 9 c20 B20 B20 B20 B20 B20 B20 B20 B20 B20 Lc Tc

2 Figure 4: Schedule of the application given in example 1 on the budget of example 2. The communication slots used: B20 = 4 7 8 4 8 3 7 {δb,1}; B20 = {δd,1, δe,1}; B20 = {δb,2}; B20 = {δd,2, δe,2}; B21 = {δd,1}; B21 = {δd,2}; B01 = {δg,1, δh,1}; B01 = {δm,1, δn,1}; 5 6 7 B10 = {δi,1}; B10 = {δj,1}; B12 = {δl,1}. between clusters). To the best of our knowledge, there is no • precedence constraints: an interval i1 can occur before scheduling technique able of managing all those constraints a second interval i2: simultaneously in the literature. i1 → i2 ⇔ s(i1) + l(i1) ≤ s(i2) 4.1 Modelling framework Many approaches in the literature have been proposed to • alternative constraints: for a set of optional intervals map dependent task sets on multi/many-core architectures I = {i1, . . . , in }, it is possible to express that only one using an ILP formulation of the problem [3,11,20]. However, interval is present and all the others are absent: those approaches often face major scalability issues when ⊕(I) ⇔ p (i) =1 applied to large applications making them unusable in an i∈I industrial context without considering sub-optimal heuris- X tics. In this paper, we use a different modelling framework • cumulative function constraints: A cumulative func- that has proven to be practically useful in order to overcome tion represents the usage of a resource by different ac- the same scalability issues that we faced when we devel- tivities over time as the sum of the individual contri- oped the preliminary ILP-based formulation of our indus- bution of these activities. Cumulative functions can trial case study. In the following sections, we will formu- be used to constrain the usage of a resource to fit into late the scheduling problem using the notion of Conditional a specific envelope representing the resource capabili- Time-Intervals that has been introduced into IBM ILOG ties. Especially, the pulse function ⊓(i, h) increments CP Optimizer since version 2.0. In order to ease the read- the usage of resource by h at the beginning of an in- ing, we provide a short introduction to some of the concepts terval i and decrements it at the end of i if i is present. associated with this scheduling framework. Further motiva- tion behind this approach and the formal semantics of the Example 4 (Avoid overlapping with cumulative func- operations on Interval variables can be found in [15] and [16]. tions). Assuming a set of intervals to be scheduled I = An interval variable i represents an activity of finite du- {i1, . . . , in} that all use a single resource that is accessible by ration l(i) whose start date s(i) must be computed by the only one activity at a time, we can enforce a non-overlapping solver. Each interval i is defined in a pre-definite time win- constraints between these intervals by using cumulative func- dow [a(i), e(i)] where a(i) and e(i) respectively are the ac- tions with ,⊓ (i 1) ≤ 1. tivation and the deadline of i. This means that a correct i∈I schedule must always ensure that s(i) ≥ a(i) and s(i)+l(i) ≤ P Overall, this formulation with conditional intervals allows e(i). i is said to be optional if its presence in the final solu- a natural modelling of problems in which the activities to tion is not mandatory. Among all the interval-related con- be scheduled can be processed by different resources. In straints implemented within IBM ILOG CP Optimizer, we such problems, each activity can be associated with several use: optional intervals (one for each resource) and constrained so • start constraints: the start date s(i) of the interval i that only one of the intervals is present for each activity in can be constrained; the final solution.

• presence constraints: for an optional interval i, p(i) ∈ 4.2 Problem Formulation [0, 1] denotes whether 1) i is present (i.e. p(i) = 1), in which case i is part of the final solution and all 4.2.1 Problem inputs constraints on it must be met, or 2) absent (i.e. p(i) = We consider an application A = hτ, δi, a budget hP, B, I, Ci 0) meaning that i is ignored and not considered in any and a specific platform. Since Conditional Time-Intervals constraint in which it was originally involved; apply on jobs and since tasks can have different periods, we flatten the tasks as jobs on the hyperperiod and the schedule The sub-jobs assigned to a common PN may not overlap. will follow a repeating pattern: This can be modelled with the utilization using cumulative functions to be less than the number of core Nc: TH = lcm (Ti) τi∈τ j ∀pn ∈ P, ⊓ (j(τi,k,) pn), 1 ≤ Ncn(p ) (3) j We define the job τi,k of task τi as the k-th activation of τi ∀τi,k∈S j X in one hyperperiod and τi,k as the k-th activation of the sub- j The constraint 3 is a necessary and sufficient condition for task τ . By doing so, we transform the initial multi-periodic i being able to execute all sub-tasks allocated to a PN non- problem into a job scheduling problem. S will represent the preemptively on Nc(np ) cores since the Interval Graph asso- set of sub-jobs. ciated to each PN is chordal and can thus be colored using The precedence constraints are assumed to exist only be- Nc(np ) colors [9]. tween sub-tasks of the same parent task. Therefore, we du- plicate each constraint on sub-tasks to many constraints on 4.2.4 Local memory constraints sub-jobs. j j Each sub-task τi has a memory footprint Mi that repre- Since each data can have at most one producing sub-task, j it is assumed to be over-written by each sub-job of this sub- sents the amount of local SRAM used by τi in the cluster to task. When the data must be read by a sub-task allocated to which it is attached. Then, we ensure that the local memory a different cluster, it is sent from the producing cluster to the available in each PN Ma is sufficient to store the sub-tasks consuming cluster during a communication slice following its mapped on this PN: j production. The k-th data δx produced by τi,k is denoted j j ∀pn ∈ P, p(j(τi,0,) pn) × Mi ≤ Ma (4) as δx,k. τ j ∈S Xi 4.2.2 Decision variables 4.2.5 Precedence constraints There are two decision variables: The precedence constraints imposed by the application j: Each sub-job τ j ∈ S is associated with |P| optional inter- model can be separated in two categories. A precedence i,k j l j j constraint (τi , τi ) is said to have: val variables. ∀pn ∈ P, τi,k ∈ S, j(τi,k , pn) is present if τ j is allocated on the PN pn. All the sub-jobs j l i,k • forward data ⇔ ∃δx ,∈ δ (τi = prod(δx)) ∧ (τi ∈ intervals have a fixed length equal to the WCET of cons(δx)), the sub-job and are defined in a fixed time window j l [(k − 1) × Ti; k × Ti]. • backward data ⇔ ∃δx ,∈ δ (τi ∈ cons(δx)) ∧ (τi = prod(δx)). d: Each data δx,k is associated with |C| optional interval variables. ∀pc ∈ C, δx,k ∈ δ, d(δx,k , pc) is present if the With forward data, the freshest value semantics imposes l that τi cannot start before it receives the data produced by data δx,k is sent over the PC pc. The duration of each j data interval is fixed and equal to the duration of its τi . If both sub-tasks are executed on the same cluster, the j l corresponding PC. The time window in which a data data produced by τi will be visible to τi immediately after interval is defined is equal to the one of its producing its production since the address space is shared. In this case, l sub-job. the precedence constraint can be met simply by starting τi j after the completion of τi : Remark 1 (Duration of data intervals). Note that j l j l the duration of data intervals does not represent the actual ∀(τi,k, τi,k) ∈ Pi,kn, ∀p ∈ P, j(τi,k, pn) → j(τi,k , pn) (5) time required to send the data over the NoC. The link be- However, if the two sub-tasks are not executed on the tween the communication time and the number of data allo- same cluster, the data will need to be sent through the NoC cated to the same PC is made in section 4.2.7. Note more- l after its production to become visible by τ . Thus, first, over that during a hyperperiod, there may be several slots for i the data must be assigned to a PC slot after the end of the a specific PC (exactly ⌈TH /T ⌉). Thus the start time of an producing sub-job: interval d indicates in which slot the data is sent. ∀pc ∈ C, ∀δx,k ∈ δ,j(prod(δx,k ), src(pc)) → d(δx,k , pc) (6) 4.2.3 PN utilization constraints And then the consumer τ l cannot start before the com- Each sub-job is executed only once in only one PN. i pletion of the communication slot during which the data is j j sent: ∀τi,k ,∈ S ⊕({j(τi,k,∀ pn) | pn ∈ P}) (1) j l If different sub-jobs of the same sub-task could execute ∀(τi,k, τi,k) ∈ Pi,kc, ∀p ∈ C, ∀δx,k ∈, δ j l on different clusters, it would imply 1) to duplicate the sub- (τi,k = prod(δx,k)) ∧ (τi,k ∈ cons(δx,k)) (7) l task’s code and data on all of these clusters; 2) to enforce =⇒ d(δx,k, pc) → j(τi,kc, dst(p )) coherency between the static data of all duplicates thanks With backward data, the precedence constraint imposes to NoC communications. So, in order to reduce the pressure j that the data consumed by τi,k is always the one that was on the NoC, the local SRAM and for simplicity reasons, we l impose that all sub-jobs of a sub-task execute on the same produced by τi,k−1. To enforce the respect of this constraint, if the two sub-tasks are on the same cluster we reuse the con- PN: j straint 6, otherwise we impose that τi,k completes before the j j j ∀τi,k ,∈ S ∀pn ∈ P, p(j(τi,k, pn))= p(j(τi,k+1 , pn)) (2) beginning of the communication slice during which the data l j produced by τi,k is sent, so that τi,k could never consume a δx as: too fresh data: δx δx δx head δx bbl Nflit = nflit + npkt × nflit + (npkt −) 1 × nflit j l ∀(τi,k, τi,k) ∈ Pi,kc, ∀p ∈ C, ∀δx,k ∈, δ j l Finally, in order to simplify the upper-bounding of the (τi,k ∈ cons(δx,k)) ∧ (τi,k = prod(δx,k)) (8) amount of data that can be sent during one communica- j =⇒ j(τi,k,c dst(p )) → d(δx,k, pc) tion slice, we take the conservative assumption that all data are placed into non-contiguous memory areas. So, assuming gap 4.2.6 Data mapping constraints that Nflit is the number of empty flits lost in the gap in- Any data produced by a sub-job and consumed by any curred when the DMA jumps to a non contiguous memory other sub-job hosted on a different cluster is sent during a address, we can constrain the amount of data sent into a communication slice after the completion of the producing single communication slice with constraint 12. sub-job. To do so, we constrain the data produced by the ∀pc ∈ C, first sub-job of a sub-task to be present on a PC if at least δx gap Lc (12) ⊓(d(δxN, pc), + N ) ≤ N one consuming sub-task is present on the destination PN: δx∈δ flit flit flit RemarkP 2 (NoC-aware optimization of the memory ∀pc ∈ C, ∀δx ,∈ δ mapping). It would be preferable, in order to achieve the best performance, to include the positioning of data with p(j(prod(δx, ) , src(pc))) ∧ 0 respect to each other into the scheduling problem. Indeed, (9) an optimized memory mapping could enable to group several j p(j(τi,0c, dst(p ))) ≥ 1 = p(δx,0, pc) data into bigger memory chunks that could be sent over the j τi,0∈cons(δx,0) ! NoC more efficiently. However, such an approach would P greatly increase the complexity of the scheduling problem. Additionally, we impose to all sub-data to be placed on We will investigate the possibilities of optimal, or at least the same PC: heuristic, NoC-aware optimization of the memory mapping in future work. ∀pc ∈ C, ∀δx,k ∈ δ,p(d(δx,k , pc))= p(d(δ1x,k+ , pc)) (10) In order to fulfill implementability constraints, we must And finally, we enforce all the data sent to be aligned in also take into account the fact that our DMA micro-code is time with the activation of their PC assuming that T (pc) able of autonomously sending only a finite number of non- and O(pc) respectively are the period and the offset of the DMA contiguous memory areas denoted Nbufs . PC. DMA ∀pc ∈ C, ⊓(d(δx), pc), 1 ≤ Nbufs (13) ∀pc ∈ C, ∀δx,k ∈ δ,s(d(δx,k , pc)) mod T (pc) = O(pc) (11) δ ∈δ Xx 4.2.7 PC utilization constraints 4.2.8 Determinism constraints The amount of data sent during each communication slice Two sub-jobs are allowed to exchange data without being must be restricted in order to be compatible with the hard- constrained by a precedence relation. In this case, if a data ware capabilities and to be implementable on the real target. is produced or received during the execution of a consuming Since our execution model and hypervisor provide a prop- sub-job, the order of consumption of this data could be non- erty of interference avoidance on the NoC [19], we can derive deterministic and even change over time depending on the the maximum number of flit that can be sent during one real execution times of sub-jobs. In this paper, we consider communication slice simply by assuming that the latency this non-determinism in the data production-consumption of a NoC link is ∆L, that the latency of one router is ∆R ordering as not acceptable since the semantics of such an and that the maximum number of routers on a NoC route execution is not clear and would break the repeatability of max is nR . Indeed, based on the models we introduced in [18], executions (even with the same set of input values) that is we can deduce that the maximum time required for Nflit to required for test purposes. We avoid this non-determinism max max completely cross a NoC route is (nR + 1)∆L + nR ∆R + by forbidding the overlapping between the data interval and Nflit. So, we can formulate the maximum number of flits its consuming sub-jobs. that can be transferred during one communication slice of j length Lc as: ∀δx,k ∈, δ ∀τi,l ∈ cons(δx,kc), ∀p ∈ C j j (prod(δx,k)τ, ) ∈/ Pk ∧ (τ , prod(δx,k)) ∈/ Pk (14) Lc max max i,l i,l Nflit = Lc − (nR + 1) × ∆L − nR × ∆R j ⇒ ⊓(d(δx,k , pc), 1)+ ⊓(j(τi,l,c dst(p )), 1) ≤ 1 Secondly, in order to compute the number of flits required Similarly, if the producing and the consuming sub-jobs are δx flit to send a data δx, we define nbytes and nbytes respectively present on the same PN, they must not overlap: as the number of bytes in δx and the number of bytes in one j NoC flit. So, the number of payload flits required to transfer ∀δx,k ∈, δ ∀τi,l ∈ cons(δx,k)n, ∀p ∈ P j j δ δ flit pkt x x (prod(δx,k)τ, i,l) ∈/ Pk ∧ (τi,l, prod(δx,k)) ∈/ Pk (15) the data is nflit = nbytes/nbytes . Assuming nflit to be ⇒ ⊓(j(prod(δ ), pn), 1) + ⊓(j(τ j ,) pn), 1 ≤ 1 the maximum numberl of payloadm flits in one NoC packet, x,k i,l we can derive the number of packets required to send δx as 4.2.9 Managing IOs δx δx pkt head npkt = nflit/nflit . Then, with nflit the number of flits In our approach, the communication with out-of-chip com- bbl in one NoCl packetm header and ncycles the number of empty ponents are handled as read and write requests into the ex- flits lost in the bubble separating two consecutive packets, ternal DDR-SDRAM. Indeed, this can simulate the recep- δx we can derive the total number of flits Nflit required to send tion and emission of AFDX frames for example. Since, all transactions with the external memory are han- the reserved resources. And secondly, we want to explore the dled by the compute clusters’ DMAs on the Kalray mppa R - possibilities of speeding-up the application when parallelized 256, they can be taken into account exactly like the appli- over several clusters despite the non negligible latencies in- cation’s data. Thus, we assume all the IO buffers to be volved by the NoC. represented as conditional intervals together with the data. The only difference resides on the PCs on which IO intervals 5.2.1 Narrowing the budget can be placed which are the PCs linking the PNs and the To minimize the footprint of the budget, we compute a min I/ONs. By doing so, all the constraints for data intervals on minimal number of PNs denoted as npn using the rules of PCs can be applied directly for managing the IO intervals. Section 3.2. Based on this, we define and evaluate 3 poten- min In future work, we will integrate the fine grained modelling tial configurations with |P| equal to npn r+ i fo i ∈ [0; 2]. of the DDR access time that we developed in [18] to the Secondly, we investigate the impact of the NoC config- formulation of constraint 12 when applied on PCs targeting uration with 3 PC setups where the length of each PC is min I/ONs. equal to Lc + j × Systick with j ∈ [0; 2] being the PC configuration number. 5. EXPERIMENTAL RESULTS 5.2.2 Increasing speed-up As a case study, we used a large industrial application To evaluate the speed-up possibilities, we artificially re- from Airbus. This application has several tasks with har- duce the temporal horizon (i.e. the hyperperiod) on which monic periods. The total number of sub-jobs and associated tasks are scheduled. In this case, the maximum data is close to 100,000 leading to a large optimization prob- and minimal makespan is obtained when no valid budget lem with several million decision variables and constraints. can be found for any further reduced temporal horizon. We define a division parameter kdiv to compute schedules with 5.1 Implementation choices 256−kdiv reduced periods defined as Ti = 256 × Ti and we evalu- ate 256 instances of the application for 0 ≤ kdiv ≤ 255. All 5.1.1 One PE per PN other parameters, includinge the WCETs of sub-tasks are left In our experiments, we limited the number of cores in unchanged. each PN to 1 for the following reasons: 1) we limit the po- tential interferences suffered by the computing core to the 5.3 Results local memory accesses generated by the DMA in order to Figure 5 shows the time (in seconds) required by the solver simplify and improve the computation of tight WCETs of to compute a schedule meeting all the constraints as a func- 3 sub-tasks; 2) we focus on the mostly unexplored problem tion of the kdiv parameter in 9 different budget configura- of inter-cluster parallelization requiring a tight management tions. We can see in figure 5a that the maximum kdiv (repre- of the NoC to exhibit the application’s speed-up opportuni- senting the maximum processing load) is -intuitively ties through parallelization despite the non-negligible NoC- achieved with the budget configuration offering the smallest related delays. number of PNs. However, we can also observe on all three figures that the maximum kdiv are invariably obtained with 5.1.2 Prompt and symmetric PCs the shortest PCs. This seems to demonstrate that the bot- In the budgets considered during the experiments, we al- tlenecking parameter to achieve the best performance with ways assume symmetric and prompt PCs connecting all the this case study is the latency of data exchanged through the PNs of the budget. By symmetric, we mean that each PN NoC and not the processing power. The figure 6 depicts has one outgoing PC to send data to any other PN and all an example of PC utilization from the experiment with the PCs have the same duration Lc and the same period Tc. maximum kdiv over a complete hyperperiod. The figure 6a By prompt, we mean that the value of Lc typically is in the shows the number on non contiguous memory areas that order of magnitude of the Systick and that Tc equal to its are sent during each PC activation and the figure 6b shows minimal value of (n − 1) × Lc with n the number of PNs. the number of flits that are sent in each PC slot. Both fig- The symmetry between the PCs is chosen to simplify the ures are annotated with the maximum values imposed by expression of the budget and we argue that the promptness the constraints 13 and 12. We observe that the limitation of the PCs is well suited for the targeted case study. Indeed, from constraint 13 (fig. 6a) seems to be a lot more bottle- our industrial application basically has many data of small necking than the limitation from constraint 12. This means size. By choosing fast PCs we hope to shorten the delays that the limiting parameter in this case study is not the ac- involved by precedences with forward data. The evaluation tual NoC bandwidth but rather the capabilities of DMAs of fundamentally different budgets with asymmetric and/or to send many small non-contiguous data. As mentioned in slow PCs will be done in future work. Remark 2, a better optimized memory mapping could prob- ably help send bigger chunks of data and thus balance the 5.2 Goal curves of figure 6 to achieve the best performances. Overall, the goal of these experiments is twofold. Firstly, we aim at finding a valid budget with a minimal footprint 6. RELATED WORK in order to both maximize the space left to other potential In this paper, we presented a mapping procedure enabling co-running applications and increase the utilization ratio of the efficient parallelization of one application into its budget. 3 However, since our execution model allows the concurrent The problem of intra-cluster parallelization has already been addressed in the literature in [3] and the Capacites execution of several partitions, our mapping technique can project [2] and will be further explored in the Assume be re-used on several applications needing to be mapped into project [1]. their corresponding budgets. To the best of our knowledge, 12,000 time- out Lmin 8,000 c 6,000 Lmin + 1 4,000 c Lmin c + 2

Duration2,000 (s) 0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 kdiv kdiv kdiv nmin nmin nmin (a) |P| = pn (b) |P| = pn + 1 (c) |P| = pn + 2

Figure 5: Comparison of computation times in function of the load increase with different budgets.

DMA Lc Nbufs Nflit

T T T T T T 0 H H 3 H TH 0 H H 3 H TH 4 2 4 4 2 4 Communication slice Communication slice (a) PC utilization regarding the number of DMA buffers (b) PC utilization regarding the amount of data sent

Figure 6: Example of PC utilization under maximum processing load no other approaches enabling the execution of multiple par- and a framework to map dependent task sets on several allelized applications on the Kalray mppa R -256 have been multi/many-core architectures. However, the message pass- proposed in the literature yet. ing latencies are derived from a measured WCTT which oc- Generic DAG scheduling. The theoretical problem of curs in a situation of maximum contention while our ap- scheduling real-time DAG parallel tasks has already been ad- proach relies on the contention-avoidance property provided dressed in the literature for both mono-core [6] and more re- by our execution model. In [4], the authors proposed an effi- cently multi-core [17,22,23] processors with (G)EDF. How- cient heuristic to map Synchronous Dataflow models onto a ever, the classical DAG model used in these papers does not time-triggered architecture but the assumptions on the sys- include information on the tasks’ communication latencies tem model differ from our approach. Indeed, no support or on memory limitations and are thus not directly applica- is offered to either account for multi-rate applications or ble on the Kalray mppa R -256. handle constraints coming from an execution model. In [7], Similarly, the static scheduling of DAG tasks on multipro- Craciunas et al. proposed both a SMT and a MIP formula- cessor systems with a makepan-optimization objective has tion of a co-scheduling problem handling simultaneously the been widely studied, mostly using heuristics based on list generation of cores and network schedules. Since the authors algorithms [14]. Nevertheless, to the best of our knowledge, assume a TTEthernet network and since other constraints there are no published algorithm able of managing simulta- such as the memory limitations are taken into account, the neously the precedence constraints together with communi- goal of the mapping algorithm is very close to ours. The cation latencies, memory limitations and other constraints main differences reside in the system model as they assume coming from an execution model. preemptive scheduling at the tt-tasks level while we assume Mapping on the Kalray MPPA. The problem of map- non preemptive scheduling at the sub-task level. Moreover, ping real-time applications on the Kalray mppa R -256 has all the tt-tasks are assumed to be pre-assigned to an end- already been addressed several times in the literature [3, 8, system while our approach includes the placement of sub- 10, 12]. In [3], the authors propose a framework for map- tasks to cores together with the core and NoC scheduling. ping automotive applications inside a single compute clus- ter accessing the external DDR-SDRAM but never address 7. CONCLUSION the problem of multi-clustered applications. On the other hand, the NoC is taken into account in [8] and [10]. How- In this paper, we have proposed an approach enabling ever, in both cases, the guarantees on the NoC are provided the automatic parallelization of large applications onto the with a competition-aware approach based on Network Cal- MPPA architecture. This approach meets the requirements culus. Our approach relies on the competition-free property imposed by the execution model providing the property of of the NoC TDM schedule provided by our execution model. strong temporal isolation between co-running applications We argue that such an approach enables shorter guaranteed thanks to spatial and temporal partitioning. We provided NoC latencies that are better suited to benefit from a multi- a formulation of the mapping problem as a CSP using the cluster parallelization [21]. notion of time-intervals and we demonstrated the ability of Mapping on other many-core processors or distributed the approach to deal with large problems with an indus- systems. Puffitsch et al. [20] proposed an execution model trial case study. Overall, the mapping procedure that we described in this paper together with the real world imple- mentation of our execution model that we presented in [19] Interference Applied to Multicore Processors, 2016. form a complete work-flow enabling an end-to-end integra- [14] Y.-K. Kwok and I. Ahmad. Static Scheduling R tion of real-time applications on the Kalray mppa -256. Algorithms for Allocating Directed Task Graphs to In the future, we will propose additional guidelines to help Multiprocessors. ACM Comput. Surv., 31(4):406–471, the application designers in the process of defining their bud- Dec. 1999. gets and we will investigate the optimization possibilities of [15] P. Laborie and J. Rogerie. Reasoning with Conditional the memory mappings in order to improve the NoC utiliza- Time-intervals. In 21st International Florida Artificial tion and the overall performances. Intelligence Research Society Conference (FLAIRS’08), 2008. 8. REFERENCES [16] P. Laborie, J. Rogerie, P. Shaw, and P. Vil´ım. [1] Affordable Safe And Secure Mobility Evolution Reasoning with Conditional Time-Intervals. Part II: (ASSUME). http://www.assume-project.eu/. An Algebraical Model for Resources. In 22nd [2] for Time and Safety Critical International Florida Artificial Intelligence Research Applications (CAPACITES). Society Conference (FLAIRS’09), 2009. http://capacites.minalogic.net/en/. [17] J. Li, K. Agrawal, C. Lu, and C. Gill. Analysis of [3] M. Becker, D. Dasari, B. Nicolic, B. Akesson,˚ V. N´elis, Global EDF for Parallel Tasks . In 25th Euromicro and T. Nolte. Contention-Free Execution of Conference on Real-Time Systems (ECRTS’13), 2013. Automotive Applications on a Clustered Many-Core [18] Q. Perret, P. Maur`ere, E.´ Noulard, C. Pagetti, Platform. In 28th Euromicro Conference on Real-Time P. Sainrat, and B. Triquet. Predictable composition of Systems (ECRTS’16), July 2016. memory accesses on many-core processors. In 8th [4] T. Carle, M. Djemal, D. Potop-Butucaru, and Conference on Embedded Real Time Software and R. De Simone. Static mapping of real-time Systems (ERTS’16), 2016. applications onto processor arrays. [19] Q. Perret, P. Maur`ere, E.´ Noulard, C. Pagetti, In 14th International Conference on Application of P. Sainrat, and B. Triquet. Temporal isolation of hard Concurrency to System Design (ACSD’14), pages real-time applications on many-core processors. In 112–121, 2014. 22nd IEEE Real-Time and Embedded Technology and [5] CAST (Certification Authorities Software Team). Applications Symposium (RTAS’16), 2016. Position Paper on Multi-core Processors - CAST-32, [20] W. Puffitsch, E.´ Noulard, and C. Pagetti. Off-line 2014. mapping of multi-rate dependent task sets to [6] H. Chetto, M. Silly, and T. Bouchentouf. Dynamic many-core platforms. Real-Time Systems, scheduling of real-time tasks under precedence 51(5):526–565, 2015. constraints. Real-Time Systems, 2(3):181–194, 1990. [21] W. Puffitsch, R. B. Sørensen, and M. Schoeberl. [7] S. S. Craciunas and R. S. Oliver. Combined task- and Time-division Multiplexing vs Network Calculus: A network-level scheduling for distributed time-triggered Comparison. In 23rd International Conference on Real systems. Real-Time Systems, 52(2):161–200, 2016. Time and Networks Systems (RTNS’15), pages [8] B. D. de Dinechin, Y. Durand, D. van Amstel, and 289–296, 2015. A. Ghiti. Guaranteed Services of the NoC of a [22] M. Qamhieh, F. Fauberteau, L. George, and Manycore Processor. In 2014 International Workshop S. Midonnet. Global EDF Scheduling of Directed on Network on Chip Architectures (NoCArc’14), pages Acyclic Graphs on Multiprocessor Systems. In 21st 11–16, 2014. International Conference on Real-Time Networks and [9] F. Gavril. Algorithms for Minimum Coloring, Systems (RTNS’13), pages 287–296, 2013. Maximum Clique, Minimum Covering by Cliques, and [23] M. Qamhieh, L. George, and S. Midonnet. A Maximum Independent Set of a Chordal Graph. SIAM Stretching Algorithm for Parallel Real-time DAG Journal on Computing, 1(2):180–187, 1972. Tasks on Multiprocessor Systems. In 22nd [10] G. Giannopoulou, N. Stoimenov, P. Huang, L. Thiele, International Conference on Real-Time Networks and and B. D. de Dinechin. Mixed-criticality scheduling on Systems (RTNS’14), pages 13–22, Oct. 2014. cluster-based manycores with shared communication [24] Radio Technical Commission for Aeronautics (RTCA) and storage resources. Real-Time Systems, pages 1–51, and EURopean Organisation for Civil Aviation 2015. Equipment (EUROCAE). DO-297: Software, [11] R. Gorcitz, E. Kofman, T. Carle, D. Potop-Butucaru, electronic, integrated modular avionics (ima) and R. De Simone. On the Scalability of Constraint development guidance and certification considerations. Solving for Static/Off-Line Real-Time Scheduling. In [25] Radio Technical Commission for Aeronautics (RTCA) 13th International Conference on Formal Modeling and EURopean Organisation for Civil Aviation and Analysis of Timed Systems (FORMATS’15), Equipment (EUROCAE). DO-178C: Software pages 108–123, 2015. considerations in airborne systems and equipment [12] A. Graillat. Code Generation of Time Critical certification, 2011. Synchronous Programs on the Kalray MPPA [26] R. Wilhelm and J. Reineke. Embedded systems: Many Many-Core architecture. In 7th Workshop on Analysis cores - many problems. In 7th IEEE International Tools and Methodologies for Embedded and Real-Time Symposium on Industrial Embedded Systems Systems (WATERS’16), 2016. (SIES’12), pages 176–180, 2012. [13] X. Jean, L. Mutuel, D. Regis, H. Misson, G. Berthon, and M. Fumey. White Paper on Issues Associated with