Magistère Informatique 2016
Total Page:16
File Type:pdf, Size:1020Kb
Magistère In fo rm a tiq u e 2016 JEU DI 1er SEPTEM BRE VENDREDI 2 SEPTEM BRE 9h Steve RO Q U ES 8h30 Thom as BAU M ELA Optimisation des accès mémoire spécifiques à une In tegratin g Sp lit D rivers in Lin u x application par prechargem ent de page dans les manycores. 9h10 Rém y BO U TO N N ET Relational Procedure Sum m aries for 9h40 Thom as LAVO CAT Interprocedural Analysis Gotaktuk, a versatile parallel launcher 10h Quentin RICARD 10h25 Myriam CLO U ET Détection d'intrusion en milieu industriel Rainem ent de propriétés de logique tem porelle 10h40 Christopher FERREIRA 11h05 Lenaïc TERRIER Scat : Function prototypes retrieval inside A language for the hom e - Dom ain-Specific stripp ed binaries Language design for End-User Program m ing in Sm art Hom es 11h30 Lina MARSSO Form al proof for an activation m odel of real- 14h Mathieu BAILLE tim e syste m Méthodologie d’analyse de corpus de flux RSS multimédia 14h Antoine DELISE Is facto rizatio n h elp fu l fo r cu rren t sat-so lvers ? 14h40 Jules LEFRERE Com position in Coq of silents distributed self- stab ilizing algo rith m s. 15h20 Rodolphe BERTO LINI Distributed Approach of Cross-Layer Le Magistère est une option pour les élèves de Allocator in Wireless Sensor Networks niveau L3 à M2 leur perm ettant d’acquérir de l’expérience dans le dom aine de la recherche. 16h15 Claude Gobet 3D kidney motion characterization from 2D+T MR Im ages Les soutenances, organisées sous form e de conférence, sont ouvertes au public. UFR IM4AG Rendez-vousR au grand am phithéatre de l’IMAG *QMi2Mib R ai2p2 _Q[m2b- S`272i+?BM; K2KQ`v T;2b BM KMv+Q`2b R k h?QKb GpQ+i- :QiFimF p2`biBH2 T`HH2H HmM+?2` 3 j Jv`BK *HQm2i- _2}M2K2Mi Q7 `2[mB`2K2Mib 7Q` +`BiB+H bvb@ i2Kb R8 9 GûMś+ h2``B2`- HM;m;2 7Q` i?2 ?QK2 k9 8 Ji?B2m "BHH2- Jûi?Q/QHQ;B2 /ǶMHvb2 /2 +Q`Tmb /2 ~mt _aa KmHiBKû/B j9 e CmH2b G27`ĕ`2- *QKTQbBiBQM Q7 a2H7@ai#BHBxBM; H;Q`Bi?Kb BM *Q[ 9y d _Q/QHT?2 "2`iQHBMB- .Bbi`B#mi2/ TT`Q+? Q7 *`Qbb@Gv2` _2@ bQm`+2 HHQ+iQ` BM qB`2H2bb a2MbQ` L2irQ`Fb 9e 3 *Hm/2 :Qm#2i- j. FB/M2v KQiBQM +?`+i2`BxiBQM 7`QK k.Yh J_ AK;2 89 N h?QKb "mK2H- AMi2;`iBM; aTHBi .`Bp2`b BM GBMmt ej Ry _ûKv "QmiQMM2i- _2HiBQMH S`Q+2/m`2 amKK`B2b 7Q` AMi2`@ T`Q+2/m`H MHvbBb d9 RR Zm2MiBM _B+`/- AMi`mbBQM .2i2+iBQM 7Q` a*. avbi2Kb lbBM; S`Q+2bb .Bb+Qp2`v 38 Rk *?`BbiQT?2` 62``2B`- a+i , 6mM+iBQM T`QiQivT2b `2i`B2pH BM@ bB/2 bi`BTT2/ #BM`B2b N8 Rj GBM J`bbQ- 6Q`KH T`QQ7b 7Q` M +iBpiBQM KQ/2H Q7 `2H@iBK2 bvbi2Kb Nd R9 MiQBM2Ƙ.2HBb2- 6+iQ`BxiBQM 7Q` bi@bQHpBM; RRy R8 "QvM JBHMQp- J;Bbiĕ`2 AMi2`Mb?BT _2TQ`i RR9 Prefetching memory pages in manycores Roques Steve Supervised by: Fred´ eric´ Petrot´ [email protected], [email protected] Abstract to time to load between Mem1 and DMA. It is better to use the Mem2 more often and therefore repatriate Mem1 data to Computer systems contain an increasing number of Mem2. So the execution time is reduced. To do this we will processors. Unfortunately, increasing the compu- see several methods. tation capability is not sufficient to obtain the ex- pected speed-up, because then memory accesses become a bottleneck. To limit this issue, most Figure 1: Prefetch between 2 memories multi/manycore architectures are clustered and em- bed local memory, but then what, when and how Mem1 the content of these local memories is managed be- comes a crucial problem called prefetching. In this article, we perform a study of the state of the art around memory pages prefetching and pro- Mem2 pose a simple model to evaluate the performances of such systems. This ultimately leads us to con- clude with a description of the future works to CPU DMA achieve. t1 1 Introduction t2 The memory hierarchy in modern embedded massively multi-processor CPUs is usually structured in the form of L1 caches (coherent or not) close to the processor, local memory The objective of this paper is to examine the state of the art (LM) dedicated to a small set of processors (8 or 16, called in prefetching memory pages (typically 4 KB) to determine a cluster or a tile) which can be regarded as a L2, but whose if the knowledge of the application can be exploited to define content is the responsibility of the software, and an external an offline approach able to limit the number of page faults. global memory (GM) can be seen as a L3. We start by doing a quick tour of different works already re- alised about prefetching memory pages. Then the evaluating The local memories are usually small, for technological in- system that we used for experimentation aiming at comparing tegration reasons, and have good performances (latency and these approaches will be explained. Finally, we present future throughput). But this implies that any access to the LM that works and conclusion. fails requires access to the GM, which causes an important slowdown. 2 State of the art Some operating systems attempt to prefecth pages on the fly, based on previous page-fault occurrences for example. 2.1 Comparison of several algorithms This approach is general but makes assumptions, such as that Dini et al. [1] compare many caching and prefetching al- the desired pages are on disk, and thus suffer of extremely gorithms in order to explain why prefetching improves ef- high latencies. ficiency. They begin by explaining the two basic algorithms they use for comparison: Least recently used (LRU) and Most Let us take the example of figure 1 in which one can see recently used (MRU). LRU is defined with this rule: every one CPU, one direct memory access engine (DMA) and 2 page replacement evicts the buffered page whose previous memories. The time t1, corresponding to time to load be- reference is farthest in the past. MRU is defined with this tween Mem2 and DMA is shorter than time t2, corresponding rule: every page replacement evicts the buffered page whose R previous reference is nearest in the past. These two algo- to occur and when data is no longer needed. Then, operat- rithms probably represent the program response times upper ing system manage I/O to accelerate performance and min- and lower bounds in the absence of page prefetch. imising prefetch. Caragea et al. [4] aim at evaluating many algorithms with compiler prefetching for manycores. Many- cores, or massively multi-core, describe multi-core architec- Krueger et al. [2] trace the memory accesses produced by tures with an high number of cores. They evaluate prefetching the successive over relaxation (SOR) target application. They algorithms on the explicit multi-threading (XMT) framework. simulate SOR with LRU page replacement policy and they The goal of the XMT is improving single-task performance show that LRU is inadequate under the given circumstances through parallelism. Furthermore, they present Resource- because it generates one page fault by iteration of the loop. Aware Prefetching (RAP) algorithm, an improvement over However the objective of prefetch is to reduce access time Mowry’s algorithm. not only because the data is closer, but also by reducing the number of page faults too. Krueger et al. [2] show that the replacement algorithm for page depends directly on the target Mowry’s looping prefetch is a three-phase algorithm. One application. for determining dynamic accesses which are likely to suf- fer cache misses and therefore should be prefetched. A sec- ond for isolating the predicted dynamic miss instances using Unlike those algorithms, Dini et al. [1] presents two loop-splitting techniques such as peeling, unrolling, and strip- prefetch algorithms EP (early prefetch) and LP (late mining. Finally, a scheduler that prefetches the appropriate prefetch), that characterise different degrees of secondary page the proper amount of time in advance using software memory system (SMS) activity. pipelining[4]. EP include several phases, one for early prefetch when RAP is an improved Mowry’s algorithm lowering 2 itera- SMS becomes idle and until the fair replacement rule was tion. Thereby, it limits resource requirements and uses them satisfied. Fair replacement rule is defined as every page re- to hide as much latency as possible. Experiments with this al- placement evicts the buffered page whose next reference is gorithm show that it has up to 40% better performances than farthest in the future, provided this page has been referenced Mowry’s algorithm. since its most recent load from secondary memory into the buffer. Finally, the next reference is every page fetch loads the non-buffered page whose next reference will be in the fu- 2.3 Shared Memory Systems ture. Paudel et al. [5] and Speight and Burtscher [6] present how prefetching improves performance in shared memory. They showed that employing optimised coherence protocols for tar- LP is defined by the next reference and fair replacement geted patterns of shared-variable accesses improves applica- rules and a third rule. This is late prefetch, when the SMS tion performance. Paudel et al. [5] have used different access becomes idle and a page is missing in the buffer, the fetch patterns to test and optimise these access : of this page is done at the earliest time making it possible to complete fetch before occurrence of the next reference to this Read-mostly : variables are initialised once and subse- page.