Support for PREM on Contemporary Multicore COTS Systems

Master’s Thesis Czech Technical University in Prague Faculty of Electrical Engineering F3 Department of Control Engineering Support for PREM on contemporary multicore COTS systems Jan Doležal Open Informatics, Computer Engineering [email protected] January 2019 Supervisor: Ing. Joel Matějka Acknowledgement / Declaration My gratitude for support during I hereby declare that the submitted preparations of this work belongs to thesis is exclusively my own work and my supervisor Joel Matějka. I thank that I have listed all used information him especially for his guidance and sources in accordance with the Method- also for his easy-going attitude that ological Guideline on Ethical Principles shifted the fun while creating this work for College Final Work Preparation. to the completely different level. I am In Prague, 8. 1. 2019 very grateful for the support provided by people from the CIIRC’s Industrial Informatics Research Center. Namely I want to thank to Michal Sojka for all the valuable advices. ........................................ iii iv Abstrakt / Abstract Práce přináší podporu pro vyvíjené This work brings support for high- výkonné aplikace reálného času, které performance real-time applications that pro svůj běh využívají vícejádrové běžně run on generic contemporary multicore komerčně dostupné systémy. systems referred to as Commercials Hlavním cílem práce je shrnout me- Off-The-Shelf (COTS) platforms. tody pro zvýšení předvídatelnosti doby The main goal is to summarize the běhu programu a provést testy ná- methods improving the predictability of strojů a hardware, které tyto metody program execution time and to test the využívají. Metody diskutované v této tools and hardware utilizing these meth- práci řeší problémy při konfliktech při ods. Methods discussed in this work ad- přístupu ke sdíleným prostředkům na dress issues of competition for shared re- těchto platformách. sources in multicore COTS platforms. Konflikty při přístupech do hlavní pa- Memory contention is resolved by měti jsou řešeny za pomoci Predikova- PRedictable Execution Model (PREM) telného výpočetního modelu (PREM). that schedules memory accesses in the K vylepšení jeho spolehlivosti jsou zkou- system. To improve the reliability mány a vyhodnoceny možnosti omezení of PREM, we describe and employ šířky pásma klientů paměti na úrovni bandwidth limiting capabilities to the paměťového kontroléru a také s využí- memory clients. We evaluate throttling tím nástroje MemGuard na úrovní ja- of memory controller clients and the der procesoru. Dvě ze zmíněných metod MemGuard tool, which allows throt- k omezení šířky pásma paměti jsou tes- tling of CPU cores, and profile the továny na energeticky účinné vestavné implementation to show its overhead. platformě NVIDIA TX2. Tests are performed to see the effects Dále jsou prozkoumány efekty hard- of throttling on the memory controller warového dělení sdílených vyrovnáva- level. The two presented methods cích pamětí (Intel RDT) na platformě s to throttle bandwidth of the memory procesorem řady Intel Xeon. clients are tested on the low-power Výsledky ukázaly, že omezení šířky embedded platform NVIDIA TX2. pásma na úrovni paměťového kon- Further, we address shared cache toléru není použitelné, protože není contention using methods for parti- možné korektně separovat klienty. Na- tioning shared cache (Intel RDT, Page opak implementace MemGuardu byla Color). We showed the effects of the úspěšně otestována na více testovacích hardware partitioning using RDT on an případech. Testované hardwarové dě- Intel Xeon series processor. lení sdílených vyrovnávacích pamětí The results have shown that throt- zlepšuje předvídatelnost dle očekávání. tling of the memory controller is not us- Dále poskytujeme informace jak vy- able since we cannot separate particular ladit parametry nástrojů využívaných clients properly. Further, MemGuard implementací PREM. was successfully tested on multiple use Klíčová slova: Systémy reálného cases. Tested hardware cache partition- času; WCET; Predikovatelný výpočetní ing improves the system performance as model; omezování šířky pásma klientů expected. Moreover, we provide details paměti; Využití dělené skryté paměti. about the fine-tuning of the tools used Překlad titulu: Podpora PREM na by PREM. současných multicore COTS systémech Keywords: Real-time systems; Worst-Case Execution Time; Pre- dictable Execution Model; throttling memory clients bandwidth; Partitioned v Cache Utilization. Contents 1 Introduction ........................1 3.1.3 Throttling memory 2 Background .........................3 controller clients ......... 25 2.1 Real-time embedded systems ....3 3.1.4 Processing units.......... 26 2.1.1 Worst-case execution 3.1.5 Performance Monitor time (WCET) ..............4 Unit (PMU) .............. 27 2.1.2 WCET and contempo- 3.2 Intel Xeon W................... 28 rary hardware ..............4 3.2.1 Memory controller ....... 28 2.1.3 Timing constraints in 3.2.2 Cache organization ...... 28 industry ....................5 3.2.3 Resource Director 2.2 Multicore systems and real- Technology (RDT) ....... 28 time applications.................5 3.2.4 Performance Monitor 2.3 Methods increasing pre- Unit (PMU) .............. 30 dictability of execution time 3.3 Measuring performance within multicore system .........5 events under Linux ............ 30 2.4 Memory hierarchy ...............6 3.4 Jailhouse: HV partitions HW . 31 2.4.1 Data request propaga- 3.4.1 Setup ..................... 33 tion ........................7 3.5 Benchmarks .................... 33 2.4.2 Cache organization ........8 3.6 Design of experiments ......... 34 2.4.3 Writing data back to 3.6.1 Throttling memory main memory ..............9 clients .................... 34 2.4.4 Cache inclusion policy ... 10 3.6.2 Profiling MemGuard 2.4.5 Cache coherency ......... 10 implementation .......... 35 2.4.6 Caches incorporated in 3.6.3 Hardware cache parti- virtual address model .... 10 tioning .................... 36 2.4.7 Caches in contempo- 4 Evaluation......................... 38 rary hardware ............ 12 4.1 Throttling memory clients..... 39 2.4.8 Quality of service in 4.1.1 Measurement results ..... 39 relation to cache ......... 12 4.2 Profiling MemGuard imple- 2.5 Performance Monitor Unit mentation ...................... 41 (PMU).......................... 14 4.2.1 Profiling basic MG pa- 2.6 Hypervisor ..................... 14 rameters .................. 41 2.6.1 Hardware partitioning ... 15 4.2.2 Reliability of Mem- 2.7 PRedictable Execution Mod- Guard .................... 42 el (PREM) ..................... 16 4.2.3 MemGuard as a profiler . 45 2.7.1 WCET-aware compiler .. 17 4.2.4 MemGuard upon 2.7.2 Decoupled Access Exe- real-world application: cute (DAE)............... 18 KCF tracker.............. 48 2.8 MemGuard: limiting memo- 4.3 Hardware cache partitioning .. 48 ry bandwidth of CPU cores ... 18 5 Conclusion ........................ 51 2.9 Page coloring as a software References ........................ 52 solution addressing cache A Specification ...................... 55 contention ...................... 20 A.1 Specification in English ........ 57 3 Methodology ..................... 22 B Abbreviations ..................... 58 3.1 NVIDIA Tegra X2 ............. 22 3.1.1 System organization ..... 23 3.1.2 Memory subsystem ...... 23 vi Tables / Figures 2.1. MOESI cache coherency pro- 2.1. Execution time distribution .....4 tocol states ..................... 10 2.2. CPU vs. DRAM perfor- 2.2. Cache parameters of contem- mance in time ....................7 porary hardware platforms .... 14 2.3. Multi-level cache .................8 3.1. Parameters of the Tegra X2 2.4. Parts of memory address system on module .............. 22 driving cache element choice.....9 3.2. Registers used to throttle 2.5. x86 arch: paging with 4kB groups of clients ................ 26 long pages ...................... 12 3.3. Cache parameters of TX2 2.6. Principle of Translation CPU clusters ................... 27 Lookaside Buffer and Page 3.4. Xeon W-2133 cache parame- Table miss ...................... 13 ters.............................. 28 2.7. Hypervisor types comparison .. 15 3.5. RDT support on Xeon pro- 2.8. Comparison of virtualization cessor versions .................. 29 types............................ 16 3.6. Intel PMU: selected perfor- 2.9. Predictable interval with mance events ................... 30 constant execution time........ 17 3.7. Complexity of the used Poly- 2.10. Compiler outline with sup- Bench benchmarks ............. 33 port for multiple targets ....... 18 4.1. Percentage overhead of the 2.11. MemGuard system architec- MemGuard for various re- ture ............................. 19 plenish time periods............ 45 2.12. MemGuard budget illustra- tive example .................... 20 2.13. Example of page coloring used to partition cache......... 21 3.1. TX2: Block diagram of a Parker series processor ......... 23 3.2. TX1: Memory Controller: Path between DRAM mod- ules and MC clients ............ 24 3.3. TX1: Memory Controller: Arbitration among MC clients . 25 3.4. RDT: Cache Allocation Technology ..................... 29 3.5. Jailhouse: hardware static partitioning ..................... 32 3.6. Jailhouse: loading process phases........................... 32 3.7. KCF tracker illustration ....... 36 3.8. Cache partitioning experi- ment ............................ 36 4.1. TX2 vs. Xeon W: cache re- lated relative comparison of performance .................... 38 4.2. TX2: Throttle MC Clients: Throttle on ring 2. No clients added. .........................

Support for PREM on Contemporary Multicore COTS Systems

A Viga T Ing R T Ificia L N Te Ll Igence

Agent: STRAUSS, Ryan N. Et Al.; 121 1 SW 5Th Avenue, H04W 72/04 (2009.01) H04W4/40 (2018.01) Suite 1500-1900, Portland, Oregon 97204 (US)

610456 Confidentiality

United States District Court Eastern District of Texas Marshall Division

Royalties When Our Partners Ship Chips That Contain ARM IP Highly Profitable and Cash Generative

Prototype Eclipse OMR Port Performance Evaluation on Aarch64

Iii. the Rise of Arm Holdings in the Mobiel Ap Market

Qualcomm Incorporated Qualcomm QCOM) U.S

SUSE Linux ARM BOF SC17 111317

SUSE Linux Enterprise For

Qualcomm Centriq 2400 Server TCO Methodology & Assumptions

Multi-Company Report