Online Thread and Data Mapping Using the Memory Management Unit

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO EDUARDO HENRIQUE MOLINA DA CRUZ Online Thread and Data Mapping Using the Memory Management Unit Thesis presented in partial fulﬁllment of the requirements for the degree of Doctor of Computer Science Advisor: Prof. Dr. Philippe O. A. Navaux Porto Alegre March 2016 CIP – CATALOGING-IN-PUBLICATION Cruz, Eduardo Henrique Molina da Online Thread and Data Mapping Using the Memory Man- agement Unit / Eduardo Henrique Molina da Cruz. – Porto Ale- gre: PPGC da UFRGS, 2016. 154 f.: il. Thesis (Ph.D.) – Universidade Federal do Rio Grande do Sul. Programa de Pós-Graduação em Computação, Porto Alegre, BR– RS, 2016. Advisor: Philippe O. A. Navaux. 1. Data movement. 2. Thread and data mapping. 3. Cache memory. 4. NUMA. I. Navaux, Philippe O. A.. II. Título. UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Carlos Alexandre Netto Vice-Reitor: Prof. Rui Vicente Oppermann Pró-Reitor de Pós-Graduação: Prof. Vladimir Pinheiro do Nascimento Diretor do Instituto de Informática: Prof. Luis da Cunha Lamb Coordenador do PPGC: Prof. Luigi Carro Bibliotecária-chefe do Instituto de Informática: Beatriz Regina Bastos Haro “Logic will get you from A to B. Imagination will take you everywhere.” —ALBERT EINSTEIN AGRADECIMENTOS Primeiramente, agradeço a minha mãe e meu pai por todo suporte. Agradeço também a meu primo Gustavo e minha tia Lourdes por todas as visitas e momentos descontraídos. Ao meu tio Luiz, obrigado por toda ajuda disponibilizada. Também não posso esquecer dos amigos Guilherme e Ramon, que também considero como da minha família. Em relação aos amigos aqui do GPPD, um obrigado inicial ao professor Navaux, pela ori- entação e compreensão. Em segundo lugar, nada mais nada menos que o grande Matthias, grande parceiro de pesquisa. Não posso esquecer do Laércio, que também muito contribuiu nesta pesquisa. Um abraço também para os outros membros do GPPD, em especial Marco, Francis, Emmanuell e Roloff. Aos familiares e amigos aqui não mencionados, obrigado por tudo. CONTENTS LIST OF ABBREVIATIONS AND ACRONYMS . 11 LIST OF FIGURES . 13 LIST OF TABLES . 17 ABSTRACT . 19 RESUMO . 21 1 INTRODUCTION . 23 1.1 Improving Memory Locality With Sharing-Aware Mapping .......... 24 1.2 Monitoring Memory Accesses for Sharing-Aware Mapping .......... 27 1.3 Using the Memory Management Unit to Gather Information on Memory Ac- cesses ........................................ 29 1.4 Contributions of this Thesis ............................ 31 1.5 Organization of the Text .............................. 32 2 SHARING-AWARE MAPPING: OVERVIEW AND RELATED WORK . 35 2.1 Improving Memory Locality With Sharing-Aware Mapping .......... 35 2.1.1 Memory Locality in Different Architectures................... 35 2.1.2 Parallel Applications and Sharing-Aware Thread Mapping............ 42 2.1.3 Parallel Applications and Sharing-Aware Data Mapping............. 48 2.1.4 Evaluation with a Microbenchmark........................ 55 2.2 Related Work on Sharing-Aware Static Mapping ................ 57 2.2.1 Static Thread Mapping.............................. 57 2.2.2 Static Data Mapping................................ 58 2.2.3 Combined Static Thread and Data Mapping................... 59 2.3 Related Work on Sharing-Aware Online Mapping ............... 59 2.3.1 Online Thread Mapping.............................. 59 2.3.2 Online Data Mapping............................... 60 2.3.3 Combined Online Thread and Data Mapping................... 62 2.4 Discussion on Sharing-Aware Mapping and Related Work ........... 62 3 LAPT – LOCALITY-AWARE PAGE TABLE . 65 3.1 Structures Introduced by LAPT ......................... 65 3.2 Data Sharing Detection .............................. 66 3.3 Thread Mapping Computation .......................... 66 3.4 Data Mapping Computation ............................ 67 3.5 Example of the Operation of LAPT ........................ 68 3.6 Overhead ...................................... 69 3.6.1 Memory Storage Overhead............................ 69 3.6.2 Circuit Area Overhead.............................. 69 3.6.3 Runtime Overhead................................ 70 4 SAMMU – SHARING-AWARE MEMORY MANAGEMENT UNIT . 71 4.1 Overview of SAMMU ............................... 71 4.2 Structures Introduced by SAMMU ........................ 72 4.3 Gathering Information about Memory Accesses ................. 72 4.4 Detecting the Sharing Pattern for Thread Mapping ............... 77 4.5 Detecting the Page Usage Pattern for Data Mapping .............. 78 4.6 Example of the Operation of SAMMU ...................... 79 4.7 Implementation Details .............................. 80 4.7.1 Hardware Implementation............................. 80 4.7.2 Notifying the Operating System About Page Migrations............. 81 4.7.3 Performing the Thread Mapping......................... 81 4.7.4 Increasing the Supported Number of Threads.................. 81 4.7.5 Handling Context Switches and TLB Shootdowns................ 82 4.8 Overhead ...................................... 82 4.8.1 Memory Storage Overhead............................ 82 4.8.2 Circuit Area Overhead.............................. 82 4.8.3 Execution Time Overhead............................. 83 5 IPM – INTENSE PAGES MAPPING . 85 5.1 Detecting Memory Access Patterns via TLB Time ................ 85 5.2 Overview of IPM .................................. 86 5.3 Structures Introduced by IPM .......................... 87 5.4 Detailed Description of IPM ............................ 88 5.4.1 Calculating the TLB Time............................. 88 5.4.2 Gathering Information for Thread Mapping................... 88 5.4.3 Gathering Information for Data Mapping..................... 89 5.5 Overhead ...................................... 90 5.5.1 Memory Storage Overhead............................ 90 5.5.2 Circuit Area Overhead.............................. 91 5.5.3 Execution Time Overhead............................. 91 6 EVALUATION OF OUR PROPOSALS . 93 6.1 Methodology .................................... 93 6.1.1 Algorithm to Map Threads to Cores....................... 93 6.1.2 Evaluation Inside a Full System Simulator.................... 93 6.1.3 Evaluation Inside Real Machines with Software-Managed TLBs........ 95 6.1.4 Trace-Based evaluation Inside Real Machines with Hardware-Managed TLBs. 96 6.1.5 Presentation of the Results............................ 97 6.1.6 Workloads..................................... 97 6.2 Comparison Between our Proposed Mechanisms ................ 98 6.3 Accuracy of the Proposed Mechanisms ...................... 99 6.3.1 Accuracy of the Sharing Patterns for Thread Mapping.............. 99 6.3.2 Accuracy of the Data Mapping.......................... 100 6.4 Performance Experiments ............................. 104 6.4.1 Performance Results in Simics/GEMS...................... 104 6.4.2 Performance Results in Itanium.......................... 106 6.4.3 Performance Results in Xeon32 and Xeon64................... 108 6.4.4 Summary of Performance Results......................... 109 6.5 Energy Consumption Experiments ........................ 110 6.6 Overhead ...................................... 110 6.7 Design Space Exploration ............................. 113 6.7.1 Varying the Cache Memory Size (SAMMU)................... 113 6.7.2 Varying the Memory Page Size (SAMMU).................... 114 6.7.3 Varying the Interchip Interconnection Latency (SAMMU)............ 115 6.7.4 Varying the Multiplication Threshold (LAPT).................. 116 6.7.5 Analyzing the Impact of Aging (CRaging in IPM)................ 117 6.7.6 Analyzing the Impact of the Migration Threshold (CRmig in IPM)....... 118 7 CONCLUSIONS AND FUTURE WORK . 119 7.1 Conclusions ..................................... 119 7.2 Future Work .................................... 120 7.3 Publications ..................................... 121 APPENDIX A THREAD MAPPING ALGORITHM . 123 A.1 Related Work .................................... 124 A.2 EagerMap – Greedy Hierarchical Mapping ................... 124 A.2.1 Description of the EagerMap Algorithm..................... 126 A.2.2 Mapping Example in a Multi-level Hierarchy.................. 130 A.2.3 Complexity of the Algorithm........................... 132 A.3 Evaluation of EagerMap .............................. 133 A.3.1 Methodology of the Mapping Comparison.................... 133 A.3.2 Results....................................... 135 A.4 Conclusions ..................................... 138 APPENDIX B SUMMARY IN PORTUGUESE . 141 B.1 Introdução ..................................... 141 B.2 Visão Geral Sobre Mapeamento Baseado em Compartilhamento ....... 142 B.3 Mecanismos Propostos para Mapeamento Baseado em Compartilhamento .. 144 B.3.1 LAPT – Locality-Aware Page Table....................... 144 B.3.2 SAMMU – Sharing-Aware Memory Management Unit............. 144 B.3.3 IPM – Intense Pages Mapping.......................... 145 B.4 Avaliação dos Mecanismos Propostos ....................... 146 B.5 Conclusão e Trabalhos Futuros .......................... 147 REFERENCES . 149 LIST OF ABBREVIATIONS AND ACRONYMS AC Access Counter AT Access Threshold DBA Dynamic Binary Analysis FSB Front Side Bus IBS Instruction-Based Sampling ID Identiﬁer ILP Instruction Level Parallelism IPC Instructions per Cycle ISA Instruction Set Architecture LRU Least Recently Used MRU Most Recently Used NUMA Non-Uniform Memory Access NV NUMA Vector PHT Page History Table PMU Performance Monitoring Unit PU Processing Unit

Load more