Appendix a Sift Optimizations in Van Emde Boas-Based Heap
Total Page:16
File Type:pdf, Size:1020Kb
Nuno Filipe Monteiro Faria Locality optimizations on irregular algorithms and data structures Dissertação de Mestrado Engenharia Informática Trabalho efectuado sob a orientação do Professor Doutor João Luís Sobral Outubro de 2011 Acknowledgments This dissertation would not have been possible without the aid of my parents and my sister, which have always endured and supported me through my life. Second, I thank my advisor, Professor Jo~aoLu´ısSobral, for giving me the opportunity to join him and his team in research, for fostering my critical thinking, guiding me and sharing his experience and wisdom with me. I would also like to thank the professors I found along my master course for introducing me to the highly motivating theme of high performance comput- ing, Professor Jo~aoLu´ısSobral, Professor Alberto Proen¸ca,Professor Rui Ralha and Professor Ant´onioPina. Last but not least, thank you to all my friends and lab colleagues in university, Roberto Ribeiro, Rui Silva, Rui Gon¸calves, Diogo Mendes, Nuno Silva, for all the brainstorm-ish discus- sions and opinions that allowed me to improve and push myself further - a few themes presented in this dissertation were discussed in some of those brainstorms. To my colleagues/room-mates through my graduate years with whom I shared unforgettable experiences, Nuno Silva, Pedro Silva, Emanuel Gon¸calves, Andr´eF´elix,Samuel Moreira, Roberto Ribeiro and Pedro Miranda - I believe I have made friends for life. Nuno Filipe Monteiro Faria The project that served as ground for this dissertation was funded under the agreement of the protocol between FCT and UTAustin Parallel Programming Refinements for Irregular Applications (UTAustin/CA/0056/2008) ii Otimiza¸c~oesde localidade em algoritmos e estruturas de dados irregulares Resumo Estruturas de grafos baseadas em apontadores t^emsido amplamente discutidas por v´ariasco- munidades cient´ıficas que tencionam usar este tipo de estruturas. Muitas destas ´areast^em a necessidade de usar e implementar algoritmos complexos e sofisticados, como por exemplo, encontrar a ´arvore de expans~aode custo m´ınimode um grafo. Estes algoritmos t^emum com- portamento t´ıpicamente irregular no que toca aos padr~oesde acesso `amem´oria. Como tal, o objectivo ´eotimizar estas implementa¸c~oesde estruturas de grafos visando aspetos que melhorem a localidade dos acessos `amem´oria. O impacto da lat^enciade mem´oriatem sido continua- mente atenuado atrav´esdo uso de mem´orias cache nas arquitecturas de computadores atuais - as t´ecnicasde otimiza¸c~aopara estruturas regulares (ex., matrizes) s~aobem conhecidas neste ^ambito, contudo estruturas irregulares inserem-se numa classe de problemas para os quais ainda n~aoexistem consolidadas representa¸c~oeseficientes. Nesta disserta¸c~ao´eefetuado um estudo sobre as optimiza¸c~oesposs´ıveis em algoritmos como m´etodos de ordena¸c~ao heap-sort: s~aoapresentadas uma implementa¸c~aopadr~aode um algoritmo heap-sort e uma vers~aoamig´avel da cache, originalmente apresentada por Emde Boas. Junta- mente com os problemas de esfor¸code mem´oria,existem tamb´emos problemas de algoritmos intensivos em linguagens de alto n´ıvel. Frameworks orientadas a objectos s~aonormalmente baseadas em conceitos, linguagens e mecanismos abstratos de alto-n´ıvel, o que pode criar incon- sist^enciasquando combinadas com aspetos habituais da computa¸c~aoavan¸cada(contiguidade de elementos em mem´oria,localidade, eliminar a redund^anciado c´odigo-fonte, etc.). Aspetos como a gest~aode objetos em mem´oria(introduzidos por pol´ıticascomo type-erasure e auto-boxing) e conceitos abstratos relativamente ao encapsulamento (uso ineficiente de APIs, em conjunto com mecanismos de tipos gen´ericos)s~aoidentificados como sendo problem´aticosna resolu¸c~aode sobrecarga e introdu¸c~aode refinamentos nas implementa¸c~oes. E´ notada uma clara incompatibil- idade entre aplica¸c~oesirregulares intensivas e metodologias object-oriented, nomeadamente em Java. Otimiza¸c~oese altera¸c~oesao c´odigo-fonte s~aopropostas em aplica¸c~oesque sofrem destas limita¸c~oesde desempenho, com o intuito de diminuir a sobrecarga que a abstra¸c~aointroduz nas implementa¸c~oes. E´ feito um conjunto de experi^enciascom aplica¸c~oesintensivas ao ordernar el- ementos com o heap-sort e com o algoritmo sobre grafos para encontrar a ´arvore de expans~ao de custo m´ınimo,para que se possam introduzir otimiza¸c~oescomo novas distribui¸c~oesde dados nas estruturas para melhorar os padr~oesde acesso `amem´oria,mais amig´aveis da cache. Padr~oes eficientes de acesso `amem´oria´eo principal interesse tendo em considera¸c~aoaspetos sobre a localidade, quer seja utilizando primitivas eficientes de distribui¸c~oesde dados em mem´oria,ou diminuindo a complexidade do c´odigo-fonte dos algoritmos e estruturas. As otimiza¸c~oespro- postas melhoram o comportamento de cache verificado em vers~oesiniciais, assim como o tempo de execu¸c~ao.Aspetos baixo-n´ıvel como misses nas caches L1 e L2 salientam o car´acteramig´avel da cache das distribui¸c~oesde dados e as melhorias no n´umerode misses; os menores misses na TLB mostram as melhorias na complexidade das implementa¸c~oes. iii Locality optimizations on irregular algorithms and data structures Abstract Pointer-based graph structures has been discussed by the scientific community that aims to use such structures on several areas. Many of these areas have the need to implement complex and, sometimes, extremely sophisticated algorithms, like finding the minimum spanning tree of a graph. These algorithms are well known for being hard to efficiently execute on current multi- core machines due to irregular patterns of memory accesses. Thus, the objective is optimizing the implementation of graph structures aiming for better memory locality. Memory latency problems have been attenuated by using cache memories in nowadays computer architectures - the memory optimization techniques for regular data-structures (e.g., matrices) are well known, as for irregular data-structures insert themselves in areas where efficient representations are not yet well known. Optimization algorithms like sorting problems are studied through the use of heap-sort meth- ods: we present a standard implementation and an optimized version originally presented by van Emde Boas. Along with the problems of memory straining we refer also to the problems of in- tensive algorithms in modern high-level languages. Object-oriented frameworks usually operate with high-level concepts and abstract mechanisms that may not combine well with core features of high performance computing (element contiguity, locality, eliminating code redundancy, etc.). We identify aspects like inefficient object-memory management (introduced by type-erasure and auto-boxing) abstract concepts regarding encapsulation (inefficient API usage alongside with generic mechanisms) as being problematic issues in the resolution of overheads of data-intensive algorithms. A mismatch is clearly noticed between irregular data-intensive applications and object-oriented in Java. We propose a series of optimizations and changes to the source code of applications that suffer from bottlenecks related to these, in order to demean the setbacks imposed by abstractions. We perform tests on heap sorting and Prim's minimal spanning tree algorithm in order to introduce the improvements made in data layouts optimizing irregular memory accesses. Efficient memory access patterns are the main concern in this thesis and good cache locality on modern memory architectures, whether by using efficient sorting techniques or improving pointer-chasing complexity in algorithms and data structures, are the main goals. The optimizations proposed are able to decrease the cache misses of applications and, most im- portant, execution time. We analyse the low-level instruction counts in order to accurately show that instruction complexity decreases; L1 and L2 cache miss lower counts prove the efficiency of cache-friendly layouts and show the miss behaviour improvement; TLB low miss counts verify the improvement in address and memory management. iv Contents Acknowledgments ii Resumo iii Abstract iv List of Figures vii List of Tables viii 1 Introduction 9 1.1 Data intensive and Irregular applications ....................... 10 1.2 Locality of reference .................................. 11 1.3 Optimizing data layouts ................................ 12 1.3.1 Object-oriented and Memory management . 12 1.3.2 Parallelism motivations ............................ 13 1.4 Main goals, Scope and Contributions ......................... 13 1.5 Organization of the Dissertation ........................... 14 2 Background 15 2.1 Cache-aware and cache-oblivious ........................... 15 2.2 Memory models ..................................... 15 2.2.1 RAM model ................................... 16 2.2.2 External memory model ............................ 16 2.2.3 Hierarchical memory model .......................... 17 2.2.4 Ideal-cache model ............................... 17 2.2.5 Caches in multi-processor environments ................... 19 2.3 Cache-aware/oblivious sorting ............................. 19 2.3.1 Efficient heap implementations ........................ 19 2.3.2 Cache-efficient heap and priority-queues ................... 21 2.4 Locality optimizations ................................. 23 2.4.1 Algorithmic locality optimizations ...................... 23 2.4.2 Data layout locality optimizations ...................... 25 2.4.3 JVM level optimizations ............................ 27 2.5 Graphs