TECHNOLOGIES FOR MAIN MEMORY DATA ANALYSIS
DISSERTATION FOR THE AWARD OF THE DOCTORAL DIPLOMA ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS
2020
Marios D. Fragkoulis Department of Management Science and Technology Athens University of Economics and Business ii
Department of Management Science and Technology Athens University of Economics and Business Email: [email protected]
Copyright 2017 Marios Fragkoulis This work is licensed under a Crea ve Commons A ribu on-ShareAlike 4.0 Interna onal License. iii
Supervised by Professor Diomidis Spinellis iv
To my parents
Contents
1 Introduc on 1 1.1 Context ...... 1 1.2 Problem statement ...... 2 1.3 Proposed solu on and contribu ons ...... 2 1.4 Research methodology ...... 3 1.5 Thesis outline ...... 3
2 Related Work 6 2.1 Orthogonal persistence ...... 7 2.1.1 Background ...... 7 2.1.2 Scope ...... 8 2.1.3 Surveys on orthogonal persistence ...... 9 2.1.4 Dimensions of persistent systems ...... 11 2.1.4.1 Persistence implementa on ...... 12 2.1.4.2 Data de ni on and manipula on language ...... 12 2.1.4.3 Stability and resilience ...... 12 2.1.4.4 Protec on ...... 12 2.1.5 Persistent systems performing transac ons ...... 15 2.1.5.1 Persistent opera ng systems ...... 15 2.1.5.2 Persistent programming systems ...... 16 2.1.5.3 Persistent stores ...... 17 2.1.6 Persistent systems performing distributed processing ...... 18 2.1.6.1 Persistent opera ng systems ...... 19 2.1.6.2 Persistent programming systems ...... 19 2.1.6.3 Persistent stores ...... 20 2.1.7 Performance Op misa ons ...... 22 2.2 Programming language query support ...... 23 2.2.1 Query languages and interac ve interfaces for object-oriented data ..... 24 2.2.1.1 Database query languages for object-oriented data ...... 24 2.2.1.2 Query comprehensions to programming language collec ons ... 25 2.2.1.3 Ad-hoc queries to main memory D(B)MSs ...... 25 2.2.1.4 Ad-hoc queries to object collec ons ...... 26 2.2.2 Rela onal representa ons for objects ...... 26 2.2.2.1 Object-rela onal mapping ...... 26 2.2.2.2 Rela onships and associa ons as rst class programming language constructs ...... 27 2.2.2.3 Rela onal interface to programming language data structures ... 27 2.2.3 N1NF rela onal data model ...... 28 2.3 Diagnos c tools ...... 29
v vi
2.3.1 Opera ng system rela onal interfaces ...... 29 2.3.2 Kernel diagnos c tools that use built-in performance counters ...... 30 2.3.3 Kernel diagnos c tools that use program instrumenta on techniques .... 30 2.3.4 Applica on diagnos c tools ...... 31
3 Query interface design 33 3.1 Rela onal representa on of objects ...... 33 3.1.1 Mapping has-a associa ons ...... 34 3.1.2 Mapping many-to-many associa ons ...... 35 3.1.3 Mapping is-a associa ons ...... 36 3.2 A domain speci c language for de ning rela onal representa ons of data structures 37 3.2.1 Struct view de ni on ...... 37 3.2.2 Virtual table de ni on ...... 41 3.3 Mapping a rela onal query evalua on to the underlying object-oriented environment 42 3.4 Rela onal algebra extension for virtual table instan a ons ...... 45 3.5 So ware architecture ...... 45
4 Query interface implementa on 49 4.1 Genera ve programming ...... 49 4.2 Virtual table implementa on ...... 50 4.3 SQL support ...... 50 4.4 Query op miza ons ...... 51 4.5 Query interface ...... 51 4.6 Embedding i in applica ons ...... 52 4.7 Loadable module implementa on for the Linux kernel ...... 53 4.7.1 Query interfaces ...... 54 4.7.2 Security ...... 54 4.7.3 Synchronized access to kernel data structures ...... 55 4.7.3.1 De ni on of consistency ...... 55 4.7.3.2 Synchroniza on in i ...... 55 4.7.3.3 Limita ons ...... 55 4.7.4 Reliability ...... 56 4.7.5 Deployment and maintenance ...... 56 4.7.6 Portability ...... 57 4.8 Implementa on for the Valgrind framework ...... 57
5 Empirical Valida on 58 5.1 Data analysis of C++ applica ons ...... 58 5.1.1 Method ...... 58 5.1.2 Use cases ...... 59 5.1.3 Presenta on of measurements ...... 62 5.1.3.1 LOC measurements ...... 63 5.1.3.2 CPU execu on me measurements ...... 63 5.1.3.3 Query memory use measurements ...... 63 5.1.4 Results ...... 64 5.2 Diagnos cs in the Linux kernel ...... 65 5.2.1 Method ...... 65 5.2.2 Use cases ...... 65 5.2.2.1 Opera on integrity ...... 65 5.2.2.2 Security audits ...... 68 vii
5.2.2.3 Performance ...... 70 5.2.3 Presenta on of measurements ...... 70 5.2.3.1 Query execu on e ciency ...... 71 5.2.3.2 Impact on system performance ...... 71 5.2.4 Results ...... 72 5.2.4.1 Query execu on e ciency ...... 72 5.2.4.2 Impact on system performance ...... 73 5.3 Analysis of memory pro les in the Valgrind instrumenta on framework ...... 75 5.3.1 Method ...... 75 5.3.2 Use cases ...... 75 5.3.2.1 Memcheck ...... 76 5.3.2.2 Cachegrind ...... 80 5.3.2.3 Callgrind ...... 81 5.3.3 Presenta on of measurements ...... 82 5.3.4 Results ...... 83 5.4 User study ...... 86 5.4.1 So ware typology ...... 87 5.4.2 Mo va on and ques ons ...... 87 5.4.3 Experimental design ...... 89 5.4.3.1 Users ...... 89 5.4.3.2 Se ng ...... 89 5.4.3.3 Tasks ...... 89 5.4.4 Experimental evalua on ...... 90 5.4.5 Experimental results ...... 91 5.4.6 User interface preferences ...... 91 5.4.7 Threats to validity ...... 93 5.4.8 Limita ons ...... 93
6 Conclusions and Future Work 94 6.1 Summary of results ...... 94 6.2 Overall contribu on ...... 95 6.3 Future work ...... 96 6.4 Conclusions ...... 96
A Appendix 98 A.1 C++ and SQL code for the queries used in the evaluated applica ons ...... 98 A.1.1 Stellarium ...... 98 A.1.2 QLandKarte ...... 101 A.1.3 CScout ...... 106 A.2 Iden fy possible privilege escala on a acks with AWK ...... 107 A.3 Dynamic diagnos c tasks via Systemtap scripts ...... 108 A.4 Diagnos c tasks with i , Systemtap, and DTrace ...... 110
Bibliography 112 List of Figures
1.1 The research methodology followed ...... 4
2.1 Map of research work related to this thesis ...... 6
3.1 has-a associa on ...... 34 3.2 Has-one and has-many associa ons between data structures are normalized or de- normalized in the virtual rela onal schema...... 34 3.3 Many-to-many associa on ...... 36 3.4 Inheritance and subtype polymorphism support ...... 37 3.5 Full support of polymorphic containers ...... 37 3.6 syntax in nota on ...... 38 3.7 The C++ i under the library’s namespace ...... 47 3.8 So ware architecture ...... 48
4.1 Web-based query interface ...... 52 4.2 Steps for plugging i in an applica on...... 53
5.1 Class diagram and virtual table schema for the Stellarium, QLandkarte, and CScout applica ons ...... 60 5.2 Linux kernel models ...... 65 5.3 Fixed query cost depending on number of s...... 73 5.4 Impact on system performance ...... 74 5.5 Work ow for embedding i to Valgrind ...... 75 5.6 Memcheck’s data structure model and rela onal representa on ...... 77 5.7 Cachegrind’s data structure model and rela onal representa on ...... 84 5.8 Callgrind’s data structure model and rela onal representa on ...... 85 5.9 i ’s scalability ...... 87 5.10 User preferences of interface characteris cs ...... 92
viii List of Tables
2.1 Types of persistent systems examined and ques ons applicable to them ...... 12 2.2 Systems that adhere to orthogonal persistence ...... 14 2.3 Query languages and interfaces for querying OO data: table with columns signify- ing data loca on and rows signifying the so ware data model (objects or rela ons) against which queries are executed...... 23
5.1 Projects used in evalua on ...... 59 5.2 Use case (UC) iden ers and descrip ons and respec ve i queries for each project 61 5.2 Use case (UC) iden ers and descrip ons and respec ve i queries for each project 62 5.3 Query evalua on measurements ...... 64 5.4 Use case (UC) iden ers and descrip ons and respec ve i queries ...... 66 5.4 Use case (UC) iden ers and descrip ons and respec ve i queries ...... 67 5.5 Present query execu on cost for 10 diverse queries...... 70 5.6 Present query overhead on system performance...... 72 5.7 Data structure associa on mapping according to our rela onal representa on’s rules 76 5.8 Query performance measurements ...... 85 5.9 Expensive jump calls between basic blocks ...... 86 5.10 Query characteris cs, expressiveness, and performance by level of di culty for all tasks 90 5.11 Average student scores and p-value per evalua on criterion ...... 90
6.1 Usage metrics of the i so ware library ...... 96
A.1 Analysis tasks using i , SystemTap and DTrace ...... 111
ix Acknowledgements
Throughout my PhD, I have been blessed to be surrounded by wonderful, kind, and, most of all helpful people that contributed signi cantly in my progress. First and foremost, I can’t thank Diomidis Spinellis enough. He put his faith on me to embark on the PhD and invested great amounts of me and energy in my supervision. He has been there for me, in the good mes and the di cult mes. A true mentor always willing to devote his precious me and share his talent and profound experience. His support has been invaluable in many aspects, research, coding, funding, teaching. He is a rolemodel, as a scien st, as a hacker, as a person. If we could just have a few copies of him, I think Greece would be in be er shape. So if by any chance a prospec ve PhD student is reading this, here is my advice to you: if you are hard-working, eager to learn, and like challenges, I strongly encourage you to go for it. Panos Louridas was unfortunate enough to teach me the rst clues about academic wri ng. But this encounter was very fortunate for me. I frequently turn to him for advice because he owns a knowledge base right there in his head. His breadth and depth of knowledge and crea ve thinking are astounding. Although very busy, Panos has always been available to help me improve my work with his me culus reviews. I admire that he is so well grounded and ac ve both on industry and research. We are very lucky to have him around; he is a precious asset to the lab. If we came to observe the characters of the SENSE crew (besides my crappy self) I think we would observe some trademark characteris cs: hard-working, caring, and ne quality of persons. I bet Tushar couldn’t nd a single smell among them :-) . The SENSE group of PhD students, which consists of Maria Kechagia, Tushar Sharma, Stefanos Georgiou, and Antonis Gkortzis, has a new tradi on: watching StarWars movies with junk food! Maria is one of the purest persons I have ever met. Armed with kindness and a smile, even when things don’t go as planned, she brings grace to our group. She is friendly and easy to talk to and also great to collaborate and exchange thoughts on any subject. Maria is also a fantas c hostess; we all remember the fantas c whole-day trip to Salamina for swimming, lunch, and co ee. I remember when I went to the airport to pick up Tushar. A though ul, gentle, and, genuine guy that came to Greece from India and managed to adapt and pick up speed in no me. Tushar has helped me become more succinct in my research. I have enjoyed our paper collabora ons very much, not to men on the amazing Indian food and the good company. He is very communica ve and a great story teller. Stefanos, the benign gym guy, brought energy in our lab and a steady lunch schedule :-) . The conversa ons and the happy short breaks for a joke he provides make my day. That said, what I like most about Stefanos is his frequent proposals about doing sta together as a group. Even though I haven’t followed up as much as I wanted, his enthusias c sugges ons keep us close. Antonis owes us a boat ride! And I’m not afraid to ask for it (please be careful :-) . Besides that SENSE has its own skipper now, Antonis has a good word for everybody and the inten on to go the extra mile to help. Not to men on that he always nds an interes ng underground story to share about his days in Thessaloniki or Groningen. And there is also the old school SENSE: Dimitris Mitropoulos, Vassilis Karakoidas, Giorgos Gousios,
x xi
Stefanos Androutselis, and Kostas Stroggylos. Dimitris helped me improve my perspec ve on things through our conversa ons. Although a person with a skill set in diverse aspects he remains humble and that’s a good example to follow. He is much fun to hang out with and a good company to discuss serious ma ers. I have enjoyed our bike rides back in the days. Vassilis has his own fun way of expression. I enjoy how he nds analogies and examples to back up his statements. He is a deep thinker and always has an answer as to why it is so. Vassilis has helped me look for the right ques ons and have the answers to those ques ons. Georgios is the rst I met from the SENSE group. I will never forget how pa ently he endured my ques ons when I was beginning my PhD. When I was dubious on a wri ng approach for my rst research papers, he pointed me at the right direc on. He has also kindly reviewed my papers many mes. Now our paths meet again at the examina on of my PhD. Stefanos is known for his wri ng prowess and crea ve logo designs among other things. The few mes that we collaborated, I no ced how his work lends to art. With Kostas we didn’t overlap signi cantly, but I do remember his sharp technical advice and expert biking takes. Furthermore, I want to express my gra tude to the rest of the Professors of my PhD commi ee for their help in improving my work: Damianos Chatziantoniou, Angelos Bilas, Yannis Smaragdakis, Nektarios Koziris, and Georgios Gousios. Their par cipa on is an honor for me. Damianos Chatziantoniou is a cool business-oriented Professor in databases to whom I turn to for the quirks of database research. I am fortunate to have him as advisor to my PhD. He is always keen to answer my ques ons and explore new ideas. Angelos Bilas gave me invaluable advice at the dawn of my PhD and helped me improve signif- icantly my rst research paper. Through our communica on and my one-day visit to his group in Heraklion I learnt to think and write more pragma cally. Yannis Smaragdakis has been kind to review my papers and answer my ques ons (even during lunch break at the PL seminar :-). His advice on my demo presenta on for a paper saved me from a lot of mistakes. I didn’t have the pleasure to collaborate with Nektarios Koziris so far, but I respect his work and I am grateful to have him in my PhD commi ee. I am thankful to Prof. Cris an Cadar who provided me with insigh ul hints during the poster session of a paper where I was fortunate to meet him. I am also thankful for his quick responses to my ques ons. I am grateful to Prof. Adam Vrehopoulos for helping me prepare the user study that I carried out for the valida on of my research work. I would also like to thank two Professors of the DMST department, Christos Taran lis for his sharp advice regarding the PhD grind and Nancy Pouloudi for her support when I assisted her with one of the undergraduate courses she taught. I con nue with the sta of the DMST department’s computer labs: Christos Lazaris, Stavros Grig- orakakis, Vaso Tangalaki, and Giorgos Zouganelis (former employee). Because they are based next to our o ces, we stalk them with requests from me to me. Christos has helped me with network di- agnos cs and data rescues numerous mes. Stavros is always quick to help me overcome a system’s malfunc on and provide expert advice. Vaso takes care to provide us with access to various systems of the department. Giorgos has provided technical support in the past. Switching to a more personal tone, Panos, Vasilis, Dimitris, and Aleksandros have been my closest friends. Thank you guys for your huge support in the course of my life and for all the wonderful me we have together. Chris na is my virtual sister since we grew up together. Although we haven’t spent loads of me recently, we share a bond, and she is there when I need her. xii
My brother, Le eris, is a super cool person whose experience in life can ll numerous pages. He is famous for this sort of life-drawn advice, which he couples with amazing stories. My cousin and godmather Anna is the sponsor of my humble home where I have spent countless hours of undistracted research work. Thank you Anna! My aunt, Tasoula, and my cousin, So a, have also contributed in this context. My parents, Dimitris and Vaso, are the kind that do everything for their children even at their own expense. They have supported me nancially, morally, with prayers, by cooking, and the list goes on. They are the unseen heroes whose only desired reward is seeing their e ort bloom. Thus, this whole e ort is devoted to them as a minor acknowledgment of their service to me for the past thirty plus years. Special references come last. So this is for you Maria, my love and my smile. Thank you for your encouragement and support that made the dark days sunnier. Thank you for bringing joy and laughter in my life. Thank you for being by my side. Summary
The digital data has become a key resource for solving scien c and business problems and achieving compe ve advantage. With this purpose the scien c and business communi es worldwide are trying to extract knowledge from data available to them. The mely use of data signi cantly a ects scien c progress, quality of life, and economic ac vity. In the digital age the e cient processing and e ec ve data analysis are important challenges. The processing of data in main memory can boost processing e ciency especially if it is combined with new so ware system architectures. At the same me useful and usable tools are required for analysing main memory data to sa sfy important use cases not met by database and programming language technologies. The uni ed management of the memory hierarchy can improve the processing of data in main memory. In this architecture the communica on between the di erent parts of the memory hierarchy is transparent to the applica ons and op miza on techniques are applied holis cally. The data ow in the memory hierarchy so that the ones that will be processed shortly are closest to the s and programming languages treat temporary and permanent data of any type uniformly. As a result, new data analysis systems can be developed that take advantage of faster main memory data structures over disk-based ones for processing the data leaving the memory hierarchy to care for the availability of data. The absence of suitable analy cal tools hinders knowledge extrac on in cases of so ware appli- ca ons that do not need the support of a database system. Some examples are applica ons whose data have a complex structure and are o en stored in les, eg scien c applica ons in areas such as biology, and applica ons that do not maintain permanent data, such as data visualiza on applica ons and diagnos c tools. Databases o er widely used and recognized query interfaces, but applica ons that do not need the services of a database should not resort to this solu on only to sa sfy the need to analyze their data. Programming languages on the other hand rarely provide expressive and usable query interfaces. These can be used internally in an applica on, but usually they do not o er interac ve ad-hoc queries at run me. Therefore the data analysis scenarios they can support are standard and any addi ons or modi ca ons to the queries entail recompiling and rerunning the applica on. In addi on to solving problems modeled by so ware applica ons, data analysis techniques are useful for solving problems that occur in the applica ons themselves. This is possible by analyzing the metadata that applica ons keep in main memory during their opera on. This prac ce can be applied to any kind of system so ware, such as an opera ng system. This thesis studies the methods and technologies for suppor ng queries on main memory data and how the widespread architecture of so ware systems currently a ects technologies. Based on the ndings from the literature we develop a method and a technology to perform interac ve queries on data that reside in main memory. Our approach is based on the criteria of usefulness and usability. A er an overview of the programming languages that t the data analysis we choose , the standard data manipula on language for decades. The method we develop represents programming data structures in rela onal terms as requires
xiii xiv