Toward Timely, Predictable and Cost-Effective Data Analytics

Toward timely, predictable and cost-effective data analytics THÈSE NO 7140 (2016) PRÉSENTÉE LE 3 NOVEMBRE 2016 À LA FACULTÉ INFORMATIQUE ET COMMUNICATIONS LABORATOIRE DE SYSTÈMES ET APPLICATIONS DE TRAITEMENT DE DONNÉES MASSIVES PROGRAMME DOCTORAL EN INFORMATIQUE ET COMMUNICATIONS ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Renata BOROVICA-GAJIC acceptée sur proposition du jury: Prof. C. Koch, président du jury Prof. A. Ailamaki, directrice de thèse Dr S. Chaudhuri, rapporteur Dr G. Graefe, rapporteur Prof. K. Aberer, rapporteur Suisse 2016 "Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning." —Albert Einstein To Ivan, my husband, my friend, my soul mate, for encouraging me to pursue my dreams, and never letting me give up. To Mia, our beautiful daughter, for lightening up my days and making me look once again at life as a miracle. To Tamara, my "big" sister and partner in crime, for being there whenever I needed her, and to my parents, Dijana and Petar, for letting me fly. I dedicate this thesis to them... Acknowledgements Although this thesis has only one name written under the tittle as the author, nothing could be further from the truth. This thesis would have not been possible without the help and support of my advisors, my collaborators, my friends, and my family, all of who helped me throughout this journey in their own ways. First and foremost, I would like to thank my advisor, Anastasia Ailamaki, for being an advisor in the real sense of the word. For teaching me ins and outs of database systems, teaching me how to plan and manage my time, and for teaching me life - to be present in each moment and do my best always. She is the best role model one could have. Even though the name of Stratos Idreos does not show as a "co-advisor", he was practically my second advisor with whom I had the luck to collaborate with. Stratos helped me define my own research line, and encouraged me to jump out of the safe zone of well-defined research. Many decisions made in this long journey would have been impossible without his input. I am also indebted to Goetz Graefe, for organizing a seminar on "Robust Query Processing" and getting me interested in the topic of robustness in query processing. Goetz was also kind enough to accept to be my thesis committee member, and moreover to follow and help with my research for several years. On that note, I am also thankful to Harumi Kuno for inviting me to the seminar. If it was not for her, I would probably not get to be obsessed with the robustness topic in DBMS. I sincerely thank Surajit Chaudhuri, for giving me a chance to do an internship with Microsoft Research and spend a lovely summer in Redmond. Moreover, despite his busy schedule, Surajit accepted to be on my committee and help excel my research. I would also like to thank my other mentors from the Data Management, Exploration and Mining group, Vivek Narasayya and Christian König, for their patience and guidance. Karl Aberer is another committee member to whom I am extremely grateful for devoting his time to follow my research progress, despite having more than a dozen of his own students. i Acknowledgements I thank Christoph Koch for accepting to be the thesis jury president and a member of my candidacy exam committee. I thank my collaborators for our fruitful discussions that resulted in the chapters of this thesis, in the chronological order, Ioannis Alagiannis, Miguel Branco, Stratos Idreos, Thomas Hei- nis, Farhan Tauheed, Marcin Zukowski, Campbell Fraser and Raja Appuswamy. Ioannis and Stratos were the driving force behind the NoDB system. Miguel practically forced us to build an amazing demo of the NoDB system. He was also kind enough to listen to and provide feedback on all my rumbling about ideas while they were still in a premature phase. Marcin and Campbell together with Martina-Cezara Albutiu helped crystallizing the ideas about Smooth Scan during the seminar on Robust Query Processing. Raja initiated the work on cold storage and has been an amazing collaborator since then. I would also like to thank my previous manager Maja Domazetovic, Professor Miroslav Ha- jdukovic and Professor Ivan Lukovic for getting me interested in databases and for being supportive when I decided to leave Serbia. I would like to thank the members of the DIAS lab for creating a wonderful working environ- ment which made me jump out of my bed in the morning with a smile on my face. The DIAS ladies: I thank Danica Porobic, Pinar Tozun, Mirjana Pavlovic, Eleni Tzirita Zacharatou, Erietta Liarou, Stella Giannakopoulou, for numerous lunches, coffees, discussions, that all made me feel like at home. Special thanks to Danica, who was my "partner in crime" from the first day at EPFL and who proofread and provided feedback on the countless paper submissions. Now the DIAS gentlemen: Thomas Heinis, Farhan Tauheed, Adrian Popescu, Ioannis Alagiannis, Manos Athanassoulis, Radu Stoica, Ippokratis Pandis, Ryan Johnson, Debabrata Dash, Miguel Branco, Manos Karpathiotakis, Matt Olma, Iraklis Psaroudakis, Utku Sirin, Darius Sidlauskas, Raja Appuswamy, Satya Valluri, Cesar Torcato, Ben Gaidioz, Angelos Anadiotis, Odysseas Pa- papetrou, Lionel Sambuc, Georgios Psaropoulos, Xuesong Lu,andTahir Azim, all contributed to this thesis in their own way, by asking hard questions, proofreading, polishing my talks or simply discussing things over a glass of wine, lunch or a cup of coffee. I thank Ioannis Alagiannis, who was a perfect office mate, and my first "guide" into the PhD journey. I thank Angelos Anadiotis, my second office mate, for his patience, and for all good pieces of advice he shared with me probably when I needed them the most, in the last year of my PhD. Thanks to Alejandro Javier Rivas for trying out numerous ideas I had as his student semester projects. I am also grateful to Erika Raetz and Dimitra Tsaoussis-Melissargos, for their indispensable help with administrative intricacies, documents translations and proofreading. Special thanks to Ben for translating the abstract in French. I would like to thank my Serbian friends, who made the transition to Switzerland super smooth - Bilja, Laza, Natasa, Zlatko, Milos, Mica, Danica, Mira, Baki, Mara, Pedja, Mima, Drazen, Sonja, Mirko, Jasmina, Maja, Petar, Milica, Aleksandar, Vojin, Mića,Teodora, and probably ii Acknowledgements many more whom I have forgotten to mention - thank you all! I thank my parents, Dijana and Petar, for everything they gave me and for all their love and support. I specially thank them for letting me go wherever I wanted, despite their hearts breaking every time I would move to a new place. My sister Tamara, has probably had the biggest influence on my life. She was my big sister, my role model, my partner in crime, and most importantly she kept my sanity throughout my life crises. I am blessed with having such a person in my life. There are no words to express gratitude to my husband Ivan, for being by my side for as long as I remember. I would not be on this path if it was not for him. I would not be who I am if it was not for him. And I would not aspire to achieve greater things and go beyond my comfort zone if it was not with him being by my side. Thank you! Finally, I would like to thank our baby girl Mia, for making my life a pure joy since she has joined us. I thank her for all her smiles, hugs, funny faces and the nights during which she let me sleep. This research has been supported by grants from the School of Computer and Communication Sciences, EPFL, Swiss National Science Foundation: EURYI Award, “Efficient Data Management for Scientific Applications" (PE002-117125/1), Office of Naval Research Global, UK: “From Rats to Humans: Exascale Data Management for Brain Simulations" (SDN: N6290912MD07010), and the European Union Seventh Framework Programme (ERC-2013-CoG), under grant agreement no 617508 (ViDa). Lausanne, September 2016 Renata Borovica-Gajic iii Abstract Modern industrial, government, and academic organizations are collecting massive amounts of data at an unprecedented scale and pace. The ability to perform timely, predictable and cost- effective analytical processing of such large data sets in order to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, decision making, or lead to major scientific breakthroughs in what is known as the fourth paradigm of data-driven scientific discovery. .... Traditional database systems (DBMS) are, however, not the first choice for servicing these modern applications, despite 40 years of database research. This is due to the fact that modern applications exhibit different behavior from the one assumed by DBMS: a) timely data exploration as a new trend is characterized by ad-hoc queries and a short user interaction period, leaving little time for DBMS to do good performance tuning, b) accurate statistics representing relevant summary information about distributions of ever increasing data are frequently missing, resulting in suboptimal plan decisions and consequently poor and un- predictable query execution performance, and c) cloud service providers - a major winner in the data analytics game due to the low cost of (shared) storage - have shifted the control over data storage from DBMS to the cloud providers, making it harder for DBMS to optimize data access. .... This thesis demonstrates that database systems can still provide timely, predictable and cost- effective analytical processing, if they use an agile and adaptive approach. In particular, DBMS need to adapt at three levels (to workload, data and hardware characteristics) in order to stabilize and optimize performance and cost when faced with requirements posed by modern data analytics applications.

Toward Timely, Predictable and Cost-Effective Data Analytics

A Comparison of On-Line Computer Science Citation Databases

Linking Educational Resources on Data Science

The UK E-Science Core Programme and the Grid Tony Hey∗, Anne E

Analysis of Computer Science Communities Based on DBLP

The Fourth Paradigm

How Do Computer Scientists Use Google Scholar?: a Survey of User Interest in Elements on Serps and Author Proﬁle Pages

PP-DBLP: Modeling and Generating Public-Private Attributed Networks with DBLP

The Best Nurturers in Computer Science Research

Author Template for Journal Articles

The Claremont Report on Database Research

Database Architecture Evolution: Mammals Flourished Long Before Dinosaurs Became Extinct

Decrease in Free Computer Science Papers Found Through Google Scholar Lee A