Pico:A Domain-Specific Language for Data Analytics Pipelines

University of Torino Doctoral School on Science and High Technology Computer Science Department Doctoral Thesis PiCo: A Domain-Specific Language for Data Analytics Pipelines Author: Supervisor: Claudia Misale Prof. Marco Aldinucci Co-Supervisor: Cycle XVIII Prof. Guy Tremblay A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy ii \Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better. " Edsger W. Dijkstra, \On the nature of Computing Science" Marktoberdorf, 10 August 1984 iii UNIVERSITY OF TORINO Abstract Computer Science Department Doctor of Philosophy PiCo: A Domain-Specific Language for Data Analytics Pipelines by Claudia Misale In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models|for which only informal (and often confusing) semantics is generally provided|all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world. v Acknowledgements Questi quattro anni sono stati veramente brevi. Praticamente non me ne sono accorta. Ho imparato tante cose, molte ancora non ho capito come si fanno (ad esempio non so fare le revisioni in maniera educata ed elegante, non ho imparato a parlare in pubblico, non so riassumere i meeting..) e faccio ancora arrabbiare tantissimo il mio supervisore, il Professor Marco Aldinucci. A lui, infatti, dire semplicemente \grazie" non èabbastanza. Specie per la pazienza che ha avuto con me, da tutti i punti di vista. These four years were really short. I did not notice they are gone. I learned a lot, I still do not understand how to do a lot of things (for instance I am not able to review papers in a polite and elegant way, I did not learn to talk in public, I am not able to summarize meetings..) and I still make my supervisor very angry with me, Professor Marco Aldinucci. It is not enough to say \thank you" to him. Especially for the patience he had with me, from all points of view. Sono arrivata a Torino senza conoscerlo, chiedendogli di essere il mio supervisore per il dottorato. Non potevo fare scelta migliore. Tutto quello che so lo devo a lui. Le ore passate ad ascoltarlo spiegarmi i mille aspetti di quest'area di ricerca (anche se per lui, di aspetti, ce ne sono sempre e solo due), mi hanno aperto la mente e mi hanno formato tantissimo, anche se ancora ho tutto da imparare. Quattro anni, infatti, non bastano. Mi ha sempre spronato a fare meglio, a guardare dentro le cose, e cercheròdi portarmi dietro i suoi insegnamenti ovunque la vita mi porterà. Inoltre la sua simpatia e i suoi modi di fare, sono stati ottimi compagni in questi anni di dottorato. S`ı,anche gli insulti lo sono stati! Grazie, theboss. Di cuore. I arrived in Turin without knowing that much about Prof. Aldinucci and I asked him to be my supervisor. I could not make a better choice. Everything I know is thanks to him. I spent a lot of hours listening to him explaining all aspects about this research area (even if, from his point of view, there are always only two aspects), this opened my mind and made me grow up a lot, even if I still have to learn. Four years are not enough, actually. He always pushed me to do my best, to look inside things, and I will always try to bring his teachings with me, wherever I will go. Furthermore, his pleasantness and his way to do things, have been perfect companionship in these years. Thank you, theboss. Thank you so much. Ci sono alcune persone che voglio ringraziare, perchétanto mi hanno dato e sarò sempre in debito con loro per questo. There are a few people I want to thank, because they gave me so much and I will always be in debt with them. Vorrei iniziare con il mio co-supervisore, il Professor Guy Tremblay. Grazie di avermi seguito e di avermi aiutato cos`ıtanto nella stesura della tesi e nel capire tanti argomenti che, per entrambi, erano nuovi: la tua grande esperienza èstata illuminante, sei stato una guida importantissima. Grazie ancor di piùper i tanti consigli per la presentazione che mi hai dato, ho cercato di seguirli al meglio e spero di esserne stata all'altezza. I would like to start with my co-supervisor, professor Guy Tremblay. Thank you so much for helping me so much while writing this thesis and thank you for helping me in understanding topics that, for both, were new: your great experience has been enlightening, you have been a very important guide. Thank you even more for all the advices about the presentation, I tried to follow all of them the best I can do and I hope I have been able to do that. Vorrei ringraziare i revisori per il lavoro di revisione che hanno fatto: i loro com- menti sono stati preziosissimi e molto dettagliati. Mi hanno permesso di migliorare il manoscritto e soprattutto di rivedere molti aspetti che l'inesperienza non mi per- mette ancora di vedere. vi I would like to thank reviewers for the great job they did, their impressive knowledge and experience, and the time the dedicated to my work: their comments have been precious and very detailed. Their advices made me improve the manuscript and, mostly, to review a lot of aspects that the inexperience makes me not able to see them yet. Un profondo grazie va ai miei colleghi: una cricca di persone uniche con cui ho condiviso tanto e tanto mi hanno dato. Le nostre serate \non facciamo tardi" che finivano alle 4 o 5 del mattino, le mangiate, le uscite, i giochi, i mille discorsi, le Gossip Girlz.. Grazie di tutto ciòche mi avete dato. Grazie a Caterina, la mia secolare amica. Tu sai tutto di me, che altro ti devo dire. Siamo cresciute insieme nei cinque anni di universitàall'UniCal in cui ne abbiamo combinate tante e mi hai assistito durante questi anni: sei anche tu una colonna portante di questo dottorato e una portante nella mia vita. Quasi tutte le pause pranzo del mio dottorato le ho trascorse in palestra a \pic- chiare la gente", come dicono i miei colleghi. Le lezioni di kick-boxing/muay-thai sono state assolutamente una droga e a renderle tali sono state innanzitutto le persone che ho conosciuto l`ı. Grazie al Maestro Beppe e ai ragazzi, ho imparato ad apprezzare questo magnifico sport fatto di disciplina, rispetto, autocontrollo, che mi ha permesso di spingermi oltre i ciòche pensavo fossero i miei limiti fisici e mentali. Grazie, Maestro, per avermi fatto scoprire un lato di me a me sconosciuto e grazie della persona e amico che sei. Grazie a tutti i miei \amici delle botte", grazie delle botte che mi avete dato, della perfetta compagnia che siete sia in palestra che fuori, per avermi aiutato a scaricare l'ansia accumulata ogni giorno. Grazie a Roberta, la mia amica di botte numero uno in assoluto. Abbiamo condiviso tanto in palestra e siamo cresciute insieme, mi hai fatto crescere e mi hai insegnato tanto. Soprattutto, hai ascoltato e sopportato tutti i miei sfoghi e le mie frustrazioni da dottoranda. Mai ti ringrazieròabbastanza.

Load more