Web Workflows for Data Mining in the Cloud
Total Page:16
File Type:pdf, Size:1020Kb
WEB WORKFLOWS FOR DATA MINING IN THE CLOUD Janez Kranjc Doctoral Dissertation Jožef Stefan International Postgraduate School Ljubljana, Slovenia Supervisor: Prof. Dr. Nada Lavrač, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Co-Supervisor: Assoc. Prof. Dr. Marko Robnik Šikonja, Faculty of Computer and Infor- mation Science, University of Ljubljana, Slovenia Evaluation Board: Assist. Prof. Dr. Martin Žnidaršič, Chair, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Assoc. Prof. Dr. Igor Mozetič, Member, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Prof. Dr. Hendrik Blockeel, Member, Kaotolieke Universiteit Leuven, Leuven, Belgium Janez Kranjc WEB WORKFLOWS FOR DATA MINING IN THE CLOUD Doctoral Dissertation SPLETNI DELOTOKI ZA RUDARJENJE PODATKOV V OBLAKU Doktorska disertacija Supervisor: Prof. Dr. Nada Lavrač Co-Supervisor: Assoc. Prof. Dr. Marko Robnik Šikonja Ljubljana, Slovenia, March 2017 v Acknowledgments This thesis would have never been finished without the help of a number of people. Their support and help throughout my studies are greatly appreciated. First of all I would like to thank my supervisor Nada Lavrač for her infinite amount of patience, enthusiasm, support and her always helpful and insightful research ideas and comments. I am also very grateful to my co-supervisor Marko Robnik Šikonja for the copious amount of time spent reviewing my work and his greatly appreciated comments and ideas. I would also like to thank the members of my evaluation board Martin Žnidaršič, Igor Mozetič and Hendrik Blockeel for reading my thesis and providing comments for improvement. I also wish to thank the funding bodies that financially supported my research: the Department of Knowledge Technologies at the Jožef Stefan Institute, the Jožef Stefan International Postgraduate School, and the European Commission for funding the research projects e-LICO, BISON, FIRST, ConCreTe, MUSE, and HBP in which I was involved. Special thanks goes to Vid Podpečan for his work that helped spark the ideas that eventually became this dissertation and his contributions to my work and all his helpful insights and ideas. I would also like to thank everyone I had the pleasure of cooperating and co-authoring papers with, especially: Matic Perovšek for the work we did together on TextFlows; Jasmina Smailović for our work on sentiment analysis and active learning; Roman Orač for implementing the DiscoMLL library that contributed to big data process- ing in ClowdFlows; and Anže Vavpetič for his numerous contributions and creating the Relational Data Mining package. I am thankful to all the work colleagues at the Department of Knowledge Technologies for creating a great working environment. I would like to thank Darko Aleksovski for his numerous contributions to the open-source projects developed during the time of my studies. Thanks also to Jan Kralj, Borut Sluban and Miha Grčar for their valuable time for productive and insightful discussions that greatly motivated my research work. Last but not least, I must thank my family and friends for their encouragement and support. Finally, I owe sincere and earnest thankfulness to my wife Nina for her love, understanding, support and patience throughout my study. vii Abstract The thesis addresses the development of an innovative data mining platform ClowdFlows and novel knowledge discovery scenarios implemented therein as executable data mining workflows. The ClowdFlows platform is implemented as a cloud-based web application with a graphical user interface which supports the construction, execution and sharing of data mining workflows. Big data analytics is supported by several algorithms, including novel ensemble techniques implemented using the MapReduce programming model, and a special stream mining module for real-time analysis using continuous parallel workflow execution. The adaptability of ClowdFlows is demonstrated through descriptions of two platforms that expand the usage of the web-based workflow environment to fields other than data mining. The ConCreTeFlows platform is a novel platform for computational creativity, while TextFlows is focused mainly on text mining and natural language processing. The ConCreTeFlows platform is demonstrated with a conceptual blending use case, while fea- tures of TextFlows are demonstrated on three use cases: comparison of document classifiers and different part-of-speech taggers on a text categorization problem, and outlier detection in document corpora. We present novel use cases in Inductive Logic Programming (ILP) and Relational Data Mining (RDM). The main novelty is a propositionalization technique called wordification which can be seen as a transformation of a relational database into a corpus of text docu- ments. The wordification methodology and the evaluation procedure are implemented as executable workflows in ClowdFlows. The implemented workflows include several other ILP and RDM algorithms, as well as the utility components that enable access to these techniques to a wider research audience. The real-time analysis and data stream mining capabilites are demonstrated on a novel active learning scenario for dynamic adaptive sentiment analysis, which is able to handle changes in data streams and adapt its behavior over time. Established stream mining techniques are transferred to the visual programming paradigm and demonstrated through the sentiment analysis use case, using active learning of microblogging data streams with a linear Support Vector Machine. The thesis contributes to open-source scientific software. The ClowdFlows platform, its adaptations ConCreTeFlows and TextFlows, and all the described use cases are publicly available. This enables experiment reproducibility, as well as workflow adaptations and enhancements. ix Povzetek Disertacija obravnava razvoj inovativne platforme za podatkovno rudarjenje, imenovane ClowdFlows, ter nove načine odkrivanja znanja, implementirane v platformi v obliki izvr- šljivih delotokov. Platforma ClowdFlows je izdelana kot spletna aplikacija v oblaku, njen grafični uporabniški vmesnik pa omogoča izdelavo, izvedbo in objavo delotokov za podat- kovno rudarjenje. Analiza velikih podatkov je mogoča s pomočjo več algoritmov, med drugim z inovativnimi ansambelskimi tehnikami, ki uporabljajo paradigmo MapReduce, in z modulom rudarjenja podatkovnih tokov v realnem času, ki uporablja neprekinjeno vzporedno izvajanje delotokov. Prilagodljivost platforme ClowdFlows je prikazana z opisom dveh izpeljanih platform, ki širita uporabnost delotokov spletne platforme na druga področja. Platforma ConCreTe- Flows je namenjena računalniški ustvarjalnosti, medtem ko se platforma TextFlows osredo- toča na tekstovno rudarjenje in procesiranje naravnega jezika. Platforma ConCreTeFlowa je prikazana s primerom uporabe konceptualnega povezovanja, medtem ko so posebnosti platforme TextFlows prikazane na treh primerih uporabe: s primerjavo klasifikatorjev do- kumentov in različnih označevalcev besednih vrst ter z odkrivanjem osamelcev v zbirkah besedil. V disertaciji so predstavljeni novi primeri uporabe na področjih induktivnega logičnega programiranja in relacijskega podatkovnega rudarjenja. Novost je metoda transformacije relacijskih podatkovnih baz v zbirke besedil, imenovana wordification (besedizacija). Eval- vacija razvite metode je vključena v platformo ClowdFlows kot izvršljiv delotok. Poleg opisane evalvacije so kot delotoki dostopni tudi drugi algoritmi za relacijsko podatkovno rudarjenje in induktivno logično programiranje, prav tako pa tudi pomožne komponente, ki omogočajo dostopnost teh tehnologij širši javnosti. Možnosti analiz v realnem času in rudarjenje podatkovnih tokov je prikazano na pri- meru uporabe aktivnega učenja za dinamično prilagajanje analize sentimenta v podat- kovnih tokovih. Uveljavljene tehnike za rudarjenje podatkovnih tokov smo uporabili po načelu vizualnega programiranja in prikazali na primeru uporabe analize sentimenta na podatkovnih tokovih kratkih spletnih sporočil. Disertacija prispeva tudi k odprtokodni programski opremi. Vsi delotoki in izvorna koda opisanih primerov uporabe, platform ClowdFlows, ConCreTeFlows in TextFlows so javno dostopni. S tem omogočimo ponovljivost eksperimentov in prilagoditve ter izboljšave delotokov. xi Contents Abbreviations xiii 1 Introduction1 1.1 Problem Definition................................ 1 1.2 Purpose of the Dissertation ........................... 2 1.3 Goals of the Dissertation............................. 3 1.4 Hypotheses .................................... 4 1.5 Scientific Contributions ............................. 4 1.6 Dissemination of the Developed Software.................... 6 1.7 Organization of the Thesis............................ 7 2 The ClowdFlows Platform9 2.1 Data Mining Platforms.............................. 9 2.2 Big Data Mining ................................. 10 2.2.1 Batch data processing .......................... 11 2.2.2 Data stream mining ........................... 12 2.3 Development of the ClowdFlows Platform................... 12 2.4 Related Publication ............................... 13 3 Adaptations of ClowdFlows 37 3.1 Creating an Adaptation of the ClowdFlows Platform............. 37 3.1.1 Setting up a development version.................... 37 3.1.2 Creating ClowdFlows widgets and packages.............. 38 3.1.3 ClowdFlows Unistra ..........................