Quality of Analytics Management of Data Pipelines for Retail Forecasting

Krists Kreics Quality of analytics manage ment of data pipelinesfor retailforecasting School of Science Thesis sub mittedfor exa minationfor the degree of Master of Sciencein Technology. Espoo 29.07.2019 Thesis supervisor: Prof. Hong-Linh Truong Thesis advisors: Dr.Sci. ( Tech.) Mikko Ervasti M.Sc. Teppo Luukkonen aalto university abstract of the school of science master’s thesis Author: Krists Kreics Title: Quality of analytics manage ment of data pipelinesfor retailforecasting Date: 29.07.2019 Language: English Numberof pages: 54+3 Degree progra m me: Master’s Progra m meinI C TInnovation Major: Data Science Code: S CI3095 Supervisor: Prof. Hong-Linh Truong Advisors: Dr.Sci. ( Tech.) Mikko Ervasti, M.Sc. Teppo Luukkonen This thesis presents a fra mework for managing quality of analytics in data pipelines. The main research question of this thesisis the trade-off manage ment bet ween cost, ti me and data qualityin retail forcasting. Generally this trade-off in data analyticsis defined as quality of analytics. The challenge is addressed by introducing a proof of concept fra me work that collects real ti me metrics about the data quality, resource consu mption and other relevant metrics fro m tasks within a data pipeline. The data pipelines within thefra me work are developed using Apache Airflow that orchestrates Dockerized tasks. Different metrics of each task are monitored and stored to ElasticSearch. Cross-task co m municationis enabled by using an event driven architecture that utilizes a Rabbit M Q as the message queue and custo m consu meri mages written in python. With the help of these consu mers the syste m can control the result with respect to quality of analytics. E mpiricaltesting ofthe finalsyste m withretail datasetssho wedthatthis approach can aid data science tea ms to provide better services on de mand with bounded resources especially when dealing with big data. Key words: machinelearning, o fflinelearning, data pipelines, quality of analytics, apache airflo w 3 Ackno wledge ments I wouldliketothank my wife Ka millaforsupporting meinthelong hoursthat went into this work and allowing me tofullyfocus on writing. I really cannot believein the way we flow. I would alsolike to thank my professor Hong-Linh Truongfor providinginter- esting discussion points andfeedback. FinallyI wouldlike to thank my advisors Teppo Luukkonen and Mikko Ervasti and allthe wonderful people at Sellforte who provided me with afun and challenging working environ ment that truly was one of a kind. Otanie mi, 29.07.2019 Krists Kreics 4 C o nt e nts A b s t r a c t 2 Ackno wledge ments 3 C o nt e nt s 4 Abbreviations and Acrony ms 6 1 Introduction 7 1.1 Contributions ............................... 7 1.2 Structureofthethesis .......................... 8 2 Background 9 2.1 Casecompany............................... 9 2.2 Problemstatement ............................ 9 2.3 Systemrequirements ........................... 10 2.4 Testcase.................................. 11 3 Literature revie w 1 2 3.1 Machinelearning pipeline manage ment ................. 12 3.2 Managing qualityofanalytics ...................... 13 4 Overvie w of existingfra me works 1 6 4.1 Comparisonattributes .......................... 16 4.2 Machinelearningframeworks ...................... 16 4.2.1 AWSSagemaker ......................... 16 4.2.2 Azure MLService......................... 17 4.2.3 Google ML Engine ........................ 18 4.2.4 Pachyderm ............................ 18 4.2.5 Apache PredictionIO ....................... 18 4.2.6 Valohai .............................. 19 4.2.7 Kubeflow ............................. 19 4.3 Evaluation of machinelearningfra meworks ............... 20 4.4 Experiment managementframeworks .................. 21 4.4.1 DVC ................................ 21 4.4.2 Polyaxon.............................. 21 4.4.3 Summary ............................. 23 5 Overvie w of related technologies 2 4 5.1 Taskorchestration ............................ 24 5.1.1 A WSStep Functions ....................... 24 5.1.2 Luigi ................................ 24 5.1.3 Netflix Conductor......................... 25 5.1.4 Apache Airflow .......................... 25 5.2 Taskorganization ............................. 25 5 5.3 Datastorage................................ 26 5.4 Modelserving ............................... 26 5.5 Resource monitoring ........................... 28 5.5.1 Cadvisor .............................. 28 5.5.2 DockerStats API ......................... 28 5.5.3 Psutil ............................... 28 6 Technical solution 2 9 6.1 Architectureoverview........................... 29 6.1.1 Taskrunner ............................ 29 6.1.2 Datastorage............................ 30 6.1.3 Modelserving ........................... 30 6.1.4 Messageconsumers ........................ 31 6.1.5 Taskdesign ............................ 32 6.2 Usingtheframework ........................... 34 6.2.1 Setup ............................... 35 6.2.2 Setting acusto mcostfunction .................. 35 6.2.3 Pushingcustom metrics ..................... 36 6.2.4 Settingcusto mcontrolrules ................... 37 6.2.5 Changingthestorage ....................... 37 6.3 Summary ................................. 38 7 Fra me work evaluation 4 0 7.1 Descriptionofdata ............................ 40 7.2 Description of QoA manage mentstrategies ............... 40 7.3 Descriptionof metrics .......................... 42 7.4 Descriptionof pipelines .......................... 42 7.5 Adjust ment actionstrategyevaluation ................. 43 7.6 Resourcecontrolstrategyevaluation .................. 44 7.7 Auxillaryevaluationresults ....................... 45 7.8 Summary ................................. 46 8 Discussion 4 7 8.1 Meetingtherequirements ........................ 47 8.2 Future work ................................ 48 9 Conclusion 4 9 6 Abbreviations and Acrony ms API Application progra m minginterface A WS A mazon Web Services DA G Directed acyclic graph ECS A mazon Elastic Cloud Service EC2 A mazon Elastic Co mpute Cloud ML Machine Learning Qo A Quality of Analytics S D K Software develop ment kit S3 A mazon Si mple Storage Service vCPU Virtual CPU 7 1 Introduction With the rapidincreasein co mputing power and the growth of available data machinelearning has beco me a mature topicin thelast decade[1]. Nowadays enterprises of different sizes use so me kind of machinelearning solution to eitheri mprove their business or enhance their product. The develop ment and deploy ment of such solutions bring ne w overhead to these enterprises. A typical machinelearning solution does not only consist of the progra m code thatis usedfor training models but it also has to be able to connect to a training data source, store the trained models and make these models availablefor usagein different syste ms. Typically machine learning syste ms are co mposed as data pipelines. A data pipeline orchestrates different data processing tasks. Usually, tasks are executed consecutively and the result of the pipelineis afunction or model thatis utilized to make predictions[2]. O fflinelearningis asubset of machinelearning workflows[3]. Such workflows do not change the approxi mation of their targetfunction once theinitial training phase is done. This settingis usually utilized when there areinfrequent data updates and when the whole dataset can be usedfor training the model or targetfunction. The firstfocal point of any machinelearning solutionis to develop an algorith m thatcan approxi matethe given data with goodenough accuracy. Afterthe algorith m is constructed evaluation ofitis done. Ifitis good enough an opti mization process can be started. For exa mple, in startups, one might want to have control over different aspects at the sa me ti me. Thisis because require ments can change quickly and the resourcesfor exa mple ti me and money have to be utilized carefully. This topic has not been studied extensively and requires deeper exa mination and more practical applications. Since data pipeline manage ment is still a young topic the available literature and online resources are more li mited than for more general soft ware engineering topics. Thisis why the goal of this thesisis to assess the current state of the art practices and technologies and then based on that kno wledge architect and develop a solution that provides trade-off manage ment and quality assess ment capabilities in anindustrial setting. 1.1 Contributions The main contribution of this thesisis afra me work that gives control over trade-off manage ment and provides task-level monitoring capabilities. The proof of con- ceptis open-sourced and available at https://github.com/kristsellforte/qoa_ framework. Currently, only ali mited a mount of practical exa mplesforsuch pipelines exist. Thefra me work was tested with retail datain collaboration with Sellforte[4], a Finnish startup specializedin marketing and pro motion analytics. A secondary contribution of the thesis is a thorough co mparison of existing machine learning fra meworks and co m mon workflows. 1.2 Structure of the thesis This thesisis structuredin nine sections. The first sectionintroduces the thesis and highlights contributions. The second section provides a deeper background and sets out clear goalsfor the thesis. The third sectionfeatures aliterature overvie w that highlights current challengesin model manage ment and approaches for managing different trade-offs in data pipelines. The fourth section features an overvie w of the current state ofthe art machinelearningfra me worksto

Quality of Analytics Management of Data Pipelines for Retail Forecasting

Programming Models to Support Data Science Workflows

Presto: the Definitive Guide

Summary Areas of Interest Skills Experience

TR-4798: Netapp AI Control Plane

Using Amazon EMR with Apache Airflow: How & Why to Do It

Migrating from Snowflake to Bigquery Data and Analytics

Spring Boot AMQP Starter 1.5.8.RELEASE

Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS

Building a Google Cloud Data Platform

Airflow Documentation

Apache Airflow Overview Viktor Kotliar

ARCHIVED: Deep Learning On