Dataops: Towards Understanding and Defining Data Analytics Approach

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 DataOps: Towards Understanding and Defining Data Analytics Approach KIRAN MAINALI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE DataOps: Towards Understanding and Defining Data Analytics Approach Kiran Mainali December 17, 2020 Master’s Thesis Examiner: Prof. Mihhail Matskin, KTH, Stockholm, Sweden Industrial Supervisor: DI Lisa Ehrlinger, Software Competence Centre Hagenberg (SCCH), Austria DI Johannes Himmelbauer, Software Competence Centre Hagenberg (SCCH), Austria KTH Royal Institute of Technology School of Electrical Engineering and Computer Science (EECS) Department of Computer Science and Engineering SE-100 44 Stockholm, Sweden Abstract Data collection and analysis approaches have changed drastically in the past few years. The reason behind adopting different approach is improved data availability and continuous change in analysis requirements. Data have been always there, but data management is vital nowadays due to rapid generation and availability of various formats. Big data has opened the possibility of dealing with potentially infinite amounts of data with numerous formats in a short time. The data analytics is becoming complex due to data characteristics, sophisticated tools and technologies, changing business needs, varied interests among stakeholders, and lack of a standardized process. DataOps is an emerging approach advocated by data practitioners to cater to the challenges in data analytics projects. Data analytics projects differ from software engineering in many aspects. DevOps is proven to be an efficient and practical approach to deliver the project in the Software Industry. However, DataOps is still in its infancy, being recognized as an independent and essential task data analytics. In this thesis paper, we uncover DataOps as a methodology to implement data pipelines by conducting a systematic search of research papers. As a result, we define DataOps outlining ambiguities and challenges. We also explore the coverage of DataOps to different stages of the data lifecycle. We created comparison matrixes of different tools and technologies categorizing them in different functional groups to demonstrate their usage in data lifecycle management. We followed DataOps implementation guidelines to implement data pipeline using Apache Airflow as workflow orchestrator inside Docker and compared with simple manual execution of a data analytics project. As per evaluation, the data pipeline with DataOps provided automation in task execution, orchestration in execution environment, testing and monitoring, communication and collaboration, and reduced end-to-end product delivery cycle time along with the reduction in pipeline execution time. Keywords: DataOps, Data lifecycle, Data analytics, DataOps pipeline, Data pipeline, DataOps tools and technologies, DataOps pipeline i Sammanfattning Datainsamling och analysmetoder har förändrats drastiskt under de senaste åren. Anledningen till ett annat tillvägagångssätt är förbättrad datatillgänglighet och kontinuerlig förändring av analyskraven. Data har alltid funnits, men datahantering är viktig idag på grund av snabb generering och tillgänglighet av olika format. Big data har öppnat möjligheten att hantera potentiellt oändliga mängder data med många format på kort tid. Dataanalysen blir komplex på grund av dataegenskaper, sofistikerade verktyg och teknologier, förändrade affärsbehov, olika intressen bland intressenter och brist på en standardiserad process. DataOps är en framväxande strategi som förespråkas av datautövare för att tillgodose utmaningarna i dataanalysprojekt. Dataanalysprojekt skiljer sig från programvaruteknik i många aspekter. DevOps har visat sig vara ett effektivt och praktiskt tillvägagångssätt för att leverera projektet i mjukvaruindustrin. DataOps är dock fortfarande i sin linda och erkänns som en oberoende och viktig uppgiftsanalys. I detta examensarbete avslöjar vi DataOps som en metod för att implementera datarörledningar genom att göra en systematisk sökning av forskningspapper. Som ett resultat definierar vi DataOps som beskriver tvetydigheter och utmaningar. Vi undersöker också täckningen av DataOps till olika stadier av datalivscykeln. Vi skapade jämförelsesmatriser med olika verktyg och teknologier som kategoriserade dem i olika funktionella grupper för att visa hur de används i datalivscykelhantering. Vi följde riktlinjerna för implementering av DataOps för att implementera datapipeline med Apache Airflow som arbetsflödesorkestrator i Docker och jämfört med enkel manuell körning av ett dataanalysprojekt. Enligt utvärderingen tillhandahöll datapipelinen med DataOps automatisering i uppgiftskörning, orkestrering i exekveringsmiljö, testning och övervakning, kommunikation och samarbete, och minskad leveranscykeltid från slut till produkt tillsammans med minskningen av tid för rörledningskörning. Nyckelord: DataOps, Data lifecycle, Data analytics, DataOps pipeline, Data pipeline, DataOps tools and technology, DataOps pipeline ii Acknowledgements There are several people and organizations whom I would like to thank for their help and support for completion of this project. First, I would like to express my gratitude to my examiner Prof. Mihhail Matskin for his guidance, assistance, and encouragement throughout the project. His ideas and insights were vital for me to able to finish the project. I would also like to thank Lisa Ehrlinger, and Johannes Himmelbauer at the Software Competence Center Hagenberg for providing valuable assistance. I offer my sincere appreciation for the learning opportunity provided. The research reported in this paper has been funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK), the Federal Ministry for Digital and Economic Affairs (BMDW), and the Province of Upper Austria in the frame of the COMET -- Competence Centers for Excellent Technologies Programme managed by Austrian Research Promotion Agency FFG. I would also like to thank SINTEF AS and TBFY project member for providing the necessary data and resources needed for the project. Last -but not the least – I would like to thank my loving family and friends for their continuous motivation and support. Completion of this thesis could not have been accomplished without their support. iii Contents Chapter 1. Introduction ...................................................................................... 1 1.1. Problem ....................................................................................................... 2 1.2. Research Questions .................................................................................... 4 1.3. Purpose ....................................................................................................... 4 1.4. Goals ........................................................................................................... 5 1.5. Benefits, Ethics, and Sustainability .......................................................... 5 1.6. Methodology ............................................................................................... 5 1.7. Delimitations .............................................................................................. 6 1.8. Outline ........................................................................................................ 7 Chapter 2. Theoretical Background ................................................................... 8 2.1. Data Lifecycle ............................................................................................. 8 2.1.1. CRUD Lifecycle .................................................................................... 8 2.1.2. IBM Lifecycle ........................................................................................ 9 2.1.3. USGS Lifecycle ................................................................................... 10 2.1.4. DataOne Lifecycle ............................................................................... 12 2.1.5. Discussion ........................................................................................... 13 2.2. Data Pipeline ............................................................................................ 15 2.3. Data Pipeline in Data Lifecycle .............................................................. 15 2.4. DataOps .................................................................................................... 16 2.4.1. DataOps Evolution ............................................................................. 17 2.4.2. DataOps Principle .............................................................................. 20 2.4.3. DataOps Implementation ................................................................... 21 2.4.4. Discussion ........................................................................................... 26 2.5. Data Governance ...................................................................................... 27 2.5.1. Data Governance with DataOps ......................................................... 27 2.6. Data Provenance and Lineage ................................................................. 28 iv 2.6.1. Data Provenance and Lineage with DataOps ..................................... 29 2.7. From Data Pipeline to DataOps Pipeline ............................................... 29 2.8. Analysis of Related work ........................................................................

Dataops: Towards Understanding and Defining Data Analytics Approach

Is a SPARQL Endpoint a Good Way to Manage Nursing Documentation

A Survey of Geospatial Semantic Web for Cultural Heritage

Diplomová Práce Přenos Dat Z Registru SITS Ve Formátu DASTA

Towards a Service-Oriented Architecture for a Mobile Assistive System with Real-Time Environmental Sensing

Building and Utilizing Linked Data from Massive Open Online Courses

A Computational Framework for Identity and Its Web-Based

Linking and Maintaining Quality of Data Provided by Various MOOC Providers

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Devontocreator: an Ontology-Based Approach to Software Engineering

250166701.Pdf

Title: Graphhelper

Introducing JDBC for SPARQL