DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

DataOps: Towards Understanding and Defining Data Analytics Approach

KIRAN MAINALI

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

DataOps: Towards Understanding and

Defining Data Analytics Approach

Kiran Mainali December 17, 2020

Master’s Thesis

Examiner: Prof. Mihhail Matskin, KTH, Stockholm, Sweden

Industrial Supervisor: DI Lisa Ehrlinger, Competence Centre Hagenberg (SCCH), Austria

DI Johannes Himmelbauer, Software Competence Centre Hagenberg (SCCH), Austria

KTH Royal Institute of Technology School of Electrical Engineering and Computer Science (EECS) Department of Computer Science and Engineering SE-100 44 Stockholm, Sweden

Abstract Data collection and analysis approaches have changed drastically in the past few years. The reason behind adopting different approach is improved data availability and continuous change in analysis requirements. Data have been always there, but data management is vital nowadays due to rapid generation and availability of various formats. Big data has opened the possibility of dealing with potentially infinite amounts of data with numerous formats in a short time. The data analytics is becoming complex due to data characteristics, sophisticated tools and technologies, changing business needs, varied interests among stakeholders, and lack of a standardized process. DataOps is an emerging approach advocated by data practitioners to cater to the challenges in data analytics projects. Data analytics projects differ from software engineering in many aspects. DevOps is proven to be an efficient and practical approach to deliver the project in the Software Industry. However, DataOps is still in its infancy, being recognized as an independent and essential task data analytics. In this thesis paper, we uncover DataOps as a methodology to implement data pipelines by conducting a systematic search of research papers. As a result, we define DataOps outlining ambiguities and challenges. We also explore the coverage of DataOps to different stages of the data lifecycle. We created comparison matrixes of different tools and technologies categorizing them in different functional groups to demonstrate their usage in data lifecycle management. We followed DataOps implementation guidelines to implement data pipeline using as workflow orchestrator inside Docker and compared with simple manual execution of a data analytics project. As per evaluation, the data pipeline with DataOps provided automation in task execution, orchestration in execution environment, testing and monitoring, communication and collaboration, and reduced end-to-end product delivery cycle time along with the reduction in pipeline execution time.

Keywords: DataOps, Data lifecycle, Data analytics, DataOps pipeline, Data pipeline, DataOps tools and technologies, DataOps pipeline

i

Sammanfattning Datainsamling och analysmetoder har förändrats drastiskt under de senaste åren. Anledningen till ett annat tillvägagångssätt är förbättrad datatillgänglighet och kontinuerlig förändring av analyskraven. Data har alltid funnits, men datahantering är viktig idag på grund av snabb generering och tillgänglighet av olika format. Big data har öppnat möjligheten att hantera potentiellt oändliga mängder data med många format på kort tid. Dataanalysen blir komplex på grund av dataegenskaper, sofistikerade verktyg och teknologier, förändrade affärsbehov, olika intressen bland intressenter och brist på en standardiserad process. DataOps är en framväxande strategi som förespråkas av datautövare för att tillgodose utmaningarna i dataanalysprojekt. Dataanalysprojekt skiljer sig från programvaruteknik i många aspekter. DevOps har visat sig vara ett effektivt och praktiskt tillvägagångssätt för att leverera projektet i mjukvaruindustrin. DataOps är dock fortfarande i sin linda och erkänns som en oberoende och viktig uppgiftsanalys. I detta examensarbete avslöjar vi DataOps som en metod för att implementera datarörledningar genom att göra en systematisk sökning av forskningspapper. Som ett resultat definierar vi DataOps som beskriver tvetydigheter och utmaningar. Vi undersöker också täckningen av DataOps till olika stadier av datalivscykeln. Vi skapade jämförelsesmatriser med olika verktyg och teknologier som kategoriserade dem i olika funktionella grupper för att visa hur de används i datalivscykelhantering. Vi följde riktlinjerna för implementering av DataOps för att implementera datapipeline med Apache Airflow som arbetsflödesorkestrator i Docker och jämfört med enkel manuell körning av ett dataanalysprojekt. Enligt utvärderingen tillhandahöll datapipelinen med DataOps automatisering i uppgiftskörning, orkestrering i exekveringsmiljö, testning och övervakning, kommunikation och samarbete, och minskad leveranscykeltid från slut till produkt tillsammans med minskningen av tid för rörledningskörning.

Nyckelord: DataOps, Data lifecycle, Data analytics, DataOps pipeline, Data pipeline, DataOps tools and technology, DataOps pipeline

ii

Acknowledgements

There are several people and organizations whom I would like to thank for their help and support for completion of this project. First, I would like to express my gratitude to my examiner Prof. Mihhail Matskin for his guidance, assistance, and encouragement throughout the project. His ideas and insights were vital for me to able to finish the project. I would also like to thank Lisa Ehrlinger, and Johannes Himmelbauer at the Software Competence Center Hagenberg for providing valuable assistance. I offer my sincere appreciation for the learning opportunity provided. The research reported in this paper has been funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK), the Federal Ministry for Digital and Economic Affairs (BMDW), and the Province of Upper Austria in the frame of the COMET -- Competence Centers for Excellent Technologies Programme managed by Austrian Research Promotion Agency FFG. I would also like to thank SINTEF AS and TBFY project member for providing the necessary data and resources needed for the project. Last -but not the least – I would like to thank my loving family and friends for their continuous motivation and support. Completion of this thesis could not have been accomplished without their support.

iii

Contents

Chapter 1. Introduction ...... 1 1.1. Problem ...... 2 1.2. Research Questions ...... 4 1.3. Purpose ...... 4 1.4. Goals ...... 5 1.5. Benefits, Ethics, and Sustainability ...... 5 1.6. Methodology ...... 5 1.7. Delimitations ...... 6 1.8. Outline ...... 7 Chapter 2. Theoretical Background ...... 8 2.1. Data Lifecycle ...... 8 2.1.1. CRUD Lifecycle ...... 8 2.1.2. IBM Lifecycle ...... 9 2.1.3. USGS Lifecycle ...... 10 2.1.4. DataOne Lifecycle ...... 12 2.1.5. Discussion ...... 13 2.2. Data Pipeline ...... 15 2.3. Data Pipeline in Data Lifecycle ...... 15 2.4. DataOps ...... 16 2.4.1. DataOps Evolution ...... 17 2.4.2. DataOps Principle ...... 20 2.4.3. DataOps Implementation ...... 21 2.4.4. Discussion ...... 26 2.5. Data Governance ...... 27 2.5.1. Data Governance with DataOps ...... 27 2.6. Data Provenance and Lineage ...... 28

iv

2.6.1. Data Provenance and Lineage with DataOps ...... 29 2.7. From Data Pipeline to DataOps Pipeline ...... 29 2.8. Analysis of Related work ...... 31 2.8.1. Commercial Contribution to DataOps ...... 31 2.8.2. Scientific Contribution to DataOps ...... 34 Chapter 3. Method ...... 35 3.1. Exploring DataOps as Data Analytics Methodology ...... 35 3.1.1. Research Process ...... 35 3.2. Project Implementation Using DataOps Methodology ...... 37 3.2.1. Objective ...... 37 3.2.2. Project Introduction ...... 37 3.2.3. Design ...... 38 3.2.4. Implementation ...... 39 Chapter 4. Result and Evaluation ...... 47 4.1. DataOps as Data Analytics Methodology ...... 47 4.1.1. DataOps Definition ...... 47 4.1.2. DataOps in the Data Lifecycle ...... 52 4.1.3. Evaluation of DataOps Tools and Technologies ...... 54 4.1.4. Discussion ...... 75 4.2. Experiment Evaluation ...... 78 4.2.1. Evaluation of Pipeline using DataOps Implementation Guidelines .... 78 4.2.2. Runtime Evaluation ...... 80 4.2.3. Discussion ...... 82 Chapter 5. Conclusions and Future Work ...... 83 5.1. Conclusions ...... 83 5.2. Future Work ...... 87 Bibliography….……………………………………………………………………………………………………………………88 Appendix………...... 98 v

List of Figures

Figure 1: CRUD Data lifecycle ...... 9 Figure 2: Where management task fall in data lifecycle ...... 10 Figure 3: USGS Science Data Lifecycle Model ...... 12 Figure 4: DataOne data lifecycle ...... 13 Figure 5: Data lifecycle ...... 14 Figure 6: Simple data pipeline example ...... 15 Figure 7: Data pipeline in the data lifecycle ...... 16 Figure 8: DataOps has evolved from lean manufacturing and software methodologies ...... 20 Figure 9: DataOps pipeline in general ...... 30 Figure 10: Illustration of the research process ...... 37 Figure 11: ETL process to publish knowledge graph from JSON ...... 38 Figure 12: Running ETL pipeline steps manually ...... 40 Figure 13: Illustration of KG ETL pipeline implementation using Apache Airflow in Docker container ...... 41 Figure 14: Apache Airflow architecture ...... 42 Figure 15: DataOps pipeline ...... 50 Figure 16:DataOps in the data lifecycle ...... 54 Figure 17: DataOps ecosystem ...... 75 Figure 18: Data analytics tasks execution time without following DataOps implementation guidelines ...... 81 Figure 19: Data pipeline execution time following DataOps implementation guidelines ...... 82

vi

List of Tables

Table 1: Research approach...... 6 Table 2: Summary of DataOps contribution from companies ...... 33 Table 3: Literature review overview ...... 36 Table 4: Scripts with input and output in KG ETL pipeline ...... 40 Table 5: DataOps definition’s analysis from literature ...... 48 Table 6: Workflow orchestration tools ...... 59 Table 7: Testing and monitoring tools ...... 62 Table 8: Deployment automation tools ...... 64 Table 9: Data governance tools...... 66 Table 10: Code, artifact, and data versioning tools ...... 68 Table 11: Analytics and visualization tools ...... 69 Table 12: Collaboration and communication tools ...... 70 Table 13: Other tools and technologies for DataOps ...... 71 Table 14: Summary of the covered DataOps implementation guideline by experiments ...... 80

List of Listings Listing 1: DAG on Apache Airflow of KG ETL pipeline ...... 43 Listing 2: DataOps definition ...... 49

vii

List of Acronyms and Abbreviations API Application Programming Interface AWS Amazon Web Service BI Business Intelligence CD/ CD Continuous Integration/ Continuous Deployment CRD Custom Resource Definition CRUD Create Read Update Delete CPU Central Processing Unit DAG Directed Acyclic Graph DSL Domain Specific Language DGI Data Governance Institute ETL Extraction, Transformation, and Loading ELT Extraction, Loading, and Transformation ECS Amazon Elastic Container Service GKE Google Kubernetes Engine GDPR General Data Protection Regulation HTTP Hypertext Transfer Protocol HDFS Hadoop Distributed File System IT Information Technology IDE Integrated Development Environment JSON JavaScript Object Notation KG Knowledge Graphs LXC Linux Containers OCDS Open Contracting Data Standards OS Operating System RDF Resource Description Framework RDD Resilient Distributed Database RML RDF Mapping Language RAM Random Access Memory REST Representational State Transfer SPC Statistical Process Control SDLC Software Development Lifecycle TM Trademark USGS United States Geological Survey UI User Interface YAML YAML Ain't Markup Language

viii

Chapter 1. Introduction The way of how data collected and analyzed has changed drastically from the past few years. One of the main reasons for that is improved data availability. Data have been always there, but data management becomes vital nowadays due to rapid generation and availability of various formats. We talk about volume, velocity, and data variety together as big data [1–4]. The concept of big data has opened the possibility of dealing with potentially infinite amounts of data in numerous formats within milliseconds or less. Data create a new capital for any enterprise that relies on data to make informed decisions. The growth of data has increased both the challenges and opportunities. Lots of investments are being poured on information extraction to enhance the decision-making process. Companies are trying hard to excerpt information faster than before. However, the data analytics process is becoming complex due to data characteristics, sophisticated tools and technologies, changing business needs, varied interests among stakeholders, and a lack of a standardized process [5,6]. Data management is one of the challenges most companies are facing. Handling data has been costly and inefficient with expanding the data utility sector. Data analytics with a conventional method is almost absolute if we consider the effectiveness of the result it provides. Enormous resources with repetitive tasks made the process costly and time-consuming. Moreover, in this fast-paced environment, companies need prompt results, which can never fulfill traditional data analytics processes. Data need to be analyzed to generate reliable results instantly with the minimum cost of operation [7]. Several approaches to handling data are introduced in recent years, and many tools and technologies become available to accomplish the need of the data analytics project. Nevertheless, challenges remain the same because of the mismatch between different tools and sophisticated business requirements. There is a need for an approach that can fill vast arrays of industrial facades with a collective process of combining tools and technologies irrespective of business requirements. DataOps is a looming approach advocated by data practitioners to cater the challenges faced during the data analytics process. DataOps is described as a method to automatically manage the entire data life cycle from data identification, cleaning, integration to analysis, and reporting [8]. It borrows proven practices from the DevOps in the software development lifecycle. DevOps is a set of practices to incorporate between software development process, and IT teams to shorten the software development lifecycle by providing continuous delivery and software quality

1

[9,10]. While DevOps is a mature field in software development, DataOps is still in its infancy stage. DataOps requires extensive industry adaption and evidence of improved outcome to be recognized as a proven methodology for data analytics projects. DataOps vows to confiscate the pain of data management for reporting and analytics. Data travels through a tortuous route from source to value. At backstage, data professionals go through gyrations to transform data before releasing it to the business community. These “data pipelines” are error-prone and inefficient: data hops across multiple systems and processed by various software programs [11]. Humans intervene to apply manual workarounds to fix recalcitrant transactions on data combined, aggregated, and analyzed by knowledge workers. Reuse and automation are scarce. Business users wait months for reports and insights. The hidden costs of data operations are monumental. DataOps pledge to smooth the process of the build, change, and manage data pipelines [12]. Its primary goal is business value maximization of data and improved customer satisfaction. DataOps does this by speeding up data delivery and analytic output while simultaneously reducing data defects, essentially fulfilling the mantra “better, faster, cheaper” [11]. DataOps emphasizes collaboration, reuse, and automation, along with a heavy dose of testing and monitoring [8]. Team-based development tools are used for creating, deploying, and managing data pipelines. 1.1. Problem Data analytics is a broad term. Any analysis done to generate information from the given data is considered as data analytics. However, there are always some typical steps in performing analysis over data. Typically, a data analysis task starts with defining the goals and identifying what to get from a dataset. Then the analysis team will retrieve data from sources and transform it into an analyzable format. The data analysis starts with data ingestion. We need to determine the required technique for a given situation, to know dataset contents, to know how to implement chosen analysis techniques, and the process of result generation [13]. After the analysis task, generated results are distributed to end-users in presentable and understandable formats. Data Analysis is not a stepwise process to perform tasks one after another; instead, we need to always keep in mind that each step has several iterative processes which might take numbers of iterations to complete [14]. DevOps is proven to be an efficient and practical approach to deliver the project in the Software Industry [15,16]. DevOps is an approach to deliver applications and services at a higher pace using combinations of cultural philosophies (combining

2

development and operation teams), practices (Continuous Integration, Continuous Delivery, Microservices, Infrastructure as Code, monitoring and logging, communication and collaboration) and tools (coding, building, testing, packaging, configuring and monitoring) [17,18]. Traditional software development and infrastructure management modules for production and service have been outperformed by DevOps’s speed in the field [15]. DevOps practices are being implemented in the data analytics lifecycle [19]. While collecting data, DevOps can reduce the effort and time for data retrieval [20]. Once this process is automated, there exist continuous growing of data, ready for analysis. Automation techniques used for data collection depend on how we are going to collect data. We might have to write custom codes to automate the functionality of data retrieval tools. In this step, one of the DevOps approaches could be to store the data retrieval script in a central repository. By doing this, the process becomes transparent and ensure the uniformity in collected datasets initiated by any responsible users. In the data transformation step, DevOps can maintain and update transformation scripts effectively and collaboratively using Continuous Integration (CI) server and a central code repository. The CI server also performs automated testing of these scripts and ensures that the transformation scripts have intended functionality. After data are transformed into the analyzable format, analysis of data begins. The analysis process has multiple analysis techniques and uses sets of tools depending on the type of data and results required. Automating this process will improve efficiency by allowing several iteration and repetition. Besides, we can feed additional datasets to the same process without worrying about readjusting and rerunning the tools. DevOps solution for the analysis stage contains several tasks and tools arranged to perform the analysis process effectively and iteratively. For instance, a central code repository allows the whole analysis team to manage necessary scripts. Automatic provisioning of virtual environments to deploy and run the analysis tools can be done for the entire team. Additionally, the CI server can regularly manage the test and deployment of scripts and analysis tools. Data analytics initiatives more closely resemble systems integration and business analysis rather than a general software project. The significant difference is in creating an analytics pipeline that copies the operational data from business, makes business rule-based data transformation, and populates in a datastore from which analysts can extract business information.

3

Data analytics projects differ from software engineering in many aspects (the data and analytics pipeline, the stateful data store, and process control). We can see the differences in which DevOps and DataOps deliver value and assure quality. DataOps aims to improve the quality and reduce the cycle time of data analytics initiatives [21]. The difference between DataOps and DevOps is based on the unique nature of dealing with data and delivering data insights to users [21]. Data analytics strength lags far behind software engineering in delivering product rigorously. However, the business intelligence field has been revolutionized by new technologies and data professionals in the field with varied backgrounds. The challenge of data analytics projects cannot be solved by exploiting DevOps practices due to data analytics projects’ heterogeneous nature. DataOps sketched a thin line within the sand, declaring data analytics projects can significantly improve capability and generate more business value in the same way that DevOps and software engineering has. However, there is very little work done in the past to establish DataOps as a methodology.

1.2. Research Questions Research questions defined for the thesis are: 1. What is currently understood by DataOps? What are ambiguities in the understanding of the concept? 2. Which available tools implement DataOps, and what functionalities do they offer? 3. Which parts of the data life cycle are currently supported well, and where is the need for the potential enhancement?

1.3. Purpose This thesis aims to uncover DataOps as a methodology to implement data pipelines by conducting a systematic search for research in the field of DataOps. With this thesis work, DataOps practices, principles, data lifecycle support, and tools and technologies will be studied to define DataOps and provide a baseline to further exploration and experiments with this methodology in data analytics projects.

4

1.4. Goals The goals of the thesis are to: 1. Provide a definition for DataOps and point out the ambiguities in the field of DataOps. 2. Propose a generic DataOps pipeline that supports data lifecycle. 3. Prepare a comparative feature analysis of the tools and technologies that support DataOps. 4. Explore which part of data lifecycles are currently supported and where do we need potential enhancement. 5. Implement a data pipeline and evaluate it using DataOps implementation process.

1.5. Benefits, Ethics, and Sustainability The thesis outcome will provide additional support and authority on approaches and practices to reduce the cost and time to handle complex data. For inexpensive computation and resource-consuming data analytics projects, the cost is crucial and must be kept relatively low compared to the project’s benefit [22]. It costs time and resources to apply, try, and error for selecting tools in analytics projects. Furthermore, companies will have to bear huge expenses before starting the project. Plus, data professionals’ whole effort goes wasted on modifying and adjusting the codes during the selection process of tools and technologies. The thesis outcome will help companies and data professionals reduce the cost of trail and time of project adjustment, significantly impacting the overall project cost. This work does not involve using and processing sensitive personal and organizational data at any stage of the work. All the resources used to carry out this thesis project are mentioned with proper citation. Data and codes used for the project are used after taking consent from the related parties.

1.6. Methodology To achieve the listed goals in Section 1.3, we followed qualitative, exploratory, and empirical research approaches. Comprehensive literature study of the data lifecycle, DataOps, DevOps, Agile, lean manufacturing, data governance, data lineage and provenance, and data pipeline reflects the qualitative aspect of the research process. Also, feature-based comparison of tools and technologies involves studying product features using credible online resources. Overall, the whole thesis paper follows exploratory research. By using exploratory research, we are creating the basis of general findings [23] by obtaining as much information about DataOps to define the

5

term and provide a base for future work in the field. Exploratory research is carried out by literature review and supported by empirical observation of the experiment conducted. Empirical research relies on experience and observation by focusing on the real situation [23]. We developed two implementation situations and differentiated the DataOps implementation guidelines’ level through experience and observation. Table 1 below summarizes the research approach we followed in conducting tasks in this thesis. Table 1: Research approach. Tasks Research Methodology Establish a theoretical background Qualitative + explorative Develop data lifecycle Qualitative DataOps implementation guideline formation Qualitative + explorative DataOps definition and point out ambiguities Qualitative + explorative DataOps in the data lifecycle Qualitative + explorative Evaluation of DataOps tools and technology Qualitative Implement data pipeline and compare Empirical implementation approach

1.7. Delimitations The thesis work is based on secondary data and research papers. The content presented and conclusions drawn rely on the information we gathered through our literature review process. So, DataOps definition we provide here may not completely cover the specific data analytics scenario in arrays of organizations and data analytics projects. Tools and technology presented in the feature-based comparison table do not cover all the available tools in the market. Furthermore, the tools and technologies listed in one category can have features supported in other categories. We created the feature comparison table to provide general information about the product and give the user a starting point of research for tools and technologies to consider while implementing the DataOps pipeline. The comparison matrix does not provide the complete and in-depth technical specifications of products. Implementation of a data pipeline example demonstrates possible options to implement the same data analytics project. The project implemented here does not resemble the actual project goal; preferably, it is used to demonstrate data pipeline implementation examples only.

6

1.8. Outline The rest of the thesis paper is structured as follows: Section 2 presents concepts of the data lifecycle, data pipeline, DataOps, data governance, data lineage, and provenance and analyzes the related work done in a similar field under theoretical background. Section 3 presents the detailed methods used to explore DataOps and implementing data analytics projects as a part of thesis work outlining the objective, implementation process, data pipeline design, and technology stack descriptions. Section 4 presents the result of the thesis work. Finally, Section 5 concludes the thesis work with potential future work that can be carried based on the thesis paper.

7

Chapter 2. Theoretical Background

In this section, we discuss the theoretical background and related work around DataOps. We first discuss different approaches of data lifecycle management and present data pipeline in general and establish its relationship with data lifecycle management. After, we investigate DataOps: its evolution, principles, and implementation. Then we explore the concept of Data Governance, Provenance, and Lineage with their importance in DataOps. Furthermore, we discuss the fundamental differences in implementing data pipeline and data pipeline with DataOps before presenting related work. 2.1. Data Lifecycle

Data lifecycle is the sequence of stages a data goes through from its initial generation or creation to its eventual archival and deletion at the end of its useful life. The transformation of data from extraction to its ultimate deletion is considered a data lifecycle. USGS in [24] outlines that after deciding the collection or use of data, data must be handled and managed until they become out of use. The phases of the data lifecycle depend on business requirements, context, and types of data. A data lifecycle adds value to raw data by managing many problems during the process of converting raw data to sensible and valuable data [25]. Data lifecycle gives a high-level overview of the management and security stages to use and reuse data. Several research and contributions [25] define data lifecycle with various attributes in practice across different applicable sectors.

2.1.1. CRUD Lifecycle

In [26], Yu, Xianojun, and Wen proposed the CRUD (Create, Read, Update, and Delete) model of the data lifecycle that focuses more on data security with a highlight on classical phases of the lifecycle. Their research paper proposes six stages of data lifecycle: • Create: Data created from scratch as raw data without value. • Store: Data stored in the cloud or local environment with several replicas if needed in different nodes for use and publication. • Use and Share: Data accessed for extraction and exploitation from the storage. The owner has the right to use solely or share data among others for the use. • Archive: Data stored in cheaper storage for long term depot. 8

• Destruct: Data deleted permanently after the purpose is fulfilled. Create, Store, and Destruct are mandatory phases in the data lifecycle, whereas Use/Share and Archive are optional on the CRUD lifecycle. However, the project cost and efficiency are determined by these two optional phases of CRUD lifecycle.

Figure 1: CRUD Data lifecycle (Source: [3])

2.1.2. IBM Lifecycle IBM in [28] purposes management tasks as a part of the data lifecycle. So, IBM enumerates the traditional data lifecycle by adding management and management policy layers. IBM considers three essential components while managing lifecycle during different phases of data existence. • Test data management: While developing new data sources, test technicians must simulate and automate realistic data sources of equivalent size to reflect the crucial behaviors of existing production databases. The focus should be on creating a subset of actual production data to pinpoint the errors, problems, and defects as early as possible. • Data masking and privacy: To protect production data, and comply with privacy, certain data features are masked from group users. • Archiving: Archiving does not mean storing data unnecessary for eternity. It needs intelligence of policy, based on specific parameters derived from business rules and age of data. Intelligent archiving helps to improve data 9

warehouse performance by automating the archiving process based on strategy. The entire data lifecycle benefits from good governance, but management capabilities that specialize in the utilization, share, and archive steps have wide-ranging benefits for cost reduction and efficiency gains.

Figure 2: Where management task fall in data lifecycle (Source: IBM [28])

2.1.3. USGS Lifecycle The USGS data lifecycle model in [29] says data are corporate assets with long term value and should be preserved behind immediate needs throughout the entire data lifecycle. Issues related to documentation, storage quality assurance, and ownership should be resolved in each stage. • Plan: The project team should consider all available approaches, required resources, and expected outputs for each life cycle stage. By the end of the planning phase, a data management plan is developed to handle the entire data lifecycle. • Acquire: At this phase, new or existing data are collected, generated, or evaluated for reuse through sets of activities. Data acquisition techniques should align with the project purpose.

10

• Process: To verify, organize, transform, extract, and integrate data into a suitable format, various actions and measurements are used. After this stage, data are ready for integration and analysis process. • Analyze: Exploration and interpretation of processed data are made for hypothesis testing, finding discoveries, and drawing the conclusion. Activities like summarization, graphical interpretation, statistical and spatial analysis, and modelling produce results from the input data. This stage will give interpretations or new datasets, which often get published in written reports or machine-readable formats. • Preserve: Storing data for long term use with controlled accessibility is considered at this stage. Data archiving and storing or submitting data to reference data are considered. • Publish/Share: Preparation to publish data and results with concerned stakeholders is done with planned and controlled accessibility over the data and publications to ensure data and project integrity. • Describe (metadata and documentation): Throughout the data, respective lifecycle documents (metadata and other projects documents) must be updated to show the actions and measurements taken on the data for the project purpose. • Manage Quality: A quality control mechanism should be used in each stage of the lifecycle to ensure the actions taken over data are according to the project proceedings and objectives. • Backup and Secure: To reduce the physical risk of data loss and damage, and make sure data are accessible when needed, this action should be performed at each stage of the data lifecycle. This element reminds scientists that routine backups are critical to prevent the physical loss of data because of hardware or software failure, natural disasters, or human error before the final preservation of the data. Loss-prevention measures apply to the raw and processed data, original project plan, data management plan, data acquisition strategy, processing procedures, versioning, analysis methods, published products, and associated metadata. This element also encourages project members to plan secure data sharing services, mainly when the team members work at multiple facilities.

11

Figure 3: USGS Science Data Lifecycle Model (source: [29])

2.1.4. DataOne Lifecycle DataOne lifecycle proposed in [30] focuses on data useful for scientific research. Data lifecycle is useful in identifying data flow with a comfortable work process for scientific experiments. DataOne has adopted a data lifecycle emphasizing data movement through eight unique stages. These steps begin with making the research plan for data collection, quality assurance, and control. Data description with metadata is deposited in a trusted repository along with data for preservation. Tools and services can then support data discovery, integration, and analysis, including visualization. • Plan: Lifecycle starts with the plan of doing scientific research. Data management and access protocol for the entire lifecycle will be described along with the description of data. • Collect: Data will be collected using various methods like observations, sensors, or other data collection endpoints and are stored in digital format. • Assure: Quality of data is checked through control and inspection. • Describe: Data are described correctly using the metadata standards. • Preserve: Data are stored in repositories for future purposes where data are discoverable and usable by others through consultation. • Discover: Archived or preserved data can be accessed for reuse and further use. • Integrate: A homogeneous set of data is created by combining different sources of data. • Analyze: Exploitation and analysis of data are performed for results.

12

Figure 4: DataOne data lifecycle (Source: DataOne [30])

2.1.5. Discussion We have discussed various approaches to data lifecycle. The primary motivation for consideration of data lifecycle is finding similarities in various works and establishing a common ground to understand the data lifecycle. By having a clear definition and understanding of the data lifecycle, it will be easier to implement data governance and data lineage. We briefly covered four data lifecycle approaches prevalent in scientific research projects. The review of these various lifecycles provides a clear picture that all these models have typical phases, although their terminology is different in some cases. Some approaches are designed for a specific purpose. For example, the DataOne lifecycle and USGS lifecycle are more related to scientific studies, whereas IBM and CRUD lifecycle are more focused on enterprise data projects. The data lifecycle mentioned above starts with careful planning of projects and data collection methods that satisfy the project requirements. However, in CRUD and IBM, data lifecycle planning is not explicitly defined. In the CRUD model, planning is done in the creation phase, whereas in the IBM lifecycle, planning comes under test data management on the management level. Storing and data archiving is useful for reuse and distribution in each stage of the data lifecycle. IBM is more concerned with data privacy, and data masking. At the same time, USGS and DataOne are more inclined towards data availability with a proper project and metadata description to make future use and distribution more comfortable. Quality control is a challenging task

13

to manage in the data lifecycle. IBM, CRUD, and USGS have a well-defined stage for quality and assurance throughout the lifecycle. In Figure 5, a general data life cycle is illustrated. Planning of the project is the first step. In this stage, the purpose and procedure of data collection and transformation, analysis, publish, and storage procedure are defined. Also, data lifecycle governance and data quality management plan are formulated. Here we are not considering planning as part of the data lifecycle. Data lifecycle goes through collection/creation, processing, analyze, and publish stage. In each stage, data will be archived/stored in persistent storage simultaneously storing in temporary memory for easy access to other stages. Temporary memory data are deleted immediately after use in the next stage. So, to access the same data in the future, each stage must check persistent storage. Stored data in persistent storage will be eventually destroyed after the purpose of data fulfilled. Data governance policy governs the lifecycle and handles the flow by implementing rule-based decisions for storage (temporary and persistence), archival, access, and disposal of data. Furthermore, data quality management will check the quality of data (input and output), a process associated with data throughout the data lifecycle. The data lifecycle presented below is a cyclic process. Published data can be collected from storage for process, and analyze to deliver different data analytics results. So, the data lifecycle will continue with each data lifecycle producing new data as a result.

Figure 5: Data lifecycle stages

14

2.2. Data Pipeline Data pipeline or data analytics pipeline is a series of steps for data processing in which data moves from source to destination. A pipeline is an engine for data transformation with a series of automated and manual interconnected operations or tasks. A data pipeline is the foundation of analytics, reporting, and machine learning with processes to move and transform data from sources to destination generating new values [31]. A data pipeline task could be individual or in collection with an accepted order of execution [32]. In some pipelines, the order is essential, whereas some pipelines can have interchangeable task order. Each data analytics project has its purpose and to meet the project purpose; data analytics pipeline is created. Depending on the project goal tools that we use and the task that we create will differ.

Data Task 1 Task 2 Task 3 Task N Data

Data pipeline

Figure 6: Simple data pipeline example In general, a data pipeline is built for efficiency to minimize or eliminate the manual process. There are several commercial and opensource tools available to use in data analytics pipeline, and lots of comparative studies can be found [33,34]. However, the selection of tools and technologies for data analytics pipeline depends on factors related to organization, people, technology, data governance policy and data management policy [35]. 2.3. Data Pipeline in Data Lifecycle Data pipelines are a means of sequencing data processes throughout the data lifecycle [32]. In general, the data pipeline often starts from data collection or creation as in Figure 7 where pipeline P1, P2 and P6 are using data directly from source whereas rest of the pipeline access data from storage (temporary or persistent). Data pipelines can cover the whole data lifecycle or can be built for its specific stage. Data pipeline coverage decision over stages of data lifecycle is made at the planning phase with project scope consideration. The data pipeline does not require to send data to storage service; it can route to other pipeline or applications [36]. In Figure 7, we assume that pipeline P3 can take input directly for P2 and route output directly to P4 without storing data in storage service.

15

In Figure 7, we illustrate how data pipeline is used in different stages of the lifecycle. We outlined different possible data pipelines with dotted ovals. Several possible data pipelines can be made as per the data analytics project requirement. The figure shows a data movement from one data lifecycle stage to another, transforming data through a series of the task inside the pipeline. The task on top of the figure resembles a pipeline, which has one or more interconnected task. Moving data from one stage to another stage should be a system of interconnected tools, technologies, and processing steps assembled inside the data pipeline. So, a data pipeline is a mean of data transportation with the transformation from one stage to another.

Figure 7: Data pipeline in the data lifecycle

2.4. DataOps Lenny Liebmann first used DataOps term in his blog post titled “3 reasons why DataOps is essential for big data success” on IBM Data and Analytics Hub, where he shines a light on the importance of executing data analytics task rapidly with ease of collaboration and assured quality outcome in diverse bigdata workload and cloud computing environment [37]. However, the term DataOps gained its popularity after Andy Palmar’s contribution in 2015, where he describes DataOps as communication, collaboration, integration, automation enabler practiced with cooperation between data engineers, data scientists and other stakeholders [38]. The definition given by

16

Gartner’s Glossary [39] describes DataOps as a collaborative approach of collection and distribution of data with automation, controlled access to data users to preserve data privacy and integrity. DataOps is a consequence of three emerging trends: process automation, digital-native companies pressure on traditional industry and the essence of data visualization and representation of results [40]. Data analytics is a dire process of several tools and technology combined for the results. It is not just about the tool; lots of expertise required for tools to work collectively. DataOps is a unified way of conveying a data analytics solution that uses automation, testing, orchestration, collaborative development, containerization, and continuous monitoring to stimulate output with improved quality regularly [36]. DataOps goal is to take data from the source and deliver to the person, application or system where it produces business value [41]. Other several definitions: which describe DataOps as “analytic process which spans from data collection to delivery of information after data processing” [42], “develop and deliver data analytics projects in a better way” [43], “is combination of value and innovation pipeline” [44] or “data management approach to improve communication and integration between previously inefficient teams, system and data” [45]. From these definitions, we can say that DataOps is a process of generating value from data in an efficient way using appropriate tools and technology with collaborations of teams. After analyzing existing articles from different authors and experts on the field, we found that different perspectives inspired DataOps definitions. Some definitions are more goal oriented [27,28,29] while some are activities[30,28] and furthermore some are process and team oriented [19,31,32]. DataOps fills the gap between data handlers and data from the activity perspective, enabling a continuous flow of data through the pipeline developed by data handlers to generate the desired outcome with the ability to create, test, and deploy changes throughout the process. From a goal- oriented approach, DataOps is viewed as a process to eliminate errors and inefficiency in data management, reducing the risk of data quality degradation and exposure of sensitive data using interconnected and secure data analytics models. From a process and team-oriented perspective, DataOps is a way of managing activities of data lifecycle with a high level of data governance, collaborating data creators and consumers using digital innovations.

2.4.1. DataOps Evolution DataOps is a set of practices in the data analytics field that takes proven practices from other industries [52]. It is the combination of proven methodologies that helped to grow other industries: DevOps and Agile methodology from the software industry 17

and lean manufacturing from the automotive/manufacturing industry. The speed of delivery of results using Agile and DevOps and the quality control done by lean manufacturing can be used in data analytics under DataOps [8]. From Agile Methodology Agile methodology gained popularity as an alternative to traditional software development methodologies when product development lifecycle is reduced, and delivery of product with higher quality is expected even with frequent change in requirements. According to Agile Alliance [53], “Agile methodology in software development is an umbrella term for a set of framework and practices based on values and principle expressed in ‘Manifesto for Agile Software Development [54]’ and ‘Twelve Principles [55]’ behind it”. Agile team, tools and processes are organized to reduce the publishing cycle minimum level [56]. Agile uses a collaborative approach between people to perform their tasks in self-organizing cross-functional teams. Agile is best suitable for a non-sequential development cycle where requirements are constantly changing which is similar in data analytics projects where every new analysis and reports will open new requests for additional results and queries[57]. From DevOps Practice DevOps [39,40] is an approach for software development and system operation using the best practices from the domains to deliver a quality product in a short period with reduced cost. DevOps practice in the software development project tries to fill the gap between development and operation cycle alongside giving optimized and quality products [18]. The success of the DevOps approach can be seen by the popularity of usage in the software industry and the demands of DevOps Engineers[60]. DataOps is a span of DevOps into data analytics. In the software industry, the continuous collaboration between developers, quality assurance, and operations teams are assured by DevOps [61]. In the same manner, DataOps provides a similar platform for data analytics team members from data scientists, data engineers, and other data professionals to customers. DataOps is means to optimize the analytics and storage process in data analytics projects in a similar way DevOps is doing in application development projects. Both DevOps and DataOps underly on Continuous Integration and Delivery (CI/CD), cross-team collaboration, and large- scale provisioning and consumption [62].

18

From Lean Manufacturing Lean manufacturing is a production method derived from Toyota’s operational model, “The Toyota Way” in 1930 [63]. The model focuses on improving quality and reducing non-value-added activities. Agile and DevOps methodologies proven to be efficient in the software industry can be considered as lean manufacturing approaches applied to software development [64]. Theoretically, manufacturing and data analytics are similar if we consider them from the pipeline process. In manufacturing, raw materials are passed to workstations from the stock room, and a series of transformations are done using humans and machines to produce finished goods. Similarly, in data analytics, data are passed to several transformations and analytics nodes to generate reports and insights. In both cases, each step takes input from the previous step and creates output for the upcoming step. Both are identical in a way they use a set of operations to produce high-quality, consistent output. In addition to Agile and DevOps methodologies inspired by lean manufacturing, in data analytics, another tool, Statistical Process Control (SPC), is useful in DataOps process improvement. SPC manages the consistency, and input and output quality of the pipeline process in the manufacturing industry. To monitor and control the manufacturing process’s quality, SPC uses a real-time product and process measurement. The functionality of the process and delivery of the quality product will be guaranteed within the specific limits of measurement. In data analytics, SPC can improve quality and efficiency by applying tests in data or models at the input or output of each step in the data analytics pipeline [8]. Summary DataOps is a combination of statistical process control with Agile development and DevOps. Agile helps to deliver analytics results in faster ways, DevOps automates the process of analysis and SPC from lean manufacturing tests and monitors the data flow quality in the entire data analytics lifecycle. DataOps combines the speed and flexibility of Agile and DevOps and quality control of SPC. Manufacturing quality control principles with methodologies and tools from software engineering combined provide high-quality data analytics capability in minimized process cycle for data and external users.

19

Figure 8: DataOps has evolved from lean manufacturing and software methodologies ( Source: DataKitchen [8] )

2.4.2. DataOps Principle Total of 18 principles listed in DataOps manifesto [43] created the best practices by people and organizations supporting DataOps. DataOps principle summarizes the best practices, goals, philosophies, mission, and values for DataOps practitioners. Below a listed summary of DataOps principles presented in the DataOps Manifesto[43]. 1. Continually satisfy customer: DataOps should prioritize customer satisfaction and fast delivery of results by putting them first in each stage of the data analytics process. 2. Value working analytics: Data analytics should provide insightful analytics result at the end. To deliver that, DataOps should incorporate accurate data with a suitable analysis framework and system. 3. Embrace changes: Changes are inevitable and should welcome, whether from the side of the customer or the internal process and system changes 4. It is a team sport: Analytics project is a team effort with people from different backgrounds, roles, skills, and favored tools. Should create an innovative environment by accepting diversity and opportunities to grow. 5. Daily interaction: Constant communication between customers, analytics team members, and operation, should be done every day throughout the project. 6. Self-organize: Self-organizing team and team members always foster the best algorithms, designs, requirements, insights, and architectures.

20

7. Reduce heroism: The team should rely on each other competence rather than waiting for individual heroism to lift the entire project. From the beginning, the people and process should be sustainable and scalable. 8. Reflect: Self-reflection and proper feedback mechanisms uses for check and balance the performance. 9. Analytics is code: Analytics teams use different tools for a specific task that generates codes and configuration to show the data manipulation process. 10. Orchestrate: Everything in the data pipeline (data, code, tools, and environment) should be orchestrated in an end-to-end manner. 11. Make it reproducible: Versioning of data, hardware, and software configuration code are essential to reproduce the same result. 12. Disposable environment: For minimal project cost, the disposable environment should be used so that when a project completes no extra cost occurs. 13. Simplicity: Focus on essential work only. Extra and unnecessary work should be avoided as much as possible. 14. Analytics is manufacturing: Constant production of insight is critical with efficiencies and quality control. 15. Quality is paramount: Analytic pipeline should have a built-in mechanism of quality control with continuous feedback in data, process, and system. 16. Monitor quality and performance: Variations in quality and performance should continuously be monitored and managed. 17. Reuse: Avoid redoing the same task. 18. Improve cycle time: It always tries to reduce the delivery cycle by optimizing the production process. From the above-listed principles, we can say that the manifesto puts team communication over tools and the process. Experimentation, iteration, and feedback are more important than designing and developing the whole pipeline upfront. Sense of responsibility and cross-functional collaboration increase the project efficiency reducing individual soiled responsibilities and heroism. Customer collaboration is always on priority over contract negotiation. 2.4.3. DataOps Implementation When it comes to the implementation of DataOps in the data analytics pipeline, there is not any well-defined approach. However, different organization and DataOps platform providers [47,48] have proposed their implementation methods. In this

21

section, we will discuss and summarize the implementation practice proposed by iCEDQ1 and DataKitchen2. According to the DataKitchen whitepaper, the data analytics team can implement DataOps in simple seven steps [8]. In the iCEDQ whitepaper [67], the implementation process has three sections: people culture, process practice, and tools. DataKitchen focuses more on technical implementation, whereas the iCEDQ proposal is more of a holistic approach to shifting organizational culture to smooth the technical aspect of DataOps implementation. In the next section, we provide a summary of both the iCEDQ and DataKitchen DataOps implementation processes.

A. DataKitchen Implementation (Source: DataKitchen [49]). a. Add data and logic test Adding test in each step of the data analytics pipeline ensures the integrity of output by verifying if intermediate results meet the expectation. Data, models, and logic should be tested before going to production delivery on each step. If the change of the artifact is required, the proper test is required before deploying for production. b. Use a version control system Versioning of data and code and configuration is essential to track the project’s data lineage and process authenticity. It will be easier to reproduce and reuse the project in the future with version control. c. Branch and merge Branching plays a vital role to boost productivity and freedom of experiment. It gives team members the freedom to work independently without affecting the performance of others. They can set up their experiment, do changes, run tests and, if satisfied, integrate to the central development environment before deploying to the production. d. Use multiple environments All team members working in one production environment may lead to conflicts. So, there should be multiple environments for each team member, at least for testing. Version control, branching and merging, and multiple

1 https://icedq.com/ 2 https://datakitchen.io/

22

environments work together to boost performance and reduce conflict among team members. e. Reuse and containerize It is messy to implement one whole data analytics project as one pipeline, and it will be hard to reuse the pipeline’s steps. While implementing the data analytics pipeline, try to break steps in the pipeline to the smallest unit and containerized them using container/orchestration technology for easy reuse and access. f. Parameterize the processing Data analytics pipeline should be flexible enough to handle different runtime conditions like using which version of raw data, which data should be sent to production and which to testing, data validation according to business requirement, and which pipeline steps should be used for a particular set of data. In short data, the pipeline should be flexible enough to handle possible alternatives that can occur throughout the project. g. Work Without FearTM With the right set of tools and technology, data professionals can be confident in avoiding overcommit, heroism, and silly errors. DataOps assures quality by two critical workflows, value pipeline and innovation pipeline, to reduce deployment errors and eliminate data professionals’ fear. B. iCEDQ Implementation (Source: iCEDQ [66]). a. Identify the people and their culture: DataOps is about breaking the barriers between development and business teams. An organization should implement a cultural shift to remove the boundaries between the stakeholders of data analytics projects. If cultural barriers not existed, the project could run in parallel with each team member contributing their part simultaneously in a different stage of the lifecycle. b. Get the automation tools for DataOps Without proper automation tools, DataOps is impossible to implement. Organizations must make all necessary ready to implement DataOps. Software related to versioning, code repository, task management, data repository, CI/CD, and production monitoring is some category that is very handful in DataOps implementation.

23 c. Define DataOps practice After ensuring organization culture is aligned to DataOps practice and acquiring essentials software and hardware tools next step is to define DataOps practice. Where requirements, development, test, production process, and management task are defined. i. Develop and integrate with a code repository It is essential to develop and integrate a code repository, code with the ability to versioning and tracking changes in each step of the data pipeline. ii. Implement CI/CD pipeline Another crucial step on DataOps is to implement the CI/CD pipeline to easy deploy development data and code to production after testing. Following are some aspects to consider while implementing a CI/CD pipeline in DataOps. a) Continuous integration With codes stored in repositories with branching and versioning, it will be easy to select a version according to the release plan and integrate with CI/CD tools. b) Continuous deployment The integrated code will be pulled by a CI/CD tool and deployed to the test environment from development. c) Initialization test Once the data and code are fetched, CI/CD tool will execute the test process to validate code, data, result, and process based on predefined quality requirements, which are also integrated and deployed along with code and data. d) ETL/Report execution After testing completing test next step is to process data in the production environment to generate distributable reports/results. e) ETL/Report testing Another crucial step is to test the ETL process’s accuracy and report quality against the test environment results to cross-verify the entire DataOps pipeline. f) Production monitoring Once the whole pipeline deployed in a production environment, there is a need for continuous monitoring of the production environment by setting up

24

testing rules. If there is a change in code, data, tools, or steps, the entire CI/CD cycle will be repeated.

2.4.3.1. Summary From above discussed two DataOps implementation approach, we establish our DataOps implementation guidelines as follow. 1. Set DataOps culture Start DataOps by identifying people and culture in an organization. Establish management control, communication process, and project management to align with the process and tools that are going to be used inside an organization.

2. Automate and orchestrate Use automation and orchestration tools to reduce manual work. Collaboration between team members to execute data analytics projects tries to automate tasks as much as possible. With orchestration, integration of tools and technology will be easier to automate the data analytics projects. 3. Use version control Versioning is essential for data, documents, and code tracking. Data governance, data provenance and data lineage can depend on version control tools to some extent. Also, with version control, different team members can create their version of work and be merged for implementation.

4. Reuse and containerize Do not waste time redoing the same thing if there is a possibility to reuse. Furthermore, containerize applications and pipelines helps to reduce the risk of failure due to external circumstances.

5. Setup multiple environments Setup a separate environment for production and development gives the flexibility of innovation and change management without risking ongoing pipeline execution. Within the development environment, each data worker should have their work environment so that everyone can work independently without affecting other performance.

6. Test and test Without testing, confidence in pipeline quality cannot be guaranteed. Create test cases to cover every possible corner of the pipeline (data, code, system,

25

and output). Do an extensive test before releasing the data pipeline or changes in data pipeline into the production environment.

7. Continuous integration and deployment To assemble work of various data workers and put into a test environment, use continuous integration and after passing all the test cases, use continuous deployment to release work into the production environment.

8. Continuous monitor Monitor the production and development environment regularly to track overall data pipeline performance, pipeline’s input and output quality, and used tools and technology performance in the pipeline. Also, always crosscheck results of two environments. With continuous monitoring, statistics related to system performance and quality can be recorded. Scope of further improvement always exists with test results analysis.

9. Communicate and collaborate Continuously communicate with customers, stakeholders, and team members. Try to minimize the communication loop to a minimum so that messages can travel faster. If required, create collaborative workspace between tools and people, and between tools and tools for the task to provide better results. 2.4.4. Discussion DataOps, from its initial establishment of the term until now, significant contribution to defining and practice has been seen. DataOps enthusiasts collaborate to create a common principle to offer uniformity in applying the methodology in the heterogeneous data operation environment. With all these efforts, still certain ambiguities on applicability due to the diverse nature of its field of application. Data analytics itself is a broad field where numerous tools, approaches, and technology can lead to the same result. However, DataOps advocates collaboration, quality control, and faster delivery of data analytics pipeline by extending proven DevOps methodology from SDLC and combining with Agile and Lean Manufacturing’s SPC. With a combination of three methodologies as a reference point, DataOps have been continuously evolving as an efficient and reliable methodology in data lifecycle management. Companies like DataKitchen and iCEDQ are actively involved in developing data analytics tools supporting DataOps principles and contributing to DataOps research and development by publicly publishing their work. For instance, the DataOps implementation guidelines they have presented (summarized in Section 2.4.3.) are derivative from their project experience. However, they have designed 26

guidelines considering their tool implementation, but still holds true for implementing DataOps using other tools and technologies. Both implementation approaches satisfy DataOps principles, which shows the uniformity in DataOps practice. With the help of the study of two implementation approaches and DataOps principles, we created our own implementation guidelines in Section 2.4.3, which is taken away from both practices. The implementation guidelines we created fulfils the DataOps principles manifesto.

2.5. Data Governance According to the Data Governance Institute (DGI), data governance is a system for information-related processes in an organization providing accountabilities and decision rights according to the agreed-upon information management model [68]. Here, the information management model describes when to take a particular action by whom in which circumstances using which method. Data governance is an orchestration of people, processes, and technology to establish data and information as organization assets [69]. Therefore, an organization must protect data considering it as valuable assets by establishing the organization’s culture of proper data handling. With data governance, organization data becomes more manageable, usable, accessible, reliable, and consistent. It manages information strategically and tactically involving the organization’s managerial and technical branches to implement control policies for quality and accessibility throughout the business process [70]. It is a continuous and iterative process with constant improvement and modification according to the condition [71].

2.5.1. Data Governance with DataOps In the above section, we define data governance from an organizational perspective. Now let us look from the data analytics process; data are moved to several stages to draw valuable information. Data travel to different lifecycle stages with a series of modifications making data vulnerable to lose track of quality, consistency, and accessibility. Data governance is a set of principles and practices that ensure the data operation process’s quality throughout the data life cycle. With the DataOps principle [43] related to quality assurance and monitoring, data governance practice is implemented in the data analytics pipeline, focusing on all three aspects: data, process, and technology. Data governance has two stages [72]; first, policy-level execution, where people from the top-level discovery and design of overall policy take 27

place to handle data, and the second level is about implementing, automating, monitoring, and measuring the policy. Using DataOps in action, using suitable tools and technology, it will be easier to execute data governance policy in the data analytics process [55,33].

2.6. Data Provenance and Lineage Data provenance identifies the data and data transformation process to generate a given data instance [74]. With data provenance, the historical record of data and its origins can be associated with output results by tracking inputs, systems, entities, and processes used for desired output [75]. Data provenance is useful in the data analytics pipeline to trackback the dataflow to its origin for debugging and quality assurance. In each step of the analysis, input and operators associated with generated output must be recorded. There are several forms of data provenance, like copy- provenance and how-provenance [58,59], but in data analytics, the most common approach used to track the data transformation history is why-provenance or lineage as introduced in [78]. Data lineage captures the origin of data, the sequence of operations associated with data, and status after operation [79]. With data lineage, the ability to trackback errors can be simplified in the data analytics process [80]. Data movement tracking is not a voluntary task organization opted to perform at convenience and for internal benefit. With increased manipulation and misuse of data in recent years, binding rules and regulations are coming up to make companies more transparent and responsible while handling data. United Nations, the United States of America, and the European Union are taking initiatives to bind organizations through data protection and privacy law. GDPR3 is regulation under European Union law on data protection and privacy inside the European Union and European Economic Area. GDPR article 30 [81] says organizations need to maintain a record of data processing by listing who process data with detail listing the processor. Article 17 [82] and article 20 [83] talks about the right to forgetting data and the right to data access and portability, respectively. To follow the regulation, organizations should establish the measure of tracking data movement, the purpose of movement, involved system and people in the movement, versioning of data after each transformation, storage information and transparency in data storage period, and the purpose of collecting

3 https://gdpr.eu/ 28

data. Data lineage is a good way to ensure data quality for regulatory as well as operation purpose.

2.6.1. Data Provenance and Lineage with DataOps Data provenance and lineage in DataOps starts with defining the data governance principle. Creating the organization culture transparent in each step of handling data is the first step in implementing the lineage feature in data analytics projects. As DataOps is a way to create data analytics projects by tracking data lifecycle movement throughout project involving people, process, and technology, data lineage is part of DataOps to manage the history of data operation steps in the lifecycle using tools and technology like metadata management [84], versioning of data [78], blockchain [85,86], and data lineage software. 2.7. From Data Pipeline to DataOps Pipeline DataOps is a rising set of Agile and DevOps practices, technologies, and processes to construct and elevate data pipelines with the quality result for better business performance [87]. Data analytics deals with a massive amount of data from various sources. Managing large scale data sources is challenging. Not only managing data sources, managing data storage, and data transformation tracking is also challenging. The more the amount of data, the harder it is to understand, and at some point, it is painful to comprehend what data contains and what is happening to data inside the data pipeline. Documenting and tracking becomes tedious and time-consuming, especially when there is a need to satisfy different parties. Traditionally data pipeline deals with this issue in an unstructured manual way. With DataOps in data analytics, the process can be structured and automated to make it less tedious. Data and analytics pipeline development is still a skill-based job with minimally reused non-repeatable process, handled by individuals in isolation using different tools and practices [12]. Data pipeline, in general, has an issue in collaboration and synchronization of the project. The traditional approach to building a data pipeline focuses on the result of the project. So, the primary task goes to designing and developing ETL (extract, transform, load) /ELT (extract, load, transform) algorithm, analysis algorithm, and assembling core tools to implement the project. Moreover, the majority of task is being performed by a group of individuals who are working independently. In this case, jobs become repetitive. By the end of the project implementation, there will be a different replica of the same project in different users with their version of modification. Consistency is one major issue DataOps eliminates with collaboration, role-based job distribution with synchronization inside and

29

outside organization during and after data analytics pipeline deployment. Consistency not only in data pipeline but also in data pipeline end products. Data and data analytics requirements tend to change rapidly. Previously processed, some data pipelines become obsolete for new business insight. So, there might be a need for change in data pipeline operation to get a new insight. It is a time consuming and a difficult task to start from scratch. The case is if we are working in a data pipeline, without DataOps, the development and production environment are never separated, and changed tasks must be tested in the production environment itself, which puts the entire project at risk. With DataOps, a separate development environment gives the freedom of innovation and modification without affecting the running project. Moreover, it is not about addressing the change. With a different development environment, data professionals can continuously improve the existing pipeline for a better outcome. Two environments are connected through CI/CD pipeline to deploy in production after testing is passed from the development environment. To have the DataOps pipeline, Agile and DevOps speed, SPC’s quality control is applied in the data pipeline [8]. As a result, quick response, elastic, and robust analytics can be assured to cope with the everchanging requirement and data. With the DataOps approach, the data pipeline is treated as a joint function of involved people, process, and technology for performance and quality optimization of the entire project rather than focusing on any single entity’s productivity. With DataOps in analytics, project priority is on communication, collaboration, integration, automation, measurement, and cooperation between data scientists, analysts, data/ETL engineers, information technology (IT), and quality assurance/governance [42]. This boosts the involvement of the whole analytic team and the project with the rapid result and continuous performance and operational improvement.

Figure 9: DataOps pipeline in general 30

Figure 9 illustrates the DataOps pipeline advancement in the regular analytics pipeline from Figure 6 by adding proven methodologies (Agile, DevOps and Lean) functionalities with the inclusion of all related parties (people, process, and technology) for a designated task. For that, automated platform and tools should be incorporated in viable tasks and create an organization wise DataOps culture [8]. Using DataOps on data analytics gives a goal of delivering more and superior analysis in a rapid, economical, and quality way.

2.8. Analysis of Related work In this section, related work will be presented, and work that inspired the project will be discussed. After the first toss of the term in 2015, there have been several efforts to define DataOps. Even though few works in the scientific research field related to DataOps, we can find various resources in the commercial domain that have helped to understand it. This section divides the related work section into two different fields and summarizes works separately.

2.8.1. Commercial Contribution to DataOps Some companies are strongly advocating DataOps and delivering the product to support the DataOps principle. DataKitchen, iCEDQ, IBM4 , and Eckerson5 are some representative organization involved in practicing and exploring DataOps. These companies are chosen based on the research and contribution they have made in the field of DataOps. DataKitchen DataKitchen is one of the leading solution providers for DataOps. DataKitchen is continuously working on establishing DataOps as a methodology for data analytics projects with regularly contributing to developing DataOps products and doing industry focus research in the DataOps field. DataKitchen’ significant contribution is to develop a DataOps manifesto and generic implementation guidelines in the field. Besides, they are continually expanding their area of focus to cover every aspect of the data analytics lifecycle by introducing tools and services for generic DataOps pipeline and specific industry solutions. Publications like “The DataOps Cookbook” is one of the pioneer resources to understand DataOps, where we can find vital information about the domain origin, practice guide with industry use cases. On top

4 https://www.ibm.com/analytics/dataops 5 https://www.eckerson.com/ 31 of their publications, they have various tools to support DataOps methodology for a different stage of data lifecycle like orchestrations, testing and monitoring, analytics, and CI/CD deployments. iCEDQ iCEDQ provides an ETL testing and data monitoring platform for companies to accelerate the development and testing of data projects. Their products are based on DataOps methodology to provide agile service with quality data analytics projects. Products like ETL testing, data warehouse testing, data migration testing, BI report testing, and production data monitoring are being used in insurance, healthcare, finance, and retail space. Whitepapers and blogs published by the companies are useful to understand the DataOps application implementations on the data projects. IBM IBM is a pioneer of the word DataOps. They have developed several products using the principle of DataOps targeted to different stages of the lifecycle. They are marketing their product under their cloud service with in-house built addon features for data analytics projects. Several IBM products [88] are available for different life stages for different purposes, like collecting data, analyzing, storing, organizing, and publishing. Besides product development, IBM’s research on DataOps has helped create the foundation of the concept to explore different approaches of data analytics projects and optimize the data analysis process. Eckerson Eckerson is a global research and consulting firm for data analytics. Eckerson Group has published several research reports to define DataOps [12], exploring the use of DataOps [11], selection criteria for DataOps tools [89], and presenting industry insights that promote [90] DataOps in data analytics. The work done by the company is a baseline for other researchers to explore DataOps applicability in the various data analytics projects.

32

Table 2: Summary of DataOps contribution from companies Companies Have Focus Area in DataOps Contribution Resource List in DataOps DataOps Product DataKitchen Yes - Establishing principle 1. The DataOps Cookbook [8]. - Implementation guide 2. The DataOps Manifesto [42]. - Establishing definition 3. DataOps Blogs6. - Webinar and podcasts on DataOps 5. DataOps whitepapers7. - Orchestration, CI/CD pipeline, testing, and 6. DataOps case studies8. monitoring iCEDQ Yes - Implementation guide 1. DataOps Implementation Guide [66] - Quality assurance with DataOps 2. Data Management Implication [91] - Data Integration 3. QA Challenges in data integration projects [92] - ETL Testing - Data Migration Testing IBM Yes - Study of DataOps implementation process 1. Deliver business ready data fast with DataOps [93]. - Establishing DataOps objectives 2. Wrangling big data: Fundamentals of data lifecycle management [28]. - Data lifecycle management 3. Blogs [94–96]. Eckerson No - Survey on DataOps market practice 1. Best Practices in DataOps How to Create Robust, Automated Data Pipelines [11]. - Understanding of DataOps 2. Trends in DataOps [90]. - Establishing DataOps Framework 3. The Ultimate Guide to DataOps Product Evaluation and Selection Criteria [89]. 4. DataOps: Industrializing Data and Analytics Strategies for Streamlining the Delivery of Insights [12].

6 https://datakitchen.io/blog/ 7 https://datakitchen.io/dataops-white-papers.html 8 https://datakitchen.io/dataops-case-studies.html

33 2.8.2. Scientific Contribution to DataOps To the author’s best knowledge, works to define and establish DataOps have been emerging recently, but there is not enough published research paper. Some of the academic and research contributions that have helped carry out and influence this thesis project will be discussed below. In 2020, Raj et al. published “From Ad-Hoc Data Analytics to DataOps” [97], in which they have defined DataOps with a general scope of usability by doing an extensive literature review. Paper has presented a case study of a large telecommunication company to present how its infrastructure and the process have evolved over the period to support DataOps. Paper also has shown different stages of data analytics process evolution regarding the DataOps approach. In 2018, “DataOps – Towards a Definition” [13]by Julian Ereth contributed to the academically elaborate concept of DataOps as a new discipline. This paper has presented a body of knowledge and a working definition of DataOps with an initial research framework in the field by interviewing different industry experts. The paper conclusion is to elaborate on the DataOps as a discipline by researching process and governance, related technologies, and tools and investigate the value proposition it will bring to the business.

34 Chapter 3. Method This section presents the approaches we followed in carrying out the whole project. We divide the work into two parts, and for each part, we separately describe the process of conducting the research work. The first part is to explore the DataOps in which we do explorative qualitative research using literature review and online research. The outcome is the definition of DataOps with the discussion of ambiguities present in the field, challenges that need to address while implementing DataOps, and list of tools and technologies used to implement DataOps with features they provide in a different stage of the data lifecycle. The second part is an implementation example of existing data analytics projects using some practical tools for DataOps. We present these research methods separately in this section.

3.1. Exploring DataOps as Data Analytics Methodology In this section, we present the detailed process of work we have done to define DataOps.

3.1.1. Research Process The research process for exploring DataOps starts by collecting literature articles from various sources. We used journals, articles, books, whitepapers, reports, thesis, and online resources. Google9, Google scholar10, ResearchGate11, IEEE12, KTH Library13, KTH-Diva14, and Semantic Scholar15 were used along with companies’ websites to access companies’ resources.

9 https://www.google.com/ 10 https://scholar.google.com/ 11 https://www.researchgate.net/ 12 https://ieeexplore.ieee.org/ 13 https://www.kth.se/en/biblioteket 14 https://kth.diva-portal.org/ 15 https://www.semanticscholar.org/ 35 Table 3: Literature review overview Articles Review Overview Keywords DataOps, DataOps platform, DataOps pipeline, Data lifecycle, Data analytics, Data analytics automation, Data pipeline, Data analytics pipeline, DevOps, Agile, Agile methodology, Lean manufacturing, Statistical Process Control, Big data, Big data pipeline, Data governance, Data provenance, Data Lineage, ETL pipeline, ELT pipeline. Number of literature articles accessed 157 Number of articles used 71 Number of online resources accessed 112 Number of online resources used 39 We establish our theoretical background with literature articles to fulfil the research goals 1, 2, 3, and 4. The whole process of the literature-based research process is illustrated in Figure 10. As shown in the figure, we explore DataOps (Section 2.4), data pipeline (Section 2.2), data governance (Section 2.5), data lineage, and provenance (Section 2.6). Then, we explore the importance of data governance, lineage, and provenance in data lifecycle management and DataOps. As a result of the literature study of different data lifecycle (Section 2.1) and data governance, we establish a general data lifecycle with data governance policies involved (see Figure 5 and Section 2.1.5). With the established data lifecycle and data pipeline, we explore a data pipeline's role in data lifecycle management (Section 2.3). We also differentiate how the DataOps approach is different in data analytics projects than traditional data pipeline (Section 2.7). With literature review work in the theoretical background section, we have presented our research outcome in the result section. The result haves three sections: First, based on the work of Section 2.4, we define DataOps. Second, establish the relation of DataOps in data lifecycle and third, explore tools and technology available to support DataOps.

36

Figure 10: Illustration of the research process

3.2. Project Implementation Using DataOps Methodology For implementing the project using DataOps, data from TheyBuyForYou16 (TBFY) was used, and steps to extract RDF (Resource Description Framework) data from JSON (JavaScript Object Notation) data were followed. Two separate implementations are done for comparison purposes. The first implementation is without using orchestration and scheduling tools in which we used the Linux environment. In comparison, the second implementation uses Docker containers with Apache Airflow17 as a job scheduler. Execution time is measured from both experiments. We will also evaluate both implementations concerning the DataOps principles to show on which level does each implementation uses the DataOps approach to execute the whole project.

3.2.1. Objective The objective of implementing a data analytics project as a part of thesis work is to show the difference in project implementation approach DataOps possess concerning the existing implementation and describe the advantage of following DataOps implementation guidelines.

3.2.2. Project Introduction TheyBuyForYou is a project to make the procurement process in the EU efficient, competitive, fair, and accountable by leveraging a large amount of data and advanced analytics capabilities [98]. The project developed an ontology network by

16 https://theybuyforyou.eu 17 https://airflow.apache.org/ 37 joining and presenting tender data from OpenOpps18 and company data from OpenCorporates19 into a knowledge graph [99]. The project provides anomaly detection and cross-lingual document search [100]. In this thesis project, we implement the ETL pipeline to produce a Knowledge Graph from JSON data provided by TheyBuyForYou in Zenodo20. JSON file consists of combined and reconciled data from OpenOpps and OpenCorporates API. We are using partial data from the available data sets (first 30 days, 1.08 GB) to implement the project since computing the whole 16.9 GB (6 months of data) requires enormous computational resources. We used codes and instructions from TBFY’s GitHub21 repository following the reuse concept whenever possible (DataOps principle).

3.2.3. Design We designed our ETL pipeline based on the TBFY ETL pipeline [100] to generate the Knowledge Graph. We started our pipeline by enriching JSON data due to the unavailability of API from OpenOpps and OpenCorporates to inject data through API directly and reconcile data from both sources to JSON dataset. Since JSON data is publicly available for reuse, we used the reconciled data and followed the remaining ETL process. Figure 11 shows the pipeline steps of the project.

JSON Data

Figure 11: ETL process to publish Knowledge Graph from JSON

The pipeline is composed of the following steps: 1. Enrich JSON data: In this step, we enrich JSON files provided by TBYF by adding new properties and fixing the missing identifiers. 2. Convert JSON to XML: After JSON data are enriched, we convert JSON data to XML to better support the mapping process to convert data to RDF.

18 https://openopps.com 19 https://opencorporates.com 20 https://zenodo.org/record/3638068#.X7RdrmhKgiM 21 https://github.com/TBFY/knowledge-graph 38 3. Map XML data to RDF: XML data files are converted to RDF by running RMLMapper to produce N-Triples files. 4. Publish RDF to Database: In this step, RDF (N-Triples) files are published to Apache Jena Fuseki and Apache Jena TBD.

3.2.4. Implementation Two separate implementations are done for comparison purposes. Both implementations are done on Linux (2GHz dual-core processor, 8 GIB Ram, and 40 GB hard-drive space) using the services and tools described in the project’s GitHub repository. However, the first project implementation is performed without any automation tools, each task is run manually without scheduling them through scheduling script or scheduling tools, and all tasks are performed in Linux. The second implementation used Apache Airflow as workflow orchestrator and scheduler on top of Docker. We combined all steps as a single pipeline and orchestrated the whole process. Additionally, we separated the test and production environment for the second implementation. The reason to create a test environment is to follow the DataOps principle of implementation. Before implementing both approaches, we created and ran Apache Fuseki service using the Dockerfile available in TBFY GitHub and created a knowledge graph data set and loaded the nace.ttl, and opencorporates_identifier_system.ttl files into Fuseki.

1. KG Pipeline steps manual execution After satisfying all the requirements described in the project implementation instructions from TBYF GitHub repository, we ran the command line scripts starting from step 1 to step 4 manually and sequentially, as shown in Table 4.

39 Table 4: Scripts with input and output in KG ETL pipeline Step Script Input Output 1. Enrich python enrich_json.py -s JSON Enriched JSON '2019-01-01' -e '2019-01-30' JSON -i '/tbfy/2_JSON_OpenCorporates' -o '/tbfy/3_JSON_Enriched’

2. Convert python json2xml.py -s '2019- Enriched JSON XML 01-01' -e '2019-01-30' -i JSON '/tbfy/3_JSON_Enriched' -o to XML '/tbfy/4_XML_Enriched' 3. Map python xml2rdf.py -s '2019- XML, RDF 01-01' -e '2019-01-30' -r XML to '/tbfy/rml_mappings' -i RMLMappings22 RDF '/tbfy/4_XML_Enriched' -o files '/tbfy/5_XML_to_RDF' 4. Publish python publish_rdf.py -s RDF Publish RDF '2019-01-01' -e '2019-01-30' RDF -i '/tbfy/5_XML_to_RDF' data to Apache Jena Fuseki Table 4 lists python scripts used in the bash shell, input data generated from start to end date with input file location, and output file location. In the first three steps, data are stored in a hard-drive with the output file name specified in the script, whereas in the final step, data are published in Fuseki in the form of N-Triplets. Detail implementation procedure and all required dependencies to run each step can be found in the GitHub repository of TBFY.

1 JSON Data Enrich Manually run enrich_json.py JSON Data

2 Convert Manually run json2xml.py JSON to XML Enriched JSON data

3 Manually run xml2rdf.py XML data Map XML data to RDF 4 RML mapping file Publish RDF to Manually run publish_rdf.py Database RDF (N-Triples) data

: Input : Output Linux Environment Figure 12: Running ETL pipeline steps manually

22 https://github.com/TBFY/knowledge-graph/tree/master/rml-mappings 40 2. KG ETL pipeline with orchestration and workflow management tool In this implementation approach, we align our implementation process according to one or many DataOps principles where applicable. We designed and developed separate development (for testing) and the production environment. In the development environment, we used five days of data for testing, and in the production environment, we used 30 days of data. To set up the development environment, we installed a virtual machine (2 GIB RAM and 15 GB hard-drive) on the Linux workspace. Inside the virtual machine, we installed Docker, and inside Docker, we installed a Docker image of Apache Airflow from the puckel project23 and Apache Jena Fuseki. We used the single node cluster system of Apache Airflow with bash operator to execute the ETL pipeline.

We installed Docker containers of Apache Airflow and Apache Jena Fuseki inside Docker for the production environment. All codes, configurations are the same in both environments. The detailed architecture of the KG ETL pipeline is presented in Figure 12.

Figure 13: Illustration of KG ETL pipeline implementation using Apache Airflow in Docker container

I. Tools and Technologies Used in KG ETL Pipeline Below we list and describe in short tools and technologies used for the implementation of both pipelines.

23 https://github.com/puckel/docker-airflow 41 1. Apache Airflow Apache Airflow is a python-based open-source workflow automation and orchestration tool for setting and maintaining data pipelines. It helps to manage, structure, and organize data pipelines using Directed Acyclic Graphs (DAGs). In Airflow, DAG is a collection of all the tasks that need to be run, and it reflects the relationships and dependencies. Airflow is a metadata-based queuing system. In Airflow, the database stores the status of tasks, and the schedule uses the status information from metadata to schedule the tasks according to their priority and sequence of executions. Executer determines the worker processes to execute each scheduled task. There are different executors in Airflow [101], and workers execute the logic of tasks based on the executors used for the task.

Metadata

Scheduler Webserver/ UI

Worker

Executor

Figure 14: Apache Airflow architecture

Apache Airflow database comes with SQLite24 backend, but we are using Postgres25 for metadata storage due to easy installation provided by the puckel project. Furthermore, Airflow supports other external databases too. Webserver handles and displays tasks through UI and Rest connected to metadata storage. Through UI, visualization of task dependencies, execution time, success, failure, retry status, logs can be monitored. Also, DAGs and task deletion, rerun, stop, pause, and start can be done directly through UI. Figure 13 shows a simple architecture of Apache Airflow, which shows all the component information are stored in metadata storage.

24 https://www.sqlite.org/ 25 https://www.postgresql.org/ 42 Listing 1 is the DAG example for the KG ETL pipeline. In Apache Airflow, DAGs are created using the python language and stored as a python file under the “Dag” folder. While initiating the Apache Airflow, it will automatically detect the DAG files stored in the folder and execute the task according to the script instructions. Tasks are defined as t1, t2, t3, and t4 respectively, from start to end, and each task is executed in sequential order using bash operator. Each task calls their separate python files with input and output data folder (except t4, which publishes data directly to Apache Fuseki) and data range (start and end date). For each task completion and failure, a notification is sent to ease tracking the pipeline status.

from airflow import DAG from pathlib import Path from datetime import datetime, timedelta from airflow.operators.bash_operator import BashOperator from airflow.utils.email import send_email default_args = { "owner": "airflow", "depends_on_past": False, "start_date": datetime(2020, 7, 16), "email": ["[email protected]"], "email_on_failure": True, "email_on_retry": True, 'email_on_success': True, "retries": 1, "retry_delay": timedelta(minutes=3), } dag = DAG("tbfy-test", default_args=default_args, schedule_interval=timedelta(1))

# t1, t2 and t3 are tasks under ETL pipeline created by instantiating operators t1 = BashOperator(task_id="enrich-json", bash_command="python /python- scripts/enrich_json.py -s '2019-01-01' -e '2019-01-30' - i '/tbfy/2_JSON_OpenCorporates' -o '/tbfy/3_JSON_Enriched'", dag=dag) t2 = BashOperator(task_id="xml-json", bash_command="python /python- scripts/json2xml.py -s '2019-01-01' -e '2019-01-30' -i '/tbfy/3_JSON_Enriched' - o '/tbfy/4_XML_Enriched'", dag=dag) t3 = BashOperator(task_id="xml-rdf", bash_command="python /python- scripts/xml2rdf.py -s '2019-01-01' -e '2019-01-30' -r '/tbfy/rml_mappings' - i '/tbfy/4_XML_Enriched' -o '/tbfy/5_XML_to_RDF'", dag=dag) t4 = BashOperator(task_id="publish_rdf", bash_command="python /python- scripts/publish_rdf.py -s '2019-01-01' -e '2019-01-30' - i '/tbfy/5_XML_to_RDF'", dag=dag)

t1 >> t2 >> t3 >> t4

Listing 1: DAG on Apache Airflow of KG ETL pipeline

43 2. Apache Jena Fuseki26 It is a SPARQL server that supports query and update. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. Fuseki is tightly integrated with TBD to provide persistent and robust storage for RDF. In this project, Apache Jena Fuseki is used to provide SPARQL endpoint for knowledge graph, and TBD is to store RDF data set published from step 4 of the Knowledge Graph pipeline.

3. RMLMapper27 RMLMapper is used to execute RML rules to generate high quality linked data (RDF) from structured or semi-structured data sources. In our pipeline, we use two RML mapping files. openoppsmapping.ttl provides rules for transforming XML data from OpenOpps into RDF based on OCDS ontology28 and opencorporatesmapping.ttl mapping rules are used to transform XML data from Open Corporates into RDF based on euBusinessGraph ontology29. This task is performed in step 3 of the pipeline where one XML file is given as input and using RMLMapper; output is produced in N-Triplets (.nt) format.

4. BashOperator30 In Apache Airflow, we used BashOperator to execute pipeline tasks. BashOperator executes Bash script, command, or set of commands in the Bash31 shell. 5. Docker32 Docker is an open-source platform for developing, shipping and running applications by enabling the separation of applications from infrastructure to deliver software products quickly [102]. In Docker, infrastructure is managed in the same way as applications using Docker’s methodologies for shipping, testing, and deploying code quickly. Since it is a container-based platform, it is highly portable to run Docker containers irrespective of infrastructure. Docker container allows to pack application with all runtime dependencies and libraries and publish it as a package to run on any other machines ( Docker supported) without customization on infrastructure or applications. In our pipeline implementation, we used docker to host Apache Airflow

26 https://jena.apache.org/ 27 https://github.com/RMLio/rmlmapper-java/releases 28 https://github.com/TBFY/ocds-ontology/blob/master/model/ocds.ttl 29 https://github.com/euBusinessGraph/eubg-data/blob/master/model/ebg-ontology.ttl 30 https://airflow.readthedocs.io/en/latest/_api/airflow/operators/bash/index.html 31 https://www.gnu.org/software/bash/ 32 https://www.docker.com/ 44 and Apache Jena Fuseki. With the use of Docker images available for both applications, we do not have to concern about modification and changes. 6. RDF and N-Triples RDF33: Resource Description Framework (RDF) is an infrastructure that enables the encoding, exchange, and reuse of structured metadata by designing a mechanism to support standard conventions of semantics, syntax, and structure to associate properties with resources [103]. In our ETL pipeline, we create RDF data by combining information from two data sources to create unified metadata that can be encoded, exchanged, and reused further. N-Triples34: N-Triples is a line-based, plain text format for transmitting and storing RDF data. Generally, N-Triples contents are stored in ‘.nt’ files, and character encoding is 8-bit US-ASCII. In step 3 of the pipeline, XML data are converted to RDF using RML mapper. RDF data are stored in N-Triples format.

II. List of Dependencies and Libraries Required to Run the Pipeline To run the KG ETL pipeline in Docker container using apache Airflow or in Linux in Linux server, following libraries, modules, environments, and packages are necessary. 1. Python modules: ETL pipeline python scripts use the following python modules: • request35: To send HTTP/1.1. requests easily to Fuseki server while publishing RDF. • dpath36: This python library lets to access files inside the folder by searching dictionaries via ‘/slashed/paths ala xpath.’ This library is used to access and create folders and add JSON/XML/.nt files inside folders for data input and output of pipeline steps. • xmltodict37: This library is used to simplify the XML and provides ease of transformation. • python-dotenv38: It reads the key-value pair from the ‘.env’ file and adds them to the environment variable. The library is helpful to authenticate

33 https://www.w3.org/TR/PR-rdf-syntax/ 34 https://www.w3.org/2001/sw/RDFCore/ntriples/ 35 https://pypi.org/project/requests/ 36 https://pypi.org/project/dpath/ 37 https://pypi.org/project/xmltodict/ 38 https://pypi.org/project/python-dotenv/ 45 Fuseki for RDF storage and the Postgres database for Airflow metadata storage. 2. Java: RML mapper runs on Java, so Java needs to be installed in the host machine/environment (Linux, Apache Airflow).

46 Chapter 4. Result and Evaluation This section presents the result of our literature review work to define DataOps, explore different tools used in different stages of the data lifecycle, and evaluate our project implementation. We start with defining DataOps by listing definitions from various scholars, and we give our definition from the understanding drawn by examining various literary work. Second, we make a feature-based comparison of various tools available that support DataOps by categorizing them to different functional areas based on their usage. Finally, we compare our data pipeline implementation approach based on DataOps principles. 4.1. DataOps as Data Analytics Methodology To establish DataOps as a data analytics methodology; first, we define DataOps outlining current ambiguities presented and listing challenges required to address while implementing DataOps. We present DataOps as a better approach to handle data lifecycle by outlining benefits of the approach for the data lifecycle management. Furthermore, we provide a list of tools with features they possess to accomplish the task for data analytics projects. 4.1.1. DataOps Definition To define DataOps, we have used various literature articles in Section 2.4. Most of the authors presented their definition aligning with DataOps components and set of goals. From our literature work analysis, we divide DataOps definitions into three categories. First, the goal-oriented definition focus on the DataOps goal or result. Second is task-oriented definitions emphasize tasks performed during data analysis projects to facilitate the continuous flow of data. The third is a process-, and team- oriented gives the importance of tools and technologies used for collaboration, monitoring, testing, and product delivery. A summary definitions listing is created in Table 5 below.

47 Table 5: DataOps definition’s analysis from literature

Approach Definition Goal-oriented DataOps takes data from the source and delivers it to the person, application, or system to produce business value [41]. DataOps delivers predictable outcomes and better change management of data, technology, and process [48]. Data pipelines and applications are reused, automated, and reproduced easily with DataOps [47]. Eliminates the friction in data consumption with the right technology in use [46]. DataOps gives the advantage to utilize a large amount of heterogeneous data with low risk and high velocity [46]. Task-oriented An analytics process that spans from data collection to delivery of information [42]. Develop and deliver data analytics projects in a better way [43]. Creates dynamic data analytics team to evolve for new requirements continuously [48]. DataOps is an enabler of success rather than just a product [46]. Process- and team-oriented Is a combination of value and innovation pipeline connected through continuous integration and deployment tools [44]. Data management approach improves communication and integration between previously inefficient teams, systems, and data [45]. DataOps is not just DevOps for data analytics [31]. A collaborative approach to collection and distribution of data with automation and controlled access to data users to preserve data privacy and integrity [39]. Leverages technology for data delivery automation with security and quality on top priority [48]. From Table 5, it is noticeable that all definition has some common elements. Most of the terms used to define DataOps above are somehow related to Agile, DevOps, and lean manufacturing principles. Even though each definition above has some particular focus area, all definitions above serve to collaborate, orchestration, monitoring, and automation to generate the quality outcome. After careful analysis of different definitions from different sources, we formulate our own DataOps

48 definition with consideration of people's involvement, process, and technology to make the data analytics process manageable, trackable, and efficient with quality results.

DataOps Definition: DataOps can be defined as data pipeline development and execution methodology by assembling people and technology to deliver better results in a shorter time. With DataOps, people, processes, and technology are orchestrated with a degree of automation to streamline data flow from one stage of the data lifecycle to another. DataOps using Agile, DevOps, and SPC’s best practices in combination with technologies, and processes, promotes data governance, continuous testing and monitoring, optimization on the analysis process, communication, collaboration, and continuous improvement. Listing 2: DataOps definition

DataOps, from its concept establishment, has focused on delivering robust, rapid, collaborative, and quality-driven data analytics projects by integrating technology, people, and redesigning processes. DataOps is an advancement of the existing isolated data pipeline to provide better monitoring and better result. DataOps is not an absolute method that has scientifically proven predefined rules and steps. Instead, it is a progressive approach to create a better work environment for data workers, deliver faster data analysis result for stakeholders, track data movement with the ability to associate change and effect, and reduce cost and effort without compromising results. DataOps necessity lies in the fact that data are assets and assets values depend on how well an organization uses its assets in its operation. The term ‘DataOps’ itself is a combination of ‘Data’ and ‘Operation’, which simply gives the realization of the approach to operate data/assets to deliver business goals in an organization. If data are assets, then reports, insights, and information are products. Then data analytics is the process of utilizing and converting assets to the product. It is always true that readily available quality product with low cost generates better business value. To deliver quality products faster at low cost, innovation in the business process is a must. In data analytics, DataOps provides the innovation factor to deliver data products better. Innovation in terms of redesigning organization culture, creating a collaborative workspace, managing, monitoring process and results, reducing delivery cycle time, and a better product itself. By combining people, process, and technology with Agile, DevOps, and SPC’s method of delivering and monitoring products, DataOps makes data operation manageable. As presented in Section 2.4.1, DataOps is an association of derived practices from Agile, DevOps, and Lean manufacturing. Reducing data analysis cycle

49 time, automating analysis tasks and processes, and managing quality of data and data derived products are core contributions of Agile, DevOps, and Lean manufacturing, respectively. From these three methodologies contribution, DataOps creates the foundation of managing data lifecycle. Agile and DevOps cultivate the essence of collaboration and automation to reduce product delivery time where SPC enforces quality through continuous monitoring of the analytics process and products. DataOps has its own approaches on top of derived processes from other methodologies to tackle the challenges in the field due to the heterogeneity in a data analysis project. Separating the production environment from development gives room for data workers to experiment with the changes and altogether remove the fear of failure. With two different environments, product quality can be assured by continuous testing and cross-environment monitoring. Including customers and other stakeholders in data analytics project sets communication and feedback loop to minimum iteration. With this, changes and improvements in the pipeline can deliver

faster results without affecting current pipeline production. Also, role-based task

distribution fosters the responsibility of everyone while maintaining the coalition of

a team effort.

Production team and Monitoring team Managers,Customers, Data engineers, Data scientists, Data analysts, Data Customers Data providers and architects and System developers and Data Requirement analyst stakeholders

Business Business Value Requirements + Data

Tools and technologies Tools and technologies

Figure 15: DataOps pipeline

DataOps pipeline, as illustrated in Figure 15, starts with gathering data and business requirements. Managers, data providers, and analysts’ active involvement create the baseline for pipeline development. Once business requirements and data are finalized, the development of the data pipeline starts. The developed data pipeline is orchestrated by orchestration tools and tested (code, data, pipeline architecture, performance, and output) before deploying to the production environment. Data

50 engineers, scientists, architects, and developers collaborate to use their area of expertise. There could be multiple development environments for each involved data worker. However, the deployment will not be done without assembling all individual work to make the whole pipeline fulfilling all test requirements. Testing and orchestrating of data pipeline will be supported by CI tools to keep whole pipeline development in synchronized view to all involved parties. Deployment is done through CD tools, which automates deployment tasks. Deployment task automation reduces the workload of reconfiguration and reworks on the pipeline in another environment. With a combination of CI and CD together, data pipeline moves swiftly from the innovation stage to the production stage. In the production phase pipeline runs in an orchestrated environment as in a development environment. Continuous monitoring monitors the pipeline input, performance, and output and cross-validates the monitoring outcomes with test results from the development environment and business requirements. The production team and monitoring teams are responsible for carrying out tasks in a production environment. Teams are composed of people with a different area of expertise and interest to deliver quality performance. Finally, results will be shared with customers and stakeholders with the expectation of feedback and comments. Several tools and technologies are used to develop the DataOps pipeline. Each tool and technology have their purpose, but if placed together, an analytics environment with transparency in the process, quality in the result, efficiency in performance, and reliability in collaboration can be assured. Various tools and technologies applicable in the DataOps pipeline are discussed in Section 4.1.3. Ambiguities in DataOps Practices DataOps is an emerging concept in the field of data analytics. In recent years information collection and work contributions are progressing in the DataOps through the involvement of DataOps practitioners and enthusiasts. There are some misconceptions prevalent in DataOps which are listed and explained below.

1. DataOps is just DevOps applied in data analytics. DataOps is not DevOps for data. It takes best practices from DevOps and Agile methodology and combines with lean manufacturing’s SPC and data analytics specific tasks to streamline data lifecycle and provide quality results. Data analytics projects and software development projects have vast differences.

2. DataOps is all about using tools and technology in the data pipeline.

51 DataOps is not about automating everything using tools and technologies and keeping human involvement away. DataOps advocates a balanced involvement of people along with tools and technology. Communication and collaboration are highly focused on DataOps to turn data into value for all involved parties.

3. DataOps is an expensive methodology. Acquiring and running different tool always comes with a price. Data analytics projects will cost an organization, whether they follow DataOps or not. One should compare their investment with the value going to be received in the near future. Furthermore, proper research on tools and technology before implementing on data pipeline can help make informed decision to cut the cost to a minimal.

4. With DataOps, there is no need for coding. Without writing, code data pipeline task cannot be formed at all. So coding is always a baseline of data analytics projects. With DataOps, even the coding process can be reduced to minimal by reusing and versioning of codes, algorithms, configuration scripts. IDEs and source code editors provide easy writing and debugging of codes.

5. DataOps can only use on data analysis tasks. DataOps is not just generating reports and delivering fancy charts, templates, bars, and figures using visualization tools. It is about covering the whole data lifecycle from the collection of data to disposal. Moreover, it is not just about covering the data lifecycle; it is also about creating a data-driven organization culture that emphasizes collaboration, communication, transparency, and quality on organizational tasks. 6. DataOps and data pipeline are two different ways of data analytics task propagation. DataOps is an approach to implement a data pipeline. We apply the DataOps principle and practices while developing and executing data pipelines. Data pipeline with DataOps methodologies is also called DataOps pipeline. DataOps is not an entirely new way of performing data analytics tasks; rather, it redesigns the data pipeline to deliver quality results in a short time with minimal cost and effort. 4.1.2. DataOps in the Data Lifecycle DataOps' goal is to minimize analytics cycle time by covering the whole stage of data analysis from requirements collection to distribution of results [44]. The data lifecycle relies on people and tools [8], and DataOps collaborates people and tools to manage data lifecycle better.

52 Data analytics pipeline alters data through a series of tasks. Whether it is the ETL/ ELT pipeline or analysis pipeline, the output will always be different from the input. In data pipelines, one of the challenging tasks is to track data. Data goes through a series of transformations while going from one stage to another. In DataOps, data lifecycle management is unavoidable because of the need to monitor the quality of processes and products. Data governance and data lineage are part of DataOps to assure process and product quality. Quality assurance and the DataOps principle of reproducible and reuse are highly dependent on managing and maintaining data lifecycle change events. Data governance and data lineage is not an easy task to address; it starts with managerial level planning and flourishes with the tools and approach we use to implement our plans. In DataOps, transparency in data lifecycle management is always a priority task. DataOps applies to the entire data lifecycle [104], from data collection to publishing the result, all data preparation and analysis stages can implement DataOps methodology for the execution of the job. It provides the significant advantage of easy management of data lifecycle by applying the intrinsic approach to handle data throughout the analytics lifecycle. As mentioned in Section 2.3, a data pipeline serves means of transportation of data from one stage of the lifecycle to another. Furthermore, in Section 2.7, we demonstrated how DataOps have restructured traditional data pipelines and take them out of the black box and make them measurable, maintainable through collaboration, communication, integration, and automation. As a result of the restructuring of the traditional pipeline, data lifecycle management becomes more straightforward. DataOps support all stages in the data lifecycle; with the right people and technology in use, data will flow from one stage to another seemingly. With DataOps, a published result from the analysis can be trackback to a raw data source, decomposing each transformation task performed over them. DataOps acknowledges the interconnected nature of data engineering, data integration, data quality, and data security [38] and combines all these aspects of data analytics to form an interspace of data movement between data lifecycle stages. Figure 16 is a simple visual representation of DataOps pipeline coverage over data lifecycle. The dotted line resembles DataOps pipeline, which covers the entire data lifecycle, including planning. Data governance and data quality management are well implemented in DataOps throughout the lifecycle of data. Moreover, in DataOps, there is no necessity of creating separate pipelines for different stages, unlike in traditional data pipelines (see Section 2.3). Preferably, DataOps utilizes the technical

53

modularity of orchestration, workflow management, and automation tools to provide

flexible and customized transformation process when needed.

Create/Collect Process Analyze Publish

Plan Store

Dispose

Notations: :Archive :Access :Temporary storage

Figure 16: DataOps in the data lifecycle

4.1.3. Evaluation of DataOps Tools and Technologies In this section, we evaluate tools and technology used in DataOps. For this, we divide tools into different categories (in Section 4.1.3.1), then we introduce comparison criteria for the tools (see Section 4.1.3.2), and present DataOps tools’ feature-based comparison in Section 4.1.3.3. The selection of tools for comparison is based on the popularity of tools and to provide the baseline of further research for selection tools in DataOps tasks. Since there are numerous tools available with the same features and functionality, it is hard to cover every tool in detail. We have picked some of the popular tools and compared them categorically based on the comparison criteria presented. Selecting tools and technology in DataOps is a rigorous process and needs detailed research and planning before selecting a particular tool for the designated task [89]. Tools were selected based on their mass user base, relevant features to support the given functionality and popularity of the product in data analytics project execution. Tools presented in the feature-based comparison table are picked after extensive online research through the process of listing and comparing among other tools from same functional categories. Additional tools studied during the process of selecting tools to include in the comparison table (but not presented in the comparison table) are listed in Appendix section with a short description and product link.

54 4.1.3.1. Categorization of DataOps Tools and Technologies Tools used in DataOps are categorized in the following functional categories based on the tools’ purpose in the data pipeline. Some tools described in Section 4.1.3.3 are uncategorized and kept under ‘Other tools and technology’ because they do not fall under the first seven categories listed below.

1. Workflow orchestration tools Workflow orchestration or pipeline orchestration defines the logical flow of tasks from start to end in the data pipeline. In DataOps, orchestration tools create a logical flow of data analytics task and assemble other tools and technologies, infrastructures, and people to accomplish the job. Several orchestration tools are available that share similar design principles targeted to various users and use cases and choosing them for pipeline workflow management is a thorough job. Orchestration tools include resource provisioning, data movement, data provenance, workflow scheduling, fault tolerance, data storage, and platform integration in the data pipeline [105]. However, all orchestration tools do not have all features inbuilt to support every task in a data pipeline. So, choosing the right orchestration tool is essential to manage tasks in the data pipeline. There has been a practice of developing custom-built workflow orchestration tools for a specific project [106–108]. In the research paper [106] published by Yared et al. based on his master thesis [109], suggests a big data workflow based on the use of software container technology and message-oriented middleware with detail feature-based comparison of DSL based workflow management with existing workflow orchestration tools. Here we have compared some of the existing pipeline orchestration tools in Table 6 build on the comparison criteria presented in Section 4.1.3.1.

2. Testing and monitoring tools Continuous testing and monitoring are the principal mission of DataOps. With these, ensures the pipeline performance, quality of input and result, code, and toolchain performance throughout the data analytics process. Testing and monitoring are done in the entire stage of the data lifecycle. In DataOps, testing and monitoring start from the top management level by setting the criteria of project quality, and test cases are developed according to the proposed criteria. After the development of test cases and monitoring criteria, suitable existing tools or custom-built test and monitoring framework can be integrated into the data pipeline. Some of the testing and monitoring tools are explained in Table 7.

55 3. Deployment automation tools DataOps continuously moves code and configurations from the development environment to the production environment after test cases are satisfied. The deployment automation applied through the process of Continuous Integration and Continuous Deployment. Tools used in deployment automation are presented in Table 8. 4. Data governance tools The importance of data governance, data lineage, and data provenance in DataOps is described in Sections 2.5 and 2.6. Testing and monitoring are keeping a record of the principles of data governance. Where testing and monitoring are more focused on tracking the whole DataOps pipeline performance measures, data governance is related to data change management and data lineage tracking. Tools used in data governance are presented in Table 9.

5. Code, artifact, and data versioning tools Code, artifacts, and data versioning tools provide a platform to store different versions of codes, data sets, docker images, and other related documents like logs, user manuals, system manuals, and configurations. With the use of the right tool, accessing and reusing different versions of stored artifacts becomes easier. Tools helpful for keeping the track and records of different versions of artifacts used are described in Table 10.

6. Analytics and visualization tools The importance of visual presentation is always high while demonstrating results. As customers and non-data workers always relish on understandably receiving quality results, data visualization, and analytics tools play a big part in presenting the quality of the DataOps pipeline. With the support of analytics and visualization tools presented in Table 11, data workers can better communicate their results to other stakeholders.

7. Collaboration and communication tools To better coordinate among team members, communication and collaboration tools are necessary. Tools can be simple, from email applications to advance communication tools that have fancy features to automate most of the routine tasks and record the tasks. Table 12 provides some of the communication and collaboration tools that help establish better communication and project monitoring during DataOps pipeline execution.

56 8. Other tools and technologies In other tools and technology section, we have described containers technology, resource managers, data storages services, IDEs and source code editors, cloud servers and Big data processing and analytics frameworks. In Table 13, popular tool and services under listed categories are presented along with general use of the services in DataOps. 4.1.3.2. Evaluation Criteria The general criteria for DataOps tools evaluation and comparison are based on installation easiness, operation simplicity, integration support to other technologies, and general applicability of the tools. The schema of the comparison table is described below. 1. Complexity Measures how complex is the installation and implementation process of the presented tool. Evaluation is based on the code complexity and dependencies that need to be setup. • HIGH: Need a high level of coding and configuration to install the product. • MEDIUM: Moderate level of coding and configuration required. • LOW: Easy to install with no line of code or a few lines of code. 2. Usability Measures how simple the tool is to use after the installation, especially for non- technical data workers. • HIGH: Easy to use with little or no technical, coding, or system-related knowledge. • MEDIUM: Moderate knowledge of the system, code architecture, or technical detail is required. • LOW: High level of technical expertise and/or coding knowledge is required. 3. Compatibility Measures the integration capacity of the tool with different operation environments, other tools, databases, data types and/or programing languages. • HIGH: Supports a wide range of tools, operation environment, database, data types, and programming languages. • MEDIUM: Have some level of support either explicit (in the number of specific tools, languages, databases, data types and/or programming languages declared officially) or have implicit partial support provided through unofficial projects. • LOW: Little or no support available.

57 4. Application Provides the information related to the tools' applicability to arrays of projects, data analysis use cases, and industries. • GENERIC: Can be used in a variety of projects based on the nature of tools. • SPECIFIC: Industry/project-specific usage. 5. Lifecycle Lists in which data lifecycle stage described in Section 2.1.5. the tool can mostly be used. 6. License Describes whether the tool is commercial, opensource, freemium, free + commercial and other pricing forms. 7. Other schemas • Description: Present a general overview of the tool. • Features: List out measure features of the tool. • References: Tools and technologies’ documentation and product information link for further and detailed study. Using the above-presented attributes we have created the comparison table for DataOps supported tools and technologies in Section 4.1.3.3.

58 4.1.3.3. DataOps Tools and Technology Comparison

Table 6: Workflow orchestration tools

Tools Description Features Lifecycle Complexity Usability Compatibility Application License Reference s Airflow Airflow programmatically Define data jobs in source control. Creation/ HIGH: MEDIUM: HIGH: GENERIC: Opensource https://ai creates, monitors, and Web application to monitor and collection Installation steps are DAG based Supports Can be used rflow.apa

schedules workflow. explore DAGs, progress, metadata, Process straightforward with visualization of a different with different che.org/ Workflow is created as a and logs. Analyze easy documentation workflow (created runtime types of Directed Acyclic Graph Metadata repository to keep track of and additional using python code), environment projects and (DAGs). jobs status and other persistent runtime support, but but tasks can be and have industries. The Airflow scheduler information. workflow creation and monitored using web different executes the task on an Job scheduler for task instance. task scheduling need Interface. executors for a array of a workflow using Support several executers like programming skills. Changes in tasks and different defined dependencies and Sequential, Local, Celery, Kubernetes, Optimization and workflow through runtime-based executors. and Docker. integration with coding. task like Docker, Scalable with resource pooling different applications Only visualization Kubernetes, capabilities to execute parallel tasks. and platforms require and monitoring Celery. in-depth knowledge. available in the web Python-based interface application. Apache is a Execute and monitor workflows in Collection/ HIGH: MEDIUM: LOW: GENERIC: Opensource https://o Oozie workflow scheduler used in Hadoop. creation Configuration and HTTP, CLI, and Web Applies on Can only use in ozie.apac

managing Apache Periodic scheduling of workflows Process installation steps are interface for task Hadoop related Hadoop he.org/ 39 Hadoop jobs. Trigger Execution through data Analyze complicated with lots monitoring. jobs only projects. of dependencies. Must define the beginning and end of the workflow and mechanism to control the workflow execution at Controlflow node. Only supports Java. Reflow System for incremental Defines Domain Specific Language. Process HIGH: LOW: LOW: SPECIFIC: Opensource https://gi data processing in the Comes with in-built cluster manager Analyze Since it is distributed CLI command only. Only available Designed for thub.com cloud. to increase and decrease compute with the EC2 cluster Complex process to in EC2 bio-informatics /grailbio/

Allows data scientists and resources and is in sync with AWS manager, and create and run the environment projects. reflow engineers to write cluster manager to create and delete memoization cache is workflow file (called and focused on straightforward programs resources when needed. based on S3. reference file). bioinformatics

39 https://hadoop.apache.org/ 59 and execute them in the AWS configuration Workflow monitoring industry though cloud environment. needs to be done and modification can have beforehand and only do through implemented in should provide AWS programing. other credentials to reflow applications. afterwards for setup. DataKitchen Platform to reduce Supports the orchestration of all the Process LOW: HIGH: HIGH: GENERIC: Commercial https://u analytics cycle time by heterogeneous data centers, tools, Analyze Commercial support Kitchen creation On-premises and Hosted in their secase.dat monitoring data quality infrastructure, and workflows used available. (resource add) and cloud platform server. API and akitchen.i and providing automated across teams. API access to recipe creation (code), web app access. o/meta- support for data Can orchestrate another orchestrated computing resources. and order (jobs) can orchestrat

deployment and new pipeline. Easy implementation be created using the ion-2/ analytics. Test orchestration in the pipeline. guidelines. web interface. However, for recipe creation, there needs to have codes ready to upload. Also, support CLI commands. BMC Automation solution to Streamlines the orchestration of Process MEDIUM: MEDIUM: HIGH: GENERIC: Commercial https://w Control-M simplify and automate business applications by embedding Analyze For installation, Web interface, On-premises, Used in ww.bmc.c different batch workloads. workflow orchestration to CI/CD support is provided. IOS and Android hybrid, and business om/it- Automates event driven pipeline. Job-as Code to build mobile App for multi-cloud intelligence, solutions/ workflow with failure Extend development and operation and test workflow in monitoring. For task environment. ETL, file control-

prevention. collaboration with a job-as-code CI/CD pipeline scheduling Job-as- transfer big m.html approach. Code. data and Provides intelligent file transfer Hadoop jobs. movement. Argo Workflow engine for Each step in the Argo Workflow is Process HIGH: LOW: LOW: GENERIC: Opensource https://a Workflows orchestrating parallel jobs defined as containers. Analyze Kubernetes knowledge No user interface to Kubernetes Can be used for rgoproj.g on Kubernetes. Stepwise task execution or DAG- is required along with design workflow. native all types of ithub.io/ Container- native tool based task dependencies. high programming Should use or create application. project in projects/ implemented as K8s CRD. Runs on top of Kubernetes cluster skill to deploy as K8s. workflow through However, machine argo each task as K8s pod. Installed through ADSL. variant for learning and

Useful for parallel analytics job kubectl commands. Docker ( with data processing execution in Kubernetes. Docker-in- using Argo Workflow definition and automation Docker) is also Workflows on are created through custom-designed available Kubernetes. YAML templates called Argo DSL (ADSL). Apache Apache NIFI allows Web-based user interface. Collection/ MEDIUM: MEDIUM: MEDIUM: SPECIFIC: Opensource https:// NIFI automating the flow of Dataflow can be modified during creation Moderate installation User-friendly interface Application Mostly used for nifi.apac data between systems. runtime. Process process with directory with drag-and-drop with inbuilt data injection he.org/ Analyze creation. features to design features like tool with

60 Uses a flow-based Track data movement from beginning Publish Requires JVM to run workflow. However, it testing and inbuilt data https:// programming model to to end of the project. and execute the task. requires knowledge of monitoring, data transformation, nifi.apac build scalable data Provides testing and rapid languages like SQL to provenance, enrich, and he.org/d workflows. development. use the NIFI resource data ocs/nifi- Supports interface. management. monitoring. docs/ht Works with big ml/overv data processing iew.html platform and standalone data projects as data injection tool.

61 Table 7: Testing and monitoring tools

Tools Description Features Lifecycle Complexity Usability Compatibility Application Type References iCEDQ The software for ETL, User Security for system-level access, Creation/ LOW: HIGH: HIGH: GENERIC: Commercial https://ice data warehouse, and data database on role-based collection Configuration to data Web-based, Support a wide Can be used dq.com/ov

migration testing responsibilities. Storage sources through the web Android, and iOS range of data for any type of erview CI/CD Pipeline Integration. Analyze app. Simple steps to mobile app for structure and data analytics Custom reporting and Dashboard. follow. Have inbuilt monitoring. data sources. projects with Have an inbuilt scheduler for offline database connectors for Self-hosted, application jobs and tasks. big data, cloud storage, connected supported Alerts and notifications. flat files, and reports. through APIs. data types. Single server, multiple server, and clusters options to execute the task. Data Band An open-source framework Generate custom metrics for the Process HIGH: LOW: MEDIUM: GENERIC: Opensource https://da for building and tracking pipeline. CLI based installation Knowledge of Support cloud Gain insight + taband.ai/

data pipelines through Track workflow input and output and configuration. python is necessary and local into a pipeline commercial products metadata lineage of data to run and create environment running on Generate data profiling and statics on the task. and be used tools like datafiles with Airflow, Airflow, Data caching and data versioning Jenkins, and Google cloud Dynamic pipeline with features to Apache Spark. compose, support sub pipeline inside the Apache Spark pipeline and Kubernetes. RightData RightData is intuitive, Dataset analysis and validation Storage MEDIUM: MEDIUM: HIGH: GENERIC: Commercial https://w flexible, efficient, and through dataset comparison. Analyze Web-based installation It can be used The external Can be ww.getrigh scalable application for Data validation and reconciliation to Process with commercial through a web system can installed tdata.com

data quality assurance, detect anomaly using rule-based data support. application for connect using cloud, local / data integrity monitoring, validation engine. monitoring, but REST API and environment and quality control using Supports a wide variety of data project supports and can be automated testing sources from RDBMS, flat files, SAP, configuration different file used with capabilities. and Big data sources. requires some level systems. different data DevOps integrations. of technical analytics Cloud technologies connection expertise. projects. Naveego Cloud-based application to Can connect to any data sources. Collection/ HIGH: HIGH: LOW: SPECIFIC: Commercial https://w clean and monitor data Can define a custom data model. creation Cloud host system with Testing and It is connected Can only be ww.naveeg

quality with tracking data Data merge and data cleansing. Process a web interface. monitoring jobs can through a plugin used with data o.com/ change using the Data source tracking and unification Storage Commercial support be performed using and currently sources that dashboard. to a single file. available. a web application. supports specific can be User collaboration. The system has an data sources. connected

62 Data quality notification with manual Can be installed on- inbuilt mechanism Cannot connect using the and automated data quality checks premises system with to track and to other data developed with cross-system comparison. using the inbuilt monitor data sources without plugin by package for windows, changes for given a plugin. company. Linux, and macOS. supported data sources. No coding is necessary. DataKitchen Helps to improve data Provide customizable alerts to Collection/ HIGH: MEDIUM: HIGH: GENERIC: Commercial https://us quality by providing lean applications like Slack, email, and creation Commercial support for For the Dataflow On-premises and Can be used ecase.data manufacturing controls for Microsoft Teams. Process installation and test, some coding cloud platform with different kitchen.io/ testing and monitoring Technical users can create data flow Storage implementation. knowledge level is types of automated data tests, while business stakeholders can needed, but for the projects and -testing- apply business logic tests, among business logic test, databases and- many others. no coding connected monitoring

Code validation tests are promoted in experience is through API / the development phase and pushed to necessary. and monitored production to verify data operations. Have both CLI and using the Web Each error will create more test cases Webapp for App. in the development phase to eliminate monitoring and task production error. creation purposes. Entreprise Non-profit organization for PowerShell based unit testing Storage, HIGH: LOW: LOW: SPECIFIC: Free https://e Data enterprise data toolkit. framework. Process, PowerShell based Every testing task Developed for Can only be Non-profit nterprise- Foundation Focused on Data Help to write transformation code Analysis application. Must run must be performed enterprise data used through data.org/ Management with faster and reduce defects in the PowerShell scripts to through the management. Windows delta- deltaTool product. production environment. setup environment PowerShell script Applicable for PowerShell. tools/ The deltaTest40 with deltaTest tests inputs, process, and cloning the GitHub with defined text files or SQL Used in static

deltaRefresh41 and output if it can be invoked from repository. Though the parameters. Results Server data and deltaDeploy42 delivers a windows command line. process is well explained are published in Database. codes. robust, test-driven in the documentation, JUnit file and no development and some scripts are hard to visualization of test automated regression understand. results. testing.

40 https://enterprise-data.org/delta-test/ 41 https://enterprise-data.org/delta-refresh/ 42 https://enterprise-data.org/delta-deploy/

63 Table 8: Deployment automation tools

Tools Description Lifecycle Features Complexity Usability Compatibility Appliction Type References Jenkins Opensource automation Collection/ Easy installation in different operating systems. MEDIUM: HIGH: HIGH: Can be GENERIC: Opensource https://ww server to deploy project creation Simple and easy to use interface. Easy to install After installation used within It is a highly w.jenkins.io

from testing to Process Customizable user interface. and setup. and configuration, any kind of used deployment / production Analyze Popular and easily extensible with 400+ third- most of the tasks data analysis automation tool Publish party plugins. can be managed by environment. for data analytics Storage Distributed system with master-slave architecture using the Jenkins and software to reduce the load on the CI server. app or web development Notification enabled. interface. projects DataKitchen DataOps platform Collection/ Automates manual work of deployment from HIGH: MEDIUM: HIGH: GENERIC: Can Commercial https://use supports data analytics creation development to production environment. Commercial Deployment is On-premises be used with case.dataki code and configuration Process Keep a track record of pre-configured tools, support for straightforward and cloud different types of tchen.io/de deployment Analyze datasets, hardware, and tests and is deployed to installation and and easily usable platform projects and ployment- Publish production by remapping different target implementation. after the databases cicd-for- Storage toolchains. configuration of connected data-

Team coordination. the production through API and projects/ environment. monitored using a Web App. Circle CI CircleCI is a modern Collection/ Integrates GitHub, GitHub Enterprise, and MEDIUM: MEDIUM: MEDIUM: GENERIC: Free https://cir

continuous integration creation Bitbucket. Easy to set up Scheduling, Works with Docker, + cleci.com/ and continuous delivery Process Runs pipeline in a clean container for testing. and install. deployment and code repo Windows, and Commercial (CI/CD) platform. The Analyze After passing tests, pipeline automatic deployed to Must connect to testing job are from GitHub Linux jobs. CircleCI Enterprise Publish a different environment. either GitHub done by editing on or Bitbucket. Applicable for solution is installable Storage CircleCI may be configured to deploy code to or Bitbucket config.yaml file. Cannot access any kind of inside a private cloud or various environments, including AWS CodeDeploy, account. Must have some development project which data centers and is free AWS EC2 Container Service (ECS), AWS S3, Simple steps of level of programing code directly. code can be to try for a limited time. Google Kubernetes Engine (GKE), Microsoft configurations. knowledge. It can be stored in Gitlab CircleCI automates, Azure, and Heroku. Monitoring can be deployed on- and Bitbucket. build, test, and Other cloud service deployments are easily scripted done through Web premises or in deployment of software using SSH or by installing the API client of the UI. the cloud service with job configuration. server. GitLab Cloud-based CI platform Collection/ Multiplatform, multilanguage with parallel MEDIUM: MEDIUM: HIGH: GENERIC: Opensource https://abo for the development creation execution of jobs across multiple machines to faster Easy Install and Should have basic Can run on Virtually support + ut.gitlab.co team to handle diverse Process execution. setup. knowledge of any platform many programing Commercial m/stages- toolchain Analyze Can test locally and deploy to the production Setting up a versioning and where Go languages and devops- Publish environment. self-host server understanding of binaries are data analytics lifecycle/co Storage Docker and Kubernetes support. needs technical git. built. projects type ntinuous- Autoscaling of resources according to computation knowledge. integration

complexity. / It is integrated with other GitLab products which are helpful to carry out tasks from planning to deployment.

64 Travis CI Travis CI helps to build, Collection/ Free for opensource applications. MEDIUM: HIGH: HIGH: GENERIC: Free https://tr and test projects hosted creation Available for macOS, Linux and Windows. Need GitHub, Simple user Can be run on Can be used with + avis- in Git repositories with Process Supports 30 different programming language. GitLab, interface with easy any platform ( varieties of Commercial ci.com/ the process of continuous Analyze Provides build matrix feature to accelerate the Assembla or to perform the cloud and projects within integration. Publish project execution. Bitbucket task. native) supported Storage Configured after adding .travis.yml (YAML integration. programming text file) in the application. Need languages Good documentation and elegant user interface. programming Supports parallel testing. knowledge and understanding of git. Atlassian Atlassian Bamboo is Collection/ Pre-built software functionalities for CI/CD, no LOW: HIGH: HIGH: GENERIC: Commercial https://w Bamboo CI/CD server which creation plugin and additional setup required to use the Easy to install Intuitive and with- Can be used Can be used in ww.atlassi assists project Process tools. with enterprise guidance user within any any kind of data an.com/so development teams to Analyze Built-in Git branching workflow and deployment support interface. kind of data analytics ftware/ba automate build and test Publish projects. analysis projects. mboo code-status. Storage Test automation. environment. Enterprise support and resources. Tidier and intuitive user interface

65 Table 9: Data governance tools

Tools Description Lifecycle Features Complexity Usability Compatibility Application License References Apache Atlas is a scalable and Collection/ Pre-defined metadata type and HIGH: MEDIUM: MEDIUM: GENERIC: Opensource https://atla Atlas extensible set of core creation instances for Hadoop and non- Installing Apache Atlas Can monitor using Business Different s.apache.or

foundational governance Process Hadoop. is involved with lots of intuitive UI but metadata, data systems and g/ services in Hadoop, allowing Analyze Can define new metadata types with dependencies and setting up metadata lineage services can be integration with the whole Publish an inheritance from different types of configuration. type and instance management for accessed enterprise ecosystem. Storage metadata. HBase, Hive, , needs knowledge of Hadoop, and through API Provides open metadata REST APIs for easy integration with Storm, or Kafka for REST APIs and non-Hadoop and manage management and governance different types and instances of metadata storage and programming hosted metadata in capabilities for cataloging data metadata. management, and knowledge. application. Hadoop. assets to classify and govern Intuitive UI for lineage tracking with is required Nevertheless, with collaboration capabilities REST API to update lineage. for index store. Atlas requires with data scientists, analysts, Easy search and discovery with UI is also the Hadoop and data governance teams. and rich REST API and SQL like used for data injection engine to run query. either as Kafka topics the system with Security and data masking. or through API in a set of Atlas. interdependent services. Talend With Talend, data discovery, Collection/ Index and tracks data input and MEDIUM: MEDIUM: MEDIUM: SPECIFIC: Opensource https://ww sharing, and federation are creation output in a cloud environment with The opensource model Non-technical users Supports many Financial + w.talend.co easier. Uses simple tools to Process detailed histories of datasets. can be installed in IDE can use the system data storage service, Commercial m/products automate data processes, team Analyze Data collection in the cloud for a with easy installation, through UI for services and Telecommunica /data- collaboration, and data quality Publish central and automated hub for data. and commercial license monitoring and data operation tion and retail, integrity- management. Storage Cleansing and standardizing data to provides full support. reporting, but the platforms and consumer governance

Talend simplifies data quality improve data quality. governance policy through Industry. / and security with built-in Data catalog management. setup needs connectors. functionality to ensure insights technical expertise. are trusted, governed, and actionable. Collibra Collibra governance Collection/ Creates shared language for datasets LOW: LOW: LOW: No SPECIFIC: Commercial https://ww accelerates digital creation to ensure consistency. Commercial support for Tasks can be information Used for w.collibra.c transformation by promising Process Automate governance and installation and performed through regarding Business om/data-

all data citizen to understand Analyze stewardship tasks with the provision implementation interactive UI. compatible intelligence governance and find meaning in data. Publish of scalability. supported data purposes. No Enables trust in data while Storage Collaboration with different data types, platforms, information growing with a solid citizens. and projects. about industry governance foundation. Centralized data documentation. support and End to end visualization for data generic use lineage tracking. cases. API access for integration to data storage. IBM Data governance solutions Collection/ Multiple entries point to adapt and MEDIUM: HIGH: MEDIUM: GENERIC: Commercial from IBM are vital to creation change data governance strategy Generic

66 DataOps to ensure the data Process according to changed business It can only be used in Easy to use for non- Can be solution for all https://ww pipeline is ready to help Analyze objectives. the IBM platform with technical users to integrated with data analytics w.ibm.com/ organization catalog, protect Publish Machine learning-powered data the inbuilt installation create and run data several products projects. se- and govern sensitive data, Storage catalogs for automating metadata procedure. governance tasks of IBM through en/analytic trace data lineage, and creation and knowledge sharing. through UI. API. However, s/data- manage data lakes. A data Comply with security and privacy it gives the best governance governance platform with an law to reduce data access risk. performance in integrated data catalog can IBM DataOps help enterprises find, curate, platform. analyze, prepare, and share data while keeping it governed and protected against misuse. OvalEdge OvalEdge is data governance Collection/ Business glossary. LOW: HIGH: HIGH: GENERIC: Commercial https://ov and data catalog toolset. creation Workflow for data access. Commercial support for Machine learning- Supports all Applicable to aledge.com Widely used for data Process Collaboration with team members. installation and based advance Data all types of / discovery, data governance, Analyze Automated data lineage. implementation. algorithm to management data pipelines. compliance management. Publish Self-service data analytics. organize and platforms from Storage Windows, Unix, Cloud, On-premises, manage data with a relational web, and SaaS-based product. simple user interface databases, data to execute tasks. warehouses, Alert and object storage, collaboration with cloud platforms customers. to non-relational and Hadoop distributions.

67 Table 10: Code, artifact, and data versioning tools

Tools Description Lifecycle Purpose Features License References GitLab GitLab is a single application and Collection/ Code Enables code reviews, version control, feedback, and branching with Git-based repository. Free https://about.gitla DevOps platform for the entire creation versioning Has built-in CI/CD to streamline testing and delivery. + b.com/stages- software development cycle. Process Automatically check code quality and security with every commit. Commercial devops- Source code management in Analyze Review, track and approve changes in code before merging. lifecycle/source-

GitLab helps the development Publish Web IDE helps to deploy on any platform. code-management/ team to collaborate and increase Storage productivity. GitHub GitHub is a Git repository hosting Collection/ Code Collaborative coding with features and tools like Codespaces, pull requests, code review, Free + https://github.com

service with a graphical web-based creation versioning notifications, team reviewers, team discussion, and code owners. Commercial / UI that provides access control, Process Security with automatic tracking of code change, code scanning, dependency graph, and collaboration, and task Analyze mandatory review before merging. management services on top of Publish GitHub desktop, mobile, and CLI for different device and platforms. versioning. Storage Project management functions like creating milestones, issue tracking, contribution graphs, wikis, repository insight by taking code as the center of projects. Team administration by simplifying access and permissions across projects and teams. DVC Data Version Control (DVC) is an Collection/ Data DVC runs on top of the Git repository and is compatible with popular Git servers or service Opensource https://dvc.org/ open-source tool for data science creation versioning providers. https://github.com

projects and machine learning Process DVC offers distributed version control for data with features like local branching and local /iterative/dvc projects. Analyze versioning. Publish Flexible with cloud-based, network-attached, and disc data storage. Storage Guarantees reproducibility by maintaining input data, configurations, and codes relation for an experiment. Not dependent on language and framework. DockerHub Docker Hub is a hosted repository Collection/ Artifact Provides image repositories with versioning to pull and find images for containers. Free + https://hub.docker

service by Docker for sharing and creation versioning Integration with GitHub and Bitbucket with automated builds to publish an image in Docker Commercial .com/ finding container images. Process (Docker Hub. Analyze Image) Private and public repositories (Paid version). Publish Webhooks to trigger Docker Hub integration action with other services. Storage

Additional information: The above-described tools are used for all types of projects. So, they are ‘GENERIC’ applications. To install, and use the software, necessary knowledge of version control taxonomy and docker script is required. So, Complexity is ‘MEDIUM,’ and Usability is ‘MEDIUM.’ It can be integrated with any tools and environment. So, Compatibility is ‘HIGH’.

68 Table 11: Analytics and visualization tools

Tools Description Features Lifecycle Complexity Usability Compatibility License References Tableau Tableau is a business Informative and wholesome dashboard. Analyze LOW: MEDIUM: HIGH: Commercial https://ww intelligence and data analytics Easy collaboration with other users and share of data, Easy to install Needs training Can support many data w.tableau.c

tool for repost generation and visualizations, and dashboards in real-time. and integrate with and knowledge of sources and data types. om/ data visualization Share and retrieve data from various sources such as on- data sources. the product to It can be integrated premises, hybrid, and cloud. run visualization. with live data as well as Supports several kinds of data connectors. stored data sets. Can work with live and in-memory data without connectivity issues. Wide range of visualization options available. Power BI Power BI provides cloud-based Data query through natural language Analyze LOW: MEDIUM: MEDIUM: Commercial https://po as well as desktop-based data Report sharing and collaboration for report generation The desktop Needs domain- No connection to werbi.micro

visualization service. It among team members. application is easy specific database but can fetch soft.com/ provides interactive Access and unify data from different data sources. to install and knowledge and data from different data visualization and business Real-time business intelligence analysis. setup. training for sources like Salesforce, intelligence over data through API access. Cloud-based proper use of the Excel, and many more an interface for end-users for service is easy to software. through API calls. No reports and dashboard setup. direct installation on creation. Linux or other operating systems. QlikView QlikView is analytics software Data association representation through color rather Analyze LOW: MEDIUM: LOW: Commercial https://ww to develop interactive guided than line and objects. Easy to install the Needs training Does not support w.qlik.com/ analytics applications and Uses in-memory technology for instant associative search desktop and knowledge of installation other than us/products dashboards. and real-time analysis. application on the product for Windows OS. /qlik-sense Data relations are generated automatically using AI- windows server better use. https://ww based utilities, so business users or application users do and OS. No w.qlik.com/ not have to create relationship rules manually. support on other us/products

QlikSense is an improvised version of QlikView, can systems. /qlikview collect data from multiple sources and combine them for visualization. Additional Information: Analytics and visualization tools presented above are general-purpose software that can be used in any project and industry. So, these can be considered as ‘GENERIC’ application.

69 Table 12: Collaboration and communication tools

Description Features Lifecycle License References

Slack Business communication platform which uses channel-based Task Management Collection/creation Freemium https://slack.com/ messaging service. With slack, people can work together, Collaborate with posts. Process + connect to software tools and services. Easy integration with other business tools like Google Analyze User- Drive, Office 360, Trello, Jira, and many more. Publish based Audio and video calls. Storage pricing Slack bots for different kinds of utility tools to choose from, according to requirement.

Jira Jira is a project planning, tracking, and releasing platform Project Management with rule-based automation, task Collection/creation Freemium https://www.atlassian.com/software/jira which follows Agile methodology. relationships. Process + Views and reports on task progress and work portfolios. Analyze User- Task comment and project conversation with notification Publish based service. Storage pricing

Trello Web-based Kanban-style list-making application for task Ticket based task under projects with rule-based triggers. Collection/creation Freemium https://trello.com/ management and tracking. Calendar commands for deadlines and notifications. Process + Easy integration with other productivity tools like Slack, Analyze User- Evernote, Dropbox, and many more. Publish based File attachment feature. Storage pricing Drag and drop to move the lists of tasks across different cards. Additional information: Compatibility, Usability, Complexity, and Application type of collaboration tools are not included in the above comparison table. These tools do not need to be installed or incorporated inside the DataOps pipeline. All the tools mentioned above are web-based easy to use with a good user interface. Can be used in any type of organization for communication and collaboration purpose.

70 Table 13: Other tools and technologies for DataOps

Categories Description Product list Use in DataOps Containers A container is a standard unit of 1. Docker: Containers help to automate the process used by analytics software that packages up codes and Docker is an open-source platform for developing, shipping, and running applications teams to improve the quality and reduce analytics cycle time its dependencies to run applications by enabling the separation of applications from infrastructure to deliver software [110]. irrespective of the computing products quickly. Docker containers are restricted to a single application. Docker is The data analytics team can deploy multiple analytics tasks environment. based on the LXC project to build single application containers. in containers in a repeatable manner. Containers virtualize at an operating In DataOps, when analytics tasks run in containers, it is system level instead of virtualizing easier to maintain and share. For example, if a particular hardware stacks. Containers are 2. LXC (Linux Containers)43: source code is needed frequently, it is easier to put logic in a lightweight applications that provide a LXC is OS-Level virtualization technology for creating and running multiple isolated Linux container to host the source code and call the container every consistent environment and flexibility virtual environments on a single host. It allows isolating applications as well as the entire time the source code is required. to run anywhere by providing isolation OS. The goal of LXC is to create an environment as close as possible to a standard Linux Container provides an isolated environment, reusable tools at the OS level. installation but without the need for a separate kernel [111]. and technology hosting, and reproducible task and process for DataOps projects.

Resource Resource manager in cloud computing 1. Apache Mesos44: By using resource manager tools, DataOps can achieve Manager or data analytics deals with the is an open-source project to manage computer cluster by providing resource automatic scaling and resource provisioning. Especially for management and procurement of management and scheduling services to applications like Hadoop, Spark, and Kafka by API. Big data analytics projects, resource scaling up and scaling resources [112]. With resource manager Apache Mesos abstracts CPU, memory, storage, and other compute resources away from down is very important. Using Kubernetes application, and virtualization techniques, flexible machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be containers can be deployed in multiple clusters as per project and on-demand resource provisioning built and run virtually quickly [114]. need. Similarly, YARN and Mesos provide resource can be achieved [113]. provisioning and scheduling to prevent manual allocation of 2. YARN45 resources. Yarn is a distributed and large-scale operating system for big data, initially designed for cluster management in Hadoop. YARN decouples Hadoop MapReduce’s resource management and scheduling capabilities from the data processing component. 3. Kubernetes46 Kubernetes provides a framework to run distributed systems resiliently by automatic resource scaling, load balancing, bin packing, and self-healing [115]. Kubernetes automates Linux container operations in a distributed manner. Data Data storage is the collection and 1. AWS cloud storage service47 In DataOps, data storage is essential to move, store, and Storage retention of digital information [116]. archive data for the future. Buying physical storage devices Services Data storage includes storing digital and maintaining and scaling for big data analytics project is

43 https://linuxcontainers.org/ 44 http://mesos.apache.org/ 45 https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html 46 https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/ 47 https://aws.amazon.com/products/storage/ 71 data in temporary and permanent AWS provides Amazon S348, Amazon Glacier49, Amazon EBS50 , and Amazon EFS51 storage time and resource consuming. Furthermore, there is always a storage. In data analytics, data storage services for object storage, block storage, and simple shared file storage from a variety of risk of losing data in the local environment due to system services provide storage of large price ranges. failure and catastrophic incidents. Cloud storage service quantities of data, transport data from 2. Google cloud storage52 providers provide flexibility and security and are ready to one location to another, and secure and Google cloud storage provides object storage services with various features and prices. Based deploy infrastructure for data with addon features and prevent losing data. There are several on the usage and duration, data storage prices are charged. technical support. types of data storage services: software- defined storage (SDS), cloud storage, 3. IBM cloud storage53 and network-attached storage for With IBM cloud storage, data workers can provision and deploy object, block, and file storage object storage, file storage, and block with flexible pricing and high data security. storage [116]. Here we list some cloud data storage service providers. 4. Microsoft Azure cloud storage54 Azure cloud storage provides managed disk storage, blob storage55 , and file storage services.

IDEs and IDE (Integrated development IDEs: Converting data analytics algorithms into codes and make source environment) is a software application 1. Microsoft Visual Studio56 them executable is a challenging task. There will be several code for developing, debugging, compiling, 2. IntelliJ IDEA57 trial-and-error process to test code functionality and editors and running codes. IDEs provide 3. XCode58 performances. Using IDEs and code editors, coders/ data integrated service of testing and 4. Eclipse59 professionals can minimize errors during writing through debugging without deploying code in inbuilt syntax error tracking and suggestions plus, with IDE, Code editors: the environment. An IDE at least will they can debug the code instantly before sending for integration 1. Sublime Text60 have a source code editor, build or deployment. 2. Notepad++61 automation tools, and debugger. 3. Visual studio code62 Nowadays, IDE offers advanced

48 https://aws.amazon.com/s3/ 49 https://aws.amazon.com/glacier/ 50 https://aws.amazon.com/ebs/ 51 https://aws.amazon.com/efs/ 52 https://cloud.google.com/storage 53 https://www.ibm.com/cloud/storage 54 https://azure.microsoft.com/storage/ 55 https://azure.microsoft.com/en-us/services/storage/blobs/ 56 https://visualstudio.microsoft.com/ 57 https://www.jetbrains.com/idea/ 58 https://developer.apple.com/xcode/ 59https://www.eclipse.org/ 60 https://www.sublimetext.com/ 61 https://notepad-plus-plus.org/ 62 https://code.visualstudio.com/ 72 features that have helped programmers Jupiter Notebook63: to ease their tasks. It is an open-source web application for the creation and sharing of documents. Documents Whereas code editors do not have the contain live code, equations, visualizations, and narrative text. Nowadays, Jupiter Notebook full functionality of IDE but provide is highly used in data cleaning and transformation, numerical simulation, statistical modeling, some handful of features for editing data visualization, machine learning, and much more. Especially for data professionals, codes. It can be a standalone Jupiter notebook is becoming handy to use. application or integrated with IDE. We have listed some of the alternatives of code editor and IDEs without description. We have put Jupiter Notebook in a different section from code editor and IDE because it is a web-based application with sharing and collaboration feature that traditional IDEs and code editors lack. Cloud A cloud server is a virtual server 1. AWS64 Cloud services provide the flexibility and reliability for Servers running in a cloud computing Amazon Web Service (AWS) offers reliable, scalable, and inexpensive cloud computing DataOps practitioners to run their data pipeline. With the help environment that can function and run services ranging from data analytics, blockchain, edge computing, and many more. AWS's of the cloud service provider’s services, the risk cycle of creating as an independent unit like the local service covers a wide array of products and custom-built ready to deploy applications along services ends in exchange for a subscription payment. For environment. It can be accessed with facility deploy customized applications. example, to run a data analytics project to extract information remotely via an internet connection. from 1000 TB of data needs a powerful server and lots of Nowadays, cloud servers are gaining expertise. Nevertheless, if we chose to deploy our analytics 2. Google Cloud65 popularity due to the level of support project in the cloud, we can easily do it without worrying about Google Cloud offers Compute Engine, cloud storage, Google Kubernetes Engine, Cloud they provide for elastic computation maintaining the server. Plus, there are thousands of inbuilt SQL, BigQuery, and Dataflow, which are suitable for data analytics projects. requirements. Big cloud service services in different cloud providers to reduce workload.

providers provide inbuilt application support as well as the flexibility to 3. Microsoft Azure66 host custom applications. The Azure cloud platform offers more than 200 products and cloud services for different requirements. Products like Azure Data Factory, Azure Databricks, Azure DevOps, and Power BI are some of the cloud services which are helpful in data analytics projects.

Big data Big data requires tools that store them 1. Apache Hadoop67 Big data processing and analytics framework provide processing and organize, access, analyze, and An open-source software programing framework for distributed processing of big data into scalability, fault tolerance, distributed computation service and process them. Big data processing and batches across a set of multiple clusters using MapReduce. The core of Hadoop consists of along with core functionalities of processing and analyzing

63 https://jupyter.org/ 64 https://aws.amazon.com/ 65 https://cloud.google.com/ 66 https://azure.microsoft.com/ 67 https://hadoop.apache.org/ 73 analytics analytics framework provides a solution HDFS68 and Ozone69 for storage and processing part is MapReduce, and the resource data. In DataOps, these framework essences come when framework to store, process and analyze big data management part is YARN along with Hadoop Common, which contains libraries and dealing with a large volume of data and performing different on a larger scale. There are several big utilities needed for other Hadoop modules. processing tasks together. These frameworks provide a holistic data processing frameworks with 2. Apache Spark70 solution from storage data, processing them, analyzing with different features and different Apache Spark is a batch processing framework with the capability to support stream memory and CPU management. underlying architecture, but the processing of data. Uses in-memory computation to store and process intermediate results. primary purpose is to support big data Spark operation is based on distributed data structures called Resilient Distributed processing at their core. Several review Datasets (RDD). Application in spark can be written in Scala, Python, R, and SQL. Spark papers are available to compare provides libraries like SQL, and DataFrames, MLlib (machine learning), GraphX, and popular big data processing frameworks Spark Streaming for different tasks. These libraries can be used in the same application for in terms of features and performance robust data processing tasks. Spark can run on Hadoop, Mesos, Kubernetes and as a [117–119]. Big data frameworks are standalone application in the cloud and locally. primarily categorized into three main 3. Apache Flink71 categories based on the state of data is an open-source framework for distributed and batch processing of data. they are designed to handle: batch, With streaming API for JAVA and Scala, static data API for Java, Scala and Python, and stream, and hybrid[120]. SQL like query API for embedding Scala and Java code, Flink supports rich languages for data analysis and processing in addition to own machine learning and graph processing libraries. Additional Information: • Table 13 above presents some essential features of popular products in the given categories. However, there are several alternatives and niche products. So, it is recommended to do a thorough analysis of requirements and product features before using them in DataOps projects. • We have described the general usability of product categories in DataOps, and it is up to DataOps teams whether to use the product or not based on their requirement and goals.

68 https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html 69 https://hadoop.apache.org/ozone/ 70 https://spark.apache.org/ 71 https://flink.apache.org/ 74 4.1.4. Discussion

DataOps takes advantage of existing tools and technology to accomplish tasks. There are hundreds of tools available in the market with similar features and functionalities. Choosing right from the bucket needs informed decisions from project owners by consulting with team members. Team members need to streamline their tasks and necessary tools needed through research weighing the pros and cons of using the product and why that product serves the best. DataOps with suitable tools can cover all stages of the data lifecycle. From data collection, processing and analyzing, and publishing, all stages of data movement are covered by combinations of tools and technologies working together to serve the task inside the data pipeline. If we refer to Section 4.1.3.3, we have divided tools and technologies into several categories and list out some popular tools with their features and data lifecycle they support. Most of the tools listed above are used in all stages of the data lifecycle. It is not about individual tools; it is about the use of the functionality of tools in the lifecycle. For example, workflow orchestration tools can orchestrate the whole data pipeline from collection to deletion of the data. Furthermore, tools like testing and monitoring, deployment, collaboration and communication, data storage, IDEs, and deployment automation are required in every step of the lifecycle. The usage of tools directly reflects the level of automation we want to achieve in data pipeline operation. With the use of the right tools, manual tasks are eliminated and automated. Tools and technologies are core components of the data pipeline, but without people's involvement in analysis tasks, technology cannot accomplish the fundamental principle of DataOps.

Figure 17: DataOps ecosystem

75 Figure 17 illustrates the DataOps ecosystem where we can see various categories of tools aligned in order with people to match the process of converting input from source to generate insights as output through series of data lifecycle movement in between. Depending on project goal and level of automation, tools and technologies from the stacks are chosen. It is not always necessary to apply all the tools categories listed above. DataOps' primary objective is to deliver quality results in improved time and low cost. If that can be fulfilled by using one or a few tools from the list above, then we can deliver the project with those tools. DataOps is also about continuous improvement, so people working in the project should never give up on experimenting with new technologies and delivering better project results. Challenges in DataOps Implementation DataOps brings benefits as well as challenges. For an organization to succeed with DataOps, there is a need for considering potential issues and should be prepared to overcome them. Some of the issues that need to be taken care of while implementing DataOps practices are listed below.

1. Changing the organization’s culture DataOps is all about delivering analytics result faster, and the only way to make it happen is to encourage communication and collaboration across all departments. Data scientist, data engineers, managers, data analysts, system architects, system developers, customers and other data stakeholders all need to be willing to come together to break the status quo. DataOps can be a significant change, and for its success, everyone needs to be on board. This includes top executives, IT and business managers, data workers and everyone involved in data analytics project to identify and use of right tools. 2. Innovation with low risk DataOps advocates continuous improvement in the product and cycle time, which means lesser time for development, test, and deployment to a production environment. Teams need to move quickly without compromising quality. Not just quality but also complying with company policies and standards are required without affecting the cycle time. Automation gives extra space by reducing the manual task of testing, monitoring and deployment. With automation on the deployment cycle, there is little time for reviews increasing the risk of missing out details and pieces of information. So initially, it will take time to implement for total confidence in ensuring data and process quality. 3. Cost of DataOps The initial cost of introducing new tools and technology, employee training and moving from the old system can be substantial, and it is easy to get discouraged

76 at the beginning when there are no immediate benefits to realize. Nevertheless, in the long run, DataOps will pay off by reducing cycle time and standardizing the analytics product and process quality.

4. Transition from expertise-based team to cross-functional teams DataOps succeeds with cross-functional team collaboration and communication. Creating integrated data analytics teams will bring employees together from different departments and with varied expertise to solve a specific problem. Nevertheless, the challenge of structural change is enormous. One should think of a way to include all related and required members in the team with proper authorities and responsibilities. There should always trust-based environment among team members and between analytics teams, management, and customers. 5. Managing multiple environments DataOps, with multiple environments, provides freedom of innovation and improvement but also creates the necessity of proper management of those environments. Without an appropriate system management plan, it can quickly go out of hands and create exhaustion on performance and cost instead of giving benefits.

6. Sharing knowledge Tribal knowledge creates a big problem, and DataOps can make it even worst: new tools and technologies, change in processes and execution of data analytics projects in different platforms than before. Without useful documentation or creation of knowledge base, teamwork can be a challenging task to accomplish.

7. Tools and technology diversity In DataOps, several tools and technologies are used to accomplish the required tasks. This brings the challenges of maintaining and matching the performances of tools individually and collectively. One tool should not impact and restrict the performance of others. So careful selection of tools is always emphasized.

8. Security and quality With multiple environments and team players in project, security and quality is crucial to maintain. Data privacy, system security, data codes and insights quality, data workers and stakeholder’s authority should be well described and implemented in DataOps from the beginning. Otherwise, it will be hard to enforce when things go out of hand.

77 4.2. Experiment Evaluation This section evaluates our experiment in terms of the execution time of data pipelines and the implementation approach we followed from DataOps implementation guidelines.

4.2.1. Evaluation of Pipeline using DataOps Implementation Guidelines We executed the same data analytics project with two approaches. Experiment 1 is the manual execution of the data pipeline tasks, whereas experiment 2 automates data pipeline using the workflow orchestration tool. We compare our implementation approach with the DataOps Implementation Guidelines presented in Section 2.4.3.1. Setup DataOps culture: The comparison is not applicable since the experiments we performed are not part of an organization or team. A single person is responsible for the handling of the overall project. Automate and orchestrate: In experiment 1, there are not any automation and orchestration tools. All four steps of the data analytics pipeline are executed manually. Whereas experiment 2, converted all steps into DAG on Apache Airflow, and all tasks inside the DAG are executed automatically. Use of version control: Both experiments make use of code provided in the TBYF GitHub account. To set up a local project environment, some code changes were performed by forking the original GitHub account. For data, we did not use data versioning tools, but a separate output folder was created to store the task's output. Use multiple environments: In experiment 1, no use of multiple environments. The project was performed without a test environment. All tasks are executed in a single production environment. For experiment 2, separate testing and production environment were used. Data pipeline was run in a test environment using partial data before running full pipeline in the production environment. Reuse and containerize: In experiment 2, all tools and technologies are hosted as Docker containers. Furthermore, the data pipeline also runs inside the Docker using Apache Airflow. Whereas in experiment 1, only Apache Jena Fuseki is hosted inside Docker and the rest of the required services are installed in a Linux environment. Both experiments reuse the pipeline codes and Docker images of Fuseki services from GitHub. For experiment 2, we use the Apache Airflow docker image from the Puckel project. Test and test: Experiment 2 was executed in the production environment after successfully tested in the test environment. The test was performed for the number of the input file, the number of the output file, required output file format, and availability of N-Triples in Fuseki server in correspondence to the 78 .ntt file available in the local environment. To track the number of files, we used the Airflow inbuilt log generator and script to count the number of output and N-Triples files from the pipeline code. In experiment 1, no tests were performed. We directly ran bash script in the Linux environment and checked whether the task was completed or not by looking at output file ‘statement.txt’ after completion of tasks 1, 2, and 3. Whereas for the 4th task, we used statement.txt of the third task and statement_publish.txt created during the execution of the fourth task. Statement.txt counts the number of files processed with total execution time for each data folder (recorded daily) and statement_publish.txt counts number of N-Triples published to RDF format after the execution of the 4th task inside the daily folder. Continuous integration and deployment: Even though in experiment 2, there is no use of any specific continuous integration and deployment tools, we set up common data, and code repository folder for testing and production environment. With this, we eliminated the necessity of moving code and data to the production environment manually after testing is accomplished. However, in experiment 1, there is no concept of continuous integration and deployment since there are no multiple environments. Continuous monitor: For experiment 1, no support for continuous monitor; however, with the help of statement.txt and statement_publish.txt files, task execution successes examined at the end. Moreover, runtime monitor of tasks was not done, and no system monitor during the task execution. We used the inbuilt feature of Linux task monitoring service to track CPU and Memory usage. In contrast, in experiment 2, Airflow provides a useful web interface to see which task is currently being performed with detailed logs. Also, an email alert was created using Apache Airflow inbuilt email service for completion and failure of the task. Communicate and collaborate: Automatic email on task completion and failure provided recent data pipeline status. Tools like Apache Airflow, email service, Docker, Postgres, and Apache Fuseki were used together to execute ETL tasks in the single pipeline in experiment 2. However, to check the pipeline status in experiment 1, need to stay in front of the device screen. No alerts on failure and completion. Must estimate the completion time and regularly check with human eyes to monitor the status of the ETL task.

79

Table 14: Summary of the covered DataOps implementation guideline by experiments Implementation steps Experiment 1 Experiment 2 Set DataOps culture Not applicable Not applicable Automate and Orchestrate Not satisfied Satisfied Use version control Satisfied Satisfied Use multiple environments Not satisfied Satisfied Reuse and containerize Partially satisfied Satisfied Test and test Not satisfied Satisfied Continuous integration and deployment Not satisfied Partially satisfied Continuous monitor Partially satisfied Satisfied Communicate and collaborate Not satisfied Satisfied In Table 12, a summary of the evaluation of the experiment concerning the DataOps implementation guideline is presented. Experiment 2 satisfies seven implementation guidelines, with one satisfying partially. Whereas experiment 1 stands below a satisfactory level in terms of following DataOps implementation guideline. From this evaluation, it can be concluded that experiment 2 follows the DataOps principle to execute the data pipeline.

4.2.2. Runtime Evaluation Our experiment compares two implementation approaches in terms of project execution time and end-to-end delivery time. It is beneficial to investigate the performance of the DataOps implementation approach in data analytics projects against the manual execution of tasks. Detail of project implementation is described in Section 3.2. We designed the experiment to track each step's execution time and calculated the total execution time of the experiment. Total execution time does not consider the buffer time between steps execution in Experiment 1, whereas, in Experiment 2, there is no buffer time since we have used the DAG-based workflow orchestrator to automate execution. The stepwise execution of Experiment 1 executed the first two tasks (Enrich JSON data and Convert JSON to XML) slightly better than Experiment 2 because, in Experiment 1, data are stored in the same system (Linux) where the service is running. The data retrieval and storage process was faster to perform. However, in Experiment 2, data are stored outside Docker container (in Linux), and read and write data from docker container has comparatively higher latency and lower throughput. The third (Map XML to RDF), and fourth tasks (Publish RDF to Database) of Experiment 2 outperform Experiment 1. In Experiment 2, task 3, execution time is more 2 hours less than Experiment 1. It due to

80 containerized application, containers run in isolated resources allocated to containers and application. Here, in this case, application inside Docker containers uses optimal resources to perform the task. However, Experiment 1, the performance of a task is affected due to shared resources among other services or applications in the Linux environment. Task 4 is dependent on network quality and Database (Apache Jena Fuseki) performance in both experiments. N-Triples RDF data are published to the database using the HTTP protocol. In Experiment 2, data are published 8 minutes faster than Experiment 1. Overall, Experiment 2 provides 2 hours faster result than Experiment 1. However, in terms of end-to-end delivery time, Experiment 1 lags far behind than Experiment 2. If we see Figure 17 and Figure 18: in Figure 17, we can see the gap between the previous task endpoint and the next task start point. This gap represents the buffer time before manually running the task after completion of the previous task. If we add buffer time with execution time, then project delivery time will increase by the amount of total buffer time. In Experiment 1, buffer time can range from minutes to several hours and days. So, the delivery cycle goes beyond tasks execution time. Nevertheless, in Experiment 2, due to the task execution automation, we have zero buffer time, which means overall project delivery time is equal to project execution time.

Figure 18: Data analytics tasks execution time without following DataOps implementation guidelines

81

Figure 19: Data pipeline execution time following DataOps implementation guidelines

4.2.3. Discussion

With the DataOps implementation guideline, we could reduce product delivery cycle time and pipeline execution time by significant hours. Using DataOps approach automation on data pipeline, testing and monitoring, data change tracking, system performance tracking, and communication and collaboration were easier. Through experiment observation and experience of accomplishing the task, DataOps gave significant advantage on analysis task execution by reusing, automating, orchestrating, and containerizing tools, technology, and people for data lifecycle management process. In addition to technical benefits, with DataOps, people's involvement in unproductive tasks like manually monitoring system status, running task manually, and moving data and codes from one system to another has been reduced to minimal.

82 Chapter 5. Conclusions and Future Work The primary motivation of this thesis work is to do a systematic search about research into DataOps. The main tasks include defining DataOps, exploring data lifecycle supported by DataOps, creating a feature-based comparison matrix of DataOps supported tools, and executing a data pipeline using DataOps implementation guide. A comparison against data pipeline executed without DataOps implementation guide is made using two experiments. The rest of the section below describes the findings and conclusions of the thesis work, and the final part presents future work. 5.1. Conclusions In this thesis work, we performed a systematic literature search to establish DataOps as a methodology to implement data pipelines. We studied different data lifecycle definition approaches and created data lifecycle by referring best practices from studied data lifecycle reference. We also investigate the principles and practices of DataOps, the essence of data governance and lineage in DataOps, and the implementation approach from DataOps practitioners. Based on the investigation, we developed an implementation guide to executing the existing data analytics project for comparison. We analyzed different tools features for DataOps practice and explored their support in different stages of the lifecycle. The summary of the result and conclusion of the thesis work in terms of goals and research questions stated in Section 1 is as follow: Goal 1: Provide a definition for DataOps and point out the ambiguities in the field of DataOps. To define DataOps, we analyzed existing definitions and found that the researcher's definitions can be categorized into three main approaches: task- oriented, goal-oriented and process and team-oriented. Furthermore, we came up with a new definition in Section 4.1.1, which combines all three approaches. The version of the definition we provide considers the collaborative involvement of people, processes, and technology to automate, manage, and track data and data analytics projects for better outcomes. Since DataOps is an evolving field, there are some ambiguities and misconceptions needed to address. Misconception related to the cost of practicing DataOps, the application area of DataOps, the difference between DataOps and data pipeline and its relation to DevOps is clarified in Section 4.1.1. Goal 2: Propose a generic DataOps pipeline that supports data lifecycle. After defining DataOps and addressing ambiguities, we proposed a generic DataOps pipeline to support data lifecycle by differentiating it from traditional data pipeline and the advantage it can have over traditional data pipelines in

83 terms of data lifecycle management. We restructured the existing data pipeline using DevOps, Agile, and SPC methodology with addition to the skillset of people, reform of organization culture, and degree of task automation, to name it as DataOps pipeline. DataOps pipeline we created is based on DataOps principle (see Section 2.4.2) and follows DataOps implementation guide (see Section 2.4.3.1) for development and execution of analytics projects and supports all stages of data lifecycle presented in Section 2.1.5. While restructuring existing pipeline or developing new pipeline, there are challenges which need to be addressed are mentioned in Section 4.1.4 to take in consideration before starting DataOps implementation. Goal 3: Prepare a comparative feature analysis of the tools and technologies that support DataOps. We prepared comparison tables of DataOps supported tools in Section 4.1.3.3 based on the criteria presented in Section 4.1.3.2. For the easiness of comparison and tools functionalities, we divided them into several categories (see Section 4.1.3.1). Tools presented in a comparison matrix are selected based on their popularity while doing a literature review. Tools are evaluated to show their support level in DataOps, easiness of installation and operation, compatibleness with other tools to execute whole analytics project, and stages of data lifecycle usually used for. However, in some categories, we have not used all comparison criteria due to the nature of the tools and similar traits they share with those discarded criteria. Goal 4: Explore which part of data lifecycles are currently supported and where do we need potential enhancement. With the comparison table's help, we evaluated tools applicability in different stages of the lifecycle. We derived the current support status of each category of tools in different stages of the lifecycle. Tools which do not have a separate column indicating which stage of data lifecycle they support are applicable in all stages of the data lifecycle. Our finding regarding tools support shows every stage of the data life cycle is well supported by a set of tools in different categories, but there is a need for careful selection of tools based on the task we need to perform. Goal 5: Implement a data pipeline and evaluate using DataOps implementation process. We implemented an existing ETL pipeline using two different approaches for evaluating the DataOps approach advantage over manual ETL task execution. Experiment 1 is performed without following any DataOps principle and implementation guide. In contrast, Experiment 2 is designed to reflect DataOps principle and implementation guide wherever applicable. The observation of experiments and results of execution gave the level of improvement DataOps

84 provided on the project's efficiency in terms of pipeline execution and product delivery time.

Based on the above analysis, it can be concluded that we have achieved all our research goals with good results.

RQ 1: What is currently understood by DataOps? What are ambiguities in the understanding of the concept? “DataOps can be defined as data pipeline development and execution methodology by assembling people and technology to deliver better results in a shorter time. With DataOps, people, processes, and technology are orchestrated with a degree of automation to streamline data flow from one stage of the data lifecycle to another. With Agile, DevOps, and SPC’s best practices, technologies, and processes, DataOps promotes data governance, continuous testing and monitoring, optimization on the analysis process, communication, collaboration, and continuous improvement.” The definition presented here reflects the current understanding of DataOps. DataOps ambiguities lie in method, scope, and cost of implementation. In the method, misunderstanding prevails as; DataOps is all about automating using tools and technology, which is not valid, since every data pipeline project starts with people and ends with people. There is always human involvement in several tasks like requirement gathering, coding, researching tools, project planning, operating tools, monitoring, and many more. With DataOps, human involvement is shifted towards innovation and development rather than on repetitive manual production tasks (which can be automated using DataOps tools). The definition we presented in Listing 1 also clarifies that DataOps is not just DevOps for Data. In terms of scope, people treat DataOps differently from the data pipeline and are taking DataOps practices for data analytics projects only. In fact, DataOps is a restructured version of the data pipeline with a motive to deliver better and faster results, which are hard to achieve using traditional pipelines. Moreover, DataOps is not just about analytics job; it covers the whole organization and customers in a communication and collaboration loop to accomplish its goals. From the cost of implementation, the organization considers that DataOps practices are expensive to perform. Although acquiring new technology and replacing the old one is an extra cost, if we consider the value creation by new practices, the cost is worth to bear. After all, data analytics projects will cost organization whether they follow DataOps or not. Furthermore, in DataOps, cost control is a priority task along with the quality of output. Moreover, while

85 considering using tools, there are so many tools available open-source and free, which is always good to consider.

RQ 2: Which tools implement DataOps, and what functionalities do they offer? Tools presented in comparison tables are some of the representative tools which support DataOps. There are hundreds of tools with the same functionalities to choose for. We have presented the comparison matrix as a starting point of research to choose a tool for a given functionality. We categorized tools as workflow orchestration tools, testing and monitoring tools, deployment automation tools, data governance tools, code artifact and data versioning tools, analytics and visualization tools, and communication and collaboration tools along with containers, resource managers, data storage, IDEs and source code editors, cloud servers, and big data processing and analytics framework. These categorizations are based on the functionalities they offer in DataOps to complete big and small data analytics projects. It is not necessary to use all the functionalities in the data pipeline. It purely depends on the scope and nature of the project. Careful planning and thorough research are required before deciding on using a specific tool to have certain functionality. RQ 3: Which parts of the data life cycle are currently supported well, and where is the need for the potential enhancement? DataOps tools and technologies can be used in data lifecycle stages based on the functionalities they provide. Some tools are particular for certain lifecycle stages like analysis and visualization tools are used in the analysis stage to provide insights and publish results, cloud storage (storage) for storing data in storage and data collection stage of the lifecycle. In comparison, some tools, like communication and collaboration, are independent of the data lifecycle. Deployment automation tools, data governance tools, code, artifact, and data versioning tools, Containers, resource manager, IDE and source codes editors, and cloud servers are used in all stages (collection/creation, processing, analysis, publishing, and storage) of the data lifecycle. Workflow orchestration tools’ and testing and monitoring tools’ data lifecycle stage support depends on tools features. Some tools can provide support in the whole pipeline process, whereas some are specific for certain tasks and frameworks. Furthermore, big data processing and analysis framework provide a complete solution even though the focus is more on data processing and analysis. The potential enhancement needed is on the design and development aspect of the pipeline with consideration of using tools to increase performance and quality while reducing the cost of operation during pipeline execution. It is the reason

86 why DataOps advocates the separation of production environment from development and use of multiple development environments. We can test different tools and technologies performance with separate and multiple environments and take the best to the production environment. Besides, tools should support future changes (changes in data pipeline steps, data and data structures, codes, other tools, and their configurations, execution environment) and scalability scope.

5.2. Future Work DataOps is an emerging concept, and there are little experiences and numerous challenges. We have done the thesis work to explore the existing concepts in DataOps and create a starting point for future work. The next stage could be experimenting DataOps approach in different data analytics projects and validate the DataOps and DataOps supported tools performance. We have created a feature-wise comparison of different tools based on their functionalities. Another step would be to implement different alternative tools of the same functionalities and test their performance in different industry use cases. Furthermore, a compatibility rating (based on combined performance when used together in data analytics task) of one tool from one functional group to other functional groups would help DataOps practitioners make informed decisions. Big data pipelines differ from data pipelines in terms of computing in heterogeneous distributed environments with large data processing requirements. Most of the literature and research projects performed in DataOps are for specific and medium-scale data analytics projects. An important aspect to consider DataOps as a scalable and reliable solution for the Big data processing pipelines.

87 Bibliography

[1] A. Katal, M. Wazid, R.H. Goudar, Big data: Issues, challenges, tools and Good practices, 2013 6th Int. Conf. Contemp. Comput. IC3 2013. (2013) 404–409. https://doi.org/10.1109/IC3.2013.6612229. [2] I.A.T. Hashem, I. Yaqoob, N.B. Anuar, S. Mokhtar, A. Gani, S. Ullah Khan, The rise of “big data” on cloud computing: Review and open research issues, Inf. Syst. 47 (2015) 98–115. https://doi.org/10.1016/j.is.2014.07.006. [3] S. Sagiroglu, D. Sinanc, Big data: A review, in: Proc. 2013 Int. Conf. Collab. Technol. Syst. CTS 2013, 2013: pp. 42–47. https://doi.org/10.1109/CTS.2013.6567202. [4] A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and analytics, Int. J. Inf. Manage. 35 (2015) 137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007. [5] H. Baars, J. Ereth, From data warehouses to analytical atoms - The internet of things as a centrifugal force in business intelligence and analytics, in: 24th Eur. Conf. Inf. Syst. ECIS 2016, 2016. https://aisel.aisnet.org/ecis2016_rp/3 (accessed July 25, 2020). [6] S. LaValle, E. Lesser, R. Shockley, M.S. Hopkins, N. Kruschwitz, Big Data, Analytics and the Path from Insights to Value, MIT Sloan Manag. Rev. 52 (2011) 1–18. https://sloanreview.mit.edu/article/big-data- analytics-and-the-path-from-insights-to-value/ (accessed July 25, 2020). [7] K. Singh, R. Wajgi, Data analysis and visualization of sales data, in: IEEE WCTFTR 2016 - Proc. 2016 World Conf. Futur. Trends Res. Innov. Soc. Welf., Institute of Electrical and Electronics Engineers Inc., 2016. https://doi.org/10.1109/STARTUP.2016.7583967. [8] C. Bergh, G. Benghiat, S. Eran, The DataOps Cookbook, second, 2019. [9] M. Loukides, What is DevOps?, O’Reilly Media, Inc, 2012. [10] D.J. Mala, Integrating the Internet of Things into Software Engineering Practices, IGI Global, 2019. [11] W. Eckerson, Best Practices in DataOps How to Create Robust, Automated Data Pipelines, (2019). www.eckerson.com (accessed November 8, 2020). [12] J. Ereth, W. Eckerson, DataOps: Industrializing Data and Analytics Strategies for Streamlining the Delivery of Insights, 2018. www.eckerson.com (accessed October 21, 2020). [13] J. Ereth, DataOps – Towards a Definition, in: LWDA, 2018: pp. 104–112. http://ceur-ws.org/Vol-2191/paper13.pdf. [14] A.K. Gupta, S. Singhal, R.R. Garg, Challenges and issues in data analytics, in: Proc. - 2018 8th Int. Conf. Commun. Syst. Netw. Technol.

88 CSNT 2018, Institute of Electrical and Electronics Engineers Inc., 2018: pp. 144–150. https://doi.org/10.1109/CSNT.2018.8820251. [15] Dr. Nicole Forsgren, Dr. Dustin Smith, Jez Humble, Jessie Frazelle, Accelerate State of DevOps , 2019. https://services.google.com/fh/files/misc/state-of-devops-2019.pdf (accessed November 25, 2020). [16] L.E. Lwakatare, T. Kilamo, T. Karvonen, T. Sauvola, V. Heikkilä, J. Itkonen, P. Kuvaja, T. Mikkonen, M. Oivo, C. Lassenius, DevOps in practice: A multiple case study of five companies, Inf. Softw. Technol. 114 (2019) 217–230. https://doi.org/10.1016/j.infsof.2019.06.010. [17] M. Hüttermann, M. Hüttermann, Beginning DevOps for Developers, in: DevOps Dev., Apress, 2012: pp. 3–13. https://doi.org/10.1007/978-1- 4302-4570-4_1. [18] M. Artac, T. Borovssak, E. Di Nitto, M. Guerriero, D.A. Tamburri, DevOps: Introducing infrastructure-as-code, in: Proc. - 2017 IEEE/ACM 39th Int. Conf. Softw. Eng. Companion, ICSE-C 2017, Institute of Electrical and Electronics Engineers Inc., 2017: pp. 497–498. https://doi.org/10.1109/ICSE-C.2017.162. [19] Z. Zhang, DevOps for Data Science System, KTH Royal Institute of Technology, 2020. https://www.diva- portal.org/smash/record.jsf?pid=diva2%3A1424394&dswid=mainwindow (accessed August 20, 2020). [20] K. Kontostathis, Collecting Data, The DevOps Way, (2017). https://insights.sei.cmu.edu/devops/2017/11/collecting-data-the-devops- way.html (accessed October 5, 2020). [21] S. Ward-Riggs, The Difference Between DevOps and DataOps – Altis Consulting, (n.d.). https://altis.com.au/the-difference-between-devops- and-dataops/ (accessed July 30, 2020). [22] D. Grande, J. Machado, B. Petzold, M. Roth, Reducing data costs without jeopardizing growth, 2020. https://www.mckinsey.com/business- functions/mckinsey-digital/our-insights/reducing-data-costs-without- jeopardizing-growth (accessed September 25, 2020). [23] A. Håkansson, Portal of Research Methods and Methodologies for Research Projects and Degree Projects, in: Proc. Int. Conf. Front. Educ. Comput. Sci. Comput. Eng. FECS’13, CSREA Press U.S.A, 2013: pp. 67– 73. http://www.world-academy-of-science.org/worldcomp13/ws. [24] M.L. Langseth, M.Y. Chang, J. Carlino, J.R. Bellmore, D.D. Birch, J. Bradley, R.S. Bristol, D.D. Buscombe, J.J. Duda, A.L. Everette, T.A. Graves, M.M. Greenwood, D.L. Govoni, H.S. Henkel, V.B. Hutchison, B.K. Jones, T. Kern, J. Lacey, R.M. Lamb, F.L. Lightsom, J.L. Long, R.A. Saleh, S.W. Smith, C.E. Soulard, R.J. Viger, J.A. Warrick, K.E.

89 Wesenberg, D.J. Wieferich, L.A. Winslow, Community for Data Integration 2015 annual report, 2016. https://doi.org/10.3133/ofr20161165. [25] M. El Arass, I. Tikito, N. Souissi, M. El, M. El Arass, I. Tikito, Data lifecycles analysis: towards intelligent cycle Data lifecycles analysis: towards intelligent cycle, n.d. https://hal.archives-ouvertes.fr/hal- 01593851 (accessed October 30, 2020). [26] X. Yu, Q. Wen, A view about cloud data security from data life cycle, in: 2010 Int. Conf. Comput. Intell. Softw. Eng. CiSE 2010, 2010. https://doi.org/10.1109/CISE.2010.5676895. [27] A. Wahaballa, O. Wahballa, M. Abdellatief, H. Xiong, Z. Qin, Toward unified DevOps model, in: Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, IEEE Computer Society, 2015: pp. 211–214. https://doi.org/10.1109/ICSESS.2015.7339039. [28] IBM, Wrangling big data: Fundamentals of data lifecycle management, IBM Manag. Data Lifecycle. (2013). [29] J.L. Faundeen, T.E. Burley, J.A. Carlino, D.L. Govoni, H.S. Henkel, S.L. Holl, V.B. Hutchison, E. Martín, C.C.L. Ellyn T. Montgomery, S. Tessler, L.S. Zolly, The United States Geological Survey Science Data Lifecycle Model: U.S. Geological Survey Open-File Report 2013–1265, 2013. https://doi.org/http://dx.doi.org/10.3133/ofr20131265. [30] S. Allard, DataONE: Facilitating eScience through Collaboration, J. EScience Librariansh. 1 (2012) 4–17. https://doi.org/10.7191/jeslib.2012.1004. [31] J. Densmore, Data Pipelines Pocket Reference, First, 2020. [32] B. Plale, I. Kouper, The Centrality of Data: Data Lifecycle and Data Pipelines, in: Data Anal. Intell. Transp. Syst., Elsevier Inc., 2017: pp. 91– 111. https://doi.org/10.1016/B978-0-12-809715-1.00004-3. [33] A. Bansal, S. Srivastava, Tools Used in Data Analysis: A Comparative Study, 2018. https://www.analyticsvidhya.com/blog/2014/03/sas-vs- (accessed August 13, 2020). [34] H. Khalajzadeh, M. Abdelrazek, J. Grundy, J. Hosking, Q. He, A Survey of Current End-User Data Analytics Tool Support, in: Proc. - 2018 IEEE Int. Congr. Big Data, BigData Congr. 2018 - Part 2018 IEEE World Congr. Serv., Institute of Electrical and Electronics Engineers Inc., 2018: pp. 41–48. https://doi.org/10.1109/BigDataCongress.2018.00013. [35] Z.A. Al-Sai, R. Abdullah, M.H. Husin, Critical Success Factors for Big Data: A Systematic Literature Review, IEEE Access. 8 (2020) 118940– 118956. https://doi.org/10.1109/ACCESS.2020.3005461. [36] G. Alley, What is a Data Pipeline? | Alooma, (2018). https://www.alooma.com/blog/what-is-a-data-pipeline (accessed May 13,

90 2020). [37] Lenny Liebmann, 3 reasons why DataOps is essential for big data success | IBM Big Data & Analytics Hub, IBM Big Data Anal. HUb. (2014). https://www.ibmbigdatahub.com/blog/3-reasons-why-dataops-essential- big-data-success (accessed June 20, 2020). [38] A. Palmer, From DevOps to DataOps - DataOps Tools Transformation | Tamr, (2015). https://www.tamr.com/blog/from-devops-to-dataops-by- andy-palmer/ (accessed April 3, 2020). [39] Gartner Inc., Definition of DataOps - Gartner Information Technology Glossary, Gartner.Com. (2019). https://www.gartner.com/en/information-technology/glossary/data-ops (accessed May 20, 2020). [40] M. Stonebraker, N. Bates-Haus, L. Cleary, L. Simmons, Getting Data Operations Right, First, O’Reilly Media, Inc., 2018. [41] E. Jarah, What is DataOps? | Platform for the Machine Learning Age | Nexla, Nexla. (n.d.). https://www.nexla.com/define-dataops/ (accessed September 3, 2020). [42] DataOps and the DataOps Manifesto | by ODSC - Open Data Science | Medium, Open Data Sci. (2019). https://medium.com/@ODSC/dataops- and-the-dataops-manifesto-fc6169c02398 (accessed April 4, 2020). [43] The DataOps Manifesto, The DataOps Manifesto, (n.d.). https://www.dataopsmanifesto.org/ (accessed April 4, 2020). [44] DataKitchen, DataOps is NOT just DevOps for data, (2018). https://medium.com/data-ops/dataops-is-not-just-devops-for-data- 6e03083157b7 (accessed August 4, 2020). [45] G. Anadiotis, DataOps: Changing the world one organization at a time | ZDNet, (2017). https://www.zdnet.com/article/dataops-changing-the- world-one-organization-at-a-time/ (accessed April 20, 2020). [46] H. Atwal, Practical DataOps: Delivering Agile Data Science at Scale, Apress, 2020. https://doi.org/10.1007/978-1-4842-5104-1. [47] W. Eckerson, Diving into DataOps: The Underbelly of Modern Data Pipelines, (2018). https://www.eckerson.com/articles/diving-into-dataops- the-underbelly-of-modern-data-pipelines (accessed April 4, 2020). [48] J. Zaino, Get Ready for DataOps - DATAVERSITY, (2019). https://www.dataversity.net/get-ready-for-dataops/ (accessed April 5, 2020). [49] DataKitchen, DataOps in Seven Steps, Medium. (2017). https://medium.com/data-ops/dataops-in-7-steps-f72ff2b37812 (accessed November 4, 2020). [50] D. Wells, DataOps: More Than DevOps for Data Pipelines, Eckerson Gr. (2019). https://www.eckerson.com/articles/dataops-more-than-devops-

91 for-data-pipelines (accessed July 30, 2020). [51] What is DataOps?, DataOpsZone. (n.d.). https://www.dataopszone.com/what-is-dataops/ (accessed October 20, 2020). [52] S. Gibson, Exploring DataOps in the Brave New World of Agile and Cloud Delivery, 2020. [53] Agile Alliance, What is Agile Software Development? | Agile Alliance, Agil. Alliance. (2019). https://www.agilealliance.org/agile101/ (accessed February 6, 2020). [54] Agile Alliance, Agile Manifesto for Software Development | Agile Alliance, (2001). https://www.agilealliance.org/agile101/the-agile- manifesto/ (accessed June 11, 2020). [55] Agile Alliance, 12 Principles Behind the Agile Manifesto - Agile Alliance, Agilealliance.Org. (n.d.). https://www.agilealliance.org/agile101/12- principles-behind-the-agile-manifesto/ (accessed April 24, 2020). [56] DataKitchen, How Software Teams Accelerated Average Release Frequency from 12 Months to Three Weeks, Medium. (2017). https://medium.com/data-ops/how-software-teams-accelerated-average- release-frequency-from-12-months-to-three-weeks-5cb86c2b551e (accessed May 25, 2020). [57] DataKitchen, High-Velocity Data Analytics with DataOps, 2017. www.datakitchen.io (accessed April 4, 2020). [58] M. Kersten, A cambrian explosion of DevOps tools, IEEE Softw. 35 (2018) 14–17. https://doi.org/10.1109/MS.2018.1661330. [59] L. Bass, I. Weber, L. Zhu, DevOps: A Software Architect’s Perspective, 1st ed., Addison-Wesley Professional, 2015. [60] R. Chan, DevOps engineer is the most recruited job on LinkedIn, Bus. Insid. (2018). https://www.businessinsider.com/devops-engineer-most- recruited-job-on-linkedin-2018-11?r=US&IR=T (accessed April 14, 2020). [61] L. Zhu, L. Bass, G. Champlin-Scharff, DevOps and Its Practices, IEEE Softw. 33 (2016) 32–34. https://doi.org/10.1109/MS.2016.81. [62] F. Erfan, DataOps: The New DevOps of Analytics, InsideBIGDATA. (2019). https://insidebigdata.com/2019/03/29/dataops-the-new-devops- of-analytics/ (accessed July 30, 2020). [63] J.K. Liker, Toyota way: 14 management principles from the world’s greatest manufacturer, McGraw-Hill, New York, 2013. [64] DataKitchen, Lean Manufacturing Secrets that You Can Apply to Data Analytics, Medium. (2017). https://medium.com/data-ops/lean- manufacturing-secrets-that-you-can-apply-to-data-analytics- 31d1a319cbf0#.7db9fza6b (accessed November 4, 2020). [65] DataKitchen, The Seven Steps to Implement DataOps, 2017.

92 www.datakitchen.io (accessed April 14, 2020). [66] S. Gawande, Complete DataOps Implementation Guide, ICEDQ. (2019). https://icedq.com/dataops/dataops-implementation-guide (accessed April 5, 2020). [67] S. Gawande, DataOps Implementation Guide, 2019. [68] G. Thomas, The DGI Data Governance Framework, (2006) 1–20. [69] MDM Institute, What is Data Governance?, (2015). http://www.tcdii.com/whatIsDataGovernance.html (accessed July 6, 2020). [70] B.W. Gow, Getting started with data governance, (2008). [71] M. Aisyah, Y. Ruldeviyani, Designing data governance structure based on data management body of knowledge (DMBOK) Framework: A case study on Indonesia deposit insurance corporation (IDIC), in: 2018 Int. Conf. Adv. Comput. Sci. Inf. Syst. ICACSIS 2018, Institute of Electrical and Electronics Engineers Inc., 2019: pp. 307–312. https://doi.org/10.1109/ICACSIS.2018.8618151. [72] R. Karel, The Process Stages of Data Governance, Inform. Corp. (2014). https://blogs.informatica.com/2014/01/02/the-process-stages-of-data- governance/ (accessed July 6, 2020). [73] E. Strod, Continuous Governance with DataGovOps, DataKitchen. (2020). https://blog.datakitchen.io/blog/continuous-governance-with- datagovops (accessed July 6, 2020). [74] K. Belhajjame, P. Missier, C. Goble, Data Provenance in Scientific Workflows, in: Handb. Res. Comput. Grid Technol. Life Sci. Biomed. Healthc., IGI Global, 2009: pp. 46–59. https://doi.org/10.4018/978-1- 60566-374-6.ch003. [75] A. Lohachab, Bootstrapping Urban Planning: Addressing Big Data Issues in Smart Cities, in: R.C. Joshi, B. Gupta (Eds.), Secur. Privacy, Forensics Issues Big Data, IGI Global, 2020: pp. 217–246. https://doi.org/10.4018/978-1-5225-9742-1.ch009. [76] R. Ikeda, J. Widom, Data Lineage: A Survey, 2009. http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf (accessed July 6, 2020). [77] B. Glavic, K. Dittrich, Data provenance: A categorization of existing approaches, Datenbanksysteme Business, Technol. Und Web, BTW 2007 - 12th Fachtagung Des GI-Fachbereichs “Datenbanken Und Informationssysteme” (DBIS), Proc. (2007) 227–241. https://doi.org/10.5167/uzh-24450. [78] Y. Cui, J. Widom, Lineage tracing for general data warehouse transformations, in: VLDB 2001 - Proc. 27th Int. Conf. Very Large Data Bases, 2001: pp. 471–480.

93 [79] What is Data Lineage? - Definition from Techopedia, Techopedia Inc. (n.d.). https://www.techopedia.com/definition/28040/data-lineage (accessed July 6, 2020). [80] Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance, n.d. www.trifacta.com (accessed July 6, 2020). [81] Art. 30 GDPR - Records of processing activities - GDPR.eu, (n.d.). https://gdpr.eu/article-30-records-of-processing-activities/ (accessed November 8, 2020). [82] Art. 17 GDPR - Right to erasure ('right to be forgotten’) - GDPR.eu, (n.d.). https://gdpr.eu/article-17-right-to-be-forgotten/ (accessed November 8, 2020). [83] Art. 20 GDPR - Right to data portability - GDPR.eu, (n.d.). https://gdpr.eu/article-20-right-to-data-portability/ (accessed November 8, 2020). [84] R. Bose, J. Frew, Composing lineage metadata with XML for custom satellite-derived data products, in: Proceedings. 16th Int. Conf. Sci. Stat. Database Manag., 2004: pp. 275–284. https://doi.org/10.1109/SSDM.2004.1311219. [85] S.M.K. Sigurjonsson, Blockchain Use for Data Provenance in Scientific Workflow, KTH Royal Institute of Technology, 2018. http://kth.diva- portal.org/smash/record.jsf?pid=diva2%3A1235451&dswid=-1674. [86] X. Liang, S. Shetty, D. Tosh, C. Kamhoua, K. Kwiat, L. Njilla, ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability, in: Proc. - 2017 17th IEEE/ACM Int. Symp. Clust. Cloud Grid Comput. CCGRID 2017, 2017: pp. 468–477. https://doi.org/10.1109/CCGRID.2017.8. [87] D. Potter, DataOps: The Antidote for Congested Data Pipelines, RTInsights. (2019). https://www.rtinsights.com/dataops-the-antidote-for- congested-data-pipelines/ (accessed July 14, 2020). [88] Analytics Products - Sweden | IBM, IBM Big Data Anal. Hub. (n.d.). https://www.ibm.com/se-en/analytics/products (accessed July 8, 2020). [89] W.W. Eckerson, The Ultimate Guide to DataOps: Product Evaluation and Selection Criteria, 2019. www.eckerson.com (accessed July 8, 2020). [90] W.W. Eckerson, Trends in DataOps, 2019. [91] H. Crocket, Fundamental Review of the Trading Book: Data Management Implications, 2018. https://icedq.com/resources/whitepapers/fundamental-review-of-the- trading-book. [92] M. Chisholm, Effective Quality Assurance and Testing in Data-Centric Projects, Stamford, Connecticut, 2014.

94 [93] K. Madera, D. Paredes Aguilera, Deliver business-ready data fast with DataOps, 2020. [94] K. Madera, The difference between DataOps and DevOps and other emerging technology practices, IBM Big Data Anal. Hub. (2019). https://www.ibmbigdatahub.com/blog/difference-between-dataops-and- devops-and-other-emerging-technology-practices (accessed July 30, 2020). [95] R. Gupta, Components of the DataOps toolchain and best practices to make it successful - Journey to AI Blog, IBM Big Data Anal. Hub. (2019). https://www.ibm.com/blogs/journey-to-ai/2019/12/components- of-the-dataops-toolchain-and-best-practices-to-make-it-successful/ (accessed July 8, 2020). [96] S. Quoma, What is DataOps? - Journey to AI Blog, IBM Big Data Anal. Hub. (2019). https://www.ibm.com/blogs/journey-to-ai/2019/12/what-is- dataops/ (accessed July 8, 2020). [97] A. Raj, D.I. Mattos, J. Bosch, H.H. Olsson, A. Dakkak, From Ad-Hoc data analytics to DataOps, in: Proc. - 2020 IEEE/ACM Int. Conf. Softw. Syst. Process. ICSSP, 2020: pp. 165–174. https://doi.org/10.1145/3379177.3388909. [98] About - They Buy For You, (n.d.). https://theybuyforyou.eu/about/ (accessed November 18, 2020). [99] A. Soylu, B. Elvesaeter, P. Turk, D. Roman, O. Corcho, E. Simperl, I. Makgill, C. Taggart, M. Grobelnik, T.C. Lech, An Overview of the TBFY Knowledge Graph for Public Procurement, n.d. https://ec.europa.eu/growth/single-market/public-procurement_en (accessed November 8, 2020). [100] A. Soylu, O. Corcho, B. Elvesaeter, C. Badenes-Olmedo, F. Martínez, M. Kovacic, M. Posinkovic, I. Makgill, C. Taggart, E. Simperl, T. Lech, D. Roman, Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph, in: 2020. [101] airflow.executors — Airflow Documentation, Apache Airflow. (2019). https://airflow.apache.org/docs/stable/_api/airflow/executors/index.htm l (accessed October 18, 2020). [102] Docker overview | Docker Documentation, (n.d.). https://docs.docker.com/get-started/overview/ (accessed November 19, 2020). [103] E. Miller, An Introduction to the Resource Description Framework, D-Lib Mag. (1998). http://www.dlib.org/dlib/may98/miller/05miller.html (accessed November 19, 2020). [104] Margaret Rouse, What is DataOps (data operations)? - Definition from WhatIs.com, TechTarget. (2019). https://searchdatamanagement.techtarget.com/definition/DataOps

95 (accessed July 4, 2020). [105] M. Barika, S. Garg, A.Y. Zomaya, L. Wang, A.V.A.N. Moorsel, R. Ranjan, S. Garg, L. Wang, A. Van Moorsel, Orchestrating big data analysis workflows in the cloud: Research challenges, survey, and future directions, ACM Comput. Surv. 52 (2019). https://doi.org/10.1145/3332301. [106] Y. Dessalk, N. Nikolov, M. Matskin, A. Soylu, D. Roman, Scalable Execution of Big Data Workflows using Software Containers, in: 2020. [107] H. Chen, J. Wen, W. Pedrycz, G. Wu, Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds, IEEE Trans. Big Data. 6 (2018) 131–144. https://doi.org/10.1109/tbdata.2018.2874469. [108] H. Hu, Y. Wen, T.S. Chua, X. Li, Toward scalable systems for big data analytics: A technology tutorial, IEEE Access. 2 (2014) 652–687. https://doi.org/10.1109/ACCESS.2014.2332453. [109] Y.D. Dessalk, Big Data Workflows: DSL-based Specification and Software Containers for Scalable Executions, KTH Royal Institute of Technology, 2020. [110] H. Yoshida, The DataOps Advantage of Containers and Converged Infrastructure, (2020). https://community.hitachivantara.com/s/article/The-DataOps- Advantage-of-Containers-and-Converged-Infrastructure (accessed November 26, 2020). [111] Linux Containers - LXC - Introduction, (2019). https://linuxcontainers.org/lxc/introduction/ (accessed July 29, 2020). [112] E. Arianyan, H. Taheri, S. Sharifian, Novel energy and SLA efficient resource management heuristics for consolidation of virtual machines in cloud data centers, Comput. Electr. Eng. 47 (2015) 222–240. https://doi.org/10.1016/j.compeleceng.2015.05.006. [113] P. Mell, T. Grance, The NIST definition of cloud computing: Recommendations of the National Institute of Standards and Technology, Nova Science Publishers Inc., 2012. [114] Apache Mesos Telegraf Plugin | InfluxData, InfluxData Inc. (n.d.). https://www.influxdata.com/integration/apache-mesos/ (accessed September 29, 2020). [115] Kubernetes & Kubernete Architecture | Infrastructure Basics, PSSC Labs. (2020). https://pssclabs.com/article/infrastructure-considerations- for-containers-and-kubernetes/ (accessed May 29, 2020). [116] Understanding data storage, Red Hat Inc. (n.d.). https://www.redhat.com/en/topics/data-storage (accessed May 29, 2020). [117] S. Alkatheri, S. Abbas, M. Siddiqui, A Comparative Study of Big Data

96 Frameworks, Int. J. Comput. Sci. Inf. Secur. (2019) 8. https://www.researchgate.net/publication/331318859_A_Comparative_ Study_of_Big_Data_Frameworks. [118] D. García-Gil, S. Ramírez-Gallego, S. García, F. Herrera, A comparison on scalability for batch big data processing on Apache Spark and Apache Flink, Big Data Anal. 2 (2017) 1. https://doi.org/10.1186/s41044-016- 0020-2. [119] W. Inoubli, S. Aridhi, H. Mezni, A. Jung, Big Data Frameworks: A Comparative Study, 2016. https://members.loria.fr/SAridhi/files/software/bigdata/. [120] C. Cheng, S. Li, H. Ke, Analysis on the Status of Big Data Processing Framework, in: Proc. 2018 Int. Comput. Signals Syst. Conf. ICOMSSC 2018, Institute of Electrical and Electronics Engineers Inc., 2018: pp. 794– 799. https://doi.org/10.1109/ICOMSSC45026.2018.8941875.

97 Appendix

A. Additional tools and technologies list

1. Workflow orchestration tools

Tools Description References Astronomer.io It is an enterprise framework for Apache https://www.astrono Airflow. Helps to deploy and manage Apache mer.io/ workflow for data analytics projects. Piperr.io Provides pre-built data algorithms from https://www.piperr.io/ Piperr. Applications range from IT to analytics, IoT and data science. Pachyderm Docker and Kubernetes based data workflow https://www.pachyderm.c and input/output management software for om/ data analytics projects. Provides advance features like data versioning, provenance, and incremental processing Luigi It is a Python module to build complex https://github.com/spotif pipelines in batch jobs. Handles dependency y/luigi resolution, data visualization and workflow management. Comes with built-in Hadoop support. Conductor Built by Netflix for workflow orchestration https://netflix.github.io/c using microservices-based process and business onductor/ workflows. Nextflow Software container based scalable and https://www.nextflow.io/ reproducible scientific workflow manager.

2. Testing and monitoring tools

Tools Description References FirstEigen Provides automatic data quality rule http://firsteigen.com/ discovery and continuous data monitoring Bigeye Provides anomalies and data quality problem https://docs.bigeye.com/ ToroData detection services automatically with the no- code interface. Great Provides data testing, documentation, and https://github.com/g Expectations data profiling services. reat- expectations/great_e xpectations AccelData AccelData observes, optimizes, and scales the https://www.acceldata.io/ data analytics pipeline through instant troubleshooting, performance monitoring, data flow monitoring and automated data quality management.

98

3. Deployment automation tools

Tools Description References TeamCity Known as “Intelligent CI Server” due to ease of https://www.jetbrains.co integration and use. Offers installation package m/teamcity/ based on operating systems. Spinnaker Netflix developed an opensource platform for https://spinnaker.io/ Continuous Delivery. Supports major cloud platforms and hosting technologies like Docker and Kubernetes. Buddy Buddy is CI/CD software with an interactive user https://buddy.works/ interface with an on-premises solution. Helps to build, test, and deploy application quickly. Shippable With Shippable CI and CD setup is faster by https://www.shippable.co providing ready-to-use build images with machine- m/ level isolation to secure workflow. CodeShip Hosted CI/CD platform with fast feedback and https://codeship.com/ customized environments to build applications. It provides comprehensive integration support and scalability when necessary.

4. Data governance tools

Tools Description References TrueDat It is an opensource data governance tool for https://www.truedat.io/ business enterprises with data-driven and cloud adaption capability Xplenty Data Integration, ETL and ELT platform with https://www.xplenty.com/ data governance product offerings for data is/ pipelines visualization and data management. Informatics On-premises or cloud-based enterprise data https://www.informatica.c governance and compliance solution. om/ Functionalities for managing GDPR data risks, sensitive data protection and data accuracy tracking. Erwin Data governance and data management solution https://erwin.com/ provider with compelling functionalities like da modeling, business process modeling, enterprise architecture modeling. Agility Data governance solution for establishing and https://www.agilitymultic implementing high-level policies and procedures. hannel.com/governance/ Specific solution for data modeling and rules, workflow management rules, and security compliance rules.

99

5. Code, artifact, and data versioning tools

Tools Description References Apache Opensource distributed version control system https://subversion.apache. Subversion under . org/

Mercurial Free and distributed source control https://www.mercurial- SCM management tool. Platform independent, fast, scm.org/ and extensible service to manage source codes and artifacts.

Fossil Fossil is a distributed software configuration https://fossil- management system with inbuilt bug tracking scm.org/home/doc/trunk/ and web interface. Commonly used for source www/index.wiki codes, documents and configuration files management.

Bazaar Bazaar is a version control system for tracking https://bazaar.canonical.c project history and collaboration among team om/en/ members.

6. Analytics and visualization tools

Tools Description References SAS Visual Software for visual data exploration to generate https://www.sas.com/en_ Analytics insightful and easy analytics results with the us/solutions/business- (SAS BI) interactive and self-serving dashboard. intelligence.html

Sisense Complete end-to-end business Intelligence https://www.sisense.com/ platform for data engineers, developers, and analysts to create interactive analytic results.

Looker Business intelligence software supporting https://looker.com/ multiple data sources, deployment methods with transparency, security, and privacy.

Infogram Web-based infographics and data visualization https://infogram.com/ platform. Allows sharing charts, infographics and maps with quality publish through information extraction from uploaded data.

GoodData Agile, powerful, and secure data and analytics https://www.gooddata.co platform to deliver actionable results. m/

100 Responsive, user-friendly UI with drag and drop filter and metrics.

7. Collaboration and communication tools

Tools Description References Microsoft Hub for team collaboration in Microsoft 365 to https://www.microsoft.com Teams integrate people, content, and tools to provide /en-us/microsoft- better productivity, communication, 365/microsoft-teams/group- collaboration, and task management. chat-software

Mattermost It is an opensource, self-hostable online chat https://mattermost.com/ and file sharing, integrating, communication and collaboration.

Ryver All in one application for task management, https://ryver.com/ voice and video call, and file sharing. Integration with other tools and unlimited group messaging. Monday.com Customizable and scalable agile software for https://monday.com/ business process management, team management, project, and task management.

Wrike Cloud-based collaboration software for https://www.wrike.com/ planning, sharing, and streamlining workflow.

SpiraTeam Integrated lifecycle management software to http://www.inflectra.com/S manage project’s requirements, releases, test piraTeam/ cases, issues, and tasks in one single system supporting Agile, Kanban, Scrum waterfall and hybrid methodologies.

101 TRITA -EECS-EX-2020:895

www.kth.se