Analysis and Representation of Dataflows and Performance

ISTE Master Nr. 3101854 Analysis and representation of dataflows and performance influences in NoSQL systems Alexander Szilagyi Course of Study: Software Engineering Examiner: Prof. Dr. rer. nat. Stefan Wagner Supervisor: Dr. Daniel Graziotin, Christian Heß Commenced: 2017-03-03 Completed: 2017-09-20 CR-Classification: I.7.2 Affirmation Hereby I ensure that this thesis I have worked on, was completely written by myself indepen- dently and that I only used those resources which are denoted in this work. Adoptions from other works like books, papers or web-sources are indicated by references. Stuttgart, September 22, 2017 Alexander Szilagyi III Foreword This thesis was written in cooperation with a famous company in the automotive sector and the University of Stuttgart. Three different departments, within the company, were involved in the acquisition of the relevant information, the access to the needed systems, the technical support and help for additional questions. Therefore I would like to thank them all for the support. Special thanks goes to my advisor Christian Heß, which is working at the company and Dr. Daniel Graziotin, adviser of the institute ISTE at the University in Stuttgart. Both gave me important tips and advices, which helped me a lot during the conduction of this thesis. IV Abstract The dispersion of digitalization is gaining more and more attention in the industry. Trends like industry 4.0 results in an enormous growth of data. But with the abundance of data, new opportunities are possible, that could lead to a market advantage against competitors. In order to deal with the huge pool of information, also referred as “Big Data”, companies are facing new challenges. One challenge relies in the fact, that ordinary relational database solutions are unable to cope with the enormous amount of data, hence a system must be provided, that is able to store and process the data in an acceptable time. Because of that, a well known company in the automotive sector established a distributed NoSQL system, based on Hadoop, to deal with this challenge. But the processes and interactions behind such a Hadoop system are more complex than with a relational database solution, hence wrong configurations leads to an ineffective operation. Therefore this thesis deals with the question, how the interactions between the applications are realized, and how possible lacks in the performance can be prevented. This work also presents a new approach in dataflow modeling, because an ordinary dataflow meta description has not enough syntax elements to model the entire NoSQL system in de- tail. With different use cases, the dataflow present interactions within the Hadoop system, which provide first valuations about performance dependencies. Characteristics like file format, compression codec and execution engine could be identified, and afterwards combined into different configuration sets. With this sets, randomized test were executed to measure their influence in the execution time. The results show, that a combination of an ORC file format with a Zlib file compression leads to the best performance regarding the execution time and disk utilization of the implemented system at the company. But the results also show, that configurations on the system could be very dependent on the actual use case. Beside the performance measurements, the thesis provide additional insights. For example, it could be observed, that an increase of the file size leads to a growth of the variance. However, the study, that was conducted in this work leads to the assumption, that performance tests should be repeatedly investigated. V Contents 1 Introduction 1 1.1 Context . 1 1.2 Problem Statement and Motivation . 1 2 Background 3 2.1 Big Data . 3 2.2 Data Lake . 3 2.3 NoSQL . 4 2.4 Virtualization . 5 2.5 Hadoop Environment . 6 2.6 Qlik Sense - Data Visualization Software . 14 2.7 Data types . 15 2.8 File formats . 16 2.9 Compression . 20 3 Related work 22 4 The challenge of dataflow modelling in a distributed NoSQL system 25 4.1 Approaches in dataflow . 25 4.2 Dataflow in Hadoop Environment . 27 5 Evaluation of performance influences 35 5.1 Possible sources of performance influence factors . 35 5.2 Evaluation of performance indicators . 40 6 Methodologies 46 6.1 Methodology: Testing Strategy . 46 6.2 Methodology: Testing Preparation . 49 7 Results 55 8 Discussion 73 9 Conclusion 77 10 Appendix 78 10.1 Additional Hadoop applications . 78 10.2 Additional file formats . 79 10.3 Additional compression codecs . 80 Bibliography 81 VI Acronyms ACID - Atomicity, Consistency, Isolation and Durability API - Application programming interface BPMN - Business process management notation CEO - Center of excellence CPU - Central processing unit CSV - Comma separated values DAG - Directed acyclic graph DB - Data base DMBS - Database management systems DTD - Document type definition HAR - Hadoop archives file HDFS - Hadoop file system HDP - Hadoop distributed package IS - Intelligente Schnittstelle (smart interface) ITO - IT-operations JSON - JavaScript object notation JVM - Java virtual machine KPI - Key performance indicator LLAP - Live long and process MSA - Measurement system analysis MSB - Manufacturing service bus MSR - Manufacturing reporting system ODBC - Open database connectivity OEP - Oozie Eclipse Plugin OS - Operating system OSV - Operating system virtualization ORC - Optimized row columnar QFD - Quality function development RAM - Random access memory VII RD - Research and development SQL - Structured query language TA - Transaction TEZ - Hindi for ‘speed‘ XML - Extensible markup language YARN - Yet another resource negotiator VIII List of Figures 2.1 Five V’s . 3 2.2 Data Lake * http://www.autosarena.com/wp-content/uploads/2016/04/Auto-apps.jpg . 4 2.3 Concept of Hypervisor - Type 2 . 5 2.4 Node types . 6 2.5 Hadoop V1.0 . 8 2.6 Architecture of HDFS - source: https://www.linkedin.com/pulse/hadoop- jumbo-vishnu-saxena . 8 2.7 Hadoop V2.0 . 10 2.8 Architecture of flume . 12 2.9 Nifi process flow - source: https://blogs.apache.org/nifi/ . 12 2.10 Oozie workflow - source: https://i.stack.imgur.com/7CtlR.png . 13 2.11 Example of Qlik - Sense Dashboard . 14 4.1 Classic dataflow graph . 25 4.2 Syntax table . 26 4.3 HDFS Filesystem . 27 4.4 IS-Server Dataflow . 28 4.5 Push data on edge node . 29 4.6 Store data redundant on HDFS . 30 4.7 MRS-Server Dataflow . 31 4.8 Route data from edge node . 32 4.9 Store Hive table persistent in HDFS . 32 4.10 Processing dataflow . 33 4.11 Create and send HIVE query . 34 4.12 Execute MapReduce Algorithm . 34 5.1 QFD - Matrix: Performance influence evaluation . 42 6.1 Measurement system analysis type 1 of the Hadoop test system . 48 6.2 File format - size . 50 6.3 CPU utilization . 53 6.4 RAM utilization . 54 7.1 MapReduce total timing boxplot . 56 7.2 TEZ total timing boxplot . 57 7.3 Standard deviation total timings - unadjusted . 57 7.4 Total timings MapReduce . 58 7.5 Total timings TEZ . 58 7.6 Total timings MapReduce with 10% lower quantile . 59 7.7 Total timings MapReduce with 10% higher quantile . 60 7.8 Total timings TEZ with 10% higher quantile . 60 7.9 Total timings TEZ with 10% higher quantile . 61 7.10 Histogram MapReduce Count . 63 7.11 Histogram TEZ Join . 63 7.12 Performance values, MapReduce - Count . 64 7.13 Performance values, MapReduce - Join . 64 IX 7.14 Performance values, MapReduce - Math . 65 7.15 Performance values, TEZ - Count . 65 7.16 Performance values, TEZ - Join . 66 7.17 Performance values, TEZ - Math . 66 7.18 Total performance values - MapReduce . 67 7.19 Total performance values - TEZ . 68 7.20 Standard deviation total timings - adjusted . 68 7.21 File size of AVRO tables . 69 7.22 Boxplot AVRO with MapReduce . 70 7.23 Standard deviation AVRO timings - adjusted . 70 7.24 Boxplot AVRO with TEZ . 71 7.25 Processing of data files . 71 8.1 Morphologic box: Use case examples . 73 8.2 Trend: TEZ execution time dependent on data volume . 74 8.3 Trend: Reduced file size . 75 X List of Tables 2.1 Chart: Compression codecs . 21 5.1 Evaluation of QFD - Matrix, Part 1 . 44 5.2 Evaluation of QFD - Matrix, Part 2 . 45 6.1 File formats with compression . 46 6.2 Testset . 47 6.3 Factor of file size reduction . 51 XI Chapter 1 Introduction 1.1 Context This thesis was written within a department of a famous car manufacturer. The core function of this department is to plan the mounting process for new car models, as well as to maintain and provide test benches for vehicles after their initiation. Thereby the vehicle tests include wheel alignment on chassis dynamometers or calibrations for driver assistance systems. Because of that, the department works closely together with other departments and all vehicle production lines of all factories all over the world. 1.2 Problem Statement and Motivation In these days, the flow of information grows bigger and bigger. On the one hand, is the daily produced amount of data on the internet, whereby social media plays an import role. On the other hand, a lot of data are also created in modern companies [68]. In order to manage this huge amount of data, the term big data was established, whereby the term unites two different problem states: To provide a system, which is able to handle the enormous amount of data and the possibility to create benefits, offered.

Load more