Efficient Stream Data Management
Total Page:16
File Type:pdf, Size:1020Kb
From Big Data to Fast Data: Efficient Stream Data Management Alexandru Costan To cite this version: Alexandru Costan. From Big Data to Fast Data: Efficient Stream Data Management. Distributed, Parallel, and Cluster Computing [cs.DC]. ENS Rennes, 2019. tel-02059437v2 HAL Id: tel-02059437 https://hal.archives-ouvertes.fr/tel-02059437v2 Submitted on 14 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale MathSTIC HABILITATION À DIRIGER DES RECHERCHES Discipline: INFORMATIQUE présentée devant l’École Normale Supérieure de Rennes sous le sceau de l’Université Bretagne Loire par Alexandru Costan préparée à IRISA Institut de Recherche en Informatique et Systèmes Aléatoires Soutenue à Bruz, le 14 mars 2019, devant le jury composé de: Rosa Badia / rapporteuse Directrice de recherche, Barcelona Supercomputing Center, Espagne From Big Data Luc Bougé / examinateur Professeur des universités, ENS Rennes, France to Fast Data: Valentin Cristea / examinateur Professeur des universités, Université Politehnica de Efficient Stream Bucarest, Roumanie Christian Pérez / rapporteur Data Management Directeur de recherche, Inria, France Michael Schöttner / rapporteur Professeur des universités, Université de Düsseldorf, Allemagne Patrick Valduriez / examinateur Directeur de recherche, Inria, France 3 Abstract This manuscript provides a synthetic overview of my research journey since my PhD de- fense. The document does not claim to present my work in its entirety, but focuses on the contributions to data management in support of stream processing. These results address all stages of the stream processing pipeline: data collection and in-transit processing at the edge, transfer towards the cloud processing sites, ingestion and persistent storage. I start by presenting the general context of stream data management in light of the recent transition from Big to Fast Data. After highlighting the challenges at the data level associ- ated with batch and real-time analytics, I introduce a subjective overview of my proposals to address them. They bring solutions to the problems of in-transit stream storage and processing, fast data transfers, distributed metadata management, dynamic ingestion and transactional storage. The integration of these solutions into functional prototypes and the results of the large-scale experimental evaluations on clusters, clouds and supercomputers demonstrate their effectiveness for several real-life applications ranging from neuro-science to LHC nuclear physics. Finally, these contributions are put into the perspective of the High Performance Computing - Big Data convergence. Keywords: Big Data; stream processing; storage; data management; data analytics; transactions; data transfers; metadata management; in-transit processing; workflow management; HPC. i Contents Foreword 4 1 Introduction 5 1.1 The need for real-time processing . .6 1.1.1 Motivating use-case: autonomous cars . .6 1.1.2 Solution: stream computing in real-time . .7 1.2 The challenge of data management for streams . .8 1.3 Mission statement . .9 1.4 Objectives . .9 Part I — Context: Stream Processing in the Clouds 13 2 Big Data Processing: Batch-based Analytics of Historical Data 15 2.1 Batch processing with MapReduce: the execution model for Big Data . 16 2.1.1 MapReduce extensions . 17 2.2 Big Data processing frameworks . 18 2.2.1 From Hadoop to Yarn . 19 2.2.2 Workflow management systems . 20 2.3 Big Data management . 21 2.3.1 Data storage . 21 2.3.2 Data transfer . 26 2.4 Discussion: challenges . 27 3 The World Beyond Batch: Streaming Real-Time Fast Data 29 3.1 Stream computing . 30 3.1.1 Unbounded streaming vs. bounded batch . 30 3.1.2 Windowing . 30 3.1.3 State management . 31 3.1.4 Correctness . 32 3.2 Fast Data processing frameworks . 33 3.2.1 Micro-batching with Apache Spark . 33 3.2.2 True streaming with Apache Flink . 34 3.2.3 Performance comparison of Spark and Flink . 34 3.2.4 Other frameworks . 36 ii Contents 3.3 Fast Data management . 37 3.3.1 Data ingestion . 38 3.3.2 Data storage . 39 3.4 Discussion: challenges . 40 4 The Lambda Architecture: Unified Stream and Batch Processing 41 4.1 Unified processing model . 42 4.1.1 The case for batch-processing . 43 4.2 Limitations of the Lambda architecture . 43 4.2.1 High complexity of two separate computing paths . 43 4.2.2 Lack of support for global transactions . 44 4.3 Research agenda . 44 Part II — From Sensors to the Cloud: Stream Data Collection and Pre- processing 47 5 DataSteward: Using Dedicated Nodes for In-Transit Storage and Processing 49 5.1 A storage service on dedicated compute nodes . 50 5.1.1 Design principles . 51 5.1.2 Architectural overview . 51 5.1.3 Zoom on the dedicated nodes selection in the cloud . 52 5.2 In-transit data processing . 55 5.2.1 Data services for scientific applications . 55 5.3 Evaluation and perspectives . 56 5.3.1 Data storage evaluation . 56 5.3.2 Gains of in-transit processing for scientific applications . 57 5.3.3 Going further . 58 6 JetStream: Fast Stream Transfer 61 6.1 Modelling the stream transfer in the context of clouds . 62 6.1.1 Zoom on the event delivery latency . 63 6.1.2 Multi-route streaming . 64 6.2 The JetStream transfer middleware . 66 6.2.1 Adaptive batching for stream transfers . 66 6.2.2 Architecture overview . 67 6.3 Experimental evaluation . 69 6.3.1 Individual vs. batch-based event transfers . 69 6.3.2 Adapting to context changes . 70 6.3.3 Benefits of multi-route streaming . 70 6.3.4 JetStream in support of a real-life LHC application . 71 6.3.5 Towards stream transfer "as a Service" . 73 7 Small Files Metadata Support for Geo-Distributed Clouds 75 7.1 Strategies for multi-site metadata management . 77 7.1.1 Centralized Metadata (Baseline) . 78 7.1.2 Replicated Metadata (on Each Site) . 79 Contents iii 7.1.3 Decentralized, Non-Replicated Metadata . 80 7.1.4 Decentralized Metadata with Local Replication . 80 7.1.5 Matching strategies to processing patterns . 81 7.2 One step further: managing workflow hot metadata . 82 7.2.1 Architecture . 84 7.2.2 Protocols for hot metadata . 85 7.2.3 Towards dynamic hot metadata . 86 7.3 Implementation and results . 87 7.3.1 Benefits of decentralized metadata management . 88 7.3.2 Separate handling of hot and cold metadata . 90 Part III — Scalable Stream Ingestion and Storage 93 8 KerA: Scalable Data Ingestion for Stream Processing 95 8.1 Impact of ingestion on stream processing . 96 8.1.1 Time domains . 96 8.1.2 Static vs. dynamic partitioning . 98 8.1.3 Record access . 99 8.2 KerA overview and architecture . 100 8.2.1 Models . 100 8.2.2 Favoring parallelism: consumer and producer protocols . 103 8.2.3 Architecture and implementation . 103 8.2.4 Fast crash recovery for low-latency continuous processing . 105 8.3 Experimental evaluation . 105 8.3.1 Setup and methodology . 105 8.3.2 Results . 106 8.3.3 Discussion . 108 9 Týr: Transactional, Scalable Storage for Streams 109 9.1 Blobs for stream storage . 110 9.2 Design principles and architecture . 111 9.2.1 Predictable data distribution . 111 9.2.2 Transparent multi-version concurrency control . 112 9.2.3 ACID transactional semantics . 114 9.2.4 Atomic transform operations . 115 9.3 Protocols and implementation . 116 9.3.1 Lightweight transaction protocol . 116 9.3.2 Handling reads: direct, multi-chunk and transactional protocols . 118 9.3.3 Handling writes: transactional protocol, atomic transforms . 120 9.3.4 Implementation details . 120 9.4 Real-time, transactional data aggregation in support of system monitoring . 121 9.4.1 Transactional read/write performance . 123 9.4.2 Horizontal scalability . 126 iv Contents Part IV — Perspectives 127 10 Stream Storage for HPC and Big Data Convergence 129 10.1 HPC and BDA: divergent stacks, convergent storage needs . 131 10.1.1 Comparative overview of the HPC and BDA stacks . 131 10.1.2 HPC and BDA storage . 132 10.1.3 Challenges of storage convergence between HPC and BDA . 133 10.2 Blobs as a storage model for convergence . 134 10.2.1 General overview, intuition and methodology . ..