Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Benjamin Heintz IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Professor Abhishek Chandra December, 2016 c Benjamin Heintz 2016 ALL RIGHTS RESERVED Acknowledgements I would like to thank Chenyu Wang for developing and running the experiments in Chapter 3, and Ravali Kandur for her assistance deploying Apache Storm on PlanetLab. The work presented here was supported in part by NSF Grants CNS-0643505, CNS- 0519894, CNS-1413998, and IIS-0916425, by an IBM Faculty Award, and by a University of Minnesota Doctoral Dissertation Fellowship. About six years ago, I set my sights on an M.S. degree in Computer Science. As I was beginning to plan for what I thought would be a brief career as a graduate student, I met with Prof. Jaideep Srivastava to discuss possible summer research. I am incredibly thankful that, instead of talking about which summer projects I could work on, he encouraged me to aim higher and pursue a Ph.D. I owe a tremendous debt of gratitude to my collaborators, especially Prof. Jon Weiss- man, Prof. Ramesh K. Sitaraman, and my advisor, Prof. Abhishek Chandra. If I have developed even a small fraction of their ability to see the details without losing track of the big picture, to write well, and most of all to be persistent, then the last five years will have been worth it. I am particularly grateful for my advisor's endless patience and unshakable calm. Some of the most important results in my research have only come together after I have nearly given up on a deadline or called off a set of experiments. I am grateful that he never let me off the hook so easily. On a personal note, I cannot thank my parents Henry and Jayne Heintz enough for their support, especially through college and graduate school. I would not have access to a sliver of the opportunities that I do if it weren't for them. Finally, thanks to my wife Anne, for always being there for me, even when that means listening to me complain about how my experiments aren't working. i Dedication To my dad. ii Abstract Big Data touches every aspect of our lives, from the way we spend our free time to the way we make scientific discoveries. Netflix streamed more than 42 billion hours of video in 2015, and in the process recorded massive volumes of data to inform video recommendations and plan investments in new content. The CERN Large Hadron Collider produces enough data to fill more than one billion DVDs every week, and this data has led to the discovery of the Higgs boson particle. Such large scale computing is challenging because no one machine is capable of ingesting, storing, or processing all of the data. Instead, applications require distributed systems comprising many machines working in concert. Adding to the challenge, many data streams originate from geographically distributed sources. Scientific sensors such as LIGO span multiple sites and generate data too massive to process at any one location. The machines that analyze these data are also geo-distributed; for example Netflix and Facebook users span the globe, and so do the machines used to analyze their behavior. Many applications need to process geo-distributed data on geo-distributed systems with low latency. A key challenge in achieving this requirement is determining where to carry out the computation. For applications that process unbounded data streams, two performance metrics are critical: WAN traffic and staleness (i.e., delay in receiving results). To optimize these metrics, a system must determine when to communicate results from distributed resources to a central data warehouse. As an additional challenge, constrained WAN bandwidth often renders exact com- putation infeasible. Fortunately, many applications can tolerate inaccuracy, albeit with diverse preferences. To support diverse applications, systems must determine what par- tial results to communicate in order to achieve the desired staleness-error tradeoff. This thesis presents answers to these three questions|where to compute, when to communicate, and what partial results to communicate|in two contexts: batch com- puting, where the complete input data set is available prior to computation; and stream computing, where input data are continuously generated. We also explore the chal- lenges facing emerging programming models and execution engines that unify stream and batch computing. iii Contents Acknowledgements i Dedication ii Abstract iii List of Tables ix List of Figures x 1 Introduction 1 1.1 Challenges in Geo-Distributed Data-Intensive Computing . 2 1.1.1 Where to Place Computation . 3 1.1.2 When to Communicate Partial Results . 4 1.1.3 What Partial Results to Communicate . 5 1.2 Summary of Research Contributions . 5 1.2.1 Batch Processing . 5 1.2.2 Stream Processing . 7 1.2.3 Unified Stream & Batch Processing . 8 1.3 Outline . 9 2 Model-Driven Optimization of MapReduce Applications 10 2.1 Introduction . 10 2.1.1 The MapReduce Framework and an Optimization Example . 11 2.1.2 Research Contributions . 13 iv 2.2 Model and Optimization . 14 2.2.1 Model . 14 2.2.2 Model Validation . 20 2.2.3 Optimization Algorithm . 25 2.3 Model and Optimization Results . 27 2.3.1 Modeled Environments . 27 2.3.2 Heuristics vs. Model-Driven Optimization . 28 2.3.3 End-to-End vs. Myopic . 31 2.3.4 Single-Phase vs. Multi-Phase . 33 2.3.5 Barriers vs. Pipelining . 34 2.3.6 Compute Resource Distribution . 37 2.3.7 Input Data Distribution . 37 2.4 Comparison to Hadoop . 39 2.4.1 Experimental Setup . 40 2.4.2 Applications . 40 2.4.3 Experimental Results . 41 2.5 Related Work . 46 2.6 Concluding Remarks . 47 3 Cross-Phase Optimization in MapReduce 48 3.1 Introduction . 48 3.2 Map-Aware Push . 49 3.2.1 Overlapping Push and Map to Hide Latency . 50 3.2.2 Overlapping Push and Map to Improve Scheduling . 51 3.2.3 Map-Aware Push Scheduling . 51 3.2.4 Implementation in Hadoop . 52 3.2.5 Experimental Results . 53 3.3 Shuffle-Aware Map . 55 3.3.1 Shuffle-Aware Map Scheduling . 56 3.3.2 Implementation in Hadoop . 58 3.3.3 Experimental Results . 58 3.4 Putting it all Together . 60 v 3.4.1 Amazon EC2 . 60 3.4.2 PlanetLab . 62 3.5 Related Work . 63 3.6 Concluding Remarks . 65 4 Optimizing Timeliness and Cost in Geo-Distributed Streaming Ana- lytics 67 4.1 Introduction . 67 4.1.1 Challenges in Optimizing Windowed Grouped Aggregation . 69 4.1.2 Research Contributions . 70 4.2 Problem Formulation . 72 4.3 Dataset and Workload . 74 4.4 Minimizing WAN Traffic and Staleness . 76 4.5 Practical Online Algorithms . 80 4.5.1 Simulation Methodology . 81 4.5.2 Emulating the Eager Optimal Algorithm . 82 4.5.3 Emulating the Lazy Optimal Algorithm . 85 4.5.4 The Hybrid Algorithm . 88 4.6 Implementation . 90 4.6.1 Edge . 91 4.6.2 Center . 92 4.7 Experimental Evaluation . 93 4.7.1 Aggregation Using a Single Edge . 93 4.7.2 Scaling to Multiple Edges . 94 4.7.3 Effect of Laziness Parameter . 96 4.7.4 Impact of Network Capacity . 97 4.8 Discussion . 98 4.8.1 Compression . 98 4.8.2 Variable-Size Aggregates . 100 4.8.3 Applicability to Other Environments . 101 4.9 Related Work . 102 4.10 Concluding Remarks . 103 vi 5 Trading Timeliness and Accuracy in Geo-Distributed Streaming An- alytics 104 5.1 Introduction . 104 5.2 Problem Formulation . 106 5.2.1 Problem Statement . 106 5.2.2 The Staleness-Error Tradeoff . 107 5.3 Offline Algorithms . 109 5.3.1 Minimizing Staleness (Error-Bound) . 109 5.3.2 Minimizing Error (Staleness-Bound) . 110 5.3.3 High-Level Lessons . 115 5.4 Online Algorithms . 116 5.4.1 The Two-Part Cache Abstraction . 116 5.4.2 Error-Bound Algorithms . 118 5.4.3 Staleness-Bound Algorithms . 120 5.4.4 Implementation . 123 5.5 Evaluation . 123 5.5.1 Setup . 124 5.5.2 Comparison Algorithms . 124 5.5.3 Simulation Results: Minimizing Staleness (Error-Bound) . 126 5.5.4 Simulation Results: Minimizing Error (Staleness-Bound) . 128 5.5.5 Experimental Results . 129 5.6 Discussion . 138 5.6.1 Alternative Approximation Approaches . 138 5.6.2 Coordination Between Multiple Edges . 139 5.6.3 Incorporating WAN Latency . 140 5.7 Related Work . 140 5.8 Concluding Remarks . 141 6 Challenges and Opportunities in Unified Stream & Batch Analytics 142 6.1 Introduction . 142 6.2 Background . ..
Recommended publications
  • Big Data Stream Analysis: a Systematic Literature Review
    Kolajo et al. J Big Data (2019) 6:47 https://doi.org/10.1186/s40537-019-0210-7 SURVEY PAPER Open Access Big data stream analysis: a systematic literature review Taiwo Kolajo1,2* , Olawande Daramola3 and Ayodele Adebiyi1,4 *Correspondence: [email protected]; Abstract [email protected] Recently, big data streams have become ubiquitous due to the fact that a number of 1 Department of Computer and Information Sciences, applications generate a huge amount of data at a great velocity. This made it difcult Covenant University, Ota, for existing data mining tools, technologies, methods, and techniques to be applied Nigeria directly on big data streams due to the inherent dynamic characteristics of big data. In Full list of author information is available at the end of the this paper, a systematic review of big data streams analysis which employed a rigorous article and methodical approach to look at the trends of big data stream tools and technolo- gies as well as methods and techniques employed in analysing big data streams. It provides a global view of big data stream tools and technologies and its comparisons. Three major databases, Scopus, ScienceDirect and EBSCO, which indexes journals and conferences that are promoted by entities such as IEEE, ACM, SpringerLink, and Elsevier were explored as data sources. Out of the initial 2295 papers that resulted from the frst search string, 47 papers were found to be relevant to our research questions after implementing the inclusion and exclusion criteria. The study found that scalability, privacy and load balancing issues as well as empirical analysis of big data streams and technologies are still open for further research eforts.
    [Show full text]
  • Real-Time Analytics for Fast Evolving Social Graphs
    Real-time Analytics for Fast Evolving Social Graphs Charith Wickramaarachchi1, Alok Kumbhare1, Marc Frincu2, Charalampos Chelmis2, and Viktor K. Prasanna2 1Dept. of Computer Science, University of Southern California, Los Angeles, California 90089 2Dept. of Electrical Engineering, University of Southern California, Los Angeles, California 90089 Email: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Existing Big Data streams coming from social second have been observed. In Facebook’s case we notice and other connected sensor networks exhibit intrinsic inter- an average of 41,000 posts per second or about 2.4Mb dependency enabling unique challenges to scalable graph ana- of data each second. Processing this huge amount of fast lytics. Data from these graphs is usually collected in different geographically located data servers making it suitable for streaming data to extract useful knowledge in real-time is distributed processing on clouds. While numerous solutions for challenging and requires besides efficient graph updates, large scale static graph analysis have been proposed, addressing scalable methods for performing incremental analytics in in real-time the dynamics of social interactions requires novel order to reduce the complexity of the data-driven algorithms. approaches that leverage incremental stream processing and Existing graph processing methods have focused either graph analytics on elastic clouds. We propose a scalable solution based on our stream pro- on large shared memory approaches where the graph is cessing engine, Floe, on top of which we perform real-time streamed and processed in-memory [6], or on batch pro- data processing and graph updates to enable low latency graph cessing techniques for distributed computing where periodic analytics on large evolving social networks.
    [Show full text]
  • Dzone-Guide-To-Big-Data.Pdf
    THE 2018 DZONE GUIDE TO Big Data STREAM PROCESSING, STATISTICS, & SCALABILITY VOLUME V BROUGHT TO YOU IN PARTNERSHIP WITH THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY Dear Reader, Table of Contents I first heard the term “Big Data” almost a decade ago. At that time, it Executive Summary looked like it was nothing new, and our databases would just be up- BY MATT WERNER_______________________________ 3 graded to handle some more data. No big deal. But soon, it became Key Research Findings clear that traditional databases were not designed to handle Big Data. BY G. RYAN SPAIN _______________________________ 4 The term “Big Data” has more dimensions than just “some more data.” It encompasses both structured and unstructured data, fast moving Take Big Data to the Next Level with Blockchain Networks BY ARJUNA CHALA ______________________________ 6 and historical data. Now, with these elements added to the data, some of the other problems such as data contextualization, data validity, Solving Data Integration at Stitch Fix noise, and abnormality in the data became more prominent. Since BY LIZ BENNETT _______________________________ 10 then, Big Data technologies has gone through several phases of devel- Checklist: Ten Tips for Ensuring Your Next Data Analytics opment and transformation, and they are gradually maturing. A term Project is a Success BY WOLF RUZICKA, ______________________________ that was considered as a fad and a technology ecosystem that was 13 considered a luxury are slowly establishing themselves as necessary Infographic: Big Data Realization with Sanitation ______ needs for today’s business activities. Big Data is the new competitive 14 advantage and it matters for our businesses.
    [Show full text]
  • Big Data Analytics Options on AWS
    Big Data Analytics Options on AWS December 2018 This paper has been archived. For the latest technical information, see https://docs.aws.amazon.com/whitepapers/latest/big-data- analytics-options/welcome.html Archived © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Archived Contents Introduction 5 The AWS Advantage in Big Data Analytics 5 Amazon Kinesis 7 AWS Lambda 11 Amazon EMR 14 AWS Glue 20 Amazon Machine Learning 22 Amazon DynamoDB 25 Amazon Redshift 29 Amazon Elasticsearch Service 33 Amazon QuickSight 37 Amazon EC2 40 Amazon Athena 42 Solving Big Data Problems on AWS 45 Example 1: Queries against an Amazon S3 Data Lake 47 Example 2: Capturing and Analyzing Sensor Data 49 Example 3: Sentiment Analysis of Social
    [Show full text]
  • Big Data Analytics Options on AWS AWS Whitepaper Big Data Analytics Options on AWS AWS Whitepaper
    Big Data Analytics Options on AWS AWS Whitepaper Big Data Analytics Options on AWS AWS Whitepaper Big Data Analytics Options on AWS: AWS Whitepaper Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Big Data Analytics Options on AWS AWS Whitepaper Table of Contents Abstract ............................................................................................................................................ 1 Introduction ...................................................................................................................................... 2 The AWS advantage in big data analytics .............................................................................................. 3 Amazon Kinesis .......................................................................................................................... 4 Ideal usage patterns ........................................................................................................... 5 Cost model ........................................................................................................................ 5 Performance .....................................................................................................................
    [Show full text]
  • S2CE: a Hybrid Cloud and Edge Orchestrator for Mining Exascale
    S2CE: A Hybrid Cloud and Edge Orchestrator for Mining Exascale Distributed Streams Nicolas Kourtellis Herodotos Herodotou Maciej Grzenda [email protected] [email protected] [email protected] Telefonica Research Cyprus University of Technology Warsaw University of Technology Barcelona, Spain Limassol, Cyprus Warsaw, Poland Piotr Wawrzyniak Albert Bifet [email protected] [email protected] Lodz University of Technology LTCI, Telecom Paris, IP-Paris Lodz, Poland Paris, France ABSTRACT 1 INTRODUCTION The explosive increase in volume, velocity, variety, and veracity In the future Internet era, with hundreds of billions of devices, of data generated by distributed and heterogeneous nodes such as principle factors dominating the continuous utility of the Internet IoT and other devices, continuously challenge the state of art in big will be: 1) the massive population of devices and their intelligent data processing platforms and mining techniques. Consequently, it agents, 2) the big, fast, and diverse data produced from them and reveals an urgent need to address the ever-growing gap between their users, and 3) the need for large-scale, adaptive infrastructures this expected exascale data generation and the extraction of insights to process and extract knowledge from these exascale data in or- from these data. To address this need, this paper proposes Stream der to help make critical, data-driven decisions. Intelligent agents to Cloud & Edge (S2CE), a first of its kind, optimized, multi-cloud already exist in different forms, and are well embedded in various and edge orchestrator, easily configurable, scalable, and extensible. ways in our everyday lives, either as passive data collectors, or S2CE will enable machine and deep learning over voluminous and active producers.
    [Show full text]
  • A Buyer's Guide to Streaming Data Integration for Google Cloud Platform
    A Buyer’s Guide to Streaming Data Integration For Google Cloud Platform A Buyer’s Guide to Streaming Data Integration For Google Cloud Platform Table of Contents CHAPTER 1: Setting the Stage for Digital Transformation . 2 The Digital Transformation Game-Changer.......................................................2 Why Streaming Data Integration is Necessary ....................................................4 What’s In This eBook . 4 CHAPTER 2: How SDI Fits Into A Modern Data Architecture......................................6 Cloud Equation ................................................................................7 The Change Data Capture Solution ..............................................................7 Designing for the Future .......................................................................8 CHAPTER 3: The Business Value of Streaming Data Integration.................................10 Game-Changing Technology . .11 Streaming’s Broad Business Impact ............................................................12 The Role of SDI ...............................................................................13 SDI in Practice . .14 SDI and the Cloud ............................................................................15 CHAPTER 4: 1 + 1 = 3: Streaming Data Integration and the Google Cloud Platform...............19 Multi-Cloud and Multi-Engine . .19 CHAPTER 5: Building the Business Case for SDI ................................................22 Start With the Low-Hanging Fruit ...............................................................22
    [Show full text]
  • Distributed Supervised Sentiment Analysis of Tweets
    Distributed Supervised Sentiment Analysis of Tweets: Integrating Machine Learning and Streaming Analytics for Big Data Challenges in Communication and Audience Research Análisis distribuido y supervisado de sentimientos en Twitter: Integrando aprendizaje automático y analítica en tiempo real para retos de dimensión big data en investigación de comunicación y audiencias Carlos Arcila Calderón University of Salamanca [email protected] (ESPAÑA) Félix Ortega Mohedano University of Salamanca [email protected] (ESPAÑA) Mateo Álvarez University Rey Juan Carlos [email protected] (ESPAÑA) Miguel Vicente Mariño University of Valladolid [email protected] (ESPAÑA) Recibido: 25.01 2018 Aceptado: 20.12.2018 EMPIRIA. Revista de Metodología de Ciencias Sociales. N.o 42 enero-abril, 2019, pp. 113-136. ISSN: 1139-5737, DOI/empiria.42.2019.23254 114 C.ARCILA, F.ORTEGA, M.ÁLVAREZ Y M. VICENTE ANÁLISIS DISTRIBUIDO... RESUMEN El análisis a gran escala de tweets en tiempo real utilizando el análisis de sentimiento supervisado representa una oportunidad única para la investigación de comunicación y audiencias. El poner juntos los enfoques de aprendizaje au- tomático y de analítica en tiempo real en un entorno distribuido puede ayudar a los investigadores a obtener datos valiosos de Twitter con el fin de clasificar de forma inmediata mensajes en función de su contexto, sin restricciones de tiempo o almacenamiento, mejorando los diseños transversales, longitudinales y experimentales con nuevas fuentes de datos. A pesar de que los investiga- dores de comunicación y audiencias ya han comenzado a utilizar los métodos computacionales en sus rutinas, la mayoría desconocen el uso de las tecnologías de computo distribuido para afrontar retos de dimensión big data.
    [Show full text]
  • Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data Yuheng Du Clemson University, [email protected]
    Clemson University TigerPrints All Dissertations Dissertations May 2019 Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data Yuheng Du Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Recommended Citation Du, Yuheng, "Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data" (2019). All Dissertations. 2329. https://tigerprints.clemson.edu/all_dissertations/2329 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Yuheng Du May 2019 Accepted by: Dr. Amy Apon, Committee Chair Dr. Alexander Herzog Dr. Mashrur Chowdhury Dr. Andre Luckow Dr. Ilya Safro Abstract Streaming data are produced in great velocity and diverse variety. The vision of this research is to build an end-to-end system that handles the collection, curation and analysis of streaming data. The streaming data used in this thesis contain both numeric type data and text type data. First, in the field of data collection, we design and evaluate a data delivery framework that handles the real- time nature of streaming data. In this component, we use streaming data in automotive domain since it is suitable for testing and evaluating our data delivery system.
    [Show full text]
  • Process Mining with Streaming Data
    Process mining with streaming data Citation for published version (APA): van Zelst, S. J. (2019). Process mining with streaming data. Technische Universiteit Eindhoven. Document status and date: Published: 14/03/2019 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim.
    [Show full text]
  • Efficient Stream Data Management
    From Big Data to Fast Data: Efficient Stream Data Management Alexandru Costan To cite this version: Alexandru Costan. From Big Data to Fast Data: Efficient Stream Data Management. Distributed, Parallel, and Cluster Computing [cs.DC]. ENS Rennes, 2019. tel-02059437v2 HAL Id: tel-02059437 https://hal.archives-ouvertes.fr/tel-02059437v2 Submitted on 14 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale MathSTIC HABILITATION À DIRIGER DES RECHERCHES Discipline: INFORMATIQUE présentée devant l’École Normale Supérieure de Rennes sous le sceau de l’Université Bretagne Loire par Alexandru Costan préparée à IRISA Institut de Recherche en Informatique et Systèmes Aléatoires Soutenue à Bruz, le 14 mars 2019, devant le jury composé de: Rosa Badia / rapporteuse Directrice de recherche, Barcelona Supercomputing Center, Espagne From Big Data Luc Bougé / examinateur Professeur des universités, ENS Rennes, France to Fast Data: Valentin Cristea / examinateur Professeur des universités, Université Politehnica de Efficient Stream Bucarest, Roumanie Christian Pérez / rapporteur Data Management Directeur de recherche, Inria, France Michael Schöttner / rapporteur Professeur des universités, Université de Düsseldorf, Allemagne Patrick Valduriez / examinateur Directeur de recherche, Inria, France 3 Abstract This manuscript provides a synthetic overview of my research journey since my PhD de- fense.
    [Show full text]
  • Distributed Stream Processing of Twitter Data Using Apache Spark
    Distributed Stream Processing of Twitter Data using Apache Spark Thesis submitted in partial fulfilment of the requirements for the award of degree of Master of Engineering In Computer Science and Engineering Submitted By Shruti Arora (801632045) Under the supervision of: Dr. Rinkle Rani Associate Professor COMPUTER SCIENCE AND ENGINEERING DEPARTMENT THAPAR INSTITUTE OF ENGINEERING AND TECHNOLOGY PATIALA – 147004 JUNE 2018 1 Scanned by CamScanner Scanned by CamScanner ABSTRACT Data is continuously being generated from sources such as machines, network traffic, sensor networks, etc. Twitter is an online social networking service with more than 300 million users, generating a huge amount of information every day. Twitter’s most important characteristic is its ability for users to tweet about events, situations, feelings, opinions, or even something totally new, in real time. Currently there are different workflows offering realtime data analysis for Twitter, presenting general processing over streaming data. This study will attempt to develop an analytical framework with the ability of in-memory processing to extract and analyze structured and unstructured Twitter data. The proposed framework includes data ingestion and stream processing and data visualization components with the Apache Kafka and Apache Flume messaging system that is used to perform data ingestion task. Furthermore, Spark makes it possible to perform sophisticated data processing and machine learning algorithms in real time. We have conducted a case study on tweets and analysis on the time and origin of the tweets. We also worked on study of SparkML component to study the K-Means Clustering algorithm. iii TABLE OF CONTENTS CERTIFICATE ................................................................................................................ i ACKNOWLEDGEMENT ............................................................................................... ii ABSTRACT ....................................................................................................................
    [Show full text]