Large-Scale Data Analysis at Cloud Scale [email protected] 2016-09

Total Page:16

File Type:pdf, Size:1020Kb

Large-Scale Data Analysis at Cloud Scale Johnwilkes@Google.Com 2016-09 Large-scale data analysis at cloud scale [email protected] 2016-09 Based on slides from Eric Schmidt, Greg DeMichillie, Frances Perry, Tyler Akidau, Dan Halperin. Data & Analytics Application Development Infrastructure & Operations GCP Big Data reference architecture 3 One possible flow Application logs Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis .... Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle? 4 10+ years of tackling Big Data problems 2002 2004 2005 2006 2008 2010 2012 2014 2015 Google Papers GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow 5 10+ years of tackling Big Data problems 2002 2004 2005 2006 2008 2010 2012 2014 2015 Google Papers GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow 6 In: Proc. 36th VLDB, Sept 2010. Link. The Dremel paper 7 Dremel in 2016 Dremel is mission critical for Google In production for 10+ years, in every Google datacenter Internal usage every month: ● O(Exabytes) analyzed ● O(Quadrillions) of rows scanned ● 80% of Googlers use it 8 BigQuery ≃ Dremel + cloud Our idea of highly available, managed analytics: ● no indexing, no resharding, no storage resizing ● just ... RUN QUERY 9 BigQuery ≃ Dremel + cloud Full-managed data warehouse – General Availability in 2012 – Same engineering team – Same code base Fast, petabyte-scale with SQL interface Encrypted, durable and highly available Near-unlimited resources, only pay for what you use Streaming ingest for unbounded data sets 10 But what have we done lately? 11 code submits BigQuery: 5+ years of innovation 2010 2011 2012 2013 2014 2015 2016 1,200 Beta Release at Google I/O Dynamic Execution Public launch Faster shuffle 900 Big JOIN support Large query results Capacitor 300 100k qps streaming Dremel X User-defined functions 0 12 Dremel architecture: 2006–2015 Mixer 0 ● Partial Reduction ● Mixer 1 Mixer 1 Long-lived Serving Tree ● Diskless data flow Leaf Leaf Leaf Leaf Distributed Storage ● Columnar Storage 13 Dremel X architecture (2015–now) Master Shard Shard Shard Shard ● Dynamic Serving Tree Shard Shard Shard Shard Shard Shard Shard Shard Distributed Storage ● Columnar Storage 14 Storage engine: ColumnIO (2006–2015) SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Decompress Filter Emit F$#h5 xc* {Hey Jude, 5375} CONTAINS(“Sun”) rm7y5 c8! {My Michelle, 2188} CONTAINS(“Sun”) rm7y5 8ec {My Michelle, 9363} CONTAINS(“Sun”) F$#h5 7h! {Hey Jude, 9502} CONTAINS(“Sun”) 4t#@h a7c {Here Comes The Sun, 7383} CONTAINS(“Sun”) {7833} rm7y5 c-% {My Michelle, 3912} CONTAINS(“Sun”) 15 Storage engine: Capacitor (2016–now) SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Emit Dictionary Filter Lookup 2 xc* 0 Hey Jude CONTAINS(“Sun”) F 1 c8! 1 My Michelle CONTAINS(“Sun”) F 1 8ec 2 Here Comes the Sun CONTAINS(“Sun”) T 2 7h! 0 a7c {7833} Performance improvement: 1 c-% ● 2x faster average over all queries ● 10–1000x faster for selective filters 16 Cloud Dataflow 17 One possible flow Application logs Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis .... Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle? 18 Time-to-answers matters Sequence a human Process point of sale Who is the best at Who/what is trending? genome transaction logs collecting Poké Balls? 19 Data boundedness Bounded Unbounded Time Finite data set Infinite data set Fixed schema Potentially changing schema Complete Never complete Typically at rest Typically not at rest 20 Latency happens Transmit Ingest Process Network Throughput Poor engine delay or (read design unavailable latency) Starved Ingest delay Backlog resources (write latency) Hardware Internal failure backlog Ingest failure Confused heuristics 21 The evolution of Apache Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume Apache Beam MapReduce 22 Apache Beam (incubating) is a unified programming model designed to provide efficient and portable data processing pipelines 1.Classic batch 2. Batch with fixed 3. Streaming windows 4. Streaming with 5. Streaming with 6. Sessions speculative + late data retractions Formalizing Event-Time Skew Watermarks describe event time progress in the processing time space: "No timestamp earlier than the watermark will be seen" Often based on heuristics ● Too slow? Results are delayed :-( ● Too fast? Some data is late :-( What Where When How ● What are you computing? ● Where in event time? ● When in processing time? ● How do refinements relate? What Where When How What are you computing? Per-element Aggregations Compositions (subgraphs) What Where When How What: computing integer sums What Where When How What: computing integer sums What Where When How Where in event time? Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4 Key 1 Key 2 Key 3 4 5 2 Time Often required when doing aggregations over unbounded data. What Where When How Where: Fixed 2-minute Windows What Where When How When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. What Where When How When: triggering at the watermark What Where When How When: early and late firings What Where When How How do refinements relate? Q: if there are multiple panes per window … what should be emitted? • discard – emit the delta over the pane • accumulate – emit running total • accumulate + retract* – retract last sum & emit new running total A: drive by needs of the downstream consumer *Accumulating & Retracting not yet implemented in Apache Beam. What Where When How How: add newest, remove previous What Where When How Customizing What When Where How 1.Classic batch 2. Batch with fixed windows 3. Streaming 4. Streaming with 5. Streaming with speculative + late data retractions The straggler problem Work is unevenly distributed across tasks •Underlying data size •Processing differences Worker •Runtime effects Time Effects are cumulative per stage Beam readers enable dynamic adaptation Beam readers provide simple progress signals, enable runners to take action based on execution-time characteristics. APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes() Work-stealing: •Bounded: Source splitAtFraction(double) int getParallelismRemaining() Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Now Predicted mean time Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Time Now Now Dynamic work rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner Savings 2-stage pipeline, Same pipeline split “evenly” but uneven in practice with dynamic work rebalancing Dynamic work rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on input size Long-running tasks aborted Multiple rounds of without causing upsizing enabled stragglers by dynamic splitting Scale up to 1000 workers * tasks stay well-balanced * without initial oversplitting Cloud Dataflow execution runner Resource management S Graph optimization Compute and Storage Intelligent watermarking O S Work scheduler Auto-healing U Unbounded I R Resource auto-scaler Monitoring N C Dynamic work K Log collection E rebalancer Bounded 44 The Dataflow Model & Cloud Dataflow Dataflow Model & SDKs Google Cloud Dataflow a unified model for no-ops, fully managed service batch and stream processing The DataflowBeam Model & Cloud Dataflow Apache Beam Google Cloud Dataflow a unified model for a great place to run Beam batch and stream processing supporting multiple runtimes What is in Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines – starting with Java 3. Runners for existing distributed processing backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing Categorizing runner capabilities http://beam.incubator.apache.org/capability-matrix/ Apache Beam roadmap 02/25/2016 1st commit to ASF repository 1H 2016 1H 2017 Refactoring Stable APIs (slight chaos) 2H 2016 Portability basics + 02/01/2016 Multiple runners + Enter Apache Community growth Incubator ⇒ towards a Top-Level Project Growing the Beam community Collaborate – a community-driven effort across many organizations and contributors Grow – the Beam ecosystem and community with active, open involvement Now Next 2nd Wave 1st Wave Virtualized Colocation Datacenters User Managed Intelligent Services User Configured User Maintained Auto Everything Big Data Programming with Google Focus on insights, not infrastructure From batch to real-time Understanding Apache Beam offers a common model across execution engines: please contribute! [email protected].
Recommended publications
  • Google Cloud Platform Integration
    Solidatus FACTSHEET Google Cloud Platform Integration The Solidatus Google Cloud Platform (GCP) integration suite helps to discover data structures and lineage in GCP and automatically create and maintain Solidatus models describing these assets when they are added to GCP and when they are changed. As of January 2019, the GCP integration supports the following scenarios: • Through the Solidatus UI: – Load BigQuery dataset schemas as Solidatus objects on-demand. • Automatically using a Solidatus Agent: – Detect new BigQuery schemas and add to a Solidatus model. – Detect changes to BigQuery schemas and update a Solidatus model. – Detect new files in Google Cloud Storage (GCS) and add to a Solidatus model. – Automatically detect changes to files in GCS and update a Solidatus model. • Automatically at build time: – Extract structure and lineage from a Google Cloud Dataflow and create or update a Solidatus model. FEATURES BigQuery Loader Apache Beam (GCP Dataflow) Lineage A user can import a BigQuery table definition, directly Mapper from Google, as an object into a Solidatus model. A developer can visualise their Apache Beam job’s The import supports both nested and flat structures, pipeline in a Solidatus model. The model helps both and also includes meta data about the table and developers and analysts to see that data from sources dataset. Objects created via the BigQuery Loader is correctly mapped through transforms to their sinks, can be easily updated by a right-clicking on an providing a data lineage model of the pipeline. object in Solidatus. Updating models using this Generating the models can be ad-hoc (on-demand by feature provides the ability to visualise differences in the developer) or built into a CI/CD process.
    [Show full text]
  • Google's Mission
    & Big Data & Rocket Fuel Dr Raj Subramani, HSBC Reza Rokni, Google Cloud, Solutions Architect Adrian Poole, Google Cloud, Google’s Mission Organize the world’s information and make it universally accessible and useful Eight cloud products with ONE BILLION Users Increasing Marginal Cost of Change $ Traditional Architectures Prohibitively Expensive change Marginal cost of 18 years of Google R&D / Investment Google Cloud Native Architectures (GCP) Increasing complexity of systems and processes Containers at Google Number of running jobs Enabled Google to grow our fleet over 10x faster than we grew our ops team Core Ops Team 2004 2016 4 Google’s innovation in data Millwheel F1 Spanner TensorFlow MapReduce Dremel Flume GFS Bigtable Colossus Megastore Pub/Sub Dataflow 2002 2004 2006 2008 2010 2012 2013 2016 Proprietary + Confidential5 Google’s innovation in data Dataflow Spanner NoSQL Spanner Cloud ML Dataproc BigQuery Dataflow GCS Bigtable GCS Datastore Pub/Sub Dataflow 2002 2004 2006 2008 2010 2012 2013 2016 Proprietary + Confidential6 Now available on Google Cloud Platform Compute Storage & Databases App Engine Container Compute Storage Bigtable Spanner Cloud SQL Datastore Engine Engine Big Data Machine Learning BigQuery Pub/Sub Dataflow Dataproc Datalab Vision API Machine Speech API Translate API Learning Lesson of the last 10 years... ● Democratise ML ● Big datasets beat fancy algorithms ● Good Models ● Lots of compute Google BigQuery BigQuery is Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL.
    [Show full text]
  • Are3na Crabbé Et Al
    ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA ISA Action 1.17: A Reusable INSPIRE Reference Platform (ARE3NA) Authentication, Authorization & Accounting for Data and Services in EU Public Administrations D1.1.2 & D1.2.2– Analysing standards and technologies for AAA Ann Crabbé Danny Vandenbroucke Andreas Matheus Dirk Frigne Frank Maes Reijer Copier 0 ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA This publication is a Deliverable of Action 1.17 of the Interoperability Solutions for European Public Admin- istrations (ISA) Programme of the European Union, A Reusable INSPIRE Reference Platform (ARE3NA), managed by the Joint Research Centre, the European Commission’s in-house science service. Disclaimer The scientific output expressed does not imply a policy position of the European Commission. Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication. Copyright notice © European Union, 2014. Reuse is authorised, provided the source is acknowledged. The reuse policy of the European Commission is implemented by the Decision on the reuse of Commission documents of 12 December 2011. Bibliographic Information: Ann Crabbé, Danny Vandenbroucke, Andreas Matheus, Dirk Frigne, Frank Maes and Reijer Copier Authenti- cation, Authorization and Accounting for Data and Services in EU Public Administrations: D1.1.2 & D1.2.2 – Analysing standards and technologies for AAA. European Commission; 2014. JRC92555 1 ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA Contents 1.
    [Show full text]
  • Regeldokument
    Master’s degree project Source code quality in connection to self-admitted technical debt Author: Alina Hrynko Supervisor: Morgan Ericsson Semester: VT20 Subject: Computer Science Abstract The importance of software code quality is increasing rapidly. With more code being written every day, its maintenance and support are becoming harder and more expensive. New automatic code review tools are developed to reach quality goals. One of these tools is SonarQube. However, people keep their leading role in the development process. Sometimes they sacrifice quality in order to speed up the development. This is called Technical Debt. In some particular cases, this process can be admitted by the developer. This is called Self-Admitted Technical Debt (SATD). Code quality can also be measured by such static code analysis tools as SonarQube. On this occasion, different issues can be detected. The purpose of this study is to find a connection between code quality issues, found by SonarQube and those marked as SATD. The research questions include: 1) Is there a connection between the size of the project and the SATD percentage? 2) Which types of issues are the most widespread in the code, marked by SATD? 3) Did the introduction of SATD influence the bug fixing time? As a result of research, a certain percentage of SATD was found. It is between 0%–20.83%. No connection between the size of the project and the percentage of SATD was found. There are certain issues that seem to relate to the SATD, such as “Duplicated code”, “Unused method parameters should be removed”, “Cognitive Complexity of methods should not be too high”, etc.
    [Show full text]
  • Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)
    Install Guide for Data Preparation for Amazon Redshift and S3 Version: 7.1 Doc Build Date: 05/26/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc. EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION. For third-party license information, please select About Trifacta from the Help menu. 1. Quick Start . 4 1.1 Install from AWS Marketplace . 4 1.2 Upgrade for AWS Marketplace . 7 2. Configure . 8 2.1 Configure for AWS . 8 2.1.1 Configure for EC2 Role-Based Authentication . 14 2.1.2 Enable S3 Access . 16 2.1.2.1 Create Redshift Connections 28 3. Contact Support . 30 4. Legal 31 4.1 Third-Party License Information . 31 Page #3 Quick Start Install from AWS Marketplace Contents: Product Limitations Internet access Install Desktop Requirements Pre-requisites Install Steps - CloudFormation template SSH Access Troubleshooting SELinux Upgrade Documentation Related Topics This guide steps through the requirements and process for installing Trifacta® Data Preparation for Amazon Redshift and S3 through the AWS Marketplace.
    [Show full text]
  • Economic and Social Impacts of Google Cloud September 2018 Economic and Social Impacts of Google Cloud |
    Economic and social impacts of Google Cloud September 2018 Economic and social impacts of Google Cloud | Contents Executive Summary 03 Introduction 10 Productivity impacts 15 Social and other impacts 29 Barriers to Cloud adoption and use 38 Policy actions to support Cloud adoption 42 Appendix 1. Country Sections 48 Appendix 2. Methodology 105 This final report (the “Final Report”) has been prepared by Deloitte Financial Advisory, S.L.U. (“Deloitte”) for Google in accordance with the contract with them dated 23rd February 2018 (“the Contract”) and on the basis of the scope and limitations set out below. The Final Report has been prepared solely for the purposes of assessment of the economic and social impacts of Google Cloud as set out in the Contract. It should not be used for any other purposes or in any other context, and Deloitte accepts no responsibility for its use in either regard. The Final Report is provided exclusively for Google’s use under the terms of the Contract. No party other than Google is entitled to rely on the Final Report for any purpose whatsoever and Deloitte accepts no responsibility or liability or duty of care to any party other than Google in respect of the Final Report and any of its contents. As set out in the Contract, the scope of our work has been limited by the time, information and explanations made available to us. The information contained in the Final Report has been obtained from Google and third party sources that are clearly referenced in the appropriate sections of the Final Report.
    [Show full text]
  • Portable Stateful Big Data Processing in Apache Beam
    Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group
    [Show full text]
  • The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019
    LICENSED FOR INDIVIDUAL USE ONLY The Forrester Wave™: Streaming Analytics, Q3 2019 The 11 Providers That Matter Most And How They Stack Up by Mike Gualtieri September 23, 2019 Why Read This Report Key Takeaways In our 26-criterion evaluation of streaming Software AG, IBM, Microsoft, Google, And analytics providers, we identified the 11 most TIBCO Software Lead The Pack significant ones — Alibaba, Amazon Web Forrester’s research uncovered a market in which Services, Cloudera, EsperTech, Google, IBM, Software AG, IBM, Microsoft, Google, and TIBCO Impetus, Microsoft, SAS, Software AG, and Software are Leaders; Cloudera, SAS, Amazon TIBCO Software — and researched, analyzed, Web Services, and Impetus are Strong Performers; and scored them. This report shows how each and EsperTech and Alibaba are Contenders. provider measures up and helps application Analytics Prowess, Scalability, And development and delivery (AD&D) professionals Deployment Freedom Are Key Differentiators select the right one for their needs. Depth and breadth of analytics types on streaming data are critical. But that is all for naught if streaming analytics vendors cannot also scale to handle potentially huge volumes of streaming data. Also, it’s critical that streaming analytics can be deployed where it is most needed, such as on-premises, in the cloud, and/ or at the edge. This PDF is only licensed for individual use when downloaded from forrester.com or reprints.forrester.com. All other distribution prohibited. FORRESTER.COM FOR APPLICATION DEVELOPMENT & DELIVERY PROFESSIONALS The Forrester Wave™: Streaming Analytics, Q3 2019 The 11 Providers That Matter Most And How They Stack Up by Mike Gualtieri with Srividya Sridharan and Robert Perdoni September 23, 2019 Table Of Contents Related Research Documents 2 Enterprises Must Take A Streaming-First The Future Of Machine Learning Is Unstoppable Approach To Analytics Predictions 2019: Artificial Intelligence 3 Evaluation Summary Predictions 2019: Business Insights 6 Vendor Offerings 6 Vendor Profiles Leaders Share reports with colleagues.
    [Show full text]
  • Google Cloud Dataflow – an Insight
    International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2015): 78.96 | Impact Factor (2015): 6.391 Google Cloud Dataflow – An Insight Eashani Deorukhkar1 Department of Information Technology, RamraoAdik Institute of Technology, Mumbai, India Abstract: The massive explosion of data and the need for its processing has become an area of interest with many companies vying to occupy a significant share of this market. The processing of streaming data i.e. data generated by multiple sources simultaneously and continuously is an ability that some widely used data processing technologies lack. This paper discusses the “Cloud Data Flow” technology offered by Google as a possible solution to this problem and also a few of its overall pros and cons. Keywords: Data processing, streaming data, Google, Cloud Dataflow 1. Introduction To examine a real-time stream of events for significant [2] patterns and activities The processing of large datasets to generate new insights and To implement advanced, multi-step processing pipelines to relationships from it is quite a common activity in numerous extract deep insight from datasets of any size[2] companies today. However, this process is quite difficult and resource intensive even for experts[1]. Traditionally, this 3. Technical aspects processing has been performed on static datasets wherein the information is stored for a while before being processed. 3.1 Overview This method was not scalable when it came to processing of data streams. The huge amounts of data being generated by This model is specifically intended to make data processing streams like real-time data from sensors or from social on a large scale easier.
    [Show full text]
  • Scalable and Flexible Middleware for Dynamic Data Flows
    SCALABLEANDFLEXIBLEMIDDLEWAREFORDYNAMIC DATAFLOWS stephan boomker Master thesis Computing Science Software Engineering and Distributed Systems Primaray RuG supervisor : Prof. Dr. M. Aiello Secondary RuG supervisor : Prof. Dr. P. Avgeriou Primary TNO supervisor : MSc. E. Harmsma Secondary TNO supervisor : MSc. E. Lazovik August 16, 2016 – version 1.0 [ August 16, 2016 at 11:47 – classicthesis version 1.0 ] Stephan Boomker: Scalable and flexible middleware for dynamic data flows, Master thesis Computing Science, © August 16, 2016 [ August 16, 2016 at 11:47 – classicthesis version 1.0 ] ABSTRACT Due to the concepts of Internet of Things and Big data, the traditional client-server architecture is not sufficient any more. One of the main reasons is wide range of expanding heterogeneous applications, data sources and environments. New forms of data processing require new architectures and techniques in order to be scalable, flexible and able to handle fast dynamic data flows. The backbone of all those objects, applications and users is called the middleware. This research goes about designing and implementing a middle- ware by taking into account different state of the art tools and tech- niques. To come up to a solution which is able to handle a flexible set of sources and models across organizational borders. At the same time it is de-centralized, distributed and, although de-central able to perform semantic based system integration centrally. This is accom- plished by introducing of an architecture containing a combination of data integration patterns, semantic storage and stream processing patterns. A reference implementation is presented of the proposed architec- ture based on Apache Camel framework. This prototype provides the ability to dynamically create and change flexible and distributed data flows during runtime.
    [Show full text]
  • Google Cloud Dataflow a Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic
    Google Cloud Dataflow A Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic STREAM 2015 Agenda 1 Data Shapes 2 Data Processing Tradeoffs 3 Google’s Data Processing Story 4 Google Cloud Dataflow Score points Form teams Geographically distributed Online and Offline mode User and Team statistics in real time Accounting / Reporting Abuse detection https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg 1 Data Shapes Data... ...can be big... ...really, really big... Thursday Wednesday Tuesday ...maybe even infinitely big... 8:00 1:00 9:00 2:00 10:00 3:00 11:00 4:00 12:00 5:00 13:00 6:00 14:00 7:00 … with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 2 Data Processing Tradeoffs Data Processing Tradeoffs 1 + 1 = 2 $$$ Completeness Latency Cost Requirements: Billing Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Live Cost Estimate Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Abuse Detection Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Abuse Detection Backfill Pipeline Important Not Important Completeness Low Latency Low Cost Google’s 3 Data Processing Story Data Processing @ Google Dataflow MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010 2012 2014 2016 Data Processing @ Google Dataflow MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010
    [Show full text]
  • Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem
    1/16/18 Big Data Analysis using Hadoop Lecture 4 Hadoop EcoSystem Hadoop Ecosytems 1 1/16/18 Overview • Hive • HBase • Sqoop • Pig • Mahoot / Spark / Flink / Storm • Hadoop & Data Management Architectures Hive 2 1/16/18 Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL • Minimal learning curve for people with SQL expertise • Data analysts are target audience • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amountsof data • Access to files on various data stores such as HDFS, etc Website : http://hive.apache.org/ Download : http://hive.apache.org/downloads.html Documentation : https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive • Hive does NOT provide low latency or real-time queries • Even querying small amounts of data may take minutes • Designed for scalability and ease-of-use rather than low latency responses • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster • This is changing 3 1/16/18 Hive • Hive Concepts • Re-used from Relational Databases • Database: Set of Tables, used for name conflicts resolution • Table: Set of Rows that have the same schema (same columns) • Row: A single record; a set of columns • Column: provides value and type for a single value • Can can be dived up based on • Partitions • Buckets Hive – Let’s work through a simple example 1. Create a Table 2. Load Data into a Table 3. Query Data 4. Drop a Table 4 1/16/18
    [Show full text]