“No One at Google Uses Mapreduce Anymore”

Total Page:16

File Type:pdf, Size:1020Kb

“No One at Google Uses Mapreduce Anymore” “No one at Google uses MapReduce anymore” Cloud Dataflow, the Dataflow model and parallel processing in 2016. Martin Görner #Dataflow @martin_gorner Before we start... (Quick Google BigQuery demo) #Dataflow @martin_gorner The Dataflow Model (2015) Millwheel (2013) Flume Java (2010) #Dataflow @martin_gorner Streaming pipeline Example: hash tag auto-complete Pipeline p = Pipeline.create() Tweets .apply(PubSub.Read...)) read #argentina scores, my #art project, watching.apply( #armeniaWindow vs #argentina.into()) ExtractTags #argentina.apply( #art #armeniaParDo #argentina.of(new ExtractTags())) Count (argentina,.apply( 5M) (art,Count 9M) (armenia,.create()) 2M) a->(argentina,5M) ar->(argentina,5M) ExpandPrefixes arg->(.apply(argentina,5M)ParDo ar->(art,.of(new 9M) ... ExpandPrefixes()) a->[apple, art, argentina] Top(3) ar->[ar.apply(t, argentina,Top armenia].largestPerKey(3) write .apply(PubSub.Write...)); p.run() Predictions #Dataflow @martin_gorner MapReduce ? M M M R R #Dataflow @martin_gorner MapReduce ? M M M R R #Dataflow @martin_gorner Associative example: count M+R M+R M+R 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SUM⇒ 3 3 SUM⇒ 3 4 SUM⇒ 2 3 3 3 2 3 4 3 R R SUM⇒ 8 SUM⇒ 9 #Dataflow @martin_gorner Dataflow programming model PCollection<T> in: PCollection<T> PaDo(DoFn) out: PCollection<S> GroupByKey in: PCollection<KV<K,V>> //multimap out: PCollection<KV<K,Collection<V>>> //unimap Combine(CombineFn) in: PCollection<KV<K,Collection<V>>> //unimap out: PCollection<KV<K,V>> //unimap in: PCollection<T>, PCollection<T>, ... Flatten out: PCollection<T> #Dataflow @martin_gorner reinventing Count in: PCollection<T> Count out: PCollection<KV<T,Integer>> //unimap ParDo: T => KV<T, 1> GroupByKey: PCollection<KV<<T, 1>> // multimap => PCollection<KV<<T,Collection<1>>> // unimap Combine(Sum): => PCollection<KV<<T,Integer>> #Dataflow @martin_gorner reinventing JOIN in: two multimaps with same key type PCollection<KV<<K,V>>, PCollection<KV<<K,W>> join out: a unimap PCollection<KV<<K, Pair<Collection<V>,Collection<W>>>> ParDo: KV<K,V> => KV<K, TaggedUnion<V,W>> KV<K,W> => KV<K, TaggedUnion<V,W>> Flatten: => PCollection<KV<<K, TaggedUnion<V,W>>> //multimap GroupByKey: => PCollection<KV<K, Collection<TaggedUnion<V,W>>>> //unimap ParDo: final type transform => PCollection<KV<<K, Pair<Collection<V>, Collection<W>>>> #Dataflow @martin_gorner = ParallelDo Execution graph example: extracting shopping session data checkout logs IN1 A OUT1 formatted payment logs 1 checkout log IN2 B flatten join payment logs 2 D F OUT2 by session id IN3 C shopping session log: ➔ session id website logs ➔ cart contents count ➔ payment OK? IN4 E ➔ page views page views in session #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo Fusion + = Combine GBK = GroupByKey consumer-producer sibling C D C D C+D C+D #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B flatten D GBK C #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B D flatten GBK C D #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B D GBK C D #Dataflow @martin_gorner = ParallelDo graph optimiser: identify MR-like blocks + = Combine “Map Shuffle Combine Reduce” GBK = GroupByKey M IN1 1 OUT1 R GBK + 1 OUT M 2 IN2 2 GBK + R 2 OUT3 M IN3 3 MSCR #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 IN2 B flatten join D F OUT2 IN3 C count IN4 E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 IN2 B flatten join D F OUT2 IN3 C Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B D flatten J flat. GBK J 1 2 F OUT2 IN3 C D J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J J 1 IN2 B D 1 flatten flat. GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J J 1 IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 J IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 E+J1 Cnt GBK IN4 + #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1 E+J1 Cnt GBK IN4 + #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1 Cnt GBK E+J IN4 + 1 #Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 Cnt GBK E+J IN4 + 1 #Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 MSCR1 Cnt GBK E+J MSCR IN4 + 1 2 #Dataflow @martin_gorner graph optimiser A+B+C+D+E+J IN 1 A+B+C+D+E+J 1 1 F+J 2 OUT A+B+C+D+E+J 1 1 F+J2 IN A+B+C+D+E+J1 F+J 2 2 A+B+C+D+E+J 1 OUT A+B+C+D+E+J 2 IN3 1 Cnt + Cnt IN + Cnt 4 #Dataflow @martin_gorner Dataflow model What are you computing? Where in event time? When are results emitted? Windowing How do refinements relate? #Dataflow @martin_gorner Dataflow model: windowing Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4 Key 1 Key 2 Key 3 4 5 2 Time Often required when doing aggregations over unbounded data. What Where When How #Dataflow @martin_gorner Event-time windowing: out-of-order Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time 10:00 11:00 12:00 13:00 14:00 15:00 What Where When How #Dataflow @martin_gorner Watermark Watermark Skew The correct definition of “late data” Processing Time Event Time What Where When How #Dataflow @martin_gorner What, When, Where, How ? PCollection<KV<String, Integer>> scores p.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AfterWatermark().pastEndOfWindow() default .withLateFirings(AtCount(1)) behaviour .withEarlyFirings(AtPeriod(Duration.standardMinutes(1)) .withAllowedLateness.(Duration.standardHours(1)) .accumulatingfFiredPanes()/.discardingFiredPanes() ) .apply(Sum.integersPerKey()); What Where When How #Dataflow @martin_gorner The Dataflow model ParDo GroupByKey Flatten Combine What are you computing? FixedWindows SlidingWindows Where in event time? Sessions triggering When are result emitted? AfterWatermark AfterProcessingTime How do refinements relate? AfterPane.elementCount… withAllowedLateness accumulatingFiredPanes withEarlyFirings discardingFiredPanes withLateFirings #Dataflow @martin_gorner Apache Beam (incubating) Run it in the cloud Google Cloud Dataflow Run it on premise Apache Flink runner for Beam Apache Beam: the “Dataflow model” Apache Spark runner Open-source lingua franca for unified for Beam batch and streaming data processing #Dataflow @martin_gorner Demo time 1 week of data 3M taxi rides One point every 2s on each ride Accelerated 8x 20,000 events/s Streamed from Google Pub/Sub #Dataflow @martin_gorner Demo: NYC Taxis Dataflow PubSub PubSub (event queue) (event queue) Visualisation (Javascript) BigQuery (data warehousing and interactive analysis) #Dataflow @martin_gorner Thank you ! cloud.google.com/dataflow beam.incubator.apache.org Martin Görner Google Developer relations @martin_gorner #Dataflow @martin_gorner.
Recommended publications
  • Google Cloud Platform Integration
    Solidatus FACTSHEET Google Cloud Platform Integration The Solidatus Google Cloud Platform (GCP) integration suite helps to discover data structures and lineage in GCP and automatically create and maintain Solidatus models describing these assets when they are added to GCP and when they are changed. As of January 2019, the GCP integration supports the following scenarios: • Through the Solidatus UI: – Load BigQuery dataset schemas as Solidatus objects on-demand. • Automatically using a Solidatus Agent: – Detect new BigQuery schemas and add to a Solidatus model. – Detect changes to BigQuery schemas and update a Solidatus model. – Detect new files in Google Cloud Storage (GCS) and add to a Solidatus model. – Automatically detect changes to files in GCS and update a Solidatus model. • Automatically at build time: – Extract structure and lineage from a Google Cloud Dataflow and create or update a Solidatus model. FEATURES BigQuery Loader Apache Beam (GCP Dataflow) Lineage A user can import a BigQuery table definition, directly Mapper from Google, as an object into a Solidatus model. A developer can visualise their Apache Beam job’s The import supports both nested and flat structures, pipeline in a Solidatus model. The model helps both and also includes meta data about the table and developers and analysts to see that data from sources dataset. Objects created via the BigQuery Loader is correctly mapped through transforms to their sinks, can be easily updated by a right-clicking on an providing a data lineage model of the pipeline. object in Solidatus. Updating models using this Generating the models can be ad-hoc (on-demand by feature provides the ability to visualise differences in the developer) or built into a CI/CD process.
    [Show full text]
  • Google's Mission
    & Big Data & Rocket Fuel Dr Raj Subramani, HSBC Reza Rokni, Google Cloud, Solutions Architect Adrian Poole, Google Cloud, Google’s Mission Organize the world’s information and make it universally accessible and useful Eight cloud products with ONE BILLION Users Increasing Marginal Cost of Change $ Traditional Architectures Prohibitively Expensive change Marginal cost of 18 years of Google R&D / Investment Google Cloud Native Architectures (GCP) Increasing complexity of systems and processes Containers at Google Number of running jobs Enabled Google to grow our fleet over 10x faster than we grew our ops team Core Ops Team 2004 2016 4 Google’s innovation in data Millwheel F1 Spanner TensorFlow MapReduce Dremel Flume GFS Bigtable Colossus Megastore Pub/Sub Dataflow 2002 2004 2006 2008 2010 2012 2013 2016 Proprietary + Confidential5 Google’s innovation in data Dataflow Spanner NoSQL Spanner Cloud ML Dataproc BigQuery Dataflow GCS Bigtable GCS Datastore Pub/Sub Dataflow 2002 2004 2006 2008 2010 2012 2013 2016 Proprietary + Confidential6 Now available on Google Cloud Platform Compute Storage & Databases App Engine Container Compute Storage Bigtable Spanner Cloud SQL Datastore Engine Engine Big Data Machine Learning BigQuery Pub/Sub Dataflow Dataproc Datalab Vision API Machine Speech API Translate API Learning Lesson of the last 10 years... ● Democratise ML ● Big datasets beat fancy algorithms ● Good Models ● Lots of compute Google BigQuery BigQuery is Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL.
    [Show full text]
  • Are3na Crabbé Et Al
    ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA ISA Action 1.17: A Reusable INSPIRE Reference Platform (ARE3NA) Authentication, Authorization & Accounting for Data and Services in EU Public Administrations D1.1.2 & D1.2.2– Analysing standards and technologies for AAA Ann Crabbé Danny Vandenbroucke Andreas Matheus Dirk Frigne Frank Maes Reijer Copier 0 ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA This publication is a Deliverable of Action 1.17 of the Interoperability Solutions for European Public Admin- istrations (ISA) Programme of the European Union, A Reusable INSPIRE Reference Platform (ARE3NA), managed by the Joint Research Centre, the European Commission’s in-house science service. Disclaimer The scientific output expressed does not imply a policy position of the European Commission. Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication. Copyright notice © European Union, 2014. Reuse is authorised, provided the source is acknowledged. The reuse policy of the European Commission is implemented by the Decision on the reuse of Commission documents of 12 December 2011. Bibliographic Information: Ann Crabbé, Danny Vandenbroucke, Andreas Matheus, Dirk Frigne, Frank Maes and Reijer Copier Authenti- cation, Authorization and Accounting for Data and Services in EU Public Administrations: D1.1.2 & D1.2.2 – Analysing standards and technologies for AAA. European Commission; 2014. JRC92555 1 ARe3NA Crabbé et al. (2014) AAA for Data and Services (D1.1.2 & D1.2.2): Analysing Standards &Technologies for AAA Contents 1.
    [Show full text]
  • Economic and Social Impacts of Google Cloud September 2018 Economic and Social Impacts of Google Cloud |
    Economic and social impacts of Google Cloud September 2018 Economic and social impacts of Google Cloud | Contents Executive Summary 03 Introduction 10 Productivity impacts 15 Social and other impacts 29 Barriers to Cloud adoption and use 38 Policy actions to support Cloud adoption 42 Appendix 1. Country Sections 48 Appendix 2. Methodology 105 This final report (the “Final Report”) has been prepared by Deloitte Financial Advisory, S.L.U. (“Deloitte”) for Google in accordance with the contract with them dated 23rd February 2018 (“the Contract”) and on the basis of the scope and limitations set out below. The Final Report has been prepared solely for the purposes of assessment of the economic and social impacts of Google Cloud as set out in the Contract. It should not be used for any other purposes or in any other context, and Deloitte accepts no responsibility for its use in either regard. The Final Report is provided exclusively for Google’s use under the terms of the Contract. No party other than Google is entitled to rely on the Final Report for any purpose whatsoever and Deloitte accepts no responsibility or liability or duty of care to any party other than Google in respect of the Final Report and any of its contents. As set out in the Contract, the scope of our work has been limited by the time, information and explanations made available to us. The information contained in the Final Report has been obtained from Google and third party sources that are clearly referenced in the appropriate sections of the Final Report.
    [Show full text]
  • Google Cloud Dataflow – an Insight
    International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2015): 78.96 | Impact Factor (2015): 6.391 Google Cloud Dataflow – An Insight Eashani Deorukhkar1 Department of Information Technology, RamraoAdik Institute of Technology, Mumbai, India Abstract: The massive explosion of data and the need for its processing has become an area of interest with many companies vying to occupy a significant share of this market. The processing of streaming data i.e. data generated by multiple sources simultaneously and continuously is an ability that some widely used data processing technologies lack. This paper discusses the “Cloud Data Flow” technology offered by Google as a possible solution to this problem and also a few of its overall pros and cons. Keywords: Data processing, streaming data, Google, Cloud Dataflow 1. Introduction To examine a real-time stream of events for significant [2] patterns and activities The processing of large datasets to generate new insights and To implement advanced, multi-step processing pipelines to relationships from it is quite a common activity in numerous extract deep insight from datasets of any size[2] companies today. However, this process is quite difficult and resource intensive even for experts[1]. Traditionally, this 3. Technical aspects processing has been performed on static datasets wherein the information is stored for a while before being processed. 3.1 Overview This method was not scalable when it came to processing of data streams. The huge amounts of data being generated by This model is specifically intended to make data processing streams like real-time data from sensors or from social on a large scale easier.
    [Show full text]
  • Google Cloud Dataflow a Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic
    Google Cloud Dataflow A Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic STREAM 2015 Agenda 1 Data Shapes 2 Data Processing Tradeoffs 3 Google’s Data Processing Story 4 Google Cloud Dataflow Score points Form teams Geographically distributed Online and Offline mode User and Team statistics in real time Accounting / Reporting Abuse detection https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg 1 Data Shapes Data... ...can be big... ...really, really big... Thursday Wednesday Tuesday ...maybe even infinitely big... 8:00 1:00 9:00 2:00 10:00 3:00 11:00 4:00 12:00 5:00 13:00 6:00 14:00 7:00 … with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 2 Data Processing Tradeoffs Data Processing Tradeoffs 1 + 1 = 2 $$$ Completeness Latency Cost Requirements: Billing Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Live Cost Estimate Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Abuse Detection Pipeline Important Not Important Completeness Low Latency Low Cost Requirements: Abuse Detection Backfill Pipeline Important Not Important Completeness Low Latency Low Cost Google’s 3 Data Processing Story Data Processing @ Google Dataflow MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010 2012 2014 2016 Data Processing @ Google Dataflow MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010
    [Show full text]
  • Create a Created Date Schema in Bigquery
    Create A Created Date Schema In Bigquery Denominative Elmore canoodles funereally. Auxiliary Rudy acquiesce: he unbar his aril adequately and contagiously. Overreaching Randolf excused some scrabble and shacks his syneresis so midnight! Wait polls with exponential backoff. The schema in tests are many cases. Perform various levels of dates in schema definition file somewhere on create date column when you created in an identical data will lead at. You'd may that avoiding two scans of SALES would improve performance However remember most circumstances the draw BY version takes about twice as suite as the lot BY version. You run some SQL queries against that data. The script then creates a partition function, if configured. The order in which clustering is done matters. Developer relations lead at SKB Kontur. We create schema in the dates and their very wide variety of bigquery operation creates a working directory as a few seconds of model you. By using the client libraries. Database services to migrate, please leave it empty. What are partitions in SQL? In our case the join condition is matching dates from the pedestrian table and the bike table. The mother source records need is have a machine image updates to merge correctly. Python community version selection works in schema of bigquery operation on create date column can be created in a new custom schema tab. Once per store the data needs to simplify etl to any data are also a bigquery? Google Analytics 360 users that have that up the automatic BigQuery export will rejoice by this. Whether to automatically infer options and schema for CSV and JSON sources.
    [Show full text]
  • Data Processing with Apache Beam and Google Cloud Dataflow
    Data Processing with Apache Beam (incubating) and Google Cloud Dataflow Jelena Pjesivac-Grbovic Staff software engineer Cloud Big Data In collaboration with Frances Perry, Tayler Akidau, and Dataflow team XLDB’16 - May 2016 Agenda 1 Infinite, Out-of-Order Data Sets 2 What, Where, When, How 3 Apache Beam (incubating) 4 Google Cloud Dataflow 1 Infinite, Out-of-Order Data Sets Data... ...can be big... ...really, really big... Thursday Wednesday Tuesday … maybe infinitely big... 8:00 9:00 10:00 11:00 12:00 13:00 14:00 … with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Element-wise transformations Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time Aggregating via Processing-Time Windows Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time Aggregating via Event-Time Windows Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time 10:00 11:00 12:00 13:00 14:00 15:00 Formalizing Event-Time Skew Skew Reality Ideal Processing Time Event Time Formalizing Event-Time Skew Watermarks describe event time progress. Skew "No timestamp earlier than the ~Watermark watermark will be seen" Ideal Often heuristic-based. Processing Time Too Slow? Results are delayed. Too Fast? Some data is late. Event Time 2 What, Where, When, How What are you computing? Where in event time? When in processing time? How do refinements relate? What are you computing? Element-Wise Aggregating Composite What Where When How What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey()); What Where When How What: Computing Integer Sums What Where When How What: Computing Integer Sums What Where When How Where in event time? Windowing divides data into event-time-based finite chunks.
    [Show full text]
  • Tamr on Google Cloud Platform: Walkthrough
    Tamr on Google Cloud Platform: Walkthrough Tamr on Google Cloud Platform: Walkthrough Overview Tamr on Google Cloud Platform empowers users to manage and publish data without learning a new SDK or coding in Java. This preview version of Tamr on Google Cloud Platform allows users to move data from Google Cloud Storage to BigQuery via a visual interface for selection and transformation of data. The preview of Tamr on Google Cloud Platform covers: + Attribute selection from CSV files + Joining CSV sources + Transformation of missing values, and + Publishing a table in BigQuery, Google’s fully managed, NoOps, data analytics service Signing Into Tamr & Google Cloud Platform To get started, register with Tamr and sign into Google Cloud Platform (using a Gmail account) by going to gcp-preview.tamr.com + If you don’t have an account with Google Cloud Platform, you can go through the Tamr portion of the offering, but will not be able to push your dataset to BigQuery. + If you don’t have a Google Cloud Platform account but would like to register for one, select the “Free Trial” option at the top of the Google Cloud Platform sign-in page. Selecting Sources Once you have signed in: + Select the project and bucket on Google Cloud Platform from which you would like to pull data into Tamr. Tamr on Google Cloud Platform: Walkthrough Adding / Subtracting Attributes Now that a data source has been added, attributes related to that data source should now appear on the left side of the screen. At this point, you have the option to add all of the attributes to a preview (via ‘Add All’ button) or add some of the attributes of interest to the preview (via click-and-drag functionality).
    [Show full text]
  • For Cloud Professionals, Part of the on Cloud Podcast
    Architecting the Cloud at Google Cloud Next 2018_The past, present, and future of BigQuery For Cloud professionals, part of the On Cloud Podcast Mike Kavis, Chief Cloud Architect, Deloitte Consulting LLP File Duration: 0:18:04 Operator: Welcome to Architecting the Cloud, part of the On Cloud Podcast, where we get real about Cloud Technology what works, what doesn't and why. Now here is your host Mike Kavis. Mike Kavis: Welcome to Deloitte’s Architecting the Cloud Podcast, I'm your host Mike Kavis and I am here in San Francisco on day three of the exciting Google Next Conference, and I'm here with Google's Product Manager for BigQuery, Tino Tereshko. And Tino, nice to meet you, I‘ve been following you on Twitter for a while, you may not remember but we chatted back and forth on a couple topics over the years, it was nice to meet you in person here. Tino Tereshko: Yes, nice to meet you as well. Mike Kavis: Yes. 1 Architecting the Cloud at Google Cloud Next 2018_The past, present, and future of BigQuery Tino Tereshko: Thank you for not butchering my last name, you got it. Mike Kavis: Well I practiced it a few times, and I still worry. Tell us a little bit about your background and then tell you what keeps you busy, Google these days? Tino Tereshko: It’s a great question, so my background is in mathematics, spent my time in finance, in my time (0:01:00) had a number of start-up’s before coming to Google.
    [Show full text]
  • Composite Event Recognition for Maritime Monitoring
    NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS SCHOOL OF SCIENCE DEPARTMENT OF INFORMATICS AND TELECOMMUNICATIONS BSc THESIS Composite event recognition for maritime monitoring Manolis N. Pitsikalis Supervisors: Alexander Artikis Researcher, NCSR “Demokritos” Assistant Professor, UNIPI Panagiotis Stamatopoulos Assistant Professor, NKUA ATHENS JUNE 2018 ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ Αναγνώριση σύνθετων γεγονότων για την παρακολούθηση της ναυτιλιακής δραστηριότητας Μανώλης Ν. Πιτσικάλης Επιβλέποντες: Αλέξανδρος Αρτίκης, Ερευνητής, ΕΚΕΦΕ «∆ημόκριτος» Επίκουρος Καθηγητής, ΠΑΠΕΙ Παναγιώτης Σταματόπουλος, Επίκουρος Καθηγητής, ΕΚΠΑ ΑΘΗΝΑ ΙΟΥΝΙΟΣ 2018 BSc THESIS Composite event recognition for maritime monitoring Manolis N. Pitsikalis R.N.: 1115201300143 SUPERVISORS: Alexander Artikis Researcher, NCSR “Demokritos” Assistant Professor, UNIPI Panagiotis Stamatopoulos Assistant Professor, NKUA ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ Αναγνώριση σύνθετων γεγονότων για την παρακολούθηση της ναυτιλιακής δραστηριότητας Μανώλης Ν. Πιτσικάλης Α.Μ.: 1115201300143 ΕΠΙΒΛΕΠΟΝΤΕΣ: Αλέξανδρος Αρτίκης, Ερευνητής, ΕΚΕΦΕ «∆ημόκριτος» Επίκουρος Καθηγητής, ΠΑΠΕΙ Παναγιώτης Σταματόπουλος, Επίκουρος Καθηγητής, ΕΚΠΑ ABSTRACT Maritime monitoring systems support safe shipping as they allow for the real-time detection of dangerous, suspicious and illegal vessel activities. The intent of this thesis was the development of a composite event recognition engine for maritime monitoring and the construction
    [Show full text]
  • Serverless Through Cloud Native Architecture
    Published by : International Journal of Engineering Research & Technology (IJERT) http://www.ijert.org ISSN: 2278-0181 Vol. 10 Issue 02, February-2021 Serverless Through Cloud Native Architecture M. V. L. N. Venugopal Dr. C. R. K. Reddy Research Scholar, Professor & Head of the Department computer science, Osmania University, Hyderabad, Mahatma Gandhi Institute of Technology Hyderabad Telangana, India. Telangana, India Abstract:- Many Enterprises prefer technologies that meet their business needs and requirements. Change of market dynamics continuously demands proliferation of latest technologies. Microservices, Serverless computing, Artificial intelligence, machine learning, and IoT are few latest technologies. Ongoing technology changes insist software companies to bring quick new products with latest features as expected by client. Global modern software legends stress the need of adoption of cutting edge technologies, frameworks, tools, and infrastructure. So that applications can take the advantages like faster-delivery, lightweight scalable and lower developing and maintenance cost. Digital transformation has become very essential for serverless computing. Digital transformation has accelerated business models. Some of them are favorites of serverless technologies, micro services, and native cloud applications. These initiated a new transformation in the technological areas. Stateless functions impact on the maintainability of the software. Enterprises will now concentrate on their products and services to sustain global software market. Hence, an implementer can run an application on a third- party server and this will in-turn decrease deployment time. Serverless [2], is a hot topic and, in software world Serverless computing has evolved as new important technology for deploying cloud based applications. Serverless is latest execution model of cloud. Provider of cloud will allocate resources at run time.
    [Show full text]