Modern ETL Tools for Cloud and Big Data Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc. Agenda ▪ Landscape ▪ Cloud ETL Tools ▪ Big Data ETL Tools ▪ Best Practices ▪ QnA © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 2 Landscape © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 3 Data Warehousing Has Transformed Enterprise BI and Analytics © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 4 ETL Tools Are At The Heart Of EDW © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 5 Rise Of The Modern EDW Solutions Cloud Data Warehouse Big Data Warehouse © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 6 Enterprises Are Increasingly Adopting Cloud and Big Data Cloud Big Data NoSQL Amazon S3 Amazon Web Services 24% MongoDB 27% 44% Hadoop Hive 22% Cassandra Microsoft Azure 39% Spark SQL 11% 15% Redis 7% VMware 22% Hortonworks 9% Amazon EMR 9% Oracle NoSQL 7% Google Cloud 18% Apache Solr 8% Google Cloud… 6% Salesforce 14% Cloudera CDH 7% HBase 6% Oracle 10% Oracle BDA 7% DynamoDB 5% Apache Sqoop 7% Couchbase 4% IBM 8% 7% IBM BigInsights Google Cloud… 3% 6% SAP HANA 6% SAP Cloud Platform Big Data… DocumentDB Cloudera Impala 2% Rackspace 6% 5% MapR 6% SimpleDB 2% RedHat OpenStack 4% Azure HDInsights 6% MarkLogic 1% DataStax Enterprise RedHat OpenShift 3% Apache Storm 5% 1% Google Cloud DataProc 5% Aerospike 1% Digital Ocean 2% Apache Drill 4% Riak 0% CenturyLink 1% Apache Phoenix 3% Scylla 0% Linode 0% Presto 2% ArangoDB 0% Pivotal HD 2% Other (please… 2% Other (please specify): 6% GemFireXD 1% None 32% None 13% Other (please specify): 2% Not sure 10% Not sure 3% None 32% Not sure 12% Source: Progress DataDirect’s Data Connectivity Outlook Survey 2018 © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 7 Cloud ETL Tools © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 8 Poll Question 1 ▪ Which cloud provider are you currently using / plan to use? ▪ AWS ▪ Google Cloud Platform ▪ Azure ▪ Other ▪ None © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 9 AWS Glue Serverless ETL AWS Glue ▪ Components: • Data Catalog • Crawlers • ETL jobs/scripts • Job scheduler ▪ Useful for… • …running serverless queries against S3 buckets and relational data • …creating event-driven ETL pipelines • …automatically discovering and cataloging your enterprise data assets © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 11 AWS Glue architecture Source: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 12 Data catalog - the central component ▪ Can act as metadata repository for other Amazon services ▪ Tables - Added to “databases” using the wizard or a crawler ▪ Data sources: Amazon S3, Redshift, Aurora, Oracle, PostgreSQL, MySQL, MariaDB, MS SQL Server, JDBC, DynamoDB ▪ Crawlers connect to one or more data stores, determine the data structures, and write tables into the Data Catalog © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 13 Jobs ▪ PySpark or Scala scripts, generated by AWS Glue ▪ Visual dataflow can be generated, but not used for development ▪ Execute ETL using the job scheduler, events, or manually invoke ▪ Built-in transforms used to process data ▪ ApplyMapping ▪ SelectFields • Maps source and target columns • Output selected fields to new DynamicFrame ▪ Filter ▪ SplitRows • Load new DynamicFrame based on • Split rows into two new DynamicFrames filtered records based on a predicate ▪ Join ▪ SplitFields • Joins two DynamicFrames • Split fields into two new DynamicFrames © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 14 Azure Data Factory Visual Cloud ETL Azure Data Factory ▪ Build data pipelines using a visual ETL user interface • Visual Studio Team Services (VSTS) Git integration for collaboration, source control, and versioning ▪ Drag, drop, link activities • Copy Data: Source to Target • Transform: Spark, Hive, Pig, streaming on HDInsight, Stored Procedures, ML Batch Execution, etc. • Control flow: If-then-else, For-each, Lookup, etc. Source: https://azure.microsoft.com/en-us/blog/continuous-integration-and-deployment-using-data-factory/ © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 16 Linked services ▪ Allow connection to many data sources • Source: – Azure DBs, Azure Blob Storage, Azure File Storage, Oracle, SQL Server, SAP HANA, Sybase, DB2, Impala, Drill, MySQL, Google BigQuery, Cassandra, Amazon S3, HDFS, Salesforce, etc. • Target: – Azure DBs, Azure Blob Storage, Azure File Storage, Oracle, SQL Server, SAP HANA, File System, Generic ODBC, Salesforce, etc. © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 17 Integration runtime environment ▪ Compute environment required for Azure Data Factory • Fully managed in Azure cloud infrastructure • Automatic scaling, high availability, etc. • Allows execution of SSIS packages ▪ Can be hosted on a local private network • Organizations can control, and manage, compute resources © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 18 Google Cloud Dataflow Unified Programming Model for Data Processing Google Cloud Dataflow Source: https://cloud.google.com/dataflow/ © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 20 Google Cloud Dataflow overview ▪ Unified programming model for batch (historical) and streaming (real-time) data pipelines and distributed compute platform • Reuse code across both batch and streaming pipelines • Java or Python based ▪ Programming model an open source project - Apache Beam (https://beam.apache.org/) • Runs on multiple different distributed processing back-ends: Spark, Flink, Cloud Dataflow platforms ▪ Fully managed service • Automated resource management and scale-out Source: https://cloud.google.com/dataflow/ © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 21 Dataflow I/O transforms JDBC allows custom connection to Language File-based Messaging Databaseadditional databases Java Beam Java supports Apache HDFS, Amazon Amazon Kinesis Apache Cassandra S3, Google Cloud Storage, and local AMQP Apache Hadoop InputFormat filesystems Apache Kafka Apache Hbase FileIO, AvroIO, TextIO, TFRecordIO, XmlIO, TikaIO, ParquetIO Google Cloud Pub/Sub Apache Hive (HCatalog) JMS Apache Solr MQTT JDBC ElasticSearch Many different Google BigQuery Google Cloud BigTable filesystems and cloud Google Cloud DatastoreOther I/O object stores supported Google Cloud Spannertransforms are MongoDB Redis already under Python Beam Python supports Google Cloud Storage Google Cloud Pub/Sub Google BigQuery development and local filesystems Google Cloud Datastore avroio, textio, tfrecordio, vcfio © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 22 Cloud ETL tools compared AWS Glue Google Dataflow Azure Data Factory Batch ETL X X X Streaming - X - User Interface - - * X Compute data platform - X - Cross-platform support X X X Custom connector support X X X Metadata catalog X - - Monitoring tools available X X X Fully managed X X X *Cloud Dataprep © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 23 Big Data ETL Tools © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 24 Poll Question 2 ▪ Which of the following tools are you currently using / plan to use? ▪ Apache Sqoop ▪ Apache Nifi ▪ Apache Flume ▪ Other ▪ None © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 25 Apache Sqoop Batch Data Transfer Sqoop Overview ▪ Sqoop is a tool designed to transfer data between Hadoop and RDBMS or mainframes (bi-directional) ▪ Open source (Apache) command line tool ▪ Batch/Bulk transfer of structured data ▪ Requires Hadoop • Leverages MapReduce to provide parallel execution and data partitioning © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 27 Sqoop Details Importing Data Exporting Data ▪ Bulk move data from RDBMS or mainframe to ▪ Exports a set of files in HDFS to RDBMS Hadoop ▪ Uses JDBC* for connectivity • Persist in HDFS, Hive, HBase or Accumulo ▪ Leverages either update or insert mode or ▪ Uses JDBC* for connectivity incremental imports ▪ Able to perform column and row filtering or free- ▪ Limited transformation support – column form queries filtering • Must use MapReduce, Pig or other scripting ▪ Constraints: language • Table must exist • Operation is not atomic * Very limited number of databases are supported and version specific © 2018 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved. 28 Sqoop Details ▪ Sqoop jobs can created and saved to be used as templates ▪ Sqoop jobs can be scheduled using Oozie ▪ sqoop-merge can be used to combine two datasets into one and overwrite values of an older dataset ▪ Plugin architecture $ sqoop import --connect
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages43 Page
-
File Size-