KNIME Big Data Workshop
Total Page:16
File Type:pdf, Size:1020Kb
KNIME Big Data Workshop Björn Lohrmann KNIME © 2018 KNIME.com AG. All Rights Reserved. Variety, Velocity, Volume • Variety: – Integrating heterogeneous data… – ... and tools • Velocity: – Real time scoring of millions of records/sec – Continuous data streams – Distributed computation • Volume: – From small files... – ...to distributed data repositories – Moving computation to the data © 2018 KNIME.com AG. All Rights Reserved. 2 2 Variety © 2018 KNIME.com AG. All Rights Reserved. 3 The KNIME Analytics Platform: Open for Every Data, Tool, and User Data Scientist Business Analyst KNIME Analytics Platform External Native External Data Data Access, Analysis, and Legacy Connectors Visualization, and Reporting Tools Distributed / Cloud Execution © 2018 KNIME.com AG. All Rights Reserved. 4 Data Integration © 2018 KNIME.com AG. All Rights Reserved. 5 Integrating R and Python © 2018 KNIME.com AG. All Rights Reserved. 6 Modular Integrations © 2018 KNIME.com AG. All Rights Reserved. 7 Other Programming/Scripting Integrations © 2018 KNIME.com AG. All Rights Reserved. 8 Velocity © 2018 KNIME.com AG. All Rights Reserved. 9 Velocity • High Demand Scoring/Prediction: – Scoring using KNIME Server – Scoring with the KNIME Managed Scoring Service • Continuous Data Streams – Streaming in KNIME © 2018 KNIME.com AG. All Rights Reserved. 10 What is High Demand Scoring? • I have a workflow that takes data, applies an algorithm/model and returns a prediction. • I need to deploy that to hundreds or thousands of end users • I need to update the model/workflow periodically © 2018 KNIME.com AG. All Rights Reserved. 11 How to make a scoring workflow • Use JSON Input/Output nodes • Put workflow on KNIME Server -> REST endpoint for scoring © 2018 KNIME.com AG. All Rights Reserved. 12 KNIME Managed Scoring Service • The KNIME Managed Scoring Service is a hosted service that allows provisioning and consuming of scoring workflows via REST. • No need to think about servers, hosting, building services, etc… © 2018 KNIME.com AG. All Rights Reserved. 13 KNIME Managed Scoring Service KNIME handles this… KNIME Scoring Agent Scaling Metric KNIME Scoring Agent Application Client Client Load KNIME applicationClient applicationClient Balancer Scoring application application Agent Scale up/down With demand KNIME Scoring Agent © 2018 KNIME.com AG. All Rights Reserved. 14 Velocity • High Demand Scoring/Prediction: – High Performance Scoring using generic Workflows – High Performance Scoring of Predictive Models • Continuous Data Streams – Streaming in KNIME © 2018 KNIME.com AG. All Rights Reserved. 15 Streaming in KNIME © 2018 KNIME.com AG. All Rights Reserved. 16 Volume © 2018 KNIME.com AG. All Rights Reserved. 17 Moving computation to the data © 2018 KNIME.com AG. All Rights Reserved. 18 Volume • Database Extension • Introduction to Hadoop • KNIME Big Data Connectors • KNIME Extension for Apache Spark • KNIME H2O Sparkling Water Integration • KNIME Workflow Executor for Apache Spark © 2018 KNIME.com AG. All Rights Reserved. 19 Database Extension • Visually assemble complex SQL statements • Connect to almost all JDBC-compliant databases • Harness the power of your database within KNIME © 2018 KNIME.com AG. All Rights Reserved. 20 In-Database Processing • Operations are performed within the database © 2018 KNIME.com AG. All Rights Reserved. 21 Tip • SQL statements are logged in KNIME log file © 2018 KNIME.com AG. All Rights Reserved. 22 Database Port Types Database Connection Port (dark red) • Connection information • SQL statement Database Connection Ports can be connected to Database JDBC Connection Port (light red) Database JDBC Connection Ports • Connection information but not vice versa © 2018 KNIME.com AG. All Rights Reserved. 23 Database Connectors • Nodes to connect to specific Databases – Bundling necessary JDBC drivers – Easy to use – DB specific behavior/capability • Hive and Impala connector • General Database Connector – Can connect to any JDBC source – Register new JDBC driver via preferences page © 2018 KNIME.com AG. All Rights Reserved. 24 Register JDBC Driver Register single jar file JDBC drivers Register new JDBC driver with companion files Open KNIME and go to Increase connection timeout for File -> Preferences long running database operations © 2018 KNIME.com AG. All Rights Reserved. 25 Query Nodes • Filter rows and columns • Join tables/queries • Extract samples • Bin numeric columns • Sort your data • Write your own query • Aggregate your data © 2018 KNIME.com AG. All Rights Reserved. 26 Database GroupBy – Manual Aggregation Returns number of rows per group © 2018 KNIME.com AG. All Rights Reserved. 27 Database GroupBy – DB Specific Aggregation Methods SQLite 7 aggregation functions PostgreSQL 25 aggregation functions © 2018 KNIME.com AG. All Rights Reserved. 28 Volume • Database Extension • Introduction to Hadoop • KNIME Big Data Connectors • KNIME Extension for Apache Spark • KNIME H2O Sparkling Water Integration • KNIME Workflow Executor for Apache Spark © 2018 KNIME.com AG. All Rights Reserved. 30 Apache Hadoop • Open-source framework for distributed storage and processing of large data sets • Designed to scale up to thousands of machines • Does not rely on hardware to provide high availability – Handles failures at application layer instead • First release in 2006 – Rapid adoption, promoted to top level Apache project in 2008 – Inspired by Google File System (2003) paper • Spawned diverse ecosystem of products © 2018 KNIME.com AG. All Rights Reserved. 31 Hadoop Ecosystem Access HIVE Processing MapReduce Tez Spark Resource Management YARN Storage HDFS © 2018 KNIME.com AG. All Rights Reserved. 32 HDFS HIVE MapReduce Tez Spark • Hadoop distributed file system YARN • Stores large files across multiple machines HDFS File File (large!) Blocks (default: 64MB) DataNodes © 2018 KNIME.com AG. All Rights Reserved. 33 HDFS – Data Replication and File Size Data Replication • All blocks of a file are File 1 B B B stored as sequence of 1 2 3 blocks • Blocks of a file are B B NameNode n1 n1 n1 replicated for fault 1 2 B n2 B n2 B n2 tolerance (usually 3 1 1 2 replicas) B n3 B n3 B n3 – Aims: improve data 3 2 3 reliability, availability, and B n4 n4 n4 network bandwidth 3 utilization rack 1 rack 2 rack 3 © 2018 KNIME.com AG. All Rights Reserved. 36 YARN • Cluster resource management system • Two elements – Resource manager (one per cluster): • Knows where worker nodes are located and how many resources they have • Scheduler: Decides how to allocate resources to applications – Node manager (many per cluster): • Launches application containers • Monitor resource usage and report to Resource Manager HIVE MapReduce Tez Spark YARN HDFS © 2018 KNIME.com AG. All Rights Reserved. 38 MapReduce Input Splitting Mapping Shuffling Reducing Result blue, 1 blue, 1 blue red orange red, 1 blue, 1 blue, 3 orange, 1 blue, 1 blue, 3 blue red orange yellow, 1 yellow, 1 yellow blue yellow blue yellow, 1 yellow, 1 blue, 1 orange, 1 blue red,1 blue blue, 1 orange, 1 orange, 1 Map applies a function to each element red, 1 red, 1 For each word emit: word, 1 Reduce aggregates a list of values to one result HIVE For all equal words sum up count MapReduce Tez Spark YARN HDFS © 2018 KNIME.com AG. All Rights Reserved. 41 Hive • SQL-like database on top of files in HDFS • Provides data summarization, query, and analysis • Interprets a set of files as a database table (schema information to be provided) • Translates SQL queries to MapReduce, Tez, or Spark jobs • Supports various file formats: – Text/CSV – SequenceFile HIVE – Avro MapReduce Tez Spark – ORC YARN HDFS – Parquet © 2018 KNIME.com AG. All Rights Reserved. 46 Hive SQL select * from table MAP(...) MapReduce / Tez / Spark REDUCE(...) Var1 Var2 Var3 Y DataNode DataNode DataNode x1 z1 n1 y1 table_1.csv table_2.csv table_3.csv x2 z2 n2 y2 x3 z3 n3 y3 © 2018 KNIME.com AG. All Rights Reserved. 47 Spark • Cluster computing framework for large-scale data processing • Keeps large working datasets in memory between jobs – No need to always load data from disk -> much (!) faster than MapReduce • Programmatic interface – Scala, Java, Python, R – Functional programming paradigm: map, flatmap, filter, reduce, fold, … • Great for: HIVE – Iterative algorithms MapReduce Tez Spark YARN – Interactive analysis HDFS © 2018 KNIME.com AG. All Rights Reserved. 48 Spark – Data Representation DataFrame: Name Surname Age John Doe 35 – Table-like: Collection of rows, organized in columns Jane Roe 29 with names and types … … … – Immutable: • Data manipulation = creating new DataFrame from an existing one by applying a function on it Note: • Earlier versions of KNIME and – Lazily evaluated: Spark used RDDs (resilient • Functions are not executed until an action is triggered, distributed datasets) that requests to actually see the row data • In KNIME, DataFrames are always used in Spark 2 and later. – Distributed: • Each row belongs to exactly one partition • Each partition is held by a Spark Executor © 2018 KNIME.com AG. All Rights Reserved. 49 Spark – Lazy Evaluation • Functions ("transformations") on DataFrames are not executed immediately • Spark keeps record of the transformations for each DataFrame • The actual execution is only triggered once the data is needed • Offers the possibility to optimize the transformation steps Triggers evaluation © 2018 KNIME.com AG. All Rights Reserved. 51 Spark Context • Spark Context – Main entry point for Spark functionality – Represents connection to a Spark cluster – Allocates resources on the cluster Spark Executor Task