<<

Integrating high-performance : H2O and KNIME

Mark Landry (H2O), Christian Dietz (KNIME)

© 2017 KNIME AG. All Rights Reserved. Speed Accuracy

H2O: in-memory machine learning platform designed for speed on distributed systems

2 High Level Architecture

HDFS

H2O Compute Engine

Exploratory & Supervised & S3 Load Descriptive Unsupervised Predict Analysis Modeling Distributed In-Memory Feature Model NFS Data & Model Engineering & Evaluation & Loss-less Storage Compression Selection Selection

Local Data Prep Export: Model Export: Plain Old Java Plain Old Java Object Object SQL Production Scoring Environment

Your Imagination

3 Distributed Algorithms

Advantageous Foundation

• Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression

• All algorithms are distributed in H2O: GBM, GLM, DRF, Parallel Parse into Distributed Rows Deep Learning and more. Fine-grained map-reduce iterations. • Only enterprise-grade, open-source distributed algorithms in the market User Benefits

• “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: , Python, Java

• Designed for all sizes of data sets, especially large data Foundation for Distributed Algorithms Distributed for Foundation • Highly optimized Java code for model exports Fine Grain Map Reduce Illustration: Scalable • In-house expertise for all algorithms Distributed Histogram Calculation for GBM

4 Scientific Advisory Council

5 Machine Learning Benchmarks (https://github.com/szilard/benchm-ml)

Time (s) AUC

X = million of samples X = million of samples

Gradient Boosting Machine Benchmark (also available for GLM and Random Forest) 6 H2O Algorithms

Supervised Learning Unsupervised Learning

• Generalized Linear Models: Binomial, • K-means: Partitions observations into k Statistical Gaussian, Gamma, Poisson and Tweedie Clustering clusters/groups of the same spatial size. Analysis • Naïve Bayes Automatically detect optimal k

• Distributed Random Forest: Classification • Principal Component Analysis: Linearly transforms or regression models Dimensionality correlated variables to independent components Ensembles • Gradient Boosting Machine: Produces an Reduction • Generalized Low Rank Models: extend the idea of ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical, refined approximations Boolean, categorical, and missing data

• Deep learning: Create multi-layer feed • Autoencoders: Find outliers using a Deep Neural forward neural networks starting with an Anomaly nonlinear dimensionality reduction using input layer followed by multiple layers of Networks Detection deep learning nonlinear transformations

7 H2O in KNIME

Live Demo

© 2017 KNIME AG. All Rights Reserved. 8 H2O in KNIME

• Offer our users high-performance machine learning algorithms from H2O in KNIME • Allow to mix & match with other KNIME functionality – Data wrangling KNIME Analytics Platform functionality – KNIME Big-Data Connectors – Text Mining, Image Processing, Cheminformatics, … – and more!

© 2017 KNIME AG. All Rights Reserved. 9 H2O in KNIME

Live Demo

© 2017 KNIME AG. All Rights Reserved. 10 H2O in KNIME – Cross Validation

© 2017 KNIME AG. All Rights Reserved. 11 H2O in KNIME – Cross Validation

© 2017 KNIME AG. All Rights Reserved. 12 H2O in KNIME – Cross Validation

© 2017 KNIME AG. All Rights Reserved. 13 H2O in KNIME – Parameter Optimization

© 2017 KNIME AG. All Rights Reserved. 14 H2O in KNIME – Parameter Optimization

© 2017 KNIME AG. All Rights Reserved. 15 H2O in KNIME – Nodes in KNIME 3.4

© 2017 KNIME AG. All Rights Reserved. 16 H2O in KNIME – What’s cooking?

© 2017 KNIME AG. All Rights Reserved. 17 H2O in KNIME – What’s cooking?

© 2017 KNIME AG. All Rights Reserved. 18 Thank you!

© 2017 KNIME AG. All Rights Reserved. 19