Integrating High-Performance Machine Learning: H2O and KNIME
Total Page:16
File Type:pdf, Size:1020Kb
Integrating high-performance machine learning: H2O and KNIME Mark Landry (H2O), Christian Dietz (KNIME) © 2017 KNIME AG. All Rights Reserved. Speed Accuracy H2O: in-memory machine learning platform designed for speed on distributed systems 2 High Level Architecture HDFS H2O Compute Engine Exploratory & Supervised & S3 Load Data Descriptive Unsupervised Predict Analysis Modeling Distributed In-Memory Feature Model NFS Data & Model Engineering & Evaluation & Loss-less Storage Compression Selection Selection Local Data Prep Export: Model Export: Plain Old Java Plain Old Java Object Object SQL Production Scoring Environment Your Imagination 3 Distributed Algorithms Advantageous Foundation • Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression • All algorithms are distributed in H2O: GBM, GLM, DRF, Parallel Parse into Distributed Rows Deep Learning and more. Fine-grained map-reduce iterations. • Only enterprise-grade, open-source distributed algorithms in the market User Benefits • “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java • Designed for all sizes of data sets, especially large data Foundation for Distributed Algorithms Distributed for Foundation • Highly optimized Java code for model exports Fine Grain Map Reduce Illustration: Scalable • In-house expertise for all algorithms Distributed Histogram Calculation for GBM 4 Scientific Advisory Council 5 Machine Learning Benchmarks (https://github.com/szilard/benchm-ml) Time (s) AUC X = million of samples X = million of samples Gradient Boosting Machine Benchmark (also available for GLM and Random Forest) 6 H2O Algorithms Supervised Learning Unsupervised Learning • Generalized Linear Models: Binomial, • K-means: Partitions observations into k Statistical Gaussian, Gamma, Poisson and Tweedie Clustering clusters/groups of the same spatial size. Analysis • Naïve Bayes Automatically detect optimal k • Distributed Random Forest: Classification • Principal Component Analysis: Linearly transforms or regression models Dimensionality correlated variables to independent components Ensembles • Gradient Boosting Machine: Produces an Reduction • Generalized Low Rank Models: extend the idea of ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical, refined approximations Boolean, categorical, and missing data • Deep learning: Create multi-layer feed • Autoencoders: Find outliers using a Deep Neural forward neural networks starting with an Anomaly nonlinear dimensionality reduction using input layer followed by multiple layers of Networks Detection deep learning nonlinear transformations 7 H2O in KNIME Live Demo © 2017 KNIME AG. All Rights Reserved. 8 H2O in KNIME • Offer our users high-performance machine learning algorithms from H2O in KNIME • Allow to mix & match with other KNIME functionality – Data wrangling KNIME Analytics Platform functionality – KNIME Big-Data Connectors – Text Mining, Image Processing, Cheminformatics, … – and more! © 2017 KNIME AG. All Rights Reserved. 9 H2O in KNIME Live Demo © 2017 KNIME AG. All Rights Reserved. 10 H2O in KNIME – Cross Validation © 2017 KNIME AG. All Rights Reserved. 11 H2O in KNIME – Cross Validation © 2017 KNIME AG. All Rights Reserved. 12 H2O in KNIME – Cross Validation © 2017 KNIME AG. All Rights Reserved. 13 H2O in KNIME – Parameter Optimization © 2017 KNIME AG. All Rights Reserved. 14 H2O in KNIME – Parameter Optimization © 2017 KNIME AG. All Rights Reserved. 15 H2O in KNIME – Nodes in KNIME 3.4 © 2017 KNIME AG. All Rights Reserved. 16 H2O in KNIME – What’s cooking? © 2017 KNIME AG. All Rights Reserved. 17 H2O in KNIME – What’s cooking? © 2017 KNIME AG. All Rights Reserved. 18 Thank you! © 2017 KNIME AG. All Rights Reserved. 19.