Build an Event Driven Machine Learning Pipeline on Kubernetes

Total Page:16

File Type:pdf, Size:1020Kb

Build an Event Driven Machine Learning Pipeline on Kubernetes Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor DeployedDeployed Validate and Deploy ModelModel Build an Event Driven Machine Learning Pipeline on Kubernetes Yasushi Osonoi Animesh Singh Developer Advocate IBM STSM, IBM kubeflow kfserving maintainer osonoi animeshsingh Center for Open Source Improving Enterprise AI lifecycle in Open Source Data and AI Technologies (CODAIT) Code – Build and improve practical frameworks to enable more developers to realize immediate value. Content – Showcase solutions for complex and real-world AI problems. Community – Bring developers and data scientists to engage with IBM • Team contributes to over 10 open source projects • 17 committers and many contributors in Apache projects • Over 1100 JIRAs and 66,000 lines of code committed to Apache Spark itself; over 65,000 LoC into SystemML • Over 25 product lines within IBM leveraging Apache Spark • Speakers at over 100 conferences, meetups, unconferences and more CODAIT codait.org 3 DEVELOPER ADVOCATE in TOKYO Tokyo Team is a part of Worldwide Developer Advocate Teams! Developer Advocate City Leader WW Developer Advocate WW Developer Advocate Client Developer Advocate AKIRA ONISHI NORIKO KATO KYOKO NISHITO YASUSHI OSONOI Program Manager WW Developer Advocate WW Developer Advocate Digital Developer Advocate TOSHIO YAMASHITA TAIJI HAGINO AYA TOKURA JUNKI SAGAWA @taiponrock https://developer.ibm.com/patterns/ https://developer.ibm.com/jp/ Please follow me @osonoi IBM’s history of actively fostering balanced communities 19 1998 “For more than 20 years, IBM and Red Hat have paved the way for open communities to power innovative IT solutions.” – Red Hat © 2019 IBM Corporation 8 AI and machine leaning IBM recently open sourced some key technologies for AI, including: •The AI Fairness 360 toolkit (AIF360), an open source software toolkit that can help detect and remove bias in machine learning models •The Adversarial Robustness Toolbox for rapid crafting and analysis of attack and defense methods for machine learning models •Fabric for Deep Learning (FfDL, pronounced fiddle), a deep learning platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes This space is as hot as they come, so look for more open source innovation from IBM in the months to come. Progress in Deep Learning IBM Watson IBM Deep Blue AlphaGo chess Jeopardy Apple’s Facebook’s Siri gets releases Siri face deep learning recognition 1997 AlexNet … 1997 2011 2012 2015 2016 2017 Introduced deep learning with GPUs 10 Progress in Deep Learning IBM Watson IBM Deep Blue AlphaGo chess Jeopardy Apple’s Facebook’s Siri gets releases Siri face deep learning recognition AlexNet 2011 … 1997 2011 2012 2015 2016 2017 Introduced deep learning with GPUs 11 Progress in Deep Learning IBM Watson IBM Deep Blue AlphaGo chess Jeopardy Apple’s Facebook’s Siri gets releases Siri face deep learning recognition AlexNet … 1997 2011 2012 2015 2016 20172017 Introduced deep learning with GPUs 12 IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 13 Progress in Deep Learning IBM Watson IBM Deep Blue AlphaGo chess Jeopardy Apple’s Facebook’s Siri gets releases Siri face deep learning recognition AlexNet … 1997 2011 2012 2015 2016 2017 Introduced deep learning with GPUs IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 14 Deep Learning = Training Artificial Neural Networks • 25 million “neurons” • 100 million connections (parameters) A human brain has: • 200 billion neurons • 32 trillion connections between them IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 15 Perception *Source: Hidden Technical Debt in Machine Learning Systems In reality … *Source: Hidden Technical Debt in Machine Learning Systems Neural Network Design Workflow domain data design Performance neural network HPO meets needs? • neural network structure • hyperparameters Start another experiment optimal hyperparameters NO Neural Network Design Workflow domain data design Performance neural network HPO meets needs? • neural network structure • hyperparameters Start another experiment optimal yes hyperparameters NO trained BAD Still model good! Cloud deploy evaluat e Let`s understand it from the context of an AI Lifecycle AI PLATFORM 20 We need a Cloud native AI Platform to build,AIOps train, deploy and monitor Models Assign Hyperparameters Initial Model and Train Create Model Prepared and AI Trained Analyzed Model Data PLATFORM Monitor Deployed Validate and Deploy Model Many tools available to build initial modelsAIOps Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor DeployedDeployed Validate and Deploy ModelModel Neural Network Modeller within Watson Studio An intuitive drag-and-drop, no-code interface for designing neural network structure https://developer.ibm.com/tutorials/create-and-experiment-with-dl-models-using-nn-modeler/ https://developer.ibm.com/patterns/validate-deep-learning-models/ March 30 2018 / © 2018 IBM Corporation 24 Many tools to train machine learning andAIOps deep learning models Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor DeployedDeployed Validate and Deploy ModelModel Training is accomplished. Model is readyAIOps – Can we trust it? Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor DeployedDeployed Validate and Deploy Can the model be trusted? ModelModel What does it take to trust a decision made by a machine? (Other than that it is 99% accurate)? #21, #32, #93 #21, #32, #93 Is it easy to Did anyone Is it fair? understand? tamper with it? Is it accountable? Our vision for Trusted AI Pillars of trust, woven into the lifecycle of an AI application FAIRNESS EXPLAINABILITY ROBUSTNESS ASSURANCE supported by an instrumented platform AIOpenScale So let`s start with vulnerability detectionAIOps of Models? Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor Is the model vulnerable to DeployedDeployed Validate and Deploy ModelModel adversarial attacKs? Enter: Adversarial Robustness Toolbox AIOps Assign Hyperparameters Initial Model and Train Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Monitor DeployedDeployed Validate and Deploy ModelModel ART The Adversarial Robustness Toolbox contains implementations of the following attacks: Deep Fool (Moosavi-Dezfooli et al., 2015) IBM Adversarial Robustness Fast Gradient Method (Goodfellow et al., 2014) Jacobian Saliency Map (Papernot et al., 2016) Universal Perturbation (Moosavi-Dezfooli et al., 2016) Virtual Adversarial Method (Moosavi-Dezfooli et al., Toolbox 2015) C&W Attack (Carlini and Wagner, 2016) NewtonFool (Jang et al., 2017) https://github.com/IBM/adversarial-robustness- toolbox The following defense methods are also supported: Feature squeezing (Xu et al., 2017) Spatial smoothing (Xu et al., 2017) Label smoothing (Warde-Farley and Goodfellow, 2016) ART Adversarial training (Szegedy et al., 2013) ART is a library dedicated to adversarial Virtual adversarial training (Miyato et al., 2017) machine learning. Its purpose is to allow rapid crafting and analysis of attack and defense methods for machine learning models. The Adversarial Robustness Toolbox provides an implementation for many state-of-the-art methods for attacking and defending classifiers. 31 Implementation for state-of-the-art methods for attacking and defending classifiers. Evasion attacks Evasion defenses Poisoning detection Robustness metrics • FGSM • Feature squeezing • Detection based on • CLEVER clustering activations • JSMA • Spatial smoothing • Empirical robustness • Proof of attack strategy • BIM • Label smoothing • Loss sensitivity • PGD • Adversarial training Evasion detection • Carlini & Wagner • Virtual adversarial Unified model API training • Detector based on • DeepFool inputs • Training • Thermometer encoding • NewtonFool • Detector based on • Prediction • Gaussian data activations • Universal perturbation augmentation • Access to loss and prediction gradients 32 ART Demo: https://art-demo.mybluemix.net/ 33 https://developer.ibm.com/patterns/integrate-adversarial-attacks-model-training-pipeline/ March 30 2018 / © 2018 IBM Corporation 34 https://developer.ibm.com/jp/patterns/integrate-adversarial-attacks-model-training-pipeline/ March 30 2018 / © 2018 IBM Corporation 35 Robustness check accomplished. How doAIOps we check for bias throughout lifecycle? Assign Hyperparameters Initial Model and Train Is the dataset biased? Create Model PreparedPrepared andand Trained AnalyzedAnalyzed Model DataData Are predictions Are model weights biased? Monitor biased? DeployedDeployed Validate and Deploy ModelModel Unwanted bias and algorithmic fairness Machine learning, by its very nature, is always a form of statistical discrimination Discrimination becomes objectionable when it places certain privileged groups at systematic advantage and certain unprivileged groups at systematic disadvantage Illegal in certain contexts Unwanted bias and algorithmic fairness Machine learning, by its very nature, is always a form of statistical discrimination Unwanted bias in training data yields models with unwanted bias that scale out Prejudice in labels Undersampling or oversampling 39 40 Supported bias mitigation algorithms Optimized Preprocessing
Recommended publications
  • Running ML/DL Workloads Using Red Hat Openshift Container Platform V3.11 Accelerate Your ML/DL Projects Platform Using Kubeflow and NVIDIA Gpus
    Running ML/DL Workloads Using Red Hat OpenShift Container Platform v3.11 Accelerate your ML/DL Projects Platform using Kubeflow and NVIDIA GPUs January 2021 H17913.2 White Paper Abstract This white paper describes how to deploy Kubeflow v0.5 on Red Hat OpenShift Container Platform and provides recommendations for achieving optimal performance for ML/DL workloads using the latest NVIDIA Tesla GPUs. Dell Technologies Solutions Copyright The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, the Intel logo, the Intel Inside logo and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. Other trademarks may be trademarks of their respective owners. Published in the USA 01/21 White Paper H17913.2. Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to change without notice. 2 Running ML/DL Workloads Using Red Hat OpenShift Container Platform v3.11 Accelerate your ML/DL Projects Platform using Kubeflow and NVIDIA GPUs White Paper Contents Contents Executive summary ........................................................................................................................ 4 Solution architecture ...................................................................................................................... 6 GPU accelerated TensorFlow training using TFJobs .................................................................. 9 Installing the GPU device plug-in ...............................................................................................
    [Show full text]
  • WORLD FINTECH REPORT 2018 Contents 4 Preface
    in collaboration with WORLD FINTECH REPORT 2018 Contents 4 Preface 8 Executive Summary FinTechs Are Redefining the Financial Services Customer Journey 13 Competition and Rising Expectations Spur 14 Customer-Centricity Push – Identify gaps left by traditional Financial Services providers and explore changing customer expectations. Emerging Technologies Enable Customer 19 Journey Transformation – Data and insights are reshaping personalized experiences through automation and distributed ledger technology. Alignment with Customer Goals, Creation of 27 Trust, and Delivery of Digital, Agile, and Efficient Processes Are Catalysts for Success – Firms are driving innovation and operational excellence through agile and digital teams. World FinTech Report 2018 The Symbiotic Relationship between FinTechs and Traditional Financial Institutions 35 FinTech and Incumbent Firms’ Respective 36 Competitive Advantages and Shortcomings Make Collaboration a Logical Fit – A partnership ecosystem of FinTechs, incumbents, and other vendors fosters a win-win for stakeholders. Finding the Right Partners for Collaboration 44 Is Essential – Maintaining and accelerating scale is a common FinTech firm struggle, so the right collaboration partner is critical. Successful Collaboration Requires Commitment 49 and Agility from FinTechs and Incumbents – Selection of the appropriate engagement model boosts FinTech scale-up efforts. The Path Forward: An Impending Role for BigTechs? 60 – BigTechs could have massive impact on the Financial Services industry. Preface Once rather homogenous and somewhat staid, the financial services marketplace has transformed into a dynamic milieu of bar-raising specialists. These new-age professionals are devoted to meeting and exceeding the expectations of consumers who have become accustomed to personalized services from industries such as retail, travel, and electronics. Financial services customers no longer rely on one or two firms.
    [Show full text]
  • Postgres List All Tables in All Schema
    Postgres List All Tables In All Schema Coronal and louche Jonathan still cannibalise his goblin unheedingly. Motivated and marvelous Yance hard-wearing:kneecap her linchpin she bridling anesthetized showily andwhile balloting Darrell decimatedher girlhood. some aliyah intensely. Jeffry is Conditional by not in postgres schemas which are made free consultation with The list all schemas live rows of tables in schemas in our case insensitive names exist in a database host itself, and worse yet accurate counts are. Arm full stack exchange for postgres installed in southeast asia a postgres list all tables in schema. The live rows into your schema list views when you get! Very useful meaning that use one schema and other sites, postgres service for all tables! Sqlalchemy authors and foreign data separate privacy notice through either drop schemas are referenced by using restoro by revoking them. This approach we use for other kinds of varying levels of schema list in postgres database. Other views are currently looking at wellesley college studying media arts and all tables in postgres schema list of. Create or if there are retrieved either exactly the tables in postgres list all schema names with the database? True, render a FULL OUTER JOIN, type of an OUTER JOIN. Registry for storing, managing, and securing Docker images. University College London Computer Science Graduate. Subscribe you receive weekly cutting edge tips, strategies, and news when need to snap your web business. All occurences of postgres databases on a followup post, and users in postgres all schema list tables? You are commenting using your Twitter account.
    [Show full text]
  • Application Development with Azure
    Application Development with Azure Karim Vaes Specialist – Azure Application Development 0032 497 219577 @kvaes Agenda • Kubernetes Kubernetes Kubernetes momentum “By 2020, more than 50% of enterprises Larger companies will run mission-critical, containerized are leading the cloud-native applications in production.” adoption. 77% For the organizations running Kubernetes today, 77%1 of those with more than 1,000 developers are running it in production. 1Heptio: state of Kubernetes 2018 What’s behind the growth? Kubernetes: the leading orchestrator shaping the future app development and management It’s widely used It’s vendor-neutral It’s community-supported Kubernetes is in production for A variety of cloud providers There’s a huge community of active global companies across industries1 offer robust Kubernetes support contributors supporting Kubernetes3 24,000 1.1 million contributors contributions since 2016 since 2016 1Kubernetes.io. “Kubernetes User Case Studies.” 2CNCF. “Kubernetes Is First…” 3CNCF. Keynote address. Azure Kubernetes Service (AKS) Ship faster, operate easily, and scale confidently with managed Kubernetes on Azure Manage Kubernetes Accelerate Build on an Run anything, with ease containerized enterprise-grade, anywhere development secure foundation Top scenarios for Kubernetes on Azure Lift and shift Machine Microservices IoT Secure DevOps to containers learning Cost saving Agility Performance Portability Automation without refactoring Faster application Low latency Build once, Deliver code faster and your app development processing run anywhere securely at scale Azure Kubernetes momentum Trusted by thousands of customers 30x Azure Kubernetes Service usage grew 30x since it was made generally available in June 2018 Dated November 2018 How Kubernetes works Kubernetes control Worker node Internet kubelet kube-proxy 1.
    [Show full text]
  • Running Large-Scale Machine Learning Experiments in the Cloud
    O’REILLY Artificial Intelligence Conference, San Jose 2019 Running large-scale machine learning experiments in the cloud Shashank Prasanna, Amazon Web Services @shshnkp September 12, 2019 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Machine Learning, experiments and the scientific method • Scaling challenges with machine learning setups • Containers technologies for large-scale machine learning • Demo 1: Hyperparameter optimization with Amazon SageMaker • Demo 2: Custom designed experiments with Amazon SageMaker • Demo 3: Hyperparameter optimization with Kubernetes, Kubeflow and Katib • Summary and additional resources © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experimentation in machine learning Data acquisition Data preparation for Design and run curation and labeling training experiments Model optimization Distributed Deployment and validation training © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run experiments? The scientific method Question Hypothesis empirical procedure to determine Experiment whether observations agree or conflict with our hypothesis Interpret Iterate © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run machine learning experiments? Machine learning researcher may want to: Question • Identify factors that affect model performance Hypothesis • Explain the root causes of performance Experiment • Choose between alternate models Interpret • Study variability and robustness of
    [Show full text]
  • VGP Fact Sheet
    Final 2013 VGP Fact Sheet U.S. Environmental Protection Agency 2013 Final Issuance of National Pollutant Discharge Elimination System (NPDES) Vessel General Permit (VGP) for Discharges Incidental to the Normal Operation of Vessels Fact Sheet Agency: Environmental Protection Agency (EPA) Action: Notice of NPDES General Permit Page 1 of 198 Final 2013 VGP Fact Sheet TABLE OF CONTENTS 1. General Information ...................................................................................................................9 1.1. Does this Action Apply to Me? ........................................................................................9 1.2. Further Information ...........................................................................................................9 2. Background ................................................................................................................................9 2.1. The Clean Water Act ........................................................................................................9 2.2. Legal Challenges .............................................................................................................10 2.3. Congressional Legislation ...............................................................................................11 2.4. General Permits ...............................................................................................................12 2.5. Public Comment on EPA’s Proposed VGP ....................................................................13
    [Show full text]
  • Phase 2.1 Report
    Phase 2.1 Report DOE Award: DE-EE0002777 AltaRock Energy, Inc. March 10, 2014 Contributing Authors AltaRock Energy Trenton T. Cladouhos, Susan Petty, Yini Nordin, Geoff Garrison, Matt Uddenberg, Michael Swyer, Kyla Grasso Consultants and Sub-recipients Paul Stern (PLS Environmental) Eric Sonnenthal (LBNL) Dennise Templeton (LLNL) Pete Rose (EGI) Gillian Foulger and Bruce Julian (Foulger Consulting) Acknowledgment: This material is based upon work supported by the Department of Energy under Award Number DE-EE0002777. Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Table of Contents Table of Figures ...........................................................................................................................................
    [Show full text]
  • Chapter 1 - Overview
    Reference architecture Chapter 1 - Overview This document provides a complete reference architecture guide for infrastructure required for machine learning use cases. This document specifically covers open source machine learning software that includes Kubeflow using Charmed Kubernetes solution on Dell EMC hardware delivered by Canonical. This includes Dell EMC PowerEdge servers for workloads and storage based on Intel® Xeon® Scalable Processors and Dell EMC Networking. This guide discusses the Dell EMC hardware specifications and the tools and services to set up both the hardware and software, including the foundation cluster and the Kubernetes cluster to run machine learning workloads. It also covers other tools used for the monitoring and management of the cluster in detail and how all these components work together in the system. The guide also provides the deployment steps and references to configuration and automation scripts developed by Dell EMC and Canonical for the deployment process. Document ID Revisions Revisions Date Description March 2020 Initial release Acknowledgements This paper was produced by the following: Author: Andrey Grebennikov Support: Other: The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.
    [Show full text]
  • Kubernetes & AI with Run:AI, Red Hat & Excelero
    Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Date/Time: Tuesday, June 9 | 9 am PST What’s next in technology and innovation? Kubernetes & AI with Run:AI, Red Hat & Excelero AI WEBINAR Your Host: Presenter: Presenter: Presenter: Tom Leyden Omri Geller Gil Vitzinger William Benton VP Marketing CEO & Co-Founder Software Developer Engineering Manager Kubernetes for AI Workloads Omri Geller, CEO and co-founder, Run:AI A Bit of History Needed flexibility Reproducibility and and better utilization portability Bare Metal Virtual Machines Containers Containers scale easily, they’re lightweight and efficient, they can run any workload, are flexible and can be isolated …But they need orchestration 2 Enter Kubernetes Track, Create Efficient Execute Across Schedule and Cluster Different Operationalize Utilization Hardware 3 Today, 60% of Those Who Deploy Containers Use K8s for Orchestration* *CNCF 4 Now let’s talk about AI Computing Power Fuels Development of AI Deep Learning Classical Machine Learning Manual Engineering 6 Artificial Intelligence is a Completely Different Ballgame New Distributed Experimentation accelerators computing R&D 7 Data Science Workflows and Hardware Accelerators are Highly Coupled Data Hardware scientists accelerators Constant Workflow Under-utilized hassles Limitations GPUs 8 This Leads to Frustration on Both Sides Data Scientists are IT leaders are frustrated – speed and frustrated – GPU productivity are low utilization is low 9 AI Workloads are Also Built on Containers NGC – Nvidia pre-trained models for
    [Show full text]
  • Tensorflow 2.0 and Kubeflow for Scalable and Reproducable Enterprise Ai
    TENSORFLOW 2.0 AND KUBEFLOW FOR SCALABLE AND REPRODUCABLE ENTERPRISE AI Romeo Kienzler1, 2, Holger Kyas2, 3, 4 1IBM Center for Open Source Data and AI Technologies, 505 Howard St, San Francisco, CA, USA 2Berne University of Applied Sciences, Technology and Informatics, Wankdorffeldstrasse 102, 3014 Berne, Switzerland 3Open Group, 548 Market St #54820, San Francisco, CA 94104-5401 4Helvetia Insurance Switzerland, St. Alban-Anlage 26, 4002 Basel, Switzerland ABSTRACT Towards the End of 2015 Google released TensorFlow 1.0, which started out as just another numerical library, but has grown to become a de-facto standard in AI technologies. TensorFlow received a lot of hype as part of its initial release, in no small part because it was released by Google. Despite the hype, there have been complaints on usability as well. Especially, for example, the fact that debugging was only possible after construction of a static execution graph. In addition to that, neural networks needed to be expressed as a set of linear algebra operations which was considered as too low level by many practitioners. PyTorch and Keras addressed many of the flaws in TensorFlow and gained a lot of ground. TensorFlow 2.0 successfully addresses these complaints and promises to become the go-to framework for many AI problems. This paper introduces the most prominent changes in TensorFlow 2.0 targeted towards ease of use followed by introducing TensorFlow Extended Pipelines and KubeFlow in order to illustrate the latest TensorFlow and Kubernetes ecosystem movements towards simplification for large scale Enterprise AI adoption. KEYWORDS Artificial Intelligence, TensorFlow, Keras, Kubernetes, KubeFlow, TFX, TFX Pipelines 1.
    [Show full text]
  • Using Ipus from Docker Release 1.4.0
    Using IPUs from Docker Release 1.4.0 Graphcore Ltd Dec 08, 2020 CONTENTS 1 Introduction 1 2 Initial setup 2 3 Using gc-docker 3 3.1 Loading docker images.............................................3 3.2 Verifying IPU access from inside container..................................4 3.3 Mounting directories from the host......................................5 3.4 Setting environment variables.........................................5 4 Running a TensorFlow application on an IPU6 5 Extending the images 7 6 Further reading 8 7 Trademarks & copyright 9 i CHAPTER ONE INTRODUCTION This guide explains how you can run applications in Docker on a Linux machine with one or more physical IPU devices. Prerequisites: • A machine with IPU devices • Ubuntu 18.04 / CentOS 7.6 1 CHAPTER TWO INITIAL SETUP First check if your machine has the IPU device driver installed. You can check this is loaded and running with the following command: $ modinfo ipu_driver If the driver is are installed and running, you should see something similar to: $ modinfo ipu_driver filename: /lib/modules/4.15.0-55-generic/updates/dkms/ipu_driver.ko version: 1.0.39 description: IPU PCI Driver author: Graphcore Limited license: GPL srcversion: 49FFB7D8556EB58899AE41A alias: pci:v00001D95d00000003sv*sd*bc*sc*i* alias: pci:v00001D95d00000002sv*sd*bc*sc*i* alias: pci:v00001D95d00000001sv*sd*bc*sc*i* depends: retpoline: Y name: ipu_driver vermagic: 4.15.0-55-generic SMP mod_unload parm: memmap_start:array of ulong parm: memmap_size:array of ulong If so, proceed to the next section. If it returns an error along the lines of: $ modinfo ipu_driver modinfo: ERROR: Module ipu_driver not found. You will need to install the driver.
    [Show full text]
  • Reference Architecture for Kubeflow on Openshift Accelerate ML/DL Workloads Using Kubeflow and Poweredge Servers
    Solution brief Reference Architecture for Kubeflow on OpenShift Accelerate ML/DL workloads using Kubeflow and PowerEdge servers Artificial intelligence (AI) and its subsets, machine learning (ML) and deep learning (DL), Customer are gaining widespread adoption for use cases such as computer vision, speech recognition and natural language processing (NLP). Integrating production‑grade AI technologies Results in well‑defined platforms within the protection of the data center can facilitate wider adoption of advanced computing, extending investments by supporting AI use cases as well as augmenting the resources available to data science teams. 2 hours vs. But not every IT organization has the time and resources to research, integrate and test all the components required to deploy a customized system to run AI workloads. Building 9 months a production‑ready AI system involves combining various components, often from different to run analysis1 vendors, and integrating and managing these disparate pieces. As well, deployments are often tied to a specific cluster, so that moving models — for example, between laptops and cloud clusters or from DevOps to production and back — significantly increases 218% ROI complexity and the chance for human errors. over 3 years2 Kubeflow together with the Red Hat® OpenShift® Container Platform help address these challenges. Kubeflow is an open‑source Kubernetes®‑native platform designed 20 million to accelerate ML workloads. It’s a composable, scalable, portable stack that includes images used to train a components and automation features to integrate ML tools, so they work together to deep neural network3 create a cohesive pipeline that makes it easy to deploy ML applications at scale.
    [Show full text]