Hacking AI on the Cloud

Lisa Amini, PhD Lab Director, IBM Research Cambridge

© 2018 IBM Corporation. We are in a period of rapid innovation and emerging adoption of data science, and AI Enterprise AI presents unique and challenging requirements Challenges of Building and deploying AI Models Today

• “Roll your own” home-brew environments • Stateful, compute-intensive execution at odds with cloud-native design AI Models • Stresses cloud networking, storage, and hardware Training • Open source components evolving at different rates and speed

• Testing and debugging neural nets is an active research topic • Model evolution based on feedback is unavailable • Enterprise readiness for compliance and traceability is not well Deployment understood

• Must handle streaming data • Near-real-time response required though inferencing on large networks is compute intensive Inference • Must be able to run in the cloud and at the edge Data Scientists Business Owners

Want to: Want to: • Use DL frameworks they love • Train large models with big data • Provide cognitive capabilities to my users • Rapidly experiment with different models • Rapidly innovate product features • Focus on business logic • Access cutting edge AI • Get the best accuracy • Be adaptive and flexible

Don’t want to: • Configure machines with GPUs • Manage the frameworks • Handle failure and recovery • Manage load balancing, storage, security and logging • Manage distributed processing 3 approaches for building AI

1 pre-trained AI 2 transfer learning 3 custom AI

+ + + + +

your your pre-trained app developer pre-trained app developer data scientist custom domain domain model or SME model or SME model data data

Watson Visual Recognition Watson Visual Recognition Watson Studio Natural Language Understanding Natural Language Classifier Watson Machine Learning Watson Speech to Text Watson Speech to Text Deep Learning as a Service Watson Text to Speech … … … Watson Developer Cloud Pretrained AI Language – * Watson – Personality Insights – Natural Language Classifier – Tone Analyzer Data – Language Translator – Natural Language Understanding / Insights AlchemyLanguage • Includes Emotion analysis

Data Insights – * Discovery – Discovery News Language Speech

Speech – Speech to Text – Text to Speech Vision Vision – Visual Recognition

https://www.ibm.com/watson/products-services/ * Higher level APIs that include tooling to simplify implementation https://console.bluemix.net/catalog/?category=watson 7 IBM For the latest view of Watson API’s available, go to: www/.com/watsondevelopercloud Watson Studio, with Deep Learning Create a Watson Studio account: https://www.ibm.com/cloud/watson-studio

9 IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation Several ways to build custom AI

SPSS Modeler Data Refinery “non-coder/ clicker” Neural Network Modeler data scientists interactive Programming frameworks

Notebooks

“coder” data scientists vi researchers PyCharm terminal Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation 11 Deep Learning on the Cloud (DLaaS): Overview DLaaS

– A “serverless” user experience: focus on Neural Networks and data

– Adaptive deep learning as a service: cloud-native support for popular frameworks; elastic, resilient, and SLA driven

– Enabling quick time-to-value for a fast evolving field by architecting for continuous experimentation.

AI Studio/ Data Science Watson Services Experience User experience and tools Experiment Studio (visualization, multiple parallel jobs, autonn, model governance and management) API (WML): train/manage/watch Innovation in neural net design frameworks Advances in DL frameworks DLaaS supports multiple popular DL frameworks

Improvement in training techniques High performance distribution

DLaaS container and resource management Advances in the cloud stack GPU enabled kubernetes (Armada), Armada advanced scheduling (GPU enabled) Infrastructure (Softlayer)

CPUs GPUs New hardware New Accelerators https://dlaas-api.stage1.mybluemix.net https://dlaas-guide.stage1.mybluemix.net Deep Learning on the Cloud: Goal Where data scientists run deep learning experiments Run experiments, provide model tuning and comparison tools to evaluate models across 100-1000’s of hyperparameter configurations

experiments parameter permutation, data science model adjustments teams model parameter definition space model hyperparameters

optimizer hyperparameters chosen model

training artifacts models snapshots losses DL runtime Experimentation accuracies Parallel Model Tuner Studio logs Model training distributed across containers Multi-tenant container orchestration

training runs

containers

Kubernetes container orchestration

server cluster

NVIDIA GPUs dataset

Cloud Object Storage

1 4 Key Concepts

Watson Machine Learning Service Cloud Object Storage (COS) (WML)  Object storage is the primary data  IBM WML is a comprehensive solution resource used in the cloud, and it is that is your gateway to various machine also increasingly used for on-premise and deep learning technologies. solutions.

 Provides the following capabilities  Inexpensive, scalable, self-healing - Automated model building storage for massive amounts of - Model training and optimization unstructured data. - Model deployment  IBM Cloud Object Storage can support  To interact with WML, you need to single objects as large as 10TB. create a WML instance

IBM Research AI 1 5 Getting Started: Online Docs, Tutorials, Blogs, …

https://console.bluemix.net/docs/tutorials/index.html#tutorials IBM Cloud / Watson and Cloud Platform / © 2018 IBM Corporation 16 Questions?

IBM Research AI | Confidential | Copyright 2017 17 Watson Studio Supporting the end-to-end Enterprise AI workflow

Connect & Search and Find Prepare Data Build and Train Monitor, Analyze Deploy Models Access Data Relevant Data for Analysis ML/DL Models and Manage

Connect and Find data (structured, Clean and prepare Democratize the Deploy your models Monitor the discover content unstructured) and AI your data with Data creation of ML and DL easily and have them performance of the from multiple data assets (e.g., ML/DL Refinery, a tool to models. Design your AI scale automatically for models in production sources in the cloud models, notebooks, create data preparation models online, batch or and trigger automatic or on premises. Watson Data Kits) in pipelines visually. programmatically or streaming use cases retraining and Bring structured and the Knowledge Catalog Use popular open visually with the most redeployment of unstructured data to with intelligent search source libraries to popular open source models. Build one toolkit. and giving the right prepare unstructured and IBM ML/DL Enterprise Trust with access to the right data. frameworks or leverage Bias Detection, users. transfer learning on Mitigation Model pre-trained models Robustness and Testing using Watson tools to Service Model Security. adapt to your business domain. Train at scale on GPUs and distributed compute Cloud Object Storage

• Object storage is the primary data resource used in the cloud, and it is also increasingly used for on-premise solutions. • Inexpensive, scalable, self-healing storage for massive amounts of unstructured data. • IBM® Cloud Object Storage can support single objects as large as 10TB when using multipart uploads. Compare Training Runs

2 0 DLaaS – Basic Flow Model 1. Create model in DL framework Spec supported by DLaaS (, Tensorflow, , 4. Start a training job …) 5. Query training status and retrieve logs Object 6. Get trained model storage 2. Store training data in object storage Training Training Trained logs, Request Model status

3. Specify model metadata manifest.yml (framework, …); resource requirements DLaaS API: /v1/models (GPUs, mem, num learners, …); pointer to training data (object store, s3, …) Kubernetes cluster runs training jobs How to access TensorBoard?

https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard

> python download_training_run_files.py TRAINING_RUN > tensorboard --logdir=path/to/log-directory

View Tensorboard at http://localhost:6006 MIT-IBM Watson AI Lab $240M 10 year commitment to jointly create the future of

AI algorithms Physics of AI Applications of AI Advancing to industries shared prosperity • Learning and • Analog AI through AI Reasoning • Healthcare, Life • AI & Quantum Sciences • Ethics of AI, fair, • Continuous, multi- unbiased 23 task, small data, • Cybersecurity explanations, … http://mitibmwatsonailab.mit.edu/ • AI for Social Good