Machine Learning Logistics Book by Ted Dunning & Ellen Friedman © 2018 (O’Reilly)

Beyond the Algorithm: What Makes Machine Learning Work? Ellen Friedman, PhD 11 June 2018 Berlin Buzzwords #bbuzz © 2018 Ellen Friedman 1 Contact Information Ellen Friedman, PhD Principal Technologist, MapR Technologies Committer Apache Drill & Apache Mahout projects O’Reilly author Email [email protected] [email protected] Twitter @Ellen_Friedman #bbuzz © 2018 Ellen Friedman 2 What makes machine learning work? © 2018 Ellen Friedman 3 + = ? © 2018 Ellen Friedman 4 Data Engineer Ian Downard Had never tried machine learning, but he caught the bug… © 2018 Ellen Friedman 5 Image Recognition: Which Bird Is This? Rhode Island Red Buff Orpington Jay Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Commons https://en.wikipedia.org/wiki/ Commons https://upload.wikimedia.org/wikipedia/ Commons https://en.wikipedia.org/wiki/ File:Rhode_Island_Red_cock,_cropped.jpg commons/7/74/ Aphelocoma#/media/ Barred_Plymouth_Rock_Rooster_001.jpg File:WesternScrubJay2.jpg © 2018 Ellen Friedman 6 More to the point… Chicken Chicken Not a Chicken Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Commons https://en.wikipedia.org/wiki/ Commons https://upload.wikimedia.org/wikipedia/ Commons https://en.wikipedia.org/wiki/ File:Rhode_Island_Red_cock,_cropped.jpg commons/7/74/ Aphelocoma#/media/ Barred_Plymouth_Rock_Rooster_001.jpg File:WesternScrubJay2.jpg © 2018 Ellen Friedman 7 Domain Knowledge Matters Chicken Chicken Predator Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Image from Wikipedia & used under Creative Commons https://en.wikipedia.org/wiki/ Commons https://upload.wikimedia.org/wikipedia/ Commons https://en.wikipedia.org/wiki/ File:Rhode_Island_Red_cock,_cropped.jpg commons/7/74/ Aphelocoma#/media/ Barred_Plymouth_Rock_Rooster_001.jpg File:WesternScrubJay2.jpg © 2018 Ellen Friedman 8 Tensor Chicken Deep learning project using Gather Label Labeled training training image ﬁles Inception v3 model from data data TensorFlow (see blog + @tensorchicken) Run the Deploy Train model model model Update model © 2018 Ellen Friedman 9 Value from ML: what about SLA’s? How long does it take for image recognition model to classify image? • ~30 seconds because just running on a Raspberry Pi How long does it take for Scrub Jay to peck an egg? • < 30 seconds • Oops… © 2018 Ellen Friedman 10 Value from ML: what about action? When image classification indicates a jay in the henhouse, what would you do to chase away the predator? • Not sure yet. That’s a problem… © 2018 Ellen Friedman 11 What makes a difference for impact? Left: labelled as Buff Orpington. Right: This what a Buff Orpington really looks like. But…this error in domain knowledge (wrong name) did not matter for SLAs. Image from Wikipedia & used under Creative Image from Wikipedia used under Creative Commons Commons https://upload.wikimedia.org/wikipedia/ https://en.wikipedia.org/wiki/Orpington_chicken#/media/ commons/7/74/ File:Coq_orpington_fauve.JPG Barred_Plymouth_Rock_Rooster_001.jpg © 2018 Ellen Friedman 12 What lessons can we learn from this toy project? • Image recognition & deep learning are cool • In some cases, building or training a model is simple • Domain knowledge matters (really) • Pay attention to SLAs • For real business value, have a plan of action in response to machine learning insights (producing a report ≠ taking an action) • Software engineers have a role in machine learning (!) © 2018 Ellen Friedman 13 What about real world examples? © 2018 Ellen Friedman 14 Domain Knowledge Matters: Video Recommender • Use clicks as input data: recommender gives poor performance – Model is testing the wrong preferences: how well people liked titles • Use first 30 seconds of viewing as input data: recommender performance is good – Model now tests how well people liked the videos, not just the titles © 2018 Ellen Friedman 15 Domain Knowledge Matters: Detecting Security Attacks Security expert at a bank preserved headers for web site requests © 2018 Ellen Friedman 16 Spot the Important Difference? GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1GET /photo.jpg HTTP/1.1 Host: www.sometarget.com Host: lh4.googleusercontent.com User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:44.0) Gecko/20100101 Firefox/44.0 Accept-Encoding: deflate Accept: image/png,image/*;q=0.8,*/*;q=0.5 Accept-Charset: UTF-8 Accept-Language: en-US,en;q=0.5 Accept-Language: fr Accept-Encoding: gzip, deflate, br Cache-Control: no-cache Referer: https://www.google.com Pragma: no-cache Connection: keep-alive Connection: Keep-Alive If-None-Match: "v9” Cache-Control: max-age=0 Attacker request Real request © 2018 Ellen Friedman 17 Spot the Important Difference? GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1GET /photo.jpg HTTP/1.1 Host: www.sometarget.com Host: lh4.googleusercontent.com User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:44.0) Gecko/20100101 Firefox/44.0 Accept-Encoding: deflate Accept: image/png,image/*;q=0.8,*/*;q=0.5 Accept-Charset: UTF-8 Accept-Language: en-US,en;q=0.5 Accept-Language: fr Accept-Encoding: gzip, deflate, br Cache-Control: no-cache Referer: https://www.google.com Pragma: no-cache Connection: keep-alive Connection: Keep-Alive If-None-Match: "v9” Cache-Control: max-age=0 Attacker request Real request © 2018 Ellen Friedman 18 Another Example GET photo.jpg HTTP/1.1 GET cc/borken.json HTTP/1.1 Host: lh4.googleusercontent host: c.qrs.my User-agent: Mozilla/5.0 (Ma user-agent: Mozilla/4.0 (co Accept: image/png,image/* accept: application/json, t Accept-language: en-US,en accept-language: en-US,en Accept-encoding: gzip, defl accept-encoding: gzip, defl Referer: https://www.google referer: none Connection: keep-alive connection: keep-alive If-none-match: "v9” if-none-match: "v9” Cache-control: max-age=0 cache-control: max-age=0 Real request Attacker request © 2018 Ellen Friedman 19 Another Example GET photo.jpg HTTP/1.1 GET cc/borken.json HTTP/1.1 Host: lh4.googleusercontent host: c.qrs.my User-agent: Mozilla/5.0 (Ma user-agent: Mozilla/4.0 (co Accept: image/png,image/* accept: application/json, t Accept-language: en-US,en accept-language: en-US,en Accept-encoding: gzip, defl accept-encoding: gzip, defl Referer: https://www.google referer: none Connection: keep-alive connection: keep-alive If-none-match: "v9” if-none-match: "v9” Cache-control: max-age=0 cache-control: max-age=0 Real request Attacker request © 2018 Ellen Friedman 20 Domain Knowledge Matters: Detecting Security Attacks Security expert at a bank preserved headers for web site requests • Detected anomaly in headers for the attackers vs normal (real) requests • Pattern of behavior for attackers was allowable for headers and it was not predictable: but it was different © 2018 Ellen Friedman 21 Keep data: You don’t know what you’ll need to know later © 2018 Ellen Friedman 22 Big Industry, Big Data, Big Value . All other mages © E. Friedman Image courtesy Mtell used with permission ©WesAbrams © 2018 Ellen Friedman 23 Simple But Valuable: Accounting Audit Targeting Big industrial company with a lot of machinery • Tracking actions & parts – label as repairs, contracted services, delivery of supplies, etc. • Which are to be taxed, expensed or counted as revenues – Mislabelling can cost millions of dollars • Use machine learning to target potential mislabelling for audit review • Relatively simple models (pattern matching; exception detection) deliver a huge business value. © 2018 Ellen Friedman 24 Is it the algorithm? the model? the ML tool? https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/ © 2018 Ellen Friedman 25 90% of the effort in successful machine learning isn’t the algorithm or the model… It’s the logistics © 2018 Ellen Friedman 26 What Does Streaming Do for You? Surfer on standing wave, Munich Image © 2017 Ellen Friedman © 2018 Ellen Friedman 27 Stream transport supports microservices © 2018 Ellen Friedman 28 At the Heart: Message Transport B Patient Facilities EMR management With the right messaging tool at the heart of stream-1st Real-time analytics architecture you support other Medical test classes of use cases (B & C) results A Insurance Medical tests audit C © 2018 Ellen Friedman 29 Stream Transport that Decouples Producers & Consumers P C Kafka / P C MapR Streams P C Transport Processing Good stream transport is persistent, performant & pervasive! © 2018 Ellen Friedman 30 Streaming Microservices • “Streaming Microservices” by Ted Dunning & Ellen Friedman, chapter in Encyclopedia of Big Data Technologies, Sherif Sakr and Albert Zomaya, editors, © 2018 (Springer International Publishing) • Chapter 3 of Streaming Architecture by Ted Dunning & Ellen Friedman © 2016 (O’Reilly Media) https://mapr.com/ebooks/streaming-architecture/chapter-03-streaming-platform-for- microservices.html © 2018 Ellen Friedman 31 Get rid of the myth of the unitary model © 2018 Ellen Friedman 32 Streaming microservices provides flexibility & independence to manage many models © 2018 Ellen Friedman 33 Logistics for machine learning can be difficult • Just getting the training & input data is hard • Many models to manage • Model-to-model evaluation needs to be convenient & accurate • Respond as the world changes: Deploy to production with agile roll out & roll • There’s a need for good data

Machine Learning Logistics Book by Ted Dunning & Ellen Friedman © 2018 (O’Reilly)

Learning Apache Mahout Classification Table of Contents

Apache Mahout User Recommender

Workshop- Matrix Math at Scale with Apache Mahout and Spark

Hadoop Operations and Cluster Management Cookbook

Full-Graph-Limited-Mvn-Deps.Pdf

Apache Giraph 3

Front Matter Template

Licensing Information User Manual Release 4 (4.9) E87604-01

Tika/Tika Overview.Htm Copyright © Tutorialspoint.Com

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Apache Mahout Scalable Machine Learning and Data Mining

ISSN 2348 – 0319 International Journal of Innovative and Applied Research (2017) 12-35 Volume 5, Issue 9