AWS Certified Machine Learning - Specialty Exam
Total Page:16
File Type:pdf, Size:1020Kb
A I M 3 0 8 Build accurate training datasets with Amazon SageMaker Ground Truth Warren Barkley Vikram Madan Kevin Dela Rosa GM, Augmented AI Product Lead ML Engineer, Perception Amazon Web Services Amazon Web Services Snap © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data labeling is tedious and difficult • Massive scale: ML models need large, labeled datasets • High accuracy: ML models depend on accurately labeled data As a result, building the training dataset takes up to 80% of a data scientist’s time Amazon SageMaker Ground Truth Build highly accurate training datasets Reduce data labeling costs by up to 70% ▪ Built-in features to improve label accuracy ▪ Automated data labeling capability ▪ Access to multiple workforces: Amazon Mechanical Turk and third-party vendors ▪ Option to bring your own workers and secure handling of data ▪ Tight integration with Amazon SageMaker pipeline More accurate and efficient data labeling Active learning An accurate training model is trained from dataset is ready for use human-labeled data Human-labeled data is then sent back to retrain and improve the machine learning model Built-in data labeling workflows Image classification Bounding boxes Semantic segmentation Custom Text classification Named entity recognition Custom data labeling workflows Learn more @ https://amzn.to/2OsREAk Human workforce options © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Labeling job creation © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scan is powered by AI Machine learning & computer vision are at the core of the scan: • Image classification • Object detection • Semantic segmentation • Content-based information retrieval • Ranking Etc. Machine learning – bird’s-eye view Dataset Model Evaluation Deployment collection training Machine learning – bird’s-eye view Dataset Model Evaluation Deployment collection training “Data is the new oil” ImageNet ImageNet Many large-scale public CNN datasets are available; for data label example: • ImageNet (1M images ) • Open Images (9M images) New output Pre-trained • Places (10M images) New task layer for CNN data target Data is great … Getting to state of the art { "label": "Hand", "score": 0.99794453 } Your application’s data is different { "label": "Hand", "score": 0.9703855 } New problem We have relevant images we can learn from, but no labels Solution Amazon SageMaker Ground Truth What we liked: • In AWS • We’re already using some AWS solutions in our training workflows (Amazon SageMaker training, Amazon S3), so it’s easy to point to data • Speed • We can get images labeled on-demand quickly • Flexibility to leverage public or private workforces to label data • Ability to kick off labeling jobs programmatically or via UI Integrating with Amazon SageMaker Ground Truth Get target smInput := sagemaker.CreateLabelingJobInput{ images HumanTaskConfig: &humanTaskConfig, InputConfig: &inputConfig, LabelAttributeName: &attrName, Pre-processing LabelCategoryConfigS3Uri: &job.CategoryUri, LabelingJobName: &fullJobName, OutputConfig: &outputConfig, Submit Ground RoleArn: roleArn, Truth job } client.CreateLabelingJob(&smInput) Extract labels Data labeling pipeline (Kubeflow) Integrating with Amazon SageMaker Ground Truth Get target Labeled images image list Create TF Pre-processing record Submit Ground SageMaker Truth job training Extract Extract labels model file BigQuery Data labeling pipeline Model training pipeline Learn more @ https://bit.ly/35GzfH4 Impact • We can gather labels for 1000s of images in hours • Incorporating new data improves our predictions • Our models do the right thing now No longer labeled as hand { "label": "Hand", "score": 0.06862099 } Tips for getting high-quality labels Provide clear instructions • Show images of good and bad examples • Bootstrap this by running a small test set and gathering common mistakes Opt for the smallest label set possible • If you have multiple potential labels, narrow down the field instead of exposing all possible labels Use multiple workers per task • Helps improve your overall accuracy © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data labeling best practices • Evaluate and improve your labels • Make labeling easier for your labelers • Use multiplicity to improve accuracy • Measure accuracy & throughput of labelers • Label only what you need to Evaluate and improve your labels ☐ Yes ☑ No Learn more @ https://amzn.to/364LQFh Make labeling easier for your labelers Learn more @ https://amzn.to/33IbiyL Auto-segment uses Deep Extreme Cut (DEXTR) algorithm Use multiplicity to improve accuracy https://amzn.to/2N9PrsD Measure accuracy and throughput of labelers Raw worker responses emitted to S3 { "answers": [{"answerContent": { "crowd-classifier":{"label":"Athlete"}}, Response from worker 1 "submissionTime":"2019-10-16T03:25:56.656Z", "workerId":"private.us-west-2.2fa5a9d73ef73ba0", "workerMetadata": { "identityData": { "identityProviderType":"Cognito", "issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq", "sub":"c9a8f4a4-ed4a-4dad-a722-8532d0d6016e“ } } }, {"answerContent": { "crowd-classifier":{"label":"Animal"}}, Response from worker 2 "submissionTime":"2019-10-16T03:27:31.048Z", "workerId":"private.us-west-2.7dcbcca1ce3117d8", "workerMetadata": { "identityData": { "identityProviderType":"Cognito", "issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq", "sub":"7eb0d3bc-2da5-4244-b14f-d9ec6ffe2e17“ } } }] } Measure accuracy and throughput of labelers Amazon CloudWatch Logs & metrics for worker throughput { "worker_id": "cd449a289e129409", "cognito_user_pool_id": "us-east-2_IpicJXXXX", "cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd", "task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019", "task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019", "task_returned_time": "", "workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-labeling-team", "labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo", "work_requester_account_id": "############", "job_reference_code": "############", "job_type": "Private", "event_type": "TasksSubmitted", "event_timestamp": "1565798464" } Learn more @ https://amzn.to/34N2eZQ Label only what you need to “Not all data is created equal” Check out the full blog @ https://amzn.to/2VZnBDv Label only what you need to Check out the full blog @ https://amzn.to/2VZnBDv Learn ML with AWS Training and Certification The same training that our own developers use, now available on demand Role-based ML learning paths for developers, data scientists, data platform engineers, and business decision makers 70+ free digital ML courses from AWS experts let you learn from real-world challenges tackled at AWS Validate expertise with the AWS Certified Machine Learning - Specialty exam Visit https://aws.training/machinelearning © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved..