A I M 3 0 8 Build accurate training datasets with SageMaker Ground Truth

Warren Barkley Vikram Madan Kevin Dela Rosa GM, Augmented AI Product Lead ML Engineer, Perception Amazon Web Services Snap

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data labeling is tedious and difficult

• Massive scale: ML models need large, labeled datasets

• High accuracy: ML models depend on accurately labeled data

As a result, building the training dataset takes up to 80% of a data scientist’s time Amazon SageMaker Ground Truth

Build highly accurate training datasets Reduce data labeling costs by up to 70%

▪ Built-in features to improve label accuracy ▪ Automated data labeling capability ▪ Access to multiple workforces: and third-party vendors ▪ Option to bring your own workers and secure handling of data ▪ Tight integration with Amazon SageMaker pipeline More accurate and efficient data labeling

Active learning An accurate training model is trained from dataset is ready for use human-labeled data

Human-labeled data is then sent back to retrain and improve the model Built-in data labeling workflows

Image classification Bounding boxes Semantic segmentation

Custom

Text classification Named entity recognition Custom data labeling workflows

Learn more @ https://amzn.to/2OsREAk Human workforce options © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Labeling job creation © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Scan is powered by AI

Machine learning & computer vision are at the core of the scan: • Image classification • Object detection • Semantic segmentation • Content-based information retrieval • Ranking Etc. Machine learning – bird’s-eye view

Dataset Model Evaluation Deployment collection training Machine learning – bird’s-eye view

Dataset Model Evaluation Deployment collection training “Data is the new oil”

ImageNet ImageNet Many large-scale public CNN datasets are available; for data label example: • ImageNet (1M images ) • Open Images (9M images) New output Pre-trained • Places (10M images) New task layer for CNN data target Data is great …

Getting to state of the art

{ "label": "Hand", "score": 0.99794453 }  Your application’s data is different

{ "label": "Hand", "score": 0.9703855 }  New problem

We have relevant images we can learn from, but no labels Solution Amazon SageMaker Ground Truth

What we liked: • In AWS • We’re already using some AWS solutions in our training workflows (Amazon SageMaker training, ), so it’s easy to point to data • Speed • We can get images labeled on-demand quickly • Flexibility to leverage public or private workforces to label data • Ability to kick off labeling jobs programmatically or via UI Integrating with Amazon SageMaker Ground Truth

Get target smInput := sagemaker.CreateLabelingJobInput{ images HumanTaskConfig: &humanTaskConfig,

InputConfig: &inputConfig,

LabelAttributeName: &attrName, Pre-processing LabelCategoryConfigS3Uri: &job.CategoryUri,

LabelingJobName: &fullJobName,

OutputConfig: &outputConfig,

Submit Ground RoleArn: roleArn, Truth job }

client.CreateLabelingJob(&smInput) Extract labels

Data labeling pipeline (Kubeflow)

Integrating with Amazon SageMaker Ground Truth

Get target Labeled images image list

Create TF Pre-processing record

Submit Ground SageMaker Truth job training

Extract Extract labels model file BigQuery

Data labeling pipeline Model training pipeline

Learn more @ https://bit.ly/35GzfH4 Impact

• We can gather labels for 1000s of images in hours • Incorporating new data improves our predictions • Our models do the right thing now

No longer labeled as hand { "label": "Hand", "score": 0.06862099 }  Tips for getting high-quality labels

Provide clear instructions • Show images of good and bad examples • Bootstrap this by running a small test set and gathering common mistakes

Opt for the smallest label set possible • If you have multiple potential labels, narrow down the field instead of exposing all possible labels

Use multiple workers per task • Helps improve your overall accuracy © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data labeling best practices

• Evaluate and improve your labels

• Make labeling easier for your labelers

• Use multiplicity to improve accuracy

• Measure accuracy & throughput of labelers

• Label only what you need to Evaluate and improve your labels

☐ Yes ☑ No

Learn more @ https://amzn.to/364LQFh Make labeling easier for your labelers

Learn more @ https://amzn.to/33IbiyL Auto-segment uses Deep Extreme Cut (DEXTR) algorithm Use multiplicity to improve accuracy

https://amzn.to/2N9PrsD Measure accuracy and throughput of labelers

Raw worker responses emitted to S3

{ "answers": [{"answerContent": { "crowd-classifier":{"label":"Athlete"}}, Response from worker 1 "submissionTime":"2019-10-16T03:25:56.656Z", "workerId":"private.us-west-2.2fa5a9d73ef73ba0", "workerMetadata": { "identityData": { "identityProviderType":"Cognito", "issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq", "sub":"c9a8f4a4-ed4a-4dad-a722-8532d0d6016e“ } }

}, {"answerContent": { "crowd-classifier":{"label":"Animal"}}, Response from worker 2 "submissionTime":"2019-10-16T03:27:31.048Z", "workerId":"private.us-west-2.7dcbcca1ce3117d8", "workerMetadata": { "identityData": { "identityProviderType":"Cognito", "issuer":"https://cognito-idp.us-west-2.amazonaws.com/us-west-2_K2Rl3SHuq", "sub":"7eb0d3bc-2da5-4244-b14f-d9ec6ffe2e17“ } } }] } Measure accuracy and throughput of labelers

Amazon CloudWatch Logs & metrics for worker throughput

{ "worker_id": "cd449a289e129409", "cognito_user_pool_id": "us-east-2_IpicJXXXX", "cognito_sub_id": "d6947aeb-0650-447a-ab5d-894db61017fd", "task_accepted_time": "Wed Aug 14 16:00:59 UTC 2019", "task_submitted_time": "Wed Aug 14 16:01:04 UTC 2019", "task_returned_time": "", "workteam_arn": "arn:aws:sagemaker:us-east-2:############:workteam/private-crowd/Sample-labeling-team", "labeling_job_arn": "arn:aws:sagemaker:us-east-2:############:labeling-job/metrics-demo", "work_requester_account_id": "############", "job_reference_code": "############", "job_type": "Private", "event_type": "TasksSubmitted", "event_timestamp": "1565798464"

}

Learn more @ https://amzn.to/34N2eZQ Label only what you need to

“Not all data is created equal”

Check out the full blog @ https://amzn.to/2VZnBDv Label only what you need to

Check out the full blog @ https://amzn.to/2VZnBDv Learn ML with AWS Training and Certification The same training that our own developers use, now available on demand

Role-based ML learning paths for developers, data scientists, data platform engineers, and business decision makers

70+ free digital ML courses from AWS experts let you learn from real-world challenges tackled at AWS

Validate expertise with the AWS Certified Machine Learning - Specialty exam

Visit https://aws.training/machinelearning

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.