Collabar: Edge-Assisted Collaborative Image Recognition for Mobile Augmented Reality

CollabAR: Edge-assisted Collaborative Image Recognition for Mobile Augmented Reality

Zida Liu, Guohao Lan, Jovan Stojkovic *, Yunfan Zhang, Carlee Joe-Wong, Maria Gorlatova

Duke University, Durham, NC, Carnegie Mellon University, Moffett Field, CA *University of Belgrade, Belgrade, Serbia

1/29 Outline

• Introduction • CollabAR system design • MVMDD dataset • Evaluation

2/29 Augmented Reality (AR)

• The [virtual] content is laid out around a user in the same spatial coordinates.

• Image recognition enables seamless contextual AR experience.

3/29 Challenges Large proportion of distorted images in real-world mobile AR scenarios.

Magic Leap One Nokia 7.1

Measurements study in real-world mobile AR scenarios 4/29 Challenges Large proportion of distorted images in real-world mobile AR scenarios.

Magic Leap One Nokia 7.1

Measurements study in real-world mobile AR scenarios 5/29 Challenges Large proportion of distorted images in real-world mobile AR scenarios.

Magic Leap One Nokia 7.1

Measurements study in real-world mobile AR scenarios 6/29 Challenges

Multiple distortions cause dramatic performance degradation of well-trained DNNs for image recognition.

Impact of image distortions on MobileNetV2 trained with Caltech-256 dataset

7/29 Outline

• Introduction • CollabAR system design • MVMDD dataset • Evaluation

8/29 Our method Collaborative image recognition: leverage the temporally and spatially correlated images to improve the image recognition accuracy in heterogeneous scenarios.

Cloud

Edge Edge server server

User devices 9/29 System design: overview Edge Server

Distortion-tolerant Recognition module Mobile image recognizer client Pristine Motion blur expert expert Resized Distortion Image classifier Gaussian blur Gaussian noise expert expert Pose estimation

Spatial-temporal Multi-view ensemble image ID image lookup learning inference result Cloud anchors timestamps image IDs

Result anchors Anchor-time Results cache cache anchor IDs Auxiliary-assisted Correlated Image lookup multi-view ensembler 10/29 System design: distortion-tolerant image recognizer 1. Distortion Classifier RGB Grayscale

Rec. 601 Coding

DFT Distortion type

Distortion Spectrum classifier 11/29 System design: distortion-tolerant image recognizer 2. Recognition Expert • Step 1: Initialize the CNN by training it on ImageNet dataset.

50% pristine, 50% motion blur • Step 2: Fine-tuning the CNN by pristine images to get !"#$%&'.

• Step 3: Fine-tuning !"#$%&' to get !"#$%&)*, !"#$%&+* and !"#$%&+,. Pristine Motion blur Expert expert

12/29 System design: distortion-tolerant image recognizer 2. Recognition Expert • Step 1: Initialize the CNN by training it on ImageNet dataset.

• Step 2: Fine-tuning the CNN by 50% pristine, 50% gaussian blur pristine images to get !"#$%&'.

• Step 3: Fine-tuning !"#$%&' to get !"#$%&)*, !"#$%&+* and !"#$%&+,. Pristine Gaussian Expert blur expert 13/29 System design: distortion-tolerant image recognizer 2. Recognition Expert • Step 1: Initialize the CNN by training it on ImageNet dataset.

50% pristine, 50% gaussian noise • Step 2: Fine-tuning the CNN by pristine images to get !"#$%&'.

• Step 3: Fine-tuning !"#$%&' to get !"#$%&)*, !"#$%&+* and !"#$%&+,. Pristine Gaussian Expert noise expert

14/29 System design: spatial-temporal image lookup 1. Spatial correlation look up: Different images contain the same anchor:

{"#$ℎ&'()*},-.∩ {"#$ℎ&'()*}/012-3 ≠ 0

2. Temporal correlation look up: Different images that are taken within a certain freshness:

∆7 = 79-. − 7/012-3 and ∆7 <;<=->2

15/29 System design: auxiliary-assisted multi-view ensemble learning 1. Normalized entropy: (AMEL) |)| %/123(%/) -" !" = . 123 |6| /0& • !" = {%& … %|)|} − probability vector of image , • -" − weight of image , • 6 − number of classes 2. Ensemble multi-user results based on normalized entropy: 7 ! = .(1 − -/)!/ /0& 16/29 Multi-view multi-distortion image dataset (MVMDD)

Examples of the pristine images that are Summary of collected MVMDD dataset collected in our MVMDD dataset

17/29 Outline • Introduction • CollabAR system design • MVMDD dataset • Evaluation

18/29 Evaluation: experiment setup 1. Implementation:

• Client: android smartphones with Google ARCore SDK. • Edge server: a desktop with an Intel i7-8700k CPU and a Nvidia GTX 1080 GPU. • Deep learning framework: Keras 2.3 on top of the TensorFlow 2.0. • Communication: python Flask framework through HTTP protocol.

19/29 Evaluation: experiment setup

2. Benchmark datasets: Dataset collected by ourselves Existing standard dataset ○ Single-view datasets: ■ Caltech-256: ● 257 categories, 80 instances every category. ■ MobileDistortion: ● 4 distortion types, 300 image instances for each distortion type. ○ Multi-view datasets: ■ MIRO: ● 12 categories, 10 objects each category, 160 views each object, black background. ■ MVMDD: ● 6 categories, 6 objects each category, 6 views each object, 3 different distances, 2 background complexity levels.

20/29 Evaluation: distortion classifier performance

Real-world distortions Synthesized distortions

Accuracy: 92.92% 95.6% 2.7% 0.0% 0.0% Pristine 172 4 0 0 0.0% 97.3% 18.1% 0.0% MBL 0 143 39 0 0.0% 0.0% 81.9% 0.0% GBL 0 0 176 0 Output Class 4.4% 0.0% 0.0% 100.0% GN 8 0 0 178 Pristine MBL GBL GN Target Class MobileDistortion Caltech-256 MIRO MVMDD

21/29 Evaluation: image recognition accuracy

1. Performance of distortion expert.

22/29 Evaluation: image recognition accuracy 2. Performance of multi-view collaboration: • Multi-view single-distortion 100 100

80 80

60 60

40 w/o DC, w/o Auxiliary 40

DC / Accuracy (%)

Accuracy (%) w/o , w Auxiliary 20 w/ DC, w/o Auxiliary 20 w/ DC, w/ Auxiliary 0 0 123456 1 2 3 4 5 6 Number of views Number of views (a) MIRO (a) MVMDD 23/29 Evaluation: image recognition accuracy 2. Performance of multi-view collaboration: • Multi-view multi-distortion

100 100

80 80

60 60 MBL+GN GBL+GN 40 Accuracy (%)

Accuracy (%) 40 GBL+MBL GBL+MBL+GN 20 20 1 2 3 4 5 6 1 2 3 4 5 6 Number of views Number of views (a) MVMDD (b) MIRO 24/29 Evaluation: image recognition accuracy 3. Advantages of AMEL

• Good view: the image is a pristine image.

• Bad view: the image contains multiple distortions with high distortion levels.

CollabAR accuracy on the MVMDD dataset with and without the auxiliary feature 25/29 Evaluation: system profiling 1. Computational latency

2. Total end to end latency Nokia 7.1 Pixel 2 XL Xiaomi 9 32.7ms 21.5ms 17.8ms

26/29 Open source

We have open sourced the MVMDD dataset and the code of CollabAR!

• MVMDD: https://github.com/CollabAR-Source/MVMDD. • The code of CollabAR: https://github.com/CollabAR- Source/CollabAR-Code.

27/29 Acknowledgements

•Lord Foundation of North Carolina •NSF awards CSR-1903136, CNS-1908051

28/29 Thank you & Questions?

29/29