AI Generated Video Annotation MSC INDIVIDUAL PROJECT FINAL REPORT

IMPERIAL COLLEGE LONDON DEPARTMENT OF COMPUTING AIGVA: AI Generated Video Annotation MSC INDIVIDUAL PROJECT FINAL REPORT Author: Project Supervisor: Emil Soerensen Dr. Bernhard Kainz Second Marker: Dr. Wenjia Bai Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Abstract Most researchers agree that the ”best way to make a machine learning model gen- eralize better is to train it on more data” (Goodfellow et al., 2016). However, gath- ering and labelling training data is a laborious, tedious, and costly task. Although researchers have made significant developments in improving machine learning ar- chitectures and developing custom hardware, limited focus has been on designing better ways to acquire and label more training data. Worse still, the problem of labelling training data becomes more difficult when you transition into the video domain because of the substantial increase in data size. Currently available video annotation software tools require you to label each frame in a video individually which can be prohibitively expensive, often resulting in large amounts of either unused or unlabelled video data. This is unfortunate as video data is crucial for training algorithms in many of the most promising areas of machine learning research such as autonomous vehicle navigation, visual surveillance, and medical diagnostics. The dissertation aims to address these problems by building a web application that can speed up the labelling process of video datasets. The outcome of the dissertation is the AI-Generate Video Annotation Tool (AIGVA), a web-based application for in- telligently labelling video data. A set of novel labelling techniques called Cold Label and Blitz Label were developed to automatically predict frame-level labels for videos. A scalable and distributed software architecture was built to handle the heavy data processing related to the machine learning and video processing tasks. The first in- browser video player capable of frame-by-frame navigation for video labelling was also built as part of the project. Quantitative and qualitative evaluations of the AIGVA application all show that the tool can significantly increase labelling efficiency. Quantitative testing using the CIFAR-10 dataset suggest that the tool can reduce labelling time by a factor of 14- 19x. User interviews with machine learning researchers were positive with inter- viewees claiming that they had ”no doubt it could be used today in our research group”. Furthermore, software testing of the web application and Core Data API demonstrate ample run-time performance. Acknowledgments I would like to thank the following people: • My supervisor Dr. Bernhard Kainz and co-supervisor Dr. Wenjia Bai for their valuable support throughout the process • Giacomo Tomo, Sam Budd, Anselm Au, Jeremy Tan from the Biomedical Image Analysis Group at Imperial College London for providing both technical and non-techincal feedback to improve the usability of the application • Jacqueline Matthew, Clinical/Research Sonographer @ Guy’s and St Thomas’ NHS Foundation Trust, who volunteered to teach me about the problems of data labelling in a clinical setting • My family, friends and fellow students, particularly Finn Bauer, who had to suffer through countless hours of discussions about web app development, data labelling and machine learning ii Contents 1 Introduction1 1.1 Motivation.................................1 1.2 Objectives.................................3 1.3 Contributions...............................4 1.4 Outline..................................8 2 Background9 2.1 Literature Review.............................9 2.1.1 The Stagnation of Image Classification Performance......9 2.1.2 Beyond ImageNet: Improving Image Classification Performance 11 2.1.3 The Challenges of New Data Acquisition............ 13 2.1.4 Methods to Increase Data Labelling Efficiency......... 14 2.1.5 Sourcing Image Data From Videos............... 18 2.2 Existing Labelling Tools......................... 20 2.3 The iFIND Project............................. 24 2.4 Ethical & Professional Considerations.................. 26 3 Method 28 3.1 Label Prediction.............................. 29 3.1.1 Cold Label............................. 29 3.1.2 Blitz Label............................. 31 3.1.3 Custom Model.......................... 32 3.2 User Stories................................ 33 iii CONTENTS Table of Contents 3.2.1 User Story #1: Building a Pedestrian Classifier for an Au- tonomous Vehicle Startup.................... 33 3.2.2 User Story #2: Improving Standard Plane Detection in Fetal Ultrasound Classifiers...................... 37 3.3 Design Principles............................. 39 3.4 Requirements............................... 43 3.4.1 Project Management....................... 43 3.4.2 Videos............................... 45 3.4.3 Label Prediction......................... 49 4 Implementation 55 4.1 Software Arch. & Tech Stack....................... 55 4.1.1 Core Data API........................... 56 4.1.2 Web Application......................... 63 4.1.3 File Storage & Database..................... 70 4.2 Feature Impl. - Video Management................... 73 4.2.1 Frame-by-frame Video Player.................. 73 4.2.2 Export Tool............................ 77 4.3 Feature Impl. - Label Prediction..................... 80 4.3.1 Cold Label............................. 80 4.3.2 Blitz Label............................. 91 4.3.3 Custom Model.......................... 92 4.3.4 Label Prediction Engine..................... 95 4.4 DevOps.................................. 97 4.4.1 Testing............................... 97 4.4.2 Deployment............................ 98 5 Evaluation & Results 100 5.1 Quantitative Evaluation......................... 100 5.1.1 Label Prediction - Cold Label.................. 100 5.1.2 Label Prediction - Blitz Label.................. 106 iv Table of Contents CONTENTS 5.2 Performance Testing........................... 111 5.2.1 Web Application Performance.................. 111 5.2.2 Core Data API Performance................... 113 5.3 User Interview & Feedback........................ 117 5.3.1 Machine Learning Researchers................. 117 5.3.2 Clinical Sonographer....................... 119 6 Conclusion & Future Work 121 6.1 Summary of Achievements........................ 121 6.2 Future Work................................ 123 Appendices 132 Appendix A User Guide 134 A.1 Installation................................ 134 A.2 User Manual............................... 136 Appendix B Full Surveys 137 B.1 Machine Learning Researchers...................... 137 B.2 Clinical Sonographer........................... 143 Appendix C Supporting Material 145 C.1 Folder Structure.............................. 146 C.2 Web Performance Testing........................ 147 C.3 Experimentation............................. 148 C.3.1 List of Videos For AIGVA Experiment.............. 148 C.3.2 Model Used For AIGVA Experiment............... 148 C.3.3 Confusion Matrices for AIGVA Experimentation........ 149 Appendix D Supporting Code 154 D.1 Web Application............................. 154 D.1.1 Video Frame Player........................ 154 D.1.2 Export Tool............................ 160 v CONTENTS Table of Contents D.2 Label Prediction.............................. 163 D.2.1 Cold Label............................. 163 D.2.2 Blitz Label............................. 171 D.2.3 Custom Model.......................... 175 D.2.4 Prediction Engine......................... 176 D.3 DevOps.................................. 180 D.3.1 Docker............................... 180 D.4 File Storage and Database........................ 183 D.4.1 Database Schema......................... 183 vi Chapter 1 Introduction ”We’re entering a new world in which data may be more important than software.” —Tim O’Reilly, founder, O’Reilly Media 1.1 Motivation While recent advancements in machine learning has shown its potential to solve complex classification problems, there is still a way to go. For example, in March 2018 Elaine Herzberg was killed by one of Uber’s self-driving cars because the vehicle failed to brake in time. The system misclassified the pedestrian Elaine first as an unknown object, then as a vehicle, and then as a bicycle with varying expectations of future travel path (NST, 2019). Examples like this highlight the need to build better machine learning models. In applied machine learning, supervised algorithms still outperform other models for classification tasks. While researchers have found a plethora of novel methods to improve the generalizability of such algorithms, the size and quality of the training dataset is still a crucial factor for model performance. The process of acquiring and labelling data for training machine learning models is, however, a laborious, tedious, and expensive task. Consequently, most of the recent research focus has been on improving the machine learning models rather than on improving the data collection 1 1.1. MOTIVATION Chapter 1. Introduction and labelling techniques. The data labelling task becomes even more difficult when transitioning into the video domain, due to the substantial increase in data size. This difficulty increases the cost of data labelling, which often results in large amounts of either unused or unlabelled video data. As many of today’s most promising areas of machine learning research, such as autonomous vehicle navigation, visual surveillance, and medical diagnostics,

Load more