Efficient Semantic Retrieval on K-Segment Coresets of User Videos

by Pramod Kandel

S.B., C.S. M.I.T., 2014

Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

Massachusetts Institute of Technology

AUgiist, 2015

All rights reserved.

Author: Signature redacted Depatiit of Electrical Engineering and Computer Science Aug 3 1, 2015

Certified by: Signature redacted Prof. Daniela Rus, CSAIL Director, Andrew (1956) & Erna Viterbi Professor, Thesis Supervisor Aug 31, 2015

Certified by: Signature redacted Guy Rosman ostd Ooral Associate, Thesis Co-Supervisor Aug 31, 2015

Accepted by: Signature redacted Prui.~ALgrt Me, Chaimj'asters of Engineering Thesis Committee ARCHIVES MASSACHUSETTS INSTITUTE OF TECHNOLOGY

JUL 19 2016 LIBRARIES

Efficient Semantic Retrieval on K-segment Coresets of User Videos By Pramod Kandel Submitted to the Department of Electrical Engineering and Computer Science

Aug 24, 2015 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science

ABSTRACT

Every day, we collect and store various kinds of data with our modern sensors, phones, cameras, and various gadgets. One of the richest available data is video data. We take numerous hours of videos with our phones and cameras, and store them in computers or cloud. However, because recording videos produce large files, it is hard to search and locate for specific video segments within a video library. We might need the part where "Matt was playing guitar", or we might want to see "the glimpse of John's laptop" among hours of video data that contain those pieces. The goal of this thesis is to create a system that is able to retrieve efficiently the relevant segments(frames) in the video by allowing users to do textual search based on objects of the video, such as "guitar" or "laptop".

A big challenge with videos is the huge space required to store them, therefore making it difficult to retrieve and analyze videos. This thesis presents an efficient compression method, which uses k-segment mean coresets to represent the video data using fewer frames while preserving the information content in the original data set. The system then uses a state-of-the-art object detector to analyze and detect objects in the reduced data. The objects and corresponding frames are stored and cross-linked to the original data to enable retrieval. The system allows users to pose text queries about objects in the videos. It is important that the retrieval of the stored objects is as efficient and meaningful as possible. This thesis presents a retrieval algorithm, also based on the k-segment mean coreset algorithm, which allows efficient any-time retrieval of the detected objects, retrieving the "more preferred" or "more important" frames earlier. The system presents the any-time results to the users in an incremental way.

This thesis describes the architecture and modules of the objects retrieval system for video data. The modules include the user interface for making the search query and displaying the results, the module for video compression with coresets, the object-detection module, the retrieval module, and the data flow between them. This thesis describes an implementation of this system, the algorithms used, and a suite of experiments to validate and evaluate the algorithms. The results show that using coresets, it is possible to identify, store, and efficiently retrieve video segments by specifying the objects in video data.

Page 3 of 107 Page 4 of 107 ACKNOWLEDGEMENTS

First and foremost, I would like to thank my thesis supervisor Professor Daniela Rus and my thesis co- supervisor Guy Rosman for guiding me through my Master's program. Prof. Rus has a deep understanding of myriads of topics in the field of robotics and beyond, and has a knack of giving clear and understandable guidelines and comments to her students. Despite her busy schedule, I met her every week, and every meeting with her has positively impacted my understanding on the subject of my thesis and the direction I should be heading. She has shown me the way in the most difficult times. Similarly,

Guy has been almost my companion throughout this process, being there to guide and support me at any time I needed him. He is one of the smartest people I have met both in theory and applications of various topics of computer vision, and more generally robotics and computer science as a whole. I am thankful for the opportunity to be supervised by such amazing and smart people.

Secondly, my family has always been an inspiration to me. My parents provided for me and sent me to good schools, even when the circumstances were not great for themselves. Their sacrifice at difficult times is the primary reason I'm here and writing this thesis. Similarly, my loving brother Saroj has always been my best friend and has made me smile and laugh when I needed.

Thirdly, I thank my close friends who have continuously offered assistance to make my life easier during the thesis write-up process, and have always been there to encourage me.

Finally, I am thankful to each and every person I have had opportunity to know during my lifetime. I am confident that each of their positivity and their stories have inspired me to be where I am today, and will keep inspiring and guiding me in days to come.

Page 5 of 107 Page 6 of 107 TABLE OF CONTENTS

Chapter 1 Introduction...... 18

1.1 Goal...... 18

1.2 Challenges...... 19

1.2.1 Large Data...... 19

1.2.2 Preserving Sem antics on Compression ...... 20

1.2.3 Object Detection on Videos ...... 1

1.2.4 Efficient, Sem antic, and Preferential Retrieval...... 21

1.3 Solution Approach ...... 21

1.4 Contributions ......

1.4.1 System ...... 22

1.4.2 Algorithm s ...... 22

1.4.3 Experiments ...... 23

Chapter 2 Related W ork ...... 24

2.1 Life Logging Systems/ ...... 24

2.2 Coreset Algorithm s...... 26

2.3 Sem antic Retrieval for videos...... 27

Chapter 3 Video Summarization and Retrieval: Technical Approach ...... 30

3.1 Objects Retrieval System Architecture: Introduction ...... 31

3.2 W eb Interface...... 32

Page 7 of 107 3.3 Video Processing ...... 32

3.3.1 Coreset Tree Creator...... 32

3.3.2 Object Detector ...... 33

3.4 Storage ...... 33

3.5 Retriever...... 36

Chapter 4 Video to Coreset Tree Creation ...... 38

4.1 Coresets: Introduction...... 38

4.2 K-segment Mean Coreset Tree Construction (without key frames) ...... 38

4.3 Sum m arization: Coreset Tree with key frames...... 39

4.4 Output/Populating inform ation to DB ...... 41

Chapter 5 Object Detection of Coreset Tree Leaves ...... 43

5.1 Object Recognition and Detection ...... 43

5.2 Caffe ...... 44

5.3 RCNN (Regions w ith Convolutional Neural Network Features) ...... 44

5.4 Choosing Key frames for Detection...... 45

5.5 Output/Populating D B ...... 45

Chapter 6 Retrieval with Text-based Search ...... 47

6.1 Inputs ...... 47

6.1.1 Database...... 47

6.1.2 Search Query Input ...... 47

6.2 Coreset Retrieval Algorithm with Preferential Sampling ...... 48

Page 8 of 107 6.2.1 Preferential Coreset Sampling Algorithm description ...... 49

6.2.2 Algorithm Correctness ...... 52

6.2.3 Algorithm Complexity ...... 54

6.3 Alternative Retrieval Algorithm s...... 54

6.3.1 Direct Database Retrieval ...... 54

6.3.2 Uniform Sam pling of Coreset Leaves...... 55

6.4 Comparison of various retrieval algorithm s...... 55

Chapter 7 System Im plementation ...... 57

7.1 Frontend W eb UI ...... 58

7.2 Django Server ...... 62

7.2.1 Retrievers ...... 62

7.3 Video Processor ...... 67

7.4 PostgreSQL Database ...... 68

Chapter 8 Experim ents ...... 69

8.1 End-to-end Experiment on a Synthetic Video with Known Ground Truth Segmentation...... 70

8.1.1 Coreset tree creation M odule ...... 71

8.1.2 O bject Detector ...... 72

8.1.3 Database Update ...... 77

8.1.4 Retrieval...... 78

8.2 Tim ing Experiments...... 80

8.2.1 Tim ing for the Videos Taken in Natural Environm ents...... 81

Page 9 of 107 8.2.2 Timing vs. Video Length ...... 84

8.2.3 Timing vs. Video Quality ...... 85

8.3 Retieval Experiments...... 87

8.3.1 Retrieval Experiments on Real (Non-Synthetic) Data...... 87

8.3.2 Retrieval Experiments on Synthetic Data ...... 90

8.4 Conclusion ...... 97

Chapter 9 Future Extensions...... 99

9.1 Overall System ...... 99

9. UI...... 99

9.3 Coreset Tree Creation ...... 100

9.4 Object Detection ...... 100

9.5 Retrieval...... 100

Page 10 of 107 Page 11 of 107 LIST OF FIGURES

Figure 3-1: System Architecture of the Objects Retrieval System. The front end is the web interface to upload, search, and view the results, and server is where the video processing (i.e. compression of data,

and detection of objects) as well as retrieval algorithms operate...... 31

Figure 4-1 : A simple example of a coreset tree, corresponding to a video which spans 100 frames. KF in the figure stands for key frames for a node, and Range is the span covered by the leaf. All the leaves are pointing to the database, meaning that the detections on the key frames of those leaves are passed to the

d atab ase...... 4 1

Figure 6-1 : A simple example of coreset tree corresponding to a video of 100 frames. Each leaf spans

25 frames. The key frames and frame span of each node is mentioned along with the node...... 48

Figure 7-1 : System diagram of search-objects module with implementation details. The tools and

languages used to implement each system component is mentioned on top of the component...... 57

Figure 7-2: Web UI for search-objects system to search and upload videos ...... 59

Figure 7-3 : An example retrieval for the search query "guitar". Results appear as image thumbnails containing rectangles on the detected regions...... 61

Figure 7-4 : An example retrieval for the search query "computer". Two object categories match the q u ery ...... 6 1

Figure 7-5 : System diagram depicting two different retrieval systems -- coreset and database retrieval.

Coreset retrieval loops around while the client continuously polls for the results, and retriever

incrementally provides the results. The DB retriever receives one client request and returns all results at

o n ce ...... 6 3

Figure 7-6: Process of going from retrieved re gions to displaying on the UI...... 64

Figure 7-7 : Image saved in file system for the detection of object category computer keyboard...... 65

Page 12 of 107 Figure 8-1 : Eight still images used to make the small synthetic video for system evaluation experimentation. The video contains segments of multiple frames of each of these images ...... 70

Figure 8-2 : a) Key frames and other descriptions for the first leaf node of the coreset tree, and b) for the

root node of the tree. Coreset trees are shown towards the left in both a) and b), with the relevant node

selected w ith a black circle...... 7 1

Figure 8-3 : A portion of "regions" table in the database. It shows various columns in the table and the exam ple inp uts...... 77

Figure 8-4: Database retrieval of two queries: "guitar"(a), and "person"(b)...... 78

Figure 8-5 : Uniform sampling retrieval of queries "guitar"(a) and "person" (b)...... 79

Figure 8-6: Preferential coreset sampling retrieval results for query "guitar" (a) and "person" (b)...... 79

Figure 8-7 : Line graph showing trend of processing time vs. the length of the video with 4 different videos. The time taken is increasing with the increasing length of the video...... 84

Figure 8-8 : Time taken per frame with increasing length of videos but same quality. The figure shows that the time per frame is very similar (around 3 seconds) in all videos, implying that time per frame doesn't increase if quality is the sam e...... 85

Figure 8-9 : End-to-end time with increasing video resolution but fixed length...... 86

Figure 8-10 : Result from retrieval experiment on real data when querying "car" in a large video taken in natural environment. X-axis is increasing time, while y-axis is the portion of frames retrieved from the server. Red star is for db, red line is for uniform sampling, and blue line is for coreset sampling. We see that DB returns all results at a certain point in time (11 secs), while uniform and coreset sampling systems keep on retrieving results in an incremental fashion. Uniform sampling is retrieving more results at any given point than coreset sampling. It seems that the preferential coreset sampling is worst

Page 13 of 107 of all, but once we take into account the importance of frames, we will see its advantage (discussed later in Section 8.3.2 )...... 89

Figure 8-11 : Plots for cumulative importance of frames vs. time for a video with 10K frames. On the left is the normalized axes, to show the importance fraction against fraction of time. It allows us to see which fraction of the time retrieved more important results. We can see that [preferential] coreset graph is higher than the uniform graph in most of the earlier portion of the time, showing that more important frames are retrieved earlier on in the results for preferential retrieval...... 92

Figure 8-12 : Plots for cumulative importance of frames vs. time for a video with 25K frames. Here, uniform and coreset intersect in around half the time, showing that the uniform sampling retrieved more fraction of important results than preferential coreset retrieval in the later half of the results...... 93

Figure 8-13 : Plots for cumulative importance of frames vs. time for a video with 100K frames. With such long video, even though the graph looks the same, half the retrieval time is about a minute. It is important that user gets most of important frames by a minute, because it is way more than tolerable w ait tim e described in [3]...... 94

Figure 8-14 : Portion of results returned from the synthetic image database, with various retrieval types.

The images in ascending order of importance were: red, yellow, green, blue, purple. In a), we can more of purple and blue frames, which are the most and the second-most important frames respectively. There are fewer red and yellow images retrieved. DB in this case (b) happened to have more important frames earlier in the database. However, in uniform sampling(c), we can see all kinds of frames with no particular order or preference...... 96

Page 14 of 107 Page 15 of 107 LIST OF TABLES

Table 3-1 : Simplified schema of "Regions" table in the database. Columns videoid and coresetid link to other tables with video and coreset table respectively. Framenum field indicates the frame where this detection was done, classid is the id of detected object class, (xl,y l), and (x2,y2) are the corners of rectangular region of detected object, and confidence is the score returned by the detection algorithm.. 34

Table 3-2 : Simplified schema of "Video table". It has basic information about the videos uploaded such as path, width and height, and number of frames it spans...... 34

Table 3-3 : Simplified schema of "Coreset table" in the database. It is cross linked with the video, and it has info about the coreset tree path in the file system...... 35

Table 3-4: Simplified schema of "Object table". It has id and name of the object class...... 35

Table 8-1 : The expected objects in each of the 8 images in the video...... 72

Table 8-2 : False Positives (FP), False Negatives (FN), and True Positives(TP) marked by a human on 8 images of the small synthetic video, for confidence thresholds 0 and 0.5. True positive objects are marked in green, false positives are marked in red, and false negatives are marked in yellow. FP, FN, and TP columns show the counts of false positives, false negatives, and true positives respectively...... 74

Table 8-3 : Precision and recall numbers for the detections in each image for confidence thresholds 0 and 0.5. Increasing the confidence threshold, we saw average increase in precision but decrease in recall.

...... 7 5

Table 8-4 : Time taken by various components of the system for the three videos. For each video and each component, the absolute time taken and the time taken per frame are shown...... 82

Table 8-5 : Table showing absolute time taken to retrieve all results for a query "first synthetic object" on 3 videos, for three different retrieval systems...... 94

Page 16 of 107 Page 17 of 107 Chapter 1 Introduction

1.1 Goal

The proliferation of devices such as phones, cameras, and various other sensors is enabling us to collect and store personal data in various forms: pictures, GPS data, videos, etc. The data is stored on our devices and in the cloud. Detailed data about our lives has the potential to provide the capability of answering questions such as "What was I doing last Monday at 2 pm?" "When did I play Tennis?" "Get all videos when I was cooking." "Where did I leave my laptop"? and so on. We should be able to get answers about our activities, friends, events, and all other aspects of our lives that are captured by data.

Our objective is to develop an auto-generated, text-searchable personal "" that we could refer to for answering such questions. The challenge is effectively searching though such a lot of data in an efficient and effective manner. In this thesis we build on prior work on life logging as part of the iDiary [1] project and extend this system with the ability to collect, store, and search semantically over video data.

Objects Retrieval System of iDiary

This thesis implements a video storage and retrieval component within a life logging system called iDiary [I]. The new module is called the "objects retrieval system". It allows users to answer questions about their videos, based on the objects in the videos. Users can search with text to identify where in the video certain object is, and the system then retrieves and displays the important frames of the video where the object is present, in a preferential order.

Page 18 of 107 Consider the following scenario. My friend Matt likes playing guitar. Once he performed in his school during a "talent show" event. There was a video of the entire event, which lasted several hours, but he was struggling to find the 3-minute segment where he was playing guitar. With this new objects retrieval system, Matt would upload the video to the system. After some background processing, he would type

"guitar" in the query window. The system would return the frames in the video where a guitar is present.

Then, he would be able to navigate to that part of the video and watch himself play guitar.

However, since Matt doesn't like to wait, we wish for system to return the results right away incrementally showing more video frames as they are identified. The results should be shown in preferential order, i.e. frames with higher "importance" or "preference" should be shown earlier. Which frames are considered more important may differ with situation. E.g., Matt might want bright and less blurry frames, or the frames where there are no other objects detected than just the guitar and himself, or he might equally want the frames where other band members and other instruments are visible as well.

In whichever case, Matt wants more important frames to be shown earlier in the results.

A primary goal in this thesis is to build a retrieval system that Matt would love. Ideally, he would get the frames he wanted as soon as possible with minimal wait.

1.2 Challenges

There are various technical and algorithmic challenges in creating such a system. Most of the challenges are related to segmenting, representing, and searching over video data.

1.2.1 Large Data

Video data comes in very large quantities, which makes the efficient storage and retrieval from videos difficult. Let's say, a user records a video of his activities continuously for the entire day, i.e. -12 hours. Page 19 of 107 A normal 720p video with 30 frames per second is in the order of 100 MB per minute. For 12 hours, the size is 100 x 60 x 12 = 72000 MB = 72 GB. A week's worth of such video is 504 GB. Assuming a typical hard drive is of capacity 500 GB, the user would need one hard drive to store every week of his video data. Similarly, it would be very difficult to retrieve a certain segment from that data, and normally the user has to sift through the entire data.

The challenge is to have an efficient compression method to reduce the size of such huge amount of data

into a meaningful structure that preserves the information content of the original video. In this thesis, the challenge is to identify and extract the important and representative frames from the user video and store

these key frames and the objects in them in an efficient form.

1.2.2 Preserving Semantics on Compression

"Semantics" refers to the meaning, or the information, contained in the data. As Sugaya describes in [2],

data we receive through modem sensors are of very large quantities, and not everything is of interest. An

important challenge is to map the data that comes from sensors such as video, GPS, or accelerometers to

information.

In this thesis, the challenge is to preserve such semantics in the video data while summarizing the video.

The compression algorithm needs to identify semantically meaningful (key) frames in the video and

store those in compressed form, although the video may not (and usually does not) have uniformly

meaningful frames throughout. For example, the video frames contained in 3 hours of sitting at an office

desk are quite similar to each other; however the video frames in 3 hours of walk around Boston are

likely very different from each other. There are less semantically important frames in the office desk

portion of the video, while more semantically important frames in the Boston walk portion. Therefore,

the compression algorithm needs to select more key frames from Boston walk than from the office desk.

Page 20 of 107 1.2.3 Object Detection on Videos

Object detection is a technique to locate and recognize objects in an image. The state-of-the-art object detectors are focused on detecting objects in an image, and not optimized for video, which is a collection of many images. Because the videos have -30 frames every second, doing detections on every frame on a large video can be computationally expensive. Therefore, the challenge is to identify representative frames from the videos to do the detections on, such that those frames represent the entire video.

1.2.4 Efficient, Semantic, and Preferential Retrieval

Users do not want to wait too long for system to load results, and prefer immediate results if possible. A research study [3] shows that tolerable wait times for links without feedback is 5-8 seconds, while the tolerable wait time for web users in general peaks around 2 seconds. Therefore, a challenge is to retrieve the results as fast as possible for users to see, from a potentially large video data. Similarly, we wish to show the users more preferred or important results earlier in the result set.

Tackling and solving these challenges is important and interesting not only for daily applications for a user to search objects in the video, but also for various other applications. A solution to these challenges would allow management of large quantities of data and efficiently retrieve semantically important aspects of the data. Robots continuously collect large amounts of data, and need efficient semantic processing of the data for wide varieties of applications. As shown in [2]., the capability of handling large amounts of data would be "valuable in commercial business, scientific research, government analyses, and many other applications."

1.3 Solution Approach

We use an efficient compression algorithm based on k-segment mean coresets [4] to compress the data and extract the semantically important (key) frames from the data. We then run a state-of-the-art

Page 21 of 107 detection system to detect the objects on those key frames and store them in the database. We then use efficient retrieval algorithm that does retrieval on the compressed data on any-time basis, so that key frames with detections can be incrementally presented to the user with minimal wait time. The retrieval algorithm provides a probabilistic bias towards more representative key frames, so that they are retrieved earlier than less important ones. The solution also provides the tools for the users to input their videos into the system, do the search queries, and to see the results retrieved by the system.

The solution is novel because this is the first system that allows users to do textual queries about the objects in the videos and shows them the detections in the videos. The process itself is also novel that it is a unique application using the novel coreset creation algorithm for data compression.

1.4 Contributions

1.4.1 System

* This thesis provides an implemented system to collect and store user videos, identify

representative key frames that correspond to a semantic segmentation of the videos, detect

objects in the representative frames, and efficiently retrieve and display results to the user in

response to text queries about the objects in the videos.

* The thesis also implements a search query tool for users to search for objects in the videos.

1.4.2 Algorithms

* The algorithmic contributions include an efficient retrieval algorithm that takes compressed data

as input and retrieves the key frames. The algorithm has a high probability of retrieving key

frames in the order of their importance.

" The retrieval algorithm is any-time, returning relevant key frames incrementally.

Page 22 of 107 * The thesis uses prior video segmentation and summarization algorithms that create a

compressed structure called a coreset tree representing the entire video. These algorithms are

described in detail in [4] and [5].

1.4.3 Experiments

* The thesis presents an end-to-end evaluation of the system on a synthetic video with known

ground truth segmentation and object detections to evaluate the correctness of the system and its

modules.

* The thesis shows the results of timing experiments on various modules as well as the entire

system, and presents the performance data and its dependence on the size and quality of the

videos.

* The final sets of experiments evaluate the retrieval system. The thesis shows the pros and cons

of various retrieval systems. The experiments demonstrate the advantage of the preferential

retrieval system proposed in this thesis over other systems on large dataset where more preferred

frames need to be retrieved earlier.

Page 23 of 107

V Chapter 2 Related Work

In this chapter, we describe various previous related works on life-logging systems, coreset algorithms for k-segment means problem and video summarization, and semantic retrieval systems on videos.

2.1 Life Logging Systems/Diaries

Life logging means storing the personal data and experiences of a person's life so that it could be referred to later. Life logging systems/diaries have transformed in the past decade and a half from self- logging or self-creating methods such as text diaries, photo albums, , , to automatic logging of data through digital tools such as "recording computer and cellphone activity mobile-contexts (e.g.

GPS)", and automatic all-time recording with various wearable sensors and cameras [6] . All these data collected need to be processed for them to be useful to humans. [6]. iDiary, described in [1] is an automatic life-logging system, which infers user activities and meetings using the GPS information automatically logged with their phones. The term "lifelog" is used nowadays vastly for digital life- logging.

The idea of lifelog was first envisioned by Vannevar Bush in 1945 [7]. He proposed a way to capture, collect, and store a person's personal information for the entire lifetime digitally, so that it could be retrieved later, whenever needed. In 2001, this notion was revived by Gordon Bell [8], when he scanned all his office paperwork, medical records, photographs, and similar other personal data and stored them digitally. Although first started with desktops, the concept of lifelogging has been advancing currently in the world of mobile smart phones [6]. [9] describes various kinds life-logging systems according to the

Page 24 of 107

11111111MM OP I IMP, P kind of data they collect - passive visual capture ("always-on" cameras), biometrics (wearable sensors to collect body information such as heart beat, temperature, etc.), mobile context(GPS, people connected to the same network, etc.), mobile activity (call, SMS, social network sites, etc.), computer activity, and active capture (users adding photos, annotations, etc.). The GPS logging part of iDiary [1] is through mobile context and is automatic, while the objects retrieval system (this thesis) part is the active capture, i.e. users upload their videos into the system. However, both are designed to process and retrieve the data in the human usable form, with humans doing the text queries.

There are quite a few recent life-logging projects and products that use phones and/or cameras to log the users' data and process them for users to interact with. Affective diary [10] uses the data from the sensors in mobile phones as well as the pictures that user uploads to the system to create an abstract colorful body shape. The purpose of the abstract body shape is to allow users for self-reflection and to piece together their stories. Microsoft's SenseCam [11] is a passive (not requiring user's active input) camera that users wear around their neck. The camera takes pictures periodically and automatically. The part that is similar to ours is the processing of these thousands of images taken per day. The pictures are segmented into various events, representative pictures of those events are extracted, and event-novelty is calculated for each event based on how unique they are. This is similar to the high level approach we take, where we segment the video into various parts based on the semantic uniqueness of the scenes in the video, and we select the representative frames from those segments. In addition we use the object detection in the videos, and provide the text-based retrieval system where users can type queries.

There are several other life-logging devices and works such as Google Glass [12], MyLifeBits [13],

FitBit [14], and'more. All these tools process the users' data to provide them with useful information. To our knowledge, iDiary is the first project that aims to encompass such a wide variety of data such as

GPS, videos, etc. and provide the users a simple search tool to query about the semantic content of their data.

Page 25 of 107 2.2 Coreset Algorithms

In this thesis, we use the data-compression technique called coresets ( [15] , [16]) to conduct a fast, semantic (content-based) segmentation of the videos. "Informally, a coreset D is problem dependent compression of the original data P, such that running algorithm A on the coreset D yields a result A (D) that provably approximates the result A (P) of running the algorithm on the original data." [16] This thesis directly uses the coreset creation algorithms described in [5] and [4], which describe a new coreset algorithm for the k-segment mean problem [4]. and build on this segmentation to create an online summarization tree from the video data stream, selecting representative frames (key frames) from each segment [5]. The coreset for k-segment mean problem described in [4] is linear in the number of segments k, independent on the dimension of the data and the number of points in the data. The processes of creating this coreset and the tree are described in more detail in Chapter 4.

There have been several works on summarizing high-dimensional data streams in various application domains. For example, [17] run approximation clustering algorithms such as k-center to divide the data into k clusters. However, the clusters are not temporal therefore not same as k-segments. The computation of clusters takes exponential time in both the dimension of the video (d) and k. The feature space they use for the video stream uses simple data structures. The algorithms in this thesis allow for complex features and on-line feature space update using the k-means clustering of the features seen so far.

Similarly, the k-segment mean problem has been tackled using various approaches. The problem can be exactly solved using dynamic programming [18], but this takes O(dn2 k) time and O(dnz) memory, which is impractical for large streaming data. However, many approximation algorithms have been

introduced including recent work such as [19] and [20]. [19] supports efficient streaming, but it is not parallel. Because it does not return a coreset but only the k-segmentation, it cannot solve other

optimization problems with other priors or constraints. On the other hand, [19] returns a coreset, but the

Page 26 of 107 running time of the proposed algorithm is cubic both in d and k. [19] is the most recent research on

solving the k-segment mean problem and its variations, and all the coreset creation solution before [19],

including itself, take time and memory that are quadratic in d and cubic in k [4]. .

Our key tool in this thesis relies on video summarization based on coresets. Video summarization means

extracting representative and meaningful frames from the video to produce a compact representation of

the video itself. Analyzing ad-hoc videos and summarizing them for specific applications is a difficult

task, as described and tackled before in [21], [22], [23]. The problem is very similar to action

classification, scene classification, and object segmentation of the videos [22]. Applications where life-

long video stream analysis is crucial include mapping and navigation medical / assistive interaction, and

augmented-reality applications, among others [4]. Although we use the word "compression" in this

thesis, this summarization is different in that compression can store semantically redundant content

because is geared towards preserving image quality for all frames, while we seek a summarization

approach that allows us to represent the video content by a set of key segments, for a given feature

space.

2.3 Semantic Retrieval for videos

With the recent growth of online videos, video search and semantic retrieval has been a growing topic in computer vision. Various works have been done in this field. Some papers tackle semantic search in certain domains such as autonomous driving [24], surveillance videos [25] and others, while some are more general [26], [27]. Our work closely resembles [24] because it takes natural text query as input and outputs a candidate image of the video segment with the bounding boxes. However, video segmentation and object labelling (putting bounding boxes on images) is done by humans. In our work segmentation is done with efficient coreset algorithms, and detection is automatically done using an object detection

Page 27 of 107 algorithm. In [25], the text input is allowed, but the application is focused on tracking vehicles in the traffic surveillance videos, therefore they don't recognize other objects. It is mostly based on trajectory tracking of vehicles.

More general and related work includes Video Google [26] and [27]. In these work, however, the input is not the text query of the objects or scene descriptions of the potential scenes. In [26], the user queries a region of a particular frame in the video, and the system searches for all the frames in the video that contains that region. In [27], an image is provided as input. The system searches for similar image match

in the video. In [26], they select key frames to be the consecutive frames occurring every second, not as in ours, where we segment the videos and then select meaningful key frames.

Our system is the first system we know of that uses k-segment mean coresets to smartly segment videos, identify representative key frames from those segments, use those segments to detect objects, and allow users to type in the objects in the videos to retrieve frames from the videos.

Page 28 of 107 Page 29 of 107 Chapter 3 Video Summarization and Retrieval:

Technical Approach

Given multiple streams of videos, we wish to develop an efficient representation for storing them in a database and querying for them using textual semantic categories that describe the objects and places in the video streams. Our key insight is to segment each video stream into segments that have semantic coherence and to represent each segment using one frame (key frame) selected from that segment. This approach will greatly reduce the amount of data required to represent the video and speed up the search.

Because we wish to capture video summarization at multiple semantic levels, we capture an entire tree of semantic representations for the data. Intuitively, the leaf level of the tree contains smallest video segments, close to individual video frames. At each level in the tree we represent increasingly larger segments that have semantic coherence according to some metric.

In order to enable semantic textual retrieval, we process each video frame in the database to extract the objects in it using Berkeley's object detection system RCNN (Regions with Convolutional Neural

Network Features) [28], an extension of their Neural Networks-based object classification system, Caffe

[29]. The output of RCNN is the list of objects that occurs in each frame and their textual description.

This solution pipeline requires (1) algorithms for video segmentation and summarization; (2) algorithms for object recognition and mapping objects to text; (3) a schema for storing the original video segments, their representative images, and their textual description; and (4) a method for efficiently searching this

Page 30 of 107 data. We have developed a system that presents an end-to-end solution to searching efficiently over

video data using text. The rest of the chapter describes the architecture of this system.

3.1 Objects Retrieval System Architecture: Introduction

The objects retrieval system is composed of several components - the web interface (frontend, i.e. user- facing), the coreset tree creator, the object detector, retriever, and the storage (database (DB) and file system). These components can be broadly categorized into two parts - the front end, which includes the web interface, and the back end (server), as shown in Figure 3-1. The server includes broadly 3 modules

- video processing, storage, and retriever. The rest of this chapter briefly describes these components: their functions, inputs, and outputs, and the data flow between these components.

Frontend/ Web Interface Backend/Server

Storage Video Processing

Video Raw File Video Coreset Fiuei StSystem Tree Creator

Search u erv Video N th Coreset aree

Retriever ar s a Det cted Reg ons r-Detected Retrieved Images DB Regions Object resFTt Detector results - detections Dtco 1

Figure3-1: System Architecture qf the Objects Retrieval Systemn. The front end is the wueb interface to upload, search, and viewv the results, and server is where the video processing (i.e. compression qf data, and detection qf objects) as well as retrieval algorithmns operate.

Page 31 of 107 3.2 Web Interface

The Web interface is the front end, or the user facing portion of the system. This is the portal for the users to upload their videos into the system, search for the objects in the uploaded videos with text, and see the retrieved key frames (images) and the object detections.

In our user scenario from Chapter 1, our user Matt can upload the video of his school event using the upload interface, search for his performance on the basis of the object in the video, i.e. by typing

"guitar" in the search interface, and view the retrieved key frames with the relevant detections in them, i.e. a key frame containing guitar, in the retrieved-results interface.

3.3 Video Processing

This component takes as input the raw video and processes the video to compress it, extract key frames, and detect objects in those key frames. As output, the database is populated with the processed information. This is a part of the server. The two modules of the video processing component are the coreset tree creator and the object detector.

3.3.1 Coreset Tree Creator

This component takes as input a raw video and scans the video frames in an online fashion, as they appear in the video, building up a compressed structure called "the coreset tree". The module uses the

"k-segmentation coreset algorithm" [4] to divide the video into its semantic segments. Each segment can be represented by one frame extracted from the segment. By choosing different levels of granularity for segmenting the video data the system creates an entire coreset tree of key frames, where each level in the tree corresponds to a different segmentation granularity. During the extraction of key frames, the

Page 32 of 107 segmentation algorithm preserves the overall information of the video and selects less key frames from the dormant parts of the video (less segments) while more from the active parts (more segments).

The output of this component, the coreset tree, contains the information about the key frames, which are much fewer in number than the original frames but represent the video well. We can then do object detection on these key frames.

3.3.2 Object Detector

Given the coreset tree representation, we wish to extract and index all the objects present in the key frames. This processing will enable efficient retrieval using textual queries that refer to the extracted objects. The algorithm we use for object detection, RCNN (Regions with Convolutional Neural

Network Features) [28] was developed as part of the Caffe system at Berkeley [29]. Caffe is a deep learning framework suitable for vision-based applications. Caffe provides object classification models, as well as feature extractors by default. RCNN is a "state-of-the-art visual object detection system" [28] based on Caffe models and the feature extractors, and can detect objects in an image.

For each key frame, the object detector gives information about exact regions of detections on those frames, what the detections are, along with their confidence scores. For example, for a key frame containing Matt and his guitar, R-CNN may provide the object category detected, e.g. "guitar", coordinates of the rectangular region where guitar is detected, and the confidence score of the detection.

All these information can then be stored in the database, cross linking them with the video uploaded, the coreset tree created, and the corresponding key frames of coreset tree.

3.4 Storage

Storage mainly includes the database., but also the file system on the server. All the information about the detected regions in the key frames, cross linking with paths of the coreset tree and video files, are

Page 33 of 107 stored in the database. On the other hand, the actual coreset structures and actual uploaded videos are stored in the file system of the server.

The Database contains the following tables: 1) Regions table, where information about the detected regions are stored. Most of other tables are linked from here, 2) Videos table, where the information about uploaded videos is stored, 3) Coreset table, where the information about the paths of coreset tree structure is stored and cross-linked with the video table, and 4) Objects table which has object id and name. The following tables show the simplified schema of the these tables, and an example row of data for each of them.

Table 3-1 : Simplified schema of "Regions"table in the database. Columns videoid and coresetidlink to other tables with video and coreset table respectively. Frame numfield indicates the frame where this detection was done, classid is the id of detected object class, (x,yf), and (x2,y2) are the corners of rectangularregion of detected object, and confidence is the score returned by the detection algorithm. region-id videoid coreset id framenum classid x1 yl x2 y2 confidence

2322 201 302 232 79 250 900 600 1200 1.5

Tab e 3-2 Simplified schema of "Video table". It has basic information about the videos uploaded such as path, width and height,. and number offrames it spans.

videoid video-path width height numframes

201 /home/videos/video 1.mp4 720 480 550

Page 34 of 107 Table 3-3 : Simplified schema of "Coresettable" in the database.It is cross linked with the video, and it has info about the coreset tree path in thefile system. coreset id videoid coresettree-path

302 201 /home/videos/coreset videol .mat

Table 3-4 Simplified schema of "Objecttable". It has id and name of the object class. classid classname

79 guitar

As shown in the tables above, the information about detected regions is linked to other tables in the database, e.g. videoid links to the videos table, while coresetid links to the coresets table. Similarly, classid links to objects table. Those tables contain file paths and other information about videos and coresets respectively, and are linked between themselves as well. (xl, yl) and (x2, and y 2 ) are the corner coordinates of the rectangle of detection region, and confidence is a score on the detection provided by the object detector.

Thus, when user searches for guitar, the classid corresponding to that class name can be searched over in the objects table. Then, that class_id, because it is cross-referenced in regions table, can be used to find relevant detected regions in the regions table. Once we find those regions, we can find which video those regions belong to because of the "video id" field that references the videos table. Similarly, relevant coreset tree can be found as well. In this way, we can know exactly which videos and which

Page 35 of 107 frames have the queried object, and exactly in which regions of the frames. Thus, we have enough information to return the results to the users.

3.5 Retriever

After the user inputs the search query in the query interface, the retriever component uses the database and the coreset tree structure to retrieve key frames containing the queried object from the video. The

Retriever uses a modified form of sampling algorithm described in [5] to sample key frames from the coreset tree. The algorithm has higher sampling probability for the key frames that are more important than others. The importance function of a key frame can be different according to the applications, and the retriever is independent of the nature of importance function. The algorithm is an any-time algorithm, meaning that there are retrieved results no matter when the algorithm is stopped, and if the algorithm continues, it eventually returns all the results.

Making use of the any-time property of the algorithm, the retriever component takes the key frames retrieved at discrete time intervals, gets detections on those frames from the database, and returns the information to the web interface to be displayed to the user in an incremental manner.

Page 36 of 107

' Il I|III II I|1'll !1l,11 1l1111lllil lil lli ' llIw IIIIIIw I Page 37 of 107 Chapter 4 Video to Coreset Tree Creation

As soon as the video is uploaded to the server, the processing of the video starts for coreset tree creation.

The constructed coreset tree is described in [5], on which most of this chapter is primarily based on.

4.1 Coresets: Introduction

In simple terms, coresets algorithms select a subset of data from a large data stream such that the result of running an algorithm on the reduced data set is approximately the same as the result produced by running the algorithm on the entire data set (which is usually intractable for large data). The coreset data structures approximate some larger datasets with a compressed structure that can be constructed in linear time using sublinear memory.

The coreset tree described in this chapter is also a compressed representation of the original data based on a high-dimensional compression method that uses k-segment mean coreset, described in [5]. The tree represents a compressed visual summarization of the input video data that can be constructed very efficiently.

4.2 K-segment Mean Coreset Tree Construction (without key frames)

Given a continuous video stream, we select a "representative over-segmentation of the video stream

(along with a compact representation of each segment)" [5]. Each of the representative over-segment

(containing smaller segments of the video) is a coreset because it is a compact representation of some

Page 38 of 107 portion/scene of the real video stream. The coreset provably approximates the real data guaranteeing a

"good trade-off between the size of the coreset and the approximation of each scene in the video stream." The output of the algorithm is called c-coreset, which is a set of roughly k/r segments approximating the real data. The k-segment mean coreset, algorithm used to construct the tree has 2 parts: first, the complexity of the data is determined using the bicriteriaalgorithm, and second, the data is partitioned into segments using a balanced partition algorithm, approximating each segment by SVD

[5] [41 of a matrix constructed to represent the segment (refer to supplementary material on [4] for algorithm details). This algorithm produces a segmentation close to k-segments with approximately optimal cost, and this set of roughly k segments is called a c-coreset, or coreset in general. In our coreset tree (final structure), each leaf is a E-coreset, and the whole tree is a E-coreset as well.

A coreset tree can be constructed from the coreset leaves by exploiting these two important compositional properties of coresets: 1) The union of two c-coresets is a c-coreset, and 2) A g-coreset constructed as such is an c(l + )-coreset. Therefore, two coresets can be merged to form another coreset.

(See supplementary materials on [4] for detailed correctness proofs and complexity analyses.)

To create a tree, first we create a coreset for every block of consecutive points in the streaming data

(these are frames in case of video data) using the k-segment mean coreset algorithm. Once we create two of such coresets, we can merge them using the first property, and recompress using the second property to avoid increasing the coreset size. Finally, a coreset streaming tree is created by recursively merging the coresets formed directly from the streaming data, i.e. the streaming leaves.

4.3 Summarization: Coreset Tree with key frames

The goal of the summarization is to get representative frames from each segment of the video, so that they capture the entire structure of the video. We use the k-segment mean coreset tree constructed as above, and for each node of the tree, we select and store a set of images or frames from the video, called

Page 39 of 107 key frames. For each node, these key frames represent a summarization of the segment of the video that the node spans. The key frames are merged upwards in the tree according to image quality, their ability to represent other images in the video, and their representation of the video transitions, as captured by the segment merges in the [k-segment mean] coreset algorithm.

For each streaming leaf of the coreset streaming tree, we choose a set of K key frames. When two nodes merge, we select a new set of K key frames from its two children nodes by using a modified farthest- point-sampling(FPS)algorithm [30], [31], and the modified algorithm is described in [5]. For sampling a next frame, the FPS algorithm chooses the frame whose feature vector x is farthest in distance from the feature vectors of the already sampled frames. In the modified FPS, the algorithm also takes into account the "image relevance score", in addition to the feature vector distance. Image relevance score is a function of "image quality (sharpness and saliency), temporal information (time span and number of represented segments), and other quality and importance measures," described in more detail in [5].

Thus, the algorithm chooses a frame xj with feature vector x, with following algorithm:

xj = argmax (x)t d(x,Sji) + f*(xj) }

Where Sj1 is the set of previously chosen frames, so d(x, Sj_1) is a point-to-set distance between x and

Si-1 . f*(xj) is the image relevance function that depends on metrics such as blur measure, video time, and number of segments associated with keyframe xj. Therefore, the frame that is chosen is farthest from already chosen frames in terms of feature vector distance and likely has better image quality. The algorithm chooses K such key frames at each merge action.

Page 40 of 107 4.4 Output/Populating information to DB

The output of the entire process is a coreset tree, whose nodes contain key frames from the video segment they span. In our application, we chose each node to have 9 key frames, and the size of each coreset leaf to be 100 frames (except in the very small videos), i.e. there are 9 key frames for each 100 frames in the original video in the leaf-level of the tree. For a simple example, refer back to the coreset tree example from chapter 2. shown again in Figure 4-1, where the leaf size was 25 and there were 3 key frames per node.

KFs:{30, 66, 91) 7 Range: [1, 100]

KFs: {66, 72, 91} KFs: {15, 30, 46} 3 6 Range: [51, 100] Range: [1,501

KFs:(30, 4 46) KFs:{55, 66 2) KFs: 76, 81, 91) KFs: {2, 15, 20} Range: [2 Range:[5 ,75 Ran 100] Range: [1, 25] a2 4 5

Database

Figure4-1 : A simple example of a coreset tree, correspondingto a video which spans 1oo frames. KF in thefigure stands for key framesfor a node, and Range is the span covered by the leaf. All the leaves are pointing to the database, meaning that the detections on the key frames of those leaves are passed to the database.

Page 41 of 107 This coreset tree structure is stored in the server file system, and its path is stored in the database, mapped to the path and the ID of the video that it corresponds to.

Page 42 of 107 Chapter 5 Object Detection of Coreset Tree Leaves

This chapter describes how the coreset tree is used as input to select key frames and run the object detections on them. It discusses the definitions of object recognition, the tools we chose, and their outputs. We then talk about what information is stored in the database after the detection of objects.

5.1 Object Recognition and Detection

Visual object recognition is a technique in computer vision for learning object categories and then identifying new instances of those categories. [32] There are two broad types of object recognition - 1) identifying the instances of a particular object, e.g. 'Eiffel Tower, or 'Matt's acoustic guitar', and 2) recognizing various instances of a general object category or class, e.g. 'building', 'guitar'. We use the general object category in this thesis.

Object recognition requires object classification and object detection. In object classification, an entire image is classified as an object category or no category. In object detection, the locations of the objects detected in the frame are located within the image and classified as different object categories. Thus, in object detection, there is an additional step of localizing the regions of various object proposals within the image, and only then trying to classify each of those object proposals to an object category or no category. Caffe provides object classification models, while RCNN is an object detector that uses Caffe.

Page 43 of 107

V 5.2 Caffe

Caffe [29] is a state-of-the-art framework for developing applications based on deep neural networks. It provides various deep learning algorithms and reference models and object classifiers. In root, it is a

C++ library, but has bindings to MATLAB and Python. Caffe provides capabilities, in all of these languages, to train, test, fine-tune, and deploy deep learning models and object classifiers, with examples of each of those. RCNN uses Caffe to build its detection models.

The major parts RCNN aspects of Caffe are its capability to extract deep features from images, and to do the pre-training and fine-tuning on large datasets. Caffe provides ways to do training in huge datasets,

such as ones provided in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) [33] or

PASCAL VOC (Pattern Analysis, Statistical Modeling and Computational Learning, Visual Object

Classes) [34] challenges, which have millions of training images and standard object categories. RCNN

uses the feature extractors and training modules for at least two challenges, PASCAL VOC 2010 challenge (with 20 object categories), and ILSVRC 2013 challenge (with 200 classes) [28].

5.3 RCNN (Regions with Convolutional Neural Network Features)

RCNN is an object detector that combines the region proposals with CNN features. It has three modules:

1) generate region proposals independent of object categories, 2) extract features from each region, and

3) class-specific linear SVMs to classify each region.

First, RCNN uses a standard selective search algorithm [35] to produce category-independent region

proposals within an image. Then, it extracts features from each of those regions using a CNN

implementation of a feature extractor in Caffe [36]. When a test image is presented, RCNN first extracts

about 2000 region proposals using selective search, extracts features from each, and each of the

Page 44 of 107 extracted features are scored using a SVM trained for that specific class. For training, RCNN first pre- trains for each object category using Caffe's CNN library, and fine-tunes it, again using Caffe, for the

specific purpose of object detection. Then, using CNN features for each class generated by Caffe's pre- training, class-specific SVMs are produced as a binary classifiers, i.e. to decide whether a given region,

is a certain class or not, based on certain score threshold. In summary, with RCNN, we can input an

image, and it outputs the regions in the image, classifying each of those regions to one of the pre-defined

list of object classes/categories, with a confidence score for each region.

5.4 Choosing Key frames for Detection

The input to the object detection component is the coreset tree, which provides the key frames of the

video. We made a decision to choose the key frames at the lowest level, i.e. the key frames of the leaf nodes. This has a trade-off between speed and maximal video coverage. The leaves level has the highest number of key frames, which maximizes the time taken on the detection. However, it also has the highest video coverage, and those key frames are the best approximation of the original video.

5.5 Output/Populating DB

Each of the leaf-level key frames is passed as input to the detection system. For each detected region in the frame/image, we get the region coordinates in the image (xl, yl, x2, and y2 coordinates), the object class assigned for the region, and the confidence score. After running detections on each key frame, we augment the "'regions" table in the database, which stores the information about all the regions detected in all the key frames in all the videos.

Page 45 of 107 Page 46 of 107 Chapter 6 Retrieval with Text-based Search

In this chapter, we describe how retrieval is done using the coreset tree and the database populated with

the information about detected regions, after the server gets the user query. We discuss the coreset

retrieval algorithm with preferential sampling, its analysis, and compare it with other alternative

retrieval systems.

6.1 Inputs

6.1.1 Database

After the video processing is done, the database is populated with the information about the paths of the

uploaded videos, their corresponding coreset trees, and all the regions in all the key frames that were

detected by the object detection component. As previously mentioned, each region contains information

about its coordinates in the image, the object detected in that region, and the confidence score.

6.1.2 Search Query Input

From the search query, the inputs for the retrieval algorithms are the search term, for example, "guitar", the video file name, e.g. "schoolevent.mp4", and the frame range, for example, "35 - 280". The frame range is the user's area of interest of the video. If frame range is not provided as input, we assume the retrieval is over the entire video. If the video filename is not provided, retrieval is from all the videos in the database.

Page 47 of 107

IF Given the database information and the query terms, the system will perform text-based retrieval for the

objects in the videos.

6.2 Coreset Retrieval Algorithm with Preferential Sampling

The primary retrieval method proposed in this thesis is an algorithm that uses the coreset tree to sample

the leaves in any-time fashion. The sampling is preferred to more important frames. The coreset creation

algorithm (see Chapter 4) , selects the key frames from children nodes and gives higher weight to the

key frames that are less blurry and have higher image quality. Therefore, by default, the importance

function is the function of blur and quality measure of the image. However, this importance function can

be modified as desired, and the

KFs:{30, 66, 91} 7 retrieval algorithm is agnostic to Range: [1, 100] the nature of the importance

function. KFs: {66, 72, 91} KFs: {15, 30, 46} 3 6 Range: [51, 100] Range: [1,50] As the leaves are sampled, the

matching regions within the

1 2 4 5 frame span of those leaves are

KFs: (2, 15, 20} KFs:{30, 42, 46} KFs:{55, 66, 72} KFs: (76, 81, 91} Range: [1, 251 Range: [26, 50] Range:[51, 751 Range: [76, 100] extracted from the database and

the results are presented as Figure 6-1 : A simple example of coreset tree corresponding to a video of ioo frames. Each leaf spans 25frames. The key frames and frame span of each output in an incremental fashion. node is mentioned along with the node.

For example, consider the coreset tree example from Chapter 4, shown again in Figure 6-1. If the search

range is 1-74. the sampling algorithm samples leaf nodes 1, 2, and 4 in some order, because these nodes

cover the query range. Say it sampled node 4 first. Then we search for matching regions in the database

in node 4's range, i.e. [5 1, 75], and add them to Our regions list. Next, we repeat until algorithm samples

Page 48 of 107 the remaining leaves, i.e. nodes 1 and 2. At some point, the algorithm signals that all required leaves are done sampling, and we have a complete list of regions.

6.2.1 Preferential Coreset Sampling Algorithm description

The algorithm's input is the coreset tree, the query text, and the query range, and the output is the database regions. It can be best described using two functions - one function traverses the coreset tree, samples a new leaf in every loop and returns the detected regions within the leaf span. Let's call it sampleleaf n_retrieveregions. During execution of this traversal, another algorithm called preference samplechild is called. This algorithm samples one of its children, weighing the sampling based on importance of the child nodes.

SAMPLE_LEAF_N_RETRIEVE_REGIONS

The retrieval algorithm is a modified version of the sampling algorithm introduced in [5]. It runs on a loop until all the leaves in the provided frame range (or query range) are sampled, and all the matching detections or detected regions within corresponding leaf spans are returned. It guarantees that no leaf is sampled twice. With each sampling of a leaf, it "yields" any matching regions from the span of that leaf.

Yielding means returning the results, while still preserving the state of the algorithm, making sure the algorithm is any-time.

The pseudocode is shown below. def sample leaf n retrieve regions(coreset tree, q text, q range): donenodes = [] done leaves = [] init node = getlastleaf in range(coreset tree, q-range) v = initnode num leaves in range = get num leaves in range(coreset tree, qrange) going down = False while True:

Page 49 of 107 nodeout of _range = is nodeout of range(v, qgrange) node-done = is nodedone(v) or node out of range node toobig = does_nodecover q-range (v, q-range) or is root (v)

if node done and v not in done nodes: done nodes.append(v)

all_leaves-sampled = len(num leaves inrange)==len(doneleaves) if (node toobig and nodedone) or all leaves sampled: yield None ###END CONDITION, all nodes finished return

p = sample uniform([0,1]) parent = parent (v) if (p<=alpha and not node too big and not going down)or node done: v = parent else: #sample a child in a preferential order child = preferencesample _child(v, done_nodes, qrange) if child == v: # leaf, so retrieve regions from database regions = get regions from db(v, qrange) done leaves . append (v) done nodes,append(v) v = init node going down = Fal-se yield regions else: going down = True v = child

First, an initial node (leaf) is chosen for the algorithm. The initial node is the last leaf in the coreset tree that is within the query range. In our example above in the beginning of Section 6.2, the initnode would be leaf number 4. From that initial node, the algorithm (almost) randomly determines whether to try to traverse up one level of the tree or to try to go down. In one complete traversal from initial node, up to the tree, and down to a leaf, a new leaf is sampled and returned.

The algorithm has the list variables "donenodes" and "done_leaves". An intermediate node is considered done if all its successors that fall in the query range are done as well. A leaf node is considered done if it is sampled for retrieval. There is another variable named "num leavesin-range", which is the total number of leaves that are in the query range, therefore the number of leaves that need

Page 50 of 107 to be sampled. These variables help the algorithm figure out when all the required nodes have been

sampled so that the loop can be terminated.

To sample a leaf, the algorithm starts with initnode. At each iteration, it has a current node, denoted by

'v' in the pseudocode, and the initial value of v is initnode. During each iteration, the algorithm updates

its 'v', either going to the parent or a child node. At the beginning of every loop, the current node is

checked for whether it is out of range, or is done (meaning all its children are done), or is too big

(meaning the frame span of the node is larger than or equal to the query range), or if all required leaves

are already sampled. If the node is done or out of range, it is added to the "done-nodes" list. Notice that

the out-of-range node is considered done, because we do not want the algorithm to traverse that node.

Similarly, if the node is too big and is already done, it means that all required leaves have been sampled.

There is another function to check for whether all leaves have been sampled, which calculates the number of leaves in range according to the query range and the coreset leaf size, and equates it with the length of doneleaves. If, at the start of any iteration, all required leaves have been sampled, the

algorithm yields None and returns.

Next, consider traversing up and down the tree. With probability c , next v will be current v's parent, i.e. it will traverse upwards in the tree, unless the current v is the root node, or it spans the range higher than or equal to the query range, or if the algorithm had previously decided to traverse downwards instead, or if the node is "done", or doesn't fall within the query range. Otherwise, the algorithm goes down the tree.

To go down, the algorithm needs to sample one of the leaves. It calls the preferential sampling algorithm, preference sample child (), which either returns one of the children of the current node that has not been previously sampled, or if the node is a leaf, returns the node itself. If the node itself is returned, it is a leaf, so the algorithm retrieves all the matching regions from the database that lie in the frame span of the leaf, adds the leaf to the "done-nodes" and "done-leaves" lists, and re-

Page 51 of 107 initializes all the variables to start another upward traversal from the initnode. If not, the sampled child becomes the 'v' for the next iteration.

PREFERENCE_SAMPLE_CHILD

This algorithm samples one of the children that has not been sampled before, with higher probability of sampling one of the more important children. In Chapter 4 we discussed the key frames selection process during the merging of two nodes. During the selection, the variability of key frames was considered, with the importance function determined by blurriness and the quality of the image. The key frames with higher overall score were selected and propagated upwards to the merged node. Therefore, it was more likely that more important (in this case, less blurry and higher quality) images were propagated upwards. Now, looking from the top-down view of the tree, we can say that a merged node contains on average more important frames than either of the leaves, and whichever child node contributed more key frames is the more important of the child nodes. The preferential sampling algorithm is based on that fact. It gives higher sampling probability to the child that contributed the most number of key frames to the parents, or in other words, it has the highest amount of key frames- intersection with the parent. The probability is proportional to the number of common key frames with the parent.

6.2.2 Algorithm Correctness

The retrieval algorithm has two properties - first, it samples a new, previously not sampled, leaf every time it samples, and second, it terminates, or in other words, it samples all the leaves in range.

Let's tackle the first claim, i.e. a new leaf is sampled on every traversal up and down the tree. Consider any current node during the algorithm. If the node is done, i.e. all its successors have been done, the

Page 52 of 107 algorithm prunes that path, i.e. does not go down towards that node. Thus, it considers going down only

if there are some children of the node that have not been done. This fact proves that the algorithm can

only reach a leaf that has not been sampled before, because it the leaf had been sampled before,

algorithm would have marked it under "done-nodes", and therefore the algorithm would only traverse

upward when this node is encountered again.

The second claim is that the algorithm eventually samples all the leaves in the range. First, let's prove

that it doesn't sample any leaf outside the range. At every iteration, the algorithm checks whether the

current node is out of the query range. In that case, the node is considered done, and as a result the

algorithm traverses back to the parent. Therefore, at no point is it possible to sample a leaf that is out of

range.

Now, we prove all leaves in range are eventually sampled. It can be proven that there is a finite

probability to reach any leaf (L) in range, starting from the last leaf in range (Lend). We can see that at

every node, if the node has not been "done" before, i.e. its children are not done, there is 1 - a probability to decide to traverse down to children. With the algorithm preferencesamplechild,

there is a finite probability for each child to be sampled, as long as the child is not done before. This is

true for all the nodes during the traversal, until we reach the leaf. Because there is nothing in the algorithm that prevents from any "not done" node to be sampled, there is a finite probability of each leaf being sampled.

The longest possible traversal path from Lend to L is twice the distance from leaf level to the earliest node that covers the entire query range, looking bottom up starting from any leaf in the range. Let's call that node Vtop. If the algorithm reaches Vtop, it is prevented from going above it because it covers the query range. Similarly, if there is a leaf that is still left to be sampled, it is guaranteed that Vtop is not in the "donenodes" list, therefore there is a path of nodes that are not done from Vtop to L. In this way, L is guaranteed to be reached.

Page 53 of 107 Worst Case Probability Bound for Reachability of a Leaf

We calculate the smallest possible probability of that longest path, i.e. the worst case scenario for a leaf to be reached. Let's say that the distance, measured in number of nodes, between one of the leaves in range to Vtop is Dv. Let's assume there are N leaves within the query range, therefore Dv is 0 (log N) because the tree is binary. The upward probability of reaching to Vtop is therefore 0 (log N X a), because probability of going up at each node is O(a). Now, let's say at each level of child-node sampling, the nodes in that path each has a finite probability O(p) to be sampled. As discussed above, probability of sampling a child at each node differs according to children, but it is finite. Then, combining everything,

P (Sample L) = 0(a. log N. p. log N) = O(a. p. log 2 N)

6.2.3 Algorithm Complexity

Because .a new leaf is sampled at every traversal, and every traversal, borrowing from analyses above, takes O(log N), the complexity of algorithm is O(N log N). Here, N denotes the number of leaves in the query range.

6.3 Alternative Retrieval Algorithms

6.3.1 Direct Database Retrieval

This is a usual method of retrieval of regions from the database. We take the query range, and retrieve

all the regions from the database that are in that query range. All the results are passed to the user at

once.

Page 54 of 107 6.3.2 Uniform Sampling of Coreset Leaves

This algorithm is similar to the algorithm proposed in this thesis, in that it is also any-time algorithm.

This algorithm samples the leaves in the query range, one leaf at a time, with uniform probability. As the leaves are sampled, the regions within those leaves are retrieved from the database as in the preferential sampling algorithm.

6.4 Comparison of various retrieval algorithms

All retrieval algorithms have advantages and disadvantages. Direct database retrieval is ideal for smaller databases, where users may not have to wait much for all results. However, if the application requires retrieval of large datasets from large databases, the coreset sampling algorithms win. If we compare the total time to retrieve all the results though, database retrieval likely beats both others, because others have to do multiple database queries as opposed to one, which will likely take longer. Similarly, database retrieval likely retrieves the data in the order that is in the database, which may be important for some applications. In our case, the order may correspond to the key frame number ordering in a certain video, which may be important if we are trying to retrieve a key frames-summary of the video.

With coreset sampling, the results are not retrieved in the same order as in the database, therefore the key frames will not be shown in the order they appear in the video.

Between the two sampling methods, uniform sampling takes less time to complete sampling all the leaves, because the time complexity is, on average, linear with the number of leaves in the range, i.e.

0(N). On the other hand, as we analyzed before, preferential sampling algorithm has an extra log N factor. The preferential retrieval algorithm here, however, is the best one if we want the important frames to be displayed to the users as fast as possible.

Page 55 of 107

f Page 56 of 107

p| p| | I||!|p!l|| | Chapter 7 System Implementation

In this chapter, we describe the implemented objects retrieval system. The system was implemented and tested on a machine with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz processor. running Ubuntu

14.04. Figure 7-1 shows an architecture diagram of the objects retrieval system with labels for the implementation tools and languages used in each of the components. The web application was written using Django [37]. and the video processing was done in MATLAB.

Frontend/Client Backend/Server

go Storage Video Processing ar (MATLAB)

Ul Django Server (HTML, CSS, Javascnipt) Cpgt ...

Videarse -l -ahCoe

------Iear..

Figure 7-: System diagrain of search-objects module with implementation details. The tools and languages used to implement each system component is mentioned on top of the component.

Page 57 of 107 Django Framework

Because the iDiary interface, as implemented previously, was a Django-based web framework [2],

Django was used to implement the web portion of search-objects module. It includes the user-facing interfaces as well as some portions of the server, including uploader, search parser, and retriever.

Django is a Python-based web framework that provides an architecture to build web user interfaces (UI).

Django also connects with a Python server that can parse requests, communicate with a database, and send responses back to the Ul. Django was used for the implementation of the frontend part and the

"search parser" and "retriever" modules in the backend.

7.1 Frontend Web UI

The front end of the system is a web UI that the user interacts with. With the UI, user can upload the video, search using textual queries, and visualize the results of the search. Most of the user-facing UI contains basic web-U languages such as HTML and CSS. Bootstrap [38] is used for aesthetic forms and buttons. JavaScript, jQuery and AJAX are used as client-side scripting languages to process forms and display results for preferential coreset retrieval.

Page 58 of 107 7.1.1.1 Upload and Search Interfaces

iDiary Search for objects in my video

Search in existing videos

Search Word*: Object or Text: Objects

Search in file/part of filename: Search in frame range:

Confidence threshold: Same detection skip frames:

Select retrieval method: Coreset tree Database

Upload new video file

Select a file:

Choose File No file chosen

Figure7-2 : Web UI for search-objects system to search and upload videos

The of the website is shown in Figure 7-2. Two major web interfaces, query/search

interface and upload interface both are on the same UI as shown in the figure. The user can upload videos to the system with the 'upload" module. As soon as the user Uploads a video, the server starts processing it. adding the detections to the database. Thus user searches can proceed right away.

To search. the user submits a search form where he can specify the query text (e.g. "guitar"), along with other parameters such as the full or partial name of the video file to search in. frame range that the search should span (e.g. "200-650"), the confidence threshold of the returned results (e.g. 1.0-), the skip-frames, or the minimum frame gap between the returned video frames, and the type of retrieval

(preferential coreset sampling, uniform sampling, or direct database retrieval). The confidence score is a number given by the object detection module and reflects how confident the detector is on a certain

Page 59 of 107 detection. For RCNN (the object detector), the scores usually span the range [-4.0, 4.0], although the actual range of the scores are not mentioned in the paper [28]. Any float or integer number is a valid input in this field.

7.1.1.2 Retrieved Results Interface

When the user submits the form, he can see the returned frames of the video in the "retrieved results" interface. The UI of this interface for the search query "guitar" in some sample videos is shown in

Figure 7-3, and the UI for the query "computer" is shown in Figure 7-4. There is just one object class matching the "guitar" query, namely "guitar". However, two object classes, namely "computer keyboard" and "computer mouse" match the query "computer". The results are shown as a thumbnail array for each object class, where each thumbnail contains a key frame (image), the green rectangles in the image denoting the regions of the detected object, and a brief description below each image. The description includes the frame number, the maximum confidence score among the regions of the object class detected in that image, and the file name of the video which contains that frame.

Page 60 of 107 iDiAry Search for objects in my video

Color coding of bounding boxes in confidence range (0.0. 3 0) guitar

Framne 1S Max scor M Frame 15 Max scor. Frame 15 Mtix cfnrr Video filIe: demo proje. Video file: demoprole. Video file demo proje Video file demoprole

"W."

MI- A-

Figure 7-3 : An example retrievalfor the search query "guitar".Results appear as image thumbnails containing rectangles on the detected regions.

a ILIary Search for objects in my video

computer mouse

ci L~

computer keyboard

Figure 7-4: An example retrievalfor the search query "computer". Two object categories match the query.

Page 61 of 107 For direct database retrieval, the results are displayed in a new page all at once, while for coreset retrieval (both preferential and uniform sampling retrievals), the results are added to the same page without refreshing or redirecting to a new page, with the help ofjQuery. This is done to seamlessly show the incremental results to the user without refreshing the page.

7.2 Django Server

The Django server contains three modules - uploader, search parser, and retriever. The uploader module was implemented as a standard Django file-upload system, similar to the one described in [39], and search parser is provided by Django. The most important component of the Django server is "retriever", which encapsulates different kinds of retrievers we have implemented.

7.2.1 Retrievers

When the user searches with a query text and other information described above, the "search parser" parses the search and calls the "retriever" module to retrieve the results. We implemented three types of retrievers: 1), the "DB retriever", 2) "coreset uniform retriever", and 3) "coreset preferential retriever".

In Chapter 6, we discussed in all these retrieval algorithms, and saw that they return the regions from the database. The direct database retrieval returns all the regions at once, while both the coreset retrieval algorithms return the regions in any-time manner. However, there is more to the retrievers - these regions have to be passed to the front end for the UI to display the images to the user. For sake of implementation design, both the coreset tree-based retrievers can be put in one module, namely "coreset retriever".

Page 62 of 107

PNINMMM, "Vem Client

Display P; iths of frame DB Retriever Results irr ;ees with d( tections

Search Query Search Form Database

Search Polling Coreset Retriever

Search Query (every t ms)

Figure 7-5 : System diagram depicting two different retrievalsystems -- coreset and databaseretrieval. Coreset retrievalloops around while the client continuously pollsfor the results, and retriever incrementally provides the results. The DB retriever receives one client request and returns all results at once.

4

As shown in Figure 7-5. the DB and Coreset retrieval systems are different. The DB retriever receives one request from the client and sends one response back will all the retrieved results, while the coreset retrieval receives continuous requests from the client every I milliseconds, and sends incremental responses back to the client for each request. Therefore. on the client side. the DB search is implemented as a regular non-AJAX HTTP Get request, while the coreset search is implemented with JQuery and

AJAX, whereby an AJAX request is sent to server with the form details every few hundreds of milliseconds (f). so-called "polling". The client regularly polls the server for additional data and displays results in the UlI.

We discuss the details of how these two different kinds of retrieval systems next.

Page 63 of 107 7.2.1.1 DB retrieval

First, the regions retrieved from the database are filtered according to the client's request. For example,

if the skip frames is 20, we make sure that no two adjacent frames for the same object class be less than

20 frames apart. Similarly, we ignore the regions that are of lower confidence than the threshold specified by the client.

Search SaeIae fImage r Rgosform Filtered f paeiaeaths. Rgnsvaluee Filter Regions Regioni Detections to and info Respond to Ul _J J ~ FilesystemLJ

Figure7-6 Processof goingfrom retrieved regions to displaying on the UI

Second, we save each of the final filtered trames, with the region rectangles drawn in them, in the server's file system. Each of the regions has the information about which video and frame number they belong to, along with the coordinates of 4 corners of the regions, the name of the object class, and the confidence score. This allows the system to extract the corresponding frame from the corresponding video, and draw all rectangular regions for the same object class on top of that frame, with different color intensity according to the confidence of the detection, and save all that as one image. Figure 7-7 shows an example of one such image saved in the file system. This image contains the regions for

'computer keyboard" detections.

Page 64 of 107 -~1

- AF7=: -Wn

Figure 7-7: Image saved in file system for the detection of object categorycomputer keyboard.

Third, We build a data structure to send enough information to the UI so that it can display the images with description. In that data structure, we store paths of the saved images, along with the name of the object class, maximum confidence among the regions in that image, and the name of the video file that frame was extracted from.

Finally, Django webserver sends that data structure as a response to the client, which is then able to display the images saved in the server's file system. Figure 7-6 above shows this process.

7.2.1.2 Coreset Retrieval

The process of responding to the UI for coreset retrieval is similar to the DB retrieval. The only difference is that the server goes through that same cycle mulltiple times. When a client sends a search request with "coreset" as the retrieval type, a thread starts in the background without affecting the UI.

The thread does the similar process of taking regions, saving images, and creating a data structure to

Page 65 of 107 send to the client, but does that process for each of the sampled leaf one by one. Therefore, the data structure to send to the client keeps on increasing as time goes on, but there is always something to respond with. The client request is identified by the Django's session ID. The data structure for each client is mapped with the session ID, so that two clients won't try to change the same data structure or interfere with one another's results. Because the client sends multiple requests in a certain time interval, we can identify the client with the session ID, and send the corresponding data structure back to the UI.

When the sampling algorithm is done, we also send the "done" flag to the client so that it stops sending requests. At the same time, we clear the data structure of that client in the server.

7.2.1.3 Comparison between DB and Coreset Retrieval

We discussed the comparisons among different retrieval algorithms in Section 6.4. Here we focus on additional differences as a result of the specifics of implementation.

Because of the nature of implementation, the coreset tree-based algorithms may show different, and possibly more, results than the DB retrieval. This is because the process of retrieving regions takes place with DB retrieval only once, but multiple times with coreset retrieval. Therefore, the entire region set is filtered with DB retrieval, but with coresets, smaller sets of regions are filtered out. This means that the filter really does not apply to all of the regions but only to a subset of those regions at each time. This may cause retrieving more or different key frames. For example, if the skip-frames field in the search form is 15, and if there are two frames with the same object closer to each other than 15 frames but in different leaf nodes of the coreset tree, both will be retrieved with coreset retrieval. However, with DB retrieval, this will not happen. In the same example, it is possible to get different key frames if one key frame is skipped in DB and another in coreset. For instance, for some query, DB retrieval may return the key frames 1, 17, 35, and 52. However, let's say the coreset leaves span 1-25, 26-50, 51-75, and so on.

Page 66 of 107

M- 1W 11111, Ow IMFpm," - --- tomm"MIRI Then, it is possible to get the following frames from coreset retrieval - 1, 17 (from first leaf), 28, and 48

(from second), and 52 (from third).

The experiments for the comparison are conducted and described in Chapter 8.

7.3 Video Processor

The video processor component takes the path of the uploaded video as an input, and produces - 1) a

coreset tree and 2) the database augmented with the information about the objects detected in the coreset

tree key frames, among other information. The video processor component is implemented in

MATLAB. This component contains two modules: coreset tree creator, and object detector.

7.3.1.1 Coreset Tree Creator

This component was previously implemented in MATLAB, and the system is described in [5]. The

implementation of video segmentation and the coreset tree creation was provided to use for this system.

7.3.1.2 Object Detector

The object detector we used was RCNN, which was based on the deep learning framework, Caffe.

RCNN is provided as an open source GitHub project [40]. There are two RCNN models provided,

PASCAL VOC 2010 and ILSVRC 2013, having 20 and 200 object categories respectively. We choose

ILSVRC 2013 in our system because we wanted to choose more categories to cover various kinds of videos that the user may input in this system. RCNN is implemented in MATLAB, and because the coreset creation was implemented in MATLAB already, RCNN was readily integrated into the system.

Page 67 of 107 7.4 PostgreSQL Database

The PostgreSQL database stores the information about the videos, coresets, object categories, and the regions detected as objects. PostgreSQL was used because it was used in the original iDiary system [2].

The database has one major table named "regions", which stores the information about the detected regions. It also has 3 other tables: 1) the table for object classes, 2) the table for video paths and general video properties, and 3) the table for coreset tree information. The regions table is linked to all the other tables.

Page 68 of 107 Chapter 8 Experiments

In this chapter we describe the experimental evaluation of the objects retrieval system for video data.

We also describe the evaluations of the system components, and the retrieval algorithms.

To verify the correctness of the system, we create a synthetic video where the segmentation and retrieval ground-truth is built-in. We describe the results of each step in the end-to-end process. Then, we conduct experiments to measure the time required by each of the components of the system and describe how performance varies according to the size and the quality of the videos. Next we conduct experiments with different retrieval methods: direct database retrieval, preferential sampling coreset retrieval, and uniform sampling coreset retrieval. Through empirical results, we analyze the strengths and weaknesses of these methods, and show that the retrieval system proposed in this thesis is efficient and displays more important frames earlier than other systems. We use both real and synthetic videos to showcase the results.

All the experiments were conducted in a desktop machine with Intel(R) Xeon(R) CPU E5-2680 v3 @

2.50GHz processor running Ubuntu 14.04 system. CPU was used for coreset creation, and GPU was used for the object detections during each of the experiments.

Page 69 of 107 8.1 End-to-end Experiment on a Synthetic Video with Known Ground

Truth Segmentation

For this experiment, we created an 8-second synthetic video, using 8 still images, each spanning I second and 30 frames, with the entire video containing 240 frames. Figure 8-1 shows the 8 still images used for the video. The resolution of the video was 720x480 and the video was 5.7 MB. This video was synthesized so that the segmentation ground truth was known. The video was used for the baseline correctness evalLiation of the system.

Figure 8-1 Eight still images used to make the small synthetic videofor system evaluation experimentation. The video contains segments of multiple frames of each of these images

Page 70 of 107 8.1.1 Coreset tree creation Module

The synthetic video was provided as input to the coreset tree creator for video segmentation,

compression, and coreset tree creation, with the leaf size parameter of the tree as 40. Because there were

only 8 images in the video, we expected 8 segments of the video, with the root containing all 8 different

images. We expected the overall key frames in the coreset tree to be much less than that of the original

video. We expected 6 leaves, each containing two key frames/images, since each leaf should span 2

images.

Results

After the video is processed by the coreset-creation module, a coreset tree is created, as shown on the left side of each of the sub-figures in Figure 8-2. The leaf size for the coreset was set to 40. so there are

6 coreset leaves, each spanning 40 frames, i.e. [1-40], [41-80], .. [201-240]. In the figure. for leaf 1. there are 3 key frames, namely 17, 34, and 38. Here, 34 and 38 are the same image. Because each key frame is a representative of a coreset segment, we can tell that leaf I had 3 segments. The root of the coreset tree, as shown in Figure 8-2 (a), contains all of the 8 images, although one of them is repeated.

LE-AF SIZE 40 KEyFRAMES: SELECTED) TIME SPAN. 1-40 KEY FRAMEN9 LEAF SIZE- 40 1? 32 73 KEYFRAME TIME SPAN 1-40 7., 1 SELECTED TINE SPAlS. 1-233 1W 4 17I2 NU SEGMENTS: 3 KEYFRAMF TIED SPAN- 1-239 19 207 227 HUM SEGMENTS 20

a) b)

Figure8-2 a) Key frames and other descriptionsfor the first leaf node of the corcset tree, and b)for the root node of the tree. Coreset trees are shown towards the left in both a) and b), with the relevant node selected with a black circle.

Page 71 of 107 Analyses

Contrary to our initial expectation, we saw in the results that we had more segments in the video than 8,

the number of still images in the video. This is due to the slight imperfections of video-creation process

(which may slightly change the pixels as same image is attached again), the features extracted from

those images are slightly different as well, creating more segments than expected. However, the most

distinct (in feature space) images got propagated upwards during coreset creation because of FPS

(Farthest Point Sampling) described in Chapter 4. Therefore the root has all 8 images (with one

duplicate, which is expected because we set a node to have 9 key frames if possible).

Looking at compression, there are 20 key frames total in the leaf-level of the coreset tree, which is 12

times less than the number of frames in the original video. The information of the video is preserved,

because no image is lost during the upward propagation in the coreset tree. In the root, we can see in

Figure 8-2 (b) that all the images in the video are stored as the key frames.

8.1.2 Object Detector

Given the coreset tree, the object detector detects objects in each key frames of the coreset tree. We have

the ground truth on the detections as well. The following table shows all the objects we expect (objects

in full view in the image), out of the 200 object categories that the detector should recognize, for each

image.

Table 8-1 The expected objects in each of the 8 images in the video.

Images Expected Object Detections

Page 72 of 107

V, I I I . , , -" , " .. I . I I "1 11 -- "I',,',',,,------%%%%11 . " ., I Imagel Guitar, chair, table, laptop, lamp, tv or monitor, person, filing cabinet

Image2 Chair, tie, person

Image3 computer keyboard computer mouse, water bottle, table. tv or monitor

Image4 Chair, computer keyboard. computer mouse, tv or monitor, table

Image5 Hat with wide brim

Image6 Car

Image7 Person, bicycle

Image8 Person, bicycle, sunglasses

Results

The following table shows the object detections with confidence scores for each image, returned by the object detector. We mention the detections above the confidence score of 0, because that is what the default RCNN implementation [40] deems useful. We also show results with the confidence score threshold of 0.5, because this is a score empirically determined by us to be more useful in our videos.

The detections that were expected (True Positives, aka TP) are marked green, the ones that were not expected (mistake in either the object category or the position of the object in image - False Positives, aka FP) are marked in red. The ones that were expected but not detected (False Negatives, aka FP) are marked yellow. These are marked by a human judge.

Page 73 of 107 Table 8-2: FalsePositives (FP), FalseNegatives (F), and True Positives(TP)marked by a human on 8 images of the small synthetic video,for confidence thresholds o and 0.5. True positive objects are marked in green,false positives are marked in red, and false negatives are marked in yellow. FP, FN, and TP columns show the counts of false positivesfalse negatives, and true positives respectively.

Images Real Object Detections Real Object Detections FP FN TP

(confidence threshold = 0) (confidence threshold = 0.5) (0, (0, (0,

0.5) 0.5) 0.5)

Imagel G 3,0 4.5 5,3

, eron

conit em. r

keyboard pb~c

Image2 Chuir, bow, esC , 10 1.1 2,2

Image3 compuerky , 7,1 3.3 3,3

it iaptop, piiano, bi der or, ti)P

Lmage4 (hCir, mput 4, 0,3 5,3

A, laptop, diaper. pri ter.

1irmonica

Image5 wt widebri'a tape 7, 3 0,0 1,1

Page 74 of 107 washer, cream, person,

biickjpdack

Image6 . power drill, printer. pencil Car 8,0 0,0 1.1

sharpener, purse, hair drr,

piencil box, waffle iron

Image7 , bye 0,0 0.0 1.1

Image8 PCrkso, b c u s 1.0 0.1 3,2

unicyclle

Building on the table above, following are the precision-recall numbers for all images for the two confidence thresholds., 0 and 0.5. We calculate precision and recall with these standard formulae:

TP TP Precision = TP + FP , Recall = TP + FN

Table 8-3 : Precision and recall numbersfor the detections in each image for confidence thresholds o and 0.5. Increasing the confidence threshold, we saw average increasein precision but decrease in recall.

Images Precision @ 0 Precision @ 0.5 Recall @ 0 Recall @ 0.5

Imagel 0.625 1 0.555556 0.5

Page 75 of 107 lmag~e'2 0.666667 1 0.666667 0.666667

Image3 0.3 0.75 0.5 0.5

Image4 0.555556 0.5 1 0.625

Image5 0.125 0.25 1 1

Image6 0.111111 1 1 1

Image7 1 1 1 1

Image8 0.75 1 1 0.75

Avg. 0.516667 0.8125 0.840278 0.755208

Analyses

We can see that the detections are generally "ok", if we select a good confidence threshold. For threshold of 0, we saw a lot of false positives in almost all images. When we increased the threshold to be 0.5, many of the false positives were eliminated, but in some images, the number of true positives declined. Therefore, as a result, the precision score, which indicates the fraction of total detected objects that were actually the objects in the image, increased. However, the recall score, which indicates the fraction of total images in the object that were actually detected by the detector, decreased. Therefore, there is a trade-off while choosing the confidence scores. For our application, we chose the threshold of

Page 76 of 107 zero for storing in the database, but allowed users to input the threshold themselves while searching for

the objects.

8.1.3 Database Update

After the entire backend processing is done, the database is populated with the information about the

detections and coresets. Figure 8-3 shows a part of the augmented "regions" table in the database with

the detected regions.

Id frame x1 x2 yi y2 features Ifeaturesit I labeLversio labe~ld sceneld confidence [PKJ integer integer integer Integer Integer bytea timestamp Integer Integer Integer double prec 23277 36 73 120 294 349 'binary dc 1 4574 378 -0.437652C J23278 36 1 142 235 358

Figure 8-3 A portion qf "regions"table in the database. It shows various columns in the table and the example inputs.

Description of the data

Let's look at specific row (id = 23284) in the database. The first field id is the primary key of the row.

The field "frame" (44) refers to the frame number in the video, and the "sceneid" (378) field in the second-to-last column is the id of the video. This field links to the table of information about the uploaded video. xl, yl, x2, and y2 refer to the coordinates of the detected object. The field "confidencescore" is self-explanatory, and object label refers to the unique id of the object category.

This row corresponds to the "guitar" object from image 1.

Page 77 of 107 8.1.4 Retrieval

Now that the database is populated, search queries can be performed, knowing exactly what results to

expect. We searched for "guitar" and "person" with all retrieval types, i.e. database retrieval, Uniform

sampling retrieval, and preferential sampling retrieval. We had 0.0 for the confidence threshold for

guitar, and 1.0 for person, with skip frames 15 for both. We expected one image to be retrieved for guitar (for all retrieval types), and 4 different images retrieved for "person". For "person", we expected random ordering for coreset tree-based retrievals, while ordered results for database retrieval. We also expected more results for 'person" with the coreset tree-based retrievals.

Results

Figure 8-4 a) shows the image with the guitar retrieved, when "guitar" was searched, and figure b) shows the images when "person" was searched. Figure 8-5 and Figure 8-6 show the results for the same query with uniform sampling, and preferential sampling respectively.

guitar person

Frai 4 J dLd L' U _,

Video file demoproje

a) b)

Figure8-4 : Databaseretrieval oJtLVo queries: "guitar"(a),and "person"(b).

Page 78 of 107 guitar person

f .1 TIm# 15 Max Score. M7 Video file. demoproje...

a) b)

Figure8-5: Uniform sampling retrievalof queries "guitar"(a)and "person"(b)

guitar person wz~ ~

iL -

a) b)

Figure8-6 Preferentialcoreset sampling retrievalresultsfor query "guitar"(a) and "person"(b)

Observations/ Analyses

As expected, for "guitar", one image was retrieved with the detection and confidence (0.7) we expected for all types of retrieval methods. I

Page 79 of 107 The query "person" also showed expected results. The database retrieval has ordered frames (as in the database), while the two other retrievals have unordered frames, with more results than the database retrieval. However, we cannot yet see the difference in results between preferential retrieval and uniform sampling results. We will see differences in the further experiments.

8.2 Timing Experiments

We did three kinds of timing experiments, all on the videos taken in natural environment. The first experiment used three videos filmed in our daily environment. The second experiment tested the timings of the system when increasing the size (length) of the videos, keeping the resolution of each video constant. The third experiment aimed to characterize timing when increasing the resolution of the video, while keeping the size of the video constant.

Metrics for Timing Experiments

We used two timing metrics for all the timing experiments, to analyze each component of the system as well as the entire end-to-end system. The first metric is time (in seconds), and the second metric is time per frame (also in seconds) calculated by dividing absolute time by the number of frames in the video.

Absolute time is useful in gaging how long the system takes in what kinds of videos, while time taken per frame lets us intuitively compare timing between videos that would be non-intuitive to discuss with the total time. For example, a small video might take much shorter time to complete the processing than a larger video (seconds vs. hours), but with the time per frame metric, we can get comparable units.

However, notice that this metric is not the average time taken for each key frame, but rather a normalized metric which indicates time taken per frame of the original video. This may be misleading for the object detection module, where the detections are performed on key frames, not the original frames. The average time taken on a key frame would be -10 times more than this metric, because there are 9 key frames chosen per 100 original frames.

Page 80 of 107

FIRM I RM I MIRR" M 11 "M MIN rplqqw 8.2.1 Timing for the Videos Taken in Natural Environments

For this experiment, we took three videos of various sizes and qualities filmed in a natural environment.

We recorded the time taken by each component of the system for each of the videos.

8.2.1.1 Data

Video] was a short HD quality video taken inside a house. This video spanned 3 minutes and 21 seconds, with a frame rate of approximately 30 fps, totaling approximately 6000 frames. The resolution was 1920 x 1080, and the video was approximately 431 MB. This video was recorded using a Samsung

Galaxy Note 4.

Video2 was a medium-length. medium-quality video, which contained segments in a house environment, an outdoor environment (moving from the house to the office), and an office environment.

It spanned approximately 10 minutes, with a frame rate of approximately 30 fps, totaling approximately

18000 frames. This video was recorded using a Samsung Galaxy Note 4, but was down-sampled to the resolution of 1280 x 720. The size of the video was approximately 1 GB.

Video 3 was a large-length, medium-quality video taken on a casual walk around the MIT campus, containing scenes inside different buildings and outside on the streets. It spanned approximately half an hour and contained approximately 60,000 frames. This video was recorded using a GoPro Hero 4 camera, had a resolution of 1280 x 720, and was approximately 2 GB.

8.2.1.2 Methodology

We provided the three videos as input to our system and logged the times taken by the following system components - video upload, coreset creation, object detection, database insertion, and end-to-end processing. For coreset creation, the leaf size parameter was 100 for all of these videos.

Page 81 of 107 8.2.1.3 Results

Table 8-4 shows the absolute time (in secs) and time taken per frame for various kinds of videos and various components.

Table 8-4 : Time taken by various components of the system Jbr the three videos. For each video and each component, the absolute time taken and the time taken perframe are shown.

Videol (short, IO8Op, Video2 (medium, 720p, Video3 4large, 720p, 2GB,

431 MB, 6000 frames) 1GB, 18000 frames) 60000 frames)

Total time Time per Total Time per Total time(s) Ti me per

(s) frame (s) time(s) frame(s) frame(s)

Video 0.81 1.75 4.48 0.00014 9.7222E-05 Upload 7.46667E-05

Coreset 10220 24602 109439 1.70333 1.36677778 Creation 1.823983333

Object 38800 33200 102000 6.46667 1.84444444 Detection 1.7

Database 1350 2415 5240 0.225 0.13416667 Insertion 0.087333333

End-to-end 50269 8.37817 60499 3.36105556 218346 3.6391

Page 82 of 107 8.2.1.4 Discussion

As we can see in Table 8-4, the background processing of the videos takes a long time. Video upload and database insertion are done in a reasonable amount of time (the video upload took 4 seconds for the largest video, while DB insertion generally took <1 second per key frame). On the other hand, the coreset creation and object detection are much slower. For detection, we see that the time taken per frame can take up to -7 seconds, and seems to be the biggest bottleneck in the system. As mentioned previously, this means that for one frame of image detection, it takes -7x10 = 70 seconds, because the detections are only done in key frames, and the time taken per frame metric takes the total number of frames of the video. However, the object detection system is an external, plugged-in tool, whose time can decrease with faster detectors. Similarly, given that the detections were done only on the key frames, this system is an order magnitude faster than running on the original video.

We notice some interesting things. While the absolute times for each component are generally increasing with the size of the video, time per frame is not. Time per frame for object detection is the highest in the smallest video, because the resolution was the highest. We believe that it was because the object detection algorithm had to scan more pixels for finding the region proposals for potential objects.

Also, there were more region proposals per frame because frame size was larger compared to other videos, and therefore more number of detections. This increased detection time and database insertion time. In addition, these timings may change depending on the general CPU load. Because frames with larger resolution require more processing for object detection, this may affect the insertion times.

Thus, we noticed different trends according to size and quality. Further experiments examine the effect of video size (length) and video quality on timing of the system.

Page 83 of 107 8.2.2 Timing vs. Video Length

This experiment was run to measure how time taken by the system increases with the length of the

video, keeping the video quality constant.

8.2.2.1 Data and Methodology

In this experiment, we took 4 different videos whose resolution was fixed to be 1280 x 720. The other

quality measures such as fps (frames per second), bitrate, and so on were fixed as well. All videos were taken with Samsung Galaxy Note 4 device, and down-sampled to 1280 x 720 with the same software.

The videos were of length 10 seconds, 1 minute, 5 minutes, and video2 above, which was ~10 minutes.

The videos of different lengths were run through the system and end-to-end times was recorded.

8.2.2.2 Results and Discussion I Total time taken (in seconds) for each video 70000 60000 50000 40000 30000 20000 10000 0 vidl(10s) vid2(1 min) vid3 (5 min) vid4(10 min)

Figure 8-7: Line graph showing trend ofprocessing time vs. the length of the video with 4 different videos. The time taken is increasingwith the increasing length of the video

Page 84 of 107 Time (in seconds) per frame for each video

4 3.5

2.5 2 1.5 1 0.5 0 vidl(10s) vid2(1 min) vid3 (5 min) vid4(10 min)

Figure8-8: Time taken perframe with increasinglength of videos but same quality. Thefigure shows that the time per frame is very similar (around 3 seconds) in all videos, implying that time perframe doesn't increase if quality is the same.

Figure 8-7 shows the trend of total end-to-end processing time in seconds (y-axis) with increasing length of the videos (x-axis). It shows, not surprisingly, that the end-to-end time increases linearly with the length of the video. Figure 8-8 shows the same graph, but the y-axis is time taken per frame. Here, we can see that all four videos have end-to-end processing time per frame of -3 seconds. It means that if the quality of videos is the same, each frame takes similar amount of processing time. We can see a slight increase in this graph because of overheads such as merging of coreset nodes, which increases exponentially(not linearly) with increasing leaves of the coreset tree. However, the overhead is very little for it to be significantly visible.

8.2.3 Timing vs. Video Quality

This experiment shows the trend of time with the increasing video quality. We keep the length of the video constant.

Page 85 of 107 8.2.3.1 Data and Methodology

For this experiment, we took the video 1 mentioned above. It is a video of length 3 minutes 21 seconds and originally of resolution 1920x1080 and 30 fps. We create three more videos by down-sampling this video to each of the following three resolutions - 1280x720, 800x600, and 480x320. We then run each of those videos through our system, and record the end-to-end processing time.

8.2.3.2 Results and Discussion

End-to-end processing time (in sec) with increasing quality of video U0000

50000

40000

30000

20000

L0000

0 vidl(480x320) vid2(800x600) vid3(1280x720) vid4(1920x1080)

Figure 8-9 : End-to-end time with increasingvideo resolution butfixed length. I The trend in Figure 8-9 is greater than linear, which makes sense. When we go to a higher resolution, the number of pixels that the system has to process doesn't increase linearly but quadratically. Therefore, the processing time increases accordingly as well.

Page 86 of 107 8.3 Retrieval Experiments

Retrieval experiments were conducted on two kinds of data: 1) data produced by processing the videos described in Section 8.1 and 8.2, and 2) synthetically produced data. Here, synthetic data refers to the database and the coreset trees created with MATLAB scripts that generate synthetic frames and synthetic detections. The synthetic data was created in order to be able to generate data worth weeks or months of videos, which is impractical with real videos. Also, in the real pipeline, we explicitly don't know the importance score of the frames, because the algorithm internally calculates the blur and the quality of the images, and it is hard to tell apart the quality with our eyes. With synthetic data, we can assign the importance score to each frame and verify whether more important frames are being sampled during the retrieval.

In these experiments, we compare preferential sampling coreset retrieval with two other alternatives: 1) database retrieval, and 2) retrieval with uniform sampling rather than with the sampling algorithm used for coreset retrieval.

8.3.1 Retrieval Experiments on Real (Non-Synthetic) Data

These experiments are based on the database created as a result of running all the different videos mentioned above in Sections 8.1 and 8.2, including the small synthetic video of 8 seconds. There are

-70,000 rows in the regions table, which have been accumulated as a result of processing all the aforementioned videos. The regions table is the biggest table in the database, and it is where all the retrievals are primarily coming from.

8.3.1.1 Metrics for Comparison

The primary metric for comparing retrieval methods in the real data is percentage of total relevant frames with detections (images) retrieved with increasing time, until all relevant frames are retrieved.

Page 87 of 107 This retrieval time is the time taken by the server, starting when it receives the query from the client, and ending when it returns a set of results to the client. For database retrieval, all the detections are retrieved at once, hence we look at total time taken. In the server, we record the number of images retrieved and the time taken to retrieve those images, for every search request. Because the coreset retrieval and uniform-sampling retrieval receive multiple client requests at a regular time intervals for the same search, the number of detections retrieved and time taken for that retrieval are logged multiple times for one search, until all detections are returned. Here, we compare the percent of total detections returned at each interval.

Ideally, we would weigh the frames returned with their importance score, but because it is difficult to extract importance score from the real frames as aforementioned, we have that metric only for the experiments on the synthetically created data, described in section 8.3.2.

8.3.1.2 Methodology

The experiments were done on the search query "car". The "skip-frames" parameter was 15, which means the retrieved results were (mostly) more than 15 frames from each other. Mostly, because as discussed Section 7.2.1, it is possible for the sampling algorithms to return results that are within 15 frames of each other in the video. The confidence threshold was 1.0. The retrieval was done in the largest video file (-35 minutes and -60,000 frames), i.e. video3 above. Then, during the retrieval, as mentioned in "metrics" section of 8.3, percentage of retrieved images is recorded at discrete time intervals for each of the retrieval systems.

Page 88 of 107 8.3.1.3 Results I

1 I -, I I

0.9

0.8 - .-. 0.7 ( 2

0.6 - C) /

-J 0.5 j C-, 0.4

0.3 -~ F-I

0.2 - F r 0.1

0 I I 0 0 50 100 150 200 250 300 350 400

Figure8-1o : Resultfrom retrieval experiment on real data when querying "car"in a large video taken in natural environment. X-axis is increasing time, while y-axis is the portionofframes retrievedfrom the server. Red star is for db, red line isfor uniform sampling, and blue line is for coreset sampling. We see that DB returns all results at a certainpoint in time (11 secs), while uniform and coresetsampling systems keep on retrieving results in an incrementalfashion. Uniform sampling is retrieving more results at any given point than coreset sampling. It seems that the preferentialcoreset sampling is worst of all, but once we take into account the importance of frames, we will see its advantage (discussed later in Section 8.3.2 ) g1 Figure 8-10 shows the results of the experiment. The x-axis is time, with discrete time intervals, and y- axis is the percentage of frames with detections retrieved at that time. Here, the '*' denotes the point for the database retrieval because it returns all the images at a certain point in time, in this case 1 seconds.

The red line is uniform sampling, and blue line is coreset sampling.

Page 89 of 107 8.3.1.4 Discussion

The figure shows that the database retrieval beats the other two in time to retrieve 100% of the results.

However, it takes multiple seconds (-11 seconds) to retrieve those results, during which time the user has to wait patiently for the results. As previously mentioned, the tolerable wait time for web users peaks around 2 seconds. Therefore, database retrieval may not be the best idea. Both other retrieval methods start retrieving the results within 2 seconds.

Between uniform sampling and preferential coreset sampling, we see that fewer results are being retrieved at each time interval with the coreset sampling, and the time taken to retrieve all the results is higher as well. It is because the importance of the frames is implicit in these frames, and therefore we don't see the advantage of the preferential sampling. In the following sections, we synthetically code the importance to the frames, in which we can see different results.

8.3.2 Retrieval Experiments on Synthetic Data

We carried out similar experiments for synthetic data as for the real data, but the metric was a bit different. After we queried for an object, we looked at the importance of frames retrieved as time goes on, rather than the absolute number of frames retrieved.

8.3.2.1 Synthetic Data Creation

To create synthetic data, we started out with a finite number (5) of synthetically created images (created with MATLAB), each assigned to a distinct importance score uniformly in the range (0,1]. The assignments were: red (0.2), yellow (0.4), green (0.6), blue (0.8), and purple (1.0). We used these images to create any length of video by having number of frames as an input parameter. Each segment comprises of one image which spans over multiple frames, and another image is used for another segment. Segment length is determined by sampling from a geometric distribution with parameter p =

Page 90 of 107

MOMW 0.025. It was done to make sure that on average, the segment length is around 40, i.e. /p (expectation of geometric distribution). We sampled from geometric distribution to make the video closer in nature to the real videos, where the segment lengths are not uniform. There are 4 object categories, i.e. the first synthetic object, the second synthetic object, the third synthetic object, and the fourth synthetic object.

For each image, we create two synthetic detections of these objects at random regions, and put the information into the database. This way, we can create large datasets, any kind of segmentation, and any kind of importance function. In the first set of retrieval experiments, we used real videos captured with the cameras, therefore we were limited on the size and could not control the parameters of the video such as segments, images in the videos, and importance measure of the frames. Making synthetic videos help us really understand the importance of the retrieval algorithm proposed in this thesis.

8.3.2.2 Metrics

The primary metric was the importance score of the frames retrieved. As new frames were retrieved, instead of considering the number of frames retrieved, we considered their importance. We cumulatively added the importance of the retrieved frames, so that as more frames were retrieved, absolute importance score of the retrieved results increased as well. We also recorded the time taken for the retrieval in a similar manner with the previous retrieval experiments on real videos.

8.3.2.3 Experiment methodology

We synthetically created videos of three different sizes - 10,000 frames, 25,000 frames, and 100,000 frames according to the creation method described above in section 8.3.2.1. Then, we queried each of those videos for the "first synthetic object". In the server, in addition to the information logged for the retrieval experiments on real videos, we also logged the importance of each frame retrieved. This way, we would be able to see which one of the retrieval systems retrieved the more important frames earlier.

Page 91 of 107 .V - - 7 Recall that preferential retrieval is slower than the other methods. Therefore, we normalize the time axis

from 0 to 1. This is because if we look at absolute time, other retrieval systems are expected to retrieve

more results, and hence the number will outweigh the importance of frames retrieved per unit time. For

example, if system A retrieves 4 frames of 0.4 importance each in first I ms, and system B retrieves 4

frames of 0.8 importance in first 4 ms, we consider system B more useful because although it takes a bit

more time, it returns more important frames earlier in the search. Therefore, we normalize both time

axis (x-axis) as well as cumulative importance (y-axis) in the results. We also look at the non-

normalized results for comparison.

8.3.2.4 Results

The following figures Figure 8-11, Figure 8-12, and Figure 8-13 each show normalized results(left) and non-normalized results(right) for the video with 10K , 25K, and lOOK frames respectively. X-axis is the time, and y-axis is the cumulative importance as the frames are retrieved. Table 8-5 shows the absolute times taken by three retrieval types in these videos, on the query "first synthetic object."

100 0.9 CO 90

0.8 80

0.7 70

S0.6 60

0.5 50 Careset ~0.4 - E 40 Uniform + DO8 0.3 30

0.2 2)n 0.1 10 0- 0 0.2 0.4 0.6 0.8 -s Time Time

a) b)

Figure 8-11 : Plotsfor cumulative importance offrames vs. time for a video with ioKframes. On the left is the normalized axes, to show the importancefraction againstfraction of time. It allows us to see whichfraction of the Page 92 of 107 time retrieved more important results. We can see that [preferential]coreset graph is higher than the uniform graph in most of the earlierportion of the time, showing that more importantframes are retrievedearlier on in the resultsfor preferentialretrieval.

1 - 600

0.9 7 Coreset - Coreset Uniform 0.8 Uniform 500 DO + DB 0.7 400 S0.6 U 0.5 ~0 300

0.3 200

0.2 100 0.1 -

0 0 0.2 4, 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 Time Time

a) b)

Figure 8-12: Plotsfor cumulative importance offrames vs. timefor a video with 25Kframes. Here, uniform and coreset intersect in aroundhalf the time, showing that the uniform sampling retrieved more fraction of important results than preferentialcoreset retrievalin the later half of the results.

Page 93 of 107 1 12 0.9 In ormlCor7se 0.8 IN

0.7

0.6

+7D 0.5 EJ 0.4

03 4( 0

0.2 2( 0 0 2 0.1

0 012 0.4 0.6 0.8 1 0 100 200 300 400 500 Time Time a) b)

Figure 8-13: Plotsfor cumulative importanceofframes vs. timefor a video with iooKframes. With such long video, even though the graph looks the same, half the retrievaltime is about a minute. It is important that user gets most of importantframes by a minute, because it is way more than tolerable wait time described in [3]

Table 8-5: Table showing absolute time taken to retrieve all resultsfor a query 'first synthetic object" on 3 videos, for three different retrievalsystems

DB Retrieval Uniform Core set Preferential Coreset

Time (s) Retrieval Total Time(s) Retrieval Total Time(s)

10K Frames Video 3.9 21.6 46.8

25K Frames Video 4.27 26.38 63.74

100K Frames Video 16.17 107.66 408.70

Page 94 of 107 8.3.2.5 Analyses

In all the non-normalized results (right of all figures), we can see that preferential coreset retrieval has less cumulative importance at each point. However, these show no information about the nature of frames retrieved earlier in time vs. later in time. It is important to consider that, because even though frames may be retrieved in less time, user may have to sift through hundreds of results to find the important frames, while it is easier for the user if the preferred frames are shown earlier in the results.

The normalized plots on the left of each figure show that at least for the first half of the time, preferential retrieval returns more important frames, and uniform retrieval starts to catch up on the cumulative importance in the later part of the time. In general, this shows that the results retrieved by preferential sampling are generally in order of more important frames to less important frames.

The importance of such ordering reflects more on large videos. While on smaller videos, i.e. 10K and

25K, database might return all results "fast enough", (-4 seconds), on the larger video, i.e. 100K frames, it took -16 seconds. Because this is with no feedback, this is way more than the tolerable wait time of the user, as shown in [3]. In addition, user has to look at all the frames to find the important frames, because there is no ordering on importance. Similar is the case in uniform sampling. In Figure 8-13 a), we can see that during at least the first half of the time, each point of the preferential retrieval is above the uniform sampling retrieval. First half in this large video is about a minute, and it is important that users get most of important frames by that time. The figure shows that the overall importance of the retrieved results was more per time than that of uniform sampling for about a minute. This is important for the users such as Matt, introduced in Chapter 1, because he would get the preferred frames for his guitar performance earlier on without needing to sift through to find them.

Page 95 of 107 8.3.2.6 Web UI results comparison for three retrieval types

first synthetic object fr,-. syrthelic object anMamna n mana nanap

a) Preferential coreset sampling a) DB retrieval results retrieval results

f irs sytei objec ananaMI a a a a- c) Uniform coreset sampling retrieval results

Figure 8-14 : Portion of results returnedfrom the synthetic image database, with various retrieval types. The images in ascending order of importance were: red, yellow, green, blue, purple. In a), we can more of purple and blueframes, which are the most and the second-most importantframesrespectively. There arefewer red and yellow images retrieved. DB in this case (b) happened to have more importantframes earlier in the database. However, in uniorn sampling(c), we can see all kinds offramesLWith no particularorder or preference

We can also see the important frames being returned visUally in the web Ul. The importance numbers were unique to the images, so that it would be easier to see visually. The importance numbers were:

0.2(red), 0.4(yellow), 0.6(green), 0.8(blue), and l.0(putrple). Figure 8-14 shows the UI retrieval resutlts of "first synthetic object" in the video file one of the synthetic videos for all three retrieval methods. We can see that the preferential sampling algorithm retrieves a lot of purple and blue frames in the

Page 96 of 107 beginning, which are the top two most important frames. Yellow and red frames, which are the least important, appear the least in the results. However, we don't see all purple and blue frames, because preferential coreset sampling not just cares about the importance, but also the distinctness of the images retrieved. For DB retrieval, this instance of DB happened to have frames with higher importance earlier in the database, therefore we don't see many red or yellow frames. However, we can visualize that there are less important frames overall in DB than the preferential coreset retrieval. With the uniform sampling results though, the visualization is much clearer. The uniform sampling results have almost equal number of all the frames, and shows no particular ordering of the frames. Thus, we can visually see the significance and advantage of preferential sampling retrieval system.

8.4 Conclusion

In summary, we showed that the objects retrieval system allows users to search for the objects in the video in various different ways, and we showed the timings for each of the system modules. We discussed pros and cons of various retrieval systems. Finally, we demonstrated that the preferential retrieval system can be useful in scenarios where important frames need to be retrieved in the context of large videos. Our hope is that using the preferential object retrieval system, users such as Matt can find the preferred guitar frames early on in the results so they can find what they are looking for as efficiently as possible.

Page 97 of 107 Page 98 of 107 Chapter 9 Future Extensions

There are many improvements that could be made on the current system. This chapter discusses those for the overall system, as well as each of the modules of the system.

9.1 Overall System

- The system is completely independent of its parent system, iDiary. It needs to be incorporated

into iDiary.

- Some modules of the system were slow as we saw in the experiments, and measures can be

taken to make them faster. For example, the slowest module is object detection -- we saw that

the object detection per original frame in the video takes -3 seconds in a 720p video. Because

we did detections on a smaller data than the video (-10 times smaller), it takes -30 seconds for

an actual frame of that resolution. In the future, a faster detector could be added, or measures

could be taken to make it faster. Because the focus of this thesis was not to make the detection

faster, it is left for the future.

9.2 UI

- There should be a way to see the uploaded and processed videos in the UI, and even better, the

status of the processing of a video at any given moment, including how much time has passed.

and how much is approximately remaining.

Page 99 of 107 - The returned results will be better if they are video clips rather than the frames. Another way

could be playing the original videos and skipping to the parts with the desired frames.

9.3 Coreset Tree Creation

- For creating a coreset tree, there are certain default parameters to input, such as leaf-size and

number of segments. Right now, they are constants independent of the video. However, it would

be great if these parameters, and all other relevant parameters are decided automatically based

on the input video. For instance, a video with a lot of long still scenes could have higher leaf

size than a video with frames changing faster.

9.4 Object Detection

- Our chosen object detector, i.e. RCNN, is too slow. We can apply measures to make it faster, or

choose a faster detector, e.g. Fast-RCNN [41].

- Right now, we have implemented the system for 200 object categories. However, we can test

with the models with more object categories, such as LSDA [42], which has 7.5 K categories.

We have partially incorporated it into our system, but needs more time for complete integration.

9.5 Retrieval

- The retrieval can potentially have low-level implementation improvements. Currently, for the

coreset retrieval, each request spawns a thread in the server which updates results in a variable

in the server. This was done as a "quick-and-dirty" trick to keep updating the results in the

server and show the clients results in an incremental way. However, the recommended way is to

either store information in the database or use some other way of storage. And if possible,

another way should be explored for passing incremental results from server to the client than the

Page 100 of 107

MMM"M !P I P.M polling method, which uses various HTTP requests and responses for the same original search request.

Page 101 of 107 REFERENCES

[1] D. Feldman, A. Sugaya, C. Sung and D. Rus, "iDiary: From GPS Signals to a Text-Searchable

Diary," Cambridge, 2013.

[2] A. Sugaya, "iDiary: Compression, Analysis, and Visualization of GPS Data to Predict User

Activities," Massachusetts Institute of Technology, Cambridge, 2012.

[3] D. F. Galletta, R. Henry, S. McCoy and P. Polak, "Web Site Delays: How Tolerant are Users?,"

Journal of the Associationfor Inftrmation Systems, vol. 5, no. 1, pp. 1-28, 2004.

[4] G. Rosman, M. Volkov, D. Feldman, J. F. III and D. Rus, "Coresets for k-Segmentation of

Streaming Data," in Neural Infrmnation Processing Systems (NIPS), 2014.

[5] M. Volkov, G. Rosman, D. Feldman, J. W. F. III and D. Rus, "Coresets for visual summarization

with applications to loop closure," in IEEE International Conference on Robotics and Automation

(ICRA), Seattle, May 2015.

[6] J. Machajdik, A. Hanbury, A. Garzl and R. Sablatnig, "Affective Computing for Wearable Diary

and Lifelogging Systems: An Overview," in Machine Vision - Researchfjr High Quality Processes

and Products - 35th Workshop of the Austrian Associationfor PatternRecognition, Graz, 2011.

[7] V. Bush, "As We May Think," TheAtlantic Monthy, vol. 176, no. 1, pp. 101-108, 1945.

[8] G. Bell, "A Personal Digital Store," Communications of the ACM, vol. 44, pp. 86-91, 2001.

Page 102 of 107 [9] D. Byrne, L. Kelly and G. J. Jones, "Multiple multimodal mobile devices: Lessons learned from

engineering lifelog solutions," in Handbook of Research on Mobile Software Engineering: Design,

Implementation and Emergent Applications, Engineering Science Reference (an imprint of IGI

Global), 2012, pp. 706-724.

[10] M. Lindstr6m, A. Stihl, P. Sundstr6m, K. H66k, J. Laaksolahti, M. C. M, A. Taylor and R. Bresin,

"Affective Diary - Designing for Bodily Expressiveness and Self-Reflection," In proceedings of

CHI '06: CHI '06 extended abstracts on Human factors in computing systems, pp. 1037 - 1042,

2006.

[11] "Microsoft Research SenseCam," [Online]. Available: http://research.microsoft.conen-

us/um/cambridge/projects/sensecam/. [Accessed 27 August 2015].

[12] N. S. J. Namrata S. Pathkar, "Google Glass: Project Glass," InternationalJournal of Application or

Innovation in Engineering & Management (IJAIEM), vol. 3, no. 10, pp. 31-35, 2014.

[13] J. Gemmell, G. Bell and R. Lueder, "MyLifeBits: a personal database for everything,"

Communications of the ACM (CA CM), vol. 49, no. 1, pp. 88-95, 2006.

[14] "Fitbit," [Online]. Available: https://www.fitbit.con/. [Accessed 27 August 2015].

[15] P. K. Agrawal, S. Har-peled and K. R. Varadarajan, "Geometric Approximation via Coresets," in

Conbinatorialand Computational Geometry, vol. 52, MSRI Publications, 2005, pp. 1-30.

[16] D. Feldman and M. Langberg, "A Unified Framework for Approximating and Clustering Data," in

Page 103 of 107 STOC, 2010.

[17] R. Paul, D. Feldman, D. Rus and P. Newman, "Visual Precis Generation using Coresets," in ICRA,

IEEE Press, 2014.

[18] R. Bellman, "On the approximation of curves by line segments using dynamic programming,"

Communicationsof the A CM, vol. 4, no. 6, p. 284, 1961.

[19] S. Guha, N. Koudas and K. Shim, "Approximation and streaming algorithms for histogram

construction problems," A CM Transactions on Database Systems, vol. 31, no. 1, pp. 396-43 8, 2006.

[20] D. Feldman, C. Sung and D. Rus, "The single pixel gps: learning big data signals from tiny

coresets," in In Proceedings of the 20th International Conference on Advances in Geographic

Information Systems, 2012.

[21] W. Churchill and P. Newman, "Continually Improving Large Scale Long Term Visual Navigation

of a Vehicle in Dynamic Urban Environments," in 15th International IEEE Conftrence on

Intelligent TransportationSystems, Alaska, 2012.

[22] Z. Lu and K. Grauman, "Story-Driven Summarization for Egocentric Video," in Proceedingsof the

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

[23] S. Bandla and K. Grauman, "Active Learning of an Action Detector from Untrimmed Videos," in In

Proceedingsof the IEEE InternationalConference on Computer Vision (ICCV), 2013.

Page 104 of 107 [24] D. Lin, S. Fidler, C. Kong and R. Urtasun, "Visual Semantic Search: Retrieving Videos via

Complex Textual Queries," in IEEE Conference on Computer Vision and Pattern Recognition,

2014.

[25] W. Hu, D. Xie, Z. Fu, W. Zeng and S. Maybank, "Semantic-Based Surveillance Video Retrieval,"

IEEE Transactions on Image Processing, vol. 16, no. 4, pp. 1168-1181, 2007.

[26] J. Sivic and A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in

Videos," in Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV

2003), 2003.

[27] B. V. Patel and B. B. Meshram, "Content Based Video Retrieval," The International Journal (?f

Multimedia & Its Applications (IJMA), vol. 4, no. 5, pp. 77-98, 2012.

[28] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies for accurate object

detection and semantic segmentation," 3 December 2013. [Online]. Available:

http://arxiv.org/abs/1311.2524. [Accessed 2 Aug 2015].

[29] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell,

"Caffe: Convolutional Architecture for Fast Feature Embedding," in Proceedings of the ACM

InternationalConference on Multimedia, New York, ACM, 2014, pp. 675-678.

[30] T. Gonzalez, "Clustering to minimize the maximum intercluster distance," Theoritical Computer

Science, vol. 38, pp. 293-306, 1985.

Page 105 of 107 [31] D. Hochbaum and D. Shmoys, "A best possible heuristic for the k-center problem," Mathemnatics

OperationalResearch, vol. 10, no. 2, pp. 180-184, 1985.

[32] K. Grauman and B. Leibe, Visual Object Recognition: Synthesis Lectures on Artificial Intelligence

and Machine Learning, Morgan & Claypool, 2011.

[33] "ImageNet," [Online]. Available: http://www.image-net.org/. [Accessed 2 August 2015].

[34] "The Pascal Visual Object Classes Homepage," [Online]. Available:

http://host.robots.ox.ac.uk/pascal/VOC/. [Accessed 2 August 2015].

[35] J. Uijlings, K. v. d. Sande, T. Gevers and A. Smeulders, "Selective Search for Object Recognition,"

InternationalJournal of Computer Vision., vol. 104, no. 2, pp. 154-171, 2013.

[36] A. Krizhevsky, 1. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional,"

in NIPS, Lake Tahoe, 2012.

[37] "Django: The web framework for perfectionists with deadlines.," [Online]. Available:

https://www.djangoproject.com/. [Accessed 20 August 20151.

[38] "Bootstrap: The world's most popular mobile-first and responsive front-end framework," [Online].

Available: http://getbootstrap.com/. [Accessed 2 August 2015].

[39] "Github," [Online]. Available: https://github.coim/axelpale/minimal-django-file-upload-example.

[Accessed 2 August2015].

Page 106 of 107

TIMM [40] "rbgirshick/rcnn - GitHub," [Online]. Available: https://github.com/rbgirshick/rcnn. [Accessed 2

August 2015].

[41] R. Girshick, "Fast R-CNN," arXiv preprint arXiv:1504.08083, 2015.

[42] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell and K. Saenko,

"LSDA : Large Scale Detection through Adaptation," in Neural Infonnation Processing Systems

(NIPS), 2014.

[43] "ImageNet Large Scale Visual Recognition Competition 2013 (ILSVRC2013)," [Online].

Available: http://www.image-net.org/challenges/LSVRC/201-3/browse-det-synsets. [Accessed 2

August 2015].

[44] "The Psycology of Web Performance," Website Optimization, 30 May 2008. [Online]. Available:

http://www.websiteoptimization.com/speed/tweak/psychology-web-performance/. [Accessed 23

August 2015].

Page 107 of 107