Web-Scale Multimedia Search for Internet Video Content
Total Page:16
File Type:pdf, Size:1020Kb
CARNEGIE MELLON UNIVERSITY Web-scale Multimedia Search for Internet Video Content Lu Jiang CMU-LTI-17-003 Language and Information Technologies School of Computer Science Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA, 15213 www.lti.cs.cmu.edu Thesis Committee: Dr. Alex Hauptmann, Carnegie Mellon University Dr. Teruko Mitamura, Carnegie Mellon University Dr. Louis-Philippe Morency, Carnegie Mellon University Dr. Tat-Seng Chua, National University of Singapore Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies c 2017 Lu Jiang Abstract The Internet has been witnessing an explosion of video content. According to a Cisco study, video content accounted for 64% of all the world’s internet traffic in 2014, and this percentage is estimated to reach 80% by 2019. Video data are becoming one of the most valuable sources to assess information and knowledge. However, existing video search solutions are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. The need for large-scale and intelligent video search, which bridges the gap between the user’s information need and the video content, seems to be urgent. In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content (text-to-video and text&video-to-video search). It provides a new way to look at content-based video search from finding a simple concept like “puppy” to searching a complex incident like “a scene in urban area where people running away after an explosion”. To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. Finally, we apply the proposed video engine to tackle text-and-visual question answering problem called MemexQA. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. In addition, our method can efficiently scale up the search to hundreds of millions videos, and only takes about 0.2 second to search a semantic query on a collection of 100 million videos, 1 second to process a hybrid query over 1 million videos. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search system that is capable of indexing and searching a collection of 100 million videos. Acknowledgements This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. The U.S. Government is authorized to reproduce and distribute reprints for Governmen- tal purposes notwithstanding any copyright annotation thereon. This work was partially supported by Yahoo InMind project. Besides, the author would like to thank the Flickr Computer Vision and Machine Learning group for their kind support. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. It used the Blacklight system at the Pittsburgh Supercomputing Center (PSC). iii Contents Abstract ii Acknowledgements iii 1 Introduction 1 1.1 Research Challenges and Solutions ...................... 3 1.2 Social Validity ................................. 5 1.3 Thesis Overview ................................ 6 1.4 Thesis Statement ................................ 8 1.5 Key Contributions of the Thesis ....................... 8 2 Related Work 9 2.1 Content-based Image Retrieval ........................ 9 2.2 Copy Detection ................................. 10 2.3 Semantic Concept Detection .......................... 11 2.4 Multimedia Event Detection .......................... 12 2.5 Content-based Video Semantic Search .................... 13 3 Indexing Semantic Features 14 3.1 Introduction ................................... 14 3.2 Related Work .................................. 16 3.3 Method Overview ................................ 16 3.4 Concept Adjustment .............................. 18 3.4.1 Distributional Consistency ....................... 19 3.4.2 Logical Consistency .......................... 21 3.4.3 Discussions ............................... 22 3.5 Inverted Indexing & Search .......................... 24 3.5.1 Video Search .............................. 25 3.6 Experiments ................................... 26 3.6.1 Setups .................................. 26 3.6.2 Performance on MED ......................... 27 3.6.3 Comparison to State-of-the-art on MED ............... 30 3.6.4 Comparison to Top-k Thresholding on MED ............ 30 3.6.5 Accuracy of Concept Adjustment ................... 31 3.6.6 Performance on YFCC100M ...................... 32 3.7 Summary .................................... 33 iv Contents v 4 Semantic Search 35 4.1 Introduction ................................... 35 4.2 Related Work .................................. 35 4.3 Semantic Search ................................ 36 4.3.1 Semantic Query Generation ...................... 36 4.3.2 Retrieval Models ............................ 39 4.4 Experiments ................................... 40 4.4.1 Setups .................................. 40 4.4.2 Semantic Matching in SQG ...................... 41 4.4.3 Modality/Feature Contribution .................... 42 4.4.4 Comparison of Retrieval Methods ................... 43 4.5 Summary .................................... 44 5 Query Embedding and Hybrid Search 46 5.1 Introduction ................................... 46 5.1.1 Query Embedding ........................... 46 5.1.2 Hybrid Search ............................. 47 5.2 Related Work .................................. 47 5.3 Query Embedding ............................... 48 5.3.1 Problem ................................. 48 5.3.2 Models ................................. 50 5.3.2.1 Max-Pooled MLP ...................... 51 5.3.2.2 Two-channel RNN ...................... 52 5.3.3 Results ................................. 53 5.3.3.1 Baseline Comparisons .................... 54 5.3.3.2 Model Parameters ...................... 55 5.4 Hybrid Search .................................. 57 5.4.1 Model .................................. 57 5.4.2 Experimental Results ......................... 60 5.5 Summary .................................... 62 6 Multimodal Reranking 63 6.1 Introduction ................................... 63 6.2 Related Work .................................. 65 6.3 MMPRF ..................................... 67 6.4 SPaR ...................................... 69 6.4.1 Learning with Fixed Pseudo Labels and Weights .......... 71 6.4.2 Learning with Fixed Classification Parameters ........... 73 6.4.3 Convergence and Relation to Other Reranking Models ....... 76 6.5 Experiments ................................... 77 6.5.1 Setups .................................. 77 6.5.2 Comparison with Baseline methods .................. 78 6.5.3 Impact of Pseudo Label Accuracy .................. 80 6.5.4 Comparison of Weighting Schemes .................. 81 6.5.5 Experiments on Web Query Dataset ................. 82 6.5.6 Runtime Comparison .......................... 83 6.6 Summary .................................... 84 Contents vi 7 Learning Semantic Concepts 85 7.1 Introduction ................................... 85 7.2 Related Work .................................. 86 7.2.1 Curriculum Learning .......................... 86 7.2.2 Self-paced Learning .......................... 87 7.2.3 Weakly-Labeled Data Learning .................... 88 7.3 Theory ...................................... 89 7.3.1 Curriculum ............................... 90 7.3.2 Self-pace Functions ........................... 92 7.3.3 Algorithm ................................ 93 7.4 Implementation ................................. 93 7.4.1 Curriculum region implementation .................. 94 7.4.2 Self-paced function implementation .................. 95 7.4.3 Self-paced function with diversity ................... 97 7.4.4 Self-paced function with dropout ...................100 7.5 Discussions ...................................101 7.5.1 Theoretical Justification ........................101 7.5.2 Limitations ...............................103 7.6 Experiments using Diversity Scheme .....................104 7.6.1 Event Detection ............................104 7.6.2 Action Recognition ...........................106 7.7 Experiments on Noisy Data ..........................107 7.7.1 Experimental Setup ..........................107 7.7.2 Experiments on FCVID ........................109 7.7.3 Experiments on YFCC100M .....................110