Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos Lu Jiang1, Shoou-I Yu1, Deyu Meng2, Teruko Mitamura1, Alexander G. Hauptmann1 1 School of Computer Science, Carnegie Mellon University 2 School of Mathematics and Statistics, Xi’an Jiaotong University {lujiang, iyu, teruko, alex}@cs.cmu.edu,
[email protected] ABSTRACT representative content-based retrieval task, initiated by the Semantic search in video is a novel and challenging problem TRECVID community, is called Multimedia Event Detec- in information and multimedia retrieval. Existing solutions tion (MED) [32]. The task is to detect the occurrence of are mainly limited to text matching, in which the query a main event in a video clip without any textual metada- words are matched against the textual metadata generat- ta. The events of interest are mostly daily activities ranging ed by users. This paper presents a state-of-the-art system from “birthday party” to “changing a vehicle tire”. for event search without any textual metadata or example videos. The system relies on substantial video content un- derstanding and allows for semantic search over a large col- lection of videos. The novelty and practicality is demon- strated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state- of-the-art system, which may be instrumental in guiding the design of the future system for semantic search in video. Categories and Subject Descriptors Figure 1: Comparison of text and semantic query for H.3.3 [Information Search and Retrieval]: Search pro- the event “birthday party”.