City University of Hong Kong 香港城市大學
Total Page:16
File Type:pdf, Size:1020Kb
CITY UNIVERSITY OF HONG KONG 香港城市大學 Multimedia Content Understanding by Analyzing Perceived Signals: Face, Sentiment and Emotion 基於人臉、情感和情緒等感知信號的多媒 體內容解析 Submitted to Department of Computer Science 電腦科學系 in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 哲學博士學位 by Pang Lei 龐磊 November 2015 二零一五年十一月 ii ABSTRACT Content understanding of unconstrained user-generated videos is a challenging task, due to the difficulty in mining semantics out of rich visual-audio signals in addition to the sparse and subjective text descriptions. This thesis addresses the challenge by multi-modal analysis of three perceived signals: face, sentiment and emotion, for cross-media exploratory browsing. Specifically, we study approaches in naming faces for video browsing, predicting sentiment underlying short video clips for question-answering, and emotion categorization for cross-modal retrieval. The results of analysis for the three signals are then integrated into a system for search navigation and exploration in large video archive. Face naming is challenging due to large variations of face appearances in uncon- strained videos. Instead of relying on accurate face labels for supervised learning, a rich set of relationships automatically derived from video content and knowledge from image domain and social cues is leveraged for unsupervised face labeling. The relationships refer to the appearances of faces under different spatio–temporal contexts and their visual similarities. The knowledge includes Web images weakly tagged with celebrity names and the celebrity social networks. The relationships and knowledge are elegantly encoded using conditional random field (CRF) for label inference. Two versions of face annotation are considered: within-video and between–video face labeling. The former addresses the problem of incomplete and noisy labels in metadata, where null assignment of names is allowed – a problem seldom been considered in the literature. The latter further rectifies the errors in metadata, specifically to correct false labels and annotate faces with missing names in the metadata of a video, by considering a group of socially connected videos for joint label inference. iii Popular web videos, especially videos about hotly-debated news event, often ex- press sentiment. We exploit multi-modal sentiment analysis of videos for question- answering (QA), given that there are large portion of opinion-oriented questions in QA archives. Some of these questions are better answered by videos than texts, due to the vivid display of emotional signals visible through face expression and speaking tone. The difficulty of the multimedia opinion question answering lies in two aspects. On one hand, a potential answer of duration 60 seconds may be em- bedded in a video of 10 minutes, resulting in degraded user experience compared to reading answer in text only. On the other hand, a text-based opinion question may be short and vague while the video answers could be verbal, less structure in grammar and noisy because of errors in speech transcription. Direct matching of words or syntactic analysis of sentence structure, such as adopted by factoid and how-to question-answering, is unlikely to find the video answers. The first problem, specifically the answer localization, is addressed by audio-visual analysis of the emotional signals in videos for locating video segments likely expressing opinions. The second problem, questions and answers matching, is tackled by a deep architecture that nonlinearly matches text words in questions and speeches in videos. For emotion prediction, we explore the learning of highly non-linear relation- ships that exist among low-level features across different modalities. Using Deep Boltzmann Machine (DBM), a joint density model over the modalities, is devel- oped. The model is trained directly using user generated videos without any labeling efforts. While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged. More importantly, the joint representation enables emotion–oriented cross–modal retrieval, for example, retrieval of videos using text query “crazy cat”. The model does not restrict the types of input and output, and hence in principle, emotion iv prediction and retrieval on any combinations of media are feasible. Finally, the three pieces of works are integrated into a system for exploratory- based opinion QA. The system creates hyperlinking connecting video clips sharing similar perceived signals. The system is designed in such a way that users can easily navigate the search results, exploring along the hyperlinks to different aspects of information based on either topic, person or emotion. The system is experimented on BBC videos of more than 3,000 hours in duration. CITY UNIVERSITY OF HONG KONG Qualifying Panel and Examination Panel Surname: PANG First Name: Lei Degree: Doctor of Philosophy College/Department: Department of Computer Science The Qualifying Panel of the above student is composed of: Supervisor(s) Prof. NGO Chong Wah Department of Computer Science City University of Hong Kong Qualifying Panel Member(s) Dr. CHAN Antoni Bert Department of Computer Science City University of Hong Kong Dr. WONG Hau San Department of Computer Science City University of Hong Kong This thesis has been examined and approved by the following examiners: Prof. NGO Chong Wah Department of Computer Science City University of Hong Kong Prof. ZHU Jian Hua Jonathan Department of Media and Communication City University of Hong Kong Dr. HUET Benoit Multimedia Information Processing Group Eurecom Prof. SATOH Shin'ichi Digital Content and Media Sciences Research Division National Institute of Informatics v ACKNOWLEDGEMENTS Working as a Ph.D. student in VIREO was a magnificent as well as challenging experience to me. In all these years, many people were instrumental directly or indirectly in shaping up my academic career. It was hardly possible for me to thrive in my doctoral work without the precious support of these personalities. Here is a small tribute to all those people. First of all, I wish to thank my supervisor Prof. Chong-Wah Ngo for introduc- ing me to the world of multimedia search and computing. It was only due to his valuable guidance, cheerful enthusiasm and ever-friendly nature that I was able to complete my research work in a respectable manner. I am also greatly appreciate the extra one year that I was given to finish my Ph.D. study. The experience of working with such a true expert values much beyond this Ph.D. I would like to thank my examiners, Prof. Shin’ichi Satoh, Prof. Jonathan Zhu and Dr. Benoit Huet, for their precious time and valuable feedbacks. I am also grateful to my qualifying panel members, Dr. Hau San Wong and Dr. Antoni Bert Chan, for their constructive suggestions and comments on my research reports. I am privileged for having Xiao Wu, Xiao-Yong Wei, Yu-Gang Jiang, Hung- Khoon Tan, Wan-Lei Zhao, Shi-Ai Zhu, Song Tan, Ting Yao, Wei Zhang, Chun- Chet Tan, Zhi-Neng Chen, Hao Zhang, Yi-Jie Lu and Jing-Jing Chen as my col- leagues, who have provided great company and inspiration during all these years in Hong Kong. Especially, I would like to thank Shi-Ai Zhu for his generous help and steadfast support, which is more than precious. Wei Zhang, Song Tan, Hung- Khoon Tan, Ting Yao and Yu-Gang Jiang also deserve my special thanks for their efforts and collaborations on my research papers. Finally, and most importantly, I would like to thank my wife Ya-Qun Dong. Her support, encouragement, quiet patience and unwavering love were undeniably vi the bedrock upon which the past nine years of my life have been built. She, as a gift from God, made my life colorful and happy. I greatly appreciate my parents for their faith in me and allowing me to be as ambitious as I wanted. Deepest thanks to everything they did and every sacrifice they made on my behalf. I thank my sister, sister-in-law and nephew for taking care of parents when I am pursing the degree in Hong Kong. I thank my pet dogs, Xiaoqi and Ponyo, for happy company in the last two years. vii TABLE OF CONTENTS Title Page ii Abstract ii Acknowledgements v Table of Contents vii List of Figures xi List of Tables xv 1 Introduction 1 1.1 Understanding Perceived Signals 1 1.1.1 Celebrity Naming 2 1.1.2 Sentiment Analysis for Question Answering 3 1.1.3 Emotion Prediction 3 1.1.4 Exploratory Search 4 1.2 Overview of Approaches and Contributions 5 1.3 Organization 8 1.4 Publication 8 2 Literature Review 9 2.1 Face Naming 9 2.1.1 Model-based Approaches 9 2.1.2 Search-based Approaches 10 2.1.3 Constrained Clustering-based Approaches 11 2.2 Multimedia Question Answering 12 2.2.1 Factoid Question Answering 12 2.2.2 How-to Question Answering 13 2.2.3 Multimodal Question Answering 13 2.2.4 Lexical and Stylistic Gap in QA 14 2.3 Affective Computation 15 2.3.1 Affection in Textual Documents 15 2.3.2 Affection in Images 16 viii 2.3.3 Affection in Music 17 2.3.4 Affection in Video 17 2.3.5 Comparisons 18 3 Unsupervised Celebrity Face Naming in Web Videos 19 3.1 Relationship Modeling 23 3.1.1 Problem Definition and Notation 23 3.1.2 Unary Potential 26 3.1.3 Pairwise Potential 27 3.1.3.1 Spatial Relationship 27 3.1.3.2 Temporal Relationship 29 3.1.3.3 Visual Relationship 30 3.1.4 Complexity Analysis 30 3.2 Leveraging Social Cues 32 3.2.1 Social-driven Graph Splitting 33 3.2.2 Name-to-name Social Relation 34 3.2.3 Algorithm Implementation 35 3.3 Experiments 37 3.3.1 Dataset and Evaluation Metrics