Who, What, When, Where, Why and How in Video Analysis: an Application Centric View
Total Page:16
File Type:pdf, Size:1020Kb
Who, What, When, Where, Why and How in Video Analysis: An Application Centric View Sadiye Guler Jay Silverstein, Ian Pushee, Xiang Ma, intuVision, Inc. Ashutosh Morde 10 Tower Office Park, Woburn, MA 01801 intuVision, Inc. [email protected] {jay,ian,sean,ashu}@intuvisiontech.com Abstract This paper presents an end-user application centric of video data, some of which is analyzed in real-time. In view of surveillance video analysis and describes a addition to traditional surveillance video, in recent years a flexible, extensible and modular approach to video new source of video data has surpassed any kind of content extraction. Various detection and extraction volume expectations by individuals recording video components including tracking of moving objects, during public and private events with their cameras and detection of text, faces, and face based soft biometric for smart phones and posting them on the websites such as gender, age and ethnicity classification are described YouTube for public consumption. These unconstrained within the general framework for real-time and post event video clips (i.e., “other people’s video” the term coined by analysis applications Panoptes and VideoRecall. Some [2]) may provide additional content and information that is end-user applications that are built on this framework are not available in typical surveillance camera views; hence discussed. they became part of any surveillance and investigation task. For example, surveillance video cameras around a stadium typically have views of entry/exit halls, snack courts etc. which may provide partial video coverage of 1. Introduction some excited fans being destructive with no means to get a good description of the individuals; but an inspection of Since their early deployment in mid-1960s [25], of the other fan videos posted on public sites soon after the video surveillance systems held the promise for increased event may make available crucial information that would security, safety and they facilitated forensic evidence otherwise be missed. The unconstrained surveillance collection. The potential of video comes with its video is now gaining more importance as a surveillance challenges; with the large and ever increasing volumes of source and the automated video analysis systems have to data that is collected and needs to be analyzed to utilize this emerging data source. automatically to understand the video content, detect Detection and classification of objects in video, objects and activities of interest and provide relevant tracking of moving objects over multiple camera field of information to end users in a timely manner. views, understanding the scene context, peoples activities Over the last two decades, video surveillance systems, and some level of identification information on from cameras to compression and storage technologies individuals all have been active areas of research [16]. have undergone fundamental changes. Analog cameras Despite all the advances in the field and availability of have been replaced with digital and IP network cameras, multiple real-time surveillance applications the acceptance tape based storage systems with no efficient way to find of these automated technologies by security and safety content have been replaced with Digital and Network professionals have been slower than several initial market Video Recorders with video content analysis for efficient estimations. This is partly due to the fact that most storage, indexing and retrieval. With the availability of automated systems are designed to provide a solution to more and more capable computing platforms the large some of the areas mentioned above, which effectively is body of research and development in the vision-based only a part of the solution the end-users require. End users video surveillance field video content analysis and are looking for tools that help to answer the questions extraction algorithms found their ways into several “What, Where, When, Who, Why and How (w5h)” using surveillance applications. The surveillance systems video and other data that is available. The answers to w5h currently deployed at public places collect a large amount questions describe an event with all its details. What happened? Where did it take place? When did it happen? 2.1. Modular Video Analysis Architecture Who was involved? Why did it happen? And how did it happen? In most cases the answers to most of those Figure 1 depicts the intuVision video content extraction questions can be found in video data at multiple levels of framework. At the heart of our framework lie multiple detail and the automated video surveillance applications of detectors that operate on the video to detect moving tomorrow need to be designed to answer as many of these objects and classify them (for fixed camera event questions by analyzing both in real-time and post event detection); or the frame based detection of face, text, video. person and vehicles (for analysis of moving platform “What” describes the event such as “entry from the exit video or post event analysis). Once the objects are door” that will be detected from a surveillance camera detected and tracked (faces of tracked people can be video by tracking a person entering into an area from a detected in real-time). Panoptes video event detection restricted direction doorway. “Where” is the location or module looks for pre-specified or anomalous events. The locale where the event took place such as “Building-A Soft Biometry module then classifies faces for lobby” or “Terminal C, Logan Airport, Boston” which is gender/ethnicity and age. All extracted content available information for surveillance cameras and maybe information is fed into a database to support downstream in the user created tags for the web videos. “When” is the analysis tasks, queries and retrieval. While Panoptes time of the event which again is available information for analyzes the video content in real-time, VideoRecall is surveillance cameras and a date might be available in intended for forensic analyses, identifying items of interest video tags for web videos or it may just be the time the and providing a reporting utility to organize the findings web video was posted. “Who” is/are the actors in the and results. Panoptes and VideoRecall form the backbone event such as the person or people who entered the wrong of a generic video surveillance application and allow for way. Depending on the video view and the resolution the quick development of more specific custom surveillance analyses may indicate a single person versus a group of applications (as exemplified in Section 4). These people; if a good quality face image is available it can be components are discussed in further detail in the following matched against a watch list or gender/age/ethnicity sections. maybe detectable even from lower quality face images using soft biometric classifiers. “Why” describes the intent of an event which is not directly observable from video but maybe revealed to the end user by the unfolding of events following. “How” describes the way in which this activity has happened and maybe explained by the event detection such as the person idling and another person exiting; activities preceding the wrong way entry will indicate that the person waited around to catch the door open while someone exited. Our approach to address all w5h questions is to exploit video content and metadata at multiple layers with flexible, expandable and easily re-configurable video content extraction architecture. The layered architecture allows for easily configuring various content extraction tasks to answer as many of these questions as possible. After this introduction we describe our general video content extraction framework and its components in Section 2. We introduce our face based soft video biometry classification in Section 3. In Section 4 we introduce few applications based on this framework. We briefly describe the technology evaluations where parts of Figure 1: intuVision Video Analysis architecture this framework were evaluated in Section 5, followed by the Concluding Remarks in Section 6. 2.2. Real-Time analysis with Panoptes 2. Flexible video analytics framework Our approach to real-time video analysis is based on In this section we introduce our real-time and post- exploiting space and object-based analysis in separate event content extraction architecture and explain the parallel but highly interactive processing layers that can be underlying algorithms for the major components. included in processing as required. This methodology has that are dictated by traffic rules and lights, people walk its roots in human cognitive vision [36]. Object based along convenient paths or interact with interest points visual attention; visual tracking and indexing have also such as a snack stand, park their cars and walk towards a been one of the active fields of study in cognitive vision. building etc. Learning the normal scene activity for each Models of human visual attention use a saliency map to object type facilitates the detection of the anomalous guide the attention allocation [1]. Studies on human visual actions of people and vehicles. Similar to [28], we model cognition system point to a highly parallel but closely the scene activity as a probability density function (pdf). interacting functional and neural architecture. Humans Kernel Destination Estimation computes the model and a easily focus their attention on video scenes exhibiting Markov Chain Monte-Carlo framework