Who, What, When, Where, Why and How in Video Analysis: An Application Centric View

Sadiye Guler Jay Silverstein, Ian Pushee, Xiang Ma, intuVision, Inc. Ashutosh Morde 10 Tower Office Park, Woburn, MA 01801 intuVision, Inc. [email protected] {jay,ian,sean,ashu}@intuvisiontech.com

Abstract

This paper presents an end-user application centric of video data, some of which is analyzed in real-time. In view of video analysis and describes a addition to traditional surveillance video, in recent years a flexible, extensible and modular approach to video new source of video data has surpassed any kind of content extraction. Various detection and extraction volume expectations by individuals recording video components including tracking of moving objects, during public and private events with their cameras and detection of text, faces, and face based soft biometric for smart phones and posting them on the websites such as gender, age and ethnicity classification are described YouTube for public consumption. These unconstrained within the general framework for real-time and post event video clips (i.e., “other people’s video” the term coined by analysis applications Panoptes and VideoRecall. Some [2]) may provide additional content and information that is end-user applications that are built on this framework are not available in typical surveillance camera views; hence discussed. they became part of any surveillance and investigation task. For example, surveillance video cameras around a stadium typically have views of entry/exit halls, snack courts etc. which may provide partial video coverage of 1. Introduction some excited fans being destructive with no means to get a good description of the individuals; but an inspection of Since their early deployment in mid-1960s [25], of the other fan videos posted on public sites soon after the video surveillance systems held the promise for increased event may make available crucial information that would security, safety and they facilitated forensic evidence otherwise be missed. The unconstrained surveillance collection. The potential of video comes with its video is now gaining more importance as a surveillance challenges; with the large and ever increasing volumes of source and the automated video analysis systems have to data that is collected and needs to be analyzed to utilize this emerging data source. automatically to understand the video content, detect Detection and classification of objects in video, objects and activities of interest and provide relevant tracking of moving objects over multiple camera field of information to end users in a timely manner. views, understanding the scene context, peoples activities Over the last two decades, video surveillance systems, and some level of identification information on from cameras to compression and storage technologies individuals all have been active areas of research [16]. have undergone fundamental changes. Analog cameras Despite all the advances in the field and availability of have been replaced with digital and IP network cameras, multiple real-time surveillance applications the acceptance tape based storage systems with no efficient way to find of these automated technologies by security and safety content have been replaced with Digital and Network professionals have been slower than several initial market Video Recorders with video content analysis for efficient estimations. This is partly due to the fact that most storage, indexing and retrieval. With the availability of automated systems are designed to provide a solution to more and more capable computing platforms the large some of the areas mentioned above, which effectively is body of research and development in the vision-based only a part of the solution the end-users require. End users video surveillance field video content analysis and are looking for tools that help to answer the questions extraction algorithms found their ways into several “What, Where, When, Who, Why and How (w5h)” using surveillance applications. The surveillance systems video and other data that is available. The answers to w5h currently deployed at public places collect a large amount questions describe an event with all its details. What

happened? Where did it take place? When did it happen? 2.1. Modular Video Analysis Architecture Who was involved? Why did it happen? And how did it happen? In most cases the answers to most of those Figure 1 depicts the intuVision video content extraction questions can be found in video data at multiple levels of framework. At the heart of our framework lie multiple detail and the automated video surveillance applications of detectors that operate on the video to detect moving tomorrow need to be designed to answer as many of these objects and classify them (for fixed camera event questions by analyzing both in real-time and post event detection); or the frame based detection of face, text, video. person and vehicles (for analysis of moving platform “What” describes the event such as “entry from the exit video or post event analysis). Once the objects are door” that will be detected from a surveillance camera detected and tracked (faces of tracked people can be video by tracking a person entering into an area from a detected in real-time). Panoptes video event detection restricted direction doorway. “Where” is the location or module looks for pre-specified or anomalous events. The locale where the event took place such as “Building-A Soft Biometry module then classifies faces for lobby” or “Terminal C, Logan Airport, Boston” which is gender/ethnicity and age. All extracted content available information for surveillance cameras and maybe information is fed into a database to support downstream in the user created tags for the web videos. “When” is the analysis tasks, queries and retrieval. While Panoptes time of the event which again is available information for analyzes the video content in real-time, VideoRecall is surveillance cameras and a date might be available in intended for forensic analyses, identifying items of interest video tags for web videos or it may just be the time the and providing a reporting utility to organize the findings web video was posted. “Who” is/are the actors in the and results. Panoptes and VideoRecall form the backbone event such as the person or people who entered the wrong of a generic video surveillance application and allow for way. Depending on the video view and the resolution the quick development of more specific custom surveillance analyses may indicate a single person versus a group of applications (as exemplified in Section 4). These people; if a good quality face image is available it can be components are discussed in further detail in the following matched against a watch list or gender/age/ethnicity sections. maybe detectable even from lower quality face images using soft biometric classifiers. “Why” describes the intent of an event which is not directly observable from video but maybe revealed to the end user by the unfolding of events following. “How” describes the way in which this activity has happened and maybe explained by the event detection such as the person idling and another person exiting; activities preceding the wrong way entry will indicate that the person waited around to catch the door open while someone exited. Our approach to address all w5h questions is to exploit video content and metadata at multiple layers with flexible, expandable and easily re-configurable video content extraction architecture. The layered architecture allows for easily configuring various content extraction tasks to answer as many of these questions as possible. After this introduction we describe our general video content extraction framework and its components in Section 2. We introduce our face based soft video biometry classification in Section 3. In Section 4 we introduce few applications based on this framework. We briefly describe the technology evaluations where parts of Figure 1: intuVision Video Analysis architecture this framework were evaluated in Section 5, followed by the Concluding Remarks in Section 6. 2.2. Real-Time analysis with Panoptes 2. Flexible video analytics framework Our approach to real-time video analysis is based on In this section we introduce our real-time and post- exploiting space and object-based analysis in separate event content extraction architecture and explain the parallel but highly interactive processing layers that can be underlying algorithms for the major components.

included in processing as required. This methodology has that are dictated by traffic rules and lights, people walk its roots in human cognitive vision [36]. Object based along convenient paths or interact with interest points visual ; visual tracking and indexing have also such as a snack stand, park their cars and walk towards a been one of the active fields of study in cognitive vision. building etc. Learning the normal scene activity for each Models of human visual attention use a saliency map to object type facilitates the detection of the anomalous guide the attention allocation [1]. Studies on human visual actions of people and vehicles. Similar to [28], we model cognition system point to a highly parallel but closely the scene activity as a probability density function (pdf). interacting functional and neural architecture. Humans Kernel Destination Estimation computes the model and a easily focus their attention on video scenes exhibiting Markov Chain Monte-Carlo framework generates the most several objects and activities at once and keep track of likely paths through the scene. The model is trained with a object states and relationships. This indicates allocation of set of normally observed tracks that consists of observed attention to different processes that deal with various transition vectors defined by the destination location, dynamics of the scene and using correspondences and transition time and the bounding box width and height of communication between these processes [21]. the object of interest. The pdf of transition vectors at each At the first level, the peripheral tracker is responsible pixel location in the background scene is modeled as a for providing and maintaining a quick glance of the mixture of Gaussian distribution and a Gaussian Mixture moving object in a scene, it functions as an abstracted Model (GMM) is learned for each pdf at a given location positional tracker to quickly establish the rough spatial using a modified EM algorithm. This framework area and the trajectories for the moving objects. Peripheral circumvents the need for explicitly clustering the user tracker primarily maintains motion based information for tracks. For a given new track the transition vectors are the objects. This coarse tracking information is all that is calculated and the probability of its being normal is needed for some higher level processes such as detecting estimated using the trained GMM’s at every pixel instances of objects entering into a specific zone or high location. number of objects in the scene. But other detection tasks Object Layer provides detailed object-based analysis require a more detailed analysis such as classification of for classification. We use Support Vector Machine (SVM) objects or knowledge of the scene background method for object classification. SVM is a supervised information to detect and track objects in more learning method that simultaneously minimizes the challenging environments. The tracking information from classification error while maximizing the geometric the peripheral tracker is made available to other layers to margin. Hence SVMs perform better than many other enable maintaining the scene background and context classifiers and have been used in a wide range of models, stationary object model and detailed object based applications for object classification, analysis for classification of tracked objects. object recognition [23,24] and action recognition [29,11]. These layers, described below, facilitate comprehensive We use object’s size, oriented edges [4], silhouette and analysis of the objects and the scene at multiple levels of motion based features for SVM object classification. detail as needed. Oriented edges have shown to be robust to changes in Peripheral Tracking Layer performs a coarse illumination, scale and view angle [6]. spatially-based detection and tracking of moving objects In most surveillance applications, robustly classifying using a computationally light algorithm based on frame objects without having a large set of sample data to train recency and produces rough connected regions that the classifier is of particular interest. SVM is well suited represent the moving objects. The detected moving object for such applications as they can be trained with few regions are tracked from frame to frame by building samples and generalize well to novel scenes. Figure 2 correspondences for objects based on their motion, illustrates the Panoptes object classification training position and condensed color information. interface where a new object type is being trained for a Scene Description Layer maintains models for general person carrying a long object. scene features to facilitate detailed analysis in other layers. In aerial or other moving camera views (such as Currently the Scene Description Layer includes an panning and zooming of fixed cameras) scene background adaptive background model [26], a dynamic texture is not consistent over multiple frames. To detect or track background model [20], a background edge map [11] and objects in such scenes, using trained detectors based on a scene activity model [28]. The Scene Description Layer appearance features in each single frame works well. For or some of its components may not be necessary for all these types of environments we employ the Haar detection tasks; rather it is designed to support other tasks Transform based feature detection framework as like stationary object or anomalous activity detection. suggested by [19, 33] for detection of people and vehicles. Scene Activity Model Every scene has a normal activity Haar features are used to train a cascade of classifiers and flow associated with it, vehicles usually follow paths using the Adaboost framework. The trained classifier

Real-time Event Detection: Understanding of activities and events from video sequences have been studied widely and approaches ranging from expert system like rules based methods to probabilistic models have been employed [9,13,14,15]. Our event detection algorithms use various methods appropriate for different event detection tasks [11]. We employ rules based approach for pre-defined events, trained SVM models for person activities and scene activity model for detecting anomalous behaviors.

2.3. Forensic analysis with VideoRecall The end-users who analyze video and multimedia data are being flooded with large volumes of data from a wide range of multimedia sources and need systems and tools to: Figure 2: Panoptes object classification training interface  Detect and distinguish an event of interest from many like – but unimportant – events, detects people and vehicles at real-time speeds and has a  Query, retrieve, view and route selected video data high recall rate. Some examples of person and vehicle and events of interest detections from UAV videos are shown in Figure 3. Face Detection Our face detection module uses an  Identify associations between seemingly unrelated Adaboost classifier trained on Haar like features content. [19,31,32]. In the real-time detection tasks once a person VideoRecall –designed to fill this need- provides a is detected and passes the criteria of including a face video content analysis, and exploration environment to (frontal view and sufficient resolution), we further check gain insights in to the events and activities for for the presence of face within the detected bounding box. investigating a situation of interest (Figure 3). To improve computation efficiency, we use only the top Typical VideoRecall processing starts with video shot half of the bounding box as input to the face detector. segmentation to partition the video clips into short Text Detection For in scene video text detection, we segments that are separated by camera operations (cuts, employ the method proposed by [18] due to its efficiency zooms and pans) or large scene changes (walking into a and robustness in recognizing rigid, monochrome artificial building). This is followed by scene classification, to text (such as close caption or text overlays) from video. cluster similar and related shots together into coherent The input frame is first segmented based on color, the scenes. Video content such as text, vehicles, people, faces segmented regions are then merged to remove over- and their soft biometric classification for segmentation of characters. Contrast segmentation is gender/age/ethnicity are extracted using the algorithms performed to get characters in a binary image. Geometric described in Sections 2 and 3. VideoRecall also facilitates analysis is used to remove false region or pixels that do not qualify for a character. Stationary Object Layer Once the moving objects become stationary the spatially-based peripheral tracker no longer tracks them; the detection of stationary objects is accomplished by introducing another parallel layer to the tracking algorithm. This layer uses the information produced by the Peripheral Tracker and the Scene Description Layer to create a Stationary Object Model as a collection of object regions that represent hypothesized stationary objects. Each region in the stationary object model has a pseudo probability value that measures the “endurance” of the region as a static object. The stationary object model starts with the hypothesis for the stationary objects as the regions occupied by the moving objects (identified by the Peripheral Tracker) in the previous Figure 3: VideoRecall Graphical User Interface. frames.

a Panoptes plug-in to detect events. All this extracted content information can be queried to find items of interest. VideoRecall provides a “sandbox” for creating media rich reports utilizing the extracted objects. All content metadata created by VideoRecall is stored in an index database to facilitate further investigations. In Figure 4: Aligned, scaled and masked face image ROIs. (Images addition, it supports KLV (Key Length Value) metadata are shaded to protect the anonymity of the subjects.). [10]. A map interface provides geographical context for extracted objects, queries and events. Figure 3 illustrates major advantage of video data is that the availability of the VideoRecall Graphical User Interface, showing the multiple samples of the face, as they move through the extracted vehicles and a sandbox report example. camera field of view. Our solution takes advantage of this data redundancy to build a dynamic soft biometric feature 2.4. GPU Aware video analytics vector. The algorithms were trained by over 12,000 face images from two genders and three ethnicity groups with General purpose Graphic Processing Units (GPU) have various sizes, poses and illumination. Our algorithms for attracted a lot of attention recently from the computer extracting face soft biometric features achieve 90% vision community for speeding up parallel operations in classification for good quality face images that are at least video analysis algorithms [34]. Pixel based parallel 60x60 pixels in size and around 70-75% for lower quality operations of and frame pre-processing ones around 30x30 pixels (Figure 4). stages of video content extraction tasks are well suited to The first method uses pixel intensity-based features as a (GPU) platform and provide reasonable speed-ups in case to implement real time gender and ethnicity detection. of integrated GPU cards, which became the standard Detection of the intensity-based facial features requires Graphic card hardware in newer computers. Our video the registration of the faces. The angle and the distance analysis algorithms are adapted to NVIDIA’s CUDA between the eye points are used for aligning the faces. platform and take advantage of the available GPU cards Automated pupil detection is achieved utilizing Haar seamlessly. We take advantage of the existing GPU’s on features. Then the face region is masked and the pixel the host platform to process HD resolution video in real- features are collected from these images. We use few time. Some of the CPU intense operations such as hundred pixels as features with SVM classifiers. background model generation, update and comparisons Our second method uses the Biologically Inspired are inherently parallel and can be done much faster on Model (BIM) to extract illumination, scale, and shift GPU’s resulting in about 89% drop in processing time for invariant features from face images. These features are background update and a 77% drop in processing time for also known as HMAX features [25,30]. The BIM features background compare on an average. can handle the partial occlusions, slight scale variation, and changes in illumination, which are common problems 3. Video soft biometry in video surveillance. BIM is an advanced method that The goal for the soft biometry classification is to computes features that exhibit a better trade-off between categorize faces by gender, age and ethnicity based on the invariance and selectivity. facial features and their geometry. Face images have been In the training the BIM model, we first compute extensively used to identify the gender, ethnicity and other intermediate representation (C1) for all the positive soft biometric traits in several studies [17,22]. Previous training images using Gabor filters. From this set of C1 literature use still images with datasets containing well images, template patches of different sizes are randomly aligned frontal face images taken under controlled selected. The set of these template patches create a lighting. Our algorithms for face based soft biometry dictionary defining a positive class. In other words, an classification perform well on good to low quality video image with higher number of these template patches will images [7]. We use two algorithms. The first algorithm more likely be an image from positive class. utilizes pixel intensity values while the other one uses Once the dictionary is computed these template patches Biologically Inspired Model (BIM) features for soft are used to compute features and train SVM. Each biometry computation. Both of these methods use SVM template patch gives a feature, thus the number of features for classification. is equal to the number of template patches in the Extracting robust soft biometric features from dictionary. Each template patch is correlated over an surveillance video is a challenging task due to lower entire image and the maximum correlation value defines quality of data and uncontrolled environment factors, such the feature value. Features computed in this manner for as changes in illumination, resolution, camera view angle, positive and negative training dataset is used to train SVM occlusion, and shadows. Despite these challenges, one [31].

4. Example application Basic components of our approach to design and developing the collaborative Smart Camera and Mote Sensors in a Sensor Web framework are: 4.1. Smart sensing and video tracking  Use of inexpensive ad hoc netted sensors (motes) to A prototype surveillance system is being built on cue and task video cameras to collect and process Panoptes and VideoRecall framework and uses smart video only when a mote field detects and verifies sensor motes, intelligent video, and Sensor Web activity. technologies to aid in large area monitoring operations  Embedded video analytics processing with camera and to enhance the security of borders [8]. The application (i.e., Smart Video Node) to push processing of for the prototype system developed is envisioned as part video to the sensor edge for efficient use of of a remote sensor network that can incorporate a range of bandwidth. sensor types as illustrated in Figure 5.  Sensor Web Framework integrating smart video, various inexpensive mote sensor tasks, various sensors based on events and reception and dissemination of alerts from multiple sensors. Fortunately over the past decade, several technologies have emerged to support low-cost ad hoc networks and standardized worldwide data architectures, including cellular technology, distributed processing, and sensor web services [5, 12]. A mote is a wireless sensor device that represents a single node in a wireless sensor network. Each node consists of a small microcontroller, radio, and sensor board. When a mote is powered on, it forms an ad hoc mesh network with any neighboring nodes that are in range. Wireless sensor networking (WSN) technology allows for true ad hoc mesh networking of small sensor devices. WSN extends the internet to the physical world, allowing our system to interact with physical assets and Figure 5: Sensor Web Framework with a smart video and mote perform automatic tasking of other sensors seamlessly, sensor based event and threat detection. and remote sensor control (with verified authority). This framework allows disparate, geographically Numerous barriers exist that limit the effectiveness of dispersed sensors to be universally described, discovered, surveillance video in large area protection; such as the accessed, and tasked over a network. number of cameras needed to provide coverage, large volumes of data to be processed and disseminated, lack of 5. Technology Evaluations smart sensors to detect potential threats and limited Both Panoptes and VideoRecall applications have been bandwidth to capture and distribute video data. We are in technology evaluations as briefly described below. addressing these obstacles by employing a Smart Video Node in a Sensor Web framework. Smart Video Node (SVN) is an IP video camera with automated event 5.1. NIST-TRECVid 08 Video Event Detection detection capability. The SVNs are cued by inexpensive Event detection algorithms built on Panoptes sensor motes to detect the existence of humans or framework have been evaluated in NIST TRECVid 2008 vehicles. Based on sensor motes’ observation cameras are Video Event Detection task with 50 hours of surveillance slewed in to observe the activity and automated video video from Gatwick airport. We developed algorithms for analysis detects potential threats to be disseminated as three event detection tasks, namely “Elevator No-Entry”, “alerts”. Sensor Web framework enables quick and “Opposing Flow”, and “Picture Take” that performed well efficient identification of available sensors, collects data [35]. Our detection results were at the top for these three from disparate sensors, automatically tasks various events. sensors based on observations or events received from For Elevator No-Entry event detection we used Haar other sensors, and receives and disseminates alerts from person detection followed by histogram matching to find multiple sensors. person not entering an elevator. Histogram matching is shown to perform well under pose variations as indicated by our results.

In the Opposing Flow event the main challenge was to handle severe occlusions and overlap. We used the Panoptes tracking and minimized occlusions and overlaps caused by other objects by considering only the top portion of a door region. Since heads and shoulders are smaller in area the occlusion effects were not severe. Restricting the problem in this manner we were able to get good tracking and detected the tracks in the specified direction using our Panoptes “direction event spy”. We incorporated a mechanism to reject outliers in direction event spy to better handle tracking errors which considerably improved the results. Our Picture Take Event algorithm was based on characteristic change in illumination caused by flash – large increase in image intensity followed by almost equal drop in intensity - for flash detection. Once a flash frame is detected we find the location of the flash and used matching over several frames to detect frames where the hands are steady.

5.2. IEEE-VAST Challenge VideoRecall plug-in for the Analyst Notebook was used by Penn State University (PSU) team for IEEE VAST 09 (Visual Analytics Science and Technology) challenge. The PSU team solved the challenge by utilizing automatic forensic video content extraction. The challenge problem: A data breach took place at a foreign embassy. At least one meeting associated with this case took place at locations captured by security cameras. The surveillance video was 10 hours long, hence watching the video to manually detect the instances of people Figure 6: i2 Analyst Notebook chart (top) with extracted objects people and vehicle using VideoRecall Plug-in from meeting or vehicles involved would be an intractable task. surveillance video cameras (bottom). Using VideoRecall’s person and vehicle detection the PSU team identified a clandestine meeting between the [3] Avidan, S. “Subset selection for efficient SVM suspect and an accomplice and the vehicle used as shown tracking”, Computer Vision and Pattern Recognition, 2003. in Figure 6. Volume 1, 18-20, June 2003 [4] Bileschi, S. and Wolf, L., "Image representations beyond 6. Concluding Remarks histograms of gradients: The role of Gestalt descriptors," CVPR 2007. We presented our extensible and flexible video content [5] Constantopoulos D., Johnston J., “2006 CCRTS The State extraction framework with several content extraction tasks Of The Art And The State Of The Practice - Data Schemas and configurable architecture that allows quick for Net-Centric Situational Awareness”, Cognitive Domain development and deployment of end-user applications. Issues, Tech. Report, MITRE Corporation Bedford, MA , Advance features for these and other video content 2006. extraction tasks are currently under development at [6] Dalal, N. and Triggs, B., "Histograms of oriented gradients intuVision, Inc. for human detection," CVPR, 2005. [7] M. Demirkus, K. Garg, S. Guler" Automated person categorization for video surveillance using soft biometrics " References SPIE Defense and Security Conference, April, 2010 [1] R. Allen, P. McGeorge, D. Pearson, and A. B. Milne. [8] S. Guler, T. Cole, J. Silverstein1, I. Pushee, S. Fairgrieve Attention and expertise in multiple target tracking. Applied " Border Security and Surveillance System with Smart Cognitive Psychology 18 (3): 337-347 Apr 2004. Cameras and Motes in a Sensor Web " [2] D. Aldridge, Personal Communications at VACE-3 SPIE Defense and Security Conference, April, 2010. Closeout meeting, Nov 2008, Washington DC.

[9] S. Guler and M. K. Farrow ,“Abandoned Object detection [27] Roberts, L., P., "The history of video surveillance," In Crowded Places", IEEE CVPR , New York City, NY., http://www.video-surveillance-guide.com. June 18-23, 2006. [28] I. Saleemi, K. Shafique, and M. “Probabilistic Modeling of [10] S. Guler, W.H. Liang, I. Pushee, “A Video Event Detection Scene Dynamics for Applications in Visual Surveillance, and Mining Framework”, CVPR Workshop on Event “Shah IEEE Trans. PAMI, August 2009 Vol. 31 No. 8, pp Mining”, June, 2003, Madison, Wisconsin, USA. 1492-1506. [11] S.Guler, J.A. Silverstein, K. Garg, “Video Scene [29] Christian Schuldt, Ivan Laptev, Barbara Caputo, Assessment with an Unattended Sensor Network" SPIE “Recognizing Human Actions: A Local SVM Approach” Europe, Security and Defence, Florence, Italy September, Proc. . 17th Int. Conf. on Pattern Recognition (ICPR’04) 2007 [30] Serre, T., Wolf, L. and Poggio,T., "Object recognition with [12] He T., Krishnamurthy S., et al.,“Energy-Efficient features inspired by visual cortex, " CVPR, 2005. Surveillance System Using Wireless Sensor Networks", MobiSYS’04, June 6–9, 2004, Boston, Massachusetts, [31] Shakhnarovich, G., Viola, P. and Moghaddam , B., "A USA, 2004. Unified Learning Framework for Real Time Face Detection [13] S. Hongeng,R. Nevaita and F. Bremond, “Video-Based and Classification, " Proc. of Int. Conf. on Automatic Face Event Recognition: Activity Representation and and Gesture Recognition, 2002. Probabilistic Recognition Methods”, Computer Vision and [32] Viola, P., A. and Jones, M., "Rapid object detection using a Image Understanding.96:pp 129-162, 2004. boosted cascade of simple features, " CVPR, 2001. [14] Y. A. Ivanov, and A.F. Bobick, “Recognition of Visual [33] Viola, P., A. and Jones, M., J., "Robust Real-Time Face Activities and Interactions by Stochastic Parsing,” IEEE Detection," Int. Journal of Computer Vision, 2004. Trans. PAMI., vol. 22, no. 8, pp. 852-872, Aug. 2000. [34] Yang, R., Welch, G.: Fast image segmentation and [15] S.W. Joo, R. Chellappa, "Attribute Grammar-Based Event smoothing using commodity graphics hardware. J. Graph. Recognition and Anomaly Detection", IEEE International Tools 7(4), pp. 91–100, 2002. Workshop on Semantic Learning Applications in [35] Yarlagadda, P., Demirkus, M., Garg, K. and Guler, S., Multimedia, 2006. [16] T. Ko, “A survey on behavior analysis in video surveillance "Intuvision Event Detection Systems for TRECVID 2008," for homeland security applications “, Applied Imagery TREC Video Retrieval Evaluation Competition Conf., Nov. Pattern Recognition Workshop, 37th IEEE AIPR pp1-8 Oct 17-18, Washington, DC. 2008. [36] O. Yilmaz, S. Guler, H. Ogmen. “Inhibitory Surround and Grouping Effects in Human and Computational Multiple [17] Lapedriza, A., Marin-Jimenez, M. and Vitria, J., "Gender Object Tracking" SPIE Electronic Imaging Conference, Recognition in Non Controlled Environments," ICPR, (3), Visualization and Perception, January, 2008, San Jose, CA 2006. [18] R. Lienhart, W. Effelsberg. “Automatic Text Segmentation and Text Recognition for Video Indexing”. ACM/Springer Multimedia Systems, Vol. 8. pp.69-81, January 2000. [19] Lienhart, R. and Maydt, J., "An Extended Set of Haar-like Features for Rapid Object Detection," ICIP, 2002. [20] B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings IJCAI, pp. 674--679, Vancouver, Canada, 1981. [21] D. Marr, Vision. Freeman, Cambridge, MA 1982. [22] Moghaddam, B. and Yang, M., H., "Learning Gender with Support Faces," IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 707–711 (2002). [23] E. Osuna, R. Freund, and F. Girosit. “Training support vector machines: an application to face detection”. In Proc. Computer Vision and Pattern Recognition, pages 130–136, June 1997. [24] M. Pontil and A. Verri, “Support Vector Machines for 3D Object Recognition”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 6, pp. 637-646, June 1998. [25] Riesenhuber, M. and Poggio, T., "Hierarchical models of object recognition in cortex," Nature Neuroscience, 2, 1999. [26] J. Rittscher, J Kato, S Joga, and A Blake. “A probabilistic background model for tracking” ECCV 2: 336-350, 2000.