Visual Recognition of Multi-Agent Action by Stephen Sean Intille
Total Page:16
File Type:pdf, Size:1020Kb
Visual Recognition of Multi-Agent Action by Stephen Sean Intille B.S.E., University of Pennsylvania (1992) S.M., Massachusetts Institute of Technology (1994) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial ful®llment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1999 c Massachusetts Institute of Technology 1999. All rights reserved. Author ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Program in Media Arts and Sciences August 6, 1999 Certi®ed by ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Aaron F. Bobick Associate Professor of Computational Vision Massachusetts Institute of Technology, USA Thesis Supervisor Accepted by :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Stephen A. Benton Chairman, Departmental Committee on Graduate Students Program in Media Arts and Sciences 2 3 Visual Recognition of Multi-Agent Action by Stephen Sean Intille Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on August 6, 1999, in partial ful®llment of the requirements for the degree of Doctor of Philosophy Abstract Developing computer vision sensing systems that work robustly in everyday environments will require that the systems can recognize structured interaction between people and objects in the world. This document presents a new theory for the representation and recognition of coordinated multi-agent action from noisy perceptual data. The thesis of this work is as follows: highly structured, multi-agent action can be recognized from noisy perceptual data using visually grounded goal-based primitives and low-order temporal relationships that are integrated in a probabilistic framework. The theory is developed and evaluated by examining general characteristics of multi-agent action, analyzing tradeoffs involved when selecting a representation for multi-agent action recognition, and constructing a system to recognize multi-agent action for a real task from noisy data. The representation, which is motivated by work in model-based object recognition and probabilistic plan recognition, makes four principal assumptions: (1) the goals of individual agents are natural atomic representational units for specifying the temporal relationships between agents engaged in group activities, (2) a high-level description of temporal structure of the action using a small set of low-order temporal and logical constraints is adequate for representing the relationships between the agent goals for highly structured, multi-agent action recognition, (3) Bayesian networks provide a suitable mechanism for integrating multiple sources of uncertain visual perceptual feature evidence, and (4) an automatically generated Bayesian network can be used to combine uncertain temporal information and compute the likelihood that a set of object trajectory data is a particular multi-agent action. The recognition algorithm is tested using a database of American football play descrip- tions. A system is described that can recognize single-agent and multi-agent actions in this domain given noisy trajectories of object movements. The strengths and limitations of the recognition system are discussed and compared with other multi-agent recognition algorithms. 4 Thesis Supervisor: Aaron F. Bobick Title: Associate Professor of Computational Vision Massachusetts Institute of Technology, USA Doctoral Committee Thesis Advisor :::::::::::::::::::::::::::::::::::::::::::::::::::::: Aaron F. Bobick Associate Professor of Computational Vision Massachusetts Institute of Technology, USA Thesis Reader ::::::::::::::::::::::::::::::::::::::::::::::::::::::: W. Eric L. Grimson Professor of Computer Science and Engineering Bernard Gordon Chair of Medical Engineering Massachusetts Institute of Technology, USA Thesis Reader ::::::::::::::::::::::::::::::::::::::::::::::::::::::: Hans-Hellmut Nagel Professor fÈur Informatik UniversitÈat Karlsruhe (T.H.), Germany 5 6 Contents 1 Introduction 19 1.1 Motivation : :: :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 21 1.2 The task: recognizing team action :: :: :: :: ::: :: :: :: :: :: 22 1.3 Contributions : :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 23 1.4 Outline ::: :: :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 25 2 Multi-agent action 27 2.1 Examples : :: :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 28 2.1.1 A traf®c intersection ::: :: :: :: :: ::: :: :: :: :: :: 28 2.1.2 An interactive environment : :: :: :: ::: :: :: :: :: :: 29 2.1.3 The p51curl football play :: :: :: :: ::: :: :: :: :: :: 29 2.1.4 The simpli®ed p51curl football play : :: ::: :: :: :: :: :: 31 2.2 Characteristics of multi-agent action : :: :: :: ::: :: :: :: :: :: 32 2.3 Characteristics of football team action :: :: :: ::: :: :: :: :: :: 37 2.4 Recognition of intentional action :: :: :: :: ::: :: :: :: :: :: 39 2.4.1 De®ning intentionality :: :: :: :: :: ::: :: :: :: :: :: 40 2.4.2 Using intentional models of activity for recognition :: :: :: :: 41 2.4.3 ªWhat things areº vs. ªWhat things look likeº :: :: :: :: :: 42 3 Data and data acquisition 45 3.1 The play database : :: :: ::: :: :: :: :: ::: :: :: :: :: :: 46 7 8 CONTENTS 3.2 Obtaining object trajectories ::: :: :: :: :: ::: :: :: :: :: :: 47 3.2.1 Automatic visual object tracking :: :: ::: :: :: :: :: :: 47 3.2.2 Tagged tracking systems : :: :: :: :: ::: :: :: :: :: :: 49 3.2.3 Manual data acquisition : :: :: :: :: ::: :: :: :: :: :: 49 3.2.4 Characterizing trajectory noise : :: :: ::: :: :: :: :: :: 51 4 A test system: labeling a multi-agent scene 55 4.1 Multi-agent scenes and context : :: :: :: :: ::: :: :: :: :: :: 56 4.2 A formation labeling test system : :: :: :: :: ::: :: :: :: :: :: 57 4.2.1 Searching for consistent interpretations : ::: :: :: :: :: :: 58 4.2.2 System performance ::: :: :: :: :: ::: :: :: :: :: :: 67 4.3 Evaluating the representation :: :: :: :: :: ::: :: :: :: :: :: 68 4.4 Lessons learned: desirable recognition properties ::: :: :: :: :: :: 73 5 Approaches for multi-agent action recognition 75 5.1 Object recognition : :: :: ::: :: :: :: :: ::: :: :: :: :: :: 76 5.1.1 Geometric correspondence and search trees :: :: :: :: :: :: 76 5.1.2 Insights from object recognition : :: :: ::: :: :: :: :: :: 80 5.2 Representing uncertainty :: ::: :: :: :: :: ::: :: :: :: :: :: 83 5.2.1 Options for representing uncertainty : :: ::: :: :: :: :: :: 83 5.2.2 Bayesian networks: a compromise : :: ::: :: :: :: :: :: 87 5.3 Bayesian networks for object recognition :: :: ::: :: :: :: :: :: 91 5.3.1 Part-of models : :: ::: :: :: :: :: ::: :: :: :: :: :: 91 5.3.2 Part relationships : ::: :: :: :: :: ::: :: :: :: :: :: 93 5.3.3 Constraints in nodes vs. constraints on links :: :: :: :: :: :: 93 5.3.4 Inverted networks : ::: :: :: :: :: ::: :: :: :: :: :: 94 5.3.5 Multiple networks : ::: :: :: :: :: ::: :: :: :: :: :: 95 5.4 Bayesian networks for action recognition :: :: ::: :: :: :: :: :: 96 5.4.1 Hidden Markov model networks : :: :: ::: :: :: :: :: :: 97 CONTENTS 9 5.4.2 HMM(j,k) : :: :: ::: :: :: :: :: ::: :: :: :: :: :: 98 5.4.3 Dynamic belief networks :: :: :: :: ::: :: :: :: :: :: 100 5.4.4 Rolling-up time :: ::: :: :: :: :: ::: :: :: :: :: :: 101 5.5 Action recognition : :: :: ::: :: :: :: :: ::: :: :: :: :: :: 102 5.5.1 Multi-agent plan generation : :: :: :: ::: :: :: :: :: :: 102 5.5.2 Multi-agent action and plan recognition : ::: :: :: :: :: :: 103 5.6 Summary: previous work : ::: :: :: :: :: ::: :: :: :: :: :: 108 6 The representation 109 6.1 Representational principles ::: :: :: :: :: ::: :: :: :: :: :: 110 6.2 Representational framework overview :: :: :: ::: :: :: :: :: :: 111 6.3 The temporal structure description :: :: :: :: ::: :: :: :: :: :: 113 6.3.1 Agent-based goals as atoms : :: :: :: ::: :: :: :: :: :: 113 6.3.2 Criteria for selecting goals :: :: :: :: ::: :: :: :: :: :: 114 6.3.3 Linking goals with constraints :: :: :: ::: :: :: :: :: :: 115 6.3.4 Example :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 116 6.3.5 Compound goals :: ::: :: :: :: :: ::: :: :: :: :: :: 119 6.4 Perceptual features : :: :: ::: :: :: :: :: ::: :: :: :: :: :: 120 6.5 Visual networks :: :: :: ::: :: :: :: :: ::: :: :: :: :: :: 122 6.5.1 Network structure and evaluation :: :: ::: :: :: :: :: :: 124 6.5.2 Approximations to the closed-world principle ::::::::::: 126 6.5.3 Evidence quantization :: :: :: :: :: ::: :: :: :: :: :: 128 6.6 Temporal analysis functions ::: :: :: :: :: ::: :: :: :: :: :: 128 6.7 Multi-agent networks : :: ::: :: :: :: :: ::: :: :: :: :: :: 128 6.7.1 Multi-agent network structure :: :: :: ::: :: :: :: :: :: 129 6.7.2 Generating networks automatically : :: ::: :: :: :: :: :: 132 6.8 Applying the representation ::: :: :: :: :: ::: :: :: :: :: :: 132 6.8.1 Trajectory to model consistency matching ::: :: :: :: :: :: 132 6.8.2 Network propagation performance :: :: ::: :: :: :: :: :: 133 10 CONTENTS 7 Recognition results and evaluation 135 7.1 System components and knowledge engineering : ::: :: :: :: :: :: 136 7.2 Recognition performance : ::: :: :: :: :: ::: :: :: :: :: :: 138 7.3 Recognition case study : :: ::: :: :: :: :: ::: :: :: :: :: :: 139 7.4 Evaluating the multi-agent network structure :: ::: :: :: :: :: :: 140 7.4.1 Capturing important attribute dependencies :: :: :: :: :: :: 141 7.4.2 Network computability : :: :: :: :: ::: :: :: :: :: :: 142 7.5 Representational strengths and weaknesses : :: ::: :: :: :: :: :: 144 7.5.1 Comparison to object recognition :: :: ::: :: :: ::