Stochastic And-Or Grammars: a Unified Framework and Logic Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Stochastic and-or Grammars: A Unified Framework and Logic Perspective⇤ Kewei Tu School of Information Science and Technology ShanghaiTech University, Shanghai, China [email protected] Abstract paper we will focus on the context-free subclass of stochas- tic AOGs, which serves as the skeleton in building more ad- Stochastic And-Or grammars (AOG) extend tradi- vanced stochastic AOGs. tional stochastic grammars of language to model other types of data such as images and events. In Several variants of stochastic AOGs and their inference al- this paper we propose a representation framework gorithms have been proposed in the literature to model differ- of stochastic AOGs that is agnostic to the type of ent types of data and solve different problems, such as image the data being modeled and thus unifies various scene parsing [Zhao and Zhu, 2011] and video event parsing domain-specific AOGs. Many existing grammar [Pei et al., 2011]. Our first contribution in this paper is that formalisms and probabilistic models in natural lan- we provide a unified representation framework of stochastic guage processing, computer vision, and machine AOGs that is agnostic to the type of the data being modeled; learning can be seen as special cases of this frame- in addition, based on this framework we propose a domain- work. We also propose a domain-independent in- independent inference algorithm that is tractable under a rea- ference algorithm of stochastic context-free AOGs sonable assumption. The benefits of a unified framework of and show its tractability under a reasonable as- stochastic AOGs include the following. First, such a frame- sumption. Furthermore, we provide two interpre- work can help us generalize and improve existing ad hoc ap- tations of stochastic context-free AOGs as a sub- proaches for modeling, inference and learning with stochastic set of probabilistic logic, which connects stochas- AOGs. Second, it also facilitates applications of stochastic tic AOGs to the field of statistical relational learn- AOGs to novel data types and problems and enables the re- ing and clarifies their relation with a few existing search of general-purpose inference and learning algorithms statistical relational models. of stochastic AOGs. Further, a formal definition of stochas- tic AOGs as abstract probabilistic models makes it easier to theoretically examine their relation with other models such 1 Introduction as constraint-based grammar formalism [Shieber, 1992] and Formal grammars are a popular class of knowledge represen- sum-product networks [Poon and Domingos, 2011]. In fact, tation that is traditionally confined to the modeling of natu- we will show that many of these related models can be seen ral and computer languages. However, several extensions of as special cases of stochastic AOGs. grammars have been proposed over time to model other types of data such as images [Fu, 1982; Zhu and Mumford, 2006; Stochastic AOGs model compositional structures based on Jin and Geman, 2006] and events [Ivanov and Bobick, 2000; the relations between sub-patterns. Such probabilistic model- Ryoo and Aggarwal, 2006; Pei et al., 2011]. One promi- ing of relational structures is traditionally studied in the field [ ] nent type of extension is stochastic And-Or grammars (AOG) of statistical relational learning Getoor and Taskar, 2007 . [Zhu and Mumford, 2006]. A stochastic AOG simultane- Our second contribution is that we provide probabilistic logic ously models compositions (i.e., a large pattern is the com- interpretations of the unified representation framework of position of several small patterns arranged according to a cer- stochastic AOGs and thus show that stochastic AOGs can tain configuration) and reconfigurations (i.e., a pattern may be seen as a novel type of statistical relational models. The have several alternative configurations), and in this way it can logic interpretations help clarify the relation between stochas- compactly represent a probabilistic distribution over a large tic AOGs and a few existing statistical relational models and number of patterns. Stochastic AOGs can be used to parse probabilistic logics that share certain features with stochas- [ data samples into their compositional structures, which help tic AOGs (e.g., tractable Markov logic Domingos and Webb, ] [ ] solve multiple tasks (such as classification, annotation, and 2012 and stochastic logic programs Muggleton, 1996 ). It segmentation of the data samples) in a unified manner. In this may also facilitate the incorporation of ideas from statistical relational learning into the study of stochastic AOGs and at ⇤This work was supported by the National Natural Science Foun- the same time contribute to the research of novel (tractable) dation of China (61503248). statistical relational models. 2654 2 Stochastic And-Or Grammars of non-overlapping sub-patterns. The And-rule speci- An AOG is an extension of a constituency grammar used in fies a production r : A x1,x2,...,xn for some n 2, where A is an And-node!{ and x ,x ,...,x} are natural language parsing [Manning and Schutze,¨ 1999]. Sim- ≥ 1 2 n ilar to a constituency grammar, an AOG defines a set of valid a set of terminal or nonterminal nodes representing the hierarchical compositions of atomic entities. However, an sub-patterns. A relation between the parameters of the child nodes, t(✓ ,✓ ,...,✓ ), specifies valid config- AOG differs from a constituency grammar in that it allows x1 x2 xn atomic entities other than words and compositional relations urations of the sub-patterns. This so-called parameter other than string concatenation. A stochastic AOG models relation is typically factorized to the conjunction of a the uncertainty in the composition by defining a probabilistic set of binary relations. A parameter function f is also distribution over the set of valid compositions. associated with the And-rule specifying how the param- Stochastic AOGs were first proposed to model images [Zhu eter of the And-node A is related to the parameters of the child nodes: ✓ = f(✓ ,✓ ,...,✓ ). We require that and Mumford, 2006; Zhao and Zhu, 2011; Wang et al., 2013; A x1 x2 xn Rothrock et al., 2013], in particular the spatial composition both the parameter relation and the parameter function of objects and scenes from atomic visual words (e.g., Garbor take time polynomial in n and m✓ to compute. There is bases). They were later extended to model events, in par- exactly one And-rule that is headed by each And-node. ticular the temporal and causal composition of events from An Or-rule, parameterized by an ordered pair r, p , rep- atomic actions [Pei et al., 2011] and fluents [Fire and Zhu, • resents an alternative configuration of a pattern.h i The 2013]. More recently, these two types of AOGs were used Or-rule specifies a production r : O x, where O jointly to model objects, scenes and events from the simulta- is an Or-node and x is either a terminal! or a nontermi- neous input of video and text [Tu et al., 2014]. nal node representing a possible configuration. A condi- In each of the previous work using stochastic AOGs, a tional probability p is associated with the Or-rule spec- different type of data is modeled with domain-specific and ifying how likely the configuration represented by x is problem-specific definitions of atomic entities and composi- selected given the Or-node O. The only constraint in the tions. Tu et al. [Tu et al., 2013] provided a first attempt to- Or-rule is that the parameters of O and x must be the wards a more unified definition of stochastic AOGs that is same: ✓O = ✓x. There typically exist multiple Or-rules agnostic to the type of the data being modeled. We refine and headed by the same Or-node, and together they can be extend their work by introducing parameterized patterns and written as O x1 x2 ... xn. relations in the unified definition, which allows us to reduce ! | | | a wide range of related models to AOGs (as will be discussed Note that unlike in some previous work, in the definition in section 2.1). Based on the unified framework of stochastic above we assume deterministic And-rules for simplicity. In AOGs, we also propose a domain-independent inference al- principle, any uncertainty in an And-rule can be equivalently gorithm and study its tractability (section 2.2). Below we start represented by a set of Or-rules each invoking a different copy with the definition of stochastic context-free AOGs, which are of the And-rule. the most basic form of stochastic AOGs and are used as the Fig. 1(a) shows an example stochastic context-free AOG of skeleton in building more advanced stochastic AOGs. line drawings. Each terminal or nonterminal node represents A stochastic context-free AOG is defined as a 5-tuple an image patch and its parameter is a 2D vector representing ⌃,N,S,✓,R : the position of the patch in the image. Each terminal node de- h i notes a line segment of a specific orientation while each non- ⌃ is a set of terminal nodes representing atomic patterns that terminal node denotes a class of line drawing patterns. The are not decomposable; start symbol S denotes a class of line drawing images (e.g., N is a set of nonterminal nodes representing high-level pat- images of animal faces). In each And-rule, the parameter re- terns, which is divided into two disjoint sets: And-nodes lation specifies the relative positions between the sub-patterns and Or-nodes; and the parameter function specifies the relative positions be- tween the composite pattern and the sub-patterns. S N is a start symbol that represents a complete pattern; 2 With a stochastic context-free AOG, one can generate a ✓ is a function that maps an instance of a terminal or nonter- compositional structure by starting from a data sample con- minal node x to a parameter ✓x (the parameter can take taining only the start symbol S and recursively applying the any form such as a vector or a complex data structure; grammar rules in R to convert nonterminal nodes in the data denote the maximal parameter size by m✓); sample until the data sample contains only terminal nodes.