Video-Aided Unsupervised Grammar Induction

Video-aided Unsupervised Grammar Induction Songyang Zhang1,∗ Linfeng Song2, Lifeng Jin2, Kun Xu2, Dong Yu2 and Jiebo Luo1 1University of Rochester, Rochester, NY, USA [email protected], [email protected] 2Tencent AI Lab, Bellevue, WA, USA {lfsong,lifengjin,kxkunxu,dyu}@tencent.com Abstract Sentence: A squirrel jumps on stump. We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on learning syntactic Bird sound grammars from text-image pairs, with promis- (a) ing results showing that the information from static images is useful in induction. However, Sentence: Man starts to play the guitar fast. videos provide even richer information, includ- ing not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Com- Guitar sound pound PCFG model (Kim et al., 2019) as the (b) baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to ef- Figure 1: Examples of video aided unsupervised gram- fectively aggregate these rich features from dif- mar induction. We aim to improve the constituency ferent modalities. Our proposed MMC-PCFG parser by leveraging aligned video-sentence pairs. is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, when applying to other domains (Fried et al., 2019). YouCook2 and MSRVTT, confirming the ef- To address these issues, recent approaches (Shen fectiveness of leveraging video information for et al., 2018b; Jin et al., 2018; Drozdov et al., 2019; unsupervised grammar induction. Kim et al., 2019) design unsupervised constituency 1 Introduction parsers and grammar inducers, since they can be trained on large-scale unlabeled data. In partic- Constituency parsing is an important task in nat- ular, there has been growing interests in exploit- ural language processing, which aims to capture ing visual information for unsupervised grammar syntactic information in sentences in the form of induction because visual information can capture constituency parsing trees. Many conventional ap- important knowledge required for language learn- proaches learn constituency parser from human- ing that is ignored by text (Gleitman, 1990; Pinker annotated datasets such as Penn Treebank (Marcus and MacWhinney, 1987; Tomasello, 2003). This et al., 1993). However, annotating syntactic trees task aims to learn a constituency parser from raw by human language experts is expensive and time- unlabeled text aided by its visual context. consuming, while the supervised approaches are Previous methods (Shi et al., 2019; Kojima et al., limited to several major languages. In addition, 2020; Zhao and Titov, 2020; Jin and Schuler, 2020) the treebanks for training these supervised parsers learn to parse sentences by exploiting object infor- are small in size and restricted to the newswire mation from images. However, images are static domain, thus their performances tend to be worse and cannot present the dynamic interactions among ∗ This work was done when Songyang Zhang was an visual objects, which usually correspond to verb intern at Tencent AI Lab. phrases that carry important information. There- 1513 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1513–1524 June 6–11, 2021. ©2021 Association for Computational Linguistics fore, images and their descriptions may not be fully- 2 Background and Motivation representative of all linguistic phenomena encoun- Our model is motivated by C-PCFG (Kim et al., tered in learning, especially when action verbs are 2019) and its variant of the image-aided unsuper- involved. For example, as shown in Figure1(a), vised grammar induction model, VC-PCFG (Zhao when parsing a sentence “A squirrel jumps on and Titov, 2020). We will first review the evolution stump”, a single image cannot present the verb of these two frameworks in Sections 2.1–2.2, and phrase “jumps on stump” accurately. Moreover, then discuss their limitations in Section 2.3. as shown in Figure1(b), the guitar sound and the moving fingers clearly indicate the speed of mu- 2.1 Compound PCFGs sic playing, while it is impossible to present only with a static image as well. Therefore, it is difficult A probabilistic context-free grammar (PCFG) in for previous methods to learn these constituents, Chomsky normal form can be defined as a 6-tuple (S; N ; P; Σ; R; Π) S as static images they consider lack dynamic visual , where is the start symbol, N ; P Σ and audio information. and are the set of nonterminals, preter- minals and terminals, respectively. R is a set of In this paper, we address this problem by lever- production rules with their probabilities stored in aging video content to improve an unsupervised Π, where the rules include binary nonterminal ex- grammar induction model. In particular, we expansions and unary terminal expansions. Given a ploit the current state-of-the-art techniques in both certain number of nonterminal and preterminal cat- video and audio understanding, domains of which egories, a PCFG induction model tries to estimate include object, motion, scene, face, optical charac- rule probabilities. By imposing a sentence-specific ter, sound, and speech recognition. We extract fea- prior on the distribution of possible PCFGs, the tures from their corresponding state-of-the-art mod- compound PCFG model (Kim et al., 2019) uses a els and analyze their usefulness with the VC-PCFG mixture of PCFGs to model individual sentences in model (Zhao and Titov, 2020). Since different contrast to previous models (Jin et al., 2018) where modalities may correlate with each other, indepen- a corpus-level prior is used. Specifically in the gen- dently modeling each of them may be sub-optimal. erative story, the rule probability π is estimated by We also propose a novel model, Multi-Modal r the model g with a latent representation z for each Compound Probabilistic Context-Free Grammars sentence σ, which is in turn drawn from a prior (MMC-PCFG), to better model the correlation p(z): among these modalities. Experiments on three benchmarks show substan- πr = gr(z; θ); z ∼ p(z): (1) tial improvements when using each modality of the video content. Moreover, our MMC-PCFG model The probabilities for the CFG initial expansion that integrates information from different modali- rules S ! A, nonterminal expansion rules A ! ties further improves the overall performance. Our BC and preterminal expansion rules T ! w can code is available at https://github.com/ be estimated by calculating scores of each combi- Sy-Zhang/MMC-PCFG. nation of a parent category in the left hand side of The main contributions of this paper are: a rule and all possible child categories in the right hand side of a rule: • We are the first to address video aided unsuper- > vised grammar induction and demonstrate that exp(uAfs([wS; z])) πS!A = P ; verb related features extracted from videos are A02N exp(uA0 fs([wS; z])) beneficial to parsing. > exp(uBC [wA; z]) πA!BC = P > ; 0 0 exp(u 0 0 [w ; z])) • We perform a thorough analysis on different B ;C 2N [P B C A > modalities of video content and propose a exp(uwft([wT ; z])) πT !w = P T ; model to effectively integrate these important w02Σ exp(uw0 ft([wT ; z])) modalities to train better constituency parsers. (2) where A; B; C 2 N , T 2 P, w 2 Σ, w and u • Experiments results demonstrate the effective- vectorial representations of words and categories, ness of our model over the previous state-of- and ft and fs are encoding functions such as neural the-art methods. networks. 1514 Optimization of the PCFG induction model usu- where himg(c; v) is a hinge loss between the dis- ally involves maximizing the marginal likelihood tances from the image representation v to the of a training sentence p(σ) for all sentences in a matching and unmatching (i.e. sampled from a dif- corpus. In the case of compound PCFGs: ferent sentence) spans c and c0, and the distances from the span c to the matching and unmatching Z X log pθ(σ) = log pθ(tjz)p(z)dz; (3) (i.e. sampled from a different image) image repre- z 0 t2TG (σ) sentations v and v : 0 where t is a possible binary branching parse tree of himg(c;v) = Ec0 [cos(c ; v) − cos(c; v)) + ]+ σ among all possible trees T under a grammar G. 0 + Ev0 [cos(c; v ) − cos(c; v) + ]+; (7) Since computing the integral over z is intractable, log pθ(σ) can be optimized by maximizing its evi- where is a positive margin, and the expectations dence lower bound ELBO(σ; φ, θ): are approximated with one sample drawn from the training data. During training, ELBO and the jz ELBO(σ; φ, θ) = Eqφ(zjσ)[log pθ(σ )] image-text matching loss are jointly optimized. (4) − KL[q (zjσ)jjp(z)]; φ 2.3 Limitation where qφ(zjσ) is a variational posterior, a neural VC-PCFG improves C-PCFG by leveraging the network parameterized with φ. The sample log like- visual information from paired images. In their lihood can be computed with the inside algorithm, experiments (Zhao and Titov, 2020), comparing while the KL term can be computed analytically to C-PCFG, the largest improvement comes from when both prior p(z) and the posterior approxima- NPs (+11:9% recall), while recall values of other tion qφ(zjσ) are Gaussian (Kingma and Welling, frequent phrase types (VP, PP, SBAR, ADJP and 2014).

Load more