Probabilistic Semantic Frames, © April 2014 Supervisor: Doc

PROBABILISTICSEMANTICFRAMES jiríˇ materna Ph.D. Thesis Natural Language Processing Centre Faculty of Informatics Masaryk University April 2014 JiˇríMaterna: Probabilistic Semantic Frames, © April 2014 supervisor: doc. PhDr. Karel Pala, CSc. ABSTRACT During recent decades, semantic frames, sometimes called semantic or thematic grids, have been becoming increasingly popular within the community of computational linguists. Semantic frames are con- ceptual structures capturing semantic roles valid for a set of lexical units, which can alternatively be seen as a formal representation of prototypical situations evoked by lexical units. Linguists use them for their ability to describe an interface between the syntax and the semantics. In practical natural language processing applications, they can be used, for instance, for the word sense disambiguation task, knowledge representation, or in order to resolve ambiguities in the syntactic analysis of natural languages. Nowadays, lexicons of semantic frames are mainly created manually or semi-automatically by highly trained linguists. Manually created lexicons involve, for example, a well-known lexicon of semantic frames FrameNet, a lexicon of verb valencies known as VerbNet, or Czech valency lexicons Vallex and VerbaLex. These and other frame- based lexical resources have many promising applications, but suffer from several disadvantages. Most importantly, their creation requires manual work of trained linguists, which is very time-consuming and expensive. The coverage of the resources is then usually small or limited to specific domains. In order to avoid the problems, this thesis proposes a method of creating semantic frames automatically. The basic idea is to generate a set of semantic frames and roles by maximizing the posterior probability of a probabilistic model on a syntactically parsed training corpus. A semantic role in the model is represented as a probability distribution over all its realizations in the corpus, and a semantic frame as a tuple of semantic roles, each of them connected with some grammatical relation. For every lexical unit from the corpus, a probability distribution over all semantic frames is generated. The probability of a frame corresponds to the relative frequency of its usage in the corpus for a given lexical unit. The method of constructing semantic roles is similar to inferring latent topics in Latent Dirichlet Allocation, thus, the probabilistic semantic frames proposed in this work are called LDA-Frames. The most straightforward application is its use as a corpus-driven lexicon of semantic patterns that can be exploited by linguists for the language analysis, or by lexicographers for creating lexicons. The appropriateness of using LDA-Frames for these purposes is shown in the comparison with a manually created verb pattern lexicon from the Corpus Pattern Analysis project. The second application, iii which demonstrates the power of LDA-Frames in this thesis, is the measurement of a semantic similarity between lexical units. It will be shown that the thesaurus built using the information from the probabilistic semantic frames overcomes a competitive approach based on the Sketch Engine. This thesis is structured into two parts. The first part introduces basic concepts of lexical semantics and explores the state of the art in the field of semantic frames in computational linguistics. The second part is dedicated to describing all the necessary theoretical background, as well as the idea of LDA-Frames with its detailed description and evaluation. The basic model is provided along with several extensions, namely a non-parametric version and a parallelized computational model. The last sections of the thesis illustrate the usability of LDA-Frames in practical natural language processing applications. iv PUBLICATIONS Some ideas and figures from this thesis have previously appeared in the following publications: • JiˇríMaterna. Parameter Estimation for LDA-Frames. In Proceed- ings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (NAACL-HLT 2013), pages 482–486, Atlanta, Georgia, June 2013. Association for Computational Linguistics. • Jiˇrí Materna. LDA-Frames: An Unsupervised Approach to Generating Semantic Frames. In Alexander Gelbukh, editor, Proceedings of the 13th International Conference CICLing 2012, Part I, volume 7181 of Lecture Notes in Computer Science, pages 376– 387. Springer Berlin / Heidelberg, 2012. • JiˇríMaterna. Building a Thesaurus Using LDA-Frames. In Pavel Rychlý, Aleš Horák, editor, Recent Advances in Slavonic Natural Language Processing, pages 97–103. Tribun EU, 2012. • JiˇríMaterna and Karel Pala. Using Ontologies for Semi-automatic Linking VerbaLex with FrameNet. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta Malta, 2010. (Contribution: 80 %) • JiˇríMaterna. Linking VerbaLex vith FrameNet: Case Study for the Indicate Verb Class. In Petr Sojka, Aleš Horák, editor, Recent Advances in Slavonic Natural Language Processing, pages 89–94. Tribun EU, 2010. • Jiˇrí Materna. Linking Czech Verb Valency Lexicon VerbaLex with FrameNet. In Proceedings of 4th Language and Technology Conference, pages 215–219, Poznań, 2009 • JiˇríMaterna. Czech Verbs in FrameNet Semantics. In Czech in Formal Grammar, pages 131–139, Lincom, München, 2009. Other author’s publications include: • Jiˇrí Materna. Determination of the topic consistency of doc- uments. In Radim Jiroušek, Jiˇrí Jelínek, editor, Znalosti 2011, pages 148–158. Fakulta elektrotechniky a informatiky, VŠB – Technická univerzita Ostrava, 2011. • JiˇríMaterna and Juraj Hreško. A Bayesian Approach to Query Language Identification. In Pavel Rychlý Aleš Horák, editor, Recent Advances in Slavonic Natural Language Processing, pages 111–116. Tribun EU, 2011. (Contribution: 80 %) v • Jiˇrí Materna. Domain Collocation Identification. In Recent Ad- vances in Slavonic Natural Language Processing, Faculty of Infor- matics, Masaryk University, Brno, 2009. • Lubomír Popelínský and JiˇríMaterna. On key words and key patterns in Shakespeare’s plays. In Znalosti 2009, Nakladatelstvo STU, Bratislava, 2009. (Contribution 20 %) • Jiˇrí Materna. Automatic Web Page Classification. In Recent Advances in Slavonic Natural Language Processing, Faculty of Informatics, Masaryk University, Brno, 2008. • Lubomír Popelínský and JiˇríMaterna. Keyness and Relational Learning. In International conference Keyness in Text, University of Siena, Siena, 2007. (Contribution 30 %) • JiˇríMaterna. Keyness in Shakespeare’s Plays. In Recent Advances in Slavonic Natural Language Processing, Faculty of Informatics, Masaryk University, Brno, 2007. vi ACKNOWLEDGMENTS First of all I would like to thank my supervisor Karel Pala for his support and encouragement on one hand, and his understanding for my duties at the Seznam.cz company on the other hand. Many thanks go to my employer Seznam.cz and all my colleagues for an opportunity to be a part of a mutually enriching environment, where research and applications together result in amazing technological products. I would also like to thank my colleagues from the NLP lab for valuable discussions of the issues of probabilistic semantic frames during the last years. I am very grateful to Tomáš Brázdil and Vladimír Kadlec who provided me with numerous remarks on the draft of this thesis, Steven Schwartz for his comments regarding my English, and André Miede for his nice LATEXtemplate. Finally, I would like to thank my parents and my partner Eliška Mikmeková for their love and support. vii CONTENTS i semantic frames1 1 introduction3 1.1 Lexical Semantics . 4 1.1.1 Homonymy and Polysemy . 5 1.2 Semantic Relations . 6 1.2.1 Synonymy . 6 1.2.2 Hyponymy/hypernymy . 7 1.2.3 Meronymy/holonymy . 7 1.2.4 Antonymy . 7 1.3 Ontologies and Frames . 7 1.3.1 Ontologies . 8 1.3.2 WordNet . 8 1.3.3 Frames . 9 1.4 Probabilistic Frames . 10 1.4.1 Practical Semantic Frames . 11 1.4.2 Related Work . 15 1.5 Structure of the Thesis . 16 2 semantic frames for natural language processing 17 2.1 Semantic Roles . 18 2.2 Connection with Syntax . 21 2.3 Verb Valencies and Semantic Frames . 22 2.3.1 Obligatory vs. Facultative Valencies . 22 2.3.2 Verb Classes . 23 2.3.3 Verb-specific vs. General Semantic Roles . 24 2.3.4 Relations Between Verb Valencies and Seman- tic Frames . 24 2.4 Motivation for Identifying Semantic Frames . 26 3 existing frame-based lexical resources 29 3.1 PropBank . 29 3.2 VerbNet . 30 3.3 The Berkeley FrameNet . 30 3.3.1 Semantic Frames . 31 3.3.2 Frame Elements . 32 3.3.3 FrameNet relations . 33 3.3.4 Semantic types . 34 3.4 FrameNets in other languages . 34 3.4.1 SALSA . 34 3.4.2 Spanish FrameNet . 35 3.5 BRIEF . 36 3.6 Vallex . 36 ix x contents 3.7 VerbaLex . 37 ii probabilistic semantic frames 39 4 probabilistic graphical models 41 4.1 Bayesian Networks . 41 4.1.1 Model Representation . 43 4.2 Generative vs. Discriminative models . 45 4.3 Statistical Inference in Bayesian Networks . 48 4.3.1 Variational Inference . 49 4.3.2 Sampling . 50 5 selected probability distributions 59 5.1 The Binomial and the Multinomial distributions . 60 5.2 The Beta and the Dirichlet distributions . 61 5.3 The Gamma Distribution . 64 6 topic models 67 6.1 Latent Semantic Analysis . 68 6.2 Probabilistic Latent Semantic Analysis . 71 6.3 Latent Dirichlet Allocation . 74 7 lda-frames 77 7.1 Model Description . 79 7.2 The Gibbs Sampler for LDA-Frames . 82 7.2.1 Sampling Frames . 85 7.2.2 Sampling Roles . 87 8 non-parametric lda-frames 91 8.1 Dirichlet Process . 91 8.2 Chinese restaurant process and Stick breaking process 93 8.3 Estimating the number of roles . 96 8.4 Estimating the Number of Frames . 98 8.4.1 Hierarchical Dirichlet Process . 98 8.4.2 Fully Non-Parametric Model . 102 8.4.3 Sampling Unbounded Number of Frames . 103 8.5 Gibbs Sampler for Non-Parametric LDA-Frames . 105 9 hyperparameter estimation 109 9.1 Hyperparameter Estimation Method for Parametric LDA-Frames . 110 9.2 Hyperparameter Estimation Method for Non-Para- metric LDA-Frames . 112 10 distributed inference algorithms 117 10.1 Parallel LDA-Frames .

Load more