Modeling Subcategorization Through Co-Occurrence Outline
Total Page:16
File Type:pdf, Size:1020Kb
Introducing LexIt Building Distributional Profiles Ongoing Work Conclusions Outline Modeling Subcategorization Through Co-occurrence 1 Introducing LexIt A Computational Lexical Resource for Italian Verbs The Project Distributional Profiles 1 2 Gabriella Lapesa ,AlessandroLenci 2 Building Distributional Profiles Pre-processing 1University of Osnabr¨uck, Institute of Cognitive Science Subcategorization Frames 2 University of Pisa, Department of Linguistics Lexical sets Selectional preferences Explorations in Syntactic Government and Subcategorisation nd University of Cambridge, 2 September 2011 3 Ongoing Work 4 Conclusions Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 2/ 38 Introducing LexIt Introducing LexIt Building Distributional Profiles The Project Building Distributional Profiles The Project Ongoing Work Distributional Profiles Ongoing Work Distributional Profiles Conclusions Conclusions Computational approaches to argument structure LexIt: a computational lexical resource for Italian The automatic acquisition of lexical information from corpora is a longstanding research avenue in computational linguistics LexIt is a computational framework for the automatic acquisition subcategorization frames (Korhonen 2002, Schulte im Walde and exploration of corpus-based distributional profiles of Italian 2009, etc.) verbs, nouns and adjectives selectional preferences (Resnik 1993, Light & Greiff 2002, Erk et al. 2010, etc.) LexIt is publicly available through a web interface: verb classes (Merlo & Stevenson 2001, Schulte im Walde 2006, http://sesia.humnet.unipi.it/lexit/ Kipper-Schuler et al. 2008, etc.) Corpus-based information has been used to build lexical LexIt is the first large-scale resource of such type for Italian, resources aiming at characterizing the valence properties of predicates cf. VALEX for English (Korohnen et al. 2006), LexSchem for fully on distributional ground French (Messiant et al. 2008), etc. Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 3/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 4/ 38 Introducing LexIt Introducing LexIt Building Distributional Profiles The Project Building Distributional Profiles The Project Ongoing Work Distributional Profiles Ongoing Work Distributional Profiles Conclusions Conclusions The LexIt Distributional Profiles Modeling subcategorization through co-occurrence The distributional profile for a word w is an array of statistical Distributional profiles in LexIt are automatically extracted information extracted from a corpus to characterize the from large corpora with computational linguistics tools distributional behavior of w (without any manual revision) The Lexit distributional profiles include: The LexIt profiles contain statistical indexes to identify the most salient and prototypical distributional features of syntactic profiles,specifyingthesyntactic slots (subject, predicates: complements, modifiers, etc.) and syntactic frames with co-occurrence frequency which predicates co-occur association measures semantic profiles, composed by: Corpus-derived statistics are used to model the association the lexical sets with the most prototypical fillers realizing the between verbs and syntactic constructions, lexical fillers and syntactic slots; semantic classes as a gradient preference instead of the semantic classes characterizing the selectional preferences categorical selection of syntactic slots Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 5/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 6/ 38 Introducing LexIt Introducing LexIt Pre-processing Building Distributional Profiles The Project Building Distributional Profiles Subcategorization Frames Ongoing Work Distributional Profiles Ongoing Work Lexical sets Conclusions Conclusions Selectional preferences Association Measures The LexIt framework “A simple association measure interprets co-occurrence frequency O by comparison with the expected frequency E,andcalculates and association score as a quantitative measure for the attraction between two words” (Evert, 2008:18) LexIt is an open and parametrizable framework source corpora Local Mutual Information (Evert, 2008) part of speech to be profiled definition of subcategorization frames O LMI = O log2 (1) statistical indexes × E semantic classes for selectional preferences, etc. Key properties of LMI: Today we focus on the acquisition of distributional profiles for downgrades the risk of overestimating the significance of low Italian verbs frequency events is a two-sided measure: quantifies both attraction and repulsion Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 7/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 8/ 38 Introducing LexIt Pre-processing Introducing LexIt Pre-processing Building Distributional Profiles Subcategorization Frames Building Distributional Profiles Subcategorization Frames Ongoing Work Lexical sets Ongoing Work Lexical sets Conclusions Selectional preferences Conclusions Selectional preferences Building distributional profiles Pre-processing Tokenization, Lemmatization, Part-of-speech tagging TANL (Text Analytics and Natural Language), a suite of Pre-processing: linguistic analysis with automatic tools modules for Italian Natural Language Processing developed by Extraction of subcategorization frames from parsed text the University of Pisa and ILC-CNR Assignment of lexical sets to argument slots Dependency Parsing Selectional preferences: from lexical sets to semantic classes DeSR,astochasticdependencyparser(Attardi&Dell’Orletta 2009) dependency trees are constructed without relying on any subcategorization lexicon Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 9/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 10/ 38 Introducing LexIt Pre-processing Introducing LexIt Pre-processing Building Distributional Profiles Subcategorization Frames Building Distributional Profiles Subcategorization Frames Ongoing Work Lexical sets Ongoing Work Lexical sets Conclusions Selectional preferences Conclusions Selectional preferences Building distributional profiles Subcategorization frames Subcategorization Frame (SCF): represents a pattern of syntactic dependencies headed by the target lemma Pre-processing: linguistic analysis with automatic tools is formed by an unordered set of slots, representing argument Extraction of subcategorization frames from parsed text positions (i.e., subject, object, etc.) Assignment of lexical sets to argument slots is identified by a synthetic label Selectional preferences: from lexical sets to semantic classes Verb SCFs also include: the zero argument construction Gianni `earrivato “John arrived” subj#0 ⇒ the reflexive pronoun si Il vaso si `erotto “The vase si-broke” subj#si#0 ⇒ Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 11/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 12/ 38 Introducing LexIt Pre-processing Introducing LexIt Pre-processing Building Distributional Profiles Subcategorization Frames Building Distributional Profiles Subcategorization Frames Ongoing Work Lexical sets Ongoing Work Lexical sets Conclusions Selectional preferences Conclusions Selectional preferences Subcategorization frames Subcategorization frames No formal distinction is made between arguments and adjuncts Frame: SUBJ#OBJ#COMP-A abitare al mare (“to live at the sea”) subj#comp-a Target: dare (“to give”), freq 336731; Frame-verb: freq 107388, mangiare al mare (“to eat at the sea”)⇒ subj#comp-a LMI 327656 ⇒ information between argument-adjuncts is not explicitly Gianni ha dato il libro a Maria “Gianni gave the book to encoded in the parser Mary” arguments and adjuncts are notoriously hard to discriminate Gianni ha dato a Maria il libro “Gianni gave Mary the book” For each frame, the LexIt profiles also specify the most prototypical: Gianni ha generosamente dato a Maria il libro “Gianni gave verbal modifiers Mary the book generously” entrare correndo (“to run into”) (Lui) ha dato il libro a sua madre piangendo “(He) gave the adverbial modifiers book to his mother crying” correre velocemente (“to run fast”) Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 13/ 38 Gabriella Lapesa, Alessandro Lenci Modeling Subcategorization Through Co-occurrence 14/ 38 Introducing LexIt Pre-processing Introducing LexIt Pre-processing Building Distributional Profiles Subcategorization Frames Building Distributional Profiles Subcategorization Frames Ongoing Work Lexical sets Ongoing Work Lexical sets Conclusions Selectional preferences Conclusions Selectional preferences Subcategorization frames Extracting subcategorization frames SUBJ#SI#0 1 104 SCFs were selected among the most frequent syntactic Target: rompere (“to break”) freq 52537; Frame-verb: freq dependency combinations in the parsed corpus 1980, LMI 3293. 2 The joint frequency between each verb and the SCFs was Il vetro si `erotto “The glass broke” computed from the verb dependency patterns automatically Il vetro si rompe facilmente “Glass breakes easily” extracted from the parsed corpus Target: fermare (“to stop”) freq 52537; Frame-verb : freq 10967, LMI 28864. 3 The statistical salience of each SCFs with the target word was La macchina si `efermata frenando “The car stopped braking” estimated with