Universal Dependencies
Total Page:16
File Type:pdf, Size:1020Kb
The <richer> the people, the <bigger> the crates the more <literal> the <better> the less <elaborate> you can be the <better> The <slower> the <better> Linguistic Institute 2017 the <higher> the temperature the <darker> the malt Corpus Linguistics The <older> a database is, the Treebanks & Dependencies <richer> it becomes The <higher> the clay content, the more <sticky> the Nathan Schneider soil The <fresher> the [email protected] soot, the <longer> it needed Amir Zeldes the <higher> the number, the <greater> [email protected] the protection The <older> you are the <greater> the chance The <smaller> the mesh the more <expensive> the net 1 Taking Stock Thus far in this course: • Why Corpora? ‣ Issues in assembling corpora • Markup, tokenization, metadata • Searching, frequency, collocations • Annotation: POS 2 Coming Attractions • Syntactic Treebanks ‣ Annotation: Universal Dependencies • Computational Lexical Semantics ‣ Annotation: Entity Types ‣ WordNet, Frame Semantics, Distributional Vectors 3 POS Tags Aren’t Everything • A POS tag helps narrow down what grammatical constructions a word may participate in. ‣ The main challenge in: “Will Trey Gowdy Benghazi Trump with inquiries?” ‣ But tag doesn’t exactly specify how the token relates to other tokens in the sentence. • We want to be able to search corpora for words in certain syntactic contexts. Helps us answer questions like: ‣ When does adverbial home tend to precede vs. follow the object? (Fillmore 1992, p. 48: take home the leftovers vs. take the leftovers home) ‣ What kinds of nouns prefer to be subjects vs. objects? ‣ Which verbs can take infinitival complements? 4 Ambiguity beyond POS • Lots of constructions = lots of ambiguity! Sometimes humans even notice it: ‣ PP attachment: “Illinois Sends Bill Allowing Gay Marriage to Governor” ‣ Adjective attachment: “Police Shoot Dead Suspect Inside L.A. Emergency Room” ‣ Verb argument attachment: “Attorneys for Afghan family detained by immigration officials in Los Angeles obtain restraining order” (Who was detained?) ‣ Verb argument function (depends on VVD vs. VVN): “Top U.N. Climate Official Denied Meeting with U.S. Secretary of State” • We also want to be able to resolve ambiguity automatically for natural language understanding. ‣ Foreshadowing future lectures: We’ll need additional representations for sense ambiguity, e.g. in “Campaign manager for Donald Trump is charged with battery” Credit: Dirk Hovy and Jonathan May for pointing out some of these headlines 5 Treebanks • A treebank is a corpus of sentences with syntactic trees, for ‣ Corpus-based studies involving syntax ‣ Evaluating syntactic parsers ‣ Training statistical syntactic parsers • Gold standard trees: human-annotated from scratch or manually corrected parser output ‣ Silver trees: uncorrected parser output, will contain many errors 6 Types of Syntax Trees • Phrase structure or constituency trees ‣ Nested bracketing of the sentence, usually with constituent labels like S (sentence), VP (verb phrase), etc., down to POS tags for individual tokens • Dependency trees ‣ Edges (often labeled) connect words directly • Other formalisms: CCG, LFG, HPSG, construction grammar, etc. have other kinds of structure such as nested feature structures 7 Constuency Treebanks • English ‣ Penn Treebank (PTB; Marcus et al., 1993): English; primarily, trees for 1M words of Wall Street Journal news articles in 1989. ‣ OntoNotes (Hovy et al., 2006; Pradhan et al., 2013): extends PTB with more genres (broadcast news, web, …), two additional languages (Arabic, Chinese), and semantic annotations (word senses, named entities, coreference, PropBank predicate-argument structures). OntoNotes 5.0 has 3M words—50% English, 40% Chinese, 10% Arabic. ‣ English Web Treebank (Bies et al., 2012): PTB-style trees, 5 genres—blogs, email, newsgroups, reviews, questions-answers. 250K words. • French Treebank, TIGER (German), … • Other formalisms: CCGBank (converted from PTB); Redwoods (HPSG), … 8 Dependency Treebanks • Prague School dependency syntax ‣ Czech–English Dependency Treebank: WSJ sentences and their Czech translations, in a rich multilayer formalism • Universal Dependencies (UD) corpora in several dozen languages and various genres ‣ including a manually-corrected conversion of the English Web Treebank • Twitter: e.g., Tweebank (Kong et al. 2014) 9 Cost & Licensing • Creating a high-quality treebank is expensive ‣ Need to hire linguists (often linguistics grad students) ‣ Training, guidelines development ‣ Speed of annotation ‣ Quality control • PTB and many other corpora licensed through the Linguistic Data Consortium (LDC) and similar entities 10 Constuency Trees 11 F331 Encourage the creation of local marketing cooperatives and <tree banks> for small woodland growers. sion of the corpus into a standardised knowledge-representation language, or <tree bank>, held on computer A standard generative tree for German Though since 0 days 0 thick clouds t t the sky t cover, tells the weather-report t t t t t fore, that towards 0 evening the sun t t tshines 20/33 12 F331 Encourage the creation of local marketing cooperatives and <tree banks> for small woodland growers. sion of the corpus into a standardised knowledge-representation language, or <tree bank>, held on computer A computational parse equivalent (PTB style) Penn Treebank Though since days thick clouds the sky cover, tells the current weather-report fore, that towards evening the sun shines 21/33 13 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) The first sentence of the Penn Treebank (VP (Wall Street Journal) (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 14 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) POS tags (preterminals) / tokens (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 15 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) constituents (nonterminals) (VP POS tags (preterminals) / tokens (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 16 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) nonbinary branching (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 17 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) some constituents have function tags (VP (subject, temporal, (VB join) PP-CLR = “closely related” PP) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 18 kids saw birds with fish Lexicalized Constituency ParseLexicalized Constituency Parse S-sawS S-saw ¨H ¨¨HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨ H NP-kidsNP VP-sawVP NP-kids VP-saw ¨H ¨¨HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H kids ¨ H kids ¨ H V-sawV NP-birdsNP V-saw NP-birds ¨H ¨¨HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH saw ¨¨ HH saw ¨ H NP-birdsNP PP-fishPP NP-birds PP-fish ¨H ¨H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨ H birds P-withP NP-fishNP birds P-with NP-fish with fish with fish It is sometimes useful to create a lexicalized NathanSchneider ENLPLecture18NathanSchneiderconstituency 5 ENLPLecture18parse, where each nonterminal 5 label includes the phrasal head. (How would you determine this?) 19 Head Rules • A set 20of headCHAPTER rules11 systematizesFORMAL theGRAMMARS selection OF ofE NGLISHa lexical head for each constituent so it can be done automatically. • Parent Direction Priority List ADJP Left NNS QP NN $ ADVP JJ VBN VBG ADJP JJR NP JJS DT FW RBR RBS SBAR RB ADVP Right RB RBR RBS FW ADVP TO CD JJR JJ IN NP JJS NN PRN Left PRT Right RP QP Left $ IN NNS NN JJ RB DT CD NCD QP JJR JJS S Left TO IN VP S SBAR ADJP UCP NP SBAR Left WHNP WHPP WHADVP WHADJP IN DT S SQ SINV SBAR FRAG VP Left TO VBD VBN MD VBZ VB VBG VBP VP ADJP NN NNS NP Figure 11.12 Selected head rules from Collins (1999). The set of head rules is often called a head percola- tion table. Jurafsky & Martin: SLP3 online draft, ch. 11 ‣ Traverse11.5 the Grammartree bottom-up. The Equivalence last row says how and to choose Normal a head for Form a VP constituent: ✴ First scan its daughters from left to right until a TO node is encountered; if it is, copy its head (the word it is the tag for). A formal language is defined as a (possibly infinite) set of strings of words. This ✴ Otherwise, scansuggests its daughters that we could from ask left if to two right grammars until a VBD are equivalent node is encountered; by asking if theyif it is, gener- copy its head. ate the same set of strings. In fact, it is possible to have two distinct context-free grammars generate the same language. ✴ … We usually distinguish two kinds of grammar equivalence: weak equivalence and strong equivalence. Two grammars are strongly equivalent if they generate the • Different headednesssame conventions set of strings (e.g.,and if for they PPs—the assign the P same or the phrase N?) structurerequire todifferent each sentence head rules.