LI Corpling 5 6 Dependencies.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

LI Corpling 5 6 Dependencies.Pdf The <richer> the people, the <bigger> the crates the more <literal> the <better> the less <elaborate> you can be the <better> The <slower> the <better> Linguistic Institute 2017 the <higher> the temperature the <darker> the malt Corpus Linguistics The <older> a database is, the Treebanks & Dependencies <richer> it becomes The <higher> the clay content, the more <sticky> the Nathan Schneider soil The <fresher> the [email protected] soot, the <longer> it needed Amir Zeldes the <higher> the number, the <greater> [email protected] the protection The <older> you are the <greater> the chance The <smaller> the mesh the more <expensive> the net 1 Taking Stock Thus far in this course: • Why Corpora? ‣ Issues in assembling corpora • Markup, tokenization, metadata • Searching, frequency, collocations • Annotation: POS 2 Coming Attractions • Syntactic Treebanks ‣ Annotation: Universal Dependencies • Computational Lexical Semantics ‣ Annotation: Entity Types ‣ WordNet, Frame Semantics, Distributional Vectors 3 POS Tags Aren’t Everything • A POS tag helps narrow down what grammatical constructions a word may participate in. ‣ The main challenge in: “Will Trey Gowdy Benghazi Trump with inquiries?” ‣ But tag doesn’t exactly specify how the token relates to other tokens in the sentence. • We want to be able to search corpora for words in certain syntactic contexts. Helps us answer questions like: ‣ When does adverbial home tend to precede vs. follow the object? (Fillmore 1992, p. 48: take home the leftovers vs. take the leftovers home) ‣ What kinds of nouns prefer to be subjects vs. objects? ‣ Which verbs can take infinitival complements? 4 Ambiguity beyond POS • Lots of constructions = lots of ambiguity! Sometimes humans even notice it: ‣ PP attachment: “Illinois Sends Bill Allowing Gay Marriage to Governor” ‣ Adjective attachment: “Police Shoot Dead Suspect Inside L.A. Emergency Room” ‣ Verb argument attachment: “Attorneys for Afghan family detained by immigration officials in Los Angeles obtain restraining order” (Who was detained?) ‣ Verb argument function (depends on VVD vs. VVN): “Top U.N. Climate Official Denied Meeting with U.S. Secretary of State” • We also want to be able to resolve ambiguity automatically for natural language understanding. ‣ Foreshadowing future lectures: We’ll need additional representations for sense ambiguity, e.g. in “Campaign manager for Donald Trump is charged with battery” Credit: Dirk Hovy and Jonathan May for pointing out some of these headlines 5 Treebanks • A treebank is a corpus of sentences with syntactic trees, for ‣ Corpus-based studies involving syntax ‣ Evaluating syntactic parsers ‣ Training statistical syntactic parsers • Gold standard trees: human-annotated from scratch or manually corrected parser output ‣ Silver trees: uncorrected parser output, will contain many errors 6 Types of Syntax Trees • Phrase structure or constituency trees ‣ Nested bracketing of the sentence, usually with constituent labels like S (sentence), VP (verb phrase), etc., down to POS tags for individual tokens • Dependency trees ‣ Edges (often labeled) connect words directly • Other formalisms: CCG, LFG, HPSG, construction grammar, etc. have other kinds of structure such as nested feature structures 7 Constuency Treebanks • English ‣ Penn Treebank (PTB; Marcus et al., 1993): English; primarily, trees for 1M words of Wall Street Journal news articles in 1989. ‣ OntoNotes (Hovy et al., 2006; Pradhan et al., 2013): extends PTB with more genres (broadcast news, web, …), two additional languages (Arabic, Chinese), and semantic annotations (word senses, named entities, coreference, PropBank predicate-argument structures). OntoNotes 5.0 has 3M words—50% English, 40% Chinese, 10% Arabic. ‣ English Web Treebank (Bies et al., 2012): PTB-style trees, 5 genres—blogs, email, newsgroups, reviews, questions-answers. 250K words. • French Treebank, TIGER (German), … • Other formalisms: CCGBank (converted from PTB); Redwoods (HPSG), … 8 Dependency Treebanks • Prague School dependency syntax ‣ Czech–English Dependency Treebank: WSJ sentences and their Czech translations, in a rich multilayer formalism • Universal Dependencies (UD) corpora in several dozen languages and various genres ‣ including a manually-corrected conversion of the English Web Treebank • Twitter: e.g., Tweebank (Kong et al. 2014) 9 Cost & Licensing • Creating a high-quality treebank is expensive ‣ Need to hire linguists (often linguistics grad students) ‣ Training, guidelines development ‣ Speed of annotation ‣ Quality control • PTB and many other corpora licensed through the Linguistic Data Consortium (LDC) and similar entities 10 Constuency Trees 11 F331 Encourage the creation of local marketing cooperatives and <tree banks> for small woodland growers. sion of the corpus into a standardised knowledge-representation language, or <tree bank>, held on computer A standard generative tree for German Though since 0 days 0 thick clouds t t the sky t cover, tells the weather-report t t t t t fore, that towards 0 evening the sun t t tshines 20/33 12 F331 Encourage the creation of local marketing cooperatives and <tree banks> for small woodland growers. sion of the corpus into a standardised knowledge-representation language, or <tree bank>, held on computer A computational parse equivalent (PTB style) Penn Treebank Though since days thick clouds the sky cover, tells the current weather-report fore, that towards evening the sun shines 21/33 13 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) The first sentence of the Penn Treebank (VP (Wall Street Journal) (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 14 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) POS tags (preterminals) / tokens (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 15 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) constituents (nonterminals) (VP POS tags (preterminals) / tokens (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 16 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) nonbinary branching (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 17 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . P (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) some constituents have function tags (VP (subject, temporal, (VB join) PP-CLR = “closely related” PP) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) 18 kids saw birds with fish Lexicalized Constituency ParseLexicalized Constituency Parse S-sawS S-saw ¨H ¨¨HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨ H NP-kidsNP VP-sawVP NP-kids VP-saw ¨H ¨¨HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H kids ¨ H kids ¨ H V-sawV NP-birdsNP V-saw NP-birds ¨H ¨¨HH ¨ H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨¨ HH saw ¨¨ HH saw ¨ H NP-birdsNP PP-fishPP NP-birds PP-fish ¨H ¨H ¨ H ¨¨ HH ¨¨ HH ¨ H ¨ H ¨ H birds P-withP NP-fishNP birds P-with NP-fish with fish with fish It is sometimes useful to create a lexicalized NathanSchneider ENLPLecture18NathanSchneiderconstituency 5 ENLPLecture18parse, where each nonterminal 5 label includes the phrasal head. (How would you determine this?) 19 Head Rules • A set 20of headCHAPTER rules11 systematizesFORMAL theGRAMMARS selection OF ofE NGLISHa lexical head for each constituent so it can be done automatically. • Parent Direction Priority List ADJP Left NNS QP NN $ ADVP JJ VBN VBG ADJP JJR NP JJS DT FW RBR RBS SBAR RB ADVP Right RB RBR RBS FW ADVP TO CD JJR JJ IN NP JJS NN PRN Left PRT Right RP QP Left $ IN NNS NN JJ RB DT CD NCD QP JJR JJS S Left TO IN VP S SBAR ADJP UCP NP SBAR Left WHNP WHPP WHADVP WHADJP IN DT S SQ SINV SBAR FRAG VP Left TO VBD VBN MD VBZ VB VBG VBP VP ADJP NN NNS NP Figure 11.12 Selected head rules from Collins (1999). The set of head rules is often called a head percola- tion table. Jurafsky & Martin: SLP3 online draft, ch. 11 ‣ Traverse11.5 the Grammartree bottom-up. The Equivalence last row says how and to choose Normal a head for Form a VP constituent: ✴ First scan its daughters from left to right until a TO node is encountered; if it is, copy its head (the word it is the tag for). A formal language is defined as a (possibly infinite) set of strings of words. This ✴ Otherwise, scansuggests its daughters that we could from ask left if to two right grammars until a VBD are equivalent node is encountered; by asking if theyif it is, gener- copy its head. ate the same set of strings. In fact, it is possible to have two distinct context-free grammars generate the same language. ✴ … We usually distinguish two kinds of grammar equivalence: weak equivalence and strong equivalence. Two grammars are strongly equivalent if they generate the • Different headednesssame conventions set of strings (e.g.,and if for they PPs—the assign the P same or the phrase N?) structurerequire todifferent each sentence head rules.
Recommended publications
  • Arxiv:1908.07448V1
    Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing Milan Straka and Jana Strakova´ and Jan Hajicˇ Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics {strakova,straka,hajic}@ufal.mff.cuni.cz Abstract Shared Task (Zeman et al., 2018). • We report our best results on UD 2.3. The We present an extensive evaluation of three addition of contextualized embeddings im- recently proposed methods for contextualized 25% embeddings on 89 corpora in 54 languages provements range from relative error re- of the Universal Dependencies 2.3 in three duction for English treebanks, through 20% tasks: POS tagging, lemmatization, and de- relative error reduction for high resource lan- pendency parsing. Employing the BERT, guages, to 10% relative error reduction for all Flair and ELMo as pretrained embedding in- UD 2.3 languages which have a training set. puts in a strong baseline of UDPipe 2.0, one of the best-performing systems of the 2 Related Work CoNLL 2018 Shared Task and an overall win- ner of the EPE 2018, we present a one-to- A new type of deep contextualized word repre- one comparison of the three contextualized sentation was introduced by Peters et al. (2018). word embedding methods, as well as a com- The proposed embeddings, called ELMo, were ob- parison with word2vec-like pretrained em- tained from internal states of deep bidirectional beddings and with end-to-end character-level word embeddings. We report state-of-the-art language model, pretrained on a large text corpus. results in all three tasks as compared to results Akbik et al.
    [Show full text]
  • Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format
    Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format Alina Wróblewska Institute of Computer Science Polish Academy of Sciences ul. Jana Kazimierza 5 01-248 Warsaw, Poland [email protected] Abstract even for languages with rich morphology and rel- atively free word order, such as Polish. The paper presents the largest Polish Depen- The supervised learning methods require gold- dency Bank in Universal Dependencies for- mat – PDBUD – with 22K trees and 352K standard training data, whose creation is a time- tokens. PDBUD builds on its previous ver- consuming and expensive process. Nevertheless, sion, i.e. the Polish UD treebank (PL-SZ), dependency treebanks have been created for many and contains all 8K PL-SZ trees. The PL- languages, in particular within the Universal De- SZ trees are checked and possibly corrected pendencies initiative (UD, Nivre et al., 2016). in the current edition of PDBUD. Further The UD leaders aim at developing a cross- 14K trees are automatically converted from linguistically consistent tree annotation schema a new version of Polish Dependency Bank. and at building a large multilingual collection of The PDBUD trees are expanded with the en- hanced edges encoding the shared dependents dependency treebanks annotated according to this and the shared governors of the coordinated schema. conjuncts and with the semantic roles of some Polish is also represented in the Universal dependents. The conducted evaluation exper- Dependencies collection. There are two Polish iments show that PDBUD is large enough treebanks in UD: the Polish UD treebank (PL- for training a high-quality graph-based depen- SZ) converted from Składnica zalezno˙ sciowa´ 1 and dency parser for Polish.
    [Show full text]
  • Universal Dependencies According to BERT: Both More Specific and More General
    Universal Dependencies According to BERT: Both More Specific and More General Tomasz Limisiewicz and David Marecekˇ and Rudolf Rosa Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Prague, Czech Republic {limisiewicz,rosa,marecek}@ufal.mff.cuni.cz Abstract former based systems particular heads tend to cap- ture specific dependency relation types (e.g. in one This work focuses on analyzing the form and extent of syntactic abstraction captured by head the attention at the predicate is usually focused BERT by extracting labeled dependency trees on the nominal subject). from self-attentions. We extend understanding of syntax in BERT by Previous work showed that individual BERT examining the ways in which it systematically di- heads tend to encode particular dependency verges from standard annotation (UD). We attempt relation types. We extend these findings by to bridge the gap between them in three ways: explicitly comparing BERT relations to Uni- • We modify the UD annotation of three lin- versal Dependencies (UD) annotations, show- ing that they often do not match one-to-one. guistic phenomena to better match the BERT We suggest a method for relation identification syntax (x3) and syntactic tree construction. Our approach • We introduce a head ensemble method, com- produces significantly more consistent depen- dency trees than previous work, showing that bining multiple heads which capture the same it better explains the syntactic abstractions in dependency relation label (x4) BERT. • We observe and analyze multipurpose heads, At the same time, it can be successfully ap- containing multiple syntactic functions (x7) plied with only a minimal amount of supervi- sion and generalizes well across languages.
    [Show full text]
  • Universal Dependencies for Japanese
    Universal Dependencies for Japanese Takaaki Tanaka∗, Yusuke Miyaoy, Masayuki Asahara}, Sumire Uematsuy, Hiroshi Kanayama♠, Shinsuke Mori|, Yuji Matsumotoz ∗NTT Communication Science Labolatories, yNational Institute of Informatics, }National Institute for Japanese Language and Linguistics, ♠IBM Research - Tokyo, |Kyoto University, zNara Institute of Science and Technology [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public. Keywords: typed dependencies, Short Unit Word, multiword expression, UniDic 1. Introduction 2. Word unit The definition of a word unit is indispensable in UD an- notation, which is not a trivial question for Japanese, since The Universal Dependencies (UD) project has been de- a sentence is not segmented into words or morphemes by veloping cross-linguistically consistent treebank annota- white space in its orthography.
    [Show full text]
  • Universal Dependencies for Persian
    Universal Dependencies for Persian *Mojgan Seraji, **Filip Ginter, *Joakim Nivre *Uppsala University, Department of Linguistics and Philology, Sweden **University of Turku, Department of Information Technology, Finland *firstname.lastname@lingfil.uu.se **figint@utu.fi Abstract The Persian Universal Dependency Treebank (Persian UD) is a recent effort of treebanking Persian with Universal Dependencies (UD), an ongoing project that designs unified and cross-linguistically valid grammatical representations including part-of-speech tags, morphological features, and dependency relations. The Persian UD is the converted version of the Uppsala Persian Dependency Treebank (UPDT) to the universal dependencies framework and consists of nearly 6,000 sentences and 152,871 word tokens with an average sentence length of 25 words. In addition to the universal dependencies syntactic annotation guidelines, the two treebanks differ in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in the UPDT have been separated from the clitics and appear with distinct labels in the Persian UD. The treebank has its original syntactic annota- tion scheme based on Stanford Typed Dependencies. In this paper, we present the approaches taken in the development of the Persian UD. Keywords: Universal Dependencies, Persian, Treebank 1. Introduction this paper, we present how we adapt the Universal Depen- In the past decade, the development of numerous depen- dencies to Persian by converting the Uppsala Persian De- dency parsers for different languages has frequently been pendency Treebank (UPDT) (Seraji, 2015) to the Persian benefited by the use of syntactically annotated resources, Universal Dependencies (Persian UD). First, we briefly de- or treebanks (Bohmov¨ a´ et al., 2003; Haverinen et al., 2010; scribe the Universal Dependencies and then we present the Kromann, 2003; Foth et al., 2014; Seraji et al., 2015; morphosyntactic annotations used in the extended version Vincze et al., 2010).
    [Show full text]
  • Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other
    Language technology meets documentary linguistics: What we have to tell each other Language technology meets documentary linguistics: What we have to tell each other Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ . February 15, 2018 . Language technology meets documentary linguistics: What we have to tell each other Contents Introduction Language technology for the documentary linguist Language technology for the language society Conclusion . Language technology meets documentary linguistics: What we have to tell each other Introduction Introduction I Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... I Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese I Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology I Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek . Language technology meets documentary linguistics: What we have to tell each other Introduction The most important languages we work on . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language technology for documentary linguistics ... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist ... what’s in it for the language community I work for? I Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language communities without such plans I Gather empirical material and do your linguistic analysis I (The triplet: Grammar, text collection and dictionary) I ..
    [Show full text]
  • Foreword to the Special Issue on Uralic Languages
    Northern European Journal of Language Technology, 2016, Vol. 4, Article 1, pp 1–9 DOI 10.3384/nejlt.2000-1533.1641 Foreword to the Special Issue on Uralic Languages Tommi A Pirinen Hamburger Zentrum für Sprachkorpora Universität Hamburg [email protected] Trond Trosterud HSL-fakultehta UiT Norgga árktalaš universitehta [email protected] Francis M. Tyers Veronika Vincze HSL-fakultehta MTA-SZTE UiT Norgga árktalaš universitehta Szegedi Tudomány Egyetem [email protected] [email protected] Eszter Simon Jack Rueter Research Institute for Linguistics Helsingin yliopisto Hungarian Academy of Sciences Nykykielten laitos [email protected] [email protected] March 7, 2017 Abstract In this introduction we have tried to present concisely the history of language tech- nology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the beginnings of the (Uralic) language-technology scene in 1980’s as far as it’s relevant to much of the current work, including the ones presented in this issue. We also go through the Uralic area by the main languages to survey on existing resources, to also form a systematic overview of what is missing. Finally we talk about some possible future directions on the pan-Uralic level of language technology management. Northern European Journal of Language Technology, 2016, Vol. 4, Article 1, pp 1–9 DOI 10.3384/nejlt.2000-1533.1641 Figure 1: A map of the Uralic language area show approximate distribution of languages spoken by area.
    [Show full text]
  • The Universal Dependencies Treebank of Spoken Slovenian
    The Universal Dependencies Treebank of Spoken Slovenian Kaja Dobrovoljc1, Joakim Nivre2 1Institute for Applied Slovene Studies Trojina, Ljubljana, Slovenia 1Department of Slovenian Studies, Faculty of Arts, University of Ljubljana 2Department of Linguistics and Philology, Uppsala University [email protected], joakim.nivre@lingfil.uu.se Abstract This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoper- ability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality. Keywords: dependency treebank, spontaneous speech, Universal Dependencies 1. Introduction actually out-performs state-of-the-art pipeline approaches (Rasooli and Tetreault, 2013; Honnibal and Johnson, 2014). It is nowadays a well-established fact that data-driven pars- Such heterogeneity of spoken language annotation schemes ing systems used in different speech-processing applica- inevitably leads to a restricted usage of existing spoken tions benefit from learning on annotated spoken data, rather language treebanks in linguistic research and parsing sys- than using models built on written language observation.
    [Show full text]
  • Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
    Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2021 Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas Edited by: Mager, Manuel ; Oncevay, Arturo ; Rios, Annette ; Meza Ruiz, Ivan Vladimir ; Palmer, Alexis ; Neubig, Graham ; Kann, Katharina Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-203436 Edited Scientific Work Published Version The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Edited by: Mager, Manuel; Oncevay, Arturo; Rios, Annette; Meza Ruiz, Ivan Vladimir; Palmer, Alexis; Neubig, Graham; Kann, Katharina (2021). Online: Association for Computational Linguistics. NAACL-HLT 2021 Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP) Proceedings of the First Workshop June 11, 2021 ©2021 The Association for Computational Linguistics These workshop proceedings are licensed under a Creative Commons Attribution 4.0 International License. Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-954085-44-2 ii Preface This area is in all probability unmatched, anywhere in the world, in its linguistic multiplicity and diversity. A couple of thousand languages and dialects, at present divided into 17 large families and 38 small ones, with several hundred unclassified single languages, are on record.
    [Show full text]
  • Conll-2017 Shared Task
    TurkuNLP: Delexicalized Pre-training of Word Embeddings for Dependency Parsing Jenna Kanerva1,2, Juhani Luotolahti1,2, and Filip Ginter1 1Turku NLP Group 2University of Turku Graduate School (UTUGS) University of Turku, Finland [email protected], [email protected], [email protected] Abstract word and sentence segmentation as well as mor- phological tags for the test sets, which they could We present the TurkuNLP entry in the choose to use as an alternative to developing own CoNLL 2017 Shared Task on Multilingual segmentation and tagging. These baseline seg- Parsing from Raw Text to Universal De- mentations and morphological analyses were pro- pendencies. The system is based on the vided by UDPipe v1.1 (Straka et al., 2016). UDPipe parser with our focus being in ex- In addition to the manually annotated treebanks, ploring various techniques to pre-train the the shared task organizers also distributed a large word embeddings used by the parser in or- collection of web-crawled text for all but one of der to improve its performance especially the languages in the shared task, totaling over 90 on languages with small training sets. The billion tokens of fully dependency parsed data. system ranked 11th among the 33 partici- Once again, these analyses were produced by the pants overall, being 8th on the small tree- UDPipe system. This automatically processed banks, 10th on the large treebanks, 12th on large dataset was intended by the organizers to the parallel test sets, and 26th on the sur- complement the manually annotated data and, for prise languages. instance, support the induction of word embed- dings.
    [Show full text]
  • Conferenceabstracts
    TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia MAY 23 – 28, 2016 GRAND HOTEL BERNARDIN CONFERENCE CENTRE Portorož , SLOVENIA CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik , Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis. Assistant Editors: Sara Goggi, Hélène Mazo The LREC 2016 Proceedings are licensed under a Creative Commons Attribution- NonCommercial 4.0 International License LREC 2016, TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2016 Conference Abstracts Distributed by: ELRA – European Language Resources Association 9, rue des Cordelières 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] ISBN 978-2-9517408-9-1 EAN 9782951740891 ii Introduction of the Conference Chair and ELRA President Nicoletta Calzolari Welcome to the 10 th edition of LREC in Portorož, back on the Mediterranean Sea! I wish to express to his Excellency Mr. Borut Pahor, the President of the Republic of Slovenia, the gratitude of the Program Committee, of all LREC participants and my personal for his Distinguished Patronage of LREC 2016. Some figures: previous records broken again! It is only the 10 th LREC (18 years after the first), but it has already become one of the most successful and popular conferences of the field. We continue the tradition of breaking previous records. We received 1250 submissions, 23 more than in 2014. We received 43 workshop and 6 tutorial proposals.
    [Show full text]
  • A Multilingual Collection of Conll-U-Compatible Morphological Lexicons
    A multilingual collection of CoNLL-U-compatible morphological lexicons Benoˆıt Sagot Inria 2 rue Simone Iff, CS 42112, 75589 Paris Cedex 12, France [email protected] Abstract We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures. Keywords: Morphological Lexicons, Universal Dependencies, Freely Available Language Resources 1. Introduction guages following a universal set of guidelines. The obvious Morphological information belongs to the most fundamen- choice would be to make use of the UD guidelines them- tal types of linguistic knowledge. It is often either encoded selves. into morphological analysers or gathered in the form of We have therefore developed a multilingual collection of morphological lexicons. Such lexicons, which constitute morphological lexicons that follow the UD guidelines re- the focus of this paper, are collections of lexical entries that garding part-of-speech and morphological features. We typically associate a wordform with a part-of-speech (or used three main sources of lexical information: morphosyntactic category), morphological features (such • In the context of the CoNLL 2017 UD morphologi- as gender, tense, etc.) and a lemma. Beyond direct lexicon cal and syntactic analysis shared task (Zeman et al., lookup, used in virtually all types of natural language pro- 2017) based on UD treebank data, we used lexical in- cessing applications and computational linguistic studies, formation available in the Apertium2 and Giellatekno3 morphological lexicons have been shown to significantly projects.
    [Show full text]