Proceedings of the 16Th International Workshop on Treebanks and Linguistic Theories

Total Page:16

File Type:pdf, Size:1020Kb

Proceedings of the 16Th International Workshop on Treebanks and Linguistic Theories TLT16 January 2018 Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories January 23–24, 2018 Prague, Czech Republic ISBN: 978-80-88132-04-2 Edited by Jan Hajič. These proceedings with papers presented at the 16th International Workshop on Treebanks and Linguistic Theories are included in the ACL Anthology, supported by the Association for Computational Linguistics (ACL) at http://aclweb.org/anthology; both individual papers as well as the full proceedings book are available. © 2017. Copyright of each paper stays with the respective authors (or their employers). Distributed under a CC-BY 4.0 licence. TLT16 takes place in Prague, Czech Republic on January 23–24, 2018. Preface The Sixteenth International Workshop on Treebanks and Linguistic Theories (TLT16) is being held at Charles University, Czech Republic, 23–24 January 2018, for the second time in Prague, which hosted TLT already in 2006. This year, TLT16 is co-located with the Workshop on Data Provenance and Annota- tion in Computational Linguistics 2018 (January 22, 2018, at the same place and venue) and immediately followed by the 2nd Workshop on Corpus-based Research in the Humanities, held in Vienna, Austria. This year, TLT16 received 32 submissions of which 15 have been selected to be presented as oral presentations, and additional seven have been asked to present as posters. We were happy that our in- vitation to give a plenary talk has been accepted by both Lilja Øvrelid of University of Oslo in Norway (with a talk “Downstream use of syntactic analysis: does representation matter?”) and Marie Candito, of University of Paris Diderot, France (“Annotating and parsing to semantic frames: feedback from the French FrameNet project”). We have also included a panel discussion on the very topic of Treebanks and Linguistic Theories – to discuss opinions of the future of the field and the workshop with the participants. We are grateful to the members of the program committee, who worked hard to review the submissions and provided authors with valuable feedback. We would also like to thank various sponsors, mainly the Faculty of Mathematics and Physics of Charles University, providing local and logistical and accounting and financial support, and the Centre for Advanced Study (CAS) at the Norwegian Academy of Science and Letters for supporting one of the invited speakers. Last but not least, we would like to thank all authors for submitting interesting and relevant papers, and we wish all participants a fruitful workshop. December 2017 Jan Hajič (also in the name of local organizers), Charles University, Prague, Czech Republic Sandra Kuebler, Indiana University, Bloomington, Indiana, USA Stephan Oepen, University of Oslo, Norway, and CAS, Norwegian Academy of Science and Letters, Oslo Markus Dickinson, Indiana University, Bloomington, Indiana, USA Program Committee Co-chairs i Program Chairs Jan Hajič Charles University, Prague Sandra Kübler Indiana University Bloomington Stephan Oepen Universitetet i Oslo Markus Dickinson Indiana University Bloomington Reviewers Patricia Amaral Indiana University Bloomington Emily M. Bender University of Washington Eckhard Bick University of Southern Denmark Ann Bies Linguistic Data Consortium, University of Pennsylvania Gosse Bouma Rijksuniversiteit Groningen Miriam Butt Konstanz Silvie Cinková Charles University, Prague Marie-Catherine de Marneffe The Ohio State University Koenraad De Smedt University of Bergen Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Filip Ginter University of Turku Memduh Gokirmak Istanbul Technical University Eva Hajicova Charles University, Prague Dag Haug University of Oslo Barbora Hladka Charles University, Prague Lori Levin Carnegie Mellon University Teresa Lynn Dublin City University Adam Meyers New York University Jiří Mírovský Charles University, Prague Kaili Müürisep University of Tartu Joakim Nivre Uppsala University Petya Osenova Sofia University and IICT-BAS Agnieszka Patejuk Institute of Computer Science, Polish Academy of Sciences Tatjana Scheffler Universität Potsdam Olga Scrivner Indiana University Bloomington Djamé Seddah Alpage/Université Paris la Sorbonne Michael White The Ohio State University Nianwen Xue Brandeis University Daniel Zeman Charles University, Prague Heike Zinsmeister University of Hamburg Local Organizing Committee Jan Hajič (chair) Charles University, Prague Kateřina Bryanová Charles University, Prague Zdeňka Urešová Charles University, Prague Eduard Bejček (publication chair) Charles University, Prague ii Table of Contents Invited Speakers Annotating and parsing to semantic frames: feedback from the French FrameNet project . v Marie Candito Downstream use of syntactic analysis: does representation matter? . vi Lilja Øvrelid Tuesday, 23rd January Distributional regularities of verbs and verbal adjectives: Treebank evidence and broader implications . 1 Daniël de Kok, Patricia Fischer, Corina Dima and Erhard Hinrichs UD Annotatrix: An annotation tool for Universal Dependencies . 10 Francis M. Tyers, Maria Sheyanova and Jonathan Washington The Treebanked Conspiracy. Actors and Actions in Bellum Catilinae . 18 Marco Passarotti and Berta Gonzalez Saavedra Universal Dependencies-based syntactic features in detecting human translation varieties. 27 Maria Kunilovskaya and Andrey Kutuzov Graph Convolutional Networks for Named Entity Recognition. 37 Alberto Cetoli, Stefano Bragaglia, Andrew O’Harney and Marc Sloan Extensions to the GrETEL Treebank Query Application . 46 Jan Odijk, Martijn van der Klis and Sheean Spoel The Relation of Form and Function in Linguistic Theory and in a Multilayer Treebank . 56 Eduard Bejček, Eva Hajičová, Marie Mikulová and Jarmila Panevová Literal readings of multiword expressions: as scarce as hen’s teeth . 64 Agata Savary and Silvio Ricardo Cordeiro Querying Multi-word Expressions Annotation with CQL . 73 Natalia Klyueva, Anna Vernerova and Behrang Qasemizadeh REALEC learner treebank: annotation principles and evaluation of automatic parsing . 80 Olga Lyashevskaya and Irina Panteleeva A semiautomatic lemmatisation procedure for treebanks. Old English strong and weak verbs . 88 Marta Tío Sáenz and Darío Metola Rodríguez Data point selection for genre-aware parsing . 95 Ines Rehbein and Felix Bildhauer Error Analysis of Cross-lingual Tagging and Parsing. 106 Rudolf Rosa and Zdeněk Žabokrtský Wednesday, 24th January A Telugu treebank based on a grammar book. 119 Taraka Rama and Sowmya Vajjala Recent Developments within BulTreeBank . 129 Petya Osenova and Kiril Simov iii Towards a dependency-annotated treebank for Bambara . 138 Ekaterina Aplonova and Francis M. Tyers Merging the Trees – Building a Morphological Treebank for German from Two Resources . 146 Petra Steiner What I think when I think about treebanks . 161 Anders Søgaard Syntactic Semantic Correspondence in Dependency Grammar . 167 Catalina Maranduc, Catalin Mititelu and Victoria Bobicev Multi-word annotation in syntactic treebanks – Propositions for Universal Dependencies . 181 Sylvain Kahane, Marine Courtin and Kim Gerdes A Universal Dependencies Treebank for Marathi . 190 Vinit Ravishankar Dangerous Relations in Dependency Treebanks . 201 Chiara Alzetta, Felice Dell’Orletta, Simonetta Montemagni and Giulia Venturi iv Annotating and parsing to semantic frames: feedback from the French FrameNet project Invited talk Marie Candito Université Paris Diderot, France [email protected] Abstract Building systems able to provide a semantic representation of texts has long been an objective, both in linguistics and in applied NLP. Although advances in machine learning sometimes seem to diminish the need to use as input sophisticated structured representations of sentences, the enthusiasm for interpreting trained neural networks somewhat seems to reaffirm that need. Because they represent schematic situations, semantic frames (Fillmore, 1982), as intantiated into FrameNet (Baker, Fillmore and Petruck, 1983) are an appealing level of generalization over the eventu- alities described in texts. In this talk, I will present some feedback from the development of a French FrameNet, including anal- ysis of the main difficulties we faced during annotation. I will describe how linking generalizations can be extracted from the frame-annotated data, using deep syntactic annotations. I will then investigate what kind of input is most effective for FrameNet parsing, from no syntax at all to deep syntactic representa- tions. The work I’ll present is a joint work with: • Marianne Djemaa, Philippe Muller, Laure Vieu, G. de Chalendar, B. Sagot, P. Amsili (French FrameNet), • C. Ribeyre, D. Seddah, G. Perrier, B. Guillaume (deep syntax), • Olivier Michalon, Alexis Nasr (semantic parsing). v Downstream use of syntactic analysis: does representation matter? Invited talk Lilja Øvrelid University of Oslo, Norway [email protected] Abstract Research in syntactic parsing is largely driven by progress in intrinsic evaluation and there have been impressive developments in recent years in terms of evaluation measures, such as F-score or labeled accuracy. At the same time, a range of different syntactic representations have been put to use in treebank annotation projects and there have been studies measuring various aspects of the ”learnability” of these representations and their suitability for automatic parsing, mostly also evaluated in terms of intrinsic measures. In this talk I will provide a different perspective on these developments and give an overview of re- search that
Recommended publications
  • Histcoroy Pyright for Online Information and Ordering of This and Other Manning Books, Please Visit Topwicws W.Manning.Com
    www.allitebooks.com HistCoroy pyright For online information and ordering of this and other Manning books, please visit Topwicws w.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Tutorials Special Sales Department Offers & D e al s Manning Publications Co. 20 Baldwin Road Highligh ts PO Box 761 Shelter Island, NY 11964 Email: [email protected] Settings ©2017 by Manning Publications Co. All rights reserved. Support No part of this publication may be reproduced, stored in a retrieval system, or Sign Out transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid­free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. PO Box 761 Shelter Island, NY 11964 www.allitebooks.com Development editor: Cynthia Kane Review editor: Aleksandar Dragosavljević Technical development editor: Stan Bice Project editors: Kevin Sullivan, David Novak Copyeditor: Sharon Wilkey Proofreader: Melody Dolab Technical proofreader: Doug Warren Typesetter and cover design: Marija Tudor ISBN 9781617292576 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 22 21 20 19 18 17 www.allitebooks.com HistPoray rt 1.
    [Show full text]
  • Using Constraint Grammar for Treebank Retokenization
    Using Constraint Grammar for Treebank Retokenization Eckhard Bick University of Southern Denmark [email protected] Abstract multiple tokens, in this case allowing the second part (the article) to become part of a separate np. This paper presents a Constraint Grammar-based method for changing the Tokenization is often regarded as a necessary tokenization of existing annotated data, evil best treated by a preprocessor with an establishing standard space-based abbreviation list, but has also been subject to ("atomic") tokenization for corpora methodological research, e.g. related to finite- otherwise using MWE fusion and state transducers (Kaplan 2005). However, there contraction splitting for the sake of is little research into changing the tokenization of syntactic transparency or for semantic a corpus once it has been annotated, limiting the reasons. Our method preserves ingoing comparability and alignment of corpora, or the and outgoing dependency arcs and allows evaluation of parsers. The simplest solution to the addition of internal tags and structure this problem is making conflicting systems for MWEs. We discuss rule examples and compatible by changing them into "atomic evaluate the method against both a tokenization", where all spaces are treated as Portuguese treebank and live news text token boundaries, independently of syntactic or annotation. semantic concerns. This approach is widely used in the machine-learning (ML) community, e.g. 1 Introduction for the Universal Dependencies initiative (McDonald et al. 2013). The method described in In an NLP framework, tokenization can be this paper can achieve such atomic tokenization defined as the identification of the smallest of annotated treebank data without information meaningful lexical units in running text.
    [Show full text]
  • Page 657 Monterey Program (November 19-20)- Page 666
    EAST Lansing Program (November 12-13)- Page 657 Monterey Program (November 19-20)- Page 666 VI 0 0 Notices of the American Mathematical Society z c:: 3 o­ ..,~ z 0 < ~ November 1982, Issue 221 ·5- Volume 29, Number 7, Pages 617- 720 ..,~ Providence, Rhode Island USA ISSN 0002-9920 Calendar of AMS Meetings THIS CALENDAR lists all meetings which have been approved by the Council prior to the date this issue of the Notices was sent to press. The summer and annual meetings are joint meetings of the Mathematical Association of America and the Ameri· can Mathematical Society. The meeting dates which fall rather far in the future are subject to change; this is particularly true of meetings to which no numbers have yet been assigned. Programs of the meetings will appear in the issues indicated below. First and second announcements of the meetings will have appeared in earlier issues. ABSTRACTS OF PAPERS presented at a meeting of the Society are published in the journal Abstracts of papers presented to the American Mathematical Society in the issue corresponding to that of the Notices which contains the program of the meet· ing. Abstracts should be submitted on special forms which are available in many departments of mathematics and from the office of the Society in Providence. Abstracts of papers to be presented at the meeting must be received at the headquarters of the Society in Providence, Rhode Island, on or before the deadline given below for the meeting. Note that the deadline for ab· stracts submitted for consideration for presentation at special sessions is usually three weeks earlier than that specified below.
    [Show full text]
  • Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System
    Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System Eckhard Bick Lars Nygaard Institute of Language and The Text Laboratory Communication University of Southern Denmark University of Oslo Odense, Denmark Oslo, Norway [email protected] [email protected] Abstract running, mixed domain text. Also, some languages, like English, German and This paper presents a rule-based Japanese, are more equal than others, Norwegian-English MT system. not least in a funding-heavy Exploiting the closeness of environment like MT. Norwegian and Danish, and the The focus of this paper will be existence of a well-performing threefold: Firstly, the system presented Danish-English system, Danish is here is targeting one of the small, used as an «interlingua». «unequal» languages, Norwegian. Structural analysis and polysemy Secondly, the method used to create a resolution are based on Constraint Norwegian-English translator, is Grammar (CG) function tags and ressource-economical in that it uses dependency structures. We another, very similar language, Danish, describe the semiautomatic as an «interlingua» in the sense of construction of the necessary translation knowledge recycling (Paul Norwegian-Danish dictionary and 2001), but with the recycling step at the evaluate the method used as well SL side rather than the TL side. Thirdly, as the coverage of the lexicon. we will discuss an unusual analysis and transfer methodology based on Constraint Grammar dependency 1 Introduction parsing. In short, we set out to construct a Norwegian-English MT Machine translation (MT) is no longer system by building a smaller, an unpractical science. Especially the Norwegian-Danish one and piping its advent of corpora with hundreds of output into an existing Danish deep millions of words and advanced parser (DanGram, Bick 2003) and an machine learning techniques, bilingual existing, robust Danish-English MT electronic data and advanced machine system (Dan2Eng, Bick 2006 and 2007).
    [Show full text]
  • Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis
    D4.1 PRELIMINARY VERSION OF THE TEXT ANALYSIS COMPONENT, INCLUDING: NER, EVENT DETECTION AND SENTIMENT ANALYSIS Grant Agreement nr 611057 Project acronym EUMSSI Start date of project (dur.) December 1st 2013 (36 months) Document due Date : November 30th 2014 (+60 days) (12 months) Actual date of delivery December 2nd 2014 Leader GFAI Reply to [email protected] Document status Submitted Project co-funded by ICT-7th Framework Programme from the European Commission EUMSSI_D4.1 Preliminary version of the text analysis component 1 Project ref. no. 611057 Project acronym EUMSSI Project full title Event Understanding through Multimodal Social Stream Interpretation Document name EUMSSI_D4.1_Preliminary version of the text analysis component.pdf Security (distribution PU – Public level) Contractual date of November 30th 2014 (+60 days) delivery Actual date of December 2nd 2014 delivery Deliverable name Preliminary Version of the Text Analysis Component, including: NER, event detection and sentiment analysis Type P – Prototype Status Submitted Version number v1 Number of pages 60 WP /Task responsible GFAI / GFAI & UPF Author(s) Susanne Preuss (GFAI), Maite Melero (UPF) Other contributors Mahmoud Gindiyeh (GFAI), Eelco Herder (LUH), Giang Tran Binh (LUH), Jens Grivolla (UPF) EC Project Officer Mrs. Aleksandra WESOLOWSKA [email protected] Abstract The deliverable reports on the resources and tools that have been gathered and installed for the preliminary version of the text analysis component Keywords Text analysis component, Named Entity Recognition, Named Entity Linking, Keyphrase Extraction, Relation Extraction, Topic modelling, Sentiment Analysis Circulated to partners Yes Peer review Yes completed Peer-reviewed by Eelco Herder (L3S) Coordinator approval Yes EUMSSI_D4.1 Preliminary version of the text analysis component 2 Table of Contents 1.
    [Show full text]
  • Information Retrieval and Web Search
    Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course taught by Prof. Ray Mooney at UT Austin) IR System Architecture User Interface Text User Text Operations Need Logical View User Query Database Indexing Feedback Operations Manager Inverted file Query Searching Index Text Ranked Retrieved Database Docs Ranking Docs Text Processing Pipeline Documents to be indexed Friends, Romans, countrymen. OR User query Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 2 4 1 2 Inverted index roman countryman 13 16 From Text to Tokens to Terms • Tokenization = segmenting text into tokens: • token = a sequence of characters, in a particular document at a particular position. • type = the class of all tokens that contain the same character sequence. • “... to be or not to be ...” 3 tokens, 1 type • “... so be it, he said ...” • term = a (normalized) type that is included in the IR dictionary. • Example • text = “I slept and then I dreamed” • tokens = I, slept, and, then, I, dreamed • types = I, slept, and, then, dreamed • terms = sleep, dream (stopword removal). Simple Tokenization • Analyze text into a sequence of discrete tokens (words). • Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. – However, frequently they are not. • Simplest approach is to ignore all numbers and punctuation and use only case-insensitive
    [Show full text]
  • The Dravidian Languages
    THE DRAVIDIAN LANGUAGES BHADRIRAJU KRISHNAMURTI The Pitt Building, Trumpington Street, Cambridge, United Kingdom The Edinburgh Building, Cambridge CB2 2RU, UK 40 West 20th Street, New York, NY 10011–4211, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia Ruiz de Alarc´on 13, 28014 Madrid, Spain Dock House, The Waterfront, Cape Town 8001, South Africa http://www.cambridge.org C Bhadriraju Krishnamurti 2003 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2003 Printed in the United Kingdom at the University Press, Cambridge Typeface Times New Roman 9/13 pt System LATEX2ε [TB] A catalogue record for this book is available from the British Library ISBN 0521 77111 0hardback CONTENTS List of illustrations page xi List of tables xii Preface xv Acknowledgements xviii Note on transliteration and symbols xx List of abbreviations xxiii 1 Introduction 1.1 The name Dravidian 1 1.2 Dravidians: prehistory and culture 2 1.3 The Dravidian languages as a family 16 1.4 Names of languages, geographical distribution and demographic details 19 1.5 Typological features of the Dravidian languages 27 1.6 Dravidian studies, past and present 30 1.7 Dravidian and Indo-Aryan 35 1.8 Affinity between Dravidian and languages outside India 43 2 Phonology: descriptive 2.1 Introduction 48 2.2 Vowels 49 2.3 Consonants 52 2.4 Suprasegmental features 58 2.5 Sandhi or morphophonemics 60 Appendix. Phonemic inventories of individual languages 61 3 The writing systems of the major literary languages 3.1 Origins 78 3.2 Telugu–Kannada.
    [Show full text]
  • Programming in HTML5 with Javascript and CSS3 Ebook
    spine = 1.28” Programming in HTML5 with JavaScript and CSS3 and CSS3 JavaScript in HTML5 with Programming Designed to help enterprise administrators develop real-world, About You job-role-specific skills—this Training Guide focuses on deploying This Training Guide will be most useful and managing core infrastructure services in Windows Server 2012. to IT professionals who have at least Programming Build hands-on expertise through a series of lessons, exercises, three years of experience administering and suggested practices—and help maximize your performance previous versions of Windows Server in midsize to large environments. on the job. About the Author This Microsoft Training Guide: Mitch Tulloch is a widely recognized in HTML5 with • Provides in-depth, hands-on training you take at your own pace expert on Windows administration and has been awarded Microsoft® MVP • Focuses on job-role-specific expertise for deploying and status for his contributions supporting managing Windows Server 2012 core services those who deploy and use Microsoft • Creates a foundation of skills which, along with on-the-job platforms, products, and solutions. He experience, can be measured by Microsoft Certification exams is the author of Introducing Windows JavaScript and such as 70-410 Server 2012 and the upcoming Windows Server 2012 Virtualization Inside Out. Sharpen your skills. Increase your expertise. • Plan a migration to Windows Server 2012 About the Practices CSS3 • Deploy servers and domain controllers For most practices, we recommend using a Hyper-V virtualized • Administer Active Directory® and enable advanced features environment. Some practices will • Ensure DHCP availability and implement DNSSEC require physical servers.
    [Show full text]
  • ED132833.Pdf
    DOCUMENT RESUME ED 132 833 FL 008 225 AUTHOR Johnson, Dora E.; And Others TITLE Languages of South Asia. A Survey of Materials for the Study of the Uncommonly Taught Languages. INSTITUTION Center for Applied Linguistics, Arlington, Va. SPONS AGENCY Office of Education (DHEW), Washington, D.C. PUB DATE 76 CONTRACT 300-75-0201 NOTE 52p. AVAILABLE FROMCenter for Applied Linguistics, 1611 North Kent Street, Arlington, Virginia 22209 ($3.95 each fascicle: complete set of 8, $26.50) BDRS PRICE MF-$0.83 HC-$3.50 Plus Postage. DESCRIPTORS Adult Education; Afro Asiatic Languages; *Annotated Bibliographies; Bengali; Dialect Studies; Dictionaries; *Dravidian Languages; Gujarati; Hindi; Ind° European Languages; Instructional Materials; Kannada; Kashmiri; Language Instruction; Language Variation; Malayalam; Marathi; Nepali; Panjabi; Reading Materials; Singhalese; Sino Tibetan Languages; Tamil; Telugu; *Uncommonly Taught Languages; Urdu IDENTIFIERS Angami Naga; Ao Naga; Apatani; Assamese; Bangru; Bhili; Bhojpuri; Boro; Brahui; Chakesang Naga; Chepang; Chhatisgarhi; Dafla; Galiong; Garo; Gondi; Gorum; Gurung; Hindustani; *Indo Aryan Languages; Jirel; Juang; Kabui Naga; Kanauri; Khaling; Kham; Kharia; Kolami; Konda; Korku; Kui; Kumauni; Kurukh; Ruwi; Lahnda; Maithili; Mundari Ho; Oraon; Oriya; Parenga; Parji; Pashai; Pengo; Remo; Santali; Shina; Sindhi; Sora; *Tibet° Burman Languages; Tulu ABSTRACT This is an annotated bibliography of basic tools of access for the study_of the uncommonly taught languages of South Asia. It is one of eight fascicles which constitute a revision of "A Provisional Survey of Materials for the Study of the Neglected Languages" (CAI 1969). The emphasis is on materials for the adult learner whose native language is English. Languages are grouped according to the following classifications: Indo-Aryan; Dravidian; Munda; Tibeto-Burman; Mon-Khmer; Burushaski.
    [Show full text]
  • Floresta Sinti(C)Tica : a Treebank for Portuguese
    )ORUHVWD6LQWi F WLFD$WUHHEDQNIRU3RUWXJXHVH 6XVDQD$IRQVR (FNKDUG%LFN 5HQDWR+DEHU 'LDQD6DQWRV *VISL project, University of Southern Denmark Institute of Language and Communication, Campusvej, 55, 5230 Odense M, Denmark [email protected], [email protected] ¡ SINTEF Telecom & Informatics, Pb 124, Blindern, NO-0314 Oslo, Norway [email protected],[email protected] $EVWUDFW This paper reviews the first year of the creation of a publicly available treebank for Portuguese, Floresta Sintá(c)tica, a collaboration project between the VISL and the Computational Processing of Portuguese projects. After briefly describing the main goals and the organization of the project, the creation of the annotated objects is presented in detail: preparing the text to be annotated, applying the Constraint Grammar based PALAVRAS parser, revising its output manually in a two-stage process, and carefully documenting the linguistic options. Some examples of the kind of interesting problems dealt with are presented, and the paper ends with a brief description of the tools developed, the project results so far, and a mention to a preliminary inter-annotator test and what was learned from it. supporting 16 different languages. VISL's Portuguese ,QWURGXFWLRQ0RWLYDWLRQDQGREMHFWLYHV system is based on the PALAVRAS parser (Bick, 2000), There are various good motives for creating a and has been functioning as a role model for other Portuguese treebank, one of them simply being the desire languages. More recently, VISL has moved to incorporate to make a new research tool available to the Portuguese semantic research, machine translation, and corpus language community, another the wish to establish some annotation proper.
    [Show full text]
  • Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data
    Constructing a Lexicon of Arabic-English Named Entity using SMT and Semantic Linked Data Emna Hkiri, Souheyl Mallat, and Mounir Zrigui LaTICE Laboratory, Faculty of Sciences of Monastir, Tunisia Abstract: Named entity recognition is the problem of locating and categorizing atomic entities in a given text. In this work, we used DBpedia Linked datasets and combined existing open source tools to generate from a parallel corpus a bilingual lexicon of Named Entities (NE). To annotate NE in the monolingual English corpus, we used linked data entities by mapping them to Gate Gazetteers. In order to translate entities identified by the gate tool from the English corpus, we used moses, a statistical machine translation system. The construction of the Arabic-English named entities lexicon is based on the results of moses translation. Our method is fully automatic and aims to help Natural Language Processing (NLP) tasks such as, machine translation information retrieval, text mining and question answering. Our lexicon contains 48753 pairs of Arabic-English NE, it is freely available for use by other researchers Keywords: Named Entity Recognition (NER), Named entity translation, Parallel Arabic-English lexicon, DBpedia, linked data entities, parallel corpus, SMT. Received April 1, 2015; accepted October 7, 2015 1. Introduction Section 5 concludes the paper with directions for future work. Named Entity Recognition (NER) consists in recognizing and categorizing all the Named Entities 2. State of the art (NE) in a given text, corresponding to two intuitive IAJIT First classes; proper names (persons, locations and Written and spoken Arabic is considered as difficult organizations), numeric expressions (time, date, money language to apprehend in the domain of Natural and percent).
    [Show full text]
  • Proposal to Encode 0C34 TELUGU LETTER LLLA
    Proposal to encode 0C34 TELUGU LETTER LLLA Shriramana Sharma, Suresh Kolichala, Nagarjuna Venna, Vinodh Rajan jamadagni, suresh.kolichala, vnagarjuna and vinodh.vinodh: *-at-gmail.com 2012Jan-17 §1. Character to be encoded 0C34 TELUGU LETTER LLLA §2. Background LLLA is the Unicode term for the native form in South Indian scripts of the voiced retroflex approximant. It is an ill-informed notion that it is limited or unique to one particular language or script. Such written forms are already known to exist in Tamil, Malayalam and Kannada and characters for the same are already encoded in Unicode. Evidence for the native presence of corresponding written forms of LLLA in other South Indian scripts, which may or may not be identical to characters already encoded, is forthcoming. This document provides evidence for Telugu LLLA identical to Kannada LLLA and requests its encoding. Similar evidence is to be expected for other South Indian scripts in future as well. §3. Attestation Currently the Telugu script does not use LLLA. However historic usage before the 10th century is undeniably attested. When the literary Telugu language was effectively standardized by the time of the Mahābhārata of Nannayya in the 11th century, this letter had disappeared from use. It is clear that this disappeared from the Telugu language long before it disappeared from Kannada. However, it was definitely used as part of the Telugu script and language and is considered a Telugu character by Telugu epigraphist scholars. Even though the pre-10th-century Telugu writing had much (more) in common with the Kannada writing of those days and is perhaps better analysed as a single proto-Telugu Kannada script, and hence the older written form would not be a part of the modern Telugu 1 script currently encoded in the block 0C00-0C7F, the modern written form used when transcribing the older form is most definitely a part of the modern Telugu script and hence should be encoded as part of it.
    [Show full text]