Word-Formation Research
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
Conference Abstracts
EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues. -
The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1. -
Better Web Corpora for Corpus Linguistics and NLP
Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods. -
Unit 3: Available Corpora and Software
Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription. -
The Bulgarian National Corpus: Theory and Practice in Corpus Design
The Bulgarian National Corpus: Theory and Practice in Corpus Design Svetla Koeva, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria abstract The paper discusses several key concepts related to the development of Keywords: corpora and reconsiders them in light of recent developments in NLP. corpus design, On the basis of an overview of present-day corpora, we conclude that Bulgarian National Corpus, the dominant practices of corpus design do not utilise the technologies computational adequately and, as a result, fail to meet the demands of corpus linguis- linguistics tics, computational lexicology and computational linguistics alike. We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilin- gual texts and on providing them with a detailed metadata description and high-quality multi-level annotation. We go on to illustrate this concept with a description of the com- pilation, structuring, documentation, and annotation of the Bulgar- ian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 bil- lion words (1.95×109) altogether. The BulNC is supplied with a com- prehensive metadata description, which allows us to organise the texts according to different principles. -
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy and Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
Psyneuroling
ABSTRACT BOOK International Online Conference of Psycholinguistic and Neurolinguistic Research: Methods, Materials, and Approaches PSYNEUROLING 17-19 December 2020 17 DECEMBER 2020 Session 1 Language and speech in speakers with neurocognitive diseases (1) Niharika Mahajan, Sonal Chitnis, Hemina Dawar, Sujit Jagtap, Poornima Karandikar & Sankar Prasad Gorthi Effect of education on cognitive and communicative abilities in mild dementia: Preliminary study from India (2) Niharika Mahajan, Abhishek Chaudhari, Sonal Chitnis & Sujit Jagtap Cognitive and communicative reserve in bilingual biliterate lady with Alzheimer’s dementia: A case study (3) Ana Varela Suárez Communicative guidelines for caregivers of people with dementia: A decalogue (4) Olga Ivanova Humor and dementia: what can the type of neurodegeneration tell us about pragmatic competence in human language? Session 2 Brain correlates of language processing (5) Jordi Martorell, Simona Mancini, Nicola Molinaro & Manuel Carreiras Oscillatory tracking of syntactic structure and cross-linguistic variation (6) Effrosyni Ntemou, Ann-Katrin Ohlerth, Sebastian Ille, Sandro M. Krie, Roelien Bastiaanse & Adrià Rofes The effect of transitivity on cortical regions involved in action naming: evidence from navigated Transcranial Magnetic Stimulation (7) Monica Vanoncini, Olga Dragoy, Victoria Pozdnyakova & Adrià Rofes The Contribution of the Motor and Auditory Cortex in Priming Action and Sound Verbs: a Pilot Study (8) Roha M. Kaipa Hemispheric Differences in Conflict Sentence Processing in Multilinguals Session 3 Emotions & emotivity in language processing, learning and production (9) Suma Raju & N.P. Nataraja Assessment of verbal fluency skills in Kannada-English speaking bilingual and Kannada speaking monolingual children in the age range of eight to ten years (10) Spandan Chowdhury Emotion Categorization based on Phrase Order Preferences in Bengali (11) Ratul Ghosh An acoustic and neural study of emotions expressed in Bengali speech: A Pilot Study (12) Lucía Sabater, Marta Ponari, Juan Haro, Eva Moreno, Miguel A. -
Edited Proceedings 1July2016
UK-CLC 2016 Conference Proceedings UK-CLC 2016 preface Preface UK-CLC 2016 Bangor University, Bangor, Gwynedd, 18-21 July, 2016. This volume contains the papers and posters presented at UK-CLC 2016: 6th UK Cognitive Linguistics Conference held on July 18-21, 2016 in Bangor (Gwynedd). Plenary speakers Penelope Brown (Max Planck Institute for Psycholinguistics, Nijmegen, NL) Kenny Coventry (University of East Anglia, UK) Vyv Evans (Bangor University, UK) Dirk Geeraerts (University of Leuven, BE) Len Talmy (University at Buffalo, NY, USA) Dedre Gentner (Northwestern University, IL, USA) Organisers Chair Thora Tenbrink Co-chairs Anouschka Foltz Alan Wallington Local management Javier Olloqui Redondo Josie Ryan Local support Eleanor Bedford Advisors Christopher Shank Vyv Evans Sponsors and support Sponsors: John Benjamins, Brill Poster prize by Tracksys Student prize by De Gruyter Mouton June 23, 2016 Thora Tenbrink Bangor, Gwynedd Anouschka Foltz Alan Wallington Javier Olloqui Redondo Josie Ryan Eleanor Bedford UK-CLC 2016 Programme Committee Programme Committee Michel Achard Rice University Panos Athanasopoulos Lancaster University John Barnden The University of Birmingham Jóhanna Barðdal Ghent University Ray Becker CITEC - Bielefeld University Silke Brandt Lancaster University Cristiano Broccias Univeristy of Genoa Sam Browse Sheffield Hallam University Gareth Carrol University of Nottingham Paul Chilton Lancaster University Alan Cienki Vrije Universiteit (VU) Timothy Colleman Ghent University Louise Connell Lancaster University Seana Coulson -
New Kazakh Parallel Text Corpora with On-Line Access
New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed. -
Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi
Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi To cite this version: Adrien Barbaresi. Ad hoc and general-purpose corpus construction from web sources. Linguistics. ENS Lyon, 2015. English. tel-01167309 HAL Id: tel-01167309 https://tel.archives-ouvertes.fr/tel-01167309 Submitted on 24 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License THÈSE en vue de l'obtention du grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Discipline : Linguistique Laboratoire ICAR École Doctorale LETTRES, LANGUES, LINGUISTIQUE, ARTS présentée et soutenue publiquement le 19 juin 2015 par Monsieur Adrien Barbaresi _____________________________________________ Construction de corpus généraux et spécialisés à partir du web ______________________________________________ Directeur de thèse : M. Benoît HABERT Devant la commission d'examen formée de : M. Benoît HABERT, ENS Lyon, Directeur M. Thomas LEBARBÉ, Université Grenoble -
Henning Lobin, Roman Schneider Und Andreas Witt (Hrsg.) Digitale Infrastrukturen Für Die Germanistische Forschung Germanistische Sprachwissenschaft Um 2020
Henning Lobin, Roman Schneider und Andreas Witt (Hrsg.) Digitale Infrastrukturen für die germanistische Forschung Germanistische Sprachwissenschaft um 2020 Herausgegeben von Albrecht Plewnia und Andreas Witt Band 6 Digitale Infrastrukturen für die germanistische Forschung Herausgegeben von Henning Lobin, Roman Schneider und Andreas Witt Die Open-Access-Publikation dieses Bandes wurde gefördert vom Institut für Deutsche Sprache, Mannheim. ISBN 978-3-11-053675-1 e-ISBN (PDF) 978-3-11-053866-3 e-ISBN (EPUB) 978-3-11-053681-2 Dieses Werk ist lizenziert unter der Creative Commons Attribution 4.0 Lizenz. Weitere Informationen finden Sie unter http://creativecommons.org/licenses/by/4.0/. Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.dnb.de abrufbar. © 2018 Henning Lobin, Roman Schneider und Andreas Witt, publiziert von Walter de Gruyter GmbH, Berlin/Boston Foto Einbandabbildung: © Oliver Schonefeld, Institut für Deutsche Sprache, Mannheim Portrait Ludwig M. Eichinger, Seite V: © David Ausserhofer, Leibniz-Gemeinschaft Satz: Meta Systems Publishing & Printservices GmbH, Wustermark Druck und Bindung: CPI books GmbH, Leck www.degruyter.com Ludwig M. Eichinger gewidmet Vorwort Wo steht die germanistische Sprachwissenschaft aktuell? Der vorliegende Band mit dem Titel „Digitale Infrastrukturen für die germanistische Forschung“ ist der sechste Teil einer auf sechs Bände angelegten Reihe, die eine zwar nicht exhaustive, aber doch umfassende Bestandsaufnahme derjenigen Themen- felder innerhalb der germanistischen Linguistik bieten will, die im Kontext der Arbeiten des Instituts für Deutsche Sprache in den letzten Jahren für das Fach von Bedeutung waren und in den kommenden Jahren von Bedeutung sein werden (und von denen nicht wenige auch vom Institut für Deutsche Sprache bedient wurden und werden).