The National Corpus of Contemporary Welsh
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Talk Bank: a Multimodal Database of Communicative Interaction
Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes. -
CLIB 2016 Proceedings
The Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) is organised within the Operation for Support for International Scientific Conferences Held in Bulgaria of the National Science Fund Grant № ДПМНФ 01/9 of 11 Aug 2016. National Science Fund CLIB 2016 is organised by: The Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences PUBLICATION AND CATALOGUING INFORMATION Title: Proceedings of the Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) ISSN: 23675675 (online) Published and The Institute for Bulgarian Language Prof. Lyubomir distributed by: Andreychin, Bulgarian Academy of Sciences Editorial address: Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences 52 Shipchenski prohod blvd., bldg. 17 Sofia 1113, Bulgaria +359 2/ 872 23 02 Copyright of each paper stays with the respective authors. The works in the Proceedings are licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0). Licence details: http://creativecommons.org/licenses/by/4.0 Proceedings of the Second International Conference Computational Linguistics in Bulgaria 9 September 2016 Sofia, Bulgaria PREFACE We are excited to welcome you to the second edition of the International Conference Computational Linguistics in Bulgaria (CLIB 2016) in Sofia, Bulgaria! CLIB aspires to foster the NLP community in Bulgaria and further the cooperation among researchers working in NLP for Bulgarian around the world. The need for a conference dedicated to NLP research dealing with or applicable to Bulgarian has been felt for quite some time. We believe that building a strong community of researchers and teams who have chosen to work on Bulgarian is a key factor to meeting the challenges and requirements posed to computational linguistics and NLP in Bulgaria. -
The Main Features of the E-Glava Online Valency Dictionary
The Main Features of the e-Glava Online Valency Dictionary Matea Birtić, Ivana Brač, Siniša Runjaić Institute of Croatian Language and Linguistics, Ulica Republike Austrije 16, HR-10000 Zagreb, Croatia E.mail: [email protected], [email protected], [email protected] Abstract E-Glava is an online valency dictionary of Croatian verbs. The theoretical approach to valency follows the German tradition, particularly that of the VALBU dictionary, with some minor changes and adjustments. The main principle of our valency approach is to link valency patterns to specific verb meanings. The verb list is compiled semi-automatically on the basis of the Croatian Frequency Dictionary and Croatian language textbooks. Currently, e-Glava contains descriptions of 57 psychological verbs with 187 meanings and 375 valency patterns. The lexicographic articles are written in Tschwanelex. A Document Type Definition editing module has been used, and the description of verbs follows a three-level linguistic schema prepared for lexicographers. Verbs are distributed throughout 34 semantic classes, and examples are extracted manually from Croatian corpora. Fully processed data for each semantic class will be publicly available in the form of a browsable HTML dictionary. The paper also presents a comparison between e-Glava and other cognate resources, as well as a summary of its main advantages, disadvantages, and potential applied uses. Keywords: Croatian language; valency dictionary; e-dictionary; syntax 1. Introduction Sentence structure and the syntactic behaviour of verbs were perhaps the most intriguing and interesting topics for early grammatical descriptions and, later, linguistic descriptions of language. Valency properties are relevant to both theoretical and applied linguistic considerations. -
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
From CHILDES to Talkbank
From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself. -
The British National Corpus Revisited: Developing Parameters for Written BNC2014 Abi Hawtin (Lancaster University, UK)
The British National Corpus Revisited: Developing parameters for Written BNC2014 Abi Hawtin (Lancaster University, UK) 1. The British National Corpus 2014 project The ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press are working together to create a new, publicly accessible corpus of contemporary British English. The corpus will be a successor to the British National Corpus, created in the early 1990s (BNC19941). The British National Corpus 2014 (BNC2014) will be of the same order of magnitude as BNC1994 (100 million words). It is currently projected that the corpus will reach completion in mid-2018. Creation of the spoken sub-section of BNC2014 is in its final stages; this paper focuses on the justification and development of the written sub- section of BNC2014. 2. Justification for BNC2014 There are now many web-crawled corpora of English, widely available to researchers, which contain far more data than will be included in Written BNC2014. For example, the English TenTen corpus (enTenTen) is a web-crawled corpus of Written English which currently contains approximately 19 billion words (Jakubíček et al., 2013). Web-crawling processes mean that this huge amount of data can be collected very quickly; Jakubíček et al. (2013: 125) note that “For a language where there is plenty of material available, we can gather, clean and de-duplicate a billion words a day.” So, with extremely large and quickly-created web-crawled corpora now becoming commonplace in corpus linguistics, the question might well be asked why a new, 100 million word corpus of Written British English is needed. -
Conference Abstracts
EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues. -
Morphological Doublets in Croatian: a Multi-Methodological Analysis
Morphological Doublets in Croatian: A multi-methodological analysis By: Dario Lečić A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Arts and Humanities Department of Russian and Slavonic Studies 20 January, 2017 Acknowledgments Many a PhD student past and present will agree that doing a PhD is a time-consuming process with lots of ups and downs, motivational issues and even a number of nervous breakdowns. Having experienced all of these, I can only say that they were right. However, having reached the end of the tunnel, I have to admit that the feeling is great. I would like to use this opportunity to thank all the people who made this possible and who have helped me during these four years spent researching the intricate world of morphological doublets in Croatian. First of all, I would like to express my immense gratitude to my primary supervisor, Professor Neil Bermel from the Department or Russian and Slavonic Studies at the University of Sheffield for offering his guidance from day one. Our regular supervisory meetings as well as numerous e-mail exchanges have been eye-opening and I would not have been able to do this without you. I hope this dissertation will justify all the effort you have put into me as your PhD student. Even though the jurisdiction of the second supervisor as defined by the University of Sheffield officially stretches mostly to matters of the Doctoral Development Programme, my second supervisor, Dr Dagmar Divjak, nevertheless played a major role in this research as well, primarily in matters of statistics. -
FERSIWN GYMRAEG ISOD the National Corpus of Contemporary
FERSIWN GYMRAEG ISOD The National Corpus of Contemporary Welsh Project Report, October 2020 Authors: Dawn Knight1, Steve Morris2, Tess Fitzpatrick2, Paul Rayson3, Irena Spasić and Enlli Môn Thomas4. 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets. -
Languages in the European Information Society Croatian
META-NET White Paper Series Languages in the European Information Society Croatian Early Release Edition META-FORUM 2011 27-28 June 2011 Budapest, Hungary The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission under contracts T4ME (Grant Agreement 249119), CESAR (Grant Agreement 271022), METANET4U (Grant Agreement 270893) and META-NORD (Grant Agreement 270899). This white paper is for educators, journalists, politicians, language communities and others, who want to establish a truly multilingual Europe. This white paper is part of a series that promotes knowledge about language technology and its potential. The availability and use of language technology in Europe varies between languages. Conse- quently, the actions that are required to further support research and development of language technologies also differs for each language. The required actions depend on many factors, such as the complexity of a given language and the size of its community. META-NET, a European Commission Network of Excellence, has conducted an analysis of current language resources and technolo- gies. This analysis focused on the 23 official European languages as well as other important regional languages in Europe. The results of this analysis suggests that there are many significant research gaps for each language. A more detailed, expert analysis and as- sessment of the current situation will help maximise the impact of additional research and minimize any risks. META-NET consists of 44 research centres from 31 countries who are working with stakeholders from commercial businesses, gov- ernment agencies, industry, research organisations, software com- panies, technology providers and European universities. -
The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1. -
The Workshop Programme
The Workshop Programme Monday, May 26 14:30 -15:00 Opening Nancy Ide and Adam Meyers 15:00 -15:30 SIGANN Shared Corpus Working Group Report Adam Meyers 15:30 -16:00 Discussion: SIGANN Shared Corpus Task 16:00 -16:30 Coffee break 16:30 -17:00 Towards Best Practices for Linguistic Annotation Nancy Ide, Sameer Pradhan, Keith Suderman 17:00 -18:00 Discussion: Annotation Best Practices 18:00 -19:00 Open Discussion and SIGANN Planning Tuesday, May 27 09:00 -09:10 Opening Nancy Ide and Adam Meyers 09:10 -09:50 From structure to interpretation: A double-layered annotation for event factuality Roser Saurí and James Pustejovsky 09:50 -10:30 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke 10:30 -11:00 Coffee break 11:00 -11:40 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers 11:40 -12:20 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas 12:20 -12:40 Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool (DEMO) Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui 12:40 -14:30 Lunch 14:30 -15:10 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska 15:10 -15:50 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia 16:00 -16:30 Coffee break 16:30 -16:50 A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi