Linguistic Annotation In/For Corpus Linguistics
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Talk Bank: a Multimodal Database of Communicative Interaction
Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes. -
Adapting to Trends in Language Resource Development: a Progress
Adapting to Trends in Language Resource Development: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman University of Pennsylvania, Linguistic Data Consortium 3600 Market Street, Suite 810, Philadelphia PA. 19104, USA E-mail: {ccieri,myl} AT ldc.upenn.edu Abstract This paper describes changing needs among the communities that exploit language resources and recent LDC activities and publications that support those needs by providing greater volumes of data and associated resources in a growing inventory of languages with ever more sophisticated annotation. Specifically, it covers the evolving role of data centers with specific emphasis on the LDC, the publications released by the LDC in the two years since our last report and the sponsored research programs that provide LRs initially to participants in those programs but eventually to the larger HLT research communities and beyond. creators of tools, specifications and best practices as well 1. Introduction as databases. The language resource landscape over the past two years At least in the case of LDC, the role of the data may be characterized by what has changed and what has center has grown organically, generally in response to remained constant. Constant are the growing demands stated need but occasionally in anticipation of it. LDC’s for an ever-increasing body of language resources in a role was initially that of a specialized publisher of LRs growing number of human languages with increasingly with the additional responsibility to archive and protect sophisticated annotation to satisfy an ever-expanding list published resources to guarantee their availability of the of user communities. Changing are the relative long term. -
Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources. -
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
Studying Social Tagging and Folksonomy: a Review and Framework
Studying Social Tagging and Folksonomy: A Review and Framework Item Type Journal Article (On-line/Unpaginated) Authors Trant, Jennifer Citation Studying Social Tagging and Folksonomy: A Review and Framework 2009-01, 10(1) Journal of Digital Information Journal Journal of Digital Information Download date 02/10/2021 03:25:18 Link to Item http://hdl.handle.net/10150/105375 Trant, Jennifer (2009) Studying Social Tagging and Folksonomy: A Review and Framework. Journal of Digital Information 10(1). Studying Social Tagging and Folksonomy: A Review and Framework J. Trant, University of Toronto / Archives & Museum Informatics 158 Lee Ave, Toronto, ON Canada M4E 2P3 jtrant [at] archimuse.com Abstract This paper reviews research into social tagging and folksonomy (as reflected in about 180 sources published through December 2007). Methods of researching the contribution of social tagging and folksonomy are described, and outstanding research questions are presented. This is a new area of research, where theoretical perspectives and relevant research methods are only now being defined. This paper provides a framework for the study of folksonomy, tagging and social tagging systems. Three broad approaches are identified, focusing first, on the folksonomy itself (and the role of tags in indexing and retrieval); secondly, on tagging (and the behaviour of users); and thirdly, on the nature of social tagging systems (as socio-technical frameworks). Keywords: Social tagging, folksonomy, tagging, literature review, research review 1. Introduction User-generated keywords – tags – have been suggested as a lightweight way of enhancing descriptions of on-line information resources, and improving their access through broader indexing. “Social Tagging” refers to the practice of publicly labeling or categorizing resources in a shared, on-line environment. -
Instructions on the Annotation of Pdf Files
P-annotatePDF-v11b INSTRUCTIONS ON THE ANNOTATION OF PDF FILES To view, print and annotate your chapter you will need Adobe Reader version 9 (or higher). This program is freely available for a whole series of platforms that include PC, Mac, and UNIX and can be downloaded from http://get.adobe.com/reader/. The exact system requirements are given at the Adobe site: http://www.adobe.com/products/reader/tech-specs.html. Note: if you opt to annotate the file with software other than Adobe Reader then please also highlight the appropriate place in the PDF file. PDF ANNOTATIONS Adobe Reader version 9 Adobe Reader version X and XI When you open the PDF file using Adobe Reader, the To make annotations in the PDF file, open the PDF file using Commenting tool bar should be displayed automatically; if Adobe Reader XI, click on ‘Comment’. not, click on ‘Tools’, select ‘Comment & Markup’, then click If this option is not available in your Adobe Reader menus on ‘Show Comment & Markup tool bar’ (or ‘Show then it is possible that your Adobe Acrobat version is lower Commenting bar’ on the Mac). If these options are not than XI or the PDF has not been prepared properly. available in your Adobe Reader menus then it is possible that your Adobe Acrobat version is lower than 9 or the PDF has not been prepared properly. This opens a task pane and, below that, a list of all (Mac) Comments in the text. PDF ANNOTATIONS (Adobe Reader version 9) The default for the Commenting tool bar is set to ‘off’ in version 9. -
Characteristics of Text-To-Speech and Other Corpora
Characteristics of Text-to-Speech and Other Corpora Erica Cooper, Emily Li, Julia Hirschberg Columbia University, USA [email protected], [email protected], [email protected] Abstract “standard” TTS speaking style, whether other forms of profes- sional and non-professional speech differ substantially from the Extensive TTS corpora exist for commercial systems cre- TTS style, and which features are most salient in differentiating ated for high-resource languages such as Mandarin, English, the speech genres. and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and 2. Related Work are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from TTS speakers are typically instructed to speak as consistently “found” data collected for other purposes (e.g. training ASR as possible, without varying their voice quality, speaking style, systems) or available on the web (e.g. news broadcasts, au- pitch, volume, or tempo significantly [1]. This is different from diobooks) to produce TTS systems for low-resource languages other forms of professional speech in that even with the rela- (LRLs) which do not currently have expensive, commercial sys- tively neutral content of broadcast news, anchors will still have tems. This study investigates whether traditional TTS speakers some variance in their speech. Audiobooks present an even do exhibit significantly less variation and better speaking char- greater challenge, with a more expressive reading style and dif- acteristics than speakers in found genres. By examining char- ferent character voices. Nevertheless, [2, 3, 4] have had suc- acteristics of f0, energy, speaking rate, articulation, NHR, jit- cess in building voices from audiobook data by identifying and ter, and shimmer in found genres and comparing these to tra- using the most neutral and highest-quality utterances. -
Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by -
Linguistic Annotation of the Digital Papyrological Corpus: Sematia
Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances. -
MASC: the Manually Annotated Sub-Corpus of American English
MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing -
Multilingvální Systémy Rozpoznávání Řeči a Jejich Efektivní Učení
Multilingvální systémy rozpoznávání řeči a jejich efektivní učení Disertační práce Studijní program: P2612 – Elektrotechnika a informatika Studijní obor: 2612V045 – Technická kybernetika Autor práce: Ing. Radek Šafařík Vedoucí práce: prof. Ing. Jan Nouza, CSc. Liberec 2020 Multilingual speech recognition systems and their effective learning Dissertation Study programme: P2612 – Electrotechnics and informatics Study branch: 2612V045 – Technical cybernetics Author: Ing. Radek Šafařík Supervisor: prof. Ing. Jan Nouza, CSc. Liberec 2020 Prohlášení Byl jsem seznámen s tím, že na mou disertační práci se plně vztahu- je zákon č. 121/2000 Sb., o právu autorském, zejména § 60 – školní dílo. Beru na vědomí, že Technická univerzita v Liberci (TUL) nezasahu- je do mých autorských práv užitím mé disertační práce pro vnitřní potřebu TUL. Užiji-li disertační práci nebo poskytnu-li licenci k jejímu využi- tí, jsem si vědom povinnosti informovat o této skutečnosti TUL; v tomto případě má TUL právo ode mne požadovat úhradu nákla- dů, které vynaložila na vytvoření díla, až do jejich skutečné výše. Disertační práci jsem vypracoval samostatně s použitím uvedené literatury a na základě konzultací s vedoucím mé disertační práce a konzultantem. Současně čestně prohlašuji, že texty tištěné verze práce a elektro- nické verze práce vložené do IS STAG se shodují. 1. 12. 2020 Ing. Radek Šafařík Abstract The diseratation thesis deals with creation of automatic speech recognition systems (ASR) and with effective adaptation of already existing system to a new language. Today’s ASR systems have modular stucture where individual moduls can be consi- dered as language dependent or independent. The main goal of this thesis is research and development of methods that automate and make the development of language dependent modules effective as much as possible using free available data fromthe internet, machine learning methods and similarities between langages. -
Segmentability Differences Between Child-Directed and Adult-Directed Speech: a Systematic Test with an Ecologically Valid Corpus
Report Segmentability Differences Between Child-Directed and Adult-Directed Speech: A Systematic Test With an Ecologically Valid Corpus Alejandrina Cristia 1, Emmanuel Dupoux1,2,3, Nan Bernstein Ratner4, and Melanie Soderstrom5 1Dept d’Etudes Cognitives, ENS, PSL University, EHESS, CNRS 2INRIA an open access journal 3FAIR Paris 4Department of Hearing and Speech Sciences, University of Maryland 5Department of Psychology, University of Manitoba Keywords: computational modeling, learnability, infant word segmentation, statistical learning, lexicon ABSTRACT Previous computational modeling suggests it is much easier to segment words from child-directed speech (CDS) than adult-directed speech (ADS). However, this conclusion is based on data collected in the laboratory, with CDS from play sessions and ADS between a parent and an experimenter, which may not be representative of ecologically collected CDS and ADS. Fully naturalistic ADS and CDS collected with a nonintrusive recording device Citation: Cristia A., Dupoux, E., Ratner, as the child went about her day were analyzed with a diverse set of algorithms. The N. B., & Soderstrom, M. (2019). difference between registers was small compared to differences between algorithms; it Segmentability Differences Between Child-Directed and Adult-Directed reduced when corpora were matched, and it even reversed under some conditions. Speech: A Systematic Test With an Ecologically Valid Corpus. Open Mind: These results highlight the interest of studying learnability using naturalistic corpora Discoveries in Cognitive Science, 3, 13–22. https://doi.org/10.1162/opmi_ and diverse algorithmic definitions. a_00022 DOI: https://doi.org/10.1162/opmi_a_00022 INTRODUCTION Supplemental Materials: Although children are exposed to both child-directed speech (CDS) and adult-directed speech https://osf.io/th75g/ (ADS), children appear to extract more information from the former than the latter (e.g., Cristia, Received: 15 May 2018 2013; Shneidman & Goldin-Meadow,2012).