Preparation and Analysis of Linguistic Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Talk Bank: a Multimodal Database of Communicative Interaction
Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes. -
Adapting to Trends in Language Resource Development: a Progress
Adapting to Trends in Language Resource Development: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman University of Pennsylvania, Linguistic Data Consortium 3600 Market Street, Suite 810, Philadelphia PA. 19104, USA E-mail: {ccieri,myl} AT ldc.upenn.edu Abstract This paper describes changing needs among the communities that exploit language resources and recent LDC activities and publications that support those needs by providing greater volumes of data and associated resources in a growing inventory of languages with ever more sophisticated annotation. Specifically, it covers the evolving role of data centers with specific emphasis on the LDC, the publications released by the LDC in the two years since our last report and the sponsored research programs that provide LRs initially to participants in those programs but eventually to the larger HLT research communities and beyond. creators of tools, specifications and best practices as well 1. Introduction as databases. The language resource landscape over the past two years At least in the case of LDC, the role of the data may be characterized by what has changed and what has center has grown organically, generally in response to remained constant. Constant are the growing demands stated need but occasionally in anticipation of it. LDC’s for an ever-increasing body of language resources in a role was initially that of a specialized publisher of LRs growing number of human languages with increasingly with the additional responsibility to archive and protect sophisticated annotation to satisfy an ever-expanding list published resources to guarantee their availability of the of user communities. Changing are the relative long term. -
Characteristics of Text-To-Speech and Other Corpora
Characteristics of Text-to-Speech and Other Corpora Erica Cooper, Emily Li, Julia Hirschberg Columbia University, USA [email protected], [email protected], [email protected] Abstract “standard” TTS speaking style, whether other forms of profes- sional and non-professional speech differ substantially from the Extensive TTS corpora exist for commercial systems cre- TTS style, and which features are most salient in differentiating ated for high-resource languages such as Mandarin, English, the speech genres. and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and 2. Related Work are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from TTS speakers are typically instructed to speak as consistently “found” data collected for other purposes (e.g. training ASR as possible, without varying their voice quality, speaking style, systems) or available on the web (e.g. news broadcasts, au- pitch, volume, or tempo significantly [1]. This is different from diobooks) to produce TTS systems for low-resource languages other forms of professional speech in that even with the rela- (LRLs) which do not currently have expensive, commercial sys- tively neutral content of broadcast news, anchors will still have tems. This study investigates whether traditional TTS speakers some variance in their speech. Audiobooks present an even do exhibit significantly less variation and better speaking char- greater challenge, with a more expressive reading style and dif- acteristics than speakers in found genres. By examining char- ferent character voices. Nevertheless, [2, 3, 4] have had suc- acteristics of f0, energy, speaking rate, articulation, NHR, jit- cess in building voices from audiobook data by identifying and ter, and shimmer in found genres and comparing these to tra- using the most neutral and highest-quality utterances. -
Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by -
MASC: the Manually Annotated Sub-Corpus of American English
MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing -
100000 Podcasts
100,000 Podcasts: A Spoken English Document Corpus Ann Clifton Sravana Reddy Yongze Yu Spotify Spotify Spotify [email protected] [email protected] [email protected] Aasish Pappu Rezvaneh Rezapour∗ Hamed Bonab∗ Spotify University of Illinois University of Massachusetts [email protected] at Urbana-Champaign Amherst [email protected] [email protected] Maria Eskevich Gareth J. F. Jones Jussi Karlgren CLARIN ERIC Dublin City University Spotify [email protected] [email protected] [email protected] Ben Carterette Rosie Jones Spotify Spotify [email protected] [email protected] Abstract Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typi- cally studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a re- source for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research. -
Child Language
ABSTRACTS 14TH INTERNATIONAL CONGRESS FOR THE STUDY OF CHILD LANGUAGE IN LYON, IASCL FRANCE 2017 WELCOME JULY, 17TH21ST 2017 SPECIAL THANKS TO - 2 - SUMMARY Plenary Day 1 4 Day 2 5 Day 3 53 Day 4 101 Day 5 146 WELCOME! Symposia Day 2 6 Day 3 54 Day 4 102 Day 5 147 Poster Day 2 189 Day 3 239 Day 4 295 - 3 - TH DAY MONDAY, 17 1 18:00-19:00, GRAND AMPHI PLENARY TALK Bottom-up and top-down information in infants’ early language acquisition Sharon Peperkamp Laboratoire de Sciences Cognitives et Psycholinguistique, Paris, France Decades of research have shown that before they pronounce their first words, infants acquire much of the sound structure of their native language, while also developing word segmentation skills and starting to build a lexicon. The rapidity of this acquisition is intriguing, and the underlying learning mechanisms are still largely unknown. Drawing on both experimental and modeling work, I will review recent research in this domain and illustrate specifically how both bottom-up and top-down cues contribute to infants’ acquisition of phonetic cat- egories and phonological rules. - 4 - TH DAY TUESDAY, 18 2 9:00-10:00, GRAND AMPHI PLENARY TALK What do the hands tell us about lan- guage development? Insights from de- velopment of speech, gesture and sign across languages Asli Ozyurek Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands Most research and theory on language development focus on children’s spoken utterances. However language development starting with the first words of children is multimodal. Speaking children produce gestures ac- companying and complementing their spoken utterances in meaningful ways through pointing or iconic ges- tures. -
Segmentability Differences Between Child-Directed and Adult-Directed Speech: a Systematic Test with an Ecologically Valid Corpus
Report Segmentability Differences Between Child-Directed and Adult-Directed Speech: A Systematic Test With an Ecologically Valid Corpus Alejandrina Cristia 1, Emmanuel Dupoux1,2,3, Nan Bernstein Ratner4, and Melanie Soderstrom5 1Dept d’Etudes Cognitives, ENS, PSL University, EHESS, CNRS 2INRIA an open access journal 3FAIR Paris 4Department of Hearing and Speech Sciences, University of Maryland 5Department of Psychology, University of Manitoba Keywords: computational modeling, learnability, infant word segmentation, statistical learning, lexicon ABSTRACT Previous computational modeling suggests it is much easier to segment words from child-directed speech (CDS) than adult-directed speech (ADS). However, this conclusion is based on data collected in the laboratory, with CDS from play sessions and ADS between a parent and an experimenter, which may not be representative of ecologically collected CDS and ADS. Fully naturalistic ADS and CDS collected with a nonintrusive recording device Citation: Cristia A., Dupoux, E., Ratner, as the child went about her day were analyzed with a diverse set of algorithms. The N. B., & Soderstrom, M. (2019). difference between registers was small compared to differences between algorithms; it Segmentability Differences Between Child-Directed and Adult-Directed reduced when corpora were matched, and it even reversed under some conditions. Speech: A Systematic Test With an Ecologically Valid Corpus. Open Mind: These results highlight the interest of studying learnability using naturalistic corpora Discoveries in Cognitive Science, 3, 13–22. https://doi.org/10.1162/opmi_ and diverse algorithmic definitions. a_00022 DOI: https://doi.org/10.1162/opmi_a_00022 INTRODUCTION Supplemental Materials: Although children are exposed to both child-directed speech (CDS) and adult-directed speech https://osf.io/th75g/ (ADS), children appear to extract more information from the former than the latter (e.g., Cristia, Received: 15 May 2018 2013; Shneidman & Goldin-Meadow,2012). -
From CHILDES to Talkbank
From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself. -
Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarindeliverableD5C3_v1_5finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2 -
The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality
The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality Maria Shvedova[0000-0002-0759-1689] Kyiv National Linguistic University [email protected] Abstract. The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di- aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is pub- lished. This feature differs the GRAC from a majority of general linguistic cor- pora. Keywords: Ukrainian language, corpus, diachronic evolution, regional varia- tion. 1 Introduction Currently many major national languages have large universal corpora, known as “reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national” cor- pora (a label dating back ultimately to the British National Corpus). These corpora are large, representative for different genres of written language, have a certain depth of (usually morphological and metatextual) annotation and can be used for many differ- ent linguistic purposes. The Ukrainian language lacks a publicly available linguistic corpus. Still there is a need of a corpus in the present-day linguistics. Independently researchers compile dif- ferent corpora of Ukrainian for separate research purposes with different size and functionality. As the community lacks a universal tool, a researcher may build their own corpus according to their needs. -
Gold Standard Annotations for Preposition and Verb Sense With
Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.