The Phonetic Analysis of Speech Corpora

Total Page:16

File Type:pdf, Size:1020Kb

The Phonetic Analysis of Speech Corpora The Phonetic Analysis of Speech Corpora Jonathan Harrington Institute of Phonetics and Speech Processing Ludwig-Maximilians University of Munich Germany email: [email protected] Wiley-Blackwell 2 Contents Relationship between International and Machine Readable Phonetic Alphabet (Australian English) Relationship between International and Machine Readable Phonetic Alphabet (German) Downloadable speech databases used in this book Preface Notes of downloading software Chapter 1 Using speech corpora in phonetics research 1.0 The place of corpora in the phonetic analysis of speech 1.1 Existing speech corpora for phonetic analysis 1.2 Designing your own corpus 1.2.1 Speakers 1.2.2 Materials 1.2.3 Some further issues in experimental design 1.2.4 Speaking style 1.2.5 Recording setup 1.2.6 Annotation 1.2.7 Some conventions for naming files 1.3 Summary and structure of the book Chapter 2 Some tools for building and querying labelling speech databases 2.0 Overview 2.1 Getting started with existing speech databases 2.2 Interface between Praat and Emu 2.3 Interface to R 2.4 Creating a new speech database: from Praat to Emu to R 2.5 A first look at the template file 2.6 Summary 2.7 Questions Chapter 3 Applying routines for speech signal processing 3.0 Introduction 3.1 Calculating, displaying, and correcting formants 3.2 Reading the formants into R 3.3 Summary 3.4 Questions 3.5 Answers Chapter 4 Querying annotation structures 4.1 The Emu Query Tool, segment tiers and event tiers 4.2 Extending the range of queries: annotations from the same tier 4.3 Inter-tier links and queries 4.4 Entering structured annotations with Emu 4.5 Conversion of a structured annotation to a Praat TextGrid 4.6 Graphical user interface to the Emu query language 4.7 Re-querying segment lists 4.8 Building annotation structures semi-automatically with Emu-Tcl 4.9 Branching paths 4.10 Summary 4.11 Questions 4.12 Answers 3 Chapter 5 An introduction to speech data analysis in R: a study of an EMA database 5.1 EMA recordings and the ema5 database 5.2 Handling segment lists and vectors in Emu-R 5.3 An analysis of voice onset time 5.4 Inter-gestural coordination and ensemble plots 5.4.1 Extracting trackdata objects 5.4.2 Movement plots from single segments 5.4.3 Ensemble plots 5.5 Intragestural analysis 5.5.1 Manipulation of trackdata objects 5.5.2 Differencing and velocity 5.5.3 Critically damped movement, magnitude, and peak velocity 5.6 Summary 5.7 Questions 5.8 Answers Chapter 6 Analysis of formants and formant transitions 6.1 Vowel ellipses in the F2 x F1 plane 6.2 Outliers 6.3 Vowel targets 6.4 Vowel normalisation 6.5 Euclidean distances 6.5.1 Vowel space expansion 6.5.2 Relative distance between vowel categories 6.6 Vowel undershoot and formant smoothing 6.7 F2 locus, place of articulation and variability 6.8 Questions 6.9 Answers Chapter 7 Electropalatography 7.1 Palatography and electropalatography 7.2 An overview of electropalatography in Emu-R 7.3 EPG data reduced objects 7.3.1 Contact profiles 7.3.2 Contact distribution indices 7.4 Analysis of EPG data 7.4.1 Consonant overlap 7.4.2 VC coarticulation in German dorsal fricatives 7.5 Summary 7.6 Questions 7.7 Answers Chapter 8 Spectral analysis. 8.1 Background to spectral analysis 8.1.1 The sinusoid 8.1.2 Fourier analysis and Fourier synthesis 8.1.3 Amplitude spectrum 8.1.4 Sampling frequency 8.1.5 dB-Spectrum 8.1.6 Hamming and Hann(ing) windows 8.1.7 Time and frequency resolution 4 8.1.8 Preemphasis 8.1.9 Handling spectral data in Emu-R 8.2 Spectral average, sum, ratio, difference, slope 8.3 Spectral moments 8.4 The discrete cosine transformation 8.4.1 Calculating DCT-coefficients in EMU-R 8.4.2 DCT-coefficients of a spectrum 8.4.3 DCT-coefficients and trajectory shape 8.4.4 Mel- and Bark-scaled DCT (cepstral) coefficients 8.5 Questions 8.6 Answers Chapter 9 Classification 9.1 Probability and Bayes theorem 9.2 Classification: continuous data 9.2.1 The binomial and normal distributions 9.3 Calculating conditional probabilities 9.4 Calculating posterior probabilities 9.5 Two-parameters: the bivariate normal distribution and ellipses 9.6 Classification in two dimensions 9.7 Classifications in higher dimensional spaces 9.8 Classifications in time 9.8.1 Parameterising dynamic spectral information 9.9 Support vector machines 9.10 Summary 9.11 Questions 9.12 Answers References 5 Relationship between Machine Readable (MRPA) and International Phonetic Alphabet (IPA) for Australian English. MRPA IPA Example Tense vowels i: i: heed u: ʉ: who'd o: ɔ: hoard a: ɐ: hard @: ɜ: heard Lax vowels I ɪ hid U ʊ hood E ɛ head O ɔ hod V ɐ bud A æ had Diphthongs I@ ɪә here E@ eә there U@ ʉә tour ei æɪ hay ai ɐɪ high au æʉ how oi ɔɪ boy ou ɔʉ hoe Schwa @ ә the Consonants p p pie b b buy t t tie d d die k k cut g g go tS ʧ church dZ ʤ judge H h (Aspiration/stop release) m m my n n no N ŋ sing f f fan v v van 6 T θ think D ð the s s see z z zoo S ʃ shoe Z ʒ beige h h he r ɻ road w w we l l long j j yes 7 Relationship between Machine Readable (MRPA) and International Phonetic Alphabet (IPA) for German. The MRPA for German is in accordance with SAMPA (Wells, 1997), the speech assessment methods phonetic alphabet. MRPA IPA Example Tense vowels and diphthongs 2: ø: Söhne 2:6 øɐ stört a: a: Strafe, Lahm a:6 a:ɐ Haar e: e: geht E: ɛ: Mädchen E:6 ɛ:ɐ fährt e:6 e:ɐ werden i: i: Liebe i:6 i:ɐ Bier o: o: Sohn o:6 o:ɐ vor u: u: tun u:6 u:ɐ Uhr y: y: kühl y:6 y:ɐ natürlich aI aɪ mein aU aʊ Haus OY ɔY Beute Lax vowels and diphthongs U ʊ Mund 9 œ zwölf a a nass a6 aɐ Mark E ɛ Mensch E6 ɛɐ Lärm I ɪ finden I6 ɪɐ wirklich O ɔ kommt O6 ɔɐ dort U6 ʊɐ durch Y Y Glück Y6 Yɐ würde 6 ɐ Vater Consonants p p Panne b b Baum t t Tanne d d Daumen k k kahl 8 g g Gaumen pf pf Pfeffer ts ʦ Zahn tS ʧ Cello dZ ʤ Job Q ʔ (Glottal stop) h h (Aspiration) m m Miene n n nehmen N ŋ lang f f friedlich v v weg s s lassen z z lesen S ʃ schauen Z ʒ Genie C ç riechen x x Buch, lachen h h hoch r r, ʁ Regen l l lang j j jemand 9 Downloadable speech databases used in this book Database Description Language/di n S Signal Annotations Source name alect files aetobi A fragment of American 17 various Audio Word, tonal, Beckman et al the AE-TOBI English break. (2005); Pitrelli database: Read et al (1994); and Silverman et spontaneous al (1992) speech. ae Read Australian 7 1M Audio, Prosodic, Millar et al sentences English spectra, phonetic, (1997); Millar formants tonal. et al (1994) andosl Read Australian 200 2M Audio, Same as ae Millar et al sentences English formants (1997); Millar et al (1994) ema5 Read Standard 20 1F Audio, Word, Bombien et al (ema) sentences German EMA phonetic, (2007) tongue-tip, tongue-body epgassim Isolated words Australian 60 1F Audio, Word, Stephenson & English EPG phonetic Harrington (2002); Stephenson (2003) epgcoutts Read speech Australian 2 1F Audio, Word. Passage from English EPG Hewlett & Shockey (1992) epgdorsal Isolated words German 45 1M Audio, Word, Ambrazaitis & EPG, phonetic. John (2004) formants epgpolish Read Polish 40 1M Audio, Word, Guzik & sentences EPG phonetic Harrington (2007) first 5 utterances from gerplosives gerplosives Isolated words German 72 1M Audio, Phonetic Unpublished in carrier spectra sentence gt Continous German 9 various Audio, f0 Word, Utterances speech Break, Tone from various sources isolated Isolated word Australian 218 1M Audio, Phonetic As ae above production English formants. b-widths 10 kielread Read German 200 1M, 1F Audio, Phonetic Simpson sentences formants (1998), Simpson et al (1997). mora Read Japanese 1 1F Audio Phonetic Unpublished second Two speakers from gerplosives stops Isolated words German 470 3M,4F Audio, Phonetic unpublished in carrier formants sentence timetable Timetable German 5 1M Audio Phonetic As enquiries kielread 11 Preface In undergraduate courses that include phonetics, students typically acquire skills both in ear-training and an understanding of the acoustic, physiological, and perceptual characteristics of speech sounds. But there is usually less opportunity to test this knowledge on sizeable quantities of speech data partly because putting together any database that is sufficient in extent to be able to address non-trivial questions in phonetics is very time-consuming. In the last ten years, this issue has been offset somewhat by the rapid growth of national and international speech corpora which has been driven principally by the needs of speech technology. But there is still usually a big gap between the knowledge acquired in phonetics from classes on the one hand and applying this knowledge to available speech corpora with the aim of solving different kinds of theoretical problems on the other. The difficulty stems not just from getting the right data out of the corpus but also in deciding what kinds of graphical and quantitative techniques are available and appropriate for the problem that is to be solved.
Recommended publications
  • FREQUENCY EFFECTS on ESL COMPOSITIONAL MULTI-WORD SEQUENCE PROCESSING by Sarut Supasiraprapa a DISSERTATION Submitted to Michiga
    FREQUENCY EFFECTS ON ESL COMPOSITIONAL MULTI-WORD SEQUENCE PROCESSING By Sarut Supasiraprapa A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies – Doctor of Philosophy 2017 ABSTRACT FREQUENCY EFFECTS ON ESL COMPOSITIONAL MULTI-WORD SEQUENCE PROCESSING By Sarut Supasiraprapa The current study investigated whether adult native English speakers and English- as-a-second-language (ESL) learners exhibit sensitivity to compositional English multi- word sequences, which have a meaning derivable from word parts (e.g., don’t have to worry as opposed to sequences like He left the US for good, where for good cannot be taken apart to derive its meaning). In the current study, a multi-word sequence specifically referred to a word sequence beyond the bigram (two-word) level. The investigation was motivated by usage-based approaches to language acquisition, which predict that first (L1) and second (L2) speakers should process more frequent compositional phrases faster than less frequent ones (e.g., Bybee, 2010; Ellis, 2002; Gries & Ellis, 2015). This prediction differs from the prediction in the mainstream generative linguistics theory, according to which frequency effects should be observed from the processing of items stored in the mental lexicon (i.e., bound morphemes, single words, and idioms), but not from compositional phrases (e.g., Prasada & Pinker, 1993; Prasada, Pinker, & Snyder, 1990). The present study constituted the first attempt to investigate frequency effects on multi-word sequences in both language comprehension and production in the same L1 and L2 speakers. The study consisted of two experiments. In the first, participants completed a timed phrasal-decision task, in which they decided whether four-word target phrases were possible English word sequences.
    [Show full text]
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • Child Language
    ABSTRACTS 14TH INTERNATIONAL CONGRESS FOR THE STUDY OF CHILD LANGUAGE IN LYON, IASCL FRANCE 2017 WELCOME JULY, 17TH21ST 2017 SPECIAL THANKS TO - 2 - SUMMARY Plenary Day 1 4 Day 2 5 Day 3 53 Day 4 101 Day 5 146 WELCOME! Symposia Day 2 6 Day 3 54 Day 4 102 Day 5 147 Poster Day 2 189 Day 3 239 Day 4 295 - 3 - TH DAY MONDAY, 17 1 18:00-19:00, GRAND AMPHI PLENARY TALK Bottom-up and top-down information in infants’ early language acquisition Sharon Peperkamp Laboratoire de Sciences Cognitives et Psycholinguistique, Paris, France Decades of research have shown that before they pronounce their first words, infants acquire much of the sound structure of their native language, while also developing word segmentation skills and starting to build a lexicon. The rapidity of this acquisition is intriguing, and the underlying learning mechanisms are still largely unknown. Drawing on both experimental and modeling work, I will review recent research in this domain and illustrate specifically how both bottom-up and top-down cues contribute to infants’ acquisition of phonetic cat- egories and phonological rules. - 4 - TH DAY TUESDAY, 18 2 9:00-10:00, GRAND AMPHI PLENARY TALK What do the hands tell us about lan- guage development? Insights from de- velopment of speech, gesture and sign across languages Asli Ozyurek Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands Most research and theory on language development focus on children’s spoken utterances. However language development starting with the first words of children is multimodal. Speaking children produce gestures ac- companying and complementing their spoken utterances in meaningful ways through pointing or iconic ges- tures.
    [Show full text]
  • Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
    Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C­3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarin­deliverable­D5C3_v1_5­finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • The Field of Phonetics Has Experienced Two
    The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are thousands of times larger. These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit even more. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers. Our research aims to fill this gap by integrating, adapting and improving techniques developed in speech technology research and database research. The intellectual merit: The most important innovation is robust forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using HMM methods developed for speech recognition technology. Existing forced-alignment techniques must be improved and validated for robust application to phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad alignment to be identified and eliminated or fixed. Researchers need an easy way to get a believable picture of the distribution of transcription and measurement errors, so as to estimate confidence intervals, and also to determine the extent of any bias that may be introduced.
    [Show full text]
  • (Or, the Raising of Baby Mondegreen) Dissertation
    PRESERVING SUBSEGMENTAL VARIATION IN MODELING WORD SEGMENTATION (OR, THE RAISING OF BABY MONDEGREEN) DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Christopher Anton Rytting, B.A. ***** The Ohio State University 2007 Dissertation Committee: Approved by Dr. Christopher H. Brew, Co-Advisor Dr. Eric Fosler-Lussier, Co-Advisor Co-Advisor Dr. Mary Beckman Dr. Brian Joseph Co-Advisor Graduate Program in Linguistics ABSTRACT Many computational models have been developed to show how infants break apart utterances into words prior to building a vocabulary—the “word segmenta- tion task.” Most models assume that infants, upon hearing an utterance, represent this input as a string of segments. One type of model uses statistical cues calcu- lated from the distribution of segments within the child-directed speech to locate those points most likely to contain word boundaries. However, these models have been tested in relatively few languages, with little attention paid to how different phonological structures may affect the relative effectiveness of particular statistical heuristics. This dissertation addresses this is- sue by comparing the performance of two classes of distribution-based statistical cues on a corpus of Modern Greek, a language with a phonotactic structure signif- icantly different from that of English, and shows how these differences change the relative effectiveness of these cues. Another fundamental issue critically examined in this dissertation is the practice of representing input as a string of segments. Such a representation im- plicitly assumes complete certainty as to the phonemic identity of each segment.
    [Show full text]
  • Metapragmatics of Playful Speech Practices in Persian
    To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Author Arab, Reza Published 2021-02-03 Thesis Type Thesis (PhD Doctorate) School School of Hum, Lang & Soc Sc DOI https://doi.org/10.25904/1912/4079 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/402259 Griffith Research Online https://research-repository.griffith.edu.au To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Reza Arab BA, MA School of Humanities, Languages and Social Science Griffith University Thesis submitted in fulfilment of the requirements of the Degree of Doctor of Philosophy September 2020 Abstract This investigation is centred around three metapragmatic labels designating valued speech practices in the domain of ‘playful language’ in Persian. These three metapragmatic labels, used by speakers themselves, describe success and failure in use of playful language and construe a person as pleasant to be with. They are hāzerjavāb (lit. ready.response), bāmaze (lit. with.taste), and bānamak (lit. with.salt). Each is surrounded and supported by a cluster of (related) word meanings, which are instrumental in their cultural conceptualisations. The analytical framework is set within the research area known as ethnopragmatics, which is an offspring of Natural Semantics Metalanguage (NSM). With the use of explications and scripts articulated in cross-translatable semantic primes, the metapragmatic labels and the related clusters are examined in meticulous detail. This study demonstrates how ethnopragmatics, its insights on epistemologies backed by corpus pragmatics, can contribute to the metapragmatic studies by enabling a robust analysis using a systematic metalanguage.
    [Show full text]
  • Neuroinformatics.Pdf
    1 23 Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be self- archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”. 1 23 Author's personal copy Neuroinform DOI 10.1007/s12021-013-9210-5 ORIGINAL ARTICLE Action and Language Mechanisms in the Brain: Data, Models and Neuroinformatics Michael A. Arbib & James J. Bonaiuto & Ina Bornkessel-Schlesewsky & David Kemmerer & Brian MacWhinney & Finn Årup Nielsen & Erhan Oztop # Springer Science+Business Media New York 2013 Abstract We assess the challenges of studying action and Databasing empirical data . Federation of databases . language mechanisms in the brain, both singly and in relation Collaboratory workspaces to each other to provide a novel perspective on neuroinformatics, integrating the development of databases for encoding – separately or together – neurocomputational models and Bridging the Gap Between Models and Experiments empirical data that serve systems and cognitive neuroscience. The present article offers a perspective on the neuroinformatics Keywords Linking models and experiments . Models, challenges of linking neuroscience data with models of the neurocomputational .
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]
  • English Corpus Linguistics: an Introduction - Charles F
    Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index Aarts, Bas, 4, 102 Biber, Douglas, et al. (1999) 14 adequacy, 2–3, 10–11 Birmingham Corpus, 15, 142 age, 49–50 Blachman, Edward, 76–7 Altenberg, Bengt, 26–7 BNC see British National Corpus AMALGAM Tagging Project, 86–7, 89 Brill, Eric, 86 American National Corpus, 24, 84, 142 Brill Tagger, 86–8 American Publishing House for the Blind British National Corpus (BNC), 143 Corpus, 17, 142 annotation, 84 analyzing a corpus, 100 composition, 18, 31t, 34, 36, 38, 40–1, 49 determining suitability, 103–7, 107t copyright, 139–40 exploring a corpus, 123–4 planning, 30–2, 33, 43, 51, 138 extracting information: defining parameters, record keeping, 66 107–9; coding and recording, 109–14, research using, 15, 36 112t; locating relevant constructions, speech samples, 59 114–19, 116f, 118f tagging, 87 framing research question, 101–3 time-frame, 45 future prospects, 140–1 British National Corpus (BNC) Sampler, see also pseudo-titles (corpus analysis 139–40, 143 case study); statistical analysis Brown Corpus, xii, 1, 143 anaphors, 97 genre variation, 18 annotation, 98–9 length, 32 future prospects, 140 research using, 6, 9–10, 12, 42, 98, 103 grammatical markup, 81 sampling methodology, 44 parsing, 91–6, 98, 140 tagging, 87, 90 part-of-speech markup, 81 time-frame, 45 structural markup, 68–9, 81–6 see also FROWN (Freiburg–Brown) tagging, 86–91, 97–8, 111, 117–18, 140 Corpus types, 81 Burges, Jen, 52 appositions, 42, 98 Burnard,
    [Show full text]