English Corpus Linguistics: an Introduction - Charles F

Total Page:16

File Type:pdf, Size:1020Kb

English Corpus Linguistics: an Introduction - Charles F Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index Aarts, Bas, 4, 102 Biber, Douglas, et al. (1999) 14 adequacy, 2–3, 10–11 Birmingham Corpus, 15, 142 age, 49–50 Blachman, Edward, 76–7 Altenberg, Bengt, 26–7 BNC see British National Corpus AMALGAM Tagging Project, 86–7, 89 Brill, Eric, 86 American National Corpus, 24, 84, 142 Brill Tagger, 86–8 American Publishing House for the Blind British National Corpus (BNC), 143 Corpus, 17, 142 annotation, 84 analyzing a corpus, 100 composition, 18, 31t, 34, 36, 38, 40–1, 49 determining suitability, 103–7, 107t copyright, 139–40 exploring a corpus, 123–4 planning, 30–2, 33, 43, 51, 138 extracting information: defining parameters, record keeping, 66 107–9; coding and recording, 109–14, research using, 15, 36 112t; locating relevant constructions, speech samples, 59 114–19, 116f, 118f tagging, 87 framing research question, 101–3 time-frame, 45 future prospects, 140–1 British National Corpus (BNC) Sampler, see also pseudo-titles (corpus analysis 139–40, 143 case study); statistical analysis Brown Corpus, xii, 1, 143 anaphors, 97 genre variation, 18 annotation, 98–9 length, 32 future prospects, 140 research using, 6, 9–10, 12, 42, 98, 103 grammatical markup, 81 sampling methodology, 44 parsing, 91–6, 98, 140 tagging, 87, 90 part-of-speech markup, 81 time-frame, 45 structural markup, 68–9, 81–6 see also FROWN (Freiburg–Brown) tagging, 86–91, 97–8, 111, 117–18, 140 Corpus types, 81 Burges, Jen, 52 appositions, 42, 98 Burnard, Lou, 19, 82, 84, 85–6 see also pseudo-titles (corpus analysis case study) Cambridge International Corpus, 15, 143 ARCHER (A Representative Corpus of Cambridge Learners’ Corpus, 15, 143 English Historical Registers), 21, Canterbury Project, 79, 143 22, 79 n6, 140, 142 Cantos, Pascual, 33 n2 Aston, Guy, 19 Chafe, Wallace, 3, 32, 52, 72, 85 AUTASYS Tagger, 87 CHAT system (Codes for the Human Analysis of Transcripts), 26, 113–14 Bankof English Corpus, 15, 96, 142 Chemnitz Corpus, 23, 143 BBC English Dictionary, 15, 16–17 CHILDES (Child Language Data Exchange Bell, Alan, 100, 101–3, 104, 108, 110, 131 System) Corpus, xiii, 26, 113, 144 Bergen Corpus of London Teenage English Chomsky, Noam, 2, 3 see COLT Corpus CIA (contrastive interlanguage analysis), 26 Biber, Douglas, 10, 19–20, 22, 32, 33, CLAN software programs, 26 36, 39–40, 41, 42, 52, 78, 121, CLAWS tagger, 25, 87, 89–90 122, 126 Coates, Jennifer, 12, 13 162 © Cambridge University Press www.cambridge.org Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index 163 collecting data Corpus of Middle English Prose and Verse, 144 general considerations, 55–6 Corpus of Spoken Professional English, 71, record keeping, 64–6 144 speech samples, 56; broadcasts, 61; corpus-based research, 11 future prospects, 139; microphones, 60; contrastive analysis, 22–4 “natural” speech, 56–8, 59; permission, grammatical studies, 11–13 57; problems, 60–1; recording, 58–9; historical linguistics, 20–2 sample length, 57–8; tape recorders, 59–60 language acquisition, 26–7 writing samples: copyright, 38, 61–2, 79 n6, language pedagogy, 27–8 139–40; electronic texts, 63–4; future language variation, 17–20 prospects, 139; sources, 62–4 lexicography, 14–17 see also sampling methodology limitations, 124 Collins, Peter, xii–xiii natural language processing (NLP), xiii, Collins COBUILD English Dictionary,15 24–6 Collins COBUILD Project, 14, 15 reference grammars, 13–14 COLT Corpus (Bergen Corpus of London translation theory, 22–4 Teenage English), xiii–xiv, 18, 49, 142 Crowdy, Steve, 43, 59 competence vs. performance, 4 Curme, G., 13 computerizing data directory structure, 67, 68f data-driven learning, 27–8 file format, 66–7 de Haan, Pieter, 97–8 markup, 67, 68–9 see also annotation descriptive adequacy, 2, 3 speech, see speech, computerizing diachronic corpora, 46 written texts, 78–80, 139 dialect variation, 51–2 concordancing programs dictionaries, 14–17 KWIC format, 115–16, 116f Du Bois, John, 32, 52, 85 for language learning, 27–8 Dunning, Ted, 132 “lemma” searches, 116 programs, 115, 117, 150–1 EAGLES Project see Expert Advisory Group with tagged or parsed corpus, 117–18 on Language Engineering Standards, The uses, 16, 86, 114 Ebeling, Jarle, 23 “wild card” searches, 116–17 education, 50 Conrad, Susan, 126 Ehlich, Konrad, 77 contrastive analysis, 22–4 Electronic Beowulf, The, 21, 144 contrastive interlanguage analysis (CIA), 26 electronic texts, 63–4 Cook, Guy, 72, 86 elliptical coordination copyright, 38, 44, 57, 61–2, 79 n6, 139–40 frequency, 7, 12 Corpora Discussion List, 144 functional analysis, 6–11 corpus (corpora) genres, 6, 9–10 balanced, xii position, 6–7 construction see planning corpus repetition in speech, 9 construction serial position effect, 7–8, 8t definitions, xi–xii speech vs. writing, 8–9 diachronic, 46 suspense effect, 7–8, 8t historical, 20–2, 37–8, 46, 51, 78–9 empty categories, 4–5 learner, 26–7 ENGCG Parser, 96 monitor, 15 EngCG-2 tagger, 88 multi-purpose, 36 EngFDG parser, 91, 93–4, 93–4 n8, 96 parallel, 22–4 English–Norwegian Parallel Corpus, 23, parsed, 96 62, 144 resources, 142–9 ethnographic information, 65–6 special-purpose, 36 see also sociolinguistic variables synchronic, 45–6 Expert Advisory Group on Language corpus linguistics, xi, xiii–xiv, 1–2, 3–4 Engineering Standards, The (EAGLES), Corpus of Early English Correspondence, 22, xi, 84, 144 37, 144 explanatory adequacy, 2, 3, 10–11 © Cambridge University Press www.cambridge.org Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information 164 Index Extensible Markup Language see XML ICE (International Corpus of English), 146 Eyes, Elizabeth, 91 annotation, 82–3, 84, 85, 87, 90 composition, 34, 35t, 36, 38, 39, 40–2, 104 Fernquest, Jon, 114 computerizing data, 72, 73 Fillmore, Charles, 4, 17 copyright, 38, 44 Finegan, Edward, 22 criteria, 50 Fletcher, P., 121–2 record keeping, 66 FLOB (Freiburg–Lancaster–Oslo–Bergen) regional components, 104, 105–6, 106t, Corpus, 21, 45, 145 110, 123, 124 “frame” semantics, 17 research using, 6, 9 see also pseudo-titles Francis, W. Nelson, 1, 88 (corpus analysis case study) FROWN (Freiburg–Brown) Corpus, 21, 145 sampling, 44, 56 FTF see fuzzy tree fragments time-frame, 45 functional descriptions of language see also ICECUP; ICE-GB; ICE-USA elliptical coordination, 6–11, 8t,12 ICE Markup Assistant, 85, 86 repetition in speech, 9 ICE Tree, 95 voice, 5–6 ICECUP (ICE Corpus Utility Program), 19, fuzzy tree fragments (FTF), 119, 119f 116, 119, 146 ICE-East Africa, 106, 106t, 107t, 110, Gadsby, Adam, 27 123t, 124 Garside, Roger, 88–9 ICE-GB, 146 Gavioli, Laura, 28 annotation, 25, 83–4, 86, 92–3, 92f, 96, gender, 18, 22, 48–9 117–18, 118f, 140 generative grammar, 1, 3–5 composition, 106t genre variation, 18, 19–20, 31t, 34–8, 35t, 40–2 computerizing data, 73 Gillard, Patrick, 27 criteria, 50 government and binding theory, 4–5 record keeping, 64–5 grammar research using, 14, 19, 115–16, 116f generative, 1, 3–5 see also pseudo-titles (corpus analysis universal, 2–3 case study) “Grammar Safari”, 28 ICE-Jamaica, 106t, 107t, 110, 123t grammars, reference, 13–14 ICE-New Zealand, 106, 106t, 107t, 123t, grammatical markup see parsers 125, 130–3 grammatical studies, 11–13 ICE-Philippines, 106, 106t, 107t, 110, 123t, Granger, Sylvianne, 26 125, 130–3 Greenbaum, Sidney, 7, 14, 22, 35t, 64, 75, 95 ICE-Singapore, 106t, 110, 123t Greene, B. B., 87, 88 ICE-USA composition, 53, 106t Haegeman, Lilliane, 2–3, 4–5, 6 computerizing data, 70, 71, 73–4, 79 Hasselg˚ard, Hilde, 23 copyright, 62 Helsinki Corpus, 145 criteria, 46–7 composition, 20–1, 38 directory structure, 67–8, 68f planning, 46 length, 32–3 research using, 22, 37, 51 record keeping, 64, 65 symbols system, 67 research using see pseudo-titles (corpus Helsinki Corpus of Older Scots, 145 analysis case study) historical corpora, 20–2, 37–8, 46, 51, 78–9 sampling, 58, 60–1 see also ARCHER; Helsinki Corpus ICLE see International Corpus of Learner Hofland, Knut, 23 English Hong Kong University of Science and Ingegneri, Dominique, 42–3 Technology (HKUST) Learner Corpus, International Corpus of English see ICE 26, 145 International Corpus of Learner English Hughes, A., 121–2 (ICLE), 26, 27, 146 ICAME Bibliography, 145 Jespersen, Otto, xii, 13 ICAME CD-ROM, 67, 145 Johansson, Stig, 23 © Cambridge University Press www.cambridge.org Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index 165 Kalton, Graham, 43 London–Lund Corpus, 147 Kennedy, Graeme, 89 annotation, 82 Kettemann, Bernhard, 27–8 composition, 53 Kirk, John, 52 names in, 75 Kolhapur Corpus of Indian English, 104 research using, 12, 19, 39, 42, 98 Kretzschmar, William A., Jr., 42–3 Longman Dictionary of American English,15 Kucera, Henry, 1 Longman Dictionary of Contemporary KWIC (key word in context), 115–16, 116f English,15 Kyt, M., 37 Longman Essential Activator,27 Kyt, Merja, 42 Longman–Lancaster Corpus, 12, 148 Longman Learner’s Corpus, 26, 27, 148 Labov, W., 9 Longman Spoken and Written English Corpus, Lampeter Corpus, 38, 146 The (LSWE), 14, 90, 148 Lancaster Corpus, 12, 147 LSWE see Longman Spoken and Written see also LOB (Lancaster–Oslo–Bergen) English Corpus, The Corpus Lancaster–Oslo–Bergen Corpus see LOB Mair, Christian, 45 (Lancaster–Oslo–Bergen) Corpus Map TaskCorpus, 59, 148 Lancaster Parsed Corpus, 91–2, 96, 147 markup, 67, 68–9 Lancaster/IBM
Recommended publications
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Child Language
    ABSTRACTS 14TH INTERNATIONAL CONGRESS FOR THE STUDY OF CHILD LANGUAGE IN LYON, IASCL FRANCE 2017 WELCOME JULY, 17TH21ST 2017 SPECIAL THANKS TO - 2 - SUMMARY Plenary Day 1 4 Day 2 5 Day 3 53 Day 4 101 Day 5 146 WELCOME! Symposia Day 2 6 Day 3 54 Day 4 102 Day 5 147 Poster Day 2 189 Day 3 239 Day 4 295 - 3 - TH DAY MONDAY, 17 1 18:00-19:00, GRAND AMPHI PLENARY TALK Bottom-up and top-down information in infants’ early language acquisition Sharon Peperkamp Laboratoire de Sciences Cognitives et Psycholinguistique, Paris, France Decades of research have shown that before they pronounce their first words, infants acquire much of the sound structure of their native language, while also developing word segmentation skills and starting to build a lexicon. The rapidity of this acquisition is intriguing, and the underlying learning mechanisms are still largely unknown. Drawing on both experimental and modeling work, I will review recent research in this domain and illustrate specifically how both bottom-up and top-down cues contribute to infants’ acquisition of phonetic cat- egories and phonological rules. - 4 - TH DAY TUESDAY, 18 2 9:00-10:00, GRAND AMPHI PLENARY TALK What do the hands tell us about lan- guage development? Insights from de- velopment of speech, gesture and sign across languages Asli Ozyurek Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands Most research and theory on language development focus on children’s spoken utterances. However language development starting with the first words of children is multimodal. Speaking children produce gestures ac- companying and complementing their spoken utterances in meaningful ways through pointing or iconic ges- tures.
    [Show full text]
  • Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
    Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C­3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarin­deliverable­D5C3_v1_5­finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • (Or, the Raising of Baby Mondegreen) Dissertation
    PRESERVING SUBSEGMENTAL VARIATION IN MODELING WORD SEGMENTATION (OR, THE RAISING OF BABY MONDEGREEN) DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Christopher Anton Rytting, B.A. ***** The Ohio State University 2007 Dissertation Committee: Approved by Dr. Christopher H. Brew, Co-Advisor Dr. Eric Fosler-Lussier, Co-Advisor Co-Advisor Dr. Mary Beckman Dr. Brian Joseph Co-Advisor Graduate Program in Linguistics ABSTRACT Many computational models have been developed to show how infants break apart utterances into words prior to building a vocabulary—the “word segmenta- tion task.” Most models assume that infants, upon hearing an utterance, represent this input as a string of segments. One type of model uses statistical cues calcu- lated from the distribution of segments within the child-directed speech to locate those points most likely to contain word boundaries. However, these models have been tested in relatively few languages, with little attention paid to how different phonological structures may affect the relative effectiveness of particular statistical heuristics. This dissertation addresses this is- sue by comparing the performance of two classes of distribution-based statistical cues on a corpus of Modern Greek, a language with a phonotactic structure signif- icantly different from that of English, and shows how these differences change the relative effectiveness of these cues. Another fundamental issue critically examined in this dissertation is the practice of representing input as a string of segments. Such a representation im- plicitly assumes complete certainty as to the phonemic identity of each segment.
    [Show full text]
  • Metapragmatics of Playful Speech Practices in Persian
    To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Author Arab, Reza Published 2021-02-03 Thesis Type Thesis (PhD Doctorate) School School of Hum, Lang & Soc Sc DOI https://doi.org/10.25904/1912/4079 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/402259 Griffith Research Online https://research-repository.griffith.edu.au To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Reza Arab BA, MA School of Humanities, Languages and Social Science Griffith University Thesis submitted in fulfilment of the requirements of the Degree of Doctor of Philosophy September 2020 Abstract This investigation is centred around three metapragmatic labels designating valued speech practices in the domain of ‘playful language’ in Persian. These three metapragmatic labels, used by speakers themselves, describe success and failure in use of playful language and construe a person as pleasant to be with. They are hāzerjavāb (lit. ready.response), bāmaze (lit. with.taste), and bānamak (lit. with.salt). Each is surrounded and supported by a cluster of (related) word meanings, which are instrumental in their cultural conceptualisations. The analytical framework is set within the research area known as ethnopragmatics, which is an offspring of Natural Semantics Metalanguage (NSM). With the use of explications and scripts articulated in cross-translatable semantic primes, the metapragmatic labels and the related clusters are examined in meticulous detail. This study demonstrates how ethnopragmatics, its insights on epistemologies backed by corpus pragmatics, can contribute to the metapragmatic studies by enabling a robust analysis using a systematic metalanguage.
    [Show full text]
  • Neuroinformatics.Pdf
    1 23 Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be self- archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”. 1 23 Author's personal copy Neuroinform DOI 10.1007/s12021-013-9210-5 ORIGINAL ARTICLE Action and Language Mechanisms in the Brain: Data, Models and Neuroinformatics Michael A. Arbib & James J. Bonaiuto & Ina Bornkessel-Schlesewsky & David Kemmerer & Brian MacWhinney & Finn Årup Nielsen & Erhan Oztop # Springer Science+Business Media New York 2013 Abstract We assess the challenges of studying action and Databasing empirical data . Federation of databases . language mechanisms in the brain, both singly and in relation Collaboratory workspaces to each other to provide a novel perspective on neuroinformatics, integrating the development of databases for encoding – separately or together – neurocomputational models and Bridging the Gap Between Models and Experiments empirical data that serve systems and cognitive neuroscience. The present article offers a perspective on the neuroinformatics Keywords Linking models and experiments . Models, challenges of linking neuroscience data with models of the neurocomputational .
    [Show full text]