Corpus-Driven Lexicon of MSA 13
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
FERSIWN GYMRAEG ISOD the National Corpus of Contemporary
FERSIWN GYMRAEG ISOD The National Corpus of Contemporary Welsh Project Report, October 2020 Authors: Dawn Knight1, Steve Morris2, Tess Fitzpatrick2, Paul Rayson3, Irena Spasić and Enlli Môn Thomas4. 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets. -
Distributional Properties of Verbs in Syntactic Patterns Liam Considine
Early Linguistic Interactions: Distributional Properties of Verbs in Syntactic Patterns Liam Considine The University of Michigan Department of Linguistics April 2012 Advisor: Nick Ellis Acknowledgements: I extend my sincerest gratitude to Nick Ellis for agreeing to undertake this project with me. Thank you for cultivating, and partaking in, some of the most enriching experiences of my undergraduate education. The extensive time and energy you invested here has been invaluable to me. Your consistent support and amicable demeanor were truly vital to this learning process. I want to thank my second reader Ezra Keshet for consenting to evaluate this body of work. Other thanks go out to Sarah Garvey for helping with precision checking, and Jerry Orlowski for his R code. I am also indebted to Mary Smith and Amanda Graveline for their participation in our weekly meetings. Their presence gave audience to the many intermediate challenges I faced during this project. I also need to thank my roommate Sean and all my other friends for helping me balance this great deal of work with a healthy serving of fun and optimism. Abstract: This study explores the statistical distribution of verb type-tokens in verb-argument constructions (VACs). The corpus under investigation is made up of longitudinal child language data from the CHILDES database (MacWhinney 2000). We search a selection of verb patterns identified by the COBUILD pattern grammar project (Francis, Hunston, Manning 1996), these include a number of verb locative constructions (e.g. V in N, V up N, V around N), verb object locative caused-motion constructions (e.g. -
The National Corpus of Contemporary Welsh 1. Introduction
The National Corpus of Contemporary Welsh Project Report, October 2020 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets. -
Edited Proceedings 1July2016
UK-CLC 2016 Conference Proceedings UK-CLC 2016 preface Preface UK-CLC 2016 Bangor University, Bangor, Gwynedd, 18-21 July, 2016. This volume contains the papers and posters presented at UK-CLC 2016: 6th UK Cognitive Linguistics Conference held on July 18-21, 2016 in Bangor (Gwynedd). Plenary speakers Penelope Brown (Max Planck Institute for Psycholinguistics, Nijmegen, NL) Kenny Coventry (University of East Anglia, UK) Vyv Evans (Bangor University, UK) Dirk Geeraerts (University of Leuven, BE) Len Talmy (University at Buffalo, NY, USA) Dedre Gentner (Northwestern University, IL, USA) Organisers Chair Thora Tenbrink Co-chairs Anouschka Foltz Alan Wallington Local management Javier Olloqui Redondo Josie Ryan Local support Eleanor Bedford Advisors Christopher Shank Vyv Evans Sponsors and support Sponsors: John Benjamins, Brill Poster prize by Tracksys Student prize by De Gruyter Mouton June 23, 2016 Thora Tenbrink Bangor, Gwynedd Anouschka Foltz Alan Wallington Javier Olloqui Redondo Josie Ryan Eleanor Bedford UK-CLC 2016 Programme Committee Programme Committee Michel Achard Rice University Panos Athanasopoulos Lancaster University John Barnden The University of Birmingham Jóhanna Barðdal Ghent University Ray Becker CITEC - Bielefeld University Silke Brandt Lancaster University Cristiano Broccias Univeristy of Genoa Sam Browse Sheffield Hallam University Gareth Carrol University of Nottingham Paul Chilton Lancaster University Alan Cienki Vrije Universiteit (VU) Timothy Colleman Ghent University Louise Connell Lancaster University Seana Coulson -
The Spoken British National Corpus 2014
The Spoken British National Corpus 2014 Design, compilation and analysis ROBBIE LOVE ESRC Centre for Corpus Approaches to Social Science Department of Linguistics and English Language Lancaster University A thesis submitted to Lancaster University for the degree of Doctor of Philosophy in Linguistics September 2017 Contents Contents ............................................................................................................. i Abstract ............................................................................................................. v Acknowledgements ......................................................................................... vi List of tables .................................................................................................. viii List of figures .................................................................................................... x 1 Introduction ............................................................................................... 1 1.1 Overview .................................................................................................................................... 1 1.2 Research aims & structure ....................................................................................................... 3 2 Literature review ........................................................................................ 6 2.1 Introduction .............................................................................................................................. -
The Corpus of Contemporary American English As the First Reliable Monitor Corpus of English
The Corpus of Contemporary American English as the first reliable monitor corpus of English ............................................................................................................................................................ Mark Davies Brigham Young University, Provo, UT, USA ....................................................................................................................................... Abstract The Corpus of Contemporary American English is the first large, genre-balanced corpus of any language, which has been designed and constructed from the ground up as a ‘monitor corpus’, and which can be used to accurately track and study recent changes in the language. The 400 million words corpus is evenly divided between spoken, fiction, popular magazines, newspapers, and academic journals. Most importantly, the genre balance stays almost exactly the same from year to year, which allows it to accurately model changes in the ‘real world’. After discussing the corpus design, we provide a number of concrete examples of how the corpus can be used to look at recent changes in English, including morph- ology (new suffixes –friendly and –gate), syntax (including prescriptive rules, Correspondence: quotative like, so not ADJ, the get passive, resultatives, and verb complementa- Mark Davies, Linguistics and tion), semantics (such as changes in meaning with web, green, or gay), and lexis–– English Language, Brigham including word and phrase frequency by year, and using the corpus architecture Young University. -
Language, Communities & Mobility
The 9th Inter-Varietal Applied Corpus Studies (IVACS) Conference Language, Communities & Mobility University of Malta June 13 – 15, 2018 IVACS 2018 Wednesday 13th June 2018 The conference is to be held at the University of Malta Valletta Campus, St. Paul’s Street, Valletta. 12:00 – 12:45 Conference Registration 12:45 – 13:00 Conferencing opening welcome and address University of Malta, Dean of the Faculty of Arts: Prof. Dominic Fenech Room: Auditorium 13:00 – 14:00 Plenary: Dr Robbie Love, University of Leeds Overcoming challenges in corpus linguistics: Reflections on the Spoken BNC2014 Room: Auditorium Rooms Auditorium (Level 2) Meeting Room 1 (Level 0) Meeting Room 2 (Level 0) Meeting Room 3 (Level 0) 14:00 – 14:30 Adverb use in spoken ‘Yeah, no, everyone seems to On discourse markers in Lexical Bundles in the interaction: insights and be saying that’: ‘New’ Lithuanian argumentative Description of Drug-Drug implications for the EFL pragmatic markers as newspaper discourse: a Interactions: A Corpus-Driven classroom represented in fictionalized corpus-based study Study - Pascual Pérez-Paredes & Irish English - Anna Ruskan - Lukasz Grabowski Geraldine Mark - Ana Mª Terrazas-Calero & Carolina Amador-Moreno 1 14:30 – 15:00 “Um so yeah we’re <laughs> So you have to follow The Study of Ideological Bias Using learner corpus data to you all know why we’re different methods and through Corpus Linguistics in inform the development of a here”: hesitation in student clearly there are so many Syrian Conflict News from diagnostic language tool and -
Preparation and Analysis of Linguistic Corpora
Preparation and Analysis of Linguistic Corpora The corpus is a fundamental tool for any type of research on language. The availability of computers in the 1950’s immediately led to the creation of corpora in electronic form that could be searched automatically for a variety of language features and compute frequency, distributional characteristics, and other descriptive statistics. Corpora of literary works were compiled to enable stylistic analyses and authorship studies, and corpora representing general language use became widely used in the field of lexicography. In this era, the creation of an electronic corpus required entering the material by hand, and the storage capacity and speed of computers available at the time put limits on how much data could realistically be analyzed at any one time. Without the Internet to foster data sharing, corpora were typically created, and processed at a single location. Two notable exceptions are the Brown Corpus of American English (Francis and Kucera, 1967) and the London/Oslo/Bergen (LOB) corpus of British English (Johanssen et al., 1978); both of these corpora, each containing one millions words of data tagged for part of speech, were compiled in the 1960’s using a representative sample of texts produced in the year 1961. For several years, the Brown and LOB were the only widely available computer-readable corpora of general language, and therefore provided the data for numerous language studies. In the 1980’s, the speed and capacity of computers increased dramatically, and, with more and more texts being produced in computerized form, it became possible to create corpora much larger than the Brown and LOB, containing millions of words. -
INTRODUCTION to CORPORA at STANFORD Anubha Kothari for Ling 395A Research Methods, 04/18/08 OUTLINE
INTRODUCTION TO CORPORA AT STANFORD Anubha Kothari For Ling 395A Research Methods, 04/18/08 OUTLINE Data What kinds of corpora do we have? Access How do we get to them? Search How do we find what we need? (Thanks to Hal Tily whose presentation was partially copied/adapted for this one.) WHY CORPORA ARE USEFUL… Though messy and requiring much time, care, and technical expertise, corpora are great to work with! Large amounts of searchable data consisting of real language use Saves you time from collecting data yourself Analyze the distribution of some linguistic variant and understand what factors might explain its variation Discover new linguistic and non-linguistic association patterns Investigate predictions of your theory WHAT KINDS OF CORPORA DO WE HAVE? Text and speech in various languages English, German, Spanish, Mandarin, Arabic, French, Japanese, Czech, Korean, Old & Middle English, Russian, Hindi, Tamil, etc. Complete inventory at http://www.stanford.edu/dept/linguistics/corpora/inventory.html LDC and non-LDC corpora available Also scout around on the web for corpora (e.g. Native American languages), and if they aren’t free just ask the corpus TA – it might be possible to acquire them! Many possible types of annotations: Syntactic structure, part-of-speech, coreference chains, animacy, information status, pitch accents, word times, speaker gender, dialect, age, education level, etc. You can always take a (partially annotated) corpus and add your own annotations! SOME MULTILINGUAL CORPORA Parallel texts (translations) -
Masterarbeit / Master's Thesis
MASTERARBEIT / MASTER’S THESIS Titel der Masterarbeit / Title of the Master‘s Thesis Europarl Corpus HFWL Glossaries: A Set of Multilingual Glossaries Based on the Quantitative Corpus-Based Lexical Analysis of the Europarl Corpus verfasst von / submitted by Bc. Marek Paulovič, BA angestrebter akademischer Grad / in partial fulfilment of the requirements for the degree of Master of Arts (MA) Wien, 2016 / Vienna 2016 Studienkennzahl lt. Studienblatt / A 065 331 342 degree programme code as it appears on the student record sheet: Studienrichtung lt. Studienblatt / Masterstudium Dolmetschen UG2002 Deutsch Englisch degree programme as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof.Mag.Dr. Gerhard Budin EUROPARL CORPUS HWFL GLOSSARIES 3 ACKNOWLEDGMENTS Firstly, I would like to express my sincere gratitude to my advisor Univ.-Prof. Mag. Dr. Gerhard Budin for his continuous and immense support, for his patience, motivation, and considerable knowledge in this field. His guidance has helped me throughout the duration of researching and writing of this thesis. I could not have imagined having a better advisor and mentor for my master thesis. Besides my advisor, I would like to thank PhDr. Jana Rejšková and Mgr. Věra Kloudová, Ph.D. for their insightful comments and help with identification of the problematic internationalisms from the both didactic and linguistic perspective. My sincere thanks also goes to Beth Garner for her immense help with proofreading of this master thesis. Without her encouragement, I might have not decided to write this master thesis in English. I would like to also thank to my loved ones, family and friends who supported me throughout entire process, both by keeping me harmonious and helping me with putting the pieces together. -
A GLOSSARY of CORPUS LINGUISTICS 809 01 Pages I-Iv Prelims 5/4/06 12:13 Page Ii
809 01 pages i-iv prelims 5/4/06 12:13 Page i A GLOSSARY OF CORPUS LINGUISTICS 809 01 pages i-iv prelims 5/4/06 12:13 Page ii TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Aitchison A Glossary of Language and Mind 0 7486 1824 4 Laurie Bauer A Glossary of Morphology 0 7486 1853 8 Alan Davies A Glossary of Applied Linguistics 0 7486 1854 6 Geoffrey Leech A Glossary of English Grammar 0 7486 1729 9 Alan Cruse A Glossary of Semantics and Pragmatics 0 7486 2111 3 Philip Carr A Glossary of Phonology 0 7486 2234 9 Vyvyan Evans A Glossary of Cognitive Linguistics 0 7486 2280 2 Mauricio J. Mixco and Lyle Campbell A Glossary of Historical Linguistics 0 7486 2379 5 809 01 pages i-iv prelims 5/4/06 12:13 Page iii A Glossary of Corpus Linguistics Paul Baker, Andrew Hardie and Tony McEnery Edinburgh University Press 809 01 pages i-iv prelims 5/4/06 12:13 Page iv © Paul Baker, Andrew Hardie and Tony McEnery, 2006 Edinburgh University Press Ltd 22 George Square, Edinburgh Typeset in Sabon by Norman Tilley Graphics, Northampton, and printed and bound in Finland by WS Bookwell A CIP record for this book is available from the British Library ISBN-10 0 7486 2403 1 (hardback) ISBN-13 978 0 7486 2403 4 ISBN-10 0 7486 2018 4 (paperback) ISBN-13 978 0 7486 2018 0 The right of Paul Baker, Andrew Hardie and Tony McEnery to be identified as authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. -
Resources for Corpus Linguistics
RESOURCES FOR CORPUS LINGUISTICS 1. (FIRST GENERATION) WRITTEN CORPORA Brown Corpus Properties: 1m words, written, American, 1961 Design: original (see Corpus Design handout) Availability: ICAME-CD, Free online access at LDC Lancaster-Oslo/Bergen (LOB) Properties: 1m words, written, British, 1961 Design: based on BROWN Availability: ICAME-CD Kolhapur Corpus of Indian English Properties: 1m words, Written, Indian, 1978 Design: roughly based on BROWN (less specific fiction, more general fiction) Availability: ICAME-CD Wellington Corpus of Written New Zealand English Properties: 1m words, written, NZ, 1986-90 Design: roughly based on BROWN (no subcategories within Fiction) Availability: ICAME-CD, Wellington Corpus CD Australian Corpus of English (ACE) Properties: 1m words, written, Australian, 1986 Design: roughly based on BROWN Availability: ICAME-CD, ACE-CD London-Lund (LLC) Properties: 0.5m words, Spoken, British, 1959ff Design: derived from SEU Availability: ICAME-CD Freiburg-Brown (FROWN) Properties: 1m words, Written, American, 1991 Design: based on BROWN Availability: ICAME-CD Freiburg-LOB (FLOB) Properties: 1m words, Written, British, 1991 Design: based on LOB (i.e. BROWN) Availability: ICAME-CD 2. MEGA CORPORA Cobuild Bank of English Properties: > 300m words, mostly written, mostly British (some American) Design: opportunistic Availability: HarperCollins, Online demo at Cobulid web site British National Corpus (BNC) Properties: 100m words, spoken (10m) and written (90m), 1990s Design: original (see Corpus Design handout) Availability: BNC CD, Online demo at BNC web site. Corpus Linguistics 1/3 © 2003 Anatol Stefanowitsch [email protected] Resources for corpus linguistics 2/3 3. VARIOUS SPECIALIZED CORPORA Spoken Corpora Corpus of Spoken American English Properties: growing, spoken, American Design: free conversation among friends in natural setting, tape recorded by participants Availability: Free download at Talkbank Switchboard (SWB) Properties: ca.