A GLOSSARY of CORPUS LINGUISTICS 809 01 Pages I-Iv Prelims 5/4/06 12:13 Page Ii
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Construction of English-Bodo Parallel Text
International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018 CONSTRUCTION OF ENGLISH -BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION Saiful Islam 1, Abhijit Paul 2, Bipul Syam Purkayastha 3 and Ismail Hussain 4 1,2,3 Department of Computer Science, Assam University, Silchar, India 4Department of Bodo, Gauhati University, Guwahati, India ABSTRACT Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system. -
Geoffrey Neil Leech Summary of Curriculum Vitae April 1994
Goeffrey Leech PUBLICATIONS A. Books 1. G. N. Leech (1966), English in Advertising, London: Longman, pp.xiv + 240 2. G. N. Leech (1969), A Linguistic Guide to English Poetry, London: Longman, pp.xiv + 240 3. G. N.Leech (1969), Towards a Semantic Description of English, London: Longman pp.xiv + 277 4. G. N. Leech (1971), Meaning and the English Verb, London: Longman, pp.xiv + 132 (2nd and 3rd editions: 1987, 2004) 5. R. Quirk, S. Greenbaum, G. Leech and J. Svartvik (1972), A Grammar of Contemporary English, London: Longman, pp.xii + 1120 6. G. Leech (1974), Semantics, London: Penguin, pp.xii + 386 (2nd edition, entitled Semantics: the Study of Meaning, 1981) 7. G. Leech and J. Svartvik (1975), A Communicative Grammar of English, London: Longman, pp.324 (2nd and 3rd editions: 1994, 2002) 8. G. Leech (1980), Explorations in Semantics and Pragmatics, Amsterdam: Benjamins, pp.viii + 133 9. S. Greenbaum, G. Leech and J Svartvik (eds.) (1980), Studies in English Linguistics: for Randolph Quirk, London: Longman, pp.xvi + 304 10. G. N. Leech and M. H. Short (1981), Style in Fiction: A Linguistic Introduction to English Fictional Prose, London: Longman, pp. xiv + 402 (2nd edition, 2007) 11. G. Leech, M. Deuchar and R. Hoogenraad (1982), English Grammar for Today: a New Introduction, London: Macmillan, pp.xvi + 224 (2nd edition, 2006) 12. G. Leech, (1983), Principles of Pragmatics, London: Longman, pp.xiv + 250 13. R. Quirk, S. Greenbaum, G. Leech and J. Svartvik (1985), A Comprehensive Grammar of the English Language, London: Longman pp. xii + 1779 14. G. Leech and C. -
Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources. -
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
Linguistic Annotation of the Digital Papyrological Corpus: Sematia
Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances. -
MASC: the Manually Annotated Sub-Corpus of American English
MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing -
British Muslims Caught Amidst Fogs—A Discourse Analysis of Religious Advice and Authority
religions Article British Muslims Caught Amidst FOGs—A Discourse Analysis of Religious Advice and Authority Usman Maravia 1,* , Zhazira Bekzhanova 2 , Mansur Ali 3 and Rakan Alibri 4 1 ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, Lancaster LA1 4YW, UK 2 English Language Program, Astana IT University, Nur-Sultan 010000, Kazakhstan; [email protected] 3 Centre for the Study of Islam in the UK, School of History, Archaeology and Religion, Cardiff University, Cardiff CF10 3EU, UK; [email protected] 4 Department of Languages and Translation, University of Tabuk, Tabuk 47512, Saudi Arabia; [email protected] * Correspondence: [email protected] Abstract: This paper discusses the symbolic capital found within Islamic documents that were circu- lated in the UK during the COVID-19 outbreak. Specifically, the work explores “fatwas” and “other” similar documents as well as “guidance” documents (referred to as FOGs) that were disseminated in March–April 2020 on the internet and social media platforms for British Muslim consumption. We confine our materials to FOGs produced only in English. Our study takes its cue from the notion that the existence of a variety of documents created a sense of foggy ambiguity for British Muslims in matters of religious practice. From a linguistic angle, the study seeks to identify (a) the underlying reasons behind the titling of the documents; and (b) the construction of discourses in the documents. Our corpus-assisted critical discourse analysis (CA-CDA) found noticeable patterns that hold symbolic capital in the fatwa register. We also found that producers of “other” documents imitate the fatwa register in an attempt to strengthen the symbolic capital of their documents. -
Lexical Ambiguity • Syntactic Ambiguity • Semantic Ambiguity • Pragmatic Ambiguity
Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing (NLP)(NLP) Professors:Marta Gatius Vila Horacio Rodríguez Hontoria Hours per week: 2h theory + 1h laboratory Web page: http://www.cs.upc.edu/~gatius/engpln2017.html Main goal Understand the fundamental concepts of NLP • Most well-known techniques and theories • Most relevant existing resources • Most relevant applications NLP Introduction 1 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Content 1. Introduction to Language Processing 2. Applications. 3. Language models. 4. Morphology and lexicons. 5. Syntactic processing. 6. Semantic and pragmatic processing. 7. Generation NLP Introduction 2 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Assesment • Exams Mid-term exam- November End-of-term exam – Final exams period- all the course contents • Development of 2 Programs – Groups of two or three students Course grade = maximum ( midterm exam*0.15 + final exam*0.45, final exam * 0.6) + assigments *0.4 NLP Introduction 3 Welcome to the course! IntroductionIntroduction toto NaturalNatural LanguageLanguage ProcessingProcessing Related (or the same) disciplines: •Computational Linguistics, CL •Natural Language Processing, NLP •Linguistic Engineering, LE •Human Language Technology, HLT NLP Introduction 4 Linguistic Engineering (LE) • LE consists of the application of linguistic knowledge to the development of computer systems able to recognize, understand, interpretate and generate human language in all its forms. • LE includes: • Formal models (representations of knowledge of language at the different levels) • Theories and algorithms • Techniques and tools • Resources (Lingware) • Applications NLP Introduction 5 Linguistic knowledge levels – Phonetics and phonology. Language models – Morphology: Meaningful components of words. Lexicon doors is plural – Syntax: Structural relationships between words. -
Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin
Chapter 1 Creating and using multilingual corpora in translation studies Claudio Fantinuoli and Federico Zanettin 1 Introduction Corpus linguistics has become a major paradigm and research methodology in translation theory and practice, with practical applications ranging from pro- fessional human translation to machine (assisted) translation and terminology. Corpus-based theoretical and descriptive research has investigated written and interpreted language, and topics such as translation universals and norms, ideol- ogy and individual translator style (Laviosa 2002; Olohan 2004; Zanettin 2012), while corpus-based tools and methods have entered the curricula at translation training institutions (Zanettin, Bernardini & Stewart 2003; Beeby, Rodríguez Inés & Sánchez-Gijón 2009). At the same time, taking advantage of advancements in terms of computational power and increasing availability of electronic texts, enormous progress has been made in the last 20 years or so as regards the de- velopment of applications for professional translators and machine translation system users (Coehn 2009; Brunette 2013). The contributions to this volume, which are centred around seven European languages (Basque, Dutch, German, Greek, Italian, Spanish and English), add to the range of studies of corpus-based descriptive studies, and provide exam- ples of some less explored applications of corpus analysis methods to transla- tion research. The chapters, which are based on papers first presented atthe7th congress of the European Society of Translation Studies held in Germersheim in Claudio Fantinuoli & Federico Zanettin. 2015. Creating and using multilin- gual corpora in translation studies. In Claudio Fantinuoli & Federico Za- nettin (eds.), New directions in corpus-based translation studies, 1–11. Berlin: Language Science Press Claudio Fantinuoli and Federico Zanettin July/August 20131, encompass a variety of research aims and methodologies, and vary as concerns corpus design and compilation, and the techniques used to ana- lyze the data. -
Conference Abstracts
EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues. -
Background and Context for CLASP
Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE -
Geoffrey Neil Leech Summary of Curriculum Vitae April 1994
Geoffrey Neil Leech Curriculum Vitae August 2012 I. GENERAL Date of Birth: 16 January, 1936 Age: 76 Present post: Professor Emeritus, English Linguistics, Lancaster University Degrees: B.A. (1959), M.A. (1963), PhD (1968), all at the University of London Honorary Fil.Dr., University of Lund (1987) Hon.D.Litt., University of Wolverhampton (2002) D.Litt., Lancaster University (2002) Hon. PhD, Charles University, Prague (2012) Fellowships, etc.: Harkness Fellowship (held at MIT) (1964-5) Fellow of the British Academy (1987–) Member of Academia Europaea (1989–) Honorary Fellow, University College London (1989–) Member, Norwegian Academy of Science & Letters (1993–) Honorary Professor, Beijing Foreign Studies University (1994–) Honorary Fellow, Lancaster University (2009–) Previous posts: Assistant Lecturer in English, University College London (1962-5) Lecturer in English, University College London (1965-9) Reader in English Language, Lancaster University (1969-74) Professor of Linguistics and Modern English Language, Lancaster University (1974-96) Research Professor in English Linguistics (1997-2001) Other positions at Lancaster: Chairman, Institute for English Language Education (1985-90) Joint Director, Unit for Computer Research on the English Language [UCREL] (1984-95) Chair, University Centre for Computer Corpus Research on Language [UCREL] (1995-2002) Main academic interests: English grammar, semantics, stylistics, pragmatics and corpus linguistics. II. PUBLICATIONS AND RESEARCH A. Books 1. G. N. Leech (1966), English in Advertising, London: Longman, pp.xiv + 240 2. G. N. Leech (1969), A Linguistic Guide to English Poetry, London: Longman, pp.xiv + 240 1 3. G. N.Leech (1969), Towards a Semantic Description of English, London: Longman pp.xiv + 277 4. G. N. Leech (1971), Meaning and the English Verb, London: Longman, pp.xiv + 132 (2nd and 3rd editions: 1987, 2004) 5.