Collection of Usage Information for Language Resources from Academic Articles

Total Page:16

File Type:pdf, Size:1020Kb

Collection of Usage Information for Language Resources from Academic Articles Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy and Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. fields of linguistics, natural language processing, and spo- ken language processing, reflecting the recognition that ob- Therefore, we could more easily find LRs suitable for jectively analyzing linguistic behavior based on actual ex- our own purposes by collecting lists of usage information amples is important. Therefore, since the importance of for LRs from academic articles and integrating them with LRs is widely recognized, they have been constructed as metadata contained in SHACHI. Although the method for a research infrastructure and are becoming indispensable automatically extracting the lists was proposed (Kozawa et for linguistic research. However, existing LRs are not fully al., 2008), the variation of the extracted usage information utilized. Even though metadata search services for LR was limited since their extraction rules were based on the archives (Hughes and Kamat, 2005) and web services for analysis of small lists of usage information. This issue LRs (Dalli et al., 2004; Biemann et al., 2004; Quasthoff et would be addressed by collecting large lists of usage infor- al., 2006; Ishida et al., 2008) have become available, it has mation and then analyzing them to expand the extraction not been enough for users to efficiently find and use LRs rules. Therefore, in this paper, we propose to construct a suitable for their own purposes so far. text corpus annotated with usage information (UI corpus). If there exists a system which could give us a list of LRs This paper is organized as follows. In sections 2, we intro- that can be answers to the questions such as “Which LRs duce the design of UI corpus. Then, we construct the UI can be used for developing a syntactic parser?” and “Which corpus by extracting sentences containing LR names from LRs can be used for developing a Chinese-English machine academic articles and annotating them with usage informa- translation system?,” it would help users efficiently find ap- tion in section 3. In sections 4 and 5, we provide statistics propriate LRs. of the UI corpus and analytical results of usage information contained in the UI corpus. We show that the UI corpus The information satisfying these demands is sometimes de- contributes to efficient LR searches by combining the UI scribed as usages of individual LRs on their official home corpus with a metadata database of LRs and comparing the pages. The metadata database of LRs named SHACHI (To- number of LRs retrieved with and without the UI corpus in hyama et al., 2008) is managing and providing it in an in- section 6. Finally, in section 7, we describe the summary of tegrated fashion by collecting and listing it as “usage infor- this paper and the future work. mation” for LRs. SHACHI contains metadata on approxi- mately 2,400 LRs. However, the number of lists of usage 2. Design of the UI Corpus information registered in SHACHI is only about 900 LRs since the usage information is not usually described on the 2.1. Data Collection official home while it is often described in academic ar- It is unrealistic to collect all sentences and annotate them ticles. For instance, the following sentence found in the because only small number of sentences in an article in- proceedings of ACL2006 shows that Roget’s Thesaurus is clude usage information for LRs. In this issue, Kozawa et useful for word sense disambiguation, although usage in- formation is not announced on the web page of Roget’s 1http://poets.notredame.ac.jp/Roget/about.html Figure 1: Flow of the UI corpus construction. al. reported that most of the instances of usage information (E) usage information for LRs are found in the sentences containing LR names Each word sequence matched with usage information (Kozawa et al., 2008). Therefore, in this research, we col- for a certain LR is annotated with UI tags. Word se- lect sentences having LR names from academic articles to quences are annotated with usage information by re- build the UI corpus. ferring only to a given sentence without adjacent sen- tences in order to reduce the labor costs of annotators. 2.2. Annotation Policy In our research, we assume that usage information A The collected sentences are annotated with the following for LR X can be paraphrased as “X is used for A.” information: (A), (B) and (C) are provided for each sen- The followings are examples of usage information for tence. (D) and (E) are provided for word sequences. (A) WordNet. through (D) are automatically provided when sentences have been collected from academic articles. • We use hLRiWordNeth/LRi for hUIilexical lookuph/UIi. (A) sentence ID • hLRiWordNeth/LRi hUIispecifies relationships among the meanings of wordsh/UIi. (B) the title of the proceeding • It uses the content of hLRiWordNeth/LRi to (C) article ID hUIimeasure the similarity or relatedness be- tween the senses of a target word and its sur- (D) LR name rounding wordsh/UIi . Word sequences matched with LR names are anno- tated with LR tags as the following example: Note that since usage information indicates specific events, such vague expressions as “our proposed • For comparison, hLRiPenn Treebankh/LRi con- method” and “this purpose” are not our target. We tains over 2400 (much shorter) WSJ articles. also ignore expressions that can be represented by “X is used for X” and those that represent updating, ex- Note that the LR tags with which word sequences are pansion, or modification of LR X, as shown in the fol- automatically annotated need to be examined, since lowing example: homographs of the LR names (ex. the names of • We applied an automatic mapping from projects and associations) are sometimes erroneously hLRiWordNet 1.6h/LRi to hLRiWordNet annotated as LR names. For example, ‘Penn Tree- 1.7.1h/LRi synset labels. bank’ is often used as an LR name and it is sometimes used as a project name in the different context. There- 3. Construction of the UI Corpus fore, it is difficult to discriminate proper LR names from others. Then, if inappropriate word sequences This section describes a method for constructing the UI cor- are annotated with LR tags, they are manually elimi- pus. Figure 1 shows the flow of the corpus construction. nated. First, sentences containing LR names are automatically ex- tracted from academic articles. Then, the extracted sen- If word sequences matched with two or more LRs have tences are annotated by two annotators in a cascaded man- a coordinate structure as the following example, “Chi- ner. nese” and “English Propbanks” are annotated with LR tags. This is because we distinguish between two or 3.1. Automatic Extraction of Sentences Containing more LR names (e.g. Chinese and English PropBanks) LR Names and an LR name which have a coordinate structure First, we converted academic articles to plain texts using (e.g. Cobuild Concordance and Collocations Sam- the Xpdf2. We used 2,971 articles which are contained in pler). the proceedings of ACL from 2000 to 2006, LREC2004 and LREC2006 because LRs were often used in the field • The functional tags for hLRiChineseh/LRi and of computational linguistics. Next, we extracted sentences hLRiEnglish PropBanksh/LRi are to a large ex- tent similar. 2http://www.foolabs.com/xpdf/ Figure 2: Flow of corpus annotation. Figure 3: Web-based GUI for supporting annotation. containing the LR names from the articles. As for the LR Item Number names, approximately 2,400 LRs registered in SHACHI articles 1533 (Tohyama et al., 2008) were used. Consequently, 10,959 sentences 8135 sentences were extracted from 1,848 articles and word se- LR tags 10504 quences matched with the LR names are automatically an- sentences containing usage information 1110 notated with the LR tags. UI tags 1183 3.2. Annotation of Word Sequences with Usage Table 1: Size of the UI corpus. Information Two annotators were involved in the annotation. One of the • annotators (Annotator 1) had an experience in collecting Annotating word sequences representing the usage in- metadata in SHACHI, while the major of the others (Anno- formation for LRs with the UI tags tator 2) was computational linguistics.
Recommended publications
  • Collection of Usage Information for Language Resources from Academic Articles
    Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]
  • The Bulgarian National Corpus: Theory and Practice in Corpus Design
    The Bulgarian National Corpus: Theory and Practice in Corpus Design Svetla Koeva, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria abstract The paper discusses several key concepts related to the development of Keywords: corpora and reconsiders them in light of recent developments in NLP. corpus design, On the basis of an overview of present-day corpora, we conclude that Bulgarian National Corpus, the dominant practices of corpus design do not utilise the technologies computational adequately and, as a result, fail to meet the demands of corpus linguis- linguistics tics, computational lexicology and computational linguistics alike. We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilin- gual texts and on providing them with a detailed metadata description and high-quality multi-level annotation. We go on to illustrate this concept with a description of the com- pilation, structuring, documentation, and annotation of the Bulgar- ian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 bil- lion words (1.95×109) altogether. The BulNC is supplied with a com- prehensive metadata description, which allows us to organise the texts according to different principles.
    [Show full text]
  • Edited Proceedings 1July2016
    UK-CLC 2016 Conference Proceedings UK-CLC 2016 preface Preface UK-CLC 2016 Bangor University, Bangor, Gwynedd, 18-21 July, 2016. This volume contains the papers and posters presented at UK-CLC 2016: 6th UK Cognitive Linguistics Conference held on July 18-21, 2016 in Bangor (Gwynedd). Plenary speakers Penelope Brown (Max Planck Institute for Psycholinguistics, Nijmegen, NL) Kenny Coventry (University of East Anglia, UK) Vyv Evans (Bangor University, UK) Dirk Geeraerts (University of Leuven, BE) Len Talmy (University at Buffalo, NY, USA) Dedre Gentner (Northwestern University, IL, USA) Organisers Chair Thora Tenbrink Co-chairs Anouschka Foltz Alan Wallington Local management Javier Olloqui Redondo Josie Ryan Local support Eleanor Bedford Advisors Christopher Shank Vyv Evans Sponsors and support Sponsors: John Benjamins, Brill Poster prize by Tracksys Student prize by De Gruyter Mouton June 23, 2016 Thora Tenbrink Bangor, Gwynedd Anouschka Foltz Alan Wallington Javier Olloqui Redondo Josie Ryan Eleanor Bedford UK-CLC 2016 Programme Committee Programme Committee Michel Achard Rice University Panos Athanasopoulos Lancaster University John Barnden The University of Birmingham Jóhanna Barðdal Ghent University Ray Becker CITEC - Bielefeld University Silke Brandt Lancaster University Cristiano Broccias Univeristy of Genoa Sam Browse Sheffield Hallam University Gareth Carrol University of Nottingham Paul Chilton Lancaster University Alan Cienki Vrije Universiteit (VU) Timothy Colleman Ghent University Louise Connell Lancaster University Seana Coulson
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]
  • Language, Communities & Mobility
    The 9th Inter-Varietal Applied Corpus Studies (IVACS) Conference Language, Communities & Mobility University of Malta June 13 – 15, 2018 IVACS 2018 Wednesday 13th June 2018 The conference is to be held at the University of Malta Valletta Campus, St. Paul’s Street, Valletta. 12:00 – 12:45 Conference Registration 12:45 – 13:00 Conferencing opening welcome and address University of Malta, Dean of the Faculty of Arts: Prof. Dominic Fenech Room: Auditorium 13:00 – 14:00 Plenary: Dr Robbie Love, University of Leeds Overcoming challenges in corpus linguistics: Reflections on the Spoken BNC2014 Room: Auditorium Rooms Auditorium (Level 2) Meeting Room 1 (Level 0) Meeting Room 2 (Level 0) Meeting Room 3 (Level 0) 14:00 – 14:30 Adverb use in spoken ‘Yeah, no, everyone seems to On discourse markers in Lexical Bundles in the interaction: insights and be saying that’: ‘New’ Lithuanian argumentative Description of Drug-Drug implications for the EFL pragmatic markers as newspaper discourse: a Interactions: A Corpus-Driven classroom represented in fictionalized corpus-based study Study - Pascual Pérez-Paredes & Irish English - Anna Ruskan - Lukasz Grabowski Geraldine Mark - Ana Mª Terrazas-Calero & Carolina Amador-Moreno 1 14:30 – 15:00 “Um so yeah we’re <laughs> So you have to follow The Study of Ideological Bias Using learner corpus data to you all know why we’re different methods and through Corpus Linguistics in inform the development of a here”: hesitation in student clearly there are so many Syrian Conflict News from diagnostic language tool and
    [Show full text]
  • Corpus Linguistics: Corpus Annotation
    Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Corpus Linguistics: corpus annotation Kar¨enFort [email protected] November 30, 2010 Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Sources Most of this course is largely inspired by: • Corpus Annotation [Garside et al., 1997], • Annotation Science, from theory to practice and use [Ide, 2007]. • A Formal Framework for Linguistic Annotation [Bird and Liberman, 2000]. • Sylvain Pogodalla's course on the same subject [http://www.loria.fr/~pogodall/enseignements/ TAL-Nancy/notes-2008-2009.pdf], Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Annotation Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Definition \[corpus annotation] can be defined as the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. 'Annotation' can also refer to the end-product of this process" [Leech, 1997] \Enhancing (raw) data with relevant linguistic annotations (relevant with what respect? Depends on the usage)" [Pogodalla] \'Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may
    [Show full text]
  • A GLOSSARY of CORPUS LINGUISTICS 809 01 Pages I-Iv Prelims 5/4/06 12:13 Page Ii
    809 01 pages i-iv prelims 5/4/06 12:13 Page i A GLOSSARY OF CORPUS LINGUISTICS 809 01 pages i-iv prelims 5/4/06 12:13 Page ii TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Aitchison A Glossary of Language and Mind 0 7486 1824 4 Laurie Bauer A Glossary of Morphology 0 7486 1853 8 Alan Davies A Glossary of Applied Linguistics 0 7486 1854 6 Geoffrey Leech A Glossary of English Grammar 0 7486 1729 9 Alan Cruse A Glossary of Semantics and Pragmatics 0 7486 2111 3 Philip Carr A Glossary of Phonology 0 7486 2234 9 Vyvyan Evans A Glossary of Cognitive Linguistics 0 7486 2280 2 Mauricio J. Mixco and Lyle Campbell A Glossary of Historical Linguistics 0 7486 2379 5 809 01 pages i-iv prelims 5/4/06 12:13 Page iii A Glossary of Corpus Linguistics Paul Baker, Andrew Hardie and Tony McEnery Edinburgh University Press 809 01 pages i-iv prelims 5/4/06 12:13 Page iv © Paul Baker, Andrew Hardie and Tony McEnery, 2006 Edinburgh University Press Ltd 22 George Square, Edinburgh Typeset in Sabon by Norman Tilley Graphics, Northampton, and printed and bound in Finland by WS Bookwell A CIP record for this book is available from the British Library ISBN-10 0 7486 2403 1 (hardback) ISBN-13 978 0 7486 2403 4 ISBN-10 0 7486 2018 4 (paperback) ISBN-13 978 0 7486 2018 0 The right of Paul Baker, Andrew Hardie and Tony McEnery to be identified as authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
    [Show full text]
  • Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC
    Piotr Bański, Marc Kupietz, Harald Lüngen, Paul Rayson, Hanno Biber, Evelyn Breiteneder, Simon Clematide, John Mariani, Mark Stevenson, Theresa Sick (editors) Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section Birmingham, 24 July 2017 Challenges in the Management of Large Corpora and Big Data and Natural Language Processing 2017 Workshop Programme 24 July 2017 WAC-XI guest session (11.00 - 12.30) Convenors: Adrien Barbaresi (ICLTT Vienna), Felix Bildhauer (IDS Mannheim), Roland Schäfer (FU Berlin) Chair: Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg) Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen Web Corpora – the best possible solution for tracking rare phenomena in underresourced languages – Clitics in Bosnian, Croatian and Serbian Vladimir Benko – Are Web Corpora Inferior? The Case of Czech and Slovak Vit Suchomel – Removing Spam from Web Corpora Through Supervised Learning Using FastText Lunch (12:30-13:30) CMLC-5+BigNLP main section: (13:30 –17:00) Welcome and Introduction 13:30-13:40 National Corpora Talks 13:40 – 15:40 Dawn Knight, Tess Fitzpatrick, Steve Morris, Jeremy Evas, Paul Rayson, Irena Spasic, Mark Stonelake, Enlli Môn Thomas, Steven Neale, Jennifer Needs, Scott Piao, Mair Rees, Gareth Watkins, Laurence Anthony, Thomas Michael Cobb, Margaret Deuchar, Kevin Donnelly, Michael McCarthy, Kevin Scannell, Creating CorCenCC (Corpws Cenedlaethol
    [Show full text]