446931 1 En Bookfrontmatter 1..29

Total Page:16

File Type:pdf, Size:1020Kb

446931 1 En Bookfrontmatter 1..29 History, Features, and Typology of Language Corpora Niladri Sekhar Dash • S. Arulmozi History, Features, and Typology of Language Corpora 123 Niladri Sekhar Dash S. Arulmozi Linguistic Research Unit Centre for Applied Linguistics and Indian Statistical Institute Translation Studies Kolkata, West Bengal University of Hyderabad India Hyderabad, Telangana India ISBN 978-981-10-7457-8 ISBN 978-981-10-7458-5 (eBook) https://doi.org/10.1007/978-981-10-7458-5 Library of Congress Control Number: 2017962060 © Springer Nature Singapore Pte Ltd. 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Dedicated to the people of Shabra and Sathyamangalam Preface The purpose of this introductory book is to confirm the importance of speech and text corpora in the modern age of linguistic studies. We consider corpus linguistics to be one of the fundamental domains of applied linguistics within the main research and development activities of man–machine interaction in language understanding. Keeping this observation in mind, we have tried to convey some of the general ideas and issues related to corpus linguistics and corpus-based studies of languages. For the works of speech corpora development and utilization in speech and language technology, during the last few decades, corpora have created unprecedented expectations among scholars. Since we want to keep this expectation alive, we have tried to bring in an extra shade to the field of corpus application so that corpora can meet the great challenges we have been facing in understanding natural languages in all their intricacies. The present book is the result of our intensive research in the area of corpus linguistics for more than 25 years. In this book, we have tried to address some of the basic issues of corpus linguistics with reference to corpora of English and other languages. We have focussed on the revival and rejuvenation of the empirical approach to language study to show how language corpora of various types are developed and used in various works of mainstream linguistics, applied linguistics and language technology. We have shown how new findings obtained from lan- guage corpora are becoming useful to refute or substantiate previous observation about languages. We have provided working definitions of the corpus, identified the general features of the corpus, and focussed on the application potentials of the corpus. We have drawn lines of distinction between different types of corpora; discussed the form and content of parallel translation corpora; addressed issues involved in the generation of web text corpora; presented a short history of pre-digital corpora, described some digital text and speech corpora; and finally, have highlighted some limitations of language corpora. In this course-cum- reference book, we have given emphasis to English and Indian languages since no vii viii Preface book previously existed in this area that has adequately highlighted the issues linked with Indian languages. The topics discussed in this book have a strong theoretical as well as practical significance. Over the years, corpus-based language study has remarkably changed the trends of language research and application across the globe. However, it has failed to create an impact on Indian and South Asian languages, in spite of the fact that language corpora have contributed on a large scale to new growth and to the advancement of linguistics in most of the advanced countries. This initial apathy is gradually ebbing away and, in fact, some Indian universities, as well as the uni- versities of some neighboring countries like Bangladesh, Bhutan, Nepal, Maldives, Pakistan and Srilanka, are planning to introduce a fully fledged course on corpus linguistics at the university level. This book will be highly useful in this context, since it possesses the information necessary to address the requirements of students enrolled in such university-level courses. The present book contains short but highly valuable and relevant discussions on the forgotten past of corpus-based linguistic research and applications that have been carried out over a few centuries across the languages of the world. The historical narrative results from our intensive investigation into the terrains of language corpus use in earlier centuries. This is perhaps the first book of its kind that aims to encompass the history of language description and application with close reference to corpora developed manually by the masters of the craft. Over the decades, the basic methods of corpus making have undergone changes with the advent of new tools and techniques for text collection and access. In this book, we have made an attempt to show how, in the earlier centuries, the process of language corpora generation was practised long before the introduction of the computer and how earlier scholars designed, developed and used handmade corpora in their language-based activities relating to dictionary making, the study of dia- lects, language teaching, understanding word meanings, defining usages of words and terms, exploring the nature and manners of language acquisition, writing grammar books, preparing text materials, exploring specific stylistic traits of some literary masters and so on. In all such works, the earlier scholars utilized handmade corpora of selected text samples to gather and extract relevant linguistic information and examples to enhance the quality and reliability of their works. With full reference to this history, this book is expected to create awareness among the scholars about this area in order to encourage interest in using corpora in research, development and appli- cation in linguistics, as well as in sister disciplines. The information presented in this book categorically underlines that analysis of corpora of actual language use can yield new information and insights to describe a language in a more faithful manner, as well as to deal with the problems of linguistics with certified authenticity. Our experience in dealing with language corpora, along with the experience of some other scholars in India and abroad, has helped us to realize that a book of this kind is long overdue for those interested to know the utility of language corpora for linguistic research and applications. This inspired us to assemble relevant Preface ix information from various fields of linguistics and sister disciplines to write a book that would provide the necessary philosophical perspectives about this new field of language research and application. This book will provide scholars with a panoramic exposure to this new area of language study, as well as inspire them to explore this area with enthusiasm. This book also presents primary information about corpora and their typologies. It presents a colorful picture of the present state of corpus-based language study with a clear focus on the future course of activities relating to corpus generation and usage. The book intends to emphasize the compilation, analysis and investigation of actual language data from both qualitative and functional perspectives in order to address some theoretical and methodological issues and principles relating to descriptive linguistics, applied linguistics and language technology. The topics discussed and referred to in the book have strong referential and academic relevance in the global context. We have come across many queries made by scholars across the world about the history and the present state of corpus linguistics in general, and since there no such book has ever previously been written in this area, this book is highly suitable for addressing these queries. Deeper investigations into languages have shown many unique aspects of lan- guages that are not only interesting but also quite useful. We have observed that within a natural setting, a language—in speech as well as in writing—is used as a versatile tool of communication. In this context, the goal of a language investigator is to understand the language in minute detail so that (s)he can develop computer systems that can perform like normal human beings in terms of exerting the regular functions of hearing and understanding a language. With regard to the present state of research in corpus linguistics across the world, there is a need for more effort focused towards developing natural, spontaneous and unconstrained language corpora for better man–machine interaction. In addition, there is an urgent need for utilization of information obtained from the analysis of language data of various text types collected empirically and compiled in corpora for developing domain-free and workable commercial systems for speech and language technol- ogy. Only then can we think of weaving a realistic linguistic fabric for the benefit of the common people.
Recommended publications
  • CLIB 2016 Proceedings
    The Second International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) is organised within ​ the Operation for Support for International Scientific Conferences Held in Bulgaria of the National Science ​ Fund Grant № ДПМНФ 01/9 of 11 Aug 2016. National Science Fund ​ ​ CLIB 2016 is organised by: The Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences PUBLICATION AND CATALOGUING INFORMATION Title: Proceedings of the Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) ISSN: 2367­5675 (online) Published and The Institute for Bulgarian Language Prof. Lyubomir ​ distributed by: Andreychin, Bulgarian Academy of Sciences ​ Editorial address: Institute for Bulgarian Language Prof. Lyubomir ​ Andreychin, Bulgarian Academy of Sciences ​ 52 Shipchenski prohod blvd., bldg. 17 Sofia 1113, Bulgaria +359 2/ 872 23 02 Copyright of each paper stays with the respective authors. The works in the Proceedings are licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0). Licence details: http://creativecommons.org/licenses/by/4.0 ​ Proceedings of the Second International Conference Computational Linguistics in Bulgaria 9 September 2016 Sofia, Bulgaria PREFACE We are excited to welcome you to the second edition of the International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) in Sofia, Bulgaria! ​ CLIB aspires to foster the NLP community in Bulgaria and further the cooperation among researchers working in NLP for Bulgarian around the world. The need for a conference dedicated to NLP research dealing with or applicable to Bulgarian has been felt for quite some time. We believe that building a strong community of researchers and teams who have chosen to work on Bulgarian is a key factor to meeting the challenges and requirements posed to computational linguistics and NLP in Bulgaria.
    [Show full text]
  • ARCHNA BHATIA 5509 Gates Hillman Complex Phone: +1-412-268-6591
    ARCHNA BHATIA 5509 Gates Hillman Complex Phone: +1-412-268-6591 Language Technologies Institute Fax: +1-412-268-6298 School of Computer Science Email: [email protected] Carnegie Mellon University Webpage: http://www.cs.cmu.edu/∼archna 5000 Forbes Avenue Pittsburgh, PA 15213, USA RESEARCH INTERESTS Areas: Linguistics : Syntax, Semantics, Morphology, Pragmatics, Discourse Natural Language Processing : Semantic Role Labeling, Treebanking, Parsing, Machine Translation Second Language Acquisition : Acquisition of morphosyntax and semantics Constructions/ Phenomena: Adjectives, Adverbs, Adpositions, Agreement, Case system, Causatives, Coordination, Definiteness, Empty categories, Light verb constructions and other complex predicates, Specificity POSITIONS HELD ● Postdoctoral Researcher, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh (June 2012- present) ● Postdoctoral Researcher, Department of Linguistics, University of Colorado at Boulder (July 2011- June 2012) ● Lecturer, Northwestern University, Evanston (September 2011- June 2012) ● Instructor, Loyola University, Chicago (August 2011- May 2012) ● Research Assistant, Department of Linguistics, University of Colorado at Boulder (August 2009- June 2011) ● Research Assistant, Department of Linguistics, University of Illinois at Urbana-Champaign (June 2009- December 2010) ● Teaching Assistant, Department of Linguistics, University of Illinois at Urbana-Champaign (August 2003- May 2009) ● Teaching Assistant, Department of Languages, Literatures and Linguistics, York University (September 2001- August 2003) EDUCATION ● Ph.D. in Linguistics, University of Illinois at Urbana-Champaign, IL (August 2011) Dissertation title: “Agreement in the Context of Coordination” Dissertation advisor: Prof. Elabbas Benmamoun ● M.S. in Linguistics, University of Illinois at Urbana-Champaign, IL (May 2006) 1 ● M.A. in Theoretical Linguistics, York University, Toronto, Canada. (August 2003) Dissertation title: “ The Syntax of Adverbial Phrases in Hindi” Dissertation advisor: Prof.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Morphological Doublets in Croatian: a Multi-Methodological Analysis
    Morphological Doublets in Croatian: A multi-methodological analysis By: Dario Lečić A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Arts and Humanities Department of Russian and Slavonic Studies 20 January, 2017 Acknowledgments Many a PhD student past and present will agree that doing a PhD is a time-consuming process with lots of ups and downs, motivational issues and even a number of nervous breakdowns. Having experienced all of these, I can only say that they were right. However, having reached the end of the tunnel, I have to admit that the feeling is great. I would like to use this opportunity to thank all the people who made this possible and who have helped me during these four years spent researching the intricate world of morphological doublets in Croatian. First of all, I would like to express my immense gratitude to my primary supervisor, Professor Neil Bermel from the Department or Russian and Slavonic Studies at the University of Sheffield for offering his guidance from day one. Our regular supervisory meetings as well as numerous e-mail exchanges have been eye-opening and I would not have been able to do this without you. I hope this dissertation will justify all the effort you have put into me as your PhD student. Even though the jurisdiction of the second supervisor as defined by the University of Sheffield officially stretches mostly to matters of the Doctoral Development Programme, my second supervisor, Dr Dagmar Divjak, nevertheless played a major role in this research as well, primarily in matters of statistics.
    [Show full text]
  • Economics and Development Studies
    Orient BlackSwan is one of India’s best known and most respected publishing houses. Incorporated in 1948, the consistent emphasis of our publishing programme has been on quality. We also selectively reprint and co-publish outstanding titles published abroad, for the Indian market. Orient BlackSwan is the exclusive distributor for books published by: Sangam Books Universities Press t bl en ac n k a m Permanent Black r e p Social Science Press Aurum Books (An imprint of Social Science Press) Tata Institute of Social Sciences Economic and Political Weekly RCS Publishers CONTENTS Forthcoming Titles .............................................................................................. iii Economics and Development Studies ..........................................................1 E-Books .............................................................................................................21 Author Index .......................................................................................................25 Title Index ...........................................................................................................26 Order Form.........................................................................................................29 Online catalogue For more information on our books visit our online catalogue at www.orientblackswan.com Information on new books You can write to us at [email protected] for updates on our monthly arrivals and events; also visit us at www.orientblackswan.com/ newarrivals.asp
    [Show full text]
  • The Scalar Quantification of Ɔnek 'Many'
    THE SCALAR QUANTIFICATION OF ƆNEK ‘MANY’ IN BANGLA TISTA BAGCHI University of Delhi The interpretation of so-called vague quantifiers such as many is, at the conceptual-intentional interface, a straightforward one, on par with standard quantifiers such as the universal every and (in frameworks that recognize existential quantifiers) the existential a/an or some. However, while vague quantifiers display the same scopal behavior as standard ones do (at least at a “thick” level) at this interface, their quantificational status remains quite distinct from that of the standard quantifiers: they do not straightforwardly relate to the domains or sets defined by the nominal component that they are merged with (Barwise & Cooper 1981, Szabolcsi 2010). The behavior of an analogue in Bangla, viz., the quantifier ɔnek ‘many’, is the central focus of this paper, given that it can be used in both count and noncount senses, unlike in Hindi, in which anek, like many, is exclusively [+count]. Vandiver (2011a) argues that many in English can be placed on a stationary scale of quantifiers, from a/an through all. This paper, on the other hand, argues that such an explanation fails to account for the distinctive behavior of ɔnek with respect to (i) scope interaction with negation (where ɔnek is always wider in scope than any negation that it might co-occur with), (ii) semantic interaction with the Bangla classifier –tạ /-khani (versus no classifier), (iii) its use as a comparative quantifier on occasion with emphatic focus. Furthermore, the lower threshold for [+count] ɔnek might be determined by the maximum “paucal” number, possibly varying across speakers.
    [Show full text]
  • Languages in the European Information Society Croatian
    META-NET White Paper Series Languages in the European Information Society Croatian Early Release Edition META-FORUM 2011 27-28 June 2011 Budapest, Hungary The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission under contracts T4ME (Grant Agreement 249119), CESAR (Grant Agreement 271022), METANET4U (Grant Agreement 270893) and META-NORD (Grant Agreement 270899). This white paper is for educators, journalists, politicians, language communities and others, who want to establish a truly multilingual Europe. This white paper is part of a series that promotes knowledge about language technology and its potential. The availability and use of language technology in Europe varies between languages. Conse- quently, the actions that are required to further support research and development of language technologies also differs for each language. The required actions depend on many factors, such as the complexity of a given language and the size of its community. META-NET, a European Commission Network of Excellence, has conducted an analysis of current language resources and technolo- gies. This analysis focused on the 23 official European languages as well as other important regional languages in Europe. The results of this analysis suggests that there are many significant research gaps for each language. A more detailed, expert analysis and as- sessment of the current situation will help maximise the impact of additional research and minimize any risks. META-NET consists of 44 research centres from 31 countries who are working with stakeholders from commercial businesses, gov- ernment agencies, industry, research organisations, software com- panies, technology providers and European universities.
    [Show full text]
  • Corpus Studies in Applied Linguistics
    106 Pietilä, P. & O-P. Salo (eds.) 1999. Multiple Languages – Multiple Perspectives. AFinLA Yearbook 1999. Publications de l’Association Finlandaise de Linguistique Appliquée 57. pp. 105–134. CORPUS STUDIES IN APPLIED LINGUISTICS Kay Wikberg, University of Oslo Stig Johansson, University of Oslo Anna-Brita Stenström, University of Bergen Tuija Virtanen, University of Växjö Three samples of corpora and corpus-based research of great interest to applied linguistics are presented in this paper. The first is the Bergen Corpus of London Teenage Language, a project which has already resulted in a number of investigations of how young Londoners use their language. It has also given rise to a related Nordic project, UNO, and to the project EVA, which aims at developing material for assessing the English proficiency of pupils in the compulsory school system in Norway. The second corpus is the English-Norwegian Parallel Corpus (Oslo), which has provided data for both contrastive studies and the study of translationese. Altogether it consists of about 2.6 million words and now also includes translations of English texts into German, Dutch and Portuguese. The third corpus, the International Corpus of Learner English, is a collection of advanced EFL essays written by learners representing 15 different mother tongues. By comparing linguistic features in the various subcorpora it is possible to find out about non-nativeness generally and about problems shared by students representing different languages. 1 INTRODUCTION 1.1 Corpus studies and descriptive linguistics Corpus-based language research has now been with us for more than 20 years. The number of recent books dealing with corpus studies (cf.
    [Show full text]
  • Download (237Kb)
    CHAPTER II REVIEW OF RELATED LITERATURE In this chapter, the researcher presents the result of reviewing related literature which covers Corpus based analysis, children short stories, verbs, and the previous studies. A. Corpus Based Analysis in Children Short Stories In these last five decades the work that takes the concept of using corpus has been increased. Corpus, the plural forms are certainly called as corpora, refers to the collection of text, written or spoken, which is systematically gathered. A corpus can also be defined as a broad, principled set of naturally occurring examples of electronically stored languages (Bennet, 2010, p. 2). For such studies, corpus typically refers to a set of authentic machine-readable text that has been selected to describe or represent a condition or variety of a language (Grigaliuniene, 2013, p. 9). Likewise, Lindquist (2009) also believed that corpus is related to electronic. He claimed that corpus is a collection of texts stored on some kind of digital medium to be used by linguists for research purposes or by lexicographers in the production of dictionaries (Lindquist, 2009, p. 3). Nowadays, the word 'corpus' is almost often associated with the term 'electronic corpus,' which is a collection of texts stored on some kind of digital medium to be used by linguists for research purposes or by lexicographers for dictionaries. McCarthy (2004) also described corpus as a collection of written or spoken texts, typically stored in a computer database. We may infer from the above argument that computer has a crucial function in corpus (McCarthy, 2004, p. 1). In this regard, computers and software programs have allowed researchers to fairly quickly and cheaply capture, store and handle vast quantities of data.
    [Show full text]
  • Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God
    Tomislava Bošnjak Botica, Gordana Hržica, Overabundance in Croatian dual-class verbs FLUMINENSIA, god. 28 (2016), br. 1 Tomislava Bošnjak Botica, Gordana Hržica OVERABUNDANCE IN CROATIAN DUAL-CLASS VERBS dr. sc. Tomislava Bošnjak Botica, Institut za hrvatski jezik i jezikoslovlje, [email protected], Zagreb dr. sc. Gordana Hržica, Edukacijsko-rehabilitacijski fakultet, [email protected], Zagreb izvorni znanstveni članak UDK 811.163.42’367.625 rukopis primljen: 5. 4. 2016.; prihvaćen za tisak: 21. 6. 2016. Croatian verbal inflection morphology is typically described using verb class distinctions. The number of classes differs among approaches, but the basic criterion for class division is the presence or absence and the type of suppletion in verb stems. Generally, one verb belongs to one inflectional class or paradigm only. However, some verbs belong to two classes, i.e. they have two parallel sets of stems. In such dual-class verbs, one infinitive form is realizable in two present forms in all cells within a class, i.e. there is an overabundance (Thorton 2011). Inevitably, one of the stem forming paradigms is a class with categorial suppletion. The present stem of a categorial suppletion class has a greater phonological distance from the infinitive stem than the present stem of the other class. Using a different terminology one class can be described as more transparent, while the other is less transparent (more opaque) in forming the present stem. This study attempts to present overabundance in dual-class verbs and to determine whether competition in such forms can be explained by their tendency to conform to one default class or by other factors, specifically, by the phonological distance between the two paradigms of dual-class verbs.
    [Show full text]
  • 1. Introduction
    This is the accepted manuscript of the chapter MacLeod, N, and Wright, D. (2020). Forensic Linguistics. In S. Adolphs and D. Knight (eds). Routledge Handbook of English Language and Digital Humanities. London: Routledge, pp. 360-377. Chapter 19 Forensic Linguistics 1. INTRODUCTION One area of applied linguistics in which there has been increasing trends in both (i) utilising technology to assist in the analysis of text and (ii) scrutinising digital data through the lens of traditional linguistic and discursive analytical methods, is that of forensic linguistics. Broadly defined, forensic linguistics is an application of linguistic theory and method to any point at which there is an interface between language and the law. The field is popularly viewed as comprising three main elements: (i) the (written) language of the law, (ii) the language of (spoken) legal processes, and (iii) language analysis as evidence or as an investigative tool. The intersection between digital approaches to language analysis and forensic linguistics discussed in this chapter resides in element (iii), the use of linguistic analysis as evidence or to assist in investigations. Forensic linguists might take instructions from police officers to provide assistance with criminal investigations, or from solicitors for either side preparing a prosecution or defence case in advance of a criminal trial. Alternatively, they may undertake work for parties involved in civil legal disputes. Forensic linguists often appear in court to provide their expert testimony as evidence for the side by which they were retained, though it should be kept in mind that standards here are required to be much higher than they are within investigatory enterprises.
    [Show full text]
  • Distributed Memory Bound Word Counting for Large Corpora
    Democratic and Popular Republic of Algeria Ministry of Higher Education and Scientific Research Ahmed Draia University - Adrar Faculty of Science and Technology Department of Mathematics and Computer Science A Thesis Presented to Fulfil the Master’s Degree in Computer Science Option: Intelligent Systems. Title: Distributed Memory Bound Word Counting For Large Corpora Prepared by: Bekraoui Mohamed Lamine & Sennoussi Fayssal Taqiy Eddine Supervised by: Mr. Mediani Mohammed In front of President : CHOUGOEUR Djilali Examiner : OMARI Mohammed Examiner : BENATIALLAH Djelloul Academic Year 2017/2018 Abstract: Statistical Natural Language Processing (NLP) has seen tremendous success over the recent years and its applications can be met in a wide range of areas. NLP tasks make the core of very popular services such as Google translation, recommendation systems of big commercial companies such Amazon, and even in the voice recognizers of the mobile world. Nowadays, most of the NLP applications are data-based. Language data is used to estimate statistical models, which are then used in making predictions about new data which was probably never seen. In its simplest form, computing any statistical model will rely on the fundamental task of counting the small units constituting the data. With the expansion of the Internet and its intrusion in all aspects of human life, the textual corpora became available in very large amounts. This high availability is very advantageous performance-wise, as it enlarges the coverage and makes the model more robust both to noise and unseen examples. On the other hand, training systems on large data quantities raises a new challenge to the hardware resources, as it is very likely that the model will not fit into main memory.
    [Show full text]