446931 1 En Bookfrontmatter 1..29

History, Features, and Typology of Language Corpora Niladri Sekhar Dash • S. Arulmozi History, Features, and Typology of Language Corpora 123 Niladri Sekhar Dash S. Arulmozi Linguistic Research Unit Centre for Applied Linguistics and Indian Statistical Institute Translation Studies Kolkata, West Bengal University of Hyderabad India Hyderabad, Telangana India ISBN 978-981-10-7457-8 ISBN 978-981-10-7458-5 (eBook) https://doi.org/10.1007/978-981-10-7458-5 Library of Congress Control Number: 2017962060 © Springer Nature Singapore Pte Ltd. 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Dedicated to the people of Shabra and Sathyamangalam Preface The purpose of this introductory book is to confirm the importance of speech and text corpora in the modern age of linguistic studies. We consider corpus linguistics to be one of the fundamental domains of applied linguistics within the main research and development activities of man–machine interaction in language understanding. Keeping this observation in mind, we have tried to convey some of the general ideas and issues related to corpus linguistics and corpus-based studies of languages. For the works of speech corpora development and utilization in speech and language technology, during the last few decades, corpora have created unprecedented expectations among scholars. Since we want to keep this expectation alive, we have tried to bring in an extra shade to the field of corpus application so that corpora can meet the great challenges we have been facing in understanding natural languages in all their intricacies. The present book is the result of our intensive research in the area of corpus linguistics for more than 25 years. In this book, we have tried to address some of the basic issues of corpus linguistics with reference to corpora of English and other languages. We have focussed on the revival and rejuvenation of the empirical approach to language study to show how language corpora of various types are developed and used in various works of mainstream linguistics, applied linguistics and language technology. We have shown how new findings obtained from language corpora are becoming useful to refute or substantiate previous observation about languages. We have provided working definitions of the corpus, identified the general features of the corpus, and focussed on the application potentials of the corpus. We have drawn lines of distinction between different types of corpora; discussed the form and content of parallel translation corpora; addressed issues involved in the generation of web text corpora; presented a short history of pre-digital corpora, described some digital text and speech corpora; and finally, have highlighted some limitations of language corpora. In this course-cum- reference book, we have given emphasis to English and Indian languages since no vii viii Preface book previously existed in this area that has adequately highlighted the issues linked with Indian languages. The topics discussed in this book have a strong theoretical as well as practical significance. Over the years, corpus-based language study has remarkably changed the trends of language research and application across the globe. However, it has failed to create an impact on Indian and South Asian languages, in spite of the fact that language corpora have contributed on a large scale to new growth and to the advancement of linguistics in most of the advanced countries. This initial apathy is gradually ebbing away and, in fact, some Indian universities, as well as the universities of some neighboring countries like Bangladesh, Bhutan, Nepal, Maldives, Pakistan and Srilanka, are planning to introduce a fully fledged course on corpus linguistics at the university level. This book will be highly useful in this context, since it possesses the information necessary to address the requirements of students enrolled in such university-level courses. The present book contains short but highly valuable and relevant discussions on the forgotten past of corpus-based linguistic research and applications that have been carried out over a few centuries across the languages of the world. The historical narrative results from our intensive investigation into the terrains of language corpus use in earlier centuries. This is perhaps the first book of its kind that aims to encompass the history of language description and application with close reference to corpora developed manually by the masters of the craft. Over the decades, the basic methods of corpus making have undergone changes with the advent of new tools and techniques for text collection and access. In this book, we have made an attempt to show how, in the earlier centuries, the process of language corpora generation was practised long before the introduction of the computer and how earlier scholars designed, developed and used handmade corpora in their language-based activities relating to dictionary making, the study of dia- lects, language teaching, understanding word meanings, defining usages of words and terms, exploring the nature and manners of language acquisition, writing grammar books, preparing text materials, exploring specific stylistic traits of some literary masters and so on. In all such works, the earlier scholars utilized handmade corpora of selected text samples to gather and extract relevant linguistic information and examples to enhance the quality and reliability of their works. With full reference to this history, this book is expected to create awareness among the scholars about this area in order to encourage interest in using corpora in research, development and application in linguistics, as well as in sister disciplines. The information presented in this book categorically underlines that analysis of corpora of actual language use can yield new information and insights to describe a language in a more faithful manner, as well as to deal with the problems of linguistics with certified authenticity. Our experience in dealing with language corpora, along with the experience of some other scholars in India and abroad, has helped us to realize that a book of this kind is long overdue for those interested to know the utility of language corpora for linguistic research and applications. This inspired us to assemble relevant Preface ix information from various fields of linguistics and sister disciplines to write a book that would provide the necessary philosophical perspectives about this new field of language research and application. This book will provide scholars with a panoramic exposure to this new area of language study, as well as inspire them to explore this area with enthusiasm. This book also presents primary information about corpora and their typologies. It presents a colorful picture of the present state of corpus-based language study with a clear focus on the future course of activities relating to corpus generation and usage. The book intends to emphasize the compilation, analysis and investigation of actual language data from both qualitative and functional perspectives in order to address some theoretical and methodological issues and principles relating to descriptive linguistics, applied linguistics and language technology. The topics discussed and referred to in the book have strong referential and academic relevance in the global context. We have come across many queries made by scholars across the world about the history and the present state of corpus linguistics in general, and since there no such book has ever previously been written in this area, this book is highly suitable for addressing these queries. Deeper investigations into languages have shown many unique aspects of languages that are not only interesting but also quite useful. We have observed that within a natural setting, a language—in speech as well as in writing—is used as a versatile tool of communication. In this context, the goal of a language investigator is to understand the language in minute detail so that (s)he can develop computer systems that can perform like normal human beings in terms of exerting the regular functions of hearing and understanding a language. With regard to the present state of research in corpus linguistics across the world, there is a need for more effort focused towards developing natural, spontaneous and unconstrained language corpora for better man–machine interaction. In addition, there is an urgent need for utilization of information obtained from the analysis of language data of various text types collected empirically and compiled in corpora for developing domain-free and workable commercial systems for speech and language technology. Only then can we think of weaving a realistic linguistic fabric for the benefit of the common people.

446931 1 En Bookfrontmatter 1..29

CLIB 2016 Proceedings

ARCHNA BHATIA 5509 Gates Hillman Complex Phone: +1-412-268-6591

Conference Abstracts

Morphological Doublets in Croatian: a Multi-Methodological Analysis

Economics and Development Studies

The Scalar Quantification of Ɔnek 'Many'

Languages in the European Information Society Croatian

Corpus Studies in Applied Linguistics

Download (237Kb)

Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God

1. Introduction

Distributed Memory Bound Word Counting for Large Corpora