Advanced Information and Knowledge Processing

Advanced Information and Knowledge Processing Series Editors Professor Lakhmi Jain [email protected] Professor Xindong Wu [email protected] Also in this series Gregoris Mentzas, Dimitris Apostolou, K.C. Tan, E.F.Khor and T.H. Lee Andreas Abecker and RonYoung Multiobjective Evolutionary Algorithms Knowledge Asset Management 1-85233-583-1 and Applications 1-85233-836-9 Michalis Vazirgiannis, Maria Halkidi Nikhil R. Pal and Lakhmi Jain (Eds) and Dimitrios Gunopulos Advanced Techniques in Knowledge Discovery Uncertainty Handling and Quality Assessment and Data Mining 1-85233-867-9 in Data Mining 1-85233-655-2 Amit Konar and Lakhmi Jain Asunción Gómez-Pérez, Mariano Cognitive Engineering 1-85233-975-6 Fernández-López and Oscar Corcho 1-85233-551-3 Ontological Engineering Miroslav Kárný (Ed.) Optimized Bayesian Dynamic Advising Arno Scharl (Ed.) 1-85233-928-4 Environmental Online Communication 1-85233-783-4 Yannis Manolopoulos, Alexandros Nanopoulos, Shichao Zhang, Chengqi Zhang and Xindong Wu Apostolos N. Papadopoulos and Knowledge Discovery in Multiple Databases Yannis Theodoridis 1-85233-703-6 R-trees: Theory and Applications 1-85233-977-2 Jason T.L. Wang, Mohammed J. Zaki, Sanghamitra Bandyopadhyay, Ujjwal Maulik, Hannu T.T. Toivonen and Dennis Shasha (Eds) Lawrence B. Holder and Diane J. Cook (Eds) Data Mining in Bioinformatics 1-85233-671-4 Advanced Methods for Knowledge Discovery from Complex Data 1-85233-989-6 C.C. Ko, Ben M. Chen and Jianping Chen Creating Web-based Laboratories 1-85233-837-7 Marcus A. Maloof (Ed.) Machine Learning and Data Mining Manuel Graña, Richard Duro, Alicia d’Anjou for Computer Security 1-84628-029-X and Paul P. Wang (Eds) Information Processing with Evolutionary Sifeng Liu and Yi Lin Algorithms 1-85233-886-0 Grey Information 1-85233-995-0 Colin Fyfe Vasile Palade, Cosmin Danut Bocaniala Hebbian Learning and Negative Feedback and Lakhmi Jain (Eds) Networks 1-85233-883-0 Computational Intelligence in Fault Diagnosis 1-84628-343-4 Yun-Heh Chen-Burger and Dave Robertson Automating Business Modelling 1-85233-835-0 Mitra Basu and Tin Kam Ho (Eds) Dirk Husmeier, Richard Dybowski Data Complexity in Pattern Recognition 1-84628-171-7 and Stephen Roberts (Eds) Probabilistic Modeling in Bioinformatics and Medical Informatics 1-85233-778-8 Samuel Pierre (Ed.) E-learning Networked Environments Ajith Abraham, Lakhmi Jain and Architectures 1-84628-351-5 and Robert Goldberg (Eds) Evolutionary Multiobjective Optimization Arno Scharl and KlausTochtermann (Eds) 1-85233-787-7 The Geospatial Web 1-84628-826-5 Ngoc Thanh Nguyen Amnon Meisels Advanced Methods for Inconsistent Knowledge Search by Constrained Agents Management 1-84628-888-3 978-1-84800-039-1 Francesco Camastra and Alesandro Vinciarelli Mikhail Prokopenko (Ed.) Machine Learning for Image, Video Advances in Applied Self-organizing Systems and Audio Analysis 978-1-84800-006-3 978-1-84628-981-1 András Kornai Mathematical Linguistics ABC Andrá s Kornai MetaCarta Inc. 350 Massachusetts Ave. Cambridge, MA 02139 USA ISBN: 978-1-84628-985-9 e-ISBN: 978-1-84628-986-6 DOI: 10.1007/978-1-84628-986-6 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2007940401 © Springer-Verlag London Limited 2008 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Design and Patents Act 1988, this publication may only be repro- duced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the ab- sence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omis- sions that may be made. Printed on acid-free paper 9 8 7 6 5 4 3 2 1 Springer Science+Business Media Springer.com To my family Preface Mathematical linguistics is rooted both in Euclid’s (circa 325–265 BCE) axiomatic method and in Pan¯ .ini’s (circa 520–460 BCE) method of grammatical description. To be sure, both Euclid and Pan¯ .ini built upon a considerable body of knowledge amassed by their precursors, but the systematicity, thoroughness, and sheer scope of the Elements and the Asht.adhy¯ ay¯ ¯ı would place them among the greatest landmarks of all intellectual history even if we disregarded the key methodological advance they made. As we shall see, the two methods are fundamentally very similar: the axiomatic method starts with a set of statements assumed to be true and transfers truth from the axioms to other statements by means of a fixed set of logical rules, while the method of grammar is to start with a set of expressions assumed to be grammatical both in form and meaning and to transfer grammaticality to other expressions by means of a fixed set of grammatical rules. Perhaps because our subject matter has attracted the efforts of some of the most powerful minds (of whom we single out A. A. Markov here) from antiquity to the present day, there is no single easily accessible introductory text in mathematical linguistics. Indeed, to the mathematician the whole field of linguistics may appear to be hopelessly mired in controversy, and neither the formidable body of empirical knowledge about languages nor the standards of linguistic argumentation offer an easy entry point. Those with a more postmodern bent may even go as far as to doubt the existence of a solid core of mathematical knowledge, often pointing at the false theorems and incomplete or downright wrong proofs that slip through the peer review process at a perhaps alarming rate. Rather than attempting to drown such doubts in rivers of philo- sophical ink, the present volume will simply proceed more geometrico in exhibiting this solid core of knowledge. In Chapters 3–6, a mathematical overview of the tradi- tional main branches of linguistics, phonology, morphology, syntax, and semantics, is presented. viii Preface Who should read this book? The book is accessible to anyone with sufficient general mathematical maturity (graduate or advanced undergraduate). No prior knowledge of linguistics or languages is assumed on the part of the reader. The book offers a single entry point to the central methods and concepts of linguistics that are made largely inaccessible to the mathematician, computer scientist, or engineer by the surprisingly adversarial style of argumentation (see Section 1.2), the apparent lack of adequate definitions (see Section 1.3), and the proliferation of unmotivated notation and formalism (see Section 1.4) all too often encountered in research papers and monographs in the humanities. Those interested in linguistics can learn a great deal more about the subject here than what is covered in introductory courses just from reading through the book and consulting the references cited. Those who plan to approach linguistics through this book should be warned in advance that many branches of linguistics, in particular psycholinguistics, child language acquisition, and the study of language pathology, are largely ignored here – not because they are viewed as inferior to other branches but simply because they do not offer enough grist for the mathematician’s mill. Much of what the linguistically naive reader may find interesting about language turns out to be more pertinent to cognitive science, the philosophy of language, and sociolinguistics, than to linguistics proper, and the Introduction gives these issues the shortest possible shrift, discussing them only to the extent necessary for disentangling mathematical linguistics from other concerns. Conversely, issues that linguists sometimes view as peripheral to their enter- prise will get more discussion here simply because they offer such a rich variety of mathematical techniques and problems that no book on mathematical linguistics that ignored them could be considered complete. After a brief review of information theory in Chapter 7, we will devote Chapters 8 and 9 to phonetics, speech recognition, the recognition of handwriting and machine print, and in general to issues of linguistic signal processing and pattern matching, including information extrac- tion, information retrieval, and statistical natural language processing. Our treatment assumes a bit more mathematical maturity than the excellent textbooks by Jelinek (1997) and Manning and Schutze¨ (1999) and intends to complement them. Kracht (2003) conveniently summarizes and extends much of the discrete (algebraic and combinatorial) work on mathematical linguistics. It is only because of the timely appearance of this excellent reference work that the first six chapters could be kept to a manageable size and we could devote more space to the continuous (analytic and probabilistic) aspects of the subject. In particular, expository simplicity would often dictate that we keep the underlying parameter space discrete, but in the later chapters we will be concentrating more on the case of continuous parameters, and discuss the issue of quantization losses explicitly. In the early days of computers, there was a great deal of overlap between the concerns of mathematical linguistics and computer science, and a surprising amount of work that began in one field ended up in the other, sometimes explicitly as part of computational linguistics, but often as general theory with its roots in linguistics largely forgotten. In particular, the basic techniques of syntactic analysis are now Preface ix firmly embedded in the computer science curriculum, and the student can already choose from a large variety of textbooks that cover parsing, automata, and formal language theory.

Load more