Introduction to Digital Speech Processing Full Text Available At
Total Page:16
File Type:pdf, Size:1020Kb
Full text available at: http://dx.doi.org/10.1561/2000000001 Introduction to Digital Speech Processing Full text available at: http://dx.doi.org/10.1561/2000000001 Introduction to Digital Speech Processing Lawrence R. Rabiner Rutgers University and University of California Santa Barbara USA [email protected] Ronald W. Schafer Hewlett-Packard Laboratories Palo Alto, CA USA Boston – Delft Full text available at: http://dx.doi.org/10.1561/2000000001 Foundations and Trends R in Signal Processing Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is L. R. Rabiner and R. W. Schafer, Intro- duction to Digital Speech Processing, Foundations and Trends R in Signal Process- ing, vol 1, no 1–2, pp 1–194, 2007 ISBN: 978-1-60198-070-0 c 2007 L. R. Rabiner and R. W. Schafer All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The ‘services’ for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged. Authorization does not extend to other kinds of copy- ing, such as that for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. In the rest of the world: Permission to pho- tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc., PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com; [email protected] now Publishers Inc. has an exclusive license to publish this material worldwide. Permission to use this content must be obtained from the copyright license holder. Please apply to now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail: [email protected] Full text available at: http://dx.doi.org/10.1561/2000000001 Foundations and Trends R in Signal Processing Volume 1 Issue 1–2, 2007 Editorial Board Editor-in-Chief: Robert M. Gray Dept of Electrical Engineering Stanford University 350 Serra Mall Stanford, CA 94305 USA [email protected] Editors Abeer Alwan (UCLA) Jelena Kovacevic (CMU) John Apostolopoulos (HP Labs) B.S. Manjunath (UCSB) Pamela Cosman (UCSD) Urbashi Mitra (USC) Michelle Effros (California Institute Thrasos Pappas (Northwestern of Technology) University) Yonina Eldar (Technion) Mihaela van der Shaar (UCLA) Yariv Ephraim (George Mason Luis Torres (Technical University University) of Catalonia) Sadaoki Furui (Tokyo Institute Michael Unser (EPFL) of Technology) P.P. Vaidyanathan (California Vivek Goyal (MIT) Institute of Technology) Sinan Gunturk (Courant Institute) Rabab Ward (University Christine Guillemot (IRISA) of British Columbia) Sheila Hemami (Cornell) Susie Wee (HP Labs) Lina Karam (Arizona State Clifford J. Weinstein (MIT Lincoln University) Laboratories) Nick Kingsbury (Cambridge Min Wu (University of Maryland) University) Josiane Zerubia (INRIA) Alex Kot (Nanyang Technical University) Full text available at: http://dx.doi.org/10.1561/2000000001 Editorial Scope Foundations and Trends R in Signal Processing will publish sur- vey and tutorial articles on the foundations, algorithms, methods, and applications of signal processing including the following topics: • Adaptive signal processing • Signal processing for • Audio signal processing communications • Biological and biomedical signal • Signal processing for security and processing forensic analysis, biometric signal processing • Complexity in signal processing • Signal quantization, sampling, • Digital and multirate signal analog-to-digital conversion, processing coding and compression • Distributed and network signal • Signal reconstruction, processing digital-to-analog conversion, • Image and video processing enhancement, decoding and • Linear and nonlinear filtering inverse problems • Multidimensional signal processing • Speech/audio/image/video • Multimodal signal processing compression • Multiresolution signal processing • Speech and spoken language processing • Nonlinear signal processing • Statistical/machine learning • Randomized algorithms in signal processing • Statistical signal processing • Sensor and multiple source signal • Classification and detection processing, source separation • Estimation and regression • Signal decompositions, subband • Tree-structured methods and transform methods, sparse representations Information for Librarians Foundations and Trends R in Signal Processing, 2007, Volume 1, 4 issues. ISSN paper version 1932-8346. ISSN online version 1932-8354. Also available as a combined paper and online subscription. Full text available at: http://dx.doi.org/10.1561/2000000001 Foundations and Trends R in Signal Processing Vol. 1, Nos. 1–2 (2007) 1–194 c 2007 L. R. Rabiner and R. W. Schafer DOI: 10.1561/2000000001 Introduction to Digital Speech Processing Lawrence R. Rabiner1 and Ronald W. Schafer2 1 Rutgers University and University of California, Santa Barbara, USA, [email protected] 2 Hewlett-Packard Laboratories, Palo Alto, CA, USA Abstract Since even before the time of Alexander Graham Bell’s revolution- ary invention, engineers and scientists have studied the phenomenon of speech communication with an eye on creating more efficient and effective systems of human-to-human and human-to-machine communi- cation. Starting in the 1960s, digital signal processing (DSP), assumed a central role in speech studies, and today DSP is the key to realizing the fruits of the knowledge that has been gained through decades of research. Concomitant advances in integrated circuit technology and computer architecture have aligned to create a technological environ- ment with virtually limitless opportunities for innovation in speech communication applications. In this text, we highlight the central role of DSP techniques in modern speech communication research and appli- cations. We present a comprehensive overview of digital speech process- ing that ranges from the basic nature of the speech signal, through a variety of methods of representing speech in digital form, to applica- tions in voice communication and automatic synthesis and recognition of speech. The breadth of this subject does not allow us to discuss any Full text available at: http://dx.doi.org/10.1561/2000000001 aspect of speech processing to great depth; hence our goal is to pro- vide a useful introduction to the wide range of important concepts that comprise the field of digital speech processing. A more comprehensive treatment will appear in the forthcoming book, Theory and Application of Digital Speech Processing [101]. Full text available at: http://dx.doi.org/10.1561/2000000001 Contents 1 Introduction 1 1.1 The Speech Chain 2 1.2 Applications of Digital Speech Processing 7 1.3 Our Goal for this Text 14 2 The Speech Signal 17 2.1 Phonetic Representation of Speech 17 2.2 Models for Speech Production 19 2.3 More Refined Models 23 3 Hearing and Auditory Perception 25 3.1 The Human Ear 25 3.2 Perception of Loudness 27 3.3 Critical Bands 28 3.4 Pitch Perception 29 3.5 Auditory Masking 31 3.6 Complete Model of Auditory Processing 32 4 Short-Time Analysis of Speech 33 4.1 Short-Time Energy and Zero-Crossing Rate 37 4.2 Short-Time Autocorrelation Function (STACF) 40 4.3 Short-Time Fourier Transform (STFT) 42 4.4 Sampling the STFT in Time and Frequency 44 ix Full text available at: http://dx.doi.org/10.1561/2000000001 4.5 The Speech Spectrogram 46 4.6 Relation of STFT to STACF 49 4.7 Short-Time Fourier Synthesis 51 4.8 Short-Time Analysis is Fundamental to our Thinking 53 5 Homomorphic Speech Analysis 55 5.1 Definition of the Cepstrum and Complex Cepstrum 55 5.2 The Short-Time Cepstrum 58 5.3 Computation of the Cepstrum 58 5.4 Short-Time Homomorphic Filtering of Speech 63 5.5 Application to Pitch Detection 65 5.6 Applications to Pattern Recognition 67 5.7 The Role of the Cepstrum 72 6 Linear Predictive Analysis 75 6.1 Linear Prediction and the Speech Model 75 6.2 Computing the Prediction Coefficients 79 6.3 The Levinson–Durbin Recursion 84 6.4 LPC Spectrum 87 6.5 Equivalent Representations 91 6.6 The Role of Linear Prediction 96 7 Digital Speech Coding 97 7.1 Sampling and Quantization of Speech (PCM) 97 7.2 Digital Speech Coding 105 7.3 Closed-Loop Coders 108 7.4 Open-Loop Coders 127 7.5 Frequency-Domain Coders 134 7.6 Evaluation of Coders 136 8 Text-to-Speech Synthesis Methods 139 8.1 Text Analysis 140 8.2 Evolution of Speech Synthesis Systems 145 Full text available at: http://dx.doi.org/10.1561/2000000001 8.3 Unit Selection Methods 152 8.4 TTS Applications 159 8.5 TTS Future Needs 160 9 Automatic Speech Recognition (ASR) 163 9.1 The Problem of Automatic Speech Recognition 163 9.2 Building a Speech Recognition System 165 9.3 The Decision Processes in ASR 168 9.4 Representative Recognition Performance 181 9.5 Challenges in ASR Technology 183 Conclusion 185 Acknowledgments 187 References 189 Supplemental References 197 Full text available at: http://dx.doi.org/10.1561/2000000001 1 Introduction The fundamental purpose of speech is communication, i.e., the trans- mission of messages. According to Shannon’s information theory [116], a message represented as a sequence of discrete symbols can be quanti- fied by its information content in bits, and the rate of transmission of information is measured in bits/second (bps).