10. Pitch and Voicing Determination of Speech with an Extension Toward Music Signals Pitch Andw
Total Page:16
File Type:pdf, Size:1020Kb
181 10. Pitch and Voicing Determination of Speech with an Extension Toward Music Signals Pitch andW. J. Hess Vo 10.2.3 Frequency-Domain Methods: This chapter reviews selected methods for pitch Harmonic Analysis ........................ 188 determination of speech and music signals. As 10.2.4 Active Modeling ........................... 190 both these signals are time variant we first define 10.2.5 Least Squares what is subsumed under the term pitch.Thenwe and Other Statistical Methods ........ 191 subdivide pitch determination algorithms (PDAs) 10.2.6 Concluding Remarks ..................... 192 into short-term analysis algorithms, which apply some spectral transform and derive pitch from 10.3 Selected Time-Domain Methods............. 192 a frequency or lag domain representation, and 10.3.1 Temporal Structure Investigation.... 192 time-domain algorithms, which analyze the signal 10.3.2 Fundamental Harmonic Processing. 193 directly and apply structural analysis or determine 10.3.3 Temporal Structure Simplification... 193 10.3.4 Cascaded Solutions 195 individual periods from the first partial or compute ....................... the instant of glottal closure in speech. In the 1970s, 10.4 A Short Look into Voicing Determination. 195 when many of these algorithms were developed, 10.4.1 Simultaneous Pitch the main application in speech technology was the and Voicing Determination............ 196 vocoder, whereas nowadays prosody recognition in 10.4.2 Pattern-Recognition VDAs ............. 197 speech understanding systems and high-accuracy 10.5 Evaluation and Postprocessing .............. 197 pitch period determination for speech synthesis 10.5.1 Developing Reference PDAs corpora are emphasized. In musical acoustics, pitch with Instrumental Help................. 197 determination is applied in melody recognition or 10.5.2 Error Analysis............................... 198 Part B automatic musical transcription, where we also 10.5.3 Evaluation of PDAs have the problem that several pitches can exist and VDAs– Some Results ............... 200 simultaneously. 10.5.4 Postprocessing and Pitch Tracking .. 201 10 10.6 Applications in Speech and Music........... 201 10.1 Pitch in Time-Variant Quasiperiodic 10.7 Some New Challenges and Developments 203 Acoustic Signals.................................... 182 10.7.1 Detecting the Instant 10.1.1 Basic Definitions .......................... 182 of Glottal Closure ......................... 203 10.1.2 Why is the Problem Difficult? ......... 184 10.7.2 Multiple Pitch Determination......... 204 10.1.3 Categorizing the Methods.............. 185 10.7.3 Instantaneousness 10.2 Short-Term Analysis PDAs ...................... 185 Versus Reliability.......................... 206 10.2.1 Correlation and Distance Function 185 .. 10.8 Concluding Remarks ............................. 207 10.2.2 Cepstrum and Other Double-Transform Methods ........... 187 References .................................................. 208 Pitch and voicing determination of speech signals are the presence of a voiced excitation and the presence of the two subproblems of voice source analysis. In voiced a voiceless excitation, a problem we will refer to as voic- speech, the vocal cords vibrate in a quasiperiodic way. ing determination and, for the segments of the speech Speech segments with voiceless excitation are gener- signal in which a voiced excitation is present, the rate of ated by turbulent air flow at a constriction or by the vocal cord vibration, which is usually referred to as pitch release of a closure in the vocal tract. The parameters determination or fundamental frequency determination we have to determine are the manner of excitation, i. e., in the literature. 182 Part B Signal Processing for Speech Unlike the analysis of vocal-tract parameters, where statistical methods, particularly in connection with si- a number of independent and equivalent representations nusoidal models [10.2], entered the domain. A number are possible, there is no alternative to the parameters of known methods were improved and refined, whereas pitch and voicing, and the quality of a synthesized sig- other solutions that required an amount of computational nal critically depends on their reliable and accurate effort that appeared prohibitive at the time the algorithm determination. This chapter presents a selection of the was first developed were revived. With the widespread methods applied in pitch and voicing determination. The use of databases containing many labeled and processed emphasis, however, is on pitch determination. speech data, it has nowadays also become possible to Over the last two decades, the task of fundamen- thoroughly evaluate the performance of the algorithms. tal frequency determination has become increasingly The bibliography in [10.1], dating from 1983, in- popular in musical acoustics as well. In the begin- cludes about 2000 entries. To give a complete overview ning the methodology was largely imported from the of the more-recent developments, at least another 1000 speech community, but then the musical acoustics com- bibliographic entries would have to be added. It goes munity developed algorithms and applications of their without saying that this is not possible here given the own, which in turn became increasingly interesting to space limitations. So we will necessarily have to present the speech communication area. Hence it appears jus- a selection, and many important contributions cannot be tified to include the aspect of fundamental frequency described. determination of music signals and to present some of The remainder of this chapter is organized as fol- the methods and specific problems of this area. One lows. In Sect. 10.1 the problems of pitch and voicing specific problem is multipitch determination from poly- determination are described, definitions of what is sub- phonic signals, a problem that might also occur in speech sumed under the term pitch are given, and the various when we have to separate two or more simultaneously PDAs are grossly categorized. Sections 10.2 and 10.3 speaking voices. give a more-detailed description of selected PDAs. Pitch determination has a rather long history which Section 10.4 shortly reviews a selection of voicing de- goes back even beyond the times of vocoding. Liter- termination algorithms (VDAs); Sect. 10.5 deals with Part B ally hundreds of pitch determination algorithms (PDAs) questions of error analysis and evaluation. Selected ap- have been developed. The most important developments plications are discussed in Sect. 10.6, and Sect. 10.7 leading to today’s state of the art were made in the finally presents a couple of new developments, such 10.1 1960s and 1970s; most of the methods that are briefly as determining the instant of glottal closure or process- reviewed in this chapter were extensively discussed dur- ing signals that contain more than one pitch, such as ing this period [10.1]. Since then, least-squares and other polyphonic music. 10.1 Pitch in Time-Variant Quasiperiodic Acoustic Signals 10.1.1 Basic Definitions a look at the parameter pitch and provide a clear defi- nition of what should be measured and what is actually Pitch, i. e., the fundamental frequency F0 and fundamen- measured. tal period T0 of a (quasi)periodic signal, can be measured A word on terminology first. There are three points in many ways. If a signal is completely stationary and of view for looking at such a problem of acoustic signal periodic, all these strategies – provided they operate cor- processing [10.3]: the production,thesignal-processing, rectly – lead to identical results. Since both speech and and the perception points of view. For pitch deter- musical signals, however, are nonstationary and time mination of speech, the production point of view is variant, aspects of each strategy such as the starting obviously oriented toward phonation in the human lar- point of the measurement, the length of the measuring ynx; we will thus have to start from a time-domain interval, the way of averaging (if any), or the operating representation of the waveform as a train of laryn- domain (time, frequency, lag etc.) start to influence the geal pulses. If a pitch determination algorithm (PDA) results and may lead to estimates that differ from algo- works in a speech-production oriented way, it measures rithm to algorithm even if all these results are correct individual laryngeal excitation cycles or, if some aver- and accurate. Before entering a discussion on individ- aging is performed, the rate of vocal-fold vibration.The ual methods and applications, we must therefore have signal-processing point of view, which can be applied Pitch and Voicing Determination of Speech 10.1 Pitch in Time-Variant Quasiperiodic Acoustic Signals 183 2. T is defined as the elapsed time between two a) T [(Def. (2)] 0 0 successive laryngeal pulses. Measurement starts at an arbitrary point within an excitation cycle. The choice of this point depends on the individual t method, but for a given PDA it is always located at T0 [(Def. (3)] the same position within the cycle. b) T0 [(Def. (1)] Time-domain PDAs usually follow this definition. The reference point can be a significant extreme, a certain zero crossing, etc. The signal is tracked t period by period in a synchronous way yielding individual pitch periods. This principle can be ap- Fig. 10.1a,b Time-domain definitions of T0. (a) Speech plied to both speech and music signals. In speech it signal (a couple of periods), (b) glottal waveform (recon- may even be possible to derive the point of glottal structed). For further detail, see the text closure from the reference point when the signal is undistorted. to any acoustic signal, means that (quasi)periodicity 3. T0 is defined as the elapsed time between two suc- or at least cyclic behavior is observed, and that the cessive excitation cycles. Measurement starts at an task is to extract those features that best represent arbitrary instant which is fixed according to exter- this periodicity. The pertinent terms are fundamental nal conditions, and ends when a complete cycle has frequency and fundamental period. If individual cy- elapsed. cles are determined, we may (somewhat inconsistently) This is an incremental definition.