Pitch Contour Stylization by Marking Voice Intonation
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 Pitch Contour Stylization by Marking Voice Intonation Sakshi Pandey1, Amit Banerjee2, Subramaniam Khedika3 Computer Science Department South Asian University New Delhi, India Abstract—The stylization of pitch contour is a primary task [10], parabolic [11], and B-splines [12]. In addition, low-pass in the speech prosody for the development of a linguistic model. filtering is also used for preserving the slow time variations The stylization of pitch contour is performed either by statistical in the pitch contours [6]. Recently, researchers have studied learning or statistical analysis. The recent statistical learning the statistical learning models, using hierarchically structured models require a large amount of data for training purposes deep neural networks for modeling the F0 trajectories [13] and and rely on complex machine learning algorithms. Whereas, the sparse coding algorithm based on deep learning auto-encoders statistical analysis methods perform stylization based on the shape of the contour and require further processing to capture the voice [14]. In general, the statistical learning models require a intonations of the speaker. The objective of this paper is to devise large amount of data and uses complex machine learning a low-complexity transcription algorithm for the stylization of algorithms for training purposes [13], [14]. On the other hand, pitch contour based on the voice intonation of a speaker. For the statistical analysis models decompose the pitch contours this, we propose to use of pitch marks as a subset of points as a set of functions based on the shape and structure of for the stylization of the pitch contour. The pitch marks are the the contour that requires further processing to capture voice instance of glottal closure in a speech waveform that captures intonations of the speaker [9]–[12], [15]. Table I summarizes characteristics of speech uttered by a speaker. The selected subset the algorithms proposed for the stylization of pitch contour. can interpolate the shape of the pitch contour and acts as a Many successful speech applications use piecewise stylization template to capture the intonation of a speaker’s voice, which can of the pitch, including the study of sentence boundary [16], be used for designing applications in speech synthesis and speech morphing. The algorithm balances the quality of the stylized dis-fluency [17], dialogue act [18], and speaker verification curve and its cost in terms of the number of data points used. [19]. We evaluate the performance of the proposed algorithm using the mean square error and the number of lines used for fitting In this paper, we use statistical analysis for piecewise the pitch contour. Furthermore, we perform a comparison with decomposition of the pitch contour using the instance of glottal other existing stylization algorithms using the LibriSpeech ASR closure or pitch marks to stylize the pitch contour as well as corpus. capture the intonation of the speaker’s voice. As mentioned Keywords—Pitch contour; pitch marking; linear stylization; above, the previous works based on the statistical analysis straight-line approximation approach [6], [9]–[12], mainly consider the shape and structure of the contour for stylization. For example, [12] use best-fit B-splines to define the segments of a pitch contour, and [11] I. INTRODUCTION uses parabolic functions to approximate the pitch contour. In Speech prosody represents the pitch contour of a voice contrary to these approaches, in this paper, we try to model the signal and can be used for the construction of linguistic models instances of glottal closure (pitch marks) of the source speaker. and their interaction with other linguistic domains, such as An advantage of the proposed approach is that the pitch marks morphing and speech transformation [1]. In addition, the pitch can be used directly as templates for speech synthesis or speech contours are used for learning generative models for text-to- morphing, making the approach suitable for various real-time speech synthesis applications [2], language identification [3], applications. emotion prediction and for forensics research [4]. Researchers have also used pitch and intensity of sound for predicting the The piecewise stylization approximates the pitch contour K fy gN mood of a speaker [5]. In order to remove the variability in using subset points. That is, if we let n n=1 to be the pitch contour, stylization is used to encode the contour the pitch at each instant of time in a speech signal then into meaningful labels [6] or templates [7] for speech syn- the piecewise stylization can be defined using function 1, g(y) a b thesis application. According to [8], stylization is a process where is the stylized pitch, i and i are the slope and y K of representing the pitch contour of the audio signal with a intercept of each line at each time instant and is the minimum number of line segments, such that the original pitch subset size required for the stylization of the speech signal. contour is auditorily indistinguishable from the re-synthesized In this paper, we select the pitch marks as a subset of points pitch contour. for the reconstruction of the pitch contour. These pitch marks are selected to fit the pitch contour for capturing large-scale Broadly, the stylization of pitch contour either uses sta- variations. For this, we propose an algorithm using pitch marks tistical learning or statistical analysis models. In statistical as the subset points for the stylization of the pitch contour. The analysis models, the pitch contour is decomposed into a proposed algorithm can be used for retrieving the pitch marks set of previously defined functions such as polynomial [9], from the voiced region of a pitch contour. In addition, it can www.ijacsa.thesai.org 636 j P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 Signal Pitch Contours Auto Correlation Pitch Marks Linear Stylization Pitch Marks on Raw Pitch Fig. 1. Block Diagram of Proposed Method. stylize the voiced and unvoiced region of the contours after best result for 60% of the cases. However, for the same data pitch smoothing, which can be apt for applications mentioned 29% of the cases show better results for higher wavelet decom- above and for text-to-speech conversion [24], [25]. The general positions or fewer segments, and 11% of the cases have better flow of the proposed methodology on a smoothed pitch contour performance for second level decomposition. On contrary, in is shown in Fig. 1. As shown in the figure, the approach our approach, the number of segments is determined by the uses auto-correlation to detect pitch and uses median filtering intonation of the speaker’s voice and no pre-determination is with length-3-window to remove sudden spikes to generate the required for the same. That is, the number of segments required corresponding pitch contour. This is used for extracting the for pitch stylization is neither pre-determined nor depends pitch marks and to approximate the pitch contour using linear on any empirical result. The algorithm computes the optimal interpolation. The number of the linear segment depends on number of segments based on the change in the pitch trajectory the number of pitch marks in the speech signal. of the speaker. K−1 i+1 To understand the performance of the proposed algorithm, X X we analyze matrices such as mean squared error (MSE) and g(y) = (a y + b ) (1) i j i the number of line segments (K) used for stylization. For our i=1 j=i analysis, we use voice samples from the LibriSpeech ASR corpus [26] and the EUSTACE speech corpus [27] to compare The proposed work is closely related to [10], [15]. In the performance with [15]. The experimental results show that [10], the authors discuss a computationally efficient dynamic in comparison to [15], the proposed methodology uses less programming solution for the stylization of pitch contour. number of lines (K) to represent the pitch contour of a speech The approach calculates the MSE (mean square error) of the signal. Also, the proposed approach has a lower MSE, in stylized pitch by predetermining the number of segments K comparison to stylization via wavelet decomposition [15]. using [15]. The authors in [15], use Daubechies wavelet (Db10) to perform a multilevel decomposition of the pitch contour The rest of the paper is organized as follows. Section II and use third-level decomposition to extract the number of presents the related work. Section III presents the methodology extremes (K) for the stylization. The choice for the third level of the proposed piecewise linear stylization approach. In is based upon the empirically tested results, which show the Section IV, we discuss the experimental setup and simulation www.ijacsa.thesai.org 637 j P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 TABLE I. SUMMARY ON EXISTING WORK ON PITCH CONTOUR STYLIZATION Works Approach Algorithm Application Xiang Yin et.al. Stat. Hierarchically structured deep Statistical parametric speech synthesis [2016] [13] Learning neural networks Nicolas Obin et.al, Stat. Deep Auto-Encoders Learning pitch templates for synthesis [2018] [14] Learning and voice conversion. J’t Hart el.at. [1991] Stat. Piecewise stylization Parabolas’s adequate for F0 [11] Analysis approximations Daniel Hirst el.at. Stat. Stylization using quadratic spline Coding and synthesis of curve used for [1993] [12] Analysis function different languages. D’Alessandro el. at. Stat. Perceptual model of intonation Prosodic analysis and speech synthesis [1995] [20] Analysis Nygaard el.at. [1998] Stat. Piecewise polynomial Electrocardiogram (ECG) [9] Analysis approximation Dagen Wang el. at. Stat. Piecewise stylization via wavelet Pitch stylization for spoken languages [2005] [15] Analysis analysis Prashant K.