Emotional Voice Conversion with Cycle-Consistent Adversarial Network

EMOTIONAL VOICE CONVERSION WITH CYCLE-CONSISTENT ADVERSARIAL NETWORK Songxiang Liu, Yuewen Cao, Helen Meng Human-Computer Communications Laboratory Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China fsxliu, ywcao, [email protected] ABSTRACT important role in conveying various types of non-linguistic information, such as intention, attitude and mood, which represent the emo- Emotional Voice Conversion, or emotional VC, is a technique of tions of the speaker [9]. Recently, there has been tremendous active converting speech from one emotion state into another one, keep- research in modeling prosodic features of speech for emotional VC, ing the basic linguistic information and speaker identity. Previous most of which involve modeling two prosodic elements, namely the approaches for emotional VC need parallel data and use dynamic F0 contour and energy contour. A Gaussian mixture model (GMM) time warping (DTW) method to temporally align the source-target and a classification regression tree model were adopted to model the speech parameters. These approaches often define a minimum gen- F0 contour conversion from neutral speech to emotional speech in eration loss as the objective function, such as L1 or L2 loss, to learn [10]. A system for transforming the emotion in speech was built in model parameters. Recently, cycle-consistent generative adversarial [11], where the F0 contour was modeled and generated by context- networks (CycleGAN) have been used successfully for non-parallel sensitive syllable-based HMMs, the duration was transformed using VC. This paper investigates the efficacy of using CycleGAN for phone-based relative decision trees, and the spectrum was converted emotional VC tasks. Rather than attempting to learn a mapping be- using a GMM-based or a codebook selection approach. tween parallel training data using a frame-to-frame minimum generation loss, the CycleGAN uses two discriminators and one classifier Prosody is inherently supra-segmental and hierarchical in na- to guide the learning process, where the discriminators aim to differ- ture, of which the conversion is affected by both short- and long-term entiate between the natural and converted speech and the classifier dependencies, such as the sequence of segments, syllables, words aims to classify the underlying emotion from the natural and con- within an utterance as well as lexical and syntactic systems of a verted speech. The training process of the CycleGAN models ran- language [12, 13, 14, 15, 16]. There have been many attempts to domly pairs source-target speech parameters, without any temporal model prosodic characteristics in multiple temporal levels, such as alignment operation. The objective and subjective evaluation results the phone, syllable and phrase levels [17, 18, 19, 20]. Continuous confirm the effectiveness of using CycleGAN models for emotional wavelet transform (CWT) can effectively model F0 in different tem- VC. The non-parallel training for a CycleGAN indicates its potential poral scales and significantly improve speech synthesis performance for non-parallel emotional VC. [21]. The CWT methods were also adopted for emotional VC. CWT was adopted for F0 modeling within the non-negative matrix factor- Index Terms— Emotional voice conversion, generative adver- ization (NMF) model [22], and for F0 and energy contour modeling sarial networks, cycleGAN within a deep bidirectional LSTM (DBLSTM) model [23]. Using CWT method to decompose the F0 in different scales has also been 1. INTRODUCTION explored in [9, 24], where neural networks (NNs) or deep belief networks (DBNs) were adopted. Human speech is a complex signal that contains rich information, While previous work has shown the efficacy of using GMMs, arXiv:2004.03781v1 [eess.AS] 8 Apr 2020 which includes linguistic information, para- and non-linguistic infor- DBLSTMs, NNs and DBNs to model the feature mapping for the mation. While linguistic information and para-linguistic information spectral and the prosodic features, they all need parallel data and controlled by the speaker help make the expression convey informa- parallel training, which means the source and target data should have tion precisely, non-linguistic information such as emotion accompa- parallel scripts and a dynamic time warping (DTW) method is used nied with speech plays an important role in human social interaction. to temporally align the source and target features before training the Compared with Voice Conversion (VC) [1, 2, 3, 4], which is a tech- models. Parallel training data is more difficult to collect than non- nique to convert one speaker’s voice to sound like that of another, parallel data in many cases. Besides, the use of DTW may introduce emotional Voice Conversion, or emotional VC, is a technique of con- alignment errors, which degrades VC performance. Moreover, pre- verting speech from one emotion state into another one, keeping the vious emotional VC approaches often define a minimum generation basic linguistic information and speaker identity. loss as the objective function, such as L1 or L2 loss. One of the is- Most of the VC approaches focus more on the conversion of sues using a minimum generation loss is an over-smoothing effect short-time spectral features, but less on the conversion of the prosody often observed in the generated speech parameters. Since this loss features such as F0 [5, 6, 7, 8, 3]. In these works, F0 features are may also be inconsistent with human’s perception of speech, directly usually converted by a simple logarithmic Gaussian (LG) normal- optimizing the model parameters using a minimum generation loss ized transform. For emotional VC, however, parametrically mod- may not generate speech that sounds natural to human. Generative eling the prosody features is important, since the prosody plays an adversarial networks (GANs) [25] have been incorporated into TTS and VC systems [26], where it is found that GAN models are ca- Cycle consistency pable of generating more natural spectral parameters and F0 than SA SAB SABA conventional minimum generation error training algorithm regard- less of its hyper-parameter settings. Since any utterance spoken by a GAB GBA speaker with the source or target emotional state can be used as training sample, if a non-parallel emotional VC model can achieve com- DB parable performance with the parallel counterparts, it will be more flexible, more practical and more valuable than parallel emotional C VC systems. The recently emerged cycle-consistent generative ad- DA versarial network (CycleGAN) [27], which belongs to the large fam- ily of GAN models, provides a potential way to achieve non-parallel emotional VC. The CycleGAN was originally designed to transform GBA GAB styles in images, where the styles of the images are translated while S S SBAB the textures remain unchanged. CycleGAN models have been used B BA successfully for developing non-parallel VC systems [28, 29]. Cycle consistency In this paper, we investigate the efficacy of using CycleGAN models for emotional VC tasks. Emotional VC is similar to image Fig. 1. CycleGAN structure for emotional voice conversion. GAB style transformation, where we can regard the underlying linguis- and GBA refer to generators. DA and DB refer to discriminators. C tic information as analogous to image content and the accompany- refers to emotion classifier. S denotes genuine or converted spectral ing emotion as analogous to image style. Rather than attempting and prosodic features. to learn a mapping between parallel training data using a frame-to- frame minimum generation loss, in this paper, the CycleGAN uses two discriminators and one classifier to guide the learning process– SA and SBA, and discriminator DB distinguishes between SB and the discriminators aim to differentiate between natural and converted SAB . To this end, an adversarial loss, which measures how dis- speech and the classifier aims to classify the underlying emotion tinguishable the converted features SAB from the genuine target from the natural and converted speech. The spectral features, F0 domain features SB , is defined as contour and energy contour are simultaneously converted by the Cy- cleGAN model. We utilize CWT or logarithmic representation of the F0 and energy features. Although the training data we use is Ladv(GAB ;DB ) = ESB ∼pdata(SB )[log DB (SB )] (1) parallel, a non-parallel training process is adopted to learn the Cy- + ESA∼pdata(SA)[log(1 − DB (GAB (SA))]: cleGAN model, which means that source and target features are ran- domly paired during training, without any temporal alignment pro- The adversarial loss distinguishing the converted features SBA cess. The objective and subjective evaluation results confirm the ef- from the genuine source domain features SA has a similar formula- fectiveness of using CycleGAN models for emotional VC. The ad- tion. vantages offered by the CycleGAN model include (i)utilizing GAN Cycle Consistency Loss: The adversarial loss makes SA and loss instead of minimum generation loss, (ii)getting rid of source- SBA or SB and SAB as similar as possible while the cycle consis- target alignment errors and (iii) flexible non-parallel training, etc. tency loss guarantees that an input SA can retain its original form The non-parallel training for a CycleGAN indicates its potential for after passing through the two generators GAB and GBA. Using non-parallel emotional VC. the notation in Figure 1, SABA, which equals to GBA(GAB

Load more