AudioViewer: Learning to Visualize Sound
by
Yuchi Zhang
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF
Master of Science
in
THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)
The University of British Columbia (Vancouver)
December 2020
c Yuchi Zhang, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:
AudioViewer: Learning to Visualize Sound submitted by Yuchi Zhang in partial fulfillment of the requirements for the degree of Master of Science in Computer Science.
Examining Committee: Helge Rhodin, Computer Science Supervisor Kwang Moo Yi, Computer Science Examining Committee Member
ii Abstract
Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with a shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high- dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.
iii Lay Summary
We develop AudioViewer, an automated that translates speech into videos of faces or hand-written numbers. Without audio feedback, it’s challenging for hearing impaired people to learn speech. To help with the situation, we propose AudioViewer to generate videos from utterances as a visual feedback in substitution. Our current work is a proof of concept. AudioViewer is designed to generate consistent and recognizable visualization of pronunciation. For example, the word ‘water’ will be visualized as a short animation of a portrait of a young brunette lady with a short hair style facing rightward with a neutral expression, which is gradually transforming into a smiling young blonde lady facing forward. From our preliminary user study, we demonstrate that people can tell the difference between the visualization of different pronunciations and recognize the consistency of the identical ones.
iv Preface
The work presented here is original work done by the author Yuchi Zhang in collaboration with Willis Peng and Dr. Bastian Wandt under the supervision of Dr. Helge Rhodin. A shorter version of this work is under review as a conference submission. Design, implementation and experiments were done by Yuchi Zhang. Dr. Helge Rhodin provided suggestions and feedback during the development and Willis Peng helped with the user study evaluation. The initial draft of the paper is done by all of us and revised by Dr. Bastian Wandt. The user studies ran for this work have been approved by the UBC Behavioural Research Ethics Board (Certificate Number: H19-03006)
v Table of Contents
Abstract ...... iii
Lay Summary ...... iv
Preface ...... v
Table of Contents ...... vi
List of Tables ...... viii
List of Figures ...... x
Acknowledgments ...... xiii
1 Introduction ...... 1
2 Related Work ...... 4
3 Method ...... 9 3.1 Audio Encoding ...... 10 3.2 Structuring the Audio Encoding ...... 11 3.3 Image Encoding for Video Generation ...... 13 3.4 Linking Audio and Visual Spaces ...... 14
4 Experiments ...... 16 4.1 Temporal Smoothness Effectiveness ...... 18 4.2 Information Throughput ...... 20
vi 4.3 Visual Quality and Phonetics ...... 22 4.4 Visual Domain Influence ...... 22 4.5 Results on Environment Sounds ...... 23 4.6 User study ...... 25 4.6.1 Design of the questionnaire ...... 25 4.6.2 Results ...... 28
5 Limitations and Future Work ...... 31
6 Conclusion ...... 35
Bibliography ...... 36
vii List of Tables
Table 4.1 Audio encoding analysis. The average SNR of reconstructed amplitude Mel spectrogram on the TIMIT test set and the average speed and acceleration between the latent vector (dim = 128) of neighbouring frames (∆t = 0.04s). The SNR of the autoencoder drops with the additional constraints we enforce (smoothness and disentanglement). This is a trade-off for improving velocity and acceleration properties as well as for inducing interpretability. 18 Table 4.2 Throughput lower bound. The throughput from audio to video with AudioViewer can be bounded by mapping back to the audio. It is evident that the cycle consistency improves greatly while
the enforcement of additional constraints with Lp,log MSE, and
Lrr, takes only a small dip...... 21 Table 4.3 Information throughput on environment sounds, showing that the reconstruction error increases when evaluating on a test set that contains sounds vastly different from the training set (speech vs. environment sounds)...... 24 Table 4.4 Audio VAE Mel spectrum reconstruction. The average SNR of autoencoding and decoding Mel spectrograms on ESC-50 shows a significant reconstruction loss.The average speed and acceleration between the latent vector (dim = 128) of neigh- bouring frames (∆t = 0.04s) confirms the experiments in the main document, that smoothness comes at the cost of lower reconstruction accuracy...... 24
viii Table 4.5 A breakdown of the questionnaire format by sound type and tested factors, listing their frequency of occurrence...... 27 Table 4.6 User study results. Values indicate mean accuracy and standard deviation for distinguishing between visualizations of the tested factor across participants as a percentage...... 28
ix List of Figures
Figure 1.1 AudioViewer: A tool developed towards the long-term goal of helping hearing impaired users to see what they can not hear. We map an audio stream to video and thereby use faces or numbers for visualizing the high-dimensional audio features intuitively. Different from the principle of lip reading, this can encode general sound and pass on information on the style of spoken language, for instance, the timbre of the speaker. . . . 2
Figure 2.1 An illustration of harmonics and formants of speech voice in a filter bank image. The above mel-scaled power spectro- gram shows the word ‘party’ spoken by a male speaker. The vertical axis indicates mel-scaled filter banks and the horizontal axis is time. The narrow stripes in the spectrogram are the harmonics which are the integral multiples of the fundamental
frequency (F0), reflecting the vibration of the vocal fold. Note
that the change in F0 reflects the intonation during speech. The wide band peaks are the formants, reflecting the resonant fre- quencies of the vocal tract. Formant patterns are different when differnt vowels (e.g., aa, r and iy) are spoken...... 5
Figure 3.1 Overview. A joint latent encoding is learned with audio and video VAEs that are linked with a cycle constraint and aug- mented with a smoothness constraint (∆v annotation). . . . . 10
x Figure 3.2 Disentanglement, by mixing content encodings from two speak- 0 ers saying the same phones (zci with zci ) and style encodings 0 from the same speaker saying different words (zsa with zsa ). . 12 Figure 3.3 Cycle constraint. We link audio and video latent spaces with a cycle constraint that ensures that the audio signal content is preserved through video decoding and encoding...... 14
Figure 4.1 Smoothness evaluation by plotting the latent embedding of an audio snipped in 3D via dimensionality reduction. Left: PCA baseline model, consecutive audio sample embeddings are scattered. Middle: Similar performance of our model without temporal smoothness constraint. Right: Our full model leads to smoother trajectories...... 19 Figure 4.2 Throughput. The information throughput can be measured by mapping from audio to video and back. Playing the depicted Mel Spectrogram shows that the speech is still recognizable. . 20 Figure 4.3 Word analysis. Instances (different rows) of the same words are encoded with a similar video sequence (single row), yet the show pronunciation variations if spoken by different speakers. 22 Figure 4.4 Phoneme similarity analysis. The same phonemes occurring at the beginning of different words are mapped to visually similar images (colums) while different speakers have minimal influence (rows)...... 23 Figure 4.5 Environment sounds visualization. Although not trained for it, our approach also visualizes natural sounds. Here showing a clear difference between chirp and alarm clock. The visualiza- tion looks brighter and intenser when a target sound occurs. . 25 Figure 4.6 Matching Example. Examples from the CelebA-combined and MNIST-content questionnaire of a matching question asked to participants in the user study...... 26 Figure 4.7 Grouping Example. Examples from the CelebA-combined and MNIST-content questionnaire of a grouping question asked to participants in the user study...... 26
xi Figure 4.8 Model Comparison on content questions. The mean user accuracy and standard deviation for distinguishing phoneme sequences is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, * for p < 0.05, ** for p < 0.01). The mean accuracy for random guessing is 0.384. . 29 Figure 4.9 Model Comparison on style questions. The mean user accu- racy and standard deviation for distinguishing speaker sex (left) and dialect(right) is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, *** for p < 0.001). The mean accuracies for random guessing are 0.43 and 0.5 respectively...... 30
Figure 5.1 Recombined reconstruction sample. Top left: The Mel spec- trograms of a sentence spoken by a female speaker and its re- construction by our audio model with disentanglement training. Top right: The original and reconstructed Mel spectrogram of another sentence spoken by a male speaker. Bottom: The Mel spectrograms of generated by recombined latent encodings. The above one is decoded from the content latent encodings of the first utterance combined with the speaker latent encoding of the male speaker of the second utterance. The below one is the sencond sentence with the female speaker in first sentence.
Due to a higher F0 of the female speaker, the harmonics in the original and reconstructed Mel spectrograms are seperated farther apart than the ones of the male speaker, which can also be also noticed in the recombined reconstructed Mel spectrogram. 34
xii Acknowledgments
This work is funded by University of British Columbia(UBC) Department of Com- puter Science "Start up funds". I would also like to acknowledge the computing resources provided by Compute Canada. I would like to thank Dr. Helge Rhodin for supervising my thesis. I really appreciate all your support, direction, and patience. It is a great honor to work with you. I would also like to thank Dr. Kwang Moo Yi for agreeing to be the second reader of the thesis. Your suggestions, scrutiny, and feedback made this thesis a better one. To Willis Peng and Bastian Wandt, I really appreciate all your essential con- tributions to this project. Thanks Farnoosch Javadi and Tanzila Rahman for the feedback to our conference submission. I would like to acknowledge all the participants of our user study for the participation and feedback, including but not limited to Frank Yu, Xingzhe He, Daniel Ajisafe, Shih-Han Chou, Wei Jiang, Lixin Liao, Kevin Ta, Dr. Ulrike Pestel-Schiller, Ge Fang, and Xuanyi Zhao. I would like to thank Dr. Dinesh Pai for all the support and patience as my advisor in the first year. It helped me a lot during my first year. I appreciate all the wonderful instructors and colleagues I have worked with as a teaching assistant. Thanks for your help. I really enjoyed the time working with you. To the labmates from groups of Helge’s and Dinesh’s, thanks for all the discus- sion and help. To the new friends I knew here, especially Jan Hansen, Jolande Fooken, and Yihan Zhou. I did not only enjoy playing sports and games with you, but also
xiii appreciate all the help and suggestions from you. I would also like to appreciate Xuanyi Zhao and Ge Fang for their accompany during this hard time of COVID-19. I enjoyed the time with you. I apprecate all the care and support from my family and friends I have known for years. You are where I am from. To Xiuyun Wu, I could not make it here without you.
xiv Chapter 1
Introduction
Humans perceive their environment through diverse channels, including vision, hearing, touch, taste, and smell. Impairment in any of these modals of perception can lead to drastic consequences, such as challenges in learning and communication. Various approaches have been purposed to substitute lost senses, from cochlear implants[34] to attempts to brain machine interfaces (BMIs) that directly interface with neurons in the brain(e.g., recently popularized NeuraLink [64]). One of the least intrusive approaches is to substitute auditory perception with visual modal, which is challenging due to the different characteristics of the two domain. Information in the auditory domain is usually high-frequency and low- parallel, while it is low-frequency and high-parallel in visual domian. In this paper, we make a first step to visualize audio with natural images in real time, forming a live video that characterizes the entire audio content from an acoustic perspective. It could be seen as an automated form of translation, with its own throughput, abstraction, automation, and readability trade-offs. Figure 1.1 illustrate the pipeline of our method. Such a tool could help to perceive sound and also be used to train deaf speech, as immediate and high-throughput feedback on pronunciations since it is difficult to learn from traditional visual tools for the hearing-impaired speech learning that rely on spectrogram representations [18, 54, 92, 93]. Our long-term goal is to show that hearing-impaired users can learn to perceive sound through the visual system. Similar to learning to understand sign language and likewise an unimpaired child learns to associate meaning to seen objects and
1 dg ol etasae otetx dg.Acnevbeetnini ofurther to is extension conceivable A ’dog’. text the to translated be would ’dog’ GN)[ (GANs) iulfebc pc n paiglandb riuaigsud htreproduce that sounds articulating by the learned exploring speaking by and practiced space be feedback could visual articulation Then semantic. its to visualization adde o oti n oa edako style or [ feedback speakers vocal female any is and contain translation male not a between does differences such and however, phonetics semantics, terms of in terms indirect in still intuitive is This audio. the networks adversarial generative conditional using image an into word the translate [ system recognition a with words to speech also safety[ translate one’s are to which related sounds environment are they to fraction when apply a important not contains does only and and information speaking audio of the result of a rather is motion lip natural ever, from expressions facial [ lip-syncing audio and spoken dubbing abstraction digital visual for which Methods of best. question is open level an is It structure). spatial (time-frequency to representation characteristics and video) high-dimensional (one- to dimensions different audio vastly dimensional with domains these across mapping one-to-one ical parent. or teacher the by produced as feedback visual same the sound our from mapping a learn to have will user the words, and sounds perceived Audio signal iue11 AudioViewer: 1.1: Figure teei ocanon- no is there setup, our in as, difficult is video to signal audio an Translating o ntne h ibeo h speaker . the of timbre on the information instance, on for pass language, and spoken sound of general style encode the can this reading, principle lip the of from the Different visualizing intuitively. for an features numbers map or audio We faces high-dimensional hear. use thereby not and can video they to what stream audio see to users impaired hearing helping 24 Mel Spectrogram .Adgiaecudb hw hntewr dg srcgie in recognized is ’dog’ word the when shown be could image dog A ]. 16 , 75 , 91 Audio encoder ol entrl loigda epet i ed How- read. lip to people deaf allowing natural, be would ] Latentspace trajectory oldvlpdtwrsteln-emga of goal long-term the towards developed tool A
2 Visual decoder Translatedframesvideo 26 67 (face decoder) nte ieto ol eto be could direction Another ]. , 79 66 , 97 ,freapetespoken the example for ], or .Mroe,eet and events Moreover, ]. Translatedframesvideo (numberdecoder) ambient sounds that can not be indicated by a single word are ill-represented, such as the echo of a dropped object or the repeating beep of an alarm clock. To overcome these limitations, we seek an intuitive and immediate visual representation of the high-dimensional audio data, inspired by early techniques from the 1970s that visualize general high-dimensional data with facial line drawings with varying length of the nose and curvature of the mouth [10], which are shown easier and faster to be recognized by human users[42]. Instead of using computer- drawn cartoon faces in the previous studies, we use learning-based generative models trainined on natural images to generate more realistic human faces with more features so that more information can be visualized. In this paper, we design a real-time audio to video mapping leveraging ideas from computer vision and audio processing for unpaired domain translation. The technical difficulty lies in finding a mapping that is expressive yet easy to perceive. We base our method, AudioViewer, on four principles. First, humans are good at recognizing patterns and objects that they see in their environment, particularly human faces [1, 10, 42, 47, 83]. We therefore map sound to videos of natural images and faces. Second, quick and non-natural (e.g., flickering) temporal changes lead to disruption and tiredness [78]. Hence, we enforce temporal smoothness on the learned mapping. Third, audio information should be preserved in the generated visualization. We therefore exploit cycle consistency to learn a joint structure between audio and video modalities. Fourth, we disentangle style, such as gender and dialect, from content, individual phones and sounds, so that user can easier recognize, for example, the phonetic aspect from the visualization of an utterance. Our core contributions are a new approach for enabling persons to see what they can not hear; a method for mapping from audio to video via linked latent spaces; a quantitative analysis on the type of target videos, different latent space models, and the effect of smoothness constraints; and a perceptual user study on the recognition capabilities. We demonstrate the feasibility of AudioViewer with a functional prototype and analyze the performance by quantifying the loss of information content as well as showing through a user study that words and pronunciation can be distinguished from the generated video features.
3 Chapter 2
Related Work
In the following section, we first introduce the fundamental speech acoustic theories and relate our approach in context with existing assistive systems. Then, we present the overview of the studies about sensory substitution and audio visualization. Finally, we review the literature on audio and video generation with a particular focus on cross-modal models.
Speech acoustics. According to the source-filter theory of speech production[19, 45], vocal fold vibration is usually viewd as the source of sound in vowels, and the vocal tract is an acoustic filter which modifies the sound produced by vocal folds. As shown in Figure 2.1, given a power spectrogrogram of a voicing waveform, the fundamental frequency (F0) is defined as the first peak (lowest-frequency) in the sprectrogram, and the harmonic components are the multiples of the fundamental frequency, which reflects the vibration of the vocal folds and partially determined the timbre of one’s voice. Therefore, harmonic features can be used in speaker identification [41]. Due to the resonant fequencies of the vocal tract, harmonics in the voice source will be filtered in a band-passed way and the peaks in the envelop of the filtered power spectrum are formants. The envelop is determined by the configuration of the vocal tract and is important to phoneme recognition. Vowels, for instance, can be identified from the formant patterns [29].
Deaf speech support tools. Improvements in speech production of the hearing- impaired people have been achieved through non-auditory aids and these improve-
4 ], which aim at 93 ] and haptic learn- , 51 92 , , 28 54
Harmonics 0 … 퐹 5 ), reflecting the vibration of the vocal fold. 0 reflects the intonation during speech. The F 0 F ] is as effective at improving speech production as ] demonstrate that visual feedback from the speech
79 The above mel-scaled power spectrogram shows
18
Formant s ]. While electrophysiological [ ) are spoken. 97 , iy 79 , and 67 r ] have demonstrated efficacy for improving speech production, such , 17 mel-scaled filter banks and the horizontalin axis the is spectrogram time. are The the narrowthe harmonics stripes fundamental which frequency are ( the integralNote multiples of that the change in the vocal tract. Formant patternsaa are different when differnt vowels (e.g., a filter bank image. the word ‘party’ spoken by a male speaker. The vertical axis indicates wide band peaks are the formants, reflecting the resonant frequencies of Figure 2.1: An illustration of harmonics and formants of speech voice in spectrographic display (SSD) [ feedback from a speech-language pathologist. Alternativefrom graphical transformed plots spectral generated data have beenimproving explored upon by spectrograms [ by creating plots that are more distinguishable with ments last beyond learning sessions and generalizethe to sessions words [ not encountered during ing aids [ techniques can be more invasive, especially for young children, as compared to visual aids. Elssmann et al. [ respect to speech parameters. Other methods aim at providing feedback by explicitly estimating vocal tract shapes [68]. In addition, Levis et al. [57] demonstrate that distinguishing between discourse-level intonation (intonation in conversation) and sentence-level intonation (intonation of sentences spoken in isolation) is possible through speech visualization and argues that deaf speech learning could be further improved by incorporating the former. SpeechViewer[5, 40, 90], a computer-based speech training (CBST) system developed by IBM , is designed to help hearing- impaired children practice phoneme pronunciation with variant visualizations of specified speech parameters and games that controled by specified phonemes[14]. The efficacy of SpeechViewer is observed on children or teeanage individuals with hearing impairement, autism children or dysarthria [6, 72, 85], but its limitations, such as inaccurate feedback on some phonetic features are also noticed. Visualizing high dimensional data with face is found intuitive to human users[10, 42]. Instead of visualizing hand-crafted speech featurs with limited elements in a computer-drawn cartoon face, our method generate faces with deep genrative models trained on natural images which contain more information. Specificly, our image generation approach extends these spectrogram visualization techniques by leveraging the generative ability of variational autoencoders (VAEs) in creating a mapping to a more intuitive and memorable video representation. It is our hope that improved visual aids will lead to more effective learning in the future.
Sensory substitution and audio visualization. Related to our work is the field of sensory substitution, whereby information from one modality is provided to an individual through a different modality. While many sensory substitution methods focus on substituting visual information into other senses like tactile or auditory stimulation to help visual rehabilitation [23, 37, 60], few methods target substituting auditory modal with visualization. The idea is feasible from the perspective of cog- nitive neuroscience. Deaf individuals are found to perform better in various visual tasks than hearing controls.[2, 62, 63]. Moreover, visual cross-modal activation of auditory cortex induced by non-linguistic visual stimuli, which is absent with hear- ing participants, has been observed according to the neuroimaging studies[20, 21], indicating cross-modal recruitment of hearing-impaired people. Although ASL is viewed as a successul sensory substitution system which translates information
6 usually presented in auditory to visual modal[3], it is essentially diffenrent to spoken languages that are developed in the audio-vocal modality. Audio visualization is another field related to our work. Music visualization works generate visualiza- tions of songs, so that user can browsing songs more efficiently without listen to them [82, 95]. On the learning side, [96] visualize the intonation and volume of each word in speech by the font size, so that user can learn narration strategies. Different from the works mentioned above, our model tries to visualize speech in phoneme level with deep learning models instead of hand-crafted features.
Classical audio to video mappings. One of the major applications of generat- ing video from speech is digital dubbing or synthesizing facial expressions that match the speech. Recent approaches employ deep generative model like GANs to synthesis realistic videos of people speaking [16, 75, 91]. Other approaches try to reconstruct facial attributes of the speaker, such as gender and ethnicity, from their utterances or map music to facial or body animation [48, 77, 80, 84]. By contrast to our setting, all of these tasks can be supervised by learning from videos with audio lines, e.g., a talking person where the correspondence of lip motion, expressions and facial appearance to the spoken language is provided to train the relation between sound and mouth opening. Similar to the unsupervised digital dubbing approach trained with cycle consistency[52], in this work, we go beyond manual supervision, and opt for an unsupervised translation mechanism with cycle consistency. Moreover, we can map sound to non-facial features such as background and brightness.
Audio and visual generation models. Deep generative models take many forms such as generative adversarial networks (GANs) [24], variational autoencoders (VAEs) [53], and auto-regressive models[88]. The highest image fidelity is attained with hierarchical models that inject noise and latent codes at various network stages by changing their feature statistics [39, 49]. One of the state-of-the-art methods in face generation, StyleGAN[49], as well as its updated version StyleGAN2[50], can learn disentangled latent distributions with a non-linear mapping network. The vector quantized variational autoencoder (VQ-VAE-2) [74], another state-of-the-art method, can generate a diverse range of images with high fidelity comparable to GAN models with a hierachy of vector quantized codes. Considering the ultimate
7 goal of our project as a visualization, however, fidelity of the generation is not the main concern. Hence, we choose the deep feature consistent variational autoencoder (DFC-VAE)[33] which is trained with feature losses to improve the perceptual consistency in reconstruction. As a VAE with a relatively shallow structure, DFC- VAE is faster and more stable to train comparing to the deeper and more complicated models. For audio, unlike traditional speech analysis where researches focus on hand crafting features such as LPC coefficients, Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) coefficients, etc.[9, 30], recent learning- based learn the features from the speech dataset directly. Although few methods operate on the raw waveform [46], it is more common use spectrograms and to apply convolutional models inspired by the ones used for image generation [8, 15, 36]. We use SpeechVAE[36] as the audio backbone model to learn important audio features.
Cross-modal latent variable models. CycleGAN and its variations [81, 98] have seen considerable success in performing cross-modal unsupervised domain transfer, for example in medical imaging [31, 87] and audio to visual translation [27]. How- ever they often encode information as high frequency signal that is invisible to the human eye and are susceptible to adversarial attacks [12]. An alternative approach involves training a VAE, subject to a cycle-consistent constraint [44, 94], but these works have only been applied to domain transfers within a single modality. Most similar work to ours is the joint audio and video model proposed by Tian et al. [86], which uses a VAE to map between two cross-modal latent spaces using supervised alignment of attributes. Their method, however, only operates
8 Chapter 3
Method
Our goal is to translate a one-dimensional audio signal A = (a0, ··· , aTA ) into
a video visualization V = (I0, ··· , ITV ), where ai ∈ R are sound wave samples recorded over TA frames and Ii are images representing the same content over TV frames. In our case, we do not have a ground-truth answer to relate to our input data, hence supervised training is not possible. Instead, we utilize individual images and audio snippets without correspondence and learn a latent space Z that captures the importance of image and audio features and translates between them with minimal loss of content. Figure 3.1 shows an overview of the proposed pipeline. The audio
encoder EA yields latent code zi ∈ Z and the visual decoder DV (zi) outputs a
corresponding image Ii. This produces a video representation of the audio when applied sequentially. We start by learning individual audio and video models, and subsequently refined together with a cycle constraint.
Matching sound and video representation. A first technical problem lies in the higher audio sampling frequency (16 kHz), that prevents a one-to-one mapping to 25 Hz video. We follow common practice[36, 89] and represent the one- dimensional sound wave with a perceptually motivated mel-scaled spectrogram, F M = (m1,..., mTM ), mi ∈ R , where F = 80 is the number of filter banks. It is computed via the short-time Fourier transform with a 25ms Hanning window with 10 ms shifts. The resulting coefficients are converted to decibel units flooring at -80dB. Because the typical mel-spectrogram still has a higher sampling rate (100
9 where with ie frame. video AdoEncoding Audio 3.1 hti ieyue o eeaiesudmdl.Fg .,lf,sece o h mel- the how sketches left, 3.1, Fig. models. sound generative for used widely is that architecture. network Audio form parametric a have These respectively. posterior domain output and code normal standard a [53]: be objective VAE to the distribution samples latent the also of but Let shape structure, distribution. latent the the control of to representation us compact allow a learn only not do these encoder- independent learning by pairs start decoder we samples, data audio unlabelled Given length of segments overlapping into it chunk covering we video, the than Hz) Audio signal iue31 Overview. 3.1: Figure D ρ KL ie Asta r ikdwt yl osritadagetdwt a with augmented and constraint cycle a with linked are that VAEs video mohescntan ( constraint smoothness and ∆ Mel Spectrogram t 200 h ulakLilrdvrec,and divergence, Kullback-Leibler the , L ω ( s ntefloig eepanhwt a rmmlsgetto segment mel from map to how explain we following, the In ms. ( x E h upto h noe and encoder the of output the = ) x A D ,
easml rmteulble aae.W piieoe all over optimize We dataset. unlabelled the from sample a be Audio encoder − Audio latentcode A Smoothness constraint D ) q p KL φ o on and sound for θ ( ( ∆ on aetecdn slandwt ui and audio with learned is encoding latent joint A z x ( v q | | x φ z eueteSecVEmdlfo s ta.[ al. et Hsu from model SpeechVAE the use We ( = ) = ) z ∆ Audio decoder Mel | x v reconstruction N N ) annotation). k p ( ( ρ µ θ Cycleconstraint 10 ( ( ( ( z E x z )+ )) ) ) V , , D , ω σ µ 2 2 E ( ( and V z x q ) Image φ ) ) q ( I o ie.W s Assince VAEs use We video. for I z φ ) σ ) | ( , x z n (3.2) and ) h upto h decoder. the of output the |
x Visual encoder log ) Visual and p latentcode θ ( p x θ | ( z x ) | z , ) Image Visual decoder h latent the , T M reconstructio 20 = (3.1) (3.3) 36 n ] spectrogram is encoded with an encoder EA of three convolutional layers followed by a fully connected layer which flattens the spatial dimensions. As noted above, the mel-spectrogram M is a two-dimensional array, spanning the time and frequency, respectively. To account for the structural differences of the two, convolutions are split into separable 1 × F and 3 × 1 filters, where F is the number of frequencies captured by M and striding is applied only on the temporal axis. The decoder is symmetric to the encoder. We use ReLU activation and batch normalization layers in the encoder and decoder.
3.2 Structuring the Audio Encoding To generate temporally smooth visualization of the phonemes in input utterances, we desire our latent space to change smoothly in time and disentangle the style, such as gender and dialect, from the content conveyed in phones.
Disentangling content and style. We construct a SpeechVAE that disentangles the style (speaker identity) content (phonemes) in the latent encodings, i.e., the T d latent encoding z = [z1, ··· , zd] ∈ R can be separated as a style part zs = T T [z1, ··· , zm] and a content part zc = [zm+1, ··· , zd] , where d is the whole audio latent space dimension and m in the audio style latent space dimension. To encourage such disentanglement we use an audio dataset with phone and speaker ID annotation. As we show in Figure 3.2, at training time, we feed triplets of mel-spectogram segments Ta,b,i,j = {Ma,i, Mb,i, Ma,j}, where Ma,i and Mb,i are the same phoneme sequence pi spoken by different speakers sa and sb respectively, and Ma,j shares the speaker sa with the first segment but a different phoneme sequence. Each element of the input triplet is encoded individually by EA, forming T 0 T 0 T latent triplet {za,i, zb,i, za,j} = {[zsa , zci ] , [zsb , zci ] , [zsa , zcj ] }, instead of re- constructing the inputs from the corresponding latent encodings in an autoencoder, we reconstruct the first sample Ma,i from a recombined latent encoding of the other 0 0 0 T two, za,i = [zsa , zci ] . Formally, we replace the reconstruction loss term in the VAE objective with a recombined reconstruction loss term,