AudioViewer: Learning to Visualize Sound

by

Yuchi Zhang

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF

Master of Science

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)

The University of British Columbia (Vancouver)

December 2020

c Yuchi Zhang, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:

AudioViewer: Learning to Visualize Sound submitted by Yuchi Zhang in partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

Examining Committee: Helge Rhodin, Computer Science Supervisor Kwang Moo Yi, Computer Science Examining Committee Member

ii Abstract

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with a shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high- dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.

iii Lay Summary

We develop AudioViewer, an automated that translates speech into videos of faces or hand-written numbers. Without audio feedback, it’s challenging for hearing impaired people to learn speech. To help with the situation, we propose AudioViewer to generate videos from utterances as a visual feedback in substitution. Our current work is a proof of concept. AudioViewer is designed to generate consistent and recognizable visualization of pronunciation. For example, the word ‘water’ will be visualized as a short animation of a portrait of a young brunette lady with a short hair style facing rightward with a neutral expression, which is gradually transforming into a smiling young blonde lady facing forward. From our preliminary user study, we demonstrate that people can tell the difference between the visualization of different pronunciations and recognize the consistency of the identical ones.

iv Preface

The work presented here is original work done by the author Yuchi Zhang in collaboration with Willis Peng and Dr. Bastian Wandt under the supervision of Dr. Helge Rhodin. A shorter version of this work is under review as a conference submission. Design, implementation and experiments were done by Yuchi Zhang. Dr. Helge Rhodin provided suggestions and feedback during the development and Willis Peng helped with the user study evaluation. The initial draft of the paper is done by all of us and revised by Dr. Bastian Wandt. The user studies ran for this work have been approved by the UBC Behavioural Research Ethics Board (Certificate Number: H19-03006)

v Table of Contents

Abstract ...... iii

Lay Summary ...... iv

Preface ...... v

Table of Contents ...... vi

List of Tables ...... viii

List of Figures ...... x

Acknowledgments ...... xiii

1 Introduction ...... 1

2 Related Work ...... 4

3 Method ...... 9 3.1 Audio Encoding ...... 10 3.2 Structuring the Audio Encoding ...... 11 3.3 Image Encoding for Video Generation ...... 13 3.4 Linking Audio and Visual Spaces ...... 14

4 Experiments ...... 16 4.1 Temporal Smoothness Effectiveness ...... 18 4.2 Information Throughput ...... 20

vi 4.3 Visual Quality and Phonetics ...... 22 4.4 Visual Domain Influence ...... 22 4.5 Results on Environment Sounds ...... 23 4.6 User study ...... 25 4.6.1 Design of the questionnaire ...... 25 4.6.2 Results ...... 28

5 Limitations and Future Work ...... 31

6 Conclusion ...... 35

Bibliography ...... 36

vii List of Tables

Table 4.1 Audio encoding analysis. The average SNR of reconstructed amplitude Mel spectrogram on the TIMIT test set and the average speed and acceleration between the latent vector (dim = 128) of neighbouring frames (∆t = 0.04s). The SNR of the autoencoder drops with the additional constraints we enforce (smoothness and disentanglement). This is a trade-off for improving velocity and acceleration properties as well as for inducing interpretability. 18 Table 4.2 Throughput lower bound. The throughput from audio to video with AudioViewer can be bounded by mapping back to the audio. It is evident that the cycle consistency improves greatly while

the enforcement of additional constraints with Lp,log MSE, and

Lrr, takes only a small dip...... 21 Table 4.3 Information throughput on environment sounds, showing that the reconstruction error increases when evaluating on a test set that contains sounds vastly different from the training set (speech vs. environment sounds)...... 24 Table 4.4 Audio VAE Mel spectrum reconstruction. The average SNR of autoencoding and decoding Mel spectrograms on ESC-50 shows a significant reconstruction loss.The average speed and acceleration between the latent vector (dim = 128) of neigh- bouring frames (∆t = 0.04s) confirms the experiments in the main document, that smoothness comes at the cost of lower reconstruction accuracy...... 24

viii Table 4.5 A breakdown of the questionnaire format by sound type and tested factors, listing their frequency of occurrence...... 27 Table 4.6 User study results. Values indicate mean accuracy and standard deviation for distinguishing between visualizations of the tested factor across participants as a percentage...... 28

ix List of Figures

Figure 1.1 AudioViewer: A tool developed towards the long-term goal of helping hearing impaired users to see what they can not hear. We map an audio stream to video and thereby use faces or numbers for visualizing the high-dimensional audio features intuitively. Different from the principle of lip reading, this can encode general sound and pass on information on the style of spoken language, for instance, the timbre of the speaker. . . . 2

Figure 2.1 An illustration of harmonics and formants of speech voice in a filter bank image. The above mel-scaled power spectro- gram shows the word ‘party’ spoken by a male speaker. The vertical axis indicates mel-scaled filter banks and the horizontal axis is time. The narrow stripes in the spectrogram are the harmonics which are the integral multiples of the fundamental

frequency (F0), reflecting the vibration of the vocal fold. Note

that the change in F0 reflects the intonation during speech. The wide band peaks are the formants, reflecting the resonant fre- quencies of the vocal tract. Formant patterns are different when differnt vowels (e.g., aa, r and iy) are spoken...... 5

Figure 3.1 Overview. A joint latent encoding is learned with audio and video VAEs that are linked with a cycle constraint and aug- mented with a smoothness constraint (∆v annotation). . . . . 10

x Figure 3.2 Disentanglement, by mixing content encodings from two speak- 0 ers saying the same phones (zci with zci ) and style encodings 0 from the same speaker saying different words (zsa with zsa ). . 12 Figure 3.3 Cycle constraint. We link audio and video latent spaces with a cycle constraint that ensures that the audio signal content is preserved through video decoding and encoding...... 14

Figure 4.1 Smoothness evaluation by plotting the latent embedding of an audio snipped in 3D via dimensionality reduction. Left: PCA baseline model, consecutive audio sample embeddings are scattered. Middle: Similar performance of our model without temporal smoothness constraint. Right: Our full model leads to smoother trajectories...... 19 Figure 4.2 Throughput. The information throughput can be measured by mapping from audio to video and back. Playing the depicted Mel Spectrogram shows that the speech is still recognizable. . 20 Figure 4.3 Word analysis. Instances (different rows) of the same words are encoded with a similar video sequence (single row), yet the show pronunciation variations if spoken by different speakers. 22 Figure 4.4 Phoneme similarity analysis. The same phonemes occurring at the beginning of different words are mapped to visually similar images (colums) while different speakers have minimal influence (rows)...... 23 Figure 4.5 Environment sounds visualization. Although not trained for it, our approach also visualizes natural sounds. Here showing a clear difference between chirp and alarm clock. The visualiza- tion looks brighter and intenser when a target sound occurs. . 25 Figure 4.6 Matching Example. Examples from the CelebA-combined and MNIST-content questionnaire of a matching question asked to participants in the user study...... 26 Figure 4.7 Grouping Example. Examples from the CelebA-combined and MNIST-content questionnaire of a grouping question asked to participants in the user study...... 26

xi Figure 4.8 Model Comparison on content questions. The mean user accuracy and standard deviation for distinguishing phoneme sequences is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, * for p < 0.05, ** for p < 0.01). The mean accuracy for random guessing is 0.384. . 29 Figure 4.9 Model Comparison on style questions. The mean user accu- racy and standard deviation for distinguishing speaker sex (left) and dialect(right) is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, *** for p < 0.001). The mean accuracies for random guessing are 0.43 and 0.5 respectively...... 30

Figure 5.1 Recombined reconstruction sample. Top left: The Mel spec- trograms of a sentence spoken by a female speaker and its re- construction by our audio model with disentanglement training. Top right: The original and reconstructed Mel spectrogram of another sentence spoken by a male speaker. Bottom: The Mel spectrograms of generated by recombined latent encodings. The above one is decoded from the content latent encodings of the first utterance combined with the speaker latent encoding of the male speaker of the second utterance. The below one is the sencond sentence with the female speaker in first sentence.

Due to a higher F0 of the female speaker, the harmonics in the original and reconstructed Mel spectrograms are seperated farther apart than the ones of the male speaker, which can also be also noticed in the recombined reconstructed Mel spectrogram. 34

xii Acknowledgments

This work is funded by University of British Columbia(UBC) Department of Com- puter Science "Start up funds". I would also like to acknowledge the computing resources provided by Compute Canada. I would like to thank Dr. Helge Rhodin for supervising my thesis. I really appreciate all your support, direction, and patience. It is a great honor to work with you. I would also like to thank Dr. Kwang Moo Yi for agreeing to be the second reader of the thesis. Your suggestions, scrutiny, and feedback made this thesis a better one. To Willis Peng and Bastian Wandt, I really appreciate all your essential con- tributions to this project. Thanks Farnoosch Javadi and Tanzila Rahman for the feedback to our conference submission. I would like to acknowledge all the participants of our user study for the participation and feedback, including but not limited to Frank Yu, Xingzhe He, Daniel Ajisafe, Shih-Han Chou, Wei Jiang, Lixin Liao, Kevin Ta, Dr. Ulrike Pestel-Schiller, Ge Fang, and Xuanyi Zhao. I would like to thank Dr. Dinesh Pai for all the support and patience as my advisor in the first year. It helped me a lot during my first year. I appreciate all the wonderful instructors and colleagues I have worked with as a teaching assistant. Thanks for your help. I really enjoyed the time working with you. To the labmates from groups of Helge’s and Dinesh’s, thanks for all the discus- sion and help. To the new friends I knew here, especially Jan Hansen, Jolande Fooken, and Yihan Zhou. I did not only enjoy playing sports and games with you, but also

xiii appreciate all the help and suggestions from you. I would also like to appreciate Xuanyi Zhao and Ge Fang for their accompany during this hard time of COVID-19. I enjoyed the time with you. I apprecate all the care and support from my family and friends I have known for years. You are where I am from. To Xiuyun Wu, I could not make it here without you.

xiv Chapter 1

Introduction

Humans perceive their environment through diverse channels, including vision, hearing, touch, taste, and smell. Impairment in any of these modals of perception can lead to drastic consequences, such as challenges in learning and communication. Various approaches have been purposed to substitute lost senses, from cochlear implants[34] to attempts to brain machine interfaces (BMIs) that directly interface with neurons in the brain(e.g., recently popularized [64]). One of the least intrusive approaches is to substitute auditory perception with visual modal, which is challenging due to the different characteristics of the two domain. Information in the auditory domain is usually high-frequency and low- parallel, while it is low-frequency and high-parallel in visual domian. In this paper, we make a first step to visualize audio with natural images in real time, forming a live video that characterizes the entire audio content from an acoustic perspective. It could be seen as an automated form of translation, with its own throughput, abstraction, automation, and readability trade-offs. Figure 1.1 illustrate the pipeline of our method. Such a tool could help to perceive sound and also be used to train deaf speech, as immediate and high-throughput feedback on pronunciations since it is difficult to learn from traditional visual tools for the hearing-impaired speech learning that rely on spectrogram representations [18, 54, 92, 93]. Our long-term goal is to show that hearing-impaired users can learn to perceive sound through the visual system. Similar to learning to understand sign language and likewise an unimpaired child learns to associate meaning to seen objects and

1 dg ol etasae otetx dg.Acnevbeetnini ofurther to is extension conceivable A ’dog’. text the to translated be would ’dog’ GN)[ (GANs) iulfebc pc n paiglandb riuaigsud htreproduce that sounds articulating by the learned exploring speaking by and practiced space be feedback could visual articulation Then semantic. its to visualization adde o oti n oa edako style or [ feedback speakers vocal female any is and contain translation male not a between does differences such and however, phonetics semantics, terms of in terms indirect in still intuitive is This audio. the networks adversarial generative conditional using image an into word the translate [ system recognition a with words to speech also safety[ translate one’s are to which related sounds environment are they to fraction when apply a important not contains does only and and information speaking audio of the result of a rather is motion lip natural ever, from expressions facial [ lip-syncing audio and spoken dubbing abstraction digital visual for which Methods of best. question is open level an is It structure). spatial (time-frequency to representation characteristics and video) high-dimensional (one- to dimensions different audio vastly dimensional with domains these across mapping one-to-one ical parent. or teacher the by produced as feedback visual same the sound our from mapping a learn to have will user the words, and sounds perceived Audio signal iue11 AudioViewer: 1.1: Figure teei ocanon- no is there setup, our in as, difficult is video to signal audio an Translating o ntne h ibeo h speaker . the of timbre on the information instance, on for pass language, and spoken sound of general style encode the can this reading, principle lip the of from the Different visualizing intuitively. for an features numbers map or audio We faces high-dimensional hear. use thereby not and can video they to what stream audio see to users impaired hearing helping 24 Mel Spectrogram .Adgiaecudb hw hntewr dg srcgie in recognized is ’dog’ word the when shown be could image dog A ]. 16 , 75 , 91 Audio encoder ol entrl loigda epet i ed How- read. lip to people deaf allowing natural, be would ] Latentspace trajectory oldvlpdtwrsteln-emga of goal long-term the towards developed tool A

2 Visual decoder Translatedframesvideo 26 67 (face decoder) nte ieto ol eto be could direction Another ]. , 79 66 , 97 ,freapetespoken the example for ], or .Mroe,eet and events Moreover, ]. Translatedframesvideo (numberdecoder) ambient sounds that can not be indicated by a single word are ill-represented, such as the echo of a dropped object or the repeating beep of an alarm clock. To overcome these limitations, we seek an intuitive and immediate visual representation of the high-dimensional audio data, inspired by early techniques from the 1970s that visualize general high-dimensional data with facial line drawings with varying length of the nose and curvature of the mouth [10], which are shown easier and faster to be recognized by human users[42]. Instead of using computer- drawn cartoon faces in the previous studies, we use learning-based generative models trainined on natural images to generate more realistic human faces with more features so that more information can be visualized. In this paper, we design a real-time audio to video mapping leveraging ideas from computer vision and audio processing for unpaired domain translation. The technical difficulty lies in finding a mapping that is expressive yet easy to perceive. We base our method, AudioViewer, on four principles. First, humans are good at recognizing patterns and objects that they see in their environment, particularly human faces [1, 10, 42, 47, 83]. We therefore map sound to videos of natural images and faces. Second, quick and non-natural (e.g., flickering) temporal changes lead to disruption and tiredness [78]. Hence, we enforce temporal smoothness on the learned mapping. Third, audio information should be preserved in the generated visualization. We therefore exploit cycle consistency to learn a joint structure between audio and video modalities. Fourth, we disentangle style, such as gender and dialect, from content, individual phones and sounds, so that user can easier recognize, for example, the phonetic aspect from the visualization of an utterance. Our core contributions are a new approach for enabling persons to see what they can not hear; a method for mapping from audio to video via linked latent spaces; a quantitative analysis on the type of target videos, different latent space models, and the effect of smoothness constraints; and a perceptual user study on the recognition capabilities. We demonstrate the feasibility of AudioViewer with a functional prototype and analyze the performance by quantifying the loss of information content as well as showing through a user study that words and pronunciation can be distinguished from the generated video features.

3 Chapter 2

Related Work

In the following section, we first introduce the fundamental speech acoustic theories and relate our approach in context with existing assistive systems. Then, we present the overview of the studies about and audio visualization. Finally, we review the literature on audio and video generation with a particular focus on cross-modal models.

Speech acoustics. According to the source-filter theory of speech production[19, 45], vocal fold vibration is usually viewd as the source of sound in vowels, and the vocal tract is an acoustic filter which modifies the sound produced by vocal folds. As shown in Figure 2.1, given a power spectrogrogram of a voicing waveform, the fundamental frequency (F0) is defined as the first peak (lowest-frequency) in the sprectrogram, and the harmonic components are the multiples of the fundamental frequency, which reflects the vibration of the vocal folds and partially determined the timbre of one’s voice. Therefore, harmonic features can be used in speaker identification [41]. Due to the resonant fequencies of the vocal tract, harmonics in the voice source will be filtered in a band-passed way and the peaks in the envelop of the filtered power spectrum are formants. The envelop is determined by the configuration of the vocal tract and is important to phoneme recognition. Vowels, for instance, can be identified from the formant patterns [29].

Deaf speech support tools. Improvements in speech production of the hearing- impaired people have been achieved through non-auditory aids and these improve-

4 ], which aim at 93 ] and haptic learn- , 51 92 , , 28 54

Harmonics 0 … 퐹 5 ), reflecting the vibration of the vocal fold. 0 reflects the intonation during speech. The F 0 F ] is as effective at improving speech production as ] demonstrate that visual feedback from the speech

79 The above mel-scaled power spectrogram shows

18

Formant s ]. While electrophysiological [ ) are spoken. 97 , iy 79 , and 67 r ] have demonstrated efficacy for improving speech production, such , 17 mel-scaled filter banks and the horizontalin axis the is spectrogram time. are The the narrowthe harmonics stripes fundamental which frequency are ( the integralNote multiples of that the change in the vocal tract. Formant patternsaa are different when differnt vowels (e.g., a filter bank image. the word ‘party’ spoken by a male speaker. The vertical axis indicates wide band peaks are the formants, reflecting the resonant frequencies of Figure 2.1: An illustration of harmonics and formants of speech voice in spectrographic display (SSD) [ feedback from a speech-language pathologist. Alternativefrom graphical transformed plots spectral generated data have beenimproving explored upon by spectrograms [ by creating plots that are more distinguishable with ments last beyond learning sessions and generalizethe to sessions words [ not encountered during ing aids [ techniques can be more invasive, especially for young children, as compared to visual aids. Elssmann et al. [ respect to speech parameters. Other methods aim at providing feedback by explicitly estimating vocal tract shapes [68]. In addition, Levis et al. [57] demonstrate that distinguishing between discourse-level intonation (intonation in conversation) and sentence-level intonation (intonation of sentences spoken in isolation) is possible through speech visualization and argues that deaf speech learning could be further improved by incorporating the former. SpeechViewer[5, 40, 90], a computer-based speech training (CBST) system developed by IBM , is designed to help hearing- impaired children practice phoneme pronunciation with variant visualizations of specified speech parameters and games that controled by specified phonemes[14]. The efficacy of SpeechViewer is observed on children or teeanage individuals with hearing impairement, autism children or dysarthria [6, 72, 85], but its limitations, such as inaccurate feedback on some phonetic features are also noticed. Visualizing high dimensional data with face is found intuitive to human users[10, 42]. Instead of visualizing hand-crafted speech featurs with limited elements in a computer-drawn cartoon face, our method generate faces with deep genrative models trained on natural images which contain more information. Specificly, our image generation approach extends these spectrogram visualization techniques by leveraging the generative ability of variational autoencoders (VAEs) in creating a mapping to a more intuitive and memorable video representation. It is our hope that improved visual aids will lead to more effective learning in the future.

Sensory substitution and audio visualization. Related to our work is the field of sensory substitution, whereby information from one modality is provided to an individual through a different modality. While many sensory substitution methods focus on substituting visual information into other senses like tactile or auditory stimulation to help visual rehabilitation [23, 37, 60], few methods target substituting auditory modal with visualization. The idea is feasible from the perspective of cog- nitive . Deaf individuals are found to perform better in various visual tasks than hearing controls.[2, 62, 63]. Moreover, visual cross-modal activation of auditory cortex induced by non-linguistic visual stimuli, which is absent with hear- ing participants, has been observed according to the studies[20, 21], indicating cross-modal recruitment of hearing-impaired people. Although ASL is viewed as a successul sensory substitution system which translates information

6 usually presented in auditory to visual modal[3], it is essentially diffenrent to spoken languages that are developed in the audio-vocal modality. Audio visualization is another field related to our work. Music visualization works generate visualiza- tions of songs, so that user can browsing songs more efficiently without listen to them [82, 95]. On the learning side, [96] visualize the intonation and volume of each word in speech by the font size, so that user can learn narration strategies. Different from the works mentioned above, our model tries to visualize speech in phoneme level with deep learning models instead of hand-crafted features.

Classical audio to video mappings. One of the major applications of generat- ing video from speech is digital dubbing or synthesizing facial expressions that match the speech. Recent approaches employ deep generative model like GANs to synthesis realistic videos of people speaking [16, 75, 91]. Other approaches try to reconstruct facial attributes of the speaker, such as gender and ethnicity, from their utterances or map music to facial or body animation [48, 77, 80, 84]. By contrast to our setting, all of these tasks can be supervised by learning from videos with audio lines, e.g., a talking person where the correspondence of lip motion, expressions and facial appearance to the spoken language is provided to train the relation between sound and mouth opening. Similar to the unsupervised digital dubbing approach trained with cycle consistency[52], in this work, we go beyond manual supervision, and opt for an unsupervised translation mechanism with cycle consistency. Moreover, we can map sound to non-facial features such as background and brightness.

Audio and visual generation models. Deep generative models take many forms such as generative adversarial networks (GANs) [24], variational autoencoders (VAEs) [53], and auto-regressive models[88]. The highest image fidelity is attained with hierarchical models that inject noise and latent codes at various network stages by changing their feature statistics [39, 49]. One of the state-of-the-art methods in face generation, StyleGAN[49], as well as its updated version StyleGAN2[50], can learn disentangled latent distributions with a non-linear mapping network. The vector quantized variational autoencoder (VQ-VAE-2) [74], another state-of-the-art method, can generate a diverse range of images with high fidelity comparable to GAN models with a hierachy of vector quantized codes. Considering the ultimate

7 goal of our project as a visualization, however, fidelity of the generation is not the main concern. Hence, we choose the deep feature consistent variational autoencoder (DFC-VAE)[33] which is trained with feature losses to improve the perceptual consistency in reconstruction. As a VAE with a relatively shallow structure, DFC- VAE is faster and more stable to train comparing to the deeper and more complicated models. For audio, unlike traditional speech analysis where researches focus on hand crafting features such as LPC coefficients, Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) coefficients, etc.[9, 30], recent learning- based learn the features from the speech dataset directly. Although few methods operate on the raw waveform [46], it is more common use spectrograms and to apply convolutional models inspired by the ones used for image generation [8, 15, 36]. We use SpeechVAE[36] as the audio backbone model to learn important audio features.

Cross-modal latent variable models. CycleGAN and its variations [81, 98] have seen considerable success in performing cross-modal unsupervised domain transfer, for example in medical imaging [31, 87] and audio to visual translation [27]. How- ever they often encode information as high frequency signal that is invisible to the human eye and are susceptible to adversarial attacks [12]. An alternative approach involves training a VAE, subject to a cycle-consistent constraint [44, 94], but these works have only been applied to domain transfers within a single modality. Most similar work to ours is the joint audio and video model proposed by Tian et al. [86], which uses a VAE to map between two cross-modal latent spaces using supervised alignment of attributes. Their method, however, only operates

8 Chapter 3

Method

Our goal is to translate a one-dimensional audio signal A = (a0, ··· , aTA ) into

a video visualization V = (I0, ··· , ITV ), where ai ∈ R are sound wave samples recorded over TA frames and Ii are images representing the same content over TV frames. In our case, we do not have a ground-truth answer to relate to our input data, hence supervised training is not possible. Instead, we utilize individual images and audio snippets without correspondence and learn a latent space Z that captures the importance of image and audio features and translates between them with minimal loss of content. Figure 3.1 shows an overview of the proposed pipeline. The audio

encoder EA yields latent code zi ∈ Z and the visual decoder DV (zi) outputs a

corresponding image Ii. This produces a video representation of the audio when applied sequentially. We start by learning individual audio and video models, and subsequently refined together with a cycle constraint.

Matching sound and video representation. A first technical problem lies in the higher audio sampling frequency (16 kHz), that prevents a one-to-one mapping to 25 Hz video. We follow common practice[36, 89] and represent the one- dimensional sound wave with a perceptually motivated mel-scaled spectrogram, F M = (m1,..., mTM ), mi ∈ R , where F = 80 is the number of filter banks. It is computed via the short-time Fourier transform with a 25ms Hanning window with 10 ms shifts. The resulting coefficients are converted to decibel units flooring at -80dB. Because the typical mel-spectrogram still has a higher sampling rate (100

9 where with ie frame. video AdoEncoding Audio 3.1 hti ieyue o eeaiesudmdl.Fg .,lf,sece o h mel- the how sketches left, 3.1, Fig. models. sound generative for used widely is that architecture. network Audio form parametric a have These respectively. posterior domain output and code normal standard a [53]: be objective VAE to the distribution samples latent the also of but Let shape structure, distribution. latent the the control of to representation us compact allow a learn only not do these encoder- independent learning by pairs start decoder we samples, data audio unlabelled Given length of segments overlapping into it chunk covering we video, the than Hz) Audio signal iue31 Overview. 3.1: Figure D ρ KL ie Asta r ikdwt yl osritadagetdwt a with augmented and constraint cycle a with linked are that VAEs video mohescntan ( constraint smoothness and ∆ Mel Spectrogram t 200 h ulakLilrdvrec,and divergence, Kullback-Leibler the , L ω ( s ntefloig eepanhwt a rmmlsgetto segment mel from map to how explain we following, the In ms. ( x E h upto h noe and encoder the of output the = ) x A D ,

easml rmteulble aae.W piieoe all over optimize We dataset. unlabelled the from sample a be Audio encoder − Audio latentcode A Smoothness constraint D ) q p KL φ o on and sound for θ ( ( ∆ on aetecdn slandwt ui and audio with learned is encoding latent joint A z x ( v q | | x φ z eueteSecVEmdlfo s ta.[ al. et Hsu from model SpeechVAE the use We ( = ) = ) z ∆ Audio decoder Mel | x v reconstruction N N ) annotation). k p ( ( ρ µ θ Cycleconstraint 10 ( ( ( ( z E x z )+ )) ) ) V , , D , ω σ µ 2 2 E ( ( and V z x q ) Image φ ) ) q ( I o ie.W s Assince VAEs use We video. for I z φ ) σ ) | ( , x z n (3.2) and ) h upto h decoder. the of output the |

x Visual encoder log ) Visual and p latentcode θ ( p x θ | ( z x )  | z , ) Image Visual decoder h latent the , T M reconstructio 20 = (3.1) (3.3) 36 n ] spectrogram is encoded with an encoder EA of three convolutional layers followed by a fully connected layer which flattens the spatial dimensions. As noted above, the mel-spectrogram M is a two-dimensional array, spanning the time and frequency, respectively. To account for the structural differences of the two, convolutions are split into separable 1 × F and 3 × 1 filters, where F is the number of frequencies captured by M and striding is applied only on the temporal axis. The decoder is symmetric to the encoder. We use ReLU activation and batch normalization layers in the encoder and decoder.

3.2 Structuring the Audio Encoding To generate temporally smooth visualization of the phonemes in input utterances, we desire our latent space to change smoothly in time and disentangle the style, such as gender and dialect, from the content conveyed in phones.

Disentangling content and style. We construct a SpeechVAE that disentangles the style (speaker identity) content (phonemes) in the latent encodings, i.e., the T d latent encoding z = [z1, ··· , zd] ∈ R can be separated as a style part zs = T T [z1, ··· , zm] and a content part zc = [zm+1, ··· , zd] , where d is the whole audio latent space dimension and m in the audio style latent space dimension. To encourage such disentanglement we use an audio dataset with phone and speaker ID annotation. As we show in Figure 3.2, at training time, we feed triplets of mel-spectogram segments Ta,b,i,j = {Ma,i, Mb,i, Ma,j}, where Ma,i and Mb,i are the same phoneme sequence pi spoken by different speakers sa and sb respectively, and Ma,j shares the speaker sa with the first segment but a different phoneme sequence. Each element of the input triplet is encoded individually by EA, forming T 0 T 0 T latent triplet {za,i, zb,i, za,j} = {[zsa , zci ] , [zsb , zci ] , [zsa , zcj ] }, instead of re- constructing the inputs from the corresponding latent encodings in an autoencoder, we reconstruct the first sample Ma,i from a recombined latent encoding of the other 0 0 0 T two, za,i = [zsa , zci ] . Formally, we replace the reconstruction loss term in the VAE objective with a recombined reconstruction loss term,

0  Lrr(T ) = 0 log p (Ma,i|z ) . (3.4) a,b,i,j Eqφ(za,i|Mb,i,Ma,j ) θ a,i

11 Recombined reconstruction loss

퐳풂,풊 퐳풂,풊

decoder

Audio Audioencoder concatenation ′ 퐌푎,푖 퐌푏,푖 퐌푎,푗 퐌푎,푖 Mel spectrogram Audio latent code recombined Triplet recombination reconstruction

Figure 3.2: Disentanglement, by mixing content encodings from two speak- 0 ers saying the same phones (zci with zci ) and style encodings from the 0 same speaker saying different words (zsa with zsa ).

With the recombination reconstruction, the model can only successfully recon- struct the target sa,i by encoding the content-related features of sb,i, which shares 0 the content information pi with the target, to the content part of zb,i (i.e., zci ), and 0 encoding style-related features of sa,j to the style part (zsa ). Therefore, this setup forces the model to learn separate encodings for the style and phoneme information while not requiring additional loss terms. 0 Note that we could alternatively enforce za,i to be close to za,i without decoding (the unused za,i in Figure 3.2). But we do not observe any improvement in either reconstruction scores or convergence rate from preliminary experiments.

Temporal smoothness. The audio encoder has a small temporal receptive field, encoding time segments of 200 ms. This lets encodings of subsequent sounds to be encoded to distant latent codes leading to discrete and abrupt visual changes in the decoding. To mitigate, we add an additional smoothness prior. We experiment with

12 the following negative log likelihoods. We sample a pair of input mel-spectrogram segments {Mi, Mj} at random time steps {ti, tj}, spaced at most 800ms apart, from the same utterance. We test the three different pair loss functions to enforce temporal smoothness in the embedded content vectors. First, by making changes in latent space proportional to changes in time,

2 1 PN  ˆ  Lp,MSE = N i ∆ti − kti,1 − ti,2k , (3.5) where the latent space dimension scale is learned by scalar parameter s, with ∆ˆti = s · kEA(Mi) − EA(Mj)k = s · kzi − zjk for models without style disentanglement, and ∆ˆti = s · kzc,i − zc,jk for the disentangled ones, where the predicted time difference only calculated with the content part of the latent encodings. Second, we try whether enforcing the ratio of velocities to be constant is better,

 ˆ 2 1 PN ∆ti Lp,Q = − 1 , (3.6) N i kti,1−ti,2k with learned scale as before. The same quotient constraint can also be measured in logarithmic scale,

2 1 PN  ˆ  Lp,log MSE = N i log ∆ti − log kti,1 − ti,2k . (3.7)

Our final model uses Lp,log MSE, which behaves better than the other two in our experiments. All objective terms are learned jointly, with weights λcycle = 10 and 3 λp = 10 balancing their influence.

3.3 Image Encoding for Video Generation Audio encoders and video decoders can be made compatible across modalities by using two VAEs with matching latent dimension and prior distribution p(z) over z ∈ Z. We apply the per-frame image DFC-VAE [33], which prioritizes perceptual consistency of reconstuction. The input image is passed through four 4 × 4 convolutional layers followed by a fully-connected layer to arrive at the latent code. The decoder uses four 3 × 3 convolutions. Batch normalization and LeakyReLU layers are applied between convolutions. Besides the per-pixel loss of

13 ecncnaeaeteadoecdrwt h mg eoe o u desired our for to approximations decoder only image are encoders the learned with the However, encoder translation. AudioViewer audio the concatenate can we where h aetdmninmths ota h iulzto a ecnetansi or as content-agnostic long be can as visualization part speaker-agnostic. the the content that to or so applied matches, part dimension be style latent can the the loss only cycle visualize the and loss, models smoothness disentangled the to differences Similar point-wise minimizing position. values in mean on operate modules video the and as space content the on to implemented spirit is It similar back. a in datasets, paired of [ need the posterior without space latent well, audio reconstructed the from are samples additional that an ensures introduce that we loss consistency structures, cycle space latent different bridge To content. of in points all not and distribution true the prior, space latent same the with VAEs video and audio individual Spaces learning Visual By and Audio Linking 3.4 frames. video output the create to sequentially applied is trained but is a images model of individual This space on consistency. feature feature the high-level for in network loss VGG-19 perceptual trained a uses DFC-VAE the VAE, classical the 44 Audio signal iue33 yl constraint. Cycle 3.3: Figure , 94 z .Fgr . hw h ylccann rmadoltn oet ie and video to code latent audio from chaining cyclic the shows 3.3 Figure ]. c yl osritta nue htteadosga otn spreserved is encoding. content and signal decoding audio video the through that ensures that constraint cycle sdanfo h otro ftecnetpart content the of posterior the from drawn is Mel spectrogram L

cycle Audio encoder Audio latentcode = eln ui n ie aetsae iha with spaces latent video and audio link We | E 퐳 1 V 퐳 2 ( Z 14 D r oeldeulywl,laigt loss to leading well, equally modelled are V ( z c ))) Visual decoder Visual − Cycleconstraint z reconstruction E c | A , ( M ) fteadoencoder audio the of Visual encoder Visual 퐳 latentcode 1 ′ 퐳 2 ′ (3.8) mel-sp

15 Chapter 4

Experiments

In the following, we show qualitatively and quantitatively that AudioViewer conveys important audio features via visualizations of portraits or numbers and that it applies to speech and environment sounds.

Quantitative metrics. We evaluate the quality of the latent embeddings by measur- ing the reconstruction error in terms of signal to noise ratio (SNR) of the reconstruc- tion. We measure the smoothness as the change in latent space position.

Perceptual study. We perform a user study to analyze the human capability to perceive the translated audio content. Some results may be subjective and we therefore report statistics over 29 questions answered by 19 users.

Baselines. We compare the proposed coupled audio-visual model and perform an ablation study using individual VAEs for audio and speech generation as well as when disabling the style-content disentangling. We further investigate the latent space priors and training strategies introduced in section 3.2 and 3.4. As an additional baseline, we experiment with principal component analysis (PCA) for building a common latent space. PCA yields a projection matrix W that rotates the training samples ui by zi = Wui to have diagonal covariance matrix Σ and maximal variance. We can use PCA for sound and images, with u being either the flattened mel-segment vec(Mi) or equivalently the flattened video frame vec(Ij). We use the truncated PCA, where W maps input vectors to a d-dimensional space,

16 which essentially encodes the input vector to a latent space. We further scale rows of W such that Σ is the identity matrix I. By this construction, the latent space has the same prior covariance as the VAE spaces. This encoding can be decoded with the transpose W T .

Audio Datasets. We use the TIMIT dataset [22] for learning speech embeddings. It contains 5.4 hours of audio recordings (16 bit, 16 kHz) as well as time-aligned orthographic, phonetic and word transcriptions for 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. We use the training split (462/50/24 non-overlapping speakers) of the KALDI toolkit [71]. The phonetic annotation is only used at training time and as ground truth for the user study. In addition we report how well a model trained on speech generalizes to other sound on the ESC-50 dataset of environmental sounds [70] in the supplemental videos.

Image Datasets. To test what kind of visualization is best for humans to perceive the translated audio, we train and test our model on two image datasets: The face attributes dataset CelebA [58] (162770/19962 images for train/val respectively), the MNIST [55] datasets (60000/10000 images for train/val respectively).

Training and runtime. Speech models without disentanglment are trained for 300 epochs (120,000 iterations) with a batch size of 64. Disentanglement speech models are trained for 600 epochs (165,000 iterations) with a batch size of 64. Note that the total size of triplet dataset is different to the paired dataset due to the requirement of parallel data. CelebA models for 38 epochs and MNIST models for 24 epochs with batch size 144, and joint models are fine-tuned jointly on audio and image examples for 10 audio epochs. AudioViewer is real-time capable and we showcase a demonstration in the supplemental video, the amortized inference time for a frame is 5 ms on an i7-9700KF CPU @ 3.60GHz with a single NVIDIA GeForce RTX 2080 Ti.

17 Table 4.1: Audio encoding analysis. The average SNR of reconstructed am- plitude Mel spectrogram on the TIMIT test set and the average speed and acceleration between the latent vector (dim = 128) of neighbouring frames (∆t = 0.04s). The SNR of the autoencoder drops with the additional con- straints we enforce (smoothness and disentanglement). This is a trade-off for improving velocity and acceleration properties as well as for inducing interpretability.

−1 −2 Audio models λp SNR (dB) Velocity (s ) Acceleration (s ) Audio PCA - 23.37 329.02 13395.11 SpeechVAE w/o Lp - 21.90 280.09 11648.17 1 16.74 231.91 9490.04 3 SpeechVAE w/ Lp,log MSE 10 19.86 108.89 3052.19 107 11.43 117.43 2674.86 109 7.77 76.50 1735.97 1 18.83 234.19 9595.15 3 SpeechVAE w/ Lp,Q 10 18.44 119.94 3133.26 107 11.35 123.78 2476.44 109 9.00 119.60 2405.18 1 17.89 243.55 10067.36 3 SpeechVAE w/ Lp,MSE 10 18.15 226.74 9252.76 107 12.64 129.10 4192.21 109 10.56 84.05 2709.96 SpeechVAE w/ Lrr - 7.73 80.80 2947.70 3 SpeechVAE w/ Lp,log MSE, Lrr 10 6.20 59.78 1530.81 3 SpeechVAE w/ Lp,log MSE, Lrr, dim=256 10 6.28 65.63 1757.95

4.1 Temporal Smoothness Effectiveness As it is mentioned above, we evaluate the encoding and generation performance of the models with reconstruction SNR (encoding followed by decoding with the audio VAE) and we evaluate the temporal smoothness with the speed and acceleration in the latent space. Although in Table 4.1, the PCA baseline has the best reconstruction SNR, the high velocity and acceleration in the latent space make the generated videos flickering too much to be applicable as a clear visualization. Due to the low SNR of the disentangled model, here we report the ablation ex- periment on λp without the disentangled training to better illustrate the performance of the two loss terms. Table 4.1 shows the reconstruction SNR on the test set with models trained with different temporal smoothness loss functions and λp. The anal- ysis shows that 4.1, Lp,log MSE and Lp,Q are more sensitive to λp and retain more

18 Baseline-PCA Ours w/o smoothness Ours (log smoothness) 3D MDS embedding

Baseline-PCA Ours w/o smoothness Ours (log smoothness) 3D Isomap

Figure 4.1: Smoothness evaluation by plotting the latent embedding of an audio snipped in 3D via dimensionality reduction. Left: PCA baseline model, consecutive audio sample embeddings are scattered. Middle: Similar performance of our model without temporal smoothness con- straint. Right: Our full model leads to smoother trajectories.

information while learning temporal smoothness. The SpeechVAE w/ Lp,logMSE 3 with λp = 10 is the best performing model that was selected for disentangled model. The advantage of disentengled model as visualization, is illustrated in the Section 4.6 We also visualize the gain in smoothness in Figure 4.1 by plotting the high- dimensional latent trajectory embedded into three dimensions using ISOMAP [4] and multi-dimensional scaling (MDS) [13]. The gain is similar for all of the proposed smoothness constraints. The supplemental videos show how the smoothness eases information perception.

19 weights iulVE r rie neednl rrfie ihteadooe sn different using ones audio the with refined or independently trained are VAEs visual inSRidctsahge aaiyo nomto ffc iulzto hnthe than visualization face of ones. information figure of capacity higher reconstruc- a cycled indicates higher SNR the tion expected, have we as Moreover, version. disentangled λ the where versions compare We individually. throughput its measuring by coder lower a having to due perhaps encoding. loss, audio cycle the the to with compared well dimensionality fair not does from and reduction train a to relatively SNR, a the has 4.2) on Table dimensions. toll of other low block the (lower along space disentangled strike the not in does Interpretability and properties but smoothness accuracy poor reconstruction highest has the attains and PCA original samples. the audio comparing reconstructed and listening by an be qualitatively gives can analyzed difference 4.2 and The Figure quantified quantitatively. relations audio. summarizes as reconstructed 4.2 information the Table of and to example loss point the starting quantify the bound us of lower lets distance a pattern as the back cyclic and This video throughput. to the audio the from of no use map can to as we decoder However, video and expensive. to encoder are learned audio studies user from and throughput available is information truth the ground quantify to difficult is It Throughput Information 4.2 c Input iue42 Throughput. 4.2: Figure 10 = eas nlz h feto h diinlcntanso h ui autoen- audio the on constraints additional the of effect the analyze also We Mel spectrogram λ rga hw httesec ssilrecognizable. Spec- still Mel is depicted speech the the Playing that back. shows and trogram video to audio from mapping ok ettgte ihtebt eprllse n sslce o the for selected is and losses temporal both the with together best works c o h yl os repcieo h iuldmi Clb rMNIST), or (CelebA domain visual the of Irrespective loss. cycle the for

Audio encoder

Visual decoder Correspondingsynthesized images h nomto hogptcnb esrdby measured be can throughput information The 20 4 . 43 to 4 . 16

NS rvdls stable less proved MNIST . Visual encoder

Audio decoder Reconstructed Mel spectrogram Table 4.2: Throughput lower bound. The throughput from audio to video with AudioViewer can be bounded by mapping back to the audio. It is evident that the cycle consistency improves greatly while the enforcement of additional constraints with Lp,log MSE, and Lrr, takes only a small dip.

Audio models Visual models λc SNR(dB) Audio PCA Visual PCA - 23.37 DFC-VAE on CelebA - 1.65 DFC-VAE on MNIST - 2.01 SpeechVAE [36] DFC-VAE on CelebA (refined w/ Lcycle) 10 4.43 DFC-VAE on MNIST (refined w/ Lcycle) 10 0.78 DFC-VAE on CelebA - 1.51 DFC-VAE on MNIST - 2.16 DFC-VAE on CelebA (refined w/ Lcycle) 1 4.86 SpeechVAE w/ Lp,log MSE DFC-VAE on MNIST (refined w/ Lcycle) 1 4.54 3 (λp = 10 ) DFC-VAE on CelebA (refined w/ Lcycle) 10 11.81 DFC-VAE on MNIST (refined w/ Lcycle) 10 3.98 DFC-VAE on CelebA (refined w/ Lcycle) 100 10.51 DFC-VAE on MNIST (refined w/ Lcycle) 100 5.22 DFC-VAE on CelebA - 1.96 DFC-VAE on MNIST - 1.87 DFC-VAE on CelebA (refined w/ Lcycle) 1 5.95 SpeechVAE w/ Lp,Q DFC-VAE on MNIST (refined w/ Lcycle) 1 7.06 3 (λp = 10 ) DFC-VAE on CelebA (refined w/ Lcycle) 10 12.25 DFC-VAE on MNIST (refined w/ Lcycle) 10 3.12 DFC-VAE on CelebA (refined w/ Lcycle) 100 8.71 DFC-VAE on MNIST (refined w/ Lcycle) 100 5.26 DFC-VAE on CelebA - 0.84 SpeechVAE w/ Lp,log MSE, DFC-VAE on MNIST - 0.81 Lrr, dim=256 DFC-VAE on CelebA (refined w/ Lcycle) 10 4.16 DFC-VAE on MNIST (refined w/ Lcycle) 10 3.68

3 Out of the models with smoothness, Lp,log MSE with λp = 10 attains the highest reconstruction quality (SNR) and lowest latent space velocity calculated with finite differences. The disentanglement reduces audio reconstruction significantly. However, this has limited effect on the overall throughput, as analyzed before, likely due to bottlenecks in the visual model dominating the overall accuracy.

21 Figure 4.3: Word analysis. Instances (different rows) of the same words are encoded with a similar video sequence (single row), yet the show pronunciation variations if spoken by different speakers.

4.3 Visual Quality and Phonetics A meaningful audio translation should encode similar sounds with similar visuals. Figure 4.4 depicts such similarity and dissimilarity on the first 200 ms of a spoken word since these can be visualized with a single video frame. This figure compares the content encoding to MNIST and faces, both are well suited to distinguish sounds. The same word (column) has high visual similarity across different speakers (F#: female, M#: male; # the speaker ID) and dialects (D#) while words starting with different sounds are visually different. Multiple frames of entire words are shown in Fig. 4.3. Our user study and supplemental document analyzes their differences in more detail and on entire sentences.

4.4 Visual Domain Influence Earlier work [10, 42] suggests that faces are suitable for representing high-dimensional data. Our experiments support this finding in that the measurable information con- tent is larger. The SNR of the jointly trained CelebA models is larger than the MNIST models. Also the Figure 4.4 indicates that facial models are richer. The

22 Figure 4.4: Phoneme similarity analysis. The same phonemes occurring at the beginning of different words are mapped to visually similar images (colums) while different speakers have minimal influence (rows). distinctiveness is further analyzed in the user study evaluation.

4.5 Results on Environment Sounds With the style disentangling training of our AudioViewer method, the audio model is more specific to human speech than general sounds. In the following, we test the

23 Table 4.3: Information throughput on environment sounds, showing that the reconstruction error increases when evaluating on a test set that con- tains sounds vastly different from the training set (speech vs. environment sounds).

Audio models Visual models SNR(dB) Audio PCA Visual PCA 17.23 DFC-VAE on CelebA 1.03 DFC-VAE on MNIST 2.22 SpeechVAE DFC-VAE on CelebA (refined w/ Lcycle) 2.83 DFC-VAE on MNIST (refiend w/ Lcycle) 0.76 DFC-VAE on CelebA 0.46 SpeechVAE w/ Lp,log MSE, DFC-VAE on MNIST 0.46 Lrr, dim=256 DFC-VAE on CelebA (refined w/ Lcycle) 1.26 DFC-VAE on MNIST (refiend w/ Lcycle) 1.68

Table 4.4: Audio VAE Mel spectrum reconstruction. The average SNR of autoencoding and decoding Mel spectrograms on ESC-50 shows a signifi- cant reconstruction loss.The average speed and acceleration between the latent vector (dim = 128) of neighbouring frames (∆t = 0.04s) confirms the experiments in the main document, that smoothness comes at the cost of lower reconstruction accuracy.

Audio models SNR (dB) Velocity (s−1) Acc.(s−2) Audio PCA 17.23 170.13 6960.80 SpeechVAE[36] 10.17 172.57 7331.77 SpeechVAE w/ Lp,log MSE 8.92 58.78 1859.95 SpeechVAE w/ Lrr 2.98 40.54 1580.86 SpeechVAE w/ Lp,log MSE, Lrr 2.19 30.66 909.39 SpeechVAE w/ Lp,log MSE, Lrr, dim=256 2.51 33.88 1037.97 generalization capability of our method (trained on the speach TIMIT dataset [22]) on the ESC-50 environment sound dataset [70]. Table 4.3 shows the reconstruction accuracy when going via the audio and video VAEs (see Information Throughput section in the main document). The SNR for reconstructed mel-spectrogram is generally lower than the speech dataset, which is expected since it was trained on the latter. The analysis of the reconstruction ability of the audio VAE in isolation (without going through the video VAE) reported

24 Figure 4.5: Environment sounds visualization. Although not trained for it, our approach also visualizes natural sounds. Here showing a clear dif- ference between chirp and alarm clock. The visualization looks brighter and intenser when a target sound occurs. in Table 4.4 shows that a large fraction of this loss of accuracy stems from the learning of speech specific features of the audio VAE. Moreover, with a recombined reconstruction loss term on the human speech dataset, the model was fitted to speech features and tended to loss high pitch information. Although trained entirely on speech, our low-level formulation at the level of sounds enables AudioViewer to also visualize environment sounds. Figure 4.5 gives two examples. Moreover, according to the face visualization of the content encoding as we showed in the HTML file, our AudioViewer can generate consistent visualization to given environment sounds.

4.6 User study

4.6.1 Design of the questionnaire Our user study required participants to complete one of 4 versions of a question- naire: disentangled content model with CelebA visualizations (CelebA-content), disentangled style model with CelebA visualizations (CelebA-style), disentangled content model with MNIST visualizations (MNIST-content) and a combined model with CelebA visualizations (CelebA-combined). Each version of the questionnaire asked the same set of 29 questions with randomized ordering of answers within each question. The study was conducted with 14, 15, 12, and 14 participants for the CelebA-content, CelebA-style, Celeba-combined and MNIST-content ques- tionnaires, respectively. Amongst all the questionnaires, there were 27 unique participants. It took participants between 10-15 minutes to complete the each

25 Figure 4.6: Matching Example. Examples from the CelebA-combined and MNIST-content questionnaire of a matching question asked to partici- pants in the user study.

Figure 4.7: Grouping Example. Examples from the CelebA-combined and MNIST-content questionnaire of a grouping question asked to partici- pants in the user study. questionnaire. In addition, the questionnaire did not indicate the purpose of the underlying research in case it would potentially bias the participants.. We only asked participant to perform two possible tasks: matching and grouping visualizations. The format of the questionnaire is outlined in Table 4.5. The questions tested for two factors: sound content, sounds that share the same phoneme sequences, and sound style, sounds produced by speakers of the same sex or speaker dialect. This purely visual comparison allows us analyze different aspects individually.

Matching questions Matching questions asked the participants to choose which of two possible visualizations which is most visually similar to a given reference visual. Figure 4.6 shows examples of matching questions. Matching questions were used to assess the viability for users to distinguish between the same sounds produced by speakers possessing different speaker traits as well as determining whether structural similarities in the underlying audio translated into similarities in the visualization.

26 Table 4.5: A breakdown of the questionnaire format by sound type and tested factors, listing their frequency of occurrence.

Question type Sound type Tested factor Questions Content 3 Phone- Style (sex) 3 pairs Matching Style (dialect) 2 questions Content 3 Words Style (sex) 3 Style (dialect) 2 Phone- Content + style (dialect) 2 pairs Content + style (dialect + sex) 2 Grouping Content 3 questions Content + style (dialect) 2 Words Content + style (sex) 1 Content + style (dialect + sex) 1 Content (similar sounding words) 2 Total 29

In particular, the questionnaire contained 6 questions for evaluating the ability to distinguish between sound content, which compared visualizations of sounds of different phoneme sequences (3 for phoneme-pairs and 3 for words). Phone-pairs are short in length and therefore the corresponding visualisation was a single frame image, whereas visualisations of words were videos. In order to evaluate the ability to distinguish between sound style, 6 questions compared visualizations of the same phoneme sequence between male and female speakers and 4 questions for distinguishing between speakers of different dialects. In total there were 16 matching questions. Since each question has two options, the expected mean accuracy for random guessing is 50%.

Grouping questions Grouping questions asked the participants to group 4 visu- alizations into two pairs of similar visualizations. Figure 4.7 shows examples of grouping questions. Grouping questions were used to assess the degree to which visualizations of different words are distinguishable and visualizations of the same

27 Table 4.6: User study results. Values indicate mean accuracy and standard deviation for distinguishing between visualizations of the tested factor across participants as a percentage.

Tested Factor Sub-type Baseline CelebA-disentangled CelebA-combined MNIST-content Phone-pairs 40.3 85.7 ± 11.2 70.2 ± 9.6 91.8 ± 9.2 Words 37.3 84.5 ± 8.6 74.3 ± 14.0 73.8 ± 7.2 Content Matching 50.0 84.5 ± 13.8 84.7 ± 4.8 77.4 ± 10.6 Grouping 33.3 85.2 ± 8.8 67.3 ± 14.0 80.8 ± 8.4 Overall 38.4 85.0 ± 6.8 72.8 ± 10.0 79.7 ± 6.5 Style (dialect) Overall 50.0 56.7 ± 22.1 39.6 ± 24.9 - Style (sex) Overall 43.3 78.0 ± 10.8 43.3 ± 8.9 - word are similar. In particular, the study required users to group visualizations of two pairs of sounds, whereby different pairs are sound clips with shared factors of the same sound content or same sound style. In total, the user study consisted of 4 grouping questions based on phone-pairs and 9 grouping questions based on words. Since there are three possible options, the expected mean accuracy for random guessing is 33.3%.

4.6.2 Results For each of the three models, we tested for sound content: phoneme sequences, and sound style: speaker dialect and speaker sound. We generated the mean accuracy and standard deviation for each tested factor and each question sub type as shown in Table 4.6. The results of the disentangled models with CelebA visualizations (CelebA-distentangled) is aggregated by taking the results of CelebA-content on the questions which tested for sound content and the results of CelebA-style on questions which tested for sound style. Since we only evaluated the disentangled content model for the MNIST visualizations, it is only compared for the content questions. Significance comparing model means were calculated using a two-sample two-tailed t-test with unequal variance and without any outlier rejection.

Identifying and distinguishing sounds. Figure 4.8 shows users achieve the highest overall accuracy on the CelebA-disentangled model with 85.0 ± 6.8% (significant with p < 0.05) for distinguishing between visualizations of different content. The

28 1 ** * † 0.9

0.8

0.7 Accuracy

0.6

0.5 CelebA-disentangled MINIST-disentangled CelebA-combined Visualization model

Figure 4.8: Model Comparison on content questions. The mean user ac- curacy and standard deviation for distinguishing phoneme sequences is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, * for p < 0.05, ** for p < 0.01). The mean accuracy for random guessing is 0.384.

MNIST-content model has the highest accuracy for distinguishing between different phone pairs with 91.8 ± 9.2%, although not significantly higher than the CelebA- disentangled one with (p > 0.05), but has a much lower accuracy for distinguishing between different words, suggesting that the MNIST visualizations may be better suited for representing shorter sounds.

Latent space disentanglement. As shown in Figure 4.9, the CelebA-disentangled model outperforms the CelebA-combined model for distinguishing between speak- ers of different sex with 78.0 ± 10.8% (significant with p < 0.001) and between speakers of different dialects with 56.7 ± 22.1% (marginally significant with 0.05 < p < 0.1), which, however, is not significantly higher than the random guessing baseline of 50%. The results from the user study shows that both disentanglement training on the audio model and face visualization can make human users easier to recognize the visualized phonemes and speaker identity(sex). The task of distinguishing between

29 1 1 *** † 0.9 0.9 0.8 0.8 0.7 0.7

0.6

Accuracy Accuracy 0.6 0.5

0.5 0.4

0.4 0.3 CelebA-disentangled CelebA-combined CelebA-disentangled CelebA-combined Visualization model Visualization model

Figure 4.9: Model Comparison on style questions. The mean user accu- racy and standard deviation for distinguishing speaker sex (left) and dialect(right) is compared between the models, annotated with the sig- nificance level († for 0.05 < p < 0.1, *** for p < 0.001). The mean accuracies for random guessing are 0.43 and 0.5 respectively. speakers of different dialects, however, is much more difficult since there are 8 categories of dialects in the dataset and differences in dialects are much more subtle and often confound with other factors. This translates to a lower accuracy in this question category.

30 Chapter 5

Limitations and Future Work

As it is shown in Figure 5.1, it can be noticed that the recombination of the latent encodings successfully recombines the speaker identity related features (e.g., the fundamental frequencies) and phoneme related features (e.g., formant patterns), which indicates the effect of our disentanglement training. However, the blury reconstructed spectrogram and also indicates information loss during encoding, such as the intonation and higher harmonics, which are important to natrual speech synthesis. The quantitative evaluation also shows a significant drop in reconstruction SNR after the recombined reconstruction loss is introduced to the disentanglement training. Using a linear spectrogram with more filter banks instead of current Mel spectrogram might help with the reconstruction of higher harmonics since the distence between the will be seperated as clearly as the lower ones. But it remains unknown if it can significantly help with the generated audio quality since Mel filters are designed to match the sensitivity of human auditory perception. To avoid trivial solutions that harm the disentanglement performance, direct reconstruction loss is not applied during the training, which might be one of the reason of the drop. SpeechVAE[36], on the other hand, is not designed for such disentanglement. Although our recombined reconstruction loss is trained in a self-supervised manner, it still requires phoneme level annotation to align the input triplets. Recent deep learning works in voice conversion[11, 35, 38, 56, 69, 73] explicitly model such disentanglemnt of speaker identity and speech content and can be trained with non-parallel data. In the future, we can replace SpeechVAE[36] with other

31 VAE-based voice conversion models like [11] for better audio disentanglement and reconstruction performance. Meanwhile, our temporal smoothness constraint and cycle constraint can still be compatible with it. In current work, we use DFC-VAE[33] to encourage the model prioritize the consistency in visual perceptive features in the reconstruction. Although the visual- ization works as expected according to our information throughput experiment and user study, it is noticed that, for instance, in the visualization of ‘children’ in Figure 4.3, the background also changes overtime, indicating that it is also encoded with audio information even though background is generally less attended by human users. In the future, addtional foreground masks can be applied to the reconstructed image during the joint refinement of the visual model to prioritize visualization in the facial region. Although the visualization by AudioViewer does not have to be high in resolution or fidelity since it is presented as video frames and user do not have enough time to recognize the details in each frame, a more powerful AE-based visual generator like ALAE can make the visualization more expressive. Similarly, although our work has a different goal with lip synthesis as it is explained in Chapter 2, lip movement still helps speech learning[32]. Therefore, lip motion synthesis can be combined with the AudioViewer. A possible integration is that using foreground masks which enforce the AudioViewer the visualize audio information with facial attributes ouside the mouth region, and synthesis the lip movement corresponding to the speech content in the mouth region, so that user can get both information simultaneosly. We evaluate the smoothness in the latent space to show how temporal smooth- ness in the latent space help with the one in the generated visual sequence. A no-reference (NR) visual evaluation on temporal smoothness, especially flickering of the generated videos in the future might be a direct and reasonable quantitative criteria[76]. For instance, Naito et al. [65] proposed an NR flickering measurement based on inter-frame differences, with which we can quantitatively evaluate how our temporal smoothness in audio latent space eventually improve the temporal smoothness of generated videos. If the AudioViewer can help hearing impaired people to learn speech still need more investigation. Although the visual difference and similarity are consistent with audio the ones in the pronunciation according to the user study, the question about

32 whether human, especially hearing impaired people, can be trained to control the generated visualization by changing their pronunciation still need to be addressed with further user studies. Finally, if visualization by our AudioViewer can signifi- cantly faciliate speech learning compared to the spectrogram-based visualization should also be further examined with behavourial experiments. Besides helping deaf speech training, another possible application of the Au- dioViewer is foreign language learning. For instance, native speakers of Japanese who learned English after childhood were found difficult in perceving the acous- tic difference of English /r/ and /l/ [25], and the ability of speech production and discrimination of the two phonemes can be improved by training[7, 59, 61]. When a language learner can not distinguish phonemes of a foreign language from auti- tory modal at the beginning, visualiaztion by the AudioViewer can present such difference as a visual assistance. Although the audio dataset TIMIT[22] we use in the study is an English speech dataset, there is no doubt that our method can apply to another spoken language since it is are not designed to model a specific language as long as it is trained on a corresponding dataset. To get a more general speech model that works with different languages, our model should also be trained with multiple datasets on different languages. Besides human speech visualization, environment sound visualization is also a topic that worth exploring since it is hard to present environment sounds with transcript comparing to speech. AudioViewer is mainly designed and trained to work with vocal inputs to help with speech learning, so it is expected that the reconstruction SNR is not as high as the speech ones, which indicates information loss in audio encoding. The architecture of AudioViewer, can still be compatible with a audio VAE if it is trained on environment sound data. Considering the difference in using scenario, however, visualizing the category or location [43] of a environment sound might be a useful application to the hearing imapired people.

33 Figure 5.1: Recombined reconstruction sample. Top left: The Mel spec- trograms of a sentence spoken by a female speaker and its reconstruction by our audio model with disentanglement training. Top right: The original and reconstructed Mel spectrogram of another sentence spoken by a male speaker. Bottom: The Mel spectrograms of generated by recombined latent encodings. The above one is decoded from the content latent encodings of the first utterance combined with the speaker latent encoding of the male speaker of the second utterance. The below one is the sencond sentence with the female speaker in first sentence. Due to a higher F0 of the female speaker, the harmonics in the original and reconstructed Mel spectrograms are seperated farther apart than the ones of the male speaker, which can also be also noticed in the recombined reconstructed Mel spectrogram.

34 Chapter 6

Conclusion

We presented AudioViewer, a learning-based approach for visualizing audio, espe- cially speech, via generative models coupled with cycle consistency and temporal smoothness constraints. We applied a recombined reconstruction loss for disentan- glement training to make the audio model speaker agnostic. According to our user study, the human face visualization of disentangled encoding is best recognized by human users for its rich features which can retain more information than the hand written figure one, which is consistent with the quantitative results of information throughput. As a proof of concept, our AudioViewer shows the feasibility of visual- izing speech with unsupervised learning and its potential in audio substitution. We hope that our approach will catalyze the development of future tools for assisting with handicaps linked to perceiving subtle audio differences.

35 Bibliography

[1] G. K. Aguirre, E. Zarahn, and M. D’esposito. An area within human ventral cortex sensitive to “building” stimuli: evidence and implications. Neuron, 21 (2):373–383, 1998. → page 3

[2] P. Arnold and C. Murray. Memory for faces and objects by deaf and hearing signers and hearing nonsigners. Journal of psycholinguistic research, 27(4): 481–497, 1998. → page 6

[3] P. Bach-y Rita. Brain plasticity as a basis of sensory substitution. Journal of Neurologic Rehabilitation, 1(2):67–71, 1987. → page 7

[4] M. Balasubramanian and E. L. Schwartz. The isomap algorithm and topological stability. Science, 295(5552):7–7, 2002. → page 19

[5] S. Barry. Speech viewer 2. Child Language Teaching and Therapy, 10(2): 206–213, 1994. → page 6

[6] V. Bernard-Opitz, N. Sriram, and S. Sapuan. Enhancing vocal imitations in children with autism using the ibm speech viewer. Autism, 3(2):131–147, 1999. → page 6

[7] A. R. Bradlow, R. Akahane-Yamada, D. B. Pisoni, and Y. Tohkura. Training japanese listeners to identify english/r/and/l: Long-term retention of learning in perception and production. Perception & psychophysics, 61(5):977–985, 1999. → page 33

[8] J.-P. Briot, G. Hadjeres, and F.-D. Pachet. Deep learning techniques for music generation–a survey. arXiv preprint arXiv:1709.01620, 2017. → page 8

[9] K. Chen, L. Wang, and H. Chi. Methods of combining multiple classifiers with different features and their applications to text-independent speaker identification. International Journal of Pattern Recognition and Artificial Intelligence, 11(03):417–445, 1997. → page 8

36 [10] H. Chernoff. The use of faces to represent points in k-dimensional space graphically. Journal of the American statistical Association, 68(342): 361–368, 1973. → pages 3, 6, 22

[11] J.-c. Chou, C.-c. Yeh, and H.-y. Lee. One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742, 2019. → pages 31, 32

[12] C. Chu, A. Zhmoginov, and M. Sandler. Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950, 2017. → page 8

[13] M. A. Cox and T. F. Cox. Multidimensional scaling. In Handbook of data visualization, pages 315–347. Springer, 2008. → page 19

[14] F. Destombes, B. Elsendoorn, and F. Coninx. The development and application of the ibm speech viewer. Interactive learning technology for the deaf, pages 187–196, 1993. → page 6

[15] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. → page 8

[16] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giro-i Nieto. Wav2pix: speech-conditioned face generation using generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 3, 2019. → pages 2, 7

[17] S. P. Eberhardt, L. E. Bernstein, D. C. Coulter, and L. A. Hunckler. Omar a haptic display for speech perception by deaf and deaf-blind individuals. In Proceedings of IEEE Virtual Reality Annual International Symposium, pages 195–201. IEEE, 1993. → page 5

[18] S. F. Elssmann and J. E. Maki. Speech spectrographic display: use of visual feedback by hearing-impaired adults during independent articulation practice. American Annals of the Deaf, pages 276–279, 1987. → pages 1, 5

[19] G. Fant. Acoustic theory of speech production. Number 2. Walter de Gruyter, 1970. → page 4

[20] E. M. Finney, I. Fine, and K. R. Dobkins. Visual stimuli activate auditory cortex in the deaf. Nature neuroscience, 4(12):1171–1173, 2001. → page 6

37 [21] E. M. Finney, B. A. Clementz, G. Hickok, and K. R. Dobkins. Visual stimuli activate auditory cortex in deaf subjects: evidence from meg. Neuroreport, 14 (11):1425–1427, 2003. → page 6

[22] J. S. Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993. → pages 17, 24, 33

[23] L. H. Goldish and H. E. Taylor. The optacon: A valuable device for blind persons. Journal of & Blindness, 68(2):49–56, 1974. → page 6

[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. → pages 2, 7

[25] H. Goto. Auditory perception by normal japanese adults of the sounds" l" and" r.". Neuropsychologia, 1971. → page 33

[26] B. Gygi, G. R. Kidd, and C. S. Watson. Similarity and categorization of environmental sounds. Perception & psychophysics, 69(6):839–855, 2007. → page 2

[27] W. Hao, Z. Zhang, and H. Guan. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. → page 8

[28] W. J. Hardcastle, F. E. Gibbon, and W. Jones. Visual display of tongue-palate contact: electropalatography in the assessment and remediation of speech disorders. International Journal of Language & Communication Disorders, 26(1):41–74, 1991. → page 5

[29] J. Harrington and S. Cassidy. Techniques in speech acoustics, volume 8. Springer Science & Business Media, 2012. → page 4

[30] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis. In Proc. IEEE Int’l Conf. Acoustics, Speech and Signal Processing, volume 1, pages 121–124, 1991. → page 8

[31] Y. Hiasa, Y. Otake, M. Takao, T. Matsuoka, K. Takashima, A. Carass, J. L. Prince, N. Sugano, and Y. Sato. Cross-modality image synthesis from unpaired data using cyclegan. In International workshop on simulation and synthesis in medical imaging, pages 31–41. Springer, 2018. → page 8

38 [32] Y. Hirata and S. D. Kelly. Effects of lips and hands on auditory learning of second-language speech sounds. Journal of Speech, Language, and Hearing Research, 2010. → page 32

[33] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1133–1141, 2017. → pages 8, 13, 32

[34] W. F. House. Cochlear implants. Annals of Otology, Rhinology & Laryngology, 85(3_suppl):3–3, 1976. → page 1

[35] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice conversion from non-parallel corpora using variational auto-encoder. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–6. IEEE, 2016. → page 31

[36] W.-N. Hsu, Y. Zhang, and J. Glass. Learning latent representations for speech generation and transformation. In Interspeech, pages 1273–1277, 2017. → pages 8, 9, 10, 21, 24, 31

[37] D. Hu, D. Wang, X. Li, F. Nie, and Q. Wang. Listen to the image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. → page 6

[38] W.-C. Huang, H. Luo, H.-T. Hwang, C.-C. Lo, Y.-H. Peng, Y. Tsao, and H.-M. Wang. Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Transactions on Emerging Topics in Computational Intelligence, 2020. → page 31

[39] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017. → page 7

[40] IBM. Speech viewer iii, 2004. → page 6

[41] B. Imperl, Z. Kaciˇ c,ˇ and B. Horvat. A study of harmonic features for the speaker recognition. Speech communication, 22(4):385–402, 1997. → page 4

[42] R. J. Jacob and H. E. Egeth. The face as a data display. Human Factors, 18(2): 189–200, 1976. → pages 3, 6, 22

[43] D. Jain, L. Findlater, J. Gilkeson, B. Holland, R. Duraiswami, D. Zotkin, C. Vogler, and J. E. Froehlich. Head-mounted display visualizations to

39 support sound awareness for the deaf and hard of hearing. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 241–250, 2015. → page 33

[44] A. H. Jha, S. Anand, M. Singh, and V. Veeravasarapu. Disentangling factors of variation with cycle-consistent variational auto-encoders. In European Conference on Computer Vision, pages 829–845. Springer, 2018. → pages 8, 14

[45] K. Johnson. Acoustic and auditory phonetics. John Wiley & Sons, 2011. → page 4

[46] H. Kamper. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6535–3539. IEEE, 2019. → page 8

[47] N. Kanwisher, D. Stanley, and A. Harris. The fusiform face area is selective for faces not animals. Neuroreport, 10(1):183–187, 1999. → page 3

[48] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017. → page 7

[49] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. → page 7

[50] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020. → page 7

[51] W. F. Katz and S. Mehta. Visual feedback of tongue movement for novel speech sound learning. Frontiers in human neuroscience, 9:612, 2015. → page 5

[52] H. Kim, M. Elgharib, M. Zollhöfer, H.-P. Seidel, T. Beeler, C. Richardt, and C. Theobalt. Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG), 38(6):1–13, 2019. → page 7

[53] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. → pages 7, 10

40 [54] B. J. Kröger, P. Birkholz, R. Hoffmann, and H. Meng. Audiovisual tools for phonetic and articulatory visualization in computer-aided pronunciation training. In Development of multimodal interfaces: Active listening and synchrony, pages 337–345. Springer, 2010. → pages 1, 5

[55] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324, 1998. → page 17

[56] K. Lee, I.-C. Yoo, and D. Yook. Voice conversion using cycle-consistent variational autoencoder. arXiv preprint arXiv:1909.06805, 2019. → page 31

[57] J. Levis and L. Pickering. Teaching intonation in discourse using speech visualization technology. System, 32(4):505–524, 2004. → page 6

[58] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. → page 17

[59] S. E. Lively, D. B. Pisoni, R. A. Yamada, Y. Tohkura, and T. Yamada. Training japanese listeners to identify english/r/and/l/. iii. long-term retention of new phonetic categories. The Journal of the acoustical society of America, 96(4):2076–2087, 1994. → page 33

[60] S. Maidenbaum, S. Abboud, and A. Amedi. Sensory substitution: Closing the gap between basic research and widespread practical visual rehabilitation. Neuroscience & Biobehavioral Reviews, 41:3–15, 2014. → page 6

[61] J. L. McClelland, J. A. Fiez, and B. D. McCandliss. Teaching the/r/–/l/discrimination to japanese adults: behavioral and neural aspects. Physiology & Behavior, 77(4-5):657–662, 2002. → page 33

[62] S. McCullough and K. Emmorey. Face processing by deaf asl signers: Evidence for expertise in distinguishing local features. Journal of Deaf Studies and Deaf Education, pages 212–222, 1997. → page 6

[63] L. B. Merabet and A. Pascual-Leone. Neural reorganization following sensory loss: the opportunity of change. Nature Reviews Neuroscience, 11(1):44–52, 2010. → page 6

[64] E. Musk et al. An integrated brain-machine interface platform with thousands of channels. Journal of medical Internet research, 21(10):e16194, 2019. → page 1

41 [65] S. Naito, S. Sakazawa, A. Koike, et al. Objective perceptual video quality measurement method based on hybrid no reference framework. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 2237–2240. IEEE, 2009. → page 32

[66] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 42 (4):722–737, 2015. → page 2

[67] A. Öster. Teaching speech skills to deaf children by computer-based speech training. STL-Quarterly Progress and Status Report, 36(4):67–75, 1995. → pages 2, 5

[68] S. H. Park, D. J. Kim, J. H. Lee, and T. S. Yoon. Integrated speech training system for hearing impaired. IEEE Transactions on Rehabilitation Engineering, 2(4):189–196, 1994. → page 6

[69] S.-w. Park, D.-y. Kim, and M.-c. Joe. Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data. arXiv preprint arXiv:2005.03295, 2020. → page 31

[70] K. J. Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015. → pages 17, 24

[71] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, number CONF. IEEE Signal Processing Society, 2011. → page 17

[72] S. R. Pratt, A. T. Heintzelman, and S. E. Deming. The efficacy of using the ibm speech viewer vowel accuracy module to treat young children with hearing impairment. Journal of Speech, Language, and Hearing Research, 36 (5):1063–1074, 1993. → page 6

[73] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. arXiv preprint arXiv:1905.05879, 2019. → page 31

[74] A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pages 14866–14876, 2019. → page 7

42 [75] N. Sadoughi and C. Busso. Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Transactions on Affective Computing, 2019. → pages 2, 7

[76] M. Shahid, A. Rossholm, B. Lövström, and H.-J. Zepernick. No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on image and Video Processing, 2014(1):40, 2014. → page 32

[77] E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7574–7583, 2018. → page 7

[78] G. Sperling. The information available in brief visual presentations. Psychological monographs: General and applied, 74(11):1, 1960. → page 3

[79] L. W. Stewart L and H. R. A real-time sound spectrograph with implications for speech training for the deaf. In IEEE International Conference Accoustics Speech and Signal Processing, 1976. → pages 2, 5

[80] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017. → page 7

[81] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016. → page 8

[82] T. Takahashi, S. Fukayama, and M. Goto. Instrudive: A music visualization system based on automatically recognized instrumentation. In ISMIR, pages 561–568, 2018. → page 7

[83] M. J. Tarr and I. Gauthier. Ffa: a flexible fusiform area for subordinate-level visual processing automatized by expertise. Nature neuroscience, 3(8): 764–769, 2000. → page 3

[84] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4):1–11, 2017. → page 7

[85] N. Thomas-Stonell, M. McClean, and E. Hunt. Evaluation of the speechviewer computer-based speech training system with neurologically impaired individuals. Journal of Speech-Language Pathology and Audiology, 15(4):47–56, 1991. → page 6

43 [86] Y. Tian and J. Engel. Latent translation: Crossing modalities by bridging generative models. arXiv preprint arXiv:1902.08261, 2019. → page 8 [87] O. Tmenova, R. Martin, and L. Duong. Cyclegan for style transfer in x-ray angiography. International journal of computer assisted radiology and surgery, 14(10):1785–1794, 2019. → page 8 [88] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29:4790–4798, 2016. → page 7 [89] S. Vasquez and M. Lewis. Melnet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083, 2019. → page 9 [90] T. Wempe and M. van Nunen. The ibm speechviewer. Proceedings IFA, 15: 121–129, 1991. → page 6 [91] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–686, 2018. → pages 2, 7 [92] W. Xu, X. Lifang, Y. Dan, and H. Zhiyan. Speech visualization based on robust self-organizing map (rsom) for the hearing impaired. In 2008 International Conference on BioMedical Engineering and Informatics, volume 2, pages 506–509. IEEE, 2008. → pages 1, 5 [93] D. Yang, B. Xu, and X. Wang. Speech visualization based on improved spectrum for deaf children. In 2010 Chinese Control and Decision Conference, pages 4377–4380. IEEE, 2010. → pages 1, 5 [94] D. Yook, S.-G. Leem, K. Lee, and I.-C. Yoo. Many-to-many voice conversion using cycle-consistent variational autoencoder with multiple decoders. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pages 215–221, 2020. → pages 8, 14 [95] K. Yoshii and M. Goto. Music thumbnailer: Visualizing musical pieces in thumbnail images based on acoustic features. In ISMIR, pages 211–216, 2008. → page 7 [96] L. Yuan, Y. Chen, S. Fu, A. Wu, and H. Qu. Speechlens: A visual analytics approach for exploring speech strategies with textural and acoustic features. In 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 1–8. IEEE, 2019. → page 7

44 [97] C. M. Zaccagnini and S. D. Antia. Effects of multisensory speech training and visual phonics on speech production of a hearing-impaired child. Journal of Childhool Communication Disorders, 15(2):3–8, 1993. → pages 2, 5

[98] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. → page 8

45