Arxiv:2012.13341V3 [Cs.HC] 11 Mar 2021

AudioViewer: Learning to Visualize Sound

Yuchi Zhang Willis Peng University of British Columbia University of British Columbia Vancouver, Canada Vancouver, Canada [email protected] [email protected]

Bastian Wandt Helge Rhodin Leibniz University Hannover University of British Columbia Hannover, Germany Vancouver, Canada [email protected] [email protected]

Abstract substitute lost senses, going all the way to recently popular- ized attempts to directly interface with neurons in the brain Sensory substitution can help persons with perceptual (e.g., NeuraLink [39]) . deficits. In this work, we attempt to visualize audio with One of the least intrusive approaches is to substitute audio video. Our long-term goal is to create sound perception for and video, which is, however, challenging due to their high hearing impaired people, for instance, to facilitate feedback throughput and different modality. In this paper, make a for training deaf speech. Different from existing models that first step to visualize audio with natural images in real time, translate between speech and text or text and images, we forming a live video that characterizes the entire audio con- target an immediate and low-level translation that applies to tent. It could be seen as an automated form of sign language generic environment sounds and human speech without delay. translation, with its own throughput, abstraction, automation, No canonical mapping is known for this artificial transla- and readability trade-offs. Figure1 shows an example. Such tion task. Our design is to translate from audio to video by a tool could help to perceive sound and also be used to train compressing both into a common latent space with shared deaf speech, as immediate and high-throughput feedback structure. Our core contribution is the development and eval- on pronunciations is difficult to learn from traditional visual uation of learned mappings that respect human perception tools for the hearing-impaired speech learning that rely on limits and maximize user comfort by enforcing priors and spectrogram representations [12, 61, 60, 33]. combining strategies from unpaired image translation and Our long-term goal is to show that hearing-impaired per- disentanglement. We demonstrate qualitatively and quanti- sons can learn to perceive sound through the visual system. tatively that our AudioViewer model maintains important Similar to learning to understand sign language and likewise audio features in the generated video and that generated an unimpaired child learns to associate meaning to seen ob- arXiv:2012.13341v3 [cs.HC] 11 Mar 2021 videos of faces and numbers are well suited for visualizing jects and perceived sounds and words, the user will have to high-dimensional audio features since they can easily be learn a mapping from our sound visualization to its semantic. parsed by humans to match and distinguish between sounds, Still, articulation could be practiced by exploring the visual words, and speakers. feedback space and speaking learned by articulating sounds that reproduce the same visual feedback as produced by the teacher or parent. 1. Introduction Translating an audio signal to video is difficult as there is no canonical one-to-one mapping across these domains Humans perceive their environment through diverse chan- with vastly different dimensions (one-dimensional audio to nels, including vision, hearing, touch, taste, and smell. Im- high-dimensional video) and characteristics (time-frequency pairment in any of these modals of perception can lead to representation to spatial structure). It is an open question drastic consequences, such as challenges in learning and of which visual abstraction level is best. Methods for digi- communication. Various approaches have been purposed to tal dubbing and lip-syncing facial expressions from spoken

1 90 htvsaiegnrlhg-iesoa aawt fa- with data high-dimensional general visualize that 1970s hrfr xli yl ossec olanajitstructure joint a learn We to consistency features. cycle video exploit frequent therefore to mapped audio be frequent should Third, features mapping. learned the on constraints [ dis- tiredness to and lead ruption changes and temporal quick flickering) but (e.g., structure spatial non-natural complex perceive humans to Second, able faces. are and images natural of statistics image faces human particularly environment, [ their objects in and see patterns they recognizing that at good principles. yet are four expressive humans on First, is AudioViewer base that We mapping perceive. to a easy finding in technical lies The difficulty translation. domain unpaired audio for and processing vision computer from ideas leveraging mapping curva- 25]. and [5, nose mouth the the of of length ture varying with the drawings from line cial techniques early by inspired are and data high-dimensional audio the of representation visual immediate and clock. alarm an of beep repeating echo the the or as object such dropped ill-represented, a are of indi- word be single not a by can cated that sounds ambient and style [ events or speakers Moreover, female feedback and vocal male any between contain differences still not is translation does the a and in such indirect however, recognized intuitive, is is ’dog’ This word audio. the when shown be could conditional [ using (GANs) image networks to adversarial an generative is into extension word conceivable the A translate further ’dog’. text the to recog- translated a with words [ to system speech nition translate Another to be sounds. environment could infor- to direction audio apply result the not of does a fraction and rather a mation is contains motion only and lip speaking natural of However, read. lip language. to spoken of style the [ on audio information Different on intuitively. pass features and audio sound high-dimensional general the encode visualizing can for this numbers reading, or lip faces of use principle thereby and the video from to stream audio an map 1. Figure 1 Audio signal , nti ae,w eina meit ui ovideo to audio immediate an design we paper, this In intuitive an for seek we limitations, these overcome To 28 , 53 10 AudioViewer: , , 5 46 , 25 , 59 41 .W hrfr a ovdo hthv the have that videos to map therefore We ]. Mel Spectrogram ol entrl loigda people deaf allowing natural, be would ] ,freapetesoe dg ol be would ’dog’ spoken the example for ], oldvlpdtwrsteln-emga fhlighaigipie esn osewa hycnntha.We hear. not can they what see to persons impaired hearing helping of goal long-term the towards developed tool A 48 .Hne eefresmoothness enforce we Hence, ].

Audio encoder Latentspace trajectory 16 .Adgimage dog A ]. 49 , 65 , 42 ].

2 Visual decoder ihawrigpooyeadaayetescesb quan- by success the analyze and prototype working a with epet e htte a o er ehdfrmapping for method a hear; not can they what see to people sounds. and and gender phones individual as content, such from style, dialect, separate to disentangle variations, we of Fourth, factors modalities. video and audio between oespeoiatyrl nGN[ GAN on rely predominantly models. models generation video and Audio mechanism. with translation brightness unsupervised and an to background map as and such sounds features and spoken non-facial sound beyond between go We relation the opening. train mouth spo- to the used to is appearance language facial ken and correspondence expressions the motion, where lip person of with talking videos a from e.g., learning lines, by audio supervised be can tasks these generating [ and animation audio [ from images facial ethnicity, matching and audio gender spoken as to such expressions facial [ lip-syncing dub- digital or for designed bing been mostly have methods lation mappings. video to audio Classical with context in approach systems. our assistive put existing cross- then on We focus particular models. a modal with generation, video and audio Work Related 2. study. user a the in from features distinguished video showing be generated as can well pronunciation as and words content that information of loss the tifying approach AudioViewer this We of capabilities. feasibility recognition the the demonstrate first on a study and user constraints; smoothness perceptual space of latent effect different the videos, and target models, of quantitative type a the spaces; latent on linked analysis via video to audio from 10 u oecnrbtosaeanwapoc o enabling for approach new a are contributions core Our ntefloigscinw rtrve h ieaueon literature the review first we section following the In , 46 , Translatedframesvideo 59 .Ohrapoce eosrc ailattributes, facial reconstruct approaches Other ]. (face decoder) 54 , 29 , 50 , 47 or 58 .B otatt u etn,alof all setting, our to contrast By ]. n a ui ofca rbody or facial to music map and ] Translatedframesvideo 16 (numberdecoder) ui ovdotrans- video to Audio n A [ VAE and ] mg generation Image 32 , 20 ] formulations. The highest image fidelity is attained with through speech visualization and argues that deaf speech hierarchical models that inject noise and latent codes at learning could be further improved by incorporating the for- various network stages by changing their feature statistics mer. Commercially, products as such IBM’s Speech Viewer [23, 30]. For audio, only few methods operate on the raw [24] are available to the public. Our image generation ap- waveform [27]. It is more common use spectrograms and proach extends these spectrogram visualization techniques to apply convolutional models inspired by the ones used by leveraging the generative ability of VAEs in creating a for image generation [21, 9, 2]. We use cross-modal VAE mapping to a more intuitive and memorable video represen- models as a basis to learn the important audio and image tation. It is our hope that improved visual aids will lead to features. more effective learning in the future.

Cross-modal latent variable models. CycleGAN and its Sensory substitution and audio visualization. Related variations [67, 51] have seen considerable success in per- to our work is the field of sensory substitution, whereby forming cross-modal unsupervised domain transfer, for med- information from one modality is provided to an individual ical imaging [19, 56] and audio to visual translation [17], through a different modality. While many sensory substitu- but often encode information as high frequency signal that tion methods focus on substituting visual information into is invisible to the human eye and susceptible to adversar- other senses like tactile or auditory stimulation to help visual ial attacks [6]. An alternative approach involves training rehabilitation [38, 22, 15], few methods target substituting a VAE, subject to a cycle-consistency condition [26, 62], auditory modal with visualization. Audio visualization is but these works were restricted to domain transfers within another field related to our work. Music visualization works a single modality. Most similar is the joint audio and video generate visualizations of songs, so that user can browsing model proposed by Tian et al. [55], which uses a VAE to songs more efficiently without listen to them [63, 52]. On map between two incompatible latent spaces using super- the learning side, [64] visualize the intonation and volume of vised alignment of attributes, however, it operates on a word each word in speech by the font size, so that user can learn not phoneme level, and has no mechanism to ensure tem- narration strategies. Different from the works mentioned poral smoothness nor information throughput. Our contri- above, our model tries to visualize speech in phoneme level butions address these shortcomings. Relatedly, encoder- with deep learning models instead of hand-crafted features. decoder and GAN models have been applied to generating video reconstructions of lip movements based on audio 3. Method data [7, 4, 57, 66], however due to mapping ambiguities Our goal is to translate a one-dimensional audio sig- between phonemes and visemes, lip movements are not a nal A = (a0, , aTA ) into a video visualization V = reliable source of feedback for learning sound production ··· (I0, , ITV ), where ai R are sound wave samples [36, 40,3, 13]. ··· ∈ recorded over TA frames and Ii are images representing the same content over T frames. Supervised training is not Deaf speech support tools. Improvements in speech pro- V possible in the absence of paired labels. Instead, we utilize in- duction for the hearing impaired have been achieved through dividual images and audio snippets without correspondence non-auditory aids and these improvements persist beyond and learn a latent space that captures the importance of learning sessions and extend to words not encountered dur- Z image and audio features and translates between them with ing the sessions [49, 65, 42]. While electrophysiological minimal information loss. Figure2 shows the individual [18, 31] and haptic learning aids [11] have demonstrated ef- mapping steps. The audio encoder E yields latent code ficacy for improving speech production, such techniques can A z and the visual decoder D (z ) outputs a correspond- be more invasive, especially for young children, as compared i ∈ Z V i ing image I . This produces a video representation of the to visual aids. Elssmann et al. [12] demonstrate visual feed- i audio when applied sequentially. We start by learning indi- back from the Speech Spectrographic Display (SSD) [49] vidual audio and video models that are subsequently linked is equally effective at improving speech production as com- with a translation layer. pared with feedback from a speech-language pathologist. Alternative graphical plots generated from transformed spec- Matching sound and video representation. A first tech- tral data have been explored by [61, 60, 33], which aim nical problem lies in the higher audio sampling frequency (16 at improving upon spectrograms by creating plots which kHz), that prevents a one-to-one mapping to 25 Hz video. We are more distinguishable with respect to speech parameters. follow common practice and represent the one-dimensional Other methods aim at providing feedback by explicitly esti- sound wave with a perceptually motivated mel-scaled spec- F mating vocal tract shapes [43]. In addition, Levis et al. [35] trogram, M = (m0,..., mTM ), mi R , where F = 80 ∈ demonstrate that distinguishing between discourse-level into- is the number of filter banks. It is computed via the short- nation (intonation in conversation) and sentence-level intona- time Fourier transform with a 25ms Hanning window with 10 tion (intonation of sentences spoken in isolation) is possible ms shifts. The resulting coefficients are converted to decibel

3 units flooring at -80dB. Because the typical mel spectrogram [z , , z ]T , where d is the whole audio latent space m+1 ··· d still has a higher sampling rate (100 Hz) than the video, we dimension and m in the audio style latent space dimension. chunk it into overlapping segments of length TM = 20 cov- To encourage such disentanglement we use an audio ering 200 ms. In the following, we explain how to map from dataset with phone and speaker ID annotation. As it shows mel segment to video frame. in Figure3, at training time, we feed triplets of mel spec- togram segments M , M , M , where M and { a,i b,i a,j} a,i 3.1. Audio Encoding Mb,i are the same phoneme sequence pi spoken by differ- Given unlabelled audio and video sequences, we start by ent speakers sa and sb respectively, and Ma,j shares the speaker s with the first segment but a different phoneme learning independent encoder-decoder pairs (EA,DA) for a sequence. Each element of the input triplet is encoded in- sound and (EV ,DV ) for video. We use probabilistic VAEs since these do not only learn a compact representation of the dividually by EA, forming latent triplet za,i, zb,i, za,j = T T T { } [z , z ] , [z , z0 ] , [z0 , z ] , instead of reconstruct- latent structure, but also allow us to control the shape of the { sa ci sb ci sa cj } latent distribution to be a standard normal distribution. Let ing the inputs from the corresponding latent encodings in x be a sample from the unlabelled audio set. We optimize an autoencoder, we reconstructed the first sample Ma,i over all samples the VAE objective [32]: from a recombined latent encoding of the other two, za,i0 = T [zs0 a , zc0 i ] . Formally, we replaced the reconstruction loss

(x) = DKL(qφ(z x) pθ(z)) + Eqφ(z x) log pθ(x z) , term in the VAE objective by a recombined reconstruction L − | k | | (1) loss term,

rr(Ta,b,i,j) = log pθ(Ma,i z0 ) . with D , the Kullback-Leibler divergence, and q (z x) Eqφ(za0 ,i Mb,i,Ma,j ) a,i KL φ | L | | and p (x z), the latent code and output domain posterior (4) θ | respectively. These have a parametric form, with This setup forces the model to learn separate encodings q (z x) = (ρ(x), ω2(x)I) and (2) for the style and phoneme information while not requiring φ | N 2 additional loss terms. pθ(x z) = (µ(z), σ (z)I), (3) | N Note that we could alternatively enforce za,i to be close to z without decoding (the unused z in Figure3). How- where ρ and ω the output of the encoder and µ and σ the a,i0 a,i ever, an additional L2 loss on the latent space led to a bias output of the decoder. towards zero and lower reconstruction scores than the pro- Audio network architecture. We use the SpeechVAE posed mixing strategy that works with the original VAE model from Hsu et al. [21] that is widely used for generative objective. sound models. Fig.2, left, sketches how the mel spectrogram Temporal smoothness. The audio encoder has a small is encoded with an encoder EA of three convolutional layers temporal receptive field, encoding time segments of 200 followed by a fully connected layer which flattens the spatial ms. This lets encodings of subsequent sounds be encoded to dimensions. The mel spectrogram M is a two-dimensional distant latent codes leading to quick visual changes in the array, spanning the time and frequency, respectively. To ac- decoding. To counter act, we add an additional smoothness count for the structural differences of the two, convolutions prior. We experiment with the following negative log likeli- are split into separable 1 F and 3 1 filters, where F is the × × hoods. For training the audio encoder, we sample a pair of number of frequencies captured by M and striding is applied input mel spectrogram segments M , M at random time { i j} only on the temporal axis. The decoder is symmetric to the steps t , t , spaced at most 800ms apart, from the same { i j} encoder. We use ReLU activation and batch normalization utterance. We test the three different pair loss functions to layers in the encoder and decoder. enforce temporal smoothness in the embedded content vec- 3.2. Structuring the Audio Encoding tors. First, by making changes in latent space proportional to changes in time, We desire our latent space to change smoothly in time 2 and disentangle the style, such as gender and dialect, from 1 N ˆ p,MSE = i ∆ti ti,1 ti,2 , (5) the content conveyed in phones. L N − k − k where the latent spaceP dimension scale is learned by scalar ˆ We construct a Speech- parameter s, with ∆ti = s EA(Mi) EA(Mj) = Disentangling content and style. · k − k VAE that disentangles the style (speaker identity) content s zi zj for models without style disentanglement, and · k − k (phonemes) in the latent encodings, i.e., the latent en- ∆ˆt = s z z for the disentangled ones, where i · k c,i − c,jk coding z = [z , , z ]T d can be separated as a the predicted time difference only calculated with the con- 1 ··· d ∈ R style part z = [z , , z ]T and a content part z = tent part of the latent encodings. Second, we try whether s 1 ··· m c

4 ihlandsaea eoe h aeqoin constraint quotient same The before. as scale learned with ihasotns osrit( constraint smoothness a with tteltn oe h eoe ssfu x convolutions. 3x3 four uses decoder The arrive code. to latent layer the fully-connected at a by followed layers lutional et Hou from [ model al. DFC-VAE image per-frame the apply distribution prior matching and with VAEs dimension two latent using by modalities across patible Generation Video for Encoding Image 3.3. inﬂuence. their learned weights are terms with objective jointly, All experiments. our in quotients uses model ﬁnal Our scale, logarithmic in measured be also can better, is constant be to velocities of ratio the enforcing ( words different saying speaker same the from ( phones same the saying speakers 3. Figure 2. Figure Audio signal L ui noesadvdodcdr a emd com- made be can decoders video and encoders Audio 20 p, Triplet Mel 퐌 log .Teiptiaei asdtruhfu x convo- 4x4 four through passed is image input The ]. 푎 , 푖 Disentanglement spectrogram Overview. MSE 퐌 푏 L , 푖 Recombinedreconstruction ∆ p,Q Mel Spectrogram t 퐌 = 푎 , 푗 = N 1 on aetecdn slandwt ui n ie Asta r ikdwt yl osritadaugmented and constraint cycle a with linked are that VAEs video and audio with learned is encoding latent joint A P N 1 L λ

recombination Audio latent Audio encoder i N p, P cycle log ymxn otn noig rmtwo from encodings content mixing by , 퐳 풂 i N , log 풊 ∆ MSE 10 = Audio encoder v k z ∆ t annotation). c i, ˆ Audio latentcode i t hc eae etrthan better behaves which , 1 ∆ i Smoothness with − ˆ code and constraint t − t i concatenation i, 퐳 log 2 풂 ′ z k p los , λ 풊 c 0 ( i reconstruction recombined p − z k n tl encodings style and ) s ∆ ) v t 10 = z 1 i, over s

Audio decoder 1 a 2 − with 퐌 , 3 푎 ′ z t , i, 푖 balancing Audio decoder Mel Z ∈ 2 z k s 0 a reconstruction ). 2 We . . (6) (7) Cycleconstraint 5 where hsmdli rie nidvda iesbti applied is but videos individual on trained is model This ouin.Bsdsteprpxlls ftecasclVAE, classical the of loss per-pixel the Besides volutions. ie n ak ti mlmne ntecnetsaeas space content the on implemented is It back. and video ..LnigAdoadVsa Spaces Visual and Audio Linking 3.4. frames. video output the create to sequentially consistency. a feature of high-level space for feature network the VGG-19 in trained loss perceptual a uses DFC-VAE the con- between applied are layers LeakyReLU and norm Batch h ulda itne e pcrga ihtesga to signal the with spectrogram mel distance, Euclidean the metrics. Quantitative document. and videos supplemental the in qualitative results additional provide speech We to applies sounds. it environment that and and numbers or visu- portraits of via alizations features audio important conveys AudioViewer that Experiments 4. dimension or latent content-agnostic the be can speaker-agnostic. as visualization long the as that so part matches, the content only or visualize be part and can style models loss disentangled cycle the the to loss, applied smoothness the to posi- Similar in tion. differences point-wise minimizing values mean on E to code latent audio [ from to chaining cyclic spirit the in shows pos- 4 similar Figure space well, latent reconstructed audio are the terior from samples that ensures correspondence image and without works additional that an constraint introduce cycle we structures, space bridge To latent loss. different information in to points leading all well, not equally and modelled distribution true approx- the only to are imations encoders AudioViewer learned desired the en- However, our for audio translation. decoder the image concatenate the can with coder we prior, space latent same A ylann niiuladoadvdoVE ihthe with VAEs video and audio individual learning By ntefloig eso ulttvl n quantitatively and qualitatively show we following, the In ( M z ) c fteadoecdradtevdomdlsoperate modules video the and encoder audio the of Image sdanfo h otro ftecnetpart content the of posterior the from drawn is L cycle

Visual encoder = | Visual E ecmaeltn mednsby embeddings latent compare We V ( latentcode D V ( z c ))) − z Image Visual decoder c | , reconstructio 26 Z , 62 are n (8) ]. eueterdcdPAvrin where version, PCA reduced the use We Σ elamdltando pehgnrlzst te sound other to generalizes speech on trained model a well vec( o uast ecietetasae ui,w ri and train we audio, translated the perceive to humans for Datasets. Image [ videos. sounds environmental supplemental of dataset ESC-50 how the report on as we addition and In time study. training user the at for used truth ground only is annotation [ phonetic toolkit KALDI (462/50/24 the split of training speakers) the non-overlapping use pho- We ten sentences. reading rich each netically English, American eight of of speakers dialects 630 major orthographic, for time-aligned transcriptions as word well and as phonetic kHz) 16 record- audio bit, of (16 hours ings 5.4 contains It embeddings. speech ing Datasets. Audio pseudo-inverse the This with spaces. decoded VAE be the can as covariance encoding prior same the has space of rows scale further We space. dimensional with images, and trix samples matrix space. projection latent a common yields a PCA building for (PCA) analysis ponent 3.4. and 3.2 section and in priors introduced space strategies latent training the disen- ablate style-content we the Furthermore, disabling tangling. when as well speech as and audio generation for VAEs individual using ablate and model Baselines. document. supplemental study the The in report users. given 19 are therefore by details we answered questions and 29 subjective over statistics be content. may audio results translated Some the perceive to capability human study. Perceptual in change the position. as space measured latent is Smoothness (SNR), ratio noise encoding. and decoding video through preserved 4. Figure steiett matrix identity the is safrhrbsln,w xeietwt rnia com- principal with experiment we baseline, further a As Σ M i n aia aine ecnuePAfrsound for PCA use can We variance. maximal and ) Audio signal u yl constraint. Cycle reuvlnl h atndvdoframe video flattened the equivalently or i by ecmaetepooe ope audio-visual coupled proposed the compare We z i u = eueteTMTdtst[ dataset TIMIT the use We ots htkn fvsaiaini best is visualization of kind what test To en ihrtefltee mel-segment flattened the either being Wu epromaue td oaayethe analyze to study user a perform We Mel spectrogram I i yti osrcin h latent the construction, this By . eln ui n ie aetsae ihaccecntan htesrsta h ui inlcnetis content signal audio the that ensures that constraint cycle a with spaces latent video and audio link We ohv ignlcvrac ma- covariance diagonal have to W htrttstetraining the rotates that W Audio encoder ast a to maps W 14 Audio latentcode o learn- for ] 44 uhthat such W 45 vec( nthe in ] .The ]. † . 퐳 1 Z I j 퐳 D ) 2 . 6 od r noe ihasmlrvdosqec snl o) yet row), (single sequence video similar a with encoded are words s5m na 790K P .0H ihasingle a with 3.60GHz @ Ti. CPU 2080 RTX i7-9700KF GeForce an NVIDIA on ms frame a 5 for is time inference the amortized in the demonstration video, a supplemental showcase is we AudioViewer and epochs. audio capable audio real-time on 10 jointly for batch fine-tuned examples with are image epochs and models joint 24 and for 144, models for size MNIST models CelebA and pairs), epochs (65 38 128 size batch with epochs runtime. and Training respectively). train/val for [ MNIST attributes the face respectively), The [ datasets: CelebA image dataset two on model our test clock. alarm Here and chirp sounds. between natural difference visualizes clear also a showing approach our it, for trained 6. Figure speakers. different by spoken if variations pronunciation show the 5. Figure

Visual decoder Visual odanalysis. Word Cycleconstraint niomn onsvisualization. sounds Environment reconstruction 37 127/96 mgsfrtrain/val for images (162770/19962 ] ntne dfeetrw)o h same the of rows) (different Instances 34 pehmdl r rie o 300 for trained are models Speech aaes(00/00 images (60000/10000 datasets ] Visual encoder Visual 퐳 latentcode 1 ′ 퐳 2 ′ lhuhnot Although Table 1. Throughput lower bound. The throughput from audio to video with AudioViewer can be bounded by mapping back to the audio. It is evident that the cycle consistency improves greatly while the enforcement of additional constraints with p,log MSE , L and rr, takes only a small dip. L Audio models Visual models SNR(dB) Audio PCA Visual PCA 23.37 DFC-VAE on CelebA 1.65 DFC-VAE on MNIST 2.01 SpeechVAE [21] DFC-VAE on CelebA (refined w/ cycle) 4.43 L DFC-VAE on MNIST (refiend w/ ) 0.78 Lcycle DFC-VAE on CelebA 0.84 SpeechVAE w/ , DFC-VAE on MNIST 0.81 Lp,log MSE rr, dim=256 DFC-VAE on CelebA (refined w/ cycle) 4.16 L L DFC-VAE on MNIST (refiend w/ ) 3.68 Lcycle

Table 2. Audio encoding analysis. The SNR of the autoencoder drops with the additional constraints we enforce (smoothness and disentanglement). This is a trade-off for improving velocity and acceleration properties as well as for inducing interpretability. 1 2 Audio models SNR (dB) Velocity (s− ) Acc.(s− ) Audio PCA 23.37 329.02 13395.11 SpeechVAE[21] 21.89 280.09 11648.17 SpeechVAE w/ p,log MSE 19.89 108.89 3052.19 L SpeechVAE w/ 7.73 80.80 2947.70 Lrr SpeechVAE w/ p,log MSE, rr 6.20 59.78 1530.81 L L SpeechVAE w/ , , dim=256 6.28 65.63 1757.95 Figure 7. Phoneme similarity analysis. The same phonemes oc- Lp,log MSE Lrr curring at the beginning of different words are mapped to visually similar images (colums) while different speakers have minimal influence (rows). can be quantified and analyzed qualitatively by listening and comparing the original and reconstructed audio samples. PCA attains the highest reconstruction accuracy but 4.1. Visual Quality and Phonetics has poor smoothness properties and does not strike along the other dimensions. Interpretability in the disentangled space A meaningful audio translation should encode similar (lower half of Table1) has a relatively low toll on the SNR, sounds with similar visuals. Figure7 depicts such similar- a reduction from 4.43 to 4.16. MNIST proved less stable ity and dissimilarity on the first 200 ms of a spoken word to train and does not fair well with the cycle loss, perhaps since these can be visualized with a single video frame. This due to having a lower dimensionality compared to the audio figure compares the content encoding to MNIST and faces, encoding. both are well suited to distinguish sounds. The same word We further analyze the effect of the additional constraints (column) has high visual similarity across different speakers on the audio autoencoder by measuring its throughput indi- (F#: female, M#: male; # the speaker ID) and dialects (D#) vidually. Out of the models with smoothness, p,log MSE while words starting with different sounds are visually dif- 3 L with λp = 10 attains the highest reconstruction quality ferent. Multiple frames of entire words are shown in Fig.5. (SNR) and lowest latent space velocity calculated with finite Our user study and supplemental document analyzes their differences. The disentanglement reduces audio reconstruc- differences in more detail and on entire sentences. tion significantly. However, this has limited effect on the Although trained entirely on speech, our low-level for- overall throughput, as analyzed before, likely due to bottle- mulation at the level of sounds enables AudioViewer to also necks in the visual model dominating the overall accuracy. visualize environment sounds. Figure6 gives two examples. 4.3. Temporal Smoothness Effectiveness 4.2. Information Throughput Mapping from audio to video with VAEs without con- It is difficult to quantify the information throughput from straints and PCA leads to choppy results, with mean latent 1 audio to video as no ground truth is available and user stud- space velocities above 300 s− . We visualize the gain in ies are expensive. However, we can use the learned en- smoothness in Figure9 by plotting the high-dimensional la- coder and decoder to map from audio to video and back tent trajectory embedded into three dimensions using multi- as a lower bound. This cyclic pattern lets us quantify the dimensional scaling (MDS) [8]. The gain is similar for all loss of information as the distance of the starting point to of the proposed smoothness constraints. The supplemental the reconstructed audio. Figure8 gives an example and Ta- videos show how the smoothness eases information percep- ble1 summarizes relations quantitatively. The difference tion.

7 o h eeAmdlaheetebte cuayo 84.6% of accuracy better the users achieve tasks, model by CelebA down the Broken for 79.7% model. and content model MNIST content the CelebA for the for an 85.4% with of to sound accuracy able same were the users for the visualizations Overall, recognize the correctly 3. and Table 7 in Figure show in are show results ones the to of similar visualizations sounds, between different distinguish to ability the analyzes sounds. distinguishing and Identifying with and randomized. above questions mentioned for techniques repeated visualization question three participants same the the 19 with among digits questions, study or 29 user answering faces a using with of visualization difference for the and one combined study User 4.5. evaluation. study further user is the distinctiveness in The analyzed indicates richer. 7 are Figure models the facial Also is that models models. CelebA MNIST trained the jointly than the larger of content SNR information The measurable larger. the is that in ﬁnding this support experiments Our data. high-dimensional representing Inﬂuence Domain Visual 4.4. scattered. embeddings are sample audio consecutive constraint, reduction. dimensionality smoothness via out 3D in snipped audio an of 9. Figure recognizable. still is speech the that shows Spectrogram 8. Figure 3D MDS embedding eaayetegi rmadsnage aetsaev.a vs. space latent disentangled a from gain the analyze We [ work Earlier Baseline-PCA mohesevaluation Smoothness Throughput. Right: Input 5 , u ulmdllast mohrtrajectories. smoother to leads model full Our Mel spectrogram 25 h nomto hogptcnb esrdb apn rmadot ie n ak lyn h eitdMel depicted the Playing back. and video to audio from mapping by measured be can throughput information The ugssta ae r utbefor suitable are faces that suggests ] Ours w/osmoothness Ours ypotn h aetembedding latent the plotting by

Audio encoder

Ours (logsmoothness) Ours Visual decoder u srstudy user Our Correspondingsynthesized images Left: With- 8 uiVee rttp hudteeoeuetefl model full the use therefore should prototype AudioViewer hsmyipoewt sr erigteseic fthe of speciﬁcs the learning users with improve may This ( al 3. Table ons rkndw yqeto yeadpoepisv words. vs phone-pairs and type question by down broken sounds, etsud.Evrnetsud oehge frequencies environ- higher for lose not sounds are pro- Environment that sounds. the ment those maintaining for while encoding motion general lip posed that the sounds in spoken captured for motion are we lip future, realistic the integrate In to generators. plan face powerful more con- with information tent higher the provide making and by intuitive supported more further method be also could but method distinguished. be could dialects all not and words and sounds Work Future and Limitations 5. document. visualization. style the and have content or between side switch by to side option decodings visual separate show and from dialects different of of from speakers sex between different distinguishing for in- accuracy model the face creases disentangled im- The signiﬁcantly scores. model recognition full proves our with separately part content disentanglement. space Latent 33.3%). yields guessing (random an % achieved 85.8 users of pairs, accuracy into task visualizations the four for grouping a while of against 50%), visualizations yields guessing two (random of reference one matching correctly in ± sr eesmtmscnuigtevsaiainof visualization the confusing sometimes were Users supplemental the in full in reported is study user The t. o itnusigbtenvsaiain fdifferent of visualizations between distinguishing for std.) Words Phone-pairs questions Grouping questions Matching vrl 85.0 Overall srsuyresults. study User

Visual encoder 41 Audio decoder . Reconstructed 5 ± 9 15 . 0% 84.5 85 MNIST 85.2 84.5 CelebA . 4 srase cuayi percent in accuracy answer User ± . 7 to 16 ± ± ± ± Mel ± 77 . 8.6 8.8 13.8 6.8 11 iulzn h tl and style the Visualizing 3% . spectrogram 1 . ± 2 to 10 57 77 . 73 80 79 91.8 7% . 1+22 . . . . 4 8 8 7 n speakers and ± ± ± ± ± 10 9.2 7 8 6 . . . . 8% . 3 4 5 6 An . since trained on speech. [10] Duarte, A., Roldan, F., Tubau, M., Escur, J., Pascual, S., Salvador, A., Mohedano, E., McGuinness, K., Torres, 6. Conclusion J., Giro-i Nieto, X.: Wav2pix: speech-conditioned face generation using generative adversarial networks. In: We presented AudioViewer, a new approach for visual- IEEE International Conference on Acoustics, Speech izing audio via generative models coupled with cycle con- and Signal Processing (ICASSP). vol. 3 (2019)2 sistency and smoothness constraints. We have explored different generative models and target domains and found [11] Eberhardt, S.P., Bernstein, L.E., Coulter, D.C., Hunck- that while visualizations based on CelebA faces have richer ler, L.A.: Omar a haptic display for speech perception features and a higher SNR, the visualizations based on the by deaf and deaf-blind individuals. In: Proceedings of MNIST digits translated with VAEs give the most consistent IEEE Virtual Reality Annual International Symposium. mapping that is best recognized by human participants as pp. 195–201. IEEE (1993)3 indicated per our user study. Because our method is self- supervised, personalized models for any languages or sound [12] Elssmann, S.F., Maki, J.E.: Speech spectrographic dis- domain can be learned. We hope that our approach will play: use of visual feedback by hearing-impaired adults catalyze the development of future tools for assisting with during independent articulation practice. American An- handicaps linked to perceiving subtle audio differences. nals of the Deaf pp. 276–279 (1987)1,3

References [13] Fernandez-Lopez, A., Sukno, F.M.: Optimizing phoneme-to-viseme mapping for continuous lip- [1] Aguirre, G.K., Zarahn, E., D’esposito, M.: An area reading in spanish. In: International Joint Conference within human ventral cortex sensitive to “building” on Computer Vision, Imaging and Computer Graphics. stimuli: evidence and implications. Neuron 21(2), 373– pp. 305–328. Springer (2017)3 383 (1998)2 [2] Briot, J.P., Hadjeres, G., Pachet, F.D.: Deep learn- [14] Garofolo, J.S.: Timit acoustic phonetic continuous ing techniques for music generation–a survey. arXiv speech corpus. Linguistic Data Consortium, 1993 preprint arXiv:1709.01620 (2017)3 (1993)6 [3] Cappelletta, L., Harte, N.: Phoneme-to-viseme map- [15] Goldish, L.H., Taylor, H.E.: The optacon: A valuable ping for visual speech recognition. In: ICPRAM (2). device for blind persons. Journal of Visual Impairment pp. 322–329 (2012)3 & Blindness 68(2), 49–56 (1974)3

[4] Chen, L., Li, Z., K Maddox, R., Duan, Z., Xu, C.: Lip [16] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., movements generation at a glance. In: Proceedings of Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: the European Conference on Computer Vision (ECCV). Generative adversarial nets. In: Advances in neural pp. 520–535 (2018)3 information processing systems. pp. 2672–2680 (2014) [5] Chernoff, H.: The use of faces to represent points in k- 2 dimensional space graphically. Journal of the American [17] Hao, W., Zhang, Z., Guan, H.: Cmcgan: A uniform statistical Association 68(342), 361–368 (1973)2,8 framework for cross-modal visual-audio mutual genera- [6] Chu, C., Zhmoginov, A., Sandler, M.: Cyclegan, a mas- tion. In: Thirty-Second AAAI Conference on Artiﬁcial ter of steganography. arXiv preprint arXiv:1712.02950 Intelligence (2018)3 (2017)3 [18] Hardcastle, W.J., Gibbon, F.E., Jones, W.: Visual dis- [7] Chung, J.S., Jamaludin, A., Zisserman, A.: You said play of tongue-palate contact: electropalatography in that? arXiv preprint arXiv:1705.02966 (2017)3 the assessment and remediation of speech disorders. [8] Cox, M.A., Cox, T.F.: Multidimensional scaling. In: International Journal of Language & Communication Handbook of data visualization, pp. 315–347. Springer Disorders 26(1), 41–74 (1991)3 (2008)7 [19] Hiasa, Y., Otake, Y., Takao, M., Matsuoka, T., [9] Dong, H.W., Hsiao, W.Y., Yang, L.C., Yang, Y.H.: Takashima, K., Carass, A., Prince, J.L., Sugano, N., Musegan: Multi-track sequential generative adversar- Sato, Y.: Cross-modality image synthesis from un- ial networks for symbolic music generation and ac- paired data using cyclegan. In: International workshop companiment. In: Thirty-Second AAAI Conference on on simulation and synthesis in medical imaging. pp. Artiﬁcial Intelligence (2018)3 31–41. Springer (2018)3

9 [20] Hou, X., Shen, L., Sun, K., Qiu, G.: Deep feature [33] Kröger, B.J., Birkholz, P., Hoffmann, R., Meng, H.: consistent variational autoencoder. In: 2017 IEEE Win- Audiovisual tools for phonetic and articulatory visual- ter Conference on Applications of Computer Vision ization in computer-aided pronunciation training. In: (WACV). pp. 1133–1141 (2017)2,5 Development of multimodal interfaces: Active listening and synchrony, pp. 337–345. Springer (2010)1, [21] Hsu, W.N., Zhang, Y., Glass, J.: Learning latent rep- 3 resentations for speech generation and transformation. In: Interspeech. pp. 1273–1277 (2017)3,4,7 [34] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recog- [22] Hu, D., Wang, D., Li, X., Nie, F., Wang, Q.: Listen nition. Proceedings of the IEEE 86(11), 2278–2324 to the image. In: Proceedings of the IEEE/CVF Con- (1998)6 ference on Computer Vision and Pattern Recognition (CVPR) (June 2019)3 [35] Levis, J., Pickering, L.: Teaching intonation in discourse using speech visualization technology. System [23] Huang, X., Belongie, S.: Arbitrary style transfer in real- 32(4), 505–524 (2004)3 time with adaptive instance normalization. In: Proceed- ings of the IEEE International Conference on Computer [36] Lidestam, B., Beskow, J.: Visual phonemic ambiguity Vision. pp. 1501–1510 (2017)3 and speechreading. Journal of Speech, Language, and Hearing Research (2006)3 [24] IBM: Speech viewer iii (2004)3 [37] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face [25] Jacob, R.J., Egeth, H.E.: The face as a data display. attributes in the wild. In: Proceedings of International Human Factors 18(2), 189–200 (1976)2,8 Conference on Computer Vision (ICCV) (December 2015)6 [26] Jha, A.H., Anand, S., Singh, M., Veeravasarapu, V.: Disentangling factors of variation with cycle-consistent [38] Maidenbaum, S., Abboud, S., Amedi, A.: Sensory variational auto-encoders. In: European Conference on substitution: Closing the gap between basic research Computer Vision. pp. 829–845. Springer (2018)3,5 and widespread practical visual rehabilitation. Neuro- science & Biobehavioral Reviews 41, 3–15 (2014)3 [27] Kamper, H.: Truly unsupervised acoustic word em- [39] Musk, E., et al.: An integrated brain-machine inter- beddings using weak top-down constraints in encoder- face platform with thousands of channels. Journal of decoder models. In: ICASSP 2019-2019 IEEE Inter- medical Internet research (10), e16194 (2019)1 national Conference on Acoustics, Speech and Signal 21 Processing (ICASSP). pp. 6535–3539. IEEE (2019)3 [40] Newman, J.L., Theobald, B.J., Cox, S.J.: Limita- tions of visual speech recognition. In: Auditory-Visual [28] Kanwisher, N., Stanley, D., Harris, A.: The fusiform Speech Processing 2010 (2010)3 face area is selective for faces not animals. Neuroreport 10(1), 183–187 (1999)2 [41] Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep [29] Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, learning. Applied Intelligence 42(4), 722–737 (2015) J.: Audio-driven facial animation by joint end-to-end 2 learning of pose and emotion. ACM Transactions on Graphics (TOG) 36(4), 1–12 (2017)2 [42] Öster, A.: Teaching speech skills to deaf children by computer-based speech training. STL-Quarterly [30] Karras, T., Laine, S., Aila, T.: A style-based gener- Progress and Status Report 36(4), 67–75 (1995)2, ator architecture for generative adversarial networks. 3 In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4401–4410 (2019) [43] Park, S.H., Kim, D.J., Lee, J.H., Yoon, T.S.: Inte- 3 grated speech training system for hearing impaired. IEEE Transactions on Rehabilitation Engineering 2(4), [31] Katz, W.F., Mehta, S.: Visual feedback of tongue move- 189–196 (1994)3 ment for novel speech sound learning. Frontiers in human neuroscience 9, 612 (2015)3 [44] Piczak, K.J.: Esc: Dataset for environmental sound classiﬁcation. In: Proceedings of the 23rd ACM in- [32] Kingma, D.P., Welling, M.: Auto-encoding variational ternational conference on Multimedia. pp. 1015–1018 bayes. arXiv preprint arXiv:1312.6114 (2013)2,4 (2015)6

10 [45] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., [57] Vougioukas, K., Petridis, S., Pantic, M.: End-to-end Glembek, O., Goel, N., Hannemann, M., Motlicek, P., speech-driven facial animation with temporal gans. Qian, Y., Schwarz, P., et al.: The kaldi speech recogni- arXiv preprint arXiv:1805.09313 (2018)3 tion toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF, [58] Wen, Y., Singh, R., Raj, B.: Reconstructing faces from IEEE Signal Processing Society (2011)6 voices. arXiv preprint arXiv:1905.10604 (2019)2 [46] Sadoughi, N., Busso, C.: Speech-driven expressive [59] Wiles, O., Sophia Koepke, A., Zisserman, A.: X2face: talking lips with conditional sequential generative ad- A network for controlling face generation using images, versarial networks. IEEE Transactions on Affective audio, and pose codes. In: Proceedings of the European Computing (2019)2 Conference on Computer Vision (ECCV). pp. 670–686 (2018)2 [47] Shlizerman, E., Dery, L., Schoen, H., Kemelmacher- Shlizerman, I.: Audio to body dynamics. In: Proceed- [60] Xu, W., Lifang, X., Dan, Y., Zhiyan, H.: Speech visu- ings of the IEEE Conference on Computer Vision and alization based on robust self-organizing map (rsom) Pattern Recognition. pp. 7574–7583 (2018)2 for the hearing impaired. In: 2008 International Con- ference on BioMedical Engineering and Informatics. [48] Sperling, G.: The information available in brief visual vol. 2, pp. 506–509. IEEE (2008)1,3 presentations. Psychological monographs: General and applied 74(11), 1 (1960)2 [61] Yang, D., Xu, B., Wang, X.: Speech visualization based on improved spectrum for deaf children. In: 2010 Chi- [49] Stewart L, L.W., R, H.: A real-time sound spectrograph nese Control and Decision Conference. pp. 4377–4380. with implications for speech training for the deaf. In: IEEE (2010)1,3 IEEE International Conference Accoustics Speech and Signal Processing (1976)2,3 [62] Yook, D., Leem, S.G., Lee, K., Yoo, I.C.: Many-to- many voice conversion using cycle-consistent varia- [50] Suwajanakorn, S., Seitz, S.M., Kemelmacher- tional autoencoder with multiple decoders. In: Proc. Shlizerman, I.: Synthesizing obama: learning lip sync Odyssey 2020 The Speaker and Language Recognition from audio. ACM Transactions on Graphics (TOG) Workshop. pp. 215–221 (2020)3,5 36(4), 1–13 (2017)2 [63] Yoshii, K., Goto, M.: Music thumbnailer: Visualizing [51] Taigman, Y., Polyak, A., Wolf, L.: Unsuper- musical pieces in thumbnail images based on acoustic vised cross-domain image generation. arXiv preprint features. In: ISMIR. pp. 211–216 (2008)3 arXiv:1611.02200 (2016)3 [64] Yuan, L., Chen, Y., Fu, S., Wu, A., Qu, H.: Speech- [52] Takahashi, T., Fukayama, S., Goto, M.: Instrudive: lens: A visual analytics approach for exploring speech A music visualization system based on automatically strategies with textural and acoustic features. In: 2019 recognized instrumentation. In: ISMIR. pp. 561–568 IEEE International Conference on Big Data and Smart (2018)3 Computing (BigComp). pp. 1–8. IEEE (2019)3 [53] Tarr, M.J., Gauthier, I.: Ffa: a ﬂexible fusiform area [65] Zaccagnini, C.M., Antia, S.D.: Effects of multisensory for subordinate-level visual processing automatized by speech training and visual phonics on speech produc- expertise. Nature neuroscience 3(8), 764–769 (2000)2 tion of a hearing-impaired child. Journal of Childhool [54] Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Communication Disorders 15(2), 3–8 (1993)2,3 Rodriguez, A.G., Hodgins, J., Matthews, I.: A deep learning approach for generalized speech animation. [66] Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio- ACM Transactions on Graphics (TOG) 36(4), 1–11 (2017)2 visual representation. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence. vol. 33, pp. 9299– [55] Tian, Y., Engel, J.: Latent translation: Crossing modal- 9306 (2019)3 ities by bridging generative models. arXiv preprint arXiv:1902.08261 (2019)3 [67] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent ad- [56] Tmenova, O., Martin, R., Duong, L.: Cyclegan for style versarial networks. In: Proceedings of the IEEE interna- transfer in x-ray angiography. International journal of tional conference on computer vision. pp. 2223–2232 computer assisted radiology and surgery 14(10), 1785– (2017)3 1794 (2019)3

11 Supplemental Document AudioViewer: Learning to visualize sounds

Yuchi Zhang Willis Peng University of British Columbia University of British Columbia Vancouver, Canada Vancouver, Canada [email protected] [email protected]

Bastian Wandt Helge Rhodin Leibniz University Hannover University of British Columbia Hannover, Germany Vancouver, Canada [email protected] [email protected]

This document provides an additional analysis and details analysis of the reconstruction ability of the audio VAE in on the user study, supplementing the main document. We isolation (without going through the video VAE) reported in also prepared an additional supplemental HTML file (link) Table2 shows that a large fraction of this loss of accuracy that contains audio and video snippets that can not be played stems from the learning of speech specific features of the in a conventional PDF. These give additional qualitative audio VAE. Moreover, with a recombined reconstruction results for spoken sentences and environment sounds. loss term on the human speech dataset, the model was fitted to speech features and tended to loss high pitch information. 1. Results on Environment Sounds Still, according to the face visualization of the content encoding as we showed in the HTML file, our AudioViewer With the style disentangling training of our AudioViewer can generate consistent visualization to given environment method, the audio model is more specific to human speech sounds. than general sounds. In the following, we test the generaliza- tion capability of our method (trained on the speach TIMIT dataset [1]) on the ESC-50 environment sound dataset [3]. 2. Additional ablation study on λc and λp Table1 shows the reconstruction accuracy when going Due to the low SNR of the disentangled model, here we via the audio and video VAEs (see Information Throughput report the ablation experiments on λc and λp without the section in the main document). The SNR for reconstructed disentangled training to better illustrate the performance of Mel spectrum is generally lower than the speech dataset, the two loss terms.

arXiv:2012.13341v3 [cs.HC] 11 Mar 2021 which is expected since it was trained on the latter. The

Table 2. Audio VAE Mel spectrum reconstruction. The average Table 1. Information throughput on environment sounds, show- SNR of autoencoding and decoding Mel spectrograms on ESC-50 ing that the reconstruction error increases when evaluating on a shows a significant reconstruction loss.The average speed and ac- test set that contains sounds vastly different from the training set celeration between the latent vector (dim = 128) of neighbouring (speech vs. environment sounds). frames (∆t = 0.04s) confirms the experiments in the main docu- Audio models Visual models SNR(dB) ment, that smoothness comes at the cost of lower reconstruction Audio PCA Visual PCA 17.23 accuracy. DFC-VAE on CelebA 1.03 1 2 DFC-VAE on MNIST 2.22 Audio models SNR (dB) Velocity (s− ) Acc.(s− ) SpeechVAE DFC-VAE on CelebA (refined w/ cycle) 2.83 Audio PCA L 17.23 170.13 6960.80 DFC-VAE on MNIST (refiend w/ cycle) 0.76 SpeechVAE[2] 10.17 172.57 7331.77 L DFC-VAE on CelebA 0.46 SpeechVAE w/ p,log MSE 8.92 58.78 1859.95 L SpeechVAE w/ p,log MSE, DFC-VAE on MNIST 0.46 SpeechVAE w/ 2.98 40.54 1580.86 L Lrr rr, dim=256 DFC-VAE on CelebA (refined w/ cycle) 1.26 SpeechVAE w/ p,log MSE, rr 2.19 30.66 909.39 L L L L DFC-VAE on MNIST (refiend w/ cycle) 1.68 SpeechVAE w/ p,log MSE, rr, dim=256 2.51 33.88 1037.97 L L L

1 Table3 shows the reconstruction error on TIMIT, when reconstructing the Mel spectrum via the audio and video VAEs (see Information Throughput section in the main document). We compare versions where the visual VAEs are trained independently or reﬁned with the audio ones using different weights λc for the cycle loss. Irrespective of the visual domain (CelebA or MNIST), λc = 10 works best together with the both temporal losses and is selected for the disentangled version. Moreover, as we have expected, the higher cycled reconstruction SNR indicates a higher capacity of information of face visualization than the ﬁgure ones. Table4 shows the reconstruction SNR (encoding followed by decoding with the audio VAE) on the test set with models trained with different loss functions, as in the main document but with a larger number of tested weights λp. The Speech- VAE w/ with λ = 103 is the best performing Lp,logMSE p model that was selected for disentangled model.

3. User Study Our user study required participants to complete one of 4 versions of a questionnaire: disentangled content model with Figure 1. Matching Example. Examples from the CelebA- CelebA visualizations (CelebA-content), disentangled style combined and MNIST-content questionnaire of a matching question asked to participants in the user study. Table 3. Experiments with different cycle loss weight λc. Av- erage SNR of cycle-reconstructed amplitude Mel spectrogram on the TIMIT test set. model with CelebA visualizations (CelebA-style), disentan- Audio models Visual models λc SNR(dB) gled content model with MNIST visualizations (MNIST- DFC-VAE on CelebA - 1.51 DFC-VAE on MNIST - 2.16 content) and a combined model with CelebA visualizations DFC-VAE on CelebA (refined w/ Lc) 1 4.86 (CelebA-combined). Each version of the questionnaire asked SpeechVAE w/ p,log MSE DFC-VAE on MNIST (refined w/ Lc) 1 4.54 3 L the same set of 29 questions with randomized ordering of (λp = 10 ) DFC-VAE on CelebA (refined w/ Lc) 10 11.81 DFC-VAE on MNIST (refined w/ Lc) 10 3.98 answers within each question. The study was conducted DFC-VAE on CelebA (refined w/ Lc) 100 10.51 DFC-VAE on MNIST (refined w/ Lc) 100 5.22 with 14, 15, 12, and 14 participants for the CelebA-content, DFC-VAE on CelebA - 1.96 CelebA-style, Celeba-combined and MNIST-content ques- DFC-VAE on MNIST - 1.87 DFC-VAE on CelebA (refined w/ Lc) 1 5.95 tionnaires, respectively. Amongst all the questionnaires, SpeechVAE w/ p,Q DFC-VAE on MNIST (refined w/ Lc) 1 7.06 3 L there were 27 unique participants. It took participants be- (λp = 10 ) DFC-VAE on CelebA (refined w/ Lc) 10 12.25 DFC-VAE on MNIST (refined w/ Lc) 10 3.12 tween 10-15 minutes to complete the each questionnaire. In DFC-VAE on CelebA (refined w/ Lc) 100 8.71 addition, the questionnaire did not indicate the purpose of the DFC-VAE on MNIST (refined w/ L ) 100 5.26 c underlying research and only asked participant to perform two possible tasks: matching and grouping visualizations. Table 4. Additional ablation study on λp. The average SNR of reconstructed amplitude Mel spectrogram on the TIMIT test set The format of the questionnaire is outlined in Table5. The and the average speed and acceleration between the latent vector questions tested for two factors: sound content, sounds that (dim = 128) of neighbouring frames (∆t = 0.04s). share the same phoneme sequences, and sound style, sounds 1 2 Audio models λp SNR (dB) Velocity (s− ) Acceleration (s− ) produced by speakers of the same sex or speaker dialect. Audio PCA - 23.37 329.02 13395.11 This purely visual comparison allows us analyze different SpeechVAE w/o - 21.90 280.09 11648.17 Lp 1 16.74 231.91 9490.04 aspects individually. 3 SpeechVAE w/ p,log MSE 10 19.86 108.89 3052.19 L 107 11.43 117.43 2674.86 109 7.77 76.50 1735.97 1 18.83 234.19 9595.15 Matching questions Matching questions asked the partic- SpeechVAE w/ 103 18.44 119.94 3133.26 Lp,Q ipants to choose which of two possible visualizations which 107 11.35 123.78 2476.44 109 9.00 119.60 2405.18 is most visually similar to a given reference visual. Fig- 1 17.89 243.55 10067.36 SpeechVAE w/ 103 18.15 226.74 9252.76 ure1 shows examples of matching questions. Matching Lp,MSE 107 12.64 129.10 4192.21 questions were used to assess the viability for users to dis- 9 10 10.56 84.05 2709.96 tinguish between the same sounds produced by speakers

2 Table 6. User study results. Values indicate mean accuracy and standard deviation for distinguishing between visualizations of the tested factor across participants as a percentage. Tested Factor Sub-type Baseline CelebA-disentangled CelebA-combined MNIST-content Phone-pairs 40.3 85.7 11.2 70.2 9.6 91.8 9.2 ± ± ± Words 37.3 84.5 8.6 74.3 14.0 73.8 7.2 ± ± ± Content Matching 50.0 84.5 13.8 84.7 4.8 77.4 10.6 ± ± ± Grouping 33.3 85.2 8.8 67.3 14.0 80.8 8.4 ± ± ± Overall 38.4 85.0 6.8 72.8 10.0 79.7 6.5 ± ± ± Style (dialect) Overall 50.0 56.7 22.1 39.6 24.9 - ± ± Style (sex) Overall 43.3 78.0 10.8 43.3 8.9 - ± ±

1 ** * † 0.9

0.8

0.7 Accuracy

Figure 2. Grouping Example. Examples from the CelebA- 0.6 combined and MNIST-content questionnaire of a grouping question asked to participants in the user study. 0.5 CelebA-disentangled MINIST-disentangled CelebA-combined Table 5. A breakdown of the questionnaire format by sound type Visualization model and tested factors, listing their frequency of occurrence. Question type Sound type Tested factor Questions Figure 3. Model Comparison on content questions. The mean Content 3 user accuracy and standard deviation for distinguishing phoneme Phone- Style (sex) 3 sequences is compared between the models, annotated with the pairs Matching Style (dialect) 2 signiﬁcance level († for 0.05 < p < 0.1, * for p < 0.05, ** for questions Content 3 p < 0.01). The mean accuracy for random guessing is 0.384. Words Style (sex) 3 Style (dialect) 2 Phone- Content + style (dialect) 2 for random guessing is 50%. pairs Content + style (dialect + sex) 2 Grouping Content 3 questions Content + style (dialect) 2 Grouping questions asked the par- Words Content + style (sex) 1 Grouping questions Content + style (dialect + sex) 1 ticipants to group 4 visualizations into two pairs of similar Content (similar sounding words) 2 visualizations. Figure2 shows examples of grouping ques- Total 29 tions. Grouping questions were used to assess the degree to which visualizations of different words are distinguishable and visualizations of the same word are similar. In possessing different speaker traits as well as determining particular, the study required users to group visualizations whether structural similarities in the underlying audio trans- of two pairs of sounds, whereby different pairs are sound lated into similarities in the visualization. In particular, the clips with shared factors of the same sound content or same questionnaire contained 6 questions for evaluating the abil- sound style. In total, the user study consisted of 4 grouping ity to distinguish between sound content, which compared questions based on phone-pairs and 9 grouping questions visualizations of sounds of different phoneme sequences (3 based on words. Since there are three possible options, the for phoneme-pairs and 3 for words). Phone-pairs are short expected mean accuracy for random guessing is 33.3%. in length and therefore the corresponding visualisation was 3.1. Results a single frame image, whereas visualisations of words were videos. In order to evaluate the ability to distinguish between For each of the three models, we tested for sound content: sound style, 6 questions compared visualizations of the same phoneme sequences, and sound style: speaker dialect and phoneme sequence between male and female speakers and 4 speaker sound. We generated the mean accuracy and stan- questions for distinguishing between speakers of different dard deviation for each tested factor and each question sub dialects. In total there were 16 matching questions. Since type as shown in Table6. The results of the disentangled each question has two options, the expected mean accuracy models with CelebA visualizations (CelebA-disentangled)

3 1 1 [2] Hsu, W.N., Zhang, Y., Glass, J.: Learning latent repre- *** † 0.9 0.9 sentations for speech generation and transformation. In: 0.8 0.8 Interspeech. pp. 1273–1277 (2017)1 0.7 0.7

0.6 [3] Piczak, K.J.: Esc: Dataset for environmental sound

Accuracy Accuracy 0.6 classification. In: Proceedings of the 23rd ACM interna- 0.5 tional conference on Multimedia. pp. 1015–1018 (2015) 0.5 0.4 1 0.4 0.3 CelebA-disentangled CelebA-combined CelebA-disentangled CelebA-combined Visualization model Visualization model Figure 4. Model Comparison on style questions. The mean user accuracy and standard deviation for distinguishing speaker sex (left) and dialect(right) is compared between the models, annotated with the significance level († for 0.05 < p < 0.1, *** for p < 0.001). The mean accuracies for random guessing are 0.43 and 0.5 respectively. is aggregated by taking the results of CelebA-content on the questions which tested for sound content and the results of CelebA-style on questions which tested for sound style. Since we only evaluated the disentangled content model for the MNIST visualizations, it is only compared for the content questions. Significance comparing model means were calculated using a two-sample two-tailed t-test with unequal variance and without any outlier rejection. Figure3 shows users achieve the highest overall accuracy on the CelebA-disentangled model with 85.0 6.8% ± (significant with p < 0.05) for distinguishing between visualizations of different content. The MNIST-content model has the highest accuracy for distinguishing between different phone pairs with 91.8 9.2%, although not sig- ± nificantly higher than the CelebA-disentangled one with (p > 0.05), but has a much lower accuracy for distinguishing between different words, suggesting that the MNIST visualizations may be better suited for representing shorter sounds. The CelebA-disentangled model outperforms the CelebA-combined model for distinguishing between speakers of different sex with 78.0 10.8% (significant with ± p < 0.001) and between speakers of different dialects with 56.7 22.1% (marginally significant with 0.05 < p < ± 0.10) as shown in Figure4. The task of distinguishing between different speakers of different dialects is much more difficult than distinguishing between phoneme sequences since there are 8 categories of dialects in the dataset and differences in dialects are much more subtle and can contain often contain overlaps. This translates to lower accuracy in this question category.

References

[1] Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993) 1