Audioviewer: Learning to Visualize Sound
Total Page:16
File Type:pdf, Size:1020Kb
AudioViewer: Learning to Visualize Sound by Yuchi Zhang A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science) The University of British Columbia (Vancouver) December 2020 c Yuchi Zhang, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled: AudioViewer: Learning to Visualize Sound submitted by Yuchi Zhang in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. Examining Committee: Helge Rhodin, Computer Science Supervisor Kwang Moo Yi, Computer Science Examining Committee Member ii Abstract Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with a shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high- dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers. iii Lay Summary We develop AudioViewer, an automated that translates speech into videos of faces or hand-written numbers. Without audio feedback, it’s challenging for hearing impaired people to learn speech. To help with the situation, we propose AudioViewer to generate videos from utterances as a visual feedback in substitution. Our current work is a proof of concept. AudioViewer is designed to generate consistent and recognizable visualization of pronunciation. For example, the word ‘water’ will be visualized as a short animation of a portrait of a young brunette lady with a short hair style facing rightward with a neutral expression, which is gradually transforming into a smiling young blonde lady facing forward. From our preliminary user study, we demonstrate that people can tell the difference between the visualization of different pronunciations and recognize the consistency of the identical ones. iv Preface The work presented here is original work done by the author Yuchi Zhang in collaboration with Willis Peng and Dr. Bastian Wandt under the supervision of Dr. Helge Rhodin. A shorter version of this work is under review as a conference submission. Design, implementation and experiments were done by Yuchi Zhang. Dr. Helge Rhodin provided suggestions and feedback during the development and Willis Peng helped with the user study evaluation. The initial draft of the paper is done by all of us and revised by Dr. Bastian Wandt. The user studies ran for this work have been approved by the UBC Behavioural Research Ethics Board (Certificate Number: H19-03006) v Table of Contents Abstract . iii Lay Summary . iv Preface . v Table of Contents . vi List of Tables . viii List of Figures . x Acknowledgments . xiii 1 Introduction . 1 2 Related Work . 4 3 Method . 9 3.1 Audio Encoding . 10 3.2 Structuring the Audio Encoding . 11 3.3 Image Encoding for Video Generation . 13 3.4 Linking Audio and Visual Spaces . 14 4 Experiments . 16 4.1 Temporal Smoothness Effectiveness . 18 4.2 Information Throughput . 20 vi 4.3 Visual Quality and Phonetics . 22 4.4 Visual Domain Influence . 22 4.5 Results on Environment Sounds . 23 4.6 User study . 25 4.6.1 Design of the questionnaire . 25 4.6.2 Results . 28 5 Limitations and Future Work . 31 6 Conclusion . 35 Bibliography . 36 vii List of Tables Table 4.1 Audio encoding analysis. The average SNR of reconstructed amplitude Mel spectrogram on the TIMIT test set and the average speed and acceleration between the latent vector (dim = 128) of neighbouring frames (∆t = 0:04s). The SNR of the autoencoder drops with the additional constraints we enforce (smoothness and disentanglement). This is a trade-off for improving velocity and acceleration properties as well as for inducing interpretability. 18 Table 4.2 Throughput lower bound. The throughput from audio to video with AudioViewer can be bounded by mapping back to the audio. It is evident that the cycle consistency improves greatly while the enforcement of additional constraints with Lp;log MSE, and Lrr, takes only a small dip. 21 Table 4.3 Information throughput on environment sounds, showing that the reconstruction error increases when evaluating on a test set that contains sounds vastly different from the training set (speech vs. environment sounds). 24 Table 4.4 Audio VAE Mel spectrum reconstruction. The average SNR of autoencoding and decoding Mel spectrograms on ESC-50 shows a significant reconstruction loss.The average speed and acceleration between the latent vector (dim = 128) of neigh- bouring frames (∆t = 0:04s) confirms the experiments in the main document, that smoothness comes at the cost of lower reconstruction accuracy. 24 viii Table 4.5 A breakdown of the questionnaire format by sound type and tested factors, listing their frequency of occurrence. 27 Table 4.6 User study results. Values indicate mean accuracy and standard deviation for distinguishing between visualizations of the tested factor across participants as a percentage. 28 ix List of Figures Figure 1.1 AudioViewer: A tool developed towards the long-term goal of helping hearing impaired users to see what they can not hear. We map an audio stream to video and thereby use faces or numbers for visualizing the high-dimensional audio features intuitively. Different from the principle of lip reading, this can encode general sound and pass on information on the style of spoken language, for instance, the timbre of the speaker. 2 Figure 2.1 An illustration of harmonics and formants of speech voice in a filter bank image. The above mel-scaled power spectro- gram shows the word ‘party’ spoken by a male speaker. The vertical axis indicates mel-scaled filter banks and the horizontal axis is time. The narrow stripes in the spectrogram are the harmonics which are the integral multiples of the fundamental frequency (F0), reflecting the vibration of the vocal fold. Note that the change in F0 reflects the intonation during speech. The wide band peaks are the formants, reflecting the resonant fre- quencies of the vocal tract. Formant patterns are different when differnt vowels (e.g., aa, r and iy) are spoken. 5 Figure 3.1 Overview. A joint latent encoding is learned with audio and video VAEs that are linked with a cycle constraint and aug- mented with a smoothness constraint (∆v annotation). 10 x Figure 3.2 Disentanglement, by mixing content encodings from two speak- 0 ers saying the same phones (zci with zci ) and style encodings 0 from the same speaker saying different words (zsa with zsa ). 12 Figure 3.3 Cycle constraint. We link audio and video latent spaces with a cycle constraint that ensures that the audio signal content is preserved through video decoding and encoding. 14 Figure 4.1 Smoothness evaluation by plotting the latent embedding of an audio snipped in 3D via dimensionality reduction. Left: PCA baseline model, consecutive audio sample embeddings are scattered. Middle: Similar performance of our model without temporal smoothness constraint. Right: Our full model leads to smoother trajectories. 19 Figure 4.2 Throughput. The information throughput can be measured by mapping from audio to video and back. Playing the depicted Mel Spectrogram shows that the speech is still recognizable. 20 Figure 4.3 Word analysis. Instances (different rows) of the same words are encoded with a similar video sequence (single row), yet the show pronunciation variations if spoken by different speakers. 22 Figure 4.4 Phoneme similarity analysis. The same phonemes occurring at the beginning of different words are mapped to visually similar images (colums) while different speakers have minimal influence (rows). 23 Figure 4.5 Environment sounds visualization. Although not trained for it, our approach also visualizes natural sounds. Here showing a clear difference between chirp and alarm clock. The visualiza- tion looks brighter and intenser when a target sound occurs. 25 Figure 4.6 Matching Example. Examples from the CelebA-combined and MNIST-content questionnaire of a matching question asked to participants in the user study. 26 Figure 4.7 Grouping Example. Examples from the CelebA-combined and MNIST-content questionnaire of a grouping question asked to participants in the user study. 26 xi Figure 4.8 Model Comparison on content questions. The mean user accuracy and standard deviation for distinguishing phoneme sequences is compared between the models, annotated with the significance level (y for 0:05 < p < 0:1, * for p < 0:05, ** for p < 0:01). The mean accuracy for random guessing is 0.384. 29 Figure 4.9 Model Comparison on style questions. The mean user accu- racy and standard deviation for distinguishing speaker sex (left) and dialect(right) is compared between the models, annotated with the significance level (y for 0:05 < p < 0:1, *** for p < 0:001). The mean accuracies for random guessing are 0.43 and 0.5 respectively. 30 Figure 5.1 Recombined reconstruction sample. Top left: The Mel spec- trograms of a sentence spoken by a female speaker and its re- construction by our audio model with disentanglement training.