Hifi-GAN: Generative Adversarial Networks for Efficient and High
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Enhance Your DSP Course with These Interesting Projects
AC 2012-3836: ENHANCE YOUR DSP COURSE WITH THESE INTER- ESTING PROJECTS Dr. Joseph P. Hoffbeck, University of Portland Joseph P. Hoffbeck is an Associate Professor of electrical engineering at the University of Portland in Portland, Ore. He has a Ph.D. from Purdue University, West Lafayette, Ind. He previously worked with digital cell phone systems at Lucent Technologies (formerly AT&T Bell Labs) in Whippany, N.J. His technical interests include communication systems, digital signal processing, and remote sensing. Page 25.566.1 Page c American Society for Engineering Education, 2012 Enhance your DSP Course with these Interesting Projects Abstract Students are often more interested learning technical material if they can see useful applications for it, and in digital signal processing (DSP) it is possible to develop homework assignments, projects, or lab exercises to show how the techniques can be used in realistic situations. This paper presents six simple, yet interesting projects that are used in the author’s undergraduate digital signal processing course with the objective of motivating the students to learn how to use the Fast Fourier Transform (FFT) and how to design digital filters. Four of the projects are based on the FFT, including a simple voice recognition algorithm that determines if an audio recording contains “yes” or “no”, a program to decode dual-tone multi-frequency (DTMF) signals, a project to determine which note is played by a musical instrument and if it is sharp or flat, and a project to check the claim that cars honk in the tone of F. Two of the projects involve designing filters to eliminate noise from audio recordings, including designing a lowpass filter to remove a truck backup beeper from a recording of an owl hooting and designing a highpass filter to remove jet engine noise from a recording of birds chirping. -
Self-Training Wavenet for TTS in Low-Data Regimes
StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher. -
4 – Synthesis and Analysis of Complex Waves; Fourier Spectra
4 – Synthesis and Analysis of Complex Waves; Fourier spectra Complex waves Many physical systems (such as music instruments) allow existence of a particular set of standing waves with the frequency spectrum consisting of the fundamental (minimal allowed) frequency f1 and the overtones or harmonics fn that are multiples of f1, that is, fn = n f1, n=2,3,…The amplitudes of the harmonics An depend on how exactly the system is excited (plucking a guitar string at the middle or near an end). The sound emitted by the system and registered by our ears is thus a complex wave (or, more precisely, a complex oscillation ), or just signal, that has the general form nmax = π +ϕ = 1 W (t) ∑ An sin( 2 ntf n ), periodic with period T n=1 f1 Plots of W(t) that can look complicated are called wave forms . The maximal number of harmonics nmax is actually infinite but we can take a finite number of harmonics to synthesize a complex wave. ϕ The phases n influence the wave form a lot, visually, but the human ear is insensitive to them (the psychoacoustical Ohm‘s law). What the ear hears is the fundamental frequency (even if it is missing in the wave form, A1=0 !) and the amplitudes An that determine the sound quality or timbre . 1 Examples of synthesis of complex waves 1 1. Triangular wave: A = for n odd and zero for n even (plotted for f1=500) n n2 W W nmax =1, pure sinusoidal 1 nmax =3 1 0.5 0.5 t 0.001 0.002 0.003 0.004 t 0.001 0.002 0.003 0.004 £ 0.5 ¢ 0.5 £ 1 ¢ 1 W W n =11, cos ¤ sin 1 nmax =11 max 0.75 0.5 0.5 0.25 t 0.001 0.002 0.003 0.004 t 0.001 0.002 0.003 0.004 ¡ 0.5 0.25 0.5 ¡ 1 Acoustically equivalent 2 0.75 1 1. -
Lecture 1-10: Spectrograms
Lecture 1-10: Spectrograms Overview 1. Spectra of dynamic signals: like many real world signals, speech changes in quality with time. But so far the only spectral analysis we have performed has assumed that the signal is stationary: that it has constant quality. We need a new kind of analysis for dynamic signals. A suitable analogy is that we have spectral ‘snapshots’ when what we really want is a spectral ‘movie’. Just like a real movie is made up from a series of still frames, we can display the spectral properties of a changing signal through a series of spectral snapshots. 2. The spectrogram: a spectrogram is built from a sequence of spectra by stacking them together in time and by compressing the amplitude axis into a 'contour map' drawn in a grey scale. The final graph has time along the horizontal axis, frequency along the vertical axis, and the amplitude of the signal at any given time and frequency is shown as a grey level. Conventionally, black is used to signal the most energy, while white is used to signal the least. 3. Spectra of short and long waveform sections: we will get a different type of movie if we choose each of our snapshots to be of relatively short sections of the signal, as opposed to relatively long sections. This is because the spectrum of a section of speech signal that is less than one pitch period long will tend to show formant peaks; while the spectrum of a longer section encompassing several pitch periods will show the individual harmonics, see figure 1-10.1. -
Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17]. -
Unsupervised Speech Representation Learning Using Wavenet Autoencoders
Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning by modeling data distribution Train the model to minimize − log 푝(푥) E.g. -
Analysis of EEG Signal Processing Techniques Based on Spectrograms
ISSN 1870-4069 Analysis of EEG Signal Processing Techniques based on Spectrograms Ricardo Ramos-Aguilar, J. Arturo Olvera-López, Ivan Olmos-Pineda Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science, Puebla, Mexico [email protected],{aolvera, iolmos}@cs.buap.mx Abstract. Current approaches for the processing and analysis of EEG signals consist mainly of three phases: preprocessing, feature extraction, and classifica- tion. The analysis of EEG signals is through different domains: time, frequency, or time-frequency; the former is the most common, while the latter shows com- petitive results, implementing different techniques with several advantages in analysis. This paper aims to present a general description of works and method- ologies of EEG signal analysis in time-frequency, using Short Time Fourier Transform (STFT) as a representation or a spectrogram to analyze EEG signals. Keywords. EEG signals, spectrogram, short time Fourier transform. 1 Introduction The human brain is one of the most complex organs in the human body. It is considered the center of the human nervous system and controls different organs and functions, such as the pumping of the heart, the secretion of glands, breathing, and internal tem- perature. Cells called neurons are the basic units of the brain, which send electrical signals to control the human body and can be measured using Electroencephalography (EEG). Electroencephalography measures the electrical activity of the brain by recording via electrodes placed either on the scalp or on the cortex. These time-varying records produce potential differences because of the electrical cerebral activity. The signal generated by this electrical activity is a complex random signal and is non-stationary [1]. -
Real-Time Black-Box Modelling with Recurrent Neural Networks
Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019 REAL-TIME BLACK-BOX MODELLING WITH RECURRENT NEURAL NETWORKS Alec Wright, Eero-Pekka Damskägg, and Vesa Välimäki∗ Acoustics Lab, Department of Signal Processing and Acoustics Aalto University Espoo, Finland [email protected] ABSTRACT tube amplifiers and distortion pedals. In [14] it was shown that the WaveNet model of several distortion effects was capable of This paper proposes to use a recurrent neural network for black- running in real time. The resulting deep neural network model, box modelling of nonlinear audio systems, such as tube amplifiers however, was still fairly computationally expensive to run. and distortion pedals. As a recurrent unit structure, we test both Long Short-Term Memory and a Gated Recurrent Unit. We com- In this paper, we propose an alternative black-box model based pare the proposed neural network with a WaveNet-style deep neu- on an RNN. We demonstrate that the trained RNN model is capa- ral network, which has been suggested previously for tube ampli- ble of achieving the accuracy of the WaveNet model, whilst re- fier modelling. The neural networks are trained with several min- quiring considerably less processing power to run. The proposed utes of guitar and bass recordings, which have been passed through neural network, which consists of a single recurrent layer and a the devices to be modelled. A real-time audio plugin implement- fully connected layer, is suitable for real-time emulation of tube ing the proposed networks has been developed in the JUCE frame- amplifiers and distortion pedals. -
Linear Prediction-Based Wavenet Speech Synthesis
LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis Min-Jae Hwang Frank Soong Eunwoo Song Search Solution Microsoft Naver Corporation Seongnam, South Korea Beijing, China Seongnam, South Korea [email protected] [email protected] [email protected] Xi Wang Hyeonjoo Kang Hong-Goo Kang Microsoft Yonsei University Yonsei University Beijing, China Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract—We propose a linear prediction (LP)-based wave- than the speech signal, the training and generation processes form generation method via WaveNet vocoding framework. A become more efficient. WaveNet-based neural vocoder has significantly improved the However, the synthesized speech is likely to be unnatural quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target when the prediction errors in estimating the excitation are database contains massive amount of acoustical information propagated through the LP synthesis process. As the effect such as prosody, style or expressiveness. As a solution, the of LP synthesis is not considered in the training process, the approaches that only generate the vocal source component by synthesis output is vulnerable to the variation of LP synthesis a neural vocoder have been proposed. However, they tend to filter. generate synthetic noise because the vocal source component is independently handled without considering the entire speech To alleviate this problem, we propose an LP-WaveNet, production process; where it is inevitable to come up with a which enables to jointly train the complicated interactions mismatch between vocal source and vocal tract filter. -
Improved Spectrograms Using the Discrete Fractional Fourier Transform
IMPROVED SPECTROGRAMS USING THE DISCRETE FRACTIONAL FOURIER TRANSFORM Oktay Agcaoglu, Balu Santhanam, and Majeed Hayat Department of Electrical and Computer Engineering University of New Mexico, Albuquerque, New Mexico 87131 oktay, [email protected], [email protected] ABSTRACT in this plane, while the FrFT can generate signal representations at The conventional spectrogram is a commonly employed, time- any angle of rotation in the plane [4]. The eigenfunctions of the FrFT frequency tool for stationary and sinusoidal signal analysis. How- are Hermite-Gauss functions, which result in a kernel composed of ever, it is unsuitable for general non-stationary signal analysis [1]. chirps. When The FrFT is applied on a chirp signal, an impulse in In recent work [2], a slanted spectrogram that is based on the the chirp rate-frequency plane is produced [5]. The coordinates of discrete Fractional Fourier transform was proposed for multicompo- this impulse gives us the center frequency and the chirp rate of the nent chirp analysis, when the components are harmonically related. signal. In this paper, we extend the slanted spectrogram framework to Discrete versions of the Fractional Fourier transform (DFrFT) non-harmonic chirp components using both piece-wise linear and have been developed by several researchers and the general form of polynomial fitted methods. Simulation results on synthetic chirps, the transform is given by [3]: SAR-related chirps, and natural signals such as the bat echolocation 2α 2α H Xα = W π x = VΛ π V x (1) and bird song signals, indicate that these generalized slanted spectro- grams provide sharper features when the chirps are not harmonically where W is a DFT matrix, V is a matrix of DFT eigenvectors and related. -
Anomaly Detection in Raw Audio Using Deep Autoregressive Networks
ANOMALY DETECTION IN RAW AUDIO USING DEEP AUTOREGRESSIVE NETWORKS Ellen Rushe, Brian Mac Namee Insight Centre for Data Analytics, University College Dublin ABSTRACT difficulty to parallelize backpropagation though time, which Anomaly detection involves the recognition of patterns out- can slow training, especially over very long sequences. This side of what is considered normal, given a certain set of input drawback has given rise to convolutional autoregressive ar- data. This presents a unique set of challenges for machine chitectures [24]. These models are highly parallelizable in learning, particularly if we assume a semi-supervised sce- the training phase, meaning that larger receptive fields can nario in which anomalous patterns are unavailable at training be utilised and computation made more tractable due to ef- time meaning algorithms must rely on non-anomalous data fective resource utilization. In this paper we adapt WaveNet alone. Anomaly detection in time series adds an additional [24], a robust convolutional autoregressive model originally level of complexity given the contextual nature of anomalies. created for raw audio generation, for anomaly detection in For time series modelling, autoregressive deep learning archi- audio. In experiments using multiple datasets we find that we tectures such as WaveNet have proven to be powerful gener- obtain significant performance gains over deep convolutional ative models, specifically in the field of speech synthesis. In autoencoders. this paper, we propose to extend the use of this type of ar- The remainder of this paper proceeds as follows: Sec- chitecture to anomaly detection in raw audio. In experiments tion 2 surveys recent related work on the use of deep neu- using multiple audio datasets we compare the performance of ral networks for anomaly detection; Section 3 describes the this approach to a baseline autoencoder model and show su- WaveNet architecture and how it has been re-purposed for perior performance in almost all cases. -
Attention, I'm Trying to Speak Cs224n Project: Speech Synthesis
Attention, I’m Trying to Speak CS224n Project: Speech Synthesis Akash Mahajan Management Science & Engineering Stanford University [email protected] Abstract We implement an end-to-end parametric text-to-speech synthesis model that pro- duces audio from a sequence of input characters, and demonstrate that it is possi- ble to build a convolutional sequence to sequence model with reasonably natural voice and pronunciation from scratch in well under $75. We observe training the attention to be a bottleneck and experiment with 2 modifications. We also note interesting model behavior and insights during our training process. Code for this project is available on: https://github.com/akashmjn/cs224n-gpu-that-talks. 1 Introduction We have come a long way from the ominous robotic sounding voices used in the Radiohead classic 1. If we are to build significant voice interfaces, we need to also have good feedback that can com- municate with us clearly, that can be setup cheaply for different languages and dialects. There has recently also been an increased interest in generative models for audio [6] that could have applica- tions beyond speech to music and other signal-like data. As discussed in [13] it is fascinating that is is possible at all for an end-to-end model to convert a highly ”compressed” source - text - into a substantially more ”decompressed” form - audio. This is a testament to the sheer power and flexibility of deep learning models, and has been an interesting and surprising insight in itself. Recently, results from Tachibana et. al. [11] reportedly produce reasonable-quality speech without requiring as large computational resources as Tacotron [13] and Wavenet [7].