Deep Learning for Speech Enhancement a Study on Wavenet, Gans and General CNN- RNN Architectures

Home , Backpropagation, Deep learning, Perceptron, Softmax function, Supervised learning, WaveNet

DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Deep Learning for Speech Enhancement A Study on WaveNet, GANs and General CNN- RNN Architectures

OSCAR XING LUO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Deep Learning for Speech Enhancement - A Study on WaveNet, GANs and General CNN-RNN Architectures

OSCAR XING LUO

Master in Computer Science Date: July 1, 2019 Supervisor: Jonas Beskow Examiner: Olov Engwall Swedish title: Djupinlärning för talsignalförbättring - en studie om WaveNet, GANs och generell CNN-RNN-arkitektur School of Electrical Engineering and Computer Science

iii

Abstract

Clarity and intelligiblity are important aspects of speech, especially in a time of misinformation and mistrust. The breakthrough in generative models for audio files has brought massive improvements for speech enhancement. Google’s WaveNet architecture has been modified for noise reduction in a model called WaveNet denoising and has proven to be state-of-the-art. Another competitor on the market would be the Speech Enhancement Generative Adversarial Net- work (SEGAN) which adapts the GAN architecture into applications on speech. While most older models focus on feature extraction and spectrogram analysis, these two models attempt to skip those steps and become end-to-end models completely. While end-to-end is good, data preprocessing is still a valuable asset to consider. A network designed by Microsoft Research called EHNet uses the spectrogram data as input instead of the mere 1D waveforms to capture more relations between datapoints as a higher dimension can enable more information. This thesis aims to explore the speech enhancement field of study from a deep learning perspective and focus on the three mentioned architectures in theory dissection and results from new datasets. There is also an implementation of the Wiener filter as a benchmark. We arrive at the conclusion that all three networks are viable in the task of enhancing speech, however SEGAN performed better on our dataset and was more robust to new data in comparison. For future work one could improve the evaluation methods, change datasets and implement hyperparameter optimization for further comparative analysis. iv

Sammanfattning

Klarhet och förståelse är viktiga aspekter av tal, särskilt i en tid då falsk information och misstrogenhet är vanligt. Genombrottet för generati- va modeller inom ljud har medfört stora förbättringar inom talsignal- förbättring. Googles WaveNet-arkitektur har modifierats för brusre- ducering i en modell som kallas för WaveNet-denoising vilket har vi- sat goda resultat. En annan konkurrent på marknaden är den generella adversariella nätverket för talsignalförbättring (SEGAN) som anpas- sar GAN-arkitekturen till tillämpningar på tal. Medan de flesta äldre modeller fokuserar på särdragsextraktion och spektrogramanalys, så försöker de två nya modellerna med att ignorera dessa koncept och vara end-to-end istället. Medan end-to-end är bra är databehandling fortfarande en viktig aspekt som är värdefull att överväga. Ett nät- verk som designats av Microsoft Research heter EHNet och använder spektrogramdata som input istället för enbart 1D-vågformer för att fånga upp fler relationer mellan datapunkter, då högre dimensioner möjliggör mer information. Detta examensarbete syftar till att utforska studieområdet inom tal- signalförbättring samt utreda de tre nämnda arkitekturerna genom te- oretisk undersökning och resultat på nya dataset. Det kommer också vara en implementering av Wienerfilter som riktmärke för resultaten. Vi kommer fram till slutsatsen att alla tre nätverk är möjliga alter- nativ inom talsignalförbättring men SEGAN är den bästa modellen när det kommer till resultat på vårt specifika dataset och med avseende på robusthet. För framtida arbeten kan man förbättra utvärderingsmeto- derna, ändra datasetet och implementera hyperparameteroptimering för ytterligare jämförande analyser. v

Acknowledgements

The thesis was made possible thanks to several people in my life. I’d like to acknowledge all my friends and family who supported and kept me sane throughout the study. I will speciﬁcally acknowledge my mother, Kezhao Xing, since she asked for it. Obviously the ones who aren’t mentioned by name are equally important to me but a mother is a mother. I would also like to acknowledge my supervisor Jonas Beskow, examiner Olov Engwall and course coordinator Ann Bengtsson for being very helpful in their fast response time in both administrative issues and feedback particularly regarding the thesis work. Contents

1 Introduction 1 1.1 State of the art ...... 2 1.1.1 New Models ...... 2 1.2 Ethics ...... 4 1.3 Related Works ...... 4 1.4 Scientiﬁc Question ...... 5

2 Background 7 2.1 Speech Enhancement ...... 8 2.1.1 Evaluation ...... 9 2.2 Signal Processing ...... 10 2.2.1 Wiener Filter ...... 10 2.3 Short-Time Fourier Transform ...... 11 2.4 Machine Learning ...... 12 2.4.1 Supervised Learning ...... 13 2.4.2 Unsupervised learning ...... 14 2.4.3 Semi-supervised Learning ...... 14 2.5 Artiﬁcial Neural Networks ...... 14 2.5.1 Perceptron ...... 15 2.5.2 Convolutional Neural Networks ...... 19 2.5.3 Activation Functions ...... 22 2.5.4 Recurrent Neural Networks ...... 25 2.5.5 ResNet ...... 29 2.5.6 Batch Normalization ...... 30 2.6 Generative Models ...... 31 2.6.1 Gated PixelCNN ...... 31 2.6.2 WaveNet ...... 33 2.6.3 Generative Adversarial Networks ...... 36 2.7 Dataset ...... 41

vi CONTENTS vii

2.8 Models in Review ...... 42 2.8.1 SEGAN ...... 43 2.8.2 WaveNet Denoising ...... 47 2.8.3 EHNet ...... 49 2.8.4 Robustness ...... 51

3 Method 52 3.1 Evaluation ...... 52 3.2 SEGAN ...... 53 3.3 WaveNet Denoising ...... 54 3.4 EHNet ...... 54 3.5 Wiener Filter ...... 55 3.6 Technical Implementation ...... 55

4 Results 57 4.1 Loss Functions ...... 57 4.1.1 SEGAN ...... 57 4.1.2 WaveNet Denoising ...... 59 4.1.3 EHNet ...... 60 4.2 Evaluation Scores ...... 61

5 Discussion 63 5.1 Training Procedure and Data ...... 63 5.2 Comparative Analysis in Overall Performance ...... 66 5.3 Improvements Outside the Models ...... 68 5.4 Conclusion ...... 71

Bibliography 72

A Chinese and Swedish Perspective 76

Chapter 1

Introduction

In an ever growing globalized world, communication is an important part of everyday life. Speech is the best communicative method as that provides the most context to the message. However, there are many scenarios where speech can’t be used and we are forced to resort to text or other mediums. This thesis looks into one of the aspects that can ruin the availability of speech: background noise. The field of study is called speech enhancement and aims to remove noise and background intrusiveness to introduce clarity and intelligibility to the speech sample. Speech enhancement is an important subject that has several appli- cable opportunities in various industries. The main application would be noise reduction in conversations via mobile phones and similar technologies. One non-obvious application could be the usage on con- troversial recordings with public figures that has been contaminated by a noisy background. In a world where correct information is the key to truth and misinformation has now been spread like wildfire, clarity is of great importance. An example of this would be an audio clip of a world leader uttering something in a crowded room that is picked up by a bystander’s phone. In this audio clip the speech is not clear because of the environment and a conversation that could incrim- inate one of the most powerful leaders in the world becomes uncertain. With the help of speech enhancement technology, this situation would become crystal clear. Another more conventional application of speech enhancement would be to use it as a tool in speech recognition as previously noisy files wouldn’t be considered unviable as data anymore. It would also

1 2 CHAPTER 1. INTRODUCTION

even the playing ﬁeld for independent creators who can record their podcasts, interviews, audiobooks etc. anywhere without the need of professional equipment. We also have societal contributions in hearing aids and voice- command technologies for those who aren’t able as well as enhancing speech and amplifying the voices for those who require assistance in speaking.

1.1 State of the art

While signal processing and subspace methods have historically been the dominant theories behind speech enhancement, the industry today has moved on to greener pastures. In computer science there is a ﬁeld that has been popularized in recent times called Machine Learning (ML). This ﬁeld has been used for vastly different industries and succeeded in almost all of them. The reason for this is the improvement in processing power with GPU technology as NVIDIA and AMD have taken the next step towards AI research. The algorithms which have been so revolutionary have all been built upon the foundations of statistical theory that practically has only been limited by computing time in the past. However as with all research, people will always strive towards improvement and further innovation. For industry giants such as Google, this means creating new combinations and new iterative algorithms that can re-examine impractical solutions to previously hard problems like AlphaGo by Silver et al. [1]. Despite that, not all research requires giant corpora- tions as there have been innovations birthed from the scarcity of big data as well.

1.1.1 New Models The modern innovations that have taken the spotlight are all built with the core concepts of older deep learning theory. Much of the newer architectures are a natural expansion of previous ideas, simply taken to the next level. Some have thought of combining two successful architectures into one while others have simply borrowed concepts from other areas and applied them with great success in newer ﬁelds. CHAPTER 1. INTRODUCTION 3

WaveNet Among some of the new innovations that have been brought up is the new WaveNet architecture, a generative model for raw audio. It is an end-to-end autoregressive model that has become the state-of- the-art for speech generation. The model is based on the PixelCNN architecture which has cemented its importance in image generation. The core architecture is so good that modifications to it have produced considerable results in other fields. Staying in the speech enhancement field of study, a newly developed model called WaveNet denoising that’s based on the original archiecture is specifically aimed to reduce the noise in contaminated audio files to enhance speech.

Generative Adversarial Networks One example of an algorithm that could be a possible competitor or alternative for WaveNet is the new concept of Generative Adversarial Networks (GANs). This new concept has been extremely successful for image generation and image tagging. To adapt the method from a visual to an audio format would be interesting to try as the data representation can be similarly achieved. GANs were introduced to the machine learning scene as a force to be reckoned with in 2016, however the ﬁrst instance of it can be found in 2014 [2]. From that point, a smorgasbord of variations has been developed spanning a massive ranges of topics. To name a few:

• GAP – Context Aware Generative Adversarial Privacy: A generative model that lets “data holders learn the privatization mechanism directly from the dataset without requiring access to the dataset statistics.” Introduced by Huang et al. [3].

• SeqGAN: A policy gradient model that is applied to sequence generation. Introduced by Yu et al. [4].

• SEGAN – Speech Enhancement GAN: Using ConvNets to explore noise reduction in audio ﬁles speciﬁcally containing speech. Introduced by Pascual, Bonafonte, and Serrà [5].

A full list of GANs can be found at this github if the reader is interested to explore more at the GAN-Zoo by Hindupur [6]. 4 CHAPTER 1. INTRODUCTION

1.2 Ethics

When it comes to ethical questions, engineering and scientific fields of study often share the same worries. However, there are almost no ethical questions to be answered for speech enhancement that are specific to the subject. One does have to take into account of applications in war and malicious attempts such as harrassment or other disorderly conduct. Nevertheless, outside of those general topics that concern all engineers, there shouldn’t be anything major that could create a divide in the future for speech enhancement. If we are talking about deep learning as a whole and the impact the technology can have ethically on the world then we are discussing a different topic with a broader view. It’s probable that some of the speech enhancement technologies might advance adjacent fields of study that could harm the world since deep learning algorithms are very general in their applications. Nevertheless, these are more philo- sophical questions that are outside the scope of this thesis.

1.3 Related Works

There have been many studies in the speech enhancing community with different approaches through deep learning. While this thesis focuses primarily on exploring the ﬁeld of study we still miss some networks that could be interesting to examine. The reason for omitting most of these is because their code isn’t available online so we can’t do an independent evaluation of their results. Other reasons include noise-speciﬁc problems that are fringe to the study.

Speech Denoising with Deep Feature Losses A study in late 2018, Germain, Chen, and Koltun [7] introduce a new algorithm with deep feature losses and compares its performance to SEGAN and WaveNet. The proposed algorithm is a fully- convolutional context aggregation network using a deep feature loss which is based on computer vision theory. They manage to outperform the optimal weights on both state-of-the-art models for the same dataset and argues that the advantages of this approach is particularly pronounced for data with the most intrusive background noise. CHAPTER 1. INTRODUCTION 5

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks Kumar and Florencio [8] presents the scenario where multiple noises simultaneously corrupt speech. The argument is that most speech enhancement tools focus on presence of single noise in corrupted speech despite this not being true to real-life. By combining deep neural networks with concepts such as Mel-frequency spectrum and short-time Fourier transform they achieve "remarkably well" results.

Whisper-to-Voiced Alaryngeal Speech Conversion with Generative Adverarial Networks Pascual et al. [9] introduces an adaptation of the SEGAN architecture to work on whisper to voice speech conversion to help patients who suffer from aphonia. The modiﬁcations are speciﬁcally implemented for a dataset of whispered samples. In this paper they also present an improvement to the original SEGAN by making it shal- lower and adding learnable parameters for the skip connections, they call it SEGAN+.

Speech Enhancement Based on Improved Deep Neural Networks with MMSE Pretreatment Features Feature extraction can be important in deep learning. Han et al. [10] explore a new feature which extract through the minimum mean square error estimator pre-treatment.

1.4 Scientiﬁc Question

Since the ﬁeld of deep learning is so vast in current times, it can be hard to keep up with what is currently viable as a method of choice for one’s task. This thesis aims to explore some of the possible speech enhancement algorithms available in the domain of deep learning and compare their performances with each other. The three models in question are the Speech Enhancement Generative Adver- sarial Network (SEGAN), WaveNet denoising and the Convolutional- Recurrent Neural Network called EHNet. We will explore these three as they have produced state-of-the-art results and compare their performances against each other. 6 CHAPTER 1. INTRODUCTION

To be more specific, the scientific question becomes: How does SEGAN, WaveNet denoising and EHNet compare against each other, especially in terms of robustness? In addition to these three deep learning models, we will also have an implementation of the Wiener filter as a benchmark comparison. The outcome of this thesis will further the understanding of current speech enhancement algorithm. Evidently, a successful algorithm can also lead to further research in adjacent fields as the subsets of deep learning is commonly interchangeable as long as the data representation can be similar. Chapter 2

Background

Lin, Tegmark, and Rolnik [11] state that with the explosion of machine learning and big data, the state of the art has shifted from carefully hand-crafted and fully understood analytical algorithms to more ar- chitectural based deep learning algorithms. They mention that these are only understood at a heuristic level, where we empirically know that certain training protocols will result in excellent performance. In their paper they try to explain the efficacy of deep learning with physics but a complete conclusion can still not be reached as to why certain neural networks perform better than others. Being blind to the underlying factors is never a good thing in science which can make it uncertain as to which direction this field should take. Nevertheless, this archaic search for understanding can’t compete with the fact that deep learning often can work end-to-end and doesn’t require the often limiting factor of clinical feature extraction backed up by a foundation of scientific theory. One could believe that the field of study is simply throwing a dart on a board and that random trial and error can achieve improved performances. This is obviously false as one still needs the groundwork of several different understandings. For this thesis specifically, the knowledge includes some basic speech enhancement theory, general deep learning theory and more in-depth knowledge about generative models. To let the reader have a better comprehension of the final results we will go through each important aspect required to build this proficiency. As an overview, the basic building blocks for the three models in review can be outlined as such:

7 8 CHAPTER 2. BACKGROUND

SEGAN • Convolutional neural networks

• Skip connections

• Generative adversarial networks

WaveNet Denoising • Convolutional neural networks

• General WaveNet architecture

EHNet • Spectrogram extraction

• Convolutional neural networks

• Long short-term memory recurrent neural networks

• Fully connected neural networks

We will go through the background theory one concept at a time, from the most basic machine learning and speech enhancement theory all the way to the intricate theory behind the speciﬁc models. By doing this, we hope that the reader that has zero knowledge of deep learning or machine learning can still grab some concepts and understand the underlying work that has been done. For more experienced read- ers that are only interested in the dataset and models, section 2.8 and section 2.9 in this chapter would be the most appropriate.

2.1 Speech Enhancement

The term speech enhancement is a more speciﬁc functioning way of describing noise reduction as noise reduction can be more general depending on the problem being solved. The aim behind speech enhancement is to identify where the noise is in the audio and remove it to introduce clarity, intelligibility or pleasantness. The algorithms used for speech enhancement varies and from a historic perspective; signal processing has been the most prevalent method for denoising. CHAPTER 2. BACKGROUND 9

However, in recent times the deep learning field has taken over some of the state of the art technology. The underlying problem with speech enhancement is the dilemma of noise and speech parity [12]. With how signal processing works, especially when using linear filters or subspace methods, reducing noise will introduce speech distortion while enhancing speech will bring more artifacts. Much like the bias-variance decomposition [13] that we encounter in machine learning, this is a delicate case of fine-tuning and optimization to achieve the best results. While most older models use spectrogram analysis and feature extraction, newer deep learning algorithms attempt to remove these methods in favour of end-to-end neural networks. While the three models in this thesis are considered end-to-end there could be a debate as to how accurate that assessment is as some models still require a little preprocessing.

2.1.1 Evaluation Clarity, intelligibility and pleasantness are all very subjective to the human ear; therefore it is not that simple to evaluate mathematically how well a result is performing. This has not stopped people from trying as there are still some objective evaluations that can be used. For our objective evaluation we applied Perceptual Evaluation of Speech Quality (PESQ) [14] and Short-Time Objective Intelligibity (STOI) [15]. Both evaluation methods correlate with speech intelligibility and attempts to model human perception scores. From Hu and Loizou [16] they mention that PESQ is the most complex evaluation method to compute and is recommended by the ITU-T, one of the three sectors of the International Telecommunication Union. It produces a score that is between -0.5 and 4.5 where higher scores indicate higher quality. STOI is a newer evaluation method which has shown to have high correlation with the intelligibility of both noisy and time-frequency weighted noisy speech. It produces a score that is between 0 and 1 where higher scores indicate higher quality. The theory behind both evaluation models are quite extensive and as this thesis mainly focuses on the deep learning aspect of speech enhancement, we implore the interested reader to study the sources themselves. 10 CHAPTER 2. BACKGROUND

While all objective evaluation methods can be a linear combination of each other there is still interest in seeing the values independently. There was also a human evaluation of 17 people with 14 selected samples from each model evaluating on a scale from 1 to 5 where:

• 5 - Excellent: Very natural speech with no degradation and not noticeable noise.

• 4 - Very good: Speech is natural sounding but there are some noticeable background noises.

• 3 - OK: Either the background noise is too overpowering or the speech quality has degraded (with almost no background noise)

• 2 - Very bad: Speech is bad and the background noise is very distracting.

• 1 - Terrible: This is not a person who is talking, at least going by the things I can decipher.

This is similar to the evaluation scale that Pascual, Bonafonte, and Serrà [5] used in their paper.

2.2 Signal Processing

While newer models are mostly in the deep learning and machine learning ﬁeld, from a historic perspective we still have a foundation of signal processing theory that was the state-of-the-art in the recent past. We explore two concepts that are often used in signal processing theory: the Wiener ﬁlter which we will use as a benchmark and Short Time Fourier Transform as a common way to extract spectrogram data.

2.2.1 Wiener Filter The wiener filter is one of the most basic filters in signal processing. The equation that defines a finite-duration impulse response (FIR) wiener filter is stated by Vaseghi [17] as:

P −1 X T xˆ(m) = wky(m − k) = w y (2.1) k=0 CHAPTER 2. BACKGROUND 11

where w = [w0, w1, ..., wP −1] is the wiener filter coefficient vector, m is the discrete-time index and y = [y(m), y(m − 1), ..., y(m − P − 1)] is the filter input signal. We can see the whole process in figure 2.1 as an input signal with stationary noise y is filtered by the wiener-vector to produce an ap- proximation of the desired signal x(m). The error function is defined as:

e(m) = x(m) − xˆ(m) (2.2)

Least-Square Error Estimation The objective criterion in Wiener theory is the least-square error estimation and is calculated as:

[E[e2(m)] = E[(x(m) − wTy)2] = E[x2(m)] − 2wTE[yx(m)] + wTE[yyT]w (2.3) T T = rxx(0) − 2w ryx + w Ryyw

T where Ryy = E[y(m)y (m)] is the autocorrelation matrix of the input signal y and rxy = E[x(m)y(m)] is the cross-correlation matrix of the input and target signals. To minimize the mean square error we obtain the Wiener ﬁlter as: −1 w = Ryy ryx (2.4) which means that the autocorrelation matrix and the cross-correlation matrix are the required two components to calculate our optimal wiener ﬁlter.

2.3 Short-Time Fourier Transform

An audio ﬁle has many properties that can be used as data. While many ML models are moving towards complete end-to-end learning without any feature extraction, there are still some methods that can make the data ﬁt the model better. Since audio is often one- dimensional data it can be limiting in learning the relationships between datapoints. Short-Time Fourier Transform or STFT is a way to extract more information from an audio wave. 12 CHAPTER 2. BACKGROUND

Figure 2.1: Wiener filter structure. An input signal with static noise y(m) is filtered by a Wiener filter and produces xˆ(m). Image adapted from Vaseghi [17].

jωk Let x(n) be a signal deﬁned for all n and let Xn(e ) be the short- time Fourier transform of x(n) evaluated at time n and frequency ωk. We can then deﬁne the STFT as:

∞ jωk X −jωkm Xn(e ) = w(n − m)x(m)e (2.5) m=−∞ where w(n) is a window function (low-pass ﬁlter). From this it’s common to have a spectrogram which is a 2D plot over the frequency against the time. A model which we reviewed in this thesis, EHNet, uses these properties to have a convolutional neural network go over a 2D space where the original data is 1D.

2.4 Machine Learning

The deﬁnition for machine learning changes every once in a while and depends on where the source is from. Lecturers from Stanford like to describe it as “...the science of getting computers to act without being explicitly programmed”. Faggella [18] compiled a list of quotes where we have those within the industry such as NVIDIA who instead chooses to be more complicated with the statement “Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world”. People CHAPTER 2. BACKGROUND 13

not within the industry such as McKinsey & Co. have a different defi- nition which is similar to the quote from Stanford: “Machine learning is based on algorithms that can learn from data without relying on rules-based programming”. The truth is that all of these definitions can be correct, machine learning is everything and in between. It’s a giant umbrella term which is why researchers are very specific about what subset they are working with.

Speech Enhancement ML has as mentioned, many applications in various fields. For speech enhancement we can consider it a problem of denoising and going by Goodfellow, Bengio, and Courville [19] it is defined as follows: "...an algorithm is given an input of a corrupted example ˜x ∈ Rn which is obtained by an unknown corruption process from a clean example x ∈ Rn. The learner has to predict the conditional probability distribution p(x | ˜x)". The way of learning can be varied and the algorithm used is often decided by the data available. However, these approaches are always within three categories: supervised learning, semi-supervised learning and unsupervised learning. We will explore these definitions with the help of Bishop [20].

2.4.1 Supervised Learning A very common type of dataset that one will encounter is a dataset where there are input vectors x with corresponding target values y available. The aim is to create a machine learning model that can correctly predict y with the calculations given by the input vectors x. During learning, the algorithm will adjust the weights of the system accordingly as it gets closer and closer to the correct predictions. This learning can be either a classification problem or a regression problem depending on the dataset. Classification problems are characterized by mapping input vectors x to one specific value out of a finite number of discrete categories, e.g. input data about attributes of a song mapped to a label company. Regression problems are characterized by mapping input vectors x to one or several continuous variables, e.g. the input data consists 14 CHAPTER 2. BACKGROUND

of a football player’s height, weight, lean muscle mass and body fat to predict the pace of their run which is given in m/s.

2.4.2 Unsupervised learning In the case of no labels for the dataset at hand, it is required for unsupervised learning to come through. As there are no corresponding target values to map the input vectors to, the aim could instead be to cluster the data into groups of similar attributes or to estimate the distribution of the data. One such example is the Principal Component Analysis, where higher dimensional data is projected onto a smaller dimension to reduce the dimensionality of the dataset. Most generative models are unsupervised, however some can also be seen as semi- supervised learning.

2.4.3 Semi-supervised Learning By combining a small labeled dataset with a bigger unlabeled dataset, one can use semi-supervised learning. Semi-supervised learning is the mixture of the unsupervised and supervised concepts and has been proven to be successful in increasing performance for certain tasks. The advantages of semi-supervised learning are that the computational power it requires for the supervised learning is minimized and the unsupervised learning part is supported by some real world labeled data which gives some guidance to the learning. We will mostly explore semi-supervised and unsupervised learning for this thesis.

2.5 Artiﬁcial Neural Networks

The field of study under ML which is interesting for this thesis will be artificial neural networks and deep learning. A neural network is an attempt at mimicking the human brain by connecting nodes to each other in a structured way. Neurons are connected to each other via weights (synapses) and the output is decided by which neurons fire and which don’t. CHAPTER 2. BACKGROUND 15

Figure 2.2: The machine learning family structure with artiﬁcial neural networks and deep learning shown as subsets.

Different Types of Networks There are many kinds of artifical neural networks, however they are split into different categories depending on how the data flows through the network. The feedforward neural network is distin- guished by the forward flow of the data while there are also cyclical and bidirectional neural networks. We will explore all three as they play a part in the models in review.

2.5.1 Perceptron The simplest neural network is the perceptron and is described by Bishop [21] as a two-class model with three components: an input layer, an output layer and the connection between the two layers. T Mathematically, we can describe the output as yi = fi(wi x) Where yi is the ith element of the y vector which is of size n, x is the input vector of size m, wi is the corresponding weight vector and f is the nonlinear activation function given by:

+1 x ≥ 0 fi(x) = (2.6) −1 x < 0

The choice of the mapping values of the activation function f is of course designed by the value representation of the two classes. Con- sider our target vector t of size m containing the two different classes. 16 CHAPTER 2. BACKGROUND

Class 1 is represented as a 1 while Class 2 is represented by a −1. An example would be t = [1, −1, 1, 1] where we have the second value as class 2 while the rest are of class 1. The aim of the algorithm is to find the correct weights w such that it can predict y correctly. We can do this by iteratively updating the weights according to the performance of the output. If the output is correct we keep the weights and if the output is wrong, we change it slightly and try again. The update process becomes an optimization problem and there are many different approaches. Most implementa- tions involve an error function that should be minimized, i.e a lower error value should indicate better performance of the network. From the activation function, we want every sample xi to have f(xi) > 0 if it’s in class 1 and f(xi) < 0 if it’s in class 2. Since t ∈ {−1, +1} we can reduce the problem to tif(xi) > 0 for all xi. Therefore we can write the error function as X T Ep(w) = − tnw xi (2.7) n∈M where M is the set of all misclassified samples. This function is called the Perceptron criterion. As the error function can’t be differentiable, we will have to use stochastic gradient descent (SGD). This approach iteratively updates the weights towards the minimum of Ep(w) according to 4w = −η∇E(w) (2.8) where 0 < η ≤ 1 is the learning rate of the system. The perceptron convergence theorem guarantees that if there exists a hyperplane that can separate the two classes, the algorithm will converge within a finite amount of steps. Despite all the building blocks at place, there is still one piece missing. If all the values of the input vector are zero, the weights will not matter as the output will always be zero. This can be resolved by adding a bias term to the input vector. Instead of x being length n, it is now of length n+1 where the last element in x is theta!=0. The bias is the equivalent to the intercept for linear regression. The perceptron was created by Frank Rosenblatt and his research behind the perceptron algorithm ended up being published in 1962 in the book “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms”. To summarize the perceptron algorithm: CHAPTER 2. BACKGROUND 17

Figure 2.3: The XOR problem. We have a two-class problem where the data is not linearly separable. Image adapted from Battini [22].

1. Initialize all the weights randomly.

2. (a) Calculate the output vector y = f(wT x) (b) Update the weights according to stochastic gradient descent

4w = −η∇E(w)

3. Reiterate step 2 until convergence or after n amount of iterations.

The perceptron is a very important milestone in the ﬁeld of pattern recognition and machine learning. However, Marvin Minsky noted that there existed some great limitations to the architecture. He brought up the fact that the famous XOR problem (Figure 2.3) couldn’t be solved by the original perceptron algorithm. Note: the error function is also known as the objective function, value function or the loss function. In the future we will denote this as L.

Multi-layer Perceptron All solutions derived from the perceptron algorithm are linear combinations of a single dimensional hyperplane. The issue arises when the 18 CHAPTER 2. BACKGROUND

Figure 2.4: A fully connected neural network with four layers, two hidden and the input/output layers. two-class problem isn’t linearly separable as can be seen in the XOR problem (figure 2.3). This caused a severe decrease in trust for the neural network research and halted the development of neural networks until a solution was revealed several years later. People realized that by adding a layer to the perceptron algorithm and turning it into a multi-layer perceptron, a new dimension appeared. This new layer is called a hidden layer and plays a major part in the field of artificial neural networks. By adding this new layer however, the perceptron algorithm becomes insufficient for learning and has to be modified.

Backpropagation The backpropagation is the most important component of deep learning as it’s the "learning" part of the networks. The 2(b) part in the update algorithm for perceptron learning is a backpropagation of the simplest kind. When the layers increase we have an error that becomes dependent on the preceeding layers and one has to use the chain rule to obtain the solution. However, when the architecture becomes more complex we will also have a different kind of backpropagation. Con- volutional layers will give one kind of error propagation while bidirectional networks will have another approach in calculations. There are many different ways of combining layers and nodes for deep learning. The earlier example of a MLP is considered a fully connected neural network. This is characterized by the fact that every node in one layer is connected to every node in the adjacent layer. MLPs are great for classiﬁcation and regression problems that have tabular datasets. It does not however perform as well when the data CHAPTER 2. BACKGROUND 19

is of a different kind such as pixel data from an image. One reason for this is because the adjacent pixels in an image are more dependent of each other compared to pixels that are further apart. It is therefore not as beneficial to connect nodes that are further apart. There are also no intrinsic features to be mapped onto a target value for an image set compared to tabular data which often have distinct attributes. For example, a car’s probability to crash will be dependent on tire pressure, tire size, car size, person driving etc. The most important features would be extracted during a preprocessing procedure. Meanwhile, an image can be end to end as the pixel data could be fed directly into the model without any feature extraction beforehand. While feature extraction can certainly still be done for an image set, it is not always required as for some other types of datasets. The biggest disadvantage of fully connected neural networks is that it doesn’t scale well for datasets such as full images. An example that CS231n [23] brings up: take a RGB (3 color channels) image that is 32 pixels in width and 32 pixels in height. The input size is therefore 32x32x3 and a node in the first hidden layer would have 3072 weights. This is already a massive amount of weights for such a small picture. As soon as the image scales, the weights grow exponentially and become unmanageable very fast. The best approach to images is instead a neural network specifically designed for such data.

2.5.2 Convolutional Neural Networks

Figure 2.5: The convolutional neural network structure. Image adapted from CS231n [23].

Convolutional operations have been prominent for many years in image recognition ﬁelds. The basic idea is to use a ﬁlter or a kernel to 20 CHAPTER 2. BACKGROUND

slide over the image and produce a feature map as an output. Taking the example of a 4x4 image, we can use a 3x3 ﬁlter to convolve over the image illustrated by ﬁgure 2.6 and produce a 2x2 feature map. This can mathematically be seen as

m m X X hi,j = wk,lxi+k−1,j+1 (2.9) k=1 l=1 where h is the convolved feature map, w is the convolution kernel, x is the input image and m is the width and height of the kernel.

Figure 2.6: Convolutional operation. A ﬁlter of size 3 is convolving over a 4x4 image and produces a 2x2 feature map as it slides over.

The advantage of using convolutions is that the nodes in a layer only have to be connected to a specific region of nodes in the adjacent layer. This can be shown by continuing the example by CS231n [23] from earlier and applying the convolutional approach. If we have a filter of size 5x5 on the 32x32x3 image, each node will only be required to have the connection to 5x5x3 = 75 nodes in the next layer. This drastically reduces the amount of computations in contrast to a fully connected neural network. For convolutional layers, there are three big hyperparameters that are of interest. 1. A convolutional neural network (CNN) will often have many filters implemented and the outputs of each feature map will be stacked in the next layer. The different filters will give nuance to the activation of nodes as each filter can look at different things in the image. The amount of filters implemented will decide the depth of the next layer.

2. The ﬁlters can also move differently depending on what is desired and the number of steps it moves each convolution is a hyperparameter called the stride. The stride often denotes the CHAPTER 2. BACKGROUND 21

Figure 2.7: Transposed convolution: Filter with size 3 convolving over a 5x5 image with a single cell border of zero padding to produce a feature map of size 4.

number of steps to the right per convolution, i.e in the x-axis. There are some instances where one might want to use y-axis strides although it’s not common. A stride of n means that the ﬁl- ter moves n steps and the most basic CNN seen in ﬁgure 2.6 has a stride of 1, however there are many instances where a stride of 2 or even 3 would be used. Increasing the stride will reduce the output volume as the feature map will have fewer elements.

3. The last hyperparameter is a technique which can control the spatial size of the output dimension of the feature map. This procedure involves padding zeros around the edges of the input size. Since convolutions will create a feature map of smaller dimensions compared to the input volume, zero padding can be a great way to retain the size throughout the whole network. It involves expanding the input space with zeros either around the border of the image or sometimes fractionally in-between.

While CNNs are speciﬁcally designed for images, the effectiveness of the algorithm has introduced attempts on applying CNNs on datasets that can be represented similarly. Speech and audio in particular has similar locality dependencies and audio waves can be represented in a similar way as pixel values. As a consequence of this, CNNs have also shown big improvements in the domains of speech recognition and speech enhancement.

Deconvolution - Transposed Convolution There is this term called deconvolution in mathematics that is the exact reversal of a convolution, an inverse e.g. Fdeconvolve(Gconvolve(H)) = H. However, in deep learning this term has been misused. A deconvolu- 22 CHAPTER 2. BACKGROUND

Figure 2.8: Transposed convolution: Filter with size 3 convolving over a 3x3 image with fractionally strided zero padding of stride 2 to produce a feature map of size 5. tional layer is defined as the application of fractionally-strided convolutions to produce a feature map in the same vein as a convolutional layer. It doesn’t reverse the previous layer’s convolving operations but what is reversed is the dimensions of the feature map, i.e it upsamples the image. In recent times we have tried to move away from this term [cnn-indepth] and instead use tranposed convolutions as the correct denomination. A deconvolution can be easily understood by studying figure 2.7. We pad the input data with zeros such that the filter can slide over a bigger input space, thus creating a bigger feature space. The term fractionally-strided refers to the padding in-between the input data elements (figure 2.8).

2.5.3 Activation Functions There are several different activation functions that are useful for deep learning. The backpropagation will have much different outcomes depending on our choice of activation function. These are some of the functions that are relevant to the models in review.

Softmax The normalized exponential function is the most common function for the last layer of a neural network as it normalizes the outputs into a probability distribution consisiting of K probabilities (given K input points). It’s deﬁned as:

exi softmax(x)i = (2.10) PK xj j=1 e CHAPTER 2. BACKGROUND 23

Figure 2.9: The sigmoid function. where i ∈ RK . It’s used primarily in classiﬁcation tasks as we can ﬁnd the maximum of the softmax-function as the most probable output class for the input data.

Sigmoid The sigmoid function is arguably the most common activation function that has been utilized in ML. It’s defined as: 1 f(x) = (2.11) 1 + ex where x is the input. From the shape of the graph (figure 2.9) we see how the gradient is flat for large absolute values while very steep for low absolute values. If the data is not kind, we could have a case of exploding or vanishing gradients. The sigmoid function is often denoted as σ(x).

ReLU A very popular activation function unit is the rectified linear unit (ReLU) and has been shown to be remarkably efficient in many models despite its simplicity. The ReLU f(a) → output is defined as:

x, x ≥ 0 f(x) = (2.12) 0, x < 0 24 CHAPTER 2. BACKGROUND

where x is the input value. The ReLU ensures that the gradients won’t explode as it is constant for values x ≥ 0. However, with ReLUs there is an issue where nodes can die. Since values less than 0 won’t affect the network, there can emerge a situation where the node is stuck in the negative space which effectively means that the node stops con- tributing, i.e it dies.

LeakyReLU To combat the issue of ReLUs, we can use LeakyReLUs. Instead of completely killing the negative value inputs we deﬁne the activation function as: x, x ≥ 0 f(x) = (2.13) αx, x < 0 where we have an extra parameter 0 < α < 1 to ensure that the negative values still contribute, albeit to a small degree. The standard value of α is set to 0.01.

PReLU It is often difficult to decide which parameter α should be. One has to use emipirical evidence from previous models to find which value is the most fitting but this is generally unreliable. Much like many things in deep learning, we can get rid of this responsibility by letting the network learn this parameter. Introduced by He et al. [24] we have parametric rectified linear units: x, x ≥ 0 f(x) = (2.14) ax, x < 0

Since we can now learn this parameter, for a single-layer CNN the backpropagation is deﬁned as:

∂L X ∂L ∂f(yi) = (2.15) ∂ai ∂f(yi) ∂ai yi where ∂L is the gradient propagated from the deeper layer. The sum- ∂f(yi) mation is over the whole feature map, i.e i ∈ {1, ...n2} where n is the height/width of the feature map. The activation function’s gradient is: ∂f(yi) 0, y ≥ 0 = i (2.16) ∂ai yi, yi < 0 CHAPTER 2. BACKGROUND 25

Figure 2.10: Recurrent Neural Networks. The left example shows a recursive model and the right shows the corresponding extended RNN model in a time sequential manner. Image adapted from Chen [25].

If we have color channels we simply sum over those feature maps as well. He et al. [24] state that the addition of PReLUs have a negligible consequence on the time complexity for both forward and backward propagation in CNNs.

2.5.4 Recurrent Neural Networks EHNet utilizes Bidirectional LSTMs and it’s therefore important to understand the concept behind this architecture. Both CNNs and MLPs are considered feedforward networks where the architecture flow follows in one direction. However, by allowing cyclical connections between the nodes we arrive at recurrent neural networks (RNNs). While the standard feedforward neural networks has outputs that are only dependent on our current observation xt, RNNs have their hidden state ht rely on the previous hidden state ht−1 as well. Despite the similarities with Hidden Markov Models [hmm], RNNs internal states operate outside the Markov assumptions [markov] which means that they possess the ability to take into account long-distance relationships of previous datapoints. This means that RNN will have an internal state where the sequence of inputs will influence the future outputs. For a generic RNN (figure 2.10) Pascanu, Mikolov, and Bengio [26] defines a hidden state ht for time step t as:

ht = f(ht−1, xt, θ) (2.17) 26 CHAPTER 2. BACKGROUND

where xt is the input, f the nonlinear transformative function and θ is the parameters. For a vanilla RNN, f is deﬁned as:

f = Whhσ(ht−1) + Wxhxt + b (2.18) which has the recurrent weight matrix Whh, the bias b and the input weight matrix Wxh deﬁned as the parameter collection θ. Let the activation function be a softmax function:

zt = softmax(Whzht + bz) (2.19)

The weights Whh, Wxh and Whz are shared temporally throughout the whole network. There are several different ways of constructing a RNN, although all will follow the same overall structure of the vanilla computational graph (ﬁgure 2.10). The deep part of deep learning for RNNs is the higher amount of hidden units in the hidden state. Nev- ertheless, we will ﬁrst only consider the simplest of RNNs for the basis of this theory.

Vanishing Gradients There is a great ﬂaw with vanilla RNNs when the sequential data is very large. We can observe this by studying the backpropagation. Let L denote the objective function and Lt denote the output at timestep t. From this we can derive:

∂L X ∂Lt = (2.20) ∂θ ∂θ 1≤t≤T Now if we expand the term for each time step we obtain: + ∂Lt X ∂Lt ∂ht ∂ hk = (2.21) ∂θ ∂ht ∂hk ∂θ 1≤k≤t The middle term in the sum is the culprit of vanishing or exploding gradients. This is because it’s calculated as a product of previous entries, i.e: ∂ht Y ∂hi = (2.22) ∂hk ∂hi−1 t≥1>k so if the entries are smaller than 1 we will have an evolution of gradients that get smaller and smaller until it vanishes or vice versa when the entries are greater than 1. Luckily, there is a way to combat this using Long Short-Term Memory cells. CHAPTER 2. BACKGROUND 27

Figure 2.11: Long Short-Term Memory: With the help of gated units, the gradient can ﬂow through the system smoothly.

Long Short-Term Memory A way to combat the gradients exploding or vanishing is to add gated units to the network. The deﬁnition of these gates are:

it = σ(Wxixt + Whiht−1 + bi) gt = tanh(Wxcxt + Whcht−1 + bc)

ct = ft ct1 + it gt ot = σ(Wxoxt + Whoht−1 + bo) ht = ot tanh(ct), zt = softmax(Whzht + bz)

where ft indicates forget gate, it input gate, ot output gate and gt input modulation gate. The forget gate is the main character in solving the vanishing gradients as it forces the problematic product component to have elements close to one. The derivation of this proof is quite long hence it’s left as an exercise for the reader (or check it out at the explanation by Bayer [27]). This speciﬁc architecture of added gates is known as Long Short- Term Memory (LSTMs). 28 CHAPTER 2. BACKGROUND

Bidirectional

Figure 2.12: Bidirectional Recurrent Neural Networks. By adding a second layer of neurons that are in the other direction, we can capture the whole input data over all timesteps. Image adapted from Schuster and Paliwal [28].

Unfortunately for vanilla RNNs and vanilla LSTMs, they can’t access future samples by design. Therefore, when problems where those samples are available and interesting to process, we have to use bidirectional RNNs (BRNNs). A BRNN is simply two RNNs stacked upon each other in the hidden layer but the flow of data is opposite for the lower one. Developed by Schuster and Paliwal [28] in the late 90s they managed to capture all time steps at once. In figure 2.12 we can see the flow of data for this type of RNN. CHAPTER 2. BACKGROUND 29

2.5.5 ResNet

Figure 2.13: A small residual block with an identity mapping of x. Image adapted from He et al. [29].

As the layers become deeper and deeper in neural networks we encounter a degradation problem for the training accuracy. This is noted in a paper by He et al. [29]. They state that this degradation of training accuracy is unexpectedly not caused by overfitting. They show how a deeper layer causes problems by design with an example. Let a shallow network be our reference point and add layers of identity mapping to form the deeper network. Intuitively, the training error should be the same as the identity layers are not changing any of the calculations. However, as it turns out we don’t achieve a comparably good or better solution in the deeper network in contrast to the shallow. From this experiments the authors see the number of layers as the main culprit in degradation of training accuracy. Therefore they introduce ResNets, a deep residual learning framework. Let F(x) be the underlying mapping that’s normally fit by a few stacked layers. In residual learning we instead let these stacked layers approximate a residual function F(x) := H(x) − x. The original mapping is then recast into F(x) + x. They achieve this new mapping by adding shortcut/skip connections to feedforward neural networks. For ResNets, these skip connections are defined as:

y = F(x, {Wi}) + Wsx (2.23) 30 CHAPTER 2. BACKGROUND

where x and y are the output and inputs in the residual block while F is the residual mapping to be learned. In ﬁgure 2.13 we have a residual F = W2f(W1x) where f denotes ReLU. Ws is often the identity mapping as it’s only an inﬂuencing term in the equation if F and x are of different dimensions. The amazing thing about ResNet is that the concept can be applied to all neural networks with large spaces of connectivity and has become a staple in the community. Even most recently, Bonnier et al. [30] showcase how a ResNet adaption of their deep signature model can extend the performance of their network which further cements the importance of the framework. We will see how it’s implemented in both SEGAN and WaveNet denoising for improvements in performance as well.

2.5.6 Batch Normalization In deep learning there are more issues with training as the layers increase. Since each layer’s inputs will have their distributions changed for each iteration that the parameters of the previous layers change we have a case of internal covariate shift. This phenomenon can be miti- gated by normalizing the data. To take it further, Ioffe and Szegedy [31] introduced a new way of normalization where they normalize each training minibatch instead of normalizing the whole training data. The strength in this is increased performance and thus enables the training to be more ﬂexible in terms of parameter initialization and choice of learning rate. However, this method could be improved even further as there are still some small issues with normalizing on a minibatch.

Virtual Batch Normalization When we normalize on a minibatch we still have a dependency issue where an input example x is highly dependet on several other inputs x0 in the same minibatch. To remedy this, Salimans et al. [32] introduces virtual batch normalization where we ﬁrst have a reference batch of examples taken out from the data. When we later do the batch normalization, we don’t normalize the input entry x on the inputs in the same minibatch but instead on the combination of the reference batch and the input entry. The problem here is that virtual batch nor- CHAPTER 2. BACKGROUND 31

Figure 2.14: The taxonomy of generative models. Image adapted from Goodfellow [33]. malization would require two minibatches for each forward propagation which can be considered computationally expensive. SEGAN and WaveNet denoising both utilize either batch normalization or VBN.

2.6 Generative Models

Generative models are in their nature at most a semi-supervised learning procedure since much of the data is often unlabeled. The graph in ﬁgure 2.14 is a way that Goodfellow [33] describes the taxonomy of generative models. The split happens whether you have a function of implicit density or explicit density. As we can see, even though GANs and WaveNet are in the same generative family, they are still very different in regards to the intrinsic properties of the models.

2.6.1 Gated PixelCNN In the original paper about PixelRNNs and PixelCNNs by Oord, Kalchbrenner, and Kavukcuoglu [34], they introduce the idea of mod- eling the pixel distributions with either two-dimensional LSTMs or with CNNs. While the RNN architecture gave much better results, CNNs performed much faster in training. In the modern market we 32 CHAPTER 2. BACKGROUND

come across resolutions such as 4K and even 8K leading to unsustain- able training for PixelRNNs. Due to this reason, they sought to improve the PixelCNN architecture as the computational speed became much more signiﬁcant. We begin by deﬁning the joint probability over an image x as:

n2 Y p(x) = p(xi | x1, ..., xi−1) (2.24) i=1 where n is the size of the image. The order of the pixels is in raster scan order, i.e we go row by row and study each pixel individually. This means that a pixel will be dependent on every pixel above and to the left of it. They handle this conditionality by applying a mask on the pixel data, making sure that no pixels are seen to the right and below the current pixel. The RGB-colors are also conditioned on each other with B being conditioned on (R,G) and G being conditioned on R. This is achieved by splitting the feature map into threes at every layer. An input of a NxNx3 image and an output of a NxNx3x256 prediction are the common dimensions for PixelCNNs. In the newer paper by Oord et al. [35] they state that one main advantage for the LSTM architecture is their gated units handling complex interactions and that "recurrent connections allow every layer in the network to access the entire neighbourhood of previous pixels, while the region of the neighbourhood available to PixelCNN grows linearly with the depth of the convolutional stack". PixelCNNs can handle the second part by simply adding more layers but the ﬁrst part isn’t possible to affect with the original architecture. They rectify this by introducing Gated PixelCNNs, replacing the ReLUs between the masked convolutions with the gated activation function:

y = tanh(Wk,f ∗ x) σ(Wk,g ∗ x)) (2.25) where k is the number of the layer and ∗ is the convolution operator.

Blind Spot Dilemma A disadvantage to the original PixelCNN architecture is the blind spot that occurs in the receptive ﬁeld because of our masking. Consider a mask like 2 in Figure 2.15, when the ﬁlter reaches the rightmost parts CHAPTER 2. BACKGROUND 33

Figure 2.15: The PixelCNN architecture with convolutional masking. From left to right: (1) The original PixelCNN mapping a neighbour of pixels to predict the next pixel. A mask is used to prevent the input pixel from accessing the pixels that are below and strictly to the right. (2) An example of how a mask can be used on a 5x5 filter. (3) Top: The original PixelCNNs have a blind spot in the receptive field. Bottom: The blind spot is removed by having two convolutional stacks. Image adapted from van den Oord et al. [35]. of the image we will eventually miss some pixels that never are ac- counted for. This leads to pixels predicted using PixelCNN to not be dependent on all previous pixels as (2.24) requires. The illustration (figure 2.15) shows how this is mended by combining two convolutional network stacks where one part handles all the rows above the current pixel and one part handles all the pixels in the same row up to the current pixel. Then by combining the two outputs, we can remove the blind spot dilemma. They denote these two stacks as a vertical and a horizontal stack. The horizontal stack will have masking properties while the vertical stack will not. Every layer of the horizontal stack will take the output of the previous layer as well as the vertical stack. However, the horizontal stack cannot be connected to the vertical stack in the reverse manner as this would break the conditional distribution.

2.6.2 WaveNet During 2016, the impact of CNNs made people rethink autoregressive generative models. As mentioned before, CNNs were mainly based around image recognition and shifted the focus in computer vision. However, while signal processing has similar attributes with the 34 CHAPTER 2. BACKGROUND

Figure 2.16: A single layer in the Gated PixelCNN architecture. Con- volution operations are shown in green, element-wise multiplications and additions are shown in red. The convolutions with Wf and Wg from Equation 2.25 are combined into a single operation shown in blue, which splits the 2p features maps into two groups of p. Descrip- tion and Image adapted from Oord, Kalchbrenner, and Kavukcuoglu [34]. theory behind Fourier transforms and convolutions, it also has very similar data structure as waveforms can be modeled the same way as pixeldata. With this idea in mind, van den Oord et al. decided to remodel their gated PixelCNN architecture to be applied on audio waveforms and achieved great success. They called this new approach WaveNet[36]. Similarly to PixelCNNs for image pixels, they deﬁne the conditonal probability of a waveform x = {x1, ..., xT } as the product of conditional probabilities:

T Y p(x) = p(xt | x1, ..., xt−1) (2.26) t=1

Equivalent to earlier, the audio sample xt is conditioned on all previous timesteps and the model can therefore not violate the ordering in which the data is modeled. The idea for handling this condition is to introduce causal convolutions, another name for the same operation which were the masks in PixelCNNs. Since audio is only in a single dimension, they mention that the masking procedure can be handled by shifting the output of the convolution by a few timesteps. The data CHAPTER 2. BACKGROUND 35

Figure 2.17: A stack of causal convolutional layers. The input will only be dependent on previous entries. Image adapted from van den Oord et al. [36]. is seen sequentially, after each sample is predicted it will be fed back into the system to predict the next. The causal convolution structure is seen in ﬁgure 2.17.

Dilated Causal Convolutions An issue for causal convolutions is that it requires a either a large amount of layers or bigger filters to have the necessary receptive field which can limit the advantageous training speed of CNNs. Van den Oord et al. [36] introduces dilated causal convolutions for their architecture to still maintain the benefits of having a much larger receptive field while not damaging the computational cost. They describe a dilated convolution as a convolution "where the filter is applied over an area larger than its length by skipping input values with a certain step". We can see how this is equivalent to dilating the original filter with zeros and is similar to pooling or strided convolutions. The difference is that we maintain the size of the input in the output. A visualization of the concept can be seen in figure 2.18.

Other Additions While the gated activation functions (equation 2.25) are the same as for PixelCNNs, WaveNet does have some more additions to the network such as a softmax modiﬁed for raw audio and the concept of residual and parameterised skip connections from ResNet[29]. In the gated PixelCNN paper van den Oord et al. showed how softmax distribution performs better on conditional distributions com- 36 CHAPTER 2. BACKGROUND

Figure 2.18: Dilated causal convolutions. We increase the receptive ﬁeld while maintaining the computational advantages of CNNs. The visualization shows a dilation that is doubled for each layer. Image adapted from van den Oord et al. [36]. pared to mixture models. On the other hand, with how raw audio is commonly stored in 16-bit integer values per timestep, each softmax layer would require an output of 65536 probabilities per timestep for it to map all possible values. To address this issue, they choose to apply a µ-law non-linear companding transformation to the data according to mu-law and then quantizes it to a more reasonable 256 values instead:

ln(1 + µ|xt|) f(xt) = sign(xt) (2.27) ln(1 + µ)

A residual block of the model can be seen in ﬁgure 2.19 and this block is stacked many times in their model.

2.6.3 Generative Adversarial Networks The first instance of the Generative Adversarial Network was seen in the paper published by Goodfellow et. al in 2014. They introduced the concept of a new generative model to combat the difficulties that could arise for previously established models when it came to calculations such as MLE. Much like the real world, the main idea behind GANs is that competition breeds success. Instead of the general idea of training a single network, GANs introduce the concept of training two networks at the same time: one classifier and one generator. The classifying network is our discriminator while the generator is our adversary. The adversary will generate fake data that will try to fool the discriminator. In Goodfellow CHAPTER 2. BACKGROUND 37

Figure 2.19: Residual and skip connections: A residual block and how it is connected to the whole architecture through skip connections. The stacking is shown by the dotted outlines. Image adapted from van den Oord et al. [36]. et. al [2014] they bring up the analogy of a counterfeiter attempting to deceive the police. They call the framework adversarial nets. Taken from their paper, suppose we have two MLPs as respective network in the adversarial net framework. The discriminator will try to learn the generator’s distribution pg over data x. They define a prior on input noise variable pz(z) and represent the mapping to data space as G(z; θg) where G is the MLP for the generator with parameters θg. The discriminator’s MLP is then defined by D(x; θd) and outputs a scalar value denoting the probability that x is from the real data and not pg. For D we maximize the probability of assigning the correct label as to where the input came from and for G we minimize the probability that D labels the generator’s data as fake, which we define as log(1−D(G(z))). The loss function (which they denote as a value function V) of the whole adversarial net becomes the combination between the two targets: min max V (D,G) = x∼p (x)[log(D(x))]+ z∼p (z)[log(1−log(D(G(z))))] G D E data E z (2.28) Goodfellow [37] note that this minmax setup can also be seen as a zero- sum game. While loss functions in general are the case of an optimization problem where we seek the local minimum, for GANs we instead seek the Nash equilibrium from game theory. In a scenario where D and G play their part perfectly, our Nash equillibrium will have G(z) 38 CHAPTER 2. BACKGROUND

1 drawn from the same distribution as the training data and D(x) = 2 for all x. Although the loss function can vary for different GAN structures, this notion of game theory is still the essential idea behind the concept. The most common networks used as replacement for the MLPs are often CNNs and LSTMs.

Difﬁculties Training While GANs at ﬁrst glance seem straight-forward in concept, the training is much harder than for most standard deep learning architectures. Balancing two networks at once becomes increasingly hard as the complexity of the system grows and some common problems emerge:

• Discriminator gets a lead too fast and makes the generator not learn any further as every fake sample is labeled correctly and its gradients are therefore diminished.

• The game oscillates between winner and loser roles forever and the adversarial net never converges.

• Generator gets stuck with sample outputs of the same type (modes) - called mode collapse.

There are a few more detriments to GANs that are hard to account for but the three mentioned are some of the more common ones. Since the concept of adversarial nets is still relatively fresh compared to other ﬁelds of study within deep learning, there hasn’t been an optimal strat- egy set in stone for how to manage this balancing act. As mentioned in earlier sections as well, deep learning has no mathematical proof as to why something performs better given certain parameters and im- plementations. There are as said only empirical data that can be used as vague guidelines for future approaches. Nevertheless, outside of the regular methods such as early stopping there are some ideas that can improve the performance of GANs if the vanilla edition doesn’t converge:

• We update the discriminator more often than the generator or vice versa to ensure that no network breaks away. Some[37] are against this method and suggest that simultaneous updating is the most beneﬁcial approach. CHAPTER 2. BACKGROUND 39

• We force the generator to generate fake samples outside their current modes intermittently to avoid a mode collapse.

• Modify the loss function to include consideration of more precise and ﬁner details of the data.

SEGAN combats the issues by having a conditioned GAN that applies LSGAN theory. We explore these two concepts further below.

Least-squares GAN The loss function of regular GANs adopts a sigmoid cross entropy function (2.2) and has shown to be problematic when training for certain datasets. X. Mao et al. [38] argues that problems occur because the evaluation of the fake samples that are on the correct side of the decision boundary does not take into account the distance to this boundary. This means that a sample that is far from the real data won’t create any errors for the discriminator even though the distance could be ex- ceedingly large. In turn, this is another case of vanishing gradients and severely limits the training of the adversarial net. They propose least-squares GAN (LSGAN) to force the generator to produce samples that are closer to the real data by penalizing distance as well. The loss function transforms into:

1 2 1 2 min VLSGAN (D) = Ex∼pdata(x)[(D(x) − b) ] + Ez∼pz(z)[(D(G(z)) − a) ] D 2 2 (2.29)

1 2 min VLSGAN (G) = Ez∼pz(z)[(D(G(z)) − c) ] (2.30) G 2 where the parameters a and b are the labels for the two classes – fake and real data, and c is the value that G wants D to believe for fake data. Relevant to this thesis, we want G to generate samples that are as real as possible which means that c = b.

Conditioned GAN An issue with GANs is the handling of one-to-many relations. Like Mirza et al. [39] state, the vanilla GANs have a one-to-one mapping from input to output that has limiting usage in the real world. An image can often be described by several tags as there exists synonyms or related words in most languages and it becomes unreasonable to 40 CHAPTER 2. BACKGROUND

’

Figure 2.20: Conditoned GAN: By simply adding a second input layer y to the network we gain a conditioned output of D(x | y) and G(z | y). Image adapted from Mirza et al. [39] CHAPTER 2. BACKGROUND 41

account for them in the one-to-one case. A solution to this which they proposed is the conditional GAN that conditions the adversarial net to a new variable by adding an additional input layer (as seen in ﬁgure 2.20). Let us have a new observation of q to condition the adversarial net on. If both D and G are both introduced to this conditional attribute, we obtain a new loss function of:

min max V (D,G) = x,q[logD(x, q)]+ z,q[log(1−logD(G(z, q)))] (2.31) G D E E

Isola et al. [40] tests the importance of conditioning the discriminator by investigating a loss function where D is not observing q:

min max V (D,G) = x[logD(x)] + z,q[log(1 − logD(G(z, q)))] (2.32) G D E E

They conclude that the mode collapse of G is much more prominent when D is blind to the conditional contribution which makes (2.31) the best approach to a conditional loss function.

2.7 Dataset

The data used in this thesis was taken from Edinburgh DataShare, published by the University of Edinburgh [41]. Its description is stated in the paper by Valentini-Botinho et al. [42]. From their paper they state that it’s a clean and noisy parallel speech database designed to train and test speech enhancement that operate at 48kHz. A small preprocessing is made to downsample the frequency to 16kHz for faster computations. The dataset is in two parts and the first part includes 28 speakers from the same accent region in England with a 50/50 split between male and female. The second part has 56 speakers (same split) from different accent regions in Sctoland and the United states. Each speaker has around 400 sentences made available. According to Valentini-Botinho et al. [42] they created two artificial noises and had eight real noise recordings taken from the Demand database [43]. The artificial noise was a speech-shaped noise and a babble noise. The former was generated by “filtering white noise with a filter whose frequency response matched that of the long term speech level of a male speaker” while the latter was generated by “adding 42 CHAPTER 2. BACKGROUND

speech from six speakers from the Voice Bank corpus that were not used either for either training or testing”. The real noises were extracted from the first channels of respective 48 kHz versions in the Demand database. These noises in detail were: “a domestic noise (inside a kitchen), an office noise (in a meet- ing room), three public space noises (cafeteria, restaurant, and subway station), two transportation noises (car and metro) and a street noise (busy traffic intersection)”. They had 4 different signal-to-noise values for each noise (0 db, 5 db, 10 db and 15 db) which meant that they had a total of 40 different noisy conditions. More details in how the noises were specifically construed and added to the clean waveforms are in their paper if the reader is interested in knowing more. In the original papers for SEGAN and WaveNet denoising, only the dataset consisting of 28 different speakers was used. For this thesis, we expanded the tests and also incorporated the 56 speaker dataset. The smaller speaker dataset was still used to reproduce the results from the two papers. For EHNet and WaveNet denoising there is a usage of a validation set. Unfortunately if the code is accurate to their papers, then they have used the test set as the validation set. We know how this is a vio- lation of the learning process as it contaminates the training data and affects the final results during evaluation. For this thesis we modify the training data and split it as follows:

• 28 speakers dataset: p243 (male) and p268 (female) are omitted from the training set and instead added to the validation set.

• 56 speakers dataset: p234 (female), p336 (female), p237 (male) and p245 (male) are omitted from the training set and instead added to the validation set.

We still consider the validation set a part of the training data, however the model is not explicitly learning on this set.

2.8 Models in Review

This thesis analyzes the performance of three different deep learning networks and uses a simple Wiener ﬁlter as a benchmark. The networks were chosen because they had distinct differences and similari- CHAPTER 2. BACKGROUND 43

ties relevant to how the industry is moving. SEGAN is utilizing GANs which are in vogue for image recognition, WaveNet has already shown its capabilities as the state of the art for speech generation which makes denoising an interesting way of ﬂeshing out the core framework and ﬁnally EHNet is the combination of two tried and tested networks (CNNs and LSTMs) that are considered state of the art in many different domains.

2.8.1 SEGAN Speech Enhancement Generative Adversarial Network (SEGAN) was developed by Pascual, Bonafonte, and Serrà [5] and applies the GAN architecture on speech enhancement. In SEGAN they have the generator network G working as the final tool to enhance audio files. The network is conditioned on an extra input which is a latent vector representing a random noise vector z. They have a binary coding with 1 and 0 for the class labels of real and fake. The loss function is a modified version of the conditional GAN which substitutes the commonly used cross-entropy loss by the least- squares function in G:

1 2 min VLSGAN (D) = Ex,˜x∼pdata(x,˜x)[(D(x, ˜x) − 1) ]+ D 2 1 (2.33) + [(D(G(z, ˜x), ˜x))2] 2Ez∼pz(z),˜x∼pdata(˜x)

1 2 min VLSGAN (G) = Ez∼pz(z),˜x∼pdata(˜x)[(D(G(z, ˜x), ˜x)−1) ]+λ|G(z, ˜x) − x| G 2 1 (2.34)

Note the L1-norm penalty added to our loss function for G. It’s chosen as it has been proven to be effective in the image manipulation domain [40].

Generator G G is fully convolutional without any dense (fully connected) layers to save computational time and reinforce the correlation of temporally closer points of the input. There is an encoding stage of the network where the input signal ˜x is ﬁrst projected and compressed to a much smaller representation that can be concatenated with the latent vector 44 CHAPTER 2. BACKGROUND

Figure 2.21: Generator G in SEGAN: A noisy input is processed through an encoder stage of convolutional layer where it ends with a thought vector c. This vector is concatenated with a noise sample z sampled from a normal distribution N (0, σ) and then fed into the decoding network which is a mirror of the encoding stage. Skip connections between respective encoding and decoding layer are also present. The ﬁnal output is the enhanced output with noise reduced and speech clariﬁed. CHAPTER 2. BACKGROUND 45

z. The process is through a number of strided convolutions that have PReLUs as their activation functions. They call this projected vector before concatenation as the thought vector c. The combined vector is then decoded in a symmetrical manner with tranposed convolutions and PReLUs. Much like WaveNet, they also introduce skip connections to the network for a better performance. In this case each encoding layer is connected to the respective decoding layer. They reason that the com- pression process is a bottleneck that can lose valuable information of the system and skip connections is an approach to retain these important low-level details. We implement the same settings as to how S. Pascual et al. [5] set it up. The encoding layers for G have the dimensions of 16384 × 1, 8192 × 16, 4096 × 32, 2048 × 32, 1024 × 64, 512 × 64, 256 × 128, 128 × 128, 64 × 256, 32 × 256, 16 × 512, and 8 × 1024. The concatenation on c is done with a noise sample z that is sampled from a 8x1024-dimensional normal distribution that has a zero mean and a standard deviation as a hyperparameter. Because of the skip connections and the added latent vector, we have a doubling of the number of feature maps in the decoding stage. An ilustration of the network can be seen in ﬁgure 2.21.

Discriminator D They implement the discriminator similarly to the generator’s encoder stage however there are a three key differences they mention:

• The input layer is consisted of two input channels of 16384 samples.

• It uses LeakyReLUs with α = 0.3 and has a virtual batch-norm applied to it beforehand.

• The last activation layer has one-dimensional convolution layer with a ﬁlter of size 1. The reasoning behind this is to reduce the amount of parameters in the fully connected layer from 8 × 1024 to 8 while still retaining the learning of the 1024 channels.

Figure 2.22 taken from their paper shows the architecture in full. 46 CHAPTER 2. BACKGROUND

Figure 2.22: The 3 cases for Speech Enhancement Generative Adver- sarial Network. From left to right: (1) Discriminator is fed samples from the real data and labels them as real. (2) Discriminator is fed fake samples from the generator and labels them as fake. (3) Discriminator is fed fake samples from the generator and labels them as real. The dotted lines are the discriminator’s gradient ﬂow when backpropagat- ing. Image adapted from S. Pascual[5]. CHAPTER 2. BACKGROUND 47

Figure 2.23: WaveNet denoising architecture. A symmetrical dilated convolutional network with skip connections is output through two 3x1 ﬁlter layers to predict a target ﬁeld. The noisy background is extracted by subtracting the noisy speech from the estimated clean speech. Image adapted from Rethage, Pons, and Serra [44].

2.8.2 WaveNet Denoising While the original WaveNet [36] architecture is focused on generating natural sounding speech samples, the introduction by Rethage, Pons, and Serra [44] modifies the model into an application that can enhance speech in audio files. They begin by formulating the problem of speech denoising as finding the speech signal st in a mixed signal mt defined as the sum of the speech signal st and a background-noise signal bt:

mt = st + bt (2.35)

Non-causality A major difference in architecture is the removal of causal convolutions as future samples can help with the predictive ability of the network. They argue that we can make it work in real-time application by applying a small latency in the response time to access the future data. Therefore, the causal nature is removed from the WaveNet denoising model and the predicted samples are no longer fed back into the network. To account for the future samples in the input, they shift the asymmetrical padding from the original WaveNet model into a symmetrical padding at each dilated layer, this can be seen in ﬁgure 2.24. 48 CHAPTER 2. BACKGROUND

Figure 2.24: Dilated convolutions without causality. The asymmetrical nature of WaveNet is switched for a symmetrical shape to dou- ble the contextual samples around the current sample. Image adapted from Rethage, Pons, and Serra [44].

This leads to the model having access to the same amount of past samples as future samples which in turn doubles the amount of context around the current sample.

Discriminative Model The equation 2.27 is for discrete softmax outputs and is designed to not make any assumptions about the data distribution. However, for Rethage, Pons, and Serra [44] a real valued prediction following a uni-modal gaussian distribution performed much better. A reasoning behind this is that the potentially multi-modal output derived from equation 2.27 can introduce artifacts which a real-valued output will not. They also encountered high variance and disproportionately am- pliﬁed the background noise with µ-law quantization. Consequently, they chose to ignore the pre-processing and have a predictive model that had real-valued predictions instead. By not feeding the previously generated samples back into the model, we can no longer consider this an autoregressive model. It is instead a discriminative model that tries to minimize a regression loss function which they deﬁne as:

ˆ L(ˆst) = |st − sˆt| + bt − bt (2.36) ˆ where sˆt is the denoised speech and bt = mt − sˆt. This is the L1 loss CHAPTER 2. BACKGROUND 49

with a small adjustment by adding the second term.

Conditioning In their paper they use conditioning where the model has a binary- encoded scalar as a bias term in every convolution operation. For the 28 speaker dataset which the original paper is based on, the scalar is deﬁned as a value between 1-28. If the speaker is unknown, we have all zeros instead. In the denoising step the conditioning is set to zero.

Parameters The model features 3 stacks of 30 residual layers that has a dilation factor that is doubled for each layer from 1 up to 512. The first step in the network is linearly projecting the 1-channel input to 128 channels by a 3x1 convolution. This accomodates the number of filters in each residual layer. The skip connections are designed as 128 convolutional filters of size 1x1 and the output is the summation of these connections going through a ReLU. There are then two final layers of 3x1 convolutions that are not dilated and they contain 2048 and 256 filters respectively. They use ReLUs between these layers as well. Finally the output layer uses a 1x1 filter to linearly project the feature map into a single- channel temporal signal.

The illustration in ﬁgure 2.23 gives an overview of the whole network.

2.8.3 EHNet Another end-to-end model was introduced by Zhao et al. [45] and combines CNNs with LSTMs. EHNet is purely data-driven as it doesn’t make any assumptions about the underlying noise. They de- ﬁne the problem as minimizing the following function to ﬁnd the best model parameter θ:

n 1 X 2 L(θ) = |gθ(xi) − yi| (2.37) 2 i=1

d×t d×t where x ∈ R+ is the noisy spectrogram and y ∈ R+ is the corresponding clean spectrogram. d is the dimension of each frame which 50 CHAPTER 2. BACKGROUND

Figure 2.25: EHNet. The model architecture from start to ﬁnish in three steps. (1) The input is processed by convolutional layers and concatenated to 2D feature map output. (2) The feature map is transformed by a bidirectional RNN along the time dimension. (3) The prediction is made by a fully connected layer and ouputs a spectrogram frame- by-frame. they call the number of frequency bins in the spectrogram and t is the time or rather the length of the spectrogram. The model is in three distinct steps, as can be seen in ﬁgure 2.25 excluding a preprocessing step.

Preprocessing EHNet has a preprocessing step that applies a Short-Time Fourier Transform to extract the spectrogram from each audio ﬁle.

Convolutional Component The data of the spectogram is processed by a convolutional network. They deﬁne their convolutional ﬁlters as b × w, i.e. not symmetrical in size. In order to retain the dimensions of the input spectrogram as it’s w being convolved they have a zero padding of size d × 2 where they force w to be odd. They also suggest saving computational time as b adjacent frequency bins are very similar by adding a stride of 2 . Each layer is activated with ReLUs. CHAPTER 2. BACKGROUND 51

Bidirectional Recurrent Component The RNNs that are implemented in EHNet are of the bidirectional va- riety to capture long term relations in both directions. Before feeding the feature maps they have to transform into something managable. They vertically concatenate each feature map along the feature dimension and forms a stacked 2D feature map. In mathematical notation p×t we have k feature maps hzj (x) ∈ R that transforms into

pk×t H(x) = [hz1 (x); ...; hzk (x)] ∈ R (2.38)

This new feature map is fed into a bidirectional LSTM.

Regression The last component is the fully connected layer that takes the output from the LSTM, Hˆ (x) ∈ Rq×t, and applies a linear regression to predict ˆy. For each t we end up with:

ˆ d×q d ˆyt = max{0,W H + bW },W ∈ R , bW ∈ R (2.39)

Converting Back to Waveform They don’t state how they convert the spectrogram data back to waveform in the paper. This was clariﬁed through contact with the original author and the approximative method is explained in section 3.5.

2.8.4 Robustness Robustness in deep learning is the measure of how well a network handles the introduction of new training and test data. To test the robustness to the extreme, one would attempt to involve adversarial at- tacks on the network which could hinder the decision boundary of the model. Research [46][47] in this topic has been done for many different networks and approaches are several. In this thesis the concept of robustness will only be used as a method of evaluating the models instead of a full on investigation, i.e we consider the performance of the model in comparative analysis on two different datasets. If the dataset isn’t affecting the model too much, we claim that it’s robust and vice versa. Chapter 3

Method

The three models were taken from their respectve github repositories. Pascual, Bonafonte, and Serrà [48], Rethage, Pons, and Serra [49], and ododoyo [50].

3.1 Evaluation

Implementation of the Objective Evaluation Methods The python package developed by Jingdong Li [51] on github is used as the method for PESQ. STOI is taken from the MATLAB implementation by C. H. Taal [52].

Human Evaluation The sources for human evaluation in speciﬁc were 4 recent graduates from the Royal Institute of Technology in Sweden and 13 PhD candidates from the University of Science and Technology of China. The evaluation set was chosen as 14 of the same utterances from the 7 different models (Wiener, WaveNet denoising 28/56, SEGAN 28/56 and EHNet 28/56) in addition to the clean and noisy data for a total of 126 samples to evaluate. The selected samples were randomly chosen with some bias towards noisy backgrounds. The sentences examined were:

They were given an excel-spreadsheet where they would give a rating (described in 2.1.1) next to each audio sample without knowing which model the audio ﬁle belonged to. The participants were told that there were no particular order for the audio samples so that they

52 CHAPTER 3. METHOD 53

Sentence spoken .wav-file in Dataset Please call Stella. p232_001.wav Now the system is right behind us. p232_089.wav He said that healthy eating was high on the council agenda. p232_219.wav We did not compete with any other local farmer. p232_252.wav This latter point is hugely important p232_155.wav He was adamant that he was still ahead on everything that matters. p232_303.wav I was in a position to challenge for this event and didn’t. p232_410.wav It is normal. p257_110.wav Customers for the panels include the National Trust for Scotland. p257_135.wav However, the figures were disputed by the Scottish Prison Service. p257_175.wav He raised the profile of the European Tour to the sky. p257_256.wav Everything will fall into place, it should be fine. p257_265.wav Whether his stance is shared by the incoming manager is another matter. p257_358.wav But it may take some time to confirm the findings. p257_422.wav

Table 3.1: The evaluation dataset for each of the models in addition to clean and noisy dataset. would think it was random. However, the audio samples were in fact ordered in model per column to save time in implementation and calculations. The excel-spreadsheet can be downloaded at Luo [53].

3.2 SEGAN

The optimal model for the 28 speaker dataset in SEGAN was given by the authors. Here are a few of the more important parameters:

• The initial standard deviation for noise is set to 0.

• The initial weight for the L1 norm is set to 100.

• The batch size is 100.

• A pre-emphasis is put on the data and the parameter is set to 0.95.

• We have biases all throughout the network.

• The learning runs for 86 epochs.

We applied the same parameters for the 56 speaker dataset. Evidently there could have been a better model to be found, however the hyperparameter search was too costly for the thesis. 54 CHAPTER 3. METHOD

3.3 WaveNet Denoising

The optimal model for the 28 speaker dataset in WaveNet denoising was given by the authors. Here are a few of the more important parameters:

• A dilation depth of 9.

• There is no decay of the learning rate and the optimizer is the AdamOptimizer.

• The target ﬁeld length is 1601.

• The batch size is 10.

• Early stopping is implemented, otherwise it runs for a maximum of 250 epochs.

The conditioning had to be changed from 29 different values to 57 different values when training the 56 speaker dataset. Because of the bigger dataset we also increased the batch size to 15. Otherwise, everything was kept the same for both training procedures. We only applied the optimal parameters given by the creators on the training datasets.

3.4 EHNet

EHNet had not been trained on either dataset, however we still applied the same reasoning to exclude hyperparameter searching. Here are a few of the more important parameters:

• There are two convolutional layers with 256 ﬁlters. We have the stride as 1 and 16 and the ﬁlter size is 11 and 32 for respective layer.

• The hidden size of the lstm is 1024 and there are two layers.

• The batch size is 32.

• Initial standard deviation is 0.02.

• A pretraining section is done of 10 epochs. CHAPTER 3. METHOD 55

• The learning rate starts at 1e-3 decays by 0.5 until it reaches a minimum of 1e-6 and the optimizer is the AdamOptimizer.

• Early stopping is implemented, otherwise it runs for a maximum of 200 epochs.

The github repository in which the code was taken from was not the official code in the paper. It was an attempt from someone to reconstruct the paper’s directions. After contacting the original authors of the paper, they could not give out the code as it was privately owned by Microsoft Research. A big mistake the github code had was a wrong implementation of STFT and ISTFT as well as no function for recon- structing the audio from the spectrogram output. The implementation for the thesis switched out their Fourier transformations for functions from the scipy.signal package and modified the preprocessing layer to store the phase φ(ω) from the noisy audio files after STFT. The outputs of the network was given in spectrogram data and would only have the magnitude informationm M of the signal. To reconstruct our prediction signal we would use the equation:

jφ(ω) reconstruct = Moutpute (3.1) before applying the ISTFT. Since the paper does not mention which optimizer they used, it was not possible to follow their scheduled learning rate of {1.0, 0.1, 0.01} for every 60 epochs. The reason for choosing AdamOptimizer is that it’s one of the best optimizers currently on the market.

3.5 Wiener Filter

The Wiener ﬁlter was implemented merely as a benchmark test for the other three networks and we use a MATLAB implementation[54].

3.6 Technical Implementation

All three models were run on Ubuntu 16.04 through an Amazon Web Services (AWS) p3.2xlarge EC2 server from Ireland. This corresponds to a Tesla V100 GPU with 418.40.04 drivers and CUDA version 10.1. However, none of the models had the same conﬁgurations. 56 CHAPTER 3. METHOD

SEGAN The implementation was executed on an Anaconda environment for python 2.7 with Tensorﬂow-GPU 0.12 and CUDA 8.0 with cudnn v5.1 installed outside the virtual environment for Ubuntu. Some of the speciﬁcs can be found in the github repository however not everything is stated there.

WaveNet Denoising WaveNet denoising was also run in an Anaconda environment for python 2.7. However instead of Tensorﬂow the project required Keras 1.2.0 with at least Theano 0.9.0 backend. It also required a pygpu version of at least 0.6.9. The same CUDA settings were implemented for this.

EHNet EHNet had the most modern configuration with python 3.6 and Tensorflow-GPU 1.10.0 + CUDA 9.0 with cudnn 7.3.1. The CUDA was installed through the conda environment, specifically ’conda install’. Chapter 4

Results

The results that we explore are between 6 different models, 2 from each deep learning architecture, in addition to the noisy and Wiener benchmark tests.

4.1 Loss Functions

The loss function graphs displayed below are either taken from tensorboard (SEGAN and EHNet) or reconstructed through logs in a .csv-ﬁle (WaveNet denoising). By studying the loss development we can infer the training performance and convergence to local extrema. The x- axis for the tensorboard graphs is number of samples processed. All graphs have outliers removed (Tensorboard function) which can mean that the x-axis is inconsistent between graphs. This decision was made to increase readability without losing any important information.

4.1.1 SEGAN

Loss Function for 28 Speaker Dataset We have smoothing (tensorboard function) implemented on all the graphs for better readability. The lighter colored orange in the graphs are the "true" values while the thick orange line is the smoothened graph.

57 58 CHAPTER 4. RESULTS

Figure 4.1: SEGAN discriminator loss for 28 speaker dataset. The training loss for the discriminator depicted in three graphs. From left to right: (1) The fake loss. (2) The total loss. (3) The real loss.

Figure 4.2: SEGAN generator loss for 28 speaker dataset. The training loss for the generator depicted in three graphs. From left to right: (1) The adversarial loss. (2) The L1 loss. (3)The total loss.

Loss Function for 56 Speaker Dataset The same conﬁguration as for the 28 speaker dataset. We have more smoothing required for this case as the output had higher variance.

Figure 4.3: SEGAN discriminator loss for 56 speaker dataset. The training loss for the discriminator depicted in three graphs. From left to right: (1) The fake loss. (2) The total loss. (3) The real loss. CHAPTER 4. RESULTS 59

Figure 4.4: SEGAN generator loss for 56 speaker dataset. The training loss for the generator depicted in three graphs. From left to right: (1) The adversarial loss. (2) The L1 loss. (3) The total loss.

4.1.2 WaveNet Denoising

Loss Function for 28 Speaker Dataset Each epoch’s ﬁnal output was put into a .csv-ﬁle and the graphs were modeled after this data.

Figure 4.5: WaveNet denoising training and validation loss for 28 speaker dataset. Blue denotes the training loss while red denotes the validation loss. 60 CHAPTER 4. RESULTS

Loss Function for 56 Speaker Dataset

Figure 4.6: WaveNet denoising training and validation loss for 56 speaker dataset. Blue denotes the training loss while red denotes the validation loss.

4.1.3 EHNet The validation set was tested only after each 500th sample which is why the graph for the validation set will be seen as softer.

Loss Function for 28 Speaker Dataset Smoothing is also implemented here but to a much lesser degree than for SEGAN. We cut off early as gradients start to explode because of overﬁt.

Figure 4.7: EHNet training (left) and validation (right) loss for 28 speaker dataset. CHAPTER 4. RESULTS 61

Loss Function for 56 Speaker Dataset The training went better here and we didn’t have to cutoff the graph at any point. The smoothing was higher in this case however as the variance was much higher.

Figure 4.8: EHNet training (left) and validation (right) loss for 56 speaker dataset.

4.2 Evaluation Scores

For PESQ made from a python wrapper, STOI from a MATLAB implementation and human evaluation taken as an average from 17 different people on 14 samples we have the following results:

PESQ STOI Human Evaluation WaveNet denoising 28 2.91 NaN 3.01 WaveNet denoising 56 3.04 NaN 2.93 SEGAN 28 3.22 0.932 3.47 SEGAN 56 3.32 0.935 3.58 EHNet 28 2.71 0.886 2.87 EHNet 56 2.86 0.900 2.80 Wiener 3.02 0.921 3.00 Noisy 3.01 0.921 3.07

Table 4.1: Evaluation scores for the models in addition to Wiener and Noisy. The NaN for WaveNet denoising is because of the cutoff in generation of the enhanced samples makes it so that it doesn’t match the original length. Bold denotes the best, italic denotes the second best.

We don’t have STOI evaluation for WaveNet denoising as their samples were cutoff earlier as per design. Instead of padding the ends 62 CHAPTER 4. RESULTS

with zeroes and getting very bad results, we omit them from the table completely. Chapter 5

Discussion

There are many things to look for when discussing the performance of an network. Outside of the obvious such as evaluation scores, we also look at the training procedure and how easy it was to train the model. We also compared the two datasets for each model in addition to the original papers’ results to check for robustness and reproducibility.

5.1 Training Procedure and Data

We study the loss for each model to see how well it matches the theory while commenting on performance relative to the loss development in training.

SEGAN For SEGAN we know that the discriminator and generator losses should both be minimized for successful training according to the equations 2.8.1. In figures 4.1, 4.2, 4.3 and 4.4 we see how the total loss for both the discriminators and generators follow the same pattern as they slowly decrease until a minimum is found. By studying the initial samples there is a big discrepancy in the adversarial loss for the generator. This is a typical shape as both networks are bat- tling it out against each other without a clear winner at first. Still, in SEGAN the discriminator learns much faster than the generator and gets a head start which could’ve caused an issue of vanishing gradients for the generator. However, the fine balance between generator and discriminator was still upheld or else we wouldn’t have achieved

63 64 CHAPTER 5. DISCUSSION

the results that we got. In the debate on how GANs are supposed to be trained there is many who advocate for strong discriminators. The reason is simple, if the generator becomes stronger than the discriminator first, we would have a network that would converge instantly as the generator no longer has to feed the discriminator any different samples, i.e we arrive at a type of mode collapse. Luckily for the 28 speaker dataset we see stable generator adversarial loss which equi- libriates with low variance just like the theory states. The 56 speaker dataset model has a bit higher variance for the adversarial loss which means that we probably could’ve run the training for more epochs or have implemented better hyperparameters. One should note that without the smoothing of the graphs we see a massive oscillating pattern, however this is also normal for GANs during training. Comparing the loss for the two datasets we can see how the networks are robust in learning. By not changing any hyperparameters we still obtain a stable system for a new dataset. This bodes well for the efficiency in training SEGAN for other datasets. Surprisingly, the bigger dataset gives us better evaluations scores even though the model’s hyperparameters were optimized according to the smaller dataset. One could argue that this is because the datasets are still somewhat similar as it’s only the accents that are different and a bigger dataset without overfitting will always give us better performance. The samples produced by SEGAN were overall very good. The background noise would however find its way into the end of each audio file were the speaker had stopped speaking. This is probably because the generator didn’t feel compelled to affect that region as the discriminator wouldn’t see it as a relatively big deal.

WaveNet Denoising The slightly modiﬁed L1 loss function for WaveNet denoising is relatively simple and we can see it from the ﬁgures 4.5 and 4.6 that there are no big surprises in the shape of the graphs. The development of the loss is very slow however as the rate of decrease is lowered very early. A reason for this could be the learning rate. In the logs we see that the default parameters set by the authors have the learning rate deprecate in a discrete manner. After a certain amount of epochs the learning CHAPTER 5. DISCUSSION 65

rate goes from the initial 0.001 to 0.0001 and after even more epochs it goes to 0.0001. Now if the network didn’t get close to the global optima, a small learning rate would mean that it could get stuck in a local optima instead. By the shape of the graphs we can infer the notion that the network possibly got stuck in a valley surrounded by several other valleys and could not escape to get closer to the global minimum. The learning rate had several different iterations where there were attempts of increasing it or letting the tensorﬂow backend take care of it by itself (by leaving to to default). Nevertheless, the same issues of gradients either exploding or vanishing still appeared. Ultimately, WaveNet denoising was a very frustrating architecture to work with as the training could get unstable very fast and convergence was a hard thing to achieve. It does however feel as if the thesis did the model a disservice but we truly believe that the best efforts were made given the limited time and scope of the thesis. One has to say that it’s unfortunate that it wasn’t possible to reproduce their optimal network on the 28 speaker dataset. If we want to study the robustness we should also take into account of their best model that they reference in their paper. While the difference in scores is minimal for the two different datasets, we couldn’t reproduce the paper’s results given many different runs. Therefore we could say that the architecture could be culpable for low robustness to new data. WaveNet denoising could not remove the most intrusive background noises and distorted some of the speech. This would be a reason for the lower rating in human evaluation as that wasn’t pleasant to listen to, even if the noises were quieter and appeared less frequently.

EHNet The loss develops more naturally for EHNet but the 28 speaker dataset model does follow the similar shape as WaveNet denoising in the same vein that it decreases fast early. The learning rate in this case was also decreased in a similar manner which could contribute to the slow- down of the loss progress. The data had not been seen by the network before as the original paper used a different dataset hence all optimizations were done from a clean slate. This would mean that there might have been missed conﬁgurations in the hyperparameter space which would’ve garnered 66 CHAPTER 5. DISCUSSION

much better results. Despite that, the results still came out sufficiently and did not dis- tort the voices. The author of this thesis does not agree fully with the lower rating from the human evaluation and this score could’ve been different with another sample size and another study group. However, configurations in between the two datasets were very similar hence we can still comment on robustness here. As the code retrieved from github was not the official code, there were some major mistakes in it such as faulty STFT implementation and no inference method to reconstruct the audio since the phase of the waves were not initially stored. The implementation was therefore written by the author of the thesis and their limited coding knowledge could’ve made the model perform worse than its full potential.

5.2 Comparative Analysis in Overall Perfor- mance

We begin by ﬁrst taking into account the runtime of each model as it’s important in real-time tasks.

Runtime The fastest network to train was WaveNet denoising and could average a run time of 100 epochs in about 2h with 1 minute per epoch. SEGAN was a bit slower where each epoch was around 4 minutes and EHNet had the slowest with around 10 minutes per epoch. When it came to denoising, all three networks performed similarly and were quite fast with under a second for each speech sample. The reason for WaveNet denoising not living up to the notoriety of slow audio generation that the original WaveNet has is because of the removal of autoregressive causality, stated in section 2.9.2. Overall all three models are great at denoising in a short time frame and could be viable for real-time tasks.

Reﬂecting on the Implementation and Results Since we didn’t use hyperparameter optimization we can’t be certain as to which network performs the best. However, we can still infer CHAPTER 5. DISCUSSION 67

some advantages and disadvantages from the training procedure. Im- mediately we can say that SEGAN has certain difficulties with finding a good training as it inherits the same complications that all GANs have. Despite these issues we can still say that it performed in a stable manner thanks to the pre-optimized guidance given by the creators. WaveNet denoising on the other hand, was personally much harder to train. The github repository did not have sufficient information as to how they trained their optimal network, specifically in regards to which hyperparameters. While they stated that their network configuration could be found in the folder, those parameters could not reproduce the optimal results at all. As we see in the loss graphs (figure 4.5 and 4.6) as well, the training flattens out very early in the procedure and doesn’t pick up the pace at all. WaveNet denoising is arguably the most complex of the three and was hard to understand the motivation as to why some parts where implemented although that’s a characteristic of deep learning in general. Especially the loss function stood out as it felt too simple for the otherwise intricate architecture and could be an interesting change to consider in the future. The biggest factor in performance was the conditioned implementation which is a negative in end-to-end performance. There was some very bad results when the conditioning number wasn’t matching the dataset, e.g. the 56 speaker dataset required 57 conditioned as a parameter or else the results would be significantly worse. However, the thesis only had tests to determine the difference in having the conditioned number being too low and didn’t consider trying the reverse in having a conditioned number far larger than the total amount of speakers. Nevertheless, this would still be a bottleneck for a generalized end-to-end model as the training data that’s input has to give the number of different speakers. If nobody labels the speaker data this information would be in the dark. This should however not be too difficult to correct in an improved version, however we must still acknowledge the detriment it had for the thesis work when implementing the new 56 speaker dataset. The last model EHNet was very straightforward in the implementation. Since it was the simplest model of the three, we could easily comprehend the architecture with prior education from the courses taken in the Royal Institute of Technology except for the STFT preprocessing and this should be seen as a plus. As stated before, the sound clips in the personal opinion of the thesis’ author were of higher 68 CHAPTER 5. DISCUSSION

quality than some other models’ but the overall scores don’t seem to match that opinion. Nevertheless, this model should still be seen as a viable option in speech enhancement. One thing to note is that the reconstruction of the audio requires the phase of the wave which is not learned but taken from the noisy input. This could be improved much further if we also added this information for learning to either a separate model or embedded it somehow into the spectrogram data. While the Wiener ﬁlter scored higher than some of the networks we should still not take that at absolute face value since the referenced papers all had higher scores. However this would of course be after rigorous hyperparameter optimization to achieve the best score for the data as they’re attempting to introduce a new model to the ﬁeld of study. In the related works, Germain, Chen, and Koltun [7] does a similar comparative approach and arrives at the conclusion that Wiener and Noisy performs better than both SEGAN and WaveNet denoising in the measurement for signal distortion (SIG) and overall (OVL) performance while worse in background intrusiveness (BAK). This strength- ens the reliability of the analysis and results as they are very compara- ble.

5.3 Improvements Outside the Models

Evaluation Methods Something that immediately catches our eyes is the evaluation scores for the Noisy dataset without any speech enhancement. It scores higher than both EHNet and WaveNet denoising architectures in all categories. Some of the data in the test set has very little intrusive noise which the objective calculations could evaluate as minor differences in voice quality which would affect the average over the whole dataset. Nevertheless, they seem to match the human evaluation and consequently shouldn’t be dismissed entirely. One reasoning behind the human category is that the people doing the evaluation seemed to value the naturalness of the speech more than the amount of background noise. If the audio had any distortion, we got very low score immediately even if the background noise was omitted in the majority of the audio clip. The background intrusiveness was therefore not as important than how natural the voice was. CHAPTER 5. DISCUSSION 69

One could say that this was also a consequence of how the ratings were deﬁned as the word "natural" was in focus for the higher ratings. From some feedback from peers of the thesis, we noted that the ratings could’ve been split up into two different categories: speech naturalness and background intrusiveness. The fact that two evaluation metrics were baked into one rating could’ve caused the disparity in ratings between participants. If we look at Appendix A we see the split in evaluation for the Chinese and Swedish candidates and interestingly see a divided opinion in model performance. While SEGAN is still highly performing, WaveNet denoising and EHNet are completely reversed in preference where EHNet 56 actually scored the highest for the Swedish participants. This could be taken into account when assessing the models in future work. From this we can infer that a more extensive human evaluation procedure could be implemented with a larger sample size and a more diverse group to gather another kind of results. Nevertheless, the main approach in this thesis was to explore the theory behind all three models rather than comparing absolute results to crown a winner. As stated before, without hyperparameter optimization we can’t completely conclude the best model for speech enhancement. There- fore, the evaluation scores shouldn’t be seen as hard-truths either way.

Choice of Dataset The 28 speaker dataset that both the SEGAN [5] and WaveNet denoising[44] authors had chosen was superseded on the Edinburgh datashare website. The new dataset webpage had added the 56 speaker dataset as well and this was seen as an opportunity to explore the models further. The reasoning was simple for me: (1) The dataset was free and readily available and (2) it was connected to the dataset that the models of interest had worked on which would (3) make it easy to draw expanded conclusions and reproducibility of their ﬁnd- ings. The introduction of EHNet also expanded the benchmark for the dataset. On the contrary, for all the same reasons that the dataset was per- fect it can also be seen as a detriment. To evaluate robustness we often want to have extreme cases of datapoints to see how well the network can handle them and the 56 speaker dataset does leave a lot to be asked 70 CHAPTER 5. DISCUSSION

for in terms of vast differences. It’s certainly a topic of discussion for future work. Nevertheless, the scope of the thesis would’ve been much too big if we had to retrain the models from scratch.

Time Management and Hardware Limitations The work had a long timeline of different hardware and software combinations which impeded some of the preliminary goals that the thesis was aiming for. The first attempts at the start of the project were done on Windows 10 by using Anaconda as the virtual environment. The issue here was that the models’ had requirements that weren’t com- patible with Windows 10, e.g. SEGAN was done on tensorflow 0.12 in python 2.7. This combination was impossible to find for the Windows OS whether it was pip install, conda install or even Docker. After a few weeks of complete disaster in terms of progress the project attempted a migration to OS X instead. However, the processing power of a GPU was required and the Macbook Pro’s CPU could not train fast enough. This was also the primary reason for using Windows 10 in the begin- ning as the computer had a NVIDIA GeForce GTX 1070 installed. To make use of the hardware at hand, the author figured we could use the dual boot function for Windows 10 and Ubuntu 16.04. In Ubuntu, the correct packages could at last be found and the requirements given by the github repositories were finally installed. A new issue arose however of CUDA and cudnn compatibility with the conda environment. The learning curves in all of the setups were steep as the author had never attempted at something similar before. Fortu- nately with the help of StackOverflow and medium.com we managed to solve the issue after a certain time. Despite all the effort, nestled in an issue thread on the github repositories the author realized that the hardware in which both SEGAN and WaveNet had been trained on was beyond the thesis’ initial ca- pacity. They required a Titan Pascal X 12GB to handle the memory and speed of the network. To remedy this we decided to rent an Ama- zon Web Services EC2 server and the final results came through the p3.2xlarge server from Ireland. Looking back, it’s evident that we should’ve just gone for the AWS immediately. One major reason apart from time constraints was financial restric- tions that made the hyperparameter search improbable. Being forced to rent the server also made the thesis change its scope. Reiterating it CHAPTER 5. DISCUSSION 71

again, one major improvement for future work would be to put down more time in hyperparameter search and truly ﬁnd the optimal networks for all three models.

5.4 Conclusion

Three deep learning architectures were examined for speech enhancement tasks. The datasets were chosen because they were similar to the ones examined before and reproducibility was an important part of the results. SEGAN is the most robust and performed the best in comparison to all the other experimental variables, therefore one should say that it’s the most reliable when picking a model for speech enhancement. However it does inherit flaws from the GAN family that could reduce the performance rate that one wishes in an industry set- ting where time is money. WaveNet denoising is the adaptation of WaveNet generation of raw audio that can be great given the right circumstances. This could unfortunately not be realized in this work despite several different runs. However, in reference to the original paper one should still consider this option as something viable as they could outperform most state-of-the-art results. EHNet is considered a simpler version but shouldn’t be overlooked because of that reason. However, it does require preprocessing and uses the old method of spectrogram analysis which the two other deep learning networks cir- cumvent. While the evaluated scores were not great, the samples did not sound as detrimental as it might be represented in this paper. In conclusion all three networks can be possible options for your task at hand but from the results SEGAN would be recommended despite its possible future flaws in optimization as GANs are notoriously difficult to work with. Bibliography

[1] D. Silver et al. Mastering the Game of Go without Human Knowledge. 2016. [2] I. Goodfellow et al. Generative Adversarial Networks. 2014. [3] C. Huang et al. Context-Aware Generative Adversarial Privacy. 2017. [4] L. Yu et al. SeqGAN: Sequence Generative Adversarial Nets with Pol- icy Gradient. 2016. [5] S. Pascual, A. Bonafonte, and J. Serrà. SEGAN: Speech Enhance- ment Generative Adversarial Network. 2017. [6] Avinash Hindupur. The GAN Zoo. 2019. URL: https : / / github.com/hindupuravinash/the-gan-zoo.. [7] F. G. Germain, Q. Chen, and V. Koltun. Speech Denoising with Deep Feature Losses. 2018. [8] A. Kumar and D. Florencio. Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks. 2016. [9] S. Pascual et al. Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks. 2018. [10] W. Han et al. Speech Enhancement Based on Improved Deep Neural Networkswith MMSE Pretreatment Features. 2016. [11] H. W. Lin, M. Tegmark, and D. Rolnik. Why does deep and cheap learning work so well? 2017. [12] J. Benesty et al. Speech Enhancement. A Signal Subspace Perspective pp. 2-3. 2014. [13] Christopher M. Bishop. Pattern Recognition and Machine Learning pp. 147-152. 2006.

72 BIBLIOGRAPHY 73

[14] A. W. Rix et al. Perceptual Evaluation of Speech Quality (PESQ) - A New Method for Speech Quality Assessment of Telephone Networks and Codecs, IEEE. 2002. [15] C. H. Taal et al. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. 2010. [16] Y. Hu and P. C. Loizou. Evaluation of Objective Quality Measures for Speech Enhancement, Senior Member, IEEE. 2008. [17] Saeed V. Vaseghi. Advanced Digital Signal Processing and Noise Re- duction, Second Edition. 2000. [18] Daniel Faggella. What is Machine Learning. 2019. URL: https : //emerj.com/ai-glossary-terms/what-is-machine- learning. [19] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning, p. 100. 2017. [20] Christopher M. Bishop. Pattern Recognition and Machine Learning pp. 3-4. 2006. [21] Christopher M. Bishop. Pattern Recognition and Machine Learning pp. 192-196. 2006. [22] Deepak Battini. Solving XOR Problem using neural network – C#. 2019. URL: https://www.tech-quantum.com/solving- xor-problem-using-neural-network-c/. [23] Stanford CS Class CS231n. Convolutional Neural Networks for Vi- sual Recognition. 2019. [24] K. He et al. Delving Deep into Recitifiers: Surpassing Human-Level Performance on ImageNet Classification. 2015. [25] G. Chen. A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation. 2018. [26] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. 2013. [27] Justin Bayer. Learning Sequence Representation, pp. 11-15. 2015. [28] M. Schuster and K. K. Paliwal. Bidirectional Recurrent Neural Net- works. 1997. [29] K. He et al. Deep Residual Learning for Image Recognition. 2015. [30] P. Bonnier et al. Deep Signatures. 2019. 74 BIBLIOGRAPHY

[31] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015. [32] T. Salimans et al. Improved Techniques for Training GANs. 2016. [33] Ian Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Net- works. 2017. [34] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel Recurrent Neural Networks. 2016. [35] A. van den Oord et al. Conditional Image Generation with Pixel- CNN Decoders. 2016. [36] A. van den Oord et al. WaveNet: A Generative Model for Raw Audio. 2016. [37] I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Net- works. 2016. [38] X. Mao et al. Least Squares Generative Adversarial Networks. 2017. [39] M. Mirza and S. Osindero. Conditional Generative Adversarial Nets. 2014. [40] P. Isola et al. Image-to-Image Translation with Conditional Adversar- ial Networks. 2018. [41] Cassia Valentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models. 2017. URL: https : / / datashare.is.ed.ac.uk/handle/10283/2791. [42] C. Valentini-Botinho et al. Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Net- works. 2016. [43] J. Thiemann, N. Ito, and E. Vincent. DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments. 2013. URL: https : / / zenodo . org / record / 1227121 # .XObEqlUzZhE. [44] D. Rethage, J. Pons, and X. Serra. A Wavenet for Speech Denoising. 2018. [45] H. Zhao et al. Convolutional-Recurrent Neural Networks for Speech Enhancement. 2018. [46] T. E. Wang et al. Towards Robust Deep Neural Networks. 2018. BIBLIOGRAPHY 75

[47] K. Sun, Z. Zhu, and Z. Lin. Enhancing the Robustness of Deep Neu- ral Networks by Boundary Conditional GAN. 2019. [48] S. Pascual, A. Bonafonte, and J. Serrà. Implementation of SEGAN: Speech Enhancement Generative Adversarial Network. 2019. URL: https://github.com/santi-pdp/segan. [49] D. Rethage, J. Pons, and X. Serra. Implementation of A Wavenet for Speech Denoising. 2019. URL: https : / / github . com / drethage/speech-denoising-wavenet. [50] ododoyo. Partial implementation of Convolutional-Recurrent Neural Networks for Speech Enhancement. 2019. URL: https://github. com/ododoyo/EHNet. [51] Jingdong Li. A python package for calculating the PESQ. 2019. URL: https://github.com/vBaiCai/python-pesq. [52] Cees Taal. STOI – Short-Time Objective Intelligibility Measure. 2019. URL: http://www.ceestaal.nl/code/. [53] Oscar Xing Luo. Human evaluation spreadsheet. 2019. URL: https://drive.google.com/file/d/1XnguF_eHwK- yogMQotkkxXSSae_w3zZp/view?usp=sharing. [54] Esfandiar Zavarehei. Wiener Filter. 2019. URL: https : / / mathworks.com/matlabcentral/fileexchange/7673- wiener-filter. Appendix A

Chinese and Swedish Perspec- tive

Since our human evaluation was taken from two different cultures of people, although all proﬁcient in English, it could be interesting to see what’s valued more in an audio clip in regards to quality. For the reader’s interest and transparency’s sake we present the average scores for Chinese and Swedish respectively.

Human Evaluation WaveNet denoising 28 3.14 WaveNet denoising 56 3.07 SEGAN 28 3.5 SEGAN 56 3.66 EHNet 28 2.82 EHNet 56 2.57 Wiener 3.03 Noisy 3.05

Table A.1: Average evaluation scores from the 13 Chinese participants. Bold denotes the best, italic denotes the second best.

76 APPENDIX A. CHINESE AND SWEDISH PERSPECTIVE 77

Human Evaluation WaveNet denoising 28 2.61 WaveNet denoising 56 2.5 SEGAN 28 3.39 SEGAN 56 3.30 EHNet 28 3.00 EHNet 56 3.54 Wiener 2.91 Noisy 3.11

Table A.2: Average evaluation scores from the 4 Swedish participants. Bold denotes the best, italic denotes the second best. TRITA -EECS-EX-2019:482