Deep Learning-Based Multimedia Content Processing

Deep Learning-based Multimedia Content Processing

by Mohammad Akbari

M.Sc., University of Lethbridge, 2014 B.Sc., Shahid Bahonar University of Shiraz, 2010

Thesis Submitted in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy

in the School of Engineering Science Faculty of Applied Science

c Mohammad Akbari 2020 SIMON FRASER UNIVERSITY Summer 2020

Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval

Name: Mohammad Akbari

Degree: Doctor of Philosophy (Engineering Science)

Title: Deep Learning-based Multimedia Content Processing

Examining Committee: Chair: Dr. Zhenman Fang Assistant Professor

Dr. Jie Liang Senior Supervisor Professor

Dr. Ivan V. Bajić Supervisor Professor

Dr. Shahram Payandeh Supervisor Professor

Dr. Jiangchuan Liu Internal Examiner Professor School of Computing Science

Dr. En-Hui Yang External Examiner Professor Department of Electrical and Computer Engineering University of Waterloo

Date Defended: July 24th, 2020

ii Abstract

In the last few years, deep learning has revolutionized many applications in the field of multimedia content processing such as music information retrieval (MIR) and image compression, which are addressed in this thesis. In order to handle the challenges in acoustic-based MIR such as automatic music transcription, the video of the musical performances can be utilized. In Chapter 2, a new learning-based system for visually transcribing piano music using the convolutional neural networks and support vector machines is presented that achieves an average improvement of ≈ 0.37 in terms of F1 score over the previous works. Another significant problem in MIR is music generation. In Chapter 3, a semi-recurrent hybrid model combining variational auto-encoder and generative adversarial network for sequential generation of piano music is introduced that achieves better results than previous methods. Auto-encoders have also been used as a perfect candidate for learned image compression, which has recently shown the potential to outperform standard codecs. Some efforts in integrating other computer vision tasks and image compression to improve the compression performance have also been made. In Chapter 4, a semantic segmentation-based layered image compression method is presented in which the segmentation map of the input is used in the compression procedure. Most learned image compression methods train multiple models for multiple bit rates, which increase the implementation complexity. In Chapter 5, we propose a variable-rate image compression model employing two novel loss functions and residual sub-networks in the auto-encoder. The proposed method outperforms the standard codecs and also previous learned variable-rate methods on Kodak image set. The state-of-the-art image compression has been achieved by utilizing joint hyper-prior and auto-regressive models. However, they suffer from the spatial redundancy of the low frequency information in the latents. In Chapter 6, we propose the first learned multi- frequency image compression approach that uses the recently developed octave convolutions to factorize the latents into high and low frequencies. As the low frequency is represented by a lower resolution, their spatial redundancy is reduced, which improves the compression rate. Our experiments show that the proposed scheme outperforms all standard codecs and learning-based methods in both PSNR and MS-SSIM metrics, and establishes the new state of the art for learned image compression on Kodak image set.

iii Keywords: multimedia content processing; machine and deep learning; music information retrieval; learned image compression; entropy coding

iv Dedication

To the holy presence of Imam Mahdi (may Allah hasten his reappearance)... and To my family...

v Acknowledgements

I am using this opportunity to express my gratitude to everyone who supported me through- out this thesis. I am thankful for their aspiring guidance, invaluably constructive criticism and friendly advice during the project work. I would like to express my special appreciation and thanks to my senior supervisor Dr. Jie Liang for all his support, efforts, and engagement through the process of this Ph.D. thesis. He continually and persuasively conveyed a spirit of adventure in regard to research and scholarship. He has been one of the best teachers and mentors that I have ever had. Without his supervision and constant help, this thesis would not have been possible. I also want to thank my supervisors, Dr. Ivan V. Bajić and Dr. Shahram Payandeh, for their useful comments and remarks during my Ph.D. studies and my defence session. In addition, I thank Dr. Zhenman Fang as the chair of my Ph.D. defence session as well as Dr. Jiangchuan Liu, my internal examiner, and Dr. En-Hui Yang, my external examiner, from University of Waterloo. I would also like to thank my M.Sc. supervisor, Dr. Howard Cheng from University of Lethbridge, for all his help and encouragement, especially his support and collaboration at the beginning my Ph.D. studies. Furthermore, I would like to express my appreciation to all faculty members and in- structors at the Schools of Engineering Science and Computing Science, especially Dr. Greg Mori, Dr. Mark Drew, Dr. Ping Tan, Dr. Craig Scratchely, and Dr. Andrew Rawics for their instructions and all they have taught me in my Ph.D. course work and teaching assistant duties. Moreover, I want to thank Derek Pang, Balu Adsumilli, and Charlie Argast for their support in my approved internship at Google Inc. despite of its cancellation due to Covid-19. I also express my gratitude to all of my good friends who supported me to strive towards my goal. A warm and special appreciation to Saeed Abbasi and Saeed Ranjbar for all their help, time, and engagement. I would additionally like to thank my good labmates Ehsan Mahoor, Hyomin Choi, Bingling Li, Jianping Lin, Elham Ideli, and Setareh Dabiri. I also want to acknowledge and thank my friends Mehran Khodabandeh, Hedayat Zarkoob, Mehdi Seyfi, Ali Pazoki, Akbar Rafiei, Mohsen Gholami, Ahmad Rezaei, Kazem Zabihol- lahi, Javad Nikpour, Maryam Bahrami, Mohadeseh Majdi, Milad Toutounchian, Shervin Jannesar, Saeid Izadi, Ehsan Esfahanian, Majid Talebi, Ehsan Haghshenas, Saeid Asgari,

vi Shahram Pourazadi, Abdollah Safari, Hashem Jeyhooni, Majid Askari, Mehdi Feyzi, Reza Ramezani, and Mahmoud Shaﬁei. The deepest thanks and appreciation go to my family, especially my mom, my sisters, my beloved wife, and my parents-in-law. Words cannot express how grateful I am to them for all of the sacriﬁces they have made on my behalf. Their prayers for me was what sustained me thus far.

vii List of Abbreviations

MIR Music Information Retrieval AMT Automatic Music Transcription MIDI Musical Instrument Digital Interface NMF Non-Negative Matrix Factorization PLCA Probabilistic Latent Component Analysis HMM Hidden Markov Models NNLSQ Non-Negative Least Squares Fitting CNN Convolutional Neural Network SVM Support Vector Machine RNN Recurrent Neural Network LSTM Long Short Term Memory VAE Variational Auto-Encoder VRAE Variational Recurrent Auto-Encoder VRASH Variational Recurrent Auto-Encoder Supported by History CVAE conditional Variational Auto-Encoder GAN Generative Adversarial Network WGAN Wasserstein Generative Adversarial Network LAPGAN Laplacian Generative Adversarial Network DCGAN Deep Convolutional Generative Adversarial Network CGAN Conditional Generative Adversarial Network RBM Restricted Boltzmann Machine C-RBM Convolutional Restricted Boltzmann Machine JPEG Joint Photographic Experts Group HEVC High Eﬃciency Video Coding BPG Better Portable Graphics FLIF Free Lossless Image Format VVC Versatile Video Coding VTM Versatile Video Coding Test Model MSE Mean Squared Error PSNR Peak Signal-to-Noise Ratio

viii SSIM Structural Similarity MS-SSIM Multi-Scale Structural Similarity GDN Generalized Divisive Normalization IGDN Inverse Generalized Divisive Normalization ResNet Residual Neural Network DSSLIC Deep Semantic Segmentation-based Layered Image Compression RGB Red Green Blue GSM Gaussian Scale Mixture GMM Gaussian Mixture Model R-D Rate-Distortion DBU Dynamic Background Update ReLU Rectiﬁed Linear Unit ELU Exponential Linear Unit TRON Trust Region Newton KL Kullback-Leibler SGD Stochastic Gradient Descent STE Straight-Through Estimate BPP Bits per Pixel GoConv Generalized Octave Convolution GoTConv Generalized Octave Transposed-Convolution

ix Table of Contents

Approval ii

Abstract iii

Dedication v

Acknowledgements vi

List of Abbreviations viii

Table of Contents x

List of Tables xiii

List of Figures xiv

1 Introduction 1 1.1 Learned Music Information Retrieval ...... 1 1.1.1 Automatic Music Transcription (AMT) ...... 2 1.1.2 Automatic Music Generation ...... 5 1.2 Learned Image Compression ...... 7 1.2.1 Learned Variable-Rate Image Compression ...... 9 1.2.2 Context-Adaptive Entropy Models ...... 9 1.3 Scholarly Publications ...... 10 1.3.1 Journal Papers ...... 11 1.3.2 Conference Papers ...... 11

2 A Real-Time System for Online Learning-based Visual Transcription of Piano Music 12 2.1 Feature Extraction ...... 14 2.1.1 Preprocessing ...... 14 2.1.2 Dynamic Background Update (DBU) ...... 16 2.1.3 CNN-based Feature Extraction ...... 17 2.2 Online Learning-based Visual AMT ...... 19

x 2.2.1 Pressed Keys Classiﬁcation ...... 19 2.2.2 Music Transcription ...... 21 2.3 Experimental Results ...... 22 2.3.1 Dataset ...... 23 2.3.2 Training ...... 23 2.3.3 Analysis of Online Learning Method ...... 25 2.3.4 Frame-Level Classiﬁcation of Pressed Keys ...... 28

3 Semi-Recurrent CNN-based VAE-GAN for Sequential Data: Piano Mu- sic Generation 33 3.1 Variational Auto-Encoder (VAE) ...... 33 3.2 Generative Adversarial Network (GAN) ...... 35 3.3 Formulation and Objective ...... 37 3.4 Experiments: Piano Music Generation ...... 40 3.4.1 Setup ...... 40 3.4.2 Results ...... 42

4 Deep Semantic Segmentation-based Layered Image Compression 45 4.1 Network Architecture ...... 48 4.2 Formulation and Objective Functions ...... 48 4.3 Training ...... 50 4.4 Experiments ...... 51 4.4.1 Ablation Study ...... 55

5 Learned Variable-Rate Image Compression with Residual Divisive Nor- malization 60 5.1 Network Architecture ...... 61 5.2 Deep Encoder ...... 63 5.3 Stochastic Rounding-Based Quantization ...... 64 5.4 Deep Decoder ...... 65 5.5 Residual Coding ...... 66 5.6 Objective Function and Optimization ...... 66 5.7 Experimental Results ...... 67 5.7.1 Other Ablation Studies ...... 70

6 Generalized Octave Convolutions for Learned Multi-Frequency Image Compression 75 6.1 Vanilla vs. Octave Convolution ...... 76 6.2 Generalized Octave Convolution ...... 78 6.3 Multi-Frequency Entropy Model ...... 83

xi 6.3.1 Objective Functions ...... 85 6.4 Experimental Results ...... 87 6.4.1 Ablation Study ...... 89 6.5 Multi-Frequency Image Semantic Segmentation ...... 92 6.6 Multi-Frequency Image Denoising ...... 94

7 Conclusion 98

Bibliography 100

xii List of Tables

Table 2.1 CNN architecture used in our learning-based visual AMT...... 18

Table 2.2 The F1 score of pressed black and white keys classiﬁcation...... 29 Table 2.3 The maximum and average processing times of diﬀerent methods. . . 31

Table 3.1 The network architecture of the encoder, generator, and discriminator. 41

Table 4.1 Ablation study of diﬀerent components in DSSLIC...... 55

Table 6.1 Ablation study of diﬀerent components in the proposed framework. . 89 Table 6.2 UNet architecture...... 93 Table 6.3 Comparison results for semantic segmentation on Cityscapes...... 94 Table 6.4 Baseline convolutional auto-encoder for image denoising...... 94 Table 6.5 Comparison results of the multi-frequency image denoising...... 96

xiii List of Figures

Figure 1.1 Acoustic music transcription...... 2 Figure 1.2 Visual music transcription...... 3 Figure 1.3 Audio-visual music transcription...... 5

Figure 2.1 The overall proposed learning-based AMT framework...... 13 Figure 2.2 Sample video frames with and without hands coverage...... 15 Figure 2.3 Localization of black and white keys...... 16 Figure 2.4 Background subtraction...... 17 Figure 2.5 The cropped difference image of the pianist’s right hand...... 19 Figure 2.6 The camera setup over the piano...... 22 Figure 2.7 Sample video frames taken from four different videos in our dataset. 24 Figure 2.8 The comparison of offline and online training times...... 26 Figure 2.9 The performance of incrementally learning data...... 28 Figure 2.10 Black and white pressed keys classification results...... 30 Figure 2.11 The ablation study of CNN-SVM classification results...... 32

Figure 3.1 The overall architecture of an auto-encoder...... 34 Figure 3.2 The overall architecture of VAE...... 35 Figure 3.3 The overall architecture of GAN...... 36 Figure 3.4 The structure of diﬀerent hybrid VAE-GAN models...... 37 Figure 3.5 The proposed semi-recurrent hybrid VAE-GAN training model. . . 38 Figure 3.6 The proposed semi-recurrent hybrid VAE-GAN testing models. . . 42 Figure 3.7 Three music samples generated...... 43 Figure 3.8 Measurements used for evaluating 300 generated music samples. . . 43

Figure 4.1 The proposed semantic segmentation-based image compression codec. 46 Figure 4.2 Comparison results on ADE20K test set...... 52 Figure 4.3 Comparison results on Cityscapes test set...... 53 Figure 4.4 Comparison results on Kodak image set...... 54 Figure 4.5 ADE20K visual example (bits/pixel/channel, PSNR, MS-SSIM) . . 56 Figure 4.6 Cityscapes visual example (bits/pixel/channel, PSNR, MS-SSIM) . 57 Figure 4.7 Kodak visual example (bits/pixel/channel, PSNR, MS-SSIM) . . . 58 Figure 4.8 Visual comparison of diﬀerent scenarios in DSSLIC...... 59

xiv Figure 5.1 The proposed variable-rate image compression framework...... 61 Figure 5.2 Architecture of the proposed deep encoder and deep decoder networks. 62 Figure 5.3 The proposed ResGDN and ResIGDN sub-networks...... 63 Figure 5.4 Sample quantized code maps...... 67 Figure 5.5 Comparison results of our variable-rate approach with other methods. 68 Figure 5.6 Kodak visual example 1...... 69 Figure 5.7 Kodak visual example 2...... 71 Figure 5.8 Kodak visual example 3...... 72 Figure 5.9 Ablation studies with diﬀerent model conﬁgurations...... 73

Figure 6.1 Vanilla convolution scheme...... 76 Figure 6.2 The architecture of the original octave convolution...... 77 Figure 6.3 The proposed generalized octave convolution (GoConv)...... 79 Figure 6.4 The proposed generalized transposed-convolution (GoTConv). . . . 81 Figure 6.5 Overall framework of our multi-frequency image compression model. 82 Figure 6.6 Sample high and low frequency latent representations...... 87 Figure 6.7 Kodak comparison results with state-of-the-art methods...... 88

Figure 6.8 Kodak visual example (bits-per-pixel, PSNR, MS-SSIMdB). . . . . 90 Figure 6.9 The architecture of the original octave transposed-convolution. . . . 92 Figure 6.10 Cityscapes visual example 1...... 95 Figure 6.11 Cityscapes visual example 2...... 95 Figure 6.12 Sample image denoising results from MNIST test set...... 96 Figure 6.13 Sample image denoising results from CIFAR10 test set...... 97

xv Chapter 1

Introduction

A huge amount of multimedia contents in various forms of video, image, audio, music, and text is generated everyday. This explosion of multimedia data has provided significant opportunities in various applications, such as computer vision, image/video compression, speech recognition, natural language processing (NLP), machine translation, bioinformatics, and also musical information retrieval. However, it is difficult to process, analyze, search, and represent such massive and diverse multimedia data effectively. The conventional techniques proposed in various multimedia applications mainly depend on the hand-crafted features extracted and captured from data in different domains. In recent years, deep learning has had a great impact on a variety of multimedia applications especially image classification and generation, speech recognition, and NLP. Deep learning is a subset of machine learning whose purpose is to automatically learn data representations using multiple levels of non-linear operations. Multimedia data can be considered as large, unstructured, and heterogeneous contents. So, deep learning has the potential to automatically extract features from such unstructured data for retrieval, detection, segmentation, classification, tracking, and also generation tasks without the need to rely on human intervention. In this thesis, we study two different problems and applications in multimedia content processing including MIR and image compression using deep learning techniques. In the following sections, a brief introduction of deep learning-based music information retrieval and learned image compression along with the related works in both fields are presented and discussed.

1.1 Learned Music Information Retrieval

The growth of digital music has considerably increased the availability of contents in the last few years. In order to analyze this amount of data, MIR has become an important ﬁeld of research [31, 51, 123], which deals with problems such as automatic music transcription (AMT), genre detection [19, 46], music generation [37], music recognition [66], and music

1 recommendation [40, 74]. In recent years, deep learning methods have become very popular in the ﬁeld of music content analysis and MIR. In this thesis, two of these problems including AMT and music generation are addressed using deep learning techniques.

1.1.1 Automatic Music Transcription (AMT)

AMT is the conversion process from music to some symbolic notation such as a music score or a musical instrument digital interface (MIDI) file [26, 99, 127]. The core problem in AMT is multi-pitch estimation, which is the task of finding multiple pitches in polyphonic music [98]. Previous works have performed this task mostly by analyzing the acoustic signal of the music piece. However, it is difficult to obtain an accurate transcription from audio alone in a number of important situations. A variety of acoustic-based AMT methods has been proposed. In the following section, we will give a brief background of these methods and discuss their limitations. Following that, we describe some previous works based on image and video processing for visually analyzing and transcribing music. Finally, some existing multi-modal developments employing both audio and visual information for transcribing music are given.

Acoustic Music Transcription

The main task in the AMT problem is multi-pitch estimation in which concurrent pitches in a time frame are estimated (Figure 1.1). This analysis can be performed in time and/or frequency domains. The key idea in time domain methods is to use an auto-correlation function to ﬁnd the periodicity of harmonic sounds. For instance, Davy and Godsill [48], and Cemgil et al. [38] used probabilistic modeling to perform this process. On the other hand, frequency domain methods basically deal with recognizing the harmonic patterns corresponding to each pitch. Iterative spectral subtraction [76], spectral peak modeling based on maximum likelihood [52, 98], probabilistic-based full spectrum modeling [143], and spectral decomposition using non-negative matrix factorization (NMF) [29] and probabilistic latent component analysis (PLCA) [25, 27] are some approaches in the frequency domain.

Figure 1.1: Acoustic music transcription.

Most of the recent audio-based polyphonic music transcription research have focused on learning-based (or classification-based) methods in which multiple binary classifiers (each corresponding to one note) are trained with short-time acoustic features in a supervised manner [35, 117, 128]. Then, these classifiers are used in predicting notes in new input

2 data. These methods have shown promising results in AMT compared to other approaches. However, acoustic variations such as diﬀerent timbre, tuning, and recording conditions in the test set make it diﬃcult to obtain good transcription results. In order to handle this problem, unsupervised feature learning methods can be utilized in which robust feature representations invariant to acoustic variations can be learned [116].

Visual Music Transcription

Although there has been some recent progress in multi-pitch estimation techniques, they still need to deal with many important diﬃculties such as the presence of multiple lines of music (e.g, simultaneous played notes or instruments), audio noise, exact onset/oﬀset detection, out-of-tune harmonics (incorrect recorded pitches), octave ambiguity, etc. In order to deal with these acoustic-based challenges in AMT, other types of music representations such as visual information can be employed [47, 100] (Figure 1.2).

Figure 1.2: Visual music transcription.

In [124], an automated approach was proposed for visually detecting and tracking the piano keys played by a pianist using image differences between the background image and current video frame. The background image containing only the piano keyboard was required to be manually taken by the user at first. The algorithm had a 63.3% precision rate, a 74% recall rate, and an F1 score of 0.68 for identifying the pressed keys [124]. Gorodnichy and Yogeswaran [61] demonstrated a video recognition tool called C-MIDI to detect which hands and fingers were used to play the keys on a piano or a musical keyboard. Instead of detecting the played notes using transcription methods, the musical instrument was equipped with a MIDI interface to be able to transmit MIDI events associated to the played notes to the computer [61]. In 2007, Zhang et al. proposed a visual method for automatically transcribing violin music in which some visual information of fingering and bowing on the violin was used to improve the accuracy of audio-based music transcription [145]. Their method had an accuracy of 92.4% in multiple finger tracking and 94.2% in string detection. However, automatic note inference including the pitch and onset/offset detection of an inferred note had an accuracy of only 14.9%. Extracting fingering information for many musical instruments such as piano [61, 94, 121] and guitar [96, 113] can be considered an important task used for a better transcription of

3 the corresponding music. However, they often assume that the pitches of the played notes are already available from MIDI interfaces. Some researchers have considered the extraction of ﬁngering information as a gesture recognition problem [115]. For example, Sotirios and Georgios [121] used hidden Markov models (HMM) [141] to classify hand gestures and retrieve ﬁngering information from the right hand of a pianist. Akbari and Cheng proposed a computer vision-based automatic music transcription system named claVision to perform piano music transcription in real-time [6, 7]. Instead of processing the music audio, their system performed the transcription only from the video performance captured by a camera mounted over the piano keyboard. Their method used a four-stage procedure including keyboard registration, illumination normalization, and pressed keys detection to perform music transcription. Although the proposed method in claVision [6] had a high accuracy (F1 score over 0.95) and a very low latency (about 7.0 ms) in real-time music transcription, it had a number of limitations such as drastic lighting changes, inappropriate camera view angle, hands coverage of the keyboard, and vibrations of the camera or the keyboard.

Audio-Visual Music Transcription

There exists some developments that have focused on multi-modal transcription systems utilizing both audio and visual information [57, 101, 105, 83] (Figure 1.3). Frisson [57] implemented a multi-modal system for extracting information from guitar music. However, this system requires artificial markers on the performer’s hands for visual tracking. Paleari et al. [96] presented a complete multi-modal approach for guitar music transcription. Their method first used visual techniques to detect the position of the guitar, the fret-board, and the hands. Then, the extracted visual information was combined with the audio data to produce an accurate output. According to the results described in [96], this technique had an 89% accuracy in detecting the notes. Tavares et al. [126] proposed an audio-visual vibraphone transcription technique to reduce the errors caused by polyphony and noise. First, audio detection was applied to estimate the notes played on vibraphone. Different audio detection techniques such as PLCA and non-negative least squares fitting (NNLSQ) were tested. In parallel, the position of the mallets was visually detected using colour filtering and contour detection algorithms. Finally, a note was detected as played if the mallet was located over the corresponding bar. In 2009, a violin transcription system was introduced by Zhang and Wang [146] to enhance the audio-only onset detection of violin music. In their approach, a visual analysis of bowing and fingering of the violin performance was utilized to infer the onsets. The proposed method was evaluated with feature-level (concatenating audio and visual features in early stage) and decision-level (fusing the classification results of audio and video analysis obtained separately) fusion techniques. The best F1 score was 93% achieved by decision-level data fusion.

4 Figure 1.3: Audio-visual music transcription.

In Chapter 2, we present a new real-time learning-based system for visually transcribing piano music using the convolutional neural networks (CNNs) and support vector machines (SVMs) to classify the black and white keys to pressed or unpressed label. The whole process in this technique is based on visual analysis of the piano keyboard and the pianist’s hands and ﬁngers. A high accuracy with an average F1 score of 0.95 even under non-ideal camera view, hand coverage, and lighting conditions is achieved. The proposed system has a low latency (about 20 ms) in real-time music transcription. In addition, a new dataset for visual transcription of piano music is created and made available to researchers in this area. Since not all possible varying patterns of the data used in our work are available, an online learning approach is applied to eﬃciently update the original model based on the new data added to the training dataset.

1.1.2 Automatic Music Generation

The algorithmic or learned process of creating and composing a novel piece of music with minimum human intervention is called automatic music generation. Music generation can be defined as a sequential data generation problem in unsupervised learning. Recurrent neural networks (RNNs) and long short term memory networks (LSTMs) have shown considerable performance in this area. However, they have difficulties in handling the vanishing and the exploding gradient problems [97]. In order to deal with these issues, RNNs have been combined with the most recent deep generative architectures such as variational auto-encoders (VAEs) and generative adversarial networks (GANs) [54, 64, 68, 92, 132, 144], which are typically used for learning complex structures of data. Different learning-based approaches for music generation have been introduced by various researchers. In [53], a RNN-based architecture using LSTMs was proposed in which a piano-roll sequence of notes and chords are generated using an iterative feed-forward

5 strategy. Some limitation of this work includes hand-crafted training set and deterministic generation. A decoder feed-forward strategy on a stacked auto-encoder architecture was presented in [112] to generate music spectrograms. In some works, sampling strategy was used for music generation [32, 69]. For example, in [32] restricted Boltzmann machines (RBMs) were utilized for modeling and generating polyphonic music by learning a model from an audio corpus. DeepBach architecture [69], which was specialized for Bach’s chorales, combined two LSTMs and two feed-forward networks (forward and backward in time networks). VAE, as one of the significant approaches considered for content generation, has been used by some researchers in order to generate musical content. In [54], a VAE-based method named variational recurrent auto-encoder (VRAE) was proposed in which the encoder and decoder consisted of LSTMs. In this work, the encoder sequentially reads each symbol of an input sequence and the decoder synthesizes the output sequence by predicting the next symbol. Variational recurrent auto-encoder supported by history (VRASH) [132] used the same architecture as in VRAE, but also used the output of the decoder back into the decoder. As a result, the previous note generated was considered as an added information to the next note. In [68], the objective function used in DeepBach was reformulated using VAE to have a better control on the embedding of the data into the latent space. Although most of the sequential data generation methods use RNNs to model the invariance in time, some recent works have shown that CNNs are also capable of generating realistic sequential data such as music [95, 142]. One advantage of CNNs is that they are practically faster to train and easier to parallelize than RNNs. In addition, applying convolutions to the time dimension can result in significant performance in some applications [33]. A system for generating raw audio music waveforms named WaveNet was proposed in [95] in which an extended convolutional network called dilated causal convolution was incorporated. In this work, the authors argue that dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units. Another CNN-based architecture is convolutional RBM (C-RBM) [80], which was developed for the generation of polyphonic music in MIDI format. In this work, convolution is performed in the time domain to model temporally invariant motives. Some works have exploited GANs for generating music [92, 142]. An example of the use of GAN is C-RNN-GAN [92] with both the generator and discriminator being LSTM architectures in which the goal was to transform random noise into melodies. A bidirectional RNN was utilized in the discriminator to take contexts from both past and future. In [142], a convolutional GAN architecture named MidiNet was proposed to generate pop music melodies from random noise (in piano-roll like format). In this approach, both generator and discriminator were composed of convolutional networks. Similar to what a RNN would do in considering the history, the information from previous musical measure was incorporated into intermediate layers.

6 In Chapter 3, we introduce a semi-recurrent hybrid VAE-GAN model for music generation. In order to consider the spatial correlation of the data in each frame of the generated sequence, CNNs are utilized in the encoder, generator, and discriminator. The subsequent frames are sampled from the latent distributions obtained by encoding the previous frames. As a result, the dependencies between the frames are maintained. Two testing frameworks for synthesizing a sequence with any number of frames are also proposed. The promising experimental results on piano music generation indicates the potential of the proposed framework in modelling other sequential data such as video.

1.2 Learned Image Compression

In the last few years, deep learning has made tremendous progresses in the well-studied topic of image compression. Deep learning-based image compression [10, 71, 84, 90, 129, 134] has shown the potential to outperform standard codecs such as JPEG2000, the H.265/HEVC- based BPG image codec [23], and also the new versatile video coding test model (VTM) [56], making it a very promising tool for the next-generation image compression. In traditional compression methods, many components such as entropy coding are fixed and hand-crafted in which linear transforms are usually used. Deep learning-based approaches have the potential of automatically discovering and exploiting the features of the data; thereby achieving better compression performance. In addition, deep learning allows non-linear transform coding and more flexible context modellings [16]. Various learning- based image compression frameworks have been proposed in the last few years. In [133], it was first used to compress thumbnail images using LSTM-based RNNs, residual-based layer coding, and a stochastic binarization process to facilitate the back- propagation. The binary representations were then compressed using entropy coding. The probability estimation in the entropy coding was also handled by LSTM convolution. Better SSIM (structural similarity) [138] results than JPEG and WebP were reported. This approach was generalized in [134], which proposed a variable-rate framework for full-resolution images by introducing a gated recurrent unit, residual scaling, and deep learning-based entropy coding. This method can outperform JPEG in terms of PSNR (peak signal-to-noise ratio). These methods were further improved in [71] using SSIM loss [138] and spatially adaptive bit allocation. In [17, 30], a scheme involving a generalized divisive normalization (GDN)-based nonlinear analysis transform, a uniform quantizer, and an inverse GDN (IGDN)-based synthesis transform was proposed. The encoding network consists of three stages of convolution and GDN layers. The decoding network, however, consists of three stages of transposed- convolution and IGDN layers. Despite its simple architecture, it outperforms JPEG2000 in both PSNR and SSIM.

7 A compressive auto-encoder framework with residual connection as in ResNet (residual neural network) was proposed in [129], where the quantization was replaced by smooth approximation, and a scaling approach was used to get different rates. In [1], a soft-to-hard vector quantization approach was introduced, and a unified framework was developed for both image compression and neural network model compression. In order to take the spatial variation of image content into account, a content-weighted framework was also introduced in [84], where an importance map for locally adaptive bit rate allocation was employed to handle the spatial variation of image content. Learned channel- wise quantization along with arithmetic coding was also used to reduce the quantization error. GAN has been exploited in a number of learning-based image compression schemes. In [111], a discriminator was used to help training the decoder. A perceptual loss based on the feature map of an ImageNet-pretrained AlexNet was introduced although only low- resolution image coding results were reported in [111]. In [108], an auto-encoder was em- bedded in the GAN framework in which the feature extraction adopted pyramid and inter- scale alignment. The discriminator also extracted outputs from different layers, similar to the pyramid feature generation. An adaptive training was used where the discriminator was trained and a confusion signal was propagated through the reconstructor, depending on the prediction accuracy of the discriminator. Recently, there have also been some efforts in combining some computer vision tasks and image compression in one framework. In [87, 135], the authors tried to use the feature maps from learning-based image compression to help other tasks such as image classification and semantic segmentation although the results from other tasks were not used to help the compression part. In [2], segmentation map-based image synthesis was utilized for image compression, where GAN was used for training the entire framework. The scheme targets extremely low bit rates (< 0.1 bits/pixel), and uses synthesized images for non-important regions. In Chapter 4, we propose a deep semantic segmentation-based layered image compression framework. In this approach, the semantic segmentation map of the input image is extracted by a deep learning network and losslessly encoded as the base layer of the bit-stream. Next, the input image and the segmentation map are used by another deep network to obtain a low-dimensional compact representation of the input, which is encoded into the bit-stream as the first enhancement layer. After that, the compact image and the segmentation map are used to obtain a coarse reconstruction of the image. The residual between the input and the coarse reconstruction is encoded as the second enhancement layer in the bit-stream. To improve the quality, the synthesized image from the segmentation map is designed to be a residual itself, which aims to compensate the difference between the up-sampled version of the compact image and the input image. Therefore the proposed scheme includes three layers of information. Experimental results in the RGB (4:4:4) domain show that the pro-

8 posed framework outperforms the H.265/HEVC-based BPG codec [23] in both PSNR and multi-scale structural similarity index (MS-SSIM) [139] metrics across a large range of bit rates, and is much better than JPEG, JPEG2000, and WebP [60]. Moreover, since semantic segmentation map is included in the bit-stream, the proposed scheme can facilitate many other tasks such as image search and object-based adaptive image compression.

1.2.1 Learned Variable-Rate Image Compression

Since most learned image compression methods need to train multiple networks for multiple bit rates, variable-rate image compression approaches have also been proposed in which a single neural network model is trained to operate at multiple bit rates. This approach was first introduced in [133], which was then generalized for full-resolution images using deep learning-based entropy coding in [134]. In [36], a CNN-based multi-scale decomposition transform was optimized for all scales. Rate allocation algorithms were also applied to determine the optimal scale of each image block. The results in [36] were reported to be better than BPG in MS-SSIM. In [147], a learned progressive image compression model was proposed, in which bit-plane decomposition was adopted. Bidirectional assembling gated units were also introduced to reduce the correlation between different bit-planes. In Chapter 5, we propose a new deep learning-based variable-rate image compression framework, which employs GDN layers. Two novel types of GDN-based residual sub- networks are also developed in the encoder and decoder networks, by incorporating the shortcut connection in ResNet [70]. Our scheme uses the stochastic rounding-based scalar quantization as in [65, 103, 133]. To further improve the performance, we encode the residual between the input and the reconstructed image from the decoder network by BPG as an enhancement layer. To enable a single model to operate with different bit rates and to learn multi-rate image features, a new variable-rate objective function is introduced. To the best of our knowledge, this is the first objective function simultaneously trained with multiple bit rates for learning-based image compression. Experimental results show that the proposed framework trained with variable-rate objective function outperforms all standard codecs including H.265/HEVC-based BPG in the more challenging YUV4:2:0 and YUV4:4:4 formats, as well as state-of-the-art learning-based variable-rate methods in terms of MS-SSIM metric. Further gains can be achieved if we optimize multiple networks for different bit rates independently.

1.2.2 Context-Adaptive Entropy Models

Most previous works used ﬁxed entropy models shared between the encoder and decoder. In [18], a conditional entropy model based on Gaussian scale mixture (GSM) was proposed where the scale parameters were conditioned on a hyper-prior learned using a hyper auto- encoder. The compressed hyper-prior was transmitted and added to the bit stream as side information. Three layers of convolutions along with GDN and IGDN layers were used

9 in the encoder and decoder networks, similar to [17]. The model in [18] was extended in [90, 81] where a Gaussian mixture model (GMM) with both mean and scale parameters conditioned on the hyper-prior was utilized. In these methods, the hyper-priors were combined with auto-regressive priors generated using context models, which outperformed BPG in terms of both PSNR and MS-SSIM. The coding efficiency in [81] was further improved in [82] by a joint optimization of image compression and quality enhancement networks. Another context-adaptive approach was introduced in [151] in which multi-scale masked convolutional networks were utilized for their auto-regressive model combined with hyper- priors. The state of the art in learned image compression has been achieved by context-adaptive entropy methods in which hyper-prior and auto-regressive models are combined [90]. These approaches are jointly optimized to effectively capture the spatial dependencies and probabilistic structures of the latent representations, which lead to a compression model with significant rate-distortion (R-D) performance. However, similar to natural images, the latents contain a mixture of information with multiple frequencies, which are usually represented by feature maps of the same spatial resolution. Some of these maps are spatially redundant because they consist of low frequency information, which can be more efficiently represented and compressed. In Chapter 6, a multi-frequency entropy model is introduced in which octave convolutions [42] are utilized to factorize the latent representations into high and low frequencies. The low frequency information is then represented by a lower spatial resolution, which reduces the corresponding spatial redundancy and improves the compression ratio, similar to wavelet transforms [13]. In addition, due to the effective communication between high and low frequencies, the reconstruction performance is also improved. In order to preserve the spatial information and structure of the latents, we develop novel generalized octave convolution and octave transposed-convolution architectures with internal activation layers. Experimental results show that the proposed multi-frequency framework outperforms all standard codecs including BPG and VTM as well as learning-based methods in both PSNR and MS-SSIM metrics, and establishes the new state-of-the-art learning-based image compression. The framework proposed in this work bridges the wavelet transform and deep learning. Therefore many techniques in the wavelet transform can be used in deep learning-based image compression. This will have profound impact on future image coding research.

1.3 Scholarly Publications

The scholarly publications achieved during my Ph.D. studies are outlined as follows:

10 1.3.1 Journal Papers

• M. Akbari, J. Liang, J. Han, and C. Tu. Generalized Octave Convolutions for Learned Multi-Frequency Image Compression. IEEE Journal on Selected Topics in Signal Pro- cessing (Special Issue on Deep Learning for Image/Video Restoration and Compres- sion), submitted [12].

– This work has achieved the state of the art in image compression, outperforming all standard codecs (such as HEVC-based BPG and VVC) and learned image compression methods.

• M. Akbari, J. Liang, J. Han, and C. Tu. Learning-based Modulated Variable-Rate Im- age Compression with Residual Divisive Normalization. IEEE Transactions on Multi- media, submitted.

• M. Akbari, J. Liang, H. Cheng. A Real-time System for Online Learning-based Visual Transcription of Piano Music. Multimedia Tools and Applications, (2018) 1-23 [9].

1.3.2 Conference Papers

• M. Akbari, J. Liang, J. Han, and C. Tu. Learned Multi-Frequency Image Compression with Generalized Octave Convolutions. 2021 AAAI Conference on Artiﬁcial Intelli- gence, to be submitted.

• M. Akbari, J. Liang, J. Han, and C. Tu. Learned Variable-Rate Image Compression with Residual Divisive Normalization. 2020 International Conference on Multimedia and Expo (ICME) [11].

– Nominated for Best Paper/Best Student Paper Award

• M. Akbari, J. Liang, and J. Han. Deep Semantic Segmentation-based Layered Image Compression. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK [10].

• M. Akbari and J. Liang. Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, CA [8].

11 Chapter 2

A Real-Time System for Online Learning-based Visual Transcription of Piano Music

Since music is usually composed from sound, most existing MIR methods only utilize audio information for the analysis of music content. However, sound is not the only important element in representing music. The visual representation is also a vital aspect of music in our daily life since music is often accompanied by a visualization [21, 34, 130]. The visual language of music genres can be a very good example in which costumes and gestures play important roles [89]. That is why music video directors use a wide range of visual eﬀects to draw attention and manipulate our perception of a song. As a result, the visual layer of music videos also provides a wide range of music information that is very useful for the analysis of music content [22, 93, 114, 115]. In this chapter, we introduce an online learning-based system to perform multi-pitch estimation of piano music by only visually analyzing the piano keyboard and pianist’s hands. This work is an extension of the claVision method proposed in [4, 5, 6]. Although claVision has a high accuracy (F1 score over 0.95) and a very low latency (about 7.0 ms) in real-time piano music transcription, it does not work well under non-ideal conditions as described in below:

• In situations where the images are very dark or bright or the lighting conditions change sharply during the performance, claVision cannot function properly.

• Pressed keys that are directly below the camera are not accurately detected. In this case, the key press and release events cannot be reliably detected using the pressed keys detection algorithm.

• If a large portion of a piano key is covered by the pianist’s hands, pressed keys detection for that key may not be accurate. Moreover, the pressed keys under the hands cannot be seen by the camera or even human viewers.

12 Figure 2.1: The overall framework of our learning-based AMT approach.

13 • If the camera vibrates or moves slightly during the performance, the keyboard and the key cannot be properly localized, which resulted in a low accuracy in pressed key detection.

The overall framework of the proposed system in this chapter is illustrated in Figure 2.1. In this work, an improvement of ≈ 0.44 (black keys) and ≈ 0.29 (white keys) in terms of F1 score is observed over the method in [6] in above-mentioned non-ideal conditions. The statistical properties of the data in our work vary according to different camera angles, lighting conditions, skin colours, fingering and hand shapes, positions, and postures. Since not all possible varying patterns are available and there is always a possibility for having new variations (e.g., playing styles used by different pianists), an online learning strategy is incorporated to dynamically adapt the model to new data added to the training set. In addition, a new dataset for visual transcription of piano music is created and made available. To the best of our knowledge, this is the first such dataset available to researchers. This chapter is organized as follows. In Section 2.1, the feature extraction procedure including keyboard detection, initial background image identification, keys localization, and hand area detection is described. In order to deal with varying illuminations during the performance, a dynamic background update method is introduced in Section 2.1.2 to update the background image in each video frame. In Section 2.1.3, CNNs are then utilized to extract features by feed-forwarding the difference images to the network. In Section 2.2, two separate binary SVMs used for classifying pressed black and white keys are formulated and discussed. The experimental results will finally be evaluated in Section 2.3.

2.1 Feature Extraction

In this section, the process of extracting features used for training our models and classifying the pressed keys is discussed. First, the video frames are preprocessed to identify the background image, localize the piano keys, and compute the diﬀerence image. A dynamic background update method is then employed to handle varying illumination conditions in the diﬀerence image. Finally, the feature vectors corresponding to the black and white keys are extracted using CNNs.

2.1.1 Preprocessing

Similar to the approach in [6], the keyboard is detected and localized by utilizing the Hough line transform and homogeneous transform. The image of the detected keyboard is considered as the initial background image denoted by G0, which is an image with no hand coverage (Figure 2.2). Next, keys localization is performed in which the black keys are detected as objects in a white background using the connected component labelling algorithm, and the white keys are localized using the dividing lines estimated from the positions of their adjacent black

14 Figure 2.2: Sample video frames with hands coverage (top) and without hands coverage (bottom). keys, similar to [6]. Each of the detected keys is identified by its bounding rectangle. More black black white white formally, let N = {k1 , . . . , kn } and M = {k1 , . . . , km } be the set of localized black and white keys, respectively. Each of the black keys is identified by the upper left and bottom right coordinates of the bounding rectangles identified in G0. White keys require a more complicated description, as its shape can be considered to be the union of two rectangles. We divide each key into its upper bounding rectangle and its lower rectangle (Figure 2.3). The bounding rectangle corresponding to each black key is rescaled to a fixed size 10×40 for all samples. Similarly, the upper and lower bounding rectangles corresponding to each white key are rescaled to 10 × 40 and 10 × 20, respectively. The musical information such as the octaves and the notes corresponding to all located keys is determined as follows. The octave numbers can be estimated by considering the middle visible octave as octave 4 including the Middle C key. The other octaves located on the left and right side of octave 4 can be numbered accordingly. The black keys on the piano are divided into groups of two black keys (C]) and D]) and three black keys (F],

15 Figure 2.3: Localization of black and white keys. The lines separating the white keys are estimated based on their adjacent black keys. The rectangles highlighted using dots and striped lines show the upper and lower parts of two sample white keys.

G], and A]), which are repeated across the keyboard. Since there is no black key between these two groups, the space separating them is doubled in comparison to that between the black keys inside each group. Thus, these two groups can be distinguished to determine their associated musical notes. The natural notes corresponding to the white keys are then determined based on the notes of the black keys and the estimated separating lines. For example, the black key G] lies between the white keys corresponding to the notes G and A. Given the background image and the video frame at time t, denoted by Gt and It, the diﬀerence image, denoted by Dt, is computed as follows to identify the changes caused by key presses as well as hand movements (Figure 2.4):

Dt = It − Gt. (2.1)

If there are no drastic changes in illumination, we may use the same background image at each step; that is Gt = G0 for all t. In the next section, we will discuss how the background image can be updated to handle changing illumination conditions. In [6], it was discussed that the black and white key presses usually result in different sign changes in the difference image. So, the positive and negative values in Dt were processed separately. However, this is not always true, especially for the pressed keys that are directly below the camera. In this work, we utilize positive and negative values for both black and white keys to improve the detection. The keys are usually pressed in the areas that the pianist’s hands and fingers are present. Thus, the search domain for finding the pressed keys can be limited by detecting the location of the hands on the keyboard [6]. In addition, we can remove noisy detection results from other parts in which no hands exist. This process is performed by applying the connected component labelling algorithm to the binarized difference image, Dt, to extract a set of column indices, Ht, containing the pianist’s hands and fingers (Figure 2.4).

2.1.2 Dynamic Background Update (DBU)

Varying illumination can be a significant issue in our approach if we compare each frame to a fixed initial background image. Different types of noise and shadows cause difficulties in this approach. These variations cause an inconsistency between the background image and the other video frames, which results in noisy difference images after background subtraction.

16 Figure 2.4: Background subtraction. From top to bottom: the background image Gt; the video frame It; and the diﬀerence image Dt.

In order to make the background image consistent with the current probable variations (i.e., illumination changes, noise, and shadows), a dynamic background update (DBU) algorithm is employed, which improves over the illumination correction procedure in claVision [4, 6]. The updated background image at time t, denoted by Gt, is deﬁned by

 It−1 i ∈ Ht−1, Gt = i,j (2.2) i,j t−1 Gi,j otherwise.

In other words, for columns that include the pianist’s hands and ﬁngers, the updated background image contains pixels from the previous frame It−1. Otherwise, it contains pixels from the previous background image Gt−1. This allows the background to adapt to changes in illumination especially around the areas where the pianist’s hands are. In order to reduce computational complexity, we may choose to perform this update every k frames. Otherwise, it contains pixels from the previous background image Gt−1.

2.1.3 CNN-based Feature Extraction

Recent works for image classification tasks have shown that generic feature descriptors extracted from CNNs can be effective [104]. In order to extract robust feature representations from our data, we propose a simple CNN model whose architecture is summarized in Table 2.1. We use two convolutional layers (with 8 and 16 filters of size 3×3 with stride 1) in which the ReLU (rectified linear unit) activation function is utilized. Since the stride used in the convolutional layers is 1, we also set the zero-padding to 1 to ensure that the input and output of the layer will have the same size. Each convolution layer is followed by a maxpool layer. Since pooling sizes with larger receptive fields may be too destructive, we use small,

17 Table 2.1: CNN architecture used in our learning-based visual AMT.

Layer Type Filters Size 1 Convolutional + ReLU 8 3×3, stride=1, pad=1 2 Maxpool - 2×2, stride=2 3 Dropout (50%) - - 4 Convolutional + ReLU 16 3×3, stride=1, pad=1 5 Maxpool - 2×2, stride=2 6 Dropout (50%) - - 7 Fully Connected - 2 hidden units 8 Softmax - 2 way but common 2×2 filter size and stride 2 to down-sample the input by half. In order to avoid over-fitting, we apply dropout regularisation with the dropout ratio of 0.5 after each maxpool layer. The last two layers include a fully-connected layer with 2 hidden units and a 2-way softmax loss. The input to our CNN network for the black keys is the cropped gray-scale image (concatenation of the gray-scale intensities in the bounding rectangle) corresponding to the ith black key in the difference image. We use the output activations from Layer 6 black as the feature vector for the ith black key ki , denoted by Φi. For the white keys, the concatenation of the upper and lower bounding rectangles corresponding to the ith white white key is considered as the input to the CNN. The feature vector for the ith white key ki , denoted by Ψi, is similarly extracted from Layer 6. Figure 2.5 shows some sample cropped difference images used as the inputs to the CNN. As explained in Section 2.1.1, black and white keys have different sizes, shapes, structures, and contexts. Each white key is represented by the union of two rectangles, upper bounding rectangle whose structure is similar to the black keys and lower bounding rectangle which usually includes the hands and fingers used for playing that key. So, in order to extract more robust features corresponding to each type of the keys, we train two separate CNN models with different input sizes 10 × 40 and 10 × 60 (concatenation of the lower and upper rectangles), which result in Layer 6 outputs of sizes 2×10×16 and 2×15×16, respectively. The black keys CNN, denoted by CNN black, is pre-trained with all black keys samples in the training dataset and the white keys CNN, denoted by CNN white, is pre-trained with all the white keys samples (see Section 2.3.1 for more details about the dataset). In each video frame, we used the MIDI events (transmitted from the digital piano keyboard to the computer using a MIDI cable) as the labels for training the CNNs. Note that for training and classification, our algorithm only considers feature vectors overlapping with the columns covered by the pianist’s hands and fingers. In other words, a feature vector is only considered if it includes pixels with column indices in H. In order to standardize the range of features before training and classification, a min-max normalization is applied to scale the values to the range [0, 1].

18 Figure 2.5: From left to right: the cropped diﬀerence image of the pianist’s right hand covering octave 6 in Figure 2.4 with its corresponding 5 cropped images for the black keys and 7 cropped images for the white keys.

2.2 Online Learning-based Visual AMT

2.2.1 Pressed Keys Classiﬁcation

Having the normalized CNN-based feature vectors Φ and Ψ extracted in the previous section, we model our pressed keys detection problem with two binary SVMs [24], which are separately trained in a supervised manner using the labels provided. They are used for linear classification on whether or not a black or a white key is pressed in each video frame. Learning algorithms can be categorized into batch and online learning. In batch learning, all examples are available at once and the model is trained over the entire set. In contrast, online learning is the technique in which training data becomes available in a sequential order, which requires the model to be updated accordingly. This type of learning is very useful in situations where the model needs to dynamically adapt to new patterns in the data, when the data is created as a function of time, or when it is not computationally feasible to train over the entire dataset. Moreover, online learning is beneficial for dealing with large or non-stationary data and real-time processing tasks for continuous data streams [79]. For classification, when a small amount of data is added to or removed from the training set, the model is not required to change much. Incremental and decremental learning algorithms can be used to efficiently update the model without retraining over the entire data set from the beginning [136]. Different camera angles, illumination conditions, style of pianists’ hands and fingers in playing (i.e., fingering and hand shapes, positions, and postures), skin colours of the hands, etc. result in varying patterns in the data in our work. These variations could happen in the same video performance (e.g., varying illumination) or between the performances (e.g., camera view). A complete dataset including all these possible patterns is not available because there is always a possibility for having new variations, for example, playing styles used by different pianists. As a result, we apply an online learning technique to efficiently update the original model to incorporate new patterns in the data added to the training set in a sequential order over time.

19 The process of adding new training data and updating the model can be done when the classification performance in specific situations is not satisfactory to the user. One way to collect the new training data is to ask the user play different pieces of music on a digital keyboard (with a MIDI cable providing the labels) in any desired situation described in Section 5.1 (e.g., different camera angles). It is also possible to collect the data using an acoustic piano. In this case, the user is asked to play some predetermined keys whose corresponding labels are already available to the program. In order to efficiently update the models, an alternate formulation of the SVM optimization problem is needed. In [6], it was discussed that the separate detection of the pressed black and white keys results in a better performance in pressed keys detection since it is unlikely that the white keys are incorrectly detected as their adjacent black keys and vice versa. In this work, we follow the same strategy and consider two separate binary SVMs for black and white keys. In this work, we use trust region Newton (TRON) method, which is one of the most efficient Newton approaches for solving linear SVM [85]. This method is only applicable to the L2-loss function. Thus, we will only consider the L2 loss function in our work.

Let dv and dw be the dimensions of the feature vectors for the black keys and white dv keys, respectively. Given the training sets {(Φi, φi)}, Φi ∈ [0, 1] , φi ∈ {+1, −1}, i = 1, ..., n dw and {(Ψi, ψi)}, Ψi ∈ [0, 1] , ψi ∈ {+1, −1}, i = 1, ..., m, the two-class linear classiﬁcation problems for the black and white keys are represented as the following primal L2-loss SVM optimization problems [39].

black 1 2 + X − X min f (v) = kvk + Cv γi + Cv γi, v,ν0 2 φi=+1 φi=−1 (2.3) T s.t. φi(v Φi + ν0) ≥ 1 − γi, γi ≥ 0, i = 1, ..., n and

white 1 2 + X − X min f (w) = kwk + Cw ξi + Cw ξi, w,ω0 2 ψi=+1 ψi=−1 (2.4) T s.t. ψi(w Ψi + ω0) ≥ 1 − ξi, ξi ≥ 0, i = 1, ..., m

T 2 T 2 where γi = max(0, 1 − φiv Φi) and ξi = max(0, 1 − ψiw Ψi) . φi and ψi are the labels representing the status of the key in the ith training instance such that the key is considered as unpressed (key-up) if the label is −1 and it is considered as pressed (key-down) if the label is +1. Since the number of positive and negative instances in our work may be unbalanced [3, 41, 55], we use two diﬀerent penalty parameters in each SVM formulation to balance the + − + − data [39]. The penalty parameters Cv , Cv , Cw , and Cw are estimated using the following

20 heuristic rules:

+ + n Cv = λ Pn T for φi = +1, i=1 ΦiΦi n C− = λ− for φ = −1, v Pn Φ ΦT i i=1 i i (2.5) + + m Cw = λ Pm T for ψi = +1, i=1 ΨiΨi − − m Cw = λ Pm T for ψi = −1, i=1 ΨiΨi where λ+ and λ− are the positive and negative class weights indicating how much of the the parameter C should be applied to instances carrying the positive and negative labels. In order to eﬃciently update the models formulated in Equations 2.3 and 2.4 when new data is added to the dataset, the warm start setting proposed in [136] is utilized to incrementally learn the patterns in the new data. In this setting, the optimal solutions v∗ and w∗ obtained from the most recent training and optimization are considered as the initial solutions of the new training and optimization problem as in below.

v¯ ≡ v∗ and w¯ ≡ w∗. (2.6)

As a result, the number of iterations required for retraining the new data set is reduced. As discussed in [136], the effectiveness of this strategy is dependent on the optimization method used for training. They showed that if an initial solution is close to the optimal solution, the warm start setting significantly speeds up a high-order optimization method such as TRON [136], which is the method we use for solving the SVM problems defined in Equations 2.3 and 2.4. black white In each video frame, the black and white keys ki and ki are classified to the T pressed (+1) or unpressed (−1) classes using the decision functions sign(v Φi + ν0) and T sign(w Ψi +ω0), respectively, if their corresponding feature vectors Φi and Ψi overlap with the columns covered by the pianist’s hands and fingers. Otherwise, they are classified to the unpressed (−1) class.

2.2.2 Music Transcription

For transcribing all classified pressed keys to symbolic notation (e.g., MIDI structure), information such as note name, octave number, and note onset/offset is required. The note names and their octave numbers corresponding to the pressed keys are determined from the keys localization step (Section 2.1.1). The onset (attack time) of each played note is the time (frame) t at which its corresponding key is classified as pressed, and the offset (release time) is the one at which the key is classified as unpressed. As soon as a pressed key is released, a new MIDI event (containing the required Note On and Note Off messages) is created from the obtained note information (i.e., name, octave number, and onset/offset).

21 The event is then added to a MIDI structure representing the transcription of the played music at the end of the performance.

2.3 Experimental Results

In this section, the results obtained from our experiments are discussed and the eﬃciency of the proposed method is analyzed and compared to the state of the art. The method proposed in this work was evaluated on a PC with an Intel Core i7-4470 CPU (3.40 GHz). All the videos and images used in this evaluation were captured using an HD 720p webcam with resolution and frame rate of 640×360 pixels and 30 FPS (frames per second). As shown in Figure 2.6, a stable stand was used to hold the camera at the top of the piano keyboard to capture a video of the pianist’s performance (similar to [6]). In order to implement the proposed method, the ConvNetSharp [73], Accord.NET [122], and LIBSVM [39] libraries were utilized.

Figure 2.6: The camera setup over the piano.

22 2.3.1 Dataset

Since there is no existing dataset for visual piano music transcription, we created a basic dataset including videos of piano performances on an 88-key digital piano1. This dataset consists of videos captured in diﬀerent situations as described in below (Figure 2.7).

• Camera Positions: variations in camera positions lead to diﬀerent rotated and angled views of the keyboard. We considered these variations (the angles between -45 and +45 degrees from vertical) in our dataset except the drastic perspective views because they may result in incorrect registration of all visible piano keys located on the piano keyboard. As a consequence, inappropriate feature vectors are produced.

• Lighting Conditions: a variety of illumination conditions was taken into account. We did not consider the situations where the images are very dark or very bright such that the black and white keys become unclear and diﬃcult to be distinguished from each other.

• Pianists: we included the performances by 14 pianists with diﬀerent playing styles, size of hands and ﬁngers (e.g., male/female or kid/adult), and skin colours.

• Music: for each of the above-mentioned situations, two types of performances are provided: 1) each key is pressed once sequentially and 2) real pieces of piano music including diﬀerent chords and melodies are played.

• Positive and Negative Examples: the dataset includes both positive and negative training samples for black and white keys. The negative examples are usually produced when the pianist’s hands and ﬁngers cover or move over the keyboard without pressing any key (Figure 2.7).

We extracted n ≈ 600, 000 samples for the black keys and m ≈ 400, 000 samples for the white keys from 71 recorded videos with 70,540 frames in total.

2.3.2 Training

In a real piano performance, the number of unpressed keys over the entire keyboard is usually more than the number of pressed keys. Since the training samples in our dataset were extracted from real piano performances, the dataset contains more negative instances n+ m+ (unpressed keys) than the positive ones (pressed keys), i.e., n− < 1 and m− < 1. In order to deal with this unbalance, the following settings are applied to the positive and negative − + n+ + m+ class weights used in Equation 2.5: λ = 1, λ = n− for black keys, and λ = m− for white keys.

1The dataset can be downloaded from http://www.sfu.ca/~akbari/MTA/Dataset.

23 Figure 2.7: Sample video frames taken from four diﬀerent videos in our dataset.

24 black black Each black key sample ki∈[1,m] is fed-forward to our pre-trained CNN network up to Layer 6 whose output is considered as the feature vector Φi of dimension dv = 320.

The same process is performed to construct the feature vector Ψi of dimension dw = 480 white white corresponding to the white key sample ki∈[1,n] from the pre-trained CNN network. In each video frame, we used the MIDI events (transmitted from the digital piano keyboard to the computer using a MIDI cable) as the labels φi and ψi corresponding to each sample feature vector extracted. The same labels were used for training the CNNs. The SVM models formulated in Equations 2.3 and 2.4 were separately trained with the training sets {(Φ, φ)} and {(Ψ, ψ)} to be later used for the classiﬁcation of the pressed black and white keys, respectively.

2.3.3 Analysis of Online Learning Method

In order to evaluate the computational complexity of the incremental learning approach described in Section 2.2, we divide the black and white datasets into 11 parts where the first part is considered as the original data and the other 10 parts are considered as new data. In order to analyze the effectiveness of the warm start strategy, we choose different levels of data changes from a small to a large increase of the data in the new subsets (the number of samples in each dataset is summarized in Figure 2.8). In this evaluation, we analyze and compare the offline and online learning fashions as described in below. In the offline mode, the model is first trained using the original dataset (OS). Then, the next new data subset is added and the model is retrained from scratch. In other words, no warm start setting is used in this mode, which means the solutions obtained from the previous training and optimization process are not utilized for the next training. This process continues until all subsets are merged and the final trained model is obtained. However, in the online learning approach in which the warm start strategy is employed, the solution obtained in each training and optimization process is considered as the initial point for the training procedure in the next iteration. For example, the solution of training OS is used as the initial point for training OS plus the first new data subset (S1), the solution of training OS ∪ S1 is considered as the initial solution for training OS ∪ S1 ∪ S2, and so on. Figure 2.8 illustrates the convergence times for training the black and white keys models using offline (without using the warm start strategy) and online (with employing the warm start strategy) learning approaches. Since no initial solution is available for training OS, only the training time for the offline mode is provided. From the results shown in Figure 2.8, it can be seen that the online learning approach provides a faster convergence for both black and white models in general. For S1, S2, S4, and S5 in the black keys model and S1, S2, and S4 in the white keys model, the convergence in the online mode is almost 5 times faster because the new data statistics do not significantly change. On the other hand, the data (e.g., S3 in the white model) that causes the model to change more results in online

25 Figure 2.8: The comparison of offline and online (warm start strategy) training times for the black and white keys models formulated in Equations 2.3 and 2.4. training converging about twice as fast. On average, the results show that the warm start strategy used for incrementally learning the new data and updating the models formulated in Equations 2.3 and 2.4 results in convergence approximately three times faster than the offline method. In order to evaluate the effectiveness of incremental learning of new data patterns in our classification problem, we set up another experiment in which different fingering styles, hand shapes and postures are taken into account. 7 recorded videos of different piano

26 performances (denoted by OT, T1, T2, T3, T4, T5, and T6) on a 60-key electronic keyboard are considered as our training sets 2. The properties of the videos are summarized in below.

• OT: in this video, all black and white keys on the keyboard are subsequently played with the index ﬁngers of the left and right hands such that only the middle part of the keys are pressed. This video also includes some negative examples produced by moving the hands over the keyboard while no key is pressed. We consider the samples extracted from this video as the original dataset.

• T1: the same style of playing as in OS is done in this video. However, only the tips of the keys are pressed.

• T2: various white-white-white triads (the chords that consist of three distinct white keys) such as C Major and D Minor are played in this video sample. These chords are played with both hands and in diﬀerent octaves.

• T3: this video is composed of diﬀerent black-black-black triads such as F-sharp Major and D-sharp Minor.

• T4: other types of triads such as C Minor, D Major, E-ﬂat Major, etc. are included in this video performance.

• T5: this performance includes a variety of dyads (two-note chords) in which the two notes have the same pitch class, but fall into two subsequent octaves (e.g., C4 and C5). These two notes are played using the thumb and the little ﬁnger of left and right hands.

• T6: this video consists of a piece of music with some chords and melodies, which are played with diﬀerent ﬁngering and hand shapes and postures.

A single test video including all the variations described in the above 7 videos is used. The training process is first performed over the samples extracted from OT. Then, the resulting trained models are used to classify the pressed black and white keys in the test video. This process is iterated 7 times in which the T1 to T6 videos are sequentially added to the training dataset. The classification performance in each iteration is illustrated in Figure 2.9. As shown in the figure, adding data samples with new patterns of varying fingering and hand postures improves the classification performance for both black and white keys.

For example, once T3 is added, the F1 score related to the black keys increases from 0.535 to 0.689. It is because T3 provides the properties corresponding to the black-black-black triads, many of which appear in the test video.

2The videos can be downloaded from http://www.sfu.ca/~akbari/MTA/OnlineLearningExperiments.

27 Figure 2.9: The performance of incrementally learning data with new ﬁngering and hand shapes and postures.

2.3.4 Frame-Level Classiﬁcation of Pressed Keys

We consider 10 test videos including music performances with diﬀerent ranges of speed, complexity, and camera positions (angles and rotations) 3. Four modes are used to evaluate the proposed method:

• CNN-SVM: the feature representations extracted from Layer 6 of the pre-trained CNNs are considered as the input to the SVM models (as described in Section 2.1). DBU is also applied in this setup.

• CNN: the CNN proposed in Section 2.1 is considered for both feature extraction and classiﬁcation (using the last 2-way softmax layer). DBU is also applied in this setup.

• SVM: we use the gray-scale intensities in the bounding rectangles corresponding to the keys as our features and take them as the input to the SVMs. DBU is also applied in this setup.

• SVM (NoDBU): the same setup as in the third mode is used, but without applying DBU.

The classiﬁcation results of the pressed black and white keys in terms of F1 score are shown in Table 2.2 and Figure 2.10. Since the oﬄine and online learning methods described

3The test videos and the classiﬁcation results can be downloaded from http://www.sfu.ca/~akbari/MTA.

28 in Section 2.2.1 provide the same classiﬁcation performance, we only report the results of the oﬄine mode in the table.

Table 2.2: The F1 score of pressed black and white keys classiﬁcation. View: the camera view to the keyboard, which can be Rotated, Angled, or none. claVision: the results using the method in [6]).

View CNN-SVM CNN SVM SVM (NoDBU) claVision [6] B W B W B W B W B W V1 RA 0.88 0.97 0.87 0.88 0.77 0.90 0.38 0.89 0.69 0.64 V2 A 0.91 0.95 0.89 0.83 0.81 0.74 0.32 0.63 0.63 0.59 V3 - 0.95 0.95 0.97 0.93 0.91 0.92 0.66 0.91 0.66 0.69 V4 - 0.97 0.97 0.92 0.94 0.76 0.93 0.64 0.92 0.45 0.74 V5 RA 0.96 0.98 0.92 0.94 0.90 0.91 0.90 0.90 0.63 0.76 V6 - 0.96 0.96 0.91 0.98 0.91 0.94 0.35 0.82 0.24 0.66 V7 A 0.97 0.91 0.95 0.88 0.95 0.86 0.54 0.75 0.64 0.58 V8 RA 0.97 0.93 0.94 0.87 0.83 0.84 0.70 0.71 0.69 0.67 V9 R 0.94 0.96 0.93 0.95 0.99 0.92 0.86 0.87 0.10 0.73 V10 - 0.94 0.97 0.97 0.92 0.91 0.87 0.73 0.75 0.28 0.64 Average: 0.94 0.96 0.93 0.91 0.87 0.88 0.61 0.74 0.50 0.67

As shown in Table 2.2, the average F1 scores for the classification of the pressed black and white keys using SVM (NoDBU) are 0.61 and 0.74, respectively. However, if the DBU (dynamic background update) method is used, the results are improved to 0.87 and 0.88 in the SVM setup, which demonstrates the effectiveness of DBU in adjusting varying illumination during the video performance. This is also illustrated in Figure 2.10 in which claVision and SVM yield to more fluctuating and unstable results due to the varying illumination conditions in different test videos. However, by applying DBU the results become more stable, especially for CNN-SVM. From the results, it is seen that CNN outperforms SVM by ≈5%, which shows the effectiveness of using deep feature representations. In Section 2.2.1, it was explained that the training samples in our dataset are unevenly distributed among the two classes (pressed and unpressed keys). As discussed in [58], when the dataset is unbalanced, CNNs usually perform better on the majority than the minority class. However, the classification ability on the minority class (pressed keys) is essential in our work. In order to address this problem, the CNN-SVM model is proposed in which two different penalty parameters in each SVM formulation is considered to balance the dataset (Sections 2.2.1 and 2.3.2). Our experiments show that the hybrid CNN-SVM model (in which SVM is trained using the features extracted from the 6th layer of CNN) achieves the best performance with an average F1 score of 0.95.

Compared to the F1 scores of 0.50 and 0.67 obtained from the pressed keys detection method used in [6], the classiﬁcation-based approach proposed in this work provides a better performance. In [6], it was discussed that the average F1 score of the pressed keys detection

29 Figure 2.10: Black and white pressed keys classiﬁcation results summarized in Table 2.2. in the ideal situation (30-45 degrees from vertical) is 0.974. However, as summarized in Table 2.2, it does not have a good performance in non-ideal situations (e.g., when the camera angle is around 0 and the pressed keys are located right under the camera). Another diﬃculty in [6] was the changing illumination, which had made the results noisy and less accurate. In particular, when the pianist’s hands cover a large portion the keyboard, the method in [6] resulted in many false positives (incorrect detection of the hands as pressed keys) and false negatives (pressed keys covered by the hands cannot be seen and detected).

30 Table 2.3: The maximum and average processing times of diﬀerent methods.

Processing Time (ms) Method Maximum Average CNN-SVM 51.0 21.0 SVM 25.0 7.5 SVM (NoDBU) 25.0 7.5 claVision [6] 23.0 6.6

In the training dataset indicated in Section 2.3.1, we included different camera positions (especially the challenging degrees around 0), illumination conditions, and also a variety of negative examples including partially covered keyboards. Having all these variations used for training the models in Equations 2.3 and 2.4 resulted in a more accurate classification of the pressed and unpressed keys compared to the approach in [6]. The maximum and average processing times of the proposed method in different modes are given in Table 2.3. The claVision method in [6] with the maximum and average of 23.0 ms and 6.6 ms has the lowest processing times compared to the other methods. SVM is however shown to result in a 9 ms more latency than claVision, which is due to the extra time required for the classification of the pressed keys. By employing the CNN-based feature extractor, the maximum and average processing times increase to 51.0 ms and 21.0 ms, respectively. The average time required for forwarding each data sample (which corresponds to the image vector of one black or white key) to the CNNs up to Layer 6 and extract the features is 1.4 ms. CNN-SVM performs 13.5 ms slower than SVM, which is due to the extra forward times required for about 10 samples per video frame on average. The computational complexity of keyboard detection, keys localization, and DBU did not affect the real-time processing because they are performed in separate threads. In order to analyze the performance of using more filters in the convolutional layers used in our CNNs (Section 2.1.3), we run a new experiment with 32 and 48 filters in Layers 1 and 4, respectively. The classification results obtained by using CNN-SVM with 8/16 and 32/48 choices are compared in Figure 2.11. As illustrated in the figure, both 8/16 and 32/48 choices result in almost the same classification performance on average. The average black keys F1 score is 0.945 for 8/16 and 0.943 for 32/48. For the white keys, the F1 scores are 0.955 and 0.962, respectively. Increasing the number of filters in CNN will result in higher forward time as well as higher SVM classification time due to the increase of the size of the feature vectors extracted from the CNNs with more filters (i.e., dv = 960 and dw = 1440). The forward times of the CNN black and CNN white networks with 32 and 48 filters is 11.0 ms and 18.0 ms. The average processing time of CNN-SVM with the choice of 32 and 48 filters is 145.12 ms, which is about 7 times slower than 8/16 with 21.0 ms. As a result, our choice of 8 and 16 filters is shown to give the best trade-off for this work, especially for real-time processing.

31 Figure 2.11: The comparison of CNN-SVM classification results with 8 & 16 filters and 32 & 48 filters for the first and second convolution layers. Top: results for black keys; Bottom: results for white keys.

Utilizing GPUs can signiﬁcantly reduce the computational complexity and processing time of the proposed method, especially CNN-based feature extraction. However, we chose to run our algorithms on a PC with a regular CPU to reduce the cost of the system and make it more accessible to the users.

32 Chapter 3

Semi-Recurrent CNN-based VAE-GAN for Sequential Data: Piano Music Generation

In this chapter, a semi-recurrent hybrid VAE-GAN model for generating temporal data such as music is introduced. The model is composed of three main units: the encoder (E), the generator/decoder (G), and the discriminator (D). In this work, the VAE decoder and the GAN generator are collapsed into one model by letting them share parameters and training them jointly. To the best of our knowledge, this is the ﬁrst hybrid VAE-GAN framework introduced for generating sequential data. The feasibility of this model is investigated and evaluated on piano music generation, which shows that the proposed framework is a viable way of training networks that model music, and has potential for modelling many other types of sequential data such as videos. The outline of this chapter is as follows. In Sections 3.1 and 3.2, a brief background of the deep generative VAE and GAN models is given. In Section 3.3, the proposed VAE-GAN is formulated and described. Then, the feasibility of this model is investigated and evaluated on piano music generation in Section 3.4. We conclude that the proposed framework is a viable way of training networks that model music, and see potential for also modelling many other types of sequential data such as video.

3.1 Variational Auto-Encoder (VAE)

In recent years, deep generative models have achieved signiﬁcant success, especially in generating natural images [59, 75, 102, 107, 110], texts, and sounds. In these models, complex structures of the data can be learned using deep architectures with multiple layers. VAEs [75, 107] and GANs [59, 102, 110] are the two powerful frameworks for learning deep generative models in an unsupervised manner. Auto-encoders are basically used for dimensionality reduction, which is the process of reducing the number of features representing the data. The general idea in auto-encoders is

33 to setup an encoder and a decoder using neural networks and then learn the best encoding- decoding scheme that minimizes the reconstruction error using an optimization process. The encoder encodes the data into a low dimensional representation (also called latents), and the decoder tries to reconstruct the input data from the latents. The overall architecture of an auto-encoder is shown in Figure 3.1.

Figure 3.1: The overall architecture of an auto-encoder. x: input data; z: latent representation; xˆ: reconstructed data.

VAE is indeed an auto-encoder in which the distribution of the encodings (latent representations) is regularized to ensure the latent space has the ability of new data generation. In regular auto-encoders, the input is directly encoded as a single point. However, in VAE, the input is ﬁrst encoded as a distribution over the latent space, then a data point from the latent space is sampled from this distribution. Similar to auto-encdoers, a VAE consists of an encoder and a decoder [75]. The encoder encodes a data sample x as a latent (hidden) distribution from which the latent representation z in latent space is sampled as follows:

z ∼ q(z|x), (3.1) where q(z|x) denotes the probability distribution over the latent space. The decoder, however, decodes the latent representation back to the data space to reconstruct the input as follows: xˆ ∼ p(x|z). (3.2) where xˆ is the reconstructed output, and p(x|z) is the probability distribution of the data in the data space. The VAE regularizes the encoder by imposing a prior over the latent distribution p(z) where z ∼ N (0,I). The loss function of the VAE is the expected log likelihood with a regularizer:

LV AE = Lrec + Lprior, (3.3) with

Lrec = −Eq(z|x)[log p(x|z)] (3.4)

Lprior = KL(q(z|x)kp(z)), (3.5)

34 where Lrec, Lprior, and KL are the reconstruction loss, a prior regularization term, and the Kullback-Leibler (KL) divergence, respectively. The regularizer measures how close q is to p, or in other words, how much information is lost when using q to represent p. If the regularizer is not included in the VAE loss, the encoder could learn to cheat by giving each datapoint a totally separate representation in diﬀerent region of Euclidean space. KL regularizer helps to keep the representations of similar datapoints suﬃciently close (e.g., images of cats). The overall structure of a VAE is illustrated in Figure 3.2.

Figure 3.2: The overall architecture of VAE. x: input data; q(z|x): latent space distribution; z: sampled latent representation; p(x|z): data distribution; xˆ: reconstructed data.

VAEs are generally easy to implement and train, and also robust to the choice of hyperparameters. However, the generated results have low quality (e.g., blurry images) due to minimizing the imperfect reconstruction error functions. The other important reason for low quality results in VAE is because variational inference optimizes a lower bound to the likelihoods not actual likelihoods (i.e., true probability distribution).

3.2 Generative Adversarial Network (GAN)

Another popular generative model is GAN, which has shown significant success in synthesizing realistic images. This model is composed of two networks named discriminator and generator, which are trained at the same time [59]. The generator model G(z) captures the data distribution by mapping the latent z to data space, while the discriminator model D(x) ∈ [0, 1] estimates the probability that x is a real training sample or a fake sample synthesized by G. These two models compete in a two-player minimax game in which the objective function is to find a binary classifier D that discriminates the real data from the fake (generated) ones, and simultaneously encourage G to fit the true data distribution. In other words, D is trained to maximize the probability of assigning correct label to the real and fake data, while G is simultaneously trained to minimize log(1−D(˜x)), where x˜ = G(z). This goal is achieved by the following objective function:

LGAN = Ex∼pdata(x)[log(D(x))] + Ez∼pz(z)[log(1 − D(˜x))] (3.6) where G tries to minimize this objective against D that tries to maximize it. The overall scheme of GAN is shown in Figure 3.3.

35 Figure 3.3: The overall architecture of GAN. x: real data; x˜: fake data synthesized by the generator; z: the input latent vector.

Although GANs are powerful generative models, they suffer from training instability. This model is very sensitive to the selected hyperparameters and also the imbalance between discriminator and generator, which could result in the oscillation of parameters and non-convergence. One of the hardest problems to solve in GAN is mode collapse, which hap- pens when the generator generates limited diversity of samples or even the same samples, regardless of the input. In other words, mode collapse occurs when the optimizer forces the generator to move towards a single ideal point. One solution to handle this problem is to reset training when the generator has collapsed into one mode and cannot get out. Different approaches have been proposed to improve GANs. Wasserstein GAN (WGAN) [14] used Wasserstein distance as an objective for training GANs to improve the stability of learning. Laplacian GANs (LAPGANs) [50] achieved coarse-to-fine conditional generation through Laplacian pyramids. Deep convolutional GAN (DCGAN) [102] was proposed an as effective and stable architecture for the discriminator and the generator using deeper CNNs to achieve significant image synthesis results. GANs generate samples with higher quality, but they suffer from training instability. On the other hand, VAEs are generally easy to train, but the generated results have low quality due to imperfect measures such as the squared error. In order to take the advantages of both VAE and GAN into account, different hybrid VAE-GAN models have been proposed [78, 88], which improve the training process and the quality of the samples generated. In these models the VAE decoder and GAN generator are collapsed into one model. In order to perform conditional generation, we can train conditional VAE (CVAE) [120] or conditional GAN (CGAN) [91] in which probabilistic mapping problem is handled. One example is the text-to-image synthesis method in [106], which generates high quality images conditioned on some input text description. The conditional VAE-GAN (CVAE-GAN) model in [20] is another example, which combines VAE with GAN under a conditioned gen-

36 Figure 3.4: The structure of different VAE, GAN, and their hybrid combinations [20] including VAE-GAN, CVAE, CGAN, and CVAE-GAN. x: input image; E: encoder; z: latent vector; G: generator (decoder); x0: generated (decoded) image; D: discriminator; y: discriminator’s binary (real/fake) output; c: some condition (attribute or class label); C: classifier. erative process to synthesize new images of a specified category. The structures of different VAE and GAN combinations are illustrated in Figure 3.4. In the following section, the proposed VAE-GAN model for sequential data generation is formulated and described.

3.3 Formulation and Objective

Considering any sequential data generation problem such as music generation as a problem of generating a sequence of discrete frames, two sub-problems need to be addressed: strong spatial correlation of the data in each of the frames, and the dependencies between them (temporal correlation). In this work, Both of these problems are eﬃciently addressed. Figures 3.5 and 3.6 illustrate the overall training and testing frameworks proposed in this work. In order to maintain strong local correlation of the data in each frame generated, we use CNN in the three networks used in this work. Convolutions are rarely used in modelling signals with invariance in time such as music, but they have been very successful in the models whose data spatially has strong local correlation such as images, which is also

37 Figure 3.5: The training framework of the proposed semi-recurrent hybrid VAE-GAN model. E: encoder; G: generator (decoder); D: discriminator; xt−1: data frame at time t − 1; xt: t−1 data frame at time t; zp : the latent vector sampled from standard normal distribution; t−1 t z : the latent vector; µ and σ: the mean and standard variation of the latent vector; x˜p: t−1 t−1 generated data frame from the latent zp ; x˜: generated data frame from the latent z . important for sequential data. In this work, we consider the input time-dependent data as a sequence of individual frames, which have internal spatial correlation. Thus, we exploit CNNs for separate generation of each of these frames. In order to keep the dependencies across the frames, for each pair of sequential frames, the previous frame is encoded to its corresponding latent representation using the encoder E. Next, the generator (decoder) G tries to generate (predict) the subsequent frame from the latent distribution of the previous frame. As a result, the history and the information from previous frames are incorporated for generating the next ones and the dependencies across the frames are preserved. The current real training frame in each pair and the synthesized frame are then forwarded to D as real and fake data, respectively. Let X = {x0, ..., xt−1, xt, ..., xn} be a sequence from the training data with n + 1 frames, the encoder network maps a training frame xt−1 (previous data frame) to the mean µ and the standard deviation σ of the latent vector:

{µ, σ} = E(xt−1), (3.7)

38 where the latent vector zt−1 is sampled as follows:

zt−1 ∼ N (µ, σ). (3.8)

As in any VAE, one problem in implementing our VAE is taking the derivatives of the random variable zt−1. The sampling process expressed in Equation 3.8 does not allow the error to be backpropagated through the network. One simple solution is to use reparamete- riation trick to make the gradient descent possible despite the random sampling happening halfway of the architecture. Assuming the random variable zt−1 follows a Gaussian distribution with µ and σ, the sampling can be done as follows:

t−1 t−1 z = µ + σ zp , (3.9)

t−1 where zp ∼ N (0,I) and is the element-wise multiplication. In order to reduce the gap between the prior p(zt−1) and the encoder’s distribution q(zt−1|xt−1) and measure how much information is lost, KL loss is used:

t−1 t−1 t−1 Lprior = KL(q(z |x )kp(z )). (3.10)

t t The network G then generates two frames x˜ and x˜p by decoding the latent represen- t−1 t−1 tations z (sampled using E) and zp (sampled from a normal distribution) back to the data space, respectively: x˜t = G(zt−1), (3.11) and t t−1 x˜p = G(zp ). (3.12)

Element-wise reconstruction errors are generally inadequate for signals with invariances [78]. As a result, in order to measure the quality of the reconstructed samples in this work, the following pair-wise feature matching loss between the real data xt and the synthesized t t data x˜ and x˜p is utilized:

1 1 L = kD (xt) − D (˜xt)k2 + kD (xt) − D (˜xt )k2, (3.13) l 2 l l 2 2 l l p 2 where Dl denotes the features (hidden representations) of an intermediate layer of the network D. Thus, the loss of network E is calculated as:

LE = Ll + Lprior. (3.14)

39 t t t In order to distinguish the real training data x from the synthesized frames x˜ and x˜p, the following loss function is minimized by D:

t t t LD = −(log D(x ) + log(1 − D(˜x )) + log(1 − D(˜xp))), (3.15) while G tries to fool D by minimizing

t t LG = −(log D(˜x ) + log D(˜xp)) + Ll, (3.16) where Ll is the pair-wise feature matching loss (Equation 3.13), which is a shared error signal between E and G. Finally, our goal is to minimize the following hybrid loss function:

L = LE + LD + LG. (3.17)

3.4 Experiments: Piano Music Generation

The results obtained from our experiments are discussed and the eﬃciency of the proposed method is analyzed, evaluated, and compared to state-of-the-art in this section. In order to evaluate the performance of the proposed approach in generating sequential data, we applied it to piano music generation 1.

3.4.1 Setup

In this experiment, we used the Nottingham dataset 2 as our training data, which contains 695 pieces of folk piano music in MIDI file format. Each MIDI file was divided into separate bars, and a bar is represented by a real-valued 2-D matrix x ∈ [0, 1]h×w where h and w represent the number of MIDI notes/pitches (i.e., h = 88 in this work) and the number of time steps (i.e., w = 16 with pitch sampling of 0.125sec), respectively. The value of each element of the matrix is the velocity (volume) of a note at a certain time step. The sequence of n + 1 bars is denoted by X = {x0, ..., xt−1, xt, ..., xn} where xt−1 and xt are two consecutive bars. The details of the networks E, G, and D are summarized in Table 3.1. The output layer of E is a fully-connected layer with 256 hidden units where its first and second 128 units are respectively considered as the mean µ and standard deviation σ used for sampling the latent t−1 t−1 t−1 z of dimension 128 (Equations 3.7 and 3.9). The latents z and zp (of dimension 128) t t 88×16 are provided to G to output the synthesized bars x˜ , x˜p ∈ [0, 1] . Before the Tanh layer of G, another convolution is applied to map to the number of output channels (that is 1 in

1The source code: https://github.com/makbari7/SR-CNN-VAE-GAN

2http://www.iro.umontreal.ca/ lisa/deep/data

40 Table 3.1: The network architecture of the encoder, generator, and discriminator. conv: convolution layer; t-conv: transposed-convolution layer; AF: the activation functions used after each convolution/transposed-convolution layer; In and Out: the input and the output of each network.

Layers (filters) Filter AF In Out conv (8, 16, 32), 5×5, Encoder ELU xt−1 {µ, σ} Fully-connected layer stride=2 t-conv (64, 32, 16, 8), 3×3, Generator ReLU zt−1, zt−1 x˜t, x˜t Tanh layer stride=1 p p conv (8, 16, 32, 64), 3×3, Discriminator LeakyReLU xt, x˜t 0 or 1 Sigmoid layer stride=1 this work). An extra convolution is also applied before the Sigmoid layer of D to represent the output by a 1-D feature map, which is used as Dl for calculating the pair-wise feature matching loss (Equation 3.13). This network takes the 2-D matrices xt and x˜t as inputs and predicts whether they are real or generated (fake) MIDI bars. In the encoder network, three convolutional layers (with 8, 16, and 32 filters of size 5 × 5 and stride 2) followed by ELU (exponential linear unit) [44] activation function are used. The output layer is a fully-connected layer with 256 hidden units where its first and second 128 units are respectively considered as the mean µ and standard deviation σ used for representing the latent z. In the generator network, a series of four transposed-convolutions (also called fractionally- strided convolutions) with 64, 32, 16, and 8 filters of size 3 × 3 and stride 1 is used. ReLU activation function is applied. After the last layer, another convolution is applied to map to the number of output channels (that is 1 in this work), followed by a Tanh function. A uniform distribution zt−1 of dimension 128 is given to this network, which outputs the synthesized bar x˜t ∈ [0, 1]h×w. The discriminator network is composed of four convolutional layers (with 8, 16, 32, and 64 filters of size 3×3 and stride 1) in which the LeakyReLU (leaky rectified linear unit) activation function is utilized. An extra convolution is finally applied to represent the output by a 1-D vector, which is followed by a Sigmoid function. This network takes the 2-D matrices xt and x˜t as inputs and predicts whether they are real or generated MIDI bars. Inspired by DCGAN [102], batch normalization was applied to all layers in all three models and no pooling layer was used. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 64. The Adam optimizer with the momen- tum of 0.5 and the learning rate of 0.0005 for the encoder and the generator, and 0.0001 for the discriminator was used. In order to keep the the losses corresponding to the encoder, generator, and discriminator balanced in each iteration, we train the encoder and generator twice and the discriminator once.

41 Figure 3.6: The testing schemes of the proposed semi-recurrent hybrid VAE-GAN model. 0 0 E: encoder; G: generator (decoder); x : ﬁrst input data frame; zp: the input latent vector sampled from standard normal distribution; zt−1: the latent vector; x˜0: ﬁrst generated data frame; x˜t−1 and x˜t: consecutive generated data frames.

Two testing models illustrated in Figure 3.6 are proposed to sequentially generate music with an arbitrary number of bars. In model 1 (top model in Figure 3.6), the input to E, denoted by x0, is a bar randomly selected from some real data samples, which is considered as the ﬁrst bar of the generated music. x0 is then mapped to the latent z1 using E. G synthesizes the next bar x˜1 by decoding z1 back to the data space. By feeding the generated bar x˜1 to E, this process is repeatedly performed to generate a sequence of bars. In model 2 (bottom model in Figure 3.6), the same recurrent process is applied, but the ﬁrst bar is also a bar synthesized using G from zp. Three sample pieces of music generated using model 1 (top model in Figure 3.6) are illustrated in Figure 3.7.

3.4.2 Results

In order to evaluate the music samples generated using our approach, the following measurements were taken into account [92]:

• Scale consistency: the percentage for the best matching musical scale that a sample is part of,

• Uniqueness: the percentage of unique tones used in a sample,

• Velocity span: the velocity range in which the tones are played,

42 Figure 3.7: Three sample music generated using the proposed testing model 1 (top model in Figure 3.6). All samples are composed of ﬁve bars (frames).

• Recurrence: repetitions of short subsequences of length 2 in a sample,

• Tone span: the number of half-tone steps between the lowest and the highest tones in a sample, and

• Diversity: the average pairwise Levenshtein edit distance [67] of the generated data.

Figure 3.8 shows the results of evaluating 300 generated pieces of music of length 10 seconds (i.e., 5 two-second bars).

Figure 3.8: Measurements used for evaluating 300 generated music samples. The measures include scale consistency, intensity span, uniqueness, tone span, and recurrence.

As seen in Figure 3.8, the scale consistency (with an average of ≈ 87%) shows that the generated music signiﬁcantly follows the standard musical scales in all samples, which outperforms C-RNN-GAN [92] with an average of ≈ 75%. A variety of velocities exist in the music generated, which is illustrated by the oscillating velocity span. The average percentage

43 of the unique tones used in the generated piece is ≈ 37%. Compared to the velocity span, less variability is seen in the tone span (with minimum and maximum of 10 and 21) of the generated music due to the low tone span in the training samples (the majority of the music in the dataset is played in 1 or 2 octaves). The number of 2-tone repetitions is ≈ 7 in average. Diversity is another metric we took into account to evaluate how realistic the generated music sounds. Compared to ORGAN [64] with an average of 0.551, a higher diversity with an average of ≈ 0.59 was achieved in this work. The framework of auto-encoder makes it a perfect candidate for application in image compression. In the next three chapters, three diﬀerent auto-encoder-based approaches for learned image compression will be introduced and discussed.

44 Chapter 4

Deep Semantic Segmentation-based Layered Image Compression

One advantage of deep learning is that it can extract much more accurate semantic segmentation map from a given image than traditional methods [149]. Recently, it was further shown that deep learning can even synthesize a high-quality and high-resolution image using only a semantic segmentation map as input [137], thanks to GAN [59]. This suggests the possibility of developing eﬃcient image compression using deep learning-based semantic segmentation and the associated image synthesis. The idea of semantic segmentation-based compression was already studied in MPEG-4 object-based video coding in the 1990’s [125]. However, due to the lack of high-quality and fast segmentation methods, object-based image/video coding has not been widely adopted. Thanks to the rapid development of deep learning algorithms and hardware, it is now the time to revisit this approach. For example, in [2], semantic segmentation map was utilized for image compression. GAN was used for training the entire framework. The scheme in [2] employed synthesized images for non- important regions to targets extremely low bit rates (< 0.1 bits/pixel). In this chapter, we propose a novel semantic segmentation-based layered image compression approach (DSSLIC). The overall framework of the proposed codec is shown in Figure 4.1. The encoder includes three deep neural networks: SegNet, CompNet, and F ineNet. The semantic segmentation map (denoted by s) of the input image (denoted by x) is ﬁrst obtained using SegNet. In this work, the PSPNet in [149] is used as SegNet. The segmentation map is encoded to serve as side information provided to CompNet for generating a low-dimensional version (denoted by c) of the original image. Both semantic map s and the low-dimensional image c are losslessly encoded using the FLIF codec [119], which is the state-of-the-art lossless image codec. As shown in Figure 4.1, given the segmentation map and the compact image, the RecNet module in our framework tries to obtain a high-quality reconstruction of the input image.

45 Figure 4.1: The overall framework of the proposed deep semantic segmentation-based layered image compression codec. x: input original image; s: semantic segmentation map; c: compact image (low-dimensional version of the input image); c0: up-sampled compact image; f: fine information image; x0: first reconstruction of the original image; r: the residual image between x and x0; r0: decoded residual image; x˜: final reconstruction of the image.

46 Inside the RecNet, the low-dimensional image (compact image) c is first up-sampled, which, together with the segmentation map, is fed into a F ineNet. Note that although GAN-based synthesized images from segmentation maps are visually appealing, their details can be quite different from the original images. To minimize the distortion of the synthesized images, we modify the existing segmentation-based synthesis framework in [137] and add the upsampled version of the compact image c as an additional input. Besides, F ineNet is trained to learn the missing fine information of the up-sampled version of c with respect to the input image. This is easier to control the output of the GAN network. After adding the upsampled version of c and the F ineNet’s output (denoted by f), we get a better estimate of the input. The residual r between the input and the estimate is then obtained and encoded by a lossy codec. In this work, the H.265/HEVC intra coding-based BPG codec is used [23]. In our scheme, the segmentation map s serves as the base layer, and the compact image c and the residual r are respectively the first and second enhancement layers. At the decoder side, the segmentation map and compact representation are decoded to be used by RecNet to get an estimate of the input image. The output of RecNet is then added to the decoded residual image to get the final reconstruction of the image x˜. In order to deal with negative values, the residual image r is shifted by 255 and then divided by 2 before encoding. This mapping is inversed in the decoder. The pseudo code of the encoding and decoding procedures is given in Algorithm 1.

Algorithm 1 The pseudo code of the encoding and decoding procedure. procedure Encode(x) s ← SegNet(x) . encode s (1st enhancement layer) c ← CompNet(x, s) . encode c (base layer) x0 ← RecNet(s, c) 1 0 r ← 2 [(x − x ) + 255] . encode r (2nd enhancement layer) procedure Decode(s, c, r0) x0 ← RecNet(s, c) x˜ ← x0 + (2r0 − 255) function RecNet(s, c) c0 ← upsample(c) f ← F ineNet(s, c0) x0 ← c0 + f return x0

The rest of this chapter is organized as follows. In Section 4.1, the architecture of the deep networks used in the codec are presented. The loss functions used for training the model are then formulated and explained in Section 4.2, which is followed by the training

47 procedure given in Section 4.3. Finally, in Section 4.4, the performance of the proposed method is evaluated and compared with the JPEG, JPEG2000, WebP, and BPG codecs.

4.1 Network Architecture

The architectures of the CompNet (proposed in this work) and F ineNet (modiﬁed from [137]) networks are deﬁned as follows:

• CompNet: c64, d128, d256, d512, c3, tanh, and

• FineNet: c64, d128, d256, d512, 9 × r512, u256, u128, u64, c3, tanh, where

• ck: 7×7 convolution layers (with k ﬁlters and stride 1) followed by instance normalization and ReLU,

• dk: 3×3 convolution layers (with k ﬁlters and stride 1) followed by instance normalization and ReLU,

• rk: a residual block containing reﬂection padding and two 3×3 convolution layers (with k ﬁlters) followed by instance normalization, and

1 • uk: 3×3 fractional-strided-convolution layers (with k ﬁlters and stride 2 ) followed by instance normalization and ReLU.

Inspired by [137], for the adversarial training of the proposed model, two discriminators denoted by D1 and D2 operating at two diﬀerent image scales are used in this work. D1 operates at the original scale and has a more global view of the image. Thus, the generator can be guided to synthesize ﬁne details in the image. On the other hand, D2 operates with 2× down-sampled images, leading to coarse information in the synthesized image. Both discriminators have the following architecture:

• C64,C128,C256,C512, where Ck denotes 4×4 convolution layers with k ﬁlters and stride 2 followed by instance normalization and LeakyReLU. In order to produce a 1-D output, a convolution layer with 1 ﬁlter is utilized after the last layer of the discriminator.

4.2 Formulation and Objective Functions

h×w×k Let x ∈ R be the original image, the corresponding semantic segmentation map h×w h × w ×k s ∈ Z and the compact representation c ∈ R α α are generated as follows:

s = SegNet(x), (4.1)

48 and c = CompNet(s, x). (4.2)

0 h×w×k Conditioned on s and the upscaled compact image (denoted by c ∈ R ), F ineNet h×w×k (our GAN generator) reconstructs the ﬁne information image (denoted by f ∈ R ), which is then added to c0 to get the ﬁrst estimate of the input:

x0 = c0 + f, (4.3) where f = F ineNet(s, c0). (4.4)

The error between x and x0 is measured using a combination of diﬀerent losses including

L1, LSSIM , LDIS, LV GG, and GAN losses. The L1-norm loss (least absolute errors) is deﬁned as: 0 L1 = 2λkx − x k1. (4.5)

It has been shown that combining pixel-wise losses such as L1 with SSIM loss can signiﬁcantly improve the perceptual quality of the reconstructed images [148]. As a result, we also utilize the SSIM loss denoted by LSSIM in our work, which is deﬁned as:

0 0 0 LSSIM = −I(x, x ).C(x, x ).S(x, x ), (4.6) where the three comparison functions luminance I, contrast C, and structure S are computed as:

0 2µxµx0 + C1 I(x, x ) = 2 2 , µx + µx0 + C1 0 2σxσx0 + C2 C(x, x ) = 2 2 , (4.7) σx + σx0 + C2 σ 0 + C S(x, x0) = xx 3 , σxσx0 + C3

0 where µx and µx0 are the means of x and x , σx and σx0 are the standard deviations, and σxx0 is the correlation coeﬃcient. C1, C2, and C3 are the constants used for numerical stability. To stabilize the training of the generator and produce natural statistics, two perceptual feature-matching losses based on the discriminator and VGG networks [118] are employed. The discriminator-based loss is calculated as:

n 1 L = λ X X kD(i)(s, c0, x) − D(i)(s, c0, x0)k , (4.8) DIS N d d 1 d=1,2 i=1 i

49 (i) where Dd denotes the features extracted from the i-th intermediate layer of the discriminator network Dd (with n layers and Ni number of elements in each layer). Similar to [111], a pre-trained VGG network with m layers and Mj elements in each layer is used to construct the VGG perceptual loss as in below:

m 1 L = λ X kV (j)(x) − V (j)(x0)k , (4.9) V GG M 1 j=1 j where Vj represents the features extracted from the j-th layer of VGG. In order to distinguish the real training image x from the reconstructed image x0, given 0 s and c , the following objective function is minimized by the discriminator Dd:

X 0 0 0 LD = − (log Dd(s, c , x) + log(1 − Dd(s, c , x ))), (4.10) d=1,2 while the generator (denoted by F ineNet in this work) tries to fool Dd by minimizing P 0 0 − d=1,2 log Dd(s, c , x ). The ﬁnal generator loss including all the reconstruction and perceptual losses is then deﬁned as:

X 0 0 LG = − log Dd(s, c , x ) + L1 + LSSIM + LDIS + LV GG. (4.11) d=1,2

Finally, our goal is to minimize the following hybrid loss function:

L = LD + LG. (4.12)

4.3 Training

The Cityscapes (with 30 semantic labels) [45] and ADE20K (with 150 semantic labels) [150] datasets are used for training the proposed model1. For Cityscapes, all the 2974 RGB images (street scenes) in the dataset are used. All images are then rescaled to 512×1024 (i.e., h = 512, w = 1024, and k = 3 for RGB channels) for training. For ADE20K, the images with at least 512 pixels in height or width are used (9272 images in total). All images are rescaled to h = 256 and w = 256 to have a ﬁxed size for training. Note that no resizing is needed for the test images since the model can work with any size at the testing time. We set the down-sampling factor α = 8 to get the compact representation of size 64×128×3 for

Cityscapes and 32×32×3 for ADE20K. We also consider the weight λ = 10 for L1, LDIS, and LV GG. All models were jointly trained for 150 epochs with mini-batch SGD and batch sizes of 2 and 8 for Cityscapes and ADE20K, respectively. The Adam solver with learning rate of

1The source code can be found at https://github.com/makbari7/DSSLIC.

50 0.0002 was used, which is ﬁxed for the ﬁrst 100 epochs, but gradually decreases to zero for the next 50 epochs. Perceptual feature-matching losses usually guide the generator towards more synthesized textures in the predicted images, which cause a slightly higher pixel-wise reconstruction error, especially in the last epochs. To handle this issue, we did not consider the perceptual losses LD and LV GG in the generator loss for the last 50 epochs. All the SegNet, CompNet, F ineNet, and the discriminator networks proposed in this work are trained in the RGB domain.

4.4 Experiments

In this section, we compare the performance of our scheme with JPEG, JPEG2000, WebP, and the H.265/HEVC intra coding-based BPG codec [23]. Since the networks are trained for RGB images, we encode all images using RGB (4:4:4) format in different codecs for fair comparison. We use both PSNR and MS-SSIM [139] as the evaluation metrics. In this experiment, we encode the RGB components of the residual image r using lossy BPG codec with different quantization values. The results of the ADE20K and Cityscapes test sets are given in Figures 4.2 and 4.3. The results are averaged over 50 random test images not included in the training set. As shown in the figures, our method gives better PSNR and MS-SSIM than BPG, especially when the bit rate is less than ≈ 0.9 bits/pixel/channel (bpp for short) on ADE20K and less than ≈ 0.5 bpp on Cityscapes. In particular, the average PSNR gain is more than 2dB for the ADE20k test set when the bit rate is between 0.4-0.7 bpp. The average results of the Kodak dataset including 24 test images are illustrated in Figure 4.4. For this experiment, the model trained on the ADE20K dataset is used. It is shown that when the bit rate is less than about 1.4bpp, our scheme achieves better results than other codes in both PSNR and MS-SSIM. For example, the average gain is about 2dB between 0.4-0.8bpp. This also shows that our method generalizes very well when the training and testing images are from different datasets. The average RecNet decoding time for Kodak images on CPU and GPU are ≈44s and ≈0.013s, respectively. Some visual examples from ADE20K, Cityscapes, and Kodak test sets are given in Figures 4.5, 4.6, and 4.7. In order to have a more clear visualization, only some cropped parts of the reconstructed images are shown in these examples. As seen in all examples, JPEG has poor performance due to the blocking artifacts. Some artifacts are also seen on JPEG2000 results. Although WebP provides higher quality results than JPEG2000, the images are blurred in some areas. The images encoded using BPG are smoother, but the fine structures are missing in some areas, for example, the ’car rental’ text in Figure 4.5, the right brown wall in Figure 4.6, and the trees in Figure 4.7.

51 Figure 4.2: Comparison results on ADE20K test set in terms of PSNR (top) and MS-SSIM (bottom) vs. bpp (bits/pixel/channel). The results are averaged over RGB channels.

52 Figure 4.3: Comparison results on Cityscapes test set in terms of PSNR (top) and MS-SSIM (bottom) vs. bpp (bits/pixel/channel). The results are averaged over RGB channels.

53 Figure 4.4: Comparison results on Kodak image set in terms of PSNR (top) and MS-SSIM (bottom) vs. bpp (bits/pixel/channel). The results are averaged over RGB channels.

54 Table 4.1: Ablation study of diﬀerent components in DSSLIC. upComp: upsampled compact image only; noSeg: no semantic segmentation map used; withSeg: the proposed DSSLIC without BPG-based residual coding; synth: perceptual losses considered in all training epochs.

ADE20K Kodak upComp synth noSeg withSeg upComp synth noSeg withSeg BPP 0.095 0.092 0.08 0.095 0.087 0.088 0.080 0.087 PSNR 17.50 21.91 22.24 23.11 17.77 20.97 21.46 21.91 MS-SSIM 0.759 0.887 0.905 0.914 0.738 0.858 0.887 0.891

4.4.1 Ablation Study

Figure 4.8 and Table 4.1 report some ablation studies of different configurations, all are obtained without using the BPG-based residual coding, including: upComp: the results are obtained without considering the FineNet network in the pipeline, i.e., x0 = c0 (the upsampled compact image only); noSeg: the segmentation maps are not considered in neither CompNet nor FineNet networks, i.e., x0 = c0 + f where c0 is the upsampled version of c = CompNet(x), and f = F ineNet(c0); withSeg: all the DSSLIC components shown in Figure 4.1 are used in this configuration (except BPG-based residual coding); synth: the settings in this configuration is the same as withSeg except that the perceptual losses LV GG and LDIS are considered in all training epochs. The poor performance of using only the upsampled compact images in upComp shows the importance of FineNet in predicting the missing fine information, which is also visually obvious in Figure 4.8. Considering perceptual losses in all training epochs (synth) leads to sharper and perceptually more natural images, but the PSNR is much lower. The results with segmentation maps (withSeg) provide slightly better PSNR than noSeg although the visual gain is more pronounced, e.g., the dark wall in Figure 4.8. In overall, our approach preserves more details and provides results with higher visual quality compared to BPG and other codecs. This significant gain demonstrates the great potential of deep learning-based image compression. In addition, the built-in semantic map enables some new applications such as fast content-based image retrieval, object-based video coding, and region-of-interest coding.

55 (a) Original (with segmentation map) (b) Ours (0.18bpp, 30.87dB, 0.973)

(e) JPEG2000 (0.21bpp, 27.13dB, 0.946) (f) JPEG (0.23bpp, 24.61dB, 0.907)

Figure 4.5: ADE20K visual example (bits/pixel/channel, PSNR, MS-SSIM)

56 (a) Original (with segmentation map) (b) Ours (0.12bpp, 34.72dB, 0.987)

(e) JPEG2000 (0.12bpp, 30.26dB, 0.953) (f) JPEG (0.14bpp, 24.81dB, 0.852)

Figure 4.6: Cityscapes visual example (bits/pixel/channel, PSNR, MS-SSIM)

57 (a) Original (with segmentation map) (b) Ours (0.69bpp, 32.54dB, 0.982)

(e) JPEG2000 (0.71bpp, 26.71dB, 0.942) (f) JPEG (0.72bpp, 24.77dB, 0.958)

Figure 4.7: Kodak visual example (bits/pixel/channel, PSNR, MS-SSIM)

58 (a) Original (PSNR, MS-SSIM) (b) upComp (18.01 dB, 0.73)

(e) withSeg (25.09 dB, 0.88)

Figure 4.8: Visual comparison of diﬀerent scenarios given in Table 4.1 at 0.08 BPP.

59 Chapter 5

Learned Variable-Rate Image Compression with Residual Divisive Normalization

In this chapter, we propose a variable-rate image compression framework, which employs more GDN layers than previous GDN-based image compression methods [17, 90]. Novel GDN-based residual sub-networks are also developed in the encoder and decoder networks. At the encoder side, two layers of information are encoded: the encoder’s output (called code map) and the residual image. The code map is a low-dimensional feature map of the original input image, obtained by the deep encoder, which is quantized by a stochastic rounding-based scalar quantizer. The quantized code map given to the deep decoder is encoded losslessly to make sure the decoder will receive the same information. In this work, the quantized code map is encoded using the FLIF lossless codec [119], which is the state of the art in lossless image coding. To further improve the performance, the reconstruction of the input image from the quantized code map is obtained using the deep decoder, and the residual between the input and the reconstruction is encoded by the BPG codec as an enhancement layer [23]. At the decoder side, the reconstruction from the deep decoder and the decoded residual image are added to get the ﬁnal reconstruction. To enable a single model to operate with diﬀerent bit rates and to learn multi-rate image features, a new objective function is also introduced. Experimental results show that the proposed framework trained with variable-rate objective function outperforms all standard codecs such as H.265/HEVC-based BPG and state-of-the-art learning-based variable-rate methods. The overall framework of the proposed approach is shown in Figure 5.1. This chapter is organized as follows. The architecture of the proposed deep encoder and decoder networks will be described in Section 5.1. Following that, the formulation of the deep encoder and decoder are presented in Sections 5.2 and 5.4. In Section 5.6, the objective functions are formulated and explained. Finally, we will present the experimental results on Kodak image set as well as the ablation studies in Section 5.7.

60 Figure 5.1: Overall framework of the proposed codec. x: input image; fE: deep encoder; 0 c: code map; Q: uniform scalar quantizer; cq: quantized code map; fD: deep decoder; x : generated image by the deep decoder; r: residual image; Qˆ: dequantizer; cˆq: dequantized code map; r0: decoded residual image; x˜: ﬁnal reconstructed image.

5.1 Network Architecture

It has been shown that end-to-end optimization of a model including cascades of differen- tiable nonlinear transformations has better performance over the traditional linear transforms [17]. One example is the GDN, which is very efficient in gaussianizing local statistics of natural images and has been shown to improve the efficiency of transforms compared to other popular nonlinearities such as ReLU [15]. It also provides significant improvements when utilized as a prior for different computer vision tasks such as image denoising and image compression when used with scalar quantization. GDN transforms were first introduced in [17] for a learning-based image compression framework, which had a simple architecture of 3 down-sampling convolution layers (either 5×5 or 9×9), each is followed by a GDN layer. The architecture of the proposed deep encoder and decoder networks are illustrated in Figure 5.2. Several modifications to the GDN-based schemes in [17, 90] are developed. First, we adopt a deeper architecture including 11 layers of convolutions (either 3×3 or 7×7), followed by GDN or IGDN operations in the encoder and decoder networks. Second, for deeper learning of image statistics and faster convergence, the concept of identity shortcut connection in the ResNet [70] is introduced to some GDN and IGDN layers, and we denote them as ResGDN and ResIGDN blocks. The internal architecture of the ResGDN and ResIGDN blocks are shown in Figure 5.3. Unlike the traditional residual blocks in ResNet where the ReLU and batch (or instance) normalization are employed, we utilize GDN and

61 Figure 5.2: Architecture of the proposed deep encoder and deep decoder networks. DeConv: transposed-convolution.

62 Figure 5.3: Architecture of the proposed ResGDN (top) and ResIGDN (bottom) sub- networks (n: the channel size used in the aﬃne convolutions).

IGDN layers in our residual blocks, which provide better performance and faster convergence rate. In Figure 5.2, the encoder can be divided into 5 stages. The first and the last convolution layers are of size 7×7 (with stride 1) and are followed by a simple GDN layer. Between them, there are three stages, where each of them includes a 3×3 down-sampling convolution layers (with stride 2), a simple GDN layer, and a ResGDN block. To avoid edge effects, reflection padding of size 3 is used before the convolution layers at the first and the last stages. The channel sizes of the convolution layers are 64, 128, 256, 512, and 8, respectively. The deep w h encoder encodes the input RGB image of size w × h × 3 into a code map of size α × α × αc. On the other hand, the deep decoder decodes the code map back to a reconstructed image. This network is basically the reverse of the deep encoder, except that the GDN layers are replaced by IGDN layers, and down-sampling convolution layers are replaced by transposed-convolution (up-sampling) layers. Similar to deep encoder, reflection padding is also used before the convolutions at the first and last stages. The channel sizes used in the deep decoder’s convolution layers are 512, 256, 128, 64, and 3, respectively.

5.2 Deep Encoder

h×w×3 h × w ×α Let x ∈ R be the original image, the code map c ∈ R α α c is generated using the parametric deep encoder fE represented as:

c = fE(x; Φ), (5.1)

63 where Φ is the parameter vector that needs to be optimized. The encoder consists of 5 stages where the input to the kth stage is denoted by U (k). The input image x is then represented as U (0), which is the input to the ﬁrst stage of the encoder. Each stage begins with a convolution layer deﬁned as:

 H(k) ?U (k) k ∈ {0, 4}, V (k) = (5.2) (k) (k) H ?↓ U otherwise, where ? and ?↓ are aﬃne and down-sampling convolutions, respectively. Each convolution layer is followed by a GDN operation [17] deﬁned as:

(k) (k) vi (m, n) wi (m, n) = r , (5.3) (k) P (k) (k) 2 βi + j γij vj (m, n) where i and j run over channels and (m, n) is the spatial location of a speciﬁc value of a tensor (e.g., V (k)). Except for the ﬁrst and last stages, a ResGDN transform, denoted by T , is applied at the end of each stage:

 W (k) k = 0, U (k+1) = (5.4) T (W (k)) + W (k) otherwise, where T is composed of two subsequent pairs of aﬃne convolutions (Equation 5.2 with k = 0), each is followed by a GDN operation (Equation 5.3).

5.3 Stochastic Rounding-Based Quantization

(4) The output of the last stage of the encoder, W , represents the code map c with αc channels. Each channel denoted by c(i) is then quantized to a discrete-valued vector using a stochastic rounding-based uniform scalar quantizer as:

(i) (i) cq = Q(c ), (5.5) where the function Q is deﬁned as in [65, 103, 133]:

! c(i) + Q(c(i)) = Round + z, (5.6) ∆

1 1 where ∈ [− 2 , 2 ] is produced by a uniform random number generator. ∆ and z respectively represent the quantization step (scale) and the zero-point, which are deﬁned as:

max(c(i)) − min(c(i)) ∆ = , (5.7) 2B − 1

64 and  (i)  −min(c ) 0 ∆ < 0,  (i) z = B −min(c ) B (5.8) 2 − 1 ∆ > 2 − 1,  (i)  −min(c )  ∆ otherwise, where B is the number of bits and min(c(i)) and max(c(i)) are the input’s minimum and maximum values over the ith channel, respectively. The zero-point z is an integer ensuring that zero is quantized with no error, which avoids quantization error in common operations such as zero padding [77]. The stochastic rounding approach in Equation 5.6 provides a better performance and convergence rate compared to round-to-nearest algorithm. Stochastic rounding is indeed an unbiased rounding scheme, which maintains a non-zero probability of small parameters [62]. In other words, it possesses the desirable property that the expected rounding error is zero as follows: E (Round(x)) = x. (5.9)

As a consequence, the gradient information is preserved and the network is able to learn with low bits of precision without any signiﬁcant loss in performance. (i) For the entropy coding of the quantized code map, denoted by cq , the FLIF codec [119] is utilized, which is the state-of-the-art lossless image codec. Since FLIF can also work with grayscale images, each of the quantized code map channels (considered as a grayscale image) is separately entropy-coded by FLIF.

5.4 Deep Decoder

The quantization process described in Section 5.3 rescales the input values to [0, 2B − 1]. To rescale them back to [−1, 1] at the decoder side, the quantized code map is dequantized using the following function: ˆ (i) (i) Q(cq ) = ∆. cq − z . (5.10)

(i) (i) Given the dequantized code map cˆq = Qˆ(cq ), the parametric decoder fD (with the 0 h×w×3 parameter vector Ψ) reconstructs the image x ∈ R as follows:

0 x = fD(ˆcq; Ψ). (5.11)

Similar to the encoder, the decoder network is composed of 5 stages in which all the operations are reversed. Each stage at the decoder network starts with an IGDN operation computed as follows:

v u 2 (k) (k) u ˆ(k) X (k) (k) vî (m, n) =w î (m, n).tβi + γîj wˆj (m, n) , (5.12) j

65 which is followed by a convolution layer deﬁned as:

 Hˆ (k) ? Vˆ (k) k ∈ {0, 4}, Uˆ (k) = (5.13) (k) (k) Hˆ ?↑ Vˆ otherwise, where ?↑ denotes transposed-convolution used for up-sampling the input tensor. As the reverse of the encoder, each convolution at the middle stages is followed by a ResIGDN block defined as: Zˆ(k) = Tˆ(Uˆ (k)) + Uˆ (k), k ∈ [1, 3] (5.14) where Tˆ consists of two subsequent pairs of an IGDN operation (Equation 5.12) followed by an affine convolution (Equation 5.13 with k = 0). The reconstructed image x0 is finally resulted from the output of the decoder represented by Uˆ (4).

5.5 Residual Coding

As an enhancement layer to the bit-stream, the residual r between the input image x and the deep decoder’s output x0 is further encoded by the BPG codec, as in [10]. To do this, the minimal and the maximal values of the residual image r are ﬁrst obtained, and the range between them is rescaled to [0, 255], so that we can call the BPG codec directly to encode it as a regular 8-bit image. The minimal and maximal values are also sent to decoder for inverse scaling after BPG decoding.

5.6 Objective Function and Optimization

Our cost function is a combination of L2-norm loss (root mean-squared error) denoted by

L2, and MS-SSIM loss [138] denoted by LMS as follows:

L(Φ, Ψ) = 2L2 + LMS, (5.15) where Φ and Ψ are the optimization parameter vectors of the deep encoder and decoder, respectively, each is deﬁned as a full set of their parameters across all their 5 stages as: Φ = {H(k), β(k), γ(k)} and Ψ = {Hˆ (k), βˆ(k), γˆ(k)}, where k = [0, 4]. In order to optimize the parameters such that our codec can operate at a variety of bit rates, we propose the following novel variable-rate objective functions for the L2 and LMS losses:

X 0 L2 = kx − xBk2, (5.16) B∈R and M X 0 Y 0 0 LMS = − IM (x, xB) Cj(x, xB).Sj(x, xB), (5.17) B∈R j=1

66 0 where xB denotes the reconstructed image with B-bit quantizer (Equations 5.7 and 5.8), and B can take all possible values in a set R. In this work, R = {2, 4, 8} is used for training our variable-rate model. The MS-SSIM metric use luminance I, contrast C, and structure S as in Equation 4.7 to compare the pixels and their neighborhoods in x and x0. Moreover, MS-SSIM operates at multiple scales where the images are iteratively down-sampled by factors of 2j, for j ∈ [1,M]. Our goal is to minimize the objective L(Φ, Ψ) over the continuous parameters {Φ, Ψ}.

However, both terms depend on the quantized values of cq whose derivative is discontin- uous, which makes the quantizer non-diﬀerentiable [17]. To overcome this issue, the fact that the exact derivatives of discrete variables are zero almost everywhere is considered, and the straight-through estimate (STE) approach in [28] is employed to approximate the diﬀerentiation through discrete variables in the backward pass. Using STE, we basically set the incoming gradients to our quantizer equal to its outgoing gradients, which indeed disregards the gradients of the quatizer. The concept of a straight through estimator is that you set the incoming gradients to a threshold function equal to it’s outgoing gradients, disregarding the derivative of the threshold function itself. Note that many methods optimize for PSNR and MS-SSIM separately in order to get better performance in each of them, while our scheme jointly optimizes for both of them. It will be shown later that we can still achieve satisfactory results in both metrics.

Figure 5.4: Sample quantized code map (channel size 8) generated by the deep encoder. Left: original input image from Kodak image set.

5.7 Experimental Results

The ADE20K dataset [150] was used for training the proposed model. The images with at least 512 pixels in height or width were used (9272 images in total). All images were rescaled to h = 256 and w = 256 to have a ﬁxed size for training. We set the down-sampling factor

α = 8 and the channel size αc = 8 to get the code map of size 32×32×8. The code maps

67 37 BPG444 BPG420 36 Ours JPEG2000 35 Webp Cai2018 [11] 34 Zhang2019 [7] JPEG 33 Toderici2017 [12]

PSNR (RGB) 31

27 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

0.99

0.98

0.97

0.96

Ours

MS-SSIM (RGB) Cai2018 [11] 0.95 BPG444 BPG420 Zhang2019 [7] JPEG2000 0.94 Webp Toderici2017 [12] JPEG 0.93 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

Figure 5.5: Comparison results of proposed variable-rate approach with state-of-the-art variable-rate methods on Kodak test set in terms of PSNR (top) and MS-SSIM (bottom) vs. bpp (bits/pixel). corresponding to one sample image from Kodak image set is shown in Figure 5.4. The deep encoder and decoder models were jointly trained for 200 epochs with mini-batch SGD and

68 a mini-batch size of 16. The Adam solver with learning rate of 0.00002 was ﬁxed for the ﬁrst 100 epochs, and was gradually decreased to zero for the next 100 epochs. All the networks were trained in the RGB domain.

(a) Original (bits/pixel, PSNR, MS-SSIM) (b) Ours (0.10bpp, 30.88dB, 0.95)

(e) WebP (0.10bpp, 28.67dB, 0.87) (f) JPEG (0.15bpp, 21.91, 0.61)

Figure 5.6: Kodak visual example 1 (bits/pixel, PSNR, MS-SSIM) in RGB domain. BPG: in YUV4:4:4 format; J2K: JPEG2000.

In this section, we compare the performance of the proposed scheme with two types of methods: 1) standard codecs including JPEG, JPEG2000 [43], WebP [60], and the H.265/HEVC intra coding-based BPG codec [23]; and 2) the state-of-the-art learning-based

69 variable-rate image compression methods in [36], [134], and [147], in which a single network was trained to generate multiple bit rates. We use both PSNR and MS-SSIM [139] as the evaluation metrics. The model was trained using 3 different bit rates, i.e., R = {2, 4, 8} in Equation 5.16. However, the trained model can operate at any bit rate in range [1, 8] at the test time. The comparison results on the popular Kodak image set (averaged over 24 test images) are shown in Figure 5.5. Different points on the R-D curve of our variable-rate results are obtained from 5 different bit rates for the code maps in the base layer, i.e., R = {3, 4, 5, 6, 7}. The corresponding residual images r in the enhancement layer are coded by BPG (YUV4:4:4 format) with quantizer parameters of {50, 40, 35, 30, 25}, respectively. Better results can be obtained by performing some rate allocation optimizations. As shown in Figure 5.5, our method outperforms the state-of-the-art learning-based variable-rate image compression models and JPEG2000 in terms of both PSNR and MS- SSIM. Our PSNR results are slightly lower than BPG (in both YUV4:2:0 and YUV4:4:4 formats), but we achieve better MS-SSIM, especially at low rates. The BPG-based residual coding in our scheme is exploited to avoid re-training the entire model for another bit rate and more importantly to boost the quality at high bit rates. For the 5 points (low to high) in Figure 5.5, the percentage of bits used by residual image is {2%, 34%, 52%, 68%, 76%}. This shows that as the bit rate increases, the residual coding has more significant contribution to the R-D performance. Three visual examples from the Kodak image set are given in Figures 5.6, 5.7, and 5.8 in which our results are compared with BPG (YUV4:4:4 format), JPEG2000, WebP, and JPEG. JPEG has very poor performance due to the ringing artifacts at edges. The BPG has the highest PSNR and also smoother results compared to JPEG2000 and WebP. However, the details and fine structures in BPG results (e.g., the grooves on the door in Figure 5.6 and the grass on the ground in Figure 5.7) are not well-preserved in many areas. Our method achieves the best MS-SSIM and also provides the highest visual quality compared to the other methods including BPG.

5.7.1 Other Ablation Studies

In order to evaluate the performance of diﬀerent components of the proposed framework, other following ablation studies are performed. The results are shown in Figure 5.9.

• Code map channel size: Figure 5.9 shows the results with channel sizes of 4 and

8 (i.e., αc ∈ {4, 8}). It can be seen that αc = 8 has better results than αc = 4. In general, we ﬁnd that a larger code map channel size with smaller quantization bits provide a better R-D performance because deeper texture information of the input image is preserved within the feature maps.

70 (a) Original (bits/pixel, PSNR, MS-SSIM) (b) Ours (0.17bpp, 23.34dB, 0.947)

(e) WebP (0.17bpp, 22.88dB, 0.868) (f) JPEG (0.21bpp, 19.42dB, 0.749)

Figure 5.7: Kodak visual example 2 (bits/pixel, PSNR, MS-SSIM) in RGB domain. BPG: in YUV4:4:4 format; J2K: JPEG2000.

• GDN vs. ReLU: In order to show the performance of the proposed GDN-based architecture, we compare it with a ReLU-based variant of our model, denoted as noGDN in Figure 5.9. In this model, all GDN and IGDN layers in the deep encoder and decoder are removed; instead, instance normalization followed by ReLU are added to the end of all convolution layers. The last GDN and IGDN layers in the encoder

71 (a) Original (bits/pixel, PSNR, MS-SSIM) (b) Ours (0.10bpp, 31.84dB, 0.977)

(e) WebP (0.10bpp, 28.93dB, 0.919) (f) JPEG (0.16, 22.50dB, 0.727)

Figure 5.8: Kodak visual example 3 (bits/pixel, PSNR, MS-SSIM) in RGB domain. BPG: in YUV4:4:4 format; J2K: JPEG2000.

and decoder are replaced by a Tahn layer. As shown in Figure 5.9, the models with GDN structure outperform the ones without GDN.

• Conventional vs. GDN/IGDN-based residual block: In this scenario, the model composed of the proposed ResGDN/ResIGDN blocks (denoted by ResGDN in Fig- ure 5.9) is compared with the conventional ReLU-based residual block (denoted by ResReLU). The results with neither ResReLU nor ResGDN (yellow blocks in Figure

72 36

32 PSNR (RGB) 31

30 Ours (8Ch + ResGDN) 8Ch + noRes 4Ch + ResGDN 29 4Ch + noGDN + ResReLU 4Ch + noGDN + noRes 28 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

0.99

0.985

0.98

0.975

0.97

MS-SSIM (RGB) 0.965

0.96 Ours (8Ch + ResGDN) 8Ch + noRes 4Ch + ResGDN 0.955 4Ch + noGDN + ResReLU 4Ch + noGDN + noRes 0.95 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

Figure 5.9: Ablation studies with diﬀerent model conﬁgurations. Ours: our variable-rate results. nCh: n channels for the code map; ResGDN: ResGDN/IGDN transforms in the network architecture; ResReLU: conventional residual block with ReLU; noRes: neither ResGDN nor ResReLU is used; noGDN: GDN and IGDN layers in our main architecture replaced by ReLU.

73 5.1), denoted by noRes, are also included. As demonstrated in Figure 5.9, ResGDN achieves better performance compared to the other scenarios.

In terms of complexity, the average processing time of the deep encoder, quantizer, and deep decoder on Kodak are 65ms, 2ms, and 51ms on a NVIDIA TITAN X Pascal GPU, respectively.

74 Chapter 6

Generalized Octave Convolutions for Learned Multi-Frequency Image Compression

In this chapter, we propose the first learned multi-frequency image compression approach that uses the recently developed octave convolutions to factorize the latents into high and low frequencies, which are respectively represented by high and low resolutions. Since the low frequency is represented by a lower resolution, their spatial redundancy is reduced, which improves the compression rate. Moreover, octave convolutions impose effective high and low frequency communication, which can improve the reconstruction quality. We also develop novel generalized octave convolution and octave transposed-convolution architectures with internal activation layers to preserve the spatial structure of the information. Our experiments show that the proposed scheme outperforms all standard codecs and learning- based methods in both PSNR and MS-SSIM metrics, and establishes the new state of the art for learned image compression. This chapter is organized as follows. In Section 6.1, vanilla and octave convolutions are briefly described and compared. The proposed generalized octave convolution and transposed-convolution with internal activation layers are formulated and discussed in Sec- tion 6.2. Following that, the architecture of the proposed multi-frequency image compression framework as well as the multi-frequency entropy model are introduced in Section 6.3. Fi- nally, in Section 6.4, the experimental results along with the ablation study will be discussed, and compared with the state-of-the-art methods in learning-based image compression.

75 6.1 Vanilla vs. Octave Convolution

h×w×c Let X,Y ∈ R be the input and output feature vectors with c number of channels of size h × w, each feature map in the vanilla convolution is obtained as follows:

X T Y(p,q) = Φ k−1 k−1 X(p+i,q+j), (6.1) (i+ 2 ,j+ 2 ) i,j∈Nk

k×k×c where Φ ∈ R is a k × k convolution kernel, (p, q) is the location coordinate, and Nk is a local neighborhood. The overall scheme of vanilla convolution is shown in Figure 6.1.

Figure 6.1: Vanilla convolution scheme. Φ: the convolution kernel; X: input feature map; Y : output feature map.

In the vanilla convolution, all input and output feature maps are of the same spatial resolution, which represents a mixture of information at high and low frequencies. Due to the redundancy in low frequency information, the vanilla convolution is not efficient in terms of both memory and computation cost. To address this problem, in the recently developed octave convolution [42], the feature maps are factorized into high and low frequencies each processed with different convolutions. As a result, the resolution of low-frequency feature maps can be spatially reduced, which saves both memory and computation. The architecture of the original octave convolution is illustrated in Figure 6.2. The factorization of input vector X in octave convolutions is denoted by X = {XH ,XL}, H h×w×(1−α)c L h × w ×αc where X ∈ R and X ∈ R 2 2 are respectively the high and low frequency maps. The ratio of channels allocated to the low frequency feature representations (i.e., at half of spatial resolution) is defined by α ∈ [0, 1].

76 Figure 6.2: The architecture of the original octave convolution. XH and XL: input high and low frequency feature maps; f: regular vanilla convolution; downsample: ﬁxed down- sampling operation (e.g., maxpooling); upsample: ﬁxed up-sampling operation (e.g., bilinear); Y H→H and Y L→L: intra-frequency updates; Y H→L and Y L→H : inter-frequency communications; ΦH→H and ΦL→L: intra-frequency convolution kernels; ΦH→L and ΦL→H : inter-frequency convolution kernels; Y H and Y L: output high and low frequency feature maps.

H L H h×w×(1−α)c The factorized output vector is denoted by Y = {Y ,Y }, where Y ∈ R L h × w ×αc and Y ∈ R 2 2 are the output high and low frequency maps calculated as follows:

Y H = Y H→H + Y L→H , (6.2) and Y L = Y L→L + Y H→L, (6.3) where Y H→H and Y L→L are intra-frequency update components and Y H→L and Y L→H denote inter-frequency communication components. Intra-frequency component is used to update the information for each high and low convolutions, while inter-frequency communication enables information exchange between them. The octave convolution kernel is given by Φ = [ΦH , ΦL] with which the inputs XH and XL are respectively convolved. ΦH and ΦL are further divided into intra- and inter-frequency components as follows: ΦH = [ΦH→H , ΦL→H ] and ΦL = [ΦL→L, ΦH→L]. For the intra- frequency update, the regular vanilla convolution is used. However, up- and down-sampling interpolations are applied to compute the inter-frequency communication formulated as:

Y H = f(XH ;ΦH→H ) + upsample f(XL;ΦL→H ), 2 , (6.4)

77 and Y L = f(XL;ΦL→L) + f downsample(XH , 2); ΦH→L , (6.5) where f denotes a vanilla convolution with parameters Φ. As reported in [42], due to the eﬀective inter-frequency communications, the octave convolution can have better performance in classiﬁcation and recognition performance compared to the vanilla convolution. Moreover, the octave convolutions divide the input signal into high and low frequency components, which is similar to the wavelet transform that has found good applications in image compression and led to the JPEG2000 standard. This motivates us to apply it to learned image compression.

6.2 Generalized Octave Convolution

In the original octave convolutions, the average pooling and nearest interpolation are respectively employed for down- and up-sampling operations in inter-frequency communication [42]. Such conventional interpolations do not preserve spatial information and structure of the input feature map. In addition, in convolutional auto-encoders where sub-sampling needs to be reversed at the decoder side, fixed operations such as pooling result in a poor performance [72]. In this work, we propose a novel generalized octave convolution (GoConv) in which strided convolutions are used to sub-sample the feature vectors and compute the inter- frequency communication in a more effective way. Fixed sub-sampling operations such as pooling are designed to forget about spatial structure, for example, in object recognition where we only care about the object not its position. However, if the spatial information is important, strided convolution can be a useful alternative. With learned filters, strided convolutions can learn to handle discontinuities from striding and preserve certain spatial properties required in down-sampling operation [72]. Moreover, since it can learn how to summarize, better generalization with respect to the input is achieved. As a result, better performance with less spatial information loss can be achieved, especially in auto-encoders where it is easier to reverse strided convolutions. Moreover, as in ResNet, applying strided convolution (i.e., convolution and down-sampling at the same time) reduces the computational cost rather than a convolution followed by a fixed down-sampling operation (e.g., average pooling). The architecture of the proposed GoConv is shown in Figure 6.3. Compared to the original octave convolution (Figure 6.2), we apply another important modification regarding the inputs to the inter-frequency convolution operations. As summarized in Section 6.1, in order to calculate the inter-frequency communication outputs (denoted by Y H→L and Y L→H ), the input high and low vectors (denoted by XH and XL) are considered as the inputs to the high-to-low and low-to-high convolutions, respectively

(f↓2 and g↑2 in Figure 6.3). This strategy is only eﬃcient for the stride of 1 (i.e., the size of input and output high/low vectors is the same). However, in GoConv, this procedure can

78 result in significant information loss for larger strides. As an example, consider using stride 2, which results in down-sampled output high/low feature maps (half resolution of input high/low maps). To achieve this, stride 2 is required for the intra-frequency convolution (f in Figure 6.3). However, for the inter-frequency convolution f↓2, a harsh stride of 4 should be used, which results in significant spatial information loss. In order to deal with this problem, we instead use two consecutive convolutions with stride 2 where the first convolution is indeed the intra-frequency operation f. In other words, to compute Y H→L, we exploited the filters learned by f to have less information loss. Thus, instead of XH and XL, we set H→H L→L Y and Y as inputs to f↓2 and g↑2.

Figure 6.3: Architecture of the proposed generalized octave convolution (GoConv). XH and XL: input high and low frequency feature maps; Act: the activation layer; f: regular vanilla convolution; f↓2: regular convolution with stride 2; g↑2: regular transposed-convolution with stride 2; Y H→H and Y L→L: intra-frequency updates; Y H→L and Y L→H : inter-frequency communications; ΦH→H and ΦL→L: intra-frequency convolution kernels; ΦH→L and ΦL→H : inter-frequency convolution kernels; Y H and Y L: output high and low frequency feature maps.

The output high and low frequency feature maps in GoConv are then formulated as follows: H H→H L→L L→H Y = Y + g↑2(Y ;Φ ), (6.6) and L L→L H→H H→L Y = Y + f↓2(Y ;Φ ), (6.7) with Y H→H = f(XH ;ΦH→H ), (6.8)

79 and Y L→L = f(XL;ΦL→L), (6.9) where f↓2 and g↑2 are respectively vanilla convolution and transposed-convolution operations with stride of 2. We also propose a generalized octave transposed-convolution denoted by GoTConv (Fig- ure 6.4), which can replace the conventional transposed-convolution commonly employed in deep auto-encoder (encoder-decoder) architectures. Figure 6.4 illustrates the overall scheme of GoTConv. Let Y˜ = {Y˜ H , Y˜ L} and X˜ = {X˜ H , X˜ L} respectively be the factorized input and output feature vectors, the output high and low frequency maps X˜ H and X˜ L) in GoTConv are obtained as follows:

H H→H L→L L→H X˜ = X˜ + g↑2(X˜ ;Ψ ), (6.10) and L L→L H→H H→L X˜ = X˜ + f↓2(X˜ ;Ψ ), (6.11) with X˜ H→H = g(Y˜ H ;ΨH→H ), (6.12) and X˜ L→L = g(Y˜ L;ΨL→L), (6.13)

H H h×w×(1−α)c L L h × w ×αc where Y˜ , X˜ ∈ R and Y˜ , X˜ ∈ R 2 2 . Unlike GoConv in which regular convolution operation is used, transposed-convolution denoted by g is applied for intra- frequency update in GoTConv. For up- and down-sampling operations in inter-frequency communication, the same strided convolutions g↑2 and f↓2 as in GoConv are respectively utilized. In original octave convolutions, activation layers (e.g., ReLU) are applied to the output high and low frequency maps. However, in this work, we utilize activations for each internal convolution performed in our proposed GoConv and GoTConv. In this case, we ensure activation functions are properly applied to each feature map computed by convolution operations. Each of the inter- and intra-frequency components is then followed by an activation layer in GoConv. This process is inverted in the proposed GoTConv where the activation layer is followed by inter- and intra-frequency communications as shown in Figures 6.3 and 6.4. Similar to the original octave convolution, the proposed GoConv and GoTConv are designed and formulated as generic, plug-and-play units. As a result, they can respectively replace vanilla convolution and transposed-convolution units in any CNN architecture, especially auto-encoder-based frameworks such as image compression, image denoising, and semantic segmentation. When used in an auto-encoder, the input image to the encoder is

80 Figure 6.4: Architecture of the proposed generalized transposed-convolution (GoTConv). Y˜ H and Y˜ L: input high and low frequency feature maps; Act: the activation layer; g: regular transposed-convolution; f↓2: regular convolution with stride 2; g↑2: regular transposed- convolution with stride 2; X˜ H→H and X˜ L→L: intra-frequency updates; X˜ H→L and X˜ L→H : inter-frequency communications; ΨH→H and ΨL→L: intra-frequency transposed-convolution kernels; ΨH→L: inter-frequency convolution kernel; ΨL→H : inter-frequency transposed- convolution kernel; X˜ H and X˜ L: output high and low frequency feature maps. not represented as a multi-frequency tensor. In this case, to compute the output of the ﬁrst GoConv layer in the encoder, Equation 6.6 is modiﬁed as follows:

Y H = f(X;ΦH→H ), (6.14) and L H H→L Y = f↓2(Y ;Φ ). (6.15)

Similarly, at the decoder side, the output of the last GoTConv is a single tensor representation, which can be formulated by modifying Equation 6.10 as:

H→H L→L L→H X˜ = X˜ + g↑2(X˜ ;Ψ ), (6.16) with X˜ H→H = g(Y˜ H ;ΨH→H ), (6.17) and X˜ L→L = g(Y˜ L;ΨL→L). (6.18)

81 Figure 6.5: Overall framework of the proposed image compression model. H-AE and H-AD: arithmetic encoder and decoder for high frequency latents. L-AE and L-AD: arithmetic encoder and decoder for low frequency latents. H-CM and L-CM: the high and low frequency context models each composed of one 5*5 masked convolution layer with 2*M ﬁlters and stride of 1. Q: represents the additive uniform noise for training, or uniform quantizer for the test.

82 6.3 Multi-Frequency Entropy Model

Octave convolution is similar to the wavelet transform [13], since it has lower resolution in low frequency than in high frequency. Therefore, it can be used to improve the R-D performance in learning-based image compression frameworks. Octave convolutions store the features in a multi-frequency representation where the low frequency is stored in half spatial resolution, which results in a higher compression ratio. Moreover, due to the effective high and low frequency communication as well as the receptive field enlargement in octave convolutions, they also improve the performance of the analysis (encoding) and synthesis (decoding) transforms in a compression framework. The overall architecture of the proposed multi-frequency image compression framework is shown in Figure 6.5. Similar to [90], our architecture is composed of two sub-networks: the core auto-encoder and the entropy sub-network. The core auto-encoder is used to learn a quantized latent vector of the input image, while the entropy sub-network is responsible for learning a probabilistic model over the quantized latent representations, which is utilized for entropy coding. We have made several improvements to the scheme in [90]. In order to handle multi- frequency entropy coding, all vanilla convolutions in the core encoder, hyper encoder, and parameters estimator are replaced by the proposed GoConv, and all vanilla transposed- convolutions in the core and hyper decoders are replaced by GoTConv. In [90], each convolution and transposed-convolution is accompanied by an activation layer (e.g., GDN/IGDN or LeakyReLU). In our architecture, we move these layers into the GoConv and GoTConv architectures and directly apply them to the inter- and intra-frequency components as described in Section 6.2. GDN/IGDN transforms are respectively used for the GoConv and GoTConv employed in the proposed deep encoder and decoder, while LeakyReLU is utilized for the hyper auto-encoder and the parameters estimator. The convolution properties (i.e., size and number of filters and strides) of all networks including the core and hyper auto-encoders, context models, and parameter estimator are the same as in [90]. h×w×3 Let x ∈ R be the input image, the multi-frequency latent representations are H L H h × w ×(1−α)M L h × w ×αM denoted by {y , y } where y ∈ R 16 16 and y ∈ R 32 32 are generated using the parametric deep encoder (i.e., analysis transform) ge represented as:

H L {y , y } = ge(x; θge), (6.19) where θge is the parameter vector to be optimized. M denotes the total number of output channels in ge, which is divided into (1−α)M channels for high frequency and αM channels for low frequency (i.e., at half spatial resolution of the high frequency part). The calculation in Equation 6.14 is used for the ﬁrst GoConv layer, while the other encoder layers are formulated using Equation 6.6.

83 At the decoder side, the parametric decoder (i.e., synthesis transform) gd with the h×w×3 parameter vector θgd reconstructs the image x˜ ∈ R as follows:

H L x˜ = gd {y˜ , y˜ }; θgd , (6.20) with

{y˜H , y˜L} = Q {yH , yL} , (6.21) where Q represents the addition of uniform noise to the latent representations during training, or uniform quantization (i.e., round function in this work) and arithmetic coding/decoding of the latents during the test. As illustrated in Figure 6.5, the quantized high and low frequency latents y˜H and y˜L are entropy-coded using two separate arithmetic encoder and decoder. The entropy sub-network in our architecture contains two models: a context model and a hyper auto-encoder. The context model is an auto-regressive model over multi-frequency latent representations. Unlike the other networks in our architecture where GoConv are incorporated for their convolutions, we use vanilla convolutions in the context model to ensure the causality of the contexts is not spoiled due to the inter-frequency communication H L in GoConv. The contexts of the high and low frequency latents, denoted by φi and φi , are H L then predicted with two separate models fcm and fcm deﬁned as follows:

H H H H φi = fcm(˜y

L L L L φi = fcm(˜y

H L H L where θcm and θcm are the parameters to be generalized. Both fcm and fcm are composed of one 5*5 masked convolution [49] with stride of 1. The hyper auto-encoder learns to represent side information useful for correcting the context-based predictions. The hyper auto-encoder is indeed composed of a hyper encoder and a hyper decoder. The latents are transformed by the hyper encoder to form the hyper latent representations, which are then transmitted and added to the bit stream as side information. At the other side, the hyper decoder is responsible for transforming (decoding) the hyper latents into the parameters required for our entropy models. The spatial dependencies of {y˜H , y˜L} are then captured into the multi-frequency hyper H L latent representations {z , z } using the parametric hyper encoder he (with the parameter

84 vector θhe) deﬁned as:

H L H L {z , z } = he {y˜ , y˜ }; θhe . (6.24)

The quantized hyper latents are also part of the generated bitstream that is required to be entropy-coded and transmitted. Similar to the core latents, two separate arithmetic coders are used for the quantized high and low frequency z˜H and z˜L. Given the quantized hyper latents, the side information used for the entropy model estimation is reconstructed using the hyper decoder hd (with the parameter vector θhd) formulated as:

H L H L {ψ , ψ } = hd {z˜ , z˜ }; θhd , (6.25) with

{z˜H , z˜L} = Q {zH , zL} . (6.26)

As shown in Figure 6.5, to estimate the mean and scale parameters required for a conditional Gaussian entropy model, the information from both context model and hyper decoder is combined by another network, denoted by fpe (with the parameter vector θep), represented as follows:

H L H L H L H L {µi , µi , σi , σi } = fpe {ψ , ψ }, {φi , φi }; θep , (6.27)

H H where µi and σi are the parameters for entropy modelling of the high frequency informa- L L tion, and µi and σi are for the low frequency information.

6.3.1 Objective Functions

The objective function for training is composed of two terms: rate R, which is the expected length of the bitstream, and distortion D, which is the expected error between the input and reconstructed images. The R-D balance is determined by a Lagrange multiplier denoted by λ. The R-D optimization problem is then deﬁned as follows:

L = R + λD, (6.28) with R = RH + RL, (6.29) and

D = Ex∼px [d(x, xˆ)] , (6.30)

85 where px is the unknown distribution of natural images and D can be any distortion metric such as MSE or MS-SSIM. RH and RL are the rates corresponding to the high and low frequency information (bitstreams) deﬁned as follows:

H h H H i h H i R = Ex∼px − log2 py˜H |z˜H (˜y |z˜ ) + Ex∼px − log2 pz˜H (˜z ) , (6.31) and L h L L i h L i R = Ex∼px − log2 py˜L|z˜L (˜y |z˜ ) + Ex∼px − log2 pz˜L (˜z ) , (6.32) where py˜H |z˜H and py˜L|z˜L are respectively the conditional Gaussian entropy models for high and low frequency latent representations (yH and yL) formulated as:

H H H Y H 2H 1 1 H py˜H |z˜H (˜y |z˜ , θhd, θcm, θep) = N (µi , σi ) ∗ U(− 2 , 2 ) (˜yi ), (6.33) i and

L L L Y L 2L 1 1 L py˜L|z˜L (˜y |z˜ , θhd, θcm, θep) = N (µi , σi ) ∗ U(− 2 , 2 ) (˜yi ), (6.34) i where each latent is modelled as a Gaussian convolved with a unit uniform distribution1, which ensures a good match between encoder and decoder distributions of both quantized H H L L and continuous-values latents. The mean and scale parameters µi , σi , µi , and σi are generated via the network fpe deﬁned in Equation 6.27. Since the compressed hyper latents z˜H and z˜L are also part of the generated bitstream, their transmission costs are also considered in the rate term formulated in Equation 6.31. As in [18, 90], to model high and low frequency hyper-priors, we assume the entries to be independent and identically distributed (i.i.d.) and ﬁt a univariate piecewise linear density model to represent each channel j. The non-parametric, fully-factorized density models for the high and low frequency hyper latents are then formulated as follows:

H L where Θ and Θ denote the parameter vectors for the univariate distributions pz˜H |ΘH and pz˜L|ΘL .

1The probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions [63].

86 Figure 6.6: Sample high and low frequency latent representations. Left column: original image; Middle columns: high frequency; Right column: low frequency.

6.4 Experimental Results

The ADE20K dataset [150] with images of at least 512 pixels in height or width (9272 images in total) were used for training the proposed model. All images were rescaled to h = 256 and w = 256 to have a fixed size for training. We set α = 0.5 so that 50% of the latent representations is assigned to the low frequency part with half spatial resolution. Sample high and low frequency latent representations are shown in Figure 6.6. Considering the four layers of strided convolutions (with stride of 2) and the output channel size M = 192 in the core encoder (Figure 6.5), the high and low frequency latents yH and yL will respectively be of size 16×16×96 and 8×8×96 for training. As discussed in [18], the optimal number of filters (i.e., N) increases with the R-D balance factor λ, which indicates that higher network capacity is required for the models with higher bit rates. As a result, in order to avoid λ-dependent performance saturation and to boost the network capacity, we set M = N = 256 for higher bit rates (bpp > 0.5). All models in our framework were jointly trained for 200 epochs with mini-batch SGD and a batch size of 8. The Adam solver with learning rate of 0.00005 was fixed for the first 100 epochs, and was gradually decreased to zero for the next 100 epochs. We compare the performance of the proposed scheme with standard codecs including JPEG, JPEG2000 [43], WebP [60], BPG (both YUV4:2:0 and YUV4:4:4 formats) [23], VTM (version 7.1) [56], and also state-of-the-art learned image compression methods in

[81, 82, 84, 90, 151]. We use both PSNR and MS-SSIMdB as the evaluation metrics, where

MS-SSIMdB represents MS-SSIM scores in dB deﬁned as:

MS-SSIMdB = −10log10(1 − MS-SSIM). (6.37)

The comparison results on the popular Kodak image set (averaged over 24 test images) are shown in Figure 6.7. For the PSNR results, we optimized the model for the MSE loss as the distortion metric d in Equation 6.28, while the perceptual MS-SSIM metric was used

87 38 Proposed VTM Lee2019 Minen2018 Lee2018 Li2019 36 BPG444 BPG420 JPEG2000 Webp JPEG 34

32 PSNR (RGB)

26 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

24 Proposed Lee2019 Zhou2019 Lee2018 Minen2018 22 Li2019 VTM BPG444 BPG420 20 JPEG2000 Webp

MS-SSIM (RGB) 16

10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BPP

Figure 6.7: Kodak comparison results of our approach with traditional codecs and learning- based image compression methods.

88 Table 6.1: Ablation study of diﬀerent components in the proposed framework. The reported PSNR, MS-SSIM, and Inference Time are averaged over Kodak image set. The Infer- ence Time includes the entire encoding and decoding time. BPP: bits-per-pixel (high/low: BPPs for high and low frequency latents). Act-Out: activation layers moved out of Go- Conv/GoTConv. Core-Oct: proposed GoConv/GoTConv only used for the core auto- encoder. Org-Oct: GoConv/GoTConv replaced by original octave convolutions.

α = 0.5 α = 0.25 α = 0.75 Act-Out Core-Oct Org-Oct BPP 0.271 0.350 0.243 0.270 0.267 0.266 (high | low) (0.217 | 0.054) (0.323 | 0.027) (0.104 | 0.139) PSNR (dB) 32.10 32.12 31.11 31.84 31.62 28.70 MS-SSIM (dB) 16.31 16.34 16.13 16.04 15.34 13.23 Size (MB) 95.2 110 84 94.9 72.1 73 Time (sec) 0.447 0.476 0.411 0.421 0.487 0.445 for the MS-SSIM results reported in Figure 6.7. In order to obtain the seven different bit rates on the R-D curve illustrated in Figure 6.7, seven models with seven different values for λ were trained. As shown in Figure 6.7, our method outperforms all standard codecs as well as the state- of-the-art learning-based image compression methods in terms of both PSNR and MS-SSIM. Compared to the new VTM and the method in [82], our approach provides ≈0.5dB better PSNR at lower bit rates (bpp < 0.4) and almost the same performance at higher rates. To the best of our knowledge, this is the best results in the literature, and the first time that a learning-based solution outperforms the VTM. Our method bridges the wavelet transform and deep learning-based image compression, and allows many techniques in the wavelet transform research to be applied to learned image compression. This will lead to many exciting research topics in the future. One visual example from the Kodak image set is given in Figure 6.8 in which our results are qualitatively compared with JPEG2000 and BPG (YUV4:4:4 format) at 0.1bpp. As seen in the example, our method provides the highest visual quality compared to the others. JPEG2000 has poor performance due to the ringing artifacts. The BPG result is smoother compared to JPEG2000, but the details and fine structures are not preserved in many areas, for example, in the patterns on the shirt and the colors around the eye.

6.4.1 Ablation Study

In order to evaluate the performance of diﬀerent components of the proposed framework, ablation studies were performed, which are reported in Table 6.1. The results are the average over the Kodak image set. All the models reported in this ablation study have been optimized for MSE distortion metric (for one single bit-rate). However, the results were evaluated with both PSNR and MS-SSIM metric.

• Ratio of high and low frequency: in order to study varying choices of the ratio of channels allocated to the low frequency feature representations, we evaluated our

89 (a) Original (bits-per-pixel, PSNR, MS-SSIMdB) (b) Ours (0.10bpp, 31.70dB, 14.57dB)

Figure 6.8: Kodak visual example (bits-per-pixel, PSNR, MS-SSIMdB).

model with three diﬀerent ratios α ∈ {0.25, 0.5, 0.75}. As summarized in Table 6.1, compressing 50% of the low frequency part to half the resolution (i.e., α = 0.5) results in the best R-D performance in both PSNR and MS-SSIM at 0.271bpp (where the contributions of high and low frequency latents are 0.217bpp and 0.054bpp). As the ratio decreases to α = 0.25, less compression with a higher bit rate of 0.350bpp (0.323bpp for high and 0.027bpp for low frequency) is obtained, while no signiﬁcant

90 gain in the reconstruction quality is achieved. Although increasing the ratio to 75% provides a better compression with 0.243bpp (high: 0.104bpp, low: 0.139bpp), it significantly results in a lower PSNR. As indicated by the model size and inference time in the table, larger ratio results in a smaller and faster model since less space is required to store the low frequency maps with half spatial resolution.

• Internal vs. external activation layers: in this scenario, we cancel the internal activations (i.e., GDN/IGDN) employed in our proposed GoConv and GoTConv. In- stead, as in the original octave convolution [42], we apply GDN to the output high and low frequency maps in GoConv, and IGDN before the input high and low frequency maps for GoTConv. This experiment is denoted by Act-Out in Table 6.1. As the comparison results indicate, the proposed architecture with internal activations (α = 0.5) provides a better performance (with ≈0.26dB higher PNSR) since all internal feature maps corresponding to the inter- and intra- communications are beneﬁted from the activation function.

• Octave only for core auto-encoder: as described in Section 6.3, we utilized the proposed multi-frequency entropy model for both latents and hyper latents. In order to evaluate the eﬀectiveness of multi-frequency modelling of hyper latents, we also report the results in which the proposed entropy model is only used for the core latents (denoted by Core-Oct in Table 6.1). To deal with the high and low frequency latents resulted from the multi-frequency core auto-encoder, we used two separate networks (similar to [90] with vanilla convolutions) for each of the hyper encoder, hyper decoder, and the parameters estimator model. As summarized in the table, a PSNR gain of ≈0.48dB is achieved when both core and hyper auto-encoders beneﬁt from the proposed multi-frequency model.

• Original octave convolutions: in this experiment, the performance of the proposed GoConv and GoTConv architectures compared with the original octave convolutions (denoted by Org-Oct in Table 6.1) is analyzed. We replace all GoConv layers in the proposed framework (Figure 6.5) by original octave convolutions. For the octave transposed-convolution (Figure 6.9) used in the core and hyper decoders, we reverse the octave convolution operation formulated as follows:

X˜ H = g(Y˜ H ;ΨH→H ) + upsample(g(Y˜ L;ΨL→H ), 2), (6.38)

and X˜ L = g(Y˜ L;ΨL→L) + g(downsample(Y˜ H , 2); ΨH→L), (6.39)

where {Y˜ H , Y˜ L} and {X˜ H , X˜ L} are the input and output feature maps, and g is vanilla transposed-convolution. Similar to the octave convolution deﬁned in [42], average pooling and nearest interpolation are respectively used for down-sample and

91 Figure 6.9: The architecture of the original octave transposed-convolution. Y˜ H and Y˜ L: input high and low frequency feature maps; g: regular vanilla transposed-convolution; upsample: ﬁxed up-sampling operation (i.e., bilinear); downsample: ﬁxed down-sampling operation (i.e., maxpooling); X˜ H→H and X˜ L→L: intra-frequency updates; X˜ H→L and X˜ L→H : inter-frequency communications; ΨH→H and ΨL→L: intra-frequency transposed- convolution kernels; ΨH→L and ΨL→H : inter-frequency transposed-convolution kernels; X˜ H and X˜ L: output high and low frequency feature maps.

up-sample operations. As reported in Table 6.1, Org-Oct provides a signiﬁcantly lower performance than the architecture with the proposed GoConv and GoTConv, which is due to the ﬁxed sub-sampling operations incorporated for its inter-frequency components. The PSNR and MS-SSIM of the proposed architecture are respectively ≈3.4dB and ≈3.08dB higher than Org-Conv at the same bit rate. Note that the ratio α = 0.5 was used for the Act-Out, Core-Oct, and Org-Oct models.

As explained in Section 6.2, since the proposed GoConv and GoTConv are designed as generic, plug-and-play units, they can be used in any auto-encoder-based architecture. In the following sections, the usage of GoConv and GoTConv in image semantic segmetnation and image denoising problems are analyzed.

6.5 Multi-Frequency Image Semantic Segmentation

In this experiment, we evaluate the proposed GoConv/GoTConv units on image semantic segmentation. The popular UNet model [109] is used as the baseline in this experiment. UNet is a CNN-based architecture originally developed for segmenting biomedical images, but also considered as a baseline for image semantic segmentation. UNet has an auto-encoder scheme composed of two paths: the contraction path (i.e., the encoder) and the expanding path (i.e., the decoder). The encoder captures the context in the image, while the decoder

92 is used to enable precise localization. The architecture of UNet auto-encoder is summarized in Table 6.2. All the vanilla convolution layers used in the encoder and decoder are followed by batch normalization and ReLU layers.

Table 6.2: UNet architecture (Conv: vanilla convolution; T-Conv: vanilla transposed- convolution).

Encoder Decoder Conv (3*3, 64, s1) T-Conv (3*3, 512, s2) Maxpool (2*2, s2) Conv (3*3, 512, s1) Conv (3*3, 128, s1) T-Conv (3*3, 256, s1) Maxpool (2*2, s2) Conv (3*3, 256, s1) Conv (3*3, 256, s1) T-Conv (3*3, 128, s1) Maxpool (2*2, s2) Conv (3*3, 128, s1) Conv (3*3, 512, s1) T-Conv (3*3, 64, s2) Maxpool (2*2, s2) Conv (3*3, 64, s1) Conv (3*3, 1024, s1) Conv (3*3, 19, s1)

In this study, we build a multi-frequency UNet model, denoted by GoConv-UNet, by replacing all the UNet convolution layers with GoConv and all transposed-convolutions with GoTConv. The other properties such as the number of layers, filters, and strides are kept the same as the original UNet. In order to compare the performance of the original octave convolutions with our proposed GoConv/GoTConv, we build another model, denoted by OrgOct-UNet, in which the original octave convolution units (shown in Figures 6.2 and 6.9) are employed. All the experiments in this section are performed on Cityscapes dataset [45], which contains 2974 training images with 19 semantic labels. All models were trained with cross- entropy loss function for 100 epochs. The models were then evaluated on Cityscapes validation set (including 500 images) based on pixel accuracy, mean intersection-over-union (IoU), number of parameters, and number of FLOPs. The mean IoU (mIou) is calculated by averaging over the IoU values of all semantic classes. Table 6.3 presents the comparison results of the original UNet and the proposed multi- frequency OrgOct-UNet and GoConv-UNet models with different α’s. All the reported values are the average over the Cityscapes validation set. As given in the table, the GoConv- UNet model with α = 0.125 achieves the highest accuracy and mIoU. Compared to the original UNet model, GoConv-UNet (α = 0.125) requires ≈0.7M less parameters, and achieves 25% FLOPs reduction, which clearly shows the benefit of the proposed GoConv and GoT- Conv units in multi-frequency semantic segmentation. Although larger α’s can result in lower FLOPs, it has a negative impact on the model performance. OrgOct-UNet has the lowest number of parameters and FLOPs compared to the other models. However, it achieves the lowest performance with no improvement over

93 Table 6.3: Comparison results for image semantic segmentation on Cityscapes database. UNet: original UNet architecture; GoConv-UNet: multi-frequency UNet with Go- Conv/GoTConv; OrgOct-UNet: multi-frequency UNet with original octave units; Ra- tio (α): the LF ratio used in octave convolutions; mIoU: mean intersection-over-union; Params (M): the number of model parameters (in million)

UNet GoConv-UNet OrgOct-UNet Ratio (α) 0.0 0.5 0.25 0.125 0.5 0.125 Accuracy 0.787 0.781 0.789 0.791 0.776 0.784 mIoU 0.402 0.381 0.401 0.414 0.379 0.399 Params (M) 34.54 32.96 33.35 33.84 32.85 33.73 FLOPs (G) 103.44 46.77 70.60 85.80 45.64 85.46

Table 6.4: Baseline convolutional auto-encoder for image denoising (Conv: vanilla convolution; T-Conv: vanilla transposed-convolution).

Encoder Decoder Conv (3*3, 32, s1) T-Conv (3*3, 128, s2) Conv (3*3, 32, s1) T-Conv (3*3, 128, s1) Conv (3*3, 64, s1) T-Conv (3*3, 64, s1) Conv (3*3, 64, s2) T-Conv (3*3, 64, s2) Conv (3*3, 128, s1) T-Conv (3*3, 32, s1) Conv (3*3, 128, s1) T-Conv (3*3, 32, s1) Conv (3*3, 256, s2) T-Conv (3*3, 3, s1)

UNet, which is basically due to the ﬁxed sub-sampling operations used in the original octave convolutions. The visual comparison of UNet and GoConv-UNet is given in Figures 6.10 and 6.11. As shown in the results, higher quality semantic segmentations are obtained with GoConv- UNet. For example, the pedestrian with backpack in the top example and the trees, the board, and the wall on the right side of the bottom example are more accurately segmented with GoConv-UNet.

6.6 Multi-Frequency Image Denoising

In this experiment, we build a simple convolutional auto-encoder and use it for image denoising problem [86, 131, 140]. In this problem, we try to denoise images corrupted by white Gaussian noise, which is the common result of many acquisition channels. The architecture of the auto-encoder used in this experiment is summarized in Table 6.4 where the encoder and decoder are respectively composed of a sequence of vanilla convolutions and transposed-convolutions each followed by batch normalization and ReLU. We performed our experiments on MNIST and CIFAR10 datasets. After 100 epochs of training, an average PSNR of 23.19dB and 23.29dB for MNIST and CIFAR10 test sets were respectively achieved. In order to analyze the performance of GoConv and GoTConv in

94 (a) Input Image (b) Ground Truth

Figure 6.10: Visual example 1 from Cityscapes. GoConv-UNet: GoConv/GoTConv-based multi-frequency UNet with α = 0.125; UNet: original UNet architecture; Acc: pixel accuracy; mIoU: mean IoU.

(a) Input Image (b) Ground Truth

Figure 6.11: Visual example 2 from Cityscapes. GoConv-UNet: GoConv/GoTConv-based multi-frequency UNet with α = 0.125; UNet: original UNet architecture; Acc: pixel accuracy; mIoU: mean IoU.

95 Table 6.5: Comparison results of the baseline and multi-frequency auto-encoders for image denoising on MNIST and CIFAR10 test sets.

Baseline Multi-Frequency MNIST CIFAR10 MNIST CIFAR10 PSNR (dB) 23.19 23.29 23.20 23.54 Params (M) 1.16 1.17 1.10 1.14 FLOPS (G) 0.20 0.23 0.17 0.20

(a) Original Input Images

(b) Input Noisy Images

(d) Multi-Frequency Denoised Results

Figure 6.12: Sample image denoising results from MNIST test set. this experiment, we replaced all vanilla convolution layers with GoConv and all transposed- convolutions with GoTConv. The other properties of the encoder and decoder networks (e.g., number of layers, filters, and strides) were the same as the baseline in Table 6.4. We set α = 0.125 and trained the model for 100 epochs. For MNIST dataset, the multi-frequency auto-encoder achieved an average PSNR of 23.20dB (almost the same as the baseline with vanilla convolutions). However, for CIFAR10, we achieved an average PSNR of 23.54dB, which is 0.25dB higher than the baseline, due to the effective communication between high and low frequencies. In addition, the proposed multi-frequency auto-encoder has less number of parameters and FLOPs than the baseline model, which indicates the benefit of octave convolutions on parameters and operations reduction with no performance loss. The comparison results are presented in Table 6.5.

96 (a) Original Input Images

(b) Input Noisy Images

(d) Multi-Frequency Denoised Results

Figure 6.13: Sample image denoising results from CIFAR10 test set.

8 visual examples from MNIST and CIFAR10 test sets are given in Figures 6.12 and 6.13. Compared to the baseline model with vanilla convolutions and transposed-convolutions, the multi-frequency model with the proposed GoConv and GoTConv results in higher visual quality in the denoised images (e.g., the red car in the second column from right in Figure 6.13).

97 Chapter 7

Conclusion

In this thesis, two important problems in deep learning-based multimedia content processing including MIR (focused on music transcription and generation tasks) and image compression were studied and discussed. In Chapter 2, it was shown that the visual analysis of music performances can be very useful for dealing with MIR problems such as AMT. A learning-based visual approach for transcribing the piano music in real-time was presented. The proposed CNN-SVM approach achieved F1 scores of 0.94 and 0.96 in the classiﬁcation of the pressed black and white keys, respectively. Compared to the state-of-the-art, an improvement of over 0.44 (black keys) and

0.29 (white keys) in F1 scores under non-ideal camera view, hand coverage, and illumination conditions was achieved. In order to adapt our classiﬁers with the new patterns in the data added to training set, an incremental learning strategy named warm start was employed to eﬃciently retrain and update the models. A new dataset for research in this area has also been created. The proposed system has a number of limitations such as intense darkness or brightness of the images, sharp lighting changes during the performance, drastic camera views, expressive aspects of music (e.g., dynamics), and camera movements. Some of these limitations are intrinsic to the approach and cannot be removed. However, there are some that can be removed using more sophisticated and often more time-consuming algorithms (e.g., keyboard tracking when the camera or the keyboard moves). We have chosen not to use these approaches in this method in order to allow real-time transcription. One potential direction of this work is to utilize both audio and visual information in order to analyze music content. Such multi-modal systems can improve the performance of music transcription as well as other problems in MIR, especially by using machine/deep learning approaches. A semi-recurrent VAE-GAN model for generating sequential music data was also presented in Chapter 3. The model consisted of three networks including encoder, generator, and discriminator, in which convolutions were utilized to spatially learn the local correlation of the data in individual frames. Each frame was sampled from a latent distribution obtained by mapping the previous frame using the encoder. As a consequence, the consis- tencies between the frames in a generated sequence was also preserved. Our experiments on

98 piano music generation presented promising results, which were comparable to the state- of-the-art. One potential direction of this work is to use this framework for modelling and generating other types of sequential data such as video. In Chapter 4, we introduced a deep semantic segmentation-based layered image compression framework in which the semantic segmentation map of the input image was used to synthesize the image, and the residual was encoded as an enhancement layer in the bit-stream. Experimental results showed that the proposed framework outperforms the H.265/HEVC- based BPG and the other standard codecs in both PSNR and MS-SSIM metrics in RGB (4:4:4) domain. In addition, since semantic segmentation map is included in the bit-stream, the proposed scheme can facilitate many other tasks such as image search and object-based adaptive image compression. The proposed scheme opens up many future topics, for example, improving its high-rate performance, modifying the scheme for YUV-coded images, and applying the framework for other tasks. Following that, we proposed a new variable-rate image compression framework, by applying GDN layers and incorporating the GDN-based shortcut connections in the ResNet in Chapter 5. We also used a stochastic rounding-based scalar quantization. To further improve the performance, the residual between the input and the reconstructed image from the decoder network was encoded by BPG as an enhancement layer. A novel variable-rate objective function was also proposed. Experimental results showed that our variable-rate model can outperform all standard codecs including BPG in MS-SSIM as well as state-of-the-art learning-based variable-rate methods in both PSNR and MS-SSIM. Further gains can be achieved by optimizing multiple networks for different bit rates independently. Another future topic is the rate allocation optimization between the base layer and the enhancement layer. Finally, a new multi-frequency image compression scheme with octave convolutions was proposed in Chapter 6 in which the latents were factorized into high and low frequency, and the low frequency was stored at lower resolution to reduce the spatial redundancy. To retain the spatial structure of the input, novel generalized octave convolution and transposed-convolution architectures denoted by GoConv and GoTConv were also introduced. Our experiments showed that the proposed method significantly improves the R-D performance and achieves the new state-of-the-art learned image compression. Further improvements can be achieved by multi-resolution factorization of latents into a sequence of high to low frequencies as in wavelet transform. This method significantly links the research fields of wavelet transforms and learned image compression, which lead to many other research topics in the future. Another potential direction of this work is to employ the proposed GoConv/GoTConv in other CNN-based architectures, particularly auto-encoder- based schemes such as image denoising and semantic segmentation.

99 Bibliography

[1] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Tim- ofte, Luca Benini, and Luc V. Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. arXiv preprint arXiv:1704.00648, 2017.

[2] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. Inter- national Conference on Computer Vision (ICCV), 2019.

[3] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying support vector machines to imbalanced datasets. In European conference on machine learning, pages 39–50, 2004.

[4] Mohammad Akbari. claVision: Visual automatic piano music transcription. Master’s thesis, University of Lethbridge, Lethbridge, Alberta, Canada, 2014.

[5] Mohammad Akbari and Howard Cheng. claVision: Visual automatic piano music transcription. In International Conference on New Interfaces for Musical Expression (NIME), pages 313–314, Baton Rouge, Louisiana, USA, 2015.

[6] Mohammad Akbari and Howard Cheng. Real-time piano music transcription based on computer vision. Transactions on Multimedia, 17(12):2113–2121, 2015.

[7] Mohammad Akbari and Howard Cheng. Methods and systems for visual music transcription, 2016. US Patent 9,418,637.

[8] Mohammad Akbari and Jie Liang. Semi-recurrent cnn-based VAE-GAN for sequential data generation. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2321–2325, 2018.

[9] Mohammad Akbari, Jie Liang, and Howard Cheng. A real-time system for online learning-based visual transcription of piano music. Multimedia Tools and Applications, 77(19):25513–25535, 2018.

[10] Mohammad Akbari, Jie Liang, and Jingning Han. DSSLIC: Deep semantic segmentation-based layered image compression. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2042–2046, 2019.

[11] Mohammad Akbari, Jie Liang, Jingning Han, and Chengjie Tu. Learned variable- rate image compression with residual divisive normalization. arXiv preprint arXiv:1912.05688, 2019.

100 [12] Mohammad Akbari, Jie Liang, Jingning Han, and Chengjie Tu. Generalized octave convolutions for learned multi-frequency image compression. arXiv preprint arXiv:2002.10032, 2020.

[13] Marc Antonini, Michel Barlaud, Pierre Mathieu, and Ingrid Daubechies. Image coding using wavelet transform. Transactions on image processing (TIP), 1(2):205–220, 1992.

[14] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.

[15] Johannes Ballé. Eﬃcient nonlinear transforms for lossy image compression. In 2018 Picture Coding Symposium (PCS), pages 248–252. IEEE, 2018.

[16] Johannes Ballé, Philip A. Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin Hwang, and George Toderici. Nonlinear transform coding. arXiv preprint arXiv:2007.03034, 2020.

[17] Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. End-to-end optimization of nonlinear transform codes for perceptual quality. In Picture Coding Symposium (PCS), pages 1–5, 2016.

[18] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick John- ston. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436, 2018.

[19] Babu Kaji Baniya and Joonwhoan Lee. Importance of audio feature reduction in automatic music genre classiﬁcation. Multimedia Tools and Applications, 75(6):3013– 3026, 2016.

[20] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Cvae-gan: ﬁne- grained image generation through asymmetric training. International Conference on Computer Vision (ICCV), pages 2745–2754, 2017.

[21] Dominikus Baur, Frederik Seiﬀert, Michael Sedlmair, and Sebastian Boring. The streams of our lives: Visualizing listening histories in context. Transactions on Visu- alization and Computer Graphics, 16(6):1119–1128, 2010.

[22] Alessio Bazzica, Cynthia CS Liem, and Alan Hanjalic. On detecting the playing/non- playing activity of musicians in symphonic music videos. Computer Vision and Image Understanding, 144:188–204, 2016.

[23] Fabrice Bellard. Bpg image format (http://bellard.org/bpg/), 2020.

[24] Asa Ben-Hur and Jason Weston. A user’s guide to support vector machines. Data mining techniques for the life sciences, pages 223–239, 2010.

[25] Emmanouil Benetos and Simon Dixon. A shift-invariant latent variable model for automatic music transcription. Computer Music Journal, 36(4):81–94, 2012.

[26] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoﬀ, and Anssi Klapuri. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems, 41:407–434, 2013.

101 [27] Emmanouil Benetos and Tillman Weyde. An eﬃcient temporally-constrained probabilistic model for multiple-instrument music transcription. International Society for Music Information Retrieval, 2015. [28] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013. [29] Nancy Bertin, Roland Badeau, and Emmanuel Vincent. Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. Transactions on Audio, Speech, and Language Processing, 18(3):538– 549, 2010. [30] Johannes Bllé, Valero Laparra, and Eero P. Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704, 2016. [31] Nastaran Borjian, Ehsanollah Kabir, Sanaz Seyedin, and Ellips Masehian. A query- by-example music retrieval system using feature and decision fusion. Multimedia Tools and Applications, pages 1–25, 2017. [32] Nicolas Boulanger-Lewandowsk, Pascal Vincent, Yoshua Bengio, Patrick Gray, and Chinmaya Naguri. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012. [33] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. Deep learning techniques for music generation-a survey. arXiv preprint arXiv:1709.01620, 2017. [34] Steven Brown. The perpetual music track: The phenomenon of constant musical imagery. Journal of Consciousness Studies, 13(6):43–62, 2006. [35] Sebastian Böck and Markus Schedl. Polyphonic piano note transcription with recurrent neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 121–124, 2012. [36] Chunlei Cai, Li Chen, Xiaoyun Zhang, and Zhiyong Gao. Eﬃcient variable rate image compression with multi-scale decomposition network. Transactions on Circuits and Systems for Video Technology, 2018. [37] Xizheng Cao, Lin Sun, Jingwen Niu, Ruiqi Wu, Yanmei Liu, and Huijuan Cai. Au- tomatic composition of happy melodies based on relations. Multimedia Tools and Applications, 74(21):9097–9115, 2015. [38] Ali Taylan Cemgil, Hilbert J. Kappen, and David Barber. A generative model for music transcription. Transactions on Audio, Speech, and Language Processing, 14(2):679– 694, 2006. [39] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [40] Shih-Chang Huang Chang, Hong-Yi and Jia-Hao Wu. A personalized music recommendation system based on electroencephalography feedback. Multimedia Tools and Applications, pages 1–20, 2016.

102 [41] Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.

[42] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In International Conference on Computer Vision (ICCV), pages 3435–3444, 2019.

[43] Charilaos Christopoulos, Athanassios Skodras, and Touradj Ebrahimi. The JPEG2000 still image coding system: an overview. Transactions on consumer electronics, 46(4):1103–1127, 2000.

[44] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289, 2015.

[45] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[46] Débora C. Corrêa and Francisco AP. Rodrigues. A survey on symbolic data-based music genre classiﬁcation. Expert Systems with Applications, 60:190–210, 2016.

[47] Roger B. Dannenberg. Music representation issues, techniques, and systems. Com- puter Music Journal, 17(3):20–30, 1993.

[48] Manuel Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis. Bayesian Statistics, 7:105–124, 2003.

[49] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems (NIPS), pages 4790–4798, 2016.

[50] Emily L. Denton, Soumith Chintala, and Rob Fergus. Deep generative image models using laplacian pyramid of adversarial networks. In Advances in neural information processing systems (NIPS), pages 1486–1494, 2015.

[51] J. Stephen Downie. Music information retrieval. Annual review of information science and technology, 37(1):295–340, 2003.

[52] Zhiyao Duan, Bryan Pardo, and Changshui Zhang. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. Transactions on Audio, Speech, and Language Processing, 18(8):2121–2133, 2010.

[53] Douglas Eck and Juergen Schmidhuber. A ﬁrst look at music composition using lstm recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artiﬁciale, 103:48, 2002.

[54] Otto Fabius and Joost R. Van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581, 2014.

103 [55] M. A. H. Farquad and Indranil Bose. Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1):226–233, 2012.

[56] HHI. Fraunhofer. VVC oﬃcial test model VTM. https://vcgit.hhi.fraunhofer. de/jvet/VVCSoftware_VTM, 2020.

[57] Christian Frisson, Loïc Reboursière, Wen-Yang Chu, Otso Lähdeoja, John Ander- son Mills III, Cécile Picard, Ao Shen, and Todor Todoroﬀ. Multimodal guitar: Perfor- mance toolbox and study workbench. QPSR Numediart Research Program, 2(3):67– 84, 2009.

[58] Mengyue Geng, Yaowei Wang, Yonghong Tian, and Tiejun Huang. Cnusvm: Hybrid cnn-uneven svm model for imbalanced visual learning. In International Conference on Multimedia Big Data (BigMM), pages 186–193, 2016.

[59] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems (NIPS), pages 2672–2680, 2014.

[60] Google Inc. WebP (https://developers.google.com/speed/webp/), 2020.

[61] Dmitry O. Gorodnichy and Arjun Yogeswaran. Detection and tracking of pianist hands and ﬁngers. Canadian Conference on Computer and Robot Vision, page 63, 2006.

[62] R. M. Gray and T. G. Stockham. Dithered quantizers. IEEE Transactions on Infor- mation Theory, 39(3):805–812, 1993.

[63] Charles Miller Grinstead and James Laurie Snell. Introduction to probability. Ameri- can Mathematical Society, 2012.

[64] Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, , and Alán Aspuru-Guzik. Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.

[65] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML), pages 1737–1746, 2015.

[66] Salvador Gutiérrez and Salvador García. Landmark-based music recognition system optimisation using genetic algorithms. Multimedia Tools and Applications, 75(24):16905–16922, 2016.

[67] Amaury Habrard, José Manuel Iñesta, David Rizo, and Marc Sebban. Melody recognition with learned edit distances. In Joint IAPR International Workshops on Statis- tical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 86–96, 2008.

[68] Gaëtan Hadjeres, Frank Nielsen, and François Pachet. GLSR-VAE: Geodesic latent space regularization for variational autoencoder architectures. Symposium Series on Computational Intelligence (SSCI), pages 1–7, 2017.

104 [69] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. DeepBach: a steerable model for bach chorales generation. International Conference on Machine Learning (ICML), 70:1362–1371, 2017.

[70] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.

[71] Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen, Sung Jin Hwang, Joel Shor, and George Toderici. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. Conference on computer vision and pattern recognition (CVPR), pages 4385–4393, 2018.

[72] Alexey Dosovitskiy Jost Tobias Springenberg, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.

[73] Andrej Karpathy. Convnetsharp. https://github.com/cbovar/ConvNetSharp, 2016.

[74] Rahul Katarya and Om Prakash Verma. Eﬃcient music recommender system using context graph and particle swarm. Multimedia Tools and Applications, 77(2):2673– 2687, 2018.

[75] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[76] Anssi P. Klapuri. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. Transactions on Speech and Audio Processing, 11(6):804– 816, 2003.

[77] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for eﬃcient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.

[78] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.

[79] Pavel Laskov, Christian Gehl, Stefan Krüger, and Klaus-Robert Müller. Incremen- tal support vector learning: Analysis, implementation and applications. Journal of machine learning research, 7:1909–1936, 2006.

[80] Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints. arXiv preprint arXiv:1612.04742, 2016.

[81] Jooyoung Lee, Seunghyun Cho, and Seung-Kwon Beack. Context-adaptive entropy model for end-to-end optimized image compression. arXiv preprint arXiv:1809.10452, 2018.

105 [82] Jooyoung Lee, Seunghyun Cho, and Munchurl Kim. A hybrid architecture of jointly learning image compression and quality enhancement with improved entropy mini- mization. arXiv preprint arXiv:1912.12817, 2019.

[83] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. Creat- ing a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. Transactions on Multimedia, 21(2):522–535, 2018.

[84] Mu Li, Wangmeng Zuo, Shuhang Gu, Jane You, and David Zhang. Learning content- weighted deep image compression. arXiv preprint arXiv:1904.00664, 2019.

[85] Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region newton method for logistic regression. Journal of Machine Learning Research, 9(Apr):627–650, 2008.

[86] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo. Multi-level wavelet-cnn for image restoration. In Conference on computer vision and pattern recognition Workshops (CVPRW), pages 773–782, 2018.

[87] Sihui Luo, Yezhou Yang, Yanling Yin, Chengchao Shen, Ya Zhao, and Mingli Song. DeepSIC: Deep semantic image compression. International Conference on Neural Information Processing, pages 96–106, 2018.

[88] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

[89] Anabel Maler. Songs for hands: Analyzing interactions of sign language and music. Music Theory Online, 19(1), 2013.

[90] David Minnen, Johannes Ballé, and George D. Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Advances in neural information processing systems (NIPS), pages 10771–10780, 2018.

[91] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[92] Olof Mogren. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.

[93] Loris Nanni, Yandre MG Costa, Alessandra Lumini, Moo Young Kim, and Seung Ryul Baek. Combining visual and acoustic features for music genre classiﬁcation. Expert Systems with Applications, 45:108–117, 2016.

[94] Akiya Oka and Manabu Hashimoto. Marker-less piano ﬁngering recognition using sequential depth images. Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV), pages 1–4, 2013.

[95] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

106 [96] Marco Paleari, Benoit Huet, Antony Schutz, and Dirk Slock. A multimodal approach to music transcription. In International Conference on Image Processing (ICIP), pages 93–96, 2008. [97] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (ICML), pages 1310–1318, 2013. [98] Paul H. Peeling and Simon J. Godsill. Multiple pitch estimation using non- homogeneous Poisson processes. Journal of Selected Topics in Signal Processing, 6(5):1133–1143, 2011. [99] Antonio Pertusa and José M. Iñesta. Polyphonic monotimbral music transcription using dynamic networks. Pattern Recognition Letters, 26(12):1809–1818, 2005. [100] Michael Poast. Color music: Visual color notation for musical expression. Leonardo, 33(3):215–221, 2000. [101] Garry Quested, Roger Boyle, and Kia Ng. Polyphonic note tracking using multimodal retrieval of musical events. In International Computer Music Conference (ICMC), 2008. [102] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. [103] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. In 2015 International Conference on Learning Representations (ICLR), 2015. [104] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In Conference on computer vision and pattern recognition Workshops (CVPRW), pages 806–813, 2014. [105] Loïc Reboursière, Christian Frisson, Otso Lähdeoja, John A. Mills, Cécile Picard- Limpens, and Todor Todoroff. MultimodalGuitar: a toolbox for augmented guitar performances. In Proceedings of the New Interfaces for Musical Expression++ (NIME++), 2010. [106] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Con- ference on Machine Learning (ICML), pages 1060–1069, 2016. [107] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back- propagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. [108] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. Interna- tional Conference on Machine Learning (ICML), 70:2922–2930, 2017. [109] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.

107 [110] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems (NIPS), pages 2234–2242, 2016.

[111] Shibani Santurkar, David Budden, and Nir Shavit. Generative compression. Picture Coding Symposium (PCS), pages 258–262, 2018.

[112] Andy Sarroﬀ and Michael A. Casey. Musical audio synthesis using autoencoding neural nets. In International Society for Music Information Retrieval Conference (ISMIR), 2014.

[113] Joseph Scarr and Richard Green. Retrieval of guitarist ﬁngering information using computer vision. International Conference of Image and Vision Computing New Zealand (IVCNZ), pages 1–7, 2010.

[114] Alexander Schindler and Andreas Rauber. Harnessing music-related visual stereo- types for music information retrieval. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):20, 2016.

[115] Rodrigo A. Seger, Marcelo M. Wanderley, and Alessandro L. Koerich. Automatic detection of musicians’ ancillary gestures based on video analysis. Expert Systems with Applications, 41(4):2098–2106, 2014.

[116] Siddharth Sigtia, Emmanouil Benetos, Nicolas Boulanger-Lewandowski, Tillman Weyde, Artur S. d’Avila Garcez, and Simon Dixon. A hybrid recurrent neural network for music transcription. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2061–2065, 2015.

[117] Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. An end-to-end neural network for polyphonic piano music transcription. Transactions on Audio, Speech and Language Processing (TASLP), 24(5):927–939, 2016.

[118] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[119] Jon Sneyers and Pieter Wuille. FLIF: Free lossless image format based on maniac compression. In International Conference on Image Processing (ICIP), pages 66–70. IEEE, 2016.

[120] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems (NIPS), pages 3483–3491, 2015.

[121] Manitsaris Sotirios and Pekos Georgios. Computer vision method for pianist’s ﬁngers information retrieval. In International Conference on Information Integration and Web-based Applications & Services, pages 604–608, 2008.

[122] Cézar Roberto Souza. Accord .net framework. http://www.accord-framework.net, 2014.

[123] Sebastian Stober and Andreas Nürnberger. Adaptive music retrieval–a state of the art. Multimedia Tools and Applications, 65(3):467–494, 2013.

108 [124] Potcharapol Suteparuk. Detection of piano keys pressed in video. Technical report, Department of Computer Science, Stanford University, April 2014.

[125] Raj Talluri, Karen Oehler, Tom Barmon, Jon D. Courtney, Arnab Das, and Judy Liao. A robust, scalable, object-based video compression technique for very low bit-rate coding. Transactions on Circuits and Systems for Video Technology, 7(1):221–233, 1997.

[126] Tiago F. Tavares, Gabrielle Odowichuck, Sonmaz Zehtabi, and George Tzanetakis. Audio-visual vibraphone transcription in real time. In International Workshop on Multimedia Signal Processing (MMSP), pages 215–220, 2012.

[127] Tiago Fernandes Tavares, Jayme Garcia Arnal Barbedo, Romis Attux, and Amauri Lopes. Survey on automatic transcription of music. Journal of the Brazilian Computer Society, 19(4):589–604, 2013.

[128] Pat Taweewat and Chai Wutiwiwatchai. Musical pitch estimation using a supervised single hidden layer feed-forward neural network. Expert Systems with Applications, 40(2):575–589, 2013.

[129] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.

[130] William Forde Thompson, Phil Graham, and Frank A. Russo. Seeing music performance: Visual inﬂuences on perception and experience. Semiotica, 2005(156):203–227, 2005.

[131] Chunwei Tian, Lunke Fei, Wenxian Zheng, Yong Xu, Wangmeng Zuo, and Chia-Wen Lin. Deep learning on image denoising: An overview. arXiv preprint arXiv:1912.13171, 2019.

[132] Alexey Tikhonov and Ivan P. Yamshchikov. Music generation with variational recurrent autoencoder supported by history. arXiv preprint arXiv:1705.05458, 2017.

[133] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David Min- nen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085, 2015.

[134] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, , and Michele Covell. Full resolution image compression with recurrent neural networks. In Conference on computer vision and pattern recognition (CVPR), pages 5435–5443. IEEE, 2017.

[135] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Tim- ofte, and Luc Van Gool. Towards image understanding from deep compression without decoding. arXiv preprint arXiv:1803.06131, 2018.

[136] Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin. Incremental and decremental training for linear classiﬁcation. In International conference on Knowledge discovery and data mining, pages 343–352, 2014.

109 [137] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. Conference on computer vision and pattern recognition (CVPR), pages 8798–8807, 2018.

[138] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. Transactions on image processing (TIP), 13(4):600–612, 2004.

[139] Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398–1402. Ieee, 2003.

[140] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems (NIPS), pages 341–349, 2012.

[141] Jie Yang and Yangsheng Xu. Hidden Markov model for gesture recognition. Technical report, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, May 1994.

[142] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation using 1d and 2d conditions. arXiv preprint arXiv:1703.10847, 2017.

[143] Kazuyoshi Yoshii and Masataka Goto. A nonparametric bayesian multipitch analyzer based on inﬁnite latent harmonic allocation. Transactions on Audio, Speech, and Language Processing, 20(3):717–730, 2012.

[144] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI Conference on Artiﬁcial Intelligence, pages 2852–2858, 2017.

[145] Bingjun Zhang, Ye Wang Jia Zhu, and Wee Kheng Leow. Visual analysis of ﬁngering for pedagogical violin transcription. International Conference on Multimedia, pages 521–524, 2007.

[146] Bingjun Zhang and Ye Wang. Automatic music transcription using audio-visual fusion for violin practice in home environment. Technical Report TRA7/09, School of Computing, National University of Singapore, 2009.

[147] Zhizheng Zhang, Zhibo Chen, Jianxin Lin, and Weiping Li. Learned scalable image compression with bidirectional context disentanglement network. arXiv preprint arXiv:1812.09443, 2018.

[148] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. Transactions on Computational Imaging, 3(1):47–57, 2017.

[149] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Conference on computer vision and pattern recognition (CVPR), pages 2881–2890, 2017.

110 [150] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Tor- ralba. Scene parsing through ADE20K dataset. In Conference on computer vision and pattern recognition (CVPR), pages 633–641, 2017.

[151] Jing Zhou, Sihan Wen, Akira Nakagawa, Kimihiko Kazui, and Zhiming Tan. Multi- scale and context-adaptive entropy model for image compression. arXiv preprint arXiv:1910.07844, 2019.

111