Towards audio-conditioned generation of music artwork

Jorge Bustos Supervisors: Perfecto Herrera and Joan Serr`a

September 28, 2020

This work is licensed under a Creative Commons ”Attribution - NonCommercial - ShareAlike 4.0 International” license.

Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Objective and structure ...... 2

2 State of the art 4 2.1 Album, the sum of cover and music ...... 4 2.1.1 Album covers throughout time ...... 5 2.1.2 Album covers by genre ...... 9 2.1.3 Conclusion ...... 12 2.2 Generative Systems and Generative Models ...... 13 2.2.1 Audio and visual generation examples ...... 14 2.2.2 Image generation based on audio ...... 18 2.2.3 Conclusion ...... 19 2.3 Deep Learning and Generative models ...... 20 2.3.1 Representation learning: Autoencoders ...... 20 2.3.2 Generative Models ...... 23 Auto-regressive Models ...... 24 Flow Models ...... 24 Variational Autoencoders (VAEs) ...... 25 Generative Adversarial Networks (GANs) ...... 28 Least Squares Generative Adversarial Networks (LSGANs) 30 Least Squares Generative Adversarial Networks (LSGANs) . . . . . 30 Wassertein Generative Adversarial Networks (WGANs) . 31 Wassertein Generative Adversarial Networks (WGANs) ...... 31 Wassertein Generative Adversarial Networks with Gradi- ent Penalty(WGAN-GP) ...... 32 Wassertein Generative Adversarial Networks with Gradient Penalty(WGAN- GP)...... 32 2.3.3 Conclusion ...... 34 3 Dataset 35 3.1 MSD-I dataset ...... 36 3.2 Custom dataset based on Acoustic Brainz and Music Brainz ...... 36 3.3 Resulting datasets ...... 36

4 Methodology 41 4.1 Preliminary experiments ...... 42 4.1.1 Representation learning for good image reconstruction ...... 42 4.1.2 Album artwork generation only trained on covers ...... 43 4.2 Album artwork generation based on audio ...... 45 4.3 Training methodologies ...... 47 4.4 Proposed evaluation - Survey ...... 47

5 Results and discussion 50 5.1 Preliminary experiments ...... 50 5.1.1 Representation learning for good image reconstruction ...... 50 5.1.2 Generative models for cover generation trained on covers ...... 50 5.2 Preliminary experiment’s discussion ...... 55 5.2.1 Representation learning for good image reconstruction ...... 55 5.2.2 Generative models for cover generation trained on covers ...... 55 5.3 Album artwork generation based on audio ...... 56 5.3.1 Album artwork generation based on audio discussion ...... 59

6 Conclusion 60 6.1 Conclusions ...... 60 6.2 Future work ...... 61

A Survey draft A.1 Introduction ...... A.2 Introductory questions ...... A.3 Questions proposed ...... A.3.1 Original album artwork questions ...... A.3.2 Generated album artwork without audio conditioning questions . . A.3.3 Generated album artwork with audio conditioning questions . . . .

B Generative album artwork examples

Acknowledgement

I want to thank all the people that have supported me along this project. Special thanks to Perfecto and Joan for all the knowledge, dedication, support and time that have shared with me. It has been a pleasure to be tutored by you and be part of this amazing idea. I also want to thank Sergio Oramas and Alastair Porter for their help with my doubts when gathering the resulting dataset we use in the project. Finally, a warm hug to my SMC colleagues and the MTG for this strange but special year.

”It is indeed a surprising and fortunate fact that nature can be expressed by relatively low-order mathematical functions.” Rudolf Carnap

Abstract

Music nowadays, can not be conceived without its artwork. Since first used, album artwork importance has changed. In our digital era, audiovisual content is everywhere and of course, regarding music albums, album covers play an important role. Computer Vision has unleashed powerful technologies for image generation,in the last decade, which have been used for lots of different applications. In particular, the main discoveries are Varational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). Latest researches on these technologies have contributed to understand and improve them, ac- quiring high quality and complex image generation. In this thesis, we experiment with the latest image generation tools to achieve album artwork generation based on audio sam- ples. We first analyse image generation without audio conditioning for VAEs and the three GAN approaches: vanilla GAN, Least Squares GAN (LSGAN) and Wasserter GAN with gradient penalty (WGAN-GP). Finally, we try the best model, from these experiment but with audio conditioning. Despite being able to generate new album covers without audio conditioning, we do not achieve the final objective of album cover generation based on audio. We finally discuss, which state of the art tools could be reviewed and implemented for this project. Keywords: Generative models, Generative Adversarial Networks (GAN), Image generation, Album cover

List of Figures

1.1 United States recorded-music revenues by format in billions of dollars. Source: Recording Industry Association of America...... 1

2.1 Left image: Format of the first album sleeves in 78 rpm discs. Right image: First album cover made by Alex Steinweiss in 1939 ...... 6 2.2 (a) The Dave Brubeck Quartet – Time Out (1959) by S Neil Fujita, (b) Ella Fitzgerald And Louis Armstrong – Ella And Louis (1956) by Phil Stern, (c) Jackie McLean – It’s Time! (1965) by Reid Miles’s ...... 7 2.3 Examples of album covers of (a) Elvis Presley - Elvis Presley (1956), courtesy of RCA, (b) Steve Miller Band - Children of the Future (1968), designed by Victor Moscoso, (c) Pink Floyd - The Dark Side of the Moon (1973), designed by Storm Thorgerson and (d) Sex Pistols - Never Mind The Bollocks Here’s The Sex Pistols (1977) ...... 8 2.4 (a) Queen - Hot Space (1982) by John Barr, Norm Ung, Steve Miller (b) The Rolling Stones - Dirty Work (1986) by Annie Leibovitz ...... 8 2.5 (a) Oasis - Definitely Maybe (1994) by Michael Spencer Jones, (b) OutKast – Aquemini (1998) by Greg Hawkins, (c) Radiohead - Kid A (2000) by Tchock and Stanley, (d) Eminem - The Eminem Show (2002) by Jonathan Mannion,9 2.6 (a) 2Pac - All Eyez on Me, (b) Earl Sweatshirt Doris ...... 10 2.7 (a) Kraftwerk - Computer Love, (b) Cybotron – Enter, (c) Plastikman - Closer, (d) Eno, Moebius, Roedelius, Plank – Begegnungen, (e) Aphex Twin - Selected ambient works 85-92, (d) David Guetta - Nothing but the Beat . 11 2.8 (a) Black Sabbath - Black Sabbath, (b) Judas Priest - Sad Wings of Destiny, (c) Metallica - Metallica, (d)Blackletter fonts ...... 12 2.9 (a) M.C.Escher - ”Sky and Water I”, (b) Mandelbrot set fractal, (c) Michael Noll - ”Gaussian-Quadratic” ...... 16 2.10 (a) Mario Klingemann - The Butcher’s Son (2018), (b)(c) Album artwork result from neural style transfer technique by Carr, C. et Zukowski Z. [1], (d) Album covers generated by Alexander Hepburn et al [2] ...... 17 2.11 Image generation results based on music data by Qiu et al[3] ...... 19 2.12 Illustration of a deep learning is able to extract and learn high level fea- tures giving only pixels. Image extracted from “Deep learning” book by Goodfellow et al [4] ...... 20 2.13 Autoencoder schema, where θ represents the weights of the encoder and φ of the decoder...... 21 2.14 2D latent space of an autoencoder based on MNIST dataset by David Foster [5] ...... 23 2.15 Normalizing flow model concept, transforming a simple distributionp0(z0 to a complex one pk(zk) step by step. Figure taken from [6] by Weng. . . . . 25 2.16 Varational Autoencoder schema, where θ represents the weights of the en- coder and φ of the decoder ...... 26 2.17 2D latent space of a varational autoencoder based on MNIST dataset by David Foster [5] ...... 27 2.18 Schema of a Generative Adversarial Network ...... 28 2.19 Image from Arjovsky et al. [7] where they show the discriminator gradients for GAN and WGAN ...... 33 2.20 Graphs extracted from Gulrajani et al. [8]. We can perceive the effect of vanishing and exploding gradients with weight clipping during training on the Swiss Roll dataset, unlike gradient penalty...... 34

3.1 Number of samples per genre in: (a) MSD-I dataset, (b) Customize dataset 37 3.2 Processing flow to obtain cover dataset ...... 38 3.3 Processing flow of audio and cover dataset ...... 39

4.1 Architecture of autoencoder based of convolutional and transpose convolu- tional layers ...... 42 4.2 Architecture of autoencoder based on convolutional layers. Reconstruction of the image done by upsampling and padding layers...... 43 4.3 VAE architecture employed...... 44 4.4 GANs architecture proposed for DCGAN, LSGAN and WGAN-GP. . . . . 44 4.5 VAE architecture with FiLM layers for audio conditioning on image genration. 46 4.6 Generator architecture with FiLM layers for audio conditioning on image generation...... 46

5.1 (a) Original images from test set, (b) Output images from transposed con- volutional layers and unmaxpooling autoencoder architecture, (c) Output images from the autoencoder with upsampling, reflection pad and convolu- tional layers ...... 51 5.2 VAE model loss graph for training and validation set with the corresponding learning rates for each of the epochs ...... 52 5.3 VAE model: (a) Original images from a batch of the test set (b) Output images for batch of figure (a) ...... 52 5.4 VAE model generated images ...... 52 5.5 2D VAE model latent space for all the blues, country and electronic images in the training set ...... 52 5.6 Images for 25,000, 50,000 and 75,000 iterations of GAN models without audio conditioning: a) DCGAN, b) LSGAN, c) WGAN-GP ...... 53 5.7 Eight manually selected images of 128x128 resolution generated from LS- GAN model without audio conditioning...... 54 5.8 Eight manually selected images of 128x128 resolution generated from WGAN- GP model without audio conditioning...... 54 5.9 VAE model with audio conditioning: (a) Original images from a batch of the test set (b) Output images for batch of figure (a) ...... 56 5.10 VAE model with audio conditioning: (Left) Image that correspond to the audio features that condition image generation by sampling from mean and variance latent space ...... 56 5.11 2D VAE with audio conditioning latent space for all the blues, country and electronic images in the training set ...... 56 5.12 Images for 25,000, 50,000 and 75,000 iterations of conditional LSGAN for: a) AcousticBrainz data, b) custom dataset...... 57 5.13 In this figure each of the columns corresponds to the same noise vector input into the generator, and each of the rows corresponds to different audio feature vectors from the same genre. These outputs correspond to the conditioned LSGAN model...... 58 5.14 Images for 25,000, 50,000 and 75,000 iterations of conditional WGAN-GP. 58

B.1 Generated album covers from LSGAN model without audio conditioning. .

Chapter 1

Introduction

Music and images have been tied, at least, since the first religions emerged where rituals were accompanied with visual elements and deities icons together with music. Nowadays, image and music can be almost considered as one with the appearance of cinema and the videoclip, the countless events where music goes together with visual art, advertisements,... The music industry has also taken benefit of this connection since 1939 with the invention of album covers. Artwork had a huge influence in vinyl formats. With the creation of CD in the 80s, people switched to this physical format where artwork had less presence; and even less with the appearance of ’mp3’ and the first portable media players. ”Mp3” made music copying very easy and piracy took its toll on music sales. However, in the last 10 years the revival of vinyl and the awareness in streaming platforms of the importance of album covers is reviving artwork designs (see figure 1.1).

Figure 1.1: United States recorded-music revenues by format in billions of dollars. Source: Recording Industry Association of America.

Music industry have used images to represent music content: jazz with its cutting-edge photography, illustrations and typography in the 50s; punk mainly based on collage art-

1 work; metal with its characteristic Gothic culture aesthetic... However, it is not absolutely true that artwork in albums is strictly based on the music content as P´erezshows in [9]. Information about most of the artwork creating process is unknown and designers have to deal with strict deadlines while they are working in several projects; thus, we can’t be completely sure about music content and album covers relationship without information of the design process. However, some studies researching the relation between album’s music and artwork have steered to discover specific patterns in the album covers of similar music content. Dorochowicz and Bozen [10] and Rudolf [11] have studied different aspects in album covers for specific genres. Both conclude that similarity in music is also reflected in album covers. All these audio and image information have motivated multiple researches in an era where improvements in artificial intelligence are taking advantage of the huge amount of data to develop new real-world applications. One of the models inside artificial intelligence that is gaining importance due to improvements in the field is generative modelling. These models are able to learn from observed data to generate new data. Several new applications have emerged thanks to generative models. One of the examples is the generation of images based on audio. Works by Wen et al [12] and Duarte et al [13] show successful projects where the model is able to learn the correlation between speech (audio) and faces (images) to generate faces by inputting speech audio into the model. In the music domain, Qiu et al [3] have used this idea to generate visual content that best matches music to reinforce music appreciation experience. Therefore by learning image and audio correlation these projects are able to generate images based on audio content.

1.1 Motivation

Taking as a premise that music similarity is also reflected in album covers [10][11], the main motivation of this research arises from the interest of studying the relationship between music and visual content in albums, following artificial intelligence approaches. By learning this connection we could be able to generate new album covers based on new music. Besides, personal interest in studying deep learning and generative tools applied to audio and image, and the application of this idea could have in the real world as an useful tool for designers, are also motivation of this research.

1.2 Objective and structure

The main objectives and structure of this work are:

• Review and analysis of the relation between album artwork and audio, and generative systems for image generation with a deeper study on the most promising systems

2 (chapter 2).

• Preprocessing and gathering methods to obtain a final multimodal dataset to face album artwork generation based on audio (chapter 3)

• Presentation of the experiments proposed to obtain a final generative model for album cover generation with audio conditioning (chapter 4).

• Exposition of the results obtained for the different experiments and its discussion (chapter 5).

• Final conclusion and further work about the project (chapter 6)

3 Chapter 2

State of the art

Audio and visual are two concepts whose connection has grown in the last century till the point they are almost conceived as one. In this thesis, we want to learn about the link between music and image in albums, as a way of generating new album artwork. To show the existing relationship between album cover and music, we first go through album artwork history; by discussing how society, economy and politic movements of each time have leaved a mark on music statements and aesthetics and therefore on the cover. Then, different album cover aesthetics are analyzed by genre to highlight again the link among music and artwork. After this section, we give an overview of the principal generative tools for audio and image generation separately and then the principal generative technologies of images based on audio. Finally, techniques applied in this thesis, based on Deep Learning, are exposed and described in detail.

2.1 Album, the sum of cover and music

As mentioned before, one of the main reasons of this project surfaced from the idea of discovering the link between music and album cover. The powerful bond these two elements share has been used by the music industry since its origins to create legendary pieces that cannot be conceived without its cover and, of course, without its music. “A musical genre is a particular type or style of music which is recognizable by certain features”; this is one of the definitions that can be founded in the English dictionary. However, there are music genres that go beyond. They are not only understood as having specific musical features, they are also considered as a culture. Some music genres have strong habits, traditions, and beliefs with which people feel identified. Alex Steinweiss said once: ”I love music so much and I had such ambition that I was willing to go way beyond what the hell they paid me for. I wanted people to look at the artwork and hear the music”. As the inventor of album artwork stated, what you first perceive when buying or listening to a LP, EP or single is the cover. This fact is used by

4 record labels to attract and seduce costumers, and by using representative characteristics of the culture of each genre they are able to transmit and reach costumers which might feel attracted or identified. However, in most of the cases the purpose behind the design is unknown. As P´erezreflects in [9], the design process of the album cover is conditioned. During the process, artists create various drafts where the latest decision might not rely on them. The designer might also be working on several projects at the same time and short deadlines oblige them to do a semiautomatic work. All this creative process information is not available to the public, which complicates the analysis of the album covers. Therefore, we have to keep in mind that our own interpretations of the artwork might be partially true, since they provide guesswork about an example from various possible solutions. Still, multiple studies have tried to analyse the relation between and album artwork. For instance, Dorochowicz and Bozena [10] studied the relationship between typographic, compositional and coloristic elements of the music album cover design and music contained in the album. The difference between genres in typographic, compositional and coloristic analysis suggest the possibility of creating a rule-based system that could automatically assign a given album to music genre. A similar study in the analysis of similarity between album art within a genre is done by Rudolf [11] where he concludes that similarity in music is also reflected in album covers. Some works related with this thesis have also taken into consideration the existing relationship between music and artwork to generate album covers. Hepburn et al [14] train a generative model (we will go deeper through generative models in the next section) to generate album covers by just taking as input genre tags.

Therefore, to comprehend better this link and the concept of album, we first go through an overview of its history by following the analysis done by Lopez Medel [14], where music and album cover tendencies by decade are shown. Finally, we exhibit how different music styles have followed different pictorial and design conventions. Both analysis in time and genre, reinforce our research by demonstrating the connection between audio and album covers.

2.1.1 Album covers throughout time When album covers were born

Around 1910, discs became the standard medium for recorded sound, replacing the phonograph cylinder. Discs were sold covered in a brown paper, which were sometimes plane or had printed the information about the producer, , etc. The cover also had a circular cutout in the middle that allowed the record label and artist name to be seen (Figure 2.1 left).

5 Album covers were basically intended to protect the record. But this purpose changed in 1939 when Alex Steinweiss, an artist designer working for Columbia records, introduced the idea of creating more sophisticate covers for each album (Figure 2.1 right). The two objectives of Alex Steinweiss were: to transmit the meaning of the music, to attract new customers.

Figure 2.1: Left image: Format of the first album sleeves in 78 rpm discs. Right image: First album cover made by Alex Steinweiss in 1939

The golden years

Alex Steinweiss innovation pushed the industry to an entire new level where new tech- nology was released and new artistic styles and generations of artist appeared for album artwork.

The 50’s. Jazz, was the main genre of this decade, and inspired the most sophisticated covers which showed an avant-grade approach to photography, illustrations and typography. Modern art illustrations (figure 2.2(a)) and photographs (figure 2.2(b)) were the main elements in album covers with bold typography which sometimes was the only element in the cover (figure 2.2(c)). As the legend of album cover design Storm Thorgerson says, jazz covers maintained a strong sense of integrity and dedication to the music due to most jazz record labels were run by small groups with strong sense of musical history. The rising Rock and Roll also sat trends. Rock albums operated to collect hit singles from successful stars, so artists became teenagers idols. This is why the marketing of rock and roll covers was tied to the movies, leading to face shots album covers and big typography. Being the 50s a conservative era, star’s shots in the covers maintained the conservatism of the time to please at the same time American teenagers and their parents (Figure 2.3(a)).

6 (a) (b) (c)

Figure 2.2: (a) The Dave Brubeck Quartet – Time Out (1959) by S Neil Fujita, (b) Ella Fitzgerald And Louis Armstrong – Ella And Louis (1956) by Phil Stern, (c) Jackie McLean – It’s Time! (1965) by Reid Miles’s

The 60’s. Marked by British bands invasion leaded by The Beatles, The Rolling Stones and The Animals. The Beatles where specially aware of the power of album covers and were one of the main figures who created innovative pieces in their LPs. Other designers experimented with psychedelic illustrations due to the growing hippie movement (Figure 2.3(b)).

The 70’s. The years of live music, where mega bands and marketing exploded. Albums became more conceptual and complex. The norm was to break the norm (Figure 2.3(c)). But the oil crisis of 1973, switched from grandiose ideas to a much simpler and cheap style. In the last year of the decade, Punk made its breakthrough. Their album covers designs were based on the motto ’Do-It-Yourself’; collage became the dominant look (Figure 2.3(d))[14].

The decline

The 80’s. The death of album cover would start with two main events in the 1980s: • The arrival of MTV in 1981, which made music videos the primary visual partner of music.

• The launch of Compact Disc. This new technology was received with fear by album designers. However, new generation of designers appeared from a more technological background. Experimentation was still possible with the CD format. However, album covers in the 80s mirrored the bright and manic aesthetic of the decade itself. Being the ”electro-pop” the genre of this decade, album artwork standed out from their garish portraits to sharp-cornered design in neon colored glory. [15] (see Figure 2.4)

7 (a) (b) (c) (d)

Figure 2.3: Examples of album covers of (a) Elvis Presley - Elvis Presley (1956), courtesy of RCA, (b) Steve Miller Band - Children of the Future (1968), designed by Victor Moscoso, (c) Pink Floyd - The Dark Side of the Moon (1973), designed by Storm Thorgerson and (d) Sex Pistols - Never Mind The Bollocks Here’s The Sex Pistols (1977)

(a) (b)

Figure 2.4: (a) Queen - Hot Space (1982) by John Barr, Norm Ung, Steve Miller (b) The Rolling Stones - Dirty Work (1986) by Annie Leibovitz

The 90’s and 00’s But the end came with the mp3. Album covers did not have an important presence within this format. From the 12” LP, to a 4”x4” with the CD, to a small place in the screen. Also, the existance of technology capable of replicating the original master sank the music industry. And the appearance of online platforms to exchange music like Napster ended with the analog format. [14]

Despite new technologies killed album covers, albums still incorporated covers and new technologies defined the artwork style of these decades, mainly because of the first software for image edition and image generation. These technologies allowed the manipulation of images digitally and lead to new image processing techniques like image editing (see Figures 2.5(a) and 2.5(d)), computer-based illustrations (see Figures 2.5(b) and 2.5(c)), superimposed images,...[16]

8 (a) (b) (c) (d)

Figure 2.5: (a) Oasis - Definitely Maybe (1994) by Michael Spencer Jones, (b) OutKast – Aquemini (1998) by Greg Hawkins, (c) Radiohead - Kid A (2000) by Tchock and Stanley, (d) Eminem - The Eminem Show (2002) by Jonathan Mannion,

The resurrection

This last epoch takes us to the present time, when album covers are experimenting a comeback. The reasons for this situation can be traced in three key facts: the resurrection of the music industry, the rebirth of vinyl for the more dedicated fan, and, the most important, because album covers have become pop objects. Album covers have gone beyond their original meaning and function (practical role of protection), to the public sphere where it has become an engraving on t-shirts, posters,...[14] This increase of importance of the visual part of albums in our culture has made latest technologies like music streaming platforms to give album covers a very important role in their interface, to the point of covering the whole screen with visual animations.

Therefore, music and album cover aesthetics have been influenced over time by technology and social, economic and political situation. These influences have created different music styles that have become a genre at a specific time. Thus, because music genres are based on a specific musical aesthetic and statement, the music industry have used specific album covers style to reinforce the message of each music genre.

2.1.2 Album covers by genre In the previous section, we went through characteristic album artwork features for specific genres like punk, rock and jazz. Next, album covers of other popular music genres present in our study are examined: hip-hop/rap, electronic and metal. We need to emphasize that these characteristics, as mentioned before by P´erezin [9], are personal interpretations which do not make them completely true.

9 Hip-Hop music

Hip-Hop or Rap music have several aspects that make their artwork recognisable. Be- cause hip-hop music is part of hip-hop culture, album covers make use of their elements. Graffiti art is directly connected to Hip-hop culture, being graffiti the predominant font used in the artwork. Another aspect commonly used in hip-hop covers is the usage of a picture of themselves on the cover. Which can be due to facial/bodily album covers usually helps the promotion of the album because the audience immediately recognise which artist it is without reading any other information on the cover (see figure 2.6)

(a) (b)

Figure 2.6: (a) 2Pac - All Eyez on Me, (b) Earl Sweatshirt Doris

Electronic music

Electronic music has evolved along time and has lead to multiple new sub-genres. We are going to go through the characteristics of the main ones, since first electronic music albums came out around the 70s with the pioneers Kraftwerk. The artwork concept was oriented to futuristic and computerized subjects as shown in figure 2.7(a) . These releases spread electronic music all over the globe. In the 80s, a new popular current and philosophical thought surfaced due to the shifting from the industrial era to the information era (see figure 2.7(b)). This new mentality is reflected in highly influential films like Blade Runner and Tron and the creation of a new literature genre: cyberpunk. This lead to album artworks mainly based on a science fiction aesthetic . But, it soon morphed to house music and became the new music of the clubs and also associated with ethnic and social minorities. Because of this, artwork in house music depicted pictures from disco culture, black culture, as well as gay culture. In the late 80s/90s minimal techno became popular in Europe. The simplicity of the music was also reflected in the covers, with simple geometries and abstract figures (figure

10 (a) (b) (c) (d)

(e) (f)

Figure 2.7: (a) Kraftwerk - Computer Love, (b) Cybotron – Enter, (c) Plastikman - Closer, (d) Eno, Moebius, Roedelius, Plank – Begegnungen, (e) Aphex Twin - Selected ambient works 85-92, (d) David Guetta - Nothing but the Beat

2.7(c)). Ambient music also increased in popularity and their covers reflected the comeback to the “real world”, as opposed to club pictures or futuristic worlds. Designs of nature landscape helped to establish a sound space (figure 2.7(d)) In these years, electronic music grew in popularity and some artists like Aphex Twin and Daft Punk began branding themselves to differentiate their music and identity from others in the field ( figure 2.7(e)). Nowadays, some electronic music artist have stepped to pop music, bringing covers where the artist appears prominently on the album’s cover (figure 2.7(f)). However, there is still an enormous underground/indie scene in electronic music whose aesthetics represent the new millennium with modernized, visually attractive images that seem to speak to a visually curious youth audience.

Metal music

Metal music has a strong and well defined culture due to its emergence. In the late 60s and early 70s, metal burst into the scene as a “harder variation” of Rock music, because of a general youth disappointment to society and politics which transformed old hopes into a new belief of paganism. This ideology is reflected in the first artworks which have remained

11 in time. The use of Gothic culture aesthetic is prominent; dark pictures with black and red tonalities containing mainly religious, satanic and magic elements (see figures 2.8(a) and 2.8(b)). The use of blackletter typography it is also noticeable (figure 2.8(d)). These unbelievable artwork reflected the message they wanted to give to society.

(a) (b) (c) (d)

Figure 2.8: (a) Black Sabbath - Black Sabbath, (b) Judas Priest - Sad Wings of Destiny, (c) Metallica - Metallica, (d)Blackletter fonts

2.1.3 Conclusion Going through the history of music and some specific genres we perceive how artwork is connected to music, to visually transmit the message of it and how they are influenced by external agents of our society: economy, politics and technology that have give room to ideologies and movements which have been reflected and transmitted through music. However, as we have already mentioned this are conventions based on personal perception because we usually do not have information about the design process. Therefore, by taking into consideration these conventions that relate music and artwork, we want to apply cutting-edge technologies to learn this connection to generate new album covers based on AI approaches. But, we first need to go through an overview of the systems that have tackled this approach of generating new data automatically.

12 2.2 Generative Systems and Generative Models

The term ”generative system” has been controversial since its invention in 1970 by Sonia Landy Sheridan as an academic program at the School of the Art Institute of Chicago. This term has been studied in the field of ”generative art” by lots of artists and philosophers that have tried to find a good definition. One of the most used for what can be considered a generative system it is found in the definition of what is ”generative art” by Philip Galanter in 2008 [17]

”Generative art refers to any art practice in which the artist cedes control to a system with functional autonomy that contributes to, or results in, a completed work of art. Systems may include natural language instructions, biological or chemical processes, computer pro- grams, machines, self-organizing materials, mathematical operations, and other procedural inventions.” (Galanter 2008)

It needs to be highlighted that generative systems are not only computer-based systems, other systems based on chemical, biological, self-organizing materials or mathematical op- erations are also considered ”generative systems” if they have autonomy. Generative sys- tem’s autonomy have lead to a controversial discussion about the role of humans in these systems. Purely generative systems involves no intervention of humans once they are in mo- tion. However, this does not mean that human knowledge it is not included in the creative process before the system starts operating. Other systems need humans as collaborator in order to operate; but they are not considered purely generative.

With the growth of computational power and the improvement of machine learning in the last decades, machine learning generative systems have approached the generative process by learning the probability distribution of observed data; instead of following strict knowledge representation imposed by humans. This new systems are called “generative models”.

Thus, traditional generative systems approach are rule-based, where rules are settled on human knowledge. New data based approach acquires knowledge from observed data. As Galanter’s mention in his definition of generative systems, these are based on any procedural inventions with functional autonomy. Therefore, we consider “generative models” as a specific approach of generative systems. In this section we will refer to rule-based systems as “traditional generative systems”and to data-based systems as “generative models”.

13 2.2.1 Audio and visual generation examples Audio generation

Audio generative systems, traditional approach

As mentioned before, generative systems do not have to be computer based. One of the most interesting and first generative systems in the music field was implemented by W. A. Mozart, but published after his death in 1893 by J.J. Hummel in Berlin-Amsterdam. This system it is called ”The Musikalisches Wurfelspiel” (Musical Dice Game)[18] 1. The game consists on creating a 16 bars Viennese minuet by throwing a dice. To do this Mozart offered 2 choices for bars 8th and 16th, and 11 choices for the other bars. With this random selection of choices you are able to generate a Viennese minuet. In 1957, F. P. Brooks et al. [19] created a statistical system of Markov chains which calculated the probabilities of the elements by analyzing the samples of 37 melodies. By elements they meant: element pairs (digrams, two notes), trigrams, and so on to the eighth order. The probabilities derived were used for the synthesis of original melodies by a random process. Several rule-based music generative systems appeared in the last part of the 20th cen- tury. Moorer [20] in 1972 implemented an algorithmic composer based on different heuristic procedures to generate rhythm, chords and melody. In 1989 Pressing [21] researched dis- crete nonlinear maps as generator for musical design. By heuristics exploration of the maps solution space, the system was able to control pitch, envelope attack time, dynamics, tempo, textural density and section length. One of the latest rule-based music generative systems was implemented by Jehan [22] in 2005. He developed an automate the life cycle of listening, composing, and performing music, only feeding the system with a song database. The system was designed to arbi- trarily combine the extracted musical parameters as a way to synthesize new and musically meaningful structures.

Audio generative models

With the increased of computation power, the improvement of machine learning tech- niques and the quality of the results, neural networks demonstrated being good generative tools. The first AI able to generate new music by capturing the style of multiple composi- tions is the project “Flow Machines” by Sony [23]. Due to intellectual property there is no technical information about the model. Multiple technologies of music generative models have come out under intellectual property. For instance, “AIVA” [24] is able to generate

1There are online versions of this famous game: https://mozart.vician.cz/

14 short compositions in various styles (classical, electronic, jazz,...), with the MIDI files for each of the different channels in the composition (bass, guitar, piano, percussion,...). by just selecting the genre. “ Deep Composer” by Amazon [25], also uses generative modelling to generate a complete song by inputting a single melody into the model. Open source projects have been also released lately. For instance, dadabots, a model able to create death metal, jazz and rock music among other genres 24 hours a day. This model uses auto-regressive conditional generation, which generates new data based on previous steps and conditions given. This auto-regressive conditional generation is implemented with SampleRNN, a Recurrent Neural Network (RNN) originally trained for text-to-speech applications but adapted for music generation in this work by Carr, C. et Zukowski Z. [1]. Other interesting works in audio generative models is the research project of Magenta by Google. They have projectcs like GANSynth by Jesse Engel et al [26] which uses Generative Adversarial Networks (GANs) to synthesize audio by using the potential of this type of networks in image generation to generate new audio spectrograms. The latest generative model is Jukebox by OpenAI developed by Dhariwal et al [27]. Jukebox is able to generate multiple minutes long pieces in raw audio imitating many different styles and artists, and with recognizable singing in natural-sounding voices by training a hierarchical VQ-VAE.

Image generation

Image generative systems, traditional approach

Generative systems based on mathematic operations became useful tool in the 20th century, joining science and art. One of the most well known artists was M. C. Escher. In his works, he develops ”impossible” and strange perspectives by using a mathematical approach (see Figure 2.9(a)). He studied the use of tessellations of a plane (patterns on a plane), convergence to a limit and various transformations of shapes among other techniques. Modern technologies developed in the second half of the 20th century influenced design and art leading to the computer era in image generation. One of the most famous mathematical approaches in this era are fractals discovered by the mathematician Benoˆıt Mandelbrot . Fractals are infinitely complex patterns that are self-similar across different scales by repeating a process over and over in a continuous feedback loop (see Figure 2.9(b)). This new era gave a new possibilities for computer-based generative systems. New computer programs allowed the creation of another image generation technique based on code in some programming language. This new technique was named algorithmic art. One of the pioneers was the researcher Michael Noll with its work Gaussian Quadratic in 1963 (see Figure 2.9(c)). This representation is made by a system that connects 100 points with 99 lines whose horizontal points are distributed following a Gaussian distribution of

15 random numbers. The vertical points increase according to a quadratic equation. When a point reaches the top, it is reflected to the bottom to continue its rise.

(a) (b) (c)

Figure 2.9: (a) M.C.Escher - ”Sky and Water I”, (b) Mandelbrot set fractal, (c) Michael Noll - ”Gaussian-Quadratic”

Image generative models

Image generative models have aroused a lot of interest due to the new findings in Machine Learning and computer power. The most used architecture in generative modelling lately are GANs after their discovery in 2014 by Ian Goodfellow. There are incredible pieces of art based on GANs like The Lumen Prize 2018 winner ”The Butcher’s Son” by Mario Klingemann (see Figure 2.10(a)). Other works use techniques like style transfer. This technique consists on taking two images (a content image and a style reference image, such as an artwork by a famous painter) and mix them together so the output image looks like the content image, but “painted”in the style of the reference image. This technique has acquired high quality with GANs, specifically with an architecture which consists on an extension of GANs called CycleGANs. One example for image generation, specifically for album covers using style transfer is the project done by Carr, C. et Zukowski Z. [1] (see Figure 2.10(b) and 2.10(c)). Another project based on GANs which tackles our target but with a different approach is the one previously mentioned developed by Hepburn et al. [2] where they generate album covers with another extension of GANs called Auxiliary- Classifier GAN or AC-GAN. This model codes some descriptive variables, in this case genre tags, into the noise which is used as input in the generator network (see results in Figure

16 2.10(d)). Hepburn et al. work is similar to ours; however, we do not condition image synthesis with genre tags but with audio features. We think conditioning only with genre tags might not be optimal due to you miss a lot of information regarding what designers take into consideration when creating their designs. Still, they not only consider the music content but also the lyrics. We do not consider the lyrics content in this project, but we discuss about this extra approach in the conclusion.

(a) (b) (c)

(d)

Figure 2.10: (a) Mario Klingemann - The Butcher’s Son (2018), (b)(c) Album artwork result from neural style transfer technique by Carr, C. et Zukowski Z. [1], (d) Album covers generated by Alexander Hepburn et al [2]

However, not all image generative models are based on GANs, other are based on auto regressive models. This is the example of PixelRNN created by Van den Oord et al.

17 [28] which uses several Recurrent Neural Networks, networks good at modeling temporal sequences, to approach pixel generation.

With these examples for single modal generation of music and image, we get a better intuition of the existing technologies and their potential. Nevertheless, our specific task involves learning audio and image features to generate new image based on new audio. Thus, we are going to explore multi-modal approaches for image generation based on audio.

2.2.2 Image generation based on audio First, before focusing on related works where they have generated images based on music, let’s introduce previous works built on the same idea of image generation based on audio samples but with a different application. One of the main tasks nowadays is the generation of faces based on voice recordings.

Face generation based on speech signals

Tae-Hyun et al [29] use natural co-occurrence of audio and visual signals as a supervision signal. They use two convolutional encoders, one for voice and one already trained for face feature extraction. By training the voice encoder with the face features they are able to generate faces by attaching an already pretrained face decoder after the voice encoder. Y. Wen et al [12] use a “Generative Adversarial Network” (GAN) for face generation. To do that they use a pretrained voice embedding network which is going to be used to train the face generator and discriminator in an adversarial manner. A similar work is done by Duarte et al. [13] where they use a Least-Squares GAN. To generate images based on speech they use images with an associated speech of one second. The architecture consists on a pretrained speech encoder whose output is fed into the generator. The generator is trained by back-propagating the speech error calculated in the discriminator through the generator only.

Image generation based on music samples

For the task which concerns to this work, few investigations have been done. Y. Qiu et al [3] have tried to find visual contents that best matches music, with which we can expect a more expressive music appreciation experience. For this, they use different neural networks to extract music and image features. For music feature extraction, they use a CNN for 16 second frames and then feed the CNN output vector into an LSTM to capture

18 time-series features. For image feature extraction they use the CNN AlexNet. Once they have both music and image features they train the network using the loss function proposed by Reed et al [30] to maximize the inner production of correlated image and music. For image generation they extract features by feeding images through the trained CNN-LSTM model. From which they can obtain music features due to audio-image correlations learnt during training. Then, they fuse those extracted music features into DCGAN to generate images (see results on Figure 2.11).

Figure 2.11: Image generation results based on music data by Qiu et al[3]

2.2.3 Conclusion Therefore, having reviewed multiple types of generative systems, we notice how popular have GANs become in image generation applications. For our work, we propose the usage of Generative Adversarial Networks for album cover generation based on music samples as this approach has not been taken before and we think it may improve upon previous works like already mentioned by Alexander Hepburn et al [2] which it is based on genre tags.

19 2.3 Deep Learning and Generative models

Deep learning exploits representation learning concept by concatenating several layers of mathematical operations. Each layer is able to capture higher abstractions of the data based on previous and lower level representations of the data. Zeiler and Fergus [31] demonstrated how deep learning is able to capture high level concepts by combining simpler functions (see figure 2.12). Therefore, at each layer of the network we obtained different factors of varation. The most used architecture of representation learning, to obtain factors of varation, by using deep learning techniques are the “autoencoders”.

Figure 2.12: Illustration of a deep learning is able to extract and learn high level features giving only pixels. Image extracted from “Deep learning” book by Goodfellow et al [4]

2.3.1 Representation learning: Autoencoders An autoencoder is a method of unsupervised learning2 or also know as a ”self-supervised” learning, whose target is to obtain an output (ˆx) as similar as the input (x) (Figure 2.13). By concatenating several layers we are able to obtain good representations of the input in the bottleneck of the architecture.

Structure. 2is a machine learning algorithm used to draw inferences from datasets with unlabeled data

20 An autoencoder has two clearly defined parts:

• Encoder: compresses high-dimensional input data into a lower-dimensional represen- tation vector, known as well as “latent space”(z).It works similar to classical dimen- sional reduction method like PCA (Principal Component Analysis). However, au- toencoders unlike PCA learn non-linear transformations. This dimensional reduction performed by the autoencoder allows the network to learn the high level non-linear features. However, we do not have data of the latent space for our input data x, so there is no way to train the encoder by minimizing a loss function to generate high level features. To solve this we attach a decoder after the encoder.

• Decoder: decompresses a given representation vector (z) back to the original dimen- sional domain (ˆx). Now, we are able to train the network by minimizing the loss function between the input x and the outputx ˆ. Therefore, the target is to gener- ate an output as similar as the input. By extracting the latent space we obtain the features learned by the network of our input data.

Figure 2.13: Autoencoder schema, where θ represents the weights of the encoder and φ of the decoder.

Training

As in all machine learning and deep learning architectures we need a function to optimize which is called cost function. The cost function is calculated as an average of loss functions, and these, in turn, consist of a value which is calculated at every instance. By minimizing the cost function between the input x and the outputx ˆ we are able to train autoencoders. Typical loss functions used in autoencoders are:

21 • Root-Mean-Squared Error (RMSE) (see 2.1). This is the most typical loss function used in autoencoders. It measures how similar the generated output it is from the original input. Mean-Squared Error (MSE) loss function can also be used.

v u n u 1 X L(x, xˆ) = t (x − xˆ)2 (2.1) n i=1

• Cross Entropy. Used when data format is a binary vector or a vector of probabilities ranging [0,1]. It measures the number of bits required to transmit a randomly selected event from a probability distribution.

n X L(x, xˆ) = − xi ∗ logi(ˆx) (2.2) i=1

Are autoencoders generative?

If we sample from the autoencoder latent space and feed this sample to our decoder we can generate an output. But generative models, as mentioned before, generate new data from a probabilistic model. This architecture, is not learning any probability density function but it does learn the features from where we are going to learn the probability density function. Hence, autoencoders are not considered as generative, and because of not being a prob- abilistic model they have the several challenges. To understand them better let’s see Figure 2.14, a 2D representation of an autoencoder’s latent space for the MNIST data set3. The challenges for autoencoders are the following:

• Gaps between different classes. This issue causes the generation of poor images.

• Overlapping classes. There are areas in the latent space shared by various classes. Selecting a point in this area may not be decoded into a well-formed image.

• Lack of diversity. Some classes are spread in a big area while others over a smaller area.

3is a data set of hand written numbers of one digit from 0 to 9.

22 Figure 2.14: 2D latent space of an autoencoder based on MNIST dataset by David Foster [5]

2.3.2 Generative Models Generative models use AI techniques to learn the probability distribution of observed data. This capability of these models are very useful to (i) analyze observed data and (ii) generate “fake” but realistic data from real observed data, making generative models a potential tool to “understand our world”. Different approaches exist to learn the probability distribution of observed data. Explicit density models, provide an explicit parametric specification by assuming some distribution in the observed data. This explicit density function can be obtained by carefully designing the density functions, which is the case of “auto-regressive models” and “Flow Models”. Instead of obtaining the density function by design, other methods obtain the density function by learning an approximation of the density function, like “Varational Autoencoders” (VAEs). These models however, have some high computational cost and errors due to inconsistency of the probability distribution obtained. On the other hand, implicit density models learn the probability distribution by spe- cific procedures. In this branch of generative models, we find “Generative Adversarial Networks” or GANs. This network built by Goodfellow et al [32] has shown incredible results on im- age synthesis since its appearance in 2014. Some surprising application of GANs for image generation are for example, new anime characters generation for game development and animation production, high resolution image synthesis, text-to-image generation,...

23 In this thesis, we will study more in detail VAEs and GANs. However, we will explain how auto-regressive and flow models work. After showing the different models we will make a comparison between them analyzing their pros and cons and justify our decision of choosing an specific model.

Auto-regressive Models These models use the chain rule of probability to decompose the probability distribution over a vector into a product of each of the memebers of the vector (see equation 2.3). “PixelRNN” developed by Van den Oord et al [28] is an example of these networks.

n Y pmodel(x) = pmodel(x1) pmodel(xi|x1, ..., xi−1) (2.3) i=2 This approach has a high computational cost due to each time you want to sample for a different xi from the vector x we need to run the model again sequentially. It not possible to run it in parallel. Another drawback is that we cannot sample from a latent space to generate new data. As we have seen before latent data can be very useful as a data compression tool that is very powerful representing features and abstractions.

Flow Models Another way of obtaining the probability density function are “Flow Models”. These mod- els are based on the change of variable theorem. The theorem allows a random multivariable z and its known probability density function z ∼ π(z) to construct a new random mul- tivariable using one-to-one function mapping x = f(z). In order to infer the unknown probability density function of p(x), x = f(z) needs to be invertible z = f −1(x). Taking the definition of density function 2.4 we obtain 2.5. Z Z p(x)dx = π(z)dz = 1 (2.4)

−1 dz −1 df p(x) = π(z) det = π(f (x)) det (2.5) dx dx “Normalizing Flows” is one of the most well known type of Flow models. Because we want to be able to run backpropagation, the probability density function is expected to be simple enough to calculate the derivative easily and efficiently. That is why Normalizing Flows chose Normal distribution in the latent variable. And by applying sequences of invertible transformation functions they are able to obtain complex probability distributions (see figure 2.15).

24 Figure 2.15: Normalizing flow model concept, transforming a simple distributionp0(z0 to a complex one pk(zk) step by step. Figure taken from [6] by Weng.

So, we can explicitly learn p(x) function, based on latent variables extracted from x (z), by minimizing the loss function, which is the negative log-likelihood, which ables to generate new data from z. Requirements for the computation of the equation are: (i) transform function fi has to be easy to compute (ii) as well as the Jacobian determinant.

Variational Autoencoders (VAEs) A Variational Autoencoder can be defined as being an autoencoder whose training is regu- larised to ensure that the latent space has a specific distribution. Therefore, VAEs learn a mean and a standard deviation of the latent space for each of the classes (see Figure 2.16). To achieve this, they use a statistical method called variational inference. Variational au- toencoders are based on the idea of learning the probability distribution of z given an input x, lets say pdata(z|x). We can calculate pdata with the Bayes Theorem:

pdata(x|z)pdata(z) pdata(z|x) = (2.6) pdata(x) However, there is a problem with the calculation of this equation because of the term pdata(x). Z pdata(x) = pdata(x|z)dz (2.7)

Equation 2.7 becomes intractable for high dimensional spaces. Therefore, instead of calculating pdata(z|x) we use a another distribution qθ(z|x), which by parametrizing θ we can find a similar distribution of pdata(z|x) to be used instead. To make qθ(z|x) as similar as pdata(z|x) we use Kullback-Leibler divergence or KL divergence (see equation 2.8). KL divergence is a mathematical tool that measures similarity between 2 different distributions (the lower the KL divergence the more similar the distributions).

25 X p(z|x) DKL(qθ(z|x)||p(z|x)) = − qθ(z|x) log (2.8) qθ(z|x) So, KL divergence is good metric to optimize. By, solving the equation of minimizing KL divergence we obtain the loss function represented in equation 2.9. This loss function has two main components. First, the reconstruction loss, which in case of considering our input as a Gaussian distribution, will be similar to the autoencoder loss function, specifically the MSE loss. The second term consists on the KL divergence mentioned before, which tells how similar the distribution qθ(z|x) is going to be to a known distribution(pθ(z)). Usually we want this known distribution to be a Normal (pθ(z) = N(0, 1)).

L(θ,φ,xˆ) = Eqθ(z|x)(log p(ˆx|z)) + DKL(qθ(z|x)||pθ(z)) (2.9) Therefore, in a variational autoencoder, the model is more likely to placed similar sam- ples near to each other by forcing the network to learn a specific probabilistic distribution of data (usually Normal); generating, this way, appropriate data.

Figure 2.16: Varational Autoencoder schema, where θ represents the weights of the encoder and φ of the decoder

How to train VAEs?

Having clear the reconstruction loss and how VAEs work, we find a problem when try- ing to train these networks. The latent space in VAEs is based on stochastic sampling procedure. So, we cannot calculate the gradient of an stochastic operation and therefore we cannot back-propagate to update the network’s weights. Back-propagation requires deterministic nodes.

26 As a way of solving this issue, reparametrization trick is used. Being z a continuous random variable and z ∼ qφ(z|x) a conditional distribution. It is possible to express the random variable z as a deterministic variable z = fφ(, x), where  is a random variable with zero mean and unit variance, and gφ(.) some function parametrized by φ. Therefore by reparametrizing the stochastic latent space (z ∼ N (µ, σ2)) into a deterministic latent space, gφ(, x) is equal to (see equation 2.10). Where µ is the mean of the distribution, σ the standard deviation which is element-wise multiplied by .

z = µ + σ  (2.10)

Are variational autoencoders generative?

We saw in the previous section 2.3.1, why autoencoders are not generative and the issues this cause. However, varational autoencoders because of being able to learn a probability density function for each class, mange to solve some of the issues present in autoencoders. As we see in the 2D latent space for a VAE represented in figure 5.5 for the same data set as the previous latent space of an autoencoder(see figure 2.14), this model it is able to overcome gaps between classes and lack of diversity. However, we perceive some overlapping classes due to self-supervised training.

Figure 2.17: 2D latent space of a varational autoencoder based on MNIST dataset by David Foster [5]

27 Generative Adversarial Networks (GANs) Generative Adversarial Network is a generative model proposed in 2014 by Goodfellow et al [32]. Unlike VAEs, GANs achieve to find the underlying structure of the data by following an adversarial process instead of varational inference. This adversarial process it has two main components:

• Generator (Gθ). It consists of a model that generates x samples from z.

• Discriminator (Dφ). It is a network whose job is to distinguish samples form the real dataset and the generator.

Generator and discriminator are trained together by playing a minimax game, where the generator tries to minimize its output (pdata = pθ ) and discriminator tries to maximize the data received (pdata = pθ). The intuition is that the generator creates fake samples until the discriminator cannot distinguish between the real and the fake image. Mathematically, GAN objective is the following:

min max V (Gθ,Dφ) (2.11) θ φ

Where V (Gθ,Dφ) is the cost function being equal to:

V (Gθ,Dφ) = Ex∼pdata [log Dφ(x)] + Ez∼p(z) [log(1 − Dφ(Gθ(z)))] (2.12)

Figure 2.18: Schema of a Generative Adversarial Network

28 Optimal discriminator

From 2.11 we see that the discriminator is maximizing its function regarding its weights φ, performing binary classification given a fixed generator Gθ. It assigns a probability of 1 to data generated from the training set x ∼ pdata, and a probability of 0 to samples generated x ∼ pG. By fixing the generator we obtained the optimal discriminator (see equation 2.13).

∗ pdata(x) DG(x) = (2.13) pdata(x) + pG(x)

Optimal generator

∗ The optimal generator for a fixed and optimal discriminator DG will be when the gener- ator loss function reaches the minimum. At this point, the probability distribution of the data generated is equal to the distribution of the dataset (pg = pdata). An extensive math- ematical demonstration of GANs formulas is shown in https://medium.com/@jonathan_ hui/proof-gan-optimal-point-658116a236fb.

Training algorithm

Taking into account that this minimax game has global minimum as demonstrated before, with the proposed training algorithm (see algorithm 1) by Ian Goodfellow et al [32] in their paper we can obtain the desired result.

Challenges of GANs

GANs have several advantages in comparison with other generative models, mainly com- putational and that they are also capable of representing very sharp and degenerate dis- tributions. However, they have some challenges like ”mode collapse” 4 which isn’t yet fully understood. Other problems related with the stability of the optimization procedure have been proposed in the last 5 years, achieving better results in modelling more complex datasets and stabilizing training. Let’s introduce some of the latest contributions to GANs that we have experiment with in this work.

4refers to models that are not able to learn the different modes from the probability density function of the dataset preventing the model to generate a wide variety of outputs

29 Algorithm 1: Minibatch stochastic gradient descent training for generative ad- versarial networks for epochs 1,....,N do for k steps do Sample minibatch of m noise samples from noise prior pg(z); Sample minibatch of m examples from data distribution pdata(x); Gradient ascent for discriminator parameters φ;

m 1 X ∇ V (G ,D ) = ∇ log 1 − D (G (z(i)) (2.14) θ θ φ m θ φ θ i=1 Gradiend descent for generator parameters θ;

m 1 X ∇ V (G ,D ) = ∇ log D (x(i)) + log(1 − D (G (z(i)))) (2.15) φ θ φ m φ φ φ θ i=1

end end

Least Squares Generative Adversarial Networks (LSGANs) Least Squares Generative Adversarial Networks proposed by Mao et al. [33] use the same architectures as vanilla GANs. However, they change the cost function. They state that ”sigmoid cross entropy loss function used in the discriminator in vanilla GANs leads to vanishing gradients problems when updating the generator using the fake samples that are on the correct side of the decision boundary, but are still far from the real data.” When fake samples from the generator are used to update the generator by making the discriminator believe they are from real data, it will cause almost no error because they are on the correct side. Based on this idea, they propose least squares loss function for the discriminator. This new function penalizes samples that lie far from real data but on the correct side of the decision boundary. The resulting loss function for discriminator and generator is the fol- lowing:

1  2 1  2 min VLSGAN (D) = Ex∼pdata(x) (D(x) − b) + Ez∼pz(z) (D(G(z)) − a) D 2 2 (2.16) 1  2 min VLSGAN (G) = Ez∼p (z) (D(G(z)) − c) G 2 z where a and b are the labels for fake and real data, zero and one respectively. And c

30 denotes the value that G wants D to believe for fake data, one.

Contributions and challenges

In the paper Mao et al. [33] demonstrate that GANs generate higher quality images than standard GANs, a faster and more stable training. However, LSGAN still suffers from mode collapse. Other new loss functions are proposed after LSGANs.

Wassertein Generative Adversarial Networks (WGANs) Arjovsky et al. [7] also propose a new loss function motivated by the diminishing and exploding gradients problem of the GAN discriminator loss. Wassertein distance, how- ever, ables smoother gradients during training. To achieve smoother gradients they use Wassertein distance or Earth Mover’s distance (EM) is the minimum energy cost of moving data from one probability distribution into another probability distribution (see equation 2.17).

W (Pr, Pg) = inf E(x,y)∼γ[kx − yk] (2.17) γ∈Π(Pr,Pg) where Π(Pr, Pg) is the set of all the possible costs of moving data from one distribution to another Pr. γ(x, y) states how much data needs to be moved from point x to point y to make x follow the same probability distribution of y, and kx − yk is the travelling distance between x and y. Therefore, the expected cost averaged across all the (x, y) pairs can be computed as: X γ(x, y)kx − yk = Ex,y∼γ kx − yk (2.18) x,y And then we finally compute the infimum (inf) or greatest lower bound, which indicated the smallest cost. However, all the possible distributions in Π(Pr, Pg) to compute the infimum is intractable. Consequently, the authors propose a transformation of the formula based on the Kantorovich-Tubinstein duality5: 1 W (Pr, Pg) = sup Ex∼Pr [f(x)] − Ex∼Pg [f(x)] (2.19) K kfkL61 where sup is the least upper bound or supremum. So, like other deep learning problems we can learn f with a neural network. Still, this new Wassertein distance needs to satisfy kfkL 6 K, which means that the function needs to be K-Lipschitz continuous. 5see further explanations of Kantorovich-Tubinstein duality check Vincent Herrmann link: https://vincentherrmann.github.io/blog/wasserstein/

31 A real-valued function f : R → R is K-Lipschitz continuous if exists a real constant K > 0 such that, for all x1, x2 ∈ R :

|f(x1) − f(x2)| 6 K|x1 − x2| (2.20) where K is a known Lipschitz constant for function f, that, in our case, is equal to 1. To, satisfy this requirement Arjovsky et al. [7] propose clipping the weights of the neural network between two hyperparameters c and −c.

Resulting loss function

Hence, as we have already mentioned we can obtain function f with a neural network, which is going to be very similar to the discriminator. However, this discriminator instead outputs a scalar score of how real input images are. Therefore, due to this change in role, they rename the discriminator to critic. The resulting loss function is the following:

max L(f) = Ex∼pdata f(x) − Ez∼pz f(G(z)) f (2.21) min L(G) = Ez∼pz f(G(z)) G where f is the critic and G is the generator.

Contributions and challenges

As LSGAN, WGAN proposes a new and meaningful loss metric for generator’s conver- gence , improved stability during training process (see figure 2.19) and good sample quality generation. One of the most important results of WGANs is the absent of mode collapse in the experiments, although the community still does not understand the reason. Further contributions have been made for WGANs to ensure Lipschitz continuity rather than using weight clipping which may still produce poor quality images and does not converge due to high sensitivity to c hyperparameter.

Wassertein Generative Adversarial Networks with Gradient Penalty(WGAN- GP) Gulrajani et al. [8] propose a different way of satisfying 1-Lipschitz continuity in the critic rather than weight clipping. In the paper, they demonstrate how weight clipping behaves as a weight regulation, reducing the capacity of the critic f and limiting the capability to model complex functions. They show how weight clipping causes vanishing and exploding gradients compared to gradient-penalty (see figure 2.20).

32 Figure 2.19: Image from Arjovsky et al. [7] where they show the discriminator gradients for GAN and WGAN

In the paper they propose gradient penalty as a regularizer to the WGAN loss function. A function f is 1-Lipschitsz if and only if it has gradients with norm at most 1 everywhere. So, instead of applying weight clipping they penalize the model if the gradient norm moves away from its target norm. Therefore the resulting loss function for the critic is:

 2 min L(f) = Ez∼Pz [f(z)] − Ex∼Px [f(x)] + λ Exˆ∼P (k∇xˆD(ˆx)k2 − 1) (2.22) f xˆ wherex ˆ is sampled from z and x with t uniformly sampled between 0 and 1:

xˆ = tz + (1 − t)x with 0 6 t 6 1 (2.23)

Contributions and challenges

Gulrajani et al. [8] have contribute with a strong modeling performance and stability during training with WGAN-GP. However, gradient penalty adds an extra computational cost to training. WGAN-GP loss function is currently used in the latest state-of-the-art GAN architectures like Progressive GAN, StyleGAN, BigGAN which are now able to model complex datasets with high resolution image generation. Still, more research needs to be done: mode collapse in GANs is not yet a fully understandable phenomena, the latent space size and more aspects of adversarial networks.

33 Figure 2.20: Graphs extracted from Gulrajani et al. [8]. We can perceive the effect of vanishing and exploding gradients with weight clipping during training on the Swiss Roll dataset, unlike gradient penalty.

2.3.3 Conclusion In this section we have gone through an overview of the most relevant generative models with more focus on VAEs and GANs. We have seen the main advantages and drawbacks of each of the models, with which we will experiment in the following sections. Latest state-of-the-art GANs are not introduced because we are not experimenting with them due to time and hardware constrains.

34 Chapter 3

Dataset

In this section we will go through all the steps we have followed to obtain a multimodal dataset for our specific task. We will explain the different datasets used, the processing applied to data and the resulting dataset. For our specific task we need audio features, its album cover and metadata with the music genre:

• Audio features. Audio features are the different properties/knowledge that humans have extracted from raw audio. Some audio feature examples are: loudness, bpm (beats-per-minute), mel frequency cepstrum coefficients, tuning frequency, etc. As we already mentioned when reviewing related work, we think conditioning with audio features will be more beneficial to achieve a better album artwork generation based on music similarity than just condition with genre tags as Helpburn et al. [2] do.

• Album artwork. Only front covers from albums are considered for album artwork generation.

• Music genre metadata. Music genre metadata is a must as we will use it to manually evaluate album cover generation. This way we do not analyze each of the features from the audio conditioning and just do a general analysis by looking at the genre1 with which album artwork generation is conditioned.

Due to the specific task of this thesis, there are not many multimodal datasets including audio and image data. To train the considered architectures two different datasets are used and mixed: MSD-I dataset and a self developed dataset based on AcousticBrainz and MusicBrainz. 1as we have done in the state of the art when reviewing album artwork based on the music genre.

35 3.1 MSD-I dataset

MSD-I dataset is a multimodal dataset created by Oramas et al. [34]. This dataset contains audio, visual, audio-visual embeddings, links to album covers and some metadata including unique genre tags from 15 classes. The total number of tracks are 30,713, which are associated to 16,753 unique albums. Therefore there are 16,753 unique album covers, of 200x200 resolution and 8 bit depth. Covers are randomly divided into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. Number of samples per genre are shown in figure 3.1(a).

3.2 Custom dataset based on Acoustic Brainz and Music Brainz

Acoustic Brainz2 is a huge open source database developed by the Music Technology Group of Pompeu Fabra University which contains metadata from MusicBrainz3 (an open music encyclopedia that collects music metadata) and audio features extracted from Essentia4 (an open-source library and tools for audio and music analysis, description and synthesis). To create our customize dataset we first download the low level dumps from https://acousticbrainz.org/download which contain metadata from AllMusic, Discogs, Lastfm and Tagtraum with musicbrainz id of the album in the metadata . Music Brainz has a cover art API5 which requests to coverartarchive.org album covers based on the musicbrainz id. Therefore once we had the metadata and audio data from the lowlevel dumps, we only considered single tagged songs from 15 different music genres shown in 3.1(b). After applying this filter we downloaded, using the musicbrainz id from lowlevel dumps, the front album covers with the cover art API from musicbrainz. We finally obtained a total of 104,196 songs from which only 21,391 were unique covers (see figure 3.1(b)).

3.3 Resulting datasets

Having these two datasets we mixed them after several preprocessing steps resulting in two final datasets: a cover dataset with only visual data, that we will use to experiment with

2https://acousticbrainz.org/ 3https://musicbrainz.org/ 4https://essentia.upf.edu/ 5https://musicbrainz.org/doc/Cover_Art_Archive/API

36 (a) (b)

Figure 3.1: Number of samples per genre in: (a) MSD-I dataset, (b) Customize dataset image generation without taking into consideration audio features; and an audio and cover dataset with audio and visual data, for album artwork generation based on audio.

Covers-only dataset

To create this set (see figure 3.2), we first ensured MSD-I dataset did not have duplicate images in different splits by running an image hash algorithm6 implemented by Adrian Rosebrock [35] and resized cover to 128x128 resolution. For our customized downloaded dataset we removed duplicates, and album compilations resulting on a total of 19,373 covers. Then, samples were randomly split following same criterion as MSD-I dataset: 70% for training, 15% for validation and 15% for testing, resized to a 128x128 resolution and mixed with MSD-I dataset. The total number of covers when merging datasets is 50,084. Once both sets were mixed, we ensure similar images are not in the same split. Image hash algorithm maps images to numbers, these numbers vary when even only one pixel has changed. Because images from our customized dataset are downloaded from different datasets and have been scaled from different original sizes, images that are considered as duplicates to the human perception are not considered as duplicates by the image hash algorithm. To detect similar images we implemented an algorithm for similar image detection. The procedure is the following:

• First, all images are resized to 4x4 resolution.

• Images pixel values are reshaped and save as a vector

6a hash function enables to map data to fixed-size values

37 • Finally, a KDTree is used to organize all the images in the space, so by finding the closest neighbors we were able to find the similar images7.

Once similar images were detected they were all moved into the same split(train, validation or test).

Figure 3.2: Processing flow to obtain cover dataset

Audio-cover dataset

This set of data is compound by the audio features and related cover of MSDI dataset and AcousticBrainz dataset. The reason of mixing this two datasets is that deep learning is very data hungry. If we only use MSD-I data, we only have 30,713 audio features from 16,753 unique albums. With our custom dataset we would have a total of 104,196 audio features and only 21,391 unique albums. By mixing both datasets we have a total of 119,592 samples, from which 35,022 are unique albums . This way we have a decent dataset in number of samples. Mixing these two datasets means that we need to concatenate audio features from both datasets into one audio vector. This might sound messy, however, a neural network should be able to grasp knowledge embedded independently in different datasets. In order to make this mix of datasets, several data preprocessing steps were necessary:

1. Ensure same number of features in both datasets. Because we want to mix both datasets, preprocessing on audio features is necessary in order to have the same number of features from each dataset, scaled and normalized similarly. As a first step we evaluate the number of audio features from each dataset: 2048 MSD-I and

7we need to highlight that applying a KDTree over hash values wasn’t detecting similar images

38 Figure 3.3: Processing flow of audio and cover dataset

5808 from AcoustricBrainz. Because 580 features is the minimum number from both datasets, we apply PCA to compress 2048 features from MSD-I into 580.

2. Scale and Standarize. AcousticBrainz features extracted from Essentia have dif- ferent order of magitude between features (from 10−2 to 1017). Because we want all features to have same impact on our neural network we standardize feature-wise by extracting the mean and scaling to unit variance.

Once audio features from both datasets were preprocessed, we take the union of the datasets by checking if the song is the same with the metadata from AcousticBrainz and from MillionSongDataset. Because metadata with song and album name are not in MSD-I dataset, we used metadata from MillionSongsDataset9 from which MSD-I was created. By checking the song two different cases could appear:

1. Same song in both datasets. In this case, features from both sets are merged into one vector of size 1160 where MSD-I features take the first 580 positions and AcousticBrainz features the last 580.

2. Song is not repeated. In case song is appears in one dataset but not in the other, the missing features are filled with zeros. Therefore, if the song is only i MSD- I dataset we build a vector of size 110 where the first 580 points belong to MSD-I

8because some AcousticBrainz features are data-frame which depend on the duration of the song, mean and standard deviation are extracted from data-frame features to make all samples of the same size 9(http://millionsongdataset.com/sites/default/files/AdditionalFiles/track metadata.db

39 audio features and the second 580 are filled with zeros. In case, the song only appears on AcousticBrainz dataset, first 580 points are filled with zeros and the last 580 filled with AcousticBrainz features.

After these processes we end up with a total of 119,592 samples from which 30,711 belong to MSD-I dataset, 90,545 to AcousticBrainz dataset and 1,664 songs belong to both sets and have therefore a fully filled data vector. Check figure 3.3 to see an schema of the process followed to build this dataset.

40 Chapter 4

Methodology

As we have seen in section 2.2, due to the computational power and new techniques in AI, generative models are providing good results in image synthesis. We also reviewed in section 2.3 different types of generative models, their benefits and drawbacks. In the next chapter we will report on our experiments using them, but here we need first to characterize or specify the architectures and parameters we are going to play with. The main objective of this thesis is to generate album artwork based on audio. However, before experimenting with conditioned artwork generation on audio we are going to do some preliminary experiments, to explore different image reconstruction techniques and generative models. For these experiments we are only going to consider our Covers-only dataset:

• Representation learning architectures for good image reconstruction. We first analyze different autoencoder architectures to obtain good quality reconstructed images.

• Generative models for cover-to-cover generation. The next steps consist on dealing with the first genreative models, specifically VAEs and GANs without con- sidering the audio component of the album cover.

After achieving good results on image generation only trained on covers we will exper- iment with image generation conditioned with audio samples:

• Generative models for audio-to-cover generation. We finally, implement gen- erative models with audio conditioning to obtain audio-to-cover generation.

41 4.1 Preliminary experiments

As we already mentioned, before experimenting with album artwork generation based on audio we are going to do some preliminary experiments on different image reconstruction techniques and image generation. Once we have a clear idea of the image reconstruction techniques and the best models for image generation we move to the considered architec- tures for album artwork generation based on audio samples.

4.1.1 Representation learning for good image reconstruction We first, implement an autoencoder for 128x128 image reconstruction with an standard architecture: convolutional and max pooling layers for the encoder and transposed con- volution and maxunpooling layers for the decoder. We also use batch normalization and LeakyReLU as activation function (see figure 4.1). The resulting image, as we will see in the results, from the autoencoder has some issues. Thus, we propose an alternative architecture inspired by [36] based on convolutional layers in the encoder, as the previous architecture, however the reconstruction of the image is done by upsampling, padding and convolving the each of the inputs. Batch Normalization and leakyReLU is also used (see figure 4.2).

Figure 4.1: Architecture of autoencoder based of convolutional and transpose convolutional layers

42 Figure 4.2: Architecture of autoencoder based on convolutional layers. Reconstruction of the image done by upsampling and padding layers.

4.1.2 Album artwork generation only trained on covers After comparing previous architectures we obtain the ideal architecture for image recon- struction1. Therefore, in this step we experiment with VAEs and GANs following similar architecture for image reconstruction from the previous experiments. During training, only covers are considered for image generation, no audio conditioning is applied. The architecture proposed for VAE is a vanilla VAE is shown in figure 4.3. For GANs, we experiment with DCGAN architecture proposed by Gulrajani et al. [37] but extended for 128x128 image resolution2. We also experiment with the main GAN’s loss functions reviewed in the state of the art and adapt for each of the losses, the architecture proposed by the authors (see architecture in figure 4.4). Taking this architecture we experiment with three different loss functions:

1. vanilla GAN loss function (see equation 2.12); 2. LSGAN loss function (see equation 2.16); 3. and WGAN-GP loss function (see equation 2.22), changing Batch Normalization in the critic for Instance Normalization as Gulrajani et al. recommend in [8].

1However, we will only be able to make transpose convolutional layers converge 2DCGAN from [37] has a 64x64 image resolution

43 Figure 4.3: VAE architecture employed.

Figure 4.4: GANs architecture proposed for DCGAN, LSGAN and WGAN-GP.

44 4.2 Album artwork generation based on audio

Generation of album covers based on audio input is our final goal. To generate covers considering audio we propose same architecture of VAE and GAN as the previous subsec- tion 4.1.2, but using Feature-wise Linear Modulation (FiLM) to condition cover generation implemented by Perez et al [38]. FiLM layers are able to modulate per-feature-map dis- 3 tribution of activations. By learning two functions f and h we calculate γi,c and βi,c (see equation 4.1) which modulates feature map activations Fi,c, as shown in equation 4.2. This conditioning method is computationally very efficient and scalable because it only needs to learn two parameters which depend on the number of features and not on the size of the input, which is critical on image generation because an increase on resolution involves a huge increase in memory usage which does not affect FiLM layers.

γi,c = fc(xi) βi,c = hc(xi) (4.1)

F iLM(Fi,c|γi,c, βi,c) = γi,cFi,c + βi,c (4.2)

Conditional VAE

For VAEs we add to the previous vanilla-VAE architecture (without conditioning), the audio conditioning part of the architecture (see figure 4.5). Audio vectors are fed into a multilayer perceptron of 3 layers, which output is then input to different activation functions which consist of 2 linear layers. We do not run further experiments due to the results obtained.

Conditional GAN

For GAN architectures we try with the best of the models obtained without audio con- ditioning: LSGAN and WGAN-GP; and we add FiLM layers to the generator. We first try with data only from AcousticBrainz, which has a total of 90,545 songs (of 580 feature size) and covers (from which 21,391 where unique). We then train with our custom dataset which has more data. The proposed architecture is shown in figure 4.6). The audio conditioning part is exactly the same as the one implemented for conditional VAE. However, the mapped latent space from audio vectors for GANs is instead settled to 512 as Karras et al propose for image generation control with StyleGAN in [39].

3where i is the number of inputs and c is the number of features

45 Figure 4.5: VAE architecture with FiLM layers for audio conditioning on image genration.

Figure 4.6: Generator architecture with FiLM layers for audio conditioning on image generation.

46 4.3 Training methodologies

Different training methodologies are employed for each of the architectures used in the experiments:

• For AEs and VAEs. We train these models with the following methodology: We stop training for an specific learning rate when we do not improve the minimum validation loss for 5 epochs. We repeat the process with lower learning rates till the loss converges. Once it converges we will manually analyze the output images.

• For GANs, we do not use standard evaluation metrics on generative models like Inception-Score(IS) or Frechet Inception Distance (FID)4 as we consider this task has a high creative and subjective component. We do not think any of these metrics will help us on finding the best of the models for album artwork generation. Therefore, to evaluate GANs performance, we manually check the output of the generative models each 100 iterations and select the model with the best output images.

4.4 Proposed evaluation - Survey

As we have mentioned, standard evaluation metrics, such as the IS and FID, are not considered in this work. We instead propose a qualitative evaluation, specifically a survey, to analyze audio conditioning in album artwork generation. Taking into consideration that our main goal, regarding the survey is to verify that our model is able to generate album artwork that fits on a certain music genre, we define the subjects, materials and procedure of the survey:

Subjects Because our task needs the understanding of music genre and the related visual artwork of the music. We will try to find subjects who that consume music frequently and who is aware of the importance of the visual content of music albums are the desired subjects. The minimum number of subjects considered are at least 30, having the answers of each of the subjects and impact of 3.33%. However, depending on the time of music consumed and the importance that each of the subjects gives to the visual content of music albums, some weighting can be applied to the answers.

Materials We select Google Forms as the platform to implement the survey. In order to make the

4see more information about these metrics in https://medium.com/@jonathan_hui/ gan-how-to-measure-gan-performance-64b988c47732

47 evaluation we are going to need original album artwork and the covers generated from our models with and without conditioning. Images will be selected randomly with an resolution of 128x128 for all the samples.

Procedure 5 In the questionnaire we first introduce the subjects our topic with the number of questions and images that they are shown. Then, some introductory questions regarding the time consuming music, the importance of album artwork and the selection of the most familiar music genre within the ten music genres that are used in the survey. After this, we start with the questionnaire. There are three different questions, based on the images shown:

1. Original album covers. Two original album artwork covers from two different genres are shown and we ask to select the album artwork that corresponds to a specific genre. Then, to ensure that the answer has been selected by their perception of the artwork and not because of personal knowledge about the artist, we ask about this. In case the subject has some personal knowledge about the artist the answer of the album cover genre identification is cancelled.

2. Generated album covers without audio conditioning. Two generated album covers without conditioning are shown, and we ask to select the album artwork that corre- sponds to a specific genre.

3. Generated album covers with audio conditioning. Two generated album covers with conditioning are shown, and we ask to select the album artwork that corresponds to a specific genre.

In the questionnaire, ten music genres are used. A total of 40 questions are asked, where 20 questions correspond to original album covers (therefore, we ask to identify an original album cover of an specific music genre twice), 10 to generated album covers without audio conditioning and 10 to generated album covers with audio conditioning. We need to highlight that no indication that generated album artwork are conditioned or not is given, in order to not bias the answer.

Statistics No specific statistic are planned due to the results obtained (see section Album artwork generation based on audio). However, we have defined the conclusions we can extract from the questions proposed. With original album covers questions we have a control measure with which we can compare and evaluate the answers of the generated outputs of our model. From the last two

5Check appendix A to see a draft of the proposed procedure

48 types of questions, we can evaluate if audio conditioning is perceived or not by comparing conditioned album artwork with the answers for album generated covers without audio conditioning.

49 Chapter 5

Results and discussion

In this chapter we will go through the results obtained from the different experiments already explained in the methodology and the related discussion. As mentioned in section 4.3, we will follow different training methodologies for each of the architectures. We need to highlight that album artwork generated by these models are not analyzed or considered as final designs. New album artwork generations are considered as sketches of album covers. Experiments have been run in Google Colab, which has K80, T4, P4 and P100 NVIDIA GPUs. We use Pytorch 1.6.0 as deep learning framework.

5.1 Preliminary experiments

5.1.1 Representation learning for good image reconstruction From the two autoencoder architectures proposed we obtained different outputs. In figure 5.1(a) we show the original image from the test set, the output from the autoencoder with transposed convolutional layers and unmaxpooling is shown in 5.1(b) and the autoencoder with upsampling, reflection pad and convolutional layers in figure 5.1(c).

5.1.2 Generative models for cover generation trained on covers VAE For the VAE architecture we train the model for 78 epochs till the validation loss does not longer improve in 5 epochs for any learning rates (see figure 5.2). In order to generate new album covers, we sample the mean and standard deviation from the latent space and apply equation 2.10. The outputs are shown in figure 5.4. We do further of the model by calculating the 2D t-SNE of the latent space for all music genres to see how the architecture is placing album covers in the latent space. We only plot blues, country and electronic 2D latent space for a better visualization (see figure 5.5).

50 (a)

(b)

(c)

Figure 5.1: (a) Original images from test set, (b) Output images from transposed convolutional layers and unmaxpooling autoencoder architecture, (c) Output images from the autoencoder with upsampling, reflection pad and convolutional layers

GANs As mentioned in the methodology we experiment with three different GANs: DCGAN, LSGAN and WGAN-GP . We train each of the models for more than 75,000 iterations and show images from each of the models for 25,000, 50,000 and 75,000 iterations in figure 5.6. Then we show eight original size images,(128x128 resolution) manually selected from LSGAN and WGAN-GP models in figures 5.7 and 5.8.

51 Figure 5.2: VAE model loss graph for training and validation set with the corresponding learning rates for each of the epochs

(a) (b)

Figure 5.3: VAE model: (a) Original images from a batch of the test set (b) Output images for batch of figure (a)

Figure 5.4: VAE model generated Figure 5.5: 2D VAE model latent space images for all the blues, country and electronic images in the training set

52 (a)

(b)

(c)

Figure 5.6: Images for 25,000, 50,000 and 75,000 iterations of GAN models without audio conditioning: a) DCGAN, b) LSGAN, c) WGAN-GP 53 Figure 5.7: Eight manually selected images of 128x128 resolution generated from LSGAN model without audio conditioning.

Figure 5.8: Eight manually selected images of 128x128 resolution generated from WGAN-GP model without audio conditioning.

54 5.2 Preliminary experiment’s discussion

5.2.1 Representation learning for good image reconstruction For good image reconstruction experiments, we analyse outputs from architectures shown in 4.1 and 4.2. We perceive how figure 5.1(b) presents some artifacts that the 5.1(c) does not have. This artifacts are called checkerboard patterns. These type of patterns are caused by transposed convolution, making that some of the pixels from the image are calculated more than two times. Odena et al. [?] do a further analysis on checkerboard patterns. Therefore, by analyzing these two outputs from these two different architectures upsam- pling, reflection pad and convolutional layers should be the selected image reconstruction methodology. However, we make GAN models converge and obtain better results for trans- posed convolutions.

5.2.2 Generative models for cover generation trained on covers VAE Following the methodology mentioned for the VAE and the archicture shown in 4.3, we are able to make the model converge for the train and validation set (see figure 2.9). The resulting outputs for the test set are shown in figure 5.3. We can see that the output images 5.3(b) from the model are blurred. This blurriness is a typical feature of VAEs. Some new album cover generated samples from the model are shown in 5.4. These outputs can not be considered album covers as it does not have defined figures and what we could consider text elements from the cover such as album title, artist, ... With the 2D t-SNE latent space representation we obtain more insights about why the output is not what we expected (see figure 5.5). We see how the three genres are placed in the same place of the latent space, which causes these blurry and noisy images.

GANs From the three architectures already introduced in figure 4.4, we obtain the resulting images (see figure 5.6). We perceive there is mode collapse1 only for DCGAN; LSGAN and WGAN-GP obtain higher quality results and do not suffer from mode collapse. Comparing image quality between LSGAN and WGAN-GP in figures 5.7 and 5.8, we obtain better defined figures, sometimes human-like representations and what could be the titles of the album artwork in LSGAN. We also see text representation in WGAN-GP but less defined. However, better contrast is obtained with WGAN-GP. See more generative album artwork from LSGAN model in appendix B.

1we already mentioned mode collapse in the state of the art. Still, to refresh memory, mode collapse refers to models that are not able to learn the different modes from the probability density function of the dataset preventing the model to generate a wide variety of outputs

55 5.3 Album artwork generation based on audio

Conditional VAE

We train the conditional VAE for 20 epochs following the training methodology already mentioned. Output images generated from the test set are shown in figure 5.9. To gener- ate new album covers with the conditioned VAE model, we run an experiment for image generation where we calculate the centroid of blues album covers in the latent space of the VAE. We do this, in order to obtain the most representative point of blues album artwork. This way, the expected output image should be have blues album cover characteristics by sampling from the blues centroid in the latent space and condition with one of the blues audios from the test set. The resulting image is shown in figure 5.10. We finally run the same experiment as without audio conditioning. We plot the 2D t-SNE of the latent space for blues, country and electronic images from the training set, to see how the architecture is placing album covers in the latent space. The resulting latent space is shown in figure 5.11.

Figure 5.9: VAE model with audio conditioning: (a) Original images from a batch of the test set (b) Output images for batch of figure (a)

Figure 5.10: VAE model with audio conditioning: (Left) Image that Figure 5.11: 2D VAE with audio correspond to the audio features that conditioning latent space for all the condition image generation by sampling blues, country and electronic images in from mean and variance latent space the training set

56 Conditional GAN

For the conditional GAN we first experiment with LSGAN as we obtain the best of the results for album artwork generation without conditioning. For LSGAN, we first train with samples from AcousticBrainz (see results 5.12(a)) and then with our custom dataset (explained in section 3.2) and obtain the outputs shown in figure 5.12(b). A further exper- iment to check if audio conditioning affects image generation is implemented. We show 64 images in figure 5.13 from which the ones that correspond to the same column are generated by using the same random vector when generating the images; images that correspond to the same row are conditioned with different audio features from the same genre. Due to the results obtained with the conditional LSGAN we also train a WGAN-GP and obtain images shown in figure 5.14.

(a)

(b)

Figure 5.12: Images for 25,000, 50,000 and 75,000 iterations of conditional LSGAN for: a) AcousticBrainz data, b) custom dataset.

57 Figure 5.13: In this figure each of the columns corresponds to the same noise vector input into the generator, and each of the rows corresponds to different audio feature vectors from the same genre. These outputs correspond to the conditioned LSGAN model.

Figure 5.14: Images for 25,000, 50,000 and 75,000 iterations of conditional WGAN-GP.

58 5.3.1 Album artwork generation based on audio discussion Conditional VAE The conditional VAE model also obtains noisy and blurry images as generated album covers like in the vanilla VAE. By running same experiment as for the vanilla VAE, we again see, in figure 5.11, how different genres album covers are place in the same place of the latent space. Therefore, the model is not placing correctly the images in the latent space which might be a reason why the generated output is mainly noise.

Conditional GAN Taking a look at the results from the LSGAN in figures 5.12(a) and 5.12(b), higher quality images are obtained with the whole custom dataset than only with the AcousticBrainz samples. However ,we do not obtain the expected figures that the LSGAN model outputs without conditioning. We also see similar or equal images in the output batches. We can infer two things from this fact:

1. image generation is outputting the same independently of the audio. Therefore, image generation is ignoring audio conditioning.

2. mode collapse

To better check that audio conditioning is not working for image generation we show 64 images in figure 5.13. Each of the images that correspond to the same column are generated by using the same random vector when generating the images; images that correspond to the same row are conditioned with different audio features from the same genre. We clearly see how audio conditioning is ignored by the neural network. Another possibility is that audio conditioning layers are not able to condition image generation due to mode collapse in conditional LSGAN. Because of this, we implement conditional WGAN-GP which does not suffer from mode collapse. However, generated outputs cannot be considered as album covers or sketches. There are no clearly defined figures despite of having some text representations in the generated album covers.

59 Chapter 6

Conclusion

6.1 Conclusions

In this thesis, we have researched album artwork generation based on audio samples. For this, we have first collected a multimodal dataset with album artwork and audio features. We have also reviewed and experiment with the latest tools for image generation. We have analysed different image reconstruction techniques and experiment with image generation tools with and without audio conditioning. From the experiments run, we can obtain some important insights about album artwork generation based on audio samples. We have seen from experiments how LSGAN achieves the best of the results to cap- ture the probability density function of our dataset. Generated images of this model could be interpreted as sketches for designers1. Nevertheless, this model when conditioned suf- fers from mode collapse or ignores audio conditioning. To face mode collapse we try with WGAN-GP as it does not suffer it. Our WGAN-GP, however, is not able to model our album artwork dataset. To verify that audio conditioning is being ignored, further ex- periments should be run. As a first step, a simpler conditioning neural network should be implemented to check if audio features do influence or not image conditioning. Other possible approaches to solve these issues can be applied from the latest state of the art GANs. One of the latest contributions is BigGAN [40]. BigGAN is able to model complex datasets ,as the one we are dealing with, using WGAN-GP. They, however, use complex and high computational computer vision tools like self-attention. Because we are limited computationally speaking this type of tools are out of our scope. Still, original WGAN-GP is able to model complex datasets like LSUN, but for a 32x32 resolution. Further researches from Karras et al [41] make possible high resolution image generation of 1024x1024 with WGAN-GP loss function. This GAN is known as Progressive GAN. There are also other models that use conditioning to control image generation, like StyleGAN [39]. We follow

1further human evaluation should be needed to confirm this statement

60 a similar approach to condition image generation with audio. Still, the architecture and hyperparameters used are not the same as ours.

So, to demonstrate image generation condition with audio we need a generator that does not suffer from mode collapse and which is able to condition image generation. As already mentioned, some of the image generation tools or experiments that we need to take into consideration to achieve a better result are:

• WGAN-GP loss function, as it does not suffer mode collapse,

• progressive training for higher image resolution and

• run simpler experiments in the conditioning neural network to check if audio con- ditioning really works. Another experiment that should be tried is to use a similar image conditioning architecture as StyleGAN does to control image generation, but, in our case, we will use it for audio conditioning.

Therefore, following the approaches of these successful tools, we should be able to get a good generative model able to generate album artwork based on audio. Another important aspect is to train the model using more iterations. These models aforementioned are usually trained more than 100,000 iterations and sometimes 200,000 iterations. Due to time and computational resources constrains we have trained the models 75,000 iterations at most. We also wanted to experiment with different models to obtain a deeper analysis on the existing models but we did not have enought computational resources. Because of this we want to mention that more computational resources and facilities should be offered to the students in order to not limit the learning and the research baseline of future students, as models get bigger and bigger every year.

All the models and experiments run in this thesis are publicly available in the link: https://github.com/jbu5105/Towards_album_artwork_generation_based_on_audio ¡

6.2 Future work

This experiments serve as a baseline to obtain a final generative system able to generate album artwork conditioned by audio. As already mentioned, the main future work should focus on creating a good generative model with audio conditioning by trying and analysing the results using the techniques reviewed in the conclusion. Once, we have a working model we need ways to evaluate the system. Evaluating image generation based on music similarity is not a simple task that can be checked with euclidean

61 distances between images and audio features. Because we believe in the high subjective and creative component of this task a subjective evaluation is the best methods for evaluating the model, such as a survey. The survey, should first analyse user perception on real album artwork for music genres, as a control measure. Users would be asked to identify between two album covers which corresponds to a specific music genre. Then, as a second step, images generated by the model without audio conditioning would be shown, and finally with audio conditioning; asking the same question format as for real album covers. With this two final stages we could checked if users can perceive audio conditioning or not; thus evaluating the model.

Further work can be done once a successful model is obtained. As we mentioned when reviewing previous works, genre tags are not enough to consider image generation as we think audio features should be considered instead. However, the lyrics content is also very important in the artwork design process. Is true that usually the meaning is also reflected in the music (instruments, harmony, melody, ...) but further research on conditioning image generation with lyrics should be taken into consideration. Another future work proposition is to research how different audio features influence image generation when applied at different depths in the generator. Is similar to NVIDIA’s work with StyleGAN where they analyse image generation control. By researching this idea, depending on the results, an interface for image generation based on audio conditioning could be implemented 2. This way, the model could be used as a tool for designers and artists during their creation process in album artwork design, where they could achieve album cover sketches by conditioning image generation with different songs from the album at different depths in the generator.

2NVIDIA already built an interface to have some control on image generation https://www.youtube. com/watch?v=kSLJriaOumA

62 Bibliography

[1] C. Carr Dadabots and Z. Zukowski Dadabots, “Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands,” Tech. Rep., [Online] Available: http: //dadabots.bandcamp.com. Accessed on September 28, 2020.

[2] A. Hepburn, R. McConville, and R. Santos-Rodriguez, “Album cover generation from genre tags,” 10 2017.

[3] Y. Qiu and H. Kataoka, “Image generation associated with music data,” Tech. Rep.

[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, [Online] Available: http://www.deeplearningbook.org. Accessed on September 28, 2020.

[5] D. Foster, Generative Deep Learning, 2019, vol. 6, no. November.

[6] L. Weng, “Flow-based deep generative models,” lilianweng.github.io/lil- log, 2018, [Online] Available: http://lilianweng.github.io/lil-log/2018/10/13/ flow-based-deep-generative-models.html. Accessed on September 28, 2020.

[7] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” Tech. Rep.

[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved Training of Wasserstein GANs Montreal Institute for Learning Algorithms,” Tech. Rep., [Online] Available: https://github.com/igul222/improved wgan training.. Ac- cessed on September 28, 2020.

[9] A. P´erezS´anchez, La grabaci´onsonora como objeto de estudio, 10 2015, pp. 67–84.

[10] A. Dorochowicz and B. Kostek, “Relationship between album cover design and music genres,” 09 2019, pp. 93–98.

[11] R. Mayer, “Analysing the similarity of album art with self-organising maps,” vol. 6731, 06 2011, pp. 357–366.

[12] Y. Wen, R. Singh, and B. Raj, “Reconstructing faces from voices,” 2019.

63 [13] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. Mcguinness, J. Torres, and X. Giro-I-Nieto, “WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks,” in ICASSP, 2019, [Online] Avail- able: https://imatge-upc.github.io/wav2pix/. Accessed on September 28, 2020.

[14] I. Lopez Medel, “Death and resurrection of the album cover,” Index Co- municaci´on, 2014, [Online] Available: https://www.researchgate.net/publication/ 260424430 Death and resurrection of the album cover. Accessed on September 28, 2020.

[15] “Best Album Covers of the 80s,” [Online] Available: https://vocal.media/beat/ best-album-covers-of-the-80s. Accessed on September 28, 2020.

[16] “Unforgettable nineties album covers — Pixartprinting,” [Online] Available: https: //www.pixartprinting.co.uk/blog/nineties-album-covers/. Accessed on September 28, 2020.

[17] P. Galanter, “A Companion to Digital Art,” Tech. Rep., 2016.

[18] Z. Ruttkay, “Composing Mozart Variations with Dice,” Teaching Statistics, vol. 19, no. 1, pp. 18–19, mar 1997, [Online] Available: http://doi.wiley.com/10.1111/j. 1467-9639.1997.tb00313.x. Accessed on September 28, 2020.

[19] F. P. Brooks, A. L. Hopkins, P. G. Neumann, and W. V. Wright, “An Experiment in Musical Composition,” IRE Transactions on Electronic Computers, vol. EC-6, no. 3, pp. 175–182, 1957.

[20] J. A. Moorer, “Music and Computer Composition,” Tech. Rep., 1972.

[21] J. Pressing, “NONLINEAR MAPS AS GENERATORS OF MUSICAL DESIGN.” Computer Music Journal, vol. 12, no. 2, pp. 35–46, 1988, [Online] Available: https: //www.jstor.org/stable/3679940. Accessed on September 28, 2020.

[22] T. Jehan, “Creating Music by Listening,” Tech. Rep., 2005.

[23] “Flow Machines -AI assisted music production-,” [Online] Available: https://www. flow-machines.com/. Accessed on September 28, 2020.

[24] “AIVA - The AI composing emotional soundtrack music,” [Online] Available: https: //www.aiva.ai/. Accessed on September 28, 2020.

[25] “AWS DeepComposer,” [Online] Available: https://aws.amazon.com/es/ deepcomposer/. Accessed on September 28, 2020.

64 [26] S. C. I. G. C. D. A. R. Jesse Engel, Kumar Krishna Agrawal, “GANSYNTH: AD- VERSARIAL NEURAL AUDIO SYNTHESIS,” no. 2006, 2019, pp. 1–14, [Online] Available: https://magenta.tensorflow.org/datasets/nsynth. Accessed on September 28, 2020.

[27] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A Generative Model for Music,” Tech. Rep., [Online] Available: https://github.com/ openai/jukebox. Accessed on September 28, 2020.

[28] Aaron van den Oord, Nal Kalchbrenner and K. Koray, “Pixel Recurrent Neural Net- works,” Tech. Rep., 2016.

[29] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, W. Matusik, and M. Csail, “Speech2Face: Learning the Face Behind a Voice,” Tech. Rep., [Online] Available: https://speech2face.github.io. Accessed on September 28, 2020.

[30] S. Reed, Z. Akata, B. Schiele, and H. Lee, “Learning deep representations of fine- grained visual descriptions,” 05 2016.

[31] M. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural net- works,” vol. 8689, 11 2013.

[32] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” Tech. Rep., [Online] Available: http://www.github.com/goodfeli/adversarial. Accessed on September 28, 2020.

[33] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least Squares Generative Adversarial Networks,” Tech. Rep., 2017.

[34] S. Oramas, F. Barbieri, O. Nieto, and X. Serra, “Multimodal Deep Learning for Music Genre Classification,” Transactions of the International Society for Music Information Retrieval, vol. 1, no. 1, pp. 4–21, sep 2018, [Online] Available: http://transactions. ismir.net/articles/10.5334/tismir.10/. Accessed on September 28, 2020.

[35] “Detect and remove duplicate images from a dataset for deep learning - PyImageSearch,” [Online] Available:https://www.pyimagesearch.com/2020/04/20/ detect-and-remove-duplicate-images-from-a-dataset-for-deep-learning/. Accessed on September 28, 2020.

[36] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016, [Online] Available: http://distill.pub/2016/deconv-checkerboard. Ac- cessed on September 28, 2020.

65 [37] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved Training of Wasserstein GANs Montreal Institute for Learning Algorithms,” Tech. Rep., [Online] Available: https://github.com/igul222/improved wgan training. Ac- cessed on September 28, 2020.

[38] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual Reason- ing with a General Conditioning Layer,” Tech. Rep., [Online] Accessed: www.aaai.org. Accessed on September 28, 2020.

[39] T. Karras NVIDIA and S. Laine NVIDIA, “A Style-Based Generator Architecture for Generative Adversarial Networks Timo Aila NVIDIA,” Tech. Rep., [Online] Available: https://github.com/NVlabs/stylegan. Accessed on September 28, 2020.

[40] A. Brock, J. Donahue, and K. Simonyan, “LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS,” Tech. Rep., [Online]Available: https://tfhub.dev/s?q=biggan. Accessed on September 28, 2020.

[41] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION,” Tech. Rep., [Online] Available: https://youtu.be/G06dEcZ-QTg. Accessed on September 28, 2020.

66 Appendix A

Survey draft

In this appendix we show a small draft with the questions proposed to evaluate our models. We show the introduction, the introductory questions and the questionnaire proposition.

A.1 Introduction A.2 Introductory questions A.3 Questions proposed

A.3.1 Original album artwork questions Question format from question 1 to 20:

A.3.2 Generated album artwork without audio conditioning ques- tions Question format from question 20 to 30:

A.3.3 Generated album artwork with audio conditioning ques- tions Question format from question 30 to 40:

Appendix B

Generative album artwork examples

Here, we show some the generative album artwork generated by one of the selected models. Generated images have a 128x128. For a better visualization of the images we downsample them.

Figure B.1: Generated album covers from LSGAN model without audio conditioning.