<<

and : A Survey

GIORGIO FRANCESCHELLI, Alma Mater Studiorum Università di Bologna, Italy MIRCO MUSOLESI, University College London, United Kingdom, The Institute, United Kingdom, and Alma Mater Studiorum Università di Bologna, Italy

There is a growing interest in the area of machine learning and creativity. This survey presents an overview of the history and the state of the art of computational creativity theories, machine learning techniques, including generative , and corresponding automatic evaluation methods. After presenting a critical discussion of the key contributions in this area, we outline the current research challenges and emerging opportunities in this field.

1 INTRODUCTION In 1842, Lady Lovelace, an English mathematician and writer recognized by many as the first programmer, wrote that the Analytical Engine - the digital programmable machine proposed by Charles Babbage [3] a hundred years before the [114]- “has no pretensions to originate anything, since it can only do whatever we know how to order it to perform” [72]. This consideration, which Alan Turing referred to as “Lovelace’s objection” [115] was just the first fundamental meeting between and creativity. In particular, the last thirty years of the Twentieth century have been marked by many attempts of building machines able to “originate something”. From the beginning of the Seventies of the past century with the AARON Project by Harold Cohen, a program designed to autonomously draw images provided by a domain knowledge and a knowledge of representational strategy [16] and the Computerized Haiku by Margaret Masterman1, we have witnessed several examples of applications of artificial to a variety of artistic fields. Examples include the storyteller TALE- SPIN [71], RACTER and its poems’ book [84], and MEXICA and its short narratives [82]. Applications were not limited to novels and paintings: BACON was used to simulate human thought processes and discover scientific laws61 [ ] and with the COPYCAT Project and Melanie Mitchell proposed a computer program to discover insightful [46]. The themes have also been extensively examined by Douglas Hofstadter in the Pulitzer Prize Gödel, Escher, Bach: an Eternal Golden Braid [45], in which the author explains the idea of self-reference in the production of creative work and its implications for artificial intelligence2. In this context, we have witnessed the emergence of the Computational Creativity field13 [ ]. We will adopt the definition by Colton and Wiggins21 [ ], according to whom, Computational Creativity is the philosophy, science and engineering of behaviors that unbiased observers would deem to be creative. It is important to highlight the use of the terms “responsibilities” (so, not considering tools but really independent systems) and, of course, “unbiased observers” (so, without any kind of prejudice about what a machine can or cannot do; our hope is that the reader can be considered

arXiv:2104.02726v2 [cs.LG] 20 Apr 2021 as an “unbiased observer”; if not, we will make, in any case, our best). Computational Creativity can also be defined as the study and support, through computational means and methods, of behaviors exhibited by natural and artificial systems, which would be deemed creative if exhibited by humans,as proposed by Wiggins [124]. In this work, the author studies the links between creativity models and search of AI from a theoretical perspective. The author shows that by them in depth, it might be possible

1http://www.in-vacua.com/cgi-bin/haiku.pl 2Quite interestingly, after Hofstadter’s classic, many AI projects focused on Bach’s music, from Bach by Design, the first recording produced by the artificial composer developed by David Cope [22] to COCONET [48] and DeepBach [44]. 1 Franceschelli and Musolesi to design solutions for both exploratory and transformational creativity. The definition of exploratory creativity and transformational creativity is provided by Margaret Boden in [8], another seminal work about Computational Creativity. Boden identifies three forms of creativity: combinatorial, exploratory and transformational creativity. Combinatorial creativity is about making unfamiliar combinations of familiar ideas; the definition of the other two, instead, is based on conceptual spaces, i.e., structured styles of thought (any disciplined way of thinking that is familiar to a certain culture or peer group): exploratory creativity involves the exploration of the structured conceptual space, while transformational creativity involves changing the conceptual space in a way that new thoughts, previously inconceivable, become possible. This is not the only interesting definition Boden provides. In fact, Boden also states that “creativity isthe ability to generate ideas or artifacts that are new, surprising and valuable”, where three sorts of surprise are considered. The first is the one that goes against statistics; the second is the one when you had not realized that this particularidea was part of it; and the third is when the idea is (or, better, was) apparently impossible. We will consider this definition, that we will call Boden’s three criteria, as one of the cornerstones of the rest of the research. Finally, Boden provides one last definition, linked with the very beginning of this Introduction. In fact, Boden states that, when talking about Computational Creativity, it is possible to distinguish four different questions, called as Lovelace questions, because many people would respond to them by using the argument cited above. The first Lovelace question is whether computational ideas can help us understand how human creativity is possible; the second is whether could ever do things which at least appear to be creative; the third is whether a computer could ever appear to recognize creativity; and the fourth is whether computer themselves could ever really be creative (the originality of the products can be attributed only to the machine) [8]. The first one is studied in Boden’s work in detail; from now on, after discussing the scope of the survey, we will provide the reader with an overview of the techniques that can answer second question and, possibly, generate about how machine learning can be used to measure creativity, trying to answer the third question. Before starting the description of techniques for generation and evaluation, we need to explain why this survey is not about in general, defined as the design and implementation of intelligent agents that receive perception information from the environment and take actions that affect that environment (according to the definition contained in [95], which is shared by many researchers in this field). The survey focuses instead on machine learning, which is a subset of artificial intelligence defined as the study of computer algorithms that improve automatically through experience [74]. In order to consider a machine as potentially creative, there is the need of assigning more and more responsibilities to the generative system [21], and machine learning seems more appropriate since it avoids the need of a human behind it to specify all the required knowledge [35]; as pointed out in [17], the use of evolutionary or machine learning methods makes the behavior of the software unpredictable for the programmer, who is, in a sense, left out from the generative process. The fact that artificial intelligence and machine learning will be the enabling technologies for supporting creativity has been apparent since Turing’s seminal work. We can find this intuition in particular in the section “Learning Machines” [115] in which Turing, trying to find a way to overcome Lovelace’s objection, suggested that probably the best possible way to simulate human and pass the is not to produce a program, but instead to simulate a child’s mind and then make it grows by learning. The relevance of this kind of systems, and in particular of neural networks, in creativity study is well-explained also by Boden, who notices that connectionist ideas of neural networks can help answer the first Lovelace question and might answer (and maybe have already answered) the second question, also suggesting that, by using them, computers might recognize certain aspects of creativity (answering the third question) [8]. It is worth noting that, according to some, computers not only might do things which appear to be creative, but already are expressing some forms of creativity (see, for example Goodfellow’s 2 Creativity and Machine Learning: A Survey interview by [73], where he said that “if creativity means producing something new and beneficial, machine learning algorithms are already at that point”). Following these considerations, to the best of our knowledge, this survey is the first to explore if and how machine learning algorithms have been used as a basis for (computational) creativity in both the generation and evaluation paths. Instead, for a reader interested in a survey concerning AI and creativity in general, we recommend [93]; for a more technical view of generative deep learning and for a view of other related techniques (for example, style transfer or machine ), we suggest [30]; finally, for a view of artistic AI works (also in human-computer co-creations [43]), [73] is a very comprehensive source of information.

2 GENERATIVE MODELS In answering the second Lovelace question, we can say that certain programs are able to perform artifact generation, since they output valuable artifacts like paintings, poems or theorems, which can be assessed, ideally, in their own right [17]. But how to distinguish between simple generation and creativity? Two main factors might be considered in order to assess the creativity of an artifact generation program: firstly, if the output is creative; and, secondly, if the process is creative [17]. The dichotomy between creative product and creative process is a central aspect in the debate about creativity in general; in fact, this distinction does not only concern computational creativity, but also human creativity (process and product are two of the four perspectives of creativity, as highlighted in [88]). However, for our purposes, we will consider Boden’s three criteria (novelty, surprise and value) in order to assess the creativity for the product; the definition of the three degrees of creativity (combinatorial, exploratory and transformational) is in some way borderline between process and product; and then we will consider the creative tripod proposed in [17] for focusing on process: the system should exhibit behaviors that are skillful, appreciative and imaginative. We now present an overview of the most important generative techniques from a creativity perspective to examine this kind of differences and to try to understand how to assess creativity not only in (some of the) products, but also,in some way, during the process. Indeed, we can identify creativity (or, at least, a seeming creativity) in several different applications. Examples include systems for playing games like AlphaGo [103], where, using the words from [111], moves are not merely powerful, but in some cases also highly creative, but also design [34, 69], recipes [76, 119], scientific discovery [17, 102], etc. We focus our attention in particular on artistic outputs. In the next two subsections, we will first consider the case of “continuous” artifacts (like images) and then the case of “discrete” or “sequential” ones (like texts or sounds), according to the definition that can be found127 in[ ]. Finally, game creation is one of the main application areas of these techniques (see, for instance, [42, 91, 112, 123]). It is worth noting that deep learning techniques for game content generation have already been extensively covered in [65]; we will only consider approaches which are relevant from a creativity perspective.

2.1 Continuous Models We can identify three main families of continuous models: Variational Auto-Encoders (VAE) 2.1.1, autoregressive models 2.1.2 and Generative Adversarial Networks (GANs) 2.1.3.

2.1.1 Variational Auto-Encoders. An auto-encoder is a neural network composed of two parts: an encoder, which compresses high-dimensional input data into a lower-dimensional representation vector, and a decoder, which decom- presses a given representation vector back to the original domain [30]. The goal is to learn how to correctly reconstruct the input in output by using the representation vector, a sort of compression of the original input data. The idea is that, 3 Franceschelli and Musolesi after learning, we can choose any point in the latent space (for example by random noise) and, through the decoder, generate novel data. There are several types of auto-encoder architectures, but in the field of generative deep learning the most important one is probably VAE, proposed at first in [60]: instead of mapping the input data directly to one point in the latent space, it is mapped to a multivariate normal distribution around a point in the latent space (it is done by mapping each input to a mean vector and a standard deviation vector, and the point given in input to the decoder is obtained from them sampling from the Gaussian distribution they define). As stated in [30], this guarantees that points in the same neighborhood produce very similar data when decoded (not guaranteed by standard auto-encoders). In this way, even when a point in the latent space that has never been seen by the decoder is chosen, a decoder is likely to generate a sample that is well formed. In particular, Very Deep VAE (VDVAE) [14] has demonstrated to be at least as strong as autoregressive models (but with fewer parameters). For example, VAE can also be used to generate object images from high-level description, as in [125], where a layered foreground-background generative model based on the (classic) latent variable and also on an attribute variable is used. Another interesting use of VAE is DRAW [38], which constructs scenes in an iterative way, by accumulating changes emitted by the decoder (then given to the encoder in input), which allows iterative self-corrections and a more natural form of image construction. Both encoder and decoder are implemented with RNN; during the iterative self-correction, the encoder also receives the previous outputs of the decoder. The decoder’s outputs are added to the distribution that ultimately generates the data; an attention mechanism is used to decide at each time-step where to focus attention. From a creativity point of view, models based on VAE can be considered as the best possible examples of exploratory creativity, since they try to build the most accurate possible space and then they sample from it. We do not have any guarantees of value, novelty and surprise, even if they sometimes end with creative products.

2.1.2 Autoregressive Models. The proposal of autoregressive models is attributed to [117] where, for generative image modeling, in order to tractably model a joint distribution of the pixels in the image, the authors propose to cast it as a product of conditional distributions. The factorization turns the joint modeling problem into a sequence problem, where one learns to predict the next pixel given all the previously generated ones. Fundamentally, the idea is to start at the top left pixel and proceed towards the bottom right pixel, predicting the next one based on the previous ones. In order to do so, they propose two main architectures: PixelRNN and PixelCNN [117]. PixelRNN is a two-dimensional RNN, which can be based on rows or on diagonals. On the other hand, PixelCNN is a fully CNN, used with a fixed dependency range (the filters on the convolution will be masked because it can only use information about pixel above and totheleftof the current one). It is interesting to notice that to generate images (so, continuous artifacts), one of the fundamental approach is discrete and sequential. In [118] these models are, in a way, evolved, since they try to combine the strength of the two models by proposing a gated variant of PixelCNN (Gated PixelCNN); in addition, they also introduce a conditional variant of the Gated PixelCNN (Conditional PixelCNN) to model the complex conditional distributions of natural images given a latent vector embedding. Gated PixelCNN replaces the RELU used by PixelCNN with a gated activation unit. In addition, in Conditional PixelCNN, it is assumed that the conditional probability of a pixel depends also on a latent vector that represents a high-level image description, containing information about what should be in the image. The interesting fact is that this Conditional PixelCNN can also be used as a decoder in a VAE, where the latent vector is outputted by the encoder. Then again, in [80], the authors propose to use self-attention to achieve both the virtually unlimited receptive field of PixelRNN and the parallelizability of PixelCNN. The proposed Image Transformer uses an encoder-decoder architecture to support image-conditioned generation; for both the encoder and decoder, it uses stacks of self-attention and position-wise feed forward layers. In addition, the decoder uses an attention 4 Creativity and Machine Learning: A Survey mechanism to consume the encoder representation; then the decoder can be used to generate class-conditioned images. Finally, in [86] we have the proposal of DALL·E, a version of GPT-3 [11] to generate images from text description. DALL·E is composed by a Transformer decoder that receives in input both the text and the image generated so far; it uses an attention mask at each self-attention layer to correlate each image token to all text tokens, and it uses a causal mask for the text tokens, and a sparse attention based on row, column or convolutional attention pattern for the image tokens. Autoregressive models seem to be in the middle between exploratory and combinatorial creativity. In fact, they are based on probabilistic predictions and, for this reason, they are able to generate completely new outputs in the same space, but also reuse sequences of pixels from different works, combining them together. Results are often valuable and they might be considered novel, but, in general, not as surprising3. However, the emergence of methods for generating images starting from textual descriptions may increase the probability of diverging and transforming conceptual space (in the sense that, even if the generative approach is not completely creative, because it is predictable, with a creative text description it is possible to obtain images that otherwise we would not generate).

2.1.3 GANs. The proposal of GAN by [36] represents a sort of turning point for generative systems. Here, Goodfellow proposes an adversarial network framework with two models: a generative model and a discriminative one, which learns how to determine if a sample is from the model distribution or from the data distribution. In the meanwhile, the generative model tries to produce samples to fool the discriminative model. This competition drives both models to improve their methods, until the fake samples are indistinguishable from the original ones. Originally, the generative model generates samples by using random noise in input. It is worth noting that instead of random noise it is possible to use other forms of data (see, for example, Pix2Pix [50], which uses an actual image). A remarkable example of how GANs can be used to generate aesthetically pleasant artifacts is the collection of works by Obvious, a French art collective that, using GANs, has developed two series of paintings [121], and one of their works has been sold to more than 400,000 dollars4. What follows is a summary of some significant works based on GANs; the list is not exhaustive5, but it provides a good observation of the GANs world, what they can generate and, in particular, how they can generate. For an in-depth survey about the GANs, we refer to [39]. The first of GAN is known as ProGAN [57]. It is based on the idea that, for learning high-resolution images in a faster and more stable way, it is possible to incrementally grow both generator and discriminator, starting from easier low-resolution images, and adding new layers for higher-resolution details after time. In this way, the networks can discover at first large-scale structures, and then shift attention to increasingly finer scale details (instead ofhaving to learn all of them together). A second variant of GAN, which uses self-attention [120] inside both generator and discriminator, is SAGAN [128]. In this solution, self-attention modules are used for modeling long range, multi-level dependencies across image regions, so that the generator can produce images with distant details correlated to each other, and the discriminator can use more complicated constraints on the global image structure. Starting from SAGAN, another proposed modification is BigGAN [10], based on the demonstrated fact that GANs dramatically benefit from scaling; here, the scaling is obtained by increasing the batch size and the number of channels in each layer, and also with little change to the architecture of the models. In addition, in the experiments, a class information is provided to

3The reader should remember that surprise is based on unexpectedness [6]: choosing with probability learned trying to correctly predict on real data makes it... expectable. 4Fun fact: the sold painting is called Portrait of Edmond De Belamy because Belamy sounds like bel ami, a sort of French translation of... Goodfellow. 5For an exhaustive and constantly updated list of GAN papers see for example: https://github.com/hindupuravinash/the-gan-zoo 5 Franceschelli and Musolesi the generator in order to obtain generated images corresponding to that class. Finally, the last variant we examine is StyleGAN [58, 59], where the generator architecture is re-designed in order to control the image synthesis process. It starts from a learned constant input and adjusts the style of the image at each convolutional layer based on the latent code (transformed by a non-linear mapping network into an intermediate latent code that controls the generator through adaptive instance normalization at each convolutional layer), therefore controlling image features at different scales, allowing automatic separation of high-level attributes from stochastic variation in the generated images. It also allows mixing regularization, where two random latent codes (so, two different input data) are used alternatively to guide the generator. GANs are certainly the most difficult approach to evaluate from a creativity perspective. The generator never sees real works, so it samples from a conceptual space that is built not directly starting from them; in rare cases, this can also lead to a different conceptual space (with respect to the original one) and so to transformational creativity, but typically leads to exploratory creativity. In fact, the goal is to learn to generate artifacts that look as much as possible like the real ones, so in theory it works better if it works with exploratory creativity. Products are usually valuable (this is, in some way, the objective), but nothing ensures that they will also be new and surprising. However, it can happen: in this case we can also consider the creative tripod to see if the process is also creative in some ways. The network is surely skillful (and all the modified versions can increase its skills), since it has the ability to produce artworks. It is important tonote that the proposed architectures were specifically used for face generation and based on ImageNet94 [ ]; however, it is apparent that the approach can also be extended to creative tasks. In addition, GANs are also appreciative, in the sense that the discriminative network can judge the value of artifacts; but nothing seems to can be traced back to imagination (notice that in StyleGAN, with the chance to control a lot of stylistic characteristic, it is possible to think of it, in a way, as a possible source of imagination; however, until now the control has been done by humans and not autonomously by the network).

2.2 Sequential Models Here also we can identify three main families of sequential models: sequence prediction models (2.2.1), autoregressive models (2.2.2) and adaptations of GANs to sequence generation (2.2.3).

2.2.1 Sequence Prediction Models. Sequence prediction models are the simplest and most used models to generate sequential data. In general, sequence prediction models are models that, taken in input the previous parts of a composition, output the next part. The difference with respect to autoregressive models is that they are based on RNN (so, insteadof explicitly using all previous outputs each time, it stores them in hidden memories). The effectiveness of this approach is greatly motivated and demonstrated in [56]. In [107], for example, the authors train their model with two different approaches: one through modeling block of characters and building a model of joint probabilities of each character given the previous 50 characters (char-RNN); and the other by modeling tokens from single transcriptions of folk music (folk-RNN). In this way, they can obtain a generative system able to output, a single symbol after the other, a sample that is close to those in the set. But of course, a sequence prediction model can also be based on different things: in [83] the generator is based on words, in [47] on phonemes, and in [131] on syllables; it can even be based on lines, as in [129], where the RNN is fed as input by the current line produced so far plus a vector that encodes the previous lines (produced by another RNN). It is also possible to use richer architectures with more models to represent different characteristics: an example is provided in [62], where three models are used: one for language (based on words), one for pentameter and one for rhymes (both Char-RNN). Despite simply predicting the next token after learning to do so 6 Creativity and Machine Learning: A Survey on a large number of real poems or songs seems to be not so creative. We can make the same observations discussed before for the autoregressive models, but also considering the aspects related to skills given the richer architectures. The lack of surprise for RNN models is demonstrated in [12], nevertheless one of the most interesting results obtained by generative models in the field of art was produced by a sequence prediction model. Sunspring might be considered as the first AI-scripted movie: it was generated by a Char-RNN trained on thousands of sci-fi scripts[77]. The result was certainly unexpected [73], but it was able to place in top-ten at the annual Sci-Fi London Film Festival in its 48-Hour Film Challenge6. Despite [92] is not focused on generating artifacts but only on helping writers, Creative Help might be considered among the approaches shown above. In fact, its help is in form of new sentence generation; although it was thought just for writing down sporadic sentences that then would be modified by the human author, it can be used as afull generative model, asking one sentence after the other. Its peculiarity, with respect to the other RNNs, is in the use of a temperature at sample time to increase the randomness of choices in composing the sentence, with the goal of increasing unpredictability and, therefore, creativity. Nevertheless, as highlighted by [92], results in narrative generation are commonly afflicted by a lack of coherence. In order to solve this problem, many approaches have been proposed: for instance, in [70], an encoder-decoder RNN (also known as Sequence-to-Sequence, see [108]) is used to generate a story in terms of events, and then another encoder-decoder RNN is employed to generate sentences starting to events; an alternative is to focus on entities, as in [15], also based on a sequence-to-sequence model, but where the encoding part takes care of the content of the previous sentence and the current state of entities from the story so far (and, of course, the content of the current sentence that has already generated). These can be considered as important efforts for achieving higher-quality generation and working at different levels of abstraction might indirectly lead to outputs that might be considered creative. However, even if this might be a matter of debate, in our opinion these systems are not explicitly designed to generate creative artifacts according to the definitions stated above. Finally, an interesting evolution of sequence prediction models is proposed in [52], where is used to impose structure on an LSTM. The authors use a Note-RNN model (based on single notes) trained using Deep Q-Network (see [75]); as rewards, they consider both the classic loss used in sequence prediction models and a reward based on rules of music theory. In this way, the model learns a set of composition rules, while still maintaining information about the transition probabilities learned from the training data. Another interesting example of how the use of reinforcement learning might improve a sequence prediction model comes from [110], where a pre-trained LSTM language model is used as a policy model. The learning aims at maximizing a reward function, which controls the story generation in order to reach a specific narrative goal (i.e., a certain event occurs at the end of the narrative). Onelast example of how the combination of RNN and RL can lead to better performance is provided in [126], where the rewards are based on automatic calculation of fluency, coherence, meaningfulness and overall quality of generated poems. The potential of reinforcement learning techniques for sequential generation will be discussed in more details in 2.2.3.

2.2.2 Autoregressive Models. A variety of autoregressive models have also been proposed in the recent years. For example, WaveNet [116] is an audio generative model based on the Gated PixelCNN architecture discussed above. WaveNet provides a flexible framework for tackling many applications that rely on audio generation (like music generation). Just like a sequence prediction model, it learns to predict the next audio sample of the waveform; however, it does not use RNN but CNN (with causal convolution to avoid the risk that the actual prediction depends on any of

6Quite interestingly, the AI that wrote Sunspring declared that its name was Benjamin, probably in honor of Walter Benjamin, the German philosopher that, already in 1935 [5], understood that new mechanical techniques related to art can radically change the public attitude to art and artists. 7 Franceschelli and Musolesi the future samples). By conditioning the model on other input variables, it is possible to guide WaveNet’s generation to produce audio with certain predefined characteristics. A possible improvement is represented by the use of attention mechanisms like that found in the Music Transformer [49], because self-attention over its own previous outputs allows an autoregressive model to access any part of the previously generated output at every generation step, and using it in subsequent layers helps capture self-referential phenomena, typical of music. We will now discuss one of the most interesting and promising current areas of research in machine learning: general language models. As exemplars, we will discuss two main projects: BERT [23] (by ) and GPT-3 [11], the evolution of GPT-2 [85] (by OpenAI). BERT is a fine-tuning based representation model that uses masked language models (that randomly mask some of the tokens from the input) to enable pre-trained deep bidirectional representations, so that they can pre-train a deep bidirectional Transformer not only for next sentence prediction, but also for predicting masked words of a period working on its context. In this way, BERT becomes able, after fine-tuning (where all the parameters are fine-tuned and an appropriate output layer is defined), to perform well on a large suite of sentence-level andtoken-level tasks. GPT-2, instead, is a general model (that uses a Transformer based architecture) that has the ability to perform a wide range of tasks in a Zero-Shot setting without any parameter (or architecture modification) and any demonstration of the task, but just providing a natural language description. For example, it is able to generate short stories given long text in input maintaining long-term consistencies (see [85]). GPT-3, its evolution, is a pre-trained 175 billion parameter autoregressive language model (almost equal to the one used for GPT-2, but bigger) based on “in-context learning”, providing in input a text that specifies the task; in this way, the model is conditioned on a natural language instruction, but it can also use few demonstrations of the task. The model has been used with Zero-Shot, Few-Shot (a few demonstrations of the task, typically between 10 and 100, are provided at inference time, but no weight updates are allowed, so it is useful for rapidly adapting to new tasks) and One-Shot (just one demonstration is allowed, but in addition to a natural language description of the task). One of the additional tasks on which GPT-3 has been tested in [11] of interest for this survey is related to poetry generation: the model is used to write a poem with a given title and a given style (more specifically, Wallace Stevens’ style), obtaining good results, which, however, are not evaluated through any previous methodology proposed in the creativity literature. Another interesting example of autoregressive models is MuseNet [81], which uses the same large-scale transformer of GPT-2 trained in order to predict the upcoming note given a set of them. It can generate 4-minute musical compositions with 10 different instruments and can combine styles from country to Mozart to the Beatles. In MuseNet, the underlying model is conditioned on both the composer and the instrumentation of samples. In other words, this information is at the basis of the prediction mechanism. At generation time, it is possible to condition the model to create samples in a chosen style. Finally, an additional example is provided by Jukebox [24]. Directly working in the raw audio domain, it can generate multiple minutes of music guaranteeing long-range coherence. Jukebox is based on a hierarchical VQ-VAE architecture [87] (where each level of abstraction consists of a WaveNet) to compress audio into a low-dimensional discrete space; then, a sparse Transformer is trained over this compressed space. The generation process can be conditioned on genre, author and even lyrics; this makes it theoretically able to generate novel compositions - at least, if it is conditioned on unseen genre-artist pair, on embedding that interpolates different artists or styles, or on new and creative lyrics. From a creativity perspective, the observations we have made for autoregressive models used in continuous tasks and those about conditioning in GANs also apply in this case. When talking about general language models, it is more difficult to assess the exploratory creativity for autoregressive models: the fact that they are general models, and,for this reason, they are trained on several different sources, appears to lead to a sort of transformational creativity, in 8 Creativity and Machine Learning: A Survey which the conceptual space of the related task is contaminated by conceptual spaces learned as representing different tasks.

2.2.3 Adaptations and Extensions of GANs. The first adaptation of GANs for sequential generation we consider is MuseGAN [25]. In order to generate multi-track polyphonic music with harmonic, rhythmic and temporal structure, and interdependency, the authors propose a GAN-based model for multi-track sequence generation that works by generating music one bar after another using transposed convolutional neural networks. The multi-track music can be generated in different ways: with a different and “independent” generator for each instrument (each onewithits own discriminator); with a single generator that outputs multiple tracks at the same time (with, of course, only one discriminator); with a different generator for each instrument, but where each generator receives in input thesame vector in order to coordinate the creation and, in fact, only one discriminator is used. Another GAN model used to generate music is proposed in [28]. The generator, a stack of convolutions, generates output data in the form of audio samples of four second length from a random vector, and then passes the output into the discriminator. In addition, the method involves conditioning on an additional source of information: they append a one-hot representation of musical pitch to the latent vector, that expresses the desired goal in terms of pitch and timbre. In order to make the generator using this new information, the loss is augmented with an auxiliary classification loss coming from a discriminator that tries to predict the pitch label. Instead, a GAN model proposed to generate sentences is TextGAN [130]. It has the goal of learning to write sentences that match features of real sentences: here, the discriminator is not simply trained to learn to discriminate between real and synthetic data, but to learn sentence features that are most discriminative, representative and challenging. For doing so, the discriminator outputs the classic binary signal plus a reconstruction of the input vector passed to the generator; and it is trained to correctly discriminate, to minimize the distance between the original and the reconstructed versions of the input vector, and also to maximize a discrepancy between the features of a real sentence and the features of the generated sentence. In this way, the discriminator learns the most challenging features for the generator to match, while the generator has to try to minimize this discrepancy between features, and therefore it learns to produce sentences which have real features. Then, we examine SeqGAN [127], a general architecture that adapts GANs for sequence generation by using reinforcement learning. In fact, the authors consider the sequence generation procedure as a sequential decision making process; the generative model is seen as an agent in a reinforcement learning approach, where the state is the generated tokens so far and the action is the next token that the model has to generate. For the rewards, they use a discriminator to evaluate the sequence, as in classic GAN. To solve the problem that the gradient cannot pass back to the generative model when the output is discrete, they consider the generative model as a stochastic parametrized policy, implementing a Monte Carlo search to approximate the state-action value, and directly train the generative model via policy gradient using the REINFORCE . They tested the proposed model for building a poetry generator (considering a corpus of Chinese quatrains) and a music compositor (considering a music collection of folk tunes), using an RNN for the generative model and a CNN for the discriminative model. An interesting modification of SeqGAN called ORGAN [40] considers the generator as trained to maximize a weighted average of two types of rewards: the one that comes from the discriminator, plus the so-called “objective”, composed by domain-specific metrics. This may help maximize specified heuristics and reach particular behaviors while maintaining the generated samples close to therealdata distribution. For instance, their experiments involve music generation with the objective of tonality - finding pleasant note sequences - and ratio of steps - avoiding melodic leaps. Although the experiment was limited to providing certain 9 Franceschelli and Musolesi qualities to the outputs, it might represent a very interesting framework for the generation of creative sequences, i.e., considering heuristics about novelty and surprise. Another adaptation of SeqGAN, which incorporates the idea of TextGAN in an RL framework, is LeakGAN [41]. It has two main differences with respect to SeqGAN: the generator, instead of just receiving the reward signal attheendof the generation, also receives a feature vector from the discriminator, i.e., its last convolutional layer, at each generated token; and then, the generator has a hierarchical structure and it is composed of a Manager module, which, at each timestep, takes the feature vector and outputs a goal vector in input, and by a Worker module, which, at each timestep, takes the goal vector and the last generated token in input, and outputs the next one. In these terms, the Worker and the discriminative model are trained as in SeqGAN, while the Manager is trained separately with the goal of predicting advantageous directions in the discriminative feature space. In this way, LeakGAN becomes able to produce texts that match features of real texts, to deal with the problem of receiving a reward just at the completion of an episode, and to produce long texts maintaining a certain coherence (thanks to the hierarchical architecture). Then, a different kind of learning is represented by MaskGAN [29]. The goal of this solution is to deal with problems concerning long text generation and training stability caused by the fact that the loss from the discriminator is only observed after the completion of the sentence. MaskGAN is based on a so-called “in-filling” learning: given a text, the generator, a sequence-to-sequence model [108], is conditioned on a masked version of the text (where some tokens are missing) as well as on the information that has filled-in up to that point and that has to fill next (i.e., the next gap).In addition, in order to state if the filled-in token is real or generated, the discriminator outputs the discriminative signal given the masked text and the sequence produced so far in input, including the last generated token. Finally, instead of using REINFORCE algorithm, the authors suggest to use REINFORCE with Baseline [109], where the correction is not based on the discriminative signal, but on the difference between the signal and a baseline learned by a critic network, since this helps reduce the high-variance of the gradient estimates. Finally, another GAN-based technique is RelGAN [78]. This solution consists of a relational memory-based generator (based on the relational network by [98]), which uses Gumbel-softmax relaxation [51, 67] to deal with training on discrete data, and a discriminator based on multiple embedded representations to enable a more diverse and informative signal for the generator updates. In addition, the Gumbel-softmax allows to use a single temperature parameter (able to provide a kind of novelty, as we have already seen) to control the trade-off between quality and diversity. This architecture has been used as the basis of CTERM-GAN for (long) text generation [7]. In this case, the generator is also conditioned on an external input (by an embedding returned from a different neural network taking the conditioning input), while the discriminator is composed by two networks: a syntactic discriminator, trained to predict if a sentence is correct or not, and a semantic discriminator, trained to predict if a sentence is coherent with the external input or not. In this way, the generator can learn to write both correct and coherent texts. In general, from a creativity perspective, the considerations we have made on GAN in continuous tasks are still valid in this context (also considering all the efforts done to generate more qualitative artifacts), even if the potential of ORGAN may allow additional, positive thoughts.

2.3 Creativity-Oriented Models Until now, we have seen a large number of generative models that can be considered, in some way, non creativity-oriented. Their original goal is not to generate creative artifacts; and even if in some applications they are used for doing this, the process is not focused on the creation of something new, valuable and surprising, using Boden’s three criteria. But, as we will see now, a few studies have been devoted to this direction. 10 Creativity and Machine Learning: A Survey

Fig. 1. Training workflow of Composer-Audience model, highlighting how the audience part transforms the classic sequence prediction model. The sequence fed in input to both audience and composer models is sampled from 퐷푐 , the dataset on which composer is trained; on the contrary, the audience model has to already be trained on a different dataset 퐷푎.

The first one is the Composer-Audience Architecture [12]. It could be considered a sequence prediction model designed to produce surprising, and not only valuable, artifacts. The architecture is composed by two models: an audience model, trained on a given dataset to learn to predict the next token, and a composer model, trained on a different dataset by receiving in input not only the previous tokens, but also the expectations of the audience model provided in input the same tokens. Even without an explicit surprise-related loss, the composer model would learn patterns of violations of expectations - characterized by how the two datasets are different. For instance, the example provided considers as the audience dataset a set of non-humorous texts, and as the composer dataset a set of humorous texts; the composer model would learn to choose, in proper moments, terms that are humorous not because they just follow the others, but because they are highly unexpected by the audience. The training workflow is summarized in Figure 1. Of course, Composer-Audience Architecture does not take care of novelty, but it seems a very promising way to correctly embed value and surprise; however, it has many of the limitations highlighted for classic sequence prediction models, as the impossibility of resulting in transformational creativity. In addition, it might be difficult to find a meaningful pair of datasets for each kind of problems such that the difference between the two representsthe surprising part of the data. The second one is CAN (Creative Adversarial Network), proposed in [26], a sort of creativity-oriented variant of GANs. These are designed to generate artifacts as close as possible to the original ones, not really producing creative ones. On the contrary, in order to develop a generator that could be considered as creative, according to the authors of this work, it should produce new things or, more precisely, to try to increase the stylistic and deviations from style norms (a kind of novelty), while, at the same time, avoiding moving too far away from what is accepted as art (to guarantee its value). For these reasons, as shown in the workflow by Figure 2, the CAN discriminator returns two signals: the classification of “art or not art”, as in GAN, but also how well it can classify the generated artinto established styles; in this way, the goal of generative model becomes maximizing the generation of what it is recognized as art, while making difficult to classify it into established styles. The proposed approach has been tried on paintings from the Fifteenth century to the Twentieth century, learning to produce something like abstract paintings and in particular obtaining, as described by Elgammal in an interview reported in [73], that the machine has captured the trajectory of art history. Applying the same kind of considerations done before, we can say that CAN is very close to transformational creativity, because the goal is to do something new and different with respect to the original data, not 11 Franceschelli and Musolesi

Fig. 2. Training workflow of Creative Adversarial Network, highlighting how the style classification output transforms the classic Generative Adversarial Network.

just sampling from that conceptual space. In addition, the products are created with the objective of novelty and value, but surprise is not taken into account. Finally, if with imagination we can also mean the ability of diversification, it could “complete” the creative tripod. Finally, in [101] Schmidhuber formalizes his theory about creativity and his way to achieve it by using reinforcement learning (see also [99] or [100]). According to Schmidhuber, an agent can learn to be creative and discover new patterns and skills thanks to the use of intrinsic rewards to measure novelty, interestingness and surprise. Schmidhuber’s model is based on two learning modules for each agent: an adaptive predictor of the growing data history as the agent is interacting with its environment and a controller which acts in the environment. The learning progress of the predictor is used as the intrinsic reward of the controller; in this way, it is motivated to invent things that the predictor does not yet know, but that can easily be learnt from its state. In addition to the intrinsic reward, the agent can also be trained to maximize an external reward provided by the environment (which could be useful to guarantee other behavioral properties to the agent). The learning process is summarized in Figure 3. Even if this is only a theoretical proposal, searching every time for novel patterns seems very similar to transformational creativity; the fact that the goal is to maximize fun or surprise appears to be able to guarantee the presence of novelty and surprise (while value may be achieved thanks to external reward), and the creative tripod seems to be respected too. 12 Creativity and Machine Learning: A Survey

Fig. 3. Training process of the Reinforcement Learning system based on the formal theory of creativity proposed by Schmidhuber. The Controller, based on the current state 푆푡 , takes an action 퐴푡 ; the action changes the Environment, which returns a new state 푆푡+1 and an external reward 푅푒푥푡푡+1, that is summed to an intrinsic reward 푅푖푛푡푡+1 to obtain the total reward used to train the Controller. This intrinsic reward reflects the difference between the Predictor model at instant 푡 and the Predictor model at instant 푡 + 1, after being trained to predict the new state.

For instance, a (partially) inspired implementation of this proposal is DeLeNoX (Deep Learning Novelty Explorer) [63], which combines autoencoders with evolutionary algorithms to achieve novelty in game content generation. In particular, a denoising autoencoder [122] is used as the predictor to learn low-dimensional representations of last generated artifacts. These learned features are not used to directly shape the intrinsic reward, but to define a distance measure in terms of typical generated patterns. Finally, as the reward guides the generation of new artifacts, the distance measure guides a particular type of called feasible-infeasible novelty search [64], aimed at maximizing distances between artifacts generated by a Compositional Pattern-Producing Network (CPPN) [105]. As the external reward provided by the environment may help model qualitative characteristics, the feasible-infeasible search may discriminate based on qualitative characteristics. Despite DeLeNoX uses learned patterns to guide an evolutionary algorithm (as suggested by [66]) instead of a reinforcement learning algorithm, this process is very similar to the formal theory of creativity [101], and might be seen as a successful implementation of it. Table 1 summarizes all the generative methods discussed in this section, highlighting their characteristics from a creativity perspective.

3 CREATIVITY MEASURES In this section we present different methodologies for evaluating the creativity of artifacts generated by artificial agents. These can typically be extended to human-generated ones. The presence of several different proposals can be associated to the fact that, as highlighted in [32], when talking about text generation, it is not always easy to determine the “right” question to ask in an evaluation of a creative artifact. This is also highlighted by the absence of creativity evaluation approaches in the GEM (Generation, Evaluation, and Metrics) Benchmark for natural language generation [33], because of the inadequacy of the creative metrics proposed until now. The focus of our overview is on measures that are based or associated to machine learning techniques. It is worth noting that some of them might simply be considered as computational measures. However, we believe that all of them can be extended or implemented through machine 13 Franceschelli and Musolesi

Type of artifacts Generative technique Type of creativity Boden’s criteria ∼ Value VAE Exploratory ∼ Novelty ∼ Surprise ∼ Value Combinatorial, Autoregressive model ∼ Novelty Exploratory × Surprise Continuous ✓ Value GAN Exploratory ∼ Novelty ∼ Surprise ✓ Value CAN Transformational ✓ Novelty ∼ Surprise ∼ Value Combinatorial, Sequence prediction model ∼ Novelty Exploratory × Surprise ∼ Value Combinatorial, Autoregressive model ∼ Novelty Exploratory* × Surprise Sequential ✓ Value Sequential adaptation of GAN Exploratory ∼ Novelty ∼ Surprise ✓ Value Audience-Composer Exploratory ∼ Novelty architecture ✓ Surprise ∼ Value Active learning based on Possibly any kind Transformational ✓ Novelty intrinsic reward ✓ Surprise Table 1. A summary of all the methods explained so far, considering their type of creativity and the possible presence of Boden’s criteria. *General language models may allow a transformational creativity.

learning techniques. Instead, for an in-depth overview about creativity measures not strictly related to machines, we refer to [96].

3.1 Lovelace 2.0 Test The first evaluation method is strictly related to what exposed in the Introduction, even if it is not based onmachine learning techniques. In fact, in 2001 the so-called Lovelace Test (LT) is proposed [9], as a sort of creativity-oriented alternative to the world-famous Turing Test [115]:

Definition 3.1. An artificial agent 퐴, designed by 퐻, passes LT if and only if 1) 퐴 outputs 표; 2) 퐴’s outputting 표 is not the result of a fluke hardware error, but rather the result of processes 퐴 can repeat; 14 Creativity and Machine Learning: A Survey

3) 퐻 (or someone who knows what 퐻 knows, and has 퐻’s resources) cannot explain how 퐴 produced 표 by appeal to 퐴’s architecture, knowledge-base, and core functions.

Then, a new version is proposed [89], with the goal of making easier to pass the Test and defining more precisely what is needed to do so. The so-called Lovelace 2.0 Test is as follows:

Definition 3.2. Artificial agent 퐴 is challenged as follows: 1) 퐴 must create an artifact 표 of type 푡;

2) 표 must conform to a set of constraints 퐶 where 푐푖 ∈ 퐶 is any criterion expressible in natural language; 3) a human evaluator ℎ, having chosen 푡 and 퐶, is satisfied that 표 is a valid instance of 푡 and meets 퐶; 4) a human referee 푟 determines the combination of 푡 and 퐶 to not be unrealistic for an average human.

An interesting property is that it is possible to use the Lovelace 2.0 Test to quantify the creativity of an artificial agent

(or, at least, of an artificial product) just considering aset 퐻 of human evaluators; with 푛푖 as the first test performed by evaluator ℎ푖 ∈ 퐻 not passed by the agent, the creativity of the artificial agent can be expressed as the mean number of Í (푛푖 ) challenges-per-evaluator passed: 푖 |퐻 | . This methodology represents a very interesting way to measure creativity, since it is ideally applicable to any field and it is quantitative. However, since it has to take into consideration anon negligible number of aspects for correctly evaluating creativity, it requires considerable human intervention, therefore it cannot be used for automatic evaluation unless a non-negligible human effort.

3.2 Ritchie’s Criteria Another quantitative way to measure creativity is proposed in [90], where the author defines a set of criteria for evaluating the extent to which a program has been creative (or not) in generating artifacts. Novelty, quality and typicality are identified as factors useful to define creativity, where novelty is intended as the sum of “untypicality” (so, it can be defined in terms of typicality) and innovation. Quality and typicality are collected using human opinions about the produced artifacts or, in case of typicality, also using “measurable” characteristics about, for instance, syntax and metric (for poetry generator). The result set of produced artifacts is considered alongside with the inspiring set (composed by artifacts of that field used during training and/or generation) to the formulation of criteria. Thedefined criteria, intended to be just a “repertoire” to help define specific methods, are: the average of typicality or quality over the result set greater than a given threshold (or not greater than); what proportion of items with good typicality score is also characterized by high value for the quality one; whether a large proportion of the program’s output falls into the supposedly desirable category of untypical but high-valued; a comparison between untypical high-valued items and typical high-valued items; which proportion of the inspiring set has been reproduced in the result set; and finally, all the already defined criteria considering not the entire result set, but only the subset not containing any itemfromthe inspiring set. In our opinion, these measures represent a promising way of evaluating creativity, but their application does not appear as straightforward. First of all, the paper does not specify clearly how to measure novelty in terms of innovation; then, all the measures require to set a large number of thresholds to be set. The criteria for the selection of these thresholds are not trivial per se and it is difficult to identify a general methodology for selecting these values. In addition, the collection of correct human opinions (in terms of uniformity of the used meter, of audience selection, etc.) is not a trivial task either. 15 Franceschelli and Musolesi

3.3 FACE In [19] the authors introduce descriptive models for processing and output of software designed to be creative: FACE for describing the creative act, and IDEA for capturing notions related to the impact of such creative act. The FACE model considers a creative act as a non-empty tuple of generative acts of eight possible types: an expression of a concept; a method for generating expressions of a concept; a concept; a method for generating concepts; an aesthetic measure; a method for generating aesthetic measures; an item of framing information; a method for generating framing information. This model can be used in a quantitative way, in a cumulative way (considering how many types of creative acts they use) and in a qualitative way (using the aesthetic function). The IDEA model, on the contrary, describes a cycle in which the software is modified in order to increase the impact of creations over an audience; iteratively, the system generates products that are judged by the audience, and based on the learned results the system is adapted to obtain better results, and so on. The FACE model represents an interesting way of studying the creative process of the machine; in particular, the possibility of using it in a qualitative way seems promising, but nothing explains how the aesthetic function has to be defined. On the other hand, the IDEA model seems a way to incorporate, in the creativeact, human opinions about products, more than a way to study its creativity.

3.4 SPECS Instead, in [53], Jordanous proposes a Standardized Procedure for Evaluating Creative Systems (SPECS); it does not use machine learning techniques, but the procedure can also include them for the testing phase. At first, the author suggests several components that are known to be linked to creativity. Then, SPECS is articulated into three steps: first identification of a definition of creativity that the system should satisfy to be considered as creative, using thesuggested creative components and eventually other domain-specific components; then, the specification of the standards tobe used to evaluate the creativity of the system; and finally, testing of the system against the standards and reporting of the results. This is an effective framework for working with computational creativity, but it cannot be considered asa real evaluation method. For this reason, its effectiveness depends on how it is implemented for the specific task. The last three evaluation methods described above, plus the creative tripod from [17] and a human opinion survey, were then evaluated through a sort of meta-evaluation based on five criteria: correctness, usefulness, faithfulness (asa model of creativity), usability (of the methodology) and generality [54]. This meta-evaluation was applied on music improvisation generators, evaluated thanks to the five methods and then judged, considering the five criteria, bythe developers of the generative systems. Results show that the opinion survey was globally the worst method, while SPECS seemed to be the best one, followed by Ritchie’s criteria and then the creative tripod, as the one that provides more insights to the developers. Although this comparison does not help us find the best creativity measure, it helps understand which methods are the best in order to obtain additional knowledge on a generative model, which may be very useful to improve the results.

3.5 Creativity Implication Network A completely different approach is based on building an art graph from which creativity is computed[27]. Given a set of artworks, it is possible to construct a directed graph with a vertex for each artwork, with an arc connecting artwork

푝푖 to 푝 푗 if 푝푖 has been realized before 푝 푗 , and with a positive weight 푤푖 푗 for each arc that is the similarity score between the two artworks. This score can be computed by any similarity function; for example, in [27] it was based on techniques. Given the constructed network, the creativity of an artwork 푝푖 depends on the similarity with the 16 Creativity and Machine Learning: A Survey previous artworks (higher the similarity, lower the creativity) and with the subsequent artworks (higher the similarity, higher the creativity). In this way, the creativity synthesizes both the value, provided by the influence over future works, and the novelty, provided by the influence of past works. This represents an effective way to deal with creativity ofsets of artworks. Two drawbacks have to be highlighted considering its application to automatic evaluation of creativity. The first one is about artworks that are “leaf” in the graph: if there are not subsequent works in the graph, their creativity would only be based on novelty, and not on value. The second one is more subtle, and it is about time-positioning. As demonstrated by [27], moving back an artwork has the effect of increasing, more or less, its creativity; however, it seems conceptually wrong. As discussed in [8], the time location of an artwork is fundamental in the determination of its creativity. In fact, it may happen that, due to the surprise component of creativity, if an artwork comes too early and observers are not able to truly understand it, it will not be considered as surprising, but mostly as bewildering and disappointing; on the contrary, if it comes too late, it will be considered as obvious and not surprising at all. In conclusion, even if this approach is able to correctly capture value and novelty, it lacks a comprehension of surprise.

3.6 Generate-and-Test Setting Another important paper is [113] in which the authors try to highlight the different roles in machine learning by the formalization of the generate-and-test-setting. Let gen() denote the generative function of the system, and let a be an artifact generated by the function. Artifact a is then evaluated by another function eval(a) and, if it passes the evaluation, is outputted by the system. Of course, the evaluation function can also be considered as a sum of different components that measure different aspects of creativity, like the ones in Boden’s three criteria. It seems a very useful and,insome cases, correct formulation of the problem, since, starting from it, it is possible to define domain-specific evaluation tests, as we will see later; but its applicability is surely not general, as we have seen in the previous section. It also cites three generate-and-test models that use machine learning for the evaluation. The first one is119 [ ], in which the authors propose the development of a computational creativity system for culinary recipes and menus. In this work, even the selection step is based on two creativity measures associated to novelty and quality, where the novelty is computed using the concept of “Bayesian surprise” [4] and the quality (in this specific case, the flavorfulness) is derived using a regression model, built on olfactory pleasantness considering its constituent ingredients. The Bayesian surprise is a measure of surprise in terms of the impact of a piece of data that changes a prior distribution over a set of possible models into a posterior distribution, calculated applying Bayes’ theorem; the surprise is then the distance between posterior and prior distributions. The second one is DARCI, an original computer system designed to produce images through creative means based on Colton’s creative tripod. In particular, in [79] the attention is focused on the need of appreciation of art, i.e., the ability to interpret images - and maybe discover creativity in it. For this reason, DARCI works by making associations between image features and descriptions of the images learned using a series of artificial neural networks as the basis for the appreciation. It has been used also as a sort of artificial art critic to complement The Painting Fool, a very ambitious and interesting generative art program with decision making abilities that uses a knowledge base of different settings to control all aspects of the artistic style it uses for a particular artwork (see [18]). As explained in [20], DARCI represents a turning point for The Painting Fool, because with computer vision abilities, it can analyze its output, assessing the appreciation of its projects and allowing for rendering styles created by the software itself and not by a human. Finally, the third one is PIERRE [76], a recipe generation system composed by two models, one for generation (based on a ) and one for evaluation. The evaluation is based on two creativity measures: novelty and quality.

17 Franceschelli and Musolesi

Novelty is computed by counting new combinations of ingredients used, while quality is based on two MLPs that perform a regression of user ratings based on the amount of different ingredients, working at two levels of abstraction.

3.7 Common Model of Creativity for Design Another proposed approach, specifically for design, comes from69 [ ]. In this work, the authors consider creativity as a relative measure in a conceptual space of potential and existing designs and, following them, novelty, value and surprise can capture different aspects of creativity in that space. For this reason, they assume each design as represented by attribute-value pairs, where the novelty is considered on a description space, the value on a performance space and the surprise is based on finding violations of patterns that is possible to anticipate in the space of both actual designsand possible designs. In particular, they use the K-means clustering algorithm, using Euclidean distance and centroids, to organize the known designs, and then they obtain novelty, value and surprise measures of a new design just looking at the distance to the nearest cluster centroid.

3.8 Regent-Dependent Creativity Metric On the contrary, the Regent-Dependent Creativity metric [31] measures novelty and value through Bayesian Surprise and synergy, a metric that captures the cooperative nature of constituent elements of an artifact. It is based on a description of the artifacts based on the Regent-Dependent Model, as collections of features, i.e., represented by dependency pairs labeled with a value to express the intensity of that pair in that context. To calculate the novelty it uses the Bayesian Surprise, while the value is measured by the synergy of the elements that exist within it. The synergy measure is based on modeling the artifacts as graphs, where the vertices are the dependency pairs of the artifact and the edges between pairs exist only if they belong to the same set of synergy, i.e., if they exhibit a better effect together than separately. Although these last two approaches are based on the use of the novelty-value pair and Boden’s three criteria, they are limited by the fact that artifacts have to be described through attribute-value pairs. In particular, in order to describe artifacts in the best way, we need to consider a large number of features; otherwise, we might lose aspects that are fundamental to correctly measure creativity. In addition, since we cannot know in advance which are the fundamental features, we need to define as many features as possible, with the risk of defining an excessive numberof non-informative attributes, making too difficult to compute the metrics. And to make matters worse, the majority of the explained approaches use clustering techniques: but data points become increasingly “sparse” as the dimensionality increases, and since many clustering techniques are based on (Euclidean) distance, they would not form meaningful clusters [106]. Finally, as in classic machine learning techniques, there is the need of manually defining and extracting the chosen features from unstructured data, that is a time-consuming and maybe prone-to-error activity. For these reasons, we think that a way - probably not the only one - to overcome the problems of feature extraction and curse of dimensionality might be the use of deep learning techniques, since they are able to work directly on unstructured data, learning themselves which are the fundamental features for the specific goal.

3.9 Unexpectedness Measure The creative measure proposed in [37] is a measure of unexpectedness, since it is related to novelty, surprise and domain transformation, becoming a vital component of computational creativity evaluation. This measure is based on a comparison between artifacts and not on subjective perceptions. According to the authors, novelty is expectation-based, since the structures acquired to evaluate novelty can be seen as a model with which the system attempts to predict 18 Creativity and Machine Learning: A Survey the world, even if these structures are not being built for the purpose of prediction, since they represent assumptions about how the underlying domain can be organized. Vice versa, surprise occurs when a creative system’s expectation is violated in response to observing an artifact. It is a very subtle, but fundamental difference: in case of transformational creativity, novelty is about the possibility of updating the domain’s conceptual space; instead, surprise is about the possibility of violating system’s expectations.

3.10 Essential Criteria of Creativity Another metric based on unexpectedness (intended here as surprise) is the one proposed in [68], where the evaluation metric is based on three components: novelty, value and, of course, unexpectedness, retracing Boden’s three criteria. The idea is that the metric has to be independent from the domain and also from the producer: novelty is calculated as the distance between the artifact and the other artifacts in its class (using a clustering algorithm); surprise is calculated considering whether or not the artifact follows the expected next artifact in the pattern recognized on recent artifacts (using self-supervised neural networks and reinforcement learning to correct the learned trajectory in case of sequential fields); and value is calculated as the weighted sum of its performance variables (using a co-evolutionary algorithm with a function that can change over time, when the current population of artifacts changes)[68]. This idea is very interesting, because it tries to capture the evolution of creativity, remaining in any case usable. Although this, in our opinion, it is limited by the fact that it requires the definition of performance variables (we have discussed above about the problems of attribute-value approaches) and of a way to calculate distances among artifacts in order to apply the clustering algorithm (that requires human fine-tuning).

3.11 Computational Tools for Lateral Thinking Finally, in [55] the authors present four creativity metrics about novelty, surprise, rarity and recreational effort for the specific case of storytelling. Novelty is considered as the average semantic distance between the dominant terms included in the textual representation of the story, compared to the average semantic distance of the dominant terms in all stories; surprise is defined as the average semantic distance between the consecutive fragments of each story; rarity is based on computing the clusters of terms in each story, and requires to calculate the semantic distance between the individual clusters (with respect to those in the story set); and recreational effort is the number of different clusters each story contains. Although value is not considered, the proposed metrics appear to be appropriate to evaluate novelty and surprise, but they suffer from two problems: they are intrinsically domain-specific, and they require that allthe types of clusters are defined correctly, which is very difficult to ensure. To summarize, Table 2 reports all the evaluation methods analyzed in this section, to highlight what the dimensions they try to capture, their applicability and their limitations as discussed above.

4 OUTLOOK AND CONCLUSION Several research directions can be identified in the various emerging areas discussed in this survey. First of all,an interesting direction is the exploration of creativity-oriented objective functions, to directly train models to be creative, as initially proposed in [26] or to find new ways of learning transformational creativity, as suggested101 in[ ]. Of course, as we have seen, the focus should not only on creative generation. A fundamental area is the definition of novel and robust evaluation techniques for both generated and real artifacts. There is also an orthogonal debate around the role of human evaluation versus machine evaluation. Another promising research directions concerns machine interpretation 19 Franceschelli and Musolesi

Name What evaluates How evaluates Applicability Limits What evalua- Mean number of Lovelace 2.0 • Require a massive tors state challenges per General Test human intervention creativity is evaluator passed Quality, Human opinions • Require human evaluation Ritchie’s novelty, (then elaborated General • Require to state thresholds criteria typicality through 18 criteria) • No innovation definition Tuples of Volume of acts, number • Not a punctual method FACE generative of acts, quality (through General • Definition of aesthetic acts aesthetic measure) measure left to the user Identification and test • More a framework for eva- What we state SPECS of standards for the General luation method definition creativity is creativity components than a real method Similarity between works • No chance to correctly com- Creativity Value, (considering subsequent pute creativity for last works implication General novelty works for value and pre- • Wrong creativity and network vious works for novelty) time-positioning correlation Creativity Bayesian surprise, Specific Novelty, • Require human ratings assessment for regression over smell (culinary quality of pleasantness Chef pleasantness recipes) Neural network to • Not based on product Art Specific DARCI associate image featu- • Consider just one of appreciation (visual art) res and description the creative tripod PIERRE - Novelty, Count of new combi- Specific • Require user ratings Evaluation part quality nations; user ratings (recipes) over ingredients K-Means on a description • Require to define Common model Novelty, space and a performance Specific attribute-value pairs of creativity value, space; degree of violation (design) • Require to define for design surprise of anticipated patterns clustering parameters Regent-Dependent Novelty, Bayesian surprise and • Require to define and General creativity metric value synergy extract features Novelty, sur- Possibility to update, Unexpectedness prise, trans- and degree of violation General • Not take care of value formativity of expectations Sum of performance va- • Require to define Value, Essential criteria riables, distance between performance variables novelty, General of creativity artifacts and between re- • Require to define surprise al and expected artifact clustering parameters Novelty, Semantic distance be- • Require to define Computational surprise, rarity, tween dominant terms, Specific dominant terms tools for lateral recreational consecutive fragments (storytelling) • Require to define thinking effort or clusters of terms clustering parameters Table 2. Summary of creativity evaluation methods and their characteristics.

20 Creativity and Machine Learning: A Survey of art [1]. Machine learning techniques might also be used to investigate psychological dimensions of creativity as well [2]. There are also foundational questions related to machine learning and creativity. For example, it is not clear if machine-generated works could, should and will be protected by Intellectual Property, and, if they do, who should be the owner of the related rights, as discussed, for instance, in [97]. In addition, other problems related to copyright should be considered in our opinion, such as if training over protected work is admitted [104] and what happens if the results are similar to some of the training data7. Another ongoing important debate is about authorship and the human role in creative fields in the era 8of AI . In this survey we have provided the reader with an overview of the state-of-the-art at the intersection between creativity and machine learning. At first, we have tried to examine the most common definitions of creativity. Wehave then examined and compared the most important generative models and evaluation techniques used to assess creativity. Generative models are already able to generate artworks, but creativity-oriented models are just at the beginning, and new architectures could be used to achieve an even greater degree of creativity. Evaluation metrics are already able to capture many fundamental aspects of creativity, but we believe this is an open area of research from both conceptual and practical points of view. We hope that this survey represents a valuable guideline for further efforts in this fascinating cross-disciplinary and trans-disciplinary area.

REFERENCES [1] Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas Guibas. 2021. ArtEmis: Affective Language for Art. CoRR abs/2101.07396 (2021). [2] Sergio Agnoli, Laura Franchin, Enrico Rubaltelli, and Giovanni Emanuele Corazza. 2019. The emotionally intelligent use of attention and affective arousal under creative frustration and creative success. Personality and Individual Differences 142 (2019), 242–248. [3] Charles Babbage. 1864. Of the Analytical Engine. In Passages from the Life of a Philosopher. Vol. 3. Longman, Green, Longman, Roberts, & Green, 112–141. [4] Pierre Baldi and Laurent Itti. 2010. Of Bits and Wows: A Bayesian Theory of Surprise with Applications to Attention. Neural networks: the official journal of the International Neural Network Society 23 (2010), 649–666. [5] Walter Benjamin. 1968. The Work of Art in the Age of Mechanical . Random House. [6] Daniel E. Berlyne. 1971. Aesthetics and psychobiology. Appleton-Century-Crofts. [7] Federico Betti, Giorgia Ramponi, and Massimo Piccardi. 2020. Controlled Text Generation with Adversarial Learning. In INLG 2020. Association for Computational Linguistics, 29–34. [8] Margaret A. Boden. 2003. The Creative Mind: Myths and Mechanisms. Routledge. [9] Selmer Bringsjord, Paul Bello, and David Ferrucci. 2001. Creativity, the Turing Test, and the (Better) Lovelace Test. and Machines 11 (2001), 3–27. [10] Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In ICLR 2019. [11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS 2020, Vol. 33. 1877–1901. [12] Razvan C. Bunescu and Oseremen O. Uduehi. 2019. Learning to Surprise: A Composer-Audience Architecture. In ICCC 2019. Association for Computational Creativity (ACC), 41–48. [13] Amílcar Cardoso, Tony Veale, and Geraint A. Wiggins. 2009. Converging on the Divergent: The History (and Future) of the International Joint Workshops in Computational Creativity. AI Magazine 30, 3 (2009). [14] Rewon Child. 2020. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv:2011.10650 [cs.LG] [15] Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018. Neural Text Generation in Stories Using Entity Representations as Context. In NAACL 2018. Association for Computational Linguistics, 2250–2260.

7These questions are all so crucial that in 2019 WIPO (World Intellectual Property Organization) has started a series of conversations to discuss the impact of AI on IP: https://www.wipo.int/about-ip/en/artificial_intelligence/conversation.html 8This is one of the ethical dilemmas highlighted by UNESCO in its Preliminary study on the Ethics of Artificial Intelligence: https://unesdoc.unesco.org/ark:/48223/pf0000367823 21 Franceschelli and Musolesi

[16] Harold Cohen. 1988. How to Draw Three People in a Botanical Garden. In AAAI 1988. [17] Simon Colton. 2008. Creativity Versus the Perception of Creativity in Computational Systems. In AAAI 2008 Spring Symposium. [18] Simon Colton. 2012. The Painting Fool: Stories from Building an Automated Painter. Computers and Creativity (2012), 3–38. [19] Simon Colton, John William Charnley, and Alison Pease. 2011. Computational Creativity Theory: The FACE and IDEA Descriptive Models. In ICCC 2011. [20] Simon Colton, Jakob Halskov, Dan Ventura, Ian Gouldstone, Michael Cook, and Blanca Pérez-Ferrer. 2015. The Painting Fool Sees! New Projects with the Automated Painter. In ICCC 2015. [21] Simon Colton and Geraint A. Wiggins. 2012. Computational Creativity: The Final Frontier?. In ECAI 2012. [22] David Cope. 1989. Experiments in musical intelligence (EMI): Non-linear linguistic-based composition. Journal of New Music Research 18 (1989), 117–139. [23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL 2019. 4171–4186. [24] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. Jukebox: A Generative Model for Music. arXiv:2005.00341 [eess.AS] [25] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. 2018. MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment. In AAAI 2018. 34–41. [26] Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. 2017. CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms. In ICCC 2017. [27] Ahmed Elgammal and Babak Saleh. 2015. Quantifying Creativity in Art Networks. arXiv:1506.00711 [cs.AI] [28] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. 2019. GANSynth: Adversarial Neural Audio Synthesis. In ICLR 2019. [29] William Fedus, Ian Goodfellow, and Andrew M. Dai. 2018. MaskGAN: Better Text Generation via Filling in the ______. In ICLR 2018. [30] David Foster. 2019. Generative Deep Learning. O’Reilly. [31] Celso França, Luís Fabrício Wanderley Góes, Alvaro Amorim, Rodrigo C. O. Rocha, and Alysson Ribeiro Da Silva. 2016. Regent-Dependent Creativity: A Domain Independent Metric for the Assessment of Creative Artifacts. In ICCC 2016. [32] Albert Gatt and Emiel Krahmer. 2018. Survey of the State of the Art in Natural Language Generation: Core Tasks, Applications and Evaluation. Journal of Artificial Intelligence Research 61, 1 (2018), 65–170. [33] Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khy- athi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo Andre Niyongabo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics. arXiv:2102.01672 [cs.CL] [34] John Gero. 2000. Computational Models of Innovative and Creative Design Processes. Technological Forecasting and Social Change 64 (2000). [35] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org. [36] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS 2014. Curran Associates, Inc., 2672–2680. [37] Kazjon Grace and Mary Lou Maher. 2014. What to expect when you’re expecting: The role of unexpectedness in computationally evaluating creativity. In ICCC 2014. [38] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation. In PMLR 2015, Vol. 37. 1462–1471. [39] Jie Gui, Z. Sun, Yonggang Wen, Dacheng Tao, and Ye Jie-ping. 2020. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. arXiv:2001.06937 [cs.LG] [40] Gabriel L. Guimaraes, Benjamin Sanchez-Lengeling, Pedro Luis Cunha Farias, and Alan Aspuru-Guzik. 2017. Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv:1705.10843 [cs.ML] [41] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Long Text Generation via Adversarial Training with Leaked Information. arXiv:1709.08624 [cs.CL] [42] Matthew Guzdial and Mark O. Riedl. [n.d.]. Automated Game Design via Conceptual Expansion. In AIIDE 2018, Vol. 18. 31–37. [43] Matthew Guzdial and Mark O. Riedl. 2019. An Interaction Framework for Studying Co-Creative AI. In CHI 2019. [44] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. 2017. DeepBach: a Steerable Model for Bach Chorales Generation. In ICML 2017, Vol. 70. 1362–1371. [45] Douglas R. Hofstadter. 1979. Godel, Escher, Bach: An Eternal Golden Braid. Basic Books, Inc. [46] Douglas R. Hofstadter and Melanie Mitchell. 1994. The Copycat project: A model of mental fluidity and -making. In Advances in connectionist and neural computation theory, Vol. 2. Analogical connections. Ablex Publishing, 31–112. 22 Creativity and Machine Learning: A Survey

[47] Jack Hopkins and Douwe Kiela. 2017. Automatically Generating Rhythmic Verse with Neural Networks. In ACL 2017. Association for Computational Linguistics, 168–178. [48] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. 2017. Counterpoint by Convolution. In ISMIR 2017. [49] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music Transformer. In ICLR 2019. [50] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR 2017. IEEE, 5967–5976. [51] Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. arXiv:1611.01144 [cs.ML] [52] Natasha Jaques, Shixiang Gu, Richard E. Turner, and Douglas Eck. 2016. Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning. In NeurIPS 2016. [53] Anna Jordanous. 2012. A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What it is to be Creative. Cognitive Computation 4 (2012). [54] Anna Jordanous. 2014. Stepping Back to Progress Forwards: Setting Standards for Meta-Evaluation of Computational Creativity. In ICCC 2014. [55] Pythagoras Karampiperis, Antonis Koukourikos, and Evangelia Koliopoulou. 2014. Towards Machines for Measuring Creativity: The Use of Computational Tools in Storytelling Activities. In ICALT 2014. IEEE. [56] Andrej Karpathy. 2015. The Unreasonable Effectiveness of Recurrent Neural Networks. karpathy.github.io/2015/05/21/rnn-effectiveness. [57] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR 2018. [58] Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In CVPR 2019. IEEE/CVF, 4396–4405. [59] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In CVPR 2020. 8107–8116. [60] Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR 2014. [61] Pat Langley, Herbert A. Simon, Gary L. Bradshaw, and Jan M. Zytkow. 1987. Scientific Discovery: Computational Explorations of the Creative .Process MIT Press. [62] Jey Han Lau, Trevor Cohn, Timothy Baldwin, Julian Brooke, and Adam Hammond. 2018. Deep-speare: A joint neural model of poetic language, meter and rhyme. In ACL 2018. 1948–1958. [63] Antonios Liapis, Hector P. Martinez, Julian Togelius, and Georgios N. Yannakakis. 2013. Transforming exploratory creativity with DeLeNoX. In ICCC 2013. 56–63. [64] Antonios Liapis, Georgios N. Yannakakis, and Julian Togelius. 2013. Enhancements to constrained novelty search: Two-population novelty search for generating game content. In GECCO 2013. 343–350. [65] Jialin Liu, Sam Snodgrass, Ahmed Khalifa, Sebastian Risi, Georgios N. Yannakakis, and Julian Togelius. 2021. Deep learning for procedural content generation. Neural Computing and Applications 33 (2021), 19–37. [66] Penousal Machado, Juan Romero, Antonino Santos, Amílcar Cardoso, and Alejandro Pazos. 2007. On the development of evolutionary artificial artists. Computers and Graphics 31, 6 (2007), 818–826. [67] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In ICLR 2017. [68] Mary Maher. 2010. Evaluating creativity in humans, computers, and collectively intelligent systems. In DESIRE 2010. 22–28. [69] Mary Maher and Doug Fisher. 2012. Using AI to Evaluate Creative Designs. ICDC’12y 1 (2012). [70] Lara J. Martin, Prithviraj Ammanabrolu, William Hancock, Shruti Singh, Brent Harrison, and Mark O. Riedl. 2018. Event Representations for Automated Story Generation with Deep Neural Nets. In AAAI 2018. [71] James R. Meehan. 1977. TALE-SPIN, an Interactive Program That Writes Stories. In IJCAI 1977. Morgan Kaufmann Publishers Inc., 91–98. [72] Luigi Federico Menabrea and . 1843. Sketch of The Analytical Engine Invented by Charles Babbage. In Scientific Memoirs. Vol. 3. Richard and John E. Taylor, 666–731. [73] Arthur I. Miller. 2019. The Artist in the Machine. The MIT Press. [74] Tom M. Mitchell. 1997. Machine Learning. McGraw-Hill. [75] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, , and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518 (2015), 529–533. [76] Richard G. Morris, Scott H. Burton, Paul Bodily, and Dan Ventura. 2012. Soup Over Bean of Pure Joy: Culinary Ruminations of an Artificial Chef. In ICCC 2012. [77] Annalee Newitz. 2016. Movie written by algorithm turns out to be hilarious and intense. arstechnica.com/gaming/2016/06/an-ai-wrote-this- movie-and-its-strangely-moving/. [78] Weili Nie, Nina Narodytska, and Ankit Patel. 2019. RelGAN: Relational Generative Adversarial Networks for Text Generation. In ICLR 2019. [79] David Norton, Derral Heath, and Dan Ventura. 2010. Establishing Appreciation in a Creative System. In ICCC 2010.

23 Franceschelli and Musolesi

[80] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In PMLR 2018, Vol. 80. 4055–4064. [81] Christine Payne. 2019. MuseNet. .com/blog/musenet. [82] Rafael Perez y Perez. 2017. Mexica: 20 Years-20 Stories [20 años-20 historias]. Counterpath Press. [83] Peter Potash, Alexey Romanov, and Anna Rumshisky. 2015. GhostWriter: Using an LSTM for Automatic Rap Lyric Generation. In EMNLP 2015. Association for Computational Linguistics, 1919–1924. [84] Racter. 1984. The Policeman’s Beard Is Half Constructed. Warner Books, Inc. [85] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. [86] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray. 2021. DALL·E: Creating Images from Text. openai.com/blog/dall-e. [87] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. 2019. Generating Diverse High-Fidelity Images with VQ-VAE-2. In NeurIPS 2019, Vol. 32. [88] Mel Rhodes. 1961. An Analysis of Creativity. The Phi Delta Kappan 42, 7 (1961), 305–310. [89] Mark O. Riedl. 2014. The Lovelace 2.0 Test of Artificial Creativity and Intelligence. arXiv:1410.6142 [cs.AI] [90] Graeme Ritchie. 2007. Some Empirical Criteria for Attributing Creativity to a Computer Program. 17 (2007), 67–99. [91] Ruben Rodriguez Torrado, Ahmed Khalifa, Michael Cerny Green, Niels Justesen, Sebastian Risi, and Julian Togelius. 2020. Bootstrapping Conditional GANs for Level Generation. In CIG 2020. 41–48. [92] Melissa Roemmele and Andrew S. Gordon. 2018. Automated Assistance for Creative Writing with an RNN Language Model. In IUI 2018. Association for Computing Machinery. [93] Jon Rowe and Derek Partridge. 1993. Creativity: a survey of AI approaches. Artificial Intelligence Review 7 (1993), 43–70. [94] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (2015), 211–252. [95] Stuart Russell and . 2009. Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall Press. [96] Sameh Said Metwaly, Wim Van den Noortgate, and Eva Kyndt. 2017. Approaches to Measuring Creativity: A Systematic Literature Review. Creativity. Theories – Research - Applications 4, 2 (2017). [97] Pamela Samuelson. 1986. Allocating Ownership Rights in Computer-Generated Works. In Symposium Cosponsored by University of Pittsburgh Law Review and The Software En on The Future of Software Protection. University of Pittsburgh Press, 1185–1228. [98] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. 2018. Relational recurrent neural networks. In NeurIPS 2018, Vol. 31. Curran Associates, Inc. [99] Jürgen Schmidhuber. 2009. Simple Algorithmic Theory of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, . Journal of SICE 48 (2009). [100] Jürgen Schmidhuber. 2010. Artificial Scientists and Artists Based on the Formal Theory of Creativity. Artificial General Intelligence - Proceedings of the Third Conference on Artificial General Intelligence, AGI 2010 (2010). [101] Jürgen Schmidhuber. 2010. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010). IEEE Transactions on Autonomous Mental Development 2, 3 (2010), 230–247. [102] Michael D. Schmidt and Hod Lipson. 2009. Distilling Free-Form Natural Laws from Experimental Data. Science 324 (2009), 81–85. [103] David Silver, Aja Huang, Christopher J. MMaddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (2016), 484–503. [104] Benjamin Sobel. 2020. A Taxonomy of Training Data: Disentangling the Mismatched Rights, Remedies, and Rationales for Restricting Machine Learning. Artificial Intelligence and Intellectual Property (2020). [105] Kenneth O. Stanley. 2006. Exploiting Regularity Without Development. In AAAI 2006. [106] Michael Steinbach, Levent Ertöz, and Vipin Kumar. 2004. The Challenges of Clustering High Dimensional Data. New Directions in Statistical 213 (2004), 273–309. [107] Bob L. Sturm, João Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. 2016. Music transcription modelling and composition using deep learning. arXiv:1604.08723 [cs.SD] [108] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NeurIPS 2014. MIT Press, 3104–3112. [109] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NeurIPS 2000, Vol. 12. MIT Press. [110] Pradyumna Tambwekar, Murtaza Dhuliawala, Lara J. Martin, Animesh Mehta, Brent Harrison, and Mark O. Riedl. 2019. Controllable Neural Story Plot Generation via Reward Shaping. In IJCAI 2019. 5982–5988. [111] Max Tegmark. 2017. Life 3.0: Being human in the age of Artificial Intelligence. Allen Lane. [112] Julian Togelius and Jurgen Schmidhuber. 2008. An experiment in automatic game design. In CIG 2008. IEEE, 111–118. [113] Hannu Toivonen and Oskar Gross. 2015. Data mining and machine learning in computational creativity. WIREs Data Mining and Knowledge Discovery 5, 6 (2015). 24 Creativity and Machine Learning: A Survey

[114] Alan M. Turing. 1937. On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society s2-42, 1 (1937), 230–265. [115] Alan M. Turing. 1950. Computing Machinery and Intelligence. Mind LIX, 236 (1950), 433–460. [116] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499 [cs.SD] [117] Aaron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel Recurrent Neural Networks. In ICML 2016. JMLR.org, 1747–1756. [118] Aaron Van Den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional Image Generation with PixelCNN Decoders. In NeurIPS 2020. Curran Associates Inc., 4797–4805. [119] Lav R. Varshney, Florian Pinel, Kush R. Varshney, Debarun Bhattacharjya, Angela Schoergendorfer, and Y-Min Chee. 2019. A big data approach to computational creativity: The curious case of Chef Watson. IBM Journal of Research and Development 63, 1 (2019), 7:1–7:18. [120] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc. [121] Gauthier Vernier, Hugo Caselles-Dupré, and Peirre Fautrel. 2020. Electric Dreams of Ukiyo: A Series of Japanese Artworks Created by an Artificial Intelligence. Patterns 1, 2 (2020), 100026. [122] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML 2018. 1096–1103. [123] Vanessa Volz, Jacob Schrum, Jialin Liu, Simon M. Lucas, Adam Smith, and Sebastian Risi. 2018. Evolving Mario Levels in the Latent Space of a Deep Convolutional Generative Adversarial Network. In GECCO 2018. Association for Computing Machinery, 221–228. [124] Geraint A. Wiggins. 2006. Searching for Computational Creativity. New Generation Computing 24 (2006), 209–222. [125] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2Image: Conditional Image Generation from Visual Attributes. In ECCV 2016. [126] Xiaoyuan Yi, Maosong Sun, Ruoyu Li, and Wenhao Li. 2018. Automatic Poetry Generation with Mutual Reinforcement Learning. In EMNLP 2018. Association for Computational Linguistics, 3143–3153. [127] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI 2017. AAAI Press, 2852–2858. [128] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-Attention Generative Adversarial Networks. In PMLR 2019, Vol. 97. 7354–7363. [129] Xingxing Zhang and Mirella Lapata. 2014. Chinese Poetry Generation with Recurrent Neural Networks. In EMNLP 2014. Association for Computational Linguistics, 670–680. [130] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial Feature Matching for Text Generation. In ICML 2017. JMLR.org, 4006–4015. [131] Andrea Zugarini, Stefano Melacci, and Marco Maggini. 2019. Neural Poetry: Learning to Generate Poems Using Syllables. Lecture Notes in Computer Science (2019), 313–325.

25