memeBot: Towards Automatic Image Meme Generation
Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou Yang Arizona State University, Tempe AZ, United States {asadasi1, kgunase3, hdavulcu, yz.yang }@asu.edu
Abstract Input Sentence I am curiously waiting for my father to cook supper tonight Image memes have become a widespread tool used by people for interacting and exchang- Meme Template ing ideas over social media, blogs, and open Selection Module messengers. This work proposes to treat au- tomatic image meme generation as a transla- Selected Meme Template Name Encoder tion process, and further present an end to end Meme Template neural and probabilistic approach to generate Meme Template Meme an image-based meme for any given sentence Image Embedding using an encoder-decoder architecture. For a given input sentence, an image meme is gener- Decoder ated by combining a meme template image and a text caption where the meme template image Generated Caption is selected from a set of popular candidates us- ing a selection module, and the meme caption is generated by an encoder-decoder model. An encoder is used to map the selected meme tem- Figure 1: An illustrative figure of memeBot. It gen- plate and the input sentence into a meme em- erates an image meme for a given input sentence by bedding and a decoder is used to decode the combining the selected meme template image and the meme caption from the meme embedding. The generated meme caption. generated natural language meme caption is conditioned on the input sentence and the se- lected meme template. The model learns the dependencies between the meme captions and Over the internet, information exists in the form the meme template images and generates new of text, images, video, or a combination of these. memes using the learned dependencies. The However the information existing as a combina- quality of the generated captions and the gen- tion of image or video and short text often gets erated memes is evaluated through both auto- viral. Image memes are a combination of image, mated and human evaluation. An experiment text, and humor, making them a powerful tool to is designed to score how well the generated memes can represent the tweets from Twitter deliver information. The image memes are popular arXiv:2004.14571v1 [cs.CL] 30 Apr 2020 conversations. Experiments on Twitter data because they portray the culture and social choices show the efficacy of the model in generating embraced by the internet community and they have memes for sentences in online social interac- a strong influence on the cultural norms of how tion. specific demographics of people operate. For ex- ample, in Figure2, we present the memes used by 1 Introduction an online deep learning community to ridicule how An Internet meme commonly takes the form of the new pre-training methods are outperforming an image and is created by combining a meme the previous state-of-the-art models. template (image) and a caption (natural language The popularity of image-based memes can be at- sentence). The image typically comes from a set of tributed to the fact that visual information is easier popular image candidates and the caption conveys to process and understand when compared to read- the intended message through natural language. ing large blocks of text, and this fact is evident in any given sentence.
• We compiled the first large-scale Meme Cap- tion dataset.
• We design experiments based on human eval- uation and provide a thorough analysis of the Figure 2: Memes used by the online deep learning com- experimental results on using the generated munity on social media to ridicule the state-of-the-art pre-training models. memes for online social interaction.
Figure2. The key role played by the image memes 2 Related Work in shaping the popular culture of the internet com- There are only a few studies on automatic image munity makes automatic meme image generation a meme generation and the existing approaches treat demanding research topic to delve into. meme generation as a caption selection or caption Davison(2012) separates a meme into three com- generation problem. Wang and Wen(2015) com- ponents - Manifestation, Behavior, and Ideal. In bined an image and its text description to select a an image meme, the Ideal is the idea that needs to meme caption from a corpus based on a ranking be conveyed. The Behavior is to select a suitable algorithm. Peirson et al.(2018) extends Natural meme template and caption to convey that idea and, Language Description Generation to generating the Manifestation is the final meme image with a a meme caption using an encoder-decoder model caption conveying the idea. Wang and Wen(2015) with an attention (Luong et al., 2015) mechanism. and Peirson et al.(2018) focus on the behavior and Although there is not much work on automatic manifestation of a meme, not much importance meme generation, the task of meme generation can is given to the ideal of a meme. Their approach be closely aligned with tasks like Sentiment Anal- of image meme generation is limited to selecting ysis, Neural Machine Translation, Image Caption the most appropriate meme caption or generating Generation and Controlled Natural Language Gen- a meme caption for the given image and template eration. name. In this work, we intend to automatically In Natural Language Understanding (NLU), re- generate an image meme to represent a given input searchers have explored classifying a sentence sentence (Ideal) as illustrated in Figure 1, which is based on their sentiment (Socher et al., 2013). We a challenging NLP task with immediate practical extend this idea to classify a sentence based on its applications for online social interaction. compatibility with a meme template. The idea of By taking a deep look into the process of image creating an encoded representation and decoding meme generation, we propose to co-relate image it into a desired target is well establish in Neural meme generation to Natural Language Translation. Machine Translation and Image Caption Genera- To translate a sentence from a source to target lan- tion. Sutskever et al.(2014), Bahdanau et al.(2014) guage, one has to decode the meaning of the sen- and Vaswani et al.(2017) use an encoder-decoder tence in its entirety, analyze its meaning, and then model to encode and decode a sentence from a encode that meaning of the source sentence into source to a targeted language. Vinyals et al.(2015), the target sentence. Similarly, a sentence can be Xu et al.(2015), Karpathy and Fei-Fei(2015) en- translated into a meme by encoding the meaning of code the visual features of an image and use a the sentence into a pair of image and caption capa- decoder to generate natural language description ble of conveying the same meaning or emotion as of the image. Fang et al.(2020) generate natu- that of the sentence. Motivated by this intuition for ral language description containing common sense meme generation, we develop a model that oper- knowledge from the encoded visual inputs. ates beyond the known approaches and extend the Our proposed model of meme generation shares capability of image meme generation to generate similar spirit with the above mentioned problems memes for a given input sentence. We summarize where we encode the given input sentence into a our contributions as follows: latent space followed by decoding it into a meme • We present an end to end encoder-decoder caption that can be combined with the meme image architecture to generate an image meme for to convey the same meaning as that of the input Generated Meme
Meme Generated Template Meme Predicted Image Caption Meme Template Softmax
Linear
Meme Template Add & Norm Add & Norm Selection Module
Meme Feed Forward Feed Forward Template Name "Please save the Input Add & Norm Add & Norm world from Sentence Corona Virus" Multi-Head Multi-Head Attention Attention Template Embedding Parts-of-Speech Meme Add & Norm Add & Norm Tagger Embedding
Multi-Head Masked Multi-Head Attention Attention POS Vector
POS Mask Position Position Embedding Embedding
Masked Input Output Caption Output Input Embedding (Shifted Right) Embedding Sentence
Figure 3: memeBot - model architecture. For a given input sentence, a meme is created by combining the meme image selected by the template selection module and the meme caption generated by the caption generation trans- former. sentence. However, the generated meme caption 3 Our Approach should represent the input sentence through the se- lected meme template, making it a conditioned or In this section, we describe our approach: an end- controlled text generation task. Controlled Natu- to-end neural and probabilistic architecture for ral Language Generation with desired emotions, meme generation. Our model has two components. semantics and keywords have been studied pre- First, a meme template selection module to identify viously. Huang et al.(2018) generate text with a compatible meme template (image) for the input desired emotions by embedding emotion represen- sentence. Second, a meme caption generator as tations into Seq2Seq models. Hu et al.(2017) illustrated in Figure3. concatenate a control vector to the latent space 3.1 Meme Template Selection Module of their model to generate text with designated se- mantics. Su et al.(2018) and Miao et al.(2019) Pre-trained language representations from trans- generate a sentence with desired emotions or key- former based architectures like BERT (Devlin et al., words using sampling techniques. While these 2019), XLNet (Yang et al., 2019) and Roberta (Liu approaches of controlled text generation involve et al., 2019) are being used in a wide range of Nat- relatively complex conditioning factors, we imple- ural Language Understanding tasks. Devlin et al. ment a transformer (Vaswani et al., 2017) based (2019), Yang et al.(2019) and Liu et al.(2019) encoder-decoder model to generate a meme cap- show that these models can be fine-tuned specifi- tion conditioned on both the input sentence and the cally to a range of NLU tasks to create state-of-the- selected meme template. art models. For meme template selection module, we fine tune the pre-trained language representation mod- 4 Dataset els with a linear neural network on the meme tem- plate selection task. In training, the probability of 4.1 Meme Caption Dataset selecting the correct template for a given sentence To make possible and validate the aforementioned is maximized by using the formulation given below: technical framework, we collect a dataset that would enable us to learn the dependency between a meme template and a meme caption. Here we X l(θ1) = arg max log(P (T |S, θ1)), (1) adopt the open online resource imgflip1 which is θ 1 (T,S) one of the most commonly used meme generators. To automatically crawl the data, we developed a where θ1 denotes the parameters of the meme tem- web crawler to collect the memes. plate selection module, T is the template and S is We observe that only a few meme templates dom- the input sentence. inate the collection. We investigate this dominating memes along with factors that can make a meme 3.2 Meme Caption Generator popular. Replication of a meme depends on the We train the meme caption generator by corrupt- mental processes of observation and learning of ing the input caption, borrowing from denoising the group of people across which it is being shared autoencoder (Vincent et al., 2008). We extract the (Davison, 2012). Popular meme templates make parts of speech of the input caption using a Part-Of- a content shareable and are replicated frequently Speech Tagger (POS Tagger) (Honnibal and Mon- because of their capability to a make content vi- tani, 2017). Using the POS vector, we mask the ral. To this end, we experiment on image meme input caption such that only the noun phrases and generation using the popular meme templates. verbs are passed as input to the meme caption gen- Our dataset has 177, 942 meme captions from erator. We corrupt the data to facilitate our model 24 templates. The distribution of meme captions to learn meme generation from existing captions across the meme templates is presented in Figure and to generalize the process of meme generation 4. The dataset consists of meme template (image & for any given input sentences during inference. template name) and meme caption pairs. A sample The meme cation generator model uses a trans- from the dataset is illustrated in Table1. To add former architecture inspired from Vaswani et al. diversity to the generated memes, we use various (2017). Our transformer encoder creates meme em- images for the same meme template. The meme bedding for a given sentence by performing multi- template figures used and a sample of the additional head scaled dot-product attention on the selected images used for the meme templates are presented meme template and the input sentence. The trans- in Appendices A and B. former decoder initially performs masked multi- head attention on the expected caption and later 4.2 Twitter Dataset performs multi-head scaled dot-product attention We collect tweets from Twitter to evaluate the ef- between the encoded meme embedding and the out- ficacy of our model in generating memes for sen- puts of the masked multi-head attention as shown tences used in online social interaction. We ran- in Figure3. This enables the meme caption gen- domly sampled 6000 tweets for the query “Corona erator to learn the dependency between the input virus”. We prune the sampled tweets to 1000 sentence, selected meme template and the expected tweets by selecting only those tweets with non neg- meme caption. We optimize the transformer by ative sentiment using VADER-Sentiment-Analysis using the formulation given below: (Hutto and Gilbert, 2014). Twitter is an open do- main and may contain tweets that could affect the X beliefs and sentiments of people and to have a con- l(θ2) = arg max log(P (C|M, θ2)), (2) trol over our model, we remove the tweets with θ2 (S,C) negative sentiments. The goal is to prompt our model to generate an image meme by inputting a where θ denotes the parameters of the meme cap- 2 tweet and evaluate if the generated meme is rele- tion generator, C is the meme caption and M is the vant to the tweet. meme embedding obtained from the transformer encoder. 1https://imgflip.com Template Name Captions Template Image
• to those who have been fortunate enough to have known true love Leonardo Dicaprio • cheers to us making the future bright Cheers • when you see your cousin at a family gathering
• carries the laundry didn’t drop a single sock • when you win your first fortnite game Success Kid • late to work and boss was even later • when she gives you her phone number
Table 1: Sample examples (Template name, Captions and Meme Image) from the meme caption dataset.
am i the only one around here 3.1% x x everywhere bad luck brian 3.1% 5.8% success kid first world problems 3.1% 5.8% third world skeptical kid the most interesting man in the world 3.3% 5.7% roll safe think about it one does not simply 3.4% 5.6% captain picard facepalm futurama fry 3.4% 5.3% ancient aliens 10 guy 3.4% 5.2% matrix morpheus creepy condescending wonka 3.7% 5.0% disaster girl but thats none of my business 3.8% 4.7% leonardo dicaprio cheers that would be great 3.9% 4.5% y u no waiting skeleton 4.1% 4.4% grumpy cat 4.4%
Figure 4: Distribution of meme caption count for the meme templates.
5 Experiments and Results tion task using the meme caption dataset. Perfor- mance of the meme template selection module on We train our model on the meme caption dataset the meme caption test data using variants of pre- (Section 4.1). The train, validation and test dataset trained language representation models is reported contains 142341, 17802 and 17799 samples respec- in Table2. tively. We evaluate the performance of the meme template selection module in selecting the com- patible template, the effectiveness of the caption Model Accuracy ↑ F1 ↑ Loss ↓ BERTbase 68.18 68.71 1.15 generator in generating captions that are similar XLNetbase 68.74 69.06 1.17 to the input captions from the meme caption test Robertabase 69.31 69.87 1.10 dataset and the efficacy of the model in generating memes for real-world examples (Tweets) through Table 2: Meme template selection module performance on the meme caption test dataset using the fine tuned human evaluation. language representation models. Bold font highlights 5.1 Meme Template Selection Module the best scores obtained. For accuracy and F1, higher the score is better. For loss, lower the score is better. We fine tune the pre-trained language represen- tation models (BERTbase (Devlin et al., 2019), XLNetbase (Yang et al., 2019) and Robertabase (Liu We adopt the best-performing model with a fine et al., 2019)) using a single layer linear neural net- tuned Robertabase model as the meme template work with 768 units on the meme template selec- selection module in the meme generation pipeline. MT2MC SMT2MC-NP SMT2MC-NP+V
Input Sentence: Waiting for trump to answer a question.
Figure 5: Memes generated by the caption generator variants for the given input sentence.
5.2 Meme Caption Generator We use residual dropout (Pdrop)(Srivastava For meme caption generation, we experiment with et al., 2014) for regularization and Adam optimizer two different variants. The first variant - Meme (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98 −9 Template to Meme Caption (MT2MC), inputs the and = 1e and a scheduler using cosine an- selected meme template and generates a meme cap- nealing with warm restarts (Loshchilov and Hutter, tion. The second variant - Sentence and Meme 2016). During inference we generate meme cap- Template to Meme Caption (SMT2MC), inputs the tions using Beam search with a beam of size 6 and input sentence along with the selected meme tem- length penalty α = 0.7. We stop the caption gen- plate and generates a meme caption. We use two eration when a special end token or the maximum variants as a part of ablation study to demonstrate length of 32 tokens is reached. the usage of the input sentence features enabling A sample of memes generated by the caption our model to generate memes relevant to the input generator variants are presented in Figure5. It can sentence. be seen that MT2MC generates a random caption We also experiment with two variants of for the given meme template which is irrelevant to SMT2MC. The first variant uses only the noun the input sentence while the meme captions gener- phrases from the input sentence while the second ated by the variants of SMT2MC are contextually variant uses the verbs along with the noun phrases. relevant to the input sentence. Among the variants We experiment only using the noun phrases in or- of SMT2MC, it can seen that the caption gener- der to study to what extent the addition of verbs ated using noun phrases & verbs as inputs better directs the context of the generated meme towards represent the input sentence. the context of the input sentence. SMT2MC and MT2MC architectures follow the same denotations 5.3 Evaluation Metrics as of Vaswani et al.(2017). We report the hyper- We use BLEU score (Papineni et al., 2002) to eval- parameters used in Table3. uate the quality of the generated captions. The perspective of good quality of a meme is subjec- Architecture N dmodel dff h tive and varies among people. To the best of our MT2MC 8 768 2048 12 SMT2MC 6 512 2048 8 knowledge, there are no known automatic evalua- tion metrics to evaluate the quality of a meme. A Table 3: Hyper-parameters used in the caption genera- fairly reliable technique is to perform human eval- tion models. uation by a set of raters to evaluate the quality of a Variant BLEU-1 BLEU-2 BLEU-3 BLEU-4 MT2MC 13.93 6.85.86 4.45 2.72 SMT2MC-NP 38.71 24.67 14.56 8.89 SMT2MC-NP+V 45.67 27.56 17.14 11.12
Table 4: BLEU scores for the caption generator variants. Bold font highlights the best scores obtained. NP - Noun Phrase and V - Verb.
4 1 4 1 15.9% 18.3% 12.5% 15.1% No likes 1.5 3.5 1.5 3.5 28.4% 3.0% 9.2% 4.7% 8.2% 2 2 8.9% 10.2% User liked 2.5 2.5 59.8% Disagreem 11.8% 3 5.9% 3 9.1% 39.7% 39.3%
(a) Coherence score distribution (b) Relevance score distribution (c) User Likes score distribution
Figure 6: Human evaluation scores. meme on a subjective score. inputs only the noun phrases. We use the best per- In machine translation, adequacy and fluency forming SMT2MC-NP+V to generate memes for (Snover et al., 2009) are used to subjectively rate Twitter data. the correctness and fluency of a translation. In- spired from adequacy and fluency, we define 2 met- 5.4.2 Human Evaluation Task Setup rics - Coherence and Relevance to evaluate the We choose Amazon Mechanical Turk (AMT) for generated memes, described as follows: the evaluation of the generated memes due to its easy to use platform and the ready availability of a • Coherence: Can you understand the message big worker pool with required skills. An example conveyed through the meme (Image + text)? for AMT questionnaire is in presented in Appendix • Relevance: Is the meme contextually relevant C. Each sample was rated by 2 raters and in case of to the text? disagreement among the raters, we consider their Coherence score captures the quality (fluency) of average score as the final score. the generated meme and the Relevance score cap- In our AMT evaluation setup, we design a two- tures how well the generated meme represents the stage process to evaluate the meme. We first dis- input sentence (correctness). We also ask the raters play the meme image and ask the workers to score if they like the meme to evaluate if the generated the Coherence metric, only based on their under- memes are good. The Relevance and Coherence standing of the meme. Later we display the tweet metrics are scored on a range of 1 - 4. User Likes and ask them to understand the text, and then ask score represents the percentage of total raters who them to score the Relevance metric based on their liked the meme. To score these metrics, we set up comprehension of the tweet and the meme. Our an Amazon Mechanical Turk (AMT) experiment. expectation for the AMT workers is that they are ca- pable of visually understanding an image, capable 5.4 Evaluation Results of semantically and contextually understanding a 5.4.1 Caption Generation Results sentence and possess the reasoning ability to com- The scores for the caption generator variants are pare context from different information sources. reported in Table4. We see that the SMT2MC We assume an adult human being is well qualified variants produce meme captions textually similar to meet our expectations. to that of the input sentence. Among the SMT2MC variants, the variant which inputs verbs along with 5.4.3 Human Evaluation Results noun phrases has better score, and using verbs with The performance of the SMT2MC-NP+V model on noun phrases enables the caption generator to gen- the human evaluation metrics is reported in Table erate relatively relevant captions to that of the in- 5 and the score distribution across the evaluation put sentence when compared to the variant which metrics is presented in the Figure6. A qualitative VERY GOOD (4) GOOD (3) POOR (2) VERY POOR (1)
Figure 7: Qualitative comparison of memes grouped by Coherence score.
VERY GOOD (4) GOOD (3) POOR (2) VERY POOR (1)
Tweet: Stay at home and save Tweet: It's illegal to post fake Tweet: Corona virus cases Tweet: People say the corona your life. information about corona virus. continue to escalates. virus is like an invisible war. Generated Meme: Generated Meme: Generated Meme: Generated Meme:
Figure 8: Qualitative comparison of memes grouped by Relevance score. comparison of memes grouped by rater scores is meme for a tweet. presented in Figures7 and8. Observing the score distribution from Figure 6, we infer that more than 60% of the generated Metric Score memes are coherent and relevant to the input tweets. Coherence 2.66 Relevance 2.65 From Table5, we see that 65% of the raters like User Likes 0.65 the meme shown to them and the like percentage correlate with the coherence and relevance scores. Table 5: Human evaluation scores on twitter data. Rele- We infer that the raters have liked the meme if they vance and Coherence metrics are scored out of 4. User Likes score represents the percentage of total raters understood the information conveyed through the who liked the meme. meme and if the meme is relevant to the input tweet. Quantitatively, our model is capable of generating Before interpreting the scores, we review the coherent memes with 66.5% confidence and rele- image meme generation task. It requires the ability vant memes with 66.25% confidence. Our model to semantically and contextually understand the performs with good confidence on the challenging input sentence along with the contextual knowledge image meme generation task using only the lan- of the image memes. Even with the understanding guage features of the image meme during training. of the input sentence and meme images, one has 6 Inter Rater Reliability to posses a good fluency in natural language to generate a meme caption that is compatible with We use Cohen’s Kappa (κ) to measure the reliabil- the meme image. The generated meme should ity among the raters. Cohen’s Kappa is defined as also be relevant to the input sentence. We analyze p − P the performance of our model by assuming that a κ = o e (3) human generated meme would get perfect score N − Pe across all the metrics in generating a good quality where po is the relative observed agreement and Figure 9: Memes generated by the SMT2MC-NP+V variant for the input tweet -“Please save the world from Corona” by conditioning caption generation using different meme templates.
pe is the hypothetical probability of chance agree- 8 Conclusion and Future Work ment among the raters, and N is the number of samples. The Inter Rater Reliability (IRR) score We have presented memeBot, an end to end ar- among the users on different metrics is reported in chitecture that can automatically generate a meme Table6. for a given sentence. memeBot is composed of two components, a module to select a meme tem- Metric Agreement Score plate and an encoder-decoder to generate a meme Coherence 71.68 caption. The model is trained on a meme caption Relevance 61.39 User Likes 73.85 dataset to maximize the likelihood of selecting a template given a caption and to maximize the like- Table 6: Inter Rater Reliability scores [%] on the hu- lihood of generating a meme caption given the in- man evaluation metrics. put sentence and the meme template. Automatic evaluation on meme caption test data and human The raters have higher than 60% agreement on evaluation scores on Twitter data show promising all the metrics which establishes a good consistency performance in generating an image for sentences among the raters for evaluating the quality of the in online social interaction. generated image memes. The concept of quality of a meme highly varies among people and is hard to evaluate using a set 7 Controlling Meme Generation of pre-defined metrics. In real-world scenarios, Corrupting the input data during training enables if an individual likes a meme, he or she shares our model to learn from the meme caption dataset it with others. If a group of individuals like the and scale our model for any input sentence dur- same meme then the meme can become viral or ing inference as shown in Figure8. In an ideal trending. Future work includes evaluating a meme scenario, the user might want to select the meme by introducing it in a social media stream and template. During the experiments, we observed rate the meme based on its transmission among that for a given sentence, information abstraction the people. The meme transmission rate and the during training has enabled our model to create a group of people it transmits across can be used as meme caption conditioned on any given meme tem- reinforcement to generate more creative and better plate. We experiment further on this by forcing the quality meme. caption generator to generate captions for a input sentence conditioned on different meme templates. Acknowledgment: The Office of Naval Research The generated memes are presented in Figure9 and (award #N00014-18-1-2761) and the National Sci- our model is capable of generating a meme for an ence Foundation under the Robust Intelligence Pro- input sentence conditioned on a selected or a given gram (#1750082) are gratefully acknowledged. meme template. References Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- preprint arXiv:1608.03983. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Thang Luong, Hieu Pham, and Christopher D. Man- arXiv:1409.0473. ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Patrick Davison. 2012. The language of internet 2015 Conference on Empirical Methods in Natu- memes. The social media reader, pages 120–134. ral Language Processing, pages 1412–1421, Lis- bon, Portugal. Association for Computational Lin- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and guistics. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei standing. In Proceedings of the 2019 Conference Li. 2019. Cgmh: Constrained sentence generation of the North American Chapter of the Association by metropolis-hastings sampling. In Proceedings of for Computational Linguistics: Human Language the AAAI Conference on Artificial Intelligence, vol- Technologies, Volume 1 (Long and Short Papers), ume 33, pages 6834–6842. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- Zhiyuan Fang, Tejas Gokhale, Pratyay Baner- uation of machine translation. In Proceedings of jee, Chitta Baral, and Yezhou Yang. 2020. the 40th annual meeting on association for compu- Video2commonsense: Generating commonsense tational linguistics, pages 311–318. Association for descriptions to enrich video captioning. arXiv Computational Linguistics. preprint arXiv:2003.05162. V Peirson, L Abel, and E Meltem Tolunay. 2018. Dank learning: Generating memnlp papes using deep neu- Matthew Honnibal and Ines Montani. 2017. spaCy 2: ral networks. arXiv preprint arXiv:1806.04510. Natural language understanding with Bloom embed- dings, convolutional neural networks and incremen- Matthew Snover, Nitin Madnani, Bonnie J Dorr, and tal parsing. To appear. Richard Schwartz. 2009. Fluency, adequacy, or hter?: exploring different human judgments with a Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan tunable mt metric. In Proceedings of the Fourth Salakhutdinov, and Eric P Xing. 2017. Toward Workshop on Statistical Machine Translation, pages controlled generation of text. In Proceedings 259–268. Association for Computational Linguis- of the 34th International Conference on Machine tics. Learning-Volume 70, pages 1587–1596. JMLR. org. Richard Socher, Alex Perelygin, Jean Wu, Jason Chenyang Huang, Osmar Zaiane, Amine Trabelsi, and Chuang, Christopher D Manning, Andrew Ng, and Nouha Dziri. 2018. Automatic dialogue generation Christopher Potts. 2013. Recursive deep models with expressed emotions. In Proceedings of the for semantic compositionality over a sentiment tree- 2018 Conference of the North American Chapter of bank. In Proceedings of the 2013 conference on the Association for Computational Linguistics: Hu- empirical methods in natural language processing, man Language Technologies, Volume 2 (Short Pa- pages 1631–1642. pers), pages 49–54. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Clayton J Hutto and Eric Gilbert. 2014. Vader: A par- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. simonious rule-based model for sentiment analysis Dropout: a simple way to prevent neural networks of social media text. In Eighth international AAAI from overfitting. The Journal of Machine Learning conference on weblogs and social media. Research, 15(1):1929–1958. Jinyue Su, Jiacheng Xu, Xipeng Qiu, and Xuanjing Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- Huang. 2018. Incorporating discriminator in sen- semantic alignments for generating image descrip- tence generation: a gibbs sampling method. In Proceedings of the IEEE conference tions. In Thirty-Second AAAI Conference on Artificial Intel- on computer vision and pattern recognition, pages ligence. 3128–3137. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Diederik P Kingma and Jimmy Ba. 2014. Adam: A Sequence to sequence learning with neural networks. method for stochastic optimization. arXiv preprint In Advances in neural information processing sys- arXiv:1412.6980. tems, pages 3104–3112. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Luke Zettlemoyer, and Veselin Stoyanov. 2019. Kaiser, and Illia Polosukhin. 2017. Attention is all Roberta: A robustly optimized bert pretraining ap- you need. In Advances in neural information pro- proach. arXiv preprint arXiv:1907.11692. cessing systems, pages 5998–6008. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoen- coders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- age caption generator. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3156–3164. William Yang Wang and Miaomiao Wen. 2015. I can has cheezburger? a nonparanormal approach to com- bining textual and visual information for predicting and generating popular meme descriptions. In Pro- ceedings of the 2015 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 355–365.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual atten- tion. In International conference on machine learn- ing, pages 2048–2057. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain- ing for language understanding. arXiv preprint arXiv:1906.08237. A Meme Templates
Template Template Template Template Image Template Image Template Image Name Name Name Leonardo Bad Luck Success Dicaprio Brian Kid Cheers
X X Every- Ancient Disaster where Aliens Girl
Third One Does World Futurama Not Simply Skeptical Fry Kid Am I The Captain Only One 10 Guy Picard Around Facepalm Here Brace Creepy Yourselves Conde- Matrix X is scending Morpheus Coming Wonka But Thats Roll Safe None Of Y U No Think My About It Business The Most Waiting Interesting First World Skeleton Man In Problems The World
Hide the That Grumpy Pain Would Be Cat Harold Great
Meme templates and meme images from the meme caption dataset used in our experiments. B Sample Additional Images
A Sample of additional images used for Waiting Skeleton meme template.
C Amazon Mechanical Turk Questionnaire
Sample AMT questionnaire - Coherence metric. Sample AMT questionnaire - Relevance and User Likes metrics.
Sample AMT questionnaire - disclaimer. D memeBot Demo A live demo of the presented memeBot is available in the website.