Personalized Image Captioning with Context Sequence Memory Networks

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks Cesc Chunseong Park∗ Byeongchang Kim Gunhee Kim Lunit Inc. Seoul National University Seoul, Korea Seoul, Korea [email protected] [email protected] [email protected] Abstract Query Image Softmax We address personalization issues of image captioning, CNN which have not been discussed yet in previous research. For Context yt a query image, we aim to generate a descriptive sentence, Memory accounting for prior knowledge such as the user’s active vo- yt cabularies in previous documents. As applications of per- Image feature User context Word output CSMN model sonalized image captioning, we tackle two post automation User’s Active Vocabulary tasks: hashtag prediction and post generation, on our newly Task1. Hashtag prediction collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique up- Task2. Post generation dates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into Figure 1. Problem statement of personalized image captioning memory to capture long-term information without suffer- with an Instagram example. As main applications, we address hashtag prediction and post generation tasks. Given a query im- ing from the vanishing gradient problem, and (iii) adopting age, the former predicts a list of hashtags, while the latter gener- CNN memory structure to jointly represent nearby ordered ates a descriptive text to complete a post. We propose a versatile memory slots for better context understanding. With quan- context sequence memory network (CSMN) model. titative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized captioning, which have not been discussed in previous re- image captioning over state-of-the-art captioning models. search. We aim to generate a descriptive sentence for an image, accounting for prior knowledge such as the user’s active vocabularies or writing styles in previous documents. Potentially, personalized image captioning is applicable to 1. Introduction a wide range of automation services in photo-sharing social networks. For example, in Instagram or Facebook, users in- Image captioning is a task of automatically generating stantly take and share pictures as posts using mobile phones. a descriptive sentence of an image [3, 4, 9, 12, 20, 22, 28, One bottleneck to complete an image post is to craft hash- 31, 33]. As this task is often regarded as one of frontier- tags or associated text description using their own words. AI problems, it has been actively studied in recent vision Indeed, crafting text is more cumbersome than taking a pic- and language research. It requires an algorithm not only to ture for general users; photo-taking can be done with only a understand the image content in depth beyond category or single tab on the screen of a smartphone, whereas text writ- attribute levels, but also to connect its interpretation with a ing requires more time and mental energy for selecting suit- language model to create a natural sentence. able keywords and completing a sentence to describe theme, This work addresses personalization issues of image sentiment, and context of the image. ∗This work was done while the author was at Seoul National University. In this paper, as examples of personalized image cap- Codes: https://github.com/cesc-park/attend2u. tioning, we focus on two post automation tasks: hashtag 895 prediction and post generation. Figure 1 shows an Insta- captioning dataset, comprising 1.1M Instagram posts from gram post example. The hashtag prediction automatically 6.3K users. Instagram is a great source for personalized predicts a list of hashtags for the image, while the post gen- captioning, because posts mostly include personal pictures eration creates a sentence consisting of normal words, emo- with long hashtag lists and characteristic text with a wide jis, and even hashtags. Personalization is key to success in range of topics. For each picture post, we consider the body these two tasks, because text in social networks is not sim- text or a list of hashtags as groundtruth captions. ple description of image content, but the user’s own story Our experimental results demonstrate that aforemen- and experience about the image with his or her favorite vo- tioned three unique features of our CSMN model indeed cabularies and expressions. improve captioning performance, especially for personal- To achieve personalized image captioning tasks, we pro- ization purpose. We also validate that our CSMN signifi- pose a memory network model named as context sequence cantly outperforms several state-of-the-art captioning mod- memory network (CSMN). Our model is inspired by mem- els with the decoders of RNNs or LSTMs (e.g. [27, 28, ory networks [5, 24, 29], which explicitly include memory 31]). We evaluate with quantitative language metrics (e.g. components to which neural networks read and write data BLEU [21], CIDEr [26], METEOR [14], and ROUGE [15]) for capturing long-term information. Our major updates and user studies via Amazon Mechanical Turk. over previous memory network models are three-fold. We summarize contribution of this work as follows. First, we propose to use the memory as a context reposi- (1) To the best of our knowledge, we propose a first per- tory of prior knowledge for personalized image captioning. sonalized image captioning approach. We introduce two Since the topics of social network posts are too broad and practical post automation tasks that benefit from personal- users’ writing styles are too diverse, it is crucial to leverage ized captioning: post generation and hashtag prediction. prior knowledge about the authors or metadata around the (2) We propose a novel memory network model named images. Our memory retains such multiple types of context CSMN for personalized captioning. The unique updates information to promote more focused prediction, including of CSMN include (i) exploiting memory as a repository users’ active vocabularies and various image descriptors. for multiple context information, (ii) appending previously Second, we design the memory to sequentially store all generated words into memory to capture long-term infor- of the words that the model generates. It leads two im- mation without, and (iii) adopting CNN memory structure portant advantages. First, it enables the model to selec- to jointly represent nearby ordered memory slots. tively attend, at every step, on the most informative pre- (3) For evaluation of personalized image captioning, we vious words and their combination with other context infor- introduce a novel Instagram dataset. We make the code and mation in the memory. Second, our model does not suf- data publicly available. fer from the vanishing gradient problem. Most caption- (4) With quantitative evaluation and user studies via ing models are equipped with RNN-based encoders (e.g. AMT, we demonstrate the effectiveness of three novel fea- [3, 22, 25, 28, 31, 33]), which predict a word at every time tures of CSMN and its performance superiority over state- step, based on only a current input and a single or a few hid- of-the-art captioning models, including [27, 28, 31]. den states as an implicit summary of all previous history. Thus, RNNs and their variants often fails to capture long- 2. Related work term dependencies, which could worsen if one wants to use Image Captioning. In recent years, much work has been prior knowledge together. On the other hand, our state- published on image captioning, including [3, 4, 9, 12, 20, based sequence generation explicitly retains all information 22, 28, 31, 33], to name a few. Many proposed caption- in the memory to predict next words. By using teacher- ing models exploit RNN-based decoders to generate a se- forced learning [30], our model has a markov property at quence of words from encoded representation of input im- training time; predicting a previous word yt−1 has no effect ages. For example, long-term recurrent convolutional net- on predicting a next word yt, which depends on only the works [3] are one of earliest model to use RNNs for model- current memory state. Thus, the gradients from the current ing the relations between sequential inputs and outputs. You time step prediction yt are not propagated through the time. et al. [33] exploit semantic attention to combine top-down Third, we propose to exploit a CNN to jointly represent and bottom-up strategies to extract richer information from nearby ordered memory slots for better context understand- images, and couples it with an LSTM decoder. Compared ing. Original memory networks [24, 29] leverage time em- to such recent progress of image captioning research, it is bedding to model the memory order. Still its representation novel to replace an RNN-based decoder with a sequence power is low, since it cannot represent the correlations be- memory. Moreover, no previous work has tackled the per- tween multiple memory slots, for which we exploit convo- sonalization issue, which is the key objective of this work. lution layers that lead much stronger representation power. We also introduce post completion and hashtag prediction For evaluation, we collect a new personalized image as solid and practical applications of image captioning. 896 Dataset # posts # users # posts/user # words/post words, which consist of the 10 most common hashtags for caption 721,176 4,820 149.6 (118) 8.55 (8) each of 27 general categories of Pinterest (e.g. design, food, hashtag 518,116 3,633 142.6 (107) 7.45 (7) style). We use the Pinterest categories because they are Table 1. Statistics of Instagram dataset. We also show average well-defined topics to obtain image posts of diverse users.

Load more