Arxiv:1904.09073V3 [Cs.CV] 7 Nov 2019 Documentintent Emnlp19

Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts Julia Kruk1,∗ Jonah Lubin2∗, Karan Sikka1, Xiao Lin1, Dan Jurafsky3, Ajay Divakaran1,y ∗equal contribution 1SRI International, Princeton, NJ 2The University of Chicago, Chicago, Illinois 3Stanford University, Stanford, CA Abstract Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine—via what has been called meaning multiplication Bateman(2014)—to create a new meaning that has a more complex relation to the literal meanings of text and im- Figure 1: Image-Caption meaning multiplication: A age. Here we introduce a multimodal dataset change in the caption completely changes the overall of 1299 Instagram posts labeled for three meaning of the image-caption pair. orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings There are many recent language processing of the image and caption, and the semiotic studies of images accompanied by basic text labels relationship between the signified meanings or captions (Chen et al., 2015; Faghri et al., 2018, of the image and caption. We build a baseline inter alia). But prior work on image–text pairs has deep multimodal classifier to validate the generally been asymmetric, regarding either im- taxonomy, showing that employing both text and image improves intent detection by 9:6% age or text as the primary content, and the other compared to using only the image modality, as mere complement. Scholars from semiotics as demonstrating the commonality of non- well as computer science have pointed out that this intersective meaning multiplication. The gain is insufficient; often text and image are not com- with multimodality is greatest when the image bined by a simple addition or intersection of the and caption diverge semiotically. Our dataset component meanings (Bateman, 2014; Marsh and offers a new resource for the study of the rich Domas White, 2003; Zhang et al., 2018). meanings that result from pairing text and image. The data is available here https: Rather, determining author intent with //github.com/karansikka1/ text+image content requires a richer kind of arXiv:1904.09073v3 [cs.CV] 7 Nov 2019 documentIntent_emnlp19. meaning composition that has been called meaning multiplication (Bateman, 2014): the creation 1 Introduction of new meaning through integrating image and Multimodal social platforms such as Instagram text. Meaning multiplication includes simple let content creators combine visual and textual meaning intersection or concatenation (a picture modalities. The resulting widespread use of of a dog with the label “dog”, or the label “Ru- text+image makes interpreting author intent in fus”). But it also includes more sophisticated multimodal messages an important task for NLP kinds of composition, such as irony or indirection, for document understanding. where the text+image integration requires infer- ence that creates a new meaning. For example in ∗ Work done while Julia (from Cornell University) and Jonah were interns at SRI International. Figure 1, a picture of a young woman smoking y Corresponding author, [email protected]. is given two different hypothetical captions that result in different composed meanings. In Pairing than the symmetric relationship in media such as I, the image and text are parallel, with the picture Instagram posts. The earliest work in the Barthe- used to highlight relaxation through smoking. sian tradition focuses on advertisements, in which Pairing II uses the tension between her image and the text serves as merely another connotative as- the implications of her actions to highlight the pect to be incorporated into a larger connotative dangers of smoking. meaning (Heath et al., 1977). Marsh and Do- Computational models that detect complex re- mas White(2003) offer a taxonomy of the relationships between text and image and how they lationship between image and text by consider- cue author intent could be significant for many ar- ing image/illustration pairs found in textbooks or eas, including the computational study of adver- manuals. We draw on their taxonomy, although as tising, the detection and study of propaganda, and we will see, the connotational aspects of Instagram our deeper understanding of many other kinds of posts require some additions. persuasive text, as well as allowing NLP applica- For our model of speaker intent, we draw on tions to news media to move beyond pure text. the classic concept of illocutionary acts (Austin, To better understand author intent given such 1962) to develop a new taxonomy of illocutionary meaning multiplication, we create three novel tax- acts focused on the kinds of intentions that tend onomies related to the relationship between text to occur on social media. For example, we rarely and image and their combination/multiplication see commissive posts on Instagram and Facebook in Instagram posts, designed by modifying exist- because of the focus on information sharing and ing taxonomies (Bateman, 2014; Marsh and Do- constructions of self-image. mas White, 2003) from semiotics, rhetoric, and Computational approaches to multi-modal doc- media studies. Our taxonomies measure the au- ument understanding have focused on key prob- thorial intent behind the image-caption pair and lems such as image captioning (Chen et al., 2015; two kinds of text-image relations: the contextual Faghri et al., 2018), visual question answering relationship between the literal meanings of the (Goyal et al., 2017; Zellers et al., 2018; Hudson image and caption, and the semiotic relationship and Manning, 2019), or extracting the literal or between the signified meanings of the image and connotative meaning of a post (Soleymani et al., caption. We then introduce a new dataset, MDID 2017). More recent work has explored the role (Multimodal Document Intent Dataset), with 1299 of image as context for interaction and pragmat- Instagram posts covering a variety of topics, anno- ics, either in dialog (Mostafazadeh et al., 2016, tated with labels from our three taxonomies. 2017), or as a prompt for users to generate de- Finally, we build a deep neural network model scriptions (Bisk et al., 2019). Another important for annotating Instagram posts with the labels direction has looked at an image’s perlocutionary from each taxonomy, and show that combining force (how it is perceived by its audience), includ- text and image leads to better classification, es- ing aspects such as memorability (Khosla et al., pecially when the caption and the image diverge. 2015), saliency (Bylinskii et al., 2018), popular- While our goal here is to establish a computational ity (Khosla et al., 2014) and virality (Deza and framework for investigating multimodal meaning Parikh, 2015; Alameda-Pineda et al., 2017). multiplication, in other pilot work we have be- gun to consider some applications, such as us- Some prior work has focused on intention. Joo ing intent for social media event detection and for et al.(2014) and Huang and Kovashka(2016) user engagement prediction. Both these directions study prediction of intent behind politician por- highlight the importance of the intent and semiotic traits in the news. Hussain et al.(2017) study the structure of a social media posting in determining understanding of image and video advertisements, its influence on the social network as a whole. predicting topic, sentiment, and intent. Alikhani et al.(2019) introduce a corpus of the coher- 2 Prior Work ence relationships between recipe text and images. Our work builds on Siddiquie et al.(2015), A wide variety of work in multiple fields has ex- who focused on a single type of intent (detect- plored the relationship between text and image ing politically persuasive video on the internet) and extracting meaning, although often assigning and even more closely on Zhang et al.(2018), a subordinate role to either text or images, rather who study visual rhetoric as interaction between the image and the text slogan in advertisements. They categorize image-text relationships into parallel equivalent (image-text deliver same point at equal strength), parallel non-equivalent (image- text deliver the same point at different levels) and non-parallel (text or image alone is insufficient in point delivery). They also identify the novel issue of understanding the complex, non-literal ways in which text and image interacts. Weiland et al. (2018) study the non-literal meaning conveyed by image-caption pairs and draw on a knowledge- base to generate the gist of the image-caption pair. 3 Taxonomies As Berger(1972) points out in discussing the relationship between one image and its caption: It is hard to define exactly how the words have changed the image but un- doubtedly they have. (p. 28). We propose three taxonomies in an attempt to an- swer Berger’s implicit question, two (contextual Figure 2: Examples of multimodal document intent: and semiotic) to capture different aspects of the re- advocative, provocative, expressive and promotive con- lationship between the image and the caption, and tent one to capture speaker intent. 3.1 Intent Taxonomy 3. exhibitionist: create a self-image reflecting The proposed intent taxonomy is a generalization the person, state etc. for the user using selfies, and elaboration of existing rhetorical categories pictures of belongings (e.g. pets, clothes) etc. pertaining to illocution, that targets multimodal 4. expressive: express emotion, attachment, or social networks like Instagram. We developed a admiration at an external entity or group. set of eight illocutionary intents from our exam- 5. informative: relay information regarding a ination and clustering of a large body of repre- subject or event using factual language. sentative Instagram content, informed by previ- ous studies of intent in Instagram posts. There 6. entertainment: entertain using art, humor, is some overlap between categories; to bound the memes, etc.

Arxiv:1904.09073V3 [Cs.CV] 7 Nov 2019 Documentintent Emnlp19

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support