LIIR at Semeval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images Using CLIP Features

LIIR at SemEval-2021 task 6: Detection of Persuasion Techniques In Texts and Images using CLIP features Erfan Ghadery , Damien Sileo , and Marie-Francine Moens Department of Computer Science (CS) KU Leuven ferfan.ghadery, damien.sileo, [email protected] Abstract method, named LIIR, has achieved a competitive We describe our approach for SemEval-2021 performance with the best performing methods in task 6 on detection of persuasion techniques the competition. Also, empirical results show that in multimodal content (memes). Our sys- the augmentation approach is effective in improv- tem combines pretrained multimodal models ing the results. (CLIP) and chained classifiers. Also, we pro- The rest of the article is organized as follows. pose to enrich the data by a data augmentation The next section reviews related works. Section 3 technique. Our submission achieves a rank of describes the methodology of our proposed method. 8/16 in terms of F1-micro and 9/16 with F1- We will discuss experiments and evaluation results macro on the test set. in Sections 4 and 5, respectively. Finally, the last 1 Introduction section contains the conclusion of our work. Online propaganda is potentially harmful to soci- 2 Related work ety, and the task of automated propaganda detection has been suggested to alleviate its risks (Martino et al., 2020b). In particular, providing a justifi- This work is related to computational techniques cation when performing propaganda detection is for automated propaganda detection (Martino et al., important for acceptability and application of the 2020b) and is the continuation of a previous shared decisions. Previous challenges have focused on the task (Martino et al., 2020a). detection of propaganda techniques (Martino et al., Taks 11 of SemEval-2020 proposes a more fine- 2020a), based on news articles. However, many graind analysis by also identifying the underly- use cases do not solely involve text, but can also ing techniques behind propaganda in news text, involve other modalities, notably images. Task 6 of with annotations derived from previously proposed SemEval-2021 proposes a shared task on the detec- propaganda techniques typologies (Miller, 1939; tion of persuasion techniques detection in memes, Robinson, 2019). where both images and text are involved. Substasks This current iteration of the task tackles a more 1 and 2 deal with text in isolation, but we focus on challenging domain, by including multimodal con- subtask 3: visuolinguistic persuasion technique de- tent, notably memes. The subtle interaction be- tection. tween text and image is an open challenge for state This article presents the system behind our sub- of the art multimodal models. For instance, the mission for subtask 3 (Dimitrov et al., 2021). To Hateful Memes challenge (Kiela et al., 2020) was handle this problem, we use a model containing recently proposed, as a binary task for detection of three components: data augmentation, image and hateful content. The recent advances in pretrain- text feature extraction, and chain classifier compo- ing of visuolinguistic representations (Chen et al., nents. First, given a paired image-text as the input, 2020) lead the model closer to human accuracy we paraphrase the text part using back-translation (Sandulescu, 2020). and pair it again with the corresponding image to More generally, propaganda detection is at the enrich the data. Then, we extract visual and textual crossroad of many tasks, since it can be helped features using the CLIP (Radford et al., 2021) im- by many subtasks. Fact-checking (Aho and Ull- age encoder and text encoder, respectively. Finally, man, 1972; Dale, 2017) can be involved with pro- we use a chain classifier to model the relation be- paganda detection, alongside various social, emo- tween labels for the final prediction. Our proposed tional and discursive aspects (Sileo et al., 2019a), 1015 Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1015–1019 Bangkok, Thailand (online), August 5–6, 2021. ©2021 Association for Computational Linguistics If Trump Isn't Hitler, Then I'm a Moron Back-Translation If Trump is not Hitler, Then I'm an idiot CLIP Text Encoder Chained Estimated Classifier Probabilities CLIP Image Encoder Figure 1: The overall architecture of our proposed model. For each example, use Back-Translation to derive augmentations of the text, and we compute persuasion techniques probabilities separately. Then, we average the estimated probabilities from augmented and original examples. including offensive language detection (Pradhan the test time, we average the probability distribu- et al., 2020; Ghadery and Moens, 2020) emotion tions over the original and paraphrased sentence- analysis (Dolan, 2002), computational study of image pairs. persuasiveness (Guerini et al., 2008; Carlile et al., 2018) and argumentation (Palau and Moens, 2009; 3.2 Feature Extraction Habernal and Gurevych, 2016). Our system isProbabilities of a combination of pretrained visuolinguistic and linguistic models. 3 Methodology We use CLIP (Radford et al., 2021) as a pretrained visuolinguistic model. CLIP provides an In this section, we introduce the design of our pro- image encoder fi and a text encoder ft. They posed method. The overall architecture of our were pretrained on a prediction of matching im- method is depicted in figure1. Our model con- age/text pairs. The training objective incentivizes sists of several components: a data augmentation high values of fi(I):ft(T ) if I and T are match- component (Back-translation), a feature extraction ing in the training corpus, and low values of they component(CLIP), and a chained classifier. Details are not matching1. Instead of using a dot prod- of each component are described in the following uct, we create features with element-wise product subsections. fi(I) ft(T ) of image and text encoding. This en- ables aspect-based representations of the matching 3.1 Augmentation Method between image and text. We experimented with One of the challenges in this subtask is the low other compositions (Sileo et al., 2019b) which did number of training data where the organizers have not lead to significant improvement. provided just 200 training samples. To enrich the We then use a classifier C on top of fi(I) ft(T ) training set we propose to use the back-translation to predict the labels. technique (Sennrich et al., 2016) for paraphrasing a given sentence by translating it to a specific target 3.3 Chained Classifier language and translating back to the original lan- In this task, we are dealing with a multilabel clas- guage. To this end, we use four translation models, sification problem, which means we need to pre- English-to-German, German-to-English, English- dict a subset of labels for a given paired image- to-Russian, and Russian-to-English provided by text sample as the input. We noticed that label (Ng et al., 2019). Therefore, for each training sen- 1They assign each image to the text associated to other tence, we obtain two paraphrased version of it. In images in the current batch to generate negative examples 1016 Slogans 0.40 Appeal to authority Label Count Presenting Irrelevant Data Transfer 0.35 Flag-waving Smears 199 Thought-terminating cliché 0.30 Appeal to Emotions Appeal to fear Loaded Language 134 Bandwagon 0.25 Smears Name calling/Labeling 118 Repetition Black-and-white Fallacy 0.20 Glittering generalities (Virtue) 54 Exaggeration/Minimisation Doubt 0.15 Appeal to (Strong) Emotions 43 Causal Oversimplification Name calling Straw Man 0.10 Appeal to fear/prejudice 42 Reductio ad hitlerum Whataboutism Exaggeration/Minimisation 42 Glittering generalities 0.05 Loaded Language Transfer 41 Obfuscation 0.00 Slogans 28 Doubt 25 Doubt Smears Slogans Transfer Repetition Flag-waving 24 Straw Man Bandwagon Obfuscation Flag-waving Name calling Appeal to fear Whataboutism Causal Oversimplification 22 Loaded Language Misrepresentation of Someone’s Position 21 Appeal to authority Appeal to Emotions Reductio ad hitlerum Glittering generalities Whataboutism 14 Black-and-white Fallacy Causal Oversimplification Presenting Irrelevant Data Exaggeration/Minimisation Thought-terminating cliché Black-and-white Fallacy/Dictatorship 13 Thought-terminating cliche´ 10 Figure 2: Probabilities of label co-occurence in the Reductio ad hitlerum 10 training set. Some label pairs, for instance (SMEARS Appeal to authority 10 Repetition 3 and LOADED LANGUAGE) are frequently associated. Obfuscation, Intentional vagueness, Confusion 3 Bandwagon 1 Presenting Irrelevant Data (Red Herring) 1 co-occurrences were not uniformly distributed, as shown in figure2. To further address the data spar- Table 1: Labels of persuasion techniques with associ- sity, we use another inductive bias at the classifier- ated counts in the training set level with a chained classifier (Read et al., 2009) using scikit-learn implementation (Pedregosa et al., 2011). is an image and its corresponding text. We use Instead of considering each classification task 10% of the training set as the validation set for independently, a chained classifier begins with the hyperparameter tuning. training of one classifier for each of the L labels. But we also sequentially train L other classifier instances thereafter, each of them using the outputs 5 Evaluation and Results of the previous classifier as input. This allows our model to model the correlations between labels. We 5.1 Results use a Logistic Regression with default parameter In this section, we present the results obtained by as our base classifier. our model on the test sets for Subtask 3. Table2 Our chain classifier uses combined image and shows the results obtained by the submitted final text features as the input. We transfer the predicted model on the test set. All the results are provided in probabilities of the classifier via the sigmoid activa- terms of macro-F1 and Micro-F1. Furthermore, we tion function to make the probability values more provide the results obtained by the random baseline, discriminating(Ghadery et al., 2018). Then we ap- the best performing method in the competition, and ply thresholding on the L labels probabilities since median result for the sake of comparison.

Load more