A Survey on Multimodal Detection

Firoj Alam1∗ , Stefano Cresci2 , Tanmoy Chakraborty3 , Fabrizio Silvestri4 , Dimiter Dimitrov5 , Giovanni Da San Martino6 , Shaden Shaar1 , Hamed Firooz7 , Preslav Nakov1 1Qatar Computing Research Institute, HBKU, Doha, Qatar, 2IIT-CNR, Pisa, Italy, 3IIIT-Delhi, , 4Sapienza University of Rome, Italy, 5Sofia University, 6University of Padova, Italy, 7Facebook AI {fialam, sshaar, pnakov}@hbku.edu.qa, [email protected], [email protected], [email protected], [email protected], [email protected], mhfi[email protected]

Abstract The term “fake news” was declared Word of the Year 2017 by Collins dictionary. However, this term is very generic, and Recent years have witnessed the proliferation of fake may mislead people and fact-checking organizations to focus news, , , and disinforma- only on veracity. Recently, several international organizations tion online. While initially this was mostly about such as the EU have redefined the term to make it more precise: textual content, over time images and videos gained disinformation, which refers to information that is – (i) fake popularity, as they are much easier to consume, at- and also (ii) spreads deliberately to deceive and harm others. tract much more attention, and spread further than The latter aspect of the disinformation (i.e., harmfulness) is simple text. As a result, researchers started targeting often ignored, but it is equally important. In fact, the rea- different modalities and combinations thereof. As son behind disinformation emerging as an important issue, is different modalities are studied in different research because the news was weaponized. communities, with insufficient interaction, here we Disinformation often spreads as text. However, the Inter- offer a survey that explores the state-of-the-art on net and allow the use of different modalities, multimodal disinformation detection covering vari- which can make a disinformation message attractive as well ous combinations of modalities: text, images, audio, as impactful, e.g., a meme or a video, is much easier to con- video, network structure, and temporal information. sume, attracts much more attention, and spreads further than Moreover, while some studies focused on factual- simple text [Zannettou et al., 2018; Hameleers et al., 2020; ity, others investigated how harmful the content is. Li and Xie, 2020]. While these two components in the definition of Yet, multimodality remains under-explored when it comes disinformation – (i) factuality and (ii) harmfulness, to the problem of disinformation detection. [Bozarth and Bu- are equally important, they are typically studied in dak, 2020] performed a meta-review of 23 models isolation. Thus, we argue for the need to tackle disin- and the data modality they used, and found that 91.3% lever- formation detection by taking into account multiple aged text, 47.8% used network structure, 26.0% used temporal modalities as well as both factuality and harmful- data, and only a handful leveraged images or videos. More- ness, in the same framework. Finally, we discuss over, while there has been research on trying to detect whether current challenges and future research directions. an image or a video has been manipulated, there has been less research in a truly multimodal setting [Zlatkova et al., 2019]. In this paper, we survey the research on multimodal disinfor- 1 Introduction mation detection covering various combinations of modalities: The proliferation of online social media has encouraged indi- text, images, audio, video, network structure, and temporal

arXiv:2103.12541v1 [cs.MM] 13 Mar 2021 viduals to freely express their opinions and emotions. On one information. We further argue for the need to cover multiple hand, the freedom of speech has led the massive growth of modalities in the same framework, while taking both factuality online content which, if systematically mined, can be used for and harmfulness into account. citizen journalism, public awareness, political campaigning, While there have been a number of surveys on “fake etc. On the other hand, its misuse has given rise to the hostility news” [Cardoso Durier da Silva et al., 2019; Kumar and Shah, in online media, resulting false information in the form of fake 2018], misinformation [Islam et al., 2020], fact-checking news, hate speech, propaganda, cyberbullying, etc. It also set [Kotonya and Toni, 2020], truth discovery [Li et al., 2016], the dawn of the Post-Truth Era, where appeal to emotions has and propaganda detection [Da San Martino et al., 2020], none become more important than the truth. More recently, with of them had multimodality as their main focus. Moreover, they the emergence of the COVID-19 pandemic, a new blending targeted either factuality (most surveys above), or harmfulness of medical and political false information has given rise to the (the latter survey), but not both. Therefore, the current survey first global infodemic. is organized as follows. We analyze literature covering various aspects of multimodality: image, speech, video, and network ∗Contact Author and temporal. For each modality, we review the relevant lit- 8 oecik,8%mr ie,ad10 oeretweets more 150% and likes, more 89% clicks, more 18% ihmsnomto npltclgroups political in misinformation with de- automatically to devoted been has work of body large A etwt iulcnet(.. mgs ped atr gets faster, spreads images) (e.g., content visual with Text e,2020 les, with information content misleading 2019 visual there predicting address literature, for that current text studies the various been in have images, with disseminated [ Image 2.1 modalities. other with used commonly text-only is except text modalities as all modality explore summarizes that section studies This relevant networks. propagated social and different video) through speech, image, with (e.g., disseminated modalities and other expressed are misleading claims debunk Such to information. content textual from factuality people. tecting mislead can that statements crucial the is of It factuality information. identify misleading to of spread – the disinformation debunk of to aspect first the on factuality focus we section, this Prediction In Factuality Multimodal 2 domain. this in directions studies, future existing suggest of and limitations and also takeaways We misin- major factuality. highlight of and aspect harmfulness with multimodality interacts the 1 formation envision Figure factuality. we and how harmfulness, shows of aspect the on erature harmfulness survey. with this interact in to factuality multimodality and of envision Our 1: Figure hn ta. 2018 al., et Zhang ] aeimages fake , Harmful ] uoai eeto ffculcam simportant is claims factual of detection Automatic . n axorpi detection fauxtographic and , Network andTemporal ] u otegoignme fclaims of number growing the to Due . [ ut ta. 2013 al., et Gupta Speech Image Video Multimodal [ [ aiel n Eck- and Garimella hn ta. 2018; al., et Zhang ] mgsshared images , [ okv tal., et Volkova

Factual cikat oe,wihdie engagement. drives which power, ‘’ Fauxtography age l,2020b al., et Wang ih2 ,ad6wy labels. 6-ways and 3, 2, with a organized was hr a enefr oueaosi inl opeitfac- predict to signals acoustic use to effort been has There ataiyo liswt epc oiae n oprdthe compared and images to respect with claims true of a factuality claims, false claims. with false image with fake image a as such forms different performed. text) the of characteristics lexical tex- and (i.e., visual, models tual, of combinations different how They compared satire. then and clickbait, disinformation, conspiracies, classes: , six posts, propaganda, with labeled 500,000 images of of corpus consisting a images developed using They information text. and misleading detecting for models labeled was dataset The Reddit. from metadata, collected image, comments text, and including posts million one containing spread images graphic and violent faster. The that colors. found dominant in also and exploited faces authors are with that images include features study related this Image types images. other fake and image), of misleading doctored and images, (funny old memes out-of-context images, as such images of types fact-checked already and groups, WhatsApp as public labeled from collected images 2,500 of i n eia sources acous- of lexical fusion and as tic such proposed, were campaign approaches timodal evaluation an corpus, same the Using field-pause, gender, laughter). information, speaker paralinguis- (e.g., features and information lexical pauses, tic of filled categories, variety LIWC a 68 and experiments including their prosodic, in acoustic, used as was such information of sets speech. Different non-deceptive and deceptive within-speaker eliciting 2005 signals. acoustic using detection debates. cen- political left, in (i.e., right) bias ter, political performance the improves signals only. features perfor- metadata and the Similarly, textual improve using to to compared help mance consistently signal acoustic the [ [ 2020 al., debates political from tuality Speech have they 2.2 and satire of means a as or used mislead often are to images memes Red- doctored as that and reveal Twitter findings in Their images dit. specifically doctored engagement, with posts user that increase found the and posts and media images social the in content the texts. to addition FauxBuster in is comments) direction this in work [ im- notable and Another text between ages. groups feature different of performance ishege l,2005 al., et Hirschberg 2019 al., et Dinkov 2018 al., et Zhang [ [ oevr ag oyo okwsdn ntedrcinof direction the in done was work of body large a Moreover, aauae l,2020 al., et Nakamura 2020 Eckles, and Garimella ] [ rae h oubaSIClrd CC opsby corpus (CSC) Columbia-SRI-Colorado the created age l,2020b al., et Wang ] etcne-ih ispeito npltcldebates political in prediction bias left-center-right , misinformation [ ikve l,2019 al., et Dinkov a lobe omnyue nsca ei in media social in used commonly been also has [ aaadKro,2016 Karpov, and Kaya ] ] . hc sscneti oilmda(e.g., media social in content uses which , ] and , n ndcpindtcini speech in detection deception on and , [ ] aaadKro,2016 Karpov, and Kaya , . [ ] o misinformation not ltoae l,2019 al., et Zlatkova ] [ unclear nlzdfutgah mgsin images fauxtography analyzed oe ta. 2019 al., et Kopev eeoe utmdldataset multimodal a developed [ ] okv ta. 2019 al., et Volkova ] [ eotdta h s fspeech of use the that reported oe ta. 09 ha et Shaar 2019; al., et Kopev aulyanttdasample a annotated manually hyietfiddifferent identified They . ] hr ifrn mul- different where , [ ishege al., et Hirschberg , misinformation ] ] eotdthat reported investigated ] ] acoustic, , proposed prosodic, lexical, and phonotactics [Levitan et al., 2016]. network and content data are used in conjunction, indicating that the two provide complementary information. 2.3 Video [Vosoughi et al., 2017] proposed Rumor Gauge, a system In addition to textual, imagery, and speech content, the in- that jointly exploits temporal and propagation in conjunction formation in video plays a significant role in capturing cues with linguistic and user credibility features, for checking the of deceptive behavior. Such cues in videos (e.g., facial ex- veracity of rumors. In particular, Rumor Gauge leverages text pression, gestures) have been investigated in several stud- (i.e., handcrafted features), user (i.e., those who shared the ies [Perez-Rosas´ et al., 2015; Krishnamurthy et al., 2018; rumor) and network propagation (i.e., structural characteris- Soldner et al., 2019] for deception detection. [Perez-Rosas´ et tics of the retweet cascade). The temporal modality does not al., 2015] developed a real-life courtroom trial dataset, which directly provide features, but is instead considered by recom- includes 61 deceptive and 60 truthful videos. They explored puting all other features at regular time steps, thus yielding the use of n-gram features from transcripts and non-verbal multiple time series. As a rumor spreads, each time series is features (i.e., facial expressions, eyebrows, eyes, mouth open- fed to a Hidden Markov Model that provides a “local” veracity ness, mouth lips, and head movements, hand gestures) to score. Local scores are then aggregated to predict the “global” develop models to discriminate between liars and truth-tellers. veracity of the rumor at each time step. Experiments revealed [Krishnamurthy et al., 2018] used textual, audio, and visual that the network features yield the largest contribution; textual features for deception detection. They used a 3D CNN to and user-derived features followed, providing comparable con- extract visual features from each frame, spatiotemporal fea- tributions. Results by [Vosoughi et al., 2017] and [Kwon et al., tures, and facial expressions such as smile, fear, or stress. The 2017] also demonstrated that the contribution of the different textual features consisted of word-embeddings, the audio fea- data modalities change through time. tures used the Interspeech 2013 ComParE feature set, and the To mitigate the “cold start” problem of propagation-based micro-expressions were frowning, smiling, eyebrows raising, early detection of fake news, [Liu and Wu, 2018] proposed etc. [Soldner et al., 2019] developed a multimodal deception an approach that is primarily based on user and temporal dataset using TV shows and experimented with textual (i.e., information. First, they built a propagation path of each news n-grams, part-of-speech, psycholinguistic, and word embed- as a time series of user representations. The time series for dings features), visual (i.e., facial expressions), and dialog a given news only contains the ordered representations of features. those users that shared such news. Then, they learned two Relevant evaluation campaigns include the MediaEval vector representations of each propagation path via GRUs and Benchmark [Boididou et al., 2016], where the focus of the CNNs, respectively. Finally, the two representations were tasks was automatic detection of manipulated and misleading concatenated and fed to a multi-layer feedforward network social media content. that predicted the final class label. Their experiments showed that the proposed system was capable of detecting fake news 2.4 Network and Temporal Information with 0.8 accuracy after only 5 minutes since the news started The rationale for leveraging network information stems from spreading. The increased efficacy at the early stage of news early work by [Shao et al., 2018; Vosoughi et al., 2018], which propagation depends on the choice to rely heavily on user showed that propagation and interaction networks of fake news information. are deeper and wider than those of real news. [Vosoughi et al., [Zannettou et al., 2018] analyzed different aspects of 2018] also found that fake information spread faster than fac- memes, such as how they evolve and propagate in differ- tual one, thus advocating for the use of temporal information. ent mainstream and fringe web communities, and variants Propagation networks can be homogeneous or heteroge- of memes that propagate. Finally, [Van-Hoang et al., 2020] neous (e.g., encompassing news articles, publishers, users, proposed Factual News Graph (FANG) to exploit social struc- and posts) and they can be analyzed at different scales (e.g., ture and engagement patterns of users for fake news detection. node-level, ego-level, triad-level, community-level and the overall network) [Zhou and Zafarani, 2019]. [Shu et al., 2020] 3 Multimodal Harmful Content Detection tackled the fake news classification task by proposing an ap- proach based on hierarchical propagation networks. At both In this section, we focus on the second aspect of disinformation micro- and macro-scale, they extracted and jointly considered – harmfulness. Once again, we do not discuss text-only models network features, temporal features, and linguistic features. as text is almost always present in multimodal approaches. Experiments on PolitiFact and GossipCop datasets revealed It is essential to filter or flag harmful content on so- that temporal features have maximum contribution, followed cial media platforms. The harmful content includes child by network and linguistic features. abuse material, violent and extreme content, hate speech, [Shu et al., 2019] provided one of the most thorough mul- graphic content, sexual content, cruel and insensitive ma- timodal frameworks to fake news classification. They used terial, and spam content. In recent years, the ability to rec- news content embeddings and user embeddings, both obtained ognize harmful content within online communities has re- via non-negative matrix factorization, as well as user-news ceived a lot of attention from researchers and policymak- interactions embeddings and publisher-news relation embed- ers that aim to keep users safe in the digital world. Stud- dings. Their experimental results suggest that social context ies in this direction include detecting harmful contents in (i.e., network-derived) features are more informative than news network science [Ribeiro et al., 2018], natural language pro- content ones. Furthermore, the best results are obtained when cessing [Waseem et al., 2017; Schmidt and Wiegand, 2017; Fortuna and Nunes, 2018] and computer vision [Yang et gunshots, screams) and non-violent (e.g., music, speech) con- al., 2019a; Vijayaraghavan et al., 2019; Gomez et al., 2020; tent in movies. The studies related to violent content detection Dimitrov et al., 2021]. using acoustic features also include [Esra et al, 2013], where the focus was finding violent content in movies. 3.1 Image [Junwei et al., 2017] proposed Localized Self-Paced Rerank- ing (LSPaR) for detecting gunshots and explosion in videos Among different harmful contents, cyberbullying is one of the using acoustic features. [Soni and Singh, 2018] investigated major growing problems, affecting teens significantly. [Hos- audio, visual and textual features for cyberbullying detection. seinmardi et al., 2015] investigated images and Their findings suggest that audio and visual features are asso- their associated comments for detecting cyberbullying and ciated with the occurrence of cyberbullying, and both these online harassment. They developed a crowd-sourced dataset features complement textual features. of manually labeled images and their associated comments. Hate speech is another important problem that spread over 3.3 Video social media. “Hateful Memes Challenge” is an important There are multiple studies on detecting cyberbullying in video- milestone to advance the research on this topic and the tasks based social networks such as vine [Rafiq et al., 2015] and is to detect hateful memes [Kiela et al., 2020]. [Das et al., YouTube [Dadvar and Eckert, 2018]. These studies show that 2020] proposed different approaches for hatefulness detection although the percentage of cyberbullying in video sessions in memes such as (i) extract caption and include this informa- is quite low, automatic detection of these types of content is tion with multimodal model, (ii) use sentiment as an additional very challenging. [Wang et al., 2020a] utilized textual, visual, feature with multimodal representations. For hate speech de- and other meta-information to detect social media posts with tection, [Yang et al., 2019a] reported that augmenting text bullying topics. Their proposed method was evaluated on with image embedding immediately boosts performance. [Vi- publicly-available multi-modal cyberbullying datasets. [Abd jayaraghavan et al., 2019] proposed methods for interpreting Kadir et al., 2016] investigated the relationship between emo- multimodal hate speech detection models, where the modali- tion and in Youtube videos. Their find- ties consist of text and socio-cultural information rather than ings suggest that propaganda techniques in Youtube videos images. Concurrently, [Gomez et al., 2020] introduced a larger affect emotional responses. dataset (150,000 tweets) for multimodal hate speech detection, Content (e.g., Youtube videos) can also be attacked by hate- which consists of six categories such as no attacks to any com- ful users via posting hateful comments through a coordinated munity, racist, sexist, homophobic, religion based attacks and effort. [Mariconti et al., 2019] investigated whether a video is attacks to other communities. likely to be attacked using different modalities such as meta- Propaganda is another topic which has been explored in data, audio transcripts, and thumbnails. [ ] multimodal setting. Seo, 2014 showed how Twitter was by There has been a recent interest from different government Israel Defense Forces and Hamas as a propaganda tool during agencies to stop the spread of violent content. [Constantin et the 2012 Gaza conflict to build international support for their al., 2020] developed a multimodal dataset, which consists [ ] own side. Dimitrov et al., 2021 addressed the detection of of more than 96 hours of Hollywood and YouTube videos persuasion techniques in memes. They collected a dataset of and high variability of content. Their study suggests that 950 memes on multiple relevant topics such as poli- multimodal approaches with audio and image perform better tics, COVID-19, vaccines and gender equality. They used both than other settings. textual and visual content of a meme to identify if it contains one or more of the 22 propaganda techniques defined in their 3.4 Network and Temporal Information work. Their analysis of the dataset showed that while propa- The use of network data for predicting factuality was moti- ganda is not always factually false or harmful, most memes vated by results showing different propagation patterns for are used to damage the reputation of a person or a group of fake and real content. These results are lacking for harmful people. [Glenski et al., 2019] explored multilingual multi- content. However, the aim to harm in social media is often modal content for deception detection to distinguish between pursued via coordinated actions, for instance by groups of disinformation, propaganda, conspiracy, , and clickbait. users (e.g., social bots and trolls [Cresci, 2020]) that target 3.2 Speech certain characters or minorities. These collaborative harmful actions, perpetrated to increase the efficacy of the harm, allow Cues in spoken content can represent harmful behaviors and to exploit networks for detecting harmful campaigns. those cues can be used to automatically detect such content. [Chatzakou et al., 2019] focused on detecting cyberbullying Due to the lack of data, studies using speech-only modality and cyberaggression by training models for are comparatively lower than other modalities even if it has a detecting: (i) bullies, (ii) aggressors, (iii) spammers, and (iv) major role in many context. For example, for detecting violent normal users on Twitter. To solve these tasks, they leveraged content such as screaming and gunshots, the speech modality a combination of 38 features extracted from user profiles, the can play a significant role, which other modalities might not be textual content of their posts, and network information (e.g., able to offer. This is important as journalists most often post user degree and centrality measures in the social graph). user-generated content without verifying then, which can lead Orthogonal and synergy to the detection of disinformation, to serious consequences. [Giannakopoulos, 2009] studied the scholars have recently focused on the novel task of detecting audio segmentation approaches for segmenting violent (e.g., Coordinated Inauthentic Behavior (CIB) [Nizzoli et al., 2020]. CIB is defined as coordinated activities that aim to mislead defined trust of news and credibility of users who spread the and manipulate others.1 news and used Bayesian learning to iteratively update these The threat of CIB does not necessarily stem from the con- quantities. News with low trustworthiness is returned as fake tent being spread – which may well be true and factual, but news. [Gangireddy et al., 2020] proposed GTUT, a graph- rather from the shared intent to mislead and manipulate, which based approach that exploits the underlying bipartite network may cause harm. Detecting CIB typically involves analyzing of users and news articles to detect the dense communities of both interaction networks to detect suspicious coordination, fake news and fraud users. as well as the coordinated users and the content they shared Due to the scarcity of labeled data, a few studies attempted to detect inauthentic users and harmful content [Nizzoli et al., to design semi-supervised methods by leveraging an ample 2020]. Given the importance of coordination in CIB, analyses amount of unlabelled data. [Helmstetter and Paulheim, 2018; typically start from the available network data by applying Gravanis et al., 2019] presented weak-supervision in fake community detection , and subsequently moving to news detection. [Guacho et al., 2018] presented a tensor- the analysis of textual data (e.g., social media posts or news decomposition based semi-supervised method for fake con- articles). [Weber and Neumann, 2020] also considered the tim- tent detection. [Dong et al., 2020] developed a deep semi- ings of actions to detect coordination, thus leveraging both the supervised model via two-path learning (one path utilizes a network and temporal data modalities. Despite the promising limited labeled data, the other path explores the unlabelled initial results, the detection of CIB is still in its infancy, and data) for timely fake news detection. [Paka et al., 2021b] thus, additional data modalities such as images and audio are presented Cross-SEAN, a cross-stitch based semi-supervised yet to be studied. end-to-end neural attention model for COVID-19 fake news detection. They further extended it by combining exogenous 4 Modeling Techniques and endogenous signals with a semi-supervised co-attention network for early detection of fake news [Paka et al., 2021a]. In this section, we discuss modeling techniques for both factu- ality and harmfulness. Within the supervised learning setup, two other types of To combine multiple modalities there has been several ap- learning method have also been explored for disinformation proaches: (i) early-fusion, where low-level features from detection such as adversarial learning and autoencoder based. different modalities are learned, fused and fed into a sin- Adversarial learning models for fake news detection include [ ] gle prediction model [Jin et al., 2017b; Yang et al., 2018; EANN Wang et al., 2018 , an event adversarial neural net- Zhang et al., 2019; Singhal et al., 2019; Zhou et al., 2020; work to detect emerging and time-critical fake news, and [ ] Kang et al., 2020]; (ii) late-fusion, where unimodal decisions SAME Cui et al., 2019 , a sentiment-aware multimodal em- are fused with some mechanisms such as averaging, voting bedding method which, along with multiple modalities, lever- [Agrawal et al., 2017; Qi et al., 2019], and (iii) hybrid-fusion, ages the sentiment expressed by readers in their comments. [ ] where a subset of learned features are passed to the final classi- Qi et al., 2019 proposed MAVE, a multimodal variational fier (early-fusion) and the remaining modalities are fed to the autoencoder that jointly encodes text and images and uses the classifier later (late-fusion) [Jin et al., 2017a]. Within these intermediate representation for fake news detection. fusion strategies, the learning setup can also be divided into unsupervised, semi-supervised, supervised and self-supervised 5 Lessons Learned methods. [Dimitrov et al., 2021] investigated different fusion strate- 1. A lot of progress has been made on the problem, but the gies (e.g., early- and late-fusion and self-supervised models) two components in the definition of disinformation (false- for propaganda detection. For different fusion strategies, the ness and harmfulness) have been considered in isolation. authors used VisualBERT [Li et al., 2019], MMBT [Kiela et There is a need to combine factuality and intentional al., 2019] and ViLBERT [Lu et al., 2019]. Their findings harmfulness into the detection model. suggest that self-supervised joint learning models, such as 2. Most multimodal datasets cover just two modalities, or MMBT, ViLBERT, VisualBERT perform better in increasing put together modalities mechanically. Moreover, no mul- order, respectively, compared to the other fusion methods. As timodal dataset looks at both aspects of disinformation: a part of “Hateful Memes Challenge” to classify hateful vs. factuality and harmfulness. not-hateful memes, several of such models have also been investigated [Kiela et al., 2020]. In addition, Kiela et al. also 3. In the early phase of (dis)information spreading, user and experimented with other models such as Gated Multimodal content features are those that provide the highest contri- Unit (GMU) [Arevalo et al., 2017] and ConcatBERT [Kiela et bution for detecting factuality. Indeed, at that time few al., 2020]. These models learn individual and non-overlapping interactions with content are available and the propaga- training objectives for each modality. tion network is small and sparse. As information spreads, Attempts to design unsupervised models are limited. the contribution of content-derived features remains con- [Muller-Budack¨ et al., 2020] introduced Cross-modal Con- stant, while propagation-derived features become richer sistency Verification Tool (CCVT) to check the coherence and more informative. In summary, early prediction of between images and associated texts. [Yang et al., 2019b] factuality and veracity must necessarily rely heavily on users and content – be it text, image, video or audio. In- 1http://about.fb.com/news/2018/12/ stead, analyses carried out at later times benefit more inside-feed-coordinated-inauthentic-behavior/ from network propagation and temporal data. 6 Major Challenges 8. Disinformation on Evolving Topics. Often claims or Recently, several initiatives were undertaken by major com- harmful content are disseminated based on the current panies and government entities to combat disinformation in event; information about COVID-19 and its vaccine are social media.2 However, automatic detection of misleading examples of such use cases. Existing models might fail and harmful content poses diverse challenges. Below we offer on such use cases, zero-shot or few-shot learning might a non-exhaustive list of what is to be addressed in the future. be a future avenue to explore. 1. Models Combining Multiple Multimodalities. The 9. Common-sense Reasoning. The models for disinforma- major challenge is to devise a mechanism to combine mul- tion detection should present the outcome in such a way tiple modalities in a systematic way so that one modality that a practitioner can interpret it and understand why a complements others. Current research primarily adopts piece of information is flagged as disinformation, what fusion mechanisms, which might not bring the comple- is the related real news based on which the judgement mentarity across modalities. was made, which part of the information was counterfeit, etc. There is also a lack of datasets containing disin- 2. Datasets. There is a need for truly multimodal datasets formation with explanations and the corresponding real that combine multiple modalities such as text, image, information. speech, video, temporal, user profile, and network. In ad- dition, we also need to focus on the dataset from multiple 10. Personalization. The way of perceiving a news is sub- platforms (e.g., news, posts from Twitter, Reddit, Gab, jective. Therefore, a disinformation detection system and Instagram) as different data sources present different should adhere to the preferences of individual readers. types of styles, and focus on diverse topics. For example, We posit that the detection system should be personal- Gab promotes the free flow of information, which can ized at the level of an individual person to the level of a hinder content moderation. group/community. 3. Contextualization. Existing methods of disinformation detection are mostly non-contextualized, i.e., the broader 7 Future Forecasting context of a news in terms of the responses of the readers Based on the challenges mentioned in the previous section, we and how the users perceive them are not captured. We ar- forecast the following research directions: gue that the response thread under a news, the underlying 1. Explainability. Model interpretation remains largely un- social network among users, the propagation dynamics of explored. This can be addressed in future studies to the news and its mentions across social media need suit- understand the general capability of the models. Pro- able integration to better capture the overall perspective viding evidence of why certain claims are false is also on the news. important. Current methods of disinformation detection 4. Meta Information. Along with the news and the context, lack such capability, and as of present, justification is other information such as the authenticity of the news, written manually by human fact-checkers. the credibility of the authors of the news, the factuality 2. Beyond Content and Network Signals. Current state- of the news also play important roles for disinformation of-the-art solutions in multimodal factuality prediction detection. Moreover, whether the disinformation attack is and harmful content detection are primarily based on a coordinated effort or an individual activity would also content signals and network structure. However, the in- help understanding its severity. formation in these signals is limited and does not include 5. Fine-grained Detection. Existing disinformation detec- personal preferences or cultural desires. In the near fu- tion models are mostly binary classifiers – given a news, ture, we envision multimodal techniques for disinforma- they detect whether it is a disinformation or not. We argue tion detection that would go beyond content and network that such systems may not be enough to provide action- signals and would include signals like common sense able results. Therefore, rather than a binary classification, and information about the user. Moreover, multimodal one could cast the problem as a multi-class classifica- models will become larger with more heterogeneous sig- tion task or even a regression task. This would also help nals as input, and they would be pre-trained on wider prioritize disinformation for reactive measurements. variety of tasks to shelter both aspects of disinformation: 6. Bias, Region, and Cultural Awareness. The perfor- factuality and harmfulness. mance of most of the existing systems is limited to the underlying dataset, particularly to the demography and 8 Conclusion the underlying cultural aspects. For instance, a model We surveyed the state of the art in multimodal disinformation trained on an Indian political dataset may not generalize detection based on prior work on different modalities, focus- to a US health related dataset [Fortuna et al., 2021]. ing on disinformation, i.e., information that is both false and 7. Transparent and Accountable Models. The detection intents to do harm. We argued for the need to tackle disinfor- models should be designed in such a way that their out- mation detection by taking into account multiple modalities as comes are unbiased and more accountable to ethical con- well as both factuality and harmfulness in the same framework. siderations. Finally, we discussed current challenges and we forecast some future research directions. 2For example, https://digi.org.au/disinformation-code/ References [Glenski et al., 2019] Maria Glenski et al. Multilingual multimodal [Abd Kadir et al., 2016] Shamsiah Abd Kadir et al. Emotion and digital deception detection and disinformation spread across social techniques of propaganda in videos. In IJST, 2016. platforms. arXiv:1909.05838, 2019. [ ] [Agrawal et al., 2017] Taruna Agrawal et al. Multimodal detection Gomez et al., 2020 Raul Gomez et al. Exploring hate speech de- of fake social media use through a fusion of classification and tection in multimodal publications. In WACV, 2020. pairwise ranking systems. In EUSIPCO, 2017. [Gravanis et al., 2019] Georgios Gravanis et al. Behind the cues: A [Arevalo et al., 2017] John Arevalo et al. Gated multimodal units benchmarking study for fake news detection. Expert Systems with for information fusion. In ICLR, 2017. Applications, 2019. [ ] [Boididou et al., 2016] Christina Boididou et al. Verifying multime- Guacho et al., 2018 Gisel Bastidas Guacho et al. Semi-supervised dia use at MediaEval 2016. MediaEval, 2016. content-based detection of misinformation via tensor embeddings. In ASONAM, 2018. [Bozarth and Budak, 2020] Lia Bozarth and Ceren Budak. Toward [ ] a better performance evaluation framework for fake news classifi- Gupta et al., 2013 Aditi Gupta et al. Faking Sandy: Characterizing cation. In ICWSM, 2020. and identifying fake images on Twitter during hurricane Sandy. In WWW, 2013. [Cardoso Durier da Silva et al., 2019] Fernando Cardoso Durier da [ ] Silva et al. Can machines learn to detect fake news? A survey Hameleers et al., 2020 Michael Hameleers et al. A picture paints a focused on social media. In HICSS, 2019. thousand lies? the effects and mechanisms of multimodal disinfor- mation and rebuttals disseminated via social media. Pol. Comm., [Chatzakou et al., 2019] Despoina Chatzakou et al. Detecting cyber- 2020. bullying and cyberaggression in social media. TWEB, 2019. [Helmstetter and Paulheim, 2018] Stefan Helmstetter and Heiko [Constantin et al., 2020] M. G. Constantin et al. Affect in multime- Paulheim. Weakly supervised learning for fake news detection on dia: Benchmarking violent scenes detection. TAC, 2020. Twitter. In ASONAM, 2018. [Cresci, 2020] Stefano Cresci. A decade of detection. [Hirschberg et al., 2005] Julia Hirschberg et al. Distinguishing de- CACM, 2020. ceptive from non-deceptive speech. In Interspeech, 2005. [Cui et al., 2019] Limeng Cui et al. SAME: Sentiment-aware multi- [Hosseinmardi et al., 2015] Homa Hosseinmardi et al. Detection modal embedding for detecting fake news. In ASONAM, 2019. of cyberbullying incidents on the Instagram social network. [Da San Martino et al., 2020] Giovanni Da San Martino et al. A arXiv:1503.03909, 2015. survey on computational propaganda detection. In IJCAI, 2020. [Islam et al., 2020] Md Rafiqul Islam et al. Deep learning for misin- [Dadvar and Eckert, 2018] Maral Dadvar and Kai Eckert. Cyber- formation detection on online social networks: a survey and new bullying detection in social networks using deep learning based perspectives. SNAM, 2020. models; a reproducibility study. arXiv:1812.08046, 2018. [Jin et al., 2017a] Zhiwei Jin et al. Multimodal fusion with recurrent [Das et al., 2020] Abhishek Das et al. Detecting hate speech in neural networks for rumor detection on microblogs. In MM, 2017. multi-modal memes. arXiv:2012.14891, 2020. [Jin et al., 2017b] Zhiwei Jin et al. Novel visual and statistical image [Dimitrov et al., 2021] Dimiter Dimitrov et al. Task 6 at SemEval- features for microblogs news verification. TMM, 2017. 2021: Detection of persuasion techniques in texts and images. In [Junwei et al., 2017] Liang Junwei et al. Temporal localization of SemEval, 2021. audio events for conflict monitoring in social media. In ICASSP, [Dinkov et al., 2019] Yoan Dinkov et al. Predicting the leading po- 2017. litical ideology of youtube channels using acoustic, textual, and [Kang et al., 2020] SeongKu Kang et al. Multi-modal component metadata information. In Interspeech, 2019. embedding for fake news detection. In IMCOM, 2020. [Dong et al., 2020] Xishuang Dong et al. Two-path deep semisuper- [Kaya and Karpov, 2016] Heysem Kaya and Alexey A Karpov. Fus- vised learning for timely fake news detection. In TCSS, 2020. ing acoustic feature representations for computational paralinguis- [Esra et al, 2013] Acar Esra et al. Violence detection in hollywood tics tasks. In Interspeech, 2016. movies by the fusion of visual and mid-level audio cues. In MM, [Kiela et al., 2019] Douwe Kiela et al. Supervised multimodal bi- page 717–720. ACM, 2013. transformers for classifying images and text. arXiv:1909.02950, [Fortuna and Nunes, 2018] Paula Fortuna and Sergio´ Nunes. A sur- 2019. vey on automatic detection of hate speech in text. CSUR, 2018. [Kiela et al., 2020] Douwe Kiela et al. The hateful memes challenge: [Fortuna et al., 2021] Paula Fortuna et al. How well do hate speech, Detecting hate speech in multimodal memes. arXiv:2005.04790, toxicity, abusive and offensive language classification models 2020. generalize across datasets? IPM, 2021. [Kopev et al., 2019] Daniel Kopev et al. Detecting deception in [Gangireddy et al., 2020] Siva Charan Reddy Gangireddy et al. Un- political debates using acoustic and textual features. In ASRU, supervised fake news detection: A graph-based approach. In HT, 2019. 2020. [Kotonya and Toni, 2020] Neema Kotonya and Francesca Toni. Ex- [Garimella and Eckles, 2020] Kiran Garimella and Dean Eckles. Im- plainable automated fact-checking: A survey. In COLING, 2020. ages and misinformation in political groups: Evidence from What- [Krishnamurthy et al., 2018] Gangeshwar Krishnamurthy et al. A sApp in India. arXiv:2005.09784, 2020. deep learning approach for multimodal deception detection. [Giannakopoulos, 2009] Theodoros Giannakopoulos. Study and ap- arXiv:1803.00344, 2018. plication of acoustic information for the detection of harmful [Kumar and Shah, 2018] Srijan Kumar and Neil Shah. False infor- content, and fusion with visual information. University of Athens, mation on web and social media: A survey. arXiv:1804.08559, 2009. 2018. [Kwon et al., 2017] Sejeong Kwon et al. Rumor detection over vary- [Shu et al., 2020] Kai Shu et al. Hierarchical propagation networks ing time windows. PLoS ONE, 2017. for fake news detection: Investigation and exploitation. In ICWSM, [Levitan et al., 2016] Sarah Ita Levitan et al. Combining acoustic- 2020. prosodic, lexical, and phonotactic features for automatic deception [Singhal et al., 2019] Shivangi Singhal et al. SpotFake: A multi- detection. In Interspeech, 2016. modal framework for fake news detection. In BigMM, 2019. [Li and Xie, 2020] Yiyi Li and Ying Xie. Is a picture worth a thou- [Soldner et al., 2019] Felix Soldner et al. Box of lies: Multimodal sand words? an empirical study of image content and social media deception detection in dialogues. In NAACL-HLT, 2019. engagement. JMR, 2020. [Soni and Singh, 2018] Devin Soni and Vivek K Singh. See no evil, [Li et al., 2016] Yaliang Li et al. A survey on truth discovery. In hear no evil: Audio-visual-textual cyberbullying detection. HCI SIGKDD, 2016. (CSCW), 2018. [Li et al., 2019] L. Li et al. VisualBERT: A simple and performant [Van-Hoang et al., 2020] Nguyen Van-Hoang et al. FANG: Lever- baseline for vision and language. arXiv:1908.03557, 2019. aging social context for fake news detection using graph represen- [Liu and Wu, 2018] Yang Liu and Yi-Fang Wu. Early detection of tation. In CIKM, 2020. fake news on social media through propagation path classification [Vijayaraghavan et al., 2019] Prashanth Vijayaraghavan et al. Inter- with recurrent and convolutional networks. In AAAI, 2018. pretable multi-modal hate speech detection. In ICML, 2019. [Lu et al., 2019] Jiasen Lu et al. VilBERT: Pretraining task-agnostic [Volkova et al., 2019] Svitlana Volkova et al. Explaining multimodal visiolinguistic representations for vision-and-language tasks. deceptive news prediction models. ICWSM, 2019. arXiv:1908.02265, 2019. [Vosoughi et al., 2017] Soroush Vosoughi et al. Rumor gauge: Pre- [Mariconti et al., 2019] Enrico Mariconti et al. “You Know What to dicting the veracity of rumors on Twitter. TKDD, 2017. Do”: Proactive detection of YouTube videos targeted by coordi- [Vosoughi et al., 2018] Soroush Vosoughi et al. The spread of true HCI (CSCW) nated hate attacks. , 2019. and false news online. Science, 2018. [ ] Muller-Budack¨ et al., 2020 Eric Muller-Budack¨ et al. Multimodal [Wang et al., 2018] Yaqing Wang et al. EANN: Event adversarial analytics for real-world news using measures of cross-modal entity neural networks for multi-modal fake news detection. In SIGKDD, arXiv:2003.10421 consistency. , 2020. 2018. [ ] Nakamura et al., 2020 Kai Nakamura et al. r/Fakeddit: A new mul- [Wang et al., 2020a] Kaige Wang et al. Multi-modal cyberbullying timodal benchmark dataset for fine-grained fake news detection. detection on social networks. In IJCNN, 2020. In LREC, 2020. [Wang et al., 2020b] Yuping Wang et al. Understanding the use of [Nizzoli et al., 2020] Leonardo Nizzoli et al. Coordinated behavior fauxtography on social media. arXiv:2009.11792, 2020. on social media in 2019 UK general . arXiv:2008.08370, 2020. [Waseem et al., 2017] Zeerak Waseem et al. Understanding abuse: A typology of abusive language detection subtasks. In ACL Work- [Paka et al., 2021a] W. Paka et al. Combining exogenous and en- shops, 2017. dogenous signals with a semi-supervised co-attention network for early detection of COVID-19 fake tweets. In PAKDD, 2021. [Weber and Neumann, 2020] Derek Weber and Frank Neumann. Who’s in the gang? revealing coordinating communities in social [Paka et al., 2021b] William Scott Paka et al. Cross-SEAN: A cross- media. arXiv:2010.08180, 2020. stitch semi-supervised neural attention model for COVID-19 fake news detection. arXiv:2102.08924, 2021. [Yang et al., 2018] Yang Yang et al. TI-CNN: Convolutional neural networks for fake news detection. arXiv:1806.00749, 2018. [Perez-Rosas´ et al., 2015] Veronica´ Perez-Rosas´ et al. Deception detection using real-life trial data. In ICMI, 2015. [Yang et al., 2019a] Fan Yang et al. Exploring deep multimodal fusion of text and photo for hate speech classification. In ACL [Qi et al., 2019] Peng Qi et al. Exploiting multi-domain visual in- Workshops, 2019. formation for fake news detection. arXiv:1908.04472, 2019. [Yang et al., 2019b] Shuo Yang et al. Unsupervised fake news de- [Rafiq et al., 2015] Rahat Ibn Rafiq et al. Careful what you share tection on social media: A generative approach. In AAAI, 2019. in six seconds: Detecting cyberbullying instances in Vine. In ASONAM, 2015. [Zannettou et al., 2018] Savvas Zannettou et al. On the origins of memes by means of fringe web communities. In IMC, 2018. [Ribeiro et al., 2018] Manoel Ribeiro et al. Characterizing and de- tecting hateful users on Twitter. In ICWSM, 2018. [Zhang et al., 2018] Daniel Yue Zhang et al. Fauxbuster: A content- free fauxtography detector using social media comments. In [Schmidt and Wiegand, 2017] Anna Schmidt and Michael Wiegand. BigData, 2018. A survey on hate speech detection using natural language process- ing. In SocialNLP, 2017. [Zhang et al., 2019] Huaiwen Zhang et al. Multi-modal knowledge- aware event memory network for social media rumor detection. [Seo, 2014] Hyunjin Seo. Visual propaganda in the age of social In MM, 2019. media: An empirical analysis of Twitter images during the 2012 Israeli–Hamas conflict. Visual Communication Quarterly, 2014. [Zhou and Zafarani, 2019] X. Zhou and R. Zafarani. Network-based fake news detection: A pattern-driven approach. SIGKDD, 2019. [Shaar et al., 2020] Shaden Shaar et al. That is a known lie: Detect- ing previously fact-checked claims. In ACL, 2020. [Zhou et al., 2020] Xinyi Zhou et al. SAFE: Similarity-aware multi- modal fake news detection. arXiv:2003.04981, 2020. [Shao et al., 2018] Chengcheng Shao et al. The spread of low- credibility content by social bots. Nature communications, 2018. [Zlatkova et al., 2019] Dimitrina Zlatkova et al. Fact-checking meets fauxtography: Verifying claims about images. In EMNLP, [Shu et al., 2019] Kai Shu et al. Beyond news contents: The role of 2019. social context for fake news detection. In WSDM, 2019.