Clickbait Detection Using Multimodel Fusion and Transfer Learning Rajapaksha Waththe Vidanelage Praboda Chathurangani Rajapaksha

Clickbait detection using multimodel fusion and transfer learning Rajapaksha Waththe Vidanelage Praboda Chathurangani Rajapaksha To cite this version: Rajapaksha Waththe Vidanelage Praboda Chathurangani Rajapaksha. Clickbait detection using multimodel fusion and transfer learning. Social and Information Networks [cs.SI]. Institut Polytechnique de Paris, 2020. English. NNT : 2020IPPAS025. tel-03139880 HAL Id: tel-03139880 https://tel.archives-ouvertes.fr/tel-03139880 Submitted on 12 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Clickbait Detection using Multimodel Fusion and Transfer Learning These` de doctorat de l’Institut Polytechnique de Paris prepar´ ee´ aT` el´ ecom´ SudParis Ecole´ doctorale n◦626 Ecole Doctorale de l’Institut Polytechnique de Paris (ED IP Paris) Specialit´ e´ de doctorat : Informatique NNT : 2020IPPAS025 These` present´ ee´ et soutenue a` Evry, le 27/11/2020, par RAJAPAKSHA WATHTHE VIDANELAGE PRABODA CHATHURANGANI RAJAPAKSHA Composition du Jury : Gareth Tyson Maitre de Conferences,` Queen Mary University of London - UK President´ Xiaoming Fu Professeur , University of Goettingen - Germany Rapporteur Christophe Cerisara Chercheur, CNRS - France Rapporteur Bruce Maggs Professeur, Duke University - USA Examinateur Cecile Bothorel Maitre de Conferences,` IMT Atlantique - France Examinateur Gareth Tyson Maitre de Conferences,` Queen Mary University of London - UK Examinateur Noel Crespi Professeur, IMT, Telecom SudParis - France Directeur de these` Reza Farahbakhsh Maitres de Conferences´ associe,´ IMT, Telecom SudParis - France Co-directeur de these` 626 Title : Clickbait Detection using Multimodel Fusion and Transfer Learning Keywords : Clickbait, Transfer Learning, BERT, XLNet, RoBERTa, Deep Learnig, Sentiment Classification, Topic Detection, Originality Detection, Social Media, Facebook, Twitter, News Media Abstract : Internet users are likely to be victims to post and headline, sentiment of the post and head- clickbaits assuming as legitimate news. The notoriety line, and topical similarity between news article and of clickbait can be partially attributed to misinforma- post. The fusion model uses three different algorithms tion as Internet users are likely to be victims to click- to generate output for each feature mentioned above baits assuming as legitimate news. The notoriety of and fuse them at the output to generate the final clas- clickbait can be partially attributed to misinformation sifier. In addition to implementing the fusion classifier, as clickbaits use an attractive headline that is decep- we conducted four extended experiments mainly fo- tive, misleading or sensationalized. A major type of cusing on news media in social media. The first ex- clickbait is in the form of spam and advertisements periment is on exploring content originality of a social which is used to redirect users to web sites that sells media post by amalgamating the features extracted products or services (often of dubious quality). Ano- from author’s writing style and her online circadian ther common type of clickbait is designed to appear rhythm. This originality detection approach is used as news headlines and redirect readers to their online to identify news dissemination patterns among news venues intending to make revenue from page views, media community in Facebook and Twitter by obser- but these news can be deceptive, sensationalized and ving news originators and news consumers. For this misleading. News media often use clickbaits to pro- experiment, dataset is collected using our implemen- pagate news using a headline which lacks greater ted crawlers through Facebook Graph API and Twitter context to represent the article. Since news media ex- streaming APIs. The next experiment is on exploring change information by acting as both content provi- flaming events on news media in Twitter by using an ders and content consumers, misinformation that is improved sentiment classification model. The final ex- deliberately created to mislead requires serious at- periment is focused on detecting topics that are dis- tention. Hence, an automated mechanism is required cussed in a real-time meeting with the aim of genera- to explore likelihood of a news item being clickbait. ting a brief summary at the end. Predicting how clickbaity a given news item is difficult The second contribution is to adapt Transfer Learning as clickbaits are very short messages and written in models for the clickbait detection task. We evalua- obscured way. The main feature that can be used to ted performances of three Transfer Learning models identify clickbait is to explore the gap between what (BERT, XLNet and RoBERTa), and delivered a set of is promised in the social media post, news headline architectural changes to optimize these models. We and what is delivered by the article linked from it. The believed that these three models are the representa- recent enhancement to Natural Language Processing tives of most of the other Transfer Learning models (NLP) can be adapted to distinguish linguistic patterns in terms of their architectural properties (Autoregres- and syntaxes among social media post, news head- sive model vs Autoencoding model) and training da- line and news article. In my Thesis, I propose two in- tasets. The experiments are conducted by introducing novative approaches to explore clickbaits generated advanced fine-tuning approaches to each model such by news media in social media. Contributions of my as layer pruning, attention pruning, weight pruning, Thesis are two-fold : 1) propose a multimodel fusion- model expansion and generalization. To the best of based approach by incorporating deep learning and authors’ knowledge, there have been an insignificant text mining techniques, and 2) adapt Transfer Lear- number of attempts to use Transfer Learning models ning (TL) models to investigate the efficacy of trans- on clickbait detection tasks and no any comparative formers for predicting clickbait contents. analysis of multiple Transfer Learning models focused In the first contribution, the fusion model is built on on this task. using three main features, namely similarity between Titre : Detection´ de Clickbait utilisant Fusion Multimodale et Apprentissage par Transfert Mots cles´ : Clickbait, Apprentissage par transfert, BERT, XLNet, RoBERTa, l’apprentissage en profondeur, Classification des sentiments, Detection´ de sujets, Detection´ d’originalite,´ Medias´ sociaux, Facebook, Twitter, Medias´ d’information Resum´ e:´ Presque tous les internautes sont suscep- blication sur les reseaux´ sociaux en fusionnant les ca- tibles d’etreˆ victimes de clickbait, suppposant a` tort racteristiques´ extraites du style d’ecriture´ de l’auteur qu’il s’agit d’informations legitimes.´ Un type important et de son rythme circadien en ligne. Cette approche de clickbait se presente´ sous la forme de spam et de detection´ de l’originalite´ est utilisee´ pour identifier de publicites´ qui sont utilises´ pour rediriger les utili- les modeles` de diffusion des nouvelles parmi la com- sateurs vers des sites web. Un autre type de ”click- munaute´ des medias´ d’information sur Facebook et bait” est conçu pour faire la une des journaux et re- Twitter en observant les auteurs et les consomma- diriger les lecteurs vers leurs sites en ligne, mais ces teurs de nouvelles. Pour cette experience,´ l’ensemble nouvelles sensasionelles peuvent etreˆ trompeuses. Il de donnees´ est collecte´ a` l’aide de nos robots d’ex- est difficile de predire´ le degre´ de click-baity d’une ploration implement´ es´ via l’API Facebook Graph et nouvelle donnee´ car les clickbait sont des messages les API de streaming Twitter. La prochaine experience´ tres` courts et ecrits´ de maniere` souvent obscure. La consiste a` explorer les ev´ enements´ enflammes´ sur principale caracteristique´ qui permet d’identifier les les medias´ d’information sur Twitter en utilisant un clickbait est d’explorer l’ecart´ entre ce qui est attendu modele` de classification des sentiments amelior´ e.´ dans un post, le titre de l’information et l’information L’experience´ finale se concentre sur la detection´ de reellement´ presente´ dans l’article qui y est lie.´ Dans sujets discutes´ lors d’une reunion´ en temps reel´ dans cette these,` on propose deux approches innovantes le but de gen´ erer´ un bref resum´ e´ a` la fin. pour explorer le clickbait gen´ er´ e´ par les medias´ d’in- La deuxieme` contribution est d’adapter les modeles` formation dans les medias´ sociaux. Les contributions de Transfer Learning pour la tacheˆ de detection´ de 1) de proposer une approche multimodele` basee´ sur clickbait. Nous avons evalu´ e´ les performances de la fusion en incorporant des techniques d’apprentis- trois modeles` d’apprentissage par transfert (BERT, sage profond et d’exploration de texte et 2) d’adapter XLNet et RoBERTa) et fourni un ensemble de modi- les modeles` d’apprentissage par transfert (TL) pour fications architecturales pour optimiser ces modeles.` etudier´ l’efficacite´ des transformateurs permettant de Nous pensons que

Clickbait Detection Using Multimodel Fusion and Transfer Learning Rajapaksha Waththe Vidanelage Praboda Chathurangani Rajapaksha

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support