<<

The Good, the Bad and the Bait: Detecting and Characterizing Clickbait on YouTube

Savvas Zannettou, Sotirios Chatzis, Kostantinos Papadamou, Michael Sirivianos Department of Electrical Engineering, Computer Engineering and Informatics Cyprus University of Technology Limassol, Cyprus [email protected], {sotirios.chatzis,michael.sirivianos}@cut.ac.cy, [email protected]

Abstract—The use of deceptive techniques in user-generated video portals is ubiquitous. Unscrupulous uploaders deliberately mislabel video descriptors aiming at increasing their views and subsequently their ad revenue. This problem, usually referred to as "clickbait," may severely undermine user experience. In this work, we study the clickbait problem on YouTube by collecting metadata for 206k videos. To address it, we devise a deep learning model based on variational autoencoders that supports Fig. 1: Comments that were found in clickbait videos. The users’ the diverse modalities of data that videos include. The proposed frustration is apparent (we omit users’ names for ethical reasons). model relies on a limited amount of manually labeled data to classify a large corpus of unlabeled data. Our evaluation indicates that the proposed model offers improved performance when usually contains hidden false or ambiguous information that compared to other conventional models. Our analysis of the collected data indicates that YouTube recommendation engine users or systems might not be able to perceive. does not take into account clickbait. Thus, it is susceptible to Recently, the aggravation of the problem has recommending misleading videos to users. induced broader public attention to the clickbait problem. For instance, aims at removing clickbaits from its I.INTRODUCTION newsfeed [27], [14]. In this work, we focus on YouTube Recently, YouTube surpassed cable TV in terms of popu- for various reasons: i) anecdotal evidence suggests that the larity within teenagers [28]. This is because YouTube offers problem exists in YouTube [8] and ii) to the best of our a vast amount of videos, which are always available on knowledge, YouTube relies on users to flag suspicious videos demand. However, because videos are generated by the users and then manually review them. To this extent, this approach of the platform, known as YouTubers, a plethora of them is deemed to be inefficient. Hence, the need for an automated are of dubious quality. The ultimate goal of YouTubers is to approach that minimizes human intervention is indisputable. increase their ad revenue by ensuring that their content will get To attain this goal, we leverage some recent advances in the viewed by millions of users. Several YouTubers deliberately field of Deep Learning [22] by devising a novel formulation of employ techniques that aim to deceive viewers into clicking variational autoencoders (VAEs) [21], [23] that fuses different their videos. These techniques include: (i) use of eye-catching modalities of a YouTube video, and infers latent correlations thumbnails, such as depictions of abnormal stuff or attractive between them. adults, which are often irrelevant to video content; (ii) use of The proposed model infers a latent variable vector for headlines that aim to intrigue the viewers; and (iii) encapsulate each video that encodes a high-level representation of the false information to either the headline, the thumbnail or the content and the correlations between the various modalities. video content. We refer to videos that employ such techniques The significance of learning to compute such concise repre- as clickbaits. The continuous exposure of users to clickbaits sentations is that: (i) this learning procedure can be robustly cause frustration and degraded user experience ( see Fig. 1 ) . performed by leveraging large unlabeled data corpora; and (ii) The clickbait problem is essentially a peculiar form of the the obtained representations can be subsequently utilized to well-known spam problem [38], [37], [36], [17], [20]. In spam, drive the classification process, with very limited requirements malicious users try to deceive users by sending them mislead- in labeled data. ing messages mainly to advertise websites or perform attacks To this end, we formulate the encoder part of the devised (e.g., ) by redirecting users to malicious websites. VAE model as a 2-component finite mixture model [10]. That Nowadays, the spam problem is not as prevalent as a few is, we consider a set of alternative encoder models that may years ago due to the deployment of systems that diminish generate the data pertaining to each video. The decision of it. Furthermore, users have an increased awareness of typical which specific encoder (corresponding to one possible class) spam content (e.g., emails, etc.) and they can effortlessly generates each is obtained via a trainable probabilistic gating discern it. However, this is not the case for clickbait, which network [29]; this constitutes an integral part of the developed autoencoder. The whole model is trained in an end-to-end the problem and is instrumental for training our deep learning fashion, using the available training data, both the unlabeled model. and the few labeled ones. The latter are specifically useful for appropriately fitting the postulated gating network that A. Manually Reviewed Ground Truth Dataset Analysis infers the posterior distribution of mixture component (and In order to better grasp the problem, we perform a compar- corresponding class) allocation. ative analysis of the manually reviewed ground truth. Contributions. We propose a deep generative model that Category. Table 1 reports the categories we find on the videos. allows for combining data from as diverse modalities as video In total, we find 15 categories but we only show the top headline text, thumbnail image and tags text, as well as various five in terms of count for brevity. We observe that most numerical statistics, including statistics from comments. Most clickbaits exist in the Entertainment and Comedy categories, importantly, the proposed model allows for successfully ad- whereas non-clickbaits are prevalent in the Sports category. dressing the problem of learning from limited labeled samples This indicates that, within this dataset, YouTubers employ and numerous unlabeled ones (semi-supervised learning). This clickbait techniques on videos for entertainment. is achieved by postulating a deep variational autoencoder that Headline. YouTubers normally employ deceptive techniques employs a finite mixture model as its encoder. In this context, on the headline like the use of exaggerating phrases. To verify mixture component assignment is regulated via an appropriate that this applies to our ground truth dataset, we perform gating network; this also constitutes the eventually obtained stemming to the words that are found in clickbait and non- classification mechanism of our deep learning system. We clickbait headlines. Fig. 2 (a) depicts the ratio of the top 20 provide a large scale analysis on YouTube; we show that, with stems that are found in our ground truth clickbait videos (i.e., respect to the collected data, its recommendation engine does 95% of the videos that contain the stem “sexi” are clickbait). not consider how misleading a recommended video is. Hence, In essence, we observe that magnetizing stems like “sexi” and it is susceptible to recommending clickbait videos to its users. “hot” are frequently used in clickbait videos, whereas their use in non-clickbaits is low. The same applies to words used II.METHODOLOGY for , like “viral” and “epic”. By leveraging YouTube’s Data API, between August and Thumbnail. To study the thumbnails, we make use of November of 2016, we collect metadata of videos published Imagga [19], which offers descriptive tags for an image. We between 2005 and 2016. Specifically, we collected the fol- perform tagging of all the thumbnails in our ground truth lowing data descriptors for 206k videos: (i) basic details like dataset. Fig. 2(b) demonstrates the ratio of the top 10 Imagga headline, tags, etc.; (ii) thumbnail; (iii) comments from users; tags that are found in the manually reviewed ground truth. We (iv) statistics (e.g., views, likes, etc.); and (v) related videos observe that clickbait videos typically use sexually-appealing based on YouTube’s recommendation system. We started our thumbnails in their videos in order to attract viewers. For retrieval from a popular (108M views) clickbait video [1] instance, 81% of the videos’ thumbnail of which contains the and iteratively collected all the related videos as were rec- “pretty” tag are clickbaits. ommended by YouTube. Note that this approach enables us Tags. Tags are words that are defined by YouTubers before to study interesting aspects of the problem, by constructing a publishing and can dictate whether a video will emerge on graph that captures the relations (recommendations) between users’ search queries. We notice that clickbaits use specific videos. words on tags, whereas non-clickbaits do not. Fig. 2 (c) depicts To get labeled data, we opted for two different approaches. the ratio of the top 20 stems that are found in clickbaits. First, we manually reviewed a small subset of the collected We observe that many clickbait videos use tags like “try not data by inspecting the headline, the thumbnail, comments from to laugh”, “viral”, “hot” and “impossible”; phrases that are users, and video content. Specifically, we watched the whole usually used for exaggeration. video and compared it to the thumbnail and headline. A video Statistics. Fig. 2 (d) shows the normalized score of the video is considered clickbait only if the thumbnail and headline statistics for both classes of videos. Interestingly, clickbaits deviate substantially from its content. However, this task is and non-clickbaits videos have similar views; suggesting that both cumbersome and time consuming; thus, we elected to viewers are not able to easily discern clickbait videos, hence retrieve more data that are labeled. To this end, we compiled clicking on them. Also, non-clickbait videos have more likes a list of channels (available at [2]) that habitually employ and less dislikes than clickbaits. This is reasonable as many clickbait techniques and channels that do not. To obtain the list users feel frustrated after watching clickbaits. of channels, we used a pragmatic approach; we found channels Comments. We notice that users on YouTube implicitly flag that are outed by other users as clickbait channels. For each suspicious videos by commenting on them. For instance, channel, we retrieved up-to 500 videos, hence creating a larger we note several comments like the following: “the title is labeled dataset. The overall labeled dataset consists of (i) misleading”, “i clicked because of the thumbnail”, “where 1,071 clickbaits and 955 non-clickbaits obtained from the is the thumbnail?” and “clickbait”. Hence, we argue that manual review process and (ii) 8,999 clickbaits and 8,470 non- comments from viewers is a valuable resource for assessing clickbaits obtained from the distinguished list of channels. The videos. To this end, we analyze the ground truth dataset to importance of this dataset is two-fold, as it allow us to study extract the mean number of occurrences of words widely used 1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

tri thi hot girl hot girl not kiss real grin kiss grin sexi epic dog sexi bodi viral viral scari scari gone sexy india gone while body prank cheat prank pretty beach indian wrong

imposs maillot imposs blooper clothing japanes challeng garment challeng swimsuit girlfriend boyfriend boyfriend whatsapp attractive whatsapp (a) (b) (c)

0.6 Clickbait non-Clickbait Category Clickbaits (%) Non-clickbaits (%) 0.5 0.9 0.8 0.4 0.7 Entertainment 406 (38%) 308 (32%) 0.6 0.3 Comedy 318 (29%) 228 (24%) 0.5 0.2 0.4 People & Blogs 155 (14%) 115 (12%) 0.3 Autos & Vehicles 33 (3%) 49 (5%) 0.1 0.2 0.1 0.0 Sports 29 (3%) 114 (12%) 0.0

Bait Likes Title Views

Dislikes Clicked Clickbait

Comments Misleading Thumbnail (d) (e)

Fig. 2 & TABLE I: Analysis of the manually reviewed ground truth dataset. Normalized mean scores for: (a) stems from headline text; (b) tags derived from thumbnails; (c) stems from tags that were defined by uploaders; (d) video statistics; and (e) comments that contain words for flagging suspicious videos. Table 1 shows the top five categories (and their respective percentages) in our ground truth dataset.

Source Destination Norm. Mean YouTube’s countermeasures. To get an insight on whether clickbait clickbait 4.1 YouTube employs any countermeasures, we calculate the num- clickbait non-clickbait 2.73 ber of offline (either deleted by YouTube or removed by the non-clickbait clickbait 2.75 non-clickbait non-clickbait 3.57 uploader) videos in our manually reviewed ground truth, as of January 10, 2017 and April 30, 2017. We found that only TABLE II: Normalized mean of related videos for clickbait and 3% (January 10th) and 10% (April 30th) of the clickbaits are non-clickbait videos in the ground truth dataset offline. Similarly, only 1% (January 10th) and 5% (April 30th) of the non-clickbaits are offline. To verify that the ground truth dataset does not consist of only recent videos (thus, just for flagging clickbait videos. Fig. 2 (e) depicts the normalized published and not yet detected) we calculate the mean number mean scores for the identified words. We observe that these of days that passed from the publication date up to January 10, words were greatly used in clickbait comments but not in non- 2017. We find that the mean number of days for the clickbaits clickbaits. Also, it is particularly interesting that comments is 700, while it is 917 days for the non-clickbaits. The very referring to the video’s thumbnail were found 2.5 times more low offline ratio, as well as the high mean number of days, often in clickbait than in non-clickbaits. indicate that YouTube is not able to tackle the problem in a Graph Analysis. Users often watch videos according to timely manner. YouTube’s recommendations. From manual inspections, we III.CLICKBAIT DETECTION MODEL have noted that when watching a clickbait video, YouTube is more likely to recommend another clickbait video. To confirm A. Processed Modalities this against our data, we create a directed graph G = (V , E), Our model processes the following modalities: (i) Headline: where V the videos and E the connections between videos For the headline, we consider both the content and the style pointing to one another via a recommendation. Then, for all the of the text. For the content of the headline, we use sent2vec videos, we select their immediate neighbors in the graph, and embeddings [26] trained on Twitter data. For the style of the count the videos that are clickbaits and non-clickbaits. Table II text, we use the features proposed in [6]; (ii) Thumbnail: We depicts the normalized mean of the number of connected scale down the images to 28x28 and convert them to grayscale. videos for each class. We apply a normalization factor to This way, we decrease the number of trainable parameters mitigate the bias towards clickbaits, which have a slightly for our developed deep network, thus speeding training time greater number in our ground truth. We observe that, when up without compromising achievable performance; (iii) Com- a user watches a clickbait video, they are recommended 4.1 ments: We preprocess user comments to find the number of clickbait videos on average, as opposed to 2.73 non-clickbait occurrences of words used for flagging videos. We consider recommendations. A similar pattern holds for non-clickbaits; the following words: “misleading, bait, thumbnail, clickbait, a user is less likely to be served a clickbait when watching a deceptive, deceiving, clicked, flagged, title”; (iv) Tags: We non-clickbait. encode the tags’ text as a binary representation of the top 100 The rationale behind this novel configuration is motivated by a key observation; the two modeled classes are expected to entail significantly different patterns of correlations and latent underlying dynamics between the modalities. Hence, it is plausible that each class can be adequately and effec- tively modeled by means of distinct, and different, encoder distributions, inferred by the two subencoders. Each of these subencoders are dense-layer networks comprising a hidden layer with 20 ReLU units, and an output layer with 10 units. Since the devised model constitutes a VAE, the output units of the subencoders are of a stochastic nature; specifically, we consider stochastic outputs, say z˜ and zˆ, with Gaussian (posterior) densities, as usual in the literature of VAEs [21], [23]. Hence, what the postulated subencoders actually compute are the means, µ˜ and µˆ, as well as the (diagonal) covariance matrices, σ˜ 2 and σˆ 2, of these Gaussian posteriors. On this basis, the actual subencoder output vectors, z˜ and zˆ, are sampled each time from the corresponding (inferred) Gaussian posteriors. Note that our modeling selection of sharing the ini- Fig. 3: Overview of the proposed model. The dotted rectangle tial CNN-based processing part between the two subencoders represents the encoding component of our model. allows for significantly reducing the number of trainable parameters, without limiting the eventually obtained modeling power. words that are found in the whole corpus; and (v) Statistics Under this mixture model formulation, we need to establish (e.g., likes, views, etc.). an effective mechanism for inferring which observations (i.e., videos) are more likely to match the learned distribution of B. Model Formulation each component subencoder. This is crucial for effectively In Fig. 3, we provide an overview of the proposed model. selecting between the samples of z˜ or zˆ at the output of the The thumbnail is initially processed, at the encoding part of encoding stage of the devised model. In layman terms, this the proposed model, by a CNN [40]. We use a CNN that can be considered to be analogous to a (soft) classification comprises four convolutional layers, with 64 filters each, and mechanism. This mechanism can be obtained by computation ReLU activations. The first three of these layers are followed of the posterior distribution of mixture component membership by max-pooling layers. The fourth is followed by a simple of each video (also known as "responsibility" in the literature densely connected layer, which comprises 32 units with ReLU of finite mixture models [25]). To allow for inferring this activations. This initial processing stage allows for learning to posterior distribution, in this work we postulate a gating extract a high-level, 32-dimensional vector of the thumbnail, network. This is a dense-layer network, which comprises one which contains the most useful bits of information for driving hidden layer with 100 ReLU units, and is presented with classification. the same vector, h(), as the two postulated subencoders. It The overarching goal of the devised model is to limit is trained alongside the rest of the model, and it is the only the required availability of labeled data, while making the part of the model that requires availability of labeled data for most out of large corpora of (unlabeled) examples. To this its training. end, after this first processing stage, we split the encoding Note that this gating network entails only a modest number part into two distinct subencoders that work in tandem. Both of trainable parameters, since both the size of its input as these subencoders are presented with the aforementioned 32- well as of its single hidden layer are rather small. As such, dimensional thumbnail representation, fused with the data it can be effectively trained even with limited availability of stemming from all the other available modalities. This results labeled data. This is a key merit of our approach, which fully in a 855-dimensional input vector, first processed by one dense differentiates it from conventional classifiers that are presented layer network (Fusing Network) that comprises 300 ReLU with raw observed data, which typically are prohibitively high- units. Due to its large number of parameters, this dense layer dimensional. network may become prone to overfitting; hence, we regularize To conclude the formulation of the proposed VAE, we using the prominent Dropout technique [34]. We use a Dropout need to postulate an appropriate decoder distribution, and a level of d = 0.5; this means that, at each iteration of the corresponding network that infers it. In this work, we opt training algorithm, 50% of the units are randomly omitted for a simple dense-layer neural network, which is fed with from updating their associated parameters. The obtained 300- the (sampled) output of the postulated finite mixture model dimensional vector, say h(), is the one eventually presented to encoder, and attempts to reconstruct the original modalities. both postulated subencoders. Specifically, we postulate a network comprising one hidden layer with 300 ReLU units, and a set of 823 output units, C. Model Training that attempt to reconstruct the original modalities, with the N Let us consider a training dataset X = {xn}n=1 that exception of the thumbnail. The reason why we ignore the consists of N video samples. A small subset, Xl, of size M of thumbnail modality from the decoding process is the need these samples is considered to be labeled, with corresponding of utilizing deconvolutional network layers to appropriately M labels set Y = {ym}m=1. Then, following the VAE literature handle it, which is quite complicated. Hence, we essentially [21], model training is performed by maximizing the evidence treat the thumbnail modality as side-information, that regulates lower bound (ELBO) of the model over the parameters set the inferred posterior distribution of the latent variables z {θ˜, θˆ, ϕ, φ}. The ELBO of our model reads: (encoder) in the sense of a covariate, instead of an observed N modality in the conventional sense. It is empirically known X log p(X) ≥L(θ˜, θˆ, ϕ, φ|X) = − KLq(z |x )||p(z ) that such a setup, if appropriately implemented, does not n n n n=1 undermine modeling effectiveness [30]. N Let us denote as xn the set of observable data per- X X + γ E[log p(xn|zn)] + log q(cm = ym|xm) taining to the nth available video. We denote as x = n n=1 l he th ta co st xm∈X {xn , xn , xn , xn , xn } the set of five distinct modalities, (6) i.e., headline, thumbnail, tags, comments, and statistics, re- Here, KLq||p is the KL divergence between the distribution spectively. Then, based on the above description, the encoder q(·) and the distribution p(·), while E[·] is the (posterior) distribution of the postulated model reads: expectation of a function w.r.t. its entailed random (latent) variables. Note also that, in the ELBO expression (6), the q(z |x ) =q(z˜ |x )q(cn=1|xn)q(zˆ |x )q(cn=0|xn) (1) n n n n n n introduced hyperparameter γ is a simple regularization con-

Here, zn is the output of the encoding stage of the proposed stant, employed to ameliorate the overfitting tendency of model that corresponds to xn, z˜n is the output of the first the postulated decoder networks, p(xn|zn). We have noticed subencoder, corresponding to the clickbait class, zˆn is the that this simple trick yields a significant improvement in output of the second subencoder, corresponding to the non- generalization capacity. clickbait class, and cn is a latent variable indicator of whether In Eq. (6), the posterior expectation of the log-likelihood xn belongs to the clickbait class or not. We also postulate term p(xn|zn) cannot be computed analytically, due to the nonlinear form of the decoder. Hence, we must approximate q(z˜ |x ) = N (z˜ |µ˜(x ; θ˜), diag σ˜ 2(x ; θ˜)) (2) n n n n n it by drawing Monte-Carlo (MC) samples from the posterior ˆ 2 ˆ q(zˆn|xn) = N (zˆn|µˆ(xn; θ), diag σˆ (xn; θ)) (3) (encoder) distributions (2)-(3). However, MC gradients are well-known to suffer from high variance. To resolve this ˜ 2 ˜ Here, the µ˜(xn; θ) and σ˜ (xn; θ) are outputs of a deep issue, we utilize a smart re-parameterization of the drawn MC neural network, with parameters set θ˜, that corresponds to the samples. Specifically, following the related derivations in [21], clickbait class subencoder; it comprises a first CNN-type part we express these samples in the form of a differentiable that processes the observed thumbnails, and a further densely- transformation of an (auxiliary) random noise variable ; this connected network part that fuses and processes the rest of random variable is the one we actually draw MC samples from: the observed modalities, as described previously. Similarly, the ˆ 2 ˆ z˜(s) = µ˜ + σ˜ · (s), (s) ∼ N (0, I) (7) µˆ(xn; θ) and σˆ (xn; θ) are outputs of a deep neural network n n n n n with parameters set θˆ, that corresponds to the non-clickbait (s) (s) zˆ = µˆ + σˆ n ·  (8) class subencoder. n n n The posterior distribution of mixture component allocation, Hence, such a re-parameterization reduces the computed ex- q(cn|xn), which is parameterized by the aforementioned gat- pectations into averages over samples from a random variable ing network, is a simple Bernoulli distribution that reads with low (unitary) variance, . This way, by maximizing the obtained ELBO expression, one can yield low-variance q(cn|xn) = Bernoulli($(h(xn); ϕ)) (4) estimators of the sought (trainable) parameters, under some mild conditions [21]. Turning to the maximization process of Here, $(h(x ); ϕ) ∈ [0, 1] is the output of the gating network, n L(θ˜, θˆ, ϕ, φ|X), this can be effected using modern stochastic with trainable parameters set ϕ. This is presented with an optimizations, such as AdaGrad [16] or RmsProp [39]. intermediate encoding of the input modalities (shared with the subencoder networks), h(xn), as described previously, and IV. EXPERIMENTAL EVALUATION infers the probability of xn belonging to the clickbait class. A. Experimental Setup Lastly, the postulated decoder distribution reads Our model is implemented with Keras [12] and Tensor- he ta co st 2 p(xn|zn) = N (xn , xn , xn , xn |µ(zn; φ), diag σ (zn; φ)) Flow [24]. (5) MC Samples. To perform model training, we used S = 10 (s) where the means and diagonal covariances, µ(zn; φ) and drawn MC samples,  ; we found that increasing this value 2 σ (zn; φ), are outputs of a deep network with trainable does not yield any statistically significant accuracy improve- parameters set φ, configured as described previously. ment, despite the associated increase in computational costs. Initialization. We employ Glorot Uniform initialization [18]. Corpus-Level Inference Insights. Having demonstrated the This scheme allow us to train deep network architectures performance of our model, we now provide some insights into without the need of layerwise pretraining. This is effected by the obtained inferential outcomes on the whole corpus of data. initializing the network weights in a way which ensures that From the 206k examples, our model predicts that 84k (41%) the signal propagated across the network layers remains in a of them are clickbaits whereas 122k (59%) are non-clickbaits. reasonable range of values, irrespectively of the network depth The considerable percentage of clickbaits in the corpus, in (i.e., it does not explode to infinity or vanish to zero). conjunction with the data collection procedure, suggests that, Stochastic optimization. We found that RmsProp works better with respect to the collected data, YouTube does not consider than the AdaGrad algorithm suggested in [21]. The RmsProp misleading videos in their recommendations. Note also that hyperparameter values we used comprise an initial learning we have performed an analysis of the whole corpus akin to rate equal to 0.001, ρ = 0.9, and ε = 10−8. the ground truth dataset analysis of Section II.A. The obtained Prediction generation. To predict the class of a video, results follow the same pattern as for the ground truth dataset. xn, we compute the mixture assignment posterior distribu- Specifically, we have found that the normalized mean values, tion q(cn|xn), inferred via the postulated gating network reported in Table II for the labeled data, for the whole corpus $(h(xn); ϕ). On this basis, assignment is performed to the become equal to 11.18, when it comes to pairs of clickbait clickbait class if $(h(xn); ϕ) > 0.5. videos, whereas for clickbait and non-clickbaits the mean is Baselines. To evaluate our model, we compare it against 2.62. This validates our deduction that it is more likely to two baseline models. First, a simple Support Vector Machine be recommended a clickbait video when viewing a clickbait (SVM) with parameters γ = 0.001 and C = 100. Second, a video on YouTube. supervised deep network (SDN) that comprises (i) the same CNN as the proposed model; and (ii) a 2-layer fully-connected V. RELATED WORK neural network with Dropout level of d = 0.5. The clickbait problem is also identified by prior work that Evaluation. We train our proposed model using the entirety of proposes tools for alleviating the problem in various web the available unlabeled dataset, as well as a randomly selected portals. Specifically, Chen et al. [11] provide useful informa- 80% of the available labeled dataset, comprising an equal tion regarding the clickbait problem and future directions for number of clickbait and non-clickbait examples. Subsequently, tackling the problem using SVM and Naive Bayes approaches. the trained model is used to perform out-of-sample evaluation; Rony et al. [32] analyze 1.67M posts on Facebook in order to that is, we compute the classification performance of our understand the extent and impact of the clickbait problem as approach on the fraction of the available labeled examples well as users’ engagement. For detecting clickbaits, they pro- that were not used for model training. pose the use of sub-word embeddings with a linear classifier. Potthast et al. [31] focus on the Twitter platform where they Model Accuracy Precision Recall F1 Score suggest the use of Random Forests for distinguishing tweets SVM 0.882 0.909 0.884 0.896 SDN 0.908 0.920 0.907 0.917 that contain clickbait content. Furthermore, Chakraborty et Proposed Model (U = 25% ) 0.915 0.918 0.926 0.923 al. [9] propose the use of SVMs in conjunction with a browser Proposed Model (U = 50% ) 0.918 0.918 0.934 0.926 Proposed Model (U = 100% ) 0.924 0.921 0.942 0.931 add-on for offering a detection system to end-users of news articles. Moreover, Biyani et al. [6] recommend the use of TABLE III: Performance metrics for the evaluated methods. We also Gradient Boosted Decision Trees for clickbait detection in report the performance of our model when using only 25% or 50% news articles. They also demonstrate that the degree of infor- of the available unlabeled data. mality in the content of the landing page can help in discerning clickbait news articles. To the best of our knowledge, Anand Table III reports the performance of the proposed model as et al. [4] is the first work that suggests the use of deep learning well as the two considered baselines. We observe that neural techniques for mitigating the clickbait problem. Specifically, network-based approaches, such as a simple neural network they propose the use of Recurrent Neural Networks in conjunc- and the proposed model, outperform SVMs in terms of all tion with word2vec embeddings for identifying clickbait news the considered metrics. Specifically, the best performance is articles. Agrawal [3] propose the use of CNNs in conjunction obtained by the proposed model, which outperforms SVMs with word2vec embeddings for discerning clickbait headlines by 3.8%, 1.2%, 5.8% and 3.5% on accuracy, precision, recall, in Reddit, Facebook and Twitter. Other efforts include browser and F1 score, respectively. Further, to assess the importance add-ons [15], [13], [7] and manually associating user accounts of using unlabeled data, we also report results with reduced with clickbait content [33], [35], [5]. unlabeled data. We observe that, using only 25% of the Remarks. In contrast to the aforementioned works, we focus available unlabeled data, the proposed model undergoes a on the YouTube platform and propose a deep learning model substantial performance decrease, as measured by all the em- that: (i) successfully and properly fuse and correlate the diverse ployed performance metrics. This performance deterioration set of modalities related to a video; and (ii) leverage large only slightly improves when we elect to retain 50% of the unlabeled datasets, while imposing much limited requirements available unlabeled data. in labeled data availability. VI.CONCLUSION [14] J. Constine. Facebook feed change fights clickbait post by post in 9 more languages, 2017. http://archive.is/l8hIt. In this work, we have explored the use of variational [15] Downworthy browser extension. http://downworthy.snipe.net/. autoencoders for tackling the clickbait problem on YouTube. [16] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for Our approach constitutes the first proposed semi-supervised online learning and stochastic optimization. JMLR, 2010. [17] H. Gao, Y. Chen, K. Lee, D. Palsetia, et al. Towards Online Spam deep learning technique in the field of clickbait detection. This Filtering in Social Networks. In NDSS, 2012. way, it enables more effective automated detection of clickbait [18] X. Glorot and Y. Bengio. Understanding the difficulty of training deep videos in the absence of large-scale labeled data. Our analysis feedforward neural networks. In AISTATS, 2010. indicates that YouTube recommendation engine does not take [19] Imagga. Tagging Service, 2016. https://imagga.com/. [20] N. Jindal and B. Liu. Review spam detection. In WWW, 2007. into account the clickbait problem in its recommendations. [21] D. Kingma and M. Welling. Auto-Encoding Variational Bayes. In ICLR, 2014. ACKNOWLEDGMENTS [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015. This project has received funding from the European [23] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary Deep Generative Models. In ICML, 2016. Union’s Horizon 2020 Research and Innovation program, [24] A. Martın, A. Ashish, B. Paul, B. Eugene, C. Zhifeng, C. Craig, C. G. under the Marie Skłodowska-Curie ENCASE project (Grant S, D. Andy, D. Jeffrey, et al. Tensorflow: Large-scale machine learning Agreement No. 691025). We also gratefully acknowledge the on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. support of NVIDIA Corporation, with the donation of the Titan [25] G. McLachlan and D. Peel. Finite Mixture Models. 2000. Xp GPU used for this research. [27] A. Peysakhovich. Reducing clickbait in facebook feed, 2016. http://newsroom.fb.com/news/2016/08/news-feed-fyi-further-reducing- REFERENCES clickbait-in-feed. [28] Piperjaffray. Survey, 2016. http://archive.is/AA34y. [1] Clickbait video. https://www.youtube.com/watch?v=W2WgTE9OKyg. [29] E. A. Platanios and S. P. Chatzis. Gaussian Process-Mixture Conditional [2] List of ground truth channels. https://goo.gl/ZSabkn, 2018. Heteroscedasticity. TPAMI, 2014. [3] A. Agrawal. Clickbait detection using deep learning. In NGCT, 2016. [30] I. Porteous, A. Asuncion, and M. Welling. Bayesian Matrix Factorization [4] A. Anand, T. Chakraborty, and N. Park. We used Neural Networks with Side Information. In AAAI, 2010. to Detect Clickbaits: You won’t believe what happened Next! arXiv [31] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. Clickbait Detection. preprint arXiv:1612.01340, 2016. In ECIR, 2016. [5] Anti-Clickbait. https://twitter.com/articlespoiler, 2015. [32] M. M. U. Rony, N. Hassan, and M. Yousuf. Diving Deep into Clickbaits: [6] P. Biyani, K. Tsioutsiouliklis, and J. Blackmer. "8 Amazing Secrets Who Use Them to What Extents in Which Topics with What Effects? for Getting More Clicks": Detecting Clickbaits in News Streams Using arXiv:1703.09400, 2017. Article Informality. 2016. [7] B.s detector browser extension. http://bsdetector.tech/. [33] SavedYouAClick. https://twitter.com/savedyouaclick, 2014. [8] R. Campbell. You Wont Believe How Clickbait is Destroying YouTube!, [34] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. 2017. https://tritontimes.com/11564/columns/you-wont-believe-how- Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks clickbait-is-destroying-/. from Overfitting. JMLR, 2014. [9] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly. Stop [35] StopClickBait. http://stopclickbait.com/, 2016. Clickbait: Detecting and Preventing Clickbaits in Online News Media. [36] G. Stringhini, M. Egele, A. Zarras, T. Holz, C. Kruegel, and G. Vigna. 2016. B@bel: Leveraging Email Delivery for Spam Mitigation. In USENIX [10] S. P. Chatzis, D. I. Kosmopoulos, and T. A. Varvarigou. Signal Modeling Security, 2012. and Classification Using a Robust Latent Space Model Based on t [37] G. Stringhini, T. Holz, B. Stone-Gross, C. Kruegel, and G. Vigna. Distributions. TOSP, 2008. BOTMAGNIFIER: Locating Spambots on the . In USENIX [11] Y. Chen, N. J. Conroy, and V. L. Rubin. Misleading Online Content: Security, 2011. Recognizing Clickbait as False News. In ACM MDD, 2015. [38] G. Stringhini, C. Kruegel, and G. Vigna. Detecting spammers on social [12] F. Chollet. Keras. https://github.com/fchollet/keras, 2015. networks. In CSA, 2010. [13] Clickbait remover for Facebook. http://bit.ly/2DIrpj6. [39] T. Tieleman and G. Hinton. Lecture 6.5-RMSprop: Divide the gradient [26] M. Pagliardini, P. Gupta, and M. Jaggi. Unsupervised Learning of by a running average of its recent magnitude. 2012. Sentence Embeddings using Compositional n-Gram Features. arXiv, [40] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolu- 2017. tional Networks. In ECCV, 2014.