<<

Teknik och samh¨alle Datavetenskap och medieteknik

Bachelor Thesis 15 h¨ogskolepo¨ang,grundniv˚a

Machine Learning explainability in text classification for Fake News detection

Lukas Kurasinski

Examen: kandidatexamen 180 hp Handledare: Radu Mihailescu Huvudomr˚ade:datavetenskap Examinator: Agnes Tegen Program: systemutvecklare Datum f¨orslutseminarium: 2020-06-03

Abstract

Fake news detection gained an interest in recent years. This made researchers try to find models that can classify text in the direction of fake news detection. While new models are developed, researchers mostly focus on the accuracy of a model. There is little research done in the subject of explainability of Neural Network (NN) models constructed for text classification and fake news detection. When trying to add a level of explainability to a Neural Network model, allot of different aspects have to be taken under consideration. Text length, pre-processing, and complexity play an important role in achieving success- fully classification. Model’s architecture has to be taken under consideration as well. All these aspects are analyzed in this thesis. In this work, an analysis of attention weights is performed to give an insight into NN reasoning about texts. Visualizations are used to show how 2 models, Bidirectional Long-Short term memory Convolution Neural Net- work (BIDir-LSTM-CNN), and Bidirectional Encoder Representations from Transformers (BERT), distribute their attentions while training and classifying texts. In addition, sta- tistical data is gathered to deepen the analysis. After the analysis, it is concluded that explainability can positively influence the decisions made while constructing a NN model for text classification and fake news detection. Although explainability is useful, it is not a definitive answer to the problem. Architects should test, and experiment with different solutions, to be successful in effective model construction. Acknowledgements

To supervisors Radu and Dennis. Thank you for your support. Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem ...... 2 1.3 Research question ...... 2 1.4 Limitations ...... 2

2 Method 4 2.1 Data-sets and pre-processing ...... 4 2.1.1 News corpus data-set ...... 5 2.1.2 Summarization ...... 7 2.1.3 Stemming ...... 7 2.1.4 Lemmatization ...... 8 2.2 Neural Networks architectures ...... 8 2.2.1 Embedding layer ...... 8 2.2.2 Bidirectional Recurrent Layer with Long-Short Term Memory . . . .9 2.2.3 Convolutional Layer ...... 12 2.2.4 Attention Mechanism ...... 14 2.2.5 Encoder-Decoder ...... 14 2.2.6 Transformer ...... 16 2.3 Models ...... 18 2.3.1 Bidirectional Long-Short Term Memory Neural Network with At- tention mechanism and Convolutional layer (BiDir-LSTM-CNN) . . 18 2.3.2 Bidirectional Encoder Representation from Transformers (BERT) . . 20 2.4 Method execution ...... 21 2.5 Method discussion ...... 22

3 Results 23 3.1 Visualizations ...... 24 3.2 Data-sets used and texts presented in the analysis ...... 25 3.3 Models’ accuracy ...... 25 3.4 Attention weights coverage ...... 27 3.5 Attention weights span over nearby ...... 27 3.6 Domain specific words ...... 27 3.7 Parts of speech ...... 28

4 Analysis 29 4.1 Architectures ...... 29 4.2 Length of input-data ...... 30 4.3 Domain specific words ...... 31 4.4 Pre-processing ...... 31 4.5 Parts of speech ...... 32

5 Conclusion 33

6 Future work 34

A Overview of attention visualizations for BiDir-LSTM-CNN and BERT - all cases 35

B Visual representations of Attention Mechanism for BiDir-LSTM-CNN - ”sports” 40

C Visual representations of Attention Mechanism for BiDir-LSTM-CNN - ”tech” 44

D Visual representations of Attention Mechanism for BERT - ”sports” 48

E Visual representations of Attention Mechanism for BERT - ”tech” 52

F Statistical data 56

Bibliography 75 Abbreviations

BERT - Bidirectional Encoder Representation from Transformers

BiDir - Bidirectional

CNN - Convolutional Neural Network

LSTM - Long-Short Memory

ML - Machine Learning

NLTK - Natural Language Tool Kit

NN - Neural Network

RNN - Recurrent Neural Network

SVM - Support Vector Machine

TF-IDF - Term Frequency-Inverse Document Frequency

1 Introduction

In this section, an introduction to the thesis is presented. First, in section 1.1 a brief background to the subject is laid out, presenting an overall scope. Later, in section 1.2 an issue regarding the explainability of Machine Learning, Neural Network models for text classification, and fake news detection is presented. In section 1.3 a research question is formulated, clarifying the subject of this work. Finally, limitations are presented in section 1.4, framing the capacity of this thesis.

1.1 Background The Internet became a fast and cost-effective way of sharing information. With its rise, more and more people rely on online media to provide information, news, and facts about the world via news-feeds, online newspapers, and social media [1]. The Internet is reach- ing a broader public than any other paper publication ever did. With the vast amount of information it provides, an everyday user cannot keep up with fact-checking everything he or she consumes. Lack of in-depth knowledge on a subject, gives online sources the benefit of trust from the public, counting on the medium to provide reliable and honest information.

With a growth of honest information available on the Internet so did the spread of mis- information [2]. Distribution of false claims can happen, unknowingly by providing facts based on unreliable sources, or knowingly to confuse, mislead, and deceive people. What- ever the case may be, the fake news as it became to be known, turned into a rising problem internet users face today. For this reason, the process of detecting false information has become a subject of extensive study in the research community. There is a need for a system that makes it possible to detect, classify, and eliminate, news articles providing malicious content. Systems based on manual identification of this type of material are costly and time-consuming. That is why an effort is being made to find an effective Ma- chine Learning (ML) approach to solve these issues.

There have been many different solutions published so far, while researches are trying to find an efficient ML model for tackling the problem. Many of those solutions are based on different ML architectures, approaches to data representation, and Natural Language Processing. The problem turns out to be very difficult to solve, and researches are yet to find a fully reliable system. While proposing different ML models, researchers focus on the effectiveness of a particular model, without trying to understand why one model performs better than the other. Researchers omit the understanding of the explainability of a model, focusing only on its effectiveness. In this work, an effort is made to understand the explainability of different popular ML models used in the fake news detection. An

1 attempt is made to understand how data representation and architecture, influence ML understanding of texts, in the context of text classification.

1.2 Problem Work done in the process of creating an ML model for text classification using a Neural Network (NN) architecture, can be simplified to a couple of basic steps. First, a model is researched. Later, a data-set is fed into it, and the model is trained. Lastly, the accuracy of fitting the model to the data is used as a measure of its usefulness. Some parameters of a given architecture are changed along the way, to maximize the result. The problem is that researchers seldom ask why a Neural Network model is effective or not, and focus only on the result. The essence of this work is to peek inside the black box that is a NN and try to understand the reasoning behind it. There is, very little to no research done, in the subject of providing explainability of Neural Network systems for text classification. Hopefully, this analysis helps future researchers pursuing effective Machine Learning Neu- ral Network models for text classification, and fake news detection, by providing a deeper understanding of the explainability behind it.

1.3 Research question Research question is derived directly from the problem description, that is:

• How can we provide explainability of neural-based models for the task of fake news detection via text-classification?

1.4 Limitations The focus of this work is not on producing a perfect neural network model. Neural net- work models used are derived from previous work, showing acceptable results. Models chosen are means of getting results for the analysis, hopefully leading to a better, general understanding of Neural Networks in Machine learning and text classification.

Computational limitations and the time needed for training were an important factor in downsizing the data-sets. The size of data-sets is such, that it made training possible in a reasonable amount of time. Language is very versatile, fluent, and complex, which makes it a difficult subject to study. Therefore, additional limitations on texts are imposed. Only text written in English, according to English grammatical and structural rules are used in this analysis.

2 Other limitations of this work, involve data pre-processing, reduction of layers’ dimen- sionality, and noise-cleaning. All of these operations have a positive impact on model performance. Unfortunately every pre-processing brings a risk of meaning distortion into a text. Although, in this work data pre-processing is minimal it still introduces some constraints to the analysis.

In this work, the focus lies on exploring Deep Artificial Neural Networks only. Study [3] shows that Deep Neural Networks outperform other traditional methods of text classifica- tion. For this reason, methods like Support vector machines (SVM), Bayesian methods, and Decision Tree are omitted in this analysis.

Analysis involves text classification based on finding patterns by models. There is no study done in the field of text classification based on fact-checking or other techniques involving source analysis. In this work, only the architecture and is taken into consideration.

3 2 Method

In this thesis, an analysis of two different ML Deep Learning Neural Network models is performed. One based on Convolutional Neural Network architecture (CNN) and Bidi- rectional Recurrent Neural Network architecture (BiDir-LSTM), and a second one based on Bidirectional Encoder Representation from Transformers (BERT). These two Neural Network architectures are used, because of the difference in their approach to textual data processing. These two architectures are representative of the two most popular, general approaches to text classification. The first one is based on sequential text processing and the second one is based on processing the text as a whole. Both architectures give good results and are actively developed by researchers today. The two models make use of the same type of basic layers, however in a different way. One of the layers models use is an Attention Layer. This layer plays a special role in the analysis. In addition to being an important component of both models, it is also used as a visualization tool, giving a deeper insight into results produced by the Neural Networks.

In addition, a wide variety of data-sets are used in the analysis. All of the sets used are a part of the same larger data-set ”Fake News Corpus” [4]. By categorizing the sets into two domains: ”tech” and ”sports”, two smaller sets are constructed for this analysis. Be- fore processing the data-sets through the Neural Network models, data pre-processing is performed. This data preparation involves stop-words removal, lemmatization, stemming, and text summarization. All of the aspects of model architectures used, and data prepa- ration are explained in detail next.

First, data pre-processing is discussed in section 2.1. Later in section 2.2 all of the impor- tant layers are explained. Next, in section 2.3 model architectures are outlined. Method execution is explained after that, in section 2.4. The whole process of training and gener- ating results is presented here. Finally, the method description is closed with some final notes and observations in section 2.5. Although an effort is made to explain all of the techniques used, a basic knowledge on the subject is expected.

2.1 Data-sets and pre-processing Preparing a set, scraping websites, finding texts, and labeling them for training, can be a very challenging and time-consuming task. That is why, in this work, an already composed collection of texts is used. This helps minimize the time taken to produce data for the analysis. In turn, more time can be invested in other aspects of the analysis, like model training and results evaluation.

Pre-processing involves a reduction of unnecessary dimensionality, and optimization of

4 model performance, by cleaning data-sets from unnecessary words and characters. In this section, an in-depth description of the data-sets used is presented. A detailed explanation of how they are prepared, and the processes they undergo before they are used in model training and classification.

2.1.1 News corpus data-set The data-sets used in this analysis is the ”Fake News Corpus” set [4]. This set is com- posed of over nine million texts labeled as one of the 11 categories: false, satire, bias, conspiracy, state, junksci, hate, click bait, unreliable, political, true. Additionally, this set is categorized by topic using the Term Association Technique [5]. For the comparison in this thesis, two smaller sets, one for each category, are derived from the ”Fake News Corpus”. A number of articles, labeled ”true” or ”false”, and categorized as ”sports” or ”tech”, are pulled out of the collection. A detailed outline of all sets used is presented in table 1.

Set 1 A 1000 of randomly chosen articles categorized as ”sports”, and labeled ”true” or ”false”, are used for the first set. The composition is balanced between 500 articles labeled as ”true”, and an additional 500 labeled as ”false”.

Set 2 The second set, is composed of a 1000 randomly chosen articles categorized as ”tech”. The composition is also balanced between 500 articles tagged as ”true”, and ad- ditional ”500” tagged as ”false”.

To deepen the analysis, additional 16 sets are derived from those two, giving in total 18 balanced sets each containing 1000 articles. There are 9 sets produced, for each category (”sports”, ”tech”). Both categories contain articles subjected to different pre-processing operations. In each category there is a ”vanilla” set of articles, a set of articles summarized to a maximum of 5 sentences each, and a set of articles summarized to a maximum of 10 sentences each. Summarizations are based on a pioneering work done on the subject by H. P. Luhn. [6], who successfully summarized texts taking into consideration word frequen- cies and stop-words (a, an, etc.) removal. Later, in each group of ”Vanilla” texts, texts summarized to 5 sentences, and texts summarized to 10 sentences, additional variations are made using lemmatization and stemming. All of the methods used in data preparation are explained next.

5 Table 1: A summary of all of the data-sets used in the analysis.

Data-set Size Summarized Stemmed Lemmatized 1 ”sports” 1000 - - - 2 ”sports” 1000 - Yes - 3 ”sports” 1000 - - Yes 4 ”sports” 1000 max 5 sentences - - 5 ”sports” 1000 max 5 sentences Yes - 6 ”sports” 1000 max 5 sentences - Yes 7 ”sports” 1000 max 10 sentences - - 8 ”sports” 1000 max 10 sentences Yes - 9 ”sports” 1000 max 10 sentences - Yes 10 ”tech” 1000 - - - 11 ”tech” 1000 - Yes - 12 ”tech” 1000 - - Yes 13 ”tech” 1000 max 5 sentences - - 14 ”tech” 1000 max 5 sentences Yes - 15 ”tech” 1000 max 5 sentences - Yes 16 ”tech” 1000 max 10 sentences - - 17 ”tech” 1000 max 10 sentences Yes - 18 ”tech” 1000 max 10 sentences - Yes

6 2.1.2 Summarization Text summarization is utilized in many Neural Network models used for text classification. This is why it had to be included in this analysis. This method of text summarization is based on the Term Frequency-Inverse Document Frequency (TF-IDF) method described in [7]. It produces a summary accuracy of 68%, which is better than other similar sum- marization methods available. Summary accuracy is a measure of meaning conveyance. It describes how much the meaning of a summarized text generated by the model, resembles the meaning of a summarized text, a human would create. The technique is based on a process of producing summaries, by extracting sentences from a text. Sentences in sum- maries are the same as they are in the original texts, preserving the order of appearance.

The method is based in a statistical approach, where a frequency with which a word appears in a document gives it a TF-IDF score.

Total appearances of a word in document TF = (1) Total words in a document

Total number of sentences IDF = log (2) Sentences with a given word in it TF-IDF score = TF x IDF (3) Common words like: ”a”, ”are”, ”in” are omitted from scoring. Since every sentence is bound to have them, they don’t really contribute to the result.

After giving a score to words, every sentence in a document gets an additional score on its own. This sentence score is based on the sum of the TF-IDF scores of individual words in that sentence. X Sentence score = TF-IDF score (4)

Sentences with highest scores are understood, as those containing the highest amount of relevant information in the document.

Before attempting TF-IDF, texts are pre-processed, with the help of a Python module (NLTK) [8]. Stop-words are removed, to declutter texts from unnecessary noise data.

2.1.3 Stemming This data pre-processing technique involves reducing words to their form, mapping them to the same stem. Stem doesn’t necessarily have to be a valid word by itself.

7 Stemming used in this work uses a suffix stripping technique, removing suffixes like: ”- ed”, ”-ize”, ”-s”, ”-de”, ”mis” from a word. Porter Stemmer as it is called, reduces words like: ”cats” to ”cat”, and words like ”trouble”, ”troubling”, ”troubled”, to ”trouble”. On the other hand words like: ”misstated”, are reduced to ”misstat”. This technique is used in various Neural Networks for text classification. It helps simplify data inputs to the network, reducing dimensions and complexity. This process positively contributes to results in many different supervised architectures, as presented in [9]. In this analysis, the Python NLTK package [8] is used to stem data-sets used.

2.1.4 Lemmatization This type of data pre-processing also uses Pythons NLTK package [8]. Lemmatization, unlike Stemming, reduces words to their proper dictionary root form. While words like: ”cats” are still reduced to the root form ”cat”, words like ”misstated” on the other hand, are not. Lemmatization is often used as an alternative to Stemming if dictionary vocabu- lary is to be preserved.

In theory, text simplification, whether it is Stemming or Lemmatization, helps reduce the dimensionality of a model without a significant negative effect on the training accuracy. As shown in [11], models using texts that undergo these pre-processing operations can still achieve an acceptable result. Stemming technique is quicker than Lemmatization, but texts can end up having nonsensical, non-dictionary words. For a better understanding of Neural Network reasoning, both alternatives are tested and compared in this analysis.

Lastly, both sets undergo simple overall basic pre-processing and cleaning operations. Stop-words like: ”a”, ”an”, and ”the” are being removed while training, and text are checked for artifacts. Stray letters, parenthesis, and none ”utf-8” encoded characters like copyright sign c , and registered sign R , are also removed. All of those things are considered noise data not contributing to the result.

2.2 Neural Networks architectures Layers are the primary building blocks of any Neural Network. There are various kinds of Neural Network layers, all having their advantages and disadvantages. In this section, all layers used in models and their usefulness in text classification are explained.

2.2.1 Embedding layer Even if processing involves texts, data has to be transformed into numbers first. Before feeding textual data into a NN model, every word has to be converted to a numerical data. To do that an Embedding Layer is used. In an Embedding Layer every word gets

8 assigned a value. This value is associated with an embedding lookup table. In this table, an additional values are stored, corresponding to the similarity of words with each other. There are two possibilities of obtaining the embedding values. First, the embedding layer’s values can be trained. In this method, additional model is developed that generates these values from a given data-set. Another possibility is to use an already available embedding lookup-table. The second option is used in this thesis. An already existing embedding lookup table named Glove300d developed by Stanford University researchers [10], is used in this analysis.

2.2.2 Bidirectional Recurrent Layer with Long-Short Term Memory In a simple Recurrent Neural Network (RNN), the current network state knows only about past inputs. This can be a disadvantage in text processing since words in a text are con- nected not only to ones before it, but also to those appearing after. For this reason, a network that can process textual data within its context, in both directions, is much more suitable to be used. Bidirectional (BiDir) Neural Network has exactly this kind of ability. In BiDir Network, two sets of hidden states are maintained. The first state is a forward state, where network processes inputs from beginning to the end, like in any typical RNN. The second state is a backward state, where input is processed back to front. Knowledge of the context from both sides of a current input improves the results of models dealing with sequential data like texts [12]. BiDir architecture is presented in figure 1.

9 Figure 1: Bidirectional Neural Network layer with two hidden states (h): forward, and backward. Input-data (X) is processed from both sides by two hidden layers simultaneously, and fed into the output. [23].

10 A BiDir network is suitable for processing short textual data like words and sentences. Unfortunately, it has problems when dealing with extensively long data, like paragraphs and articles. In this situation, a problem of vanishing or exploding gradient can occur. Since states in the BiDir network are passed from one hidden layer to another, there is a chance that with lengthy texts, the gradient starts approaching a minimum or a maximum value. While weights in the hidden states are updated, the result can be overflowed in the end by some dominant values. This situation, if occurs, renders the network unstable and in effect useless. For this reason, there is another type of component added to the structure, a Long-Short Term Memory cell (LSTM), shown in figure 2. This cell is respon- sible for retaining at least a part of a hidden state, even when dealing with lengthy texts. The weights that are at risk of overflowing are now ”regulated” by the addition of values carried by the LSTM layer. The LSTM hinders gradient from destabilizing the network. Using LSTM in conjunction with BiDir makes it possible to avoid gradient problems in data with dependencies extended over a longer text.

Figure 2: LSTM cell passing a part of hidden states (h) even in a long inputs [24], regulating the BiDir layer weights in the process.

11 2.2.3 Convolutional Layer Convolutional Neural Network (CNN) as explained in [12], is one of the most popular Neural Network architectures used today. First developed for 2-dimensional image pro- cessing, with time was adapted to other tasks like text processing. The main operation associated with CNN is the convolution operation. Partially connected convolution layers substantially decrease the number of weights (parameters) in the network, making them very efficient in the process. How convolution operates is explained next.

Convolution works by taking into consideration data in the context of its surround- ings. It uses dependencies and relationships present in the data, to achieve learning. This property makes it a good tool for text processing since words in a text are not isolated, but sequential, dependent on one another. In the context of image data, it makes sense to assume that nearby pixels are typically more relevant to each other than pixels that are far away. The same analogy can be projected onto textual data, where words also have a relation with each other [15].

CNN is built with inner layers (dimensions), wherein every layer, some features of the input layer is being convoluted into the next layer. Since there are many inner layers, different features of the input are exposed and projected onto those layers. This gives the network an ability to capture salient features, dependencies among words, and their surroundings. In figure 3, C1, and C2 represent inner layers subjected to the convolution operation. Different filters are used to highlight, and extract different features from the input layer. This operation is repeated in the hidden layers, by applying filters to the layers representing the output of the previous filters.

An important aspect of inner layers is that they are not necessarily connected, but rather represent some partial region of the previous layer. This characteristic demands, careful approach while designing a model. While only a spatial region of one layer is represented in the next, the parameters are shared across all. Both local connectivity between inner layers and parameter sharing, reduce dimensionality in the network, making it memory efficient.

While conceptually CNN is similar to BiDir-LSTM, their inner workings are very different. BiDir-LSTM processes input data in sequences from both sides, moving the relationships further to the next hidden layer. CNN on the other hand filters out partial chunks of input projecting them on the next layer.

12 Figure 3: A typical CNN layer with inner convolutional layers. C1 with 6 filters projecting on inner layers, repre- senting different partial relationship of the input layer [13].

13 2.2.4 Attention Mechanism First developed by Bahdanau [16], and later refined by Luong [17], An Attention Mecha- nism is a key addition to architectures dealing with lengthy textual data. The main idea of an Attention Mechanism is to establish a connection between the input layer and the output layer, by assigning additional weights (attention) to every word that is processed by the network. Weights are carried over the hidden layers of the network, even when input data is overly long. Words having substantial influence on the result retain higher weight value then the words that are less important. The attention mechanism improves the translation of longer sentences, making it an additional improvement over LSTM.

Both models have an Attention Layer implemented in their architecture. The Attention layer thanks to its construction, makes it possible to note the level of interest a Neural Network model has, for different aspects of data, used in training and classification. In the case of this analysis, the aspects of interest are words and sentences in any given input to the network. As a bi-product of assigning attention weights to words, it is possible to extract those values for the analysis.

2.2.5 Encoder-Decoder Introduced by Google researchers in 2014 [18], this architecture is used in mapping an input data sequence to an output data sequence. Both input and output sequences have a fixed maximum length, not necessarily equal to each other. Layers as seen in figure 4, are constructed with either RNN or LSTM, and an Attention Mechanism. Encoder-Decoder is composed of two processing blocks, the encoding block, and the decoding block. After feeding data to the encoder, hidden states are produced and passed further down the network. A hidden state, produced by the last layer of the encoding block, is vectorized and past as the first hidden state to the decoding block. Decoder, making use of the context information gathered in the Attention Mechanism and its last hidden state, maps an input to an appropriate output. The real value of this architecture is in the possibility of having an input data with one length, being mapped to an output with a different length, simultaneously paying attention to the context in which it occurs.

14 Figure 4: Encoder-Decoder architecture based on RNNs mapping input data to the output data. Additional intermediate Attention Mechanism is used for context extracting [14].

15 2.2.6 Transformer This architecture was first described in [19]. It is similar to the Encoder-Decoder in that, it is also used in sequence to sequence translation. The difference is, there are no RNNs or LSTMs used in this implementation. This architecture is based only on an Attention Mechanism and a feed-forward network. Since there is no sequential processing in this structure, an additional dimension is added to every word. This dimension represents the relative positioning of every word in the context. This can be seen in figure 5, as the area marked ”positional encoding”. After the hidden states are produced in the encoder, and initial masking operations are performed in the decoder, hidden states of the encoder are introduced as the initial states of the decoder. This process of passing the hidden state is similar to the one described in the Encoder-Decoder explained in the earlier section. This architecture has proven to achieve better results than the Encoder-Decoder architecture.

16 Figure 5: transformer architecture mapping input data to the output data. Implementation without RNNs, with Attention Mechanisms only [20].

17 2.3 Models In this section two models used in the analysis are presented. A description of each model and an overall architecture is presented here.

2.3.1 Bidirectional Long-Short Term Memory Neural Network with Atten- tion mechanism and Convolutional layer (BiDir-LSTM-CNN) This Neural Network model is built with all of the basic layers presented in section 2.2. Layers used in this model are a BiDir-LSTM, a CNN, and an Attention Mechanism. According to authors [21], the model uses the CNN layer for extraction of high-level rep- resentation from the Embedding layer, and the BiDir-LSTM for extraction of past and future context. The Attention Mechanism is used to give focus to different outputs from the hidden layers of BiDir-LSTM. The model has proven to be very effective in text classi- fication. In the original work, it was tested on a high variety of different data-sets, giving a mean accuracy of 90%. This model was chosen for the analysis since it is representative of the sequential approach to text processing, and has proven to give similarly good results independently from the data-set used.

Model’s architecture is shown in figure 6, and goes as follows. First, a data-set is pre- processed, tokenized, and fed into the Embedding layer. Later, high-level feature extrac- tion is performed by the CNN layer. After that, features are fed into the BiDir layer. There, a forward and a backward inner layer extract context from past and future se- quences and passe them further into two Attention layers. There is one Attention Layer for each forward and backward state. The Attention Mechanisms processes long term dependencies further. Finally, both contexts are concatenated and processed through the last simple feed-forward layer, outputting the classification.

18 Figure 6: Overall architecture of a BiDir-LSTM-CNN model used in the analysis [22]. Input data is processed by the layers and outputted as a probability of a text being true or false.

19 2.3.2 Bidirectional Encoder Representation from Transformers (BERT) This Neural Network Model is based on a paper [25] published by Google researchers in 2019. It is chosen for the comparison in this analysis, because of its innovative approach to text classification. This architecture makes use of a partial Transformer structure and BiDir layers. From the Transformer, the encoder block is utilized for context extraction and the BiDir layer for classification. The architecture of this model is presented in figure 7. BERT differs from the BiDir-LSTM-CNN model in the way it processes data. It does not look at the input in sequences, from left to right, and from right to left. Instead, it processes the input as a whole, like a Transformer would do. After learning the context in the encoding block, outputs are fed into the BiDir layers, where they are trained according to a given classification.

BERT is a very effective model for text classification. Unfortunately it is also very re- source demanding. For this reason, in this thesis, a slightly simplified version of a BERT is used called Distilled BERT. This implementation as described in [26], achieves 98% accuracy of BERT, with greater speed and lighter processing. This architecture simplifies the classification layer, and optimizes dimensionality, without sacrificing the quality of the outcome from the network.

Figure 7: Bert architecture with a Transformer encoding block for context processing and RNN layers for classifi- cation [31].

20 2.4 Method execution First, data-sets (table 1) are created, according to descriptions in section 2.1. Next, two models, the BiDir-LSTM-CNN, and BERT are built in accordance to their architectures in section 2.3. Later, every data-set is trained on both models. Lastly, a prediction is obtained, and the Attention Mechanism’s weights are extracted. To aid the analysis, attention weights are presented visually as a color-coded texts, representing the interest given by the network, to every word in the data processed. Visualizations based on those attentions can be seen in appendices A-E. Lastly, statistical data is collected, to strengthen the analysis. Statistics, representing attention distribution among words in all of the texts, give a deeper insight into NN classification process.

Visualizations are generated with a color-coded scheme and should be understood as fol- lows. A background color is incorporated for every word in a text. Colorations represent the level of attention a model pays to a particular word while training and classifying a text. Words marked as red should be understood as those, not having a substantial influ- ence on the model’s classification. In contrast, words marked as green should be recognized as those, having a strong effect on the outcome of the text’s classification. Following this understanding, yellow words should be interpreted as those with a moderate influence on a model’s classification. In addition to 3 principal colors used as background, a color strength or alpha is used to further define the levels of attention. Words with stronger colors, the ones with higher alpha value for any of the 3 principle colors, have a higher influence on classification, than words with the same color and weaker alpha value. For BERT, a yellow background color is not present. In this model the intermediate weak attention is omitted from the visualization, due to technical difficulties. This difference in visualizations has no effect on the results and analysis. An entire lack of coloration indicates a virtually complete absence of attention for a particular word.

In addition, statistical data is generated. This data show a cumulative distribution of attention levels among different parts of speech in texts examined. Statistical data is pro- duced for each of the two text categories, ”sports”, and ”tech” and for each of the models BiDir-LSTM-CNN and BERT. Statistical data with visual representation of the attention weights is believed to give insight into NN classification process.

Models are built with the help of the Tensorflow 2.0 [27] library for Python. Since BERT implementation is slightly simplified, additional packages Huggingface [28], and KTrain [29], [30] have to be used as well.

21 2.5 Method discussion Text processing is problematic since text has to be viewed not only as a collection of words but also as a whole semantic structure. The context in which words appear in sentences, has a significant influence on the training process. Words exist in relation with each other, as well as in relation to the whole text. Positioning of words needs to be taken into ac- count. Words appearing at the beginning of a sentence, at the end of a sentence, or in the middle, have different influences on Neural Network’s reasoning about them. Text length is something that has to be taken under consideration as well. Some inputs can be a couple of words long, and others can be full-fledged articles. Other aspects like language diversity and a semantic domain can further add to the overall complexity. This is the reason why two models with two distinct approaches to text processing, and an extensive, versatile group of texts are chosen for this analysis.

The BiDir-LSTM-CNN model’s approach is sequential, front to back and back to front. The context is extracted in chunks with help of the CNN, and the local dependencies are highlighted with BiDir-LSTM. BERT on the other hand, processes inputs as a whole, ex- tracting context and dependencies from both sides at the same time. Both Models, among other building blocks, include the Attention Mechanism. This mechanism is used, in this thesis, for visualization purposes. The task of extracting attention weights is complicated and difficult. Since Models are built with the help of available Python libraries, some modifications had to be made in these packages, to facilitate this functionality. The pro- cess of creating layers, models, attention weights extraction, and visualization is beyond the scope of this work. It is why it is not included in this thesis.

22 3 Results

In this section all of the results are provided in detail. Figures are gathered in the appen- dices for easy access. Here, a description of the results is provided. First, a summarization of the results is laid out, with references to appropriate tables and figures. Later, the re- sults are explained in detail. Results covered are:

• Visualizations of the attention weights for a typical texts.

• Classification accuracy - statistics.

• Attentions weights coverage in texts - statistics.

• Attentions span over nearby words - statistics.

• Domain specific words - statistics.

• Parts of speech in every attention type - statistics.

• Parts of speech in all attentions together - statistics.

Gathering data for the analysis, building, and training models is a difficult and time- consuming task. It takes about 200 hours to train models and produce the results. First, a sum of about 100 hours is needed to train the models, and an additional 100 to gather data for the analysis. Although the time taken to train models is substantial, it is not as drastic as time presented in the original work [26]. Additional steps are taken to improve the efficiency of this analysis. Data-sets used in this analysis are smaller, and models simplified according to the description in section 2.3.2. These actions positively affected the time needed for producing the results.

After training both models, and extracting the attention weights, visualizations of data and statistics are generated. While visualizations are made for a larger number of texts, only a particular 2 examples are presented in this thesis. Texts presented, exemplify typi- cal results achieved for the majority of input data, for both contextual domains (”sports”, ”tech”), in all data-sets used. An overall visual overview of results can be seen in appendix A, and in-detail in appendices B-E. Visualizations exhibit the same texts subjected to dif- ferent models and pre-processing operations.

Statistical data presented in appendix F, in tables 4 - 21 is a quantified summation of visual results. In tables 4, and 5 attention shares in texts are shown. Data illustrates percentual attention coverage in text. In these tables an overall share of a text covered

23 by each attention type is shown, for both models. In tables 6 - 9, an average attention span is presented. The results of both models show a different attention stretch over nearby words. Here, an average of nearby words covered with the same attention type is presented for both models and all data-sets. In all tables, additional cumulative data is presented as well. Average values for all texts, and texts belonging to one of the 3 text length groups, are also included. In tables 10 - 13, a percentual share of context-specific words is presented. The context-specific words are those used in assigning texts to a par- ticular domain. In this case 2 domains used are ”sports”, and ”tech”. Percentual share is a ratio obtained by looking at the number of context-specific words, in all of the words models paid attention to. In tables 14 - 21, statistical data representing shares of parts of speech is given. First in tables 14 - 17, the most significant percentual shares of parts of speech in a particular attention type (strong, weak, etc) are presented. Later in tables 18 - 21, a percentual share of parts of speech in all of the attentions for a given text is presented.

3.1 Visualizations

Results for the BiDir-LSTM-CNN model are color-coded with the red-yellow-green scheme. Colors correspond to the level of NN’s attention to words while training and classifying texts. In BERT, the yellow background color is not present. In this model, the lack of color represents the model’s absence of interest in a particular word. For BiDir-LSTM- CNN attention thresholds are divided in 1/3 of attention weight for each of the attention type. For BERT, on the other hand, the division is 1/2 for each of the attention type. More about the color-coding schemes can be seen in section 2.4.

As presented in appendices A-E, several different results have been achieved. Visualiza- tions show differences in attention paid by the networks to words, depending on texts’ lengths, word diversity, level of pre-processing, and contextual domain. Results also show, how a NN shifts focus to different words when the same text is structured differently. Mod- els trained on the original texts, present different results than the same models trained on the same texts summarized to 5 and 10 sentences. Similar observations can be made about texts that are lemmatized and stemmed. Although essentially the same texts are used in the classification, text distortions caused by pre-processing operations have a sig- nificant influence on models’ reasoning about them. An influence a text formatting has on a BiDir-LSTM-CNN model’s reasoning differs substantially from that of BERT’s.

24 3.2 Data-sets used and texts presented in the analysis Texts chosen for the visualization, illustrate a typical result achieved in training and clas- sification. For ”sports”, a representative text for the domain is chosen. It is a typical text conveying sports as a subject. It is used in the training of both models, with sports as its context. Models are expected to learn to classify this text with this domain in mind. For ”tech”, a representative example of a domain-specific text is chosen as well. In this case, models’ efforts to reason about the text in boundaries of the domain, is also expected. The text holds well to the subject of technology. By learning on domain-specific data-sets, models can achieve a better understanding of the importance, of individual words, for that domain. Both texts can be seen in appendices A - E. Both texts are classified correctly.

3.3 Models’ accuracy Accuracy of a model is a measure of a number of correct classified texts in a data-set. For BiDir-LSTM-CNN, accuracy is 85% and for BERT is 52%, for original texts. All accuracy values for both models are presented in tables 2 and 3. In both cases, models classification accuracies are presented for all data-set types. Percentage values correspond to the level of accuracy a particular model has, in its classification decisions. In this thesis’ imple- mentation of models, inputs differ from those used in the original scientific work. This is the reason why results achieved here deviate from those in original papers [21], [26]. Models originally were not developed for classifying fake news. Nonetheless, both models show a consistent classification accuracy above 50%, which corresponds to a 50/50 correct classification of an actual label. For ”sports”, a noticeable advantage is demonstrated by BiDir-LSTM-CNN, in the original text, and original lemmatized text, with values 85% and 66%. BERT, despite consistently low values, demonstrates some slightly higher val- ues for the original text, original text lemmatized, and text summarized to 5 sentences. For ”tech”, both models show high values for original text, and text summarized to 5 sentences. The highest value for BIDir-LSTM-CNN in ”tech” is achieved in lemmatized text summarized to 5 sentences. For BERT the highest value in ”tech” is achieved in text summarized to 5 sentences.

25 Table 2: BiDir-LSTM-CNN and BERT ”sports” - Accuracy for all texts.

”sports” BiDir-LSTM-CNN BERT accuracy accuracy Original text 85% 52% Original text lemmatized 66% 50% Original text stemmed 51% 51% Text max 5 sen. 50% 51% Text max 5 sen. lemmatized 50% 50% Text max 5 sen. stemmed 50% 50% Text max 10 sen. 51% 50% Text max 10 sen. lemmatized 50% 50% Text max 10 sen. stemmed 55% 50%

Table 3: BiDir-LSTM-CNN and BERT ”tech” - Accuracy for all texts.

”tech” BiDir-LSTM-CNN BERT accuracy accuracy Original text 52% 51% Original text lemmatized 50% 50% Original text stemmed 51% 50% Text max 5 sen. 53% 52% Text max 5 sen. lemmatized 56% 51% Text max 5 sen. stemmed 50% 50% Text max 10 sen. 50% 51% Text max 10 sen. lemmatized 50% 50% Text max 10 sen. stemmed 53% 50%

26 3.4 Attention weights coverage Attention weights coverage represents a percentual amount of text covered by attention weights. BERT’s results show a more selective approach to assigning attention to words, while BiDir-LSTM-CNN show a broader attention coverage. As presented in tables 4 - 5, BiDir-LSTM-CNN has a double the coverage of a strong attention than BERT. For mod- erate attention in BiDir-LSTM-CNN and weak attention in BERT, the situation seems to be reversed. In this group, it is BERT that has double the attention coverage in compari- son to BiDir-LSTM-CNN. The weak attention for BiDir-LSTM-CNN and no attention for BERT, seem to be somewhat the same.

3.5 Attention weights span over nearby words Attention weights span represents the number of adjacent words, covered with the same attention type. Both models architectures and approaches to input-data processing, in- fluence classification processes. For BiDir-LSTM-CNN, attentions in original longer texts tend to have attention spans stretching through a higher number of words than in BERT. As presented in the statistical data in tables 6 - 7, a median of 39 and 129 words in suc- cession are covered by strong or weak attention in BiDir-LSTM-CNN. BERT on the other hand, has an attention span of 2 words, as can be seen in tables 8 - 9. A median of words is used since average values are exaggerated due to the BiDir-LSTM-CNN’s overblown at- tention span coverage in some of the text. Nevertheless, this difference in attention spans, indicate BiDir-LSTM-CNN’s broader contextual consideration of texts than BERT. A me- dian taken across all of the text groups show a median of 15 and 9 words span in strong attention for BiDir-LSTM-CNN in both domains. Meanwhile, BERT holds 2 words at- tention span across of all of the texts. Although BiDir-LSTM-CNN has a higher attention span than BERT, there is a lot of volatility in its statistical numbers.

3.6 Domain specific words Data-sets used were assigned to one of the two contextual groups, ”sports”, and ”tech”. The contextual grouping was done with the help of the keyword association. Some words in the text were characterized as belonging to a particular domain. For example, the word ”ball” is characterized as belonging to a ”sports” domain. Only nouns are used as domain- specific keywords. Tables 10 to 13 show the share of domain-specific words, among all of the words models paid attention to, in all of the attention groups. For BiDir-LSTM-CNN, in text with original length, in words with strong attention, about 3% of words are those for a specific domain for both ”sport” and ”tech”. BERT seems to follow this pattern, with slightly higher values for ”tech”. It is worth noting that the data is consistent for both

27 models with very few exceptions. For summarized text, the data shows a higher variation of results for both models, with values ranging from 0% to 8% share of domain-specific words. For a moderate, weak, and no attention categories, there is a lot of difference in values between models and types of text. Shares of domain-specific words differ drastically depending on the text’s length and type of pre-processing operation, the text was exposed to. All of the specifics can be seen in tables.

3.7 Parts of speech Not only words from a specific domain but different parts of speech as well, influence learning and classification process. In tables 14 to 21, shares of words belonging to a particular are presented. Both models show an overwhelming dominance of nouns, followed by verbs as parts of speech with the highest attention. With a couple of exceptions, these two groups of words seem to have a foremost influence on both models reasoning about the texts. For shares in each of the attention groups presented in table 14-17, and for shares in all of the attentions together presented in 18-21, the dominant are nouns and verbs. In 14-17, it can be seen that in every attention group, for both models nouns and verbs are the dominant part of speech taken into consideration by the models. In tables 18-21 as well, it can be seen that from all of the words models paid attention to in a particular text, nouns are the type of words both models paid strong attention to in majority.

28 4 Analysis

Many different aspects have to be taken into account when analyzing the classification pro- cess of NN models for text classification and fake news detection. First, the model itself and its architecture has to be taken into consideration. Models learning, and classification capabilities are going to be affected by the way it is designed. Second, the data and its representation influence the learning process. Texts on which models are trained, impact the way a model processes data while classifying new texts. Text’s length, its complexity, stylistic domain, lemmatization, and stemming, influence models’ reasoning as well. All of these aspects influence the way models learn, to distinguish what is important, and what is not, in a given classification. The results of learning, and how they affect models ability to classify, can be analyzed by studying attention weights. Such a study gives a degree of explainability of NN models for text classification and fake news detection. In this section, two models are analyzed from all the perspectives of all aforementioned aspects affecting learning and classification processes in fake news detection.

4.1 Architectures Two models chosen for this analysis, differ from each other in a way they process the input data. These architectural differences translate to the way attentions are assigned to words in a processed text. As can be seen in section 3, and appendices A - E, both models demonstrate a different approach to distributing their attention to words in all of the texts examined. BiDir-LSTM-CNN’s approach to data processing is sequential. Words are processed from left to right and from right to left. This pre-processing strategy causes this model to pay attention to a larger number of words. Attentions in this model tend to spread through a larger number of words, sometimes covering whole sentences. BERT on the other hand tends to be more selective in distributing its attention weights. Attentions in this model, are scattered throughout the text. BERT looks at a whole contextual structure of every text it is processing. For BiDir-LSTM-CNN, to make a decision when classifying a text, a broader contextual data is needed. For the original text for example presented in appendix B, this model pays strong attention to almost every word present in the text, to make a classification decision. Model seems to struggle, to cherry-pick nuances existing in texts that could be used to generalize the classification process. Instead, it seems to need a wider amount of data to come to its conclusion. This behavior can be observed in all of the original text processed by BiDir-LSTM-CNN for both ”sports”, and ”tech”. Although attentions shift from strong to weak in these examples, there is still a repeating pattern of attention spanning over a large number of words. This observation is confirmed by the statistical data shown in tables 6, 7. When dealing with versatile, large amounts of input data, BiDir-LSTM-CNN seems to

29 have a very wide attention span. BERT on the other hand demonstrates a very different approach to attention distribution. The attention weights are scattered throughout the text. This behavior also, is the consequence of the way BERT is constructed. It seems that its architecture, where a whole contextual structure of a given text is processed simultaneously, helps in identifying nuances in a given text. BERT seems to cope better with finding keywords in texts for a given classification. As can be seen in appendices A - E, the model is generalizing its classification process basing decisions on particular words in texts. There is no situation where a broader contextual attention is used, in any of the original texts for this model.

4.2 Length of input-data

The length of the input-data is another important factor, in models ability to learn. All of the nuances and dependencies in texts, have to be carried over from one layer to another. If input data is very long, it can lead to a corruption of the learning process. An example of such corruption can be a problem of vanishing/exploding gradient explained in 2.2.2. Both models utilize a different approach to processing texts with different lengths. For BiDir-LSTM-CNN, when a text is very long, the model seems to try to pay attention to every word in the text or is ignoring large chunks of it. It seems that this model cannot cope with overly long texts and tries to find a broader context to ease the classification process. This problem, of not coping with long texts, seems to be supported by the way model processes short texts. In this case, BiDir-LSTM-CNN seems to be more selective in its distribution of attentions weights in some situations, while paying attention to whole texts in others. BERT on the other hand is consistent in its attention distribution. This model is more selective in the choice of parts of the text important for classification. Text’s length does not influence BERT’s way of distributing weights. This observation seems logical when architecture is taken into account. BiDir-LSTM-CNN having problems coping with longer texts tries to overcompensate by paying attention to a larger group of words. In contrast, BERT has the same approach to finding contextual dependencies in text, independently from its length. From the accuracy standpoint, it is BiDir-LSTM-CNN that is most effective in classifying text. This model, not only has the highest accuracy for the original text but the highest accuracy overall. Even though the texts are overly long, the model’s approach to paying attention to a larger amount of words seems to pay off. The accuracy of these models seems to drop drastically when dealing with shorter texts. This behavior suggests that BiDir-LSTM-CNN is best suited for longer texts. In contrast, BERT achieves similar accuracy independently from the length of a text. This models accuracy values are not as good, as it is the case for BiDir-LSTM-CNN, but they are consistent for all text lengths.

30 4.3 Domain specific words

Texts belonging to a certain domain are characterized, by the presence of words typical for that domain. This contextual repetitiveness of domain-specific words, ease models’ ability to learn dependencies in texts. It helps to identify words or phrases having the biggest impact on the classification, by having all of the texts in a contextual coherence with each other. Models exhibit different approaches to text processing when the domain, in which words appear within a text, changes. The same contextual domain of all of the texts provided for learning helps to find nuances and dependencies, within the domain as an underlying subject. Both models tend to pay a strong or at least moderate attention (weak in case of BERT) to words, identified as belonging to a particular domain. Both models exhibit a pattern of reasoning, based on the contextual domain in which they are trained in. For ”sports”, words like ”stadium”, and ”team” are of interest, while for ”tech” words like ”screen”, and ”camera” are interesting. This behavior is due to the learning process itself. While models learn to classify text from the same contextual domain, and the same words keep coming up over and over again in different texts, they learn to pay attention to those words. These words become the central interest of models while developing generalized dependencies, which later are passed into classification.

4.4 Pre-processing

Pre-processing techniques used in this analysis are lemmatization and stemming. These techniques, are explained in section 2.1. From a human perspective, such pre-processing does not help but rather distort texts, clouding the meaning. From a NN perspective however, these operations help to minimize the level of complexity of texts. For a NN model, there is no difference if words are grammatically correct or not. NN models do not operate on an understanding of texts like humans do. For a NN model, every word is a variable, with connections and dependencies to other words in a particular text. By minimizing the number of unique words, by for example stemming words to their root form, the complexity of each text is reduced. Models still can find relationships between words, just not based on there grammatical structure. For BiDir-LSTM-CNN, the effect of pre-processing can be seen in ”sports” texts, presented in appendices. Texts subjected to lemmatization and stemming, tend to be paid attention to in a more selective manner, than it is the case with the original, not pre-processed texts. Since the vocabulary in the pre-processed texts is smaller, models have a greater chance of finding dependencies between words that matter for a particular classification. The focus of attention shifts to different words as well, when pre-processing operations are applied to texts. When a word exists in texts in different tense forms, some nuances are not discovered. However, if a word is in its root form, textual complexity is simplified and new dependencies suddenly appear. This situation can be seen in ”sports”, in the text summarized to 10 sentences,

31 in appendix C. In this example, original and stemmed text keep the word ”going” in its original form. This causes BiDir-LSTM-CNN, to not to pay attention to this part of the text. On the other hand, when the word ”going” is lemmatized and simplified to its root form ”go”, the model starts to perceive this part of the text as an important aspect of the classification. In every text, where words are in their original form, some dependencies are missed since there are too many variables. Once words are brought to their root forms, new dependencies are observed, giving a better generalization of the classification. Sim- plifying texts, and finding more dependencies, do not guaranty that a model is going to be successful. From the accuracy perspective, models tend to achieve better accuracy on texts that are not subjected to pre-processing operation. In almost all cases, on variations of texts subjected to pre-processing operations, models get worse accuracy values, than it is the case with original texts. It cannot be concluded if it is stemming or lemmatization that has a more positive effect on classification. However, by comparing original texts with texts that are pre-processed, it can be concluded that both models tend to get better classification results when texts are not processed in any way.

4.5 Parts of speech The most dominant parts of the speech used in classifications are nouns and verbs. For both models, in all texts examined, nouns and verbs get the highest attention. As can be seen in tables 16 - 21, models seem to recognize the fact that other parts of speech like adjectives, pronouns, and adverbs are the same for all texts. What makes a particu- lar sentence unique, and what makes a difference in a text, is the nouns and verbs used. Models recognize this, and adapt to recognize them. While learning, and later when classi- fying, these two types of words end up getting the highest attention of all attention types. While training, models develop an understanding of the importance of nouns, and in a combination with a contextual domain, can distinguish which nouns should be paid strong attention to and which one should be omitted while classifying. The evidence of models understanding the importance of nouns and verbs can be seen in statistical tables 17, 21. Not only in each particular group of attentions but for a share in all of the attentions combined, these two groups of parts of speech are indisputably dominant. This translates well to a human understanding of the language. Using only nouns and verbs, it would be possible to express a thought closely resembling the actual meaning needed to be conveyed.

32 5 Conclusion

The architecture is not a definitive reason why a model is successful or not. For it to be successful, a lot of different aspects have to be taken into consideration. Data used in learning, its length, complexity, and pre-processing, have a big impact on the process of learning. This is why a level of explainability of NN for text classification and fake news detection, has a positive impact on the creation of successful models. Two architectures have been analyzed in this thesis. First, BiDir-LSTM-CNN that looks at the data se- quentially word by word, from left to right, and from right to left. Second, BERT finding dependencies between words across all of the text. Both models have proven to be suc- cessful in different degrees when classifying texts in the direction of fake news detection. The outcome of the analysis showed, how different is the approach of these two models to learning and classifying texts. BiDir-LSTM-CNN takes larger chunks of text under consid- eration or ignores them entirely. BERT is more selective in distributing attention. There is also a big difference in both models’ learning process when it comes to shorter texts or text pre-processed in some way. Sometimes, as it is for BiDir-LSTM-CNN when text is too long and the model cannot handle it, it tries to overcompensate by paying attention to the whole text. In this situation pre-processing, although messy from a human perspective, can help reduce text complexity and ease the model’s ability to efficient learning. On the other hand, pre-processing texts does not necessarily mean that a model is going to achieve adequate accuracy. It seems that a NN architect must balance the representation of the data-set between the length of the input, and its complexity, to achieve the best accuracy possible. Another aspect of learning is the fact that nouns and verbs are primary candi- dates to influence models learning and classification. This is why the conclusion is that the best-suited texts for learning, are those rich in words belonging to these types of speech groups. BiDir-LSTM-CNN is superior in its accuracy, scoring 85% on the original texts, and 65% on original lemmatized texts. In contrast, BERT is consistent with its accuracy of 50% for all data-sets. Overall, the explainability of NN models cannot be reduced to text length, pre-processing, or word patterns. A NN architect should experiment with different text representations to get the best results. The conclusion is, that it is possible to provide a degree of explainability to NN models for text classification and fake news detection, as it is shown in the analysis. Explainability of neural-based models for the task of fake news detection via text classification can be provided by taking a closer look, at the state of data-set used, and models architecture. However, although explainability can be provided, all of the aspects of text preparation before training should be tested, and experimented with, to achieve the best result.

33 6 Future work

In this thesis, an analysis is made, based on models’ ability to detect patterns to classify texts and fake news detection. Future work can be expanded to a broader statistical analysis. A higher number of data-sets could be used in training, and in generating visualizations and statistical data. This would give an even deeper understanding of the subject. Moreover, the influence of the context on the learning and classification processes could be analyzed further by expanding data-sets with additional randomized sets not belonging to any domain. This additional data-sets could be cross-checked with domain- specific sets used in this thesis. A more extensive training of a higher number of texts labeled as true or false would also be beneficial in examining text classification in fake news detection. Finally, an experiment involving training models on a set of texts from one domain and using them in the classification of text from a different domain could give more information about attentions and their distributions.

34 A Overview of attention visualizations for BiDir-LSTM-CNN and BERT - all cases

Figure organization:

• Figure 8, BiDir-LSTM-CNN ”sports”.

• Figure 9, BERT ”sports”.

• Figure 10, BiDir-LSTM-CNN ”tech”.

• Figure 11, BERT ”tech”.

Overview organization:

1. Original text.

2. Original text - lemmatized.

3. Original text - stemmed.

4. Text summarized max 5 sentences.

5. Text summarized max 5 sentences - lemmatized.

6. Text summarized max 5 sentences - stemmed.

7. Text summarized max 10 sentences.

8. Text summarized max 10 sentences - lemmatized.

9. Text summarized max 10 sentences - stemmed.

35 36

Figure 8: BiDir-LSTM-CNN ”sports” - Visual overview of all results. 37

Figure 9: BERT ”sports” - Visual overview of all results. 38

Figure 10: BIDir-LSTM-CNN ”tech” - Visual overview of all results. 39

Figure 11: BERT ”tech” - Visual overview of all results. B Visual representations of Attention Mechanism for BiDir- LSTM-CNN - ”sports”

Figure 12: BiDir-LSTM-CNN ”sports” - no pre-processing, typical example.

40 Figure 13: BiDir-LSTM-CNN ”sports” - Lemmatization, typical example.

41 Figure 14: BiDir-LSTM-CNN ”sports” - Stemmed, typical example.

Figure 15: BiDir-LSTM-CNN ”sports” - Summarized max 5 sentences in training, typical example.

Figure 16: BiDir-LSTM-CNN ”sports” - Summarized max 5 sentences in training,Lemmatization, typical example.

42 Figure 17: BiDir-LSTM-CNN ”sports” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 18: BiDir-LSTM-CNN ”sports” - Summarized max 10 sentences in training, typical example.

Figure 19: BiDir-LSTM-CNN ”sports” - Summarized max 10 sentences in training, Lemmatization, typical example.

Figure 20: BiDir-LSTM-CNN ”sports” - Summarized max 10 sentences in training, Stemmed, typical example.

43 C Visual representations of Attention Mechanism for BiDir- LSTM-CNN - ”tech”

Figure 21: BiDir-LSTM-CNN ”tech” - no pre-processing, typical example.

44 Figure 22: BiDir-LSTM-CNN ”tech” - Lemmatization, typical example.

Figure 23: BiDir-LSTM-CNN ”tech” - Stemmed, typical example.

45 Figure 24: BiDir-LSTM-CNN ”tech” - Summarized max 5 sentences in training, typical example.

Figure 25: BiDir-LSTM-CNN ”tech” - Summarized max 5 sentences in training,Lemmatization, typical example.

Figure 26: BiDir-LSTM-CNN ”tech” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 27: BiDir-LSTM-CNN ”tech” - Summarized max 10 sentences in training, typical example.

46 Figure 28: BiDir-LSTM-CNN ”tech” - Summarized max 10 sentences in training, Lemmatization, typical example.

Figure 29: BiDir-LSTM-CNN ”tech” - Summarized max 10 sentences in training, Stemmed, typical example.

47 D Visual representations of Attention Mechanism for BERT - ”sports”

Figure 30: BERT ”sports” - no pre-processing, typical example.

48 Figure 31: BERT ”sports” - Lemmatization, typical example.

49 Figure 32: BERT ”sports” - Stemmed, typical example.

Figure 33: BERT ”sports” - Summarized max 5 sentences in training, typical example.

Figure 34: BERT ”sports” - Summarized max 5 sentences in training,Lemmatization, typical example.

50 Figure 35: BERT ”sports” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 36: BERT ”sports” - Summarized max 10 sentences in training, typical example.

Figure 37: BERT ”sports” - Summarized max 10 sentences in training, Lemmatization, typical example.

Figure 38: BERT ”sports” - Summarized max 10 sentences in training, Stemmed, typical example.

51 E Visual representations of Attention Mechanism for BERT - ”tech”

Figure 39: BERT ”tech” - no pre-processing, typical example.

52 Figure 40: BERT ”tech” - Lemmatization, typical example.

Figure 41: BERT ”tech” - Stemmed, typical example.

53 Figure 42: BERT ”tech” - Summarized max 5 sentences in training, typical example.

Figure 43: BERT ”tech” - Summarized max 5 sentences in training,Lemmatization, typical example.

Figure 44: BERT ”tech” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 45: BERT ”tech” - Summarized max 10 sentences in training, typical example.

54 Figure 46: BERT ”tech” - Summarized max 10 sentences in training, Lemmatization, typical example.

Figure 47: BERT ”tech” - Summarized max 10 sentences in training, Stemmed, typical example.

55 F Statistical data

Attention share - A procentual attention coverage of a text.

Attention span - A number of adjacent words covered with the same attention type.

Table organization: • Table 4, BiDir-LSTM-CNN and BERT ”sports” attention shares in texts. • Table 5, BiDir-LSTM-CNN and BERT ”tech” attention shares in texts. • Table 6, BiDir-LSTM-CNN ”sports” average attention span in texts. • Table 7, BiDir-LSTM-CNN ”tech” average attention span in texts. • Table 8, BERT ”sports” average attention span in texts. • Table 9, BERT ”tech” average attention span in texts. • Table 10, BiDir-LSTM-CNN ”sports” procentual share of context words. • Table 11, BiDir-LSTM-CNN ”tech” procentual share of context words. • Table 12, BERT ”sports” procentual share of context words. • Table 13, BERT ”tech” procentual share of context words. • Table 14, BiDir-LSTM-CNN ”sports” procentual share of parts of speech in a given attention. • Table 15, BiDir-LSTM-CNN ”tech” procentual share of parts of speech in a given attention. • Table 16, BERT ”sports” procentual share of parts of speech in a given attention. • Table 17, BERT ”tech” procentual share of parts of speech in a given attention. • Table 18, BiDir-LSTM-CNN ”sports” procentual share of parts of speech in all attentions. • Table 19, BiDir-LSTM-CNN ”tech” procentual share of parts of speech in all atten- tions. • Table 20, BERT ”sports” procentual share of parts of speech in all attentions. • Table 21, BERT ”tech” procentual share of parts of speech in all attentions.

56 Table 4: BiDir-LSTM-CNN and BERT ”sports” overall attention coverage of texts.

Coverage BiDir-LSTM-CNN BERT ”sports” Strong Moderate Weak at- Strong Weak at- No attent. attent. tention attent. tent. attent. Original 82% 18% 0% 31% 41% 28% text Original 21% 7% 72% 40% 39% 21% text lem- matized Original 91% 2% 7% 37% 39% 24% text stemmed Text max 5 4% 17% 79% 48% 24% 28% sen. Text max 5 31% 69% 0% 45% 45% 10% sen. lem- matized Text max 10% 24% 66% 55% 31% 14% 5 sen. stemmed Text max 55% 33% 12% 45% 46% 9% 10 sen. Text max 11% 32% 57% 35% 55% 10% 10 sen. lemmatized Text max 64% 10 26% 41% 50% 9% 10 sen. stemmed Avrg. origi- 64% 9% 27% 36% 40% 24% nal texts Avrg. max 15% 37% 48% 50% 33% 17% 5 sen. texts Avrg. max 43% 25% 32% 40% 51% 9% 10 sen. texts Tot. avrg. 59% 13% 28% 37% 41% 22%

57 Table 5: BiDir-LSTM-CNN and BERT ”tech” overall attention coverage of texts.

Coverage BiDir-LSTM-CNN BERT ”tech” Strong Moderate Weak at- Strong Weak at- No attent. attent. tent. attent. tent. attent. Original 41% 57% 2% 32% 48% 20% text Original 100% 0% 0% 34% 46% 20% text lem- matized Original 97% 2% 1% 46% 42% 12% text stemmed Text max 5 63% 36% 1% 49% 40% 11% sen. Text max 5 95% 5% 0% 47% 47% 6% sen. lem- matized Text max 91% 8% 1% 32% 59% 9% 5 sen. stemmed Text max 22% 29% 49% 35% 52% 13% 10 sen. Text max 13% 48% 39% 34% 55% 11% 10 sen. lemmatized Text max 99% 1% 0% 39% 51% 10% 10 sen. stemmed Avrg. origi- 79% 20% 1% 37% 46% 17% nal texts Avrg. max 83% 16% 1% 43% 48% 9% 5 sen. texts Avrg. max 45% 26% 29% 36% 52% 12% 10 sen. texts Tot. avrg. 68% 21% 11% 38% 48% 14%

58 Table 6: BiDir-LSTM-CNN ”sports” average and median word attention span in texts.

BiDir- Average word attent. span Median word attent. span LSTM- CNN ”sports” Strong Moderate Weak at- Strong Moderate Weak attent. attent. tent. atten- attent. attent. tion Original 23 5 0 16 5 0 text Original 19 4 41 20 4 44 text lem- matized Original 410 2 17 410 2 17 text stemmed Max 5 sen. 1 2 23 1 2 23 Max 5 sen. 5 9 0 5 9 0 lemmatized Max 5 sen. 3 3 17 3 3 17 stemmed Max 10 sen. 52 9 5 52 6 5 Max 10 sen. 5 5 12 5 3 10 lemmatized Max 10 sen. 19 2 8 7 2 4 stemmed Avrg. origi- 39 5 33 18 4 38 nal texts Avrg. max 3 5 13 3 3 17 5 sen. texts Avrg. max 20 5 9 7 2 4 10 sen. texts Tot. avrg. 31 5 21 15 4 17

59 Table 7: BiDir-LSTM-CNN ”tech” average and median word attention span in texts.

BiDir- Average word attent. span Median word attent. span LSTM- CNN ”tech” Strong Moderate Weak at- Strong Moderate Weak attent. attent. tent. atten- attent. attent. tion Original 17 7 1 12 5 1 text Original 453 0 0 453 0 0 text lem- matized Original 446 3 1 446 3 1 text stemmed Max 5 sen. 16 9 1 14 9 1 Max 5 sen. 70 5 0 70 5 0 lemmatized Max 5 sen. 68 3 1 68 3 1 stemmed Max 10 sen. 6 3 9 6 2 11 Max 10 sen. 3 3 5 2 3 3 lemmatized Max 10 sen. 174 1 0 174 1 0 stemmed Avrg. origi- 129 5 1 18 4 1 nal texts Avrg. max 37 6 1 20 5 1 5 sen. texts Avrg. max 17 4 6 4 3 4 10 sen. texts Tot. avrg. 49 5 5 9 3 3

60 Table 8: BERT ”sports” average and median word attention span in texts.

BERT Average word attent. span Median word attent. span ”sports” Strong Weak at- No Strong Weak at- No attent. tent. attent. atten- tent. attent. tion Original 2 2 1 2 2 1 text Original 2 2 1 2 2 1 text lem- matized Original 2 2 1 2 2 1 text stemmed Max 5 sen. 3 2 1 2 1 1 Max 5 sen. 1 1 1 1 1 1 lemmatized Max 5 sen. 3 1 1 2 1 1 stemmed Max 10 sen. 2 2 1 2 2 1 Max 10 sen. 2 2 1 2 2 1 lemmatized Max 10 sen. 2 2 1 2 2 1 stemmed Avrg. origi- 2 2 1 2 2 1 nal texts Avrg. max 2 1 1 1 1 1 5 sen. texts Avrg. max 2 2 1 2 2 1 10 sen. texts Tot. avrg. 2 2 1 2 2 1

61 Table 9: BERT ”tech” average and median word attention span in texts.

BERT Average word attent. span Median word attent. span ”tech” Strong Weak at- No Strong Weak at- No attent. tent. attent. atten- tent. attent. tion Original 1 2 1 1 2 1 text Original 2 2 1 1 2 1 text lem- matized Original 2 3 1 2 2 1 text stemmed Max 5 sen. 3 2 1 2 1 1 Max 5 sen. 2 2 1 2 2 1 lemmatized Max 5 sen. 2 3 1 2 2 1 stemmed Max 10 sen. 2 3 1 1 2 1 Max 10 sen. 2 2 1 2 2 1 lemmatized Max 10 sen. 2 3 1 2 2 1 stemmed Avrg. origi- 2 2 1 2 2 1 nal texts Avrg. max 2 2 1 2 2 1 5 sen. texts Avrg. max 2 3 1 1 2 1 10 sen. texts Tot. avrg. 2 2 1 2 2 1

62 Table 10: BiDir-LSTM-CNN ”sports” procentual share of context words attentions in all attentions.

BiDir- Procentual share of context words attentions in LSTM- all attentions CNN ”sports” All Strong Moderate Weak attent. attent. attention Original 3% 3% 2% 5% text Original 3% 2% 8% 3% text lem- matized Original 3% 3% 0% 0% text stemmed Text max 5 3% 0% 14% 0% sen. Text max 5 3% 10% 0% 0% sen. lem- matized Text max 3% 25% 0% 0% 5 sen. stemmed Text max 3% 5% 0% 0% 10 sen. Text max 3% 7% 3% 2% 10 sen. lemmatized Text max 3% 4 0% 0% 10 sen. stemmed

63 Table 11: BiDir-LSTM-CNN ”tech” procentual share of context words attentions in all attentions.

BiDir- Procentual share of context words attentions in LSTM- all attentions CNN ”tech” All Strong Moderate Weak attent. attent. attention Original 3% 4% 2% 0% text Original 3% 3% 0% 0% text lem- matized Original 3% 3% 0% 0% text stemmed Text max 5 2% 0% 3% 0% sen. Text max 5 1% 1% 0% 0% sen. lem- matized Text max 1% 1% 0% 0% 5 sen. stemmed Text max 3% 3% 4% 2% 10 sen. Text max 3% 8% 2% 1% 10 sen. lemmatized Text max 3% 3% 0% 0% 10 sen. stemmed

64 Table 12: BERT ”sports” procentual share of context words attentions in all attentions.

BERT Procentual share of context words attentions in all attentions ”sports” All Strong Weak at- No attent. attent. tent. Original 4% 3% 5% 1% text Original 4% 3% 4% 11% text lem- matized Original 4% 2% 6% 2% text stemmed Text max 5 7% 0% 10% 17% sen. Text max 5 7% 7% 8% 0% sen. lem- matized Text max 7% 6% 10% 0% 5 sen. stemmed Text max 4% 4% 5% 0% 10 sen. Text max 4% 2% 6% 0% 10 sen. lemmatized Text max 4% 7 2% 0% 10 sen. stemmed

65 Table 13: BERT ”tech” procentual share of context words attentions in all attentions.

BERT Procentual share of context words attentions in all attentions ”tech” All Strong Weak at- No attent. attent. tent. Original 3% 4% 3% 2% text Original 3% 5% 4% 0% text lem- matized Original 3% 4% 1% 10% text stemmed Text max 5 1% 0% 3% 0% sen. Text max 5 1% 0% 2% 0% sen. lem- matized Text max 1% 0% 2% 0% 5 sen. stemmed Text max 3% 3% 4% 0% 10 sen. Text max 3% 4% 3% 0% 10 sen. lemmatized Text max 3% 3% 4% 6% 10 sen. stemmed

66 Table 14: BiDir-LSTM-CNN ”sports” procentual share of parts of speech in a given attention. Only 5 parts of the speech with highest values are presented

BiDir- Procentual share of parts of speech in a given LSTM- attention. N - Noun, V - Verb, J - Adjective, CNN P - Pronoun, R - Adverb ”sports” Strong attent. Moderate Weak attent. attent. Parts of N V J P R N V J P R N V J P R speech Original 32% 15% 3% 8% 8% 24% 27% 2% 15% 8% 27% 19% 5% 8% 5% text Original 21% 22% 5% 10% 11% 29% 15% 6% 3% 3% 33% 16% 2% 8% 7% text lem- matized Original 32% 16% 3% 8% 7% 37% 27% 0% 9% 0% 20% 29% 0% 20% 11% text stemmed Text max 5 0% 0% 0% 0% 100% 57% 0% 0% 0% 0% 20% 4% 8% 4% 8% sen. Text max 5 10% 10% 10% 0% 10% 35% 0% 4% 4% 9% 0% 0% 0% 0% 0% sen. lem- matized Text max 25% 0% 0% 0% 25% 50% 12% 0% 0% 0% 19% 0% 10% 5% 10% 5 sen. stemmed Text max 22% 12% 5% 8% 7% 24% 24% 0% 18% 12% 31% 6% 6% 6% 6% 10 sen. Text max 29% 7% 0% 0% 0% 10% 23% 3% 10% 10% 30% 8% 5% 3% 8% 10 sen. lemmatized Text max 19% 18% 4% 4% 9% 44% 0% 0% 11% 0% 32% 13% 3% 19% 10% 10 sen. stemmed

67 Table 15: BiDir-LSTM-CNN ”tech” procentual share of parts of speech in a given attention. Only 5 parts of the speech with highest values are presented

BiDir- Procentual share of parts of speech in a given LSTM- attention. N - Noun, V - Verb, J - Adjective, CNN P - Pronoun, R - Adverb ”tech” Strong attent. Moderate Weak attent. attent. Parts of N V J P R N V J P R N V J P R speech Original 36% 13% 5% 4% 4% 42% 9% 3% 7% 4% 43% 0% 14% 14% 0% text Original 39% 11% 4% 6% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% text lem- matized Original 39% 11% 4% 6% 4% 33% 17% 17% 0% 0% 100%0% 0% 0% 0% text stemmed Text max 5 0% 0% 0% 0% 100% 57% 0% 0% 0% 0% 20% 4% 8% 4% 8% sen. Text max 5 42% 10% 4% 8% 2% 41% 7% 7% 3% 0% 100%0% 0% 0% 0% sen. lem- matized Text max 44% 8% 6% 4% 1% 17% 17% 0% 33% 0% 100%0% 0% 0% 0% 5 sen. stemmed Text max 46% 10% 3% 5% 5% 33% 15% 11% 7% 2% 41% 10% 1% 7% 4% 10 sen. Text max 48% 4% 4% 4% 0% 38% 13% 2% 6% 4% 39% 11% 7% 8% 4% 10 sen. lemmatized Text max 4% 11% 4% 7% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 10 sen. stemmed

68 Table 16: BERT ”sports” procentual share of parts of speech in a given attention. Only 5 parts of the speech with highest values are presented

BERT Procentual share of parts of speech in a given attention. N - Noun, V - Verb, J - Adjective, P - Pronoun, R - Adverb ”sports” Strong attent. Weak attent. No attent. Parts of N V J P R N V J P R N V J P R speech Original 39% 14% 2% 8% 7% 38% 18% 3% 4% 7% 29% 13% 4% 1% 14% text Original 38% 13% 2% 7% 7% 35% 19% 5% 4% 10% 57% 16% 3% 5% 5% text lem- matized Original 31% 16% 1% 3% 9% 48% 13% 3% 9% 7% 32% 21% 11% 0% 13% text stemmed Text max 5 36% 7% 7% 0% 21% 20% 0% 10% 10% 0% 50% 0% 0% 0% 0% sen. Text max 5 29% 0% 7% 0% 21% 38% 8% 0% 8% 0% 0% 0% 50% 0% 0% sen. lem- matized Text max 28% 6% 6% 0% 16% 40% 0% 10% 10% 0% 0% 0% 0% 0% 0% 5 sen. stemmed Text max 31% 22% 4% 4% 12% 40% 12% 5% 7% 7% 33% 0% 0% 0% 0% 10 sen. Text max 31% 22% 4% 4% 13% 43% 8% 4% 6% 6% 0% 0% 0% 0% 0% 10 sen. lemmatized Text max 29% 12% 5% 7% 12% 42% 18% 4% 4% 8% 25% 50% 0% 0% 0% 10 sen. stemmed

69 Table 17: BERT ”tech” procentual share of parts of speech in a given attention. Only 5 parts of the speech with highest values are presented

BERT Procentual share of parts of speech in a given attention. N - Noun, V - Verb, J - Adjective, P - Pronoun, R - Adverb ”tech” Strong attent. Weak attent. No attent. Parts of N V J P R N V J P R N V J P R speech Original 40% 9% 4% 3% 2% 45% 17% 7% 5% 6% 47% 11% 0% 2% 0% text Original 38% 14% 7% 0% 2% 43% 12% 4% 4% 5% 56% 14% 0% 12% 4% text lem- matized Original 38% 13% 1% 7% 3% 43% 15% 6% 1% 4% 55% 7% 13% 3% 7% text stemmed Text max 5 51% 7% 2% 2% 2% 47% 13% 3% 10% 0% 25% 50% 25% 0% 0% sen. Text max 5 60% 0% 0% 3% 0% 37% 16% 9% 9% 3% 100%0% 0% 0% 0% sen. lem- matized Text max 54% 11% 7% 0% 4% 44% 12% 2% 10% 0% 43% 0% 14% 0% 0% 5 sen. stemmed Text max 46% 17% 0% 4% 4% 46% 12% 7% 4% 3% 32% 5% 11% 5% 5% 10 sen. Text max 47% 14% 4% 4% 4% 42% 13% 5% 5% 4% 37% 12% 0% 12% 0% 10 sen. lemmatized Text max 42% 18% 5% 5% 3% 44% 11% 4% 5% 2% 47% 6% 6% 0% 18% 10 sen. stemmed

70 Table 18: BiDir-LSTM-CNN ”sports” procentual share of parts of speech in all attentions. Only 5 parts of the speech with highest values are presented

BiDir- Procentual share of parts of speech in all at- LSTM- tentions. N - Noun, V - Verb, J - Adjective, CNN P - Pronoun, R - Adverb ”sports” Strong attent. Moderate Weak attent. attent. Parts of N V J P R N V J P R N V J P R speech Original 24% 11% 2% 6% 6% 4% 5% 1% 3% 1% 2% 1% 1% 1% 1% text Original 5% 5% 1% 2% 2% 2% 1% 1% 1% 1% 23% 11% 1% 6% 5% text lem- matized Original 30% 14% 2% 7% 8% 1% 1% 0% 0% 0% 1% 2% 0% 1% 1% text stemmed Text max 5 0% 0% 0% 0% 3% 12% 0% 0% 0% 0% 15% 3% 6% 3% 6% sen. Text max 5 3% 3% 3% 0% 3% 24% 0% 3% 3% 6% 0% 0% 0% 0% 0% sen. lem- matized Text max 3% 0% 0% 0% 3% 12% 3% 0% 0% 0% 12% 0% 6% 3% 6% 5 sen. stemmed Text max 12% 6% 3% 5% 4% 7% 7% 0% 6% 4% 5% 1% 1% 1% 1% 10 sen. Text max 4% 1% 0% 0% 0% 3% 7% 1% 3% 3% 18% 5% 3% 1% 5% 10 sen. lemmatized Text max 12% 11% 3% 3% 6% 4% 0% 0% 1% 0% 10% 4% 1% 6% 3% 10 sen. stemmed

71 Table 19: BiDir-LSTM-CNN ”tech” procentual share of parts of speech in all attentions. Only 5 parts of the speech with highest values are presented

BiDir- Procentual share of parts of speech in all at- LSTM- tentions. N - Noun, V - Verb, J - Adjective, CNN P - Pronoun, R - Adverb ”tech” Strong attent. Moderate Weak attent. attent. Parts of N V J P R N V J P R N V J P R speech Original 21% 8% 3% 3% 2% 17% 4% 1% 3% 1% 1% 0% 1% 1% 0% text Original 39% 11% 4% 6% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% text lem- matized Original 38% 11% 4% 6% 4% 1% 1% 1% 0% 0% 1% 0% 0% 0% 0% text stemmed Text max 5 26% 6% 3% 5% 1% 15% 3% 3% 1% 0% 1% 0% 0% 0% 0% sen. Text max 5 41% 8% 5% 4% 1% 1% 1% 0% 3% 0% 0% 0% 0% 0% 0% sen. lem- matized Text max 39% 8% 5% 4% 1% 1% 1% 0% 3% 0% 1% 0% 0% 0% 0% 5 sen. stemmed Text max 10% 2% 1% 1% 1% 10% 4% 3% 2% 1% 21% 5% 1% 3% 2% 10 sen. Text max 6% 1% 1% 1% 0% 18% 6% 1% 2% 2% 15% 4% 3% 3% 2% 10 sen. lemmatized Text max 40% 11% 4% 7% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 10 sen. stemmed

72 Table 20: BERT ”sports” procentual share of parts of speech in all attentions. Only 5 parts of the speech with highest values are presented

BERT Procentual share of parts of speech in all at- tentions. N - Noun, V - Verb, J - Adjective, P - Pronoun, R - Adverb ”sports” Strong attent. Weak attent. No attent. Parts of N V J P R N V J P R N V J P R speech Original 16% 6% 1% 3% 3% 17% 8% 1% 2% 3% 4% 2% 1% 1% 2% text Original 18% 7% 1% 3% 4% 15% 8% 2% 1% 4% 57% 4% 1% 1% 1% text lem- matized Original 14% 8% 1% 2% 4% 20% 6% 1% 4% 3% 3% 2% 1% 0% 1% text stemmed Text max 5 17% 3% 3% 0% 10% 7% 0% 3% 3% 0% 10% 0% 0% 0% 0% sen. Text max 5 14% 0% 3% 0% 10% 17% 3% 0% 3% 0% 0% 0% 3% 0% 0% sen. lem- matized Text max 17% 3% 3% 0% 10% 14% 0% 3% 3% 0% 0% 0% 0% 0% 0% 5 sen. stemmed Text max 16% 12% 2% 2% 6% 18% 5% 2% 3% 3% 1% 0% 0% 0% 0% 10 sen. Text max 15% 11% 2% 2% 6% 22% 4% 2% 3% 3% 0% 0% 0% 0% 0% 10 sen. lemmatized Text max 13% 5% 2% 3% 5% 22% 9% 2% 2% 4% 1% 2% 0% 0% 0% 10 sen. stemmed

73 Table 21: BERT ”tech” procentual share of parts of speech in all attentions. Only 5 parts of the speech with highest values are presented

BERT Procentual share of parts of speech in all at- tentions. N - Noun, V - Verb, J - Adjective, P - Pronoun, R - Adverb ”tech” Strong attent. Weak attent. No attent. Parts of N V J P R N V J P R N V J P R speech Original 15% 3% 1% 1% 1% 20% 8% 3% 2% 3% 8% 2% 0% 1% 0% text Original 16% 6% 3% 0% 1% 17% 5% 1% 1% 2% 10% 3% 0% 2% 1% text lem- matized Original 17% 6% 1% 3% 1% 19% 7% 3% 1% 2% 6% 1% 1% 1% 1% text stemmed Text max 5 29% 4% 1% 1% 1% 18% 5% 1% 4% 0% 1% 1% 3% 0% 0% sen. Text max 5 25% 0% 0% 1% 0% 21% 9% 5% 5% 1% 26% 0% 0% 0% 0% sen. lem- matized Text max 20% 4% 3% 0% 1% 24% 7% 1% 5% 0% 4% 0% 1% 0% 0% 5 sen. stemmed Text max 18% 7% 0% 2% 2% 23% 6% 3% 2% 2% 3% 1% 1% 1% 1% 10 sen. Text max 19% 6% 2% 2% 2% 22% 7% 3% 3% 2% 3% 1% 0% 1% 0% 10 sen. lemmatized Text max 18% 8% 2% 2% 1% 21% 5% 2% 2% 1% 4% 1% 1% 0% 2% 10 sen. stemmed

74 Bibliography

[1] ”Individuals using the internet 2005-2019.” itu.int [Cited 2020 May 14] Available from: https://www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx

[2] ”How often you encounter fake news? Opinions in Europe 2018, by country.”, statista.com [Cited 2020 May 14] Available from: https://www.statista.com/statistics/1076701/fake-news-frequency-europe/

[3] D. Katsaros, G. Stavropoulos, D. Papakostats. Which machine learning paradigm for fake news detection? 2019 IEEE/WIC/ACM International Conference on Web Intel- ligence (WI), pages 382-387, Thessaloniki, Greece, Greece, 14-17 Oct. 2019, IEEE.

[4] ”Fake News Corpus Data-set”, github.com [Internet]. 2019. [cited 2020 April 7] Avail- able from: https://github.com/several27/FakeNewsCorpus

[5] Antonie Maria-Luiza, Zaiane O. R. ”ext Document Categorization by Term Associa- tion”. Data Mining 2002. ICDM Proceedings IEEE. 2002.

[6] Luhn H.P. ”The Automatic Creation of Literature Abstracts.” IBM Journal. April 1958; 159-165

[7] Hans C, Pramodana Agus M, Suhartono D. ”Single Document Automatic Text Sum- marization using Term Frequency-Inverse Document Frequency (TF-IDF).” ComTech Vol. 7 No. 4 December 2016. 285-294

[8] ”Natural Language Toolkit” nltk.org [Internet] [cited 2020] Available from: https://www.nltk.org/

[9] Altunbey Ozbay F. Alats B. ”Fake News detection with online social media using supervised artificial intelligence .” Physica A 540 . 2020.

[10] ”GloVe: Global Vectors for Word Representation”, stanford.edu [Internet]. [citet 2020 May] Avaliable from : https://nlp.stanford.edu/projects/glove/

[11] Agarwal V, Parveen Sultana H, Malhotra S, Sarkat A. ”Analysis of Classifiers for Fake News Detection.” International Conference on Recent Trends in Advanced Computing, ICRTAC. 2019

[12] Charu C Aggarwal. Neural Networks and Deep Learning. 1 ed. Springer; 2018. 315- 371.

75 [13] Charu C Aggarwal. Neural Networks and Deep Learning. 1 ed. Springer; 2018. 315- 371. Figure 1.18, LeNet-5: One of the earliest convolutional neural networks; p. 41

[14] Charu C Aggarwal. Neural Networks and Deep Learning. 1 ed. Springer; 2018. 315- 371. Figure 1.18, LeNet-5: One of the earliest convolutional neural networks; p. 41. Figure 10.4, The neural architecture in (a) is the same as the one illustrated in Figure 7.10 of Chapter 7. An extra attention layer has been added to (b); p. 428

[15] Kamath U, Liu J, Whitaker J. Deep Learning for NLP and . 1 ed. Springer; 2018. 263-314.

[16] Bahdanau D, Cho K, H, Bengio J. ”Neural by Jointly Learning to Align and Translate.” ICLR. 2015

[17] Luong M, Pham H, Manning C. ”Effective Aproaches to Attention-based Neural Machine Translation.” Proceedings of the 2015 Conference on Empirical Methods in Neural LanguageProcessing. September 2015; 1412-1421

[18] Sutskever I, Vinyals O, Le Q, V. ”Sequance to sequance Learning with Neural Net- works.” Cornell University. arXiv:1409.3215v3 [cs.CL]. 14 Dec 2014

[19] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, N, Kaizer L, Polosukhin L. ”Attention is All You Need.” Cornell University. arXiv:1706.03762v5 [cs.CL]. 6 Dec 2017

[20] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, N, Kaizer L, Polosukhin L. ”Attention is All You Need.” Cornell University. arXiv:1706.03762v5 [cs.CL]. 6 Dec 2017. Figure 1, The Transformer - model architecture; p. 3

[21] Liu G, Guo J. ”Biderectional LSTm with attention mechanism and convolutional layer.” Neurocomputing 337. 2019; 325-338

[22] Liu G, Guo J. ”Biderectional LSTM with attention mechanism and convolutional layer.” Neurocomputing 337. 2019; 325-338. Figure 3, The architecture of the AC- BiLSTM; p. 330

[23] Liu G, Guo J. ”Biderectional LSTM with attention mechanism and convolutional layer.” Neurocomputing 337. 2019; 325-338. Figure 2b, Illustration of a LSTM model (a) and a BiLSTM model (b); p. 327

[24] Liu G, Guo J. ”Biderectional LSTM with attention mechanism and convolutional layer.” Neurocomputing 337. 2019; 325-338. Figure 2a, Illustration of a LSTM model (a) and a BiLSTM model (b); p. 327

76 [25] Devlin J, Chang M, W, Lee K, Toutanova K. ”BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.” Cornell University, arXiv:1810.04805v2. May 2019

[26] Sanh V, Debut L, Chaumond J, Wolf T. ”DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” Cornell University. arXiv:1910.01108v4 [cs.CL]. March 2020

[27] ”TensorFlow: An end-to-end open source machine learning platform.”, tensorflow.org Available from: https://www.tensorflow.org/

[28] ”State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch”, gihub.com [Internet] [cited 2020] Available from: https://github.com/huggingface/transformers

[29] ”Github repository: Ktrain.”, github.com Available from: https://github.com/amaiya/ktrain/blob/master/ktrain/text/preprocessor.py

[30] ”Transformers”, huggingface.com [Internet] [cited 2020] Available from: https://huggingface.co/transformers/

[31] Horev R. ”BERT Explained: State of the art language model for NLP”, towardsdatascience.com [Internet] [cited 2020] Available from : https://towardsdatascience.com/ -explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

77