Detecting Fake News on Using Machine Learning Models

Emma Cueva Grace Ee [email protected] [email protected]

Akshat Iyer Alexandra Pereira Alexander Roseman [email protected] [email protected] [email protected]

Dayrene Martinez* [email protected]

New Jersey’s Governor’s School of Engineering and July 24, 2020

*Corresponding Author

Abstract—With the rising popularity of social media, people On Twitter is a popular social media platform where users have become more aware of current events and important news, can easily share links to articles regardless of validity. As a often through sources such as Twitter. One issue with these result, fake news is rampant. Current solutions to combating sources of news is the prevalence of false information, or fake news. Even as some social media platforms take initiative with fake news are often heavily reliant on the initiative of readers. labels or warnings, fake news continues to have dangerous Social media users are encouraged to be vigilant regarding the consequences beyond misinformation. The goal of this research is news they see to avoid being manipulated. On average, humans to implement a highly effective method of identifying fake news identify lies with 54% accuracy, so the use of AI to spot spread on Twitter through the use of Artificial Intelligence (AI). fake news more accurately is a much more reliable solution More specifically, the investigation studied the Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Natural [3]. Some AI programs have already been created to detect Language Processing (NLP) networks to compare their accuracy fake news; one such program, developed by researchers at the when predicting fake news. The data was preprocessed and University of Western Ontario, performs with 63% accuracy used to train the models that were developed; figures were then [3]. generated for analysis. All three models achieved high accuracy in detecting fake news, however, the NLP model was the only Over the course of this project, by analyzing both real and iteration that possessed the ability to identify satire as fake news. For this reason, the NLP model was the preferred choice for fake news, several AI models were trained and optimized with detecting fake news on Twitter. the goal of increasing the accuracy of fake news detection. This paper discusses the development and comparative anal- ysis of three AI models, LSTM neural networks, GRU, and I.INTRODUCTION NLP neural networks, that accurately detect fake news. Along with the rise of technology, social media has become increasingly popular in the average person’s daily life. Inno- II.BACKGROUND vations allow people to absorb vast amounts of information on a daily basis. Social media provides its users with a platform A. Machine Learning to voice their thoughts and connects people around the world. Machine learning is an application of AI designed to look However, a significant downside to these advances in tech- for patterns in large quantities of data and improve the pre- nology is the increasing prevalence of false information. Fake dictive accuracy and classification of each data point without news is defined as articles that misrepresent information to being explicitly programmed. It is regularly used online in deceive and manipulate their audience. They are ”70% more targeted advertisements and recommendations on streaming likely to be retweeted on Twitter than true ones” [1]. Its ripple services. By identifying patterns in fake news and comparing effects can include increased bigotry, global misunderstand- them to patterns in real news, machine learning can predict ings of current events, and biased election outcomes [2]. the falsity of information [4].

1 B. Deep Learning • Named entity recognition (NER) is when the network Deep learning is an enhanced version of machine learning identifies proper nouns, such as “America” or “Donald that can identify and magnify patterns that other forms of ma- Trump”, which may have their own associated back- chine learning often miss [4]. Systems called neural networks grounds that would need to be identified. Additionally, use deep learning to mimic the structure of an organic brain. due to the usage of pronouns in English, the network While machine learning uses algorithms to analyze data and would need to identify proper nouns to connect them with apply the patterns it finds, deep learning develops an ensemble pronouns used to refer to them. algorithm to learn how to make decisions on its own [5]. • Co-reference resolution is the process of connecting all pronouns to the nouns that they refer to. This allows the C. Natural Language Processing network to consolidate and link the relevant information. 1) NLP Pipeline: NLP is a field of artificial intelligence 2) Word Embedding: Modern word embedding converts that uses text mining and analytics to process natural human words to dense vectors by projecting them into a high- language in applications such as chatbots, speech recognition, dimensional vector space. This allows neural networks to and targeted advertisements. NLP pipelines can be imple- develop connections between words, and plays an integral role mented into neural networks. They are designed to extract, in the conversion of words to numbers (vectorization), which analyze, and apply the critical information from a sample of can then be processed. Embedding can be incorporated into raw language. These pipelines can be deconstructed into eight neural networks via an embedding layer, which often utilizes components [6]. pre-trained word embeddings, allowing for more accurate translation than retraining with every new model. [8] • Sentence segmentation, where blocks of text are split into smaller samples that are easier for the network to analyze. • Word tokenization, which occurs after the sentences are broken up and splits each sentence into its words. Doing so makes it easier for the network to analyze each word for its components, such as meaning, context, and part of speech. • In tagging parts of speech, the network labels the words it has deemed most important with their parts of speech. This provides context for the network to identify how a word is being used to connect it with adjacent text. • Text lemmatization and stemming reduce words to their Fig. 1. These graphs depict Word2vector and GloVe word embeddings [9]. basic forms so that a network can more easily infer its meaning and connect it to other words. With stemming, the words extracted from the analysis, especially verbs, D. Neural Networks can be broken into roots that are more indicative of While there are several kinds of neural networks, their their definition. For example, “cleaned” would turn to general structures and the manners in which they process data “clean.” However, this process can produce stemmed are all similar, because they are modeled after the network results that are not coherent because it only strips suffixes. of neurons in the brain. Neural networks are constructed with Lemmatization also changes words but it ensures that the layers of densely interconnected nodes that pass input data products are whole words in the language [7]. from one layer to the next [10]. Fundamental layers include • Identifying stop words filters out filler words (like “a,” the input and output layers, as well as the hidden layers, where “the,” “and”) that are unnecessary for the neural net- the unique features of each neural network appear. Within work’s analysis. These words are usually listed for the these layers, each node is assigned a weight to multiply with NLP pipeline to recognize and flag so that the network incoming data. If the product satisfies a threshold value, the can ignore them. data is passed on. These weights and threshold values are • Dependency parsing reads through each sentence and randomized initially, and then trained over time [10]. The final determines how the tokenized words connect to each value is produced by the output layer, which can be translated other to understand the full meaning. It also creates a into a result, and neural networks can be trained and adjusted hierarchy by identifying parent words and associating all until these experimental results match the expected outputs other words with them. [9]. • Identifying noun phrases is important for separating commonly-used nouns from those that hold a different E. Convolutional Neural Networks meaning when combined with other words, such as in A Convolutional Neural Network (CNN) is a type of neural idioms like “it’s raining cats and dogs” versus the normal network with a convolution operation in its hidden layers. This usage of “dogs” alone. operation transforms inputs to outputs by filtering the data. It

2 reduces higher-level functions into a smaller, more condensed It controls the information written onto the Internal Cell State matrix [11]. CNNs perform these transformations through their by the input gate. The final output gate determines what output hidden convolutional layers that are each equipped with several will be generated from the current Internal Cell State.or the filters. These qualities allow for specialization in classification next hidden state, which gives the network a notion of past problems such as image recognition and pattern detection [12]. events This hidden layer will reset after every test to separate This convolution operation differentiates it from the multilayer the tested batches of input [16]. perceptron (MLP). 2) Hidden Layers: The hidden layers of an LSTM serve as Each filter can be mapped to a randomized matrix of a the “memory” of the network, and it is crucial to identify how specified size. These filters convolve, or slide, across each many layers and neurons to use in these layers. Using too many matrix along the input while storing the dot product of each hidden neurons or layers could result in overfitting, which matrix. In the case of image recognition problems, early layers means that the model only learns to identify training data in a CNN may have filters that are designed to detect geometric and doesn’t do well with external data. Using too few could patterns, such as edges and corners [12]. As these filters grow result in underfitting, whicmeans that the model doesn’t learn more complex in the network’s deeper layers, they may begin enough to identify training nor testing data, thereby outputting to identify more sophisticated concepts, such as whole objects inaccurate results [14]. Additionally, while adding more layers or animals [12]. increases precision, after a point, additions become suboptimal as the additional training time will not justify a minimal N-Grams are critical to understanding how CNNs handle increase in precision [17]. However, the process of determining NLP. N-grams can be thought of as one-dimensional versions the amount of neurons and layers to use is trial-and-error and of convolutions, with text split into sections (n-grams) of n cross-validation for each situation. words each and then converted into numbers. This allows the algorithm to account for connections between sequenced 3) Dropout Layer: The goal of the Dropout layer is to words rather than looking at each word in isolation, accom- reduce overfitting. Overfitting occurs when the model is too plishing a similar task to memory in an LSTM. complex and overly adapted for the training data, causing it to perform well on training data but poorly on new data [18]. Dropout will randomly select nodes to ignore, decreasing the risk of an overfit model [19]. When first implementing this layer, the dropout rate should be around 20%, however, it is common for the rate to be as high at 50% [19]. The Dropout layer can appear anywhere in the model between the input and output layers. *Past work has indicated that CNNs can be more effective in fake news classification when combined with LSTMs [8], Fig. 2. This is an example of a split into uni-, bi- and trigrams [13]. [20].

F. Long Short Term Memory G. Gated Recurrent Unit Recurrent neural networks (RNN) are made of layers con- The GRU network is also derived from RNNs and similar taining neuron-like nodes that have time-varying activation to the LSTM network. Unlike LSTM, it does not have a cell functions and a directed connection to every other node in state so it only has two gates - the reset gate, and the update the succeeding layer [14]. These networks also contain loops gate, as opposed to the LSTM’s four gates [16]. The update that allow them to analyze new information based on prior gate functions the same as the forget and input gate in LSTM, knowledge, solving the issue of short-term memory found in throwing old information away and introducing new data. The basic MLPs [15]. reset gate selects old data to delete from the model’s memory [16]. GRUs train at faster rates than LSTMs. However, there 1) Structure: An LSTM is a type of RNN that specializes are benefits to using each model [16]. Figure 3 shows the in solving problems that require context, such as handwriting differences between the structure of the LSTM and GRU. The recognition, speech recognition, and anomaly detection in IDs. GRU’s simpler design allows for fewer parameters, but this An LSTM differs from other networks because of its unique can result in a reduced ability to run complex functions [21]. structure consisting of four gates [16]. Figure 3 depicts how these gates interact with each other throughout the model. The H. Optimization first gate, the forget gate, selects which data the network will 1) Loss Function (Binary Cross-Entropy): The loss func- retain and which it will forget. The second gate, the input gate, tion measures the accuracy of a model’s performance. The determines how much information should be written into the goal of training a model is to make the loss function’s output Internal Cell State, which describes the data that was retained as small as possible [22]. Since news is classified as real by the forget gate. The third gate, which is occasionally or fake, the models use binary classification. Thus, the loss combined with the input gate, is the input modulation gate. function employed is binary cross-entropy, which is designed

3 down the loss function [25]. This process gets repeated until a minimum is reached. At this point the hidden layer locks in current weights, and continues to optimize the neural network [25].

Fig. 3. Pictured is a comparison of LSTM and GRU models [16]. specifically for binary classification, predicting either a zero Fig. 5. This is a 3D depiction of gradient descent [26]. or a one for each input. The loss function can be expressed 3) Adam Optimizer: empty numerically [23].

N The Adam optimizer is a stochastic gradient descent method 1 X that uses an adaptive learning rate, meaning each parameter is Hp(q) = − · yi · log(p(yi)) + (1 − yi) · log(1 − (p(yi)) N calculated with a unique learning rate [27]. The individualized i=1 rates are determined by using a gradient’s estimated first and Entropy refers to the uncertainty in a given distribution, second moments, where the nth moment is the gradient’s with entropy approaching zero as uncertainty decreases. There- expected value to the nth power. These values are then adapted fore, a set with only one value would have an entropy of zero. to the weights generated by the neural network [27]. Adam Similarly, cross-entropy is calculated using an approximation specializes in neural networks that use large quantities of data of the distribution (p(y) in the equation) instead of the or large parameters [28]. distribution itself. To calculate loss, the loss values for each point (y · log(p(y)) + (1 − y) · log(1 − p(y)) are averaged, 1 represented as summation multiplied by N , with N being the number of points in the data. 2) Gradient Descent: Gradient descent is a process that minimizes the loss of a model.

Fig. 6. This is a graphical representation of the importance of using an appropriate learning rate [29].

4) Bayesian Optimizer: Bayesian optimizers are alterna- tives to active learning that decide which succeeding points to evaluate based on the information they already have [30]. The goal of Bayesian optimizers is to narrow down hyperparam- eters to optimize them and create a more productive model [31]. xt+1 = argmax(αPI (x)) Fig. 4. This figure describes gradient descent, a process used to minimize the loss of a function by adjusting the weights in the opposite direction of = argmax(P (f(x) ≥ (f(x+) + ))) the gradient [24]. P (score|hyperparameters) Loss is defined as the error in the model with relation to 5) Hyperparameters: It is important to note that hyperpa- the hidden layers, labeled as weights (or Wn, with n-number rameters are distinct from model parameters. In a machine of weights) [25]. After defining a loss function, the hidden learning model, the model parameters are learned properties layer weights need to be optimized to produce the lowest loss of the training dataset, such as weights and biases [32]. possible. The process calculates the gradient of a random point On the other hand, model hyperparameters are properties on the curve and uses the opposite of the gradient to travel that have an effect on the classification model. Examples of

4 hyperparameters include the learning rate, number of epochs, hidden layers, hidden units and activation functions [32]. These properties are essential to the training process, and have an impact on the overall performance of the model. Quality hyperparameters achieve a balance between speed and precision, making optimization straightforward, especially when dealing with a large amount of data. 6) Epochs, Iterations, and Batches: One epoch occurs when the dataset moves through the model once. This involves both forward propagation and backward propagation. Forward Fig. 8. Shown is the form the sigmoid activation function takes on a graph [38]. propagation is where the algorithm calculates output based on input. On the other hand, the backward propagation is when the model reverses the process to find and minimize I. Python Libraries sources of error. Given the iterative nature of algorithms such as gradient descent, fully training a model typically requires 1) Pandas: Pandas, or Python Data Analysis Library is multiple epochs, meaning the full dataset must pass through an open-source data analysis library used in this project to multiple times [33]. The use of too many or too few epochs organize the data from the fake and real news samples [39]. can lead to overfitting or underfitting, both of which harms Setting the information in a dataframe makes handling data model accuracy. Often, there is too much data to be passed easier. Pandas is useful when working with tabular data, which into the model all at once. Instead it gets divided into batches. can be found in the form of a CSV. There are two primary The number of batches in an epoch affects the number of data setups in Pandas, a series and a DataFrame [40]. Data iterations; specifically, epoch = batches * iterations. frames have two-dimensional data, whereas a series only has 7) Activation Functions: The activation function deter- one-dimensional data. mines the output of each node in a model by transforming 2) Numpy: NumPy is one of the most important libraries the input with weights and biases. This allows the model to involved in this project. Its ability to perform efficient vector- better understand intricate data pattern. This specificity is what ized operations used in linear algebra helps with performing differentiates AI models from simple linear regression [34]. mass calculations required for Neural Network matrix pro- Different kinds of activation functions can be used depending cessing [41]. While their abilities may not be explicitly stated on the task at hand. in this project’s code, the functions are still used within the The rectified linear unit (ReLU) function is a commonly models that are initialized by Keras. used activation function in deep learning. When given a 3) Keras: This library drives the machine learning models negative value, ReLU will return the value of zero. When that are being implemented. Keras has many models for given a positive value, ReLU will return the same value machine learning purposes; however, the main model that will that was input. The function is represented by the equation: be implemented is the Sequential Model [42]. Keras also has f(x) = max(0, x) [36]. Since ReLU is a simple function it many useful methods that are used to train and backpropagate works significantly faster than other activation functions and models. It has in-built loss functions so the tedious process of is efficient on large datasets. creating a loss function is no longer necessary [42]. 4) Scikit-Optimize: Scikit-Optimize is a library that ac- cesses the sequential model and tunes hyperparameters, re- turning the best combination of epochs, hidden layers/units and activation functions [43]. 5) GloVe: Global Vectors for Word Representation, often abbreviated as GloVe, was developed by Stanford for word vectorization [44]. Similar phrases are grouped together and was assigned a number, allowing the model to understand large volumes of text through a simple string of numbers. It is used Fig. 7. This is a graphed form of the ReLU activation function [35]. for both data preprocessing and training the model. 6) Matplotlib: Matplotlib is a Python library that creates The sigmoid function, also known as the “squashing” graphs and charts. It can plot large amounts of information function, limits an output between zero and one [36]. This and produce a visual representation that is easy to interpret feature makes it useful for networks that produce probabilities, [45]. This library was used to chart the success of models for and also makes it differentiable [36]. The function can be 1 simple comparison for this project. represented by the equation f(x) = (1+e−x) [37].

5 7) Tokenizer: (keras.preprocessing.text.Tokenizer) Convolutional One-Dimensional layer was added to the model This library splits text sequences into sequences (tokens) and that further improved the results of the model. removes stop words to prepare the text for padding in the next Keras was used to build the model itself, and Scikit- step of preprocessing [46]. Optimize’s Bayesian optimization methods were used to find 8) Sequence: (keras.preprocessing.sequence) This library optimal hyperparameters. Dynamically defining the model and contains the methods that pad the sequences of tokenized text creating arrays of possible hyperparameters helped optimize to into vectors of floats that can be interpreted by the Keras minimize loss and maximize accuracy. libraries [47]. The models at their final states was structured as shown in 9) TF-IDF Vectorizer: hrn, im tryna sneak an empty space Tables I and II.

(sklearn.feature extraction.text.TfidfVectorizer) TABLE I The vectorizer is similar to the tokenizing and sequencing that GRUMODEL STRUCTURE was performed in Keras, but it is specifically compatible with Layer (Type) Output Shape Param # the Scikit-Learn classifiers for data input [48]. embedding 7 (Embedding) (None, 400, 50) 10000 bidirectional 1 (Bidirectional) (None, 128) 44544 III.PROCEDURE dense 14 (Dense) (None, 256) 33024 activation 11 (Activation) (None, 256) 0 A. Data Acquisition dropout 6 (Dropout) (None, 256) 0 out layer (Dense) (None, 1) 257 The first step in training the neural network was finding a activation 12 (Activation) (None, 1) 0 fake and real news dataset from a reputable source. Multiple datasets were examined on Kaggle (an online platform with data repositories for different projects), and two were selected TABLE II to train and test the models. One contained tweets with false LSTMMODEL STRUCTURE information or links to fake news. The other had articles fact- Layer (Type) Output Shape Param # checked for validity. Both datasets were CSVs in which each embedding 7 (Embedding) (None, 400, 50) 10000 data point held the title of an article, the article itself, its conv1d (Conv1D) (None, 396, 64) 16064 subject and the date of publication. The fake news file had lstm 5 (LSTM) (None, 64) 33024 dense 14 (Dense) (None, 256) 33024 23,537 rows, ranging from March 31th, 2015 to December activation 11 (Activation) (None, 256) 0 31st, 2017, while the real news file had 21,418 rows, spanning dropout 6 (Dropout) (None, 256) 0 from January 13th, 2016 to December 31st, 2017. out layer (Dense) (None, 1) 257 activation 12 (Activation) (None, 1) 0 B. Preprocessing (See Appendix A for flowchart, Appendix C for code) Note that the GRU model runs on a bidirectional layer that allows the model to take a forward and backward pass After the data was downloaded, files were uploaded to over the text, allowing for more possibilities of classification Drive and imported into Colab so they could be thereby leading to a less-overfit and more accurate model. accessed by the Pandas Library. Once Pandas was imported, the databases were copied into DataFrame objects. After labels After these methods were defined, one call to the were added (“1” for fake news and “0” for real news), the data gp minimize() method with the specified hyperparameters ran was combined into a singular Dataframe and then split into the function and saved the most favorable model. To check training and validation sets. for overfitting, the history of the model was graphed on two separate matplotlib data plots. The model accuracy graph The next step was preprocessing the data so that it could be displays the number of epochs on the x-axis and the accuracy interpreted by the computer, since Keras models only accept (in decimal form) on the y-axis. The graph should show numerical inputs. To do this, the data was inputted into a an increase in accuracy as the training (number of epochs) Tokenizer object that was able to remove “stop-words” from increased [50]. The model loss graph displays the number of the data, and then processed through sequencing functions to epochs on the x axis and the loss (in decimal form) on the y translate the remaining words into numerical vectors. axis. If the model was functioning correctly, the graph should C. Model Construction show a decrease in loss as the training (number of epochs) increased [50]. Once the model’s outputs were satisfactory it Prior to this project, there were many existing architectures was saved and used for further training and testing. built for classifying fake news as real or fake. One of these 1 ## Building the GRU-Model models was built by Aaron Abrahamson that utilized an LSTM 2 model = Sequential() layer in his model. This model alone was able to achieve a 3 model.add(Embedding(max_words, embed_dim, high accuracy of 98.5% [49]. This model was excellent as input_length = max_len)) 4 model.add(Bidirectional(GRU(gru_out))) is; however, there was still room for improvement in both 5 model.add(Dense(256)) the training and validation accuracy. In order to do this, a 6 model.add(Activation(’relu’))

6 7 model.add(Dropout(0.5)) Accuracy measures the overall proportion of correct pre- 8 model.add(Dense(1, name=’out_layer’)) dictions [51]. Precision measures how many positive predic- 9 model.add(Activation(’sigmoid’)) TP 10 tions are accurate with the expression (TP +FP ) [39]. Recall 11 model.compile(loss =’binary_crossentropy’, measures how many positives were predicted correctly and is optimizer=’adam’,metrics = [’accuracy’]) TP calculated with the expression (TP +FN) [52]. F-measure is a tool that evaluates precision and recall simultaneously with 1 ## Building the LSTM Model the expression (2·recall·precision) [52]. 2 model2 = Sequential() (recall+precision) 3 model2.add(Embedding(max_words, embed_dim, input_length = max_len)) 1) Model Accuracy/Loss Graphs: The model accuracy 4 model2.add(Conv1D(64, 5, activation=’relu’)) graph displays the number of epochs on the x-axis and the 5 model2.add(LSTM(lstm_out)) accuracy (in decimal form) on the y-axis. The graph should 6 model2.add(Dense(256)) model an increase in accuracy as the training (number of 7 model2.add(Activation(’relu’)) 8 model2.add(Dropout(0.5)) epochs) increases [50]. 9 model2.add(Dense(1, name=’out_layer’)) 10 model2.add(Activation(’sigmoid’)) The model loss graph displays the number of epochs on 11 the x-axis and the loss (in decimal form) on the y-axis. If 12 model2.compile(loss =’binary_crossentropy’, optimizer=’adam’, metrics = [’accuracy’]) the model is functioning correctly, the graph should show a 13 print(model2.summary()) decrease in loss as the training (number of epochs) increases [51].

D. Solution Prediction (See Appendix D, E) E. NLP Model Construction After the model was saved and the graphs were generated, Similar to the LSTM Model, NLP examples for text pre- the model was used to predict the reliability of news sources. processing were created for reference in order to make specific To do this, articles were taken in and stored as strings inside of models. The guide that was used was created by Filippos a list. The articles then underwent the same padding procedure Dounis, and was adapted to suit this project’s needs. used for preprocessing and translating words into numerical vectors. The model.predict(input) method was invoked and 1) Preprocessing (See Appendix F): Preprocessing of NLP the results were stored. These results could be returned to data was conducted to translate words into embedded vectors inform the user of the model’s prediction, or used to construct of floats. Once data was split into test and validation sets using a confusion matrix. By passing the predicted and expected the scikit.learn train test split() function, a TF-IDF vectorizer values to scikit-learn.metrics’ confusion matrix() function and object was initialized to translate the training and testing data. storing the output in a variable, the model was able to be 2) Model Building (See Appendix G, H): The classifier that analyzed for potential modifications. was used for the NLP model is called the PassiveAggressive Classifier. This type of NLP classifier is adept at handling the With binary classification, a confusion matrix summarizes a large quantities of text used while training the model. After model’s performance in a two by two matrix shown in Figure constructing the PassiveAggressive Classifier, the model was 9. The expected values are listed on the left side, and the trained by calling the model.fit(training data, training label) predicted values are listed on the top; true and false positives function. and negatives are plotted in the matrix, as shown in Figure 9. After training was complete, the model took in predictions of testing data that were then mapped onto a confusion matrix.

IV. RESULTS A. Model Graphs Fig. 9. Demonstrates a two by two matrix used in binary classification to plot the network’s predicted values and the data’s actual values. The graphs in Figures 10 and 11 are from the first it- eration of the LSTM model. As shown on the graph, the model stabilized at an accuracy of around 59% and the loss True positives (TP) and true negatives (TN) signify that the stabilized at 0.6775. This was because of a major flaw in the prediction was correct, while false positives (FP) and false preprocessing of the data. The preprocessing that was applied negatives (FN) indicate that it was incorrect [50]. Positive eliminated most of the text in each of the articles. As a result, results refer to fake news, which has a value of one and the model was unable to correctly classify the data as it was negative outputs are real news with a value of zero. For either non-existent, or there was not enough data to make a instance false positives have an actual value of 0, meaning valid prediction. The approach to preprocessing was revised that they are real news, but the model predicts they have a to retain most of the data and translate the words into tokens, value of 1, meaning it incorrectly predicts they are fake news. followed by vectors. The data below is from the revised model, Confusion matrices can be used to calculate many statistics and the same preprocessing method was then implemented for for insightful analysis. the GRU model as well.

7 was actually fake. The f-measure was 0.27863, which was far from the ideal f-measure value of 1.

TABLE IV LSTMCONFUSION MATRIX (TRIAL 2)

10500 8771 10917 14710

For the second iteration of the LSTM, its training process was corrected so that the network no longer outputted the same value for every input. This confusion matrix in Table IV describes this second iteration of the LSTM model. It has an accuracy of 0.56149, or 56.149%. The recall indicated that 49.026% of fake news was Fig. 10. Shown is a graph of the accuracy of the early version of the LSTM model, which lacked padded data. correctly classified as fake, which was a significant improve- ment from the first LSTM model. Real news was accurately identified 62.646% of the time. While this was lower than the previous iteration, the percentages of accurately identified real and fake news were closer to each other indicating a better balance in the model. The model’s precision was 0.54486, meaning 54.486% of predictions of fake news were accurate. Finally, the f-measure was 0.51612, which was a significant improvement from the first iteration’s f-measure of 0.27863. This second model showed a significant step in the right direction.

TABLE V LSTMCONFUSION MATRIX (TRIAL 3)

5290 20 19 5738

Fig. 11. This is a graph of the initial LSTM model, depicting the network’s For the third iteration of the LSTM network, the number loss across epochs. of nodes was reduced, and a convolutional layer was added. Table V depicts the third iteration of the LSTM neural TABLE III network. The accuracy was 0.99648, or 99.648%. 99.642% LSTMCONFUSION MATRIX (TRIAL 1) of fake news was accurately identified as fake and 99.653% 4058 3653 of real news was classified correctly. This was a significant 17359 19828 improvement from the previous LSTM models. The precision was 0.99623 and the f-measure was 0.99632. In this trial, the new f-measure was extremely close to the ideal value of 1. After the first iteration of the LSTM was trained, it out- These modifications made the LSTM network nearly perfect putted 0.606 as an optimal value for every input. with it’s predictions on data similar to what was in its training dataset. The confusion matrix in Table III displays the results of the first iteration. A total of 23886 predictions (out of 44898 TABLE VI total) were correct, yielding an overall accuracy of 53.201%. GRUCONFUSION MATRIX (TRIAL 1) There was a significant amount of false positives and false 1247 1829 negatives, meaning that the model was still lacking in accuracy. 20170 21652 Calculating the recall reveals that a total of 18.948% of fake news was accurately identified as fake. Additionally, 84.443% of real news was accurately identified as real. This disparity in Table VI depicts the results of the GRU’s first iteration percentages was a result of the model consistently outputting through the use of a confusion matrix. The overall accuracy of the same value for each news story. The model had a precision the model was 51.002%. Similar to the LSTM, a large amount of 0.52626, signifying that 52.626% of predicted fake news of false positives and false negatives, which contributed to

8 the low accuracy, signified that the model had much room for improvement. This model’s recall revealed that 5.822% of fake news was correctly classified and 92.211% of real news was accurately identified as real. Like with the LSTM, this displayed that real news was consistently classified more accurately than fake news. 40.540% of predicted fake news was actually fake. The f-measure was 0.10181, which strayed far from the target value of 1.

TABLE VII GRUCONFUSION MATRIX (TRIAL 2) Fig. 12. This is graph of the second LSTM model after the issue was data retention was fixed, and shows the network’s accuracy in predicting fake news. 735 727 20682 22754

TABLE VIII GRUCONFUSION MATRIX (TRIAL 3)

1526 2466 3474 2534

The matrices in Tables VII and VIII summarize the results of the GRU’s second and third iterations. The models had accuracies of 52.316% and 40.6%, respectively, indicating a Fig. 13. Here is graph of the loss recorded from the repaired second LSTM downward trend in efficacy due to the models being poorly model. fit. To address this, a fourth iteration of the GRU model was created and trained more effectively. Figures 12 and 13 depict the first attempt at running the TABLE IX model after the data retention problem was addressed. As GRUCONFUSION MATRIX (TRIAL 4) shown in the graph, the training loss and accuracy were min- 5272 3 imized and maximized respectively. However, the validation 37 5755 data did not follow the same trend. Due to this discrepancy, the model was deemed to be overfit. As a result, the accuracy For the fourth GRU model, the number of input nodes was of test predictions was drastically lower than what the training decreased. data was outputting. This was also true for the GRU model shown in Figures 14 and 15. This issue could be diagnosed Table IX describes this fourth iteration of the GRU model. and potentially solved by lowering the number of epochs to It had an accuracy of 0.99639. 99.303% of fake news was one, where the validation accuracy is at its highest. This was identified as fake and 99.948% of real news was accurately the next step in ensuring that test predictions would be as classified. The precision was 0.99943 and the f-measure was accurate as possible. 0.99622. Like in its LSTM counterpart, the modifications in the GRU architecture greatly improved the model’s efficacy.

TABLE X NLPCONFUSION MATRIX (TRIAL 1)

5299 33 31 5862

Table X describes the results of the NLP model. The model had a high accuracy of 0.99430. 99.418% of fake news was accurately identified as fake. Also, 99.440% of real news was correctly classified. The precision was 0.99381 and the f- measure was 0.99399. These statistics display that the NLP was extremely effective in detecting fake news, even in its Fig. 14. This graph shows the accuracy of the GRU model in predicting first iteration. which news would be fake.

9 Fig. 15. The depicted graph shows the loss of the GRU model. Fig. 18. The depicted graph shows the final accuracy of the GRU model.

Fig. 19. The depicted graph shows the final loss of the GRU model.

Fig. 16. The depicted graph shows the final accuracy of the LSTM model. Figures 14 and 15 were the result of a slight adjustment to the model, which increased accuracy and minimized loss. The number of neurons in the LSTM layer and the number of dense layers were reduced, while a new convolutional layer (Conv1D), was added. This modification was made because too many neurons in the model would require more data for optimal training. After the number of neurons and layers were reduced (LSTM neurons from 300 to 64 and Dense layers from 5 to 1), the model ran faster and was able to classify fake and real news with greater accuracy. The graphs indicated that the model was now a good fit (compared to the previously overfit model). The Conv1D, or Convolutional Layer with one Dimension, helped break up the inputs into smaller pieces before moving on to the LSTM Layer (this was not implemented in the GRU model). A predefined validation split for the data was also used to ensure the data was not being indiscriminately tested for more real news than fake news, or vice versa. With these changes, the model was able to run successfully and produce a high accuracy with minimal loss Fig. 17. The depicted graph shows the final loss of the LSTM model. on both training and testing data.

10 B. Additional Validation Testing APPENDIXA After the models were trained and optimized, they all operated at over 99% accuracy. As shown by the outputs below, the models were very accurate with identifying and classifying fake and real news. However, a peculiarity arose upon further testing. The GRU and LSTM models were unable to detect satire in an article and thereby classified the news as real, while the NLP model was able to accurately detect satire and classify it as fake. This may be due to the fact that NLP can factor the tone and context of an article into its decision, while the GRU and LSTM cannot.

V. CONCLUSION

A. Successes and Shortcomings Ultimately, all three models were successful in detecting fake news that was written to mislead its readers. The LSTM had an accuracy of 99.648%, the GRU had an accuracy of 99.639%, and the NLP had an accuracy of 99.430%. While the NLP’s overall accuracy was slightly lower than the other two models’ accuracies, when asked to classify satirical and Fig. 20. This flowchart details the process of creating a Deep Learning model. comedic news from publications such as The Onion, only the NLP model accurately detected sarcasm. Both the LSTM and APPENDIXB GRU networks consistently classified the satire as real news. As a result, NLP rises up as the preferred model out of the 3 variations.

B. Future Work

1) Satire Analysis: In the future, it would be beneficial to expand the data set and add satirical news stories to train the LSTM and GRU models. This would allow both models to more accurately classify satire and comedy as fake news. 2) Gradual Classification: Modifying the models to clas- sify articles on a spectrum would be more complex, but could be more helpful for readers rather than simply determining whether it is real or fake. This would show readers how much of the article should be trusted, rather than completely writing off a source if it is only partially fake. Adding this to the project would entail changing the network from a binary classifier into a multi-classification system. 3) Usage Beyond Twitter: While the models were trained using data from Twitter, verifying the systems’ effectiveness on other platforms could further reduce the prevalence of fake news. Facebook and Snapchat, for instance, are popular Fig. 21. This flowchart shows the process of creating an NLP model. platforms that feature news pages, and applying these models there could help users differentiate real from fake news. APPENDIXC

1 ## Reading the Fake and Real News Databases 4) User-Interface Implementation: It would be beneficial to 2 real = pd.read_csv(’DataSet/True/True.csv’) connect the models to an interface that allows users to input 3 fake = pd.read_csv(’DataSet/Fake/Fake.csv’) a news article and view whether the models classify it as fake 4 5 ## Give labels to data before combining or real. Doing so would make the research described in this 6 fake[’fake’] = 1 paper more accessible to users around the globe. 7 real[’fake’] = 0

11 8 combined = pd.concat([fake, real]) 8 ## transform test set 9 9 10 ## Splits the Data into training and testing data 10 tfidf_test=tfidf_vectorizer.transform(X_test) 11 features = combined[’text’] 12 labels = combined[’fake’] 13 APPENDIXG 14 X_train, X_test, y_train, y_test = train_test_split( features, labels, random_state = 42) 1 # Initialize the PassiveAggressiveClassifier 2 ## and fit training sets 3 model=PassiveAggressiveClassifier(max_iter=50) 1 ## Definesa few hyperparameters 4 model.fit(tfidf_train,y_train) 2 max_words = 2000 3 max_len = 400 4 APPENDIXH 5 ## Createsa tokenizer and 6 ## first tokenizes the words 1 y_pred=model.predict(tfidf_test) 7 token = Tokenizer(num_words=max_words, lower=True, 2 matrix = confusion_matrix(y_pred, y_test) split=’’) 3 8 token.fit_on_texts(X_train.values) 4 print(matrix) 9 5 >>[[5299 33] 10 ## After the words are tokenized 6 [ 31 5862]] 11 ## the words are sequenced and padded 12 sequences = token.texts_to_sequences(X_train.values) 13 train_sequences_padded = pad_sequences(sequences, ACKNOWLEDGEMENTS maxlen=max_len) The authors of this paper gratefully acknowledge the fol- APPENDIXD lowing for their assistance in the completion of this project: Residential Teaching Assistant and Project Liaison Akila Sar- 1 ## now compare to test values 2 test_sequences = token.texts_to_sequences(X_test) avanan for her constant guidance and admirable composure; 3 test_sequences_padded = pad_sequences(test_sequences Project Mentor Dayrene Martinez for her instruction on Python , maxlen=max_len) and machine learning; Head Residential Teaching Assistant 4 5 ## Evaluates the model Rajas Karajgikar for his critique and for organizing activities 6 model2.evaluate(test_sequences_padded, y_test) and schedules through the course of the program; Research 7 Coordinator Benjamin Lee for making this research project 8 Output: 9 >> 346/346 [======] possible; Director Patrick Antoine for his enthusiastic sup- 10 >> - 3s 8ms/step port and insight; Director Emeritus Ilene Rosen for making 11 >> - loss: 0.0502 this educational program possible; the Governor’s School of 12 >> - accuracy: 0.9852 13 >> [0.0501866415143013, 0.9851811528205872] Engineering & Technology (GSET), Rutgers University, the 14 ##[Loss, Accuracy] Rutgers School of Engineering, the New Jersey Space Grant Consortium, and the State of New Jersey for these unique APPENDIXE and enlightening opportunities in engineering; GSET alumni 1 ## Stores the Predicted Data for LSTM ina variable for their support by sharing their experiences and knowledge; 2 predictions = model2.predict(test_sequences_padded) and sponsor Lockheed Martin for generously funding these 3 4 ## So the old data is preserved enriching pursuits. 5 newPred = predictions.copy() 6 REFERENCES 7 ## Rounds the data up or down in order to classify 8 ## with eithera1 or0 and stores it in newPred [1] K. Paul, ”False news stories are 70% more likely to be retweeted on 9 fori in range(len(newPred)): Twitter than true ones”, MarketWatch, 2018. [Online]. Available: 10 newPred[i][0] = round(newPred[i][0]) https://www.marketwatch.com/story/fake-news-spreads-more-quickly- 11 on-twitter-than-real-news-2018-03-08. [Accessed: 20- Jul- 12 ## Invokes confusion_matrix 2020]. 13 ## and passes it the predictions 14 ## and the expected values [2] S. Vosoughi, D. Roy and S. Aral, ”The spread of true and false news 15 matrix = confusion_matrix(newPred, y_test) online”, , 2020. [Online]. Available: 16 print(matrix) https://science.sciencemag.org/content/359/6380/1146. [Accessed: 14- 17 >>[[5290 20] Jul- 2020]. 18 [ 19 5738]] [3] N. Akpan, ”The very real consequences of fake news stories and why APPENDIXF your brain can’t ignore them”, PBS NewsHour, 2020. [Online]. Available: https://www.pbs.org/newshour/science/real-consequences- 1 ## Initializea TfidfVectorizer fake-news-stories-brain-cant-ignore. [Accessed: 11- Jul- 2 ##(this will be used to vectorize the texts) 2020]. 3 4 tfidf_vectorizer=TfidfVectorizer(stop_words=’english [4] K. Hao, ”What is machine learning?”, MIT Technology Review, 2020. ’, max_df=0.7) [Online]. Available: 5 tfidf_train=tfidf_vectorizer.fit_transform(X_train) https://www.technologyreview.com/2018/11/17/103781/what-is- 6 machine-learning-we-drew-you-another-flowchart/. [Accessed: 14- Jul- 7 ## Fit& transform train set, 2020].

12 [5] B. Grossfeld, ”Deep learning vs machine learning”, Zendesk, 2020. [21] E. Muccino, ”LSTM vs GRU: Experimental Comparison”, Medium, [Online]. Available: 2019. [Online]. Available: https://medium.com/mindboard/lstm-vs-gru- https://www.zendesk.com/blog/machine-learning-and-deep- experimental-comparison-955820c21e8b. [Accessed: 09- Jul- learning/#: :text=To%20recap%20the%20differences%20between,intel 2020]. ligent%20decisions%20on%20its%20own. [Accessed: 14- Jul- 2020]. [22] V. Zhou, ”Machine Learning for Beginners: An Introduction to Neural [6] A. Geitgey, ”Natural Language Processing is Fun!”, Medium, 2020. Networks”, Victorzhou.com, 2019. [Online]. Available: [Online]. Available: https://medium.com/@ageitgey/natural-language- https://victorzhou.com/blog/intro-to-neural-networks/. [Accessed: 05- processing-is-fun-9a0bff37854e. [Accessed: 14- Jul- Jul- 2020]. 2020]. [23] D. Godoy, ”Understanding binary cross-entropy / log loss: a visual [7] H. Jabeen, ”Stemming and Lemmatization in Python”, DataCamp explanation”, Medium, 2018. [Online]. Available: Community, 2020. [Online]. Available: https://towardsdatascience.com/understanding-binary-cross-entropy-log- https://www.datacamp.com/community/tutorials/stemming- loss-a-visual-explanation-a3ac6025181a. [Accessed: 16- Jul- lemmatization-python. [Accessed: 14- Jul- 2020]. 2020]. [8] A. Agarwal, M. Mittal, A. Pathak, L. M. Goyal, ”Fake news detection [24] Raschka, S., 2020. Gradient Descent And Stochastic Gradient Descent using a blend of neural networks: an application of deep learning”, SN - Mlxtend. [online] Rasbt.github.io. Available at: Computer Science, Apr. 2020. [Online]. Doi: ¡http://rasbt.github.io/mlxtend/user guide/general concepts/gradient- https://doi.org/10.1007/s42979-020-00165-4. [Accessed 18- Jul- 2020]. optimization/¿ [Accessed 19 July 2020]. [9] R. Khandelwal, word2vec and GloVe word embeddings. 2019. [25] ”Reducing Loss: Gradient Descent — Machine Learning Crash [10] L. Hardesty, ”Explained: Neural networks”, MIT News, 2020. Course”, Google Developers, 2020. [Online]. Available: [Online]. Available: http://news.mit.edu/2017/explained-neural- https://developers.google.com/machine-learning/crash-course/reducing- networks-deep-learning-0414. [Accessed: 14- Jul- loss/gradient-descent. [Accessed: 16- Jul- 2020] 2020].

[11] S. Saha, ”A Comprehensive Guide to Convolutional Neural [26] Gradient Descent. 2020. Networks—the ELI5 way”, Medium, 2020. [Online]. Available: https://towardsdatascience.com/a-comprehensive-guide-to- [27] V. Bushaev, ”Adam—latest trends in deep learning optimization.”, convolutional-neural-networks-the-eli5-way-3bd2b1164a53. [Accessed: Medium, 2018. [Online]. Available: 14- Jul- 2020]. https://towardsdatascience.com/adam-latest-trends-in-deep-learning- optimization-6be9a291375c. [Accessed: 11- Jul- [12] deeplizard, Convolutional Neural Networks (CNNs) explained. 2020. 2020]. [13] Exemplary split of a phrase into uni-, bi- and trigrams. 2015. [28] K. Team, ”Keras documentation: Adam”, Keras.io, 2020. [Online]. [14] ”Neural Network Bias: Bias Neuron, Overfitting and Underfitting - Available: https://keras.io/api/optimizers/adam/. [Accessed: 12- Jul- MissingLink.ai”, MissingLink.ai, 2020. [Online]. Available: 2020]. https://missinglink.ai/guides/neural-network-concepts/neural-network- bias-bias-neuron-overfitting- [29] J. Jordan, Learning Rates. 2018. underfitting/#: :text=Overfitting%20and%20Underfitting%201%20Over fitting%20in%20Neural%20Networks,training%20set%206%20High%2 [30] W. Koehrsen, ”A Conceptual Explanation of Bayesian Hyperparameter 0variance%20More%20items...%20. [Accessed: 15- Jul- 2020]. Optimization for Machine Learning”, Medium, 2018. [Online]. Available: https://towardsdatascience.com/a-conceptual-explanation-of- [15] C. Olah, ”Understanding LSTM Networks — colah’s blog”, bayesian-model-based-hyperparameter-optimization-for-machine- Colah.github.io, 2020. [Online]. Available: learning-b8172278050f. [Accessed: 08- Jul- https://colah.github.io/posts/2015-08-Understanding-LSTMs/. 2020]. [Accessed: 18- Jul- 2020]. [31] A. Alonso, ”fmfn/BayesianOptimization”, GitHub, 2020. [Online]. [16] M. Phi, ”Illustrated Guide to LSTM’s and GRU’s: A step by step Available: https://github.com/fmfn/BayesianOptimization. [Accessed: explanation”, Medium, 2018. [Online]. Available: 13- Jul- 2020]. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a- step-by-step-explanation-44e9eb85bf21. [Accessed: 16- Jul- [32] Prabhu, ”Understanding Hyperparameters and its Optimisation 2020]. techniques”, Medium, 2018. [Online]. Available: https://towardsdatascience.com/understanding-hyperparameters-and-its- [17] J. Heaton, ”The Number of Hidden Layers”, Heaton Research, 2017. optimisation-techniques-f0debba07568. [Accessed: 09- Jul- [Online]. Available: 2020]. https://www.heatonresearch.com/2017/06/01/hidden-layers.html. [Accessed: 08- Jul- 2020]. [33] S. Sharma, ”Epoch vs Batch Size vs Iterations”, Medium, 2017. [18] ”Overfitting in Machine Learning: What It Is and How to Prevent It”, [Online]. Available: https://towardsdatascience.com/epoch-vs-iterations- EliteDataScience, 2016. [Online]. Available: vs-batch-size-4dfb9c7ce9c9. [Accessed: 10- Jul- https://elitedatascience.com/overfitting-in-machine-learning. [Accessed: 2020]. 13- Jul- 2020]. [34] D. Gupta, ”Activation Functions — Fundamentals Of Deep Learning”, [19] J. Brownlee, ”Dropout Regularization in Deep Learning Models With Analytics Vidhya, 2020. [Online]. Available: Keras”, Machine Learning Mastery, 2016. [Online]. Available: https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep- https://machinelearningmastery.com/dropout-regularization-deep- learning-activation-functions-when-to-use-them/. [Accessed: 16- Jul- learning-models-keras/. [Accessed: 09- Jul- 2020]. 2020]. [35] S. Sharma, ReLU Function. 2017. [20] M. Koning, “How to build a recurrent neural network to detect fake news”, Towards Data Science, Feb. 2020. [Online]. Available: [36] ”Sigmoid Function”, DeepAI, 2020. [Online]. Available: https://towardsdatascience.com/how-to-build-a-recurrent-neural- https://deepai.org/machine-learning-glossary-and-terms/sigmoid- network-to-detect-fake-news-35953c19cf0b. [Accessed 18- Jul- function. [Accessed: 12- Jul- 2020]. 2020].

13 [37] S. Sharma, ”Activation Functions in Neural Networks”, Medium, [47] ”tf.keras.preprocessing.text.Tokenizer — TensorFlow Core v2.2.0”, 2017. [Online]. Available: https://towardsdatascience.com/activation- TensorFlow, 2020. [Online]. Available: functions-neural-networks-1cbd9f8d91d6. [Accessed: 03- Jul- https://www.tensorflow.org/api docs/python/tf/keras/preprocessing 2020]. /text/Tokenizer. [Accessed: 18- Jul- 2020]. [38] S. Sharma, Sigmoid Function. 2017. [48] ”tf.keras.utils.Sequence — TensorFlow Core v2.2.0”, TensorFlow, [39] J. Brownlee, ”Display Deep Learning Model Training History in 2020. [Online]. Available: Keras”, Machine Learning Mastery, 2016. [Online]. Available: https://www.tensorflow.org/api docs/python/tf/keras/utils/Sequence. https://machinelearningmastery.com/display-deep-learning-model- [Accessed: 18- Jul- 2020]. training-history-in-keras/. [Accessed: 13- Jul- 2020]. [49] A. Abrahamson, “Detecting Fake News With Deep Learning,” [40] A. Bronshtein, ”A Quick Introduction to the “Pandas” Python Library”, Medium, 30-Mar-2020. [Online]. Available: Medium, 2017. [Online]. Available: https://towardsdatascience.com/a- https://towardsdatascience.com/detecting-fake-news-with-deep-learning- quick-introduction-to-the-pandas-python-library-f1b678f34673. 7505874d6ac5. [Accessed: [Accessed: 12- Jul- 2020]. 18-Jul-2020].

[41] ”pandas”, PyPI, 2020. [Online]. Available: [50] ”sklearn.feature extraction.text.TfidfVectorizer — scikit-learn 0.23.1 https://pypi.org/project/pandas/. [Accessed: 11- Jul- 2020]. documentation”, Scikit-learn.org, 2020. [Online]. Available: https://scikit- [42] ”NumPy”, Numpy.org, 2020. [Online]. Available: https://numpy.org/. learn.org/stable/modules/generated/sklearn.feature extraction.text. [Accessed: 18- Jul- 2020]. TfidfVectorizer.html. [Accessed: 18- Jul- 2020]. [43] K. Team, ”Keras: the Python deep learning API”, Keras.io, 2020. [Online]. Available: https://keras.io/. [Accessed: 11- Jul- 2020]. [51] S. Narkhede, ”Understanding Confusion Matrix”, Medium, 2018. [Online]. Available: https://towardsdatascience.com/understanding- [44] ”Getting started — scikit-optimize 0.7.4 documentation”, confusion-matrix-a9ad42dcfd62. [Accessed: 18- Jul- Scikit-optimize.github.io, 2020. [Online]. Available: 2020]. https://scikit-optimize.github.io/stable/getting started.html. [Accessed: 12- Jul- 2020]. [52] J. Brownlee, ”What is a Confusion Matrix in Machine Learning”, [45] ”Word Vectorization using GloVe”, Medium, 2020. [Online]. Available: Machine Learning Mastery, 2020. [Online]. Available: https://medium.com/analytics-vidhya/word-vectorization-using-glove- https://machinelearningmastery.com/confusion-matrix-machine- 76919685ee0b#: :text=GloVe%20stands%20for%20global%20vectors%20 learning/#: :text=A%20confusion%20matrix%20is%20a%20summary% for%20word%20representation.,occurrence%20matrix%20from%20a%2 20of%20prediction%20results%20on,broken%20down%20by%20each 0corpus.&text=In%20this%20blog%20post%2C%20we,about%20GloVe%. [53] F. Dounis, “Detecting Fake News With Python And Machine [Accessed: 13- Jul- 2020]. Learning,” Medium, 05-Jun-2020. [Online]. Available: https://medium.com/swlh/detecting-fake-news-with-python-and- [46] ”Python — Introduction to Matplotlib - GeeksforGeeks”, machine-learning-f78421d29a06. [Accessed: GeeksforGeeks, 2020. [Online]. Available: 18-Jul-2020]. https://www.geeksforgeeks.org/python-introduction-matplotlib/. [Accessed: 07- Jul- 2020]. %20class. [Accessed: 08- Jul- 2020].

14