<<

California State University, Northridge

A Comparison of Lexicographical and

Approaches to

A thesis submitted in partial fulfillment of the requirements For the degree of Master of Science in Computer Science

By Jeffrey Yoshida

May 2019

The thesis of Jeffrey Yoshida is approved:

______Robert McIlhenny, Ph.D. Date

______Kyle Dewey, Ph.D. Date

______George Wang, Ph.D., Chair Date

California State University, Northridge

ii

Acknowledgments

I would like to thank Dr. George Wang for being my committee chair and for all his help throughout the entire thesis process. Thank you also to Dr. Robert McIlhenney for making time for me despite being on countless other thesis committees and Dr. Kyle

Dewey for giving me tremendous feedback on my paper.

Finally, I would like to thank my parents for the support they have given me all these years. I would not be where I am today without your constant support.

iii

Table of Contents

SIGNATURES ...... ii

ACKNOWLEDGEMENTS …………………………………………………….. iii

LIST OF FIGURES ...... vi

LIST OF TABLES ...... vii

ABSTRACT ...... viii

1. INTRODUCTION ...... 1

1.1 Background …………………………….…………………………………..1

1.2 Approaches to Sentiment Analysis …………………………………………2

2. RELATED WORKS…………...... 4

3. TECHNICAL APPROACH...... 5

3.1 Data Exploration ...... 5

3.2 General Workflow ...... 7

3.3 Hardware ...... 8

3.4 Software …………………………………………………………………....9

4. DATA PREPROCESSING ...... 10

5. WORD EMBEDDINGS ...... 13

5.1 Count-Based Methods …………………………………………………… 13

5.2 Prediction-Based Methods ………………………………………………. 15

5.3 Implementations ……………………………………... 17

6. MODELS ...... 19

6.1 SentiWordNet ……………………………………………………………. 20

6.2 AFINN …………………………………………………………….…..…. 20

6.3 …………………………………………...…………. 21

6.4 Support Vector Machine …………………………………………………. 21

iv

6.5 …………………………………..……………………….. 23

6.6 Naïve Bayes Classifier …………………………...………………………. 24

6.7 Multilayer …………………….………...……………………. 25

6.8 Convolutional Neural Network …………………………………..………. 27

7. RESULTS ...... 29

7.1 Metrics ……………………………………………………...……………. 29

7.2 Model Results ……………………………………………………………. 31

8. CONCLUSION ...... 35

REFERENCES ………………………………………………………………….36

v

List of Figures

Figure 1 – Histogram of Review Scores in the Amazon Fine Foods Dataset ...... 6

Figure 2 – Histogram of Positive and Negative Reviews in the Amazon Fine Foods Dataset …………………………………………………...... 6

Figure 3 – Workflow for Project …………………...... 7

Figure 4 – Example of Discrete Representation of Words ...... 13

Figure 5 – Example …………………………...... 16

Figure 6 – SVM with Linearly Separable Data …………………………...... 22

Figure 7 – Example of Linearly Inseparable Data ………………………...... 23

Figure 8 – Perceptron Diagram ...... 26

Figure 9 – Diagram ...... 27

Figure 10 –True Positives, False Positives, False Negatives, and True Negatives …..... 29

vi

List of Tables Table 1 – Sample Count Vectorization ...... 14 Table 2 –10 Most Similar Words to “tasty” in the Amazon Fine Foods Reviews Dataset Word2vec Model...... 17

Table 3 – All Predictive Model and Word Embedding Combinations Used …...... 19 Table 4 – Overall Results ……………………...... 31 Table 5 – Single-Class Classification Results …………...……...... 31 Table 6 – Domain-Specific Words Similar to Good ...... 34 Table 7 – Pre-trained Model Words Similar to Good ...... 34

vii

Abstract

A Comparison of Lexicographical and Machine Learning

Approaches to Sentiment Analysis

By Jeffrey Yoshida

Master of Science in Computer Science

Sentiment analysis is an area of computer science research that deals with extracting subjective information such as opinions, attitudes and emotions from text data.

Although sentiment analysis is a technically challenging task, the potential benefits it can yield are great. An industry particularly interested in sentiment analysis is the e- commerce industry where major companies often receive far more product reviews than can be handled manually. These reviews if analyzed accurately, can become an invaluable source of consumer insights that go beyond numerical review scores. The goal of this thesis is to analyze and compare the performance of different methods of sentiment analysis on a dataset of Amazon product reviews.

viii

1. Introduction

1.1 Background

Due to rapid improvements in internet technology the amount of data being produced, consumed and stored has skyrocketed. In 2013 it was estimated that 90% of the worlds stored data was generated in only the two years prior [1]. What is more astounding is that an estimated 95% of this data is unstructured data such as text, videos and audio [2]. Because of this rich abundance of unstructured data, there has never been a greater need for methods of extracting valuable insights from data. This increased need to generate insights from unstructured data has fueled interest in research areas such as sentiment analysis. Sentiment analysis is generally defined as an area of computer science research that deals with extracting subjective information such as opinions, attitudes and emotions from text data [3]. For the purposes of this paper, sentiment analysis is the task of computationally determining whether a body of text has a positive or negative tone.

Whether it be understanding the social sentiment regarding a clothing brand or political opinions on a contentious specific topic, sentiment analysis can provide crucial insights that are impractical to obtain by manual inspection of data. With the growth of e- commerce, the need for in-depth consumer insights has never been greater. A study of online consumer-generated reviews by Comscore, a large media analytics company, found that consumers were willing to play at least 20% more for products and services that received a 5-star rating when compared to the same service that had received a 4-star rating [4]. In food, legal and hotel services, consumers were willing to pay between 40% and 99% more [4]. The influence of product reviews is further supported by a 2018 study

1 done by a site called brightlocal which found that 95% of people between the age of 18-

34 read reviews of local businesses [5]. More importantly this survey also found that 57% of consumers will only use a business if it has 4 or more stars. Because of this, understanding product sentiment is valuable.

1.2 Approaches to Sentiment Analysis

There are two main approaches to sentiment analysis: the lexicographical approach and the machine learning approach [6]. In the lexicographical approach to sentiment analysis, the overall attitude of a body of text is determined by analyzing individual words or phrases. The polarity of each individual word is determined using a sentiment dictionary, a specialized dictionary that gives the word a sentiment polarity score. The sentiment of the entire body of text is computed as the sum of the polarity scores of the individual words or phrases in the text. There are many different sentiment dictionaries available such as SentiWordNet, AFINN and Opinion Lexicon [7,8, 9, 10,

11].

The other common method of performing sentiment analysis is by using machine learning. Like many other research fields, the field of sentiment analysis has been influenced by the rapid growth of machine learning [12, 13]. By creating a training dataset consisting of pieces of text labeled as positive or negative, a model can be trained to classify new examples. Traditional machine learning approaches such as support vector machines (SVMs) and Naïve Bayes classifiers as well as method such as convolutional neural networks (CNNs) have been shown to perform well on text classification problems such as sentiment analysis [14, 15].

2

Lexicographical and machine learning approaches have their own unique strengths and weaknesses. The lexicographical methods save time since they do not need to train on a sample dataset. A weakness of these methods is they do not adapt well to different domains or languages [14]. On the other hand, machine learning algorithms are adaptable to any domain or language but often require large amounts of high-quality data to generate accurate results [16].

3

2 Related Works

Although some papers were published in prior years, 2001 is considered the beginning of widespread interest in sentiment analysis [17]. One of the first paper to propose a lexicographical approach to sentiment classification was by Turney et al [18].

In this paper, the polarity of bodies of text was computed using the polarity of the text’s individual words. The sentiment dictionary used to score the individual words was generated using the PMI-IR algorithm [19]. Since then, several other papers have been published using lexicons to perform sentiment analysis.

One of the first papers to apply supervised machine learning to sentiment analysis was by Pang et al. in 2002 [20]. This paper compared the performance of Naïve Bayes, maximum entropy classifier and support vector machines on sentiment classification of movie reviews. Since then, several papers have analyzed the performance of different machine learning algorithms on the problem of sentiment classification. Due to the more recent rise in popularity of deep learning, deep neural networks have been applied to the problem of sentiment analysis. A number of different neural network architectures such as CNNs, long term short term memory (LSTM), and recursive neural networks have been applied to NLP problems [21,22,23]. In many cases these models outperformed traditional models such as SVMs and Naïve Bayes[24].

4

3. Technical Approach

3.1 Data Exploration

The Amazon Fine Food Reviews dataset I performed sentiment analysis on is a dataset consisting of 568,454 food reviews collected between October 1999 and October

2012 [25]. This dataset was collected by the Stanford Network Analysis Project (SNAP) for the purpose of analyzing large social and information networks. This dataset consists of reviews from 256,059 unique users regarding 74,258 products. Each product review has a score of one to five stars which will be used to label each review as either positive or negative. In the dataset, there were 363,122 five-star reviews, 80,655 four-star reviews,

62,640 three-star reviews, 29,769 two-star reviews, and 52,268 one-star reviews. A histogram of review scores is displayed in Figure 1.

To perform sentiment classification, reviews must first be given a ground truth sentiment. For the purposes of this project, 4 and 5-star reviews were labeled as positive and reviews with 3 or fewer stars were labeled as negative. Overall, there were 443,777 positive reviews and 124,677 negatives reviews. A histogram of positive and negative review scores is shown in Figure 2. This dataset was chosen because of its larger size and because relatively few papers had been published using it. Since many machine learning models perform better as the amount of training data increases, having a larger dataset is advantageous. For a dataset to be usable for sentiment analysis it must have sentiment polarity scores associated with each piece of text. This score can be derived from something like a review score or can be manually assigned. Because it would be impractical to label a large dataset myself, I needed to pick a dataset that had reviews scores or had previously been manually labeled. The Amazon Fine Foods Reviews

5 dataset was larger than many of the publicly available dataset I could find that were suitable for sentiment analysis. Additionally, relatively few papers had been published using this dataset compared to comparable Twitter, Yelp and IMDB datasets.

Figure 1 – Histogram of Review Scores in the Amazon Fine Foods Reviews Dataset.

Figure 2 - Histogram of Positive and Negative Reviews in the Amazon Fine Foods Reviews Dataset.

6

3.2 General Workflow

The general workflow for the project is illustrated in Figure 3 and consists of taking the raw reviews stored in a CSV file, cleaning and preprocessing the reviews, and classifying the reviews as positive or negative. Similar workflows are used for both lexicographical and machine learning approaches. The difference between the two is the machine learning workflow has a word embedding step. Word embeddings are methods for converting review text strings into numerical vectors that can be used as inputs to machine learning models. Word embeddings are discussed in detail in section 5.

Figure 3 - Workflow for Project.

7

3.3 Hardware

Local Machine

• OS: Windows 10

• Processor: Intel i5 2.3GHz with 2 Cores

• GPU: N/A

• Memory: 8 GB RAM

Training a machine learning model on a large dataset is a computationally expensive task. Because of this, additional hardware was procured to reduce the training time of the machine learning models. A major factor when selecting hardware for machine learning is the large number of matrix multiplications performed when a model is trained. Graphics processing units (GPUs) have been shown to perform matrix multiplications faster than traditional central processing units (CPUs) [26]. Similarly, one study by the University of Toronto found that a GPU implementation of a neural network tested on an NVIDIA GTX280 GPU achieved a 66-fold speed-up over an optimized C++ implementation running on a 2.83 GHz Intel processor [27]. Because of this, a virtual machine with a GPU was provisioned to train complex machine learning models.

Virtual Machine

Hosted on ’s Cloud Platform

• Hosted on Google’s Cloud Platform

• OS: Debian 9

• Processor: 2 x virtual CPUs

8

• Memory: 13 GB RAM

• GPU: NVIDIA Tesla K80

2.4 Software

• Python 3

• Anaconda

• Jupyter Notebook

(Tensorflow Backend)

• NLTK

• Pandas

• Numpy

• Beautiful Soup

• Scikit-learn

• AFINN

• Matplotlib

• Inflect

• TextBlob

9

4. Data Preprocessing

Building a model with raw data can lead to poor prediction accuracy. Many large datasets are incomplete, inaccurate, and difficult to work with. Data cleaning is a general process for removing inaccuracies and noise from a dataset [28]. Another important task in data preprocessing is data reduction. The goal of data reduction is to reduce the number of features in a given dataset. Having too many features in a dataset can lead to what is known as the curse of dimensionality. The curse of dimensionality is the idea that as the number of features in a dataset increases, so does the processing time and memory required to run machine learning methods on the data [28]. In the context of sentiment analysis, data reduction consists of reducing the number of unique words in the dataset without losing the sentiment polarity of reviews.

The best preprocessing steps for text data depend on both the dataset one is working with and the NLP task the data will be used for [29]. The methods used to clean and reduce the dataset are discussed in detail in the following paragraphs, but a brief overview of the steps performed is as follows:

1) Remove irrelevant features

2) Remove rows with null values for review text or review score.

3) Remove non-ASCII characters and convert emojis into their textual

representations

4) Remove HTML/XML tags

5) Replace contractions

6) Remove punctuation

7) Convert uppercase characters to lowercase

10

8) Replace numbers with their textual representations

9) Remove stop words

10) Tokenize text

11) Lemmatize text

Before performing any of the preprocessing, the dataset was converted from a

CSV file to a Pandas DataFrame to allow for easy manipulation and analysis. Next irrelevant features were removed from the DataFrame. The features included in this dataset were Id, ProductId, UserId, ProfileName, HelpfulnessNumerator,

HelpfulnessDenominator, Score, Time, Summary and Text. At this stage all features except for review text and score were removed. The next step in the data preprocessing process was removing entries with null values in the dataset. Null values were searched for using the built-in Pandas function isnull but no null values were found in the dataset.

Next, non-ASCII characters were removed from the corpus and replaced with their closest ASCII representation. For example, Cześć was turned into Czesc. Since emojis may contain important sentiment information, they were turned into their textual descriptions instead of being completely removed. For example,  was converted into

“Angry Face”. Next, HTML and XML tags were removed from the review strings. These don’t provide any additional information about the reviews and take up computational resources. The removal of these tags was done using regular expression.

Next, contractions were replaced with their full word representations, punctuation marks were removed and numerical representations of numbers were replaced with their textual representations. Contractions were expanded using the Beautiful Soup Python

11 library and numerical representations of numbers were replaced using the Inflect Python library. Next, stop words were removed from the reviews. Removing stop words from text data is a common practice in text preprocessing [30]. Stop words are common words that provide little lexical content [31]. Some common stop words are “a”, “the” and “it”

[31]. A list of 153 stop words from the NLTK Python library was used as a base list of stop words. From this basic list, negation words such as “not” were removed since the presence of them in a sentence can completely change its sentiment. This edited list was used to remove stop words from the reviews.

Tokenization in NLP is the process of splitting a string into separate elements based on a delimiter [32, 33]. In the tokenization phase of data preprocessing, each review was split into tokens based on whitespace. An example, the sentence “The quick brown fox jumps over the lazy dog” would be tokenized to the following list of tokens:

“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”. These operations were performed using the NLTK text preprocessing library. Tokenizing the review strings also makes future data cleaning steps more convenient. The final step in data preprocessing was lemmatization. Lemmatization is a method of reducing the inflection forms of a word and is a common technique used in processing text data [34, 35]. For example, “ran”, “running” and “runs” are all forms of the word “run”. By lemmatizing our text, we can reduce the size of the corpus a model needs to learn. Lemmatization was performed on the review data using the NLTK library’s WordNetLemmatizer method.

12

5. Word Embeddings

Most machine learning methods are unable to directly process strings of data and instead require vectors of numerical values. Word embeddings are a way of converting raw text data into a numerical representation. Word embeddings typically fall into two categories, count-based and prediction-based [36, 37].

5.1 Count-Based Methods

Figure 4 – Example of Discrete Representation of Words.

Individual words can be represented mathematically as discrete valued vectors using a scheme called one-hot encoding [37]. One-hot encoding uses binary vectors where all but one index is zero to represent different words. An example of one hot- encoding is shown in Figure 4. This leads one to the question of how can sentences and other bodies of text be represented as numerical values?

Count-based word embedding methods (also known as bag of words methods) compute the statistics of how often some words co-occur in a body of text [37]. One of the most important count-based methods is count vectorization. This simple word embedding technique builds upon the idea of one-hot encoding by creating a co- occurrence matrix based on which words occur in the same body of text [37]. Table 1 is a

13 count vectorization for the strings “This cat food smells bad” and “My cat loves this cat food”.

Although this method is simple to use, it is effective in many machine learning models. One major drawback of this word embedding method is the high dimensionality of count vectors. The dimensionality of count vectors is equal to the number of unique words in the corpus that is being count vectorized. In addition to being high dimensional, these vectors can be sparse. A vector is said to be sparse when most of its elements are zero [38]. For example, suppose the count vectorization from Table 1 was performed on a larger set of strings with an overall vocab size of 5,000 instead of 7. The count vector for string 1 will still have 1s for “this”, “cat”, “food”, “smells” and “bad” but will have zeros for the remaining 4,995 elements. Machine learning algorithms typically contain many matrix operations so performing computations on sparse matrices can be expensive both in terms of time and memory. A direct quote from [39] regarding sparse matrices reads:

“It is wasteful to use general methods of linear algebra on such problems, because most of the O(N^3) arithmetic operations devoted to solving the set of equations or inverting the matrix involve zero operands.”

Table 1 – Sample Count Vectorization

this cat food smells bad my loves String 1 1 1 1 1 1 0 0

String 2 1 2 1 0 0 1 1

14

5.2 Prediction-Based Methods

The second type of wording embedding methods is prediction-based. Prediction- based methods have become popular in recent years and attempt to learn the relationship between words [37]. A notable prediction-based method is Word2vec, which uses a neural network to learn which words are similar in meaning [40]. A neural network is a machine learning method that is based on biological neurons [41]. Neural networks learn to recognize specific patterns in data by training on a sample dataset. The concept of neural networks is discussed in greater detail in section 6.

Word2vec is based on the distributional hypothesis of linguistics which is the idea words that are used and occur in the same context have similar meanings [40,42].

Word2vec builds upon this concept by using a neural network to learn which words appears in similar context. By doing so, the model can learn which words are similar in meaning. Mathematically, the neural network takes in a large corpus of one-hot encoded text data and produces a vector space where each unique word is represented as a continuous-valued vector. These word vectors are positioned in the vector space such that words with similar meanings have similar vector representations [43]. Similarity between words can be calculated as the between their word vectors [44]. Cosine similarity is a measure of similarity that is commonly used to compare the similarity of words or documents [28,44]. Suppose we have two word vectors x and y. The cosine similarity of the two word vectors can be calculated using the following equation

푥 · 푦 푠푖푚(푥, 푦) = ||푥||||푦||

15 where x·y is the inner product of x and y and ||x|| is the Euclidean norm of x [28]. A cosine similarity closer to 1 means the words are similar and a value closer to 0 means the words dissimilar. As an example, Table 2 shows the 10 most similar words to “tasty” according to a Word2vec model.

Word2vec learns to map words to this vector space using one of two methods.

These two methods are the Skip-Gram and Continuous Bag of Words (CBOW) methods

[42]. The difference between the Skip Gram model and the CBOW model is how the neural networks are trained. Both models examine a center word and a variable length window of words around it [42]. Figure 5 illustrates a context window of size 2 around the word “brown” given the sentence “The quick brown fox jumps over the lazy dog”.

The Quick Brown Fox Jumps Over The Lazy Dog

Figure 5 – Word2vec Example

The Skip-Gram model learns to predict center words using the context window words while the CBOW method learns to predict the context window words given the center word [42]. In either case, the model is learning how to map words that appear in similar contexts to similar parts of the vector space.

Finally, sentences and other bodies of text can be represented using Word2vec word vectors. The simplest way of doing this is to represent a body of text as the average of the word vectors that make up the text [45].

16

Table 2 –10 Most Similar Words to “tasty” in the Amazon Fine Foods Reviews Dataset Word2vec Model.

Word Cosine Similarity delicious 0.6921 yummy 0.5842 flavorful 0.5791 tastey 0.5581 goodtasting 0.4787 tastyi 0.4255 versatile 0.4104 enjoyable 0.3979 crunchy 0.3974 goodthese 0.3921

5.3 Word Embedding Implementations

The count vectors used in this paper were generated using the CountVectorizer class from the sklearn Python package. When working with Word2vec, one has the option of either training their own word vectors or using an existing pre-trained set of them. Due to the popularity of Word2vec, a number word vector sets have been trained and are available online. These pre-trained word vectors have been trained on massive datasets and are convenient to use. Training one’s own word vectors requires a large amount of preprocessed text data but can potentially capture language and terminology unique to the dataset it was trained on. For this project, I evaluated both a pre-trained set of word vectors and a set of word vectors that I trained on the Amazon Fine Foods

Reviews dataset.

Since there were no pre-trained word vectors for the Amazon Fine Foods Reviews dataset or product reviews in general, a set of word vectors trained on Google News data was used instead. The pre-trained model was trained by Google on a Google News dataset that contained over 100 billion words. This model contains 300-dimensional word vectors for 3 million unique words. To work with this pre-trained version of Word2vec,

17 the library was used and the word vector binary files were downloaded from http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python

[46].

The model that was trained as a part of this project was trained using the Gensim library. In order to train word vectors on the Amazon Fine Foods Reviews dataset, the reviews were tokenized into individual sentences using NLTK’s tokenize library. After, this list of all review sentences was fed to the Word2vec model to train the word vectors.

The tokenization of reviews into sentences was done using the NLTK python library. The resulting model has 300-dimensional word vectors and was trained on roughly 30 million words. For the remainder of the project, this set of word vectors is referred to as the domain-specific word vectors.

18

6. Models

Two lexicographical methods and six machine learning methods were evaluated for this paper. Each of the six machine learning models was paired with each of the three word embedding methods. The only exception to this was a CNN with a count vector embedding since I was unable to get it to work properly. In total there were 17 different combinations of machine learning method and word embedding were evaluated. These combinations are listed in Table 3. 75% of the dataset was used for training machine learning models while the remaining 25% was used to evaluate the performance of the models. The lexicographical methods were evaluated on the same 100% of the data since they didn’t require any training. This data split was done using classes from the scikit- learn library’s train_test_split method.

Table 3 – All Predictive Model and Word Embedding Combinations Used.

Model Word Embedding SentiWordNet N/A AFINN N/A Logistic Regression Count Vector Logistic Regression Pre-Trained Word2vec Logistic Regression Domain-Specific Word2vec SVM Count Vector SVM Pre-Trained Word2vec SVM Domain-Specific Word2vec Bayesian Classifier Count Vector Bayesian Classifier Pre-Trained Word2vec Bayesian Classifier Domain-Specific Word2vec Random Forest Count Vector Random Forest Pre-Trained Word2vec Random Forest Domain-Specific Word2vec Multilayer Perceptron Count Vector Multilayer Perceptron Pre-Trained Word2vec Multilayer Perceptron Domain-Specific Word2vec CNN Pre-Trained Word2vec CNN Domain-Specific Word2vec

19

6.1 SentiWordNet

WordNet is a massive lexical database of English nouns, verbs, adjectives and adverbs that are organized into cognitive synonyms. In this database, words that are similar are similar are linked together [47]. SentiWordNet is an opinion lexicon derived from the WordNet database and was created by applying semi- methods to WordNet [48, 49]. Each of the dictionary’s words has a positive, negative and objective score associated with it that can be used to classify reviews. To use the library to calculate the positive and negative scores of a word, the words part of speech must be provided. To get the part of speech of review words, I used a nltk’s pos_tag library. For this project I used NLTK’s sentiwordnet library as the implementation of SentiWordNet.

To classify a given review, the sum of positive word scores and negative word scores were computed and if the positive score minus the negative score was greater than or equal to zero the review was considered positive. When the word being scored has multiple meanings, the most commonly used definition of the word is used.

6.2 AFINN

AFINN is a sentiment dictionary developed by Årup Nielsen that contains 2477

English words [10]. Each word is labeled with a polarity score between -5 and 5 where -

5 is the most negative, 5 is the most positive and 0 is considered neutral. The sentiment of the entire body of text is computed as the average of the polarity scores of the individual words in the text. To use this dictionary, a Python package called afinn was downloaded from Årup Nielsen’s GitHub page https://github.com/fnielsen/afinn [10]. A difference between AFINN and SentiWordNet is that each word in the AFINN lexicon only has one

20 score associated with it. This score is associated with the word and not with a specific definition of the word.

6.3 Logistic Regression

Logistic regression is a machine learning algorithm used to predict binary outcomes given a set of independent variables [50]. Logistic regression is a non-linear transformation of the model [51]. This transformation is done by applying the to the output of the linear regression model. The sigmoid is a function that maps any real value to another number between 0 and 1. This resulting number can be treated as a probability value which can be used for classifying a data point [52]. For example, if the probability of a data point being class 1 is greater than 0.5, the data point will be labeled as class 1. Due to its simplicity and power, logistic regression is one of the most popular machine learning algorithms. For this project, logistic regression was implemented using scikit-learn.

6.4 Support Vector Machine

SVMs are a set of supervised learning methods used for classification and regression. Given a set of points in N dimensional space, an SVM will generate a (N-1) dimensional to linearly separate the data points into two distinct groups [53].

This hyperplane the algorithm selects is the one that maximizes the separation between the two groups of points [53, 54]. An example SVM decision boundary is shown in

Figure 6. In addition to being a strong , SVMs can be used to classify data that is not linearly separable by using a technique known as the kernel trick [55]. An example of a dataset that is nonlinearly separable is the dataset is shown in Figure 7. An

21 intuitive way of thinking about the kernel trick is mapping the dataset into a different where the data is linearly separable.

Figure 6 – SVM with Linearly Separable Data

In the context of SVMs, a kernel is a way of computing the dot product in some feature space (often a higher dimensional one) [55]. Computing the dot product is a crucial part of calculating the optimal dividing hyperplane between classes. The power of the kernel trick lies in the fact that one can calculate the dot product of vectors in a higher dimensional space without having to explicitly map each point into higher dimensional space [55]. This allows one to solve for the optimal diving hyperplane without doing mapping all features to a higher dimensional space [55]. Although using the kernel trick is more efficient than mapping features to a higher dimensional space, many kernels have performance issues when used on high dimensional datasets [56].

Because of these performance issues, SVMs with linear kernels were used. SVMs with linear kernels use the standard dot product which is less computationally expensive than other kernels [56]. In addition to saving computation time, SVMs with linear kernels have been shown to perform well on NLP tasks [57]. scikit-learn’s LinearSVC class was used to implement the SVMs for this project.

22

Figure 7 – Example of Linearly Inseparable Data.

6.5 Random Forest

The decision tree is a supervised learning method that can be used for either regression or classification. Each internal node in a tree represents a decision that partitions the dataset into subsets and each leaf node represents a subset of the dataset

[58]. Starting with the root node, internal nodes are added to a decision tree based on some predetermined criteria such as information gain [44]. Information gain is a measure of how important a given feature is for discriminating between classes to be learned.

For example, in the case of sentiment analysis, the feature the presence of the word “terrible” would likely tell us more about the polarity of the review than the word

“plastic” and thus would likely have a higher information gain. We keep adding internal nodes until the dataset has been completely partitioned into subsets by the internal nodes or another stopping condition is met. After training, a decision tree can be used to classify new data points using the internal nodes.

23

The Random Forest Classifier is a type of algorithm that builds upon the ideas of the decision tree. Ensemble learning algorithms are a type of algorithm that combines multiple machine learning methods into a single predictive model. The goal of ensemble methods is to overcome the shortcomings of individual methods. The three types of ensemble methods are bagging, boosting and stacking [59]. Bagging algorithms take the average of multiple models in order to reduce the variance of the individual models [59]. Random forest is a type of bagging algorithm that creates a series of random decision trees to classify a training dataset [59].

A common drawback of decision trees is their propensity to overfit data [60]. By creating multiple decision trees, the random forest algorithm can substantially mitigate this issue. The random forest model is trained by creating multiple decision trees where each is based on a random subset of features and training data points. To classify an example, the model takes the individual predictions of the decision trees and selects the class based on a majority vote [58]. For this project the Random Forest classifier was implemented using the scikit-learn library.

6.6 Naïve Bayes Classifier

The Naïve Bayes Classifier is a supervised learning algorithm that is based on

Bayes Theorem. Bayes Theorem is a notable theorem from the field of probability that works based on the idea of conditional probability. Conditional probability is the probability that something will occur given that something else has occurred [61, 62].

The Naïve version of Bayes Theorem considers the probability of each event to be independent of all other events. The Naïve Bayes Classifier works by calculating the most probable class given the input features. The Naïve Bayes classifier is a relatively

24 simple classification algorithm that scales well with the amount of data present and still performs well with smaller datasets [62]. For this project, I used the multinomial Naïve

Bayes algorithm with the count vectorized word embedding and the Gaussian Naïve

Bayes algorithm with the two Word2vec word embedding. The two version of the algorithm are similar but differ in what type of distribution is used to model the data. The

Multinomial Naïve Bayes classifier assumes features conform to a multinomial distribution [63]. This version of the algorithm is suitable for discrete-valued inputs such as count vectorization. On the other hand, the Gaussian Naïve Bayes algorithm assumes that the input features follow a Gaussian distribution [63]. This form of the algorithm is suitable for continuous-valued inputs such as our Word2vec embeddings. Both algorithms were implemented using the scikit-learn library.

6.7 Multilayer Perceptron

Multilayer are a type of artificial neural network that can be used for classification [40,64]. A perceptron is a simple computational unit based on biological neurons that consists of inputs, weights, an and outputs. The perceptron takes in the weighted input signals and applies the activation function to produce its output. An example of a perceptron is shown in Figure 8. In this figure xi are the inputs, wi are the weights, f(x) is the activation function and y is the output. The perceptron produces a single output based on several real-valued inputs by forming a linear combination of its inputs [64]. By adjusting the weights of the input, a perceptron can be used to approximate linear functions. The perceptron “learns” by adjusting the weights of the different inputs depending on how accurately the perceptron models the data [65].

25

Figure 8 – Perceptron Diagram.

When organized into layers, perceptrons can be used to model non-linear functions [64]. These models became known as multilayer perceptrons. Simple multilayer perceptrons consist of 3 types of layers; an input which has a perceptron for each input feature, hidden layers that attempt to learn to make predictions about the data and finally an output layer that outputs the result of the model. An example multilayer perceptron architecture is shown in Figure 9. In multilayer perceptrons, the output of each individual perceptron is feed to every perceptron in the following layer. The input and output layers have specific numbers of perceptrons but the hidden layers can have as many nodes as one desires.

Training a multilayer perceptron has two parts which are forward pass and [64]. In the forward pass phase, signals from the input layer are sent to subsequent layers of the network and eventually reach the output layer. In the backpropagation phase, the output of the network is compared to the expected output for the given inputs. If the output of the network is incorrect, changes to the perceptron weights are made starting with the layer closest to the output layer. These changes are propagated backwards through the network until they reach the first hidden layer. Due to improvements in computational hardware, artificial neural networks such as multilayer perceptron have become popular machine learning tools due to their strong performance

26 and flexibility. For this project, a multilayer perceptron was implemented using the Keras deep learning framework.

Figure 9 – Multilayer Perceptron Diagram.

6.8 Convolutional Neural Networks

CNNs are a class of neural network that has become popular in recent years.

CNNs are most easily understood in the context of image processing. One of the defining features of the CNN is its use of layers [66]. Convolution is an important concept in signal and image processing where a filter is applied to an image or signal.

The CNN’s convolution layer applies a sliding filter to images to learn key features from the image.

Although this neural network architecture is better known for its performance on image data, it has been shown to have equally impressive results on NLP application

[66]. Regardless of application, CNNs perform well on inputs where relative spatial location is important. For example, in images, pixels that are near each other are more

27 likely to form important features than pixels that are far from each other [66]. Similarly, words that are close to each other in a body of text are more likely to form important features than ones that are far from each other. Rather than extract feature from adjacent pixels, CNNs in NLP applications learn features of adjacent words using filters. For NLP tasks the input to a CNN is typically 2D (vector of sentences) or 1D (vector of all words in a review). For this project, a 1D input consisting of the vectorized review words was used as the input to the CNNs [67].

28

7. Results

7.1 Metrics

The metrics the various models will be evaluated on are accuracy, precision, recall, and F1 score. Accuracy is the number of correct predictions divided by the total number of test data points. Although accuracy is an intuitive choice for determining how well a predictive model performs, accuracy is not always the best metric for evaluating classification algorithms [68]. Accuracy’s shortcomings are most evident when working with imbalanced datasets. Suppose a model is being developed to identify malignant tumors and suppose the ratio of benign to malignant tumors in the training dataset is 99 to

1. If the model learns to mark all images as benign, it will get a 99% accuracy. Although the accuracy of this model is high, the model does not accomplish its goal of identifying malignant tumors. The ratio of positive to negative reviews in the Amazon Fine Foods dataset is roughly 4 to 1. Although the 4 to 1 ratio of positive to negative reviews isn’t as imbalanced as the previous hypothetical example dataset, other performance metrics should be used when evaluating the models. To better evaluate the models, precision, recall and F1 score will be used in addition to accuracy.

Figure 10 –True Positives, False Positives, False Negatives, and True Negatives.

29

A prediction is considered a true positive when the model classifies it as negative and it is negative, a false positive is when the model predicts positive and the example is negative, a false negative is when a model predicts negative, but the example is positive and a true negative is when the model predicts negative and the example is negative. This is summarized in Figure 10. Precision is the number of true positives divided by the total examples classified as positive:

푇푟푢푒 푃표푠푖푡푖푣푒푠 Precision = 푇푟푢푒 푃표푠푖푡푖푣푒푠+퐹푎푙푠푒 푃표푠푖푡푖푣푒푠

Recall is the number of true positives divided by the total number of actual positive example. Recall is defined mathematically by the following equation:

푇푟푢푒 푃표푠푖푡푖푣푒푠 Recall = 푇푟푢푒 푃표푠푖푡푖푣푒푠+퐹푎푙푠푒 푁푒푔푎푡푖푣푒푠

F1 score is the harmonic mean of precision and recall and gives a general measure of how well the model fits the data:

2∗푃푟푒푐푖푠푖표푛∗푅푒푐푎푙푙 F1 Score = 푃푟푒푐푖푠푖표푛+푅푒푐푎푙푙

30

7.2 Model Results

Table 4 – Overall Results

Model Word Embedding Accuracy SentiWordNet N/A 73.56 AFINN N/A 80.45 Logistic Regression Count Vector 90.03 Logistic Regression Pre-Trained Word2vec 85.36 Logistic Regression Domain-Specific Word2vec 86.85 SVM Count Vector 90.84 SVM Pre-Trained Word2vec 85.33 SVM Domain-Specific Word2vec 86.81 Bayesian Classifier Count Vector 88 Bayesian Classifier Pre-Trained Word2vec 71.57 Bayesian Classifier Domain-Specific Word2vec 77.17 Random Forest Count Vector 89.58 Random Forest Pre-Trained Word2vec 88.89 Random Forest Domain-Specific Word2vec 90.03 Multilayer Perceptron Count Vector 91.29 Multilayer Perceptron Pre-Trained Word2vec 86.81 Multilayer Perceptron Domain-Specific Word2vec 85.11 CNN Pre-Trained Word2vec 90.44 CNN Domain-Specific Word2vec 90.39

Table 5 – Single-Class Classification Results

Precision Recall F1-Score Model Word Embedding Negative Positive Negative Positive Negative Positive SentiWordNet N/A 38.85 82.35 35.81 84.17 37.27 83.25 AFINN N/A 66.21 81.68 22.8 96.73 33.91 88.57 Logistic Regression Count Vector 82.92 92.08 70.58 95.92 76.25 93.96 Logistic Regression Pre-Trained Word2vec 72.63 87.74 52.26 94.49 60.99 90.99 Logistic Regression Domain-Specific Word2vec 75.36 89.22 58.9 94.64 66.12 91.85 SVM Count Vector 82 93.04 74.58 95.41 78.11 94.21 SVM Pre-Trained Word2vec 73.45 87.46 51.15 94.86 60.3 91.01 SVM Domain-Specific Word2vec 76.12 88.92 57.47 94.98 65.49 91.85 Bayesian Classifier Count Vector 75.65 90.96 66.72 93.98 70.9 92.44 Bayesian Classifier Pre-Trained Word2vec 41.42 90.69 73.83 70.94 53.07 79.61 Bayesian Classifier Domain-Specific Word2vec 48.83 92.3 76.73 77.63 59.68 84.33 Random Forest Count Vector 96.8 88.57 54.24 99.5 69.52 93.72 Random Forest Pre-Trained Word2vec 90.81 88.6 54.49 98.47 68.11 93.27 Random Forest Domain-Specific Word2vec 89.31 90.17 61.62 97.95 72.92 93.9 Multilayer Perceptron Count Vector 85.25 92.66 72.51 96.51 78.37 94.55 Multilayer Perceptron Pre-Trained Word2vec 77.95 88.46 55.53 95.59 64.86 91.89 Multilayer Perceptron Domain-Specific Word2vec 82.87 85.39 40.43 97.66 54.35 91.11 CNN Pre-Trained Word2vec 84.87 91.65 68.62 96.57 75.89 94.04 CNN Domain-Specific Word2vec 81.86 92.44 72.14 95.51 76.69 93.95

31

Table 4 shows the overall accuracy results while Table 5 shows the individual class results with precision, recall and F1-score. The most effective method of sentiment analysis was the multilayer perceptron with a count vectorization word embedding. In general, the machine learning methods outperformed the two lexicographical methods by a significant margin. The results showed that for most of the machine learning count vectorization was the most effective word embedding strategy. This result is surprising given the popularity of Word2vec and prediction-based word embedding strategies.

Word2vec has shown to have strong performance on NLP tasks involving individual words. The model was shown to have state of the art performance for the tasks of rating semantic relatedness, concept categorization and analogy identification [69].

Although Word2vec performs well on a number of different NLP tasks, it may not be well-suited for sentiment analysis.

One potential reason for the weaker performance of the Word2vec models is that

Word2vec has trouble capturing sentiment information. For example, for both models, the words “good” and “bad” are close to each other in vector space. Table 6 shows the 10 most similar words to “good” in the pre-trained model while Table 7 shows the 10 closest words in the domain-specific model. In the pre-trained Word2vec model, “bad” was the second closest word to “good” great. This happens because the words often occur in similar contexts. For example, given the sentence “Today’s weather is _”, the blank could be replaced with either good or bad. Because opposites can often be used in the same context, Word2vec gives them a similar vector representation. For this reason, the

32 common implementation of Word2vec may not be well-suited for the task of sentiment classification.

Another potential reason for this is the way word vectors for reviews are created.

The standard implementation of Word2vec is used to map individual words to a dense vector space. Taking the average of the word vectors that make up a review is a rough way of producing a word vector for a review and may result in the loss of some semantic information [45,70]. Instead of using Word2vec, a model such as Doc2vec that is intended to embed bodies of text could yield better results for sentiment classification depending on the input data. Additionally, this method assumes that all words in a sentence contribute equally to the sentiment of the sentence. Research is being done on using the weighted average of word vectors to represent sentence vectors [71].

Another notable result was the performance of the Word2vec model that was trained on the relatively small Amazon Fine Foods dataset. When training prediction- based word embeddings, most sources suggest using large datasets. The 300 million words dataset used to train the domain-specific word embeddings is relatively small when compared to the 100 billion word dataset used to train the Google News word embeddings. For almost every machine learning method, the domain-specific Word2vec performed better than pre-trained version. An explanation for this is the context- dependent nature of NLP. Although the pre-trained Word2vec model was trained on more data, it was trained on news data which has different terminology and semantics from food product reviews. One such difference between new articles and user product reviews is the amount of slang and nonformal language. Product reviews likely contain internet acronyms such as “lol”, “omg” and “bff” while new articles contain more formal

33 language. Similarly, the pre-trained word embeddings likely weren’t trained to handle misspelled words. Although the domain-specific word vectors weren’t trained on as much data, they seem to contain more domain-specific semantics than the pre-trained Google

News vectors.

Table 6 – Domain-Specific Words Similar to Good

Word Cosine Similarity great 0.6779 decent 0.6246 excellent 0.5341 awesome 0.4683 bad 0.465 goodi 0.4641 better 0.4532 nice 0.4373 fantastic 0.4351 like 0.4082

Table 7 – Pre-Trained Model Words Similar to Good

Word Cosine Similarity great 0.7291 bad 0.719 terrific 0.6889 decent 0.6837 nice 0.6836 excellent 0.6442 fantastic 0.6407 better 0.612 solid 0.5806 lousy 0.5764

34

8. Conclusion

For the task of sentiment classification, machine learning methods perform better than lexicographical methods. This was especially true when machine learning methods were paired with count vector word embeddings. Although simple, count-based word embedding seemed to perform well despite the popularity of prediction-based method such as Word2vec. Additionally, domain-specific Word2vec embeddings seem to outperform general pre-trained embeddings even when the domain-specific model was trained on significantly fewer data points.

35

REFRENCES [1] “Big Data, for better or worse: 90% of world’s data generated over last two years,” ScienceDaily. [Online]. Available: https://www.sciencedaily.com/releases/2013/05/130522085217.htm. [Accessed: 06- May-2018]. [2] M. Tanwar, R. Duggal, and S. K. Khatri, “Unravelling unstructured data: A wealth of information in big data,” in 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2015, pp. 1–6. [3] S. Al-Asmari and M. Dahab, “Sentiment Detection, Recognition and Aspect Identification,” International Journal of Computer Applications, vol. 177, pp. 975– 8887, Nov. 2017. [4]“Online Consumer-Generated Reviews Have Significant Impact on Offline Purchase Behavior,” Comscore, Inc. [Online]. Available: http://www.comscore.com/Insights/Press-Releases/2007/11/Online-Consumer- Reviews-Impact-Offline-Purchasing-Behavior. [Accessed: 17-Mar-2019]. [5]“Local Consumer Review Survey | Online Reviews Statistics & Trends,” BrightLocal. [Online]. Available: https://www.brightlocal.com/learn/local-consumer-review- survey/. [Accessed: 17-Mar-2019]. [6] O. Kolchyna, T. T. P. Souza, P. Treleaven, and T. Aste, “Twitter Sentiment Analysis: Lexicon Method, Machine Learning Method and Their Combination,” arXiv:1507.00955 [cs, stat], Jul. 2015. [7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 79–86. [8] M. V. Mäntylä, D. Graziotin, and M. Kuutila, “The evolution of sentiment analysis— A review of research topics, venues, and top cited papers,” Computer Science Review, vol. 27, pp. 16–32, Feb. 2018. [9] Z. Hailong, G. Wenyan, and J. Bo, “Machine Learning and Lexicon Based Methods for Sentiment Classification: A Survey,” in 2014 11th Web Information System and Application Conference, 2014, pp. 262–265. [10] H. Cho, J.-S. Lee, and S. Kim, “Enhancing lexicon-based review classification by merging and revising sentiment dictionaries,” in Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 463–470. [11]F. Å. Nielsen,"A new ANEW: evaluation of a word list for sentiment analysis in microblogs", Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages. Volume 718 in CEUR Workshop Proceedings: 93-98. 2011 May. Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, Mariann Hardey (editors) [12]X. Fang and J. Zhan, “Sentiment analysis using product review data,” Journal of Big Data, vol. 2, p. 5, Jun. 2015. [13]C. Rain, “Sentiment Analysis in Amazon Reviews Using Probabilistic Machine Learning,” Swarthmore College, 2013. [14]Z. Hailong, G. Wenyan, and J. Bo, “Machine Learning and Lexicon Based Methods for Sentiment Classification: A Survey,” in 2014 11th Web Information System and Application Conference, 2014, pp. 262–265.

36

[15]S. Wang, D. S. Wang, and R. Greiner, “Machine Learning Approaches to Sentiment Classification CMPUT 551 : Course Project Winter , 2005,” 2005. [16]H. Thakkar and D. Patel, “Approaches for sentiment analysis on twitter: A state-of- art study,” arXiv preprint arXiv:1512.01043, 2015. [17]B. Pang and L. Lee, “Opinion mining and sentiment analysis,” p. 94. [18]P. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews,” in Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 417–424. [19]P. D. Turney, “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL,” arXiv:cs/0212033, Dec. 2002. [20]B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 79–86. [21]R. Socher et al., “Recursive Deep Models for Semantic Compositionality Over a Sentiment ,” p. 12. [22]K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” arXiv:1503.00075 [cs], Feb. 2015. [23]Y. Kim, “Convolutional Neural Networks for Sentence Classification,” arXiv:1408.5882 [cs], Aug. 2014. [24]P. Singhal and P. Bhattacharyya, “Sentiment Analysis and Deep Learning: A Survey,” p. 12. [25]J. McAuley and J. Leskovec, “From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews,” arXiv:1303.4402 [physics], Mar. 2013. [26]University of Ljubljana, Faculty of computer and information science, Slovenia., T. Dobravec, and P. Bulić, “Comparing CPU and GPU Implementations of a Simple Matrix Multiplication Algorithm,” International Journal of Computer and Electrical Engineering, vol. 9, no. 2, pp. 430–438, 2017. [27]D. L. Ly, V. Paprotski, and D. Yen, “Neural Networks on GPUs: Restricted Boltzmann Machines,” p. 5. [28]J. Han, M. Kamber, and J. Pei, : Concepts and Techniques, 3 edition. Haryana, India; Burlington, MA: Morgan Kaufmann, 2011. [29]J. Brownlee, “How to Clean Text for Machine Learning with Python,” Machine Learning Mastery, 17-Oct-2017. [Online]. Available: https://machinelearningmastery.com/clean-text-machine-learning-python/ [Accessed on May. 3 2018]. [30]S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st ed. O’Reilly Media, Inc., 2009. [31]“Stopwords.” [Online]. Available: https://www.ranks.nl/stopwords. [Accessed: 05- Jan-2019]. [32]“Tokenization.” [Online]. Available: https://nlp.stanford.edu/IR- book/html/htmledition/tokenization-1.html. [Accessed: 11-Apr-2019]. [33]nlp, “The Art of Tokenization (Language Processing),” 14:26:30.0. [Online]. Available:

37

https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization. [Accessed: 11-Apr-2019]. [34]“ and lemmatization.” [Online]. Available: https://nlp.stanford.edu/IR- book/html/htmledition/stemming-and-lemmatization-1.html. [Accessed: 17-Mar- 2019]. [35]S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st ed. O’Reilly Media, Inc., 2009. [36]H. Heidenreich, “Natural Language Processing: Count Vectorization with scikit- learn,” Towards , 24-Aug-2018. [Online]. Available: https://towardsdatascience.com/natural-language-processing-count-vectorization- with-scikit-learn-e7804269bb5e. [Accessed: 17-Mar-2019]. [37]“Intuitive Understanding of Word Embeddings: Count Vectors to Word2Vec,” Analytics Vidhya, 04-Jun-2017. [Online]. Available: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ [Accessed on Oct. 3 2018]. [38]“Sparse Vectors.” [Online]. Available: https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html. [Accessed: 11- Apr-2019]. [39]W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3 edition. Cambridge, UK ; New York: Cambridge University Press, 2007. [40]T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111– 3119. [41]“A Beginner’s Guide to Multilayer Perceptrons (MLP) | Skymind.” [Online]. Available: https://skymind.ai/wiki/multilayer-perceptron. [Accessed: 17-Mar-2019]. [42]Z. S. Harris, “Distributional Structure,” WORD, vol. 10, no. 2–3, pp. 146–162, Aug. 1954. [43]“Vector Representations of Words,” TensorFlow. [Online]. Available: https://www.tensorflow.org/tutorials/word2vec. [Accessed: 22-Feb-2018]. [44] “A Beginner’s Guide to Word2Vec and Neural Word Embeddings,” Skymind. [Online]. Available: http://skymind.ai/wiki/word2vec. [Accessed: 04-Dec-2018]. [45]Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” arXiv:1405.4053 [cs], May 2014. [46]C. McCormick, Python code for checking out Google’s pre-trained, 3M word Word2Vec model: chrisjmccormick/inspect_word2vec. 2019. [47]“WordNet | A Lexical Database for English.” [Online]. Available: https://wordnet.princeton.edu/. [Accessed: 17-Mar-2019]. [48]S. Baccianella, A. Esuli, and F. Sebastiani, “SENTIWORDNET 3.0: An Enhanced for Sentiment Analysis and Opinion Mining,” p. 5. [49]B. Ohana, “Sentiment Classification of Reviews Using SentiWordNet,” Dublin Institute of Technology, 2009. [50]T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, 2nd ed. New York, NY: Springer, 2009.

38

[51]“Logistic Regression — ML Cheatsheet documentation.” [Online]. Available: https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html. [Accessed: 17- Mar-2019]. [52] “An Introduction to Logistic Regression.” [Online]. Available: http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm. [Accessed: 05-May- 2019]. [53]K. S. Parikh and T. P. Shah, “Support Vector Machine – A Large Margin Classifier to Diagnose Skin Illnesses,” Procedia Technology, vol. 23, pp. 369–375, Jan. 2016. [54]“1.4. Support Vector Machines — scikit-learn 0.19.1 documentation.” [Online]. Available: http://scikit-learn.org/stable/modules/svm.html. [Accessed: 22-Feb-2018]. [55]S. Bhattacharyya, “Understanding Support Vector Machine: Part 2: Kernel Trick; Mercer’s Theorem.” [Online]. Available: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel- trick-mercers-theorem-e1e6848c6c4d. [Accessed: 15-Nov-2018]. [56]K. Eshghi and M. Kafai, “Support Vector Machines with Sparse Binary High- Dimensional Feature Vectors,” 2016. [57]T. Joachims, “Text categorization with Support Vector Machines: Learning with many relevant features,” in Machine Learning: ECML-98, vol. 1398, C. Nédellec and C. Rouveirol, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 137– 142 [58]R. Saxena, “How Decision Tree Algorithm works,” Dataaspirant, 30-Jan-2017. [Online]. Available: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm- works/. [Accessed: 17-Mar-2019]. [59]V. Smolyakov, “Ensemble Learning to Improve Machine Learning Results,” Stats and Bots, 22-Aug-2017. [Online]. Available: https://blog.statsbot.co/ensemble- learning-d1dcd548e936. [Accessed: 17-Mar-2019]. [60]N. Donges, “The Random Forest Algorithm,” Towards Data Science, 22-Feb-2018. [Online]. Available: https://towardsdatascience.com/the-random-forest-algorithm- d457d499ffcd. [Accessed: 12-Oct-2018]. [61]“In Depth: Naive Bayes Classification | Python Data Science Handbook.” [Online]. Available: https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive- bayes.html. [Accessed: 17-Mar-2019]. [62]“How the works in Machine Learning.” [Online]. Available: http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/. [Accessed: 17-Mar-2019]. [63]“1.9. Naive Bayes — scikit-learn 0.20.3 documentation.” [Online]. Available: https://scikit-learn.org/stable/modules/naive_bayes.html. [Accessed: 05-May-2019]. [64]“A Beginner’s Guide to Multilayer Perceptrons (MLP) | Skymind.” [Online]. Available: https://skymind.ai/wiki/multilayer-perceptron. [Accessed: 17-Mar-2019]. [65] J. Brownlee, “Crash Course On Multi-Layer Perceptron Neural Networks,” Machine Learning Mastery, 16-May-2016. [Online]. Available: https://machinelearningmastery.com/neural-networks-crash-course/. [Accessed: 17- Mar-2019]. [66]“How do Convolutional Neural Networks work?” [Online]. Available: https://brohrer.github.io/how_convolutional_neural_networks_work.html. [Accessed: 26-Feb-2018].

39

[67]W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative Study of CNN and RNN for Natural Language Processing,” arXiv:1702.01923 [cs], Feb. 2017. [68]W. Koehrsen, “Beyond Accuracy: Precision and Recall,” Towards Data Science, 03- Mar-2018. [Online]. Available: https://towardsdatascience.com/beyond-accuracy- precision-and-recall-3da06bea9f6c. [Accessed: 17-Mar-2019]. [69]M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, 2014, pp. 238–247. [70]L. Wu et al., “Word Mover’s Embedding: From Word2Vec to Document Embedding,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4524–4534. [71]C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt, “Representation learning for very short texts using weighted word embedding aggregation,” Letters, vol. 80, pp. 150–156, Sep. 2016.

40