MACHINE LEARNING METHODS TO UNDERSTAND TEXTUAL DATA by Sahar Sohangir

A Dissertation Submitted to the Faculty of The College of Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Florida Atlantic University Boca Raton, FL December 2018 Copyright 2018 by Sahar Sohangir

ii iii ACKNOWLEDGEMENTS

I want to thank my advisor Dr. Dingding Wang. It has been an honor to be her first Ph.D. student. I appreciate all her contributions of time, ideas, and support to make my Ph.D. I also gratefully acknowledge partial support by the National Science Founda- tion, under grant number CNS-1427536. Any opinions, findings, and conclusions or recommendations expressed in this dissertation are those of the author and do not necessarily reflect the views of the National Science Foundation. I would like to thank my family for all their love and encouragement.

iv ABSTRACT

Author: Sahar Sohangir Title: Methods to Understand Textual Data Institution: Florida Atlantic University Dissertation Advisor: Dr. Dingding Wang Degree: Doctor of Philosophy Year: 2018

The amount of textual data that produce every minute on the internet is ex- tremely high. Processing of this tremendous volume of mostly unstructured data is not a straightforward function. But the enormous amount of useful information that lay down on them motivate scientists to investigate efficient and effective techniques and algorithms to discover meaningful patterns. Social network applications provide opportunities for people around the world to be in contact and share their valuable knowledge, such as chat, comments, and discussion boards. People usually do not care about spelling and accurate grammatical construction of a sentence in everyday life conversations. Therefore, extracting information from such datasets are more complicated. can be a solution to this problem. Text mining is a knowl- edge discovery process used to extract patterns from natural language. Application of text mining techniques on social networking websites can reveal a significant amount of information. Text mining in conjunction with social networks can be used for find- ing a general opinion about any special subject, human thinking patterns, and group identification. In this study, we investigate machine learning methods in textual data in six chapters.

v 1. Text representation and encoding: This chapter will take a look at some techniques to represent documents in vector space and some machine learning methods to analyze textual data.

2. Text Similarity: In this chapter, we will propose a new similarity measure- ment. This new similarity can alleviate the Cosine similarity problem in high dimensional data.

3. Textual Data: Natural language processing and tech- niques include will investigate in this chapter. Lexicon based and machine learning base are two commonly used techniques in sentiment anal- ysis.

4. Lexicon Based Financial Sentiment Analysis: In this chapter lexicon based methods will use to extract sentiment of people in a financial forum. In this chapter We will investigate if people who are Bullish (believe the stock price will be increase) use positive and people who are Bearish (believe the stock price will be decrease) use negative words in their sentences.

5. Financial Sentiment Analysis: Investigate deep learning methods to extract the sentiment of users in the financial forum. Based on our results Convolution Neural Network is the best method to extract user sentiment in the financial forum.

6. Expert Recognition in Social Media: The main goal of this chapter is to evaluate deep learning methods to find expert people in predicting stock price movement. In other words, we will try to see if there is any relation between people words and their ability in predicting stock price.

vi To the graduate students of Florida Atlantic University. MACHINE LEARNING METHODS TO UNDERSTAND TEXTUAL DATA

List of Figures ...... xi

1 Introduction and background ...... 1 1.1 ...... 1 1.2 Text Preprocessing...... 1 1.2.1 Tokenization...... 2 1.2.2 Dropping common terms...... 2 1.2.3 Equivalence classing of terms (Normalization)...... 3 1.2.4 Capitalization...... 3 1.2.5 and lemmatization...... 4 1.2.6 Term scoring...... 4 1.3 Learning Methods...... 6 1.3.1 supervised learning methods:...... 6 1.3.2 Supervised learning evaluation metrics:...... 10 1.3.3 Unsupervised learning methods:...... 14 1.3.4 Unsupervised learning evaluation metrics:...... 21 1.3.5 Semi-supervised learning methods:...... 24 1.4 Brief revision of the dissertation...... 24

2 Text Similarity ...... 26 2.1 Text Similarity Measurement...... 26 2.2 Cosine Similarity...... 29 2.3 Sqrt-Cosine Similarity...... 30 viii 2.4 ISC Similarity...... 32 2.5 Experiment...... 33 2.6 DataSets...... 33 2.7 Learners...... 35 2.8 Performance Metrics...... 35 2.9 Experimental results...... 36 2.10 Overall Results...... 36 2.11 Results using Different Learners...... 38 2.12 Results using Different datasets and Learners...... 40 2.13 Summary...... 43

3 Textual Data ...... 44 3.1 Text mining approaches...... 44 3.2 Information Retrieval...... 44 3.3 Natural Language Processing...... 45 3.3.1 Text summarization...... 48 3.3.2 Sentiment analysis...... 49

4 Lexicon Based Financial Sentiment Analysis ...... 52 4.1 Why Financial Sentiment Analysis...... 52 4.2 previous work on Financial Sentiment Analysis...... 53 4.3 Methodology...... 54 4.3.1 VADER: Valence Aware Dictionary for sEntiment Reasoning. 55 4.3.2 SentiWordNet...... 56 4.4 Experiments...... 56 4.4.1 Machine Learning Approaches...... 57 4.4.2 Lexicon Based Approaches...... 58 4.4.3 Combined Results...... 59 4.5 Summary...... 60

ix 5 Financial Sentiment Analysis ...... 62 5.1 Social network ...... 62 5.2 Big Data...... 63 5.3 Machine Learning in Social network information extraction...... 66 5.4 Methodology...... 69 5.4.1 Sentiment Analysis with Data Mining Approaches...... 70 5.4.2 Increase Accuracy by using Feature selection...... 71 5.4.3 Deep Learning in Big Data Analytics...... 77 5.4.4 Sentiment Analysis with Deep Learning Approaches...... 79 5.4.5 Results and Discussion...... 85 5.5 Summary...... 89

6 Expert Recognition in Social Media ...... 92 6.1 How can we find the experts in Social Media?...... 92 6.2 Previous work in finding Experts in Social Media...... 94 6.3 Methodology...... 95 6.3.1 Expert Recognition with Data Mining Approach...... 95 6.3.2 Experiments Using Neural Networks...... 96 6.4 Summary...... 99

7 Summary and future work ...... 100 7.1 Future Works...... 106

Bibliography ...... 107

x LIST OF FIGURES

2.1 Accuracy in classification box plot...... 38 2.2 Purity in clustering box plot...... 38

4.1 Comparative Area Under the ROC curve for Lexicon versus Machine Learning based sentiment analysis...... 60

5.1 Receiver Operating Characteristic for Logistic Regression...... 71 5.2 Accuracy of logistic regression by using feature selection methods.. 76 5.3 Distributed Memory Architecture...... 82 5.4 Distributed Bag of words...... 82 5.5 Area Under the ROC curve for doc2vec with window size of 5 and 10 86 5.6 Area Under the ROC curve for Long Short-Term Memory...... 88 5.7 Compare Area Under the ROC curve for Convolutional Neural Network in various steps...... 90

6.1 logistic regression (Area Under the ROC curve)...... 96 6.2 Area Under the ROC curve for window size of 5...... 96 6.3 Compare Area Under the ROC curve for Convolutional Neural Network in different steps...... 99

xi CHAPTER 1 INTRODUCTION AND BACKGROUND

The most common way to represent a document is bag of words (BOW) [1,2]. Bag of words model views a document as a collection of words and disregards grammar and order. This representation leads to a vector representation which facilitates further analysis of the documents. For instance by representing a document as a vector, of the vectors can be used to measure the similarity between documents. This chapter will take a look at some preprocessing techniques that we need to apply on text dataset. Also, we will see some common machine learning methods to extract information from textual data.

1.1 VECTOR SPACE MODEL

Representing documents by the numerical vectors enable efficient analysis of the ex- tensive collection of documents. This representation is called ”Vector Space Model” (VSM) [3]. Vector spave model is used in various text mining algorithms and informa- tion retrieval systems. In VSM each word has a weight which shows the importance of the word in the document.

1.2 TEXT PREPROCESSING

One of the critical components in analyzing textual data is preprocessing. For ex- ample, a text categorization framework comprises preprocessing, feature extraction, feature selection, and classification steps. Uysal et al. [4] have investigated the effect of preprocessing tasks particularly in the area of text classification. Although feature

1 extraction, feature selection, and classification algorithm have a significant impact on the classification process, the preprocessing phase may have a noticeable influence on this success. The preprocessing stage usually includes tokenization, remove stop words, normalization, capitalization, stemming and lemmatization, term scoring. In the following we briefly describe each of these concepts.

1.2.1 Tokenization

Tokenization is the task of chopping the document up into pieces, called token. Tokens are often considered as terms or words, but we need to make a type/token distinction. A type is the class of all tokens containing the same character sequence. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

1.2.2 Dropping common terms

Stop words are common words that have little value in helping select documents matching; as a result, we need to exclude them from the vocabulary. One of the popular methods to make a list is to sort the terms by the total number of times each term appears in the document collection, and then often hand-filtered for their semantic content relative to the domain of the documents. Discarding stop words can significantly reduce the number of posting that a system has to store. Some of the modern design of information retrieval systems use better ways to eliminate stopwords. For example idf (inverse document frequency) as a standard term weigh- ing, lead to very common words having little impact on document rankings. You can find more about idf in section 1.2.6.

2 1.2.3 Equivalence classing of terms (Normalization)

In , there are lots of cases when two words are not quite the same, but you would like to match to occur. For instance, if your desire word to search is ”USA” you want to see all documents that contain ”U.S.A” as well. Token normaliza- tion is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequence of the tokens. One of the most commonly used methods in normalization is creating equivalence classes, which are normally named after one member of the set. For instance, if in document text and queries, the token U.S.A and USA are both mapped onto the term USA, then searches for one word will retrieve documents that contain either. The other method that can be used to do normalization is to maintain relations between unnormalized tokens. The most common way is to index unnormalized tokens and to keep a query expansion list of multiple vocabulary entries to consider for a particular query term. A query term is then effectively a disjunction of several posting lists. Another way is to perform the expansion during index construction. When the document contains automobile, we index it under car as well. None of these two methods are effective as classing method. The first one requires more processing at query time and the second one needs more space for sorting postings. On the other hand, these approaches are more flexible than equivalence classes because the expansion lists can overlap while not being identical [3].

1.2.4 Capitalization

Case-folding and reducing all letters to lower case is often a good idea. For instance, It can help to match ferrari and Ferrari car in a search query. On the other hand, many names distinguish from common names just by using capital letters like person names (Bush, Black), General Motors or Associated Press. An alternative way can be converting some tokens to lower cases like the beginning of a sentence and all words

3 in a title (if a title is all in uppercase). Machine learning also can help to decide when is the best time to do case-folding. This method is known as .

1.2.5 Stemming and lemmatization

Words in a document can appear in a different form, but these various form of a word can not give us extra information about the document. The primary goal in stem- ming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming use language-specific rules, but they require less knowledge that a lemmatizer, which needs a complete vo- cabulary and morphological analysis to lemmatize words correctly. Stemming chops off the ends of words and often includes the removal of derivational affixes. On the other hand, lemmatization usually refers to the use of vocabulary and morphological analysis of words, typically aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Another differ- ence is that stemming most commonly collapses derivationally related words, whereas lemmatization regularly only collapses the different inflectional forms of a lemma.

1.2.6 Term scoring

Each document evaluates based on the words that make this document. One way to score a document is whether or not a query term is present in a zone within a document. To be more accurate, a document that mentions a query term more often should receive a higher score since it has more to do with that query. The simplest way to achieve this goal is by assigning the weight to be equal to the number of occurrence of a term t in a document d. This weighing is called term frequency and is denoted tft,d. All terms in term frequency, are considered equally important. For instance, the collection of documents on the car industry is likely to have the term car in almost every document. To this end, we need a mechanism to reduce the

4 effect of terms that occur too often in the collection. We can reduce the tf weight of a term when collection frequency increased. Collection frequency refers to be the total number of occurrences of a term in the collection. It is more common to use document frequency instead of collection frequency. Document frequency is the number of documents in the collection that contain a term t. Denoting the number of documents in a collection by N, inverse document frequency of a term t define as follow:

idft = log fracNdft (1.1)

Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.

Tf-idf weighting: By combining term frequency and inverse document fre- quency, we can define tf-idf weighting scheme assigns to term t a weight in document d given by:

tf − idft,d = tft,d × idft (1.2)

The value of tf-idf is high when term t occurs many times within a small number of documents, thus this word lending high discriminating power to those documents. This value is low when the term occurs fewer times in a document or occurs in many documents. The lowest value of tf-idf is for a term that occurs in virtually all documents. From now we can view each document as a vector with one component correspond- ing to each term in the dictionary and use tf-idf as a weight for each component. The representation of a set of documents as a vector in common vector space is known as the vector space model. Vector space model is fundamental in information retrieval operations ranging from scoring documents on a query, document classification, and 5 . The score of document d define as the sum of tf-idf weight of each term in d. This score is called overlap score measure.

X Score(q, d) = tf − idft,d. (1.3) t∈q

1.3 LEARNING METHODS

1.3.1 supervised learning methods:

supervised learning methods are machine learning techniques that learn a classifier from the training data to perform predictions on unseen data. Documents may be classified based on their subject, document type, content, author, printing year, etc. Text classification has been broadly studied in different areas like data mining, ma- chine learning, and information retrieval. Also, they use in various domains such as image processing, medical diagnosis, document organization, etc. A classification is soft if a probability of belonging to a class assigned to an instance. On the other hand, a classification is hard if a label is explicitly assigned to the instance. Naive Bayes, Nearest Neighbor, Decision Trees, Support Vector Machines are examples of classifiers.

Naive Bayes Classifier: Naive Bayes [5] is one of the commonly use probabilis- tic classifiers, and it works based on a preassumption in document classification. It assumes that the distribution of different terms are independent of each other. Al- though this assumption is false, Naive Bayes performs surprisingly well. Among all of the variant of Naive Bays, Multi-variant Bernoulli and Multinomial are the most common one. Multi-variate Bernoulli Model: In this model documents represented by vec- tors of binary features. Each feature shows if a word exists in the document. 6 Multinomial Model: In this model each document represented by a vector that shows the frequency of the words in the document. In a small vocabulary set, the Bernoulli model is preferred than the Multino- mial model. In a large vocabulary size, the Multinomial model always outperforms the Bernoulli model. Almost always the Multinomial model performs better than Bernoulli if the size of the vocabulary chosen optimally for both models [6]. As- sume that text documents are generated by a mixture model parametrized by θ. The mixture model consists of mixture components cj ∈ C = {c1, ..., c|C|}. A document

di is created by selecting a component according to the priors P (cj|θ) and having the mixture component generate a document according to its own parameters, with

distribution P (dj|cj; θ). As a result the likelihood of a document can be calculate with a sum of total probability over all mixture components:

k X P (di|θ) = P (cj|θ)P (di|cj; θ) (1.4) j=1

Each document has a class label and cj use to indicate both the jth mixture component and the jth class.

Having a set of labeled training example, D = {d1, d2, ..., d|D|} as a first step we need to learn, θ, the parameters of the probabilistic classification model. Then using the estimate of this parameters we can classify the test documents by calculating

the posterior probabilities of each class cj and then select the class with the highest probability [6].

P (cj|θb)P (dj|cj; θbj) P (cj|θb)P ({w1, w2, ..., wn }|cj; θbj) P (c |d ; θb) = = i (1.5) j i P p(di|θb) c∈C P ({w1, w2, ..., wni }|c; θbc)P (c|θb)

Words in a document are independent based on the Bayes assumption so:

n Yi ({w1, w2, ..., wni }|cj; θbj) = P (wi|cj; θbj) (1.6) i=1 7 Nearest Neighbor: The Nearest Neighbor algorithm has been shown to be effective in text categorization [7]. For 1NN we assign each document to the class of its closest neighbor. KNN is a similarity-based learning algorithm and for a given test instance it tries to find the k nearest neighbors among the training documents. Then it uses the categories of the k neighbors to weight the category candidates. The weight of the categories of the neighbor document is defined based on the similarity score of each neighbor document to the test document. Cosine similarity can be used to assign a weight to the vote of neighbors. In a scenario that two classes have the same number of neighbors in the top k, the query will be assigned to the class with more similar neighbors. Generally weighing by similarities is often more accurate than simple voting. In this schema. The class score is computed as:

X 0 −→ 0 −→ score(c, d) = Ic(d )cos( v (d ), v (d)) (1.7) 0 d ∈Sk(d) 0 0 0 where Sk(d) is the set of d s k nearest neighbors and Ic(d ) = 1 iff d is in class c and 0 otherwise. Based on this definition we assign a document to the class with the highest score. The parameter k in KNN is often chosen based on experience or knowledge about the classification problem at hand [8].

Support Vector Machines: For two-class separable training datasets, there are lots of possible linear separators. Some learning methods like perceptron algorithm try to find just any linear separator. Some others like Naive Bayes try to find the best linear separator based on some criterion. Support Vector Machine has one of the most successful algorithms to separate two class. SVM defines the criterion to be looking for a decision surface that is maximally far away from any data point. The distance between the decision surface to the closest data point determine the margin ( ξ) of the classifier. In other words, SVM tries to maximize the margin [9].

8 Based on this definition, decision function for an SVM is fully specified by a small subset of the data. This small portion of the data defines the position of the separator. These points are referred to as the support vectors. Other data points play no part in determining the decision surface. Support vectors or instances that are near the decision surface represent uncertain classification decision. They have a 50% chance to allocate either way by a classifier. A classifier with a large margin has a less probability to misclass an instance and even a slight error in measurement will not cause a misclassification. Another advantage of defining a large margin is that you have a fewer choice of where it can be put. As a result, the memory capacity of the model has been decreased, and we expect that its ability to correctly generalize to test data is increased. Support vector machines are making the classification decision based on the value of the linear combinations of the features of the documents. Thus, the output of a −→ −→ −→ linear predictor is defined to be y = a . x + b, where x = (x1, x2, ..., xn) is the nor- −→ malized document word frequency vector, a = (a1, a2, ..., an) is vector of coefficients and b is a scalar. Predictory y = −→a .−→x + b can be interpret as a separating hyperplane between different classes. Since the SVM need support vector to do classification, it rarely needs feature selection and also it helps SVM to be robust to high dimensionality. Joachims et al. [10] explain that SVM is an ideal classifier for text data due to sparse high dimen- sional nature of the text with few irrelevant features. SVM is one of the common supervised learning models that has been used in many application domains such as pattern recognition, face detection, and spam filtering [11, 12, 13].

Decision Tree: The decision tree is another example of a supervised learning algorithm. This algorithm recursively partitions a dataset into smaller subdivisions to make a tree. The tree is composed of a root node a set of internal nodes and a set

9 of leave nodes. Each internal node in the decision tree has only one parent node and two or more descendant nodes. A new instance can be classified by using a decision tree framework and sequen- tially subdividing it according to the decision pattern that defines by the tree, and a class label is assigned to each observation according to the leaf node into which the observation falls. There are lots of advantages that make the decision tree different from other supervised learning models. Decision trees are nonparametric and do not have any presumption in defining the tree. Moreover, they can handle the nonlinear relation between features and classes [14]. Decision tree is one of the common algorithm in text dataset. In text classification, the internal node in the decision tree are terms in the text document. For instance, a node may be subdivided to its children relying on the presence or absence of a particular term in the document. Decision trees have been used in combination with boosting techniques. Using ensemble methods like Bagging or Boosting can improve the accuracy of the decision tree. These methods increase the accuracy of weak learners like the decision tree by applying multiple of them.

1.3.2 Supervised learning evaluation metrics:

The way we evaluate a solution to the problem is called performance measure. Depend on the type of our problem there are various performance measurements. For instance regression, classification, and clustering have different performance measurements. The score that these performance metrics provide is meaningful to your problem domain. Sometimes we need to know more detail about the performance of the model. For instance, we want to know about the false positive and the false negative on a cancer detection classification problem. Because it is very important to know which

10 Table 1.1: Confusion matrix

Relevant Nonrelevant

Retrieved true positives (tp) false positives (fp)

Not retrieved false negatives (fn) true negative (tn)

percent of affected people are classified as healthy people. There are many standard performance measurements and we need to select them based on the problem on hand. Following we take a look at common performance for classification problem. The most commonly used measures for information retrieval effectiveness are pre- cision and recall.

Precision (P) is the fraction of retrieved documents that are relevant:

#(relevant items retrieved) P recision = = P (relevant | retrieved) (1.8) #(retrieved items)

Based on confusion matrix in 1.1 we can define precosion as:

tp P recision = (1.9) tp + fp

Recall (R) is the fraction of relevant documents that are retrieved

#(relevant items retrieved) Recall = = P (retrieved | relevant) (1.10) #(relevant items)

tp Recall = (1.11) tp + fn Accuracy is another measure of effectiveness in information retrieval. Accuracy is the fraction of classifications that are correct. Based on the contingency Table

11 in 1.1 accuracy also can be defined as equation of true positive, true negative, false positive, and false negative:

accuracy = (tp + tn)/(tp + fp + fn + tn) (1.12)

Accuracy is not an appropriate measure for information retrieval. The reason is that accuracy gives a wrong perspective on the performance of classification, especially in skewed datasets. Consider a scenario that more than 99.9% of the documents are in the nonrelevant category. A method with simply deeming all documents nonrelevant to all queries can have the highest accuracy of 99.9%. Although based on accuracy this model is working great, classifying some queries to relevant lead to high rate of false positive. Labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user. Precision and recall can be a good replacement for accuracy because they consider false positive and true positive. We need to consider both value of precision and recall because one is more impor- tant than the other in many circumstances. For instance, high precision is essential for web surfers since they need to have every result on the first page to be relevant. The value of recall is not that much important for them because they are not inter- ested in looking at every document that is relevant. On the other hand, there are other professionals such as paralegals, and intelligence analysts are very concerned with trying to get as a high recall as possible. In a good system precision usually, decreases as the number of documents retrieved is increased. But recall is a non- decreasing function of the number of documents retrieved.

F measure is a single measure that trade off precision versus recall:

2 1 (β + 1)PR 2 1 − α F = 1 1 = 2 where β = (1.13) α p + (1 − α) R β P + R α

where α ∈ [0, 1] and thus β ∈ [0, ∞]. F measure is commonly written as F1, which 12 is short for Fβ=1. By considering β as one, the formula simplifies to:

2PR F = (1.14) β=1 P + R Receiver operating characteristics: is another measurement method for eval- uating, comparing, and selecting classifiers by their performance. ROC analysis has been introduced in signal detection [15], psychophysics [16], medicine [17]. The first application of ROC analysis in the field of machine learning date back to the 1980’s [18]. ROC analysis uses the measure of the performance of two-class classification methods. The ROC graph is defined as the two-dimensional plot by representing FPR ( = 1 - specificity) on the x-axis against TPR (sensitivity) on the y-axis. One of the advantages of the ROC curve is that they are robust when altering class dis- tribution. In the probabilistic classifier, we make the ROC curve by sorting instances ac- cording to their scores. The process starts at (0,0) and with the instance having the highest score. In every step, if the instance’s true class is positive, we move one unit up and if the instance’s true class is negative one unit to the right. This process is also applying with different thresholds on scores and terminates when the upper corner in (1,1) is reached. All these points finally connected to make the ROC curve. Selecting various threshold value may obtain different FPR and TPR points and this is an advantages of ROC that we may see this in the ROC curve. In ROC, area under the ROC curve (AUC) value can be used to compare a pair of classifiers. The AUC value of a classifier is the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [19]. AUC has some special characters in comparison to other performance measurements [20]:

• When AUC and the number of test samples increasing a standard error de- creases.

13 • AUC is independent of a decision threshold.

• AUC is invariant to prior class probabilities.

• AUC shows to what degree the negative and positive classes are separated.

The value of AUC is between zero and one, and it related to other measures such as Wilcoxon statistic [21] and Gini index [22].

1.3.3 Unsupervised learning methods:

This kind of techniques working with unlabeled data and try to find hidden structure from them. Since there is no labeled data, we do not have a training phase, and it can be applied to any text data without manual effort. In the context of text data, we have two most common unsupervised learning algorithms clustering and topic modeling. In clustering, we segment a collection of documents into groups where documents in the same group are more similar to each other than those in other groups (clusters). Document clustering use in filtering, topic extraction, summarization, and fast information retrieval. In topic modeling, soft clustering is used, and each document has a probability distribution over all the clusters. Probabilistic model use to assign this probability to each document. In topic models, topics represent as a probability distributions over words and documents expressed as a probability distribution over topics. Each cluster has a topic, and there is a probability that shows the membership of a document to a topic.

Text dataset has some special characteristics that make them different from other datasets. For instance, text data usually has a very large dimensionality representa- tion. The reason is that there is a massive vocabulary size that a document can be made from. At the same time, the underlying data is sparse because a given document

14 may have only a few hundred words. Also, words of a vocabulary of a given collection of documents are correlated, and the number of concepts in the data is much smaller than feature space. In text clustering, we need to consider this correlation between words. Finally, since documents have a different number of words, we need to nor- malize them during the clustering process. In the following, we take a closer look at some common clustering methods in textual data.

K-Mean: K-mean is one of the most commonly used algorithms is textual data. The primary objective of K-man is to minimize the average squared Euclidean dis- tance of documents from their cluster centers. A cluster center is a mean −→µ or centroid of the documents in a cluster w:

1 −→µ (W ) = X −→x (1.15) |w| −→x ∈w The ideal cluster in k-mean is a sphere with the centroid as its center of gravity. In other words, the best centroid has less distance of each vector in its cluster. This definition is using to measure how well the centroids represent the member of their clusters and is called residual sum of squares or RSS. The main goal of k-mean is minimizing RSS as much as possible.

X −→ −→ 2 RSSk = | x − µ (wk| (1.16) −→ x ∈wk The first step in the K-mean algorithm is to select the K documents randomly as the seeds. Then it tries to reduce RSS by moving the cluster centers around the space. In each step, each document reassigns to the cluster with the closest centroid. Then each centroid recomputes based on the new members of its cluster [23]. When should we stop this process? We can select one of the following termination conditions based on the problem.

15 • if we want to limit the running time of the algorithm we can define a fixed number of iterations. An insufficient number of iterations in some case can lead to the poor results of the algorithm.

• We can fix the assignment of documents to clusters between iterations. Al- though this can give us good clustering results, it may take lots of time to execute.

• We can terminate the algorithm when RSS falls below a threshold. This cri- terion can ensure desire quality after termination. In practice, to guarantee termination, we need to combine it with a bound on the number of iterations.

• Terminate when the decrease in RSS falls below a threshold. Again this criterion should be combined with binding on the number of iteration to prevent long runtime.

K-mean can not work appropriately if a document set contains many outliers. If an outlier is chosen as an initial seed, then we end up with a singleton cluster because no other vector is assigned to it during subsequent iterations. But how we can select the best seed to get a good result?

• We need to exclude outliers from the seed set.

• We can try algorithm with multiple starting points and choose the clustering with the lowest cost.

• We can obtain seeds from another method such as hierarchical clustering.

The Normalized Cuts Algorithm: The main idea behind clustering is to find a hyperplane that passes through the dataset with as great a distance to the data points as possible while separating the data into two even clusters. Normalized cuts views the dataset as a graph, where nodes represent data points, and edge are

16 weights according to the similarity or affinity between data points. Normalized cuts lifts the dataset to an infinite-dimensional feature space and cuts the data by passing a hyperplane through a “gap” in the lifted data. It then labels points that fall on the same side of the hyperplane as belonging to the same cluster [24]. Normalized cuts maximize a gap that weights data points away from the mean of the dataset more than those in the center of the dataset. This weighting cause Normalized Cuts to be sensitive to outliers. By defining a new gap that gives equal weight to all data points, we derive a clustering algorithm that does not exhibit these problems.

d Given a set of data points X = {xi|xi ∈ R , i ∈ {1..N}}, and an “affinity” measure k(x, y), build the affinity matrix k with Kij = k(xi, xj). A common choice for K is Gaussian kernel:

||x − y ||2 k(x, y) = exp(− i i ) (1.17) 2σ2 The affinity matrix K defines the weights on a fully connected graph where each

node corresponds to a data point xi and Kij is the weight of the edge between node

i and node j. By assigning a label yi ∈ {−1, +1} to each xi cuts the graph into a set A of the vertices with label -1 and a set B of vertices with labels +1. The cost cut(A, B) is the sum of the weight of the edges between vertices in A and vertices in B. Normalized cuts tries to find the best cut that minimizes the following cost function [25]. 1 1 cut(A, B)( + ), (1.18) V ol(A) V ol(B) Vol is the sum of the weights in a set. This cost function is designed to penalize cuts that are not well balanced. Finding the optimal Normalized Cut is an NP-hard problem, so the Normalized Cuts algorithm optimizes a relaxation of the above:

T − 1 − 1 v D 2 KD 2 v v∗ = argmax s.t. vT D1 = 0 (1.19) v vT v 17 D is a diagonal matrix whose iith entry is the sum of the ith row of K, and 1 is

− 1 − 1 the column vector of all ones. The optimum v is the second eigenvector of D 2 KD 2 . The components of v∗ are then thresholded to yield a vector in {−1, +1}N :

∗ yb = sgn(v ). (1.20)

This is the labeling as reported by Normalized Cuts. This algorithm is called Normalized Cuts and the unrelaxed cost function (1.18) as the Normalized Cut cost [26].

K-means Clustering via Principal Component Analysis: Principle compo- nent analysis (PCA) is one of the effective unsupervised dimension reduction methods that proposed in 1901 by Karl Pearson as an analog of the principal axis theorem in mechanics. PCA uses singular value decomposition (SVD) which gives the best

low-rank approximation to original data in L2 norm [27]. Consider the original n data points in m-dimensional space is contained in the data

matrix (x1, ..., xn) = X. We define the centered data matrix Y = (y1, ..., yn), where P P yi = xi − x and x = i xi/n. The covarance matrix is given by S = i(xi − x)(xi −

T T T x) = YY . The principle eigenvectors uk of YY are the principle directions of

T the data Y. The principal eigenvectors vk of the Gram matrix Y Y are the principal

components; entries of each vk are the projected values of data points on the principle 1 T 2 direction uk. vk and uk are related via: vk = Y uk/λk . Where λk is the eigenvalue of

T the covarance matrix YY .Principle uk and principle components vk are eigenvectors satisfying [28]:

T YY uk = λkuk (1.21)

T Y Y vk = λkvk (1.22) 18 1 T 2 vk = Y uk/λk (1.23)

These are the defining equations for the SVD of Y:

X 1/2 T Y = λk ukvk k

Elements of vk are the projected values of data points on the principle direction uk Chris et al. [29] prove that principal components are the continuous solution to the discrete clustering membership indicators for K-means clustering. Based on their results unsupervised dimension reduction is closely related to unsupervised learning.

Probabilistic Clustering and Topic Models: Topic modeling is one of the popular clustering algorithms that create a probabilistic generative model for the corpus of text documents [30, 31, 32]. The main goal of the is extract- ing topics from a collection of documents. The topic model considers a topic as a probability distribution over words and documents as a mixture of topics. There are two main topic models include Probabilistic (pLSA) [32] and Latent Dirichlet Allocation (LDA) [30]. LDA is an extension of pLSA by in- troducing a Dirichlet prior on mixture weights of topics per documents. The main difference between these two models is that LDA provides a probabilistic model at the document level. In LDA method if we consider D as a set of corpus and V as a set of vocabulary of the corpus. A topic zj, 1 ≤ j ≤ k is represented as a multinomial P|V | probability distribution over the words. p(wi|zj), i p(wi|zj) = 1. The distribution of words given the document is calculated as follows:

k X p(wi|d) = p(wi|zj)p(zj|d) (1.24) j=1 Non-negative Matrix Factorization: The problem of computing NMF is for-

m×n m×k n×k mulated in Equation 1.25. In this Equation X ∈ R+ , C ∈ R+ , G ∈ R+ , and 19 ||.||F denotes Frobenius norm. Also k < min{m, n}.

T 2 min ||X − CG ||F (1.25) C,G≥0

NMF can be used for dimensionaly reduction and clustering of non-negative data [33]. Suppose we have n data points as the columns in X and try to group them into k clusters. In equation (1.25) the columns of C are the cluster centroids and

T the i − th column of G is ej if xi belongs to cluster j (ej denotes the j − th column

T of Ik×k). By choosing the largest entry in the corresponding column of G we can obtain the clustering assignment of each data point. It is because of nonnegativity of NMF and it is not possible in lower-rank approximation methods like Singular Value Decomposition. The main goal of clustering is to find a partitioning of the data points where similarity is high within each cluster and low across clusters. NMF has received wide attention in clustering with many types of data, including documents [34], image [35], and microarray data [36]. Specially NMF is very suc- cessful in document clustering [37, 33]. The reason is that in document clustering, documents with similar word distribution should be classified into the same group. Although NMF has been widely used for clustering and even sometimes works better than classical clustering like k-mean it is not a general clustering method that per- forms well in every circumstance. The reason is that NMF assumes that each cluster can be represented by a single basis vector, and different clustering must correspond to different basis vectors. On the other words, NMF tries to approximate the original data matrix as a result when the underlying k clusters have nonlinear structure NMF cannot find any k basis vectors that represent the k clusters respectively. Most of the time representing the relationship between data points is better in the form of a graph. In the graph model, each node corresponds to a data point and a similarity matrix An×n contains similarity values between each pair of nodes. For instance (i, j) − th entry of A represents the similarity between xi and xj. Another

20 method that has been widely used for clustering is Symmetric variation of NMF. SymNMF uses A directly as input. The factorization of A will generate a clustering assignment matrix that is nonnegative and well captures the cluster structure inherent in the graph representation. Kuang et al. [38] formulate the nonnegative symmetric factorization (SymNMF) of the similarity matrix A as:

T 2 min ||A − HH ||F (1.26) H≥0

In this formula, H is a nonnegative matrix of size n × k, and k is the number of clusters required. Due to the nonnegativity of H, the largest entry in the i − th row of H indicates the clustering assignment of the i − th data point. SymNMF is more flexible in terms of choosing similarities for the data points. NMF implicitly chooses inner products as the similarity measure, which might not be suitable to distinguish different clusters.

1.3.4 Unsupervised learning evaluation metrics:

In this section, we take a look at some quality measurements for clustering methods. Although clustering is unsupervised and there is no label to evaluate our model, there are some measurements that we can use to assess a method. In document clustering, an ideal goal is finding the best way to partition document into clusters with the high intra-cluster similarity (similarity between documents inside a cluster) and low inter-cluster similarity (similarity between documents from different clusters). This is an internal criterion to measure the quality of clustering. But in an application, good results in internal criterion do not mean that this clustering method is effective. Instead, we can evaluate our model directly in the application. For instance, the time it takes users to find the answer by using different clustering methods. Although this method gives us a good sense of the quality of clustering, it is expensive. The other method that we can use instead of user judgment is the gold standard. The gold

21 standard produced by human judges has a high level of inter-judge agreement. Then we can use the external criterion to evaluate how well the clustering matches the gold standard classes. In the following, we introduce some external criteria of clustering quality.

Purity: to measure the purity of the clustering model, each cluster assigns to the class which is most frequent in the cluster. The accuracy of this assignment is measured by the number of correctly assigned documents and divided by the total number of instances.

1 X purity(Ω,C) = maxj|wk ∩ cj| (1.27) N k

where Ω = {w1, w2, ..., wk} is the set of clusters and C = {c1, c2, ..., cJ } is the set of classes. The perfect cluster has a purity equal to one and purity of a bad cluster is zero. With a large number of clusters we can easily achieve high purity [3].

Normalized mutual information (NMI): The advantages of NMI in compare to purity is that the number of clusters doesn’t affect the result of NMI. NMI defined as:

NMI(Ω,C) = I(Ω; C)[H(Ω) + H(C)]/2 (1.28)

I is mutual information and I(Ω; C) in equation 1.28 measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are :

X X P (wk ∩ cj) I(Ω; C) = P (wk ∩ cj)log (1.29) k j P (wk)P (cj)

Where P (wk) is the probability of a document being in cluster wk, P (cj), is the

probability of a document being in class cj and P (wk ∩ cj) is the probability of a 22 document being in the intersection of wk and cj. I(Ω; C) will have a minimum value of zero when the clustering is random with respect to class membership. In that case, knowing that a document is in a particular cluster does not give us any new information about what its class might be. In equation 1.28 H is entropy and define:

X H(Ω) = − P (wk)logP (wk) (1.30) k Mutual information does not penalize large cardinalities and thus does not for- malize our bias, other things being equal, fewer clusters are better. In equation 1.28 the normalization by the denominator [H(Ω)+H(C)]/2 fixes this problem. The value of NMI is always a number between zero and one [3].

Rand index: Clustering can be seen as a series of decisions. The main focus of clustering is to assign two similar documents into the same cluster. Based on this view true positive TP is the number of time two similar documents assigned to the same cluster and true negative TN is the number of times the model decide to cluster two dissimilar documents to different clusters. On the other hand, the model can be wrong in clustering two dissimilar documents (FP) or decide to assign two similar documents to different clusters. Rand index use to measure the percentage of decisions that are correct. This is the definition of accuracy:

TP + TN RI = (1.31) TP + FP + FN + TN False positive and false negative have the equal weight in the Rand index. Depend on the problem in hand false positive or false negative can lead to more cost. For instance, in some case separating similar documents is more costly than putting pairs of dissimilar documents in the same cluster. In F measure we can penalize false negative more strongly than false positive by selecting a value β > 1, in this way we 23 give more weight to recall [3].

(β2 + 1)PR F = (1.32) β β2P + R

1.3.5 Semi-supervised learning methods:

Although train a model by using labeled data can give us the better results but most of the time our data is unlabeled. Especially in a text context, there are a massive amount of unlabeled text data that lots of information latent on them. Labeled in- stances, however, are often expensive, and time-consuming to obtain because they require the efforts of experienced human annotators. Semi-supervised learning is a method that helps us to use a significant amount of unlabeled data, together with the labeled data to build a better classifier. On the other words, semi-supervised learning uses the advantages of supervised and unsupervised learning at the same time. The favored approach is to use the expectation-maximization (EM) algorithm on generative models such as Naive Bayes, treating unlabeled data as data with missing information. Maximum entropy is another popular model due to its com- putational tractability and straightforward optimization. In this research we do not use semi-supervised learning and investigating this method is out of the scope of this study.

1.4 BRIEF REVISION OF THE DISSERTATION.

Similarity measurement plays a significant rule in natural language processing. Sim- ilarity measurement broadly uses in information retrieval, text classification, docu- ment clustering, topic detection, questions generation, , , text summarization, etc. In chapter2, we will propose a new similarity measurement based on Hellinger distance. We will compare the performance of the new similarity with cosine sim-

24 ilarity. In our experiment, we used the common text dataset that usually uses as a benchmark. Also, we applied classification methods that we discuss in this chap- ter such as Naive Bayes, SVM, and KNN and clustering methods such as K-mean, the normalized cuts, and K-mean clustering via principal component analysis. We evaluated the performance of classification methods by using AUC and accuracy also the performance of the clustering by using accuracy, purity and normalized mutual information. As a next step, we will evaluate the effect of each of these elements including dataset, methods, and performance metrics in the performance of the new similarity and compare the results with cosine similarity. Chapter3 is an overview of natural language processing and information retrieval. We also will discuss sentiment analysis as an example of NLP tasks. Lexical-based methods and supervised machine learning based methods can be used to extract the sentiment of people from their text documents. Lexical-based use a predefined list of positive and negative words to extract the sentiment of new documents. We will discuss lexicon based sentiment analysis in chapter4. On the other hand, machine learning methods don’t need any predefined lexicon. They use machine learning methods to predict the sentiment of the author based on his/her words. Chapter5 and6 is allocated to machine learning based sentiment analysis.

25 CHAPTER 2 TEXT SIMILARITY

Text similarity measurement aims to find the commonality existing among text doc- uments, which is fundamental to most information extraction, information retrieval, and text mining problems. Cosine similarity based on Euclidean distance is currently one of the most widely used similarity measurements. However, Euclidean distance is generally not an effective metric for dealing with probabilities, which are often used in text analytics. In this chapter, we propose a new similarity measure based on sqrt-cosine similarity.

2.1 TEXT SIMILARITY MEASUREMENT

In the past decade, there has been explosive growth in the volume of text documents flowing over the Internet. This has brought about a need for efficient and effective methods of automated document understanding, which aims to deliver desired in- formation to users. Document similarity is a practical and widely used approach to address the issues encountered when machines process natural language. The simi- larity between two documents measures based on the distance between them. There are lots of different distance measurement techniques. Block Distance, which is also known as Manhattan distance, computes the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Block distance between two items is the sum of the differences of their corresponding components [39]. Euclidean distance, or L2 distance, is the square root of the sum of squared differences between corresponding elements of the two

26 vectors. Matching Coefficient is a very simple vector based approach which simply counts the number of similar terms (dimensions), with which both vectors are non- zero. Overlap coefficient considers two strings as a full match if one is a subset of the other [40]. Gaussian model is a probabilistic model which can be used to characterize a group of feature vectors of any number of dimensions with two values, a mean vector, and a covariance matrix. The Gaussian model is one way of calculating the conditional probability [41]. Traditional spectral clustering algorithms typically use a Gaussian kernel function as a similarity measure. Kullback-Leibler divergences [42] is another measure for computing the similarity between two vectors. It is a non- symmetric measure of the difference between the probability distribution correspond with the two vectors [43]. The Canberra distance metric [44] is always used in a non- negative vector. Chebyshev distance is defined on a vector space where the distance between two vectors is the greatest of difference along any coordinate dimension [45] . Triangle distance is considered as the cosine of a triangle between two vectors and its value range between 0 and 2 [46]. The Bray-Curtis similarity measure [47] which is sensitive to outlying values is a city-block metric. The [48], [49] is the number of positions at which the associated symbols are different. IT-Sim ( information-theoretic measure) for document similarity, proposed in [50], [51]. The Suffix Tree Document (STD) model [52] is a phrase-based measure. Also, there are some similarity measures which incorporate the inner product in their definition. The inner product of two vectors yields a scalar which is sometimes called the dot product or scalar product [53]. In [54] Kumar and Hassebrook used inner product to measures the Peak-to-correlation energy (PCE). Jaccard coefficient also called Tanimoto [55] is also the normalized inner product. Jaccard similarity is computed as the number of shared terms over the number of all unique terms in both strings [56]. Dice coefficient [57] also called Sorensen, Czekannowski Hodgkin- Richards [58] or Morisita [59]. Dice’s coefficient is defined as twice the number

27 of common terms in the compared strings divided by the total number of terms in both strings [60]. Cosine coefficient measures the angle between two vectors. It is the normalized inner product and also called Ochiai [61] and Carbo [58]. Some similarity measure like Soft cosine measure proposed in [62] takes into account similarity of features. They add to the Vector Space Model new features by calculation of similarity of each pair of the already existing features. Pairwise-adaptive similarity dynamically select number of features prior to every similarity measurement. Based on this method a relevant subset of terms is selected that will contribute to the measured distance between both related vectors [63]. Some examples of document similarity are document clustering, document catego- rization, document summarization, and query-based search. Similarity measurement usually uses a bag of words model [64]. For example, consider that we want to compute a similarity score between two documents, t and d. One common method for similarity measurement is to first assign a weight to each term in the document by using the number of times the term

occurs, then invert the number of occurrences of the term in all documents (tfidft,d) [65][66], and finally calculate the similarity based on the weighting results using a vector space model [67]. In a vector space scoring model, each document is viewed as a vector and each term in the document corresponds to a component in vector space. Another popular and commonly-used similarity measure is Cosine Similarity. This can be derived directly from Euclidean distance, however, Euclidean distance is generally not a desirable metric for high-dimensional data mining applications. In this chapter, we propose a new similarity measurement based on Hellinger dis- tance. Hellinger distance (L1 norm) is considerably more desirable than Euclidean distance (L2 norm) as a metric for high-dimensional data mining applications [68]. We conduct comprehensive experiments to compare our newly proposed similarity measurement with the most widely used cosine and Gaussian model-based similarity

28 measurements in various document understanding tasks, including document clas- sification, document clustering, and query search. Comprehensive experiments are then conducted to evaluate our new similarity measurement in comparison to exist- ing methods. These experimental results show that our proposed method is indeed effective.

2.2 COSINE SIMILARITY

Similarity measurement is a major computational burden in document understand- ing tasks and cosine similarity is one of the most popular text similarity measures. Manning and Raghavan provide an example in [65] which clearly demonstrates the functionality of cosine similarity. In this example, four terms (affection, jealous, gos- sip, and wuthering) from the novels Sense and Sensibility (SaS) and Pride and Prej- udice (PaP) from Jane Austen and Wuthering Heights (WH) from Emily Bronte are extracted. For the sake of simplicity we ignore idf and use equation 2.1 to calculate log frequency weight of term t in novel d.

   1 + log10 tft,d if tft,d > 0 wt,d = (2.1)   0 otherwise In Tables 2.1 and 2.2, the number of occurrences and log frequency weight of these terms in each of the novels are provided, respectively. Table 2.3 then shows the cosine similarity between these novels. Cosine similarity returns one when two documents are practically identical, or zero when the documents are completely dissimilar. In order to find cosine similarity between two documents x and y we need to normalize them to one in L2 norm(2.2).

m X 2 xi = 1 (2.2) i=1 By having two normalized vectors x and y the cosine similarity between them will

29 Table 2.1: Term frequencies of terms in each of the novels

term SaS PaP WH

affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38

Table 2.2: Log frequency weight of terms in each of the novels

term SaS PaP WH

affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

be simply the dot product of them (Eq. 2.3).

Pm x y cos(x, y) = i=1 i i . (2.3) qPm 2qPm 2 i=1 xi i=1 yi

Careful examination of equation (2.3) shows that cosine similarity is directly de- rived from Euclidean distance (Eq. 2.4).

n 1 n 1 h X 2i 2 h X i 2 dEuclid(x, y) = (xi − yi) = 2 − 2 xiyi (2.4) i=1 i=1

2.3 SQRT-COSINE SIMILARITY

Zhu et.al. in [69] attempted to use the advantages of Hellinger distance Eq.(2.6) and proposed a new similarity measurement - sqrt-cosine similarity. They claim that as a similarity measurement, it provides a value between zero and one, which is better 30 Table 2.3: Cosine Similarity between novels

SaS PaP WH

SaS 1.00 0.94 0.79 PaP 0.94 1.00 0.69 WH 0.79 0.69 1.00

assessed with probability-based approaches. However, Euclidean distance is not a good metric to deal with probability. The sqrt-cosine similarity defined in Eq.(2.5) is based on Hellinger distance Eq.(2.6).

Pm √ i=1 xiyi SqrtCos(x, y) = Pm Pm . (2.5) ( i=1 xi)( i=1 yi)

m 1 m 1 h X √ √ 2i 2 h X √ i 2 H(x, y) = ( xi − yi) = 2 − 2 xiyi (2.6) i=1 i=1 In some cases, the manner of sqrt-cosine similarity is in conflict with the definition of similarity measurement. To clarify our claim, we use the same example provided in section 2.2. Sqrt-cosine similarity is calculated between these three novels and shown in Table 2.4. Surprisingly, the sqrt-cosine similarity between two equal novels does not equal one, exposing flaws in this design. Furthermore, from on Table 2.4, we can see that the SaS (Sense and Sensibility) novel is more similar to PaP(Pride and Prejudice) than itself! Comparing Tables 2.4 and 2.3 reveals that, opposed to cosine similarity, we cannot specify the sqrt-cos similarity of WH (Wuthering Heights) to other novels within two decimal places of accuracy. Based on the above example we believe that Sqrt-cosine similarity is not a trustable similarity measurement. To address this problem, we propose an improved similarity measurement based on Sqrt- cosine similarity and compare it with other common similarity measurements.

31 Table 2.4: Sqrt-cosine similarity scores among novels

SaS PaP WH

SaS 0.15 0.16 0.11 PaP 0.16 0.21 0.11 WH 0.11 0.11 0.11

2.4 ISC SIMILARITY

Information retrieved from high-dimensional data is very common, but this space becomes a problem when working with Euclidean distances. In higher dimensions, this can rarely be considered an effective distance measurement in machine learning. Charu, in [68], prove that the Euclidean (L2) norm, from a theoretical and empiri- cal view, is often not a desirable metric for high-dimensional data mining applications. For a wide variety of distance functions, because of the concentration of distance in high-dimensional spaces, the ratio of the distances of the nearest and farthest neigh- bors to a given target is almost one. As a result, there is no variation between the distances of different data points. Also in [68], Charu investigates the behavior of the

Lk norm in high-dimensional space. Based on these results, for a given value of the high-dimensionality d, it may be preferable to use a lower value of k. In other words, for a high-dimensional application, L1 distance, like Hellinger, is more favorable than

L2 (Euclidean distance). We propose our improved sqrt-cosine (ISC) similarity measurement below.

√ Pm x y ISC(x, y) = i=1 i i . (2.7) q Pm q Pm ( i=1 xi) ( i=1 yi)

In equation (2.5), each document is normalized to 1 in L1 norm:

Pm i=1 xi = 1. We propose the ISC similarity measurement in equation (2.7). In this equation, instead of using L1 norm, we use the square root of L1 norm. 32 Table 2.5: The Proposed ISC similarity scores among novels

SaS PaP WH

SaS 1.00 0.89 0.83 PaP 0.89 1.00 0.70 WH 0.83 0.70 1.00

The same example is used in Section 2.2 to compare our ISC similarity with the previous one. The results from using ISC similarity between these three novels are available in Table 2.5. The similarity between two identical novels is one and we can clearly find a similar novel to WH (Wuthering Heights) within two decimal place of accuracy.

2.5 EXPERIMENT

Cosine similarity is considered as the ”state of the art” in similarity measurement. ISC is very close to cosine similarity in term of implementation complexity in major engines such as Spark [70] or any improved big data architecture [71] [72]. We con- duct comprehensive experiments to compare ISC similarity with cosine similarity and Gaussian model-based similarity in various application domains, including document classification, document clustering, and information retrieval query. In this study, we used several popular learning algorithms and applied them to multiple datasets. We also use various evaluation metrics in order to validate and compare our results.

2.6 DATASETS

Five different datasets from different application domains were used in this experi- ment. In Table 2.6, a list of these sets is presented. Our reason for selecting these datasets is that they are commonly used and considered a benchmark for document

33 Table 2.6: Summary of the real-world datasets [73]

#Sample #Dim #Class

CSTR 475 1000 4 DBLP 1367 200 9 Reuters 2900 1000 8-52 WebKB4 4199 1000 4 Newsgroups 11293 1000 20

classification and clustering. In the following table, more information about all the datasets used in our experiments can be found.

1. The CSTR is a collection of about 550 abstracts of technical reports published from 1991 to 2007 in computer science journals at the University of Rochester. They can be classified into four different groups: Natural Language Processing, Robotics/vision, Systems, and Theory.

2. The DBLP dataset contains the titles of the last 20 years’ chapters, published by 552 active researchers, from nine different research areas: database, data mining, software engineering, computer theory, computer vision, operating sys- tems, machine learning, networking, and natural language processing.

3. Reuters-21578 is a collection of documents that appeared on the Reuters newswire in 1987. R8 and R52 are subsets of the Reuters-21578 Text Cat- egorization. Reuters Ltd personnel collected this document set and labeled the contents.

4. The WebKB dataset contains 8,280 documents which are web pages from var- ious college computer science departments. These documents are divided into seven groups: student, faculty, staff, course, project, department, and other.

34 The four most popular categories from these seven categories are selected and made into the WebKB4 set. These four categories are student, faculty, course and project [74].

5. The 20 Newsgroups dataset is a collection of about 20 different newsgroups [75]. Containing around 20,000 newsgroup documents, this is one of the most commonly-used datasets in text processing.

2.7 LEARNERS

We apply various classification and clustering methods to analyze the performance of our new similarity measurement. We used Nearest Neighbour , Na¨ıve Bayes and Support Vector Machine which are most common classification models. As the clus- tering models we used K-Means, Normalized Cut Algorithm, K-means Clustering via Principal Component Analysis. We implemented discussing learners in R language [76].

2.8 PERFORMANCE METRICS

In the experiments, we use five different performance metrics to compare the models we constructed based on our ISC similarity with other similarity measures. The evaluation metrics include the following. Area under the ROC Curve [77], Accuracy for classification [78], Accuracy for clustering [79], Purity [65], and Normalized Mutual Information [80]. In addition to these performance metrics, we test the results for statistical signif- icance at the α = 5% level using a one-factor analysis of variance (ANOVA) [81]. An ANOVA model can be used to test the hypothesis that classification performances for each level of the main factor(s) are equal versus the alternative hypothesis that at least one is different. In this chapter, we use a one-factor ANOVA model, which

35 can be represented as:

ψjn = µ + θj + jn (2.8) where ψjn represents the response (i.e., AUC, ACC, Purity, NMI) for the nth observa- tion of the jth level of experimental factor θ; µ represents overall mean performance;

θj is the mean performance of level j for factor θ; and jn is random error. In our experiment, θ is the similarity measure and we aim to compare the average performance of the newly proposed similarity measurement with cosine similarity and Gaussian-based similarity measurement. If at least one level of θ is different, there are lots of procedures exist that can be used to specify which levels of θ is different. In this chapter, we use Tukey’s Honestly Significant Difference (HSD) test [82] to identify which levels of θ are significantly different.

2.9 EXPERIMENTAL RESULTS

In this section, we provide the results of our experiments and compare our ISC simi- larity with cosine similarity and Gaussian base similarity. As a first step, we just focus on the performance metrics across all five datasets and seven different learners (three classifications and four clustering models). As a second step, we consider different learners separately to compare the performance of these similarity measurements for different learners. At the end, the combinations of learners and datasets considered seeing their effectiveness.

2.10 OVERALL RESULTS

First, we compare the average performance of our proposed ISC similarity measure- ment with cosine similarity and Gaussian-based similarity measurement. Results are provided in Tables 2.7 and 2.8. Mean columns represent the average of performance metrics across all learners (clustering and classification) and datasets. According to 36 Table 2.7: Average performances of the similarity measure across all clustering learn- ers and datasets.

Accuracy Purity NMI Similarity Mean HSD Mean HSD Mean HSD ISC 0.3563 A 0.5950 A 0.1590 A cosine 0.3370 A 0.5608 A 0.1363 A Gaussian 0.2949 A 0.5597 A 0.0990 A

Table 2.8: Average performance of the similarity measure across all classification learners and datasets.

Accuracy AUC Similarity Mean HSD Mean HSD ISC 0.6562 A 0.7901 A cosine 0.6371 A 0.7780 A Gaussian 0.2872 B 0.5582 B

the mean values in Tables 2.7 and 2.8 , ISC similarity in all cases outperforms cosine similarity and Gaussian-based similarity measurement.

Columns labeled HSD represent results of Tukey’s Honestly Significant Difference test at the 95% confidence level. If two similarity measurements have the same letter in the HSD column, then according to HSD test their average performances are both good. For example, based on Table 2.7, using Area under the ROC curve (AUC) or Accuracy as a performance measure of each classifier indicates that ISC and cosine similarity are in the same group so their performance are not significantly different from each other. On Table 2.8 Gaussian-based similarity belongs to group B which

37 means ISC and cosine similarity outperform Gaussian-based similarity. Generally speaking, based on the HSD test, when averaging performance across all datasets and learners, the proposed ISC similarity, and cosine similarity belong to the same group.

Figure 2.1: Accuracy in classification box plot.

Figure 2.2: Purity in clustering box plot.

In addition, we use box plots to see outliers and spread the performance of clas- sification and clustering across all datasets and learners for these three similarity measurements. In this way, we can compare their performance at various points in the distribution, not only the mean value as we did in Tables 2.7 and 2.8. For exam- ple, based on Figures 2.1 and 2.2, the distribution of the accuracy and purity for ISC similarity is more favorable than those of cosine similarity and Gaussian.

2.11 RESULTS USING DIFFERENT LEARNERS

As a second step, we try to compare the effectiveness of different learners on the performance of our ISC similarity, cosine similarity, and Gaussian-based similarity 38 Table 2.9: Performances of the similarity measures using classification learners aver- aged across all datasets

KNN Na¨ıve Bays SVM Metric Similarity Mean HSD Mean HSD Mean HSD Accuracy ISC 0.7079 A 0.8589 A 0.4019 A cosine 0.6476 A 0.8633 A 0.4004 A Gaussian 0.4606 A 0.1795 B 0.2215 A AUC ISC 0.8779 A 0.8806 A 0.612 A cosine 0.7977 AB 0.8892 A 0.6473 A Gaussian 0.6620 B 0.5084 B 0.5042 A

measurement. Tables 2.9 and 2.10 show the average performance of discussing simi- larity measurements by applying different classification and clustering methods. We used K-Nearest Neighbor, Na¨ıve Bayes, and SVM as our classification models. K- Means, Normalized cut, K-Mean clustering via Principal Component Analysis and Symmetric Nonnegative Matrix Factorization (SymNMF) are our applying clustering methods. We summarized our observations as below:

1. With na¨ıve Bays as the base learner and using accuracy and Area under the ROC Curve (AUC) to measure performance, ISC similarity and cosine similarity are preferred over Gaussian base similarity measurement.

2. Based on mean values, ISC similarity is preferred over cosine similarity and Gaussian-based similarity measurement.

3. Based on the HSD test, both ISC similarity and cosine similarity are belong to group ’A’ which is the top grade ranges .

39 Table 2.10: Performances of the similarity measures using clustering learners averaged across all datasets

Kmeans Ncut PCA-Kmean SYM-NMF Metric Similarity Mean HSD Mean HSD Mean HSD Mean HSD Accuracy ISC 0.3354 A 0.3220 A 0.3090 A 0.4589 A cosine 0.3115 A 0.3104 A 0.3070 A 0.4191 A Gaussian 0.3005 A 0.3020 A 0.3020 A 0.2750 A Purity ISC 0.4357 A 0.5606 A 0.8499 A 0.5337 A cosine 0.4217 A 0.5626 A 0.7771 A 0.5072 A Gaussian 0.3919 A 0.5693 A 0.8457 A 0.4066 A NMI ISC 0.1740 A 0.1367 A 0.0369 A 0.2886 A cosine 0.1332 A 0.1321 A 0.0335 A 0.2464 A Gaussian 0.0992 A 0.1337 A 0.0309 A 0.1321 A

2.12 RESULTS USING DIFFERENT DATASETS AND LEARNERS

In this section, we try to investigate the effectiveness of different datasets in vari- ous domains on the performance of discussing similarity measurement. We consider six different datasets from different application domains including Webkb, Reuters8, Reuters52, News, dblp and cstr datasets. Table 2.11 shows the results of evaluating all classification methods and Table 2.12 presents the results of clustering methods. We use various performance evaluations for both classification and clustering. For each performance metric, we specify a row which represents the number of datasets where the given technique is in group A and also the average performance across all six datasets. Based on Tables 2.11 and 2.12 regardless of learners and datasets, ISC similarity and cosine similarity measures are always in group A. On the other hand, the Gaussian-based similarity measurement is in group B for some datasets while we

40 Table 2.11: Performance of the similarity measures in datasets averaged across all classification learners.

ISC cosine Gaussian Metric dataset Mean HSD Mean HSD Mean HSD Accuracy WEBKB 0.6104 A 0.5929 A 0.3046 A R8 0.7166 A 0.7361 A 0.4485 A R52 0.4975 A 0.4230 A 0.1945 A NEWS 0.6009 A 0.5989 A 0.2468 A DBLP 0.7101 A 0.6842 A 0.2234 B CSTR 0.8019 A 0.7873 A 0.3052 B Average #A’s 0.6562 6 0.6370 6 0.2871 4 AUC WEBKB 0.8162 A 0.8304 A 0.6171 A R8 0.7342 A 0.6641 A 0.5341 A R52 0.7826 A 0.7540 A 0.5075 A NEWS 0.7514 A 0.7570 A 0.5852 A DBLP 0.9253 A 0.9287 A 0.6011 B CSTR 0.7313 A 0.7340 A 0.5040 A Average #A’s 0.7901 6 0.7780 6 0.5581 5

41 Table 2.12: Performance of the similarity measures in datasets averaged across all clustering learners.

ISC cosine Gaussian Metric dataset Mean HSD Mean HSD Mean HSD Accuracy WEBKB 0.4798 A 0.4434 A 0.3824 A R8 0.4384 A 0.4291 A 0.4472 A R52 0.2320 A 0.2283 A 0.2395 A NEWS 0.1659 A 0.1544 A 0.1179 A DBLP 0.3886 A 0.3574 A 0.2640 A CSTR 0.4332 A 0.4095 A 0.3182 A Average #A’s 0.3563 6 0.3370 6 0.2948 6 Purity WEBKB 0.6248 A 0.6091 A 0.5548 A R8 0.5769 A 0.5790 A 0.6446 A R52 0.4440 A 0.4225 A 0.4478 A NEWS 0.4234 A 0.4410 A 0.3948 A DBLP 0.7980 A 0.6363 A 0.6531 A CSTR 0.7026 A 0.6704 A 0.6700 A Average #A’s 0.59495 6 0.55971 6 0.56085 6 NMI WEBKB 0.1500 A 0.1177 A 0.0879 A R8 0.1978 A 0.1912 A 0.2030 A R52 0.1376 A 0.1321 A 0.1179 A NEWS 0.0855 A 0.0761 A 0.0731 A DBLP 0.2439 A 0.1940 A 0.0948 A CSTR 0.1396 A 0.1069 A 0.0172 A Average #A’s 0.1590 6 0.1363 6 0.0990 6

42 use classification learners. According to the average performance across all datasets in these Tables, re- gardless of learners, datasets or even quality measurement, ISC similarity always outperforms Gaussian-based and also cosine similarity measure.

2.13 SUMMARY

Finding an effective and efficient way to calculate text similarity is a critical problem in text mining and information retrieval. One of the most popular similarity measures is cosine similarity, which is based on Euclidean distance. It has been shown useful in many applications, however, cosine similarity is not ideal. Euclidean distance is based on L2 norm and does not work well with high-dimensional data. In this chapter, we proposed a new similarity measurement technique, called improved sqrt-cosine (ISC) similarity, which is based on Hellinger distance. Hellinger distance is based on L1 norm and it is proven that in high-dimensional data, L1 norm works better than L2 norm. Most applications consider cosine similarity ”state of the art” in similarity measurement. We compare the performance of ISC with cosine similarity, and other popular techniques for measuring text similarities, in various document understanding tasks. Through comprehensive experiments, we observe that although ISC is very close to cosine similarity in term of implementation, it performs favorably when compared to other similarity measures in high-dimensional data.

43 CHAPTER 3 TEXTUAL DATA

A primary goal in processing textual data is extracting patterns, knowledge or infor- mation from an unstructured text document, and transform it into an understand- able structure for future use. Variety of task including document categorization, text clustering, information extraction, sentiment analysis, document summariza- tion, Information retrieval, tagging, pattern recognition, to study word frequency, visualization and predictive analytics consider as text analysis. In this chapter, we take a closer look at each of these concepts and their importance in processing textual data.

3.1 TEXT MINING APPROACHES

Text mining or knowledge discovery from a text (KDT) is the processes of extracting high quality of information from text. Fledman et al. introduce this concept for the first time [83]. Knowledge discovery in databases is extracting implicit valid, new and potentially useful information from the data, which is nontrivial [84]. Data mining is the application of particular algorithms for extracting patterns from data. Text mining covers many topics and algorithms for analyzing text including information retrieval, natural language processing, data mining, etc.

3.2 INFORMATION RETRIEVAL

The definition of Information retrieval can be very broad. In the academic field we can define: Information retrieval(IR) is finding material (usually documents) of an

44 unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) [3]. Kind of data that is not easy for a computer to structure is defined as unstructured data. People use the various form of IR in their typical day. Using a web search engine or search their emails are some simple examples of IR. Information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. IR can be divided into three different groups based on the scale at which they operate. If the target is information retrieval from the web or web search, billions of documents stored on millions of computers. In this enormous scale of data, IR needs to build systems that work effectively and handling particular aspects of the web such as the hypertext or manipulation of a page to boost their search engine ranking.

3.3 NATURAL LANGUAGE PROCESSING

Research in Natural Language Processing (NLP) has been going on for several decades dating back to 1940s. The first computer-based application related to natural lan- guage was machine translation. There is no single agreed definition for natural language processing that makes ev- erybody satisfied. Liddy in [85] define natural language processing as a theoretically motivated range of computational techniques for analyzing and representing natu- rally occurring texts at one or more levels of linguistic analysis to achieve human-like language processing for a range of tasks or applications. Based on this definition, there are various techniques to choose to accomplish a particular type of language analysis (range of computational techniques). Naturally occurring texts in this definition means the text should not be con- structed for the study it should be gathered from actual usage text. When we are working with human produce language, there are multiple types of

45 language processing known to be at work. But various NLP systems utilize different levels or combinations of levels of linguistic analysis, and this is seen in the differ- ences amongst various NLP applications. The notion of levels of linguistic analysis refer to this point. Human-like language processing reveals that NLP is considered a discipline within Artificial Intelligence (AI). And while the full lineage of NLP does depend on many other disciplines, since NLP strives for human-like performance, it is appropriate to consider it an AI discipline. For a range of tasks or applications in the definition refers to the fact that NLP is not usually considered as a goal by itself, except perhaps for AI researchers. For others, NLP is the means for accomplishing a particular task. Therefore, you have information Retrieval (IR) systems that utilize NLP, as well as Machine Translation (MT), Question-Answering, etc. Although the entire field is referred to as Natural Language Processing, there are two distinct focuses in NLP: language processing and language generation. Language processing involves analysis of language for the purpose of producing a meaningful representation; it tasks is equivalent to the role of reader/listener. While Language generation refers to the production of language from a representation; it tasks is equivalent to the role of writer/speaker.

Information extraction (IE) is the task of automatically extracting the seman- tic concepts from unstructured or semi-structured text data. IE is commonly used in text mining, information retrieval, natural language processing, and web mining. Information extraction includes two fundamental tasks: named entity recognition and relation extraction.

Named entity recognition (NER): a sequence of words that identifies a real word entity is called a named entity. The task of classifying named entities in free text

46 into predefined categories such as person, organization, location, etc. NER is not as simple as matching words into a dictionary of entities. First of all, it is impossible to provide a dictionary that contains all of the entities. Also, context plays an essential rule in NER. Most of the name entity recognition techniques are statistical learning methods such as hidden Markov models [86], maximum entropy models [87], support vector machines [88] and conditional random fields [89].

Hidden Markov Models: HMM are probabilistic models which consider the predicted labels of the neighboring words. They are succesfully used in named entity recognition and systems. Based on the hidden Markov model assumption, which generation of a label or an observation depends on one or a few previous labels or observation. If we consider Y = (y1, y2, ..., yn) as a sequence of

labels and X = (x1, x2, ..., xn) an observation sequence, then:

yi ∼ p(yi|yi−1) xi ∼ p(xi|xi−1) (3.1)

Relation extraction: Relation extraction task is to find the semantic relations among the text entities in a text document. The primary goal in relation extraction is to categorize the relation between two entities into one of the fixed relation types such as spouse-of, child-of, membership, etc. The most common techniques in rela- tion extraction is considering the task as a classification problem. There are lots of studies including [90, 91, 92, 93, 94] that use classification for relation extraction.

Event extraction: is the task of finding events in which these entities partici- pate. We need to recognize temporal expression to figure out when an event in a text happened like days of the week, months, holidays, etc.

Template filling: Template filling can be used to find recurring stereotypical

47 situations in documents and fill the template slots with appropriate material. This informations may consist of text data extracted directly from the text, or concepts that have been inferred from text elements through additional processing like times, amounts, or ontology entities [95].

3.3.1 Text summarization

Textual information in the form of digital documents quickly accumulates to huge amounts of data. Most of this large volume of documents is unstructured, unrestricted and has not been organized into traditional databases. Processing documents is, therefore, a perfunctory task, mostly due to the lack of standards [96]. There are various reasons behind text summarization including, reduce reading time, easier document selection, the effectiveness of indexing, useful in question- answering systems, etc. Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [97]. Generally, two main approaches are used for text summarization. Extractive Meth- ods and Abstractive methods.

Extraxtion text summarization: In this method some part of the source doc- ument select as a summary of the text. Techniques involve constructing an inter- mediate representation of the input text which expresses the main aspects of the document, then rank sentences based on their relevance to the central concept of the source. Finally, the summary of the source document makes by selecting the k most essential phrases. One of the approaches to identify words that describe the topic of the input document is topic words. In this method log-likelihood ratio used to identify explanatory words. Latent semantic analysis (LSA) is another method to

48 select highly ranked sentences for document summarization. The LSA method first builds a term-sentence matrix. The weights of the words in this matrix compute by TFIDF then singular value decomposition is used to transforms the matrix A into three matrices: A = UΣV T . Matrix U represents a term-topic matrix having weights of words. Matrix Σ shows the strength of each topic. Matrix V T is the topic sentence matrix. The matrix D = ΣV T describe how much a sentence represent the topic. For example dij shows the weight of the topic i in sentence j.

Abstractive text summarization: summary of a source document in this method contain entirely new phrases and sentences to provide the meaning of the source document.

3.3.2 Sentiment analysis

Sentiment analysis tries to extract opinions from the dataset. They can help us to extract public opinion about products, services, politics or any other topic that peo- ple have some opinions about. There are a variety of sentiment analysis tools some of them focus on polarity (positive, negative, neutral) some of them focus on de- tecting feeling like angry, happy, sad, etc. or identify intentions (e.g., interested v. not interested). Generally speaking, sentiment analysis is a form of classifying text documents to numerous groups. Most of the time, we need only to classify docu- ments into positive and negative classes. Furthermore, there are different methods in sentiment analysis that can help us to measure sentiments. These methods in- clude lexical-based approaches methods and supervised machine learning methods. Machine learning models are more popular because lexical-based approaches, which are based on the semantics of words, use a predefined list of positive and negative words to extract the sentiment of new documents. Creating these predefined lists is time-consuming and we cannot build a unique lexical-based dictionary to be used in

49 every separate context. In the rest of this study, we mainly focus on different methods to sentiment analysis in financial data. In chapter4 we will take a look at lexicon based approach in extracting the sentiment of the authors in StockTwits dataset. StockTwits is a kind of twitter website that people share their knowledge about stock price. In section 3.3.2 We will discuss the significance of our dataset. In chapter5 we will adopt deep learning to extract the sentiment of authors in StockTwits website and see how deep learning methods can increase the accuracy of our prediction.

StockTwits Dataset We were fortunate to receive permission from StockTwits Inc. to have access to their datasets. StockTwits is a financial social network which was established in 2009. Information about the stock market, like the latest stock prices, price movement, stock exchange history, buying or selling recommendations, and so on, are available to StockTwits users. In addition, as a social network, it provides the opportunity for sharing experience among traders in the stock market. Through the StockTwits website, investors, analysts, and others interested in the market can contribute a short message limited to 140 characters about the stock market. This message will be posted to a public stream visible to all site visitors. Moreover, messages can be labeled Bullish or Bearish by the authors to specify their sentiment about various stocks. Each message includes a messageID, a userID, the author’s number of followers, a timestamp, the current price of the stock, and other record-keeping attributes. We examined the posts to see if there is any relation between the future stock price and users’ sentiment. In other words, we want to see if we can predict a future stock price based on the current sentiment of many users. We can use Pearson Correlation Coefficient [98] to see if there is a linear relation between a stock’s future price and the user’s sentiment. Pearson Correlation is one

50 of the most widely-used functions to measure the linear correlation between two variables. It returns one if there is a perfect positive correlation between the two input variables, -1 if there is a perfect negative correlation, and 0 if there is no correlation. The Pearson Correlation Coefficient between a stock price and a general user’s sentiment is equal to 0.05, which means that only 53% of the time are users able to predict future stock prices correctly. This is a little bit better than a random guess, so we will examine whether that accuracy improves if the number of predictions is increased. Gang Wang in [99] tried to find if there are authors in financial social media whose contributions provide good predictors of stock price, but buried in the noise. They ranked authors based on their performance in predicting stock price within the week of their prediction. They use two consecutive years of data, the first year as a benchmark to find such top authors, and the second year to examine the top authors performance. Based on the results published in [99], the correlation score for top authors is around 0.4, which means that top authors can predict stock price movement with the accuracy of about 75%. Knowing the sentiment of top authors, we can predict stock prices with accuracy of 75% but unfortunately, only 10% of messages in StockTwits are labeled. To increase the accuracy of stock price prediction, we need a powerful method for the sentiment analysis of top authors.

51 CHAPTER 4 LEXICON BASED FINANCIAL SENTIMENT ANALYSIS

The modern stock market is a popular place to increase wealth and generate income, but the fundamental problem of when to buy or sell shares, or which stocks to buy has not been solved. With the availability of the Internet and its financial social networks, such as StockTwits and SeekingAlpha, investors around the world have new opportunities to gather and share their experiences. Individual experts can predict the movement of the stock market in financial social networks with reasonable accuracy, but how accurate is a large group of such experts in aggregate? One way to answer this question is by examining the sentiment of a massive group of these authors towards various stocks. By extracting the sentiment of the whole group, a collective prediction can be observed. Although sentiment extraction is a major technical challenge, the lexicon-based approach is an effective method of determining how positive or negative the content of a text document is. In this chapter, we investigate if we can improve the performance of sentiment extraction from financial social media data by using lexicon-based approaches.

4.1 WHY FINANCIAL SENTIMENT ANALYSIS

The Internet has become a tool of open communication for billions of people around the world, allowing interaction between individuals who may have never been able to connect previously. Crowdsourcing uses the collective wisdom of a large group of people to achieve a specific goal and has brought about a social revolution. One website which brings these opportunities to its users is StockTwits. By lever-

52 aging Twitter’s 140 character tweet system, StockTwits aggregates market analyses from the Twitter social media platform and condenses them into a focused, curated stream of data. If this stream were examined in full, it would be possible to deter- mine the crowd’s collective sentiment towards the market and make predictions from it. What makes StockTwits special is its users’ ability to add a tag to their tweets to indicate whether their post is ”Bullish” and they think the stock or market will improve, or ”Bearish” and they think the stock or market will get worse. In this chapter, we will examine a labeled dataset from StockTwits and determine whether lexicon based sentiment analysis methods are effective for classification. We will begin by reviewing a selection of works related to the application of machine learning and sentiment analysis on financial social media data. The next section covers our methodology. We will compare sentiment analysis approaches through machine learning and sentiment lexicons. The following section will provide our experimental results, which show that lexicon-based approaches can offer improved performance over machine learning methods. In the last section, we summarize our conclusions and recommend the VADER system of lexicon-based sentiment analysis for classification of StockTwits tweets.

4.2 PREVIOUS WORK ON FINANCIAL SENTIMENT ANALYSIS

Early work on Twitter and sentiment analysis comes from Bollen, et al. in [100], with their use of OpinionFinder and Google Profile of Mood States (GPOMS). These tools took tweet input and produced the author’s sentiment, which was then compared against the performance of a stock market index. The authors showed that sentiment analysis of a large Twitter dataset regarding stock movement is possible. Additionally, they found that this analysis can be used for market predictions, with an accuracy of around 87%.

53 By expanding on the work Bollen, Mittal, and Goel in [101] looked further into sentiment analysis when applied to Twitter data. They realized that having a good sentiment analysis system was extremely important for their task, and evaluated multiple analyzers, including OpinionFinder and SentiWordNet. By stressing the im- portance of sentiment analysis on financial tweets, this work also leads us to examine the topic more closely. One of the most popular works in this field is by Loughran and McDonald [102]. They used the U.S. Security and Exchange Commission portal from 1994 to 2008 to make a financial lexicon and manually create six-word lists including positive, negative, litigious, uncertainty, model strong and model weak. Supervised classification methods, such as Support Vector Machines, Na¨ıve Bayes or ensembles [103, 104] have been deployed to perform sentiment analysis in multi- ple research projects. Machine learning techniques mainly use the bag-of-words [64] model. In the bag-of-words model, a text is represented as the collection of its words, disregarding the order of those words in their sentences. In addition, we do need feature engineering in machine learning methods. Wang, et al. in [105] applied machine learning approaches, including Support Vector Machine, Naive Bayes, and Decision Tree, to classify StockTwits tweets as ”bullish” or ”bearish.” They found that the SVM model was the most accurate at 76.2%. Our research builds on this work by re-evaluating various machine learning models and then investigating lexicon-based sentiment analyzers to see if better accu- racy can be attained. With an improved method of determining the overall feelings of StockTwits users, more accurate predictions can be made from their aggregate data.

4.3 METHODOLOGY

A sentiment lexicon is a list of lexical features which are generally labeled according to their semantic orientation as either positive or negative [106]. Due to the challenge of creating a lexicon, most research in sentiment analysis relies heavily on preexisting

54 manually constructed lexicons. The three most common lexicons in use are LIWC 1, GI 2, and Hu-Liu04 3. In the following section, we briefly provide an overview of two most commonly-used sentiment lexicons - VADER and SentiWordNet. VADER uses a combination of qualitative and quantitative methods, and SentiWordNet is an extension of WordNet [107].

4.3.1 VADER: Valence Aware Dictionary for sEntiment Reasoning

VADER, as a parsimonious rule-based model for sentiment analysis, can be used in multiple domains. It is constructed from a generalized, valence-based, human- curated gold standard sentiment lexicon. In addition, the impact of grammatical and syntactical rules including punctuation, capitalization, contrastive conjunction, etc. on the sentiment of text is considered. VADER is fast enough to use online with streaming data and also it does not suffer from a speed-performance trade-off. These features make VADER one of the popular methods for sentiment analysis, especially on social media-related data. In VADER, a group of well-established sentiment lexicons, like LIWC, ANEW, and GI, are used to construct a list. Incorporation of this list with lexical features common to sentiment expression in microblogs, including Western-style emoticons 4, sentiment related acronyms and initialisms 5, and commonly used slang 6 with sentiment value, provides over 9000 lexical feature candidates. The wisdom-of-the-crowd is used to find an estimate for the sentiment valence of each candidate feature. Ten independent humans rate each of the features on a scale from -4 for extremely negative to 4 for extremely positive, and 0 for neutral. Only a lexical feature that has a non-zero mean rating, and whose standard deviation is less

1www.liwc.net 2http://www.wjh.harvard.edu/ inquirer 3http://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html 4http://en.wikipedia.org/wiki/List-of-emoticons 5http://en.wikipedia.org/wiki/List-of-acronyms 6http://www.internetslang.com/

55 than 2.5, as determined by the aggregate of ten independent raters, is kept. These processes provide a set of 7,500 lexical features with valence scores, which indicate the sentiment polarity and the sentiment intensity on a scale from -4 to +4 [108].

4.3.2 SentiWordNet

SentiWordNet is a which uses sets of synonyms, or synsets, instead of individual terms. Their reasoning for this switch is that different senses of the same term may have different opinion-related properties. SentiWordNet assigns three numerical scores - Obj(s), Pos(s), and Neg(s) - to each synset of WordNet (version 2.0). These scores describe how Objective, Positive, and Negative the terms contained in the synset are. SentiWordNet works based on training a set of ternary classifiers. These classifiers produce different results because they each train with a different training set and semi- supervised learning method. If all the ternary classifiers agree to assign the same label to a synset, that label will be assigned to that synset. Otherwise, each label will have a score proportional to the number of classifiers that have assigned it [109].

4.4 EXPERIMENTS

In this section, we will describe how our experiment applies machine learning and lexicon based approaches to the StockTwits dataset. Our experiment investigates if there is any relation between Bullish tweets and positive polarity, or Bearish tweets and negative polarity. In the following section, we seek to determine whether lexicon based models improve the accuracy of sentiment analysis of StockTwits data compared to machine learning approaches.

56 Table 4.1: Performance of the machine learning models on sentiment analysis in the StockTwits data set

Accuracy Precision Recall F-measure AUC

Logistic Regression 0.814 0.822 0.981 0.894 0.716

Na¨ıve Bayes 0.808 0.809 0.996 0.893 0.714

Linear SVM 0.814 0.820 0.984 0.895 0.716

Table 4.2: Performance of the TextBlob on sentiment analysis in the StockTwits data set

Accuracy Precision Recall F-measure AUC

TextBlob 0.810 0.842 0.726 0.780 0.804

4.4.1 Machine Learning Approaches

As we mentioned before, 10% of messages in our dataset are labeled. In our exper- iment, we use these messages and supervised machine learning methods to classify StockTwits users’ messages into either Bullish or Bearish sentiment. As Unigrams are used as features and infrequent unigrams that occur less than 300 times over all mes- sages have been removed. In Table 4.1, we provide the performance of Na¨ıve Bayes, Linear Support Vector Machine (SVM), and Logistic Regression on StockTwits data based on different performance metrics. Based on Table 4.1 the performance of lo- gistic regression, linear SVM, and Naive Bayes to classify messages to Bullish and Bearish is very close. The Accuracy of prediction is around 80%, F-measure around 90% and, Area Under the Curve is around 70%. In the following section, we try to see if we can adopt lexicons to improve the performance of the prediction.

57 Table 4.3: Performance of the SentiWordNet on sentiment analysis in the StockTwits data set

Accuracy Precision Recall F-measure AUC

SentiWordNet 0.870 0.837 0.661 0.739 0.806

4.4.2 Lexicon Based Approaches

TextBlob: The first method we used to extract the sentiment of messages in StockTwits data was TextBlob [110]. It uses a sentiment lexicon and the pattern.en sentiment analysis engine. Pat- tern.en leverages WordNet to score sentiment according to the English adjective used in the text. When TextBlob runs sentiment analysis on text, it returns a tuple of the form (polarity, subjectivity), where polarity is a float within the range [-1,1]. We first establish if there is any correlation between positive polarity and Bullish, and then negative polarity and Bearish. In order to compare the result of the machine learning approach to the lexicon-based approach, we apply TextBlob to the 2,522,557 messages that we used in the machine learning methods (Bearish around 500,000 and Bullish more than 2,000,000). From this set, TextBlob found 1,125,130 neutral mes- sages. We remove all of the neutral messages and provide the result of comparing TextBlob sentiment on StockTwits data with the actual label of messages in Table 4.2. Based on the results shown in Table 4.2, TextBlob is not an effective method for extracting sentiment from StockTwits data. TextBlob’s ineffectiveness is due to it labeling too many messages as neutral, and its performance metrics not being con- siderably improved in comparison to machine learning approaches.

SentiWordNet: Again, we consider a positive message as Bullish and a negative message as Bear-

58 Table 4.4: Performance of the VADER on sentiment analysis in the StockTwits data set

Accuracy Precision Recall F-measure AUC

VADER 0.944 0.847 0.745 0.793 0.861 ish. Among all of 2,522,557 messages, SentiWordNet found 214,972 neutral messages. All such neutral messages were removed, and then the result of SentiWordNet senti- ment for each message was compared with the actual label of that message. The result of applying SentiWordNet on StockTwits data is provided in Table 4.3. Comparing Table 4.3 and 4.1, it is clear that SentiWordNet can improve accuracy, precision and area under the curve values in comparison to machine learning models but still, the difference is not considerable. Although accuracy and AUC grow up around 9% f-measure reduce more than 10%.

VADER: Among all of 2,522,557 messages, VADER found 899,503 neutral messages and labeled them with zero. We remove all of the messages that VADER found as neutral and then compared VADER’s determined sentiment with the actual label of each sentence. Our results are shown in Table 4.4. We found that using VADER to predict the sentiment of the StockTwits users can improve accuracy, and area under the curve when compared to machine learning methods (Table 4.1) TextBlob (Table 4.2) and SentiWordNet (Table 4.3).

4.4.3 Combined Results

Figure 4.1 compares the ROC curves between machine learning methods and senti- ment lexicon methods, including VADER, SentiWordNet, and TextBlob. Sentiment

59 Table 4.5: Number of neutral messages

TextBlob SentiWordNet VADER

neutral 1,125,130 214,972 899,503 lexicons outperform machine learning methods based on these ROC curves. In Table 4.5, we provide the number of messages that were labeled as neutral by TextBlob, Sen- tiWordNet, and VADER. Fewer neutral messages indicate better performance from an analyzer, and so SentiWordNet clearly gives the best results here. However, Tables 4.2, 4.3, and 4.4 reveal that, among the sentiment lexicon methods studied, VADER’s higher performance metrics make it the best method for use in predicting StockTwits users’ sentiment.

Figure 4.1: Comparative Area Under the ROC curve for Lexicon versus Machine Learning based sentiment analysis

Receiver operating characteristic 1.0

0.8

0.6

0.4 True Positive Rate Positive True VADER (area = 0.86) SentiWordNet (area = 0.81) 0.2 TextBlob (area = 0.80) Logistic Regression (area = 0.72) MultinomialNB (area = 0.71) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

4.5 SUMMARY

Knowing the sentiment of top authors, we can predict stock prices with accuracy of 75% but unfortunately, only 10% of messages in StockTwits are labeled. To in- crease the accuracy of stock price prediction, we need a powerful method for the

60 sentiment analysis of top authors. Sentiment analysis has two main approaches - lexicon-based and machine learning. The primary drawback to machine learning is the training process, which is very time-consuming and computationally expensive. On the other hand, the lexicon-based approach does not need training data, and so it is favorable, particularly in tasks that involve high-dimensional data. There are a variety of lexicon-based methods that can be used to perform sentiment analysis. In this chapter, we applied VADER, SentiWordNet, and TextBlob on StockTwits data to see if they can increase the accuracy of sentiment analysis. Logistic regression, Linear SVM, and Naive Bayes classification was used as our baseline and compared to the results of applying lexicon-based models alongside machine learning models. Based on our results, not only does VADER outperform machine learning methods in extracting sentiment from financial social media, like StockTwits, it is also faster.

61 CHAPTER 5 FINANCIAL SENTIMENT ANALYSIS

Sentiment Analysis (SA) is a common method which is increasingly used to assess the feelings of social media users towards a subject. Financial social media brings people, companies, and organizations together so that they can generate ideas and share information with others. It is this media that provides a huge amount of unstructured data that can be integrated into the decision-making process. Such a Big Data can be considered as a great source of real-time estimation because of its high frequency of creation and low-cost acquisition. Deep Learning is beneficial in facing a large amount of unsupervised data like data provided in social media. In this chapter, we adopt Deep Learning to do sentiment analysis of top authors. We believe that using Deep Learning can vastly improve correct classification in sentiment analysis regarding various stock picks and thus exceed the current accuracy of stock price prediction.

5.1 SOCIAL NETWORK INFORMATION EXTRACTION

The Internet, as a global system of interconnection, provides a link between billions of devices and people around the world. The rapid development of social network causes the tremendous growth of users and digital content [111]. It opens opportunities for people with various skills and knowledge to share their experiences and wisdom with each other. There are many websites like Yelp, Wikipedia, Flickr, etc. that use the power of the Internet to help their users make optimal decisions. Furthermore, there are websites that give users the ability to consult with pro-

62 fessionals, and one topic that is always popular is the investment. Companies like Goldman Sachs and Lehman Brothers have more than 150 years of investment advice. In the Internet age, independent analysts and retail investors around the world can collaborate with each other through the web. Seeking Alpha and StockTwits are two examples of common financial social media platforms focused on the stock market, giving their users a way to connect with information and each other and grow their investments [99]. Following the early work in sentiment analysis done in [112, 113], we examine source materials and apply natural language processing techniques to determine the attitude of the writer towards a subject. With the growing popularity of social media, huge datasets of reviews, blogs, and social network feeds are being generated continuously. Growing data, intensive technologies, and increasing data storage resources develop Big Data science. The main concept in Big Data analytics is extracting a meaningful pattern from a huge amount of data. we need special methods that can be used to extract patterns from a massive amount of data. Deep Learning has this opportunity to provide a solution to address the learning and data analysis problem that exists in a massive amount of data and also they are better at learning complex data patterns. There are other problems such as domain adoption and streaming data that large-scale Deep Learning models have to contend with them. Concepts and methods from sentiment analysis that can help us to extract information from these areas have become increasingly important as businesses, organizations, and individuals seek to make better use of their data.

5.2 BIG DATA

The term Big Data has been in use since the 1990s. In 2012 Gartner update his previous definition and defines it as follows: “Big Data is high-volume, high-velocity

63 and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision-making, and process automation”. It is referred to the growing digital data that are difficult to manage and analyze using traditional software tools and technologies. Big Data often has a large number of samples, a large number of class labels and very high dimensionality (attributes). The target size of it moving continually in 2012 it was ranging around a few dozen terabytes to many petabytes of data. There are four attributes including Volume, Variety, Velocity, and Veracity that define Big Data [114]. Obviously, data volume is the primary attribute of this data. By increasing the volume of the data, the complexity, and the underneath relationships of it increased as well. Many social media companies including Facebook, Twitter, StockTwits, LinkedIn have a large amount of data. As data become bigger Deep Learning approach become more important to provide data analysis. The other thing that makes Big Data really big is the variety of data. Big Data coming from a variety of sources than ever before. Web sources including social media, clickstreams, and logs are some example of these resources. One of the challenges in Big Data processing is working with Variety of different data. In order to extract a structured representation of data, it needs to do preprocessing on unstructured data. Velocity is another feature of Big Data. The frequency of data generation in Big Data is fast. For example, consider the stream of message coming from StockTwits website. Velocity is just as important as the volume and variety. The quickness of processing input into usable information is important to deal with Velocity. Veracity refers to the trustworthiness of the data. By increasing the number of data sources and types, trust in Big Data become a practical challenge. In addition to four Vs. there are lots of challenges including data cleansing, feature engineering [115, 116, 117], high-dimensionality, and data redundancy that Big Data analytics are faced. Deep Learning is used in industrial products that have this opportunity to have

64 a large volume of digital data. Google use Deep Learning algorithms and Big Data available on the Internet for Google’s translator. In some application domains such as social media, marketing, and financial data feeds using Deep Learning algorithms and architecture for analyzing large-scale [118, 119], fast-moving streaming data is encouraged, but still analyzing Big Data by using Deep Learning application remain unexplored. Big Data has this potential to make a huge change in science and all aspects of our society, but extracting information from this data is not an easy task. Decentralized control and autonomous data sources are two other important characteristics of Big Data. Each data source can collect information without any centralized control. Big Data technology is still young, there are many technical problems in stream computing, parallel computing, Big Data architecture, Big Data model, and software systems that can support Big Data, etc should be investigated. Today, machine learning techniques especially Deep Learning models, together with powerful computers play an important role in Big Data analysis. Deep Learning methods can leverage the predictive power of Big Data in fields like search engines, medicine, and astronomy. In contrast to conventional datasets used for data ming approach which was noise free, Big Data is often incomplete because of their disparate origins. Big Data brings transformative potential and big opportunities for various fields. Typical data mining algorithms require having all data into main memory while this is a clear technical difficulty for Big Data which spread across different locations. In addition, data mining methods need to overcome to sparsity, hetero- geneity, uncertainty, and incompleteness of Big Data as well. Deep Learning and Big Data consider as the big deals and the bases for an American innovation and economic revolution. Even in government and society Big Data emerge as a useful remedy to solve some problems. In 2012 the Obama Administration announced a ”Big Data research and development Initiative to help solve some of the Nation’s most pressing

65 challenges.

5.3 MACHINE LEARNING IN SOCIAL NETWORK INFORMATION EXTRACTION

Deep learning and Big Data Analytics are two focal points of data science. Deep Learning models have achieved remarkable results in speech recognition [120, 121, 122, 123] and computer vision [124, 125, 126, 127, 128] in recent years. Big Data is important for organizations that need collect a huge amount of data like a social network and one of the greatest assets to use Deep Learning is analyzing a massive amount of data. This advantage makes Deep Learning as a valuable tool for Big Data. The modern stock market is an example of these social networks. They are a popular place to increase wealth and generate income, but the fundamental problem of when to buy or sell shares, or which stocks to buy has not been solved. It is very common among investors to have professional financial advisors, but what is the best resource to support the decisions these people make? Investment banks such as Goldman Sachs, Lehman Brothers, and Salomon Brothers dominated the world of financial advice for more than a decade. However, via the popularity of the Internet and financial social networks such as StockTwits and SeekingAlpha, investors around the world have new opportunity to gather and share their experiences. Individual experts can predict the movement of the stock market in financial social networks with the reasonable accuracy, but what is the sentiment of a mass group of these expert authors towards various stocks? Specific Big Data domains including computer vision [127] and speech recogni- tion [129], have seen the advantages of using Deep Learning to improve classification modeling results but, there are a few works on Deep Learning architecture for senti- ment analysis. In 2006 [130] Alexandrescu et al. present a model where each word is represented as a vector of features. A single embedding matrix is used to look up

66 all of these features. In [131] Luong at al. use a recursive neural network (RNN) to model the morphological structures of words and learn morphologically-aware em- beddings. In 2013 Lazaridou et al. [132] try to learn meanings of a phrase by using compositional distributional semantic models. In 2013 Chrupala use a simple recurrent network (SRN) to learn continues vector representations for sequences of characters. They use their model to solve a character level and labeling task. A meaningful search space via Deep Learning can be constructed by using Recurrent neural network [133] Socher et al. in 2011 [134], use recursive au- toencoders [135, 136, 137, 138] for predicting sentiment distribution and proposed a semi-supervised approach model. In 2012 [139] Socher et al. propose a model for se- mantic compositionality with the ability to learn compositional vector representation for sentences of arbitrary length. Their proposed model is a matrix-vector recursive neural network model. Recursive Neural Tensor Network (RNTN) architecture pro- posed in [140]. RNTN use word vector and a parse tree to represent a phrase and then use a tensor-based composition function to compute vectors for higher nodes [141]. Regarding convolutional network for NLP tasks, Collobert et al. in [142] for task avoid excessive feature engineering by using the convo- lutional neural network. In 2011 Collobert used a similar network architecture for syntactic . In [143] a deep convolutional neural network proposed that ex- ploit the character-to sentence-level information to perform sentiment analysis of short texts. The experiments in this chapter focus on market sentiment. Based on the defini- tion in [144], market sentiment is the general prevailing attitude of investors as to anticipate price development in a market. This attitude is the combination of various factors such as world events, history, economic reports, seasonal factors, and many others. Market sentiment is found through sentiment analysis, also known as opinion

67 mining [145], which is the use of natural language processing methods to extract the attitude of a writer from source materials. Wang and Sambasivan in [99] apply market sentiment on the StockTwits dataset by using supervised sentiment analysis classified messages in StockTwits as ”Bullish” or ”Bearish.” An investor is considered Bullish if he or she believes that the stock price will increase over time and recommends purchasing shares. Oppositely, if an investor is Bearish he or she expects downward price movement and will recommend selling shares or against buying. Supervised classification methods, such as Support Vector Machines, Na¨ıve Bayes or ensembles have been deployed to perform sentiment analysis in multiple research projects. Machine learning techniques mainly use the bag-of-words model. In the bag- of-words model, a text is represented as the collection of its words, disregarding the order of those words in their sentences. However, the order of the words in a sentence can change the sentiment of a word. For example, consider the word ”underestimate.” This word potentially has a negative connotation, but if we consider it beside other words like ”underestimated stock” it can become positive. Recently, Deep Learning approaches have emerged as a powerful tool in sentiment analysis in Big Data due to the advantages they provide over other methods. One of these advantages is that features are learned hierarchically during the process of Deep Learning instead of the feature engineering that is required in data mining. Addi- tionally, in Deep Learning methods, each word is considered as part of a sentence. In this way, relevant information contained in word order, proximity, and relationships is not lost. Furthermore, Deep Learning benefits from a similarity model. Word em- bedding creates a vector representation of words with a much lower dimensional space compared to the bag of the words model. The vectors representing similar words in vector space are therefore closer together. One of the other main concepts in Deep Learning algorithm is the automatic extraction of representation (abstractions) [146].

68 To achieve this goal Deep Learning use a massive amount of unsupervised data and extract complex representation automatically. One of the advantages of abstract rep- resentation extracted with Deep Learning algorithms is their generalization. Features extracted from a given dataset can be used successfully for a discriminative task on another dataset. Deep Learning is an important aspect of artificial intelligence be- cause it provides a complex representation of Big Data and also makes the machine independent of human knowledge. Deep Learning constructs complicated representations for image and video data with high level of abstraction. High-level data representations provide by Deep Learn- ing can be used for simpler linear models for Big Data. This representation can be useful for image indexing and retrieval. On the other words, Deep Learning can be used in the discriminative task of semantic tagging in the context of Big Data analysis. In this chapter, we seek to determine if Deep Learning models can be adapted to improve the performance of sentiment analysis for StockTwits. We applied several neural network models such as long short-term memory [147], doc2vec [148], and convolutional neural networks [127], to stock market opinions posted in StockTwits. Our results show that Deep Learning model can be used effectively for financial sentiment analysis and a convolutional neural network is the best model to predict sentiment of authors in StockTwits dataset.

5.4 METHODOLOGY

Concepts and methods from sentiment analysis that can help us to extract information from Big Data have become increasingly important as businesses, organizations, and individuals seek to make better use of their data. In the following section, we start our investigation the performance of sentiment analysis based on data mining approaches for our dataset.

69 Table 5.1: Performance of the Logistic Regression on the StockTwits dataset

Accuracy Precision Recall F-measure AUC

0.7088 0.7134 0.6980 0.7056 0.7088

5.4.1 Sentiment Analysis with Data Mining Approaches

Gang Wang in [99] uses a supervised data mining approach to find the sentiment of messages in the StockTwits dataset. They removed all stopwords, stock symbols, and company names from the messages. They consider ground-truth messages as training data and test multiple data mining models, including Na¨ıve Bayes, Support Vector Machines (SVM), and Decision Trees. By running 10-fold cross validation, they found that the SVM model produces the highest accuracy (76.2%). They used unigrams as features and removed infrequent unigrams that occur less than 300 times over all messages because using n-grams can lead to data sparsity problem. As a result, it is necessary to use lower-order n-grams to address sparsity problem otherwise performance would be decreased. On the other hand, by using lower-order n-grams we lose the order of the words in a sentence. As we know the order of the words in a sentence can help us to better understanding the sentiment of a document. We believe that using Deep Learning to predict sentiment of authors can help us to overcome these problems and increase the accuracy of prediction. Deep Learning nonlinear feature extraction can improve data mining results and classification modeling [146]. Logistic regression uses the logistic sigmoid function to weighted input values to classify input data, it is similar to a Deep Learning without hidden layers. Logistic regression is used as a classifier in the final layer of a Deep Learning. In other words, Deep Learning algorithms work as multiple feature learning steps. Logistic regression is very fast and simple so it is used for large datasets. We follow Gang Wang [99] approach and apply logistic regression [149] on the StockTwits dataset. In Table 5.1, we provide the performance of logistic regression

70 on StockTwits data based on different performance metrics. Also in Figure 5.1, we present the ROC curve [21] for this model.

Figure 5.1: Receiver Operating Characteristic for Logistic Regression

5.4.2 Increase Accuracy by using Feature selection

One of the problems that prevent us from accurately classifying a Big Data is the noise found within it. Feature selection, including the removal of noisy features and elimination of ineffective vocabulary, makes training and applying a classier more effective [150]. The existing approaches to finding an adequate subset of features fall into two groups: feature filters and feature wrappers [151]. In feature filters, the final set of features is selected based on the statistical properties of those features. With feature wrappers, an iterative search process is applied through a modeling tool’s results. In each iteration, a candidate set of features is used in the modeling tool and the results are recorded. Each step uses the results from the previous step, and so new tentative sets are generated. This process is repeated until some specified convergence criteria are met. In our experiment, we had a huge number of features and instances, and thus, our data was very sparse. We tried several feature selection methods to see how they would affect the accuracy of our sentiment analysis.

71 From the methods tested, we selected three feature filters which included chi- squared, ANOVA, and mutual information. The advantages of using these feature selection techniques are their speed, scalability and their independence of the classifi- cation. Our reasoning for choosing these methods is their ability to deal with sparse data. On the other hand, these methods have some drawbacks as well, they ignore feature dependencies and also they ignore interaction with the classifier [152]. In this section, we examine these methods and the results of applying them to our dataset.

Chi-square Pearson’s chi-squared [153] test is used for two types of comparison: a test of independence or a test of goodness of fit. We apply the test of independence to our dataset to see if the occurrence of a specific feature is independent of the class. Our terms are ranked by their score as determined with Eq.(5.1). In this equation, ‘O’ stands for Observed Frequency, and ‘E’ stands for Expected frequency. A high X2 score rejects the null hypothesis of independence of the term and class.

n (O − E )2 X2 = X i i (5.1) i=1 Ei Applying chi-squared to our dataset and decreasing the number of features gradu- ally allowed us to see how it can affect the performance of logistic regression. Classifier results are provided in Table 5.2. Reducing the number of features increases accu- racy in some cases - for example, by reducing the number of features from 40,000 to 500, accuracy increases by seven percent. However, this is an irregularity in our dataset and does not mean that chi-squared is an effective feature selection method to increase the accuracy of our classier.

Analysis of variance One of the other feature selection methods that we used was the analysis of vari-

72 Table 5.2: Performance of the chi-squared feature selection on the StockTwits dataset

Features Accuracy Precision Recall F-measure AUC

55820 0.7088 0.7134 0.6980 0.7056 0.7088

40000 0.4796 0.4851 0.6645 0.5608 0.4796

20000 0.5018 0.5013 0.6879 0.5800 0.5018

4000 0.5274 0.5206 0.6946 0.5951 0.5274

2000 0.5221 0.5190 0.6036 0.5581 0.5221

400 0.5308 0.5278 0.5834 0.5542 0.5308

200 0.5333 0.5280 0.6284 0.5738 0.5333

50 0.5314 0.5232 0.7071 0.6014 0.5314 ance (ANOVA) feature selection. ANOVA [154] is used to determine if there are any statistically significant differences between the arithmetic means of independent groups. By using ANOVA for feature selection in our experiment, we clarify the rel- evance of terms by assigning a score to each based on an F-test. Top scoring terms are considered as our desired features and sent to the classification models. The F-test formula is shown in Eq.(5.2).

MS F = B (5.2) MSw

In this equation MSB is between-group variability Eq.(5.3), and MSW is within- group variability Eq.(5.4). In between-group variability ni is the total number of observations of class i, m is the number of classes andx ¯ denotes the general mean of the data.

P n (x ¯ − x¯)2 MS = i i i (5.3) B m − 1

In within-group variability, xij denotes the j-th observation in the i-th class [155].

73 Table 5.3: Performance of the ANOVA F-test feature selection on the StockTwits dataset

Features Accuracy Precision Recall F-measure AUC

55820 0.7088 0.7134 0.6980 0.7056 0.7088

40000 0.7094 0.7130 0.7010 0.7070 0.7094

20000 0.7091 0.7127 0.7007 0.7066 0.7091

4000 0.5274 0.5206 0.6946 0.5951 0.5274

2000 0.7045 0.7048 0.7038 0.7043 0.7045

400 0.6785 0.6638 0.7233 0.6923 0.6785

200 0.6611 0.6378 0.7457 0.6875 0.6611

50 0.0.6191 0.5863 0.8084 0.6797 0.6191

P (x − x¯ )2 MS = ij ij i (5.4) W n − m By extracting more effective features based on F-test scores, we examined whether ANOVA feature selection improves the accuracy of the classification methods. Per the results provided in Table 5.3, accuracy is not improved through ANOVA feature selection, so it will not be used for further testing.

Information Gain Our results show that ANOVA and chi-square feature selection methods cannot considerably increase the accuracy of our classification models. In this section, we look at mutual information feature selection, which is one of the most commonly used feature selection methods. Mutual information is defined as the number of dependencies between two random variables. This allows us to determine information gain, which is the amount of information acquired about one random variable through another random variable. Mutual information between two random variables (X and 74 Table 5.4: Performance of the Mutual Information feature selection on the StockTwits dataset

Features Accuracy Precision Recall F-measure AUC

55820 0.7088 0.7134 0.6980 0.7056 0.7088

40000 0.5417 0.5311 0.7115 0.6082 0.5417

20000 0.5123 0.5087 0.7144 0.5943 0.5123

4000 0.5391 0.5337 0.6190 0.5732 0.5391

2000 0.5406 0.5350 0.6193 0.5741 0.5406

400 0.5665 0.5540 0.6815 0.6112 0.5665

200 0.4713 0.4760 0.5692 0.6126 0.5459

50 0.5077 0.5052 0.7414 0.6009 0.5077

Y) in defined in Eq.(5.5).

p(x, y) I(X; Y ) = X X p(x, y)log( ) (5.5) y∈Y x∈X p(x)p(y) In this equation, if x and y are independent, i.e. (p(x, y) = p(x) × p(y)), their mutual information will be zero. Which, in turn, means that by knowing one of these random variables we cannot gain any information about the other one. By using mutual information for feature selection, we explore how much informa- tion each term provides to making the correct classification decision. This method extracts features with the highest mutual information value. In this way, we will have features that contain the most information about the class. In our experiment, mutual information for feature selection was also not effective. This is shown by the results provided in Table 5.4. In Figure 5.2 we decrease the number of features by applying selection methods containing chi-Square, ANOVA, and information gain and compare the accuracy of logistic regression. As our results demonstrate, feature selection methods cannot 75 considerably improve the accuracy of logistic regression. Data mining algorithms cannot extract the complex and nonlinear patterns that exist in Big Data. Extracting these features, Deep Learning can use simpler linear models for Big Data analysis tasks including classification and prediction which is important when we deal with the scale of Big Data.

Figure 5.2: Accuracy of logistic regression by using feature selection methods

0.80 Chi-Square 0.75 Anova Mutual Information

0.70

0.65

0.60 Accuracy 0.55

0.50

0.45

0.40 0 10000 20000 30000 40000 50000 Number Of Features

With the result of logistic regression based on the bag-of-words model used as a baseline, we investigate whether Deep Learning methods can improve the accuracy of this logistic regression in Big Data. The bag-of-words model does not consider word order and other words in a sentence, and it has a limited sense of word sentiment. We believe that using Deep Learning methods instead of the bag-of-words may help us to improve the accuracy of our model. In consecutive layers of deep architectures in Deep Learning, each layer applies a nonlinear transformation on its input and provides a representation of its output. On the other word, Deep Learning can learn representations of the Big Data in a Deep Architecture with multiple levels of representations. It is important to consider that transformations in the layers of Deep Learning are nonlinear and try to extract underlying factors in the Big Data. The output of final layer (the final representation of data which constructed by 76 Deep Learning algorithm) can be used as features for classifiers or other applications. In this chapter, we mainly focus to see how Deep Learning can assist with senti- ment analysis in StockTwits data and which Deep Learning algorithm can be adapted to improve the accuracy of sentiment analysis in StockTwits in compare to data min- ing models. With respect to the first topic, we explore three Deep Learning algorithms including doc2vec [156, 157, 158], LSTM [159] and CNN [160] to see if they can more accurately predict StockTwit’s users’ sentiment.

5.4.3 Deep Learning in Big Data Analytics

In this section, we explore advantages of using Deep Learning algorithms in Big Data analysis. Also, we take a look at some Big Data characteristic that challenges Deep Learning in Big Data analysis. Deep Learning algorithms extract an abstract representation of Big Data through multi-level hierarchical learning. Deep Learning is attractive for extracting informa- tion from Big Data because it can be used to learn from a massive amount of unlabeled data. Once Deep Learning learned unsupervised data more traditional models can be trained with less amount of labeled data [161, 162, 163]. Global relationships in the Big Data can perform better by using Deep Learning. Some of the advantages of learned abstract representations by Deep Learning include, a simple model can work effectively with the knowledge of more abstract data representation, automation of data representation extraction can lead to a broad application to different data type. These specific characteristics of Deep Learning make it desirable for Big Data analytics. Deep Learning algorithms can be used to address the problem of Volume and Variety of Big Data analytics. Effectively using a massive amount of data ( Volume ) is one of the advantages of Deep Learning. Since Deep Learning deals with data abstraction it is desirable to work with raw data in different formats and resources

77 (Variety ) and minimize a need for feature selection from new data type observed in Big Data. However Big Data has some characteristics including streaming and fast moving which can lead to some challenges for adopting Deep Learning. There are some works associated with Deep Learning and streaming Big Data. For instance, adoptive deep belief networks introduced in [164] illustrate how Deep Learning can be used to learn from streaming data. In [165] Zhou et al. describe how Deep Learning algorithms can be used for feature learning on Big Data. One of the other problem that associate of using Deep Learning in Big Data is using Deep Learning for large-scale models and massive datasets. In [166] Dean uses thousands of CPU cores to train a Deep Learning neural network with billions of parameters. In [167] Coates et al. suggest using the power of a cluster of GPU servers to overcome the problem of Deep Learning in large-scale datasets. Big Data encompasses a lot of things from medicine, genomic and biological data to call center. To handle huge volumes of input associated with Big Data, large-scale Deep Learning models are desirable. They can illustrate the optimal number of model parameters and overcome to the challenges f Deep Learning for Big Data analysis. There are other Big Data problems like domain adaption and streaming data that large-scale Deep Learning models for Big Data need to handle them. Variety is one of the other characteristics of Big Data, which focuses on the vari- ation of the input domains and data types in Big Data so the problem of domain adoption is another issue that Deep Learning need to overcome. There are some stud- ies including [168, 169] that mainly focus on domain adoption during the learning process. In [168] Glorot et al. illustrate that Deep Learning can find intermediate data representations in a hierarchical learning manner and this representation can be used for other domains. Chopra et al. in [169] propose a new Deep Learning model for domain adoption. Their new proposed Deep Learning model considers in-

78 formation available from the distribution shift between the train and test data. This chapter mainly focuses on information retrieval so following section, we summarize Deep Learning in sentiment analysis.

5.4.4 Sentiment Analysis with Deep Learning Approaches

In the prior section, we discussed some advantages of using Deep Learning in Big Data analysis including the application of Deep Learning algorithms for Big Data analysis and how specific characteristics of Big Data can lead to some challenges in adopting Deep Learning algorithms for Big Data analytics tasks. In this section, we explore sentiment analysis using Deep Learning algorithms. In data mining prediction tasks feature engineering is the most important and most difficult skill. The effort involved in feature engineering is the main reason to seek algorithms that can learn features by themselves. Hierarchical feature learning in Deep Learning extracts multiple layers of non- linear features and then a classifier combines all the features to make predictions [170]. data mining models based on shallow learning like Support Vector Machines and decision trees are not able to extract complex features. On the other hand, Deep Learning algorithms have the capability to generalize in global ways, generating learning patterns, and relationships beyond immediate neighbors in the Big Data [161]. In order to gain more complex features, Deep Learning algorithms transform first features like edge and blobs in image again to extract more informative features to distinguish between classes. This process is very close to brain activity. The first hierarchy of neurons which are sensitive to specific edges and blobs receive information in the visual cortex while brain regions further down the visual pipeline are sensitive to more complex structures such as faces. So in other words, Deep Learning learn the representation of Big Data in a deep architecture and more layers the data goes through, the more complicated the

79 nonlinear transformations which are constructed. But hierarchical feature learning suffered from major problems such as the vanishing gradient for very deep layers, this problem makes these architectures perform poorly in comparison to shallow learning algorithms. Deep Learning methods can overcome vanishing gradient problem so they can train with dozens of layers of non-linear hierarchical features. Not only Deep Learning methods are related to learning deep non-linear hierarchical features they also can be used to detect very long non-linear time dependencies in sequential data. Long Short-Term Memory (LSTM) and Recurrent Neural Networks are two examples of neural networks that can increase the accuracy of prediction by picking up on activity hundreds of time steps in the past. One of the main problems in Big Data is storing data effectively and retrieve information from this Big Data. Deep Learning algorithms can be used to generate high-level abstract data representation which will be used for sentiment analysis. While a vector representation of Big Data provides faster information retrieval, Deep Learning can be used for relational understanding of the Big Data. Using Deep Learning algorithms can help us to extract semantic features from a massive amount of text data in addition to reduce dimensions of the data representations. In [171] Hinton et al. propose a Deep Learning model to learn the binary codes for documents. The word count vector of a document is the lowest layer and the learned binary code of the documents is the highest layer. The binary code can be used for information retrieval in Big Data. we can use some unsupervised data in training a Deep Learning model [172]. In [173] Ranzato et al. propose a study in which Deep Learning model learn with supervised and unsupervised Big Data. Deep Learning algorithms provide this opportunity to extract semantic aspect of a document by capture complex nonlinear representations between word occurrences. Using Deep Learning can help us leverage

80 unlabeled document to have access to huge amount of data. Since Deep Learning relatively recently becomes popular, additional work needs to be done to use hierarchical learning strategy as a method for sentiment analysis of Big Data.

Tomas Mikolov in [174], proposed word2vec model. In this model, instead of relying on the number of occurrences of the words, neural network methods are used to produce a high-dimensional vector representation of each word or document. Word2vec uses the location of words relevant to each other in a sentence to find the semantic relationship between them. In contrast to the bag-of-words model, word2vec can capture sentimental similarity among words. Word2vec is implemented in two different model architectures, continuous bag- of-words and skip-gram. In the continuous bag-of-words architecture, we have a sequence of words and we need to predict which word is more likely to be the next word in this sequence. In the skip-gram architecture, with each word, we try to find a more probabilistic surrounding window of words. The outcome is in a vector space, words with are nearby. When using the word2vec model, the order of the words in a sentence is ignored, and only words and their distance from each other are considered. Quoc Le and Tomas Mikolov in [156], describe the doc2vec method. doc2vec generalizes word2vec by adding a paragraph vector. This inclusion means that each paragraph, like each word, is mapped to a vector. The advantage of considering a paragraph as a vector is that it can work as a kind of memory to keep the order of the words in a sentence. Doc2vec, like word2vec, is implemented in two different methods distributed memory and distributed bag-of-words. In distributed memory, a paragraph is treated the same as a word. This is word2vec beneficial because after paragraph

81 Figure 5.3: Distributed Memory Architecture

vectors have been learned from labeled Big Data they can be used effectively for a task especially when labeled data is limited. The distributed bag-of-words model ignores the word context as input, but rather predicts words by randomly selecting samples from a paragraph. The architectures of distributed memory and distributed bag-of-words are provided in Figures 5.3 and 5.4.

Figure 5.4: Distributed Bag of words

To achieve the goal of higher accuracy, in each iteration of stochastic gradient descent, we sample a text window and select some random words from this window. At the end of this process, based on the given paragraph vector, we will form a clas- sification. The distributed bag-of-words model is conceptually simple and does not need to store word vectors, so it needs less memory. Deep Learning algorithms are powerful to extract useful representation from various kinds of Big Data and discrim- inative results provided by Deep Learning can be used for information retrieval [175].

82 Recurrent Neural Network The idea behind Recurrent Neural Network (RNN) is that input data are not independent of each other. Knowing the previous iterations’ data will improve our prediction accuracy. For example, consider that we want to predict the next word in a sequence of words. Having knowledge of the previous words helps us to improve the accuracy of our prediction. Recurrent neural networks, by considering the previous computation, perform the same task for every element of a sequence. In other words, it has memory to capture information about what has been calculated so far. But in practice, vanishing gradient is a common problem in Deep Learning. Because of the vanishing gradient problem, RNNs look back just a few steps. Although vanishing gradients are not exclusive to RNNs, they limit our network depth to less than the length of the sentence. Thankfully, there are a variety of methods that can help us address the vanishing gradient problem. For example, instead of using tanh or sig- moid as activation functions, we can use ReLU. However, we chose a more popular solution for our work - Long Short-Term Memory (LSTM).

Long Short-Term Memory (LSTM) LSTM was proposed in [159] by Sepp Hochreiter and Jurgen Schmidhuber. The main difference between RNNs and LSTMs is the gated cell. Gated cells in LSTMs help the system store more information in comparison to RNNs. Information can be stored in, written to, or read from a cell. Cells decide whether to remove or store information by opening and closing gates. A cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection, a forget gate, and an output gate. The forget gate is an element which allows the cell to remember or forget its previous state. For example, assume that we want to capture the gender of the sub- ject. In this case, when seeing a new subject, the previous one should be forgotten so that a relevant information can be determined and stored.

83 Convolutional Neural Network One of the most commonly used Deep Learning models is the fully-connected neural network. Although fully-connected neural networks are considered as a good solution in classification tasks, the huge number of connections in these networks may lead to problems. These problems can be further amplified in text processing because of the high number of neurons required. In addition, we believe that words which are close together in a sentence are more to each other when compared to words which never appear close together in any sentence. But fully-connected neural networks treat input words which are far apart the same as words which are close together in a sentence. The hierarchical learning process of Deep Learning makes it expensive for high-dimensional data like image or text. On the other words, these kinds of Deep Learning algorithm can be stalled when dealing with Big Data that shows large Volume. Convolutional neural networks offer certain advantages that make them desirable to address these problems. First, each neuron in the first hidden layer, instead of connecting to all input neurons, is only connected to a small region of them. This reduction in connection complexity works to also reduce potential computational problems. Second, using the same weights for each of the hidden neurons provides the opportunity to detect the same feature in different locations in the input text. At the end of the network, a pooling layer simplifies the information from the convolutional layers to the output. The convolutional neural network is one of the methods that can be used effectively for Big Data analysis. The convolutional neural network which is one of the powerful models in Deep Learning, use convolutional layers to filter inputs for useful information. in [127] Hinton et al. use a Deep Learning and convolutional Neural network for image object recognition. Their Deep Learning model outperforms other existing

84 approaches. Hinton’s team work is valuable because they show the importance of Deep Learning in image searching. Dean at al. in [166] use similar Deep Learning modeling approach but with a large-scale software infrastructure as a training and in [176] video data is used. They use Deep Learning method like stacking and convolution to learn hierarchical representation.

5.4.5 Results and Discussion

In this section, we will explain our experiments in applying Deep Learning methods on the StockTwits dataset. We tried to see if Deep Learning models could improve the accuracy of sentiment analysis of StockTwits messages. Deep Learning attempt to mimic the hierarchical learning approach of the human brain. Using Deep Learning in extract features bring non-linearity to the Big Data analysis. The results of apply- ing three commonly used Deep Learning methods in natural language processing are provided in the following section.

Doc2vec As our first step, we apply the doc2vec model to the StockTwits dataset to see if it can increase the accuracy of sentiment prediction for stock market writers. This was chosen as the first model because it uses the paragraph as a memory to keep the order of the words in a sentence, and maps paragraphs, as well as words, to a vector. Quoc Le in [156] recommends using both doc2vec architectures simultaneously to create a paragraph vector. Following his method in our experiment, each para- graph vector is a combination of two vectors - one learned by distributed memory architecture (DM) and the other learned with distributed bag-of-words (DBOW) ar- chitecture. The accuracy of the doc2vec model is also likely to be affected by window size; with larger windows having higher accuracy. In order to evaluate this, we con- sider windows of the most commonly-used sizes - 5 and 10. The Gensim library in

85 Table 5.5: Performance of doc2vec on the StockTwits dataset

Window Accuracy Precision Recall F-measure AUC

5 0.6202 0.6097 0.6682 0.6376 0.6202

10 0.6723 0.6687 0.6830 0.6757 0.6723

Python was used to implement doc2vec and all words with a total frequency of less than two were ignored. The results are shown in Table 5.5. As we expected, the accuracy of applying doc2vec for a window size of 10 is higher than with a window size of 5, but their difference is negligible. By comparing the results of applying logistic regression as a baseline on the Stock- Twits dataset in Table 5.1 with the results of doc2vec in Table 5.5, we find that doc2vec cannot be an effective model to predict sentiment in the StockTwits dataset. In Figure 5.5 we provide the receiver operating characteristic curve for the window sizes of five and ten and compare their results with the ROC of the logistic regression.

Figure 5.5: Area Under the ROC curve for doc2vec with window size of 5 and 10

Long Short-term Memory Based on the findings in the previous section, doc2vec is not a good model for predicting sentiment of authors regarding the stock market, and so we move on to Recurrent Neural Networks (RNNs). These are some of the other most popular 86 Table 5.6: Performance of the LSTM on the StockTwits dataset

Accuracy Precision Recall F-measure AUC

0.6923 0.8518 0.6571 0.7419 0.7109 models for use in Natural Language processing have shown very good results. RNNs were adopted to see if they can help improve the accuracy of StockTwits sentiment analysis. Although an actual RNN was not used for our experiment, Long Short-Term Memory [177, 178, 179, 180] could be a viable replacement because it has a deeper memory structure. In our implementation, we used the Theano [181] library in Python. We use average pooling as our pooling method. For the last step, we fed the result of the pooling to a logistic regression layer to find the target class label associated with the current input sequence. We present the result of our experiments in Table 5.6. Although using LSTM compared to doc2vec did increase our accuracy, it is still lower than our requirements. Using logistic regression as the baseline and comparing results in Table 5.1 and 5.6 reveals that LSTM is not an effective model for predicting sentiment in the Stock- Twits dataset. In Figure 5.6, we compare the area under the ROC curve for the results of applying LSTM and logistic regression.

Convolutional Neural Network With LSTM being found ineffective, we turn to the convolutional neural network (CNN). Although CNN is very popular in image processing, the ability to find the internal structures of a Big Data makes it a desirable model for our purposes. We employ CNN to see if it can be used to improve our sentiment analysis task by using the Tensorflow [182] package in Python. The first step of our process is embedding words into low dimensional vectors.

87 Figure 5.6: Area Under the ROC curve for Long Short-Term Memory

After that, we perform convolutions with different filter sizes over the embedded word vectors. In our experiment, we used filter sizes of 3, 4 and 5. Then we apply max pooling on the result of the convolution and add dropout regularization. The process concludes by using a softmax layer to classify our results. Table 5.7 shows the results of these operations. By comparing the accuracy of logistic regression as a baseline in Table 5.1 with the results of applying convolutional neural network provided in Table 5.7, we conclude that CNN outperforms logistic regression after less than 2000 steps. After 6000 steps the accuracy of CNN is around 86% which is considerably higher than the other models. Additionally in Figure 5.7, we provide the receiver operating characteristic curve for CNN, which compares the area under the roc curve after applying CNN in multiple steps. As evident in Table 5.7, with proceeding steps in CNN, the ROC curve gets closer to the top left corner of the diagram. This proves that by proceeding stepwise in CNN on the StockTwits dataset, the accuracy of prediction increases gradually.

88 Table 5.7: Performance of the Convolutional Neural Network on the StockTwits dataset

steps Accuracy Precision Recall F-measure AUC

100 0.5700 0.6348 0.3294 0.4338 0.5700

2000 0.7943 0.7787 0.8221 0.7999 0.7943

4000 0.8210 0.7828 0.8885 0.8323 0.8210

6000 0.8651 0.8778 0.8484 0.8629 0.8651

8000 0.8891 0.8774 0.9046 0.8908 0.8891

10000 0.9093 0.9168 0.9004 0.9086 0.9093

70000 0.9897 0.9909 0.9885 0.9897 0.9897

5.5 SUMMARY

Deep Learning has good performance and promise in many areas, such as natural language processing. Deep Learning has this opportunity to address the data analysis and learning problems in Big Data. In contrast to data mining approaches with its shallow learning process, Deep Learning algorithms transform inputs through more layers. Hidden layers in Deep Learning are generally used to extract features or data representations. This hierarchical learning process in Deep Learning provides the opportunity to find word semantics and relations. These attributes make Deep Learning one of the most desirable models for sentiment analysis.

Table 5.8: Compare Deep Learning models in financail sentiment analysis

Model Accuracy Precision Recall F-measure AUC

Logistic regression 0.7088 0.7134 0.6980 0.7056 0.7088

Doc2vec 0.6723 0.6687 0.6830 0.6757 0.6723

LSTM 0.6923 0.8515 0.6571 0.7419 0.7109

CNN(10000 steps) 0.9093 0.9168 0.9004 0.9086 0.9093

89 Figure 5.7: Compare Area Under the ROC curve for Convolutional Neural Network in various steps

Receiver operating characteristic 1.0

0.8

0.6

0.4 100 steps (area = 0.57) 2000 steps (area = 0.79) True Positive Rate Positive True 4000 steps (area = 0.82) 6000 steps (area = 0.87) 0.2 8000 steps (area = 0.89) 10000 steps (area = 0.91) 70000 steps (area = 0.99) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

In this chapter, based on our results we show that convolutional neural networks can overcome data mining approach in stock sentiment analysis (Table 5.8). In stan- dard data mining approach to text categorization, documents represent as bag-of- word vectors. These vectors represent which words appear in a document but do not consider the order of the words in a sentence. It is clear that in some cases, the word order can change the sentiment of a sentence. One remedy to this problem is using bi-grams or n-gram in addition to uni-gram [183, 168, 184]. Unfortunately, using n-grams with n > 1 is not effective [185]. Using CNN provides this opportunity to use n-grams to extract the sentiment of a document effectively. It benefits from the internal structure of data that exists in a document through convolution layers, where each computation unit responds to a small region of input data. We used logistic regression, which works based on a bag-of-words, as a baseline and compared the result of applying Deep Learning to logistic regression. Based on our results, among different common Deep Learning methods in sentiment analysis, only convolutional neural network outperforms logistic regression. The accuracy of convolutional neural networks, in comparison to the other models, is considerably better. Based on our results we can use CNN to extract the sentiment of authors 90 regarding stocks from their words. There are some people in the financial social network who can correctly predict the stock market. By using CNN to predict their sentiment we can predict future market movement.

91 CHAPTER 6 EXPERT RECOGNITION IN SOCIAL MEDIA

Via the popularity of the Internet and financial social networks such as StockTwits and SeekingAlpha, investors around the world have new opportunity to gather and share their experiences. This raises new questions: are the users provide trustworthy information? How can we find the experts? We rank authors based on their ability to predict stock price movement. We use two different set of datasets one for finding top authors and the other one for examining the top authors’ performance. Deep learning is one of the powerful methods to analyze Big Data. In this chapter, we seek to determine if Deep Learning methods can help us to find the experts in a set of StockTwits tweets. Based on our results, Convolutional Neural Network with an accuracy of around 90% is the most effective method to find expert authors in StockTwits data.

6.1 HOW CAN WE FIND THE EXPERTS IN SOCIAL MEDIA?

In the Internet age, independent analysts and retail investors around the world can collaborate with each other through the web [186]. SeekingAlpha and StockTwits are two examples of common financial social media platforms focused on the stock market, giving their users a way to connect with each other, share information, and grow their investments [99, 187]. How can finding top authors be helpful in financial social media? Is there any re- lation between user sentiment and stock price movement? Based on our experiment, the Pearson Correlation Coefficient between a stock price and an average user’s sen-

92 timent is equal to 0.05, which means that only 53% of the time are users able to predict future stock prices correctly. 53% of accuracy is a little bit better than a random guess. We tried to find if there are authors in financial social media whose contributions provide good predictors of stock price, but are hidden in the noise. We ranked authors based on their ability to accurately predict stock price within a week of their prediction and then examined two consecutive years of data. The first year was a benchmark to find such top authors, and the second year was used to examine the top authors’ performances. Based on our results, the Pearson Correlation Coeffi- cient for top authors is around 0.4, which means that top authors can predict stock price movement with an accuracy of about 75%. Financial social media brings people, and organizations together so that they can generate ideas and share information with others [188, 189]. This media provides a huge amount of unstructured data that can be integrated into the decision-making process [190]. Such a Big Data can be considered as a great source of real-time estimation because of its high frequency of creation and low-cost acquisition. Is all of this data actually useful? Who produces it? Are they all experts? Are they really trying to help other people? These questions bring our attention to the fact that we do not have an effective method to establish the trustability of a source. In this chapter, we seek to determine if there is a way to differentiate expert users from regular users in the StockTwits dataset. Deep Learning algorithms provide opportunity to extract complex data at a high level of abstraction. By using deep learning high-level features with more abstraction are defined in terms of lower-level features with less abstraction [191]. Convolutional neural network (CNN) [192] is an example of one of the various number of Deep Learning models. The CNN model which is extensively used for image analysis, it makes use of the internal structure of data through convolution layers. Internal structure that exists inside the text documents CNN has been gaining attention on text data as well. CNN is used in

93 systems for tagging, entity search, sentence modeling, etc. [142, 193, 194]. The remainder of this chapter is organized as follows: Section “Related Work”, where we look at previous work on Expert recognition and the methods employed therein; in the Section “Expert Recognition with Data Mining Approach” we explore if data mining can be used to find top authors on StockTwits dataset. the Section “methodology” explains our experiments, and goes into depth about how we can apply Deep Learning to extract top authors from financial datasets like StockTwits. Our primary findings and conclusions are presented in the section “Summary”.

6.2 PREVIOUS WORK IN FINDING EXPERTS IN SOCIAL MEDIA

Several studies have addressed the problem of finding the most influential users on social networks, especially on Twitter. Weng et al. [195], use topical similarity, link structure, and PageRank to introduce the TwitterRank measure. Chat et al. [196] use indegree and retweets to rank users, which showed that retweets have a higher correlation with user influence than indegree. Based on the results by Bakshy et al. [197] larger messages tend to be created by users that have been more influential in the past. All of these studies are done on Twitter and show that this problem can be approached in many ways. Eliacik and Erdogan [198] ranked users based on their degree of membership (users that mutually follow each other) and degree of interest, which is based on the words that they used in their articles. Tianyi Wang et al. [199], use two different set of heuristics to find expert authors. First, they used empirical past performance, which looks at an average hypothetical return of all articles posted by a given au- thor in a given period of time. Second, they ranked authors based on either their total number of comments or comments per article, because they believed that user feedback and engagement with content can be a good indicator of value. Although empirical performance-based metrics are likely the most direct way to rank authors,

94 Table 6.1: Performance of the Logistic Regression in finding expert on the StockTwits dataset

Accuracy Precision Recall F-measure AUC

0.6443 0.6397 0.5542 0.5939 0.6391 they need significant amount of historical data and computational power. Is there a way to use empirical performance from the past to predict new expert authors? In this chapter, we adopt data mining and neural network methods to answer this question, based on our knowledge machine learning methods never used for expert recognition.

6.3 METHODOLOGY

We adopt machine learning methods to find expert authors in the financial social forum. We start our investigation by applying machine learning method and compare the results with deep learning methods. In the following section we take a look at our methods we use in our experiment.

6.3.1 Expert Recognition with Data Mining Approach

The first step in our process was to see if data mining methods can predict experts based on their messages. Our assumption is that we can guess whether a user is an expert based on the words that he or she uses in his or her messages. We used the 2015 dataset as a benchmark and extracted top authors by selecting users that could predict stock price movement correctly over 15 times. Then we labeled the 2016 dataset by using the expert authors from 2015. We applied logistic regression on the StockTwits dataset and use unigrams as features. In Table 6.1, we show how logistic regression performed with the StockTwits data based on different metrics, and in Figure 6.1, we present the ROC curve for this model.

95 Figure 6.1: logistic regression (Area Under the ROC curve)

Receiver operating characteristic example 1.0

0.8

0.6

0.4 True Positive Rate Positive True

0.2

ROC curve logistic regression (area = 0.71) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

Figure 6.2: Area Under the ROC curve for window size of 5

Receiver operating characteristic 1.0

0.8

0.6

0.4 True Positive Rate Positive True

0.2

Doc2Vec (Window Size 5) (area = 0.59) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

6.3.2 Experiments Using Neural Networks

In this section, we provide the results of our experiments applying neural network methods to Expert Recognition in the StockTwits dataset. We used the two common Neural Network methods that were discussed in section 5.4.4.

Doc2vec As previously described, doc2vec was implemented in two different architectures - distributed memory and distributed bag-of-words. In order to increase the perfor-

96 Table 6.2: Performance of Doc2Vec to finding Top Authors on the StockTwits dataset

Window Accuracy Precision Recall F-measure AUC

5 0.5900 0.5991 0.5438 0.5701 0.5900 mance of the model, Quoc Le. [156], recommend using both of these architectures to make a paragraph vector. In our experiment, we followed their method and built each paragraph vector by combining two vectors. One vector is learned through the distributed memory architecture and the other is learned through the distributed bag-of-words architecture. We use the Gensim [200] Python library to implement doc2vec and ignored all words with a total frequency of less than three. Negative sampling is used and we select 100 for the dimensionality of the feature vector. The result of applying doc2vec on StockTwits data is shown in Table 6.2. In Figure 6.2 we provide the receiver operating characteristic (ROC) curve for the window sizes of five. By comparing the results of applying logistic regression as a baseline on the StockTwits dataset in Table 6.1 with the results of doc2vec in Table 6.2, we find that doc2vec is not an effective model for predicting the experts in the StockTwits social network.

Convolutional Neural Network With doc2vec being found ineffective, we turn to the convolutional neural network. Although CNNs are very popular for image processing, the ability to find the internal structures of a dataset makes them a desirable model for our purposes. In this chapter, we use the Tensorflow [182] package in Python to see if CNNs can be used to find top authors. The first step of our process is embedding words into low dimensional vectors. After that, we perform convolutions with filter sizes of 3, 4 and 5 over the embedded word vectors. Then we apply max pooling onto the result of the convolutions and

97 Table 6.3: Performance of the Convolutional Neural Network on finding top authors in the StockTwits dataset

steps Accuracy Precision Recall F-measure AUC

500 0.5740 0.6457 0.3279 0.4350 0.5740

2000 0.6493 0.6873 0.5479 0.6098 0.6493

4000 0.7241 0.7222 0.7284 0.7253 0.7241

8000 0.8032 0.8360 0.7544 0.7931 0.8032

10000 0.8388 0.8629 0.8056 0.8333 0.8388

14000 0.8930 0.9203 0.8605 0.8894 0.8930

18000 0.9205 0.9234 0.9171 0.9203 0.9205 add dropout regularization to avoid overfitting. The process concludes by using a softmax layer to classify our results. Table 6.3 shows the results of these operations. By comparing the accuracy of logistic regression as a baseline in Table 6.1 with the results of applying the convolutional neural network provided in Table 6.3, we conclude that the CNN outperforms logistic regression after 2,000 steps. After 8,000 steps the accuracy of CNN is around 80%, which is considerably high in comparison to other models, and passing 18,000 steps can give us an accuracy of more than 90%. Additionally, in Figure 6.3, we provide the receiver operating characteristic curve for our CNN, which compares the area under the roc curve after applying the CNN in multiple steps. As evident in Figure 6.3, with increased steps in the CNN, the ROC curve gets closer to the top left corner of the graph. This proves that by proceeding stepwise in a CNN on the StockTwits dataset, the accuracy of predicting top authors increases gradually.

98 Figure 6.3: Compare Area Under the ROC curve for Convolutional Neural Network in different steps

Receiver operating characteristic 1.0

0.8

0.6

0.4 500 steps (area = 0.57) 2000 steps (area = 0.65) True Positive Rate Positive True 4000 steps (area = 0.72) 8000 steps (area = 0.80) 0.2 10000 steps (area = 0.84) 14000 steps (area = 0.89) 18000 steps (area = 0.92) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

6.4 SUMMARY

The neural network has shown great performance and promise in many areas, such as natural language processing. In contrast to data mining approaches, with their shallow learning process, neural network algorithms transform inputs through more layers. This hierarchical learning process in the neural network provides the oppor- tunity to find word semantics and relations. In this chapter, we applied doc2vec and convolutional neural networks to the StockTwits data to see if they can be used to find the top authors based on their words. Logistic regression, which works on a bag-of-words model, was used as a baseline and then compared to the result of apply- ing neural network methods. Based on our results, the convolutional neural network outperforms logistic regression. The accuracy of convolutional neural networks, in comparison to other models, is considerably high.

99 CHAPTER 7 SUMMARY AND FUTURE WORK

Majority of the available information is in textual format. Text data easily generate in the various scenarios and it is a perfect example of unstructured data. Although the human can quickly process and understand natural language, it is significantly harder for a machine to understand it. Based on a prediction by International Data Corporation (IDC), the volume of text data will grow to 40 zettabytes by 2020. It means the amount of text data will increase 50 times since the beginning of 2010. In recent years text mining field has gained a great deal of attention due to the availability of massive amount of data in a variety of forms such as social networks, patient records, healthcare insurance data, news outlets, etc. This volume of text contains an incredible source of information and knowledge. As a result, there is a desirable need to design methods and algorithms to effectively process this massive amount of text in a wide variety of applications. Although documents contain lots of words, they all not provide useful informa- tion. There are some words such as stop words that should be removed during the preprocessing of text data. The preprocessing step usually consists of the tasks such as tokenization, filtering, lemmatization and stemming. The primary purpose of text mining is extracting meaningful numeric values that represent the text data. In this way, we can apply various data mining algorithms to the text dataset. By turning text into numbers (meaningful indices) they can incorporate in other analyses such as supervised or unsupervised learning models. In chapter1 we provided more information about preprocessing methods, convert text data to meaningful numeric values, and various data mining models for processing 100 text data. Also in this chapter, we explained how representing documents by the numerical vectors, enable efficient analysis of the extensive collection of documents. For example, we showed that how representing documents as vectors can help us to measure the similarity between documents. One of the most popular similarity measurement is cosine similarity. Cosine sim- ilarity uses the angle between the vector-representation of documents to measure their similarity. Although cosine similarity effectively use in text analysis, it has some drawbacks. The cosine value is between -1 and 1, but the value of similarity can be between 0 and 1. Also, cosine similarity has some difficulties in high dimensional data. The reason is that cosine similarity is extracted from the Euclidean distance. Euclidean distance is L2 norm. In high dimensional data Euclidean distance between two close points is the same as between two remote points within the standard error.

In chapter2 we proposed a new similarity measurement. we showed in dk for a given value of the high-dimensionality d, it may be preferable to use a lower value of

k. In other words, for a high-dimensional application, L1 distance, like Hellinger, is

more favorable than L2 (Euclidean distance). To alleviate the problem of cosine similarity in high dimensional data we propose a new similarity measurement which is working based on Hellinger distance. Since Hellinger distance works better than cosine similarity in high dimensional data we believe our new proposed similarity works better than cosine similarity in high dimen- sional data. In chapter2 we represented our comprehensive experiment to compare cosine similarity and newly proposed similarity measurement. Based on our results although the new similarity and cosine similarity are in the same group based on the Tukey test, the results of new similarity measurement outperform cosine similarity. These results are independent of the dataset, kind of methods we used to classify or cluster documents or performance metrics. Finding similarity between documents is only one of the example of using natural

101 language processing in analyzing text data. The popularity and availability of textual data make natural language processing and information retrieval as one of the most commonly used artificial intelligent techniques. The primary goal in processing text data is using machines for extracting infor- mation from the unstructured text document and transform it into understandable information for future use. Variety of tasks such as document categorization, text clustering, information extraction, pattern recognition, document summarization, and sentiment analysis consider as natural language processing. In chapter3 we defined natural language processing and some of the common tasks in information retrieval and NLP. Sentiment analysis is one of the examples of information retrieval tasks that we discussed in chapter3. Sentiment analysis helps us to extract public opinion about products, services, politics or any other topic that people have some opinion about. lexical-based methods and supervised machine learning based methods can be used to extract the sentiment of people from their text documents. Lexical-based use a predefined list of positive and negative words to extract the sentiment of new documents. On the other hand, machine learning methods don’t need any predefined lexicon. They use machine learning methods to predict the sentiment of the author based on his/her words. People around the world make lots of comments every day about various products, service, politics, etc. This comments can save a company from a big loss and help them in their future decision. By the popularity of social media massive number of users around the world write comments about different topics. There is enormous information which is hidden in this incredible source of textual data. StockTwits is one of the examples of social media that has lots of users. More details about StockTwits is available in chapter 3. For instance information about the stock market, such as latest stock prices, price movement, stock exchange history, buying or selling recommendations, etc. are available to StockTwits users. In addition, as a social network, it provides the

102 opportunity for sharing experience among traders in the stock market. For example, users can write a short comment about the specific stock. Which kind of information can we extract from these comments? How can we use these data to get to know our users and provide more services to them? As a first step, we can predict the sentiment of authors regarding various stock price. In this way, if a person is bearish, we can provide some selling stock services to this customer, or if he or she is bullish, we can recommend him or her some buying stock services. In chapter4 we used some lexical-based approach to predict the sentiment of au- thors. We investigated if there is any relation between positive sentences and bullish authors or negative sentences and bearish authors. In other words, we tried to eval- uate if people how wrote positive comments are bullish, or authors with negative comments are bearish. We applied three lexical-based methods and based on our results there is a close relation between positive comments and bullish, and nega- tive comments and bearish. In chapter4 we showed Vader is the best lexical-based method for extracting authors sentiment and with AUC more than 85% it can predict sentiment of users. Data mining techniques also can be used for sentiment analysis. Machine learning models are more popular because lexical-based approaches, which are based on the semantics of words, use a predefined list of positive and negative words to extract the sentiment of new documents. Creating these predefined lists is time-consuming, and we cannot build a unique lexical-based dictionary to be used in every separate context. In chapter5 we investigated the performance of machine mining techniques for sentiment analysis in StockTwits dataset. By using logistic regression, we could predict the sentiment of authors in StockTwits by the AUC of 70%. The StockTwits comments are short, but we have a massive number of people who are using this social network to communicate with other people, so we have a

103 significant context of words. It means that there are lots of features that we are not sure if they can help us in predicting the sentiment of authors. As a first step, we used some feature selection techniques to take out most effective words in our prediction. We selected three feature filters which included chi-squared, ANOVA, and mutual information. The advantages of using these feature selection techniques are their speed, scalability and their independence of the classification. Our reasoning for choosing these methods is their ability to deal with sparse data. Based on our results in chapter5 feature selection techniques couldn’t help us to improve the performance of the sentiment analysis in StockTwits dataset. The incredible results of deep learning models in analyzing data make them a desirable technique. Textual data is not an exception. Because of the nature of text data that has lots of words and features, deep learning methods commonly use and they provide astonishing results. Some of the advantages of deep learning models in comparison to data mining methods can be found in chapter5 For instance:

• Features are learned hierarchically during the process of deep learning instead of the feature engineering that is required in data mining.

• In deep learning methods, each word is considered as part of a sentence. In this way, relevant information contained in word order, proximity, and relationships are not lost.

• Deep learning benefits from a similarity model. creates a vector representation of words with a much lower dimensional space compared to the bag of the words model. The vectors representing similar words in vector space are therefore closer together.

• Automatic extraction of representation (abstractions) is another advantage of deep learning. To achieve this goal, Deep Learning uses a massive amount of 104 unsupervised data and extract complex representation automatically.

Also in chapter5 we went through most common neural network models for textual data including doc2wev, long short-term memory, and convolutional neural network. We investigated the performance of these models in predicting the sentiment of authors in StockTwits dataset. Based on our results CNN outperform other neural network methods in predicting the sentiment of authors regarding future stock price. Is there any way to use this information to predict the future stock price? Are the users provide trustworthy information? How can we find the experts? Is there any relation between user sentiment and stock price movement? These are questions that we answered in chapter6. Based on our experiment, the Pearson Correlation Coefficient between a stock price and an average user’s sentiment is equal to 0.05, which means that only 53% of the time users are able to predict future stock prices correctly. 53% of accuracy is a little bit better than a random guess. Are there authors in financial social media whose contributions provide good pre- dictors of stock price, but are hidden in the noise? Based on our results, the Pearson Correlation Coefficient for top authors is around 0.4, which means that the top authors can predict stock price movement with an accuracy of about 75%. But how we can find top authors? In chapter6 we ranked authors based on their ability to predict stock price within a week of their prediction accurately and then examined two consecutive years of data. The first year was a benchmark to find such top authors, and the second year was used to investigate the top authors’ performances. We applied doc2vec and convolutional neural networks to the StockTwits data to see if they can be used to find the top authors based on their words. Based on our results, the convolutional neural network outperforms other models in predicting top authors in StockTwits data.

105 7.1 FUTURE WORKS

Artificial intelligence scientists always try to find a way that machine can understand human language as it is spoken. Natural language processing is an area in artificial intelligence that investigates how to program computers to process and analyze large amounts of natural language data. Over the decades and popularity of the internet, people use different ways to rep- resent their emotions and NLP need to be prepared for these changes. Deep learning is one of the most effective methods in extracting information from unstructured data. Textual data as one of the most popular unstructured data can take advantages of deep learning power in the analysis of the big data. As our future work, we decide to investigate other aspects of NLP that deep learning can help to improve. In addition, we want to use our new similarity measurement in other tasks such as a recommen- dation engine. We want to investigate our new proposed similarity performance in comparison to other similarity measurements and its contribution to improving other NLP aspects.

106 BIBLIOGRAPHY

[1] Gerard Salton and Christopher Buckley. Term-weighting approaches in au- tomatic text retrieval. Information processing & management, 24(5):513–523, 1988.

[2] Stephen E Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 232–241. Springer-Verlag New York, Inc., 1994.

[3] Hinrich Sch¨utze,Christopher D Manning, and Prabhakar Raghavan. Introduc- tion to information retrieval, volume 39. Cambridge University Press, 2008.

[4] Alper Kursat Uysal and Serkan Gunal. The impact of preprocessing on text classification. Information Processing & Management, 50(1):104–112, 2014.

[5] Tom M Mitchell et al. Machine learning. wcb, 1997.

[6] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text cate- gorization, volume 752, pages 41–48. Citeseer, 1998.

[7] Tom M Mitchell et al. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870–877, 1997.

[8] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is nearest neighbor meaningful? In International conference on database theory, pages 217–235. Springer, 1999.

[9] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

[10] Thorsten Joachims. Text categorization with support vector machines: Learn- ing with many relevant features. In European conference on machine learning, pages 137–142. Springer, 1998.

[11] Christopher JC Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121–167, 1998.

[12] Harris Drucker, Donghui Wu, and Vladimir N Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural networks, 10(5):1048– 1054, 1999. 107 [13] Edgar Osuna, Robert Freund, and Federico Girosit. Training support vector ma- chines: an application to face detection. In Computer vision and pattern recog- nition, 1997. Proceedings., 1997 IEEE computer society conference on, pages 130–136. IEEE, 1997. [14] Mark A Friedl and Carla E Brodley. Decision tree classification of land cover from remotely sensed data. Remote sensing of environment, 61(3):399–409, 1997. [15] JP Egan. Signal detection theory and roc analysis. series in cognition and perception. 1975. [16] John A Nevin. Signal detection theory and operant behavior: A review of david m. green and john a. swets’ signal detection theory and psychophysics. 1. Journal of the Experimental Analysis of Behavior, 12(3):475–480, 1969. [17] John A Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–1293, 1988. [18] Kent A Spackman. Signal detection theory: Valuable tools for evaluating in- ductive learning. In Proceedings of the sixth international workshop on Machine learning, pages 160–163. Elsevier, 1989. [19] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006. [20] MatjaˇzMajnik and Zoran Bosni´c.Roc analysis of classifiers in machine learning: A survey. Intelligent data analysis, 17(3):531–558, 2013. [21] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982. [22] Leo Breiman. Classification and regression trees. Routledge, 2017. [23] J. Macqueen. Some methods forclassification and analysis of multivariate obser- vations. Berkeley Symposum on Mathematical Statistics and Probability, pages 281–297, 2009. [24] J. Shi and J. Malik. Self inducing relational distance and its application to image segmentation. In Proceedings of the 5th European Conference on Computer Vision (ECCV’98), pages 528–543, 1998. [25] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000. [26] Ali Rahimi and Ben Recht. A feature space view of spectral clustering. [27] D. Napoleon and S. Pavalakodi. A new method for dimensionality reduction using k-means clustering algorithm for high dimensional data set. International Journal of Computer Applications, 13(7):41–46, 2011. 108 [28] Chris Ding and Xiaofeng He. K-means clustering via principal component anal- ysis. In Proceedings of the twenty-first international conference on Machine learning, page 29. ACM, 2004.

[29] Chris Ding and Xiaofeng He. Principal component analysis and effective k- means clustering. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 497–501. SIAM, 2004.

[30] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

[31] Thomas L Griffiths and Mark Steyvers. A probabilistic approach to semantic representation. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 24, 2002.

[32] Thomas Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR Forum, volume 51, pages 211–218. ACM, 2017.

[33] Jingu Kim and Haesun Park. Sparse nonnegative matrix factorization for clus- tering. Technical report, Georgia Institute of Technology, 2008.

[34] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM, 2003.

[35] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1548–1560, 2011.

[36] Hyunsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analy- sis. Bioinformatics, 23(12):1495–1502, 2007.

[37] Tao Li, Chris Ding, and Michael I Jordan. Solving consensus and semi- supervised clustering problems using nonnegative matrix factorization. In icdm, pages 577–582. IEEE, 2007.

[38] Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix fac- torization for graph clustering. In Proceedings of the 2012 SIAM international conference on data mining, pages 106–117. SIAM, 2012.

[39] Eugene F Krause. Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation, 2012.

[40] Wael H. Gomaa and Aly A. Fahmy. A survey of text similarity approaches. International Journal of Computer Applications, 68(13):0975–8887, 2013.

[41] Steve Cassidy. Speech recognition. Sydney Australia, pages 10–35, 2002. 109 [42] S Kullback and RA Leibler. The annals of mathematical statistics. On infor- mation and sufficiency, 22, 1951.

[43] Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures be- tween probability density functions. City, 1(2):1, 2007.

[44] Timothy W. Schoenharl and Greg Madey. Evaluation of measurement tech- niques for the validation of agent-based simulations againt streaming data. ICCS, 2008.

[45] Sachin Kumar and Durga Toshniwal. Analysis of hourly road accident counts using hierarchical clustering and cophenetic correlation coefficient. Journal of Big Data, 2016.

[46] Sachin Kumar and Durga Toshniwal. A novel framework to analyze raod acci- dent time series data. Journal of Big Data, 2016.

[47] M. G. Michie. Use of the bray-curtis similarity measure in of foraminiferal data. 14(6):661–667, 1982.

[48] C. G. Gonzalez, W. Bonventil, and A. L. V. Rodrigues. Density of closed balls in real-valued and autometrized boolean spaces for clustering applications. Brazilian symp. Artif. Intell, pages 8–22, 2008.

[49] R. W. Hamming. Error detecting and error correcting codes. Bell Syst. Tech, 29(2):147–160, 1950.

[50] Javed A Aslam and Meredith Frost. An information-theoretic measure for document similarity. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 449–450. ACM, 2003.

[51] Dekang Lin et al. An information-theoretic definition of similarity. In Icml, volume 98, pages 296–304, 1998.

[52] H. Chim and X. Deng. Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng, 20(9):1217–1229, 2008.

[53] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.

[54] B. V. K. Vijaya Kumar and L. Hassebrook. Performance measures for correla- tion filters. Applied Optics, 29:2997–3006, 1990.

[55] T.T. Tanimoto. IBM internal Report, 1957.

[56] P Jaccard. Distribution of the alpine flora in the dranses basin and some neighbouring regions. Bulletin de la Societe vaudoise des Sciences Naturelles, 37:241–272, 1901. 110 [57] Dice and L. R. Measures of the amount of ecologic association between species, ecology. Applied Optics, 26:297–302, 1945.

[58] Valentin Monev. Introduction to similarity searching in chemistry. MATCH Commun. Math. Comput. Chem, 51:7–38, 2004.

[59] Morisita M. Measuring of interspecific association and similarity between com- munities. Series E Biology, 3, 1959.

[60] L. Dice. Measures of the amount of ecologic association between species. Ecol- ogy, page 26, 1945.

[61] Michel-Marie Deza and Elena Deza. Dictionary of distances. Elsevier, 2006.

[62] Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto. Soft similarity and soft cosine measure:similarity of features in vector space model. Computacion y Sistemas, 18(3):491–504, 2014.

[63] Joris Dhondt, Joris Vertommen, Paul-Armand Verhaegen, Dirk Cattrysse, and Joost R Duflou. Pairwise-adaptive dissimilarity measure for document cluster- ing. Information Sciences, 180(12):2341–2358, 2010.

[64] Chris Potts. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188, 2010.

[65] Christopher D.Manning, Prabhakar Raghavan, and Hinrich Schutze. Cam- bridge, New York, NY, USA, 2008.

[66] Prayag Tiwari, Brojo Kishore Mishra, Sachin Kumar, and Vivek Kumar. Im- plementation of n-gram methodology for rotten tomatoes review dataset senti- ment analysis. International Journal of Knowledge Discovery in Bioinformatics (IJKDB), 7(1):30–41, 2017.

[67] David Dubin. The most influential paper gerard salton never wrote. LIBRARY TRENDS, 52(4):748–764, 2004.

[68] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the sur- prising behavior of distance metrics in high dimensional space. In International Conference on Database Theory, pages 420–434. Springer, 2001.

[69] Shunzhi Zhu, Lizhao Liu, and Yan Wang. Information retrieval using hellinger distance and sqrt-cos similarity. International Conference on Computer Science and Education (ICCSE 2012), pages 14–17, 2012.

[70] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm- brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. Apache spark: A unified engine for big data pro- cessing. Communications of the ACM, 59(11):56–65, 2016.

111 [71] Andy J. Wellings Ian Gray Norberto Fernndez Garca Pablo Basanta-Val, Neil C. Audsley. Architecting time-critical big-data systems. IEEE Trans. Big Data, 2(4):310–324, 2016.

[72] Andy J. Wellings Neil C. Audsley Pablo Basanta-Val, Norberto Fernndez Garca. Improving the predictability of distributed stream processors. Communications of the ACM, (52):22–36, 2015.

[73] Dingding Wang, Sahar Sohangir, and Tao Li. Update summarization using semi-supervised learning based on hellinger distance. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1907–1910. ACM, 2015.

[74] S. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V.Kumar, B. Mobasher, and J. Moore. Webace: a web agent for document categoriza- tion and exploration. In Proceedings of the second international conference on Autonomous agents, pages 408–415, 1998.

[75] Ana Cardoso-Cachopo. Improving Methods for Single-label Text Categoriza- tion. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.

[76] Luis Torgo. Data Mining with R. CRC Press, Boca Raton, FL, 2011.

[77] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of sta- tistical learning, volume 1. Springer series in statistics New York, 2001.

[78] David L Olson and Dursun Delen. Advanced data mining techniques. Springer Science & Business Media, 2008.

[79] D. Wang, T. Li, and C. Ding. Weighted feature subset non-negative matrix fac- torization and its applications to document understanding. IEEE International Conference on Data Mining, 2010.

[80] D. Wang, S. Zhu, and T. Li. Integrating document clustering and multidocu- ment summarization. ACM Transactions on Knowledge Discovery from Data, 5(3), 2011.

[81] Mark L Berenson, David M Levine, and Matthew Goldstein. Intermediate statiscal methods and applications: a computer package approach. 1983.

[82] John W. Tukey. Comparing individual means in the analysis of variance. In- ternational Biometric Society, 5(2), 2009.

[83] Mutlu Mete, Nurcan Yuruk, Xiaowei Xu, and Daniel Berleant. Knowledge discovery in textual databases: A concept-association mining approach. In Data Engineering, pages 225–243. Springer, 2009.

112 [84] Usama M Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, et al. Knowl- edge discovery and data mining: Towards a unifying framework. In KDD, volume 96, pages 82–88, 1996.

[85] Elizabeth D Liddy. Natural language processing. 2001.

[86] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel. Nymble: a high-performance learning name-finder. In Proceedings of the fifth conference on Applied natural language processing, pages 194–201. Association for Compu- tational Linguistics, 1997.

[87] Hai Leong Chieu and Hwee Tou Ng. Named entity recognition: a maximum entropy approach using global information. In Proceedings of the 19th interna- tional conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002.

[88] Hideki Isozaki and Hideto Kazawa. Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on Com- putational linguistics-Volume 1, pages 1–7. Association for Computational Lin- guistics, 2002.

[89] Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan. Effec- tive adaptation of a hidden markov model-based named entity recognizer for biomedical domain. In Proceedings of the ACL 2003 workshop on Natural lan- guage processing in biomedicine-Volume 13, pages 49–56. Association for Com- putational Linguistics, 2003.

[90] Yee Seng Chan and Dan Roth. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computa- tional Linguistics, pages 152–160. Association for Computational Linguistics, 2010.

[91] Yee Seng Chan and Dan Roth. Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 551–560. Association for Computational Linguistics, 2011.

[92] Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. Exploring various knowl- edge in relation extraction. In Proceedings of the 43rd annual meeting on asso- ciation for computational linguistics, pages 427–434. Association for Computa- tional Linguistics, 2005.

[93] Jing Jiang and ChengXiang Zhai. A systematic exploration of the feature space for relation extraction. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguis- tics; Proceedings of the Main Conference, pages 113–120, 2007.

113 [94] Nanda Kambhatla. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 22. Association for Computational Linguistics, 2004.

[95] Dan Jurafsky and James H Martin. Speech and language processing, volume 3. Pearson London, 2014.

[96] Juan-Manuel Torres-Moreno. Automatic text summarization. John Wiley & Sons, 2014.

[97] Inderjeet Mani. Advances in automatic text summarization. MIT press, 1999.

[98] Karl Pearson. Notes on regression and inheritance in the case of two parents. Proceedings of the Rpyal Society of London, 58:240–242, 1895.

[99] Bolun Wang Divya Sambasivan Gang Wang, Tianyi Wang and Ben Zhao. Crowds on wall street: Extracting value from social investing platforms. Foun- dations and Trends in Information Retrieval, 2014.

[100] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2:1–8, March 2011.

[101] Anshul Mittal and Arpit Goel. Stock prediction using twitter sentiment analy- sis. 2011.

[102] Tim Loughran and Bill McDonald. When is a liability not a liability? textual analysis, dictionaries. Journal of Finance, 66:35–65, 2011.

[103] Eduardo Hruschka Nadia Silva and Estevam Hruschka. Tweet sentiment anal- ysis with classifier ensembles. Decision Support Systems, 66:170–179, 2014.

[104] E. Messina E. Fersini and F.A. Pozzi. Sentiment analysis: Bayesian ensemble learning. Decision Support Systems, 68:26–38, 2014.

[105] Gang Wang, Tianyi Wang, Bolun Wang, Divya Sambasivan, Zengbin Zhang, Haitao Zheng, and Ben Y. Zhao. Crowds on wall street: Extracting value from collaborative investing platforms. The 18th ACM conference on Computer- Supported Cooperative Work and Social Computing (CSCW), March 2015.

[106] Bing Liu. Sentiment analysis and subjectivity. In Handbook of Natural Language Processing, Second Edition, pages 627–666. Chapman and Hall/CRC, 2010.

[107] Christiane Fellbaum. WordNet. Wiley Online Library, 1998.

[108] Tanushree Mitra, Clayton J Hutto, and Eric Gilbert. Comparing person-and process-centric strategies for obtaining quality data on amazon mechanical turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Com- puting Systems, pages 1345–1354. ACM, 2015. 114 [109] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A high-coverage lexical resource for opinion mining. Evaluation, pages 1–26, 2007.

[110] Steven Loria. Textblob: Simplified text processing, 2017.

[111] Zhiyong Zhang, Ranran Sun, Xiaoxue Wang, and Changwei Zhao. A situational analytic method for user behavior pattern in multimedia social networks. IEEE Transactions on Big Data, 2017.

[112] Lillian Lee Bo Pang and shivakumar vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 66:79–86, 2002.

[113] Paul Barham Martn Abadi, Ashish Agarwal and Eugene Brevdo. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. Proceedings of the Association for Computational Linguistics, 66:417– 424, 2002.

[114] Hsinchun Chen, Roger HL Chiang, and Veda C Storey. Business intelligence and analytics: From big data to big impact. MIS quarterly, 36(4), 2012.

[115] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.

[116] David G Lowe. Object recognition from local scale-invariant features. In Com- puter vision, 1999. The proceedings of the seventh IEEE international confer- ence on, volume 2, pages 1150–1157. Ieee, 1999.

[117] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[118] Adam Coates and Andrew Y Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of the 28th Inter- national Conference on Machine Learning (ICML-11), pages 921–928, 2011.

[119] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus- lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[120] Rahman Mohamed Alex Graves and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. ICASSP, 2013.

[121] George Dahl, Abdel-rahman Mohamed, Geoffrey E Hinton, et al. Phone recog- nition with the mean-covariance restricted boltzmann machine. In Advances in neural information processing systems, pages 469–477, 2010.

115 [122] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30–42, 2012.

[123] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association, 2011.

[124] Ilya Sutskever Alex Krizhevsky and Geoffrey Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.

[125] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[126] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information pro- cessing systems, pages 153–160, 2007.

[127] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[128] Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng. Dynamic pooling and unfolding recursive autoencoders for para- phrase detection. In Advances in Neural Information Processing Systems, pages 801–809, 2011.

[129] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

[130] Andrei Alexandrescu and Katrin Kirchhoff. Factored neural language models. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 1–4. Association for Computational Linguistics, 2006.

[131] Thang Luong, Richard Socher, and Christopher D Manning. Better word rep- resentations with recursive neural networks for . In CoNLL, pages 104–113, 2013.

[132] Angeliki Lazaridou, Marco Marelli, Roberto Zamparelli, and Marco Baroni. Compositional-ly derived representations of morphologically complex words in . In ACL (1), pages 1517–1526, 2013.

[133] Adam Kilgarriff and Gregory Grefenstette. Introduction to the special issue on the web as corpus. Computational linguistics, 29(3):333–347, 2003.

116 [134] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christo- pher D Manning. Semi-supervised recursive autoencoders for predicting senti- ment distributions. In Proceedings of the conference on empirical methods in natural language processing, pages 151–161. Association for Computational Lin- guistics, 2011.

[135] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.

[136] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994.

[137] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.

[138] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Training, 14(8), 2006.

[139] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Se- mantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 joint conference on empirical methods in natural language process- ing and computational natural language learning, pages 1201–1211. Association for Computational Linguistics, 2012.

[140] Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. Pars- ing with compositional vector grammars. In ACL (1), pages 455–465, 2013.

[141] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 129– 136, 2011.

[142] Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[143] Ronan Collobert. Deep learning for efficient discriminative parsing. In Proceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 224–232, 2011.

[144] Market sentiment. http://www.investopedia.com/. [145] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2, 2008.

[146] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1–127, 2009. 117 [147] Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [148] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368, 2016. [149] David Amiel Freedman. Statistical models: Theory and practice. Cambridge University Press, page 128, 2009. [150] Mairead L Bermingham, Ricardo Pong-Wong, Athina Spiliopoulou, Caroline Hayward, Igor Rudan, Harry Campbell, Alan F Wright, James F Wilson, Felix Agakov, Pau Navarro, et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific reports, 5:10312, 2015. [151] Luis Torgo. Data Mining with R. CRC Press, 2010. [152] Yvan Saeys, I˜nakiInza, and Pedro Larra˜naga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007. [153] Karl Pearson. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edin- burgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157– 175, 1900. [154] Ronald Fisher. Dispersion on a sphere. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 217, pages 295–305. The Royal Society, 1953. [155] Andreas Gr¨unauerand Markus Vincze. Using dimension reduction to improve the classification of high-dimensional data. arXiv preprint arXiv:1505.06907, 2015. [156] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. International Conference on Machine Learning, 31, 2014. [157] Greg Corrado Tomas Mikolov, Kai Chen and Jeffrey Dean. Efficient estimation of word representations in vector space. workshop at ICLR, 2013. [158] Kai Chen Greg Corrado Tomas Mikolov, llya Sutskever and Jeffrey Dean. Dis- tributed representations of words and phrases and their compositionality. work- shop at ICLR, 2013. [159] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–1780, 1997. [160] Patrice Simard kumar chellapilla, Sid Puri. High performance convolutional neural network for document processing. Tenth International workshop on fron- tiers in Handwriting Recognotion, 2006. 118 [161] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel machines, 34(5):1–41, 2007. [162] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. [163] Yoshua Bengio. Deep learning of representations: Looking forward. In Interna- tional Conference on Statistical Language and Speech Processing, pages 1–37. Springer, 2013. [164] Roberto Calandra, Tapani Raiko, Marc Peter Deisenroth, and Federico Mon- tesino Pouzols. Learning deep belief networks from non-stationary streams. In International Conference on Artificial Neural Networks, pages 379–386. Springer, 2012. [165] Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. In Artificial Intelligence and Statistics, pages 1453–1461, 2012. [166] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale dis- tributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012. [167] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with cots hpc systems. In International Conference on Machine Learning, pages 1337–1345, 2013. [168] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520, 2011. [169] Sumit Chopra, Suhrid Balakrishnan, and Raghuraman Gopalan. Dlid: Deep learning for domain adaptation by interpolating between domains. In ICML workshop on challenges in representation learning, volume 2, 2013. [170] Hugo Larochelle, Yoshua Bengio, J´erˆome Louradour, and Pascal Lamblin. Ex- ploring strategies for training deep neural networks. Journal of Machine Learn- ing Research, 10(Jan):1–40, 2009. [171] Geoffrey Hinton and Ruslan Salakhutdinov. Discovering binary codes for docu- ments by learning deep generative models. Topics in Cognitive Science, 3(1):74– 91, 2011. [172] Quoc V Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8595–8598. IEEE, 2013. 119 [173] Marc’Aurelio Ranzato and Martin Szummer. Semi-supervised learning of com- pact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, pages 792–799. ACM, 2008.

[174] Greg Corrado Tomas Mikolov, Kai Chen and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[175] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[176] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierarchi- cal invariant spatio-temporal features for action recognition with independent subspace analysis. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3361–3368. IEEE, 2011.

[177] Jurgen Schmidhuber Felix Gers and Fred Cummins. Learning to forget: Con- tinual prediction with lstm. Neural computation, 12:2451–2471, 2000.

[178] Razvan Pascanu Fredric Bastien, Pascal Lamblin and Yoshua Bengio. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.

[179] Alex Graves. Supervised sequence labelling with recurrent neural networks. Springer, 2012.

[180] Frdric Bastien James Bergstra, Olivier Breuleux and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. Python for Scientific Computing Conference, 2012.

[181] Frederic Bastien James Bergstra, Olivier Breuleux and Pascal Lamblin. Thumbs up? sentiment classification using machine learning techniques. Python In Science, 9, 2010.

[182] Lillian Lee Bo Pang and shivakumar vaithyanathan. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preliminary White Paper, 9, 2015.

[183] John Blitzer, Mark Dredze, Fernando Pereira, et al. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440–447, 2007.

[184] Sida Wang and Christopher D Manning. Baselines and : Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics, 2012.

[185] Chade-Meng Tan, Yuan-Fang Wang, and Chan-Do Lee. The use of bigrams to enhance text categorization. Information processing & management, 38(4):529– 546, 2002. 120 [186] Nargess Tahmasbi and Elham Rastegari. A socio-contextual approach in auto- mated detection of cyberbullying. In Proceedings of the 51st Hawaii Interna- tional Conference on System Sciences, 2018.

[187] Elham Rastegari, Hesham Ali, et al. A hierarchical learning model for extracting public health data from social media. 2017.

[188] Onook Oh, Nargess Tahmasbi, H Raghav Rao, and Gert-Jan de Vreede. A sociotechnical view of information diffusion and social changes: From reprint to retweet. 2012.

[189] Nargess Tahmasbi and Gert Jan De Vreede. A study of emergent norm forma- tion in online crowds. 2015.

[190] Sasan Azizian, Elham Rastegari, Brian Ricks, and Magie Hall. Identifying personal messages: A step towards product/service review and opinion mining.

[191] Itamar Arel, Derek C Rose, and Thomas P Karnowski. Deep machine learning-a new frontier in artificial intelligence research [research frontier]. IEEE compu- tational intelligence magazine, 5(4):13–18, 2010.

[192] Yann LeCun, L´eonBottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[193] Dena Bazazian, Raul Gomez, Anguelos Nicolaou, Lluis Gomez, Dimosthenis Karatzas, and Andrew D Bagdanov. Improving text proposals for scene images with fully convolutional networks. arXiv preprint arXiv:1702.05089, 2017.

[194] Leonardo Galteri, Dena Bazazian, Barcelona CVC, Lorenzo Seidenari, Marco Bertini, Andrew D Bagdanov, Anguelos Nicolaou, Dimosthenis Karatzas, and Alberto Del Bimbo. Reading text in the wild from compressed images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2399–2407, 2017.

[195] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. Twitterrank: finding topic- sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261–270. ACM, 2010.

[196] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, P Krishna Gummadi, et al. Measuring user influence in twitter: The million follower fallacy. Icwsm, 10(10-17):30, 2010.

[197] Eytan Bakshy, Jake M Hofman, Winter A Mason, and Duncan J Watts. Ev- eryone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 65–74. ACM, 2011.

121 [198] Alpaslan Burak Eliacik and Erdo˘ganErdo˘gan.User-weighted sentiment analysis for financial community on twitter. In Innovations in Information Technology (IIT), 2015 11th International Conference on, pages 46–51. IEEE, 2015.

[199] Tianyi Wang, Gang Wang, Bolun Wang, Divya Sambasivan, Zengbin Zhang, Xing Li, Haitao Zheng, and Ben Y Zhao. Value and misinformation in collab- orative investing platforms. ACM Transactions on the Web (TWEB), 11(2):8, 2017.

[200]R Rehˇrekandˇ P Sojka. Gensim–python framework for vector space mo dell. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Repub- lic, 2011.

122