MACHINE LEARNING METHODS to UNDERSTAND TEXTUAL DATA by Sahar Sohangir

MACHINE LEARNING METHODS TO UNDERSTAND TEXTUAL DATA by Sahar Sohangir A Dissertation Submitted to the Faculty of The College of Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Florida Atlantic University Boca Raton, FL December 2018 Copyright 2018 by Sahar Sohangir ii iii ACKNOWLEDGEMENTS I want to thank my advisor Dr. Dingding Wang. It has been an honor to be her first Ph.D. student. I appreciate all her contributions of time, ideas, and support to make my Ph.D. I also gratefully acknowledge partial support by the National Science Founda- tion, under grant number CNS-1427536. Any opinions, findings, and conclusions or recommendations expressed in this dissertation are those of the author and do not necessarily reflect the views of the National Science Foundation. I would like to thank my family for all their love and encouragement. iv ABSTRACT Author: Sahar Sohangir Title: Machine Learning Methods to Understand Textual Data Institution: Florida Atlantic University Dissertation Advisor: Dr. Dingding Wang Degree: Doctor of Philosophy Year: 2018 The amount of textual data that produce every minute on the internet is ex- tremely high. Processing of this tremendous volume of mostly unstructured data is not a straightforward function. But the enormous amount of useful information that lay down on them motivate scientists to investigate efficient and effective techniques and algorithms to discover meaningful patterns. Social network applications provide opportunities for people around the world to be in contact and share their valuable knowledge, such as chat, comments, and discussion boards. People usually do not care about spelling and accurate grammatical construction of a sentence in everyday life conversations. Therefore, extracting information from such datasets are more complicated. Text mining can be a solution to this problem. Text mining is a knowledge discovery process used to extract patterns from natural language. Application of text mining techniques on social networking websites can reveal a significant amount of information. Text mining in conjunction with social networks can be used for finding a general opinion about any special subject, human thinking patterns, and group identification. In this study, we investigate machine learning methods in textual data in six chapters. v 1. Text representation and encoding: This chapter will take a look at some techniques to represent documents in vector space and some machine learning methods to analyze textual data. 2. Text Similarity: In this chapter, we will propose a new similarity measurement. This new similarity can alleviate the Cosine similarity problem in high dimensional data. 3. Textual Data: Natural language processing and information retrieval techniques include sentiment analysis will investigate in this chapter. Lexicon based and machine learning base are two commonly used techniques in sentiment analysis. 4. Lexicon Based Financial Sentiment Analysis: In this chapter lexicon based methods will use to extract sentiment of people in a financial forum. In this chapter We will investigate if people who are Bullish (believe the stock price will be increase) use positive words and people who are Bearish (believe the stock price will be decrease) use negative words in their sentences. 5. Financial Sentiment Analysis: Investigate deep learning methods to extract the sentiment of users in the financial forum. Based on our results Convolution Neural Network is the best method to extract user sentiment in the financial forum. 6. Expert Recognition in Social Media: The main goal of this chapter is to evaluate deep learning methods to find expert people in predicting stock price movement. In other words, we will try to see if there is any relation between people words and their ability in predicting stock price. vi To the graduate students of Florida Atlantic University. MACHINE LEARNING METHODS TO UNDERSTAND TEXTUAL DATA List of Figures ............................. xi 1 Introduction and background ..................... 1 1.1 Vector Space Model............................1 1.2 Text Preprocessing............................1 1.2.1 Tokenization............................2 1.2.2 Dropping common terms.....................2 1.2.3 Equivalence classing of terms (Normalization).........3 1.2.4 Capitalization...........................3 1.2.5 Stemming and lemmatization..................4 1.2.6 Term scoring...........................4 1.3 Learning Methods.............................6 1.3.1 supervised learning methods:..................6 1.3.2 Supervised learning evaluation metrics:............ 10 1.3.3 Unsupervised learning methods:................ 14 1.3.4 Unsupervised learning evaluation metrics:........... 21 1.3.5 Semi-supervised learning methods:............... 24 1.4 Brief revision of the dissertation..................... 24 2 Text Similarity ............................. 26 2.1 Text Similarity Measurement...................... 26 2.2 Cosine Similarity............................. 29 2.3 Sqrt-Cosine Similarity.......................... 30 viii 2.4 ISC Similarity............................... 32 2.5 Experiment................................ 33 2.6 DataSets.................................. 33 2.7 Learners.................................. 35 2.8 Performance Metrics........................... 35 2.9 Experimental results........................... 36 2.10 Overall Results.............................. 36 2.11 Results using Different Learners..................... 38 2.12 Results using Different datasets and Learners............. 40 2.13 Summary................................. 43 3 Textual Data .............................. 44 3.1 Text mining approaches......................... 44 3.2 Information Retrieval........................... 44 3.3 Natural Language Processing...................... 45 3.3.1 Text summarization....................... 48 3.3.2 Sentiment analysis........................ 49 4 Lexicon Based Financial Sentiment Analysis . 52 4.1 Why Financial Sentiment Analysis................... 52 4.2 previous work on Financial Sentiment Analysis............. 53 4.3 Methodology............................... 54 4.3.1 VADER: Valence Aware Dictionary for sEntiment Reasoning. 55 4.3.2 SentiWordNet........................... 56 4.4 Experiments................................ 56 4.4.1 Machine Learning Approaches.................. 57 4.4.2 Lexicon Based Approaches.................... 58 4.4.3 Combined Results......................... 59 4.5 Summary................................. 60 ix 5 Financial Sentiment Analysis ..................... 62 5.1 Social network information extraction.................. 62 5.2 Big Data.................................. 63 5.3 Machine Learning in Social network information extraction...... 66 5.4 Methodology............................... 69 5.4.1 Sentiment Analysis with Data Mining Approaches....... 70 5.4.2 Increase Accuracy by using Feature selection.......... 71 5.4.3 Deep Learning in Big Data Analytics.............. 77 5.4.4 Sentiment Analysis with Deep Learning Approaches...... 79 5.4.5 Results and Discussion...................... 85 5.5 Summary................................. 89 6 Expert Recognition in Social Media . 92 6.1 How can we find the experts in Social Media?............. 92 6.2 Previous work in finding Experts in Social Media........... 94 6.3 Methodology............................... 95 6.3.1 Expert Recognition with Data Mining Approach........ 95 6.3.2 Experiments Using Neural Networks.............. 96 6.4 Summary................................. 99 7 Summary and future work . 100 7.1 Future Works............................... 106 Bibliography .............................. 107 x LIST OF FIGURES 2.1 Accuracy in classification box plot.................... 38 2.2 Purity in clustering box plot....................... 38 4.1 Comparative Area Under the ROC curve for Lexicon versus Machine Learning based sentiment analysis.................... 60 5.1 Receiver Operating Characteristic for Logistic Regression....... 71 5.2 Accuracy of logistic regression by using feature selection methods.. 76 5.3 Distributed Memory Architecture.................... 82 5.4 Distributed Bag of words......................... 82 5.5 Area Under the ROC curve for doc2vec with window size of 5 and 10 86 5.6 Area Under the ROC curve for Long Short-Term Memory...... 88 5.7 Compare Area Under the ROC curve for Convolutional Neural Network in various steps.............................. 90 6.1 logistic regression (Area Under the ROC curve)............ 96 6.2 Area Under the ROC curve for window size of 5............ 96 6.3 Compare Area Under the ROC curve for Convolutional Neural Network in different steps............................. 99 xi CHAPTER 1 INTRODUCTION AND BACKGROUND The most common way to represent a document is bag of words (BOW) [1,2]. Bag of words model views a document as a collection of words and disregards grammar and word order. This representation leads to a vector representation which facilitates further analysis of the documents. For instance by representing a document as a vector, dot product of the vectors can be used to measure the similarity between documents. This chapter will take a look at some preprocessing techniques that we need to apply on text dataset. Also, we will see some common machine learning methods to extract information from textual data. 1.1 VECTOR SPACE MODEL Representing documents by the numerical vectors enable efficient analysis of the ex- tensive collection

MACHINE LEARNING METHODS to UNDERSTAND TEXTUAL DATA by Sahar Sohangir

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

A Generic Neural Text Segmentation Model with Pointer Network

Text Segmentation Techniques: a Critical Review

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

Evaluating Vector-Space Models of Word Representation, Or, the Unreasonable Effectiveness of Counting Words Near Other Words

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study