
Classification of Twitter Trends using Feature ranking and Feature Selection A Thesis presented to the Faculty of the Graduate School at the University of Missouri In Partial Fulfillment of the Requirements for the Degree Master of Science by Abhishek Shah Dr. Wenjun Zeng, Thesis Supervisor December 2015 The undersigned, appointed by the dean of the Graduate School, have examined the thesis entitled Classification of Twitter Trends using Feature Ranking and Feature Selection Presented by Abhishek Shah, A candidate for the degree of Master of Science, And hereby certify that, in their opinion is worthy of acceptance. Professor Wenjun Zeng Professor Toni Kazic Professor Mike McKean Acknowledgments I would like to thank Dr. Wenjun Zeng for his tremendous support and guidance. He not only provided excellent guidance, but also asked tough questions and pushed for excellence. He took extra time out of his current job to provide guidance and assistance whenever necessary. Without his knowledge and guidance, this work would not have been possible. I would also like to thank Dr. Suman Deb Roy, my mentor, who brought this problem to my attention in the first place. He was the one who showed the pathway to this research. His support and guidance has also been very useful. I would like to acknowledge A. Zubiaga, D. Spina, V. Fresno and R. Martinez, who provided the dataset and whose results were used as a benchmark for our research. Last but not the least, I would like to thank my parents, my brother and Nikita Bhatia for providing unconditional support and affection throughout my academic life. ii Table of Contents List of Figures ........................................................................................................................... iv List of Tables ............................................................................................................................... v Abstract ...................................................................................................................................... vi Chapter 1: Introduction ......................................................................................................... 1 1.1 Background ............................................................................................................................ 3 1.2 Related Work ................................................................................................................................ 6 Chapter 2: Classification System and Dataset ................................................................ 8 2.1 System Overview ......................................................................................................................... 8 2.2 Data Collection .......................................................................................................................... 10 2.3 Trending Topic Categories .................................................................................................... 11 2.4 Data Pre Processing ................................................................................................................. 13 2.5 Data organization ..................................................................................................................... 14 2.6 Data Cleanup .............................................................................................................................. 14 Chapter 3: Feature Selection .............................................................................................. 17 3.1 Feature Ranking ....................................................................................................................... 17 3.2 Forward Selection .................................................................................................................... 21 Chapter 4: Classification ...................................................................................................... 23 4.1 N-different classifiers ............................................................................................................. 23 4.2 Training and Testing Dataset ............................................................................................... 24 4.3 Naïve Bayes Classifier ............................................................................................................. 24 4.4 Bayesian Network .................................................................................................................... 26 Chapter 5: Results .................................................................................................................. 28 5.1 Bag-of-Words Ranking Analysis .......................................................................................... 28 5.2 TF-IDF Ranking Analysis ........................................................................................................ 30 5.3 Bag-of-Words vs TF-IDF ......................................................................................................... 33 5.4 Class Precision Analysis ......................................................................................................... 34 Chapter 6: Discussion and Future Work ........................................................................ 37 6.1 Discussion ................................................................................................................................... 37 6.2 Recognition ................................................................................................................................. 38 6.3 Future Work ............................................................................................................................... 39 Chapter 7: Conclusion ........................................................................................................... 41 References ................................................................................................................................ 43 iii List of Figures Figure 1: An overview of the end-to-end classification system ............................................ 9 Figure 2: Bayesian Network for trending topic classification ............................................ 27 Figure 3: Feature Selection (Bag-of-words) vs. No Feature Selection ............................ 30 Figure 4: Feature Selection (TF-IDF) vs. No Feature Selection .......................................... 33 Figure 5: Meme class precision ...................................................................................................... 35 Figure 6: News class precision ....................................................................................................... 35 Figure 7: Ongoing-event class precision ..................................................................................... 36 Figure 8: Commemorative class precision ................................................................................. 36 iv List of Tables Table 1: Trending Topic Examples with a Sample Tweet ....................................................... 6 Table 2: Example and description of each category of trending topic ............................ 13 Table 3: Top 10 features using bag-of-words approach for each category ................... 18 Table 4: Top 10 features using TF-IDF for each category .................................................... 21 Table 5: Class Precision (%) for features selected using Frequency Count .................. 29 Table 6: Class precision comparison. Feature Selection (bag-of-words) vs No Feature Selection ......................................................................................................................................... 29 Table 7: Class precision (%) for features selected using TF-IDF ranking ..................... 31 Table 8: Feature Selection (TF-IDF) vs No Feature Selection ............................................ 32 v Abstract Twitter scales 500 million tweets per day and has 316 million monthly active users. The majority of tweets are in the form of natural language. Using natural language makes it difficult to understand Twitter's data programmatically. In our research, we attempt to solve this challenge using various machine learning techniques. This thesis includes a new approach for classifying Twitter trends by adding a layer of feature selection and feature ranking. A variety of feature ranking algorithms, such as TF-IDF and bag-of-words, are used to facilitate the feature selection process. This helps in surfacing the important features, while reducing the feature space and making the classification process more efficient. Four Naïve Bayes text classifiers (one for each class), backed by these sophisticated feature ranking and feature selection techniques, are used to successfully categorize Twitter trends. Using the bag-of-words and TF-IDF rankings, our research provides an average class precision improvement, over the current methodologies, of 33.14% and 28.67% correspondingly. vi Chapter 1: Introduction In recent years, with the sudden increase in popularity of various social networks, the way we produce and consume information has changed dramatically. There is a massive amount of information flowing through these social networks. It has forced news organizations, journalists, marketing companies, business organizations, musicians, actors, bloggers, programmers and almost all businesses and communities to change their approach to branding, marketing and networking.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages50 Page
-
File Size-