SPEAKER IDENTIFICATION in LIVE EVENTS USING TWITTER By
Total Page:16
File Type:pdf, Size:1020Kb
SPEAKER IDENTIFICATION IN LIVE EVENTS USING TWITTER by MINUMOL JOSEPH Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE THE UNIVERSITY OF TEXAS AT ARLINGTON December 2015 Copyright c by MINUMOL JOSEPH 2015 All Rights Reserved To my family ACKNOWLEDGEMENTS I would like to thank my supervising professor Dr. Chengkai Li for his continu- ous motivation and support. Dr. Li's guidance helped me very much to complete the thesis work on time, also his insight on various problems helped me to set my goals. Without his guidance and support, this thesis would not have been possible. I like to thank my committee members Dr. Leonidas Fegaras and Dr. Dimitrios Zikos for their support and for being a part of my thesis supervising committee. I would like to thank all IDIR Lab members for their help in various area of my thesis work. Especially I would like to thank Naeemul Hassan on his guidance and motivation through out my work. Also, I am thankful to Sumesh Balan, Fatma Dogan and Jisa Sebastian for their valuable motivation. Finally, I would like to thank my husband Tijo Thomas and other family mem- bers for the continuous support and unending motivation. November 20, 2015 iv ABSTRACT SPEAKER IDENTIFICATION IN LIVE EVENTS USING TWITTER MINUMOL JOSEPH, M.S. The University of Texas at Arlington, 2015 Supervising Professor: Dr. Chengkai Li The prevalence of social media has given rise to a new research area. Data from social media is now being used in research to gather deeper insights into many different fields. Twitter is one of the most popular microblogging websites. Users express themselves on a variety of different topics in 140 characters or less. Oftentimes, users \tweet" about issues and subjects that are gaining in popularity, a great example being politics. Any development in politics frequently results in a tweet of some form. The research which follows focuses on identifying a speakers name at a live event by collecting and using data from Twitter. The process for identification involves collecting the transcript of the broadcasting event, preprocessing the data, and then using that to collect the necessary data from Twitter. As this process is followed, a speaker can be successfully identified at a live event. For the experiments, the 2016 presidential candidate debates have been used. In principle, the thesis can be applied to identify speakers at other types of live events. v TABLE OF CONTENTS ACKNOWLEDGEMENTS . iv ABSTRACT . v LIST OF ILLUSTRATIONS . ix LIST OF TABLES . xi Chapter Page 1. INTRODUCTION . 1 1.1 Motivation . 3 1.2 Summary of the problem . 5 2. BACKGROUND AND RELATED WORK . 6 2.1 Speaker Recognition Methods . 6 2.2 Twitter data processing . 7 3. SPEAKER IDENTIFICATION OF LIVE EVENTS USING TWITTER . 9 3.1 Problem Definition . 9 4. DATA COLLECTION . 11 4.1 Live debate transcript . 11 4.2 Twitter Data . 12 4.2.1 Twitter Rest API . 13 4.2.2 Twitter Streaming API . 14 4.3 Official Transcript of the debate . 16 5. SYSTEM OVERVIEW . 17 5.1 Process the closed caption data . 17 5.2 Identify possible candidates and non-candidates and generate phrases . 18 vi 5.3 Collect tweets using Twitter's Rest and Streaming API . 20 5.4 Process the tweets and identify the speaker . 20 5.4.1 Word-by-word Checking . 21 5.4.2 NLTK named entity recognition . 21 5.4.3 Sum of individual score values of all sentences . 22 5.4.4 Total number of times each name identified as a candidate . 22 5.4.5 Sum of scores of the candidate based on the the sentence rank . 23 6. APPLICATION OF SPEAKER IDENTIFICATION . 26 7. EXPERIMENTS AND EVALUATION . 28 7.1 Republican presidential debate held on Sep 16 2015 on CNN . 28 7.1.1 Scenario I: All sentences in the debate . 32 7.1.2 Scenario II: All sentences with more than five words in the debate . 37 7.1.3 Scenario III: All sentences spoken by the debaters . 40 7.1.4 Scenario IV: All sentences by the debaters with more than five words . 44 7.2 Democratic presidential debate held on Oct 13 2015 on CNN and Republican debate held on Oct 28 2015 on CNBC . 47 7.2.1 Scenario I: All sentences in the debate . 48 7.2.2 Scenario II: All sentences with more than five words in the debate . 51 7.2.3 Scenario III: All sentences by the debaters in the debate . 54 7.2.4 Scenario IV: All sentences by the debaters with more than five words . 57 8. CONCLUSION . 61 Appendix A. Twitter API Parameters . 62 vii REFERENCES . 65 BIOGRAPHICAL STATEMENT . 67 viii LIST OF ILLUSTRATIONS Figure Page 1.1 % of Users use Twitter and Facebook for News . 2 1.2 % of Users use Twitter and Facebook to follow Breaking News . 3 4.1 RealTerm Serial Data Capture Program . 12 4.2 Rest API functionality [1] . 14 4.3 Streaming API functionality [1] . 15 5.1 System Architecture block diagram . 17 6.1 Claimbuster tweets using live speaker identification system . 26 6.2 Debate visualization using ClaimBuster and Speaker Identification . 27 7.1 Speaker identified from tweets by NLTK and word comparison. 29 7.2 Top five candidates for paragraph using Streaming API . 31 7.3 Top five candidates for paragraph using Rest API . 32 7.4 Top five candidate accuracy for all sentences using Streaming API . 34 7.5 Top five candidate accuracy for all sentences using Rest API . 34 7.6 Top five candidate accuracy for all sentences using Rest and Streaming API..................................... 36 7.7 Top five candidate accuracy for all sentences with more than five words using Streaming API . 38 7.8 Top five candidate accuracy for all sentences with more than five words using Rest API . 38 7.9 Top five candidate accuracy for all sentences with more than five words using Rest and Streaming API . 39 7.10 Top five candidate accuracy for all sentences by the debaters using Streaming API . 41 ix 7.11 Top five candidate accuracy for all sentences spoken by the debaters using Rest API . 42 7.12 Top five candidate accuracy for all sentences by the debaters using Rest and Streaming API . 43 7.13 Top five candidate accuracy for all sentences by the debaters with more than five words using Streaming API . 45 7.14 Top five candidate accuracy for all sentences spoken by the debaters with more than five words using Rest API . 46 7.15 Top five candidate accuracy for all sentences by the debaters with more than five words using Rest and Streaming API . 47 7.16 Top five candidate accuracy for all sentences using Streaming API . 49 7.17 Top five candidate accuracy for all sentences using Rest API . 50 7.18 Top five candidate accuracy for all sentences using Rest and Streaming API..................................... 51 7.19 Top five candidate accuracy for all sentences with more than five words using Streaming API . 52 7.20 Top five candidate accuracy for all sentences with more than five words using Rest API . 53 7.21 Top five candidate accuracy for all sentences with more than five words using Rest and Streaming API . 54 7.22 Top five candidate accuracy for all sentences by the debaters using Streaming API . 55 7.23 Top five candidate accuracy for all sentences by the debaters using Rest API..................................... 56 7.24 Top five candidate accuracy for all sentences by the debaters using Rest and Streaming API . 57 7.25 Top five candidate accuracy for all sentences by the debaters with more than five words using Streaming API . 58 7.26 Top five candidate accuracy for all sentences by the debaters with more than five words using Rest API . 59 7.27 Top five candidate accuracy for all sentences by the debaters with more than five words using Rest and Streaming API . 60 x LIST OF TABLES Table Page 4.1 Streaming Endpoints in Twitter [1] . 15 5.1 Candidate/Non-candidate speakers in a sentence and scores . 18 5.2 Phrases generated in continuous processing . 19 5.3 Phrases generated in reprocessing . 20 5.4 Candidates of a paragraph with twelve sentences . 23 5.5 Top five candidates of a paragraph based on scores . 24 5.6 Top five candidates of a paragraph based on mentions . 24 5.7 Top five candidates of a paragraph based on rank score . 25 7.1 Top five candidates for paragraph using Streaming API . 30 7.2 Top five candidates for paragraph using Rest API . 31 7.3 Top five candidate accuracy for all sentences using Streaming API . 33 7.4 Top five candidate accuracy for all sentences using Rest API . 35 7.5 Top five candidate accuracy for all sentences using Rest and Streaming API..................................... 36 7.6 Top five candidate accuracy for all sentences with more than five words using Streaming API . 37 7.7 Top five candidate accuracy for all sentences with more than five words using Rest API . 39 7.8 Top five candidate accuracy for all sentences with more than five words using Rest and Streaming API . 40 7.9 Top five candidate accuracy for all sentences by the debaters using Streaming API . 41 7.10 Top five candidate accuracy for all sentences spoken by the debaters using Rest API . 42 xi 7.11 Top five candidate accuracy for all sentences by the debaters using Rest and Streaming API . 43 7.12 Top five candidate accuracy for all sentences by the debaters with more than five words using Streaming API .