<<

Claremont Colleges Scholarship @ Claremont

CMC Senior Theses CMC Student Scholarship

2021

Using API to Solve the GOAT Debate: vs. LeBron James

Jordan Trey Leonard

Follow this and additional works at: https://scholarship.claremont.edu/cmc_theses

Part of the Applied Mathematics Commons, Data Science Commons, and the Statistics and Probability Commons

Recommended Citation Leonard, Jordan Trey, "Using Twitter API to Solve the GOAT Debate: Michael Jordan vs. LeBron James" (2021). CMC Senior Theses. 2733. https://scholarship.claremont.edu/cmc_theses/2733

This Open Access Senior Thesis is brought to you by Scholarship@Claremont. It has been accepted for inclusion in this collection by an authorized administrator. For more information, please contact [email protected]. Claremont McKenna College

Using Twitter API to Solve the GOAT Debate: Michael Jordan vs. LeBron James

submitted to Professor Mark Huber

by Jordan Leonard

for Senior Thesis in Mathematics May 3, 2021 Introduction

What is Sentiment Analysis? Sentiment analysis (Feldman 2013) is a unique data mining (Hand and Adams 2014) tool that refers to the use of natural language processing (Chowdhury 2003), text analysis (Bernard and Ryan 1998), computational linguistics (Grishman 1986), and biometrics (Jain, Flynn, and Ross 2007) to identify, extract, quantify, and study subjective information. It is commonly used to gather information on public opinion by breaking down text to determine whether it contains positive or negative sentiment. Many studies tend to gather their text data from social media platforms due to the large number of users and available content. In this case, I use sentiment analysis (Feldman 2013) to analyze tweet data collected from Twitter in RStudio (Allaire 2012).

Problem Description In the following paper, I gather and analyze Twitter tweets from real users to compare the social sentiment of professional athletes in the (NFL), National Association (NBA), Major League Baseball (MLB), as well as athletes who play National Collegiate Athletic Association (NCAA) Division 1 basketball. The reasoning for my analysis of the social sentiment of athletes on Twitter began with my interest in solving the dispute of labeling athletes as the GOAT or the “Greatest of All Time.” Granted that every professional athlete is extremely talented and made it to the professional level for a reason, the label of GOAT is reserved to the best of the best. In the NBA, the discussion tends to come down to comparing LeBron James and Michael Jordan. With that being said, I decided that I would employ the technique of sentiment analysis (Feldman 2013) and a Twitter API (Makice 2009) in an attempt to find some sort of resolution. I also believed this analysis to be vital for the fact that Michael Jordan released a documentary of his NBA career called The Last Dance (“Everything You Need to Know about ’The Last Dance”’ 2020) in April of 2020, and LeBron James won an NBA Championship with the Lakers in October of 2020. With these two major events occurring in the same year, I hoped that it would produce enough substance to compare both athletes on a seemingly even scale.

Data Gathering Process In order to publicly and freely access Twitter tweets, it is required to go through an application process in which one is granted confidential keys to access the Twitter API (Makice 2009) for your specific project. Once these keys are granted, you are able to use the keys as search tokens within the rtweet package in RStudio (Allaire 2012) to run the API that enables tweet collection. Using the search_tweets() function, I was able to input a given keyword that I would like to search for across the Twitter database and get an output of tweets that include that keyword. However, the the

1 Twitter API (Makice 2009) that I was granted is limited in that I am only able to search for 18,000 tweets in a given 15-minute period, and the tweets that are searched over for a given keyword had to have been tweeted in the past 6-9 days. Hence, the API granted me a limited time frame to work with which was unfortunate as I had hoped to access tweets ranging over the past couple of years. A larger time frame would allow me to see how the social sentiment revolving around athletes fluctuated due to their athletic performances and achievements within their respective sporting seasons. Given that the tweet data collected only ranges over the past 6-9 days from when the API is employed using the search_tweets() function, I was still able to gather important results for the athletes in my study as the NBA, MLB, and NCAA basketball seasons were still ongoing. Despite the NFL season not being in progress like the other sports, there was still valuable tweet data to be collected and analyzed. In the data gathering code, it can be seen that the search_tweets() function is simple and readable in that it uses a search query argument, q = [Name] GOAT OR [Name] Goat, for every athlete. It also includes the arguments n, include_rts = FALSE, -filter = “replies", and lang = en. These arguments specify the desired number of tweets to be returned while filtering out retweets and replies, and only collecting tweets that are in English. These arguments in keeping my data concise and focused on the portions that are necessary for further analysis.

Analysis

NBA Exploring Tweets Read in Data Using the search_tweets() function from the rtweet package in the following code chunk, we are able to collect tweet data regarding Michael Jordan. Given that the focus is on tweets that contain the term GOAT, the search query q = “Michael Jordan GOAT OR Michael Jordan Goat” is adopted to narrow the scope of the search. The values from the search are stored in the variable mj_goat_tw and contain 279 observations of 91 variables. This means there is a total of 279 tweets available that are sorted into 91 column variables such as “user_id,” “created_at,” “text,” etc. To ensure that the analysis is done on the same set of data instead of consistently recollecting new tweet data, the write_as_csv() function stores the values from the mj_goat_tw variable into a CSV file. This not only saves the data in a safe, readable format but grants the ability to read in the data after each session using the read.csv() function.

### Michael Jordan mj_goat_tw <- read.csv("mj_goat_tw.csv", fill = TRUE) mj_goat_tw2 <- read.csv("mj_goat_tw.csv", fill = TRUE)

2 Now that the tweet data for Michael Jordan is collected, it is time to dive into the data by performing EDA. This enables the ability to fully understand the data that has been collected in order to perform further analyses later on. In the following code, we are using the pipe operator from the data variable so that we can see a sample of size 3 for the selected column variables of “created_at,” “screen_name,” “text,” “favorite_count,” “retweet_count.” From the output, it can be seen that the “text” column contains the terms “Michael Jordan” and GOAT. This is important because it ensures that the search_tweets() function from the data gathering process is working properly by producing valuable results.

mj_goat_tw %>% sample_n(3) %>% select(created_at, screen_name, favorite_count)

## created_at screen_name favorite_count ## 1 2021-03-30 02:16:03 not_andrew____ 1 ## 2 2021-03-31 03:31:48 ulforicks 1 ## 3 2021-04-02 22:48:08 Jimmyrealdeal 0

This process can be replicated for the other NBA players within our sample to confirm that the tweets we analyze are in fact referencing each specified player. Below, we can see the sample of tweets for LeBron James, , , and Bryant which used the same coding process on their respective dataset of tweets. Since the data outputs contain the “created_at” column variable which labels the date and time that each tweet was published, it would be interesting to take a look at the tweet frequency by users who are invested in the NBA “GOAT” conversation.

Timeline of Tweets - Frequency Plot The ts_plot() function allows us to investigate the frequency of tweets as they were tweeted between the dates of “2021-03-26 06:25:39” and “2021-04-03 03:12:00.” In the code below, we are able to specify a desired time interval to model which is where the hours and days arguments come into effect. Both models display a spike in frequency of tweets on “2021-03-28” which total to 50+ tweets for that day.

### Michael Jordan ts_plot(mj_goat_tw, "hours")+ labs(x = NULL,y= NULL, title = "Frequency of tweets with Michael Jordan GOAT Keyword", caption = "Data collected from Twitter's API via rtweet")+ theme_minimal()

3 Frequency of tweets with Michael Jordan GOAT Keyword

15

10

5

0 Mar 27 Mar 29 Mar 31 Apr 02 Data collected from Twitter's API via rtweet

ts_plot(mj_goat_tw2, "days")+ labs(x = NULL,y= NULL, title = "Frequency of tweets with Michael Jordan GOAT Keyword", caption = "Data collected from Twitter's API via rtweet")+ theme_minimal()

Frequency of tweets with Michael Jordan GOAT Keyword

50

40

30

20

10

0 Mar 27 Mar 29 Mar 31 Apr 02 Data collected from Twitter's API via rtweet Similarly, we are able to investigate LeBron James’ tweet frequency for tweets between the dates of “2021-03-26 23:49:59” and “2021-04-04 02:17:50.” The following frequency plot has a spike in frequency of tweets on “2021-03-28” and “2021-03-31” which have total of about 70 and 55 tweets for those days, respectively.

4 Frequency of tweets with LeBron James GOAT Keyword

30

20

10

0 Mar 28 Mar 30 Apr 01 Apr 03 Data collected from Twitter's API via rtweet Next, we can take a look at James Harden’s tweet frequency between the dates of “2021-03-26 23:49:59” and “2021-04-04 02:17:50.” The following frequency plot has a spike in frequency of tweets on “2021-03-31” which total to about 25 tweets for that day. Frequency of tweets with James Harden GOAT Keyword

10

5

0 Mar 28 Mar 30 Apr 01 Apr 03 Data collected from Twitter's API via rtweet Then, we can take a look at Kevin Durant’s tweet frequency between the dates of “2021-03-27 21:57:47” and “2021-04-04 02:26:39.” The following frequency plot has a spike in frequency of tweets on “2021-03-28” and “2021-03-30” which have total of about 20 and 25 tweets for those days, respectively.

5 Frequency of tweets with Kevin Durant GOAT Keyword

10

5

0 Mar 29 Mar 31 Apr 02 Apr 04 Data collected from Twitter's API via rtweet Finally, ’s tweet frequency takes place between the dates of “2021- 04-05 06:16:15” and “2021-04-13 03:00:21.” The following frequency plot has a spike in frequency of tweets on “2021-04-06,” “2021-04-09,” and “2021-04-12” which have total of about 6, 10, and 12 tweets for those days, respectively. Frequency of tweets with Kobe Bryant GOAT Keyword

2.0

1.5

1.0

0.5

0.0 Apr 06 Apr 08 Apr 10 Apr 12 Data collected from Twitter's API via rtweet When comparing the time intervals of hours versus days, the days interval provides an interesting visual of the daily frequency but the hours argument provides a better insight since we are dealing with a time-frame of only 6-9 days. If we were to be dealing with a dataset of tweets which span the course of a month or more, then the days interval would be an effective model. The above frequency plots are offer valuable insight into the NBA GOAT conversation as we are able to notice that these athletes are consistently being “talked” about throughout their dataset time-frames. Since each athlete had at least one spike in tweet frequency, it may be beneficial to understand the sentiment/sentiment

6 polarity during those periods. This would allow us to determine if those spikes were positive or negative, and how it compares to the other days in the dataset.

Top Tweeting Location Another variable that could potentially play a factor in the public sentiment towards an athlete is the location of where a Twitter user lives. When it comes to sports, fans tend to develop competitive attitudes which may lead to a resentment towards opponents. Some fans follow the teams that reside in their hometown or state while others may not, either way, location is present. In the following code chunk, we can filter the mj_goat_tw variable to remove any NA values from consideration in the location column variable, while including the count of tweets from each location in the output. The reason for excluding NA values is that they do not provide any useful information and removing them presents a more efficient model. From the output, we can see that there were 104 tweets from a blank location, 16 tweets from Chicago, IL, and 4 tweets from both the United States and Washington, D.C.. In this case, the 104 tweets from a blank location were not represented as NA values nor were they removed due to the fact that the users did not enable the location feature when they published the tweet. Therefore, the missing location value was replaced with a blank cell when the data was read into the CSV file using the fill == TRUE argument. To correct this, we can run the following function to replace those blank cells with NA values so that the filter argument properly selects the non-NA values. When comparing the two outputs it is obvious that the blank cells are removed.

mj_goat_tw2[mj_goat_tw2 ==""] <-NA

mj_goat_tw %>% filter(!is.na(location)) %>% count(location, sort = TRUE) %>% top_n(5) %>% kable()

location n 104 Chicago, IL 16 United States 4 Washington, DC 4 Boston, MA 3

mj_goat_tw2 %>% filter(!is.na(location)) %>%

7 count(location, sort = TRUE) %>% top_n(5) %>% kable()

location n Chicago, IL 16 United States 4 Washington, DC 4 Boston, MA 3 Charlotte, NC 2 Dallas, TX 2 Detroit, MI 2 Downtown 2 Lagos, Nigeria 2 Los Angeles, CA 2 Miami, FL 2 San Francisco, CA 2 Somewhere 2 Your head rent free 2

From the following bar chart, we observe that our dataset includes Twitter users all over the United States and even reaches users as far as Lagos, Nigeria. It does make sense that Chicago would hold be the top location since Michael Jordan played for the for 14 years which was essentially his entire career.

# Omits NA Locations mj_goat_tw2 %>% count(location, sort = TRUE) %>% mutate(location = reorder(location, n)) %>% na.omit() %>% top_n(12) %>% ggplot(aes(x = location,y= n))+ geom_col(fill = "red", color = "black")+ coord_flip()+ labs(x = "Count",y= "Location", title = "Top Locations of Michael Jordan GOAT Tweets")

8 Top Locations of Michael Jordan GOAT Tweets

Chicago, IL Washington, DC United States Boston, MA Your head rent free Somewhere San Francisco, CA Miami, FL Count Los Angeles, CA Lagos, Nigeria Downtown Detroit, MI Dallas, TX Charlotte, NC 0 5 10 15 Location Similar to the bar chart for Michael Jordan, every other NBA athlete has top locations that spread the U.S. and even extend to other countries/continents. However, some locations listed by users are not real locations but were frequented enough to make the list. Regardless, it is worthwhile to explore all aspects of our data even if that leads to locations such as “Your head rent free.”

9 Top Locations of LeBron James GOAT Tweets

United States

Washington, DC

Iowa, USA

Chicago, IL

San Antonio, TX

Count Michigan, USA

Lagos, Nigeria

Florida, USA

Boise, ID

2011 and 2007 finals

0 2 4 6 Location Top Locations of James Harden GOAT Tweets

Your head rent free

NEVADA

Texas, USA Count nfl

Houston, TX

0 3 6 9 Location Top Locations of Kevin Durant GOAT Tweets

Los Angeles, CA

Cleveland, OH

Count Boise, ID

16 he/him

0 1 2 3 Location

10 Top Locations of Kobe Bryant GOAT Tweets

Buena Park, CA

Count Agoura Hills, CA

0.0 0.5 1.0 1.5 2.0 Location Similar to other social media applications, Twitter allows users to like/favorite tweets. So, if a tweet has a considerable number of likes it is safe to assume that others share the same opinion and agree with what is being communicated. Using the code below, we can see the top-3 tweets with the most likes/favorites for Michael Jordan. From the output, each tweet relays a positive attitude when it comes to Michael Jordan being considered the GOAT, however, this may not always be the case. Eventually, we will investigate the overall attitude towards Michael Jordan and the other NBA players to see just how positive and/or negative they are.

## created_at screen_name favorite_count ## 1 2021-03-28 22:22:59 AllThingsSnyder 387 ## 2 2021-03-26 21:50:08 BurnerKhris 356 ## 3 2021-04-01 15:17:43 undisputed 252

Word Cloud Analysis Another text analysis that we are able to observe involves creating a word cloud (Heimerl et al. 2014) which allows us to visualize common words within tweets. What makes this visualization method unique is that the sizing of each word is determined by their frequency which implies their importance/relevance to the overall twitter dataset. So as to gather the dataset containing the top tweeted words, each tweet must be cleaned by removing unnecessary characters and symbols while detecting the strings that are characterized as individual words. In this case we are filtering by regular expressions (Li et al. 2008). Also, stop words which are commonly used words that are viewed as unimportant to the text analysis must also be filtered out to shift the focus of the word networks onto the more important word groupings. Once the frequency of each word is accounted for, we can apply the wordcloud() function, from the wordcloud package.

## Top Words ### Michael Jordan data("stop_words")

words_mj_goat <- mj_goat_tw %>% mutate(text = str_remove_all(text, "&|<|>"), text = str_remove_all(text,

11 "\\s?(f|ht)(tp)(s?)(://) ([ˆ\\.]*)[\\.|/](\\S*)"), text = str_remove_all(text, "[ˆ\x01-\x7F]")) %>% unnest_tokens(word, text, token = "tweets") %>% filter(!word %in% stop_words$word, !word %in% str_remove_all(stop_words$word," '"), str_detect(word, "[a-z]"), !str_detect(word, "ˆ#"), !str_detect(word,"@\\S+")) %>% count(word, sort = TRUE)

Given that frequency correlates to word size, we can confidently say that “Michael,” “Jordan,” and GOAT are the top words within the dataset. These top words occur 287, 287, and 267 times, respectively, with the next top word being “LeBron” with a frequency of 73. In some sense this outcome was expected, especially with the N.B.A. GOAT debate usually comparing Michael Jordan and LeBron James. Taking a look into LeBron’s dataset of top words, Michael Jordan’s name appears to be the seventh-most frequent word with 57 occurrences. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

words_mj_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "red"))

legend super players stop bryant million era watch people steph disney soccer hockey debate russell ago golf bill ruth jeffrey team title game lost jordans fuck gretzky dwyane magic kjz nets play

football goat phelps boxing history wade won club god lemickey jordan real brady rings beats

air james lebron wanna time mj cried

love basketball babe messi player sports

jwill bolt

played mike

kobe curry star tennis tom top check undisputed baseball win ali michael finals bubble peak track ronaldo swimming dance muhammed federer legacy documentary lefraud lakers bulls hear kareem literally shit space

words_lebron_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "purple"))

12 question games conversation magic won hear mvp hate kyrie aldridge legacy yall people play funny shot club lakers nets timechili ppg ad lost team stop wouldve durantla nba lbj win debate real close

mj dislike beat league miami goat basketball bosh top guys fan super heat bro steph star curry clubs seasonlebron finals space warriors player played

history basketballer skip james love wanna lose true lol wins kevin jam ring jordankd dufraud stars care michael harden beats game players king level career kobe biggest undisputed brooklyn rings considered teams lebrons teammates

words_harden_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "black"))

niggas damianshooters season poundsteam refsharden chinawade basketball foot clubharden giannis mvpfatharden playing reason armstht blake shots kd born win king considered ball lebron nba jordan bosh

watching aldridge players brooklyn hear finals love griffin harden strip finesse

stop club time la skips klay irving steph lol ringstht james anthony game top franchisetht clubs harder lamarcus deandre nets durantring goattht pays goatass beat incoming guy play shoot curry 2000sthtnice paul

created tht playsplayerkyrie lakers playoffs kevin enjoy bucket beardharden ugly dislike kobepracticebasketballer travelsharden lillard assists wins playmaking

words_kd_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "gray1"))

13 deandre game love twitter lmao aldridge conversation finals kyriesuperteam blah beats beat nets era nahjordan griffin shit brooklyn nba idc james burner kevin curry kobe win rapaport durantharden kd michael steph basketball lebron goat irving time team rings blake lol lamarcus player hooper players lefraud

words_kobe_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "yellow2"))

nba lakers bean michael lebron jordan goat de quit watching james kobe bryant basketball player team

It is interesting to note that most of the word clouds contain words referencing other elite athletes that could be considered the GOAT in their respective sports.

Word Networks After performing some initial exploratory data analysis on the datasets of our NBA athletes, we can dive deeper into various text analyses such as bigram and trigram (Martin, Liermann, and Ney 1998) analysis. We’ve used the unnest_tokens() function to tokenize by a word, but we can also use the function to tokenize by consecutive

14 sequences of words, called n-grams (Cavnar, Trenkle, and others 1994). By determining how often word X is followed by word Y , we can model the relationship between them. This can be done by adding the token = “ngrams” argument to unnest_tokens() and setting the n argument to the number of words we wish to capture in each n-gram.

## Michael Jordan

### Bigram Analysis mj_goat_tw_paired_words <- mj_goat_tw %>% select(stripped_text) %>% unnest_tokens(paired_words, stripped_text, token = "ngrams",n=2)

head(mj_goat_tw_paired_words %>% count(paired_words, sort = TRUE), 10) %>% kable()

paired_words n michael jordan 256 the goat 155 is the 92 jordan is 73 29 of all 27 â â 24 u 0001f410 23 goat michael 22 goat u 22

mj_goat_tw_sep_words <- mj_goat_tw_paired_words %>% separate(paired_words,c("word1", "word2"), sep ="")

mj_goat_tw_filtered_01 <- mj_goat_tw_sep_words %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)

# new bigram counts: mj_goat_tw_bigram_counts <- mj_goat_tw_filtered_01 %>% count(word1, word2, sort = TRUE)

head(mj_goat_tw_bigram_counts) %>% kable()

15 word1 word2 n michael jordan 256 lebron james 29 â â 24 goat michael 22 undisputed goat 17 tom brady 16

To create Michael Jordan’s bigram word network (Zuo, Zhao, and Xu 2016), we must set n = 2. In the figure below, we can visualize the relationships between two words whose pairing forms a bigram. Each node represents a word within the filtered dataset and the connection between them is represented by an arrow which begins at word X and points to word Y . The frequency of each bigram can be distinguished by the size/boldness of the arrow, like the the arrow connecting “Michael” and “Jordan” as compared to the arrow connecting “Steph” and “Curry.” It is fascinating to see that there are multiple bigram chains with the largest located in the bottom left of the figure. Note: The following figures contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

mj_goat_tw_bigram_counts %>% filter(n >=3) %>% graph_from_data_frame() %>% ggraph(layout = "fr")+ geom_edge_link(aes(edge_alpha = n, edge_width = n), arrow = a)+ geom_edge_link(aes(edge_alpha = n, edge_width = n), arrow = a)+ geom_node_point(color = "red", size =3)+ geom_node_text(aes(label = name), vjust = 1.8, size =3)+ labs(title = "Bigram Word Network", subtitle = "Tweets using Michael Jordan GOAT Keyword", x ="",y="")

16 Bigram Word Network Tweets using Michael Jordan GOAT Keyword

larrybird hardaway russell kobe penny bill dwyane kjz bryant chicago wade jwill lefraudlost â bulls curry ruthbabe nets steph n baseball jabbar jameslebronpal 50 bolt abdul johnsonmagic bubble muhammedali track kareem chamberlainwilt won 100 boxing messi 3 6 150 phelps childhood soccer swimming rings gretzky 200 federer jordanâhero jordan's hockey sports hear 250 tennis reasons fuck time ronaldo brady michael club wanna imliterallarry1 basketball tom cried jeffrey player football debate space goat nba legends jam status ampmike real undisputedbleacherreport officialj0nn Like before, Michael Jordan’s trigram word network can be found by adjusting the unnest_tokens() function such that n = 3. Thus, the resulting figure visualizes the relationships between three words whose pairing forms a trigram. Each node represents a word within the filtered dataset and the connection between them is represented by an arrow which begins at word X, points to word Y , and then points to word Z. The frequency of each trigram is also distinguished by the size/boldness of the arrow. Unlike the bigram figure, there are less trigram chains and half of them are extremely bolded. In the table containing the paired words for the bigram and trigram analysis, “â â” and “â â â” occur due to the fact that they are special characters that were included in the “stripped_text” column during the tweet gathering process.

### Trigram Analysis mj_goat_tw_tri_paired_words <- mj_goat_tw %>% select(stripped_text) %>% unnest_tokens(paired_words, stripped_text, token = "ngrams",n=3)

head(mj_goat_tw_tri_paired_words %>% count(paired_words, sort = TRUE), 10) %>% kable()

paired_words n michael jordan is 70 is the goat 54

17 paired_words n jordan is the 53 â â â 20 goat michael jordan 20 the undisputed goat 16 is the undisputed 15 goat of all 14 lebron is the 14 of all sports 14 mj_goat_tw_sep_words_3 <- mj_goat_tw_tri_paired_words %>% separate(paired_words,c("word1", "word2", "word3"), sep ="") mj_goat_tw_filtered_02 <- mj_goat_tw_sep_words_3 %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word)

# new trigram counts: mj_goat_tw_trigram_counts <- mj_goat_tw_filtered_02 %>% count(word1, word2, word3, sort = TRUE) head(mj_goat_tw_trigram_counts) %>% kable()

word1 word2 word3 n â â â 20 goat michael jordan 20 baseball babe ruth 13 basketball michael jordan 13 boxing muhammed ali 13 football tom brady 13

18 Trigram Word Network Tweets using Michael Jordan GOAT Keyword

lebron pal 6 real won goat nets childhood bubble time reasons hero â n jwill imliterallarry1michael 5 basketballronaldo 10 jeffrey fuck kareem jordan babe abdul 15

football baseball 20 soccer tom bolt gretzky

brady track hockey tennis federer swimming ali sports phelps boxingmuhammed Replicating the bigram and trigram analysis for LeBron James, James Harden, Kevin Durant, and Kobe Bryant produces the following word network figures. Bigram Word Network Trigram Word Network Tweets using LeBron James GOAT Tweets using LeBron James GOAT Keyword

kevindurant finals timberwolvesminnesota ve appearances basketball blake wouldâ coachb_allen3 36 lakers win stop griffin ronnpeezie youâ lamarcus dwyane brooklyn lakersla la dislike shannonsharpe nets bro wade love n basketball team ryan85313260 n steph gerard_papa club clubs super blake 5 goat__james 100 mikehirsch3 goatjordan_23 griffin curry 10 bullsgotnext kyrie 200 goatjordan_23goat__james jam irving 15 realdebate space top ryan85313260 beat king conversation 3 goatlebron james mikehirsch3 michael time4change916 undisputed raymone thefrankisola raymonelebron netsbrooklyn james jordan œlebron harden wins beat goat coachb_allen3 beats â curry ronnpeezie stephlove ring

19 Bigram Word Network Trigram Word Network Tweets using James Harden GOAT Tweets using James Harden GOAT Keyword

refs:hardenfat:harden paysrefs:hardenfat:harden travels:harden stripclub:harden aldridge created pays durant finesse lamarcus shots club:harden kevin deandre strip watching aldridge basketball rings:tht2000s:tht 6 goatjordan_23 enjoy club foot practice n mattdegennaro14 arms:thtugly skips franchise:tht lebron n 9â plays finals ass beard:harden shoot 20 270 born goat 5 rings:tht lamarcus 7 shooters 40 damian beard:harden assarms:tht ring pounds love 4 10 ugly fe0f griffin skips kobe :tht 60 curry 15 shots 15 incoming 80 steph gt kyrie blake fe0f players bucket goat finals shooters irving lebron kevin griffin 4 harden finesse james 7 badgeplug brooklyn blake james pounds harden king king nets watching china 270 bucketirving enjoy steph kyrie 9â foot currydamianlillard kd 6 Bigram Word Network Trigram Word Network Tweets using Kevin Durant GOAT Tweets using Kevin Durant GOAT Keyword

michael williams jay marks steph sean rapaport jordan curry blah deandre kyrie blake finalsnba burneraccounts n claims n 11 2 williams irving griffin rings 20 aldridge shit 3 jay 40 6 conversation 4 lamarcus love goat 60 lamarcus team james 5 nah durant basketball 80 griffin kevin lebron nah 6 bill player harden blake 1 russell harden kevin anthony james love irving nets davis lebron beats kyrie brooklyn goat

20 Bigram Word Network Trigram Word Network Tweets using Kobe Bryant GOAT Tweets using Kobe Bryant GOAT Keyword

james 10 ganhou lebron mint lebron em bccg jordan 10 michael la2 jordan class n bccg n 2 michael draft 10 la2 chrome rc 3 20 1996 topps 4 rc 30 5 1996 40 rookie collector's rookie bean 6 collector's bryant set bryant kobe choice team choice kobe bean lakers goat lakers set real team The varying shapes of the word networks and the bigram/trigram word rela- tionships among the athletes is intriguing to interpret. It is expected that the most bold or one of the most bold n-gram arrow involves the relationship between each player’s first and last name, but the fact that other professional athletes’ names are also present emphasizes the intertwinement of the GOAT debate amongst sports. As long as the search for the GOAT goes on, athletes will continue to be compared and grouped together with those from other sports as Twitter users, fans, and sports media voice their opinion.

Sentiment Analysis To truly understand the connotation behind the GOAT tweets in the datasets involving the NBA athletes, we can perform a text analysis known as sentiment analysis (Feldman 2013). The get_sentiments() function offered by the tidytext package enables us to retrieve data frames containing words and their corresponding sentiment within a given lexicon (Ding, Liu, and Yu 2008). The available lexicons within the get_sentiments() function include “bing,” “afinn,” “loughran,” and “nrc” arguments. What differentiates the lexicons is their word list, the size of each word list, and how the sentiment is evaluated. For example, the bing lexicon categorizes sentiment as either “positive” or “negative”; the afinn lexicon labels sentiment as numeric values ranging from [−5, 5]; the loughran lexicon categorizes sentiment as “negative,” “positive,” “litigious,” “uncertainty,” “constraining,” or “superfluous”; the nrc lexicon assigns sentiment values consisting of the 8 emotions from Plutchik’s Wheel of Emotions (Tromp and Pechenizkiy 2014) to each word. To see this in action, we can randomly sample 5 rows from each lexicon data frame using the following code. This grants us a glimpse into the lexicons’ word variety and the sentiment values associated with each lexicon.

21 set.seed(12345) sample_n(get_sentiments("bing"),5) %>% kable()

word sentiment undisputably positive accursed negative bump negative buoyant positive senseless negative sample_n(get_sentiments("afinn"),5) %>% kable()

word value empathetic 2 delighting 3 trauma -3 protected 1 affectionate 3 sample_n(get_sentiments("loughran"),5) %>% kable()

word sentiment frivolous negative drag negative quitting negative mediators litigious injures negative sample_n(get_sentiments("nrc"),5) %>% kable()

word sentiment peaceful trust unhealthy negative cultivate anticipation crowning positive alien fear

22 When we are ready to perform sentiment analysis (Feldman 2013) on our dataset of tweets, we are only interested in the literal text of the tweet so that the analysis runs smoothly and does not encounter any unnecessary errors. In the following code, it can be seen that a new column variable was created within the original dataset to centralize each tweet’s text, while substituting out the letters that occur in the beginning of a web browser search. Once the text column has been identified, we can create a cleaned dataset that breaks down each tweet in the stripped_text column by word to create a list. Finally, we remove any stop words from the dataset and we arrive at the final product which is listed as mj_goat_tw_clean_02.

#### Data Cleaning ##### Michael Jordan

mj_goat_tw$stripped_text <- gsub("http.*","", mj_goat_tw$text) mj_goat_tw$stripped_text <- gsub("https.*","", mj_goat_tw$stripped_text)

mj_goat_tw_clean_01 <- mj_goat_tw %>% select(stripped_text) %>% unnest_tokens(word, stripped_text)

mj_goat_tw_clean_02 <- mj_goat_tw_clean_01 %>% anti_join(stop_words)

In order to see the sentiment frequencies among the four sentiment lexicons, we can create bar charts that group by each lexicon’s sentiment values and output the most frequent words within those values. Using the cleaned dataset, mj_goat_tw_clean_02, we can perform an inner join with the bing lexicon word list which matches sentiment values to the cleaned dataset if a word occurs in both sets. Using that knowledge, we can adjust the dataset to include the number of times each word takes place. In the following figure, we can see the most frequent words being grouped into the “negative” and “positive” sentiment values that the bing lexicon evaluates upon. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

23 Sentiment of Michael Jordan Goat Tweets using Bing Lexicon

negative positive lost undisputed shit love fuck top wilt won fucking win doubt super jam magic hate wow hard winning fleer hero dislike easy 0 5 10 15 20 25 0 5 10 15 20 25 Word Frequency If we slightly change the above code such that the inner join is operated on the afinn, loughran, and nrc lexicons, we are able to have more words represented in the figures since the previously noted lexicons offer a greater variety of sentiment values. Sentiment of Michael Jordan Goat Tweets using AFINN Lexicon

−5 −4 −3 shit lost nigger fuck fucking hate fraud ridiculous niggas damn asshole bad

−2 −1 1 cried stop god dislike doubt easy wrong yeah stupid hard matter crazy admit dream

2 3 4 top love win true won wow hero super winning care lol wins 0 3 6 9 0 3 6 9 0 3 6 9 Word Frequency

24 Sentiment of Michael Jordan Goat Tweets using Loughran Lexicon

litigious negative contracts lost contract doubt witness dropped testimony wrong prosecution quit offense jury question court lying claims cancel claim bad

positive uncertainty

win doubt winning easy suggested

honoring sudden favorite dream risky 0 2 4 6 0 2 4 6 Word Frequency Sentiment of Michael Jordan Goat Tweets using NRC Lexicon

anger anticipation disgust fear boxing basketball shit watch winning shit time god shot hate football dislike doubt hit track ridiculous shot hate lying dislike top bad hate

joy negative positive sadness player basketball lost basketball lost football shit football winning doubt love skip ruth doubt shot star hit debate shot hate dance dislike love hate 0 10 20 30 40 0 10 20 30 40 surprise trust winning team shot real hero top score star expect dance 0 10 20 30 40 0 10 20 30 40 Word Frequency Now that we have seen each lexicon’s word frequency variation for Michael Jordan, let us conduct the visualization process for the others as well. Later, we will be determining the sentiment polarity values for each player using the bing lexicon By creating a function called sentiment_bing_score(), we can input the “text” values from Michael Jordan’s tweet dataset and receive an ordered list of bing sentiment polarities that we can then transform into a readable tibble. One important

25 facet of the function is that it creates a column score of −1 for words with “negative” sentiment, 1 for words with “positive” sentiment, and 0 in the case that there are no words in the “text” column for a tweet after being cleaned and filtered. The new tibble can be displayed in a histogram to understand the statistical distribution of the bing sentiment polarities. Repeating this procedure for LeBron James, James Harden, Kevin Durant, and Kobe Bryant assists in comparing the sentiment polarity distributions, visually. ggplot(mj_goat_tw_sent_score_bing2, aes(x = Score))+ geom_histogram(bins = 15, alpha = 0.9, fill = "red", color = "black")+ xlab("Sentiment Polarity: Michael Jordan")+ ylab("Count")+ theme_minimal()

150

100 Count 50

0 −2.5 0.0 2.5 Sentiment Polarity: Michael Jordan From the histogram, we see that Michael Jordan’s bing sentiment polarity is fairly neutral with a slight advantage on the right which may bring his overall score to being positive. The following code aims to interpret the above histogram by finding the range of values, the overall mean score, and the standard error of that score. mj_goat_tw_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.00000 0.00000 0.00000 0.07885 1.00000 4.00000 tibble( sent_mean = mean(mj_goat_tw_sent_score_bing2$Score), sent_err = sd(mj_goat_tw_sent_score_bing2$Score)/

26 sqrt(length(mj_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.078853 0.0585912

For Michael Jordan, the bing sentiment polarity scores range from [−4, 4] with a mean and standard error of 0.079 ± 0.059. The purpose of the standard error is to measure the statistical accuracy of the mean, so, the mean is estimated to be between the values of [0.020, 0.138]. Therefore, the dataset we gathered and analyzed using the bing lexicon indicates a minor positive sentiment polarity for Michael Jordan. Moving to LeBron James, we see that he also has a fairly even histogram shape with the majority at 0. Yet, he has a higher frequency of negative scores which may dock his overall polarity.

200

150

100 Count

50

0 −4 −2 0 2 Sentiment Polarity: LeBron James For LeBron James, the bing sentiment polarity scores range from [−4, 3] with a mean and standard error of −0.005 ± 0.056. This means that the mean is estimated to be between the values of [−0.061, 0.051]. Therefore, the dataset we gathered and analyzed using the bing lexicon indicates a neutral sentiment polarity with a slight lean in the negative direction for LeBron James.

lebron_goat_tw_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.000000 0.000000 0.000000 -0.004706 1.000000 3.000000

27 tibble( sent_mean = mean(lebron_goat_tw_sent_score_bing2$Score), sent_err = sd(lebron_goat_tw_sent_score_bing2$Score)/ sqrt(length(lebron_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.0047059 0.0559446

Next, James Harden’s histogram is somewhat even but has been shifted in the negative direction such that it now as a at about −1. This differs from the the previous histograms which leads us to believe that his polarity is likely to be negative.

40 Count 20

0 −2 0 2 Sentiment Polarity: James Harden James Harden’s bing sentiment polarity scores range from [−3, 3] with a mean and standard error of −0.758±0.093. As a result, the mean is estimated to be between the values of [−0.851, −0.665]. Hence, the dataset we gathered and analyzed using the bing lexicon indicates a negative sentiment polarity for James Harden. harden_goat_tw_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -3.0000 -1.0000 -1.0000 -0.7576 0.0000 3.0000 tibble( sent_mean = mean(harden_goat_tw_sent_score_bing2$Score), sent_err =

28 sd(harden_goat_tw_sent_score_bing2$Score)/ sqrt(length(harden_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.7575758 0.0931491

Unlike James Harden, Kevin Durant’s histogram continued the trend of main- taining a distribution that is centered at 0, however, he does have a higher frequency of negative polarity values. 50

40

30

Count 20

10

0 −5.0 −2.5 0.0 2.5 Sentiment Polarity: Kevin Durant Kevin Durant’s sentiment polarity scores can be seen to range from [−5, 4] with a mean and standard error of −0.27 ± 0.12. Then, the mean can be estimated to be between the values of [−0.29, −0.15]. Consequently, the dataset we gathered and analyzed using the bing lexicon indicates a slightly negative sentiment polarity for Kevin Durant. kd_goat_tw_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -5.0000 -1.0000 0.0000 -0.2667 0.0000 4.0000 tibble( sent_mean = mean(kd_goat_tw_sent_score_bing2$Score), sent_err = sd(kd_goat_tw_sent_score_bing2$Score)/ sqrt(length(kd_goat_tw_sent_score_bing2$Score)) ) %>% kable()

29 sent_mean sent_err -0.2666667 0.1180364

30

20 Count

10

0 −2.5 0.0 2.5 5.0 Sentiment Polarity: Kobe Bryant Kobe Bryant’s sentiment polarity scores range from [−4, 5] with a mean and standard error of −0.08 ± 0.16. The overall mean sentiment polarity can be estimated to be between the values of [−0.24, 0.08]. Thus, the dataset we gathered and analyzed using the bing lexicon indicates a slightly negative sentiment polarity with some neutral influence for Kobe Bryant. kobe_goat_tw_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.00000 0.00000 0.00000 -0.07843 0.00000 5.00000 tibble( sent_mean = mean(kobe_goat_tw_sent_score_bing2$Score), sent_err = sd(kobe_goat_tw_sent_score_bing2$Score)/ sqrt(length(kobe_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.0784314 0.1604971

With each NBA athlete’s bing sentiment polarity score having been calculated, they can be ordered from first to last as Michael Jordan, LeBron James, Kobe Bryant,

30 Kevin Durant, and James Harden, with the first being the most positive and last being the least. There are many factors that can be attributed to a player receiving a positive or negative sentiment polarity score based on tweets but this would require a larger dataset that covers a longer period than 6 − 9 days.

Bing Sentiment Polarities of Tweet Frequency Plots Earlier in the paper, we plotted the tweet frequency by Twitter users for each of the NBA players and their tweet datasets. For each player, there were at least one noticeable spike in tweet frequency which raised interest to understand why it took place and if it was beneficial or detrimental to the sentiment. Using the following code, we can create new datasets which solely contain information applying to the dates of the frequency spikes. Once that is done, we can find the bing sentiment polarity score like we did in the previous section by taking the mean and standard error.

### Bing Sentiment Score #### Michael Jordan mj_goat_tw_freq <- mj_goat_tw[(mj_goat_tw$created_at >= "2021-03-28 00:00:00"& mj_goat_tw$created_at< "2021-03-29 00:00:00"), ]

mj_freq_sent_score_bing <- lapply(mj_goat_tw_freq$text, function(x){sentiment_bing_score(x)})

mj_freq_sent_score_bing2 <- rbind( tibble( Name = "Michael Jordan", Score = unlist(map(mj_freq_sent_score_bing, "score")), Type = unlist(map(mj_freq_sent_score_bing, "type")) ) )

The bing sentiment polarity score for the day of Michael Jordan’s tweet frequency spike, “2021-03-28,” ranges from [−3, 2] with a mean and standard error of 0.07 ± 0.11. This means that the true mean polarity is estimated to be within the values [−0.04, 0.18]. This produces a similar mean estimate to that of the overall polarity score, but the spike does have a lower bottom estimate and higher upper estimate. While this frequency spike has a greater chance of producing a negative bing sentiment polarity, it also has a greater chance for a positive sentiment polarity.

mj_freq_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -3.00000 0.00000 0.00000 0.07692 0.00000 2.00000

31 tibble( sent_mean = mean(mj_freq_sent_score_bing2$Score), sent_err = sd(mj_freq_sent_score_bing2$Score)/ sqrt(length(mj_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.0769231 0.1093176

Tweets referring to LeBron James as the GOAT experienced a spike on days of “2021-03-28” and “2021-03-31.” On these dates, the sentiment polarities range between [−4, 3] with a mean and standard error of −0.14 ± 0.11 such that the true mean is estimated to be within the values of [−0.25, −0.03]. This frequency spike produced a much more negative bing sentiment polarity when compared to the polarity of his entire dataset. On the above dates the , who LeBron James plays for, had two games in which they won one and lost the other. LeBron did not play in either game so the negative sentiment does not seem to be the result of his own personal performance, but could have been caused by his own team’s performance and the fact that he did not participate.

lebron_freq_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.0000 -1.0000 0.0000 -0.1368 0.0000 3.0000

tibble( sent_mean = mean(lebron_freq_sent_score_bing2$Score), sent_err = sd(lebron_freq_sent_score_bing2$Score)/ sqrt(length(lebron_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.1367521 0.1064567

Tweets about James Harden had a spike on “2021-03-31.” On this date, the sentiment polarities have a range between [−3, 0] with a mean and standard error of −1.04 ± 0.14 such that the true mean is estimated to be within the values of [−1.18, −0.90]. On “2021-03-31” the team that James Harden plays for, the Brooklyn

32 Nets, had a game that they won against his former team, the . He participated in and had a decent performance in which he scored 17 points, had 6 assists, and 8 rebounds in 27 minutes of play. Despite the victory and performance, the estimated sentiment polarity on this day is about 0.2 more negative than his dataset as a whole.

harden_freq_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -3.000 -1.000 -1.000 -1.042 -1.000 0.000

tibble( sent_mean = mean(harden_freq_sent_score_bing2$Score), sent_err = sd(harden_freq_sent_score_bing2$Score)/ sqrt(length(harden_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -1.041667 0.1408973

Kevin Durant’s GOAT tweets experienced a spike on days of “2021-03-28” and “2021-03-30.” On these dates, the sentiment polarities range between [−2, 4] with a mean and standard error of −0.02 ± 0.14 such that the true mean is estimated to be within the values of [−0.16, 0.12]. Kevin Durant also plays on the with James Harden, but there was not a game on the above dates so the increase in frequency was not related to any game performance. However, on “2021-03-30” Kevin Durant and actor, , exchanged direct messages which were screenshotted and posted to Twitter by Rapaport. The contents of the messages were not necessarily friendly, yet Kevin Durant’s sentiment polarity is much more neutral and is approximately 0.2 more positive than the original dataset.

kd_freq_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.00000 0.00000 0.00000 -0.02128 0.00000 4.00000

tibble( sent_mean = mean(kd_freq_sent_score_bing2$Score), sent_err = sd(kd_freq_sent_score_bing2$Score)/

33 sqrt(length(kd_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.0212766 0.1442367

Kobe Bryant’s tweet dataset encountered a frequency spike on days of “2021-04- 06,” “2021-04-09,” and “2021-04-12.” On these dates, the sentiment polarities range between [−2, 1] with a mean and standard error of −0.04 ± 0.15 such that the true mean is estimated to be within the values of [−0.19, 0.11]. The spike in frequency presented a polarity score which is almost identical to that of Kobe’s entire dataset, just a touch more positive. The increase of tweets on “04-12-21” is most likely due to it being the five-year anniversary of his farewell game in which he played his final NBA game and scored 60 points.

kobe_freq_sent_score_bing2$Score %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.00000 0.00000 0.00000 -0.04348 0.00000 1.00000

tibble( sent_mean = mean(kobe_freq_sent_score_bing2$Score), sent_err = sd(kobe_freq_sent_score_bing2$Score)/ sqrt(length(kobe_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.0434783 0.1471503

After comparing the bing sentiment polarity values for each player in regards to their dataset as a whole and by the spikes in tweet frequency, the frequency spikes were only positive for two of the five NBA athletes. The following figures represent the histograms of the NBA players from the original bing sentiment polarities along with the sentiment polarities of the frequency spikes as a method of comparison.

ggplot(nba_goat_tw_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+

34 xlab("Original Sentiment Polarity")+ ylab("Count")+ theme_minimal()

James Harden Kevin Durant Kobe Bryant LeBron James Michael Jordan

200

Name 150 James Harden Kevin Durant 100 Kobe Bryant Count LeBron James 50 Michael Jordan

0 −3 0 3 −3 0 3 −3 0 3 −3 0 3 −3 0 3 Original Sentiment Polarity

ggplot(nba_freq_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+ xlab("Frequency Spike Sentiment Polarity")+ ylab("Count")+ theme_minimal()

James Harden Kevin Durant Kobe Bryant LeBron James Michael Jordan

Name

40 James Harden Kevin Durant Kobe Bryant Count 20 LeBron James Michael Jordan

0 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 Frequency Spike Sentiment Polarity From the above results and analyses, the GOAT debate between Michael Jordan and LeBron James can be decided as a victory in the favor of Michael Jordan for having the most positive sentiment polarity of 0.079 ± 0.059. LeBron James’ sentiment polarity was not too far behind so it would be interesting to see how much the results vary according to new datasets. We can also extend our analyses to other sports to determine how their athletes respond to the GOAT debate. With that being said, we may be able to crown a GOAT for each sport using the athletes we sampled from.

35 NFL For the NFL, the professional athletes that we gathered Twitter data on include Aaron Rodgers, Jerry Rice, Patrick Mahomes, and Tom Brady. Aaron Rodgers is a quarterback for the Green Bay Packers who won Super Bowl XLV, was named Super Bowl MVP, and is considered to be one of the best quarterbacks in the NFL. Jerry Rice is a former wide receiver who won three Super Bowls (XXIII, XXIV, XXIX), a Super Bowl MVP, and was named to the NFL Hall of Fame in 2010. Patrick Mahomes is a quarterback for the Kansas City Chiefs that won Super Bowl LIV and was named Super Bowl MVP. Finally, Tom Brady is a quarterback for the Tampa Bay Buccaneers who has won seven Super Bowls (XXXVI, XXXVIII,XXXIX, XLIX, LI, LIII, LV), five Super Bowl MVP’s (XXXVI, XXXVIII, XLIX, LI, LV), and is widely considered to be the NFL’s GOAT. With the above NFL athletes, will focus on the unique and interesting results from the various analyses that the NBA athletes were put through.

Timeline of Tweets - Frequency Plot Tom Brady To begin, we can take a look at the frequency plots of Tom Brady in both situations of plotting by hours and days. Within the NFL sample, Tom Brady has the highest frequency of tweets with consistent spikes in activity. The two biggest frequency spikes occurred on “2021-03-27” and “2021-03-28” with approximately 60 and 50 tweets, respectively. Frequency of tweets with Tom Brady GOAT Keyword

20

15

10

5

0 Mar 28 Mar 30 Apr 01 Apr 03 Data collected from Twitter's API via rtweet

36 Frequency of tweets with Tom Brady GOAT Keyword

60

40

20

0 Mar 27 Mar 29 Mar 31 Apr 02 Apr 04 Data collected from Twitter's API via rtweet

Word Cloud Analysis Tom Brady It was also interesting to see how Tom Brady’s word cloud analysis stacked up against the others because his word cloud not only outnumbered the others, but emphasizes his presence and/or dominance within the NFL’s GOAT debate. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

words_tb_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "red3"))

throw montana quarterbackmahomes swimmingtho agree player top messimoney bolt fan won shininghockey guy patriots sports phelps

trust champion qbsboxing record federersuper called

love played

bradys play life considered defense track time sport

planet goat bucs ronaldo football tbwalkpeople england win world game bowl light winning ruth tennisgoats left billhuman qb lol history day team nfl mafraud tom michael

lebron type moment card

career gretzky aliseason mj brady babestats watch manning basketball jordan sb lost joe baseball soccer rings aaron winsmuhammed rodgers bowls peyton jones tampa

37 Word Networks Tom Brady Tom Brady’s bigram and trigram figures also have interesting results as the names of Aaron Rodgers, Patrick Mahomes. It can also be noted that Tom Brady shares a connection with elite athletes in other sports like Michael Jordan, Wayne Gretsky, and Muhammad Ali since they also appear in the networks. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset Bigram Word Network Trigram Word Network Tweets using Tom Brady GOAT Tweets using Tom Brady GOAT Keyword

regular soccer season aaron ronaldo ruth basketball rodgersmanning gretzky sports muhammedali babe michael hockey peyton boxing track baseball bolt œthe phelps jordan jones â babe n swimming mac mahomes n federer patrick œgoatâ 15.0 100 football baseball tennis world 17.5 light baytampa 200 qb shining goat time bolt bleacherreportundisputed 20.0 left 7 300 gt tom card brady 22.5 goat nfl bowls superbowl mafraud track tombrady basketballronaldo gt football tom jordanmichael ali messi brady mafraudultraweedhater soccer muhammed bradyâ 7 gretzky tennis boxing time sportshockey worldchampion federerswimmingphelps

Bing Sentiment Polarity As we move past the initial analysis phase, we can transition into the sentiment analysis phase to determine each NFL player’s polarity within their dataset. Like before, we can use the results to name a GOAT within the NFL sample and compare them to that of the NBA sentiment polarity values.

Aaron Rodgers Using the same sentiment_bing_score() function from the NBA analysis, we can calculate Aaron Rodgers’ sentiment polarity scores to range from [−2, 5] with a mean and standard error of 0.44 ± 0.23. Using the standard error, the true polarity is within the values of [0.21, 0.67] which can be interpreted as Aaron Rodgers having a generally positive dataset.

tibble( sent_mean = mean(arodgers_goat_tw_sent_score_bing2$Score), sent_err = sd(arodgers_goat_tw_sent_score_bing2$Score)/ sqrt(length(arodgers_goat_tw_sent_score_bing2$Score)) ) %>% kable()

38 sent_mean sent_err 0.4444444 0.2269342

Jerry Rice Likewise, we can calculate Jerry Rice’s sentiment polarity scores to range from [−2, 2] with a mean and standard error of −0.16 ± 0.14. Then, the true mean is within the values of [−0.30, 0.02] which means that Jerry Rice’s data has a slightly negative attitude towards him.

tibble( sent_mean = mean(jrice_goat_tw_sent_score_bing2$Score), sent_err = sd(jrice_goat_tw_sent_score_bing2$Score)/ sqrt(length(jrice_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.15625 0.1427649

Patrick Mahomes Next, Patrick Mahomes’ sentiment polarity scores seem to range between [−2, 3] with a mean and standard error of 0.24 ± 0.22. So, the true mean is within the values of [0.02, 0.46] which is positive but not as much as Aaron Rodgers.

tibble( sent_mean = mean(mahomes_goat_tw_sent_score_bing2$Score), sent_err = sd(mahomes_goat_tw_sent_score_bing2$Score)/ sqrt(length(mahomes_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.2380952 0.2171763

Tom Brady Finally, Tom Brady’s sentiment polarity scores have the greatest range of polarity scores which are between [−5, 5] and have an estimated mean and standard error of 0.371 ± 0.077. Therefore, the true polarity is in [0.294, 0.448] which is highly positive. It is even greater than Michael Jordan’s sentiment polarity which was the highest until this .

39 tibble( sent_mean = mean(tb_goat_tw_sent_score_bing2$Score), sent_err = sd(tb_goat_tw_sent_score_bing2$Score)/ sqrt(length(tb_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.3712575 0.077264

Bing Sentiment Polarities of Tweet Frequency Plots While the sentiment polarities of the spikes in frequency plots for the NBA athletes were not necessarily higher than their overall sentiment polarity, it is sensible to look into how the NFL sample reacts because their response could be entirely different.

Aaron Rodgers Within Aaron Rodgers’ tweet frequency plot which covers the days from “2021-03-27” to “2021-04-03,” there was an increase in tweets on “2021-04- 01” and “2021-04-03.” On these days, the sentiment polarity scores have a range of [−1, 2] with a mean and error of 0.38 ± 0.32. If the true sentiment polarity for these two frequency spikes is within [0.06, 0.70], then the spikes can be seen as positive influences to the overall sentiment polarity. However, the frequency spikes do not grant a better sentiment polarity since the overall dataset has a greater floor estimate and an almost identical ceiling.

tibble( sent_mean = mean(arodgers_freq_sent_score_bing2$Score), sent_err = sd(arodgers_freq_sent_score_bing2$Score)/ sqrt(length(arodgers_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.375 0.3238992

Jerry Rice Jerry Rice’s tweet frequency plot spans from “2021-03-27” to “2021-04- 04” with frequency spikes on “2021-04-01” and “2021-04-03.” The sentiment polarity scores for the two days have a range of [−2, 1], as well as a mean and error of −0.16 ± 0.18. Since the true sentiment polarity is within [−0.34, 0.02], then the spikes can be seen as negative influences to the overall sentiment polarity. Also, the tweet

40 spikes have a worse floor estimate so they are not better than the dataset as a whole.

tibble( sent_mean = mean(jrice_freq_sent_score_bing2$Score), sent_err = sd(jrice_freq_sent_score_bing2$Score)/ sqrt(length(jrice_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.1578947 0.1754386

Patrick Mahomes The tweets from Patrick Mahomes’ tweet frequency plot were tweeted between the dates of “2021-03-27” and “2021-04-04” with a surge coming on “2021-04-01.” The sentiment polarity score for this day has a range from [0, 3], along with a mean and error of 0.60 ± 0.60. Given that the true sentiment polarity is within [0.0, 1.2], the increase in frequency was a positive influence on the overall sentiment polarity. On another note, the sentiment polarity for this spike is also greater that the dataset’s making it a succesful day.

tibble( sent_mean = mean(mahomes_freq_sent_score_bing2$Score), sent_err = sd(mahomes_freq_sent_score_bing2$Score)/ sqrt(length(mahomes_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.6 0.6

Tom Brady Lastly, Tom Brady’s tweet frequency plot extends from “2021-03-27” to “2021-04-04” and captured tweet frequency spikes on “2021-03-27” and “2021-03-28.” The sentiment polarity score for these days have a minimum and maximum of [−3, 5], in addition to a mean and error of 1.41 ± 0.21. With the true sentiment polarity of the frequency spikes are in the range of [1.2, 1.62], we are able to consider the spikes as being positive despite being lower than the sentiment polarity calculated in the previous section.

41 tibble( sent_mean = mean(tb_freq_sent_score_bing2$Score), sent_err = sd(tb_freq_sent_score_bing2$Score)/ sqrt(length(tb_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 1.409836 0.2134508

In review, the tweet frequency spikes that took place in the NFL datasets had a better impact than those in the NBA datasets seeing as the sentiment polarities were positive for Patrick Mahomes and Tom Brady. Moreover, we can crown Aaron Rodgers as the GOAT over the other NFL players we researched for having a bing sentiment polarity of 0.44 ± 0.23.

Comparing Sentiment Histograms In the figures below, we can compare the histograms of the sentiment polarity distributions that we computed in the past two sections. While the shapes of the histograms vary among the situations, the biggest change was the decrease in the count of the values which dropped from 150 to 25.

ggplot(nfl_goat_tw_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+ xlab("Original Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Aaron Rodgers Jerry Rice Patrick Mahomes Tom Brady 150

Name 100 Aaron Rodgers Jerry Rice

Count Patrick Mahomes 50 Tom Brady

0 −3 0 3 −3 0 3 −3 0 3 −3 0 3 Original Sentiment Polarity

42 ggplot(nfl_freq_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+ xlab("Frequency Spike Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Aaron Rodgers Jerry Rice Patrick Mahomes Tom Brady 25

20 Name

15 Aaron Rodgers Jerry Rice

Count 10 Patrick Mahomes Tom Brady 5

0 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 Frequency Spike Sentiment Polarity

MLB Another sport that we will be pursuing analytically is baseball and our sampled athletes consist of Clayton Kershaw, Miguel Cabrera, Mike Trout, and Shohei Ohtani of the MLB. Clayton Kershaw is currently a pitcher for the Los Angeles Dodgers who has three Cy Young Awards (2011, 2013, 2014) and won the World Series in 2020. Miguel Cabrera plays for the Detroit Tigers as a first baseman, is a two-time American League MVP (2012, 2013), and won the World Series in 2003. Mike Trout plays center field for the Los Angeles Angels, was selected to the All-MLB First Team in 2019 and 2020, and is a three-time American League MVP (2014, 2016, 2019). Shohei Ohtani also plays for the Los Angeles Angels as a pitcher who is a Japan Series champion (2016), and a Pacific League MVP (2016).

Timeline of Tweets - Frequency Plot Mike Trout Among the MLB players in our sample, Mike Trout has the most unique and active frequency plot with multiple spikes, but the most significant came on “2021-03-30” and “2021-04-02.” On those days, the second frequency plot allows us to decipher that the number of tweets increased from 1 to 6 and 9 to 18, respectively.

43 Frequency of tweets with Mike Trout GOAT Keyword

6

4

2

0 Mar 31 Apr 02 Apr 04 Apr 06 Data collected from Twitter's API via rtweet Frequency of tweets with Mike Trout GOAT Keyword

15

10

5

Mar 30 Apr 01 Apr 03 Apr 05 Data collected from Twitter's API via rtweet

Word Cloud Analysis Mike Trout Mike Trout’s word cloud was also the most unique among the others as it consisted of more than just his name and actually referenced baseball terms such as the American and National League, as well as baseball legend . Michael Jordan’s last name was also referenced in the cloud analysis. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

words_mtrout_goat %>% with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "red"))

44 prslk nteNAadNLwr networks. other word from NFL his athletes and that not players, NBA is MLB the fact other in interesting pairings like or Another word terms sports the shapes. not baseball of interesting contains is all form only it that to network that interesting connect crowded is and so it bold network, are has are Trout trigram they Mike Shohei’s but example, networks In For n-gram readable. substance. his having with also connections while many best the performed data Ohtani Shohei Networks Word mound Tweets usingShoheiOhtaniGOAT Keyword Bigram Word Network amp itâ acquired screw mike yordan rotoclegg trout extremely farm ericcross04 mookie paradise motherfuckin goat game shohei potential yooooo hit run plate o h irmadtirmwr ewr nlss hhiOhtani’s Shohei analysis, network word trigram and bigram the For home 1 ohtani allowed solo walks drunkenly traded season foot 2019 451 shane bieber named talented baseball absolute player video cool joke n mike 45

7.5 5.0 2.5 goat bonds yooooo Tweets usingShoheiOhtaniGOAT Keyword Trigram Word Network plate won nl baseball mvp jordan al angels hit mookie season watch amp

troutseries players potential mookie world shohei screw mlb goat trout mike ohtani game home drunkenly solo traded bieber foot yordan shane acquired 451 named rotoclegg 1 allowed ericcross04 talented baseball n 1 Bing Sentiment Polarity Clayton Kershaw After using the sentiment_bing_score() function on Clayton Kershaw’s dataset, we can use the following code to see that the bing sentiment polarities have a range of [−2, 1]. Plus, the sentiment polarities have a mean and standard error of −0.25 ± 0.75 which puts the true polarity value in [−1.0, 0.5]. This implies that the tweets in his dataset have a slightly negative connotation. tibble( sent_mean = mean(kershaw_goat_tw_sent_score_bing2$Score), sent_err = sd(kershaw_goat_tw_sent_score_bing2$Score)/ sqrt(length(kershaw_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.25 0.75

Miguel Cabrera Taking a look into Miguel Cabrera’s sentiment scores, we see that they range from [0, 4] with a mean and error of 1.00 ± 0.77. This would put the true polarity value somewhere within [0.23, 1.77]. Even with the bounds being so large, the attitudes are still positive. tibble( sent_mean = mean(mcabrera_goat_tw_sent_score_bing2$Score), sent_err = sd(mcabrera_goat_tw_sent_score_bing2$Score)/ sqrt(length(mcabrera_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 1 0.7745967

Mike Trout In the case of Mike Trout, he has minimum and maximum polarity bounds of [−1, 2] which obtain a mean and error of 0.000 ± 0.084. Due to the true sentiment polarity being in [−0.084, 0.084] such that it is equally as negative as it is positive, the dataset’s tweets portray a neutral opinion of Mike Trout. tibble( sent_mean = mean(mtrout_goat_tw_sent_score_bing2$Score),

46 sent_err = sd(mtrout_goat_tw_sent_score_bing2$Score)/ sqrt(length(mtrout_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0 0.0842842

Shohei Ohtani Finally, after observing Shohei Ohtani’s sentiments we notice that the polarities are captured in a range from [−2, 1]. Upon further research, these polarity values possess a mean and error of −0.14±0.34 which places the true sentiment polarity among the values of [−0.48, 0.20]. With that said, the tweets in the dataset have a more negative tone towards Shohei than positive.

tibble( sent_mean = mean(sohtani_goat_tw_sent_score_bing2$Score), sent_err = sd(sohtani_goat_tw_sent_score_bing2$Score)/ sqrt(length(sohtani_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.1428571 0.340068

Bing Sentiment Polarities of Tweet Frequency Plots Once again, we will gather the sentiment polarity of the spikes in the tweet frequency plots for each MLB athlete before determining whether the spikes had a positive or negative impact on the sentiment polarity of the entire dataset. Meanwhile, it is helpful to understand if the increase in tweet frequencies were positive or negative because that offers a reason for further research into what the cause was.

Clayton Kershaw Clayton Kershaw’s frequency plot reports on the dates of “2021- 03-30” to “2021-04-05,” including the tweet surge on “2021-03-31” and “2021-04-05.” The sentiment polarity values of these days have the minimum and maximum bounds [−2, 1] coupled with a mean and standard error of −0.67 ± 0.88. This leaves us with a sentiment polarity that is mostly negative and between [−1.55, 0.21]. Subsequently, the increase in tweet frequencies offer a negative influence that is not a better than the polarity of the entire dataset.

47 tibble( sent_mean = mean(kershaw_freq_sent_score_bing2$Score), sent_err = sd(kershaw_freq_sent_score_bing2$Score)/ sqrt(length(kershaw_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.6666667 0.8819171

Miguel Cabrera Secondly, Miguel Cabrera’s frequency plot extends from “2021- 04-01” to “2021-04-02.” Within this time frame, there was a single of tweet on “2021-04-02” that caused a spike and obtained a value of 4 ± NA. Since there is only one tweet, it is not really possible to interpret the results without being biased, especially with the value being 4.

tibble( sent_mean = mean(mcabrera_freq_sent_score_bing2$Score), sent_err = sd(mcabrera_freq_sent_score_bing2$Score)/ sqrt(length(mcabrera_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 4 NA

Mike Trout Next, Mike Trout’s dataset contains tweets from “2021-03-29” to “2021-04-05” which undergo an increase in tweet frequency on the dates of “2021-03-30” and “2021-04-02.” The listed days have sentiment values that range from [−1, 1] which lead to a mean and error of −0.04 ± 0.12. The true sentiment polarity can ultimately be defined within [−0.16, 0.08] which is marginally negative. This also follows the trend of being worse than the dataset’s overall polarity since the original senitment is neutral.

tibble( sent_mean = mean(mtrout_freq_sent_score_bing2$Score), sent_err = sd(mtrout_freq_sent_score_bing2$Score)/ sqrt(length(mtrout_freq_sent_score_bing2$Score)) ) %>% kable()

48 sent_mean sent_err -0.04 0.122202

Shohei Ohtani Finally, Shohei Ohtani’s dataset captures tweets beginning on “2021-03-30” and through “2021-04-05.” The day that underwent a frequency spike is listed as “2021-04-05” and is understood to have a mean and standard error of 0.25 ± 0.25, including minimum and maximum values of [0, 1]. This presents us with a true sentiment polarity that can be defined within 0.0 ± 0.5 which is positive and greater than Shohei’s sentiment that includes the entire dataset.

tibble( sent_mean = mean(sohtani_freq_sent_score_bing2$Score), sent_err = sd(sohtani_freq_sent_score_bing2$Score)/ sqrt(length(sohtani_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.25 0.25

To sum up the sentiment analysis (Feldman 2013) of the MLB players, Shohei Ohtani was the only player whose spike in tweet frequency had positive sentiment polarity and was positively influential to the polarity of their dataset as a whole. Furthermore, Miguel Cabrera’s sentiment polarity of 1.00 ± 0.77 was the highest among the others in the sample, thus, we can crown him the GOAT of our MLB sample.

Comparing Sentiment Histograms In the figures below, we are able to compare the histograms of the sentiment polarity distributions that we computed in the prior two sections. The shapes of the histograms marginally shift from one figure to the next, but the main differences are the decrease in the count of the sentiment values which dropped from 40 to 15, and Miguel Cabrera’s lack of negative values in the sentiment frequency histogram.

ggplot(mlb_goat_tw_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+

49 xlab("Original Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Clayton Kershaw Miguel Cabrera Mike Trout Shohei Ohtani

40

30 Name Clayton Kershaw

20 Miguel Cabrera

Count Mike Trout

10 Shohei Ohtani

0 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 Original Sentiment Polarity ggplot(mlb_freq_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+ xlab("Frequency Spike Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Clayton Kershaw Miguel Cabrera Mike Trout Shohei Ohtani

15

Name

10 Clayton Kershaw Miguel Cabrera

Count Mike Trout 5 Shohei Ohtani

0 −2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4 Frequency Spike Sentiment Polarity

NCAA Basketball After focusing on the NBA, NFL, and MLB, I thought that it would be interesting to shift the focus to analyzing the NCAA. The NCAA Division I Men’s and Women’s Basketball Tournaments both take place in March, and given their vast popularity and media coverage, I took the opportunity to include two players from each tournaments into my NCAA Basketball sample. On the men’s side, I selected Drew Timme and

50 who both played for the Gonzaga Bulldogs, the runner-ups in the 2021 championship game. For the women, I selected Aari McDonald who played for the Arizona Wildcats, the runner-ups in the 2021 championship game. I also chose who plays for the UConn Huskies as she lead her team to the Final Four and became the first freshman to win AP Player of the Year, Naismith Trophy, Wooden Award POY, and the Award.

Timeline of Tweets - Frequency Plot Paige Bueckers From Paige Buecker’s tweet frequency plot below, she has a high count of tweets referring to her as the GOAT as well as surges that appear throughout the entire dataset. The most notable and drastic peaks could be viewed as taking place on “2021-03-30” and “2021-04-01.” The second frequency plot looks at the dataset from another angle by adjusting the time interval to group tweets by days instead of hours. Frequency of tweets with Paige Bueckers GOAT Keyword 12.5

10.0

7.5

5.0

2.5

0.0 Mar 31 Apr 02 Apr 04 Apr 06 Data collected from Twitter's API via rtweet

51 Frequency of tweets with Paige Bueckers GOAT Keyword

25

20

15

10

5

0 Mar 30 Apr 01 Apr 03 Apr 05 Data collected from Twitter's API via rtweet

Word Cloud Analysis Jalen Suggs A few things to notice about Jalen Suggs’ word cloud is that Paige Bueckers’ name appears, the word “final” is also mentioned which may be referring to the championship game, and that “” is listed because of his game-winning shot in the semi-finals against the University of California, Los Angeles (UCLA) that advanced Gonzaga to the finals. Note: The following figures may contain inappropriate language, but is included to illustrate the prevalence of such terms within the dataset

suggs uconns time fuckin final advice uclakilling perspective goat bueckers jim omg paige credits run buzzerbeater fucking game nantz winners friend readied gonzagas jalen

52 Paige Bueckers Similarly, Jalen Suggs’ name appears in Paige’s word cloud. Paige’s word cloud contains the most words among her fellow NCAA basketball players with some referencing her achievements and extraordinary skills.

basketball season uconn happen friend confidence womens uconns time wnba credits gonzagasfinal paige

bueckersgeno game phenom poy expect

baylor suggs goat advice auriemma readied freshman coach completing run jalen unprecedented

Word Networks Jalen Suggs Using the bigram and trigram analyses, we are able to see the connections between words that are in pairs and in trios. After seeing the top words in his word cloud, it is interesting to see how those words are all connected.

Paige Bueckers Using the bigram and trigram analyses, we are also able to see the connections between w pairs and trios of words in Paige’s dataset. One interesting point is that Paige has a stronger connection to the term GOAT in both the bigram and trigram plots when compared to Jalen’s figures.

53 Bigram Word Network Trigram Word Network Tweets using Jalen Suggs GOAT Tweets using Jalen Suggs GOAT Keyword

game buzzer winners uconn's killing friend 039 ucla paige suggs amp jalen n bueckers perspective n 3 25 gonzaga's 4 readied goat nantz 50 5 75 jim readied 6 gonzaga's credits suggs jalen bueckers

friend ucla beater paige buzzer killing uconn's Bigram Word Network Trigram Word Network Tweets using Paige Bueckers GOAT Tweets using Paige Bueckers GOAT Keyword

uconnâ jalen goatâ suggs 039 sheâ gonzaga's amp n credits freshman â 3 n 4 readied unprecedented 20 unprecedented 5 friend uconn's 40 6 uconn friend freshman goat 60 7 phenom bueckers paige credits season 8 wnba goat 9 suggs paige bueckers phenom uconn completing jalen readied gonzaga's uconn's

Bing Sentiment Polarity As we did for the other sports and athletes, we will use the sentiment_bing_score() function to calculate the bing sentiment polarity for the NCAA athletes’ datasets.

Drew Timme Drew Timme’s sentiment polarities can be found within the range [−1, 2], meanwhile, the sentiment polarity has a mean and standard error of 0.06±0.14. Adjusting for the error produces a true sentiment polarity between [−0.08, 0.20] which is slightly positive, overall.

tibble( sent_mean = mean(timme_goat_tw_sent_score_bing2$Score), sent_err =

54 sd(timme_goat_tw_sent_score_bing2$Score)/ sqrt(length(timme_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.0625 0.1434326

Jalen Suggs Jalen Suggs’ sentiment polarities can be found ranging between [−1, 3], with a sentiment polarity that has a mean and standard error of 0.030 ± 0.067. Adjusting for the error produces a true sentiment polarity between [−0.037, 0.097] which is slightly positive, like Drew Timme.

tibble( sent_mean = mean(jsuggs_goat_tw_sent_score_bing2$Score), sent_err = sd(jsuggs_goat_tw_sent_score_bing2$Score)/ sqrt(length(jsuggs_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.03 0.0673525

Aari McDonald Aari McDonald’s sentiment polarity can be found within the values [−1, 0], with a sentiment polarity that has a mean and standard error of −0.091 ± 0.091. Adjusting for the error produces a true sentiment polarity between [−0.182, 0.000] which is slightly slightly negative.

tibble( sent_mean = mean(amcdonald_goat_tw_sent_score_bing2$Score), sent_err = sd(amcdonald_goat_tw_sent_score_bing2$Score)/ sqrt(length(amcdonald_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.0909091 0.0909091

55 Paige Bueckers Paige Buecker’s sentiment polarity can be found in the range of values with a minimum and maximum of [−1, 3], such that the polarity that has a mean and error of 0.37 ± 0.10. Adjusting for the error produces a true sentiment polarity between [0.27, 0.47] which is positive, and the highest among the other NCAA athletes. tibble( sent_mean = mean(pbueckers_goat_tw_sent_score_bing2$Score), sent_err = sd(pbueckers_goat_tw_sent_score_bing2$Score)/ sqrt(length(pbueckers_goat_tw_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.369863 0.1004307

Bing Sentiment Polarities of Tweet Frequency Plots

### Frequency Bing Sentiment Score timme_goat_tw_freq <- timme_goat_tw[ (timme_goat_tw$created_at >= "2021-04-04 00:00:00"& timme_goat_tw$created_at< "2021-04-05 00:00:00")| (timme_goat_tw$created_at >= "2021-04-05 00:00:00"& timme_goat_tw$created_at< "2021-04-06 00:00:00"), ] timme_freq_sent_score_bing <- lapply(timme_goat_tw_freq$text, function(x){sentiment_bing_score(x)}) timme_freq_sent_score_bing2 <- rbind( tibble( Name = "Drew Timme", Score = unlist(map(timme_freq_sent_score_bing, "score")), Type = unlist(map(timme_freq_sent_score_bing, "type")) ) )

Drew Timme Referring back to the tweet frequency plots, Drew Timme’s tweet dataset ranges from “2021-03-30” to “2021-04-06” in which there were two spikes on

56 “2021-04-05” and “2021-04-06.” On those days, the bing sentiment polarity scores range from [0, 2] with a mean and standard error of 0.25 ± 0.25 meaning that the true mean is likely between the values of [0.0, 0.5]. These values are much more positive than Timme’s overall sentiment polarity score, so the spike acted in his favor.

tibble( sent_mean = mean(timme_freq_sent_score_bing2$Score), sent_err = sd(timme_freq_sent_score_bing2$Score)/ sqrt(length(timme_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.25 0.25

Jalen Suggs Jalen Suggs’ tweet dataset ranges from “2021-03-31” to “2021-04-06” where he experienced a spike in frequency on “2021-04-04.” The sentiment polarity ranges from [−1, 3] with a mean and standard error of 0.024 ± 0.076. This would place the true mean between [−0.052, 0.1] which also turns out to be more positive than his original sentiment polarity.

tibble( sent_mean = mean(jsuggs_freq_sent_score_bing2$Score), sent_err = sd(jsuggs_freq_sent_score_bing2$Score)/ sqrt(length(jsuggs_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.0238095 0.0756994

Aari McDonald Aari McDonald’s dataset that covers the dates from “2021-04-03” to “2021-04-05” underwent a couple spikes on “2021-04-03” and “2021-04-05” and offer polarities that range from [−1, 0] with a mean and standard error of −0.17 ± 0.17. This gives a true mean within the values of [−0.34, 0], proving that the spike was detrimental since the floor estimate almost doubled in value.

tibble( sent_mean = mean(amcdonald_freq_sent_score_bing2$Score), sent_err =

57 sd(amcdonald_freq_sent_score_bing2$Score)/ sqrt(length(amcdonald_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err -0.1666667 0.1666667

Paige Bueckers Finally, the tweet dataset for Paige Bueckers captures the dates “2021-03-29” to “2021-04-05,” with spikes on “2021-03-30” and “2021-04-01.” On these dates, the sentiment polarity has a minimum and maximum of [−1, 3] with a mean and standard error of 0.22 ± 0.10. With the true estimate being between [0.12, 0.32], these values are still smaller than the overall dataset but has more of a neutral effect since they are positive.

tibble( sent_mean = mean(pbueckers_freq_sent_score_bing2$Score), sent_err = sd(pbueckers_freq_sent_score_bing2$Score)/ sqrt(length(pbueckers_freq_sent_score_bing2$Score)) ) %>% kable()

sent_mean sent_err 0.2195122 0.1018858

After comparing each NCAA basketball player’s bing sentiment polarity of their entire dataset to their spikes in tweet frequency, the frequency spikes’ polarity was more positive, or better, for only one of the four players. On a positive note, Paige Buecker has earned the title of GOAT among the NCAA Basketball athletes within our sample after earning a bing sentiment polarity score of 0.37 ± 0.10.

Comparing Sentiment Histograms The following figures represent the histograms of the NCAA basketball players from the original bing sentiment polarities along with the sentiment polarities of the frequency spikes as a method of comparison.

ggplot(ncaab_goat_tw_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+

58 xlab("Original Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Aari McDonald Drew Timme Jalen Suggs Paige Bueckers 80

60 Name

Aari McDonald 40 Drew Timme

Count Jalen Suggs

20 Paige Bueckers

0 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 Original Sentiment Polarity ggplot(ncaab_freq_sent_score_bing, aes(x = Score, fill = Name))+ geom_histogram(bins = 15, alpha = 0.9)+ facet_grid(~Name)+ xlab("Frequency Spike Sentiment Polarity")+ ylab("Count")+ theme_minimal()

Aari McDonald Drew Timme Jalen Suggs Paige Bueckers

60

Name

40 Aari McDonald Drew Timme

Count Jalen Suggs 20 Paige Bueckers

0 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 Frequency Spike Sentiment Polarity

Comparisons Across Sports The two athletes with the highest bing sentiment polarity are Miguel Cabrera of the MLB and Aaron Rodgers of the NFL with scores of 1.00 ± 0.77 and 0.44 ± 0.23, respectively. On the other hand, the two athletes with the lowest bing sentiment polarity are James Harden of the NBA and Aari McDonald of NCAA Women’s Basketball with scores of −0.758 ± 0.093 and −0.17 ± 0.17, respectively. This shows

59 that athletes receive criticism and praise even as professionals who are at the top of their game. To summarize, the GOATs within our four-sport sample are Michael Jordan for the NBA, Aaron Rodgers for the NFL, Miguel Cabrera for the MLB, and Paige Bueckers for NCAA Women’s Basketball.

Conclusion

Regardless of the results that were presented by the various analyses performed on the athletes’ datasets, there are always improvements that can be made. To begin, it would be more insightful to increase the sample size of the athletes within each sport along with the number of sports being analyzed. This would allow for the ability to compare and contrast on a greater scale, whether it is internally or externally. Another improvement would involve gathering more tweets per athlete, over period of time that is greater than 6 − 9 days. This would improve the accuracy of the bing sentiment scores by decreasing the standard error. Increasing the time period would also make it easier to locate frequency trends which could assist in gaining a better understanding of how to predict an athlete’s sentiment by using results based on prior events. One final improvement is simply recreating the analysis multiple times because social media can be more subjective than factual, meaning that opinions are easily changed. This could be represented as an athlete’s sentiment polarity being positive one day but the next it changes to being negative due to a bad performance or some other “negative” event. Although the Twitter API (Makice 2009) did present some limitations, it was a valuable experience learning to use access real Twitter data and exploring the capabilities that it had to offer. In conclusion, the answer to the main question that motivated this entire project and analysis is that Michael Jordan has been confirmed to be the GOAT over LeBron James based on the bing sentiment polarities that were calculated using the tweet datasets gathered by the Twitter API (Makice 2009).

60 Bibliography

Allaire, J. 2012. “RStudio: Integrated Development Environment for r.” Boston, MA 770: 394. Bernard, H Russell, and Gery Ryan. 1998. “Text Analysis.” Handbook of Methods in Cultural Anthropology 613. Cavnar, William B, John M Trenkle, and others. 1994. “N-Gram-Based Text Cate- gorization.” In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Vol. 161175. Citeseer. Chowdhury, Gobinda G. 2003. “Natural Language Processing.” Annual Review of Information Science and Technology 37 (1): 51–89. Ding, Xiaowen, Bing Liu, and Philip S Yu. 2008. “A Holistic Lexicon-Based Approach to Opinion Mining.” In Proceedings of the 2008 International Conference on Web Search and Data Mining, 231–40. “Everything You Need to Know about ’The Last Dance’.” 2020. ESPN.com. https: //www.espn.com/nba/story/_/id/28973557/the-last-dance-updates-untold- story-michael-jordan-chicago-bulls. Feldman, Ronen. 2013. “Techniques and Applications for Sentiment Analysis.” Com- munications of the ACM 56 (4): 82–89. Grishman, Ralph. 1986. Computational Linguistics: An Introduction. Cambridge University Press. Hand, David J, and Niall M Adams. 2014. “Data Mining.” Wiley StatsRef: Statistics Reference Online, 1–7. Heimerl, Florian, Steffen Lohmann, Simon Lange, and Thomas Ertl. 2014. “Word Cloud Explorer: Text Analytics Based on Word Clouds.” In 2014 47th Hawaii International Conference on System Sciences, 1833–42. IEEE. Jain, Anil K, Patrick Flynn, and Arun A Ross. 2007. Handbook of Biometrics. Springer Science & Business Media. Li, Yunyao, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and HV Jagadish. 2008. “Regular Expression Learning for Information Extraction.” In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 21–30. Makice, Kevin. 2009. Twitter API: Up and Running: Learn How to Build Applications with the Twitter API. " O’Reilly Media, Inc.". Martin, Sven, Jörg Liermann, and Hermann Ney. 1998. “Algorithms for Bigram and Trigram Word Clustering.” Speech Communication 24 (1): 19–37. Tromp, Erik, and Mykola Pechenizkiy. 2014. “Rule-Based Emotion Detection on Social Media: Putting Tweets on Plutchik’s Wheel.” arXiv Preprint arXiv:1412.4682. Zuo, Yuan, Jichang Zhao, and Ke Xu. 2016. “Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts.” Knowledge and Information Systems 48 (2): 379–98.

70 Appendix

Code that was applied to Michael Jordan’s tweet data but not included in the paper can be found here. As far as the other athletes involved, their results are replicated using the same code but utilize their dataset’s name in place of the mj seen below.

Libraries

# Set-up library(rtweet) library(ggmap) library(igraph) library(ggraph) library(tidytext) library(ggplot2) library(dplyr) library(readr) library(magrittr) library(wordcloud) library(widyr) library(tidyr) library(utils) library(wordcloud) library(purrr) library(stringr) library(knitr) library(grid) library(ggpubr) library(tinytex)

Code

## Michael Jordan ### Code Collection mj_goat_tw <- search_tweets( q = "Michael Jordan GOAT OR Michael Jordan Goat", n = 300, include_rts = FALSE, `-filter` = "replies", lang = "en")

63 ## Michael Jordan ### Exporting tweets to CSV write_as_csv(mj_goat_tw, "mj_goat_tw.csv")

## Michael Jordan ### Most Liked Tweets mj_goat_tw %>% arrange(-favorite_count) %>% top_n(3, favorite_count) %>% select(created_at, screen_name, favorite_count)

### Arrow within bigram/trigram networks a <- arrow(length = unit(.075, "inches"), type = "closed")

## Michael Jordan ### Trigram plot mj_goat_tw_trigram_counts %>% filter(n >=3) %>% graph_from_data_frame() %>% ggraph(layout = "fr")+ geom_edge_link(aes(edge_alpha = n, edge_width = n), arrow = a)+ geom_edge_link(aes(edge_alpha = n, edge_width = n), arrow = a)+ geom_edge_link(aes(edge_alpha = n, edge_width = n), arrow = a)+ geom_node_point(color = "red", size =3)+ geom_node_text(aes(label = name), vjust = 1.8, size =3)+ labs(title = "Trigram Word Network", subtitle = "Tweets using Michael Jordan GOAT Keyword", x ="",y="")

## NBA ### Arranging bigram/trigram plots ggarrange(lebron_bigram_plot, lebron_trigram_plot, ncol =2) ggarrange(harden_bigram_plot, harden_trigram_plot, ncol =2) ggarrange(kd_bigram_plot, kd_trigram_plot, ncol =2) ggarrange(kobe_bigram_plot, kobe_trigram_plot, ncol =2)

64 ## Michael Jordan ### Bing Lexicon Counts/Histogram mj_goat_tw_bing_word_counts <- mj_goat_tw_clean_02 %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() mj_goat_tw_bing_word_counts %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment))+ geom_col(show.legend = FALSE)+ facet_wrap(~sentiment, scales = "free_y")+ labs(title = "Sentiment of Michael Jordan Goat Tweets using Bing Lexicon", y = "Word Frequency", x = NULL)+ coord_flip()

## Michael Jordan ### AFINN Lexicon Counts/Histogram mj_goat_tw_afinn_word_counts <- mj_goat_tw_clean_02 %>% inner_join(get_sentiments("afinn")) %>% count(word, value, sort = TRUE) %>% ungroup() mj_goat_tw_afinn_word_counts %>% group_by(value) %>% top_n(4) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = value))+ geom_col(show.legend = FALSE)+ facet_wrap(~value, scales = "free_y")+ labs(title = "Sentiment of Michael Jordan Goat Tweets using AFINN Lexicon", y = "Word Frequency", x = NULL)+ coord_flip()

65 ## Michael Jordan ### Loughran Lexicon Counts/Histogram mj_goat_tw_loughran_word_counts <- mj_goat_tw_clean_02 %>% inner_join(get_sentiments("loughran")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() mj_goat_tw_loughran_word_counts %>% group_by(sentiment) %>% top_n(5) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment))+ geom_col(show.legend = FALSE)+ facet_wrap(~sentiment, scales = "free_y")+ labs(title = "Sentiment of Michael Jordan Goat Tweets using Loughran Lexicon", y = "Word Frequency", x = NULL)+ coord_flip()

## Michael Jordan ### NRC Lexicon Counts/Histogram mj_goat_tw_nrc_word_counts <- mj_goat_tw_clean_02 %>% inner_join(get_sentiments("nrc")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() mj_goat_tw_nrc_word_counts %>% group_by(sentiment) %>% top_n(5) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment))+ geom_col(show.legend = FALSE)+ facet_wrap(~sentiment, scales = "free_y")+ labs(title = "Sentiment of Michael Jordan Goat Tweets using NRC Lexicon", y = "Word Frequency", x = NULL)+ coord_flip()

66 ### Bing Sentiment Score Function

sentiment_bing_score <- function(twt) { #Step 1: Perform text cleaning (on tweet) twt_tbl= tibble(text = twt) %>% mutate( #Remove http elements manually stripped_text = gsub("http\\S+","", text) ) %>% unnest_tokens(word, stripped_text) %>% anti_join(stop_words) %>% # remove stop words inner_join(get_sentiments("bing")) %>% # merge with bing sentiment count(word, sentiment, sort = TRUE) %>% ungroup() %>% # Create a column "score" that assigns a -1 to all negative words # and 1 to all positive words mutate( score = case_when( sentiment == 'negative'~n*(-1), sentiment == 'positive'~n*(1) ) ) ## Calculate the total score sentiment.score= case_when( nrow(twt_tbl) ==0~0, # if there are no words, score is 0 nrow(twt_tbl)>0~sum(twt_tbl$score) # sum the positive&negatives ) # This is to keep track of which tweets # contained no words at all from the bing list zero.type= case_when( nrow(twt_tbl) ==0~"Type 1", #Type 1: no words at all, zero = no nrow(twt_tbl) ==0~"Type 2" #Type 2: zero means sum of words = 0 ) list(score = sentiment.score, type = zero.type, twt_tbl = twt_tbl) }

## Michael Jordan ### Applying Bing Sentiment Score Function

mj_goat_tw_sent_score_bing <- lapply(mj_goat_tw$text, function(x){sentiment_bing_score(x)})

67 mj_goat_tw_sent_score_bing2 <- rbind( tibble( Name = "Michael Jordan", Score = unlist(map(mj_goat_tw_sent_score_bing, "score")), Type = unlist(map(mj_goat_tw_sent_score_bing, "type")) ) )

## NBA ### Bing Sentiment nba_goat_tw_sent_score_bing <- rbind( tibble( Name = "Michael Jordan", Score = unlist(map(mj_goat_tw_sent_score_bing, "score")), Type = unlist(map(mj_goat_tw_sent_score_bing, "type")) ), tibble( Name = "LeBron James", Score = unlist(map(lebron_goat_tw_sent_score_bing, "score")), Type = unlist(map(lebron_goat_tw_sent_score_bing, "type")) ), tibble( Name = "James Harden", Score = unlist(map(harden_goat_tw_sent_score_bing, "score")), Type = unlist(map(harden_goat_tw_sent_score_bing, "type")) ), tibble( Name = "Kevin Durant", Score = unlist(map(kd_goat_tw_sent_score_bing, "score")), Type = unlist(map(kd_goat_tw_sent_score_bing, "type")) ), tibble( Name = "Kobe Bryant", Score = unlist(map(kobe_goat_tw_sent_score_bing, "score")), Type = unlist(map(kobe_goat_tw_sent_score_bing, "type")) ) )

## NBA ### Bing Sentiment of Frequency Plots nba_freq_sent_score_bing <- rbind(

68 tibble( Name = "Michael Jordan", Score = unlist(map(mj_freq_sent_score_bing, "score")), Type = unlist(map(mj_freq_sent_score_bing, "type")) ), tibble( Name = "LeBron James", Score = unlist(map(lebron_freq_sent_score_bing, "score")), Type = unlist(map(lebron_freq_sent_score_bing, "type")) ), tibble( Name = "James Harden", Score = unlist(map(harden_freq_sent_score_bing, "score")), Type = unlist(map(harden_freq_sent_score_bing, "type")) ), tibble( Name = "Kevin Durant", Score = unlist(map(kd_freq_sent_score_bing, "score")), Type = unlist(map(kd_freq_sent_score_bing, "type")) ), tibble( Name = "Kobe Bryant", Score = unlist(map(kobe_freq_sent_score_bing, "score")), Type = unlist(map(kobe_freq_sent_score_bing, "type")) ) )

69