<<

Investigating Commentator Bias within a Large Corpus of Broadcasts

Jack MerulloF♠ Luke YehF♠ Abram Handler♠ Alvin Grissom II♣ Brendan O’Connor♠ Mohit Iyyer♠

University of Massachusetts Amherst♠ Ursinus College♣ {jmerullo,lyeh,ahandler,miyyer,brenocon}@umass.edu [email protected]

Abstract Player Race Mention text Baker white “Mayfield the ultimate com- Sports broadcasters inject drama into play- Mayfield petitor he’s tough he’s scrappy” by-play commentary by building team and player narratives through subjective analyses Jesse white “this is a guy . . . does nothing and anecdotes. Prior studies based on small James but work brings his lunch pail” datasets and manual coding show that such Manny nonwhite “good specs for that defensive theatrics evince commentator bias in sports Lawson end freakish athletic ability” broadcasts. To examine this phenomenon, we B.J. nonwhite “that otherworldly athleticism assemble FOOTBALL, which contains 1,455 Daniels he has saw it with Michael broadcast transcripts from American football Vick” games across six decades that are automat- ically annotated with 250K player mentions Table 1: Example mentions from FOOTBALL that high- and linked with racial metadata. We identify light racial bias in commentator sentiment patterns. major confounding factors for researchers ex- amining racial bias in FOOTBALL, and perform about each player’s race and position. Our re- a computational analysis that supports conclu- sulting FOOTBALL dataset contains over 1,400 sions from prior social science studies. games spanning six decades, automatically an- 1 Introduction notated with ∼250K player mentions (Table1). Analysis of FOOTBALL identifies two confounding Sports broadcasts are major events in contempo- factors for research on racial bias: (1) the racial rary popular culture: televised American football composition of many positions is very skewed (henceforth “football”) games regularly draw tens (e.g., only ∼5% of running backs are white), and of millions of viewers (Palotta, 2019). Such broad- (2) many mentions of players describe only their casts feature live sports commentators who weave actions on the field (not player attributes). We the game’s mechanical details into a broader, more experiment with an additive log-linear model for subjective narrative. Previous work suggests that teasing apart these confounds. We also confirm this form of storytelling exhibits racial bias: non- prior social science studies on racial bias in nam- white players are less frequently praised for good ing patterns and sentiment. Finally, we publicly plays (Rainville and McCormick, 1977), while release FOOTBALL,2 the first large-scale sports white players are more often credited with “in- commentary corpus annotated with player race, to telligence” (Bruce, 2004; Billings, 2004). How- spur further research into characterizing racial bias ever, such prior scholarship forms conclusions in mass media. from small datasets1 and subjective manual cod- ing of race-specific language. 2 Collecting the FOOTBALL dataset We revisit this prior work using large-scale We collect transcripts of 1,455 full game broad- computational analysis. From YouTube, we col- casts from the U.S. NFL and National Collegiate lect broadcast football transcripts and identify Athletic Association (NCAA) recorded between mentions of players, which we link to metadata 1960 and 2019. Next, we identify and link men- FAuthors contributed equally. tions of players within these transcripts to infor- 1 Rainville and McCormick(1977), for example, study only 16 games. 2 http://github.com/jmerullo/football

6355 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6355–6361, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics mation about their race (white or nonwhite) and bias in sports broadcasts which informs our work position (e.g., quarterback). In total, FOOTBALL (Rainville and McCormick, 1977; Rada, 1996; contains 267,778 mentions of 4,668 unique play- Billings, 2004; Rada and Wulfemeyer, 2005) typ- ers, 65.7% of whom are nonwhite.3 We now de- ically assume hard distinctions between racial scribe each stage of our data collection process. groups, which measurably affect commentary. In this work, we do not reify these racial categories; 2.1 Processing broadcast transcripts we use them as commonly understood within the We collect broadcast transcripts by downloading context of the society in which they arise. YouTube videos posted by nonprofessional, indi- To conduct a large-scale re-examination of this vidual users identified by querying YouTube for prior work, we must identify whether each player football archival channels.4 YouTube automati- in FOOTBALL is perceived as white or nonwhite.7 cally captions many videos, allowing us to scrape Unfortunately, publicly available rosters or player caption transcripts from 601 NFL games and 854 pages do not contain this information, so we re- NCAA games. We next identify the teams play- sort to crowdsourcing. We present crowd workers ing and game’s year by searching for exact string on the Figure Eight platform with 2,720 images matches in the video title and manually labeling of professional player headshots from the Asso- any videos with underspecified titles. ciated Press paired with player names. We ask After downloading videos, we tokenize tran- them to “read the player’s name and examine their scripts using spaCy.5 As part-of-speech tags pre- photo” to judge whether the player is white or dicted by spaCy are unreliable on our transcript nonwhite. We collect five judgements per player text, we tag FOOTBALL using the ARK TweetNLP from crowd workers in the US, whose high inter- POS tagger (Owoputi et al., 2013), which is more annotator agreement (all five workers agree on the robust to noisy and fragmented text, including TV race for 93% of players) suggests that their percep- subtitles (Jørgensen et al., 2016). Additionally, we tions are very consistent. Because headshots were use phrasemachine (Handler et al., 2016) to only available for a subset of players, the authors identify all corpus noun phrases. Finally, we iden- labeled the race of an additional 1,948 players by tify player mentions in the transcript text using ex- performing a Google Image search for the player’s act string matches of first, last, and full names to name8 and manually examining the resulting im- roster information from online archives; these ros- ages. Players whose race could not be determined ters also contain the player’s position.6 Although from the search results were excluded from the we initially had concerns about the reliability of dataset. transcriptions of player names, we noticed mini- 3 Analyzing FOOTBALL mal errors on more common names. Qualitatively, we noticed that even uncommon names were often We now demonstrate confounds in the data and correctly transcribed and capitalized. We leave a revisit several established results from racial bias more systematic study for future work. studies in sports broadcasting. For all experi- ments, we seek to analyze the statistics of con- 2.2 Identifying player race textual terms that describe or have an important Racial identity in the United States is a creation association with a mentioned player. Thus, we of complex, fluid social and historical processes preprocess the transcripts by collecting contex- (Omi and Winant, 2014), rather than a reflec- tual terms in windows of five tokens around each tion of innate differences between fixed groups. player mention, following the approach of Ananya Nevertheless, popular perceptions of race in the et al.(2019) for gendered mention analysis. 9 United States and the prior scholarship on racial We emphasize that different term extraction strategies are possible, corresponding to different 3See Appendix for more detailed statistics. 4We specifically query for full 7While we use the general term “nonwhite” in this paper, NFL|NCAA|college football games the majority of nonwhite football players are black: in 2013, 1960s|1970s|1980s|1990s|2000, and the full list 67.3% of the NFL was black and most of the remaining play- of channels is listed in in the Appendix. ers (31%) were white (Lapchick, 2014). 5https://spacy.io/ (2.1.3), Honnibal and Montani(2017) 8We appended “NFL” to every query to improve precision 6Roster sources listed in Appendix. We tag first and last of results. name mentions only if they can be disambiguated to a single 9If multiple player mentions fall within the same window, player in the rosters from opposing teams. we exclude each term to avoid ambiguity.

6356 many players’ actions (“passes the ball down- field”) depend on the position they play, which is often skewed by race (Figure1). Furthermore, the racial composition of mentions across different decades can differ dramatically—Figure2 shows these changes for quarterback mentions—which makes the problem even more complex. Modeling biased descriptions of players thus requires dis- entangling attributes describing shifting, position- dependent player actions on field (e.g., “Paulsen the tight end with a fast catch”) from attributes referring to intrinsic characteristics of individual Figure 1: Almost all of the eight most frequently- players (“Paulsen is just so, so fast”). mentioned positions in FOOTBALL are heavily skewed To demonstrate this challenge, we distinguish in favor of one race. between per-position effects and racial effects us- ing an additive, log-linear model which represents Racial split in quarterback mentions by decade the log probability that a word or noun phrase nonwhite w will describe a player entity e as the sum of 20000 white two learned coefficients, corresponding to two ob- 15000 served covariates. One observed covariate records a player’s race and the other a player’s position, Count 10000 which allows us to use learned coefficients to rep-

5000 resent how much a player’s race or position con- tributes to the chance of observing an (w, e) pair. 0 1970 1980 1990 2000 2010 Formally, we model such effects using a sparse Decade MAP estimation variant of SAGE (Eisenstein Figure 2: The percentage of nonwhite quarterbacks et al., 2011). We define the binary vector ye ∈ mentions has drastically increased over time, exem- {0, 1}J to represent the observed player covari- plifying the changing racial landscape in FOOTBALL ates of race (white or nonwhite) and position. across time. For example, component ye,k will be set to 1 if player e is a quarterback and the component k in- precision–recall tradeoffs. For instance, instead dexes the quarterback covariate; ye is a concate- of collecting all terms in a window (high recall) nation of two one-hot vectors. We then model |V| we might instead only collect terms in copular p(w | e) ∝ exp (βw + (γye)w), with βw ∈ R as constructions with the entity mention (high pre- a background distribution over the vocabulary V, cision), such as ‘devoted’ in “Tebow is devoted”. set to empirical corpus-wide word and phrase log- J×|V| Because mention detection strategies affect con- probabilities, and γ ∈ R as a matrix of fea- clusions about bias in FOOTBALL, systematically ture effects on those log probabilities. γj,w denotes defining, analyzing or even learning different pos- the difference in log-probability of w for the jth sible strategies offers an exciting avenue for future player feature being on versus off. For example, if work. j indexes the quarterback covariate and w indexes the word “tough”, then γj,w represents how much 3.1 Statistical and linguistic confounds more likely the word “tough” is to be applied to Identifying racial bias in football broadcasts quarterbacks over the base distribution. We im- presents both statistical and linguistic modeling pose a uniform Laplace prior on all elements of challenges. Many descriptions of players in γ to induce sparsity, and learn a MAP estimate broadcasts describe temporary player states (e.g., with the LibLBFGS implementation of OWL-QN, “Smith deep in the backfield”) or discrete player an L1-capable quasi-Newtonian convex optimizer actions (“Ogden with a huge block”), rather than (Andrew and Gao, 2007; Okazaki, 2010). possibly-biased descriptions of players themselves Table2 shows several highest-valued γj,w for a (“Cooper is one scrappy receiver”). Moreover, subset of the J covariates. The adjective “quick”

6357 White long way, long time, valuable Position Race First Last Full DB strong safety, free safety, state university RB second effort,single setback,ground game QB white 8.3% 20.0% 71.7% QB freshman quarterback, arm strength, easier QB nonwhite 18.1% 7.5% 74.5% WR quick slant, end zone touchdown, punt returner WR white 6.9% 36.5% 56.5% WR nonwhite 11.3% 24.1% 64.6% Table 2: Top terms for the white, defensive back (DB), running back (RB), quarterback (QB), and wide re- RB white 10.5% 41.2% 48.4% ceiver (WR) covariates for the log linear model. RB nonwhite 8.5% 35.4% 56.1% TE white 16.6% 18.7% 64.7% TE nonwhite 13.8% 16.6% 69.7% is predictive of wide receivers, but refers to an ac- tion the players take on the field (“quick slant”), Table 3: White players at the four major offensive posi- not an attribute of the receivers themselves. We tions are referred to by last name more often than non- also find that since “strong safety” is a kind of de- white players at the same positions, a discrepancy that fensive back, the adjective “strong” is often asso- may reflect unconscious racial boundary-marking. ciated with defensive backs, who are often non- white. In this case, “strong” does not reflect racial commentators believe their first names sound too bias. Preliminary experiments with per-position “normal”. Bruce(2004) further points out that the mention-level race classifiers, as per Ananya et al. “practice of having fun or playing with the names (2019), were also unable to disentangle race and of people from non-dominant racial groups” con- position. tributes to racial “othering”. A per-position analy- These results suggest that a more sophisticated sis of player mentions in FOOTBALL corroborates approach may be necessary to isolate race effects these findings for all offensive positions (Table3). from the confounds; it also raises sharp conceptual questions about the meaning of race-conditional 3.3 Sentiment patterns statistical effects in social scientific inquiry, since Prior studies examine the relationship between race is a multifaceted construct (a “bundle of commentator sentiment and player race: Rainville sticks,” as Sen and Wasow(2016) argue). For fu- and McCormick(1977) conclude that white play- ture work, it may be useful to think of comparisons ers receive more positive coverage than black between otherwise similar players: how do broad- players, and Rada(1996) shows that nonwhite casters differ in their discussions of two players players are praised more for physical attributes who are both quarterbacks, and who have similar and less for cognitive attributes than white ones. in-game performance, but differ by race? To examine sentiment patterns within FOOT- We now describe two experiments that sidestep BALL, we assign a binary sentiment label to some of these confounds, each motivated by prior contextualized terms (i.e., a window of words work in social science: the first examines player around a player mention) by searching for words naming patterns, which are less tied to action on that match those in domain-specific sentiment field than player attributes. The other uses words lexicons from Hamilton et al.(2016). 10 This with known sentiment polarity to identify positive method identifies 49,787 windows containing and negative attributes, regardless of player posi- sentiment-laden words, only 12.8% of which are tion or game mechanics. of negative polarity, similar to the 8.3% figure reported by Rada(1996). 11 We compute a list 3.2 Exploring naming patterns of the most positive words for each race ranked Naming patterns in sports broadcasting—how by ratio of relative frequencies (Monroe et al., 12 commentators refer to players by name (e.g., first 2008). A qualitative inspection of these lists or last name)—are influenced by player attributes, 10We use a filtered intersection of lexicons from the NFL, as shown by prior small-scale studies. For exam- CFB, and sports subreddits, yielding 121 positive and 125 ple, Koivula(1999) find that women are more fre- negative words. 11Preliminary experiments with a state-of-the-art senti- quently referred to by their first names than men ment model trained on the Stanford Sentiment Treebank (Pe- in a variety of sports. Bruce(2004) discover a ters et al., 2018) produced qualitatively unreliable predictions similar trend for race in basketball games: white due to the noise in FOOTBALL. 12We follow Monroe et al.(2008) in removing infrequent players are more frequently referred to by their words before ranking; specifically, a word must occur at least last names than nonwhite players, often because ten times for each race to be considered.

6358 Race Most positive words meyer, 2005). For example, Rada(1996) per- white (all) enjoying, favorite, calm, appreciate, form a fine-grained analysis of five games from loving, miracle, spectacular, perfect, the 1992 season, coding for aspects such as play- cool, smart nonwhite (all) speed, gift, versatile, gifted, play- ers’ cognitive or physical attributes. Our compu- maker, natural, monster, wow, beast, tational approach allows us to revisit this type of athletic work (§3) using FOOTBALL, without relying on white (QBs) cool, smart, favorite, safe, spectacu- subjective human coding. lar, excellent, class, fantastic, good, Within NLP, researchers have studied gender interesting nonwhite (QBs) ability, athletic, brilliant, aware- bias in word embeddings (Bolukbasi et al., 2016; ness, quiet, highest, speed, wow, ex- Caliskan et al., 2017), racial bias in police stops cited, wonderful (Voigt et al., 2017) and on Twitter (Hasanuzzaman et al., 2017), and biases in NLP tools like senti- Table 4: Positive comments for nonwhite players (top ment analysis systems (Kiritchenko and Moham- two rows: all player mentions; bottom two rows: only mad, 2018). Especially related to our work is that quarterback mentions) focus on their athleticism, while of Ananya et al.(2019), who analyze mention- white players are praised for personality and intelli- gence. level gender bias, and Fu et al.(2019), who ex- amine gender bias in tennis broadcasts. Other (Table4) confirms that nonwhite players are much datasets in the sports domain include the event- more frequently praised for physical ability than annotated commentaries of Keshet et al. white players, who are praised for personality and (2011) and the WNBA and NBA basketball com- intelligence (see Table1 for more examples). mentaries of Aull and Brown(2013), but we em- phasize that FOOTBALL is the first large-scale Limitations: The small lexicon results in sports commentary corpus annotated for race. the detection of relatively few sentiment-laden 5 Conclusion windows; furthermore, some of those are false positives (e.g., “beast mode” is the nickname of We collect and release FOOTBALL to support former NFL running back Marshawn Lynch). The large-scale, longitudinal analysis of racial bias in former issue precludes a per-position analysis sports commentary, a major category of mass me- for all non-QB positions, as we are unable to dia. Our analysis confirms the results of prior detect enough sentiment terms to draw meaning- smaller-scale social science studies on commen- ful conclusions. The top two rows of Table4, tator sentiment and naming patterns. However, which were derived from all mentions regardless we find that baseline NLP methods for quantifying of position, are thus tainted by the positional mention-level genderedness (Ananya et al., 2019) confound discussed in Section 3.1. The bottom and modeling covariate effects (Eisenstein et al., two rows of Table4 are derived from the same 2011) cannot overcome the statistical and linguis- analysis applied to just quarterback windows; tic confounds in this dataset. We hope that pre- qualitatively, the results appear similar to those senting such a technically-challenging resource, in the top two rows. That said, we hope that along with an analysis showing the limitations of future work on contextualized term extraction and current bias-detection techniques, will contribute sentiment detection in noisy domains can shed to the emerging literature on bias in language. more light on the relationship between race and Important future directions include examining the commentator sentiment patterns. temporal aspect of bias as well as developing more precise mention identification techniques. 4 Related Work Acknowledgments Our work revisits specific findings from social science (§3) on racial bias in sports broadcasts. We thank the anonymous reviewers for their in- Such non-computational studies typically exam- sightful comments, and Emma Strubell, Patrick ine a small number of games drawn from a single Verga, David Fisher, Katie Keith, Su Lin Blodgett, season and rely on manual coding to identify dif- and other members of the UMass NLP group for ferences in announcer speech (Rainville and Mc- help with earlier drafts of the paper. This work Cormick, 1977; Billings, 2004; Rada and Wulfe- was partially supported by NSF IIS-1814955.

6359 References Anna Jørgensen, Dirk Hovy, and Anders Søgaard. 2016. Learning a POS tagger for AAVE-like lan- Ananya Ananya, Nitya Parthasarthi, and Sameer Singh. guage. In Conference of the North American Chap- 2019. Genderquant: Quantifying mention-level ter of the Association for Computational Linguistics. genderedness. In Conference of the North Ameri- can Chapter of the Association for Computational Ezra Keshet, Terrence Szymanski, and Stephen Tyn- Linguistics. dall. 2011. BALLGAME: A corpus for computa- G. Andrew and J. Gao. 2007. Scalable training of L1- tional semantics. In Proceedings of the Ninth In- regularized log-linear models. In Proceedings of the ternational Conference on Computational Seman- International Conference of Machine Learning. tics (IWCS 2011). Laura L Aull and David West Brown. 2013. Fighting Svetlana Kiritchenko and Saif Mohammad. 2018. Ex- words: a corpus analysis of gender representations amining gender and race bias in two hundred senti- in sports reportage. Corpora, 8(1). ment analysis systems. In Proceedings of the Joint Conference on Lexical and Computational Seman- Andrew C Billings. 2004. Depicting the quarterback tics. in black and white: A content analysis of col- lege and professional football broadcast commen- Nathalie Koivula. 1999. Gender stereotyping in tele- tary. Howard Journal of Communications, 15(4). vised media coverage. Sex roles, 41(7-8).

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Richard Lapchick. 2014. The 2014 racial and gender Venkatesh Saligrama, and Adam Tauman Kalai. report card: National football league. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Burt L Monroe, Michael P Colaresi, and Kevin M Proceedings of Advances in Neural Information Pro- Quinn. 2008. Fightin’words: Lexical feature selec- cessing Systems. tion and evaluation for identifying the content of po- litical conflict. Political Analysis, 16(4). Toni Bruce. 2004. Marking the boundaries of the ‘nor- mal’in televised sports: The play-by-play of race. Naoaki Okazaki. LibLBFGS: a Library of Limited- Media, Culture & Society, 26(6). memory Broyden-Fletcher-Goldfarb-Shanno Aylin Caliskan, Joanna J. Bryson, and Arvind (L-BFGS) [online]. 2010. Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Michael Omi and Howard Winant. 2014. Racial for- Science, 356(6334). mation in the United States. Routledge. Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. 2011. Olutobi Owoputi, Brendan T. O’Connor, Chris Dyer, Sparse additive generative models of text. In Pro- Kevin Gimpel, Nathan Schneider, and Noah A. ceedings of the International Conference of Machine Smith. 2013. Improved part-of-speech tagging for Learning. online conversational text with word clusters. In Conference of the North American Chapter of the Liye Fu, Cristian Danescu-Niculescu-Mizil, and Lil- Association for Computational Linguistics. lian Lee. 2019. Tie-breaker: Using language mod- els to quantify gender bias in . In Frank Palotta. 2019. NFL ratings rebound after two International Joint Conference on Artificial Intelli- seasons of declining viewership. CNN Business. gence. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt William L Hamilton, Kevin Clark, Jure Leskovec, and Gardner, Christopher Clark, Kenton Lee, and Luke Dan Jurafsky. 2016. Inducing domain-specific sen- Zettlemoyer. 2018. Deep contextualized word rep- timent lexicons from unlabeled corpora. In Pro- resentations. In Conference of the North American ceedings of Empirical Methods in Natural Language Chapter of the Association for Computational Lin- Processing. guistics. Abram Handler, Matthew Denny, Hanna M. Wallach, and Brendan T. O’Connor. 2016. Bag of what? James A Rada. 1996. Color blind-sided: Racial bias in simple noun phrase extraction for text analysis. In network television’s coverage of professional foot- Howard Journal of Communications NLP+CSS@EMNLP. ball games. , 7(3). Mohammed Hasanuzzaman, Gael¨ Dias, and Andy Way. 2017. Demographic word embeddings for James A Rada and K Tim Wulfemeyer. 2005. Color racism detection on twitter. In International Joint coded: Racial descriptors in television coverage of Conference on Natural Language Processing. intercollegiate sports. Journal of Broadcasting & Electronic Media, 49(1). Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embed- Raymond E Rainville and Edward McCormick. 1977. dings, convolutional neural networks and incremen- Extent of covert racial prejudice in pro football an- tal parsing. nouncers’ speech. Journalism Quarterly, 54(1).

6360 Maya Sen and Omar Wasow. 2016. Race as a ’bundle of sticks’: Designs that estimate effects of seemingly immutable characteristics. Annual Review of Politi- cal Science, 19:499–522. Rob Voigt, Nicholas P. Camp, Vinodkumar Prab- hakaran, William L. Hamilton, Rebecca C. Hetey, Camilla M. Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L. Eberhardt. 2017. Language from police body camera footage shows racial dispari- ties in officer respect. Proceedings of the National Academy of Sciences, 114(25).

6361