<<

This article was downloaded by: [University of Florida] On: 08 October 2012, At: 16:45 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

CHANCE Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ucha20 Evaluating Agreement and Disagreement among Movie Reviewers Alan Agresti & Larry Winner Version of record first published: 20 Sep 2012.

To cite this article: Alan Agresti & Larry Winner (1997): Evaluating Agreement and Disagreement among Movie Reviewers, CHANCE, 10:2, 10-14 To link to this article: http://dx.doi.org/10.1080/09332480.1997.10542015

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. and entertain us with their high­ spirited debetes, but how much do they-and other movie reviewers-really disagree? Evaluating Agreement and Disagreement Among Movie Reviewers

Alan Agresti and Larry Winner

Thumbs up, or thumbs down? Two of movie reviewers by evaluating agree­ square contingency table shows the reviewers face each other across a the­ ment and disagreement for the 28 pairs counts of the nine possible combinations ater aisle, arguing--sometimes force­ of eight popular movie reviewers. of ratings. For instance, for 24 of the 160 fully-the merits and faults of the latest movies, both Siskcl and Ebert gave a releases. Con rating, "thumbs down." The 24 + 13 This is the entertaining-and often The Database + 64 = 101 observations on the main imitated-format that 's Gene diagonal of Table 1 are the movies for Siskel and Roger Ebert originated some which they agreed, giving the same rat­ Each week in an article titled "Crix' ing. Their agreement rate was 63% (i.e., 20 years ago with a local pro­ Picks," lilriety magazine summarizes gram in their home city. Siskel and 101/160). Fig. I portrays the counts in reviews of new movies by critics in New Table I. Ebert's growing popularity led to their York, , Washington, DC, program on PBS and To achieve perfect agreement, all Chicago, and London. Each review is later theirsyndicated show, currently dis­ observations would need to fall on the categorized as Pro, Con, or Mixed, tributed by Buena Vista Television, Inc. main diagonal. Fig. 2 portrays corre­ according to whether the overall evalu­ In their day jobs, are sponding ratings having perfect agree­ ation is positive, negative, or a mixture rival movie critics at the ment. When there is perfect agree­ of the two. and the Chicago Sun-TImes, respectively. ment, the number of observations in We constructed a database using They highlight this friendly rivalry in each category is the same for both reviews of movies for the period April their on-camera face-offs, often creating reviewers. That is, the row marginal Downloaded by [University of Florida] at 16:45 08 October 2012 1995 through September 1996. The the impression that they strongly dis­ percentages are the same as the corre- agree about which movies deserve your database contains the lilriety ratings for entertainment time and dollars. these critics as well as some explanato­ But how strongly do they really dis­ ry variables for the movies, discussed agree? In this article, we'll study this later, that could influence the ratings. question. We'll also compare their pat­ terns of agreement and disagreement to those of and Jeffrey Summarizing the Siskel Lyons, the reviewers on SneakPreviews and Ebert Ratings between 1985 and fall 1996. We then look at whether the degree of disagree­ Table 1 shows the ratings by Siskel and ment between Siskel and Ebert and Ebert of the 160 movies they both Figure 1. Movie ratings for Siskel and between Medved and Lyons is typical reviewed during the study period. This Ebert.

10 VOL. 10,NO.2. 1997 expected frequencies for the Pearson The rating scale (Pro, Mixed, Con) is chi-squared test of independence for a ordinal, and Cohen's kappa does not contingency table. If Siskcl's and take into account the severity of dis­ Ebert's ratings had no association, we agreement. A disagreement in which would still expect agreement in (II.H + Siskcls rating is Pro and Ebert's is Con 6.0 + 45.6) = 63.4 of their evaluations is treated no differently than one in (39.6~, agreement rate). The observed which Siskel's rating is Pro and Ebert's counts arc larger than the expected is Mixed. A generalization of kappa, counts on the main diagonal and small­ called UJei~htetl kappa, is designed for er off that diagonal, reflecting better ordinal scales and places more weight Figure 2. Movie ratings showing perfect than expected agreement and less than on disagreements that are more severe. agreement. expected disagreement. For Table I, weighted kappa equals Of course. having agreement that is .427, which is also not especially strong. spending column marginal percentages. better than chance agreement is no great and the table satisfies nuuginal homo­ accomplishment, and the strength of that ~elleit)'. In Table I. the relative frequen­ agreement is more relevant. Where docs Symmetric Disagreement cies of the ratings (Pro. Mixed, Con) the Siskcl and Ebert agreement fall on were (52?f. 20~. 2H~) for Siskcl and the spectrum ranging from statistical Structure (:;:;~. 19?f. 26~) for Ebert. Though independence to perfect agreement? they are not identical. the percentage of A popular measure for summarizing 'Iable 1 is consistent with an unusually times that each of the three ratings agreement with categorical scales is simple disagreement structure. The occurred is similar for the two raters. Colzens kapt'''. It equals the difference counts arc roughly symmetric about the There is not a tendency for Siskcl or between the observed number of agree­ main diagonal. For each of the three Ebert to be easier or tougher than the ments and the number expected by pairs of categories (x, y) for which the other in his ratings. If this were not chance (i.c.. if the ratings were statisti­ raters disagree, the number of times true. it would be more difficult to cally independent), divided by the maxi­ that Siskels rating is x and Ebert's is y is achieve decent agreement. If one mum possible value of that difference. ahout the same as the number of times reviewer tends to give tougher reviews For the 160 observations with 101 that Siskel's rating is y and Ebert's is x. than the other. the agreement may be agreements and 63.4 expected agree­ The model of symmetry for square weak even if the statistical association is ments in Table I, for instance, sample contingency tables states that the prob­ strong between the reviewers. kappa compares the difference 101 ­ ability of each pair of ratings (x, y) for The agreement in Table I seems 63.4 = 37.6 to the maximum possible (Slskel, Ebert) is the same as the prob­ fairly good. better than we might have value of 160 - 63.4 = 96.6, equaling ability of the reversed pair of ratings (); expected. In particular. the two largest 37.6/96.6 = .3H9.The sample difference x). In fact, this model fits Table I well. counts occur in cells where both Siskcl between the observed agreement and The symmetry model has cell expected and Ebert gave Pro ratings or they both the agreement expected under indepen­ frequencies that average the pairs of gave Con ratings. If the ratings had dence is 39% of the maximum possible counts that fall across the main diago­ been statistically independent, howev­ difference. Kappa equals 0 when the nal from each other. For instance, the er. a certain amount of agreement ratings are statistically independent and expected frequencies arc (13 + 10)/2 = would have occurred simply "by equals 1 when there is perfect agree­ 11.5 for the two cells in which one rat­ chance." The cell frequencies expected ment. According to this measure, the ing is Pro and the other is Con. The under this condition are shown in agreement between Siskcl and Ebert is Pearson chi-squared statistic for testing parentheses in Table I. These arc the not impressive. being moderate at best. the fit of the symmetry model is the sum of (obscrved-cxpcctcdr/cxpected Table 1-Ratings of 160 Movies by Gene Siskel and for the six cells corresponding to the Roger Ebert, with Expected Frequencies in Parentheses three disagreement pairs. It equals .59, based on df = 3, showing that the data Downloaded by [University of Florida] at 16:45 08 October 2012 for Statistical Independence are consistent with the hypothesis of Ebert rating symmetry (P = .90). Con Pro Total Whenever a square contingency table ----,Mixed Con 24 8 13 45 satisfies symmetry, it also satisfies mar­ Siskel (11.8) (8.4) (24.8) ginal homogeneity. The counts in Table 1 rating Mixed 8 13 11 32 arc within the limits of sampling error both for the conditions of symmetry and (8.4) (6.0) (17.6) marginal homogeneity Nonetheless, Pro 10 9 ' 64 83 both these conditions can occur even if (21.8) (15.6) (45.6) _._------the ratings show weak agreement or are Total 42 30 88 160 statistically independent, so we next focus on the kappa measures for sum­ Source: Data taken from Variety, April 1995 through September 1996. marizing strength of agreement.

CIlANCF. II Agreement for Other agreements in which Medved's rating is \ilriety reports ratings for several Raters Con while Lyons's rating is Pro. Except reviewers, so we next analyzed how the for the spike in the cell where both rat­ Siskel and Ebert agreement compares ings are Con, Medved's rating is essen­ to agreement among other popular Our summary of Table I using kappa tiallyindependent of Lyons's. In fact, the reviewers. Table 3 is a matrix of kappa seems to confirm what viewers see on model of independence applied only to and weighted kappa values for eight tclevision-that Siskel's and Ebert's level the other eight cells in Table 2 (which is reviewers, the four already mentioned of agreement is not especiallystrong. But a special case of a quasi-independence Rollin~ how docs it compare to the agreement as well as Peter Traversof Stone, between other moviereviewers? We next model) fits well. Its chi-squared good­ Rex Reed of Nell' York Observer, Genc study Table 2, which shows joint ratings ness-of-fitstatistic equals .8 with df =3. Shalit of The uxlay Show (NBC), and of the most recent Sneak Previews of Good Momin~ America reviewers, Michael Medved and Jeffrey (ABC). Of the 28 pairs of reviewers,the Lyons. One difference we notice imme­ Siskel and Ebert agreement is the diately is that the ratings are not sym­ strongest, according to either agree­ metric. For instance, there are many ment measure. For instance, the next more cases in which Lyons's rating is Pro strongest kappa after the value of .389 and Medved's is Con than the reverse. between Siskcl and Ebert is .283 The marginalrowand column totals sug­ between Shalit and Travers.Though the gest that Medved is a much more critical values are not entirely comparable, the reviewer, tending to give more negative overwhelming impression one gets from reviews than Lyons. this tahle is that many reviewers simply As already mentioned, strong agree­ don't agree much more strongly than ment is difficult to achieve when the Figure 3. Movie ratings for Medved and they would if they were randomly and overall distributions of ratings differ Lyons. blindly making their ratings. substantially between two reviewers. In Table 3, one reviewer stands out For Table 2, in fact, weighted kappa is We mention this statistic only as an from the others. Michael Medved has only .204, about half as large as the informal index, because the figure moti­ very p

12 VOL. 10. NO.2. 1997 Table 3-Matrix of Agreement Indexes for Eight Movie Reviewers and a Consensus Rating, With Weighted Kappa Above Main Diagonal and Cohen's Kappa Below Main Diagonal Ebert lyons Medved Reed Shalit Siegel Siskel Travers Consensus Ebert 1.0 .209 .178 .200 .361 .224 .427 .210 .431 lyons .182 1.0 .204 .330 .183 .285 .267 .178 .404 Medved .140 .177' 1.0 .210 .081 .103 .229 .161 .356 Reed .138 .259 .180 1.0 .295 .226 .250 .215 .403 Shalit .232 .110 .033 .215 1.0 .217 .305 .374 .552 Siegel .218 .224 .081 .211 .174 1.0 .227 .348 .410 Siskel .389 .240 .211 .177 .239 .176 1.0 .246 .499 Travers .123 .124 .129 .143 .283 .275 .170 1.0 .475 Consensus .315 .284 .270 .304 .460 .303 .387 .383 1.0

Who's the Toughest, given margins. We must conclude that, reviewerand the consensus of the other Who's the Easiest? even for the given ratings distributions, reviewers (i.e., excluding that reviewer ratings among movie reviewers show in determining the consensus). weak agreement. Table 4 shows the distribution of ratings across the three categories for each of the eight reviewers. This table reveals What Causes one reason why the agreement tends to Agreement With a Disagreement? be, at best, moderate. The reviewers Consensus Rating varied considerably in their propensity We next studied whether certain to assign Pro and Con ratings. for For each movie rated, we also noted the explanatory variables suggested reasons instance, Travers gave a Pro rating only consensus rating, We define this to be for the weak agreement. We added 28.0% of the time, whereas Siegel gave the rating nearest the mean of the three indicator explanatory variables it 56.0%of the time. reviewers' ratings, based on equally representing our judgment about Nonetheless, even given this vari­ spaced scores for the three response cat­ whether the film contained large ability in ratings' distributions, the egories. AsTable 4 shows, the consensus amounts of violence, contained large agreement shown in Table 3 is quite rating is much more likely to be Mixed amounts of sex, or was strong in pro­ unimpressive. For instance, raters with than is the rating by any particular moting family values. We also recorded the margins displayed in Table 2 for reviewer. How strong is the agreement a film's studio, genre (action/adventure, Lyons and Medved have the potential between each reviewerand this consen­ comedy, drama, etc.), and rating by the for a weighted kappa as high as .707, sus rating? Motion Picture Association of America much larger than the observed value of Table 3 also shows the kappa and (MPM). Perhaps surprisingly, none of .204. This would occur for the counts weighted kappa values between each these factors showed much association portrayed in fig. 4, which show the reviewer and the consensus, Not sur­ with observed disagreement between greatest agreement possible for the prisingly, the reviewers tended to agree reviewers. Partlythis may reflect the rel­ more strongly with the atively small number of movies in this consensus than with sample judged to have large amounts of Table 4-Marginal Distributions of other reviewers. For each sex or violence. reviewer, the kappa and Downloaded by [University of Florida] at 16:45 08 October 2012 Ratings by the Eight Reviewers We also studied the effects of these weighted kappa values factors on the ratings themselves by the Reviewer Pro Mixed Con with the consensus various reviewers. The only noticeable Ebert 55.0% 18.8% 26.2% exceed the corresponding effects concerned high sexual content lyons 52.0% 16.3% 31.7% values with every other and the ratings by Medved and Lyons. Medved 35.5% 25.7% 38.8% reviewer, the only excep­ for instance, for movieswith high sexu­ Reed 42.6% 17.9% 39.5% tion being kappa with al content, the percentage of Con ratings Siskel and Ebert. The was 78%for Medved and 62%for Lyons, Shalit 49.1% 26.3% 24.6% strength of agreement is compared to Con percentages of 37%by Siegel 56.0% 15.7% 28.4% still not exceptionally Medved and 31% by Lyons for other Siskel 51.9% 20.0% 28.1% strong, however. The movies. By contrast, the Con percent­ Travers 28.0% 34.8% 37.1% agreement tends to be ages for movieswith high sexualcontent Consensus 27.2% 56.4% 16.4% slightly weaker yet if we versus other movies were 33% and 30% form kappa between a for Siskcl and 41% and 25% for Ebert.

CHANCE 13 So, what film qualities help deter­ Disagreements Among Movie Raters mine a reviewer's or other movie-goer's rating? It's not possible to draw conclu­ sions based on OUT limited sample, but Disagreement among a group of professional movie reviewers is not some research has been conducted in uncommon. Forthe movies in this database for which Siskel, Ebert, Lyons, the fields of marketing and manage­ and Medved all provided ratings, only 21.3% had the same rating by all ment to quantify differences among four reviewers. Here are examples of movie ratings in which disagree­ individuals in movie enjoyment. ments occurred, often between Medved and the others (P = Pro, M = Eliashberg and Sawhney (1994) Mixed, and C =Con): attempted to predict enjoyment of hedonic experiences in general, and MOVIE Siskel Ebert Lyons Medved movie enjoyment in particular. Their variables used in prediction were Nixon P P P C Forget Paris PP P C • Temporary moods of individuals P MC P -Arousal (low, high) Pocahantas PP P C -Pleasure (low, high) Dangerous Minds CC P C • Individual sensation-seeking ten­ Seven P P P C dency (ordinal scale-s-four levels) Leaving LasVegas PP CC Waterworld C MC C • Individual-specific mood transition Bridges of Madison County P P P C variability (times spent in mood Jumanji CC P P states) Hunchback ofNotre Dame P P P C • Movie-specific emotional content How to Make an American Quilt M C P C (varies from scene to scene) -Arousal (low, high) -Pleasure (low, high) means of measuring these effects with to find ourselves disagreeing with the Although their model, at best, had our current database, and obtaining assessment of any particular reviewer. limited success at predicting the enjoy­ reliable and fair measurement would But over time, our individual experi­ ments (on two scales) of 36 individuals probably require replication, which ence may help us to determine with watching a single made-for-HBO would be impossible. which reviewer we tend to agree most movie, their theoretical framework In our opinion, there are likely to be strongly. For these two authors, Agresti seems reasonable and provides possi­ many very diverse reasons for the listens most closely to Gene Siskel ble reasons for differences in ratings potential enjoyment of a movie, most of (weighted kappa = .5(7) and Winner among viewers of the same movie. For which have small effects. That's proba­ listens most closely to Roger Ebert instance, violent and scx-thcmed bly for the best, or else Hollywood (weighted kappa = .47'5). Sec you at would use a standard regression formu­ the movies. la and make movies that arc even more [The authors thank jacki Levine, predictable than they currently arc Assistant Managing Editor of The (anyone for RockylRambo n?). Gtlinest'ille SIIIt, for many helpful The effects of whatever factors do comments.] influence ratings probably differ between professional movie reviewers References and Further Reading and the general moyie-going public. One might well be more successful using such factors to predict movie Eliashbcrg, J., and Sawhncy, M. Downloaded by [University of Florida] at 16:45 08 October 2012 enjoyment for the general public than (1994), "Modeling Goes to for critics. For instance, individuals are Hollywood: Predicting Individual Figure 4. Movie ratings showing the likely to form specific preferences Differences in Movie Enjoyment," MmUl~enteltt strongest possible agreement, given the based on genre or degree of sex or vio­ Science, 40, II';1­ 1173. margins exhibited byMedved and Lyons. lence in a moyie, whereas most critics aim to assess the overall quality of the Medved. 1\.1. (1992), Holll'u'ood l'er­ film, regardless of such characteristics. SIIS America: Popular (illiture "ltd movies might be more popular among A result is that variation in levels of the War O1Z Traditional Vtdlles, viewers with higher "sensation-seeking agreement among members of the New York: Harper Collins. tendencies," and family fare may be movie-going population is probably at Spitzer, H. L., Cohen, J., Fleiss, J. L., more popular among viewers with least as great as among movie review­ and Endicott, J. (1967), lower "sensation-seeking tendencies." ers. "Quantification of Agreement in Ofcourse, initial mood state and mood Based on the weak agreements Psychiatric Diagnosis:' Archil'es of transition rates could affect ratings as found in this article among "expert" General Psychiatry, 17, 83-87. well. Unfortunately, there are no reviewers, we should not be surprised

14 YOLo 10.NO.2. 1997