PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference
Catch Me if I Can: Detecting Strategic Behaviour in Peer Assessment
Ivan Stelmakh, Nihar B. Shah and Aarti Singh School of Computer Science Carnegie Mellon University stiv,nihars,aarti @cs.cmu.edu { }
Abstract et al. 2019; Balietti, Goldstone, and Helbing 2016; Has- sidim, Romm, and Shorrer 2018). Deviations from the truth- We consider the issue of strategic behaviour in various peer- ful behaviour decrease the overall quality of the resulted assessment tasks, including peer grading of exams or home- works and peer review in hiring or promotions. When a peer- ranking and undermine fairness of the process. This issue assessment task is competitive (e.g., when students are graded has led to a long line of work (Alon et al. 2009; Aziz et al. on a curve), agents may be incentivized to misreport evalua- 2016; Kurokawa et al. 2015; Kahng et al. 2018; Xu et al. tions in order to improve their own final standing. Our focus 2019) on designing “impartial” aggregation rules that can is on designing methods for detection of such manipulations. eliminate the impact of the ranking returned by a reviewer Specifically, we consider a setting in which agents evaluate a on the final position of their item. subset of their peers and output rankings that are later aggre- While impartial methods remove the benefits of manip- gated to form a final ordering. In this paper, we investigate a ulations, such robustness may come at the cost of some statistical framework for this problem and design a principled test for detecting strategic behaviour. We prove that our test accuracy loss when reviewers do not engage in strategic has strong false alarm guarantees and evaluate its detection behaviour. This loss is caused by less efficient data us- ability in practical settings. For this, we design and conduct age (Kahng et al. 2018; Xu et al. 2019) and reduction of an experiment that elicits strategic behaviour from subjects efforts put by reviewers (Kotturi et al. 2020). Implementa- and release a dataset of patterns of strategic behaviour that tion of such methods also introduces some additional logis- may be of independent interest. We use this data to run a se- tical burden on the system designers; as a result, in many ries of real and semi-synthetic evaluations that reveal a strong critical applications (e.g., conference peer review) the non- detection power of our test. impartial mechanisms are employed. An important barrier that prevents stakeholders from making an informed choice 1 Introduction to implement an impartial mechanism is a lack of tools to detect strategic behaviour. Indeed, to evaluate the trade off Ranking a set of items submitted by a group of people (or between the loss of accuracy due to manipulations and the ranking the people themselves) is a ubiquitous task that is loss of accuracy due to impartiality, one needs to be able to faced in many applications, including education, hiring, em- evaluate the extent of strategic behaviour in the system. With ployee evaluation and promotion, and academic peer review. this motivation, in this work we focus on detecting strategic Many of these applications have a large number of submis- manipulations in peer-assessment processes. sions, which makes obtaining an evaluation of each item by a set of independent experts prohibitively expensive or Specifically, in this work we consider a setting where each slow. Peer-assessment techniques offer an appealing alter- reviewer is asked to evaluate a subset of works submitted by native: instead of relying on independent judges, they dis- their counterparts. In a carefully designed randomized study rat- tribute the evaluation task across the fellow applicants and of strategic behaviour when evaluations take the form of ings then aggregate the received reviews into the final ranking , Balietti, Goldstone, and Helbing (2016) were able to of items. This paradigm has become popular for employee detect manipulations by comparing the distribution of scores evaluation (Edwards and Ewen 1996) and grading students’ given by target reviewers to some truthful reference. How- homeworks (Topping 1998), and is now expanding to more ever, other works (Huang et al. 2019; Barroga 2014) suggest novel applications of massive open online courses (Kulka- that in more practical settings reviewers may strategically rni et al. 2013; Piech et al. 2013) and hiring at freelancing decrease some scores and increase others in attempt to mask platforms (Kotturi et al. 2020). their manipulations or intentionally promote weaker submis- sions, thereby keeping the distribution of output scores un- The downside of such methods, however, is that review- changed and making the distribution-based detection inap- ers are incentivized to evaluate their counterparts strategi- plicable. Inspired by this observation, we aim to design tools cally to ensure a better outcome of their own item (Huang to detect manipulations when the distributions of scores out- Copyright © 2021, Association for the Advancement of Artificial put by reviewers are fixed, that is, we assume that evalu- Intelligence (www.aaai.org). All rights reserved. ations are collected in the form of rankings. Ranking-based evaluation is used in practice (Hazelrigg 2013) and has some case as we show in the sequel (Section 2.2). theoretical properties that make it appealing for peer grad- This paper also falls in the line of several recent works ing (Shah et al. 2013; Caragiannis, Krimpas, and Voudouris in computer science on the peer-evaluation process that in- 2014) which provides additional motivation for our work. cludes both empirical (Tomkins, Zhang, and Heavlin 2017; Contributions In this work we present two sets of results. Sajjadi, Alamgir, and von Luxburg 2016; Kotturi et al. 2020) • Theoretical. First, we propose a non-parametric test for and theoretical (Wang and Shah 2018; Stelmakh, Shah, and detection of strategic manipulations in peer-assessment Singh 2018; Noothigattu, Shah, and Procaccia 2018) stud- setup with rankings. Second, we prove that our test has ies. Particularly relevant works are recent papers (Tomkins, a reliable control over the false alarm probability (prob- Zhang, and Heavlin 2017; Stelmakh, Shah, and Singh 2019) ability of claiming existence of the effect when there is that consider the problem of detecting biases (e.g., gen- none). Conceptually, we avoid difficulties associated to der bias) in single-blind peer review. Biases studied therein dealing with rankings as covariates by carefully account- manifest in reviewers being harsher to some subset of sub- ing for the fact that each reviewer is “connected” to their missions (e.g., authored by females), making the methods submission(s); therefore, the manipulation they employ is designed in these works not applicable to the problem we naturally not an arbitrary deviation from the truthful strat- study. Indeed, in our case there does not exist a fixed subset egy, but instead the deviation that potentially improves the of works that reviewers need to put at the bottom of their outcome of their works. rankings to improve the outcome of their own submissions. • Empirical. On the empirical front, we first design and conduct an experiment that incentivizes strategic be- 2 Problem Formulation haviour of participants. This experiment yields a novel In this section we present our formulation of the dataset of patterns of strategic behaviour that can be useful manipulation-testing problem. for other researchers (the dataset is attached in supplemen- tary materials)1. Second, we use the experimental data to 2.1 Preliminaries evaluate the detection power of our test on answers of real In this paper, we operate in the peer-assessment setup in participants and in a series of semi-synthetic simulations. which reviewers first conduct some work (e.g., homework These evaluations demonstrate that our testing procedure assignments) and then judge the performance of each other. has a non-trivial detection power, while not making strong We consider a setting where reviewers are asked to provide a modelling assumptions on the manipulations employed by total ranking of the set of works they are assigned to review. strategic agents. We let = 1, 2,...,m and = 1, 2,...,n de- note the setR of reviewers{ and} worksW submitted{ for review,} Related work Despite motivation for this work comes m n from studies of Balietti, Goldstone, and Helbing (2016) respectively. We let matrix C 0, 1 ⇥ represent con- flicts of interests between reviewers2{ and} submissions, that and Huang et al. (2019), an important difference between th rankings and ratings that we highlight in Section 2.2 is, (i, j) entry of C equals 1 if reviewer i is in conflict makes the models considered in these works inapplica- with work j and 0 otherwise. Matrix C captures all kinds ble to our setup. Several other papers (Thurner and Hanel of conflicts of interest, including authorship, affiliation and 2011; Cabota,` Grimaldo, and Squazzoni 2013; Paolucci and others, and many of them can be irrelevant from the ma- Grimaldo 2014) specialize on the problem of strategic be- nipulation standpoint (e.g., affiliation may put a reviewer at conflict with dozens of submissions they are not even aware haviour in peer review and perform simulations to explore its m n of). We use A 0, 1 ⇥ to denote a subset of “rele- detrimental impact on the quality of published works. These 2{ } works are orthogonal to the present paper because they do vant” conflicts — those that reviewers may be incentivized not aim to detect the manipulations. to manipulate for — identified by stakeholders. For the ease In this paper, we formulate the test for strategic behaviour of presentation, we assume that A represents the authorship as a test for independence of rankings returned by reviewers conflicts, as reviewers are naturally interested in improving from their own items. Classical statistical works (Lehmann the final standing of their own works, but in general it can capture any subset of conflicts. For each reviewer i , and Romano 2005) for independence testing are not di- 2R rectly applicable to this problem due to the absence of low- non-zero entries of the corresponding row of matrix A indi- dimensional representations of items. To avoid dealing with cate submissions that are (co-)authored by reviewer i. We let C(i) and A(i) C(i) denote possibly empty sets of works unstructured items, one could alternatively formulate the ✓ problem as a two-sample test and obtain a control sample of conflicted with and authored by reviewer i, respectively. rankings from non-strategic reviewers. This approach, how- Each work submitted for review is assigned to non- ever, has two limitations. First, past work suggests that the conflicting reviewers subject to a constraint that each re- test and control rankings may have different distributions viewer gets assigned µ works. For brevity, we assume that even under the absence of manipulations due to misalign- parameters n, m, µ, are such that n = mµ so we can assign exactly µ works to each reviewer. The assignment ment of incentives (Kotturi et al. 2020). Second, existing m n is represented by a binary matrix M 0, 1 ⇥ whose works (Mania et al. 2018; Gretton et al. 2012; Jiao and Vert th 2{ } 2018; Rastogi et al. 2020) on two-sample testing with rank- (i, j) entry equals 1 if reviewer i is assigned to work j ings ignore the authorship information that is crucial in our and 0 otherwise. We call an assignment valid if it respects the (submission, reviewer)-loads and does not assign a re- 1Supplementary materials are on the first author’s website. viewer to a conflicting work. Given a valid assignment M of works to reviewers , for each i , we use M(i) W R 2R to denote a set of works assigned to reviewer i. ⇧ M(i) denotes a set of all M(i) ! total rankings of these works | | ⇥ ⇤ and reviewer i returns a ranking ⇡i ⇧ M(i) . The rank- ings from all reviewers are aggregated2 to obtain a final or- dering ⇤(⇡ ,⇡ ,...,⇡ ) that matches each⇥ work⇤ j 1 2 m 2W to its position ⇤j(⇡1,⇡2,...,⇡m), using some aggregation rule ⇤ known to all reviewers. The grades or other re- wards are then distributed according to the final ordering ⇤(⇡1,⇡2,...,⇡m) with authors of higher-ranked works re- ceiving better grades or rewards. In this setting, reviewers may be incentivized to behave strategically because the ranking they output may impact the outcome of their own works. The focus of this work is on de- Figure 1: Comparison of fixed deterministic strategies avail- signing tools to detect strategic behaviour of reviewers when able to a single strategic reviewer depending on position of a non-impartial aggregation rule ⇤ (e.g., a rule that theoret- their work in the true underlying ranking. ically allows reviewers to impact the final standing of their own submissions) is employed. Under this simple formulation, we qualitatively analyze 2.2 Motivating Example the strategies available to the strategic reviewer, say reviewer To set the stage, we start from highlighting an important dif- i⇤. Specifically, following the rating setup, we consider the ference between rankings and ratings in the peer-assessment fixed deterministic strategies that do not depend on the work setup. To this end, let us consider the experiment conducted created by reviewer i⇤. Such strategies are limited to permu- by Balietti, Goldstone, and Helbing (2016) in which review- tations of the ground-truth ranking of submissions in M(i⇤). ers are asked to give a score to each work assigned to them Figure 1 represents an expected gain of each strategy as for review and the final ranking is computed based on the compared to the truthful strategy for positions 2–4 of the mean score received by each submission. It is not hard to work authored by reviewer i⇤ in the ground-truth ranking, see that in their setting, the dominant strategy for each ratio- where the expectation is taken over randomness in the as- nal reviewer who wants to maximize the positions of their signment. The main observation is that there does not exist own works in the final ranking is to give the lowest possible a fixed strategy that dominates the truthful strategy for every score to all submissions assigned to them. Observe that this possible position of the reviewer’s work. Therefore, in setup strategy is fixed, that is, it does not depend on the quality of with rankings strategic reviewers need to consider how their reviewer’s work — irrespective of position of their work in own work compares to the works they rank in order to im- the underlying ordering, each reviewer benefits from assign- prove the outcome of their submission. ing the lowest score to all submissions they review. Similarly, Huang et al. (2019) in their work also operate 2.3 Problem Setting with ratings and consider a fixed model of manipulations With the motivation given in Section 2.2, we are ready to in which strategic agents increase the scores of low-quality present the formal hypothesis-testing problem we consider submissions and decrease the scores of high-quality submis- in this work. When deciding on how to rank the works, the sions, irrespective of the quality of reviewers’ works. information available to reviewers is the content of the works In contrast, when reviewers are asked to output rankings they review and the content of their own works. Observe that of submissions, the situation is different and reviewers can while a truthful reviewer does not take into account their no longer rely on fixed strategies to gain the most for their own submissions when ranking works of others, the afore- own submission. To highlight this difference, let us consider mentioned intuition suggests that the ranking output by a a toy example of the problem with 5 reviewers and 5 submis- strategic agent should depend on their own works. Our for- sions (m = n =5), authorship and conflict matrix given by mulation of the test for manipulations as an independence an identity matrix (C = A = I), and three works (review- test captures this motivation. ers) assigned to each reviewer (work), that is, = µ =3. (Testing for strategic behaviour) Given a non- In this example, we additionally assume that: (i) assignment Problem 1 . impartial aggregation rule ⇤, assignment of works to review- of reviewers to works is selected uniformly at random from ers M, rankings returned by reviewers ⇡ ,i , conflict the set of all valid assignments, (ii) aggregation rule ⇤ is the i matrix C, authorship matrix A and set{ of works2R} submitted Borda count, that is, the positional scoring rule with weights for review , the goal is to test the following hypotheses: equal to positions in the ranking,2 (iii) reviewers are able W to reconstruct the ground-truth ranking of submissions as- Null (H0) : i s.t. A(i) = ⇡i A(i). 8 2R 6 ; ? signed to them without noise, and (iv) all but one reviewers Alternative (H1) : i s.t. A(i) = ⇡i A(i). are truthful. 9 2R 6 ; 6? In words, under the null hypothesis reviewers who have 2We use the variant without tie-breaking — tied submissions their submissions under review do not take into account their share the same position in the final ordering. own works when evaluating works of others and hence are not engaged in manipulations that can improve the outcome For any reviewer i , let i be a uniform distribu- of their own submissions. In contrast, under the alternative tion over rankings ⇧ M2(iR) of worksU assigned to them for hypothesis some reviewers choose the ranking depending on review. With this notation, we formally present our test as how their own works compare to works they rank, suggest- Test 1 below. Among⇥ other⇤ arguments, our test accepts the ing that they are engaged in manipulations. optional set of rankings ⇡⇤,i , where for each i , { i 2R} 2R Assumptions Our formulation of the testing problem makes ⇡i⇤ is a ranking of works M(i) assigned to reviewer i, but is two assumptions about the data-generation process to ensure constructed by an impartial agent (e.g., an outsider reviewer that association between works authored by reviewer i and who has no work in submission). For the ease of exposition, ranking ⇡i may be caused only by strategic manipulations let us first discuss the test in the case when the optional set and not by some intermediate mediator variables. of rankings is not provided (i.e., the test has no supervision) (A1) Random assignment. We assume that the assignment and then we will make a case for usefulness of this set. of works to reviewers is selected uniformly at random In Step 1, the test statistic is computed as follows: for each reviewer i and for each work j A(i) authored by this from the set of all assignments that respect the conflict 2R 2 matrix C. This assumption ensures that the works au- reviewer, we compute the impact of the ranking returned by thored by a reviewer do not impact the set of works the reviewer on the final standing of this work. To this end, assigned to them for review. The assumption of ran- we compare the position actually taken by the work (first dom assignment holds in many applications, including term in the inner difference in Equation 1) to the expected peer grading (Freeman and Parks 2010; Kulkarni et al. position it would take if the reviewer would sample the rank- 2013) and NSF review of proposals (Hazelrigg 2013). ing of works M(i) uniformly at random (second term in the (A2) Independence of ranking noise. We assume that un- inner difference in Equation 1). To get the motivation be- der the null hypothesis of absence of strategic be- hind this choice of the test statistic, note that if a reviewer i haviour, the reviewer identity is independent of the is truthful then the ranking they return may be either better works they author, that is, the noise in reviewers’ or worse for their own submissions than a random ranking, evaluations (e.g., the noise due to subjectivity of the depending on how their submissions compare to works they truthful opinion) is not correlated with their submis- review. In contrast, a strategic reviewer may choose the rank- sions. This assumption is satisfied by various popular ing that delivers a better final standing for their submissions, models for generation of rankings, including Plackett- biasing the test statistic to the negative side. Luce model (Luce 1959; Plackett 1975) and more gen- Having defined the test statistic, we now understand its eral location family random utility models (Soufiani, behaviour under the null hypothesis to quantify when its Parkes, and Xia 2012). value is too large to be observed under the absence of ma- Of course, the aforementioned assumptions may be vio- nipulations for a given significance level ↵. To this end, we lated in some practical applications.3 For example, in con- note that for a given assignment matrix M, there are many ference peer review, the reviewer assignment is performed pairs of conflict and authorship matrices (C0,A0) that (i) are in a manner that maximizes the similarity between papers equal to the actual matrices C and A up to permutations of and reviewers, and hence is not independent of the content of rows and columns and (ii) do not violate the assignment M, submissions. While the test we design subsequently does not that is, do not declare a conflict between any pair of reviewer control the false alarm probability in this case, we note be- i and submission j such that submission j is assigned to re- low that the output of our test is still meaningful even when viewer i in M. Next, observe that under the null hypothe- these assumptions are violated. sis of absence of manipulations, the behaviour of reviewers would not change if matrix A was substituted by another ma- 3 Testing Procedure trix A0, that is, a ranking returned by any reviewer i would not change if that reviewer was an author of works A0(i) In this section, we introduce our testing procedure. Before instead of A(i). Given that the structure of the alternative we delve into details, we highlight the main intuition that de- matrices C0 and A0 is the same as that of the actual matrices termines our approach to the testing problem. Observe that C and A, under the null hypothesis of absence of manipu- when a reviewer engages in strategic behaviour, they tweak lations, we expect the actual test statistic to have a similar their ranking to ensure that their own works experience bet- value as compared to that under C0 and A0. ter outcome when all rankings are aggregated by the rule ⇤. The aforementioned idea drives Steps 2-4 of the test. In Hence, when successful strategic behaviour is present, we Step 2 we construct the set of all pairs of conflict and au- may expect to see that the ranking returned by a reviewer in- thorship matrices of the fixed structure that do not violate fluences position of their own works under aggregation rule the assignment M. We then compute the value of the test ⇤ in a more positive way than other works not reviewed by statistic for each of these authorship matrices in Step 3 and this reviewer. Therefore, the test we present in this work at- finally reject the null hypothesis in Step 4 if the actual value tempts to identify whether rankings returned by reviewers of the test statistic ⌧ appears to be too extreme against values have a more positive impact on the final standing of their computed in Step 3 for the given significance level ↵. own works than what would happen by chance. If additional information in the form of impartial rank- 3Assumption (A1) can be relaxed (Appendix B) to allow for as- ings is available (i.e., the test has a supervision), then our signments of any fixed topology. We examine the behaviour of our test can detect manipulations better. The idea of supervision test under realistic violation of Assumption (A2) in Appendix A.2 is based on the following intuition. In order to manipulate Test 1 Test for strategic behaviour Input: Reviewers’ rankings ⇡i,i Assignment M of works{ to2 reviewersR} Conflict and authorship matrices (C, A) Significance level ↵, aggregation rule ⇤ Optional Argument: Impartial rankings ⇡⇤,i { i 2R} 1. Compute the test statistic ⌧ as
⌧ = ⇤j(⇡10 ,⇡20 ,...,⇡i,...,⇡m0 ) E⇡˜ i ⇤j(⇡10 ,⇡20 ,...,⇡˜,...,⇡m0 ) , (1) ⇠U i j A(i) X2R 2X ⇣ ⇥ ⇤ ⌘ where ⇡0,i , equals ⇡⇤ if the optional argument is provided and equals ⇡ otherwise. i 2R i i 2. Compute a multiset (M) as follows. For each pair (p ,p ) of permutations of m and n items, respectively, apply permu- P m n tation pm to rows of matrices C and A and permutation pn to columns of matrices C and A. Include the obtained matrix A0 to (M) if it holds that for each i : P 2R A0(i) C0(i) M(i). ✓ ⇢W\ 3. For each matrix A0 (M) define '(A0) to be the value of the test statistic (1) if we substitute A with A0, that is, '(A0) is 2P the value of the test statistic if the authorship relationship was represented by A0 instead of A. Let