Understanding Differences in Perceived Peer-Review Helpfulness Using Natural Language Processing

Understanding Differences in Perceived Peer-Review Helpfulness using Natural Language Processing Wenting Xiong Diane Litman University of Pittsburgh University of Pittsburgh Department of Computer Science Department of Computer Science & Pittsburgh, PA, 15260 Learning Research and Development Center [email protected] Pittsburgh, PA, 15260 [email protected] Abstract ically predict peer-review helpfulness based on features mined from textual reviews using Natural Lan- Identifying peer-review helpfulness is an im- guage Processing (NLP) techniques. Such an intel- portant task for improving the quality of feed- ligent component could enable peer-review systems back received by students, as well as for help- to 1) control the quality of peer reviews that are sent ing students write better reviews. As we tailor back to authors, so authors can focus on the help- standard product review analysis techniques to our peer-review domain, we notice that peer- ful ones; and 2) provide feedback to reviewers with review helpfulness differs not only between respect to their reviewing performance, so students students and experts but also between types can learn to write better reviews. of experts. In this paper, we investigate how In our prior work (Xiong and Litman, 2011), we different types of perceived helpfulness might examined whether techniques used for predicting the influence the utility of features for automatic helpfulness of product reviews (Kim et al., 2006) prediction. Our feature selection results show could be tailored to our peer-review domain, where that certain low-level linguistic features are the definition of helpfulness is largely influenced by more useful for predicting student perceived helpfulness, while high-level cognitive con- the educational context of peer review. While previ- structs are more effective in modeling experts’ ously we used the average of two expert-provided perceived helpfulness. ratings as our gold standard of peer-review help- fulness1, there are other types of helpfulness rating (e.g. author perceived helpfulness) that could be the 1 Introduction gold standard, and that could potentially impact the Peer review of writing is a commonly recommended features used to build the helpfulness model. In fact, technique to include in good writing instruction. It we observe that peer-review helpfulness seems to not only provides more feedback compared to what differ not only between students and experts (exam- students might get from their instructors, but also ple 1), but also between types of experts (example provides opportunities for students to practice writ- 2). ing helpful reviews. While existing web-based peer- In the following examples, students judge helpful- review systems facilitate peer review from the logis- ness with discrete ratings from one to seven; experts tic aspect (e.g. collecting papers from authors, as- judge it using a one to five scale. Higher ratings on signing reviewers, and sending reviews back), there both scales correspond to the most helpful reviews. still remains the problem that the quality of peer Example 1: reviews varies, and potentially good feedback is not written in a helpful way. To address this is- Student rating = 7, Average expert rating = 2 The sue, we propose to add a peer-review helpfulness 1Averaged ratings are considered more reliable since they model to current peer-review systems, to automat- are less noisy. 10 Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, pages 10–19, Portland, Oregon, 24 June 2011. c 2011 Association for Computational Linguistics author also has great logic in this paper. How can 2 Related Work we consider the United States a great democracy when everyone is not treated equal. All of the main To our knowledge, no prior work on peer review points were indeed supported in this piece. from the NLP community has attempted to automatically predict peer-review helpfulness. Instead, Student rating = 3, Average expert rating = 5 I the NLP community has focused on issues such as thought there were some good opportunities to highlighting key sentences in papers (Sandor and provide further data to strengthen your argument. For example the statement “These methods of Vorndran, 2009), detecting important feedback fea- intimidation, and the lack of military force offered tures in reviews (Cho, 2008; Xiong and Litman, by the government to stop the KKK, led to the 2010), and adapting peer-review assignment (Gar- rescinding of African American democracy.” cia, 2010). However, many NLP studies have been Maybe here include data about how . (126 words) done on the helpfulness of other types of reviews, such as product reviews (Kim et al., 2006; Ghose Example 2: and Ipeirotis, 2010), movie reviews (Liu et al., 2008), book reviews (Tsur and Rappoport, 2009), Writing-expert rating = 2, Content-expert rating = 5 etc. Kim et al. (2006) used regression to predict the Your over all arguements were organized in some helpfulness ranking of product reviews based on var- order but was unclear due to the lack of thesis in ious classes of linguistic features. Ghose and Ipeiro- the paper. Inside each arguement, there was no order to the ideas presented, they went back and tis (2010) further examined the socio-economic im- forth between ideas. There was good support to pact of product reviews using a similar approach the arguements but yet some of it didnt not fit your and suggested the usefulness of subjectivity analy- arguement. sis. Another study (Liu et al., 2008) of movie reviews showed that helpfulness depends on review- Writing-expert rating = 5, Content-expert rating = 2 ers’ expertise, their writing style, and the timeliness First off, it seems that you have difficulty writing of the review. Tsur and Rappoport (2009) proposed transitions between paragraphs. It seems that you RevRank to select the most helpful book reviews in end your paragraphs with the main idea of each an unsupervised fashion based on review lexicons. paragraph. That being said, . (173 words) As a final comment, try to continually move your paper, To tailor the utility of this prior work on help- that is, have in your mind a logical flow with every fulness prediction to educational peer reviews, we paragraph having a purpose. will draw upon research on peer review in cognitive science. One empirical study of the nature of peer- To better understand such differences and inves- review feedback (Nelson and Schunn, 2009) found tigate their impact on automatically assessing peer- that feedback implementation likelihood is signif- review helpfulness, in this paper, we compare help- icantly correlated with five feedback features. Of fulness predictions using our many different pos- these features, problem localization —pinpointing sibilities for gold standard ratings. In particular, the source of the problem and/or solution in the orig- we compare the predictive ability of features across inal paper— and solution —providing a solution to gold standard ratings by examining the most use- the observed problem— were found to be most im- ful features and feature ranks using standard feature portant. Researchers (Cho, 2008; Xiong and Lit- selection techniques. We show that paper ratings man, 2010) have already shown that some of these and lexicon categories that suggest clear transitions constructs can be automatically learned from tex- and opinions are most useful in predicting helpful- tual input using Machine Learning and NLP tech- ness as perceived by students, while review length niques. In addition to investigating what proper- is generally effective in predicting expert helpful- ties of textual comments make peer-review helpful, ness. While the presence of praise and summary researchers also examined how the comments pro- comments are more effective in modeling writing- duced by students versus by different types of ex- expert helpfulness, providing solutions is more use- perts differ (Patchan et al., 2009). Though focusing ful in predicting content-expert helpfulness. on differences between what students and experts 11 produce, such work sheds light on our study of stu- ing expert and one content expert were also asked to dents’ and experts’ helpfulness ratings of the same rate review helpfulness with a slightly different scale student comments (i.e. what students and experts from one to five. For our study, we will also com- value). pute the average ratings given by the two experts, Our work in peer-review helpfulness prediction yielding four types of possible gold-standard ratings integrates the NLP techniques and cognitive-science of peer-review helpfulness for each review. Figure 1 approaches mentioned above. We will particularly shows the rating distribution of each type. Interest- focus on examining the utility of features motivated ingly, we observed that expert ratings roughly follow by related work from both areas, with respect to dif- a normal distribution, while students are more likely ferent types of gold standard ratings of peer-review to give higher ratings (as illustrated in Figure 1). helpfulness for automatic prediction. 4 Features 3 Data Our features are motivated by the prior work in- In this study, we use a previously annotated peer- troduced in Section 2, in particular, NLP work on review corpus (Nelson and Schunn, 2009; Patchan predicting product-review helpfulness (Kim et al., et al., 2009) that was collected in an introduc- 2006), as well as work on automatically learning tory college history class using the freely available cognitive-science constructs (Nelson and Schunn, web-based peer-review SWoRD (Scaffolded Writ- 2009) using NLP (Cho, 2008; Xiong and Litman, ing and Rewriting in the Discipline) system (Cho 2010). The complete list of features is shown in Ta- and Schunn, 2007). The corpus consists of 16 pa- ble 3 and described below. The computational lin- pers (about six pages each) and 189 reviews (vary- guistic features are automatically extracted based ing from twenty words to about two hundred words) on the output of syntactic analysis of reviews and accompanied by numeric ratings of the papers.

Load more