Question Difficulty Estimation in Community Question Answering Services

Question Difficulty Estimation in Community Question Answering Services∗ ♯ ♯ Jing Liu† Quan Wang‡ Chin-Yew Lin Hsiao-Wuen Hon †Harbin Institute of Technology, Harbin 150001, P.R.China ‡Peking University, Beijing 100871, P.R.China ♯Microsoft Research Asia, Beijing 100080, P.R.China [email protected] [email protected] cyl,hon @microsoft.com { } Abstract However, less attention has been paid to question difficulty estimation in CQA. Question difficulty es- In this paper, we address the problem of timation can benefit many applications: (1) Experts estimating question difficulty in community are usually under time constraints. We do not want question answering services. We propose a to bore experts by routing every question (including competition-based model for estimating ques- both easy and hard ones) to them. Assigning question difficulty by leveraging pairwise comparisons between questions and users. Our ex- tions to experts by matching question difficulty with perimental results show that our model sig- expertise level, not just question topic, will make nificantly outperforms a PageRank-based ap- better use of the experts’ time and expertise (Ack- proach. Most importantly, our analysis shows erman and McDonald, 1996). (2) Nam et al. (2009) that the text of question descriptions reflects found that winning the point awards offered by the the question difficulty. This implies the pos- reputation system is a driving factor in user partici- sibility of predicting question difficulty from pation in CQA. Question difficulty estimation would the text of question descriptions. be helpful in designing a better incentive mechanis- m by assigning higher point awards to more diffi- 1 Introduction cult questions. (3) Question difficulty estimation can help analyze user behavior in CQA, since users may In recent years, community question answering (C- make strategic choices when encountering questions QA) services such as Stackoverflow1 and Yahoo! 2 of different difficulty levels. Answers have seen rapid growth. A great deal To the best of our knowledge, not much research of research effort has been conducted on CQA, in- has been conducted on the problem of estimating cluding: (1) question search (Xue et al., 2008; Du- question difficulty in CQA. The most relevant work an et al., 2008; Suryanto et al., 2009; Zhou et al., is a PageRank-based approach proposed by Yang et 2011; Cao et al., 2010; Zhang et al., 2012; Ji et al. (2008) to estimate task difficulty in crowdsourc- al., 2012); (2) answer quality estimation (Jeon et al., ing contest services. Their key idea is to construct 2006; Agichtein et al., 2008; Bian et al., 2009; Liu a graph of tasks: creating an edge from a task t1 to et al., 2008); (3) user expertise estimation (Jurczyk a task t2 when a user u wins task t1 but loses task and Agichtein, 2007; Zhang et al., 2007; Bouguessa t2, implying that task t2 is likely to be more diffi- et al., 2008; Pal and Konstan, 2010; Liu et al., 2011); cult than task t1. Then the standard PageRank al- and (4) question routing (Zhou et al., 2009; Li and gorithm is employed on the task graph to estimate King, 2010; Li et al., 2011). PageRank score (i.e., difficulty score) of each task. ∗This work was done when Jing Liu and Quan Wang were This approach implicitly assumes that task difficulty visiting students at Microsoft Research Asia. Quan Wang is is the only factor affecting the outcomes of competi- currently affiliated with Institute of Information Engineering, tions (i.e. the best answer). However, the outcomes Chinese Academy of Sciences. 1http://stackoverflow.com of competitions depend on both the difficulty levels 2http://answers.yahoo.com of tasks and the expertise levels of competitors (i.e. 85 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 85–90, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics other answerers). best answer ub correctly responds to question q that Inspired by Liu et al. (2011), we propose a asker ua does not know. competition-based approach which jointly models The expertise score of the best answerer ub is • question difficulty and user expertise level. Our ap- higher than that of asker ua and all answerers in S. proach is based on two intuitive assumptions: (1) This is straightforward since the best answerer ub given a question answering thread, the difficulty s- solves question q better than asker ua and all non- core of the question is higher than the expertise score best answerers in S. of the asker, but lower than that of the best answerer; Let’s view question q as a pseudo user uq. Tak- (2) the expertise score of the best answerer is higher ing a competitive viewpoint, each pairwise compar- than that of the asker as well as all other answer- ison can be viewed as a two-player competition with ers. Given the two assumptions, we can determine one winner and one loser, including (1) one compe- the question difficulty score and user expertise score tition between pseudo user uq and asker ua, (2) one through pairwise comparisons between (1) a ques- competition between pseudo user uq and the best tion and an asker, (2) a question and a best answerer, answerer ub, (3) one competition between the best (3) a best answerer and an asker, and (4) a best an- answerer ub and asker ua, and (4) S competitions | | swerer and all other non-best answerers. between the best answerer ub and all non-best an- The main contributions of this paper are: swers in S. Additionally, pseudo user uq wins the We propose a competition-based approach to es- first competition and the best answerer ub wins all timate• question difficulty (Sec. 2). Our model signif- remaining ( S + 2) competitions. icantly outperforms the PageRank-based approach Hence, the| problem| of estimating the question d- (Yang et al., 2008) for estimating question difficulty ifficulty score (and the user expertise score) is cast on the data of Stack Overflow (Sec. 3.2). as a problem of learning the relative skills of play- Additionally, we calibrate question difficulty s- ers from the win-loss results of the generated two- cores• across two CQA services to verify the effec- player competitions. Formally, let Q denote the set tiveness of our model (Sec. 3.3). of all questions in one category (or topic), and Rq de- Most importantly, we demonstrate that different note the set of all two-player competitions generated • words or tags in the question descriptions indicate from question q Q, i.e., Rq = (ua uq), (uq ∈ { ≺ ≺ question difficulty levels. This implies the possibil- ub), (ua ub), (uo1 ub), , (uo S ub) , ity of predicting question difficulty purely from the where j ≺ i means that≺ user ···i beats user| | ≺j in the} text of question descriptions (Sec. 3.4). competition.≺ Define 2 Competition based Question Difficulty R = Rq (1) Estimation q Q ∪∈ CQA is a virtual community where people can ask as the set of all two-player competitions. Our prob- questions and seek opinions from others. Formally, lem is then to learn the relative skills of players from when an asker ua posts a question q, there will be R. The learned skills of the pseudo question users several answerers to answer her question. One an- are question difficulty scores, and the learned skills swer among the received ones will be selected as the of all other users are their expertise scores. best answer by the asker ua or voted by the com- TrueSkill In this paper, we follow (Liu et al., munity. The user who provides the best answer is 2011) and apply TrueSkill to learn the relative skill- called the best answerer ub, and we denote the set of s of players from the set of generated competitions all non-best answerers as S = uo1 , , uoM . As- R (Equ. 1). TrueSkill (Herbrich et al., 2007) is a suming that question difficulty{ scores··· and user} ex- Bayesian skill rating model that is developed for es- pertise scores are expressed on the same scale, we timating the relative skill levels of players in games. make the following two assumptions: In this paper, we present a two-player version of The difficulty score of question q is higher than TrueSkill with no-draw. • the expertise score of asker ua, but lower than that TrueSkill assumes that the practical performance of the best answerer ub. This is intuitive since the of each player in a game follows a normal distribu- 86 tion N(µ, σ2), where µ means the skill level of the and mathematics4 (SO/Math) questions for our main player and σ means the uncertainty of the estimated experiments. Additionally, we use the data of Math skill level. Basically, TrueSkill learns the skill lev- Overflow5 (MO) for calibrating question difficulty els of players by leveraging Bayes’ theorem. Giv- scores across communities (Sec. 3.3). The statistics en the current estimated skill levels of two players of these data sets are shown in Table 1. (priori probability) and the outcome of a new game SO/CPP SO/Math MO between them (likelihood), TrueSkill model updates # of questions 122, 012 51, 174 27, 333 its estimation of player skill levels (posterior prob- # of answers 357, 632 94, 488 65, 966 ability). TrueSkill updates the skill level µ and the # of users 67, 819 16, 961 12, 064 uncertainty σ intuitively: (a) if the outcome of a new competition is expected, i.e. the player with higher Table 1: The statistics of the data sets. skill level wins the game, it will cause small updates To evaluate the effectiveness of our proposed in skill level µ and uncertainty σ; (b) if the outcome model for estimating question difficulty scores, we of a new competition is unexpected, i.e.

Question Difficulty Estimation in Community Question Answering Services

PERVASIVE BEHAVIOR INTERVENTIONS Using Mobile Devices for Overcoming Barriers for Physical Activity

Competition-Based User Expertise Score Estimation

Empirical Software Engineering at Microsoft Research

The Evaluation of Rating Systems in Online Free-For-All Games

Application and Further Development of Trueskill™ Ranking in Sports

Thesis Template

Ranking (Trueskill) Map #1 #1 Player Map #2 Player #2 Player Vs Player #3 Map Map #3 Map #4 Map #5 #4 Player Ranking Systems ~ 30 Mins Talk

Trueskill 2: an Improved Bayesian Skill Rating System

Influence of Gameplay on Skill in Halo Reach

Measuring Cooperative Behavior in Contemporary Multiplayer Games

Model-Based Machine Learning

Microsoft Research Cambridge