An Estimation and Analysis Framework for the Rasch Model

Andrew S. Lan 1 Mung Chiang 2 Christoph Studer 3

Abstract response theory (IRT) model (Lord, 1980), is given by

The Rasch model is widely used for item re- p(Yu,i = 1) = Φ(au − di), (1) sponse analysis in applications ranging from where Yu,i ∈ {−1, +1} denotes the response of user u to recommender systems to psychology, education, item i, where +1 stands for a correct response and −1 stands and finance. While a number of estimators have for an incorrect response. The parameters au ∈ R model the been proposed for the Rasch model over the scalar abilities of users u = 1,...,U and the parameters last decades, the available analytical performance di ∈ R model the scalar difficulties of items i = 1,...,Q. guarantees are mostly asymptotic. This paper pro- The function Φ(x) = R x N (t; 0, 1)dt, often referred to vides a framework that relies on a novel linear −∞ as the inverse probit link function1, is the cumulative dis- minimum mean-squared error (L-MMSE) estima- tribution function of a standard normal random variable, tor which enables an exact, nonasymptotic, and where N (t; 0, 1) denotes the probability density function of closed-form analysis of the parameter estimation a standard normal random variable evaluated at t. error under the Rasch model. The proposed framework provides guidelines on the number of items The literature describes a range of parameter estimation and responses required to attain low estimation methods under the Rasch model and related IRT models; errors in tests or surveys. We furthermore demon- see (Baker & Kim, 2004) for an overview. However, existing strate its efficacy on a number of real-world col- analytical results for the associated parameter estimation laborative filtering datasets, which reveals that error are limited; see (Tsutakawa & Johnson, 1990) for an the proposed L-MMSE estimator performs on par example. The majority of existing results have been pro- with state-of-the-art nonlinear estimators in terms posed in the psychometrics and educational measurement of predictive performance. literature; see, e.g., (Carroll et al., 2006) for a survey. The proposed analysis tools rely, for example, on multiple im- putation (Yang et al., 2012) or Markov chain Monte Carlo 1. Introduction (MCMC) techniques (Patz & Junker, 1999), and are thus not analytical. Hence, their accuracy strongly depends on This paper presents a novel framework that enables an ex- the available data. act, nonasymptotic, and closed-form analysis of the param- Other analysis tools use the Fisher information matrix eter estimation error under the Rasch model. The Rasch (Zhang et al., 2011; Yang et al., 2012) to obtain lower model was proposed in 1960 for modeling the responses bounds on the estimation error. Such methods are of asymp- of students/users to test/survey items (Rasch, 1960), and totic nature, i.e., they yield accurate results only when the has enjoyed great success in applications including (but number of users and items tend to infinity. For real-world arXiv:1806.03551v1 [stat.ML] 9 Jun 2018 not limited to) psychometrics (van der Linden & Hamble- settings with limited data, these bounds are typically loose; ton, 2013), educational tests (Lan et al., 2016), crowdsourc- As an example, in computerized adaptive testing (CAT) ing (Whitehill et al., 2009), public health (Cappelleri et al., (Chang & Ying, 2009), a user enters the system and starts 2014), and even market and financial research (Schellhorn responding to items. The system maintains an estimate of & Sharma, 2013; Brzezinska,´ 2016). Mathematically, the their ability parameter, and adaptively selects the next-best (dichotomous) Rasch model, also known as the 1PL item item to assign to the user that is most informative of the 1Department of Electrical Engineering, Princeton University ability estimate. Calculating the informativeness of each 2Purdue University 3School of Electrical and Computer Engineer- item requires an analysis of the uncertainty in the ability ing, Cornell University. Correspondence to: Andrew S. Lan . 1While some publications assume the inverse logit link func- 1 tion, i.e., the sigmoid Φ(x) = 1+e−x , in most real-world applica- Proceedings of the 35 th International Conference on Machine tions the choice of the link function has no significant performance Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 impact. In what follows, we will focus on the inverse probit link by the author(s). function for reasons that will be discussed in Section 3. An Estimation and Analysis Framework for the Rasch Model estimate. Initially, after the user has only responded to a few 2.1. Estimators for Probit Regression items, these asymptotic methods lead to highly inaccurate The two most prominent estimators for probit regression are analyses, which may lead to poor item selections. the posterior mean (PM) estimator, given by Another family of analysis tools relies on concentration PM R inequalities and yield probabilistic bounds, i.e., bounds that xˆ = x[x|y] = N xp(x|y)dx, (3) E R hold with high probability (Bunea, 2008; Filippi et al., 2010). Such results are often impractical in real-world applications. and the maximum a-posteriori (MAP) estimator, given by However, an exact analysis of the estimation error of the MAP PM T 1 T −1 xˆ = arg min − m=1 log(Φ(ymdmx)) + 2 x Cx x. Rasch model is critical to ensure the a certain degree of x∈ N reliability of assessment scores in tests (Thompson, 2002). R Here, p(x|y) denotes the posterior probability of the vec- T 1.1. Contributions tor x given the observations y under the model (2), dm denotes the mth row of the matrix of covariates D, and Cx We propose a novel framework for the Rasch model that denotes the covariance matrix of the multivariate Gaussian enables an exact, nonasymptotic, and closed-form analysis prior on x. A special case of the MAP estimator is the well- of the parameter estimation error. To this end, we general- known maximum likelihood (ML) estimator, which does ize a recently-proposed linear estimator for binary regres- not impose a prior distribution on x. sion (Lan et al., 2018) to the Rasch model, which enables us to derive a sharp upper bound on the mean squared error The PM estimator is optimal in terms of minimizing the (MSE) of model parameter estimates. Our analytical results MSE of the estimated parameters, which is defined as are in stark contrast to existing analytical results which ei- 2 ther provide loose lower bounds or are asymptotic in nature, MSE(xˆ) = Ex,w kx − xˆk . (4) rendering them impractical in real-world applications. However, there are no simple methods to evaluate the ex- To demonstrate the efficacy of our framework, we provide pectation in (3) under the probit model. Thus, one typically experimental results on both synthetic and real-world data. resorts to Markov chain Monte Carlo (MCMC) methods First, using synthetic data, we show that our upper bound (Albert & Chib, 1993) to perform PM estimation, which can on the MSE is (often significantly) tighter than the Fisher be computationally intensive. In contrast to the PM estima- information-based lower bound, especially when the prob- tor, MAP and ML estimation is generally less complex since lem size is small and when the data is noisy. Therefore, our it can be implemented using standard convex optimization framework enables a more accurate analysis of the estima- algorithms (Nocedal & Wright, 2006; Hastie et al., 2010; tion error in real-world settings. Second, using real-world Goldstein et al., 2014). On the flipside, MAP and ML esti- student question response and user movie rating datasets, we mation is not optimal in terms of minimizing the MSE in (4). show that our linear estimator achieves competitive predic- In contrast to such well-established, nonlinear estimators, tive performance to more sophisticated, nonlinear estimators we build our framework on the family of linear estimators for which sharp performance guarantees are unavailable. recently proposed in (Lan et al., 2018). There, a linear minimum MSE (L-MMSE) estimator was proposed for a certain 2. Rasch Model and Probit Regression class of probit regression problems. This L-MMSE estimator was found to perform on par with the PM estimator and The Rasch model in (1) can be written in equivalent matrix- outperforms the MAP estimator in terms of the MSE for vector form as follows (Hoff, 2009): certain settings, while enabling an exact and nonasymptotic analysis of the MSE. y = sign(Dx + w). (2) 2.2. Analytical Performance Guarantees Here, the UQ-dimensional vector y ∈ {−1, +1}UQ contains all user responses to all items, the Rasch model ma- In the statistical estimation literature, there exists numerous trix D = [1Q ⊗ IU×U , IQ×Q ⊗ 1U ] is constructed with the analytical results characterizing the estimation errors for Kronecker product operator ⊗, identity matrices I, all-ones binary regression problems in the asymptotic setting. For vectors 1, and the vector xT = [aT , −dT ] to be estimated example, (Brillinger, 1982) shows that least squares esti- consists of the user abilities a ∈ RU and item difficulties mation is particularly effective when the design matrix D d ∈ RQ. The “noise” vector w contains i.i.d. standard nor- has i.i.d. Gaussian entries and the number of observations mal random variables. In this equivalent form, parameter approaches infinity; in this case, its performance was shown estimation under the Rasch model can be casted as a pro- to differ from that of the PM estimator only by a constant bit regression problem (Bliss, 1935), for which numerous factor. Recently, (Thrampoulidis et al., 2015) provides a estimators have been proposed in the past. related analysis in the case that the parameter vector x is An Estimation and Analysis Framework for the Rasch Model sparse. Another family of probabilistic results relies on m = 0. As we will show below, including both of these the asymptotic normality property of ML estimators, ei- terms will be essential for our analysis. ther in the standard (dense) setting (Gourieroux & Monfort, Remark 1. We exclusively focus on probit regression since 1981; Fahrmeir & Kaufmann, 1985) or the sparse setting the matrices E and Cy exhibit tractable expressions under (Bunea, 2008; Bach, 2010; Ravikumar et al., 2010; Plan this model. We are unaware of any closed-form expressions & Vershynin, 2013), providing bounds on the MSE with for these quantities in the logistic regression case. high probability. Since numerous real-world applications, such as the Rasch model, rely on deterministic, structured As an immediate consequence of the fact that the PM esti- matrices and have small problem dimensions, existing ana- mator minimizes the MSE, we can use Theorem 1 to obtain lytical performance bounds are often loose; see Section 4 the following upper bound on the MSE of the PM estimator. for experiments that support this claim. Corollary 2. The MSE of the PM estimator is upper- bounded as follows: 3. Main Results MSE(xPM) ≤ MSE(xˆL-MMSE). (6) Our main result is as follows; the proof is given in Ap- pendix A. As we will demonstrate in Section 4, this upper bound on Theorem 1. Assume that x ∼ N (x¯, Cx) with mean vec- the MSE turns out to be surprisingly sharp for a broad range tor x¯ and positive definite covariance matrix Cx, and as- of parameters and problem settings. sume that the vector w contains i.i.d. standard normal random variables. Consider the general probit regression model We now specialize Theorem 1 for the Rasch model and use Corollary 2 to analyze the associated MSE. We divide y = sign(Dx + m + w), (5) our results into two cases: (i) both the user abilities and item difficulties are unknown and (ii) one of the two sets of where D is a given matrix of covariates and m is a given parameters is known and the other is unknown. Due to sym- bias vector. Then, the L-MMSE estimate is given by metry in the Rasch model, we will present our results with L-MMSE T −1 unknown/known item difficulties while the user abilities are xˆ = E Cy y + b, unknown and to be estimated; a corresponding analysis on where we use the following quantities: the estimation error of item parameters follows immediately.

− 1 E=2diag(N (c; 0,1) diag(C ) 2 )DC z x 3.1. First Case: Unknown Item Parameters c = z¯ diag(C )−1/2 z We now analyze the case in which both the user abilities z¯ = Dx¯ + m and item difficulties are unknown and need to be estimated. T Cz = DCxD + I In practice, this scenario is relevant if a new set of items are deployed with little or no prior knowledge on their difficulty C = 2(Φ (c1T , 1cT ; R) + Φ (−c1T , −1cT ; R)) y 2 2 parameters. We assume that there is no missing data, i.e., T − 1M×M − y¯y¯ we observe all user responses to all items.2 In the psycho- −1/2 −1/2 metrics literature (see, e.g., (Linacre, 1999)), one typically R = diag(diag(Cz) )Czdiag(diag(Cz) ) assumes that the entries of the ability a and difficulty vec- y¯ =Φ(c) − Φ(−c) 2 tors d are i.i.d. zero-mean Gaussian with variance σa and b=x¯ − ET C−1y¯. 2 2 2 y σd, respectively, i.e., au ∼ N (0, σa) and di ∼ N (0, σd), which can be included in our model assumptions. Thus, we Here, Φ (x, y, ρ) denotes the cumulative density of a two- 2 can leverage the special structure of the Rasch model, since dimensional zero-mean Gaussian distribution with covari- it corresponds to a special case of the generic probit regres- ance matrix [1 ρ; ρ 1] with ρ ∈ [0, 1), defined as sion model in (5) with D = [1Q ⊗ IU×U , IQ×Q ⊗ 1U ] and x y 2 2 Z Z s −2ρst+t m = 0. We have the following result on the MSE of the 1 − 2 Φ2(x, y; ρ) = e 2(1−ρ ) dtds p 2 L-MMSE estimator; the proof is given in Appendix B. −∞ −∞ 2π 1 − ρ 2 2 2 Theorem 3. Assume that σa = σd = σx and the covari- and is applied element-wise on matrices. Furthermore, the ance matrix of x is C = σ2I . Let associated estimation MSE is given by x x (U+Q)×(U+Q) 2 L-MMSE T −1 2 σx MSE(xˆ ) = tr(Cx − E C E). y s = arcsin 2 . π 2σx + 1 We note that the linear estimator developed in (Lan et al., 2Our analysis can readily be generalized to missing data; the 2018, Thm. 1) is a special case of our result with x¯ = 0 and results, however, depend on the missing data pattern. An Estimation and Analysis Framework for the Rasch Model

Then, the MSE of the L-MMSE estimator of user abilities The following result follows from Theorem 1 by setting 2 under the Rasch model is given by x = a, x¯ =x ¯, Cx = σx, D = 1Q, and m = −d. 2 2 Corollary 4. Assume that a ∼ N (¯x, σ ). Then, the L- MSEa = Ex,w (au − aû) = x MMSE estimate of user ability is given by 2 2 2 σx sQ(Q + U − 3) + 1 σx 1− 2 . T −1 π 2σx + 1 (s(Q − 2) + 1)(s(Q + U − 2) + 1) aˆ = e Cy y + b, (7) where To the best of our knowledge, Theorem 3 is the first exact, σ2 e=2 x N (c; 0, 1) nonasymptotic, and closed-form analysis of the MSE of a pσ2 + 1 parameter estimation method for the Rasch model. From (7), x −1/2 2 c = z¯ diag(Cz) we see that if σx is held constant, then the relationship between MSEa and the numbers of users (U) and items (Q) z¯ =x ¯1Q − d is given by the ratio of two second-order polynomials. If 2 Cz = σx1Q×Q + I the signal-to-noise ratio (SNR) is low (or, equivalently, the 2 y¯ = Φ(c) − Φ(−c) 2 2 σx data is noisy), i.e., σx σn, then we have 2 ≈ 0 2σx+1 T T T T 2 Cy = 2(Φ2(c1 , 1c , R) + Φ2(−c1 , −1c , R)) 2 σx and hence, s = arcsin( 2 ) ≈ 0. In this case, we π 2σx+1 T 2 − 1M×M − y¯y¯ have MSEa ≈ σ , i.e., increasing the number of users/items x −1/2 −1/2 does not affect the accuracy of the ability and difficulty R = diag(diag(Cz) )Czdiag(diag(Cz) . parameters of users and items; this behavior is as expected. The MSE of the user ability estimate is given by MSE(â) = 2 T −1 When U, Q → ∞, the MSE satisfies σx − e Cy e. σ2 σ2 MSE → σ2 1 − x arcsin−1 x , (8) a x 2 2 4. Numerical Results 2σx + 1 2σx + 1 which is a non-negative quantity. This result implies that We now experimentally demonstrate the efficacy of the pro- the L-MMSE estimator has a residual MSE even as the posed framework. First, we use synthetically generated data number of users/items grows large. More specifically, since to numerically compare our L-MMSE-based upper bound x ≤ arcsin(x) for x ∈ [0, 1], this residual error approaches on the MSE of the PM estimator to the widely-used lower 2 3 bound based on Fisher information (Zhang et al., 2011; σx(1 − π ) at high values of SNR. We note, however, this result does not imply that the L-MMSE estimator is not Yang et al., 2012). We then use several real-world collabora- consistent under the Rasch model, since the number of pa- tive filtering datasets to show that the L-MMSE estimator rameters to be estimated (U + Q) grows with the number of achieves comparable predictive performance to that of the the observations (UQ) instead of remaining constant. PM and MAP estimators. Remark 2. The above MSE analysis is data-independent, in contrast to error estimates that rely on the responses y 4.1. Experiments with Synthetic Data (which is, for example, the case for method in (Carroll et al., We start with synthetic data to demonstrate the exact and 2006)). This fact implies that our result provides an error nonasymptotic nature of our analytical MSE expressions. estimate before observing y. Thus, Theorem 3 provides guidelines on how many items to include and how many 4.1.1. FIRST CASE:UNKNOWN ITEM PARAMETERS users to recruit for a study, given a desired MSE level on the user ability and item difficulty parameter estimates. Experimental Setup We vary the number of users U ∈ {20, 50, 100} and the number of items Q ∈ 3.2. Second Case: Known Item Difficulties {20, 50, 100, 200}. We generate the user ability and item difficulty parameters from zero-mean Gaussian distributions We now analyze the case in which the user abilities are 2 2 2 2 with variance σx = σa = σd. We vary σx so that the signal- unknown and need to be estimated; the item difficulties (d) to-noise ratio (SNR) corresponds to {−10, 0, 10} decibels are given. In practice, this scenario is relevant if a large (dB). We then randomly generate the response from each number of users previously responded to a set of items so user to each item, Yu,i, according to (1). We repeat these that a good estimate of the item difficulties is available. Let a experiments for 1, 000 random instances of user and item denote the scalar ability parameter of an user. Then, their parameters and responses, and report the averaged results. responses to items are modeled as We compute the L-MMSE-based upper bound on the MSE p(y = 1) = Φ(1Qa − d). of the PM estimator using Theorem 1 and the Fisher An Estimation and Analysis Framework for the Rasch Model

(a) U = 20, SNR= −10dB. (b) U = 50, SNR= −10dB. (c) U = 100, SNR= −10dB.

(d) U = 20, SNR= 0dB. (e) U = 50, SNR= 0dB. (f) U = 100, SNR= 0dB.

(g) U = 20, SNR= 10dB. (h) U = 50, SNR= 10dB. (i) U = 100, SNR= 10dB.

Figure 1. Empirical MSEs of the L-MMSE and PM estimators and the L-MMSE-based upper and Fisher information-based lower bounds on the MSE of the PM estimator for various SNR levels and problem sizes, when both user and item parameters are unknown. We see that the upper bound is tight at low SNR and at all SNRs when the problem size is small. information-based lower bound using the method detailed in the L-MMSE-based upper bound is tighter at small problem (Zhang et al., 2011; Yang et al., 2012). Since the calculation sizes, while the Fisher information-based lower bound is of the Fisher information matrix requires the true values of tighter at very large problem sizes and at high SNR. the user ability and item difﬁculty parameters (which are These results conﬁrm that our L-MMSE-based upper bound to be estimated in practice), we use the PM estimates of on the MSE is nonasymptotic, while the Fisher information- these parameters instead. We also calculate the empirical based lower bound is asymptotic and thus only tight at parameter estimation MSEs of the L-MMSE and PM es- very large problem sizes. Therefore, the L-MMSE-based timators. To this end, we use a standard Gibbs sampling upper bound is more practical than the Fisher information- procedure (Albert & Chib, 1993); we use the mean of the based lower bound in real-world applications, especially for generated samples over 20, 000 iterations as the PM esti- situations like the initial phase of CAT when the number of mate after a burn-in phase of 10, 000 iterations. We then use items a user has responded to is small. these estimates to calculate the empirical MSE.

4.1.2. CASE TWO:KNOWN ITEM PARAMETERS Results and Discussion Fig. 1 shows the empirical MSEs of the L-MMSE and PM estimators, together with the Experimental Setup In this experiment, we randomly L-MMSE-based upper bound and the Fisher information- generate the item parameters from the standard normal dis- 2 based lower bound on the MSE of the PM estimator, for tribution (σd = 1) and treat these parameters as known; every problem size and every SNR. First, we see that the we then estimate the user ability parameters via Theorem 4. analytical and empirical MSEs of the L-MMSE estimator The rest of the experimental setup remains unchanged. match perfectly, which conﬁrms that our analytical MSE expressions are exact. We also see that for low SNR (i.e., the Results and Discussion Fig. 2 shows the empirical MSEs ﬁrst row of Fig. 1), our L-MMSE upper bound on the MSE of the L-MMSE and PM estimators, together with the of the PM estimator is tight. Moreover, at all noise levels, L-MMSE-based upper bound and the Fisher information- An Estimation and Analysis Framework for the Rasch Model

(a) SNR= −10dB. (b) SNR= 1dB. (c) SNR= 10dB.

Figure 2. Empirical MSEs of the L-MMSE and PM estimators and the L-MMSE-based upper and Fisher Information-based lower bounds on the MSE of the PM estimator for various SNR levels and various problem sizes, when item parameters are known. We see that the upper bound is tight at low SNR and at higher SNRs when the problem sizes are small. based lower bound on the MSE of the PM estimator, for entire dataset into ten equally-partitioned folds (of user-item every problem size and every SNR. We see that the analyti- response pairs), leave out one fold as the held-out testing set cal and empirical MSEs of the L-MMSE estimator match. and use the other folds as the training set. We then use the We also see that the L-MMSE-based upper bound on the training set to estimate the learner abilities au and item diffi- MSE is tighter than the Fisher information-based lower culties di, and use these estimates to predict user responses 2 bound at low SNR levels (−10 dB and 1 dB), and especially on the test set. We tune the prior variance parameter σx when the problem size is small (less than 50 items). These using a separate validation set (one fold in the training set). results further confirm that our L-MMSE-based upper bound To assess the performance of these estimators, we use two on the MSE is nonasymptotic, and is thus practical in the common metrics in binary classification problems: predic- “cold-start” setting of recommender systems. tion accuracy (ACC), which is simply the portion of correct predictions, and area under the receiver operating charac- 4.2. Experiments with Real-World Data teristic curve (AUC) (Jin & Ling, 2005). Both metrics have range in [0, 1], with larger values indicating better predictive We now test the performance of the proposed L-MMSE performance. estimator using a variety of real-world datasets. Since the noise model in real-world datasets is generally unknown, Results and Discussion Tables 1 and 2 show the mean we also consider the performance of MAP estimation using and standard deviation of the performance of each estimator the inverse logit link function (Logit-MAP). on both metrics across each fold. We observe that the performance of the considered estimators are comparable on the Datasets We perform our experiments using a range of ACC metric, while the L-MMSE estimator performs slightly collaborative filtering datasets. These datasets are matrices worse than the MAP, PM, and Logit-MAP estimators for that contain binary-valued ratings (or graded responses) most datasets on the AUC metric. of users (or students) to movies (or items). For these We find it quite surprising that a well-designed linear esti- datasets, we use the probit Rasch model. The datasets in- mator performs on par with more sophisticated nonlinear clude (i) “MT”, which consists of students’ binary-valued estimators on these real-world datasets. We also note that (correct/incorrect) graded responses to questions in a high- the L-MMSE estimator is more computationally efficient school algebra test, with U = 99 students’ 3, 366 responses than the PM estimator. As an example, on the MT and ML to Q = 34 questions, (ii) “SS”, which consists of student datasets, one run of the L-MMSE estimator takes 0.23s and responses in a signals and systems course, with U = 92 stu- 79s, respectively, while one run of the PM estimator takes dents’ 5, 692 responses to Q = 203 questions, (iii) “edX”, 1.9s and 528s (2, 000 and 10, 000 iterations required for which consists of student responses in an edX course, with convergence) on a standard laptop computer. These observa- U = 3241 students’ 177, 181 responses to Q = 191 questions suggest that the L-MMSE estimator is computationally tions, and (iv) “ML”, a processed version of the ml-100k efficient and thus scales favorably to large datasets. dataset from the Movielens project (Herlocker et al., 1999), with 37, 175 integer-valued ratings by U = 943 users to 5. Conclusions Q = 1152 movies. We adopt the procedure used in (Daven- port et al., 2014) to transform the dataset into binary values We have generalized a recently proposed linear estimator by comparing each rating to the overall average rating. for probit regression and applied the method to the classic Rasch model in item response analysis. We have shown that Experimental Setup We evaluate the prediction perfor- the L-MMSE estimator enables an exact, closed-form, and mance of the L-MMSE, MAP, PM, and Logit-MAP estima- nonasymptotic MSE analysis, which is in stark contrast to tors using ten-fold cross validation. We randomly divide the existing analytical results which are asymptotic, probabilis- An Estimation and Analysis Framework for the Rasch Model Table 1. Mean and standard deviation of the prediction accuracy (ACC) for the L-MMSE, MAP, PM, and Logit-MAP estimators. L-MMSE MAP PM Logit-MAP MT 0.795 ± 0.016 0.796 ± 0.015 0.796 ± 0.016 0.794 ± 0.015 SS 0.860 ± 0.007 0.859 ± 0.007 0.859 ± 0.007 0.859 ± 0.010 edX 0.932 ± 0.001 0.934 ± 0.002 0.935 ± 0.002 0.934 ± 0.002 ML 0.715 ± 0.004 0.713 ± 0.004 0.713 ± 0.004 0.714 ± 0.004

Table 2. Area under the receiver operating characteristic curve (AUC) of the L-MMSE, MAP, PM, and Logit-MAP estimators. L-MMSE MAP PM Logit-MAP MT 0.840 ± 0.016 0.843 ± 0.015 0.843 ± 0.015 0.842 ± 0.015 SS 0.800 ± 0.014 0.803 ± 0.013 0.803 ± 0.013 0.802 ± 0.013 edX 0.900 ± 0.004 0.909 ± 0.004 0.909 ± 0.004 0.909 ± 0.004 ML 0.755 ± 0.005 0.756 ± 0.004 0.756 ± 0.004 0.756 ± 0.004

Z ∞ Z ∞ ! ! tic, or loose. As a result, we have shown that the nonasymp- (a) zi +z ¯i zj +z ¯j totic, L-MMSE-based upper bound on the parameter esti- = sign p sign p −∞ −∞ [Cz]i,i [Cz]j,j mation error of the PM estimator under the Rasch model h zi i h 1 ρ i can be tighter than the common Fisher information-based N ; 0, dzjdzi asymptotic lower bound, especially in practical settings. An zj ρ 1 z¯ avenue of future work is to apply our analysis to models Z − √ z¯i Z − √ j [Cz]i,i [Cz]j,j h zi i h 1 ρ i that are more sophisticated than the Rasch model, e.g., the = N ; 0, dzjdzi −∞ −∞ zj ρ 1 latent factor model in (Lan et al., 2014). | {z } v1 Z ∞ Z ∞ h z i h 1 ρ i A. Proof of Theorem 4 + N i ; 0, dz dz z¯ j i − √ z¯i − √ j zj ρ 1 Let z = Dx + m + w. Thus, z ∼ N (Dx¯ + m, DC DT + [Cz]i,i [Cz]j,j x | {z } I) := N (z¯, Cz). The L-MMSE estimator for x has the v2 general form of xˆL-MMSE = Wy + b, where W = EC−1 Z − √ z¯i Z ∞ y [Cz]i,i h z i h 1 ρ i − N i ; 0, dz dz and b = x¯ − Wy¯, with z¯ j i −∞ − √ j zj ρ 1 [Cz] T T T T j,j Cy =E (y−y¯)(y−y¯) =E yy −y¯y¯ :=Ce y −y¯y¯ | {z } v3 z¯ and Z ∞ Z − √ j [Cz]j,j h zi i h 1 ρ i − N ; 0, dzjdzi E = (y − y¯)(x − x¯)T = yxT −y¯x¯T :=E−y¯x¯T . − √ z¯i −∞ zj ρ 1 E E e [Cz]i,i | {z } v4 We need to evaluate three quantities, y¯, Ce y, and Ee. (b) We start with y¯. Its ith entry is given by = 2(v1 + v2) − 1 ! Z ∞ z¯i z¯j = 2 Φ2 p , p , ρ y¯i = sign(zi)N (zi;z ¯i, [Cz]i,i)dzi [Cz]i,i [Cz]j,j −∞ 0 ∞ !! Z Z z¯i z¯j =− N (zi;z ¯i, [Cz]i,i)dzi + N (zi;z ¯i, [Cz]i,i)dzi + Φ2 −p , −p , ρ − 1, −∞ 0 [Cz]i,i [Cz]j,j ! ! z¯i z¯i = Φ − Φ − . zi−z¯i p[C ] p[C ] where we have used (a) change of variable √ → zi z i,i z i,i [Cz]i,i and (b) the fact that v1 +v2 +v3 +v4 = 1. The computation Next, we calculate Ce y. Its (i, j)th entry is given by of Ee follows from that in (Lan et al., 2018) and is omitted. [Ce y]i,j = Z ∞ Z ∞ h z i B. Proof of Theorem 3 sign(z ) sign(z )N i ; i j z −∞ −∞ j Recall that the expression for the MSE is tr(Cx − T −1 T −1 h z¯i i h [Cz]i,i [Cz]i,j i E Cy E), the critical part is to evaluate E Cy E. We , dzjdzi −1 z¯j [Cz]j,i [Cz]j,j begin by evaluating Cy . For the Rasch model, we have An Estimation and Analysis Framework for the Rasch Model

D = [1Q ⊗ IU×U , IQ×Q ⊗ 1U ]. Therefore, since Cx = Solving the linear system given by these four equations 2 σxIU+Q, we have results in 2 2 2 2 3 T 2 T (3U +3Q −U Q−UQ +8UQ−15U −15Q+20)s Cz = DCxD + IUQ×UQ = σ DD + IUQ×UQ a= x r T 1 ⊗ I 2 2 2 = σ2[1 ⊗ I I ⊗ 1 ] Q U×U (−U − Q − 3UQ + 11U + 11Q − 22)s x Q U×U Q×Q U I ⊗ 1T + Q×Q U r + IUQ×UQ (−2U − 2Q + 8)s − 1 + 2 r = σx(1Q×Q ⊗ IU×U + IQ×Q ⊗ 1U×U ) + IUQ×UQ, (UQ + Q2 − 3U − 5Q + 8)s3 +(U + 2Q − 6)s2 +s b= where 1U×U denotes an all-one matrix with size U × U. r Therefore, we see that the UQ × UQ matrix Cz consists of (UQ + U 2 − 5U − 3Q + 8)s3 +(2U + Q − 6)s2 +s 2 c= three parts: (i) Q copies of the all-ones matrix σx1U×U in r 2 its diagonal U × U blocks, (ii) copies of the matrix σxIU×U −(U + Q − 4)s3 − 2s2 in every other off-diagonal U ×U block, plus (iii) a diagonal d= , (9) 2 r matrix IUQ×UQ. Therefore, its diagonal elements are 2σx + 2 where 1 and its non-zero off-diagonal elements are σx. As detailed in (Lan et al., 2018, (7)), one can show that r =(2s−1)((U −2)s+1)((Q−2)s+1)((Q+U −2)s+1). Now, let A be the N × N matrix with c on its diagonal and 2 −1/2 Cy = arcsin(diag(diag(Cz) ) Cz d everywhere else, B denote the matrix with a − c on its π −1 −1/2 diagonal and b − d everywhere else, we can write Cy as × diag(diag(Cz) )), −1 C = 1Q×Q ⊗ A + IQ×Q ⊗ B. (10) we have that the term inside the arcsin function has the y same structure as Cz, with diagonal entries of 1 and non- 2 Our second task is to evaluate E. Since σx zero off-diagonal entries as 2 . Therefore, Cy also has 2σx+1 r r 2 2 −1/2 2 σx the same structure, with diagonal entries of 1 and non-zero E = diag(diag(Cz) )DCx = D π π p 2 off-diagonal entries as 2σx + 1 σ2 2 x 2 σ = [1Q ⊗ IU×U IQ×Q ⊗ 1U ], x p 2 s = arcsin 2 . 2σx + 1 π 2σx + 1 we have −1 −1 4 T Since Cy satisfies CyCy = IUQ×UQ, it is easy to see T −1 2 σx 1Q ⊗ IU×U −1 E Cy E = 2 T that the entries of Cy only contain three distinct values π 2σx + 1 IQ×Q ⊗ 1U (denoted by a, b, and c), and consists of two parts: (i) Q × (1Q×Q ⊗ A + IQ×Q ⊗ B)[1Q ⊗ IU×U IQ×Q ⊗ 1U ] copies of a U × U matrix with a on its diagonal, b every- 4 where else, in its diagonal blocks, and (ii) copies of a U × U 2 σx = 2 Q(QA + B), matrix with c on its diagonal, d everywhere else, in its other π 2σx + 1 blocks. We next compute a, b, c, and d. where we have used (X ⊗ Y)(U ⊗ V) = (XU) ⊗(YV). The first column of C−1 is given by T −1 y Therefore, the value of entry (1, 1) in E Cy E, i.e., the T MSE of the user ability parameter estimates, is given by [a, b11×U−1, c, d11×U−1, c, d11×U−1,...] . 2 σ4 x Q(a + (Q − 1)c) = Since its inner product with the first row of C is one (since 2 y π 2σx + 1 −1 CyCy = IUQ×UQ), we get 2 2 2 σx sQ(Q + U − 3) + 1 σx 1− 2 , a + (U − 1)sb + (Q − 1)sc = 1. π2σx + 1(s(Q − 2) + 1)(s(Q + U − 2) + 1) where we have used (9), thus completing the proof. Similarly, its inner products with the second, (U + 1) − th, and (U + 2)-th rows are all zero; this gives Acknowledgments sa + ((U − 2)s + 1)b + (Q − 1)sd = 0, C. Studer was supported in part by Xilinx, Inc. and by the sa + ((Q − 2)s + 1)c + (U − 1)sd = 0, US National Science Foundation (NSF) under grants ECCS- sb + sc + ((U + Q − 4)s + 1)d = 0. 1408006, CCF-1535897, CCF-1652065, and CNS-1717559. An Estimation and Analysis Framework for the Rasch Model

References ing algorithms. IEEE Trans. Knowl. Data Eng., 17(3):299–310, March 2005. Albert, J. H. and Chib, S. Bayesian analysis of binary and poly- chotomous response data. J. Am. Stat. Assoc., 88(422):669–679, Lan, A. S., Waters, A. E., Studer, C., and Baraniuk, R. G. Sparse June 1993. factor analysis for learning and content analytics. J. Mach. Learn. Res., 15:1959–2008, June 2014. Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat., 4:384–414, 2010. Lan, A. S., Goldstein, T., Baraniuk, R. G., and Studer, C. Deal- breaker: A nonlinear latent variable model for educational data. Baker, F. B. and Kim, S. H. Item Response Theory: Parameter In Proc. Intl. Conf. Mach. Learn., pp. 266–275, June 2016. Estimation Techniques. Marcel Dekker Inc., 2nd edition, 2004. Lan, A. S., Chiang, M., and Studer, C. Linearized binary regression. Bliss, C. The calculation of the dosage-mortality curve. Ann. Appl. arXiv preprint: 1802.00430, Feb. 2018. https://arxiv. Biol., 22(1):134–167, Feb. 1935. org/abs/1802.00430. Brillinger, D. A Festschrift for Erich L. Lehmann, chapter “A Linacre, J. M. Understanding Rasch measurement: Estimation generalized linear model with “Gaussian” regressor variables”, methods for Rasch measures. J. Outcome Meas., 3(4):382–405, pp. 97–114. Wadsworth Statistiscs/Probability Series. Chapman 1999. & Hall/CRC, 1982. Lord, F. Applications of Item Response Theory to Practical Testing Brzezinska,´ J. Latent variable modelling and item response theory Problems. Erlbaum Associates, 1980. analyses in marketing research. J. University of Szczecin, 16(2): 163–174, Dec. 2016. Nocedal, J. and Wright, S. Numerical Optimization. Springer, 2006. Bunea, F. Honest variable selection in linear and logistic regression Patz, R. J. and Junker, B. W. A straightforward approach to Markov models via `1 and `1 + `2 penalization. Electron. J. Stat., 2: 1153–1194, 2008. chain Monte Carlo methods for item response models. J. Educ. Behav. Stat., 24(2):146–178, June 1999. Cappelleri, J. C., Lundy, J. J., and Hays, R. D. Overview of classical test theory and item response theory for the quantitative Plan, Y. and Vershynin, R. Robust 1-bit compressed sensing and assessment of items in developing patient-reported outcomes sparse logistic regression: A convex programming approach. measures. Clinical Therapeutics, 36(5):648–662, May 2014. IEEE Trans. Inf. Th., 59(1):482–494, Sep. 2013. Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. Rasch, G. Studies in Mathematical Psychology: I. Probabilistic Measurement Error in Nonlinear Models: A Modern Perspective. Models for Some Intelligence and Attainment Tests. Nielsen & CRC press, 2006. Lydiche, 1960. Chang, H. and Ying, Z. Nonlinear sequential designs for logistic Ravikumar, P., Wainwright, M., and Lafferty, J. High-dimensional item response theory models with applications to computerized Ising model selection using `1-regularized logistic regression. adaptive tests. The Annals of Statistics, 37(3):1466–1488, Jun. Ann. Stat., 38(3):1287–1319, June 2010. 2009. Schellhorn, C. and Sharma, R. Using the rasch model to rank firms Davenport, M. A., Plan, Y., van den Berg, E., and Wootters, M. by managerial ability. Managerial Finance, 39(3):306–319, 1-bit matrix completion. Inf. Infer., 3(3):189–223, Sep. 2014. Mar. 2013. Fahrmeir, L. and Kaufmann, H. Consistency and asymptotic nor- Thompson, B. Score reliability: Contemporary thinking on relia- mality of the maximum likelihood estimator in generalized bility issues. Sage Publications, 2002. linear models. Ann. Stat., 13(1):342–368, Mar. 1985. Thrampoulidis, C., Abbasi, E., and Hassibi, B. Lasso with non- Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. Parametric linear measurements is equivalent to one with linear measure- bandits: The generalized linear case. In Proc. Adv. Neural Info. ments. In Proc. Adv. Neural Info. Proc. Syst., pp. 3420–3428, Proc. Syst., pp. 586–594, Dec. 2010. Dec. 2015. Goldstein, T., Studer, C., and Baraniuk, R. A field guide to forward- Tsutakawa, R. K. and Johnson, J. C. The effect of uncertainty of backward splitting with a FASTA implementation. arXiv item parameter estimation on ability estimates. Psychometrika, preprint: 1411.3406, Nov. 2014. 55(2):371–390, June 1990. Gourieroux, C. and Monfort, A. Asymptotic properties of the van der Linden, W. J. and Hambleton, R. K. Handbook of Modern maximum likelihood estimator in dichotomous logit models. J. Item Response Theory. Springer Science & Business Media, Econometrics, 17(1):83–97, Sep. 1981. 2013. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statis- Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R., and Ruvolo, tical Learning. Springer, 2010. P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proc. Adv. Neural Herlocker, J., Konstan, J., Borchers, A., and Riedl, J. An algorith- Info. Proc. Syst., pp. 2035–2043, Dec. 2009. mic framework for performing collaborative filtering. In Proc. Ann. Intl. Conf. Res. Develop. Inf. Retrieval, pp. 230–237, Aug. Yang, J. S., Hansen, M., and Cai, L. Characterizing sources of 1999. uncertainty in item response theory scale scores. Educ. Psychol. Meas., 72(2):264–290, Aug. 2012. Hoff, P. D. A First Course in Bayesian Statistical Methods. Springer, 2009. Zhang, J., Xie, M., Song, X., and Lu, T. Investigating the impact of uncertainty about item parameters on ability estimation. Jin, H. and Ling, C. Using AUC and accuracy in evaluating learn- Psychometrika, 76(1):97–118, Jan. 2011.