Predicting Accurate Probabilities with a Ranking Loss

Total Page:16

File Type:pdf, Size:1020Kb

Predicting Accurate Probabilities with a Ranking Loss Predicting accurate probabilities with a ranking loss Aditya Krishna Menon1 [email protected] Xiaoqian Jiang1 [email protected] Shankar Vembu2 [email protected] Charles Elkan1 [email protected] Lucila Ohno-Machado1 [email protected] 1University of California, San Diego, 9500 Gilman Drive, La Jolla CA 92093, USA 2University of Toronto, 160 College Street, Toronto, ON M5S 3E1, Canada Abstract cally where our approach can provide more reliable esti- mates than standard statistical workhorses for probability In many real-world applications of machine estimation, such as logistic regression. The model attempts learning classifiers, it is essential to predict the to achieve good ranking (in an area under ROC sense) and probability of an example belonging to a particu- regression (in a squared error sense) performance simul- lar class. This paper proposes a simple technique taneously, which is important in many real-world appli- for predicting probabilities based on optimizing cations (Sculley, 2010). Further, our model is much less a ranking loss, followed by isotonic regression. expensive to train than full-blown nonparametric methods, This semi-parametric technique offers both good such as kernel logistic regression. It is thus an appeal- ranking and regression performance, and mod- ing choice in situations where parameteric models are em- els a richer set of probability distributions than ployed for probability estimation, such as medical infor- statistical workhorses such as logistic regression. matics and credit scoring. We provide experimental results that show the ef- fectiveness of this technique on real-world appli- The paper is organized as follows. First, we provide mo- cations of probability prediction. tivating examples for predicting probabilities, and define the fundamental concept of proper losses. We then review existing methods used to predict probabilities, and discuss 1. Introduction their limitations. Next, we detail our method to estimate probabilities, based on optimizing a ranking loss and feed- Classification is the problem of learning a mapping from ing the results into isotonic regression. Finally, we provide examples to labels, with the goal of categorizing future ex- experimental results on real-world datasets to validate our amples into one of several classes. However, many real- analysis and to test the efficacy of our method. world applications instead require that we estimate the probability of an example having a particular label. For ex- We first fix our notation. We focus on probability estima- ample, when studying the click behaviour of ads in compu- tion for examples x 2 X with labels y 2 f0;1g. Each x tational advertising, it is essential to model the probability has a conditional probability function h(x) := Pr[y = 1jx]. of an ad being clicked, rather than just predicting whether For our purposes, a model is some deterministic mapping or not it will be clicked (Richardson et al., 2007). Accurate sˆ : X ! R. A probabilistic model hˆ is a model whose out- probabilities are also essential for medical screening tools puts are in [0;1], and may be derived by composing a model to trigger early assessment and admission to an ICU (Subbe with a link function f : R ! [0;1]. The scores of a model et al., 2001). may be thresholded to give a classifiery ˆ : X ! f0;1g. We n assumes ˆ is learned from a training set f(xi;yi)g of n iid In this paper, we propose a simple semi-parametric model i=1 draws from X × f0;1g. for predicting accurate probabilities that uses isotonic re- gression in conjunction with scores derived from optimiz- ing a ranking loss. We analyze theoretically and empiri- 2. Background and motivation Appearing in Proceedings of the 29th International Conference on Classically, the supervised learning literature has focussed Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright on the scenario where we want to minimize the number 2012 by the author(s)/owner(s). of misclassified examples on test data. However, practical Predicting accurate probabilities with a ranking loss applications of machine learning models often have more then (s∗)−1(s∗(h)) = h, so that the optimal scores are complex constraints and requirements, which demand that some transformation of h(x). In such cases, we call the we output the probability of an example possessing a label. corresponding probabilistic loss `P a proper (or Fisher- Examples of such applications include: consistent) loss (Buja et al., 2005), and say that ` corre- sponds to a proper loss. Building meta-classifiers, where the output of a model is fed to a meta-classifier that uses additional domain knowl- Many commonly used loss functions, such as square edge to make a prediction. For example, doctors prefer to loss `(y;sˆ) = (y − sˆ)2, and logistic loss `(y;sˆ) = log(1 + use a classifier’s prediction as evidence to aid their own e−(2y−1)sˆ), correspond to a proper loss function. Thus, decision-making process (Manickam & Abidi, 1999). In a model with good regression performance according to such scenarios, it is essential that the classifier assess the squared error, say, can be thought to yield meaningful confidence in its predictions being correct, which may be probability estimates. The hinge loss of SVMs, `(y;sˆ) = captured using probabilities; max(0;1−(2y−1)sˆ), is Bayes consistent but does not cor- respond to a proper loss function, which is why SVMs do Using predictions to take actions, such as deciding not output meaningful probabilities (Platt, 1999). whether or not to contact a person for a marketing cam- paign. Such actions have an associated utility that is to be maximized, and maximization of expected utility is most 3. Analysis of existing paradigms to learn naturally handled by estimating probabilities rather than accurate probabilities making hard decisions (Zadrozny & Elkan, 2001); We now analyze two major paradigms for probability esti- Non-standard learning tasks, where problem constraints mation, and study their possible failure modes. demand estimating uncertainty. For example, in the task of learning from only positive and unlabelled examples, train- 3.1. Optimization of a proper loss ing a probabilistic model that distinguishes labelled versus unlabelled examples is a provably (under some assump- A direct approach to predicting probabilities is to optimize tions) sufficient strategy (Elkan & Noto, 2008). a proper loss function on the training data using some hy- pothesis class, e.g. linear separators. Examples include lo- Intuitively, probability estimates hˆ (·) are accurate if, on av- gistic regression and linear regression (after truncation to erage, they are close to the true probability h(·). Quantify- [0;1]), which are instances of the generalized linear model ing “close to” requires picking some sensible discrepancy T framework, which assumes E[yjx] = f (w x) for some link measure, and this idea is formalized by the theory of proper function f (·). The loss-dependent error measure, L`(h;sˆ), loss functions, which we now discuss. A model for binary is one metric by which we can choose amongst proper classification uses a loss function ` : f0;1g × R ! R+ to losses. For example, the discrepancy measures for square measure the discrepancy between a label y and the model’s and logistic loss are (Zhang, 2004) predictions ˆ for some example x. If our model outputs 2 probability estimates hˆ by transforming scores with a link Lsquare(h;sˆ) = (h − sˆ) +C1 (2) function f (·), we may equivalently think of there being a 1 P P Llogistic(h;sˆ) = KL h +C2; (3) probabilistic loss ` (·;·) such that `(y;sˆ) = ` (y; f (sˆ)). The 1 + e−sˆ empirical error ofs ˆ with respect to the loss ` is where KL denotes the Kullback-Leibler divergence, and n 1 C1;C2 are independent of the predictions ˆ. Based on this, Eemp(sˆ(·)) = ∑ `(yi;sˆ(xi)); n i=1 Zhang (2004) notes that logistic regression has difficulty when h(x)(1 − h(x)) ≈ 0 for some x, by virtue of requir- which is a surrogate for the generalization error ing jsˆ(x)j ! ¥. This has been observed in practical uses of logistic regression with imbalanced classes (King & Zeng, E (sˆ(·)) = ExEyjx`(y;sˆ(x)) 2001; Foster & Stine, 2004), with the latter proposing the = Ex [h(x)`(1;sˆ(x)) + (1 − h(x))`(0;sˆ(x))] use of linear regression as a more robust alternative. := ExL`(h(x);sˆ(x)): (1) 3.2. Post-processing methods The term L`(h;sˆ) is a measure of discrepancy between an example’s probability of being positive and its pre- A distinct strategy is to train a model in some manner, ∗ dicted score. Let s (h) = argmins L`(h;s). Then, we and then extract probability estimates from it in a post- call a loss function ` Bayes consistent (Buja et al., 2005) processing step. Three popular techniques of this type ∗ 1 if for every h 2 [0;1], s (h) · (h − 2 ) ≥ 0, meaning that are Platt scaling (Platt, 1999), binning (Zadrozny & Elkan, we have the same sign as the optimal prediction under 2001), and isotonic regression (Zadrozny & Elkan, 2002). the 0-1 loss `(y;sˆ) = 1[ysˆ ≤ 0]. If s∗(h) is invertible, We focus on the latter, as it is more flexible than the former Predicting accurate probabilities with a ranking loss two approaches by virtue of being nonparametric, and has up to the choice of link function, i.e. h(x) = f (wT x), but been shown to work well empirically for a range of input f (·) is not the sigmoid function. The maximum likelihood models (Niculescu-Mizil & Caruana, 2005). estimates of a generalized linear model with a misspecified link function are known to be asymptotically biased (Czado Isotonic regression is a nonparametric technique to find a & Santner, 1992). Isotonic regression alleviates this partic- monotone fit to a set of target values.
Recommended publications
  • Bayesian Isotonic Regression for Epidemiology
    Dunson, D.B. Bayesian isotonic regression BAYESIAN ISOTONIC REGRESSION FOR EPIDEMIOLOGY Author's name(s): Dunson, D.B.* and Neelon, B. Affiliation(s): Biostatistics Branch, National Institute of Environmental Health Sciences, U.S. National Institutes of Health, MD, A3-03, P.O. Box 12233, RTP, NC 27709, U.S.A. Email: [email protected] Phone: 919-541-3033; Fax: 919-541-4311 Corresponding author: Dunson, D.B. Keywords: Additive model; Smoothing; Trend test Topic area of the submission: Statistics in Epidemiology Abstract In many applications, the mean of a response variable, Y, conditional on a predictor, X, can be characterized by an unknown isotonic function, f(.), and interest focuses on (i) assessing evidence of an overall increasing trend; (ii) investigating local trends (e.g., at low dose levels); and (iii) estimating the response function, possibly adjusted for the effects of covariates, Z. For example, in epidemiologic studies, one may be interested in assessing the relationship between dose of a possibly toxic exposure and the probability of an adverse response, controlling for confounding factors. In characterizing biologic and public health significance, and the need for possible regulatory interventions, it is important to efficiently estimate dose response, allowing for flat regions in which increases in dose have no effect. In such applications, one can typically assume a priori that an adverse response does not occur less often as dose increases, adjusting for important confounding factors, such as age and race. It is well known that incorporating such monotonicity constraints can improve estimation efficiency and power to detect trends [1], and several frequentist approaches have been proposed for smooth monotone curve estimation [2-3].
    [Show full text]
  • Demand STAR Ranking Methodology
    Demand STAR Ranking Methodology The methodology used to assess demand in this tool is based upon a process used by the State of Louisiana’s “Star Rating” system. Data regarding current openings, short term and long term hiring outlooks along with wages are combined into a single five-point ranking metric. Long Term Occupational Projections 2014-2014 The steps to derive a rank for long term hiring outlook (DUA Occupational Projections) are as follows: 1.) Eliminate occupations with a SOC code ending in “9” in order to remove catch-all occupational titles containing “All Other” in the description. 2.) Compile occupations by six digit Standard Occupational Classification (SOC) codes. 3.) Calculate decile ranking for each occupation based on: a. Total Projected Employment 2024 b. Projected Change number from 2014-2024 4.) For each metric, assign 1-10 points for each occupation based on the decile ranking 5.) Average the points for Project Employment and Change from 2014-2024 Short Term Occupational Projection 2015-2017 The steps to derive occupational ranks for the short-term hiring outlook are same use for the Long Term Hiring Outlook, but using the Short Term Occupational Projections 2015-2017 data set. Current Job Openings Current job openings rankings are assigned based on actual jobs posted on-line for each region for a 12 month period. 12 month average posting volume for each occupation by six digit SOC codes was captured using The Conference Board’s Help Wanted On-Line analytics tool. The process for ranking is as follows: 1) Eliminate occupations with a SOC ending in “9” in order to remove catch-all occupational titles containing “All Other” in the description 2) Compile occupations by six digit Standard Occupational Classification (SOC) codes 3) Determine decile ranking for the average number of on-line postings by occupation 4) Assign 1-10 points for each occupation based on the decile ranking Wages In an effort to prioritize occupations with higher wages, wages are weighted more heavily than the current, short-term and long-term hiring outlook rankings.
    [Show full text]
  • Ranking of Genes, Snvs, and Sequence Regions Ranking Elements Within Various Types of Biosets for Metaanalysis of Genetic Data
    Technical Note: Informatics Ranking of Genes, SNVs, and Sequence Regions Ranking elements within various types of biosets for metaanalysis of genetic data. Introduction In summary, biosets may contain many of the following columns: • Identifier of an entity such as a gene, SNV, or sequence One of the primary applications of the BaseSpace® Correlation Engine region (required) is to allow researchers to perform metaanalyses that harness large amounts of genomic, epigenetic, proteomic, and assay data. Such • Other identifiers of the entity—eg, chromosome, position analyses look for potentially novel and interesting results that cannot • Summary statistics—eg, p-value, fold change, score, rank, necessarily be seen by looking at a single existing experiment. These odds ratio results may be interesting in themselves (eg, associations between different treatment factors, or between a treatment and an existing Ranking of Elements within a Bioset known pathway or protein family), or they may be used to guide further Most biosets generated at Illumina include elements that are changing research and experimentation. relative to a reference genome (mutations) or due to a treatment (or some other test factor) along with a corresponding rank and The primary entity in these analyses is the bioset. It is a ranked list directionality. Typically, the rank will be based on the magnitude of of elements (genes, probes, proteins, compounds, single-nucleotide change (eg, fold change); however, other values, including p-values, variants [SNVs], sequence regions, etc.) that corresponds to a given can be used for this ranking. Directionality is determined from the sign treatment or condition in an experiment, an assay, or a single patient of the statistic: eg, up (+) or down(-) regulation or copy-number gain sample (eg, mutations).
    [Show full text]
  • Conduct and Interpret a Mann-Whitney U-Test
    Statistics Solutions Advancement Through Clarity http://www.statisticssolutions.com Conduct and Interpret a Mann-Whitney U-Test What is the Mann-Whitney U-Test? The Mann-Whitney U-test, is a statistical comparison of the mean. The U-test is a member of the bigger group of dependence tests. Dependence tests assume that the variables in the analysis can be split into independent and dependent variables. A dependence tests that compares the mean scores of an independent and a dependent variable assumes that differences in the mean score of the dependent variable are caused by the independent variable. In most analyses the independent variable is also called factor, because the factor splits the sample in two or more groups, also called factor steps. Other dependency tests that compare the mean scores of two or more groups are the F-test, ANOVA and the t-test family. Unlike the t-test and F-test, the Mann-Whitney U-test is a non- paracontinuous-level test. That means that the test does not assume any properties regarding the distribution of the underlying variables in the analysis. This makes the Mann-Whitney U-test the analysis to use when analyzing variables of ordinal scale. The Mann-Whitney U-test is also the mathematical basis for the H-test (also called Kruskal Wallis H), which is basically nothing more than a series of pairwise U-tests. Because the test was initially designed in 1945 by Wilcoxon for two samples of the same size and in 1947 further developed by Mann and Whitney to cover different sample sizes the test is also called Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, Wilcoxon–Mann–Whitney test, or Wilcoxon two-sample test.
    [Show full text]
  • Different Perspectives for Assigning Weights to Determinants of Health
    COUNTY HEALTH RANKINGS WORKING PAPER DIFFERENT PERSPECTIVES FOR ASSIGNING WEIGHTS TO DETERMINANTS OF HEALTH Bridget C. Booske Jessica K. Athens David A. Kindig Hyojun Park Patrick L. Remington FEBRUARY 2010 Table of Contents Summary .............................................................................................................................................................. 1 Historical Perspective ........................................................................................................................................ 2 Review of the Literature ................................................................................................................................... 4 Weighting Schemes Used by Other Rankings ............................................................................................... 5 Analytic Approach ............................................................................................................................................. 6 Pragmatic Approach .......................................................................................................................................... 8 References ........................................................................................................................................................... 9 Appendix 1: Weighting in Other Rankings .................................................................................................. 11 Appendix 2: Analysis of 2010 County Health Rankings Dataset ............................................................
    [Show full text]
  • Learning to Combine Multiple Ranking Metrics for Fault Localization Jifeng Xuan, Martin Monperrus
    Learning to Combine Multiple Ranking Metrics for Fault Localization Jifeng Xuan, Martin Monperrus To cite this version: Jifeng Xuan, Martin Monperrus. Learning to Combine Multiple Ranking Metrics for Fault Local- ization. ICSME - 30th International Conference on Software Maintenance and Evolution, Sep 2014, Victoria, Canada. 10.1109/ICSME.2014.41. hal-01018935 HAL Id: hal-01018935 https://hal.inria.fr/hal-01018935 Submitted on 18 Aug 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Learning to Combine Multiple Ranking Metrics for Fault Localization Jifeng Xuan Martin Monperrus INRIA Lille - Nord Europe University of Lille & INRIA Lille, France Lille, France [email protected] [email protected] Abstract—Fault localization is an inevitable step in software [12], Ochiai [2], Jaccard [2], and Ample [4]). Most of these debugging. Spectrum-based fault localization applies a ranking metrics are manually and analytically designed based on metric to identify faulty source code. Existing empirical studies assumptions on programs, test cases, and their relationship on fault localization show that there is no optimal ranking metric with faults [16]. To our knowledge, only the work by Wang for all the faults in practice.
    [Show full text]
  • Isotonic Regression in General Dimensions
    Isotonic regression in general dimensions Qiyang Han∗, Tengyao Wang†, Sabyasachi Chatterjee‡and Richard J. Samworth§ September 1, 2017 Abstract We study the least squares regression function estimator over the class of real-valued functions on [0, 1]d that are increasing in each coordinate. For uniformly bounded sig- nals and with a fixed, cubic lattice design, we establish that the estimator achieves the min 2/(d+2),1/d minimax rate of order n− { } in the empirical L2 loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on k hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of (k/n)min(1,2/d), again up to poly-logarithmic factors. Previous results are confined to the case d 2. Fi- ≤ nally, we establish corresponding bounds (which are new even in the case d = 2) in the more challenging random design setting. There are two surprising features of these re- sults: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the correspond- ing entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate. 1 Introduction Isotonic regression is perhaps the simplest form of shape-constrained estimation problem, and has wide applications in a number of fields. For instance, in medicine, the expression of a leukaemia antigen has been modelled as a monotone function of white blood cell count arXiv:1708.09468v1 [math.ST] 30 Aug 2017 and DNA index (Schell and Singh, 1997), while in education, isotonic regression has been used to investigate the dependence of college grade point average on high school ranking and standardised test results (Dykstra and Robertson, 1982).
    [Show full text]
  • Shape Restricted Regression with Random Bernstein Polynomials 189 Where Xk Are Design Points, Yjk Are Response Variables and Ǫjk Are Errors
    IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 187–202 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000157 Shape restricted regression with random Bernstein polynomials I-Shou Chang1,∗, Li-Chu Chien2,∗, Chao A. Hsiung2, Chi-Chung Wen3 and Yuh-Jenn Wu4 National Health Research Institutes, Tamkang University, and Chung Yuan Christian University, Taiwan Abstract: Shape restricted regressions, including isotonic regression and con- cave regression as special cases, are studied using priors on Bernstein polynomi- als and Markov chain Monte Carlo methods. These priors have large supports, select only smooth functions, can easily incorporate geometric information into the prior, and can be generated without computational difficulty. Algorithms generating priors and posteriors are proposed, and simulation studies are con- ducted to illustrate the performance of this approach. Comparisons with the density-regression method of Dette et al. (2006) are included. 1. Introduction Estimation of a regression function with shape restriction is of considerable interest in many practical applications. Typical examples include the study of dose response experiments in medicine and the study of utility functions, product functions, profit functions and cost functions in economics, among others. Starting from the classic works of Brunk [4] and Hildreth [17], there exists a large literature on the problem of estimating monotone, concave or convex regression functions. Because some of these estimates are not smooth, much effort has been devoted to the search of a simple, smooth and efficient estimate of a shape restricted regression function. Major approaches to this problem include the projection methods for constrained smoothing, which are discussed in Mammen et al.
    [Show full text]
  • Large-Scale Probabilistic Prediction with and Without Validity Guarantees
    Large-scale probabilistic prediction with and without validity guarantees Vladimir Vovk, Ivan Petej, and Valentina Fedorova ïðàêòè÷åñêèå âûâîäû òåîðèè âåðîÿòíîñòåé ìîãóò áûòü îáîñíîâàíû â êà÷åñòâå ñëåäñòâèé ãèïîòåç î ïðåäåëüíîé ïðè äàííûõ îãðàíè÷åíèÿõ ñëîæíîñòè èçó÷àåìûõ ÿâëåíèé On-line Compression Modelling Project (New Series) Working Paper #13 First posted November 2, 2015. Last revised November 3, 2019. Project web site: http://alrw.net Abstract This paper studies theoretically and empirically a method of turning machine- learning algorithms into probabilistic predictors that automatically enjoys a property of validity (perfect calibration) and is computationally efficient. The price to pay for perfect calibration is that these probabilistic predictors produce imprecise (in practice, almost precise for large data sets) probabilities. When these imprecise probabilities are merged into precise probabilities, the resulting predictors, while losing the theoretical property of perfect calibration, are con- sistently more accurate than the existing methods in empirical studies. The conference version of this paper published in Advances in Neural Informa- tion Processing Systems 28, 2015. Contents 1 Introduction 1 2 Inductive Venn{Abers predictors (IVAPs) 2 3 Cross Venn{Abers predictors (CVAPs) 10 4 Making probability predictions out of multiprobability ones 11 5 Comparison with other calibration methods 12 5.1 Platt's method . 12 5.2 Isotonic regression . 15 6 Empirical studies 16 7 Conclusion 26 References 28 1 Introduction Prediction algorithms studied in this paper belong to the class of Venn{Abers predictors, introduced in [19]. They are based on the method of isotonic regres- sion [1] and prompted by the observation that when applied in machine learning the method of isotonic regression often produces miscalibrated probability pre- dictions (see, e.g., [8, 9]); it has also been reported ([3], Section 1) that isotonic regression is more prone to overfitting than Platt's scaling [13] when data is scarce.
    [Show full text]
  • Stat 8054 Lecture Notes: Isotonic Regression
    Stat 8054 Lecture Notes: Isotonic Regression Charles J. Geyer June 20, 2020 1 License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http: //creativecommons.org/licenses/by-sa/4.0/). 2 R • The version of R used to make this document is 4.0.1. • The version of the Iso package used to make this document is 0.0.18.1. • The version of the rmarkdown package used to make this document is 2.2. 3 Lagrange Multipliers The following theorem is taken from the old course slides http://www.stat.umn.edu/geyer/8054/slide/optimize. pdf. It originally comes from the book Shapiro, J. (1979). Mathematical Programming: Structures and Algorithms. Wiley, New York. 3.1 Problem minimize f(x) subject to gi(x) = 0, i ∈ E gi(x) ≤ 0, i ∈ I where E and I are disjoint finite sets. Say x is feasible if the constraints hold. 3.2 Theorem The following is called the Lagrangian function X L(x, λ) = f(x) + λigi(x) i∈E∪I and the coefficients λi in it are called Lagrange multipliers. 1 If there exist x∗ and λ such that 1. x∗ minimizes x 7→ L(x, λ), ∗ ∗ 2. gi(x ) = 0, i ∈ E and gi(x ) ≤ 0, i ∈ I, 3. λi ≥ 0, i ∈ I, and ∗ 4. λigi(x ) = 0, i ∈ I. Then x∗ solves the constrained problem (preceding section). These conditions are called 1. Lagrangian minimization, 2. primal feasibility, 3. dual feasibility, and 4. complementary slackness. A correct proof of the theorem is given on the slides cited above (it is just algebra).
    [Show full text]
  • Pairwise Versus Pointwise Ranking: a Case Study
    Schedae Informaticae Vol. 25 (2016): 73–83 doi: 10.4467/20838476SI.16.006.6187 Pairwise versus Pointwise Ranking: A Case Study Vitalik Melnikov1, Pritha Gupta1, Bernd Frick2, Daniel Kaimann2, Eyke Hullermeier¨ 1 1Department of Computer Science 2Faculty of Business Administration and Economics Paderborn University Warburger Str. 100, 33098 Paderborn e-mail: melnikov,prithag,eyke @mail.upb.de, bernd.frick,daniel.kaimann @upb.de { } { } Abstract. Object ranking is one of the most relevant problems in the realm of preference learning and ranking. It is mostly tackled by means of two different techniques, often referred to as pairwise and pointwise ranking. In this paper, we present a case study in which we systematically compare two representatives of these techniques, a method based on the reduction of ranking to binary clas- sification and so-called expected rank regression (ERR). Our experiments are meant to complement existing studies in this field, especially previous evalua- tions of ERR. And indeed, our results are not fully in agreement with previous findings and partly support different conclusions. Keywords: Preference learning, object ranking, linear regression, logistic re- gression, hotel rating, TripAdvisor 1. Introduction Preference learning is an emerging subfield of machine learning that has received increasing attention in recent years [1]. Roughly speaking, the goal in preference learning is to induce preference models from observed data that reveals information Received: 11 December 2016 / Accepted: 30 December 2016 74 about the preferences of an individual or a group of individuals in a direct or indirect way; these models are then used to predict the preferences in a new situation.
    [Show full text]
  • Power Comparisons of the Mann-Whitney U and Permutation Tests
    Power Comparisons of the Mann-Whitney U and Permutation Tests Abstract: Though the Mann-Whitney U-test and permutation tests are often used in cases where distribution assumptions for the two-sample t-test for equal means are not met, it is not widely understood how the powers of the two tests compare. Our goal was to discover under what circumstances the Mann-Whitney test has greater power than the permutation test. The tests’ powers were compared under various conditions simulated from the Weibull distribution. Under most conditions, the permutation test provided greater power, especially with equal sample sizes and with unequal standard deviations. However, the Mann-Whitney test performed better with highly skewed data. Background and Significance: In many psychological, biological, and clinical trial settings, distributional differences among testing groups render parametric tests requiring normality, such as the z test and t test, unreliable. In these situations, nonparametric tests become necessary. Blair and Higgins (1980) illustrate the empirical invalidity of claims made in the mid-20th century that t and F tests used to detect differences in population means are highly insensitive to violations of distributional assumptions, and that non-parametric alternatives possess lower power. Through power testing, Blair and Higgins demonstrate that the Mann-Whitney test has much higher power relative to the t-test, particularly under small sample conditions. This seems to be true even when Welch’s approximation and pooled variances are used to “account” for violated t-test assumptions (Glass et al. 1972). With the proliferation of powerful computers, computationally intensive alternatives to the Mann-Whitney test have become possible.
    [Show full text]