Summary Measures for Binary Classification Systems in Animal Ecology
Total Page:16
File Type:pdf, Size:1020Kb
North-Western Journal of Zoology Vol. 6, No. 2, 2010, pp.323-330 P-ISSN: 1584-9074, E-ISSN: 1843-5629 Article No.: 061401 Summary measures for binary classification systems in animal ecology Szilárd NEMES and Tibor HARTEL Mihai Eminescu Trust, Str. Cojocarilor 2, 545400 Sighişoara, Romania. Correspondence: Sz. Nemes, E-mail: [email protected]; T. Hartel, E-mail: [email protected] Abstract. Ecological studies often result in dichotomous, binary outcomes of the response variables (e.g. the presence or absence of the studied species). In such cases, logistic regression is used to model organismal response to various environmental factors. Receiver Operating Characteristic (ROC) curve, a relatively old technique, is a standard procedure for assessing classifier’s performance in various fields of science and is increasingly used in ecology. In this note, we present the idea and sketch the mathematics behind the ROC curve, discussing its utility and interpretation. Key-words: AUC-ROC, model performance, animal occurrence. Introduction single numerical measure of model suitability is impractical. Scientists often desire a single Empirical ecological research results in a high scalar measure that allows the comparison of variability of data types, sometimes with not only different models in one study, but also dichotomous, binary outcomes (e.g. presence of models from different studies as well. Such a or absence of species or studied phenomenon). measure must have a straightforward uni- Regression analysis is one of the most power- versal interpretation and a proper scale with ful statistical tools. Since the development of easily interpretable values. Both information Generalized Linear Models (Nelder & theoretic criterions and global significance tests Wedderburn 1972), regression analysis can also fail to meet this criterion. Global significance deal with other than normally distributed and tests provide a relatively simple and powerful continuous outcomes. Binary responses are tool for comparing alternative models, but only modelled mostly with logistic regression. The nested models can be compared. Information coefficients of the model are estimated by theoretic criterions are suitable for comparing numerical methods and tested using their both nested and non-nested models, but their asymptotic properties (Pawitan 2001). interpretation becomes problematic when In addition to assessing the significance of different models have different support sets an individual variable or all parameters (are built on different data sets). The outcome estimated, it is useful to have a single number of both methods can be theoretically any real to summarize the overall appropriateness of number that lacks a straightforward interpre- the model. Such measure eases the comparison tation as a single numerical measure. Instead, of completing models as well. Model compa- their interpretation depends on the context and rison can be done by information theoretic two identical values might lead to different criterions (AIC or BIC) or by performing a conclusions in different situations. The most global test (i.e. the Likelihood Ratio test) common measure of model goodness of fit that (Hilborn & Mangel 1997). Both are extensively meets the above stated criterions is the used in model selection but their usage as a coefficient of determination - R2. The coeffi- ©NwjZ, Oradea, Romania, 2010 North-West J Zool, 6, 2010 www.herp-or.uv.ro/nwjz Oradea, Romania 324 Nemes, Sz. & Hartel, T. cient of determination originates from the The receiver operating characteristic Least Squares Regression methods and is (ROC) curve defined as the percentage of the total variation in the response variable that can be attributed The Receiver Operating Characteristic (ROC) is to the relationship with the predictors. Nume- a graphical technique used for visualizing and rous counterparts of this index were construct- selecting classifiers based on their perfor- ed for binary response (reviewed by Long mance. Formally, each event is mapped to set 1997), despite the fact that R2 is not suitable to the positive and negative outcomes (Fawcett judge the effectiveness of regression models 2006). Tests used for binary classification do with binary responses (Cox & Wermuth 1992). not produce binary values but a continuous Binary response variables make the assessment one, Ti, and a threshold c is applied in order to of the model's discriminatory performance - predict the class of the outcome. Generally, the extent to which a model successfully sepa- high values of the test are assumed to indicate rates the positive and negative observations the event of interest (Ti ≥ c: positive outcome, and classifies them correctly - more feasible. p), while low values the absence of the event Single measures of association, such as reg- (Ti < c: negative outcome, n). Given a binary ression coefficients or odds ratios, do not classifier and the event of interest there are meaningfully describe the predictor variables’ four possible situations: ability to classify observations (Pepe et al. i) The event is positive and it is classified as 2004). Reporting only odds (or hazard ratio) positive, True Positive (TP), might intuitively overestimate the predictors’ ii) The event is positive and it is classified prognostic capacity; therefore there is a need as negative, False Negative (FN), for external validating measures (Dunkler et al. iii) The event is negative and it is classified 2007). Such measures ease the comparison of as negative, True Negative (TN), completing models as well. Receiver Operating iv) The event is negative and it is classified Characteristic (ROC) curve, a relatively old as positive, False Positive (FN). technique, is a standard procedure for assess- The performance of a binary classifier at a ing classifier’s performance in various fields of given threshold can be represented as a two- science (e.g. biomedical research, pattern by-two contingency table, the confusion matrix recognition, machine learning, psychology) (Fig. 1). From the confusion matrix the two and is increasingly used in ecology (e.g. following metrics are calculated: Sensitivity Huntley et al. 2004, Arntzen 2006, Broenni- (SN), the percentage of True Positive results, mann et al. 2007, Hartel et al. 2007, 2010a,b, and Specificity (SP) the percentage of True Hartley et al. 2007, Arntzen & Espregueira- Negative results. Themudo 2008, Rödder et al. 2008). ROC Sensitivity and Specificity are estimated as curves are conceptually simple with an easy to follows follow and understand mathematical algo- TP TN SN = and SP = . rithm. Studies that model the animal occur- TP+ FN TN+ FP rence, in relation to habitat and landscape features are actually still scarce in Romania It should be noted that Sensitivity depends and Eastern Europe (see e.g. Hartel et al. 2008, only on positive observations while Specificity 2010a,b). Given this circumstance, we believe on negative observations alone. As a that it is timely to present the idea and sketch consequence, Sensitivity and Specificity ROC the mathematics behind the ROC curve and curves can be derived without worrying about discuss its utility and interpretation. the proportional representation of the two groups in the sample. North-West J Zool, 6, 2010 Summary measures for binary classification systems in animal ecology 325 iii) (0,1) represents the perfect classification with no error. A ROC curve generated from a finite data set is a step function, a two dimensional representation of the classification perfor- mance. In scientific publications it is not al- ways feasible to reproduce ROC curves, thus numerical indices are commonly used to summarize the curves. These numerical indices have to convey important information about Figure 1. Two-by-two confusion matrix of a binary the ROC curve and about the test that the ROC classifier. The observations along the major curve was constructed for. Although there are diagonal represent the correct decisions, the several summary indices, one in particular, the minor diagonal the errors - the confusion between different classes. area under the ROC curve (AUC), dominates the practical applications (Hanley and McNeil 1982). The area under a ROC curve is defined A ROC curve captures in a single graph the as trade-off between the sensitivity and specificity 1 AUC= ROC() t dt . of the test over its entire range. Thus the ROC- ∫ 0 curve plots SN vs. 1-SP as the threshold varies over its entire range Generally the test with higher AUC score is {(SN { c },1− SP { c }) :−∞ < c < ∞ }, considered the better. If test A is uniformly each data point on the plot representing a better than test B in the sense that then their AUC statistics is particular setting of the threshold (c) and each ROCAB≥ ROC ordered as well (Pepe 2003) (Fig. threshold setting defines a particular set of AUCA ≥ AUCB True Positive and False Negative results, 3). However, the inverse is not true, a higher consequently a particular pair of SN and 1-SP AUC value does not necessarily imply higher values (Lasko 2005). Sensitivity and specificity ROC value at certain sensitivity and specificity, are synonymous to the True Positive and False nor equal AUC values assure equal Positive rates (TPR and FPR) defined as classification accuracy as the threshold varies over its entire range (Fig. 4). Formal TPR{} c=≥= P ( Tii c | Y 1) comparisons can be made by the means of and FPR{} c=≥= P ( T c | Y 0), ii statistical significance testing (Hanley & the ROC-curve being a two dimensional McNeil 1982, DeLong et al. 1988, Pepe 2003, plot of {(FPRcTPRc { }, { }) :−∞ < c < ∞ } (Ma et al. Rosset 2004). Two less encountered summary 2006). measures of ROC curves are the partial AUC Fawcett (2006) points out several important ( pAUC ), that restricts the attention at certain points in the ROC space (Fig. 2): specificity intervals, and the Kolmogorov- i) (0,0), the lower left point that represents Smirnov statistic (KS). The latter is the the strategy of never issuing positive maximum vertical distance between the ROC classification (i.e. c > max(Ti)), curve and the 45° line, the line of the ii) (1,1) the upper right point representing uninformative test.