About the Pitfall of Erroneous Validation Data in the Estimation of Confusion Matrices

remote sensing Article About the Pitfall of Erroneous Validation Data in the Estimation of Confusion Matrices Julien Radoux *,† and Patrick Bogaert † Earth & Life Institute, Université Catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Received: 3 December 2020; Accepted: 14 December 2020; Published: 17 December 2020 Abstract: Accuracy assessment of maps relies on the collection of validation data, i.e., a set of trusted points or spatial objects collected independently from the classified map. However, collecting spatially and thematically accurate dataset is often tedious and expensive. Despite good practices, those datasets are rarely error-prone. Errors in the reference dataset propagate to the probabilities estimated in the confusion matrices. Consequently, the estimates of the quality are biased: accuracy indices are overestimated if the errors are correlated and underestimated if the errors are conditionally independent. The first findings of our study highlight the fact that this bias could invalidate statistical tests of map accuracy assessment. Furthermore, correlated errors in the reference dataset induce unfair comparison of classifiers. A maximum entropy method is thus proposed to mitigate the propagation of errors from imperfect reference datasets. The proposed method is based on a theoretical framework which considers a trivariate probability table that links the observed confusion matrix, the confusion matrix of the reference dataset and the “real” confusion matrix. The method was tested with simulated thematic and geo-reference errors. It proved to reduce the bias to the level of the sampling uncertainty. The method was very efficient with geolocation errors because conditional independence of errors can reasonably be assumed. Thematic errors are more difficult to mitigate because they require the estimation of an additional parameter related to the amount of spatial correlation. In any case, while collecting additional trusted labels is usually expensive, our result show that the benefits for accuracy assessment are much larger than collecting a larger number of questionable reference data. Keywords: accuracy; confusion matrix; quality assessment; response design; reference data; uncertainty 1. Introduction Providing a confusion matrix together with a geolocationmap is a prevailing good practice in remote sensing data analysis [1,2]. The confusion matrix indeed yields a complete summary about the matching between geographic data product and reference data. The entries of the confusion matrix allow map users to derive numerous summary measures of the class accuracy, that can be used to compare different classifiers in scientific studies or to inform end users about the quality of the map. The confusion matrix is also used to achieve a better estimate of the area covered by each class. The reference datasets used to build the confusion matrix are often assumed to be error-free, but even ground data often include some errors [3]. Confusion matrices and associated quality indices are, therefore, likely to yield biased quality assessment of the maps. The negative impact of those errors on the apparent accuracy of maps has been highlighted in previous studies [4,5]. This impact on accuracy indices has been quantified in situations of correlated and non correlated errors. For instance, Remote Sens. 2020, 12, 4128; doi:10.3390/rs12244128 www.mdpi.com/journal/remotesensing Remote Sens. 2020, 12, 4128 2 of 23 a case study showed that 10% errors in a reference dataset lead to 18.5% underestimation of the producer accuracy if errors were independent and 12.5% overestimation if they were correlated [6]. The sources of errors in the collection of ground reference data are well documented, and some recent studies aimed at assessing their contribution depending on whether the errors are thematic or positional. Thematic errors consist in assigning the wrong label to a sample unit because of erroneous classification, (uncertainties of the response design [7], temporal mismatch, transitional classes, class interpretation errors, careless error [8]). Positional errors (also called geolocation errors) may result in incorrect matching of reference and map labels because the geographic position of the sampling unit is shifted with respect to the map (mislocation of testing sites, mislocation of the map or uncertain definition of boundaries) [2,9,10]. A range of approaches have been developed to deal with imperfect reference data. In medical science, Enøe et al. [11] used maximum likelihood methods to handle unknown reference information in a binary case, while Espeland and Handelman [12] used latent class models to obtain consensual confusion matrix among several observers. Those methods help to validate datasets without gold standard [13], but they do not adjust for poor quality reference datasets. In remote sensing, Sarmento et al. [14] integrated information from a second label in the reference dataset to build a confidence value on the accuracy estimates. Foody [6] proposed a method to correct sensitivity and specificity estimates when conditional independence holds and the quality of the reference classification is known. Carlotto [3] proposed a method to retrieve the actual overall accuracy based on the apparent accuracy and the errors between the reference and the “ground truth”. However, this method is assuming that errors are equally distributed amongst classes. Sub-optimal reference data is a reflection of the costs of acquiring high quality data [6]. Instead of correcting the reference dataset, our method therefore proposes to manage its uncertainty globally in order to obtain a corrected confusion matrix, i.e., a confusion matrix that reflects the matching between the map and the trusted labels. This would be particularly useful to manage the diversity of data sources that are now used for validation ( ground survey or aerial surveys; crowdsourced or expert-based). The major difference with previous studies is that we rebuild at best the full confusion matrix and not only some of the indices that can be derived from it. Furthermore, the proposed method does not assume that errors are equally distributed amongst classes nor that the errors are conditionally independent. This paper is presenting the theoretical developments for recovering the trusted confusion matrix based on the observed confusion matrix and a confusion matrix between the reference dataset and the trusted labels. These theoretical development are then implemented to illustrate the method based on two synthetic case studies. The first case study addresses thematic errors while the second one focuses on geolocation errors. 2. Theoretical Framework The confusion matrix is usually assumed to represent the overlay between a map and the “ground truth”. In practice the reference data used to estimate it is not perfect, therefore calling it “ground truth” is improper. In this study, we are referring to a map where a set of m classes are considered, where labels from the classification are indexed with i, while the trusted (ground truth) labels are indexed with j. In parallel, we assume that a reference dataset containing errors has been collected, with labels indexed with k. Typically, the correspondence between the classification and the reference is assessed through a confusion matrix (i.e., the table of the joint counts n(i, k) of pixels classified in class i but referenced in class k), from which the m × m table of the joint probability estimates pb(i, k) = n(i, k)/n is computed, where n refers to the sample size. In parallel, let us consider similarly that the m × m table of the joint counts n(j, k) is at hand (but not necessarily estimated from the same sample), so that the probability estimates pb(j, k) = n(j, k)/n assess the correspondence between the trusted and the reference classes over the study area. There are thus two known confusion matrices: the classified map vs the reference and the reference vs the trusted labels. Remote Sens. 2020, 12, 4128 3 of 23 What is sought for are estimates for the unknown joint p(i, j) probabilities that assess the correspondence between the classification and the ground truth. Instead of the p(i, k) that only quantifies the quality of the map with respect to a reference—that might differ from the ground truth and whose choice might also differ between users—the p(i, j)’s are truly assessing the accuracy of the map. The question is how to estimate at best these probabilities when the only knowledge at hand are the sets of pb(i, k)’s and pb(j, k)’s while possibly considering an additional conditional independence hypothesis if relevant. We will show hereafter that this problem can be handled using a maximum entropy (MaxEnt) approach (see, e.g., [15,16] for a presentation of the rationale of the method and its panel of applications). 2.1. Joint Probability Table In order to ease the discussion, let us consider the set of unknown joint probabilities p(i, j, k) that can be organised in a 3D probability table, as shown in Figure1. By definition, all bivariate probabilities p(i, j), p(i, k) and p(j, k) are marginal probabilities from this table, so that equalling the marginal probabilities with their estimates leads to m D ∑ p(i, j, k) = p(j, k) = pb(j, k) 8j, k (1a) i=1 m D ∑ p(i, j, k) = p(i, k) = pb(i, k) 8i, k (1b) j=1 m D ∑ p(i, j, k) = p(i, j) 8i, j (1c) k=1 where the unknown p(i, j, k)’s are thus subject to the two constraints given in Equations (1a) and (1b), D while Equation (1c) allows us to compute the required p(i, j)’s. Furthermore, because ∑k p(j, k) = p(j), we can further derive from Equation (1a) that p(j) = pb(j). k ^ p( j,k) j i i k k j p^(i,k) p(i, j) Figure 1.

About the Pitfall of Erroneous Validation Data in the Estimation of Confusion Matrices

CMPSCI 585 Programming Assignment 1 Spam Filtering Using Naive Bayes

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Arxiv:1902.03680V3 [Cs.LG] 17 Jun 2019 1

Performance Metric Elicitation from Pairwise Classifier Comparisons Arxiv:1806.01827V2 [Stat.ML] 18 Jan 2019

MASTER TP^ Ostrlbunon of THIS DOCUMENT IS UNL.M.TED Kaon Content of Three-Prong Decays of the Tan Lepton

Designing Alternative Representations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance

A Generalized Hierarchical Approach for Data Labeling Zachary D. Blanks

PREDICTING ACADEMIC PERFORMANCE of POTENTIAL ELECTRICAL ENGINEERING MAJORS a Thesis by SANJHANA SUNDARARAJ Submitted to the Offi

Jets + Missing Energy Signatures at the Large Hadron Collider

Multiple Classifier Systems Incorporating Uncertainty

Classifier Uncertainty: Evidence, Potential Impact, and Probabilistic Treatment

Evaluating Prediction Strategies in an Enhanced Meta-Learning Framework