Arxiv:2103.14693V2 [Stat.ME] 30 Jun 2021 [2]

Inapplicability of the TVOR Method to USHMM Data Outlier Identification Melkior Ornik∗ Abstract Recent paper \TVOR: Finding Discrete Total Variation Outliers Among Histograms" introduces the Total Variation Outlier Recognizer (TVOR) method for identification of outliers among a given set of histograms. After providing a theoretical discussion of the method and verifying its success on synthetic and population census data, it applies the TVOR model to histograms of ages of Holocaust victims produced using United States Holocaust Memorial Museum data. It purports to identify the list of victims of the Jasenovac concentration camp as potentially suspicious. In this comment paper, we show that the TVOR model and its assumptions are grossly inapplicable to the considered dataset. When applied to the considered data, the model is biased in assigning a higher outlier score to histograms of larger sizes, the set of data points is extremely sparse around the point of interest, the dataset has not been reviewed to remove obvious data processing errors, and, contrary to the model requirements, the distributions of the victims' ages naturally vary significantly across victim lists. Notice of publication: This paper has been published in IEEE Access, vol. 9, pp. 78586{78593, 2021, under the title \Comment on 'TVOR: Finding Discrete Total Variation Outliers Among Histograms' ". The difference in titles is due to the journal policy on naming of comment papers. 1 Introduction Focusing on the problem of identifying compromised data, recently published article [1] introduces a novel method named Total Variation Outlier Recognizer (TVOR) for identification of outliers across a set of histograms. In proposing its scheme based on the difference in discrete total variations among histograms, the TVOR method critically relies on the assumption that all histograms in a dataset should come from the same probability distribution, or should at least have the same smoothness properties. Initial experiments in [1] focus on outlier detection in synthetic datasets drawn from a normal distribution, from a beta distribution, and from subsamples of the German census of 1939. TVOR method indeed successfully recognizes the outliers in those datasets. Following these experiments, the work in [1] considers data about Holocaust victims made available by the United States Holocaust Memorial Museum (USHMM) arXiv:2103.14693v2 [stat.ME] 30 Jun 2021 [2]. By comparing histograms obtained from 7106 historical documents such as lists of ghetto inhabitants, lists of casualties, census records, and concentration camp population lists | including the lists of victims of the Jasenovac concentration camp [3] differentiated by ethnicity | the authors of [1] claim to have detected \the potentially problematic parts of a sample, which in the case of the Jasenovac list lies in the birth years of Serbian inmates" [1, Appendix D]. In this comment paper we show that the use of TVOR on the USHMM records, in the manner employed in [1], is inappropriate. We identify multiple features that make TVOR and its assumptions inapplicable to the USHMM dataset: (i) the histograms are not drawn from the same probability distribution; ∗Coordinated Science Laboratory and Department of Aerospace Engineering, University of Illinois Urbana-Champaign, Ur- bana, IL 61801 USA (e-mail: [email protected]) 1 (ii) the model, when applied to the particular dataset, is biased towards providing a higher outlier score to larger lists such as the Jasenovac list; and (iii) the dataset is sparse around the point of interest, and whatever data does exist does not satisfy the assumption of the histograms sharing similar smoothness, let alone the same probability distribution. We also show the method used in [1] to pull the dataset from the USHMM database results in a significant contamination of the dataset, by duplicating hundreds of thousands of entries. We emphasize that we do not contest the theoretical underpinnings of TVOR method given in the first half of [1], and recognize the method to be valid in a variety of contexts. Our comment is solely on the application of the TVOR method to the particular dataset. Notation. In the remainder of the paper, we use the notation N0 for nonnegative integers and [n] for set f1; : : : ; ng. Notation E[X] denotes the expected value of a random variable X on an underlying probability space. 2 TVOR Preliminaries and USHMM Dataset Motivated by a classical problem of detection of outlying data [4, 5], TVOR [1] seeks to identify outliers in a finite set of histograms fh1; : : : ; hkg. Definition 1. A histogram is defined as a function h :[n] ! N0, where h(i) signifies the amount of samples Pn in a bin i. We refer to the sum i=1 h(i) as the size of the histogram. The technical approach of TVOR relies on the fact that histograms drawn from the same probability distribution will, informally stated, share similar smoothness. To formally encode the notion of smoothness, [1] proposes measuring a histogram's discrete total variation (DTV). Pn Definition 2. A discrete total variation (DTV) of a histogram h :[n] ! N0 is given by khkV = i=2 jh(i)− h(i − 1)j. To detect outliers, TVOR then compares the DTV of each histogram with its expected value across the set h1; :p : : ; hn, where the expectation is calculated by a best curve fit of the DTV data to a function of form aN + b N. Definition 3. An expected DTV for a set of histograms fh1; : : : ; hkg, where hi :[n] ! N0, is a function p m(N) = aN + b N which best fits the set f(Ni; khikV ) j i 2 [k]g, where Ni is the size of histogram i. While [1] does not seem to explicitly state the metric of best curve fit for the expected DTV, the presented results indicate thatp the experiments consider a standard least squares fit or a similar metric. Form aN + b N for the fitting function m is motivated in [1] by a bound of p E[khkV ] ≤ kDkV N + E[kRkV ] N (1) derived in [1], where N is the size of histogram h, kDkV is the total variation of a distribution D :[n] ! [0; 1] from which h is drawn, and E[kRkV ] is the expected deviation between the histogram drawn from D and the distribution D itself. We invite the reader to consult [1, Sec. III-D] for further details on the derivation of the bound. For the purposes of our paper, we emphasize, however, that values kDkV and E[kRkV ] depend on the distribution from which h isp drawn. Hence, if multiple histograms are drawn from different distributions, the coefficients next to N and N in (1) may be different. 2 0 Finally, [1] defines the TVOR outlier score d for a histogram h :[n] ! N0 of size N belonging to the set fh1; : : : ; hkg by normalizing the difference of its DTV with the expected DTV for a histogram of its size: jkhk − m(N)j d0 = V p . (2) N The normalization in (2) is motivated in [1, Sec. III-F] by the square root law [6], i.e., the observationp that the standard error of a sum of N samples drawn from a distribution is often proportional to N; for more details, we refer the readerp to [1, Sec. III-F]. We note that, if existent, the proportionality constant between the standard error and N is dependent on the distribution [6]; different distributions may lead to different constants. In parallel with TVOR, [1] also proposes its modification jkhk − µ^ j d00 = V N , (3) σ^N p where instead of m(N) and N, the model considers the mean and standard deviation of DTVs of appro- priately sized subsets of the German census of 1939 [7]. Experimental work in [1] begins by successfully demonstrating TVOR on synthetic datasets and subsets of the German 1939 census data, which | by design | satisfy the assumption of shared probability distributions across histograms. The method performs well on these experiments, recognizing the outlying histograms successfully. The later part of [1] focuses on the dataset compiled from the United States Holocaust Memorial Museum (USHMM) lists of Holocaust victims. The vast majority of the data used in that experiment is provided by the authors of [1] online, in [8]. All the data that we use in our experiments is contained in that repository. The full dataset consists of years of birth for more than 3.6 million individual records across 7106 lists. The lists are obtained from historical documents of varying provenance, including wedding announcements, ghetto inhabitant lists, death lists, concentration camp lists, and census records [2, 9]. The experiment in [1] pays particular attention to the list of victims of the Jasenovac concentration camp, marked in the USHMM records by ID 45409 [3]. In addition to the full Jasenovac list, it also considers its multiple sublists, with victim records separated by ethnicity. These sublists do not seem to be available in repository [8], but they are described in detail in [1]. Each list in the dataset forms a natural histogram, where each bin corresponds to a particular year of birth. Some of the lists contain a small number of obviously impossible dates of birth, as discussed later; in our experiments, following the figures in [1], we a priori remove all years prior to 1850 or after 1945, resulting in n = 96 bins and a total of 3 619 428 records. The authors of [1] then apply the TVOR method to identify the outlying histograms. They note that the Jasenovac list results in the highest d0-score, with d0 ≈ 43:13, thus labeling its data as \potentially problematic" [1, Sec. IV-F]. After adding the ethnicity sublists within the Jasenovac list to the histogram set, the list of Serbian victims also yields a high d0-score of d0 ≈ 40:82 and is also described as potentially problematic [1, Appendix D].

Arxiv:2103.14693V2 [Stat.ME] 30 Jun 2021 [2]

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support