NIST Technical Note 2044

Unreliable evidence in binary classification problems

David Flater

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 NIST Technical Note 2044

Unreliable evidence in binary classification problems

David Flater Software and Systems Division Information Technology Laboratory

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044

May 2019

U.S. Department of Commerce Wilbur L. Ross, Jr., Secretary National Institute of Standards and Technology Walter Copan, NIST Director and Undersecretary of Commerce for Standards and Technology Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

National Institute of Standards and Technology Technical Note 2044 Natl. Inst. Stand. Technol. Tech. Note 2044, 14 pages (May 2019) CODEN: NTNOEF This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 1 Abstract May 2019 individuals chosen randomly from a much larger population. David Flater N , meaning N Unreliable evidence in binary classification problems We assume thatpredictors the or available indicators tests, of either individuals’ classes. singlyGiven or the sample in and combination, thethat set can are of drawn tests, possibly from we and the want to population plausibly find without a be ground reliable truth. method of classifying other individuals There is a setEquivalently, we of can tests think of that a we set canWe of assume apply properties (for to the that purposes any eachits of individual individual result this either has and discussion) has no obtain that or uncertainty. each a does We individual positive have not test all or have. or of inspection negative the has result. test no results cost “for and free.” We have a sampleThe of size nature ofsomething the abstract. individuals does notEach matter; individual they in could thetruth.” sample be has people, been objects, assigned email to one messages, of or two classes. This assignment is “ground Binary classification problemsand include screening such for things theor as presence disease-negative). classifying of email disease BothBoth messages (which Bayesian kinds can as and of be spam frequentist approachesof seen or approaches provide positive as non-spam poor have classifying results been estimates afor in of applied subject the the the uncertainty to as sample of predictive disease-positive thesereport is these value problems. explains estimates of either the is tests very problem vulnerableneat for small and to solution which or explores making that the very does inferences options number large. from no for unreliable harm accounting A to evidence. for less classifier the This uncertain often-neglected that cases uncertainty. does remains A not elusive. account • • • • • • This is a genericand binary classification methods problem. are used The pattern to arises tackleOne in it. example different is contexts, classifying and emailparticular different messages words models as appear spam or inor non-spam. non-spam the The by emails properties the that orindicators recipient. we that not; A test a for ground simple message are solution truthan whether is would estimate is either identify of whether spam particular the or the wordsnon-spam that non-spam. and strength emails are then of A are treated combine evidence more deemed as the complex that to reliable evidence solutionA its be from would different presence all spam assign example of or to is thepositive absence each diagnosing words or provides word a to negative that disease reach result; the a based ground message classification on truth for is is medical the whether spam tests. message. the or Each patient test, actually for does each or patient, does produces not a have the disease. 1 Introduction Consider the following scenario: NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 2 finds gives an Section 7 − − − − − − − − − − − − − − − Section 3 describes an attempt to resolve ] where a spam message was classified 1 Section 8 surveys related work. provide Bayesian and frequentist examples Section 2 wraps up with conclusions. Section 5 and 0.017 800 0.520 000 0.375 000 Section 9 n pgood pbad fw U Section 4 shows an example using Bogofilter [ md 10 0.431 520 0.000 000 0.284 240 x s Figure 1 S addresses philosophical matters about multiple layers of uncertainty. Q P ...... Subject: re :X-Bogosity: urgent buy Ham, recommendation tests=bogofilter, spamicity=0.284 240, version=1.2.4 “highly”“initial”“medical”“price”“strong”“quote” 3“share” 2 2“which” 0.033 333“has” 0.022 222 0.022 2 222“buy” 2 0.000 000 1 0.000 000 0.022 0.000 222“yahoo” 000 0.022 222 0.003 1 067“the” 32 0.011 111 0.004 587 0.000“for” 0.004 000 587 0.000 000 0.011 + 111 31“this” 0.344 444 0.000 000 + “their” 0.004 0 + 587 0.004 587 0.000 000 0.333“more” 333 0.100 000 0 0.009“with” 094 0.000 + 000 + 0.009“information” 92 094 0.100 000 0.225 0.000 164 000“announced” + 71 0.000 000 61“report” 0.911 + 111 0.230 935 27 0.000 000 24“almost” 0.688 889 33 0.520 000 0.588 889“place” 1.000 000 0.255 57 556 3 0.520 000 0.200“com” 000 0.900 0.311 000 111 0.800“priority” 000 0.523 255 0.533 0.400 333 0.022 000 222“http” 0.600 000 3 0.566 0.500 422 000N 0.575 6 984 0.900 0.610 000 0.100 110 000 0.749 830 0.022 222 0.616 8 386 0.044 444 3 0.627 873 0.816 9 423 0.100 000 0.055 556 0.200 000 0.011 111 1 0.044 444 0.816 423 0.300 000 0.817 300 0.200 0.000 000 000 0.500 000 0.843 031 0.944 0.100 848 000 0.917 581 + 0.991 605 + + Section 6 Figure 1: Demonstrative example of words with few occurrences dominating a Bogofilter classification. the problem using likelihood ratios. Finally, as “ham” (non-spam) due to theThe presence remainder of of words havingintuitive this 3 overview report or of is fewerrespectively. occurrences organized the in as problem. the follows. traininganother corpus. side of the problem using a hypothesis testing approach. For email classification, itprovided is by common the tothe individual use real tests, Bayesian cost and and ofspecificity, Bayes-like then positive testing approaches or combine is to negative the predictive low).related estimate value, evidence metrics, the false For and from evidence positive disease then and allreal diagnosis, select false cost of one it negative of or the rates, is testing two likelihood tests more is tests ratios, (since common high). that and The are to in issue found speak at this to in hand case positive be is results terms the in that of most the both useful sample sensitivity, in is (since approaches the either in provide sample very had this small poor positive or caseoversimplified estimates results, very applications the large; or regarding of for where the tests example, only , for where oneresults predictive only or value which and one is two the individuals or assessed ground in number two based truth individuals positive the on of for sample the and had correlations a negative between negative given results. test appears results sample, In to has neglecting be the on thedetermines separate the strongest. the effect outcome. uncertainty Tests that with of a unproven that predictive large value estimate. imbalance float between to As the a top and result, their the combined force weakest evidence NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 3 (as ground truth). ]. A lot of practical 9 C . Using naive Bayes , 8 C , 7 , 6 ]. 5 ] was one application of that work, but 1 is modelled as a Bernoulli process. Each . /N with 100 % probability regardless of the values of Section 1 C ]. Bogofilter [ 17 ]. , individuals in the sample. Should a test return a positive 18 16 , N 15 , be the proportion of the sample that is class 14 ] or with Bayesian methods [ 4 , r , 13 3 , 12 , 11 , 1 times that it declined to do so are just as informative as the one time that it did. 10 − returns just one positive result, and it is for a member of t N to be sufficient to determine class t ]. Confidence intervals for these statistics can be approached in many ways using recipes times, once to each of the 2 N . r be the size of the sample. Let N and test is applied 3 Intuition of theSince the problem problem isbased symmetrical, on I positive will discuss test only results. theFor case many of statistics, classifying individuals theresults. or size diagnosing This disease of is the the case sample when is theresult a scenario only dominant once, described factor the in in theNow statistical consider significance the ascribedprevalence example in to of the a corpusproperty is rare of going the word word tospam appearing itself. be or in If low. non-spam? a it What This has sample if is 1 itAnalogously, of not occurrence consider has emails. consequential in 2 the to the occurrences examplewill sample, the Since that of have is happen testing; it a low it to it medical is right bothhappens is test in to a be in an that the conclude spam a rare independent seldom sample, that or case word, returns regardless it non-spam? that a is its of isThe specific positive positive words the “regardless for to result. sample. of the the Positiveof If disease, sample” results are is results it at it actually comes the right crux has upwe to of anything analyze positive the conclude to the matter. only that correlations do once, We thethat do with with and test randomly not statistics. the has know it distributes whether predictive sample It positive the value? it just results distribution at so is a happens applied rate that to. close these to As statistics 1 a are surrogate misleading for for a this test knowledge, that apply to binomial proportions [ Bayesian classification is one branchimplementation of work and that refinement was of builtof upon Bayesian Bayes’ detecting and theorem spam Bayes-like [ classification emailthere methods [ are was many done other for email the filtersBayesian purpose that classification implement the also sameratios. is or National used similar Institute methods. in ofof the Standards uncertainty and characterization evaluation for Technology of (NIST) this forensic statisticians context recently [ evidence, looked particularly at the through question likelihood 4 Bayesian example Let NIST Technical Note 2044 2 Related work A screening test foror membership negative in predictive a value, given andnames class related can across statistics. be different With evaluated fields,widely in these a available terms statistics [ canonical of being sensitivity, reference in specificity, common for positive use them under is various difficult to identify, but their definitions are Suppose that a test estimates, we find N

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 4 2]. Even without mis- § , C ], it is close enough in any is the weighting factor for 17 s 19 1][ are 1 and its estimated positive § , t 17 t has to be sufficient to diagnose the disease with in the sample C ]. Practitioners seeking the most simple and t t 3 ]. . r 20 and Prevalence of Bayes’ rule Proportion of sample that is class 1 individual in N , for the purpose of de-weighting words with few occurrences in ) x C | be the proportion of the sample that is positive for a disease (as = 1 Q.E.D. t ( ) r returns just one positive result, and it is for someone who has the t p ( and · r p /rN t ) /N /rN s /N 1 C 1 · ( p ) = r ]. But unless continuity corrections are applied, this approach yields a ) = 1 C ) = 1 4 ( t C ˆ ( p | ) = ) = t ˆ t p t | ( | ˆ p C C ( ( p ˆ p is the prior value that is used for the spamminess and x be the size of the sample. Let ) when the training corpus is small. However, I have not encountered an example of this being done, t N | C ( ˆ the training corpus. ground truth). Supposedisease. that The a estimated specificity test and positive predictive value (PPV) of 5 Frequentist example Let spellings, there is alwaysproblem a often proliferation leads of words toof that poor thumb appear results. without only quantifying However, once the it in underlying is anyBogofilter issue. given worrisome provides corpus, to two so discard parameters, ignoring low-frequency the words just asthat a prior rule value, relativethe to effects a of single rare occurrence words,formula of but is it the the is word. right far By tool from tuning obvious for whether these the the parameters, job. action one of can these moderate parameters in the spamminess familiar approach might use a Waldfor type formulation sensitivity of the and PPV confidence specificity interval using [ standard estimates degenerate, zero-width interval aroundare 1. certain when Users the may opposite therefore isAs be of true. December misled 11, 2018, into aof believing Google this search that for misleading “confidence the approach interval predictive estimates as value” returns the an very implementation first result [ likelihood ratio is100 infinite, % so probability we regardless find of a the values positive of result from This conclusion is wrong because theVarious uncertainty recipes of are the available statistics to has been derive ignored. confidence intervals [ This conclusion is wrong because theWhile uncertainty the associated Bayesian with pedigree theevent of to use Bogofilter’s serve of model as estimates hastheir a has been default real been questioned values, world ignored. [ Bogofilter example“spamminess” will of (ostensible assign the probability a issue of word justof indicating that described. spam) this appears of value With 0.991 is only 605 alland then once on configuration it interpreted in a parameters as is a scale left an corpus, from usedexcluded. at indication 0 in in If of to a the the 1. the spam word combinedit evidentiary message, The appears is assessment significance extremity a in again of of a taken the the as non-spamof presence useful message entire the of evidence, training instead, message but the corpus it while for word, evidence or is the words its and assigned opposite with balance destroy a conclusion. of the less spamminess ham These performance of extreme and outliers 0.009 of valuesInstead 094, spam, are the unaffected and are and of system. by it calculating the does a size probability not point distributions take many estimate, to of one modelp them could to the use overwhelm uncertainty, real a resultingmuch more in less complete a consensus Bayesian wide on prediction an posterior approach acceptable predictiveDiscarding with recipe. distribution words for with low counts is a “common text classification practice” [ NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 5 (1) ]. -values provides a p 24 ) are wrong. t | C ( -values in the presentation of p p increases, but the same does not occur  x  ]. For the scenario described above, the n N − n . k + k 21 r  m x m and ] are also suitable and are said to be better. 3 N ) = x, m, n, k ], the hypergeometric function phyper 25 phyper( ]; more generically, it may simply be called the hypergeometric test [ 23 02][ · 21 § , ] begins, “Applied statisticians have long been aware of the serious limitations of hypothesis tests The estimated PPV expresses certaintydisease. that a positive test result reliably indicatesHowever, the the PPV presence of estimate the is very uncertain. Therefore, we have littledisease. confidence that a positive test result reliably indicates the presence of the 3 22 1. 2. 3. is 7 Significance test Ref. [ when used as thehave principal indicated method that ofresults.” summarizing in data. general Following While confidence persuasion...different that intervals perspective [leading (CIs) medical that point reveals journals] are is another preferred side well-taken,In to of framing the the the context original problem. oftest problem [ statistics at literature, hand the in followingWe take may terms it be as of given recognized thattest as the is prevalence an valid. of application positive Testis test results of results heavily that is Fisher’s biased. not are exact informativesame The manufactured about number by significance the of coin likelihood test, “hits.” that toss therefore,separation the If (a is from the fake by correlation test) what comparison ofsource, are could a with then no test’s a be its less results random predictive expected fraudulent with distribution value if ground to has having truth theFor not occur the has coin a been no if given established statistically test sample, significant andgeometric test, results it distribution. should and were not level As fabricated be of defined using relied confidence, in upon a the R as random question [ evidence. number of significance is answered using the hyper- 6 Layers of uncertainty Both Bayesian andvalue frequentist of approaches a to test.whether the a scenario Predictive test produce value result“uncertainty can misleading is of be indicative estimates the of interpreted uncertainty,” of which a as the is particular a sometimes predictive Multiple ground probability philosophically layers truth. of rejected estimate The uncertainty or or problem can categorically level be therefore disregarded. of could collapsed confidence be through an filed about argument under of the following kind: The problem is thenindication reduced has been to: disregarded. one We of conclude the that significant the components values being of used uncertainty for for the ˆ reliability of the Clopper-Pearson interval for thefor specificity the narrows around PPV.95 1 as % PPV (minimum) dependsapproximately Clopper-Pearson (0.03 only interval to for on 1.00) regardless PPV the ofThe with positive the Clopper-Pearson just interval values test for of PPV one results, thusby accurately true a conveys of the test positive fact that which that and hasby the we few no strength having positive of a have false evidence results larger provided a is positives sample. limited fixed is by Other that number. intervals small surveyed number in of The [ positive results and is not improved NIST Technical Note 2044 A better choice in this case is the Clopper-Pearson interval [

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 ] 6 42. 28 (2) is the is the to the 1 m > c m n , not as a m 4 or is “the number n ]. While it avoids m < 29 p ≥ ) 5]. An LR for the combi- § , 18 x, m, n, k ], which amounts to raising ) (3) 27 , , m phyper( . This occurs readily at large 0 26 α q =0 , n X x 1 c, n reveal that a rule of thumb to discard tests having m must be adjusted to account for the fact that we are such that q c qhyper ( ≤ is “the number of white balls in the urn,” is the number of disease-negatives in the sample, ˇ Sid´akinequality [ 0 and large 11 m n x m . However, it is also impossible to reject the null hypothesis when ↔ ) adjusted for multiple comparisons as explained below, is the number of true positives. m α 0 11 H − ) = minimum x is “the number of white balls drawn without replacement from an urn x is “the number of balls drawn from the urn.” k p, m, n, k , where k qhyper( . This occurs readily at small is the level of confidence (1 α c in the 0 to x surveyed previous work and commented, “Calculatingthat the incorporates 95 both % CI sensitivityfinding for and a the specificity, LR, confidence can a interval ratio be and of provided challenging.” binomial an proportions They implementation defined in a the R bootstrap package method bootLR of [ 8 Classification using likelihoodThe preceding ratios section providedHowever, a a way method to to detect adjusttests the when evidential that a weight given that do test isapproach. not given does to meet not individual meet tests the is a standard still standardOne needed. of while option Discarding significance. is assigning to full reformulate weight the to classifier in those terms that of barely likelihood ratios meet (LR) it [ is a crude few positive results isvary justified, based but on not the sufficient. parameters, In so addition, an the arbitrarily cutoff chosen values threshold that cannot this be criterion best produces for all cases. symmetry but as a separate modeWith of a failure. large number ofhigh. tests, the To probability of filter some out test appearing insignificant to tests be effectively, significant when itpower is of not 1 becomes oversimultaneous the comparisons number is ofthe large, simultaneous individual comparisons the (i.e., null adjustedconfidence the hypotheses level level number rises of of of 0.9999 dramatically. confidence tests). (“fourwith that 10 nines”) For When disease-positives provides is example, and only the 90 then 90 if number disease-negatives, % the required of 1000 resulting confidenceThe to criterion against tests two is distinct a reject unsatisfiable are modes false for any considered, of discovery. of failure In an for a small adjusted sample It is impossibleexceeds to reject thethe null probability hypothesis of when a the fake probability test of having no a false fake negatives test exceeds having no false positives testing one independent hypothesisA for each sufficient different correction test that is is obtained identified for using potential the use in classification. of black balls in the urn,”The and hypergeometric inverse cumulative distribution function qhyper then is number of positive test results, and nation of all testsfor is each the individual test. productfor of The its evidential the LR. significance prior of To oddsis avoid an closest (if giving individual to any) weight test 1 to and manifests as insignificant in the evidence, the the LRs,Unfortunately, one estimate confidence confidence positive can intervals interval to or for choose be LRs negative the used are as value in even applicable, from more subsequent the troublesome calculations. interval than that those for PPV. Marill et al. [ which contains both black and white balls,” The null hypothesis is thatof the positive screening test test is results(Since not are more the distributed sensitive other randomly than parameters a through are fake the test fixed, sample in specificity without which is the any same an reference number equivalent to criterion.) ground truth. where number of disease-positives in the sample, NIST Technical Note 2044 for

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 7 spam) | (Tx p ) the message. For ]. − 28 ]. This corpus is evenly 30 , 11 ) shows that they are tightly clustered for both Figure 3 that the message is spam (labelled “spamicity” in Bogofilter’s p compares the Bogofilter method with the LR method, first using a 95 % level ). r ham) of the LR using the relevant counts from the database. , which is the product of all of the LRs that were computed for each word in the | r (Tx (1 + p Figure 4 r/ 1 10 90 90 ). = Figure 2: Confidence intervals computed according to Marill et al. [ p . The ranges of the counts for ham and spam are 241 to 242 and 48 to 49 respectively, for truePos totalDzPos trueNeg totalDzNeg Figure 2 > BayesianLR.test(truePos=1, totalDzPos=10, trueNeg=90, totalDzNeg=90) Likelihood ratio test of adata: 2x2 table Positive LR: InfNegative (2.505 LR: - 0.90095% Inf) (0.664 confidence - intervals 1.022) computed via BCa bootstrapping. Table 1 database, is converted to aoutput) probability via and denominator The final message LR A word that exists in the database is either present in (T+)Form or a absent conservative LR from intervalmerator (T by maximum) using / (numerator (denominator minimum) minimum) / as (denominator endpoints. Choose maximum) the and value (nu- from that interval that is closest to 1 and multiply the message’s LR by that value. Assign an initial messageor LR spam. value of 1, indicating equal ofwhichever the applies message (Tx), being compute ham Clopper-Pearson confidence intervals for the numerator 5. 2. 3. 4. 1. The LR methodgofilter. uses Also, different it approaches considersconsiders to the the filtering, absence presence of of weighting, a words and word asThe combining from evidence. first a evidence message step than as“normal” does in scenario. possible Bo- evaluating evidence, For this, while this we Bogofilter method use only the is bare version to of see the Ling-Spam how corpus it [ compares with Bogofilter as the baseline in a methods, but the LR method producedUnfortunately, the fewer receiver inconclusive operating classifications. characteristicunder (ROC) the curves LR reveal method. that classification accuracy suffered of confidence for the individual Clopper-Pearson intervals and then using a 99.9999 % level of confidence a total count of 289 toThe 291 purpose and of a the spamparts partitioning prevalence before of was being approximately to testedoriginal 17 enable on % Bogofilter a the in 10-fold method 10th. each cross-validationscoring and part. This where mechanism the strategy the in LR was classifier Bogofilter, followed method is keeping to described the trained collectA tokenization on above. data kernel logic 9 on density and LR the plot everything data performance of else were of the the collected the same. resulting by spamicities replacing ( the divided into 10 partitionsshown of in comparable size and content. The message counts for each of the 10 parts are the degeneracy that arisesa with word methods appearing based only onearlier normal once (1 approximations, being true it positive consideredabove does 1 in evidentially not (see significant. a solve sample the For problem of theTo of size example work 100), that around the wasof 95 discussed this, % Clopper-Pearson we interval confidence for canclassification intervals then the instead for is positive construct as the LR follows: a remains numerator stubbornly conservative and interval denominator. for the The LR full using “LR the method” endpoints of NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 8 http:// ], with weights being set 32 Part Ham Spam Total part6part7 241part8 241part9 48 241 48 241 289 48 289 48 289 289 part10 242 49 291 . Table 1: Message counts in the 10 parts of the Ling-Spam corpus. ], perhaps it should be replaced with the weighted Z-method [ 31 Part Ham Spam Total 1][ part1 241 48 289 part2part3 241part4 241part5 48 241 48 242 289 48 289 48 289 290 · 21 § , 22 bogofilter.sourceforge.net/ Eric S. Raymond, David Relson, Matthias Andree, Greg Louis, et al. Bogofilter, 2002. [1] References Acknowledgment Thanks to Steven Lund, Harifor Iyer, other and helpful William suggestions. F. Guthrie for statistical advice and to Ya-Shian Li-Baboud according to the uncertainty of theIf individual the probabilities. sample isnot insufficient solve for the the intended problem.insufficient purpose, to Nevertheless, support suppressing evidence it a or conclusion is propagating than better not uncertainty to does to get know that an a “unsure” conclusion result is and unreliable. know that the evidence is 9 Conclusion This report called attentionand to frequentist approaches a to component binaryin of classification the uncertainty problems. sample that Discarding reducesthreshold is tests the might sometimes that impact have show overlooked that been few in this set positive both uncertainty arbitrarily. results To Bayesian can handle have, the but this uncertainty,evidential a mitigation weight frequentist is that confidence incomplete isrule and interval given of for the to thumb, individual predictive can tests. value be can used Alternatively,Questions be a as remain significance a used of test, quantitative to how basis insteadto to adjust on do of adjust the this which inferences an using to to arbitrary confidencefor discard account intervals for the for tests. the likelihood particular ratios uncertainty model of actuallybetter and degraded predictive in predictor performance value. different on used. An circumstances. thetest attempt It corpus For [ tested, is classifiers possible that that combine the evidence likelihood using ratio Fisher’s approach combined could probability perform for the same. SinceBogofilter the method, curves it for seems the that LR theThe method LR value are method 0.4 almost at isof entirely the actually underneath elbow harmful. 0.45 the of and curve thewould for curve 0.99 need for the for to the original ham Bogofilter beat method tuned and the shows to elbow spam that realize of Bogofilter’s respectivelyspamicities the the default that are performance thresholds curve place suboptimal represented for them by the for in the LR the this ROC method ham corpus. curve. cluster. indicate that The The a much lower significant filter’s values number parameters of spam messages received NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 9 ) ) 6 0.00001 − 0.001 0.90 0.05 10 1.0 = = α α 0.92 ) 0.8 0.0001 01) of spamicities . 0.05 = α 0.001 0.6 0.94 Bogofilter method method ( ratio Likelihood method ( ratio Likelihood 0.01 Floating numbers indicate cutoff values Floating numbers Spamicity 0.01 0.4 0.96 0.000001 Bogofilter method method ( ratio Likelihood 0.1 Spam−specificity / ham−sensitivity 0.2 0.00001 0.4 0.1 0.2 0.0001 0.98 0.001 0.01 0.2 0.4 0.0

1.00

0.92 0.90 0.98 0.96 0.94 1.00 5 0 25 20 15 10 30

Spam−sensitivity / ham−specificity / Spam−sensitivity Density Figure 3: Gaussian kernel density (bandwidth = 0 Figure 4: Approximate ROC curves ( with 95 % confidence intervals) NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 . 10 , vol- . https: Journal . , July 27–28 . , AAAI Technical Œuvres compl`etes http://garyrob. . paper.pdf Philosophical Transac- https://doi.org/10. , 39(1):101–110, 2009. , March 1 2003. Proceedings of the Work- paper.pdf https://mail.python.org/ , volume 8, pages 27–65. 1891. Ebook Linux Journal , 122(27), 2017. https://doi.org/10.1098/rstl.1763.0053 http://paulgraham.com/better.html Œuvres compl`etesde Laplace: publi´eessous les . http://radio-weblogs.com/0101454/stories/ . https://en.wikipedia.org/wiki/Sensitivity . http://paulgraham.com/spam.html . https://www.microsoft.com/en-us/research/wp-content/ https://www.aaai.org/Papers/Workshops/1998/WS-98-05/ . or , 53:370–418, 1763. , volume 9, pages 383–485. 1893. Ebook available in Google Play. . http://www.aueb.gr/users/ion/docs/mlnet , 17:857–872, 1998. http://www2.aueb.gr/users/ion/docs/ceas2006 . . Communications in Statistics—Simulation and Computation . Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006) Proceedings of the AAAI Workshop on Learning for Text Categorization , pages 9–17, 2000. Œuvres compl`etesde Laplace: publi´eessous les auspices de l’Acad´emiedes sciences Statistics in Medicine specificity Steven P. Lund and Hari Iyer. Likelihood ratio as weight of forensic evidence:Tim A Peters. closer look. Email on Python-Dev mailing list, August 22 2002. Gary Robinson. Handling redundancy in email tokenVangelis Metsis, probabilities, Ion May Androutsopoulos, 6 andnaive Georgios Bayes? 2004. Paliouras. In Spam filtering2006. with With naive Bayes—which corrections: of Research of National6028/jres.122.027 Institute of Standards and Technology pipermail/python-dev/2002-August/028216.html Gary Robinson. Spam detection, September 2002. Paul Graham. Better Bayesian filtering, JanuaryGary 2003. Robinson. A statistical approach to the spam problem. uploads/1998/01/junkfilter.pdf WS98-05-009.pdf Ion Androutsopoulos, Johntine Koutsias, D. Konstantinos Spyropoulos. V. An evaluation Chandrinos, of George naive Bayesian Paliouras, anti-spam and filtering. Constan- In Paul Graham. A plan for spam, August 2002. 2002/09/16/spamDetection.html Mehran Sahami, Susan Dumais,junk David e-mail. Heckerman, In and Eric Horvitz. A Bayesian approach to filtering tions of the Royal Society of London de Laplace: publi´eessous lesavailable auspices in de des Google l’Acad´emie sciences Play. Pierre-Simon Laplace. sur M´emoire les In auspices probabilit´es. de l’Acad´emiedes sciences Pierre-Simon Laplace. sur M´emoire lesnombres. approximations In des formules qui sont fonctions de tr`esgrands shop on in(ECML the 2000) New Information Age, 11th European Conference on Machine Learning //linuxjournal.com/article/6467 blogs.com/handlingtokenredundancy94.pdf https://doi.org/10.1080/03610910903312219 Thomas Bayes. An essay towards solving a problem in thePierre-Simon doctrine Laplace. of sur M´emoire chances. la probabilit´edes causes par In les ´ev´enements. Wikipedia. Sensitivity and specificity, June 2018. Nathaniel David Mercaldo, Xiao-Huausing Zhou, data and from Kit a case F.http://biostats.bepress.com/uwbiostat/paper271 control Lau. study. Working Confidence Paper intervals 271, for UW predictive BiostatisticsJames Working values Paper D. Series, Stamey 2005. andcontrol Melinda studies. M. Holt. Bayesian for predictive values from case- and Robert G. Newcombe.methods. Two-sided confidence intervals for the single proportion: comparison of seven ume 10, pages 295–338. 1894. Ebook available in Google Play. Report WS-98-05, pages 55–62, 1998. [8] [9] [6] [7] [2] [4] [5] [3] [18] [19] [16] [17] [13] [14] [15] [11] [12] [10] NIST Technical Note 2044

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044 . . 11 https: method https://en. . public.tar.gz . . https://doi.org/10.1111/ test . Statistical Methods in Medical . . Number 5 in Biological Monographs and https://en.wikipedia.org/wiki/Fisher%27s , 18(5), August 2005. https://jstor.org/stable/2283989 , 62(318):626–633, June 1967. Corollary 1. , https://www2.ccrb.cuhk.edu.hk/stat/confidence% https://r-project.org/ https://en.wikipedia.org/wiki/Fisher%27s . Accessed 2018-12-11. distribution#Hypergeometric https://doi.org/10.1177/0962280215592907 http://www.aueb.gr/users/ion/data/lingspam . ˇ Sid´akmethod. Unpublished white paper, July 19 2018. , 26(4):404–413, 1934. Journal of Evolutionary Biology Statistical Methods for Research Workers https://cran.r-project.org/package=bootLR . Biometrika , 26(4):1936–1948, 2017. ˇ Sid´ak. Rectangular confidence regions for the means of multivariate normal distributions. test Wikipedia. Fisher’s method, September 2018. Michael C. Whitlock. Combining probabilityto from Fisher’s independent approach. tests: the weighted Z-method is superior Keith A. Marill andhood Ari ratio B. tests. Friedman. bootLR: BootstrappedLing-Spam confidence corpus, intervals July for 17 (negative) 2000. likeli- Manuals. Hafner, 12th edition, 1954. wikipedia.org/wiki/Hypergeometric William F. Guthrie.corrections using Example the of statistical testsKeith for A. hypergeometric Marill, Yuchiao dataratio Chang, confidence with when Kim test multiple F. sensitivityResearch comparison is Wong, 100%: and Ari A bootstrapping B. approach. Friedman. Estimating negative likelihood j.1420-9101.2005.00917.x Wikipedia. Hypergeometric test, in hypergeometricThe distribution, R project September for statistical 2018. computing,Zbynˇek 2018. Journal of the American Statistical Association Unspecified author, Centre forC.I. Clinical Research calculator: and , Chinese University of Diagnostic Hong Kong. statistics. Ronald A. Fisher. Wikipedia. Fisher’sexact exact test, September 2018. C. J. Clopper andbinomial. E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the 20interval/Diagnostic%20Statistic.htm //doi.org/10.1080/01621459.1967.10482935 [31] [32] [29] [30] [27] [28] [24] [25] [26] [20] [22] [23] NIST Technical Note 2044 [21]

This publication is available free of charge from: https://doi.org/10.6028/NIST.TN.2044