Deep Dominance - How to Properly Compare Deep Neural Models

Rotem Dror Segev Shlomov Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT rtmdrr|segevs|[email protected]

Abstract was deterministic and the number of configura- tions a model could have was rather small – deci- Comparing between Deep Neural Network sions about model design were usually limited to (DNN) models based on their performance on unseen data is crucial for the progress of feature selection and the selection of one of a few the NLP field. However, these models have loss functions. Consequently, when one model a large number of hyper-parameters and, be- performed better than another on unseen data it ing non-convex, their convergence point de- was safe to argue that the winning model was gen- pends on the random values chosen at ini- erally better, especially when the results were sta- tialization and during training. Proper DNN tistically significant (Dror et al., 2018), and when comparison hence requires a comparison be- the effect of multiple hypothesis testing was taken tween their empirical score distributions on unseen data, rather than between single eval- into account in cases of evaluation with multiple uation scores as is standard for more simple, datasets (Dror et al., 2017). convex models. In this paper, we propose to With the recent emergence of Deep Neural Net- adapt to this problem a recently proposed test works (DNNs), data-driven performance compar- for the Almost Stochastic Dominance relation ison has become much more complicated. While between two distributions. We define the cri- models such as LSTM (Hochreiter and Schmid- teria for a high quality comparison method huber, 1997), Bi-LSTM (Schuster and Paliwal, between DNNs, and show, both theoretically 1997) and the transformer (Vaswani et al., 2017) and through analysis of extensive experimen- tal results with leading DNN models for se- improved the state-of-the-art in many NLP tasks quence tagging tasks, that the proposed test (e.g. (Dozat and Manning, 2017; Hershcovich meets all criteria while previously proposed et al., 2017; Yadav and Bethard, 2018)), it is methods fail to do so. We hope the test we much more difficult to compare the performance propose here will set a new working practice of algorithms that are based on these models. 1 in the NLP community. This is because the loss functions of these mod- 1 Introduction els are non-convex (Dauphin et al., 2014), mak- ing the solution to which they converge (a lo- A large portion of the research activity in Natural cal minimum or a saddle point) sensitive to ran- Language Processing (NLP) is devoted to the de- dom weight initialization and the order of train- velopment of new algorithms for existing or new ing examples. Moreover, as these complex mod- tasks. To evaluate the quality of a new method, its els are not fully understood, their training is often performance on unseen datasets is compared to the enhanced by heuristics such as random dropouts performance of existing methods. The progress of (Srivastava et al., 2014) that introduces another the field hence crucially depends on our ability to level of non-determinism to the training process. draw conclusions from such comparisons. Finally, the increased model complexity results in In the past, most supervised NLP models have a much larger number of configurations, governed been linear (or log-linear), convex and relatively by a large space of hyper-parameters for model simple (e.g. (Toutanova et al., 2003; Finkel et al., properties such as the number of layers and the 2008; Ritter et al., 2011)). Hence, their training number of neurons in each layer. 1Our code is available at: https://github.com/ With so many degrees of freedom governed by rtmdrr/deepComparison random and arbitrary values, when comparing two

2773 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2773–2785 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics DNNs it is not possible to consider a single test-set (c), since it does not provide information beyond evaluation score for each model. If we do that, we its decision if one of the distributions is stochas- might compare just the best models that someone tically dominant over the other or not, and as we happened to train rather than the methods them- show in §5 its power (criterion (b)) is very limited. selves. Instead, it is necessary to compare between A new comparison tool (§ 4): We propose a the score distributions generated by different runs solution that meets our three criteria. Particu- of each of the models. Unfortunately, this compar- larly, we adapt to our problem the recently pre- ison task, which is fundamental to the progress of sented concept of Almost Stochastic Order (ASO) the field, has not received a systematic treatment between two distributions (Alvarez-Esteban´ et al., thus far. Our goal is to close this gap and propose a 2017),2 and the statistical significance test for this simple and effective comparison tool between two property, which makes very modest assumptions DNNs based on their test set score distributions. about the participating distributions (criterion (a)). Particularly, we make four contributions: Further, in line with criterion (c), the test returns a variable  ∈ [0, 1], that quantifies the degree to Defining a DNN comparison framework: We which one algorithm is stochastically larger than define three criteria that a DNN comparison tool the other, with  = 0 reflecting stochastic order. should meet: (a) Since we observe only a sam- We further show that the test is designed to be very ple from the population score distribution of each powerful (criterion (b)), which is possible because model, the decision should be significant under the decision on the superior algorithm is comple- well justified statistical assumptions. This assures mented by the confidence score. that future runs of the superior model are likely Extensive experimental analysis (§ 5): We re- to get higher scores than future runs of the infe- visit the extensive experimental setup of Reimers rior model; (b) The decision mechanism should and Gurevych(2017a,b), who performed 510 be powerful, being able to make decisions in most comparisons between strong DNN-based se- possible decision tasks; and, finally, (c) Since both quence tagging models. In each of their experi- models depend on random decisions, it is likely ments they compared two models – either different that none of them is promised to be superior over models or two variants of the same model differing the other in all cases (e.g. with all possible random in some of their hyper-parameters – and reported seeds). A powerful comparison tool should hence the score distributions of each model across vari- augment its decision with a confidence score, re- ous random seeds and hyper-parameter configura- flecting the probability that the superior model will tions. Our analysis reveals that while our test can indeed produce a better output. declare one of the algorithms superior in 100% of Analysis of existing solutions (§ 3,5): The the cases, the COS approach can do that in 49.01% comparison problem we address has been high- of the cases, and the SO approach in a mere 0.98%. lighted by Reimers and Gurevych(2017b, 2018), In addition to being powerful, the decisions and who established its importance in an extensive the confidence scores of our proposed test are also experimentation with neural sequence models well aligned with the tests proposed in previous (Reimers and Gurevych, 2017a), and proposed literature: when the previous methods are chal- two main solutions (§3). One solution, which we lenged, our method still makes a decision but it refer to as the collection of (COS) solu- also indicates the smaller gap between the algo- tion, is based on the analysis of statistics of the em- rithms. We hope that this work will establish a pirical score distribution of the two algorithms – standard for the comparison between DNNs. such as their mean, and standard deviation (std), as well as their minimum and maximum val- 2 Performance Variance in DNNs ues. Unfortunately, this solution does not respect In this section we discuss the source of non- criterion (a) as it does not deal with significance, determinism in DNNs, focusing on hyper- and as we demonstrate in §5 its power (criterion parameter configurations and random choices. (b)) is also limited. Their second solution is based on significance testing for Stochastic Order (SO) Hyper-parameter Configurations DNNs are (Lehmann, 1955), a strict criterion that is hardly complex models governed by a variety of hyper- met in reality. While this solution respects crite- 2We use the terms Almost Stochastic Order and Almost rion (a), it is not designed to deal with criterion Stochastic Dominance interchangeably in this paper.

2774 parameters. A formal definition of a hyper- As discussed in §1, being non-convex, the con- parameter, differentiating it from a standard pa- vergent point of DNNs is deeply affected by these rameter, is usually a parameter whose value is random effects. Unfortunately, exploring all pos- set before the learning process begins. We can sible random seeds is impossible both because roughly say that hyper-parameters determine the they form an uncountable set and because their structure of the model and particular algorithmic values are uninterpretable and it is hence even hard decisions related, e.g., to its optimization. Some to decide on the relevant search space for their val- popular structure-related hyper-parameters in the ues. This dictates the need for reporting model re- DNN literature are the number of layers, layer sults with multiple random choices. sizes, activation functions, loss functions, win- dow sizes, stride values, and parameter initializa- 3 Comparing DNNs: Problem tion methods. Some optimization (training) re- Formulation and Background lated hyper-parameters are the optimization algo- Problem Definition Given two algorithms, each rithms, learning rates, number of epochs, momen- associated with a set of test-set evaluation scores, tum, mini-batch sizes, whether or not to use op- our goal is to determine which algorithm, if any, timization heuristics such as gradient clipping and is superior. In this research, the score distributions gradient normalization, and sampling and ordering are generated when running two different DNNs methods of the training data. with various hyper-parameter configurations and To decide on the hyper-parameter values, it is random seeds. For both DNNs, the performance is standard to explore several configurations and ob- measured using the same evaluation measure over serve which performs best on an unseen, held-out the same dataset,3 but, to be as general as possible, dataset, commonly referred to as the development the number of scores may vary between the DNNs. set. For some hyper-parameters (e.g. the learning As noted in §1, several methods were proposed rate and the optimization algorithm) the range of for the comparison between the score distributions feasible values reflects the intuitions of the model of two DNNs. We now discuss these methods. author, and the tuned value provides some insight about the model and the data. However, for many 3.1 Collection of Statistics (COS) other hyper-parameters (e.g. the number of neu- This approach is based on the analysis of statistics rons in each layer of the model and the number of of the empirical score distributions. For example, epochs of the optimization algorithm) the range of Reimers and Gurevych(2018) averaged the test- values and the selected values are quite arbitrary. set scores and applied the Welch’s t-test (Welch, Hence, although hyper-parameters can be tuned 1947) for comparing between the means. Notice on development data, the distribution of model’s that the Welch’s t-test is based on the assumption evaluation scores across these configurations is of that the test-set scores are drawn from normal dis- interest, especially when considering the hyper- tributions – an assumption that has not been val- parameters with the more arbitrary values. idated for DNN score distributions. Hence, this Random Choices There are also hyper- method does not meet criterion (a) from §1, that parameters that do not follow the above tuning requires the comparison method to check for sta- logic. These include some of the hyper-parameters tistical significance under realistic assumptions. that govern the random ordering of the training Moreover, comparing only the mean of two dis- examples, the dropout process and the initializa- tributions is not always sufficient for making pre- tion of the model parameters. The values of these dictions about future comparisons between the al- hyper-parameters are often set randomly. gorithms. Other statistics such as the std, median In other cases, randomization is introduced to and the minimum and maximum values are of- the model without an explicit hyper-parameter. ten also relevant. For example, it might be that For example, a popular initialization method for the expected value of algorithm A is indeed larger DNN weights is the Xavier method (Glorot and than that of algorithm B, but A’s std is also much Bengio, 2010). In this method, the initial weights larger, making prediction very challenging. In §5 are sampled from a Gaussian distribution with a we show that if both larger mean and smaller std p mean of 0 and an std of 2/ni, where ni is the 3Without loss of generality we will assume that higher number of input units of the i-th layer. values of the measure are better.

2775 is required for a decision, the COS approach is Y ) if F (a) ≤ G(a) for all a (with a strict inequal- decisive (i.e. it can declare that one algorithm ity for some values of a), where F and G are the is better than the other) in only 49.01% of the CDFs of X and Y , respectively. That is, if we ob- 510 setups considered in Reimers and Gurevych serve a random value sampled from the first distri- (2017b). This violates our criterion (b) which re- bution, it is likely to be larger than a random value quires the comparison test to be powerful. sampled from the second distribution. If it can be concluded from the empirical score 3.2 Stochastic Order (SO) distributions of two DNNs that SO exists be- Another approach, proposed by Reimers and tween their respective population distributions, Gurevych(2018), tests whether a score drawn this means that one algorithm is more likely to pro- from the distribution of algorithm A (denoted as duce higher quality solutions than the other, and XA) is likely, with a probability higher than 0.5, to this algorithm can be declared superior. As dis- be larger than a score drawn from the distribution cussed above, Reimers and Gurevych(2018) ap- of algorithm B (XB). Put it formally, algorithm A plied the Mann-Whitney U-test to test for this re- is declared superior to algorithm B if: lationship. The U-test has high statistical power when the tested distributions are moderate-tailed, P (XA ≥ XB) > 0.5. (1) e.g., the normal distribution or the logistic dis- To test if this requirement holds based on the tribution. When the distribution is heavy tailed, empirical score distributions of the two algo- e.g., the Cauchy distribution, there are several al- rithms, the authors applied the Mann-Whitney U ternative statistical tests that have higher statistical test for independent pairs (Mann and Whitney, power, for example likelihood based tests (Lee and 1947) – which tests whether there exists a stochas- Wolfe, 1976; El Barmi and McKeague, 2013). tic order (SO) between two random variables. This The main limitation of this approach is that SO test is non-parametric, making no assumptions can rarely be proved to hold based on two empir- about the participating distributions except for be- ical distributions. Indeed, in §5 we show that an ing continuous. In the appendix we show that if SO holds between the two compared algorithms there is an SO between two distributions, the con- only in 0.98% of the comparisons performed by dition in Equation1 also holds. Reimers and Gurevych(2017a). Hence, while this We next describe the concept of SO in more de- approach meets our criterion (a) (testing for sig- tails. But first, in order to keep our paper self- nificance under realistic assumptions), it does not contained, we define the cumulative distribution meet criterion (b) (being a powerful test) and cri- function (CDF) and the empirical CDF of a prob- terion (c) (providing a confidence score). ability distribution. We will next describe another approach that does meet our three criteria. The CDF For a random variable X, the CDF is defined as follows: 4 Our Approach: Almost Stochastic Dominance F (t) = P (X ≤ t). Our starting point is that the requirement of SO For a sample {x , .., x }, the empirical CDF is de- 1 n is unrealistic because it means that the inequality fined as follows: F (a) ≤ G(a) should hold for every value of a. It n is likely that this criterion should fail to determine 1 X number of xis ≤ t F (t) = 1 = , n n xi≤t n dominance between two distributions even when i=1 a ”reasonable” decision-maker would clearly pre- fer one DNN over the other. We hence propose to where 1xi≤t is an indicator function that takes the employ a relaxed version of this criterion. We next value of 1 if xi ≤ t, and 0 otherwise. These defini- tions are required for the definition of SO we make discuss different definitions of such relaxation. next. A Potential Relaxation For  > 0 and random Stochastic Order (SO) Lehmann(1955) de- variables X and Y with CDFs F and G, respec- fines a random variable X to be stochastically tively, we can define the following notion of - larger than a random variable Y (denoted by X  stochastic dominance:

2776 −1 −1 X  Y if F (a) ≤ G(a) +  for all a. AX = {t ∈ (0, 1) : F (t) < G (t)}. −1 −1 That is, we allow the distributions to violate the AY = {t ∈ (0, 1) : F (t) > G (t)}. stochastic order, and hence one CDF does not have to be strictly below the other for all a. del Barrio et al.(2018) employed these defini- The practical shortcomings of this definition are tions in order to define the distance of each random apparent in cases where F (a) is greater than G(a) variable from stochastic dominance over the other: for all a, with a gap bounded by, for example, /2. In such cases we would not want to determine that R (F −1(t) − G−1(t))2dt ε (F,G) := AX . (4) X ∼ F is  stochastically dominant over Y ∼ W2 2 W2(F,G) G because its CDF is strictly above the CDF of Y , and hence Y is stochastically larger than X. Where W2(F,G), also known as the univariate However, according to this relaxation, X ∼ F is L2-Wasserstein distance between distributions, is indeed  stochastically larger than Y ∼ G. defined as: Almost Stochastic Dominance To overcome s the limitations of the above straight forward ap- Z 1 −1 −1 2 proach, and define a relaxation of stochastic order, W2(F,G) = (F (t) − G (t)) dt. (5) 0 we turn to a definition that is based on the propor- tion of points in the domain of the participating This ratio explicitly measures the distance of X distributions for which SO holds. That is, the test (with CDF F ) from stochastic dominance over Y we will introduce below is based on the following (with CDF G) since it reflects the probability mass two violation sets: for which Y dominates X. The corresponding def- inition for the distance of Y from being stochasti- VX = {a : F (a) > G(a)}. cally dominant over X can be received from the above equations by replacing the roles of F and G VY = {a : F (a) < G(a)}. and integrating over AY instead of AX . Intuitively, the variable with the smaller violation This index satisfies 0 ≤ εW2 (F,G) ≤ 1 where set should be declared superior and the ratio be- 0 corresponds to perfect stochastic dominance of tween these sets should define the gap between the X over Y and 1 corresponds to perfect stochas- distributions. tic dominance of Y over X. It also holds that To implement this idea, del Barrio et al.(2018) εW2 (F,G) = 1 − εW2 (G, F ), and smaller val- defined the concept of almost stochastic domi- ues of the smaller index (which is by definition nance. Here we describe their work, that aims to bounded between 0 and 0.5) indicate a smaller dis- compare two distributions, and discuss its appli- tance from stochastic dominance. cability to our problem of comparing two DNN models based on the three criteria defined in §1. Statistical Significance Testing for ASO Using We start with a definition: for a CDF F , the quan- this index it is possible to formulate the follow- tile function associated with F is defined as: ing hypothesis testing problem to test for almost stochastic dominance: F −1(t) = inf{x : t ≤ F (x)}, t ∈ (0, 1). (2) H0 : εW2 (F,G) ≥  It is possible to define stochastic order using the H1 : εW (F,G) <  quantile function in the following manner: 2

−1 −1 which tests, for a predefined  > 0, if the violation X  Y ⇐⇒ F (t) ≥ G (t), ∀t ∈ (0, 1). (3) index is smaller than . Rejecting the null hypoth- The advantage of this definition is that the domain esis means that the first score distribution F is al- of the quantile function is bounded between 0 and most stochastically larger than G with  distance 1. This is in contrast to the CDF whose domain is from stochastic order. unbounded. del Barrio et al.(2018) proved that without fur- From this definition, it is clear that a violation ther assumptions, H0 will be rejected with a sig- of the stochastic order between X and Y occurs nificance level of α if: F −1(t) < G−1(t) r when . Hence, it is easy to re- nm  −1 εW (Fn,Gm) −  < σˆn,mΦ (α), define VX and VY based on the quantile functions: n + m 2 2777 where Fn,Gm are the empirical CDFs with n and which the performance scores are drawn (criterion m samples, respectively,  is the violation level, (a)). Moreover, it quantifies the gap between the Φ−1 is the inverse CDF of a normal distribution two reference distributions (criterion (c)), which and σˆn,m is the estimated variance of the value allows it to make decisions even in comparisons r where the gap between the superior algorithm and nm ∗ ∗  the inferior algorithm is not large (criterion (b)). εW2 (Fn ,Gm) − εW2 (Fn,Gm) , n + m To demonstrate the appropriateness of this ∗ ∗ method for the comparison between two DNNs where εW2 (Fn ,Gm) is computed using samples ∗ ∗ we next revisit the extensive experimental setup of Xn,Ym from the empirical distributions Fn and 4 Reimers and Gurevych(2017a). Gm. In addition, the minimal  for which we can claim, with a confidence level of 1 − α, that F is 5 Analysis almost stochastically dominant over G is Tasks and Models In this section we demon- min  (Fn,Gm, α) = strate the potential impact of testing for almost q n+m −1 εW2 (Fn,Gm) − nm σˆn,mΦ (α). stochastic dominance on the way empirical results of NLP models are analyzed. We use the data min 5 If  (Fn,Gm, α) < 0.5, we can claim of Reimers and Gurevych(2017a) and Reimers that algorithm A is better than B, and the lower and Gurevych(2017b). 6 This data contains 510 min  (Fn,Gm, α) is the greater is the gap between comparison setups for five common NLP sequence min the algorithms. When  (Fn,Gm, α) = 0, tagging tasks: Part Of Speech (POS) tagging with algorithm A is stochastically dominant over B. the WSJ corpus (Marcus et al., 1993), syntactic min However, if  (Fn,Gm, α) ≥ 0.5, then F is chucking with the CoNLL 2000 data (Sang and not almost stochastically larger than G (with confi- Buchholz, 2000), Named Entity Recognition with dence level 1 − α) and hence we should accept the the CoNLL 2003 data (Sang and De Meulder, null hypothesis that algorithm A is not superior to 2003), Entity Recognition with the ACE2005 data algorithm B. (Walker et al., 2006), and event detection with the del Barrio et al.(2018) proved that, assuming TempEval3 data (UzZaman et al., 2013). In each accurate estimation of σˆn,m, it holds that: setup two leading DNNs, either different architec-

min min tures or variants of the same model but with differ-  (Fn,Gm, α) = 1 −  (Gm,Fn, α). ent hyper-parameter configurations, are compared Hence, for a given α value, one of the across various choices of random seeds and hyper- algorithms will be declared superior, unless parameter configurations. The exact details of the min min comparisons are beyond the scope of this paper;  (Fn,Gm, α) =  (Gm,Fn, α) = 0.5. Notice that the minimal  and the rejection con- they are documented in the above papers. dition of the null hypothesis depend on n and m, For each experimental setup, we report the the number of scores we have for each algorithm. outcome of three alternative comparison meth- Hence, for the statistical test to have high statisti- ods: collection of statistics (COS), stochastic or- cal power we need to make sure that n and m are der (SO), and almost stochastic order (ASO). For big enough. While we cannot provide a method COS, we report the mean, std, and median of the for tuning these numbers, we note that in the ex- scores for each algorithm, as well as their mini- tensive analysis of §5 the test had enough statis- mum and maximum values. We consider one al- tical power to make decisions in all cases. The gorithm to be superior over another only if both pseudo code of our implementation is provided in its mean is greater and its std is smaller. For SO, the appendix. we employ the U-test as proposed by Reimers and To summarize, the test for almost stochastic Gurevych(2018), and consider a result significant dominance meets the three criteria defined in §1. if p ≤ 0.05 . Finally, for ASO we employ the This is a test for statistical significance under method of §4 and report the identity of the superior very minimal assumptions on the distribution from algorithm along with its  value, using p ≤ 0.01.

4The more samples, the better. In our implementation we 5https://github.com/UKPLab/ employ the inverse transform sampling method to generate emnlp2017-bilstm-cnn-crf samples. 6Which was generously given to us by the authors.

2778 Analysis Structure We divide our analysis into RMSProp (Hinton et al., 2012) (1822 scores), are three cases. In Case A both the COS and the SO compared across various hyper-parameter config- approaches indicate that one of the models is supe- urations and random seeds. The evaluation mea- rior. In Case B, the previous methods reach con- sure is word level accuracy. The COS for the two tradicting conclusions: while COS indicates that models is presented in Table2. one of the algorithms is superior, SO comes in- significant. Finally, in Case C both COS and SO Adam RMSprop are indecisive. In the 510 comparisons we ana- Average 0.9224 0.9190 lyze there is no setup where SO was significant STD 0.0604 0.0920 but COS could not reach a decision. We start with Median 0.9319 0.9349 an example setup for each case and then provide a Min 0.1746 0.1420 summary of all 510 comparisons. Max 0.9556 0.9573 Results: Case A We demonstrate that if algo- Table 2: POS tagging results (Case B). rithm A is stochastically larger than algorithm B then all three methods agree that algorithm A is The result of the U-test came insignificant with better than B. As an example setup we analyze p-value of 0.4562. The COS approach predicts the comparison between the NER models of Lam- that Adam is the better optimizer as both its mean ple et al.(2016) and Ma and Hovy(2016) when is larger and its std is smaller. When comparing running both algorithms multiple times, changing between Adam and RMSProrp, the ASO method only the random seed fed into the random number returns an  of 0.0159, indicating that the former generator (41 scores from (Lample et al., 2016), is almost stochastically larger than the latter. 87 scores from (Ma and Hovy, 2016)). The evalu- We note that decisions with the COS method are ation measure is F1 score. The collection of statis- challenging as it potentially involves a large num- tics for the two models is presented in Table1. ber of statistics (five in this analysis). Our decision Lample et al. Ma&Hovy here is to make the COS prediction based on the Mean 0.9075 0.9056 mean and std of the score distribution, even when STD 0.2237 0.3211 according to other statistics the conclusion might Median 0.9080 0.9063 have been different. We consider this ambiguity Min 0.9018 0.8853 an inherent limitation of the COS method. Max 0.9113 0.9100 Results: Case C Finally, we address the case where stochastic dominance does not hold and no Table 1: NER results. (Case A). conclusions can be drawn from the statistics col- lection. Our observation is that even in these cases The U-test states that (Lample et al., 2016) is ASO is able to determine which algorithm is bet- stochastically larger than (Ma and Hovy, 2016) ter with a reasonable level of confidence. We with a p-value of 0.00025. This result is also con- consider again a BiLSTM architecture, this time sistent with the prediction of the COS approach for NER, where the comparison is between two as (Lample et al., 2016) is better than (Ma and dropout policies – no dropout (225 scores) and Hovy, 2016) both in terms of mean (larger) and variational dropout (2599 scores). The evaluation std (smaller). Finally, the minimum  value of the measure is the F1 score and the collection of statis- ASO method is 0, which also reflects an SO. tics is presented in Table3. Results: Case B We demonstrate that if the Variational No Dropout measures of mean and std from the COS approach Mean 0.8850 0.8772 indicate that algorithm A is better than algorithm STD 0.0392 0.0247 B but stochastic dominance does not hold, then it also holds that A is almost stochastically larger Median 0.8896 0.8799 than B with a small  > 0. As an example case we Min 0.0119 0.5547 consider the experiment where the performance of Max 0.9098 0.8995 a BiLSTM POS tagger with one of two optimizers, Adam (Kingma and Ba, 2014) (3898 scores) or Table 3: NER Results (Case C).

2779 (a) Case A (b) Case B (c) Case C

Figure 1: An histogram of  values of the ASO method for cases A, B and C.

The U-test came insignificant with a p-value of parisons (case A and B). This method is also 0.5. COS is also inconclusive as the mean result somewhat powerful (criterion (b)), but much less of the variational dropout approach is larger, but so so than ASO that is decisive in all 510 compar- also its std. In this case, looking at the other statis- isons. The  values of ASO are higher for case B tics also gives a mixed picture as the median and than for case A (middle line of the table, middle max values of the variational approach are larger, graph of the figure). For case C the  distribution but its min value is substantially smaller. is qualitatively different –  receives a range of val- The ASO approach indicates that the no dropout ues (rightmost graph of the figure) and its average approach is almost stochastically larger, with  = is 0.202 (bottom line of the table). We consider 0.0279. An in-depth consideration supports this this to be a desired behavior as the more complex decision as the much larger std and the much the picture drawn by COS and SO is, the less con- smaller minimum of the variational approach are fident we expect ASO to be. Being able to make a indicators of a skewed score distribution that decision in all 510 comparisons while quantifying leaves low certainty about future performance. the gap between the distributions, we believe that ASO is an appropriate tool for DNN comparison. Results: Summary We now turn to a summary of our analysis across the 510 comparisons of 6 Error Rate Analysis Reimers and Gurevych(2017a). Table4 presents the percentage of comparisons that fall into each While our extensive analysis indicates the quality category, along with the average and std of the  of the ASO test, it does not allow us to estimate value of ASO for each case (all ASO results are its false positive and false negative rates. This is significant with p ≤ 0.01). Figure1 presents the because in our 510 comparisons there is no oracle histogram of these  values in each case. (or gold standard) that says if one of the algorithms is superior. Below we provide such analyses. % of comparisons Avg.   std Case A 0.98% 0.0 0.0 False Positive Rate The ASO test is defined Case B 48.04% 0.072 0.108 such that the ε value required for rejecting the con- Case C 50.98% 0.202 0.143 clusion that algorithm A is better than B is defined by the practitioner. While ε = 0.5 indicates a clear Table 4: Results summary over the 510 comparisons rejection, most researchers would probably set a of Reimers and Gurevych(2017a). lower ε threshold. Our goal in the next analysis is to present a case where the false positive rate The number of comparisons that fall into case A of ASO is very low, even when one refrains from is only 0.98%, indicating that it is rare that a deci- declaring one algorithm as better than the other sion about stochastic dominance of one algorithm only when ε is very close to 0.5. can be reached when comparing DNNs. We con- To do that, we consider a scenario where each sider this a strong indication that the Mann Whit- of the 255 score distributions of the experiments ney U test is not suitable for DNN comparison as in § 5 is compared to a variant of the same dis- it has very little statistical power (criterion (b)). tribution after a Gaussian noise with a 0 expecta- COS makes a decision in 49.01% of the com- tion and a standard deviation of 0.001 is added to

2780 (a) False Positive Rate Experiment (b) False Negative Rate Experiment

Figure 2: Histograms of the  values of the ASO test in the ablation experiments. each of the scores. Since in all the tasks we con- False Negative Rate This analysis complements sider the scores are in the [0, 1] range, the value the previous one by demonstrating the low false of 0.001 is equivalent to 0.1%. Since the average negative rate of ASO in a case where it is clear that of the standard deviations of these 255 score dis- one distribution is better than the other. For each tributions is 0.06, our noise is small but not negli- of the 255 score distributions we generate a noisy gible. We choose this relatively small symmetric distribution by randomly splitting the scores into 1 noise so that with a high probability the original a set A of 4 of the scores and the complementary score distribution and the modified one should not set Aˆ of the rest of the scores. For each score s we be considered different. We run 100 comparisons sample a noise parameter φ from a Gaussian with for each of the 255 algorithms. a 0 expectation and an std of 0.01, adding to s the We compute the ε such that a value of 0 means value of (−1) · φ2 if s ∈ A, and φ2 if s ∈ Aˆ. The that the non-noisy version is better than the noisy noisy distribution is superior to the original one, one with the strongest confidence, while the value with a high probability. As before we perform 100 of 1 means the exact opposite (both values are not comparisons for each of the 255 algorithms. observed in practice). A value of 0.5 indicates that We compute ε such that a value of 0 would no algorithm is superior – the correct prediction. mean that the noisy version is superior. The ε val- Figure2 (a) presents a histogram of the ε val- ues are plotted in Figure2 (b): their average is ues. The averaged ε is 0.502 with a standard de- 0.134, standard deviation is 0.07 and more than viation of 0.0472, and 95% of the ε values are in 99% of the values are lower than 0.4 (the same [0.396, 0.631]. This means that if we set a thresh- threshold as in the first experiment). The COS test old of 0.4 on ε (i.e. lower than 0.4 or higher than deems the noisy distribution superior in 87.4% of 0.6), the false positive rate would be lower than the cases, while in the rest it considers none of the 5%. In comparison, the COS approach declares distributions superior. SO has a false negative rate the noisy version superior in 26.2% of the 255 of 100% as ε > 0 in all experiments. comparisons, and the non-noisy version in 23.8%: a false positive rate of 50%.7 The SO test makes 7 Conclusions no mistakes, as a false positive of this test is equiv- We considered the comparison of two DNNs alent to an ε value of 0 or 1 for ASO. based on their test-set score distribution. We de- Finally, we also considered a setup where for fined three criteria for a high quality comparison each of the 255 algorithms the performance score method, demonstrated that previous methods do set was randomly split into two equal sized sets. not meet these criteria and proposed to use the re- We repeated this process 100 times for each algo- cently proposed test for almost stochastic domi- rithm, using ASO to compare between the sets. In nance that does meet these criteria. We analyzed all cases we observed an averaged ε of 0.5, indi- the extensive experimental setup of Reimers and cating that the method avoids false positive predic- Gurevych(2017a) and demonstrated the effective- tions when an algorithm is compared to itself. ness of our proposed test. Having released our code, we hope this will become a new evaluation 7Recall that we consider one algorithm superior over the other according to COS when both the mean of its scores is standard in the NLP community. larger than the mean of the other, and its std is smaller.

2781 References Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation, ´ PC Alvarez-Esteban, Eustasio del Barrio, Juan Antonio 9(8):1735–1780. Cuesta-Albertos, C Matran,´ et al. 2017. Models for the assessment of treatment improvement: the ideal Diederik P Kingma and Jimmy Ba. 2014. Adam: A and the feasible. Statistical Science, 32(3):469–485. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Eustasio del Barrio, Juan A Cuesta-Albertos, and Car- los Matran.´ 2018. An optimal transportation ap- Guillaume Lample, Miguel Ballesteros, Sandeep Sub- proach for assessing almost stochastic order. In ramanian, Kazuya Kawakami, and Chris Dyer. 2016. The Mathematics of the Uncertain, pages 33–44. Neural architectures for named entity recognition. Springer. arXiv preprint arXiv:1603.01360.

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Young Jack Lee and Douglas A Wolfe. 1976. Kyunghyun Cho, Surya Ganguli, and Yoshua Ben- A distribution-free test for stochastic ordering. gio. 2014. Identifying and attacking the saddle point Journal of the American Statistical Association, problem in high-dimensional non-convex optimiza- 71(355):722–727. tion. In Advances in neural information processing systems, pages 2933–2941. Erich Leo Lehmann. 1955. Ordered families of dis- tributions. The Annals of , Timothy Dozat and Christopher D Manning. 2017. pages 399–419. Deep biaffine attention for neural dependency pars- ing. In Proc. of ICLR. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi arXiv preprint arXiv:1603.01354. Reichart. 2017. Replicability analysis for natural language processing: Testing significance with mul- Henry B Mann and Donald R Whitney. 1947. On a test tiple datasets. Transactions of the Association for of whether one of two random variables is stochasti- Computational Linguistics, 5:471–486. cally larger than the other. The annals of mathemat- ical statistics, pages 50–60. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- ichart. 2018. The hitchhikers guide to testing statis- Mitchell Marcus, Beatrice Santorini, and Mary Ann tical significance in natural language processing. In Marcinkiewicz. 1993. Building a large annotated Proceedings of the 56th Annual Meeting of the As- corpus of english: The penn treebank. sociation for Computational Linguistics (Volume 1: Nils Reimers and Iryna Gurevych. 2017a. Optimal hy- Long Papers), volume 1, pages 1383–1392. perparameters for deep lstm-networks for sequence labeling tasks. arXiv preprint arXiv:1707.06799. Hammou El Barmi and Ian W McKeague. 2013. Em- pirical likelihood-based tests for stochastic order- Nils Reimers and Iryna Gurevych. 2017b. Report- ing. Bernoulli: official journal of the Bernoulli ing score distributions makes a difference: Perfor- Society for Mathematical Statistics and Probability, mance study of lstm-networks for sequence tagging. 19(1):295. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages Jenny Rose Finkel, Alex Kleeman, and Christopher D 338–348. Manning. 2008. Efficient, feature-based, condi- tional random field parsing. Proceedings of ACL- Nils Reimers and Iryna Gurevych. 2018. Why com- 08: HLT, pages 959–967. paring single performance scores does not allow to draw conclusions about machine learning ap- Xavier Glorot and Yoshua Bengio. 2010. Understand- proaches. arXiv preprint arXiv:1803.09578. ing the difficulty of training deep feedforward neu- ral networks. In Proceedings of the thirteenth in- Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. ternational conference on artificial intelligence and Named entity recognition in tweets: an experimental statistics, pages 249–256. study. In Proceedings of the conference on empiri- cal methods in natural language processing, pages Daniel Hershcovich, Omri Abend, and Ari Rappoport. 1524–1534. Association for Computational Linguis- 2017. A transition-based directed acyclic graph tics. parser for UCCA. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin- Erik F Sang and Sabine Buchholz. 2000. Introduc- guistics (Volume 1: Long Papers), volume 1, pages tion to the conll-2000 shared task: Chunking. arXiv 1127–1138. preprint cs/0009008.

Geoffrey Hinton, Nitish Srivastava, and Kevin Swer- Erik F Sang and Fien De Meulder. 2003. Intro- sky. 2012. Neural networks for machine learning duction to the conll-2003 shared task: Language- lecture 6a overview of mini-batch gradient descent. independent named entity recognition. arXiv Cited on, page 14. preprint cs/0306050.

2782 Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958. Kristina Toutanova, Dan Klein, Christopher D Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American chapter of the association for computa- tional linguistics on human language technology- volume 1, pages 173–180. Association for Compu- tational Linguistics. Naushad UzZaman, Hector Llorens, Leon Derczyn- ski, James Allen, Marc Verhagen, and James Puste- jovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 1–9. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 5998–6008. Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. Ace 2005 multilin- gual training corpus. Linguistic Data Consortium, Philadelphia, 57. Bernard L Welch. 1947. The generalization ofstu- dent’s’ problem when several different population variances are involved. Biometrika, 34(1/2):28–35. Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th In- ternational Conference on Computational Linguis- tics, pages 2145–2158.

2783 A Proof - Equivalent Definitions of inequality for at least one value of a. We get that: Stochastic Order Z ∞ Z ∞ P (X ≥ Y ) = fX (x) · fY (y)dxdy As discussed in §3, our goal here is to prove that if y=−∞ x=y a random variable X is stochastically larger than Z ∞ a random variable Y (denoted by X  Y ), then = fY (y) · P (X ≥ y)dy y=−∞ it also holds that P (X ≥ Y ) > 0.5. This lemma Z ∞ explains why Reimers and Gurevych(2018) em- > fY (y) · P (Y ≥ y)dy = 0.5. ployed the Mann-Whitney U test that tests for y=−∞ stochastic order, while their requirement for stat- Where the last pass holds because X is stochasti- ing that one algorithm is better than the other was cally larger than Y . We get that P (X ≥ Y ) > that P (X ≥ Y ) > 0.5 (where X is the score 0.5. distribution of the superior algorithm and Y is the score distribution of the inferior algorithm). Note that the opposite direction does not always hold, i.e., it is easy to come up with an example Lemma 1. If X  Y then P (X ≥ Y ) > 0.5. where P (X ≥ Y ) > 0.5 but there is no stochastic order between the two random variables. How- Proof. For every two continuous random variables ever, the opposite direction is true with an addi- X,Y it holds that: tional assumption that the CDFs do not cross one another (which we do not prove here). P (X ≥ Y ) + P (Y > X) = 1. B Hypothesis Testing for Almost Stochastic Dominance Let us first assume that X and Y are i.i.d and continuous. If this is the case then: In this section we discuss the implementation of the algorithm for hypothesis testing for the almost P (X ≥ Y ) + P (Y > X) = 1 stochastic dominance relation between two ran- P (X ≥ Y ) + P (X > Y ) = 1 dom variables (empirical score distributions). The code of the algorithm is publicly available. 2P (X ≥ Y ) = 1 We are given two sets of scores from two algo- P (X ≥ Y ) = 0.5. rithms, n scores from algorithm A and m scores from algorithm B: A = {x1, x2, ..., xn},B = The first pass is true because X and Y are iden- {y1, y2, ..., ym}. The pseudocode of the algorithm tically distributed and the second pass is true be- is as follows: cause X and Y are continuous random variables. Assuming that the density functions of the ran- 1. Sort the data points from the smallest to dom variables X and Y exist (which is true be- the largest in both sets, creating two lists: cause they are continues variables), we can write A = [x(1), ..., x(n)] and B = [y(1), ..., y(m)], P (X ≥ Y ) in the following manner: where x(i) is the i-th smallest value.

Z ∞ Z ∞ 2. Build the empirical score distributions P (X ≥ Y ) = fX (x) · fY (y)dxdy Fn,Gm using the following formula: y=−∞ x=y Z ∞ n 1 X number of xis ≤ t = fY (y) · P (X ≥ y)dy Fn(t) = 1x ≤ t = y=−∞ n (i) n Z ∞ i=1 = fY (y) · P (Y ≥ y)dy = 0.5. y=−∞ 3. Build the empirical inverse score distribu- tions F −1(t),G−1(t) using the following for- Where the equality to 0.5 was proved above. mula:8 In our case, X  Y . This means that X and Y are independent but are not identically distributed. F −1(t) = inf{x : t ≤ F (x)}, t ∈ (0, 1) By definition of stochastic order this also means 8It is possible to compute the inverse CDF without explic- that P (X ≥ a) > P (Y ≥ a), for all a with strict itly computing the CDF.

2784 4. Compute the index of stochastic dominance

violation εW2 (F,G) (equation 4 of the main paper). In practice we compute the integral operation using the definition of the Riemann R 1 integral. That is, when computing 0 f(t)dt we partition the interval between 0 and 1 into small parts of size ∆ and compute the sum of the function value in this part times ∆).

∗ ∗ 5. Estimate σ: take many samples Xn,Ym from the empirical distributions Fn and Gm; for each of those samples compute the expres- sion: r nm ε (F ∗,G∗ ) − ε (F ,G ) n + m W2 n m W2 n m

and use the variance of those values as the estimate for σ2, take the square root of that estimator for σˆn,m. The more samples, the better. In our implementation we employ the inverse transform sampling method to gener- ate samples.

6. The minimal  for which we can claim that al- gorithm A is almost stochastically larger than algorithm B with confidence level of 1−α is: min  (Fn,Gm, α) = q n+m −1 εW2 (Fn,Gm) − nm σˆn,mΦ (α).

2785