How to Properly Compare Deep Neural Models
Total Page:16
File Type:pdf, Size:1020Kb
Deep Dominance - How to Properly Compare Deep Neural Models Rotem Dror Segev Shlomov Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT rtmdrr|segevs|[email protected] Abstract was deterministic and the number of configura- tions a model could have was rather small – deci- Comparing between Deep Neural Network sions about model design were usually limited to (DNN) models based on their performance on unseen data is crucial for the progress of feature selection and the selection of one of a few the NLP field. However, these models have loss functions. Consequently, when one model a large number of hyper-parameters and, be- performed better than another on unseen data it ing non-convex, their convergence point de- was safe to argue that the winning model was gen- pends on the random values chosen at ini- erally better, especially when the results were sta- tialization and during training. Proper DNN tistically significant (Dror et al., 2018), and when comparison hence requires a comparison be- the effect of multiple hypothesis testing was taken tween their empirical score distributions on unseen data, rather than between single eval- into account in cases of evaluation with multiple uation scores as is standard for more simple, datasets (Dror et al., 2017). convex models. In this paper, we propose to With the recent emergence of Deep Neural Net- adapt to this problem a recently proposed test works (DNNs), data-driven performance compar- for the Almost Stochastic Dominance relation ison has become much more complicated. While between two distributions. We define the cri- models such as LSTM (Hochreiter and Schmid- teria for a high quality comparison method huber, 1997), Bi-LSTM (Schuster and Paliwal, between DNNs, and show, both theoretically 1997) and the transformer (Vaswani et al., 2017) and through analysis of extensive experimen- tal results with leading DNN models for se- improved the state-of-the-art in many NLP tasks quence tagging tasks, that the proposed test (e.g. (Dozat and Manning, 2017; Hershcovich meets all criteria while previously proposed et al., 2017; Yadav and Bethard, 2018)), it is methods fail to do so. We hope the test we much more difficult to compare the performance propose here will set a new working practice of algorithms that are based on these models. 1 in the NLP community. This is because the loss functions of these mod- 1 Introduction els are non-convex (Dauphin et al., 2014), mak- ing the solution to which they converge (a lo- A large portion of the research activity in Natural cal minimum or a saddle point) sensitive to ran- Language Processing (NLP) is devoted to the de- dom weight initialization and the order of train- velopment of new algorithms for existing or new ing examples. Moreover, as these complex mod- tasks. To evaluate the quality of a new method, its els are not fully understood, their training is often performance on unseen datasets is compared to the enhanced by heuristics such as random dropouts performance of existing methods. The progress of (Srivastava et al., 2014) that introduces another the field hence crucially depends on our ability to level of non-determinism to the training process. draw conclusions from such comparisons. Finally, the increased model complexity results in In the past, most supervised NLP models have a much larger number of configurations, governed been linear (or log-linear), convex and relatively by a large space of hyper-parameters for model simple (e.g. (Toutanova et al., 2003; Finkel et al., properties such as the number of layers and the 2008; Ritter et al., 2011)). Hence, their training number of neurons in each layer. 1Our code is available at: https://github.com/ With so many degrees of freedom governed by rtmdrr/deepComparison random and arbitrary values, when comparing two 2773 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2773–2785 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics DNNs it is not possible to consider a single test-set (c), since it does not provide information beyond evaluation score for each model. If we do that, we its decision if one of the distributions is stochas- might compare just the best models that someone tically dominant over the other or not, and as we happened to train rather than the methods them- show in x5 its power (criterion (b)) is very limited. selves. Instead, it is necessary to compare between A new comparison tool (x 4): We propose a the score distributions generated by different runs solution that meets our three criteria. Particu- of each of the models. Unfortunately, this compar- larly, we adapt to our problem the recently pre- ison task, which is fundamental to the progress of sented concept of Almost Stochastic Order (ASO) the field, has not received a systematic treatment between two distributions (Alvarez-Esteban´ et al., thus far. Our goal is to close this gap and propose a 2017),2 and the statistical significance test for this simple and effective comparison tool between two property, which makes very modest assumptions DNNs based on their test set score distributions. about the participating distributions (criterion (a)). Particularly, we make four contributions: Further, in line with criterion (c), the test returns a variable 2 [0; 1], that quantifies the degree to Defining a DNN comparison framework: We which one algorithm is stochastically larger than define three criteria that a DNN comparison tool the other, with = 0 reflecting stochastic order. should meet: (a) Since we observe only a sam- We further show that the test is designed to be very ple from the population score distribution of each powerful (criterion (b)), which is possible because model, the decision should be significant under the decision on the superior algorithm is comple- well justified statistical assumptions. This assures mented by the confidence score. that future runs of the superior model are likely Extensive experimental analysis (x 5): We re- to get higher scores than future runs of the infe- visit the extensive experimental setup of Reimers rior model; (b) The decision mechanism should and Gurevych(2017a,b), who performed 510 be powerful, being able to make decisions in most comparisons between strong DNN-based se- possible decision tasks; and, finally, (c) Since both quence tagging models. In each of their experi- models depend on random decisions, it is likely ments they compared two models – either different that none of them is promised to be superior over models or two variants of the same model differing the other in all cases (e.g. with all possible random in some of their hyper-parameters – and reported seeds). A powerful comparison tool should hence the score distributions of each model across vari- augment its decision with a confidence score, re- ous random seeds and hyper-parameter configura- flecting the probability that the superior model will tions. Our analysis reveals that while our test can indeed produce a better output. declare one of the algorithms superior in 100% of Analysis of existing solutions (x 3,5): The the cases, the COS approach can do that in 49.01% comparison problem we address has been high- of the cases, and the SO approach in a mere 0.98%. lighted by Reimers and Gurevych(2017b, 2018), In addition to being powerful, the decisions and who established its importance in an extensive the confidence scores of our proposed test are also experimentation with neural sequence models well aligned with the tests proposed in previous (Reimers and Gurevych, 2017a), and proposed literature: when the previous methods are chal- two main solutions (x3). One solution, which we lenged, our method still makes a decision but it refer to as the collection of statistics (COS) solu- also indicates the smaller gap between the algo- tion, is based on the analysis of statistics of the em- rithms. We hope that this work will establish a pirical score distribution of the two algorithms – standard for the comparison between DNNs. such as their mean, median and standard deviation (std), as well as their minimum and maximum val- 2 Performance Variance in DNNs ues. Unfortunately, this solution does not respect In this section we discuss the source of non- criterion (a) as it does not deal with significance, determinism in DNNs, focusing on hyper- and as we demonstrate in x5 its power (criterion parameter configurations and random choices. (b)) is also limited. Their second solution is based on significance testing for Stochastic Order (SO) Hyper-parameter Configurations DNNs are (Lehmann, 1955), a strict criterion that is hardly complex models governed by a variety of hyper- met in reality. While this solution respects crite- 2We use the terms Almost Stochastic Order and Almost rion (a), it is not designed to deal with criterion Stochastic Dominance interchangeably in this paper. 2774 parameters. A formal definition of a hyper- As discussed in x1, being non-convex, the con- parameter, differentiating it from a standard pa- vergent point of DNNs is deeply affected by these rameter, is usually a parameter whose value is random effects. Unfortunately, exploring all pos- set before the learning process begins. We can sible random seeds is impossible both because roughly say that hyper-parameters determine the they form an uncountable set and because their structure of the model and particular algorithmic values are uninterpretable and it is hence even hard decisions related, e.g., to its optimization. Some to decide on the relevant search space for their val- popular structure-related hyper-parameters in the ues. This dictates the need for reporting model re- DNN literature are the number of layers, layer sults with multiple random choices.