On the Statistical Significance Testing for Natural Language Processing

On the Statistical Significance Testing for Natural Language Processing Haotian Zhu A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science University of Washington 2020 Reading Committee: Fei Xia Gina-Anne Levow Program Authorized to Offer Degree: Department of Linguistics c Copyright 2020 Haotian Zhu University of Washington Abstract On the Statistical Significance Testing for Natural Language Processing Haotian Zhu Chair of the Supervisory Committee: Professor Fei Xia Linguistics This thesis explores and compares statistical significance tests frequently used in comparing Natural Language Processing (NLP) system performance in several aspects. We begin by establishing the fundamentals of the NLP system performance comparison and formulating it into four major tasks specific to NLP. Each statistical significance test is explained in great detail with its assumptions explicated and testing procedure outlined. We stress the importance of verifying test assumptions before conducting a test. In addition, we examine the effect size and statistical power and discuss their significance in the statistical significance testing in NLP. By considering potential dependencies within a test set, the block bootstrap is introduced and employed to calibrate the statistical significance testing for comparing performance of two systems on average. Four case studies with both simulated and real data, of which the complexity of data dependency varies, are presented to illustrate the process of properly using a statistical significance test in comparing NLP system performance under different settings. We then proceed to discussion from different perspectives, with some open issues such as cross-domain comparison and the violation of i:i:d: assumption, which expects further studies. In conclusion, this thesis advocates the proper use of statistical significance testing in comparing NLP system performance and the reporting of the comparison results in more transparency and completeness. TABLE OF CONTENTS Page List of Figures . iv List of Tables . v Table of Notation . vii Chapter 1: Introduction . 1 Chapter 2: Preliminaries and Assumptions . 6 2.1 Definitions . 6 2.2 Assumptions . 11 2.3 Data type . 13 2.4 Statistical hypothesis testing procedure . 14 2.5 A note on the p-value . 15 2.6 Effect size . 16 2.7 Power analysis . 21 2.8 Hypothesis testing tasks . 24 2.9 Table of tests . 29 Chapter 3: Previous Work . 30 3.1 Methodological work . 30 3.2 Empirical work . 33 Chapter 4: Statistical Hypothesis Testing . 36 4.1 Verifying test assumptions . 36 4.2 Parametric tests . 41 4.3 Nonparametric test . 45 4.4 Multiple testing . 63 i 4.5 Table of tests (comprehensive) . 67 Chapter 5: Test Comparison and Case Study . 70 5.1 Testing and reporting . 70 5.2 Case 1: simple 2-sample numerical data . 72 5.3 Case 2: paired 2-sample numerical data . 77 5.4 Case 3: dependent numerical samples . 85 5.5 Case 4: 2-sample categorical data (contingency table) . 93 Chapter 6: Discussion . 97 6.1 Comparison with previous studies . 97 6.2 The choice and use of significance tests . 101 6.3 Transparent and complete reporting . 103 6.4 Interpretation of confidence intervals . 103 6.5 i:i:d: in categorical data . 104 6.6 Test statistic for block bootstrap . 104 6.7 Power estimation for block bootstrap . 105 6.8 Cross-domain comparison . 106 6.9 Choice of evaluation metric . 106 Chapter 7: Conclusion and Future Work . 108 7.1 Contribution . 108 7.2 Future trajectories . 109 Appendix A: Tables . 111 A.1 Tables of R and P ython functions for statistical tests . 111 A.2 Table of R functions for effect size indices . 114 Appendix B: Algorithms . 115 B.1 Algorithm for unpaired permutation test . 115 B.2 Algorithm for paired sign test . 116 B.3 Algorithm for unpaired bootstrap test . 117 B.4 Algorithm for block bootstrap test . 118 B.5 Algorithm for Monte Carlo power estimation . 119 ii B.6 Algorithm for bootstrap power estimation . 120 Bibliography . 121 iii LIST OF FIGURES Figure Number Page 5.1 Histogram of sample X. ............................. 76 5.2 Histogram of sample Y .............................. 76 5.3 Simulated power curves for case 1. 78 5.4 Histogram of difference Z............................. 82 5.5 Simulated power curves for case 2. 84 5.6 Histogram of BLEU scores of system A and B. 89 5.7 Estimated power curves of case 3. 92 iv LIST OF TABLES Table Number Page 2.1 General testing procedure . 14 2.2 The statement on p-values . 16 2.3 Cohen's d effect size . 19 2.4 The table of tests (simplified) . 29 3.1 The table of methodological work. 31 3.2 The table of example empirical work. 34 4.1 The table of tests to verify assumptions ('/' denotes the test is not covered in this thesis) . 37 4.2 Testing procedure for F test . 40 4.3 Testing procedure for unpaired t test . 42 4.4 The contingency table of samples X and Y ................... 50 4.5 Testing procedure for Wilcoxon signed-rank test . 54 4.6 Holm procedure . 64 4.7 Benjamin-Hochberg procedure . 65 4.8 Table of repeated measures . 66 4.9 The table of tests . 69 5.1 General testing and reporting procedure. 71 5.2 Test comparison in case 1. 74 5.3 Testing result of case 1 (we expect a rejection). 76 5.4 Summary statistics for samples X and Y. ................... 77 5.5 Test comparison in case 2. 79 5.6 Testing result of case 2 (we expect a rejection). 81 5.7 Summary statistics for samples X, Y and Z................... 83 5.8 Additional testing result of case 2 (we expect a rejection). 85 5.9 Test comparison in case 3. 87 v 5.10 Testing result of case 3 (the official report of the evaluation results gives a rejection). 88 5.11 Summary statistics for the given samples. 90 5.12 Contingency table for case 4 . 94 5.13 Test comparison in case 4. 94 5.14 Testing result of case 4 (the official report of the evaluation results does not use significance testing). 95 5.15 Qualitative interpretation of odds ratio and Cohen's g . 95 A.1 The table of R functions for tests . 112 A.2 The table of P ython functions for tests . 113 A.3 The table of tests . 114 vi TABLE OF NOTATION X: a random variable (a sample) µX : population mean of X σX : population standard deviation of X X¯ : the sample mean of X σ^X : the sample standard deviation of X XI : an observation in a sample D:F:: degree of freedom ??: statistically independent H0: the null hypothesis H1: the alternative hypothesis P0(·): the conditional probability under the null hypothesis P1(·): the conditional probability under the alternative hypothesis α: the significance level or Type I error β: the Type II error FX : the cumulative distribution function of X X ∼ Y: X and Y have the same distribution A1: Assumption 1 vii T 1: Task 1 RANK(XI ): the rank of observation Xi I: the indicator function viii ACKNOWLEDGMENTS I would like to first extend my gratitude to my advisor Professor Fei Xia for her invaluable and meticulous guidance throughout the way. It is under her auspices that, from a quixotic and immature idea, through numerous hour-long discussions via Skype, this thesis is made possible and able to come to its fullness. I am also grateful to Professor Gina-Anne Levow, for her indispensable suggestions punctually provided in carefully annotated documents. I would like to express my appreciation to Professor Emily Bender for her understanding and support when the path before me was obfuscated. I would like to appreciate the department of linguistics for efficient and magical, I would say, administrative assistance. During these three years of my graduate life, solitude and insomnia have always accompanied me, insidiously festering in invisibility, but even in the most desperate time of all there is always a scintilla of glistening flare. That flare comes from unwavering friendship and kinship. To my friends Fan Wang, I will always remember the time and felicity we have shared together. Lastly and most important of all, I thank my parents, for their undivided and.

On the Statistical Significance Testing for Natural Language Processing

12.6 Sign Test (Web)

1 Sample Sign Test 1

Signed-Rank Tests for ARMA Models

6 Single Sample Methods for a Location Parameter

Tests of Hypotheses Using Statistics

Signrank — Equality Tests on Matched Data

Reference Manual

Comparison of T-Test, Sign Test and Wilcoxon Test

Nonparametric Test Procedures 1 Introduction to Nonparametrics

Wilcoxon Signed Rank Test: • Observations in the Sample May Be Exactly Equal to M (I.E

Reference Manual

The Statistical Sign Test Author(S): W