Measuring the Homogeneity and Similarity of Language Corpora

Measuring the homogeneity and similarity of language corpora Gabriela Maria Chiara Cavaglia A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS OF THE UNIVERSITY OF BRIGHTON FOR THE DEGREE OF DOCTOR OF PHILOSOPHY JULY 2005 INFORMATION TECHNOLOGY RESEARCH INSTITUTE UNIVERSITY OF BRIGHTON Abstract Corpus-based methods are now dominant in Natural Language Processing (NLP). Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need methods for comparing corpora. This thesis investigates comparison methods based on the notions of corpus homogeneity and similarity. We provide definitions for corpus homogeneity and similarity, expressed not in terms of language varieties, but in terms of performance on application tasks. Since performance is quantifiablethrough accuracy, the accuracy values can be used as a gold standard forthe evaluation of any homogeneity and similarity measures. The applications of interest are ones that use a corpus as training data. Other things being equal, we can expect performance to be optimal if the application is trained on text that is very similar to the text used forevaluat ion, and for performance to degrade as the disparity between the two text types increases. We build on earlier work on corpus similarity by proposing two easily computed measures, one for homogeneity and one for similarity, based on frequency lists of document-internal features and an inter-document similarity measure. We evalu ate the measures by calculating the homogeneity and similarity for several corpora, comparing these values against the accuracy levels obtained by two NLP applica tions, a text classifier and a part of speech tagger, which use those corpora. Results show that our measures give fairly good estimates of the homogeneity and similar ity of corpora, especially for the tagging task. ii Contents Abstract ii Contents viii List of Tables xiii Acknowledgements xiv Author's declaration xvi 1 Introduction 1 1.1 Thesis overview 2 2 Language corpora 4 2.1 The use of corpora in language research . 5 2.2 Corpus representativeness ....... 6 2.3 Heterogeneous, balanced and homogeneous corpora 7 2.4 Corpus profiling ................... 9 2.5 Corpus design process and external/internal criteria 10 2.6 Automatically built resources and corpus homogeneity 12 2. 7 Portability of NLP applications and corpus similarity 14 2.8 Summary . 15 111 3 Document similarity and the classification task 16 3.1 The classificationtask 17 3.2 Document features . 19 3.3 Language Identification . 21 3.4 Stylometry or computer-assisted authorship attribution 23. 3.5 Text classification . .... 26 3.5.1 Content classification 26 3.5.2 Genre classification . 28 3.5.3 Sentiment classification . 31 3. 6 Information Retrieval . 33 3.7 Summary .... 36 4 Comparing corpora 38 4.1 Corpus size: representativeness and aspect . 39 4.2 Comparing corpora using surface features . 41 4.3 Comparing corpora using structural features 42 4.4 ·Corpus homogeneity and similarity: a definition 43 4.5 Automatic sublanguage identification ..... 45 4. 6 Measuring corpus homogeneity and similarity 46 4.7 Summary ............. 52 5 The document similarity algorithm 55 5.1 Step one: collecting feature types and weights 56 5.1.1 Deciding aspect and features . 57 5.1.2 Frequency list . 57 5.2 Step two: normalization and smoothing . 58 5.2.l Normalization techniques ..... 59 iv 5.2.2 Smoothing techniques .. 60 5.2.3 Producing corpus profiles . 62 5.3 Step three: feature selection . 63 5.3.1 "The K most frequent features" technique 64 5.3.2 Reducing corpus profiles ........ 66 5.4 Step four: inter-document similarity measures 66 5.4.1 Chi Square . 67 5.4.2 Log-likelihood 67 5.4.3 Relative entropy 68 5.4.4 Computing inter-document similarity 69 5.4.5 The symmetry of inter-document measures 72 5.5 Summary ........ .. 73 6 Definitions and methodologies 74 6.1 Non-circular definitions for corpus homogeneity and similarity 75 6.1.1 The performance of an NLP application 76 6.2 Corpus homogeneity and similarity measures 77 6.2.1 Computing corpus homogeneity 78 6.2.2 Computing corpus similarity . 78 6.3 The evaluation method ........ 79 6.3.1 Producing gold standard judgements for homogeneity 80 6.3.2 Producing gold standard judgements forsimila rity . 81 6.4 Summary .......................... 82 7 Exploring corpus homogeneity using a text classification applica- t�n 84 7.1 Data forthe Rainbow experiment: the British National Corpus . 85 v 7.2 Measuring homogeneity . 86 7.3 Producing the gold standard judgements using Rainbow 87 7.4 Results . 88 7.5 Discussion 90 7. 6 Conclusions 90 8 Exploring corpus homogeneity and similarity using a part of speech tagger 92 8.1 Data 93 8.2 Measuring corpus homogeneity . 95 8.3 Measuring corpus similarity .. 96 8.4 Producing the gold standard judgments using the RASP tagger 97 8.4.1 Using cross-validation forproducing gold standard judgments 98 8.4.2 Similarity between written and spoken English . 100 8.5 Results forhomog eneity 102 8. 6 Results for similarity 107 8. 7 Discussion . 112 8.8 Conclusions 113 9 Exploring the homogeneity of corpora built from the web 114 9.1 The Web as a corpus resource 115 9.2 The experiment ....... 11 6 9.3 Specialist and general terms 117 9.3.1 The specialist terms 118 9.3.2 The general terms . 119 9.4 Developing corpora .... 119 9.5 Measuring corpus homogeneity . 123 Vl 9. 6 Results .. 125 9. 7 Discussion 126 9.8 Conclusion . 128 10 Conclusions 129 10.1 The thesis research questions . 129 10.2 The contributions of this thesis 130 10.2.1 Definitions of corpus homogeneity and similarity . 130 10.2.2 Measures forestimating corpus homogeneity and similarity 132 10.2.3 Constraints on measures forcomputing corpus homogeneity and similarity . 133 10.2.4 Producing other measures 135 10.3 Directions forfuture research . 135 List of References 138 Appendices 148 A Example of a frequency list and a probability distribution 149 B List of function words 154 C Results of the Rainbow experiment 15'7 C. l Homogeneity scores 157 C.2 Accuracy scores . 161 D List of Tags in the BNC Sampler 163 E Results of the RASP experiment l '71 E.1 Homogeneity scores 171 E.2 Similarity scores . 175 vii E.3 Accuracy values for homogeneity 181 E.4 Accuracy values for similarity . 182 F Homogeneity and training size: results of a preliminary experiment 185 G Results of the experiment using the Web 187 G.1 List of technical and non-technical terms used in the WEB experiment187 G.2 Homogeneity scores . ................ ..... 193 H Published Papers 206 Vlll List of Tables 4.1 Interactions between homogeneity and similarity . 48 5.1 Types of analysis and features forprofiles . 57 5.2 Corpus Matrix 63 5.3 Reduced Corpus Matrix 66 5.4 Contingency table .... 67 5.5 Inter-document similarity matrix 69 5.6 Inter-document similarity matrix when the similarity measure .is symmetrical . 72 7.1 Homogeneity scores computed using the 500 most frequent words in each corpus . ·. 8 7 7.2 Rainbow accuracy values obtained on homogeneous and het�roge neous subcorpora; analyzing all words, and using training sets of 10 documents per class. Standard deviation values are given in brackets 87 7.3 Correlations between Rainbow accuracy values and homogeneity values ....... 89 8.1 Homogeneity values and standard deviations obtained using POS tags and G2. • . 96 8.2 Similarity values obtained using POS tags and G2 97 8.3 Similarity values obtained using POS tags and D(pl lq) 97 8.4 Accuracy values obtained using the 10-fold cross validation method for each of the seven corpora. 99 lX 8.5 Accuracy values obtained using a variant of 2-fold cross validation method for each pair of corpora. 100 8. 6 Ratio of double quote characters in written corpora 102 8. 7 Contingency table . 102 8.8 Ordered gold standard judgments forhomogeneity . 103 8.9 T-values foreach pair of corpora. 104 8.10 Correlations between homogeneity and accuracy 10 6 8.11 Related t-values computed between cells of the same quadrant 108 8.12 Related t-test values computed between cellsof the differentquadran ts109 8.13 Correlations between similarity and accuracy ............. 111 9.1 Homogeneity values and standard deviations obtained using POS tags and G2 for some of the corpora used in the experiment . 124 9.2 Mann Whitney U values and level of significance. 125 A.l Example of POS frequency list and probability distribution with and without smoothing. Part I: the 50 most frequent tags ...... 150 A.2 Example of POS frequency list and probability distribution with and without smoothing. Part II . ........... 151 A.3 Example of POS frequency list and probability distribution· with and without smoothing. Part III . ...... ............. 152 A.4 Example of POS frequency list and probability distribution with and without smoothing. Part IV . 153 C.1 Homogeneity scores for subcorpora 1-20 . 158 C.2 Homogeneity scores forsubcorpora 21-40 159 C.3 Homogeneity scores for subcorpora 41-54 160 C.4 The accuracy values for subcorpora 1-29 161 C.5 The accuracy values for subcorpora 30-54 . 162 x E.l Homogeneity values obtained using POS tags 172 E.2 Homogeneity values obtained using function words . 172 E.3 Homogeneity values obtained using POS tag brigrams . 173 E.4 Homogeneity values obtained using POS tags and normalization with smoothing . 173 E.5 Homogeneity values obtained using function words and normalization with smoothing .

Load more