Language Independent Statistical Grammar Checking Approach and Compare It to Existing Approaches
Total Page:16
File Type:pdf, Size:1020Kb
Hochschule Darmstadt & Reykjavík University Departments of Computer Science LISGrammarChecker: Language Independent Statistical Grammar Checking Master Thesis to achieve the academic degree Master of Science M.Sc. Thesis Verena Henrich and Timo Reuter February 2009 First advisor: Prof. Dr. Bettina HarriehausenMühlbauer Second advisor: Hrafn Loftsson, Ph.D., Assistant Professor Master ii “There is something funny about me grammar checking a paper about grammar checking...” William Scott Harvey iii iv Abstract People produce texts, and therefore the use of computers rises more and more. The gram matical correctness is often very important and thus grammar checkers are applied. Most nowadays grammar checkers are based on rules, but often they do not work as properly as the users want. To counteract this problem, new approaches use statistical data instead of rules as a basis. This work introduces such a grammar checker: LISGrammarChecker, a Language Independent Statistical Grammar Checker. This work hypothesizes that it is possible to check grammar up to a certain extent by only using statistical data. The approach should facilitate grammar checking even in those lan guages where rulebased grammar checking is an insufficient solution, e.g. because the lan guage is so complex that a mapping of all grammatical features to a set of rules is not possi ble. LISGrammarChecker extracts ngrams from correct sentences to built up a statistical data base in a training phase. This data is used to nd errors and propose error corrections. It contains bi, tri, quad and pentagrams of tokens and bi, tri, quad and pentagrams of part ofspeech tags. To detect errors every sentence is analyzed with regard to its ngrams. These ngrams are compared to those in the database. If an ngram is not found in the database, it is assumed to be incorrect. For every incorrect ngram an error point depending on the type of ngram is assigned. Evaluation results prove that this approach works for different languages although the accu racy of the grammar checking varies. Reasons are due to differences in the morphological richness of the languages. The reliability of the statistical data is very important, i.e. it is mandatory to provide enough data in good quality to nd all grammatical errors. The more tags the used tagset contains, the more grammatical features can be represented. Thus the quality of the statistical data and the used tagset inuence the quality of the grammar check ing result. The statistical data, i.e. the ngrams of tokens, can be extended by ngrams from the Internet. In spite of all improvements there are still many issues in nding reliably all grammatical errors. We counteract this problem by a combination of the statistical ap proach with selected language dependent rules. v vi Contents I. Introduction 1. Introduction 3 1.1. Motivation ..................................... 4 1.2. Goal and Definition ................................ 5 1.3. Structure of this Document ............................ 6 2. Fundamentals 9 2.1. Natural Languages and Grammar Checking ................... 9 2.1.1. Definition: The Grammar of a Natural Language ........... 9 2.1.2. Tokenization ............................... 10 2.1.3. Grammar Checking ............................ 11 2.1.4. Types of Grammatical Errors ...................... 12 2.1.5. Definition: ngrams ............................ 14 2.1.6. Multiword Expressions .......................... 15 2.1.7. Sphere of Words ............................. 16 2.1.8. Language Specialities ........................... 16 2.2. CorporaCollections of Text .......................... 17 2.2.1. Definition: Corpus ............................ 17 2.2.2. Sample Corpora .............................. 18 2.3. PartofSpeech Tagging .............................. 19 2.3.1. Tagset ................................... 20 2.3.2. Types of PoS Taggers ........................... 21 2.3.3. Combined Tagging ............................ 22 3. Related Work 25 3.1. Rulebased Approaches .............................. 25 3.1.1. Microsoft Word 97 Grammar Checker ................. 26 3.1.2. LanguageTool for Openffice ....................... 27 3.2. Statistical Approaches ............................... 27 3.2.1. Differential Grammar Checker ..................... 27 3.2.2. ngram based approach .......................... 28 3.3. Our Approach: LISGrammarChecker ...................... 29 vii Contents II. Statistical Grammar Checking 4. Requirements Analysis 33 4.1. Basic Concept and Idea .............................. 33 4.1.1. ngram Checking ............................. 34 4.1.2. Word Class Agreements ......................... 36 4.1.3. Language Independence ......................... 37 4.2. Requirements for Grammar Checking with Statistics ............. 37 4.3. Programming Language .............................. 39 4.4. Data Processing with POSIXShells ....................... 41 4.5. Tokenization .................................... 41 4.6. PartofSpeech Tagging .............................. 42 4.6.1. Combination of PoS Taggers ....................... 42 4.6.2. Issues with PoS Tagging ......................... 43 4.7. Statistical Data Sources .............................. 44 4.8. Data Storage .................................... 44 5. Design 47 5.1. Interaction of the Components .......................... 47 5.2. User Interface: Input and Output ........................ 48 5.3. Training Mode ................................... 49 5.3.1. Input in Training Mode ......................... 49 5.3.2. Data Gathering .............................. 50 5.4. Grammar Checking Mode ............................ 54 5.4.1. Input in Checking Mode ......................... 55 5.4.2. Grammar Checking Methods ...................... 55 5.4.3. Error Counting .............................. 57 5.4.4. Correction Proposal ............................ 60 5.4.5. Grammar Checking Output ....................... 61 5.5. Tagging ....................................... 61 5.6. Data ......................................... 63 6. Implementation 69 6.1. User Interaction .................................. 69 6.2. Tokenization .................................... 71 6.3. Tagging ....................................... 71 6.4. External Program Calls .............................. 73 6.5. Training Mode ................................... 74 6.6. Checking Mode .................................. 75 6.6.1. Checking Methods ............................ 76 6.6.2. Internet Functionality .......................... 78 viii Contents 6.6.3. Correction Proposal ............................ 79 6.6.4. Grammar Checking Output ....................... 80 6.7. Database ...................................... 80 6.7.1. Database Structure/Model ........................ 81 6.7.2. Communication with the Database ................... 81 III. Evaluation 7. Test Cases 87 7.1. Criteria for Testing ................................ 87 7.1.1. Statistical Training Data ......................... 88 7.1.2. Input Data for Checking ......................... 89 7.1.3. Auxiliary Tools .............................. 89 7.1.4. PoS Tagger and Tagsets .......................... 92 7.2. Operate Test Cases ................................ 92 7.2.1. Case 1: Selfmade Error Corpus English, Penn Treebank Tagset .. 92 7.2.2. Case 2: Same as Case 1, Refined Statistical Data ............ 95 7.2.3. Case 3: Selfmade Error Corpus English, Brown Tagset ....... 97 7.2.4. Case 4: Selfmade Error Corpus German ............... 98 7.2.5. Case 5: Several Errors in Sentence English ............... 100 7.3. Operate Test Cases with Upgraded Program .................. 100 7.3.1. Case 6: Selfmade Error Corpus English, Brown Tagset ....... 100 7.3.2. Case 7: Selfmade Error Corpus with Simple Sentences English ... 101 7.4. Program Execution Speed ............................. 102 7.4.1. Training Mode ............................... 102 7.4.2. Checking Mode .............................. 102 8. Evaluation 105 8.1. Program Evaluation ................................ 105 8.1.1. Correct Statistical Data .......................... 106 8.1.2. Large Amount of Statistical Data .................... 107 8.1.3. Program Execution Speed ........................ 107 8.1.4. Language Independence ......................... 108 8.1.5. Internet Functionality .......................... 108 8.1.6. Encoding .................................. 109 8.1.7. Tokenization ............................... 109 8.2. Error Classes .................................... 110 8.3. Evaluation of Test Cases 15 ............................ 112 8.4. Program Extensions ................................ 117 8.4.1. Possibility to Use More Databases at Once ............... 118 ix Contents 8.4.2. More Hybrid ngrams ........................... 118 8.4.3. Integration of Rules ............................ 119 8.4.4. New Program Logic: Combination of Statistics with Rules ...... 120 8.5. Evaluation of Upgraded Program ......................... 120 IV. Concluding Remarks 9. Conclusion 127 10. Future work 129 10.1. More Statistical Data ............................... 129 10.2. Encoding ...................................... 130 10.3. Split Long Sentences ................................ 130 10.4. Statistical Information About Words and Sentences .............. 132 10.5. Use ngram Amounts ............................... 132 10.6. Include more Rules ................................ 132 10.7. Tagset that Conforms Requirements ....................... 133 10.8. Graphical User Interface ............................