Interface for Integration of Language Checking Tools to Text Editing Software
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Interface for Integration of Language Checking Tools to Text Editing Software Bachelor’s Thesis Jan Tojnar Brno, Spring 2018 Masaryk University Faculty of Informatics Interface for Integration of Language Checking Tools to Text Editing Software Bachelor’s Thesis Jan Tojnar Brno, Spring 2018 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Jan Tojnar Advisor: RNDr. Adam Rambousek, Ph.D. i Acknowledgements I would like to thank my advisor RNDr. Adam Rambousek, Ph.D. for his patience, and my parents for their support and proofreading. iii Abstract In this thesis, we propose a library that unifies various text check- ing tools behind a single interface for easier integration of grammar checking into applications. The library is modular and supports dif- ferent providers; a grammar checking provider using LanguageTool and a spell checking provider using Enchant were developed as ex- amples. Additionally, AbiWord text editor was modified to use our library. iv Keywords grammar checking, spell checking, text editor integration, linguistic framework, freedesktop v Contents 1 Introduction 1 2 Overview of existing checkers 3 2.1 Elixir ............................. 3 2.2 Enchant ............................ 3 2.3 Link Grammar ........................ 3 2.4 LanguageTool ......................... 4 2.5 After the Deadline ....................... 4 2.6 Other checkers ......................... 5 3 Design of provider API 7 3.1 Annotations .......................... 7 3.2 Linking annotations to the text ................ 8 3.3 Scope of checked text ..................... 9 3.4 Choosing module system ................... 10 4 Provider implementation 13 4.1 On choice of language ..................... 13 4.2 Anatomy of provider ..................... 13 4.3 Implementing a basic provider ................ 17 4.4 Implementing more providers ................. 20 5 Library design 23 5.1 Library API .......................... 23 5.2 AbiWord integration ..................... 24 6 Conclusion and future work 27 6.1 Conclusion .......................... 27 6.2 Future work .......................... 27 A Source code 29 Bibliography 31 vii 1 Introduction To help prevent users from making mistakes, most of the software that allows composition of longer texts also integrates a spell checker. Spell checkers, however, usually only check whether each word of the text exists in their built-in dictionary, completely ignoring the word’s surroundings. Without accounting for the context of the word, many obvious mistakes are impossible to discover – especially for non-native speakers, non-word errors may actually account only for a small fraction of errors. [8, Table 2, p. 415] For that reason, advanced text processors often offer additional tools which can detect awider variety of problems, including mistakes in grammar, typography or even style. These tools are commonly called grammar checkers. AbiWord … Firefox gedit LanguageTool … Grammarly enchant Figure 1.1: Supporting common checkers in common applications would entail significant effort. There are many different checkers of varying quality1 for different languages: AbiWord uses Link Grammar2; LibreOffice has plug-ins for After the Deadline3, LightProof4 and LanguageTool5; not to for- get plethora of proprietary services like Grammarly6 and Antidote7. What is more, the checkers offer distinct, mutually incompatible in- terfaces. On the other hand, we have applications. Each application 1. See this slightly dated comparison of grammar checkers: http://www. serenity-software.com/pages/comparisons.html 2. https://www.abisource.com/projects/link-grammar/ 3. http://afterthedeadline.com/ 4. https://code.launchpad.net/lightproof 5. https://languagetool.org/ 6. https://grammarly.com/ 7. http://www.antidote.info/en 1 1. Introduction could, in theory, implement a number of most common checking li- braries but that would be far from efficient use of resources, which open-source applications often have a very limited amount. In prac- tice, it would be simply infeasible. This was one of the reasons why Enchant8, a common interface for spell checkers, was designed. Applications can request a dictio- nary for certain language and Enchant will supply them with an ab- straction of one of the checkers it supports. The support for checkers is provided by so-called providers. This allows application creators to target a single checker library and the authors of checker libraries to benefit from integration into many applications. The goal of this thesis is the creation of a library integrating gram- mar checkers in a similar way to how Enchant integrates spell check- ers. To continue with the naming scheme, we will be calling the li- brary Patronus. In the next chapter, we will compare the interfaces of selected open-source grammar checkers. Then we will design an interface for the providers integrating these checkers, and describe authoring of the checkers. Finally, we will design the external library interface and cover AbiWord integration. 8. https://abiword.github.io/enchant/ 2 2 Overview of existing checkers 2.1 Elixir In 2006, Elixir, a new library for integrating grammar checkers was proposed. [15] It aimed to provide a unified interface for grammar checkers similarly to how Enchant did it for spell checkers. It was, however, never released. 2.2 Enchant Enchant is a de facto standard when it comes to text checking libraries; it is widely used by open-source applications, especially in GNOME and KDE desktop environments (via gspell1 and Sonnet2, respectively). Since it so widely supported, extending it to check grammar might sound appealing. Unfortunately, it concerns itself with spell check- ing only, and its API allows checking just one word at a time. It also returns merely the information whether given word exists in the dic- tionary (with the possibility to request suggestions separately). [9] Modifying the checking function to parse blocks of text and return detailed annotations with descriptions of the issues would result in a completely different and incompatible API. 2.3 Link Grammar Link Grammar works on the sentence level, constructing every possi- ble graph of relations between words in a given sentence. [10] A sen- tence is assumed not to be correct when no linkages can be found. It is not a grammar checker per se – it can only determine that a sentence is not a grammatically correct, not provide any information about the exact nature of the issue or suggest fixes. 1. https://wiki.gnome.org/Projects/gspell 2. https://api.kde.org/frameworks/sonnet/html/ 3 2. Overview of existing checkers 2.4 LanguageTool The LanguageTool HTTP API provides two endpoints. /languages for listing the names and codes of supported languages, and /check which facilitates the checking itself. In addition to specifying the primary language, it allows to set user’s native language to check for false friends3 The API returns set of annotations, each containing a position of the problem in text, a short description of the problem, a longer ex- planation and a list of suggested replacements. Additionally, a wider context is provided; for example, the whole sentence would be pro- vided for subject – verb agreement error. Finally, there is a type of the rule that matched the problem, its general description and an identi- fier that can be used for disabling the rule. [3] 2.5 After the Deadline After the Deadline has an API endpoint for checking grammar with spelling, one for checking grammar without spelling, and another for getting statistics of errors in a document. Additionally, there is an end- point providing a HTML document describing an error in detail. The response of checking includes the type and description of the error and a link to a HTML page with more details. Each error also has a list of suggested replacements sorted by relevance. Unlike Lan- guageTool’s API, After the Deadline does not provide the locations of errors. Instead, it lists the matched phrase and the preceding word(s), the annotations are added to text by traversing the text and matching the errors from a queue. [2] This method can unfortunately lead to a misplaced annotation – consider the following text: “You and I are bad. I are bad.” The first instance of “I are” will be marked instead of the intended incorrect second occurrence. 3. Words that look similar in two languages but actually mean different, or even opposite things. 4 2. Overview of existing checkers 2.6 Other checkers Of the remaining checkers that were mentioned, Lightproof is not considered because it is not available as a standalone library, only as an extension for LibreOffice. Antidote does support multiple applica- tions but it is a commercial product for which no trial version is of- fered. Finally, Grammarly is an online service but it does not provide a public API, [5] and while it could be reverse engineered, without stability guarantees, it could stop working any time. 5 3 Design of provider API Since we probably will want to be able to install new providers on de- mand, and recompiling Patronus every time would not be very con- venient, the providers will have to take a form of dynamically loaded modules. Generally, a provider receives a text and returns a list of an- notations. Even here, there are many things to consider: the content of the annotations, the way they are paired with the text, the form mod- ules will take and the form of the communication between Patronus and the modules. 3.1 Annotations As we saw in chapter 2, there are several types of checkers – [11] iden- tifies rule-based checking, that we have seen in LanguageTool, and syntax-based checking used, for example, by link-grammar.