<<

Masaryk University Faculty of Informatics

Interface for Integration of Language Checking Tools to Text Editing Software

Bachelor’s Thesis

Jan Tojnar

Brno, Spring 2018

Masaryk University Faculty of Informatics

Interface for Integration of Language Checking Tools to Text Editing Software

Bachelor’s Thesis

Jan Tojnar

Brno, Spring 2018

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jan Tojnar

Advisor: RNDr. Adam Rambousek, Ph.D.

i

Acknowledgements

I would like to thank my advisor RNDr. Adam Rambousek, Ph.D. for his patience, and my parents for their support and proofreading.

iii Abstract

In this thesis, we propose a library that unifies various text check- ing tools behind a single interface for easier integration of grammar checking into applications. The library is modular and supports dif- ferent providers; a grammar checking provider using LanguageTool and a spell checking provider using were developed as ex- amples. Additionally, AbiWord text editor was modified to use our library.

iv Keywords grammar checking, spell checking, text editor integration, linguistic framework, freedesktop

v

Contents

1 Introduction 1

2 Overview of existing checkers 3 2.1 Elixir ...... 3 2.2 Enchant ...... 3 2.3 Link Grammar ...... 3 2.4 LanguageTool ...... 4 2.5 After the Deadline ...... 4 2.6 Other checkers ...... 5

3 Design of provider API 7 3.1 Annotations ...... 7 3.2 Linking annotations to the text ...... 8 3.3 Scope of checked text ...... 9 3.4 Choosing module system ...... 10

4 Provider implementation 13 4.1 On choice of language ...... 13 4.2 Anatomy of provider ...... 13 4.3 Implementing a basic provider ...... 17 4.4 Implementing more providers ...... 20

5 Library design 23 5.1 Library API ...... 23 5.2 AbiWord integration ...... 24

6 Conclusion and future work 27 6.1 Conclusion ...... 27 6.2 Future work ...... 27

A Source code 29

Bibliography 31

vii

1 Introduction

To help prevent users from making mistakes, most of the software that allows composition of longer texts also integrates a . Spell checkers, however, usually only check whether each word of the text exists in their built-in dictionary, completely ignoring the word’s surroundings. Without accounting for the context of the word, many obvious mistakes are impossible to discover – especially for non-native speakers, non-word errors may actually account only for a small fraction of errors. [8, Table 2, p. 415] For that reason, advanced text processors often offer additional tools which can detect awider variety of problems, including mistakes in grammar, typography or even style. These tools are commonly called grammar checkers.

AbiWord …

LanguageTool … Grammarly enchant

Figure 1.1: Supporting common checkers in common applications would entail significant effort.

There are many different checkers of varying quality1 for different languages: AbiWord uses Link Grammar2; LibreOffice has plug-ins for After the Deadline3, LightProof4 and LanguageTool5; not to for- get plethora of proprietary services like Grammarly6 and Antidote7. What is more, the checkers offer distinct, mutually incompatible in- terfaces. On the other hand, we have applications. Each application

1. See this slightly dated comparison of grammar checkers: http://www. serenity-software.com/pages/comparisons.html 2. https://www.abisource.com/projects/link-grammar/ 3. http://afterthedeadline.com/ 4. https://code.launchpad.net/lightproof 5. https://languagetool.org/ 6. https://grammarly.com/ 7. http://www.antidote.info/en

1 1. Introduction could, in theory, implement a number of most common checking li- braries but that would be far from efficient use of resources, which open-source applications often have a very limited amount. In prac- tice, it would be simply infeasible. This was one of the reasons why Enchant8, a common interface for spell checkers, was designed. Applications can request a dictio- nary for certain language and Enchant will supply them with an ab- straction of one of the checkers it supports. The support for checkers is provided by so-called providers. This allows application creators to target a single checker library and the authors of checker libraries to benefit from integration into many applications. The goal of this thesis is the creation of a library integrating gram- mar checkers in a similar way to how Enchant integrates spell check- ers. To continue with the naming scheme, we will be calling the li- brary Patronus. In the next chapter, we will compare the interfaces of selected open-source grammar checkers. Then we will design an interface for the providers integrating these checkers, and describe authoring of the checkers. Finally, we will design the external library interface and cover AbiWord integration.

8. https://abiword.github.io/enchant/

2 2 Overview of existing checkers

2.1 Elixir

In 2006, Elixir, a new library for integrating grammar checkers was proposed. [15] It aimed to provide a unified interface for grammar checkers similarly to how Enchant did it for spell checkers. It was, however, never released.

2.2 Enchant

Enchant is a de facto standard when it comes to text checking libraries; it is widely used by open-source applications, especially in GNOME and KDE desktop environments (via gspell1 and Sonnet2, respectively). Since it so widely supported, extending it to check grammar might sound appealing. Unfortunately, it concerns itself with spell check- ing only, and its API allows checking just one word at a time. It also returns merely the information whether given word exists in the dic- tionary (with the possibility to request suggestions separately). [9] Modifying the checking function to parse blocks of text and return detailed annotations with descriptions of the issues would result in a completely different and incompatible API.

2.3 Link Grammar

Link Grammar works on the sentence level, constructing every possi- ble graph of relations between words in a given sentence. [10] A sen- tence is assumed not to be correct when no linkages can be found. It is not a grammar checker per se – it can only determine that a sentence is not a grammatically correct, not provide any information about the exact nature of the issue or suggest fixes.

1. https://wiki.gnome.org/Projects/gspell 2. https://api.kde.org/frameworks/sonnet/html/

3 2. Overview of existing checkers 2.4 LanguageTool

The LanguageTool HTTP API provides two endpoints. /languages for listing the names and codes of supported languages, and /check which facilitates the checking itself. In addition to specifying the primary language, it allows to set user’s native language to check for false friends3 The API returns set of annotations, each containing a position of the problem in text, a short description of the problem, a longer ex- planation and a list of suggested replacements. Additionally, a wider context is provided; for example, the whole sentence would be pro- vided for subject – verb agreement error. Finally, there is a type of the rule that matched the problem, its general description and an identi- fier that can be used for disabling the rule. [3]

2.5 After the Deadline

After the Deadline has an API endpoint for checking grammar with spelling, one for checking grammar without spelling, and another for getting statistics of errors in a document. Additionally, there is an end- point providing a HTML document describing an error in detail. The response of checking includes the type and description of the error and a link to a HTML page with more details. Each error also has a list of suggested replacements sorted by relevance. Unlike Lan- guageTool’s API, After the Deadline does not provide the locations of errors. Instead, it lists the matched phrase and the preceding word(s), the annotations are added to text by traversing the text and matching the errors from a queue. [2] This method can unfortunately lead to a misplaced annotation – consider the following text: “You and I are bad. I are bad.” The first instance of “I are” will be marked instead of the intended incorrect second occurrence.

3. Words that look similar in two languages but actually mean different, or even opposite things.

4 2. Overview of existing checkers 2.6 Other checkers

Of the remaining checkers that were mentioned, Lightproof is not considered because it is not available as a standalone library, only as an extension for LibreOffice. Antidote does support multiple applica- tions but it is a commercial product for which no trial version is of- fered. Finally, Grammarly is an online service but it does not provide a public API, [5] and while it could be reverse engineered, without stability guarantees, it could stop working any time.

5

3 Design of provider API

Since we probably will want to be able to install new providers on de- mand, and recompiling Patronus every time would not be very con- venient, the providers will have to take a form of dynamically loaded modules. Generally, a provider receives a text and returns a list of an- notations. Even here, there are many things to consider: the content of the annotations, the way they are paired with the text, the form mod- ules will take and the form of the communication between Patronus and the modules.

3.1 Annotations

As we saw in chapter 2, there are several types of checkers – [11] iden- tifies rule-based checking, that we have seen in LanguageTool, and syntax-based checking used, for example, by link-grammar. [12] also adds statistics-based ones, which are not represented among listed tools. The checkers will provide different information based on the their underlying technology and the content of the annotations will have to be based on that. Syntax-based checkers, like link-grammar, usually do not offer any information other than whether they consider given text valid but user will probably want to know some details so that they could un- derstand the problem and realise how to fix it or be able to look the solution up; in this case, the provider will have to offer a generic mes- sage. Rule-based grammar checkers, such as LanguageTool, usually pro- vide a textual message describing the issue. Since the messages are un- structured text provided by the library, localization will be difficult, especially since some tools (e.g. LanguageTool) only provide the mes- sages in the language they are checking. Since linguistic phenomena can be baffling even in ones native language, the problem will only be accentuated in a language one just started to learn – the tools will therefore need to be internationalized. Sometimes additional information, including explanations, exam- ples, or links can be offered by the tool. Since this will vary widely among the tools, each provider should format them as a HTML page.

7 3. Design of provider API

Applications then could then request the details and render them in a WebView widget. Statistical checkers are similar to the syntax-based ones in that they are also descriptive, not prescriptive. They work by calculating a probability of given sequence of parts of speech or words, see, for example, [1]. For that reason, they cannot offer much in the way of explanations. Some statistical models could offer suggestions: for example, having a trigram “He are tired” would be ranked with a low probability due to the low probability of “He” being followed by “are”, the checker could then suggest correcting it to “They are” or “He is”. A field denoting a certainty or confidence could be addedas well. Each annotation should contain an id that will uniquely identify it among annotations produced by a provider. This is useful for dis- abling specific rule, debugging, or providing more precise feedback on the checker to the authors. This makes sense especially for rule based checkers. Finally, annotations could contain their type, based on what is considered wrong – spelling, grammar or typography, style and se- mantics. This would allow applications to reproduce the traditional copy-editing workflow: First, mechanical editing is carried out, during which spelling, capitalization, punctuation and other less subjective factors are checked and corrected. The second , correlating parts, that is checking cross-references and other will probably be skipped, since it requires more complex analysis of the text’s meaning. Next, Language editing is focused on grammar; this is the principal task of the grammar checkers and one of the more opinionated parts. The remaining steps – content editing and checking permissions are spe- cialised tasks that require a knowledge engine, or in the latter case a database of granted permissions, therefore grammar checking tools would probably skip them.

3.2 Linking annotations to the text

In order to simplify correcting the issues, the annotations should be linked to the correct places in the text. The simplest solution would be returning the input text with the annotations inserted inline in the

8 3. Design of provider API

text using some kind of . The application could ren- der the annotations just by parsing the markup and replacing the text with a rendered annotated text. For rich text editing, however, the checker would also have to handle the markup of the editor, or the editor would have to translate between plain text and rich text data and merge the annotations returned by the checker to its internal rep- resentation. If the text were edited, the result of checking the original the text would need to be discarded, in order not to overwrite the changes. For longer texts this would mean that not only would the text have to be passed from application to the checker but also from the checker back to the application. This would be very inefficient, especially when the number of errors were comparatively low. Alternately, the annotations could contain the indexes of the first and last letter (or equivalently the length) of the matched expression. The application would need to be able to index the text in its inter- nal data representation but the reply overhead would be eliminated. The editor would then just apply the annotations to its internal data structure just like rich text formatting. The indexing may cause issues with multi-byte Unicode characters, as Patronus uses considers char- acter to correspond to a code point, some applications may use byte- indexed strings. Providers can also internally normalize the strings but the resulting annotation vector should use the indexes of the orig- inal string.

3.3 Scope of checked text

When designing the interaction between the Patronus and the provider, we will have to decide on the granularity of text that we will pass to the provider. Spell checkers work on single words, checking each in isolation, thus lacking the context. Since grammar is context sensitive, this would not work; a wider context has to be provided. For simple syntax-based checkers, the optimal granularity is a sentence, more ad- vanced checkers could require whole paragraphs (e.g. for checking paragraph coherence), or even multi-paragraph sections (e.g. check- ing linking between paragraphs). For global consistency check, the whole text would be needed. Obviously this is problematic, especially with larger bodies of text, like theses, since passing the whole content

9 3. Design of provider API to the provider repeatedly, for each change, would be quite inefficient. This could be addressed by providers maintaining a replica of the application’s buffer and synchronizing it. Of course this is more com- plex use case and would require a separate API; for now, applications are recommended to use sentence or paragraph granularity.

3.4 Choosing module system

Each provider could be an executable file with which Patronus would communicate through one of UNiX IPC methods like pipes. [4, Sec- tion 6.3] These methods were designed for transfer of text or bytes, though, so we would have to devise a method for encoding the trans- ferred annotations. Higher level IPC mechanisms like D-Bus use more structured ap- proach –D-Bus’s type system supports integers, strings, arrays and structs. [14]. For example, the annotations returned by check method would have type s →a(uusuas). Native bindings could be also gener- ated with gdbus-codegen.1 and GObject Introspection.2 D-Bus round trip can, however, take hundred of milliseconds, so it make already noticeably slow grammar checking even slower. [19] What is more, even though D-Bus was ported to platforms other than , it is not really common there. Providers could also provide a HTTP API, as LanguageTool al- ready does. Annotations could be serialized into JSON or XML, as is commonly done with HTTP APIs. With OpenAPI definition, bind- ings could be generated using swagger-codegen. Just like D-Bus, HTTP suffers from communication overhead, though. Finally, we could use shared libraries. They would allow us to work directly with native data structures and enjoy minimal latency. Additionally, this approach is quite portable, as shared libraries are available on most modern platforms. Shared libraries have their own set of caveats, though. Since the broker library and providers can be

1. A tool that generates C code from D-Bus interface definitions https:// developer.gnome.org/gio/stable/gdbus-codegen.html 2. Tool that generates libraries that can be used by various programming languages from annotated C code https://wiki.gnome.org/Projects/ GObjectIntrospection

10 3. Design of provider API

compiled by different compilers or even from different languages, they need to agree on common ABI3 they will use. Since languages like Rust often do not have stable ABI, [6] C ABI is used as the de facto lingua franca.

3. Application Binary Interface: among other things, function calling conventions and layout of data structures in memory [7, Section 3.1.1. Language Compatibility]

11

4 Provider implementation

As we described in the previous chapter, providers are shared libraries that should communicate through C ABI. In this chapter, we will de- scribe the API, as well as demonstrate how to implement a custom provider.

4.1 On choice of language

A traditional choice for writing a system library like Patronus would be C or modern C++. Recently, however, there has been quite a few languages targeting system programming. Since Patronus is meant to be used by other programs, it should be fairly performant. Due to the potential overhead of their runtime systems, Haskell, D and Go were ruled out. Rust is similar to C++ in being a low-level language focus- ing on zero-cost abstractions with some more advanced features like iterators and closures; in addition to those, it also has an actual mod- ule system, hygienic macros, abstract data types and a type system tracking value ownership1. Rust’s package manager also provides ac- cess to wide range of libraries.2 Patronus is written in Rust, but providers can be written in any language that supports building C-compatible shared libraries. For convenience of actual vectors and string library, we will write our demonstration provider in Rust as well, the source code can be also found in patronus under providers/sample directory.

4.2 Anatomy of provider

Provider is a shared library exposing at least the following function:

patronus_provider_version()

1. https://doc.rust-lang.org/book/second-edition/ ch04-00-understanding-ownership.html 2. A Rust package directory: https://crates.io/

13 4. Provider implementation pub extern "C" fn patronus_provider_version() -> libc::c_int;

The function announces the version of the API the provider sup- ports, and based on the returned value, Patronus can decide whether it, in turn, supports the provider. Currently the only API version is 1. In the version 1, the API requires providers to have one additional symbol, the patronus_provider_init() function: patronus_provider_init()

pub extern "C" fn patronus_provider_init() -> *mut Provider;

The function allocates and initializes a Provider structure and re- turns it as a pointer. It can store internal data in the data field.

Provider

#[repr(C)] pub struct Provider { pub name: unsafe extern "C" fn() -> *const libc::c_char, pub check: unsafe extern "C" fn(props: *const Properties, text: *const libc::c_char, data: *mut libc::c_void) -> *mut AnnotationArray, pub free_annotations: unsafe extern "C" fn(*mut ↪ AnnotationArray), pub free_provider: unsafe extern "C" fn(*mut Provider), pub data: *mut libc::c_void, }

Figure 4.1: Provider structure

14 4. Provider implementation name

The Provider structure contains a name of the provider – this could be used in the application for identifying the source of an annotation and in Patronus preferences for disabling providers.

check

Next, the check method is self-describing, when an application re- quest some text to be checked, Patronus passes the text, along with Properties. to each enabled provider through this method. Patronus will also pass in the data field, where the provider can store internal data it needs for its work. The function should return an vector of Annotations.

Properties

Properties is a structure allowing applications to control the providers. Currently, the only field is primary_language, which is a string containing an ISO 639-1 language code, optionally followed by an un- derscore and ISO 3166-1 Alpha-2 country code. Provider should use this language to check the received text. If a provider does not sup- port given language, it should return an empty vector of annotations.

Annotation

The annotations were discussed in the Annotations section of the pre- vious chapter. The returned annotation was largely based on Lan- guageTool API, [3] as it had the richest structure. It contains the fol- lowing fields:

• offset and length – As discussed in the Linking annotations to the text section of the previous chapter: Number of Unicode code-points from the start of the checked text to the start of the annotation, respectively from the start of the annotation to its end.

15 4. Provider implementation

• message – Unstructured message text. Your provider should consider to make it use the system locale3. • kind – Type of the annotation, can contain of the following enum values Spelling, Grammar, Style, Typography and Suggestion; in the C header, they are available as uppercase constants pre- fixed by PATRONUS_ANNOTATION_KIND_. • suggestions – vector of strings intended to replace the whole annotated text segment.

Unlike LanguageTool, we do not return the rule details, since they could be obtained on demand.

Vector representation Providers use an vectors in multiple places: The Provider’s check method returns a vector of Annotation objects, and they in turn con- tain a vector of suggestion strings. In C, vectors can be represented by a pointer to a memory containing NULL-terminated sequence of point- ers to the contained structures. However, NULL-termination causes problems when a NULL pointer is inserted as one of the elements; for that reason, vectors are often represented by a structure remembering length of the vector, in addition to the contents. The vector will travel over the ABI boundary without any problem, but when the annota- tions are no longer needed, they cannot simply be freed by the ap- plication – the application can use a different memory allocator than Patronus, and the provider can also use a different one from both of them. Just calling free will most likely fail. One possible solution is having the provider expose a function for freeing arrays directly to the application, but the providers are implementation detail and the high level interface should not let the application communicate with the providers directly. Alternately, Pa- tronus could add another field to the vector structure containing a tag that would associate the vector with its creator provider module – ?? function would then reach out to the provider module and call its free_vector function. Finally, we could add to the vector structure

3. On Unix systems, refer to locale(7) manual page.

16 4. Provider implementation a pointer to the function for deallocation of the vector structure; this method does not expose the whole provider module and is what we use in Patronus.4 The vector structures also contain an extra field, which the provider can use for storing extra metadata.5 There are currently two vector structures: PatronusSuggestionArray and PatronusAnnotationArray.

4.3 Implementing a basic provider

In the previous section we have described the totality of API related to providers. In this one, we will implement such provider.

// Basically #include #[macro_use] extern crate patronus_provider;

// Import some names into the scope so we do not have to type // qualified names all the time use patronus_provider::*; use std::borrow::Cow; use std::ffi::CStr; use std::os::raw::{c_char, c_int, c_void};

// Rust, like C++ uses name mangling by default. Here, we disable // it and also set force the function to use C calling conventions. #[no_mangle] pub extern "C" fn patronus_provider_version() -> c_int { // Simply return 1, Rust returns the last statement // in the function 1 }

#[no_mangle] pub extern "C" fn get_name() -> *const c_char { // We use macro to convert a string to C-style string

4. Instead of returning a struct, we could allocate few extra bytes at the begin- ning of the memory area and store the metadata there. We would then return the &allocated_area + sizeof(metadata). This is how allocators store information about allocated memory, [13] though we would also need to implement a function for getting the length of the vector to be able to iterate on it. 5. For example, restoring a vector in Rust also requires the vectors capacity: https: //doc.rust-lang.org/std/vec/struct.Vec.html#method.from_raw_parts

17 4. Provider implementation

// at compile-time and then return it. static_cstr!("Sample checker") }

/// This is the main checking method – normally we would use /// functions imported from some library but here, for simplicity, /// we just create our own grammar checking library. /// /// It finds all the occurrences of “mistakes are good” string /// in the input text and suggests correction to one of contrary /// statements. fn check_text_english(text: Cow) -> *mut AnnotationArray { // Here we first create an iterator with all occurrences // of the string and their indices, then we immediately // convert (map) them to an iterator of annotations let mistakes: Vec = text .match_indices("mistakes are good") .map(|(offset, text)| { // Just preparing some data structures. let length = text.len() as usize; let suggestions = vec![ static_cstr!("mistakes are never good"), static_cstr!("mistakes are bad"), ].into(); Annotation { offset: offset, length: length, message: static_cstr!("Are you sure about mistakes ↪ being good?"), kind: AnnotationKind::Suggestion, suggestions: Box::into_raw(Box::new(suggestions)), } }) .collect(); // Boxing is Rust’s way of creating a pointer safely // into_raw will then convert it to a raw C pointer. Box::into_raw(Box::new(mistakes.into())) }

/// This is the function called by Patronus for checking text /// it handles the properties, calls the library function, and /// usually converts the result to annotation vector. Here, our /// “library” is already producing the vector so we do not need to. extern "C" fn check_text( props: *const Properties,

18 4. Provider implementation

text: *const c_char, _data: *mut c_void, ) -> *mut AnnotationArray { // Converts a C string into a Rust owned String. let lang_code = unsafe { CStr::from_ptr((*props).primary_language) .to_string_lossy() .into_owned() }; // Same here but we do not need the ownership. let text = unsafe { CStr::from_ptr(text).to_string_lossy() }; // We remove the country code, our checker is good for any ↪ English. let lang = lang_code .splitn(2, '_') .nth(0) .expect("not enough language code components");

// This checker only knows English, so we return an empty vector // otherwise match lang { "en" => check_text_english(text), _ => Box::into_raw(Box::new(Vec::new().into())), } } unsafe extern "C" fn free_annotations(ptr: *mut AnnotationArray) { // Unboxing changes the ownership of the value to this scope // so it is immediately freed as not used. let anns = Box::from_raw(ptr); for i in 0..anns.len { let ann = &*anns.data.offset(i as isize); Box::from_raw(ann.suggestions); } } unsafe extern "C" fn free_provider(ptr: *mut Provider) { assert!(!ptr.is_null(), "Trying to clean a NULL value"); let _provider = Box::from_raw(ptr); }

/// Initialize the provider with the functions #[no_mangle] pub extern "C" fn patronus_provider_init() -> *mut Provider { Box::into_raw(Box::new(Provider {

19 4. Provider implementation

name: get_name, check: check_text, free_annotations: free_annotations, free_provider: free_provider, // We do not really need to store anything here, so we just ↪ fill in null data: std::ptr::null_mut(), })) }

// Some simple tests for the library #[test] fn test_english_single_mistake() { let text = "mistakes are good"; let result = check_text_english(text.into()); let length = unsafe { (*result).len }; assert!(length == 1); }

#[test] fn test_english_multiple_mistakes() { let text = "Hello. It is true that mistakes are good. Mistakes are good, ↪ you know, mistakes are good!"; let result = check_text_english(text.into()); let length = unsafe { (*result).len }; assert!(length == 2); }

4.4 Implementing more providers

First we have decided to support Enchant. Since it is a C library, bind- ings had to be created first. In Rust, bindings consist ofFFI6 decla- rations and a wrapper code that allows for idiomatic usage of the li- brary. By convention, the declarations are placed in a separate crate7 with the suffix -sys. The crate can also be generated automatically using bindgen tool.8. The wrapper code should be idiomatic; using appropriate data structures like Vec instead of raw arrays and String

6. Foreign Function Interface 7. Cargo (the Rust package manager) calls its packages crates. 8. https://github.com/rust-lang-nursery/rust-bindgen

20 4. Provider implementation

instead of raw C strings, and use destructors for freeing resources af- ter they are no longer needed. LanguageTool has two different APIs: an HTTP API9 and a Java API10. While it would be possible to create bindings for the Java li- brary using JNI11 and Rucaja12, it claims to have performance over- head, as well as portability issues. For greater portability we have decided to use the HTTP API. There exists a machine-readable API description13 that would allow to use Swagger Codegen14 to generate the bindings. At the moment of writ- ing Rust was not among the supported target languages, so the bind- ings had to be created manually.

9. https://languagetool.org/http-api/swagger-ui/ 10. http://wiki.languagetool.org/java-api 11. Java Native Interface, an interface for calling Java functions from native code 12. Rust calls Java: https://github.com/kud1ing/rucaja 13. https://languagetool.org/http-api/languagetool-swagger.json 14. https://github.com/swagger-api/swagger-codegen

21

5 Library design

Unlike in Enchant where only one dictionary is used at a time, Pa- tronus returns the combination of results from all enabled providers. This allows combining multiple small specialised checkers. The providers itself are loaded dynamically from a directory specified during com- pilation and from directories listed in the PATRONUS_PROVIDER_PATH environment variable. This allows using third-party checkers without the need for recompilation of the library or depending applications.

gspell AbiWord …

patronus

patronus::provider::enchant patronus::provider::language_tool patronus::provider::link_grammar …

enchant LanguageTool link_grammar

enchant_provider_voikko enchant_provider_ispell …

voikko

Figure 5.1: The architecture of Patronus

Bindings are partially generated using cbindgen1 tool.

5.1 Library API

The library interface itself is simple. It consists only of a constructor and a method for checking texts: To limit the use of global state, the checking method accepts Properties objects.

1. https://github.com/eqrion/cbindgen

23 5. Library design impl Patronus { ^^Ipub fn new() -> Self; ^^Ipub fn check(&self, props: &Properties, text: String) -> ↪ Vec; }

Figure 5.2: Functions of the Rust library

The C library is analogous except for an addition of clean-up func- tions (in Rust, clean-up is handled by destructors):

Patronus* patronus_create(void); void patronus_free(Patronus* ptr);

PatronusAnnotationArray* patronus_check(Patronus* ptr, ↪ PatronusProperties const* props, char const* text); void patronus_free_annotations(PatronusAnnotationArray* ptr);

Figure 5.3: Functions of the C language bindings

Since the library basically merges the output of providers, the API mirrors the API of the providers. The additional patronus_free_annotation should be used after an annotation is no longer needed, for example, after the annotated text is edited.

5.2 AbiWord integration

AbiWord is designed modularly and offers many plug-ins. One of them add support for grammar checking using the Link Grammar library. The plug-in checks each sentence in a document and marks the parts detected as incorrect. Since Patronus aims to replace all ex- isting grammar checkers, the code of the plug-in was modified to call our library instead of Link Grammar. Because Link Grammar can- not give reasons a sentence is not grammatically correct, or suggest a corrections (see section 2.3), AbiWord does not have any interface providing this information to user.

24 5. Library design

Figure 5.4: LanguageTool provider suggesting a correction in Abi- word

25

6 Conclusion and future work

6.1 Conclusion

In this thesis, a library that allows applications to use third-party grammar checkers without the need for adding support for each of them was designed and implemented. The thesis simultaneously serves as a documentation for creating new providers for grammar checking and integrating Patronus into applications. Providers for Language- Tool and enchant were developed, and AbiWord was modified to use Patronus for grammar checking.

6.2 Future work

Currently, Patronus does not offer any configuration management. Some providers might need to store user credentials, location of API servers or other provider-specific configuration values. For example, the LanguageTool provider looks for a configuration file in user’s con- figuration directory during initialization. And while it could checkif the configuration changed before each request, it would not bevery efficient. Patronus could listen for changes centrally, either bywatch- ing the configuration files or by using platform-specific settings stor- age. Additionally, it could provide a centralised configuration man- agement tool, either as a standalone application or as a part of system control centre. At the moment checkers are executed serially. Once there are more checkers available, parallelization will become necessary. Futures which are available as a library for Rust are convenient solution. [17] Passing the whole text back and forth between editor and providers will be inefficient for any larger text. Patronus could send only changed portions but it would require smarter checkers. During discussion about proposal for a new open-source Czech spell checking dictionary, option for users to report incorrect annota- tions was cited as a useful feature. [16] Open-source spell checkers are often hidden in background, having a direct way to contribute

27 6. Conclusion and future work to the dictionary could improve them drastically. Before submitting user could edit the message context to protect privacy. Adding Patronus support to applications, libraries and languages. For widest coverage, bindings for languages like Python and C++ should be created. A higher level bindings for GLib would allow bet- ter asynchronous usage, as well as creation of bindings via GObject In- trospection. Implementing a Native messaging host1 and a WebExten- sion would easily allow the use of the library in modern web browsers, we should aim for native usage for better integration, though. Mod- ifying gspell would make grammar checking available to many ap- plications in the GNOME desktop environment, as its widgets are largely self-contained; the lower level interfaces, especially those deal- ing with enchant, might need to be removed, though. [18] Finally,more providers could be created. Flattening of the provider structure from Figure 5.1.

1. https://developer.mozilla.org/en-US/Add-ons/WebExtensions/Native_ messaging

28 A Source code

The source code is stored in the appendices.tar.xz archive. It con- tains the following files: The patronus directory contains the code of the library itself in- side the patronus subdirectory, a set of struct definitions for imple- menting providers in the patronus-provider subdirectory, the initial set of providers in the providers subdirectory, and a procedural, C compatible API in bindings/c subdirectory. The enchant-rs and languagetool-rs contain Rust bindings for Enchant and LanguageTool, respectively. The 0001-Use-Patronus-for-grammar-checking.patch file con- tains a patch for AbiWord editor to use Patronus. It was tested against SVN revision 35465. Also attached, in the packaging directory, are Nix1 expressions for easy installation of the libraries, as well as the patched AbiWord. All sources can be also found on GitHub under patronus-checker organization.2

1. https://nixos.org/nix/ 2. https://github.com/patronus-checker

29

Bibliography

[1] Md Alam, Naushad UzZaman, Mumit Khan, et al. “N-gram based statistical grammar checker for Bangla and English”. In: (2007). [2] Automattic. API Reference - Spell, Style, and Grammar Checker for WordPress, Firefox, TinyMCE, jQuery, and CKEditor. url: http : //www.afterthedeadline.com/api.slp (visited on May 28, 2017). [3] LanguageTool developers. LanguageTool HTTP API. url: https: / / languagetool . org / http - api / swagger - ui/ (visited on May 28, 2017). [4] Sven Goldt et al. The Linux ’s Guide. Version 0.4. 1995. url: http://www.tldp.org/LDP/lpg- 0.4.pdf (visited on May 26, 2018). [5] Grammarly. Sorry, no API. But, thanks for letting us know of your interest! . May 9, 2018. url: https://twitter.com/Grammarly/ status/994202262141571073 (visited on May 26, 2018). [6] Steve Klabnik. enchant-provider.h. Jan. 20, 2015. url: https:// github.com/rust-lang/rfcs/issues/600 (visited on May 28, 2018). [7] Robert Krátký, Don Domingo, and Jacquelynn East. Red Hat En- terprise Linux 6. Oct. 20, 2017. url: https://access.redhat. com/documentation/en-us/red_hat_enterprise_linux/6/ html-single/developer_guide/ (visited on May 26, 2018). [8] Karen Kukich. “Techniques for Automatically Correcting Words in Text”. In: ACM Comput. Surv. 24.4 (Dec. 1992), pp. 377–439. issn: 0360-0300. doi: 10.1145/146370.146380. url: http://doi.acm.org/10.1145/146370.146380. [9] Dom Lachowicz. enchant.h. Apr. 21, 2017. url: https://github. com/AbiWord/enchant/blob/66455864c6160ceb7bc83f3c97645a81bb770d12/ src/enchant.h (visited on May 28, 2017). [10] John Lafferty. The Link Parser Application Program Interface (API). Mar. 2, 2016. url: https://www.abisource.com/projects/ link-grammar/api/index.html (visited on May 26, 2018). [11] RONALDO TEIXEIRA MARTINS et al. “Linguistic issues in the development of ReGra: A grammar checker for Brazilian

31 BIBLIOGRAPHY

Portuguese”. In: Natural Language Engineering 4.4 (1998), pp. 287–307. [12] Daniel Naber. “A rule-based style and grammar checker”. PhD thesis. Bielefeld: Technische Fakultät, Universität Biele- feld, Aug. 28, 2003. url: http : / / www . danielnaber . de / languagetool/download/style_and_grammar_checker.pdf. [13] Carlos O’Donell and DJ Delorie. MallocInternals – glibc wiki. Mar. 12, 2018. url: https://sourceware.org/glibc/wiki/ MallocInternals?action=recall&rev=19 (visited on May 26, 2018). [14] Havoc Pennington et al. D-Bus Tutorial. Version 0.5.0. Aug. 20, 2006. url: https : / / dbus . freedesktop . org / doc / dbus - tutorial.html (visited on May 26, 2018). [15] Jacob R Rideout. Would you like to collaborate on grammar and style checking? Dec. 30, 2006. url: https://marc.info/?l=abiword- dev&m=116745754722835&w=2. [16] Michal Stanke. Report z LinuxDays 2016. Oct. 11, 2016. url: http : / / lists . l10n . cz / pipermail / diskuze / 2016 - October/002244.html. [17] Aaron Turon. Zero-cost futures in Rust. Aug. 11, 2016. url: https: //aturon.github.io/blog/2016/08/11/futures/ (visited on May 29, 2017). [18] Sébastien Wilmet. GspellEntry: gspell 1 Reference Manual. Mar. 10, 2018. url: https://developer.gnome.org/gspell/stable/ GspellEntry.html (visited on May 26, 2018). [19] Philip Withnall. D-Bus API Design Guidelines. Feb. 5, 2015. url: https://dbus.freedesktop.org/doc/dbus-api-design.html (visited on May 26, 2018).

32