Tools to improve English text [LWN subscriber-only content]

Open-source developers put a lot of emphasis on quality and have created many tools to improve source code, such June 16, 2020 as lintersand code formatters. Documentation, on the other This article was contributed by hand, doesn't receive the attention it deserves. LWN Martin Michlmayr reviewed several grammar and style-checking tools back in 2016. It seems like a good time to evaluate progress in this area.

Spell checkers almost seem too basic to mention, but, given the number of typos I encounter in open-source documentation, they might warrant a brief look. One problem with technical texts is that English prose is often mixed with code or URLs that trigger spell checkers. Aspell offers several filter modes (including Markdown as of version 0.60.8), but Hunspell believes that other tools like editors should do the work to distinguish between code and text. This is where PySpelling comes in handy. It ships with filters that make it easy to run spell checkers on formatted text (such as Markdown) or programming languages (extracting docstrings from Python code). One aim of PySpelling is to embed spell checking into continuous integration (CI) systems, which is a laudable goal.

Speaking of spell checking and editors, writing this article reminded me that, despite being a frequent user of spell checkers, I never configured spell checking in my editor of choice. has had a built-in for a long time. One minor annoyance is that Vim relies on its own word list instead of using system-wide dictionaries from Aspell and Hunspell. With the following configuration, Vim underlines misspelled and unknown words in Markdown texts and Git commits:

set spelllang=en autocmd FileType markdown setlocal spell autocmd FileType gitcommit setlocal spell

Even though I integrated several tools from this article into my workflow, I suspect this simple change will have the biggest impact. Similarly, Vim's Asynchronous Lint Engine (ALE) plugin (seen below) calls several grammar tools from this article while I'm typing. Given the noisy nature of some of these tools, it remains to be seen whether that actually leads to improvements in my writing or becomes an annoying distraction.

While spell checkers are useful, checking documentation can be time-consuming and some misspelled words are easily missed. The Debian package checker lintian ships with a list of frequently misspelled words that the script spellintian looks for. This is highly useful since there are almost no false positives (apart from the check for repeated words, which gets triggered by code and paragraphs that start with the same word as the preceding heading). In my experience, spellintianoften finds some misspelled words even after running another spell checker. The Linux kernel has adopted a variant of this word list for its checkpatch.pl script and I see potential for integrating spellintian into CI workflows.

Beyond words Words are just the basic building blocks; what matters is how we put them together. One problem with many linters for prose is that they make it difficult to distinguish between objective mistakes (like genuine grammar issues) and subjective matters concerning writing style. One frequent mistake of the former type is the incorrect use of indefinite articles (a and an). While there's a clear rule for these, a simple regular expression cannot be used to find mistakes because the correct indefinite article depends on the pronunciation of a word rather than how it's written. I started investigating whether the Natural Language Toolkit (NLTK) could help one to figure that out. But then I saw that Jakub Wilk had that idea years ago when he created anorack (seen below), which uses eSpeak to break words into phonemes (units of sound) to identify incorrect usage of indefinite articles.

The Grumpy Editor reviewed proselint in 2016 and compared it to "one of the world's worst elementary-school teachers criticizing you in front of the entire class about irrelevant details". I wanted to check whether the teacher had matured, but, alas, the project has been inactive since 2018. There are several contenders for the nagging teacher position, though. One is alex whose aim is to point out insensitive and inconsiderate writing. While alex is useful in some cases (changing "chairman" to "chair" or "chairperson" is a good improvement that doesn't introduce unnatural language, for example), I found the tool too noisy. It complained about "simple math" and "invalid characters" in a technical manual (although "basic math" might indeed be an improvement). Computers are good at pattern matching but language is all about context. The documentation observes that "alex isn't very smart" and I tend to agree. (Interestingly, alex doesn't find that phrase offensive.)

Another tool is write good which flags "weasel" words (like "very") and passive voice. LWN previously looked at writegood-mode in the context of . Personally, I didn't find the feedback from write-good particularly useful, but opinions differ when it comes to passive voice.

RedPen looks like a credible alternative to proselint. It specifically mentions technical documentation and has support for several common markup formats, including Markdown, Textile, AsciiDoc, reStructuredText, and LaTeX. Installation seemed tricky at first. I couldn't find a Debian or RPM package, no Flatpak, and the Snap image is from 2016 (while the latest release is from earlier this year).

As I was waiting for the download of the 150MB file, I found an online instance into which text can copied. The online version is particularly useful since it makes it easy to disable checks. One does not need to learn about the configuration file but can simply click some check boxes. Obviously, the online instance is not a solution if the text is private or has too many embarrassing mistakes to copy it to a random web site. I also found PyRedPen, which is a set of Python scripts that allow sending a file to this online instance of RedPen for analysis. After installing PyRedPen with pip, I ran it as follows:

redpen-flymake -e -m guide.md That gave me a quick, but somewhat dismaying result: over 600 problems in a 3,500 word document.

When my download of RedPen completed, I discovered that there was no need to worry about the lack of Linux packages. RedPen is mostly JAR files, some configuration files and some scripts, including bin/redpen which I ran on my example text with the -f markdown option. Looking at the 600 results, I noticed that excluding spelling issues resulted in a more manageable 280 results. Several suggestions were genuinely useful. For example, RedPen highlights phrases that are repeated within a sentence. While this results in some false positives, it can be helpful, such as pointing out that "a lot" is used multiple times in a sentence. RedPen also pointed out that 35% of sentences in another document I checked started with the same phrase, which is a good area for improvement. Even though RedPen supports markup formats, the checks are not intelligent enough. For example, a technical guide with the code element SUM(COST(position)) resulted in two warnings: that the "word" ")" is repeated twice and that "parenthesized sentences are nested too deeply". Overall, though, RedPen looks promising and I might be able to improve the results by turning off some checks.

A previous LWN article looked at LanguageTool. It supports more than 20 languages, which sounds impressive given that rules are language specific. There are also plugins for and LibreOffice. LanguageTool comes in the form of several JAR files and it was easy to simply launch languagetool-commandline.jar. The first time I ran the tool, the system performed as if I had started Chromium and Firefox at the same time; resource use has been fine in subsequent runs, although the tool is a bit slow.

LanguageTool does not support markup formats and almost all warnings were related to running the tool on Markdown text. Unfortunately, even the text version resulted in a lot of noise, such as complaints about words containing an underscore character (which is common for identifiers like variable names). Nevertheless, LanguageTool noticed a number of genuine problems in the text. It highlighted a sentence that started with a conjunction, which should have been separated from the previous sentence with a comma. It pointed out a few words that are normally spelled with a hyphen.

The true power in LanguageTool relies on its data sets. As mentioned above, language depends on context. The optional 8GB data set for English allows LanguageTool to find additional errors, such as suggesting that the last word in "Don't forget to put on the breaks" should be "brakes". Like RedPen, I consider LanguageTool to be an interesting possibility for integration into one's writing process. Both tools seem too noisy for constant usage, but they are good for spot checks after writing a text.

Open-source uniformity

Finally, I discovered Vale, which is an interesting tool because it doesn't focus on grammatical correctness per se, but on the style of the writing, instead. Vale's introductory post shows individual authors using tools like proselint and write-good to improve their writing in a collaborative document, while Vale ensures that the organization-specific style, tone, and branding are followed.

The beauty of open source is its collaborative nature that is based on input from many contributors. Unfortunately, this often leads to inconsistencies in documentation, such as different writing styles or capitalization. Vale allows defining a particular style guide and assures that it's followed. Vale offers style rules for the Microsoft Writing Style Guide and the Google developer documentation style guide. A YAML file format makes it easy to customize the configuration.

Vale is also smart about markup formats. While most tools use their knowledge of syntax to ignore certain parts of a text (like code elements), Vale is syntax-aware and allows applying rules to certain syntactical constructs. For example, Vale can be used to ensure that all words in a heading start with an upper-case letter. In addition to markup support and extensibility, Vale highlights performance as a key feature, which my initial tests confirm.

Checking prose is difficult, partly because human language is complex and depends on the context, but also partly because there's a lot of disagreement on what is "right" and "wrong". The tools presented here offer some insightful feedback, but all so often the results are so noisy that it takes a lot of time to find what's genuinely relevant. I have incorporated spellintian and anorack into my toolbox and run them on pretty much every repository I download from GitHub.

Of the grammar tools, I intend to make some use of RedPen and LanguageTool. After getting constant, noisy feedback in my editor thanks to Vim's ALE plugin which runs these tools as I write, my conclusion is that the feedback is too distracting. A better approach is to run them as spot checks after producing a first draft. Vale looks very promising to improve consistency within open-source documentation and it's a tool I'd like to explore in more detail.

Over the last few years, we've seen significant progress in terms of checking code quality before contributions are accepted. These days, a commit is rarely followed by another commit within a few minutes to fix a build issue. I'd like to see similar systems that allow reliable checks for documentation: typos, grammar issues, white space, and style issues. They have to give reliable feedback on agreed-upon issues so they can be integrated into continuous integration systems instead of producing a bunch of noise that is largely ignored.