Diacritics Restoration Using Neural Networks

Diacritics Restoration Using Neural Networks Jakub Náplava, Milan Straka, Pavel Straňák, Jan Hajič Institute of Formal and Applied Linguistics Charles University, Faculty of Mathematics and Physics Malostranské náměstí 25, Prague, Czech Republic {naplava,straka,stranak,hajic}@ufal.mff.cuni.cz Abstract In this paper, we describe a novel combination of a character-level recurrent neural-network based model and a language model applied to diacritics restoration. In many cases in the past and still at present, people often replace characters with diacritics with their ASCII counterparts. Despite the fact that the resulting text is usually easy to understand for humans, it is much harder for further computational processing. This paper opens with a discussion of applicability of restoration of diacritics in selected languages. Next, we present a neural network-based approach to diacritics generation. The core component of our model is a bidirectional recurrent neural network operating at a character level. We evaluate the model on two existing datasets consisting of four European languages. When combined with a language model, our model reduces the error of current best systems by 20% to 64%. Finally, we propose a pipeline for obtaining consistent diacritics restoration datasets for twelve languages and evaluate our model on it. All the code is available under open source license on https://github.com/arahusky/diacritics_restoration. Keywords: neural networks, diacritics, diacritics generation, error correction 1. Introduction critics in our data (cf. Table 1), both the above problems When writing emails, tweets or texts in certain languages, are very pronounced: Because Vietnamese uses diacrit- people for various reasons sometimes write without diacritical marks to distinguish both tones (6) and quality of ics. When using Latin script, they replace characters with vowels (up to 3), a vowel can have (and often has) 2 diacritics (e.g. c with acute or caron) by the underlying ba- marks. This need to provide efficient typing of very sic character without diacritics. Practically speaking, they many accented characters led to the invention of sys- write in ASCII. We offer several possible reasons for this tems like unikey.org that allow a user to type all the ac- phenomenon: cented characters using sequences of basic letters. For • Historically, many devices offered only an English key- instance to typeset “đường” a user types “dduwowngf”. board and/or ASCII encoding (for example older mobile While this system elegantly solves the problems above phones and SMS). with switching keyboard layouts and missing top row of • Before Unicode became widespread, there were encod- keys, it requires a special software package and it still ing problems among platforms and even among pro- results in 9 keystrokes to type 5 characters. That is why grams on the same platform, and many people still have typing without accents in informal situations like emails this in mind. or text messages is still common and system for efficient generation of diacritics would be very useful. • Even though text encoding is rarely a problem any more and all modern devices offer native keyboards, some problems persist. In situations of frequent code switch- Typical languages where approximately half of the words ing between English and a language with a substantially contain diacritics are Czech, Hungarian or Latvian. Never- different keyboard layout, it is very hard to touch type theless, as we discuss in Sections 2 and 3, diacritics restora- in both layouts. It is much easier to type both languages tion (also known as diacritics generation or diacritization) is using the same layout, although one of them without an active problem also in many languages with substantially proper diacritics. lower diacritics appearance. • In some circumstances typing with diacritical marks is significantly slower than using just basic latin char- Current approaches to restoration of diacritics (see Section 3) acters. The most common example is on-screen key- are mostly based on traditional statistical methods. However, boards on mobile devices. These keyboards do not in- in recent years, deep neural networks have shown remark- clude the top row (numerical on US English), so lan- able results in many areas. To explore their capabilities, we guages that use that row for accented characters are propose a neural based model in Section 4 and evaluate its much slower to type. Naturally, users type without ex- performance on two datasets in Section 5. plicit accents and rely on the auto-completion systems. However, these systems are usually simple, unigram- based, and based on the word form ambiguity for a given In Section 6, we describe a way to obtain a consistent mul- language (cf. Table 1), which introduces many errors. tilingual dataset for diacritics restoration and evaluate our Postponing the step of diacritics generation would be model on them. The dataset can be downloaded from the beneficial both for typing speed and accuracy. published link. Finally, Section 7 concludes this paper with • For example in Vietnamese, the language with most dia- a summary of outcomes. 1566 Words with Word error rate of Hex Language Letter Unicode name diacritics dictionary baseline code Vietnamese 88.4% 40.53% ø 00F8 LATIN SMALL LETTER O WITH STROKE Romanian 31.0% 29.71% ł 0142 LATIN SMALL LETTER L WITH STROKE Latvian 47.7% 8.45% đ 0111 LATIN SMALL LETTER D WITH STROKE Czech 52.5% 4.09% ʂ 0282 LATIN SMALL LETTER S WITH HOOK Slovak 41.4% 3.35% ç 00E7 LATIN SMALL LETTER C WITH CEDILLA Irish 29.5% 3.15% š 0161 LATIN SMALL LETTER S WITH CARON French 16.7% 2.86% Hungarian 50.7% 2.80% Table 2: Unicode characters that cannot be decomposed us- Polish 36.9% 2.52% ing NFD (first 4 lines), and those that can. The suffix of the Swedish 26.4% 1.88% name removed by the uninames method is show in italics. As Portuguese 13.3% 1.83% we can see, the strucutre of names is identical, so the method Galician 13.3% 1.62% works for all of these characters. Estonian 19.7% 1.41% Spanish 11.3% 1.28% Norwegian-Nynorsk 12.8% 1.20% Turkish 30.0% 1.16% 2.1. Methods of Diacritics Stripping Catalan 11.1% 1.10% Although there is no standard way of stripping diacritics, a Slovenian 14.0% 0.97% commonly used method is to convert input word to NFD (The Finnish 23.5% 0.89% Unicode Consortium, 2017, Normalization Form D) which Norwegian-Bokmaal 11.7% 0.79% decomposes composite characters into a base character and Danish 10.2% 0.69% a sequence of combining marks, then remove the combining German 8.3% 0.59% marks, and convert the result back to NFC (Unicode Nor- Croatian 16.7% 0.34% malization Form C). We dub this method uninorms. Table 1: Analysis of percentage of words with diacritics and We however noted that this method does not strip diacritics 1 the word error rate of a dictionary baseline. Measured on UD for some characters (e.g. for đ and ł). We therefore pro- 2.0 data, using the uninames method, and CoNLL 17 UD pose a new method uninames, which operates as follows: In shared task raw data for dictionary. Only words containing order to remove diacritics from a given character, we inspect at least one alphabetical character are considered. its name in the Unicode Character Database (The Unicode Consortium, 2017). If it contains a word WITH, we remove the longest suffix starting with it, try looking up a character with the remaining name and yield the character if it exists. 2. Diacritics Restoration in Languages using The method is illustrated in Table 2, which presents four characters that do not decompose under NFD, but whose di- Latin Script acritics can be stripped by the proposed method. As shown in Table 3, the proposed uninames method rec- Table 1 presents languages using (usually some extended ver- ognizes all characters the uninorms method does, and some sion of) a Latin script. Employing UD 2.0 (Nivre et al., 2017) additional ones. Therefore, we employ the uninames method plain text data, we measure the ratio of words with diacritics, to strip diacritics in the paper. omitting languages with less than 5% of words with diacritics. In eleven of the languages, at least every fifth word contains 3. Related Work diacritics; in another eleven languages, at least every tenth One of the first papers to describe systems for automatic di- word does. acritics restoration is a seminal work by Yarowsky (1999), Naturally, high occurrence of words with diacritics does not who compares several algorithms for restoration of diacrit- imply that generating diacritics is an ambiguous task. Con- ics in French and Spanish. Later, models for diacrization sequently, we also evaluate word error rate of a simple dic- in Vietnamese (Nguyen and Ock, 2010), Czech (Richter tionary baseline to diacritics restoration: according to a large et al., 2012), Turkish (Adali and Eryiğit, 2014), Ara- raw text corpora we construct a dictionary of the most frequent variant with diacritics for a given word without diacrit- 1What constitutes a “diacritic mark” is a bit of a problem. On ics, and use the dictionary to perform the diacritics restora- one hand not all characters with a graphical element added to a let- tion. ter letter contain diacritics, e.g. ¥ (symbol of Japanese Yen) or Ð/ð Table 1 presents the results. We utilized the raw corpora (Icelandic “eth”). On the other end of the spectrum we have clear by Ginter et al. (2017) released as supplementary material of diacritics with Unicode canonical decomposition into a letter and a CoNLL 2017 UD Shared task (Zeman et al., 2017), which combining mark.

Load more