Restoring Hebrew Diacritics Without a Dictionary
Total Page:16
File Type:pdf, Size:1020Kb
ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE. 000 050 001 Restoring Hebrew Diacritics Without a Dictionary 051 002 052 003 053 004 054 005 Anonymous ACL-IJCNLP submission 055 006 056 007 057 008 058 009 059 010 060 011 Abstract 061 062 המטוס נחת ברכות! 012 (a) hamatos naxat ???? 013 We demonstrate that it is feasible to diacritize 063 ‘The plane landed (unspecified)’ 014 Hebrew script without any human-curated re- 064 הַמָּטוֹס Éחַת בְּר¯כּוּת! sources other than plain diacritized text. We 015 065 present NAKDIMON, a two-layer character- (b) hamatos naxat b-rakut 016 level LSTM, that performs on par with ‘The plane landed softly’ 066 017 067 הַמָּטוֹס Éחַת בְּר´כוֹת! much more complicated curation-dependent 018 systems, across a diverse array of modern He- 068 (c) hamatos naxat braxot brew sources. 019 ‘The plane landed congratulations’ 069 020 1 Introduction 070 021 Table 1: An example of an undotted Hebrew text (a) 071 022 The vast majority of modern Hebrew texts are writ- (written right to left) which can be interpreted in at least 072 023 ten in a letter-only version of the Hebrew script, two different ways (b,c), dotted and pronounced differ- 073 ently, but only (b) makes grammatical sense. 024 one which omits the diacritics present in the full di- 074 1 025 acritized, or dotted variant. Since most vowels are 075 encoded via diacritics, the pronunciation of words 026 dotted text. Obtaining such data is not trivial, even 076 in the text is left underspecified, and a considerable 027 given correct pronunciation: the standard Tiberian 077 mass of tokens becomes ambiguous. This ambigu- 028 diacritic system contains several sets of identically- 078 ity forces readers and learners to infer the intended 029 vocalized forms, so while most Hebrew speakers 079 reading using syntactic and semantic context, as 030 easily read dotted text, they are unable to produce it. 080 well as common sense (Bentin and Frost, 1987; 031 Moreover, the process of manually adding diacrit- 081 Abu-Rabia, 2001). In NLP systems, recovering 032 ics in either handwritten script or through digital 082 such signals is difficult, and indeed their perfor- 033 input devices is mechanically cumbersome. Thus, 083 mance on Hebrew tasks is adversely affected by 034 the overwhelming majority of modern Hebrew text 084 the presence of undotted text (Shacham and Wint- is undotted, and manually dotting it requires exper- 035 ner, 2007; Goldberg and Elhadad, 2010; Tsarfaty 085 tise. The resulting scarcity of available dotted text 036 et al., 2019). As an example, the sentence in Ta- 086 in modern Hebrew contrasts with Biblical and Rab- 037 ble 1 (a) will be resolved by a typical reader as (b) 087 binical texts which, while dotted, manifest a very 038 in most reasonable contexts, knowing that the word 088 different language register. This state of affairs al- 039 “softly” may characterize landings. In contrast, an 089 lows individuals and companies to offer dotting as 040 automatic system processing Hebrew text may not 090 paid services, either by experts or automatically, 041 be as sensitive to this kind of grammatical knowl- 091 e.g. the Morfix engine by Melingo.2 042 edge and instead interpret the undotted token as the 092 Existing computational approaches to dotting 043 more frequent word in (c), harming downstream 093 are manifested as complex, multi-resourced sys- 044 performance. 094 tems which perform morphological analysis on the 045 One possible way to overcome this problem is 095 undotted text and look undotted words up in hand- 046 by adding diacritics to undotted text, or dotting, im- 096 crafted dictionaries as part of the dotting process. plemented using data-driven algorithms trained on 047 Dicta’s Nakdan (Shmidman et al., 2020), the cur- 097 048 1Also known as pointed text, or via the Hebrew term for 098 049 the diacritic marks, nikkud/niqqud. 2https://nakdan.morfix.co.il/ 099 1 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE. 100 rent state-of-the-art, applies such methods in ad- dotting standard. For the sake of input integrity, 150 101 dition to applying multiple neural networks over and unlike some other systems, we opt not to re- 151 102 different levels of the text, requiring manual anno- move these characters, but instead employ a dotting 152 103 tation not only for dotting but also for morphology. policy consistent with full script. 153 104 Among the resources it uses are a diacritized corpus 154 New test set Shmidman et al.(2020) provide a 105 of 3M tokens and a POS-tagged corpus of 300K 155 3 benchmark dataset for dotting modern Hebrew doc- 106 tokens. Training the model takes several weeks. 156 uments. However, it is relatively small and non- 107 In this work, we set out to simplify the dotting 157 task as much as possible to standard modules. Our diverse: all 22 documents in the dataset originate in 108 a single source, Hebrew Wikipedia articles. There- 158 system, NAKDIMON, accepts the undotted char- 109 5 159 acter sequence as its input, consults no external fore, we created a new test set from a larger variety 110 of texts, including high-quality Wikipedia articles 160 111 resources or lexical components, and produces di- 161 acritics for each character, resulting in dotted text and edited news stories, as well as user-generated 112 162 whose quality is comparable to that of the commer- blog posts. This set consists of ten documents from 113 163 cial Morfix, on both character-level and word-level each of eleven sources (5x Dicta’s test set), and 114 164 accuracy. To our knowledge, this is the first at- totals 20,476 word tokens, roughly 3.5x Dicta’s. 115 165 tempt at a “light” model for Hebrew dotting since 116 3 Character-LSTM Dotter 166 early HMM-based systems (Kontorovich, 2001; 117 167 Gal, 2002). In experiments over existing and NAKDIMON embeds the input characters 118 168 novel datasets, we show that our system is particu- and passes them through a two-layer Bi- 119 larly useful in the main use case of modern dotting, LSTM (Hochreiter and Schmidhuber, 1997). The 169 120 which is to convey the desired pronunciation to LSTM output is fed into a single linear layer, 170 121 a reader, and that the errors it makes should be which then feeds three linear layers, one for each 171 122 more easily detectable by non-professionals than diacritic category. Each character then receives a 172 123 Dicta’s.4 prediction for each category independently, and 173 124 decoding is performed greedily with no additional 174 125 2 Dotting as Sequence Labeling search techniques.6 In training, we sum the 175 126 The input to the dotting task consists of a sequence cross-entropy loss from all categories. Trivial 176 127 of characters, each of which is assigned at most one decisions, such as the label for the shin/sin diacritic 177 letter, are masked. 178 ש!-of each of three separate diacritic categories: one for any non 128 129 category for the dot distinguishing shin ( ) from 179 !שׁ Training corpora Dotted modern Hebrew text 130 sin ( ), two consonants sharing a base character 180 !שׂ is scarce, since speakers usually read and write 131 ; another for the presence of dagesh/mappiq, a 181 undotted text, with the occasional diacritic for ש! 132 central dot affecting pronunciation of some con- 182 disambiguation when context does not suffice. 133 sonants, e.g. /p/ from /f/, but also present 183 As we are unaware of legally-obtainable dotted פ|! פּ! elsewhere; and one for all other diacritic marks, 134 modern corpora, we use a combination of dot- 184 which mostly determine vocalization, e.g. /da/ ted pre-modern texts and semi-automatically dot- 185 ד´! 135 de/. Diacritics of different categories may 186/ ד»! .vs 136 ted modern sources to train NAKDIMON. The co-occur on single letters, e.g. !µ, or may be absent 137 PRE-MODERN portion is obtained from two main 187 altogether. 138 sources: A combination of late pre-modern text 188 139 Full script Hebrew script written without inten- from Project Ben-Yehuda, mostly texts from the 189 140 tion of dotting typically employs a compensatory late 19th century and the early 20th century;7 and 190 141 variant known colloquially as full script (ktiv male, rabbinical texts from the medieval period, the most 191 important of which is Mishneh Torah.8 This por- 192 י! which adds instances of the letters ,(כתיב מלא! 142 in some places where they can aid pronuncia- tion contains roughly 2.6 million Hebrew tokens, 193 ו! and 143 144 tion, but are incompatible with the rules for dotted 194 5Provided as supplementary material. 145 script. In our formulation of dotting as a sequence 6We experimented with a CRF layer in preliminary phases 195 146 tagging problem, and in collecting our test set from and did not find a substantial benefit. 196 7 147 raw text, these added letters may conflict with the Obtained from the Ben-Yehuda Project https:// 197 benyehuda.org/. 148 3Private communication. 8Obtained from Project Mamre https://www. 198 149 4The system is a available at www.nakdimon.org mechon-mamre.org/. 199 2 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE. 200 Genre Sources # Docs # Tokens final form (which may be the result of two or three 250 201 Wiki Dicta test set 22 5,862 decisions, e.g. dagesh + vowel); word accuracy 251 202 News Yanshuf 78 11,323 (WOR) is the portion of words with no diacritiza- 252 y Literary Books, forums 129 73,770 203 VOC 253 * Official gov.il 2 7,619 tion mistakes; and vocalization accuracy ( ) 204 * News / Mag Online outlets 84 55,467 is the portion of words where any dotting errors do 254 9 205 * User-gen.