ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE.

000 050 001 Restoring Hebrew Without a Dictionary 051 002 052 003 053 004 054 005 Anonymous ACL-IJCNLP submission 055 006 056 007 057 008 058 009 059 010 060 011 Abstract 061 062 המטוס נחת ברכות! 012 (a) hamatos naxat ???? 013 We demonstrate that it is feasible to diacritize 063 ‘The plane landed (unspecified)’ 014 Hebrew script without any human-curated re- 064 הַמָּטוֹס Éחַת בְּר¯כּוּת! sources other than plain diacritized text. We 015 065 present NAKDIMON, a two-layer character- (b) hamatos naxat b-rakut 016 level LSTM, that performs on par with ‘The plane landed softly’ 066 017 067 הַמָּטוֹס Éחַת בְּר´כוֹת! much more complicated curation-dependent 018 systems, across a diverse array of modern - 068 (c) hamatos naxat braxot brew sources. 019 ‘The plane landed congratulations’ 069 020 1 Introduction 070 021 Table 1: An example of an undotted Hebrew text (a) 071 022 The vast majority of texts are writ- (written right to left) which can be interpreted in at least 072 023 ten in a letter-only version of the Hebrew script, two different ways (b,c), dotted and pronounced differ- 073 ently, but only (b) makes grammatical sense. 024 one which omits the diacritics present in the full di- 074 1 025 acritized, or dotted variant. Since most are 075 encoded via diacritics, the pronunciation of words 026 dotted text. Obtaining such data is not trivial, even 076 in the text is left underspecified, and a considerable 027 given correct pronunciation: the standard Tiberian 077 mass of tokens becomes ambiguous. This ambigu- 028 system contains several sets of identically- 078 ity forces readers and learners to infer the intended 029 vocalized forms, so while most Hebrew speakers 079 reading using syntactic and semantic context, as 030 easily read dotted text, they are unable to produce it. 080 well as common sense (Bentin and Frost, 1987; 031 Moreover, the process of manually adding diacrit- 081 Abu-Rabia, 2001). In NLP systems, recovering 032 ics in either handwritten script or through digital 082 such signals is difficult, and indeed their perfor- 033 input devices is mechanically cumbersome. Thus, 083 mance on Hebrew tasks is adversely affected by 034 the overwhelming majority of modern Hebrew text 084 the presence of undotted text (Shacham and Wint- is undotted, and manually dotting it requires exper- 035 ner, 2007; Goldberg and Elhadad, 2010; Tsarfaty 085 tise. The resulting scarcity of available dotted text 036 et al., 2019). As an example, the sentence in Ta- 086 in modern Hebrew contrasts with Biblical and Rab- 037 ble 1 (a) will be resolved by a typical reader as (b) 087 binical texts which, while dotted, manifest a very 038 in most reasonable contexts, knowing that the word 088 different language register. This state of affairs al- 039 “softly” may characterize landings. In contrast, an 089 lows individuals and companies to offer dotting as 040 automatic system processing Hebrew text may not 090 paid services, either by experts or automatically, 041 be as sensitive to this kind of grammatical knowl- 091 e.g. the Morfix engine by Melingo.2 042 edge and instead interpret the undotted token as the 092 Existing computational approaches to dotting 043 more frequent word in (c), harming downstream 093 are manifested as complex, multi-resourced sys- 044 performance. 094 tems which perform morphological analysis on the 045 One possible way to overcome this problem is 095 undotted text and look undotted words up in hand- 046 by adding diacritics to undotted text, or dotting, im- 096 crafted dictionaries as part of the dotting process. plemented using data-driven algorithms trained on 047 Dicta’s Nakdan (Shmidman et al., 2020), the cur- 097 048 1Also known as pointed text, or via the Hebrew term for 098 049 the diacritic marks, nikkud/. 2https://nakdan.morfix.co.il/ 099

1 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE.

100 rent state-of-the-art, applies such methods in ad- dotting standard. For the sake of input integrity, 150 101 dition to applying multiple neural networks over and unlike some other systems, we opt not to re- 151 102 different levels of the text, requiring manual anno- move these characters, but instead employ a dotting 152 103 tation not only for dotting but also for morphology. policy consistent with full script. 153 104 Among the resources it uses are a diacritized corpus 154 New test set Shmidman et al.(2020) provide a 105 of 3M tokens and a POS-tagged corpus of 300K 155 3 benchmark dataset for dotting modern Hebrew doc- 106 tokens. Training the model takes several weeks. 156 uments. However, it is relatively small and non- 107 In this work, we set out to simplify the dotting 157 task as much as possible to standard modules. Our diverse: all 22 documents in the dataset originate in 108 a single source, Hebrew Wikipedia articles. There- 158 system, NAKDIMON, accepts the undotted char- 109 5 159 acter sequence as its input, consults no external fore, we created a new test set from a larger variety 110 of texts, including high-quality Wikipedia articles 160 111 resources or lexical components, and produces di- 161 acritics for each character, resulting in dotted text and edited news stories, as well as user-generated 112 162 whose quality is comparable to that of the commer- blog posts. This set consists of ten documents from 113 163 cial Morfix, on both character-level and word-level each of eleven sources (5x Dicta’s test set), and 114 164 accuracy. To our knowledge, this is the first at- totals 20,476 word tokens, roughly 3.5x Dicta’s. 115 165 tempt at a “light” model for Hebrew dotting since 116 3 Character-LSTM Dotter 166 early HMM-based systems (Kontorovich, 2001; 117 167 Gal, 2002). In experiments over existing and NAKDIMON embeds the input characters 118 168 novel datasets, we show that our system is particu- and passes them through a two-layer Bi- 119 larly useful in the main use case of modern dotting, LSTM (Hochreiter and Schmidhuber, 1997). The 169 120 which is to convey the desired pronunciation to LSTM output is fed into a single linear layer, 170 121 a reader, and that the errors it makes should be which then feeds three linear layers, one for each 171 122 more easily detectable by non-professionals than diacritic category. Each character then receives a 172 123 Dicta’s.4 prediction for each category independently, and 173 124 decoding is performed greedily with no additional 174 125 2 Dotting as Sequence Labeling search techniques.6 In training, we sum the 175 126 The input to the dotting task consists of a sequence cross-entropy loss from all categories. Trivial 176 127 of characters, each of which is assigned at most one decisions, such as the label for the /sin diacritic 177 letter, are masked. 178 ש!-of each of three separate diacritic categories: one for any non 128 129 category for the dot distinguishing shin ( ) from 179 !שׁ Training corpora Dotted modern Hebrew text 130 sin ( ), two consonants sharing a base character 180 !שׂ is scarce, since speakers usually read and write 131 ; another for the presence of /, a 181 undotted text, with the occasional diacritic for ש! 132 central dot affecting pronunciation of some con- 182 disambiguation when context does not suffice. 133 sonants, e.g. /p/ from /f/, but also present 183 As we are unaware of legally-obtainable dotted פ|! פּ! elsewhere; and one for all other diacritic marks, 134 modern corpora, we use a combination of dot- 184 which mostly determine vocalization, e.g. /da/ ted pre-modern texts and semi-automatically dot- 185 ד´! 135 de/. Diacritics of different categories may 186/ ד»! .vs 136 ted modern sources to train NAKDIMON. The co-occur on single letters, e.g. !µ™, or may be absent 137 PRE-MODERN portion is obtained from two main 187 altogether. 138 sources: A combination of late pre-modern text 188 139 Full script Hebrew script written without inten- from Project Ben-Yehuda, mostly texts from the 189 140 tion of dotting typically employs a compensatory late 19th century and the early 20th century;7 and 190 141 variant known colloquially as full script (ktiv male, rabbinical texts from the medieval period, the most 191 important of which is Mishneh Torah.8 This por- 192 י! which adds instances of the letters ,(כתיב מלא! 142 in some places where they can aid pronuncia- tion contains roughly 2.6 million Hebrew tokens, 193 ו! and 143 144 tion, but are incompatible with the rules for dotted 194 5Provided as supplementary material. 145 script. In our formulation of dotting as a sequence 6We experimented with a CRF layer in preliminary phases 195 146 tagging problem, and in collecting our test set from and did not find a substantial benefit. 196 7 147 raw text, these added letters may conflict with the Obtained from the Ben-Yehuda Project https:// 197 benyehuda.org/. 148 3Private communication. 8Obtained from Project Mamre https://www. 198 149 4The system is a available at www.nakdimon.org mechon-mamre.org/. 199

2 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE.

200 Genre Sources # Docs # Tokens final form (which may be the result of two or three 250 201 Wiki Dicta test set 22 5,862 decisions, e.g. dagesh + ); word accuracy 251 202 News Yanshuf 78 11,323 (WOR) is the portion of words with no diacritiza- 252 † Literary Books, forums 129 73,770 203 VOC 253 * Official gov.il 2 7,619 tion mistakes; and vocalization accuracy ( ) 204 * News / Mag Online outlets 84 55,467 is the portion of words where any dotting errors do 254 9 205 * User-gen. Blogs, forums 58 57,672 not cause incorrect pronunciation. 255 * Wiki he.wikipedia 40 62,723 206 We train NAKDIMON over PRE-MODERN fol- 256 Total 413 274,436 207 lowed by five epochs over the MODERN corpus. 257 We pre-process the input by removing all but He- 208 Table 2: Data sources for our MODERN Hebrew train- 258 209 ing set. Rows marked with * were automatically dotted brew characters, spaces and punctuation; digits are 259 210 via the Dicta API and corrected manually. Rows with † converted to a dedicated symbol, as are Latin char- 260 211 were dotted at low quality, requiring manual correction. acters. We strip all existing diacritic marks. We 261 The rest were available with professional dotting. split the documents into chunks of at most 80 char- 212 262 acters bounded at whitespace, ignoring sentence 213 263 boundaries. We optimize using Adam (Kingma 214 most of which are dotted, with a varying level of ac- 264 and Ba, 2014), and implement a cyclical learning 215 curacy, varying dotting styles, and varying degree 265 rate schedule (Smith, 2017) which we found to be 216 of similarity to modern Hebrew. 266 more useful than a constant learning rate. Based 217 The MODERN portion contains manually col- 267 on tuning experiments (§4.3), we set the character lected text in modern Hebrew, mostly from undot- 218 embedding dimension and the LSTM’s hidden di- 268 ted sources, which we dot using Dicta and follow 219 mension to d = h = 400, and apply a dropout rate 269 up by manually fixing errors, either using Dicta’s 220 of 0.1. 270 API or via automated scripts which catch common 221 Following Shmidman et al.(2020), we compare 271 mistakes. We use the same technique and style for 222 against Dicta, Snopi,10 and Morfix (Kamir et al., 272 dotting this corpus as we do for our test corpus 223 2002). 273 224 (§2), but the documents were collected in differ- 274 ent ways. We made an effort to collect a diverse 225 4.1 Dicta Test Set 275 set of sources: news, opinion columns, paragraphs 226 We present results for the Dicta test set in the left 276 from books, short stories, blog posts and forums ex- 227 half of Table 3. In order to provide fair compari- 277 pressing different domains and voices, Wikipedia 228 son and to preempt overfitting on this test data, we 278 articles, governmental publications, and more. As 229 ran this test in a preliminary setup on a variant of 279 our dotting guidelines aim for readability and faith- NAKDIMON which was not tuned or otherwise un- 230 fulness to modern writing conventions, we were 280 fairly trained. This system, which we call NAKDI- 231 able to create a corpus large enough with limited 281 MON , differs from our final variant in two main 232 expertise in dotting. Our modern corpus contains 0 282 aspects: it is not trained on the Dicta portion of our 233 roughly 0.3 million Hebrew tokens, and is much 283 training corpus (§3); and it employs a residual con- 234 more consistent and similar to the expectation of 284 nection between the two character Bi-LSTM layers. 235 a native Hebrew speaker than the PRE-MODERN 285 Testing on the Dicta test set required some mini- 236 corpus. The sources and statistics of this dataset 286 mal evaluation adaptations resulting from encoding 237 are presented in Table 2. 287 constraints,11 and so we copy the results reported 238 288 in the original paper as well as our replication. 239 4 Experiments 289 We see that the untuned NAKDIMON0 performs 240 290 In order to report compatible findings with those in on par with the proprietary Morfix, which uses 241 previous work, we present results on both the Dicta word-level dictionary data, and does not perform 291 242 test, adapted to full script, and on our new test set. far below the complex, labor-intensive Dicta on the 292 243 We report the following metrics: decision accu- character level. 293 244 racy (DEC), which is computed over the entire set 294 9These are: the sin/shin dot, vowel distinctions across the 245 295 characters. We פ|!/כ|!/ב! of individual possible decisions: dagesh/mappiq a/e/i/o/u/null sets, and dagesh in the 246 for letters that allow it, sin/shin dot for the let- do not distinguish between gadol/kamatz katan, and 296 is assumed to always be null. 247 ter , and all other diacritics for letters that allow 297 10http://www.nakdan.com/Nakdan.aspx ש! 248 them; character accuracy (CHA) is the portion of 11For example, we do not distinguish between kamatz katan 298 249 characters in the text that end up in their intended and kamatz gadol. 299

3 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE.

300 Dicta – reported Dicta – reproduced New test set (§2) 350 301 System CHAWORDECCHAWORVOCDECCHAWORVOC 351 302 Snopi 78.96 66.41 87.81 79.92 66.57 70.35 91.14 85.45 74.73 77.17 352 Morfix 90.32 80.90 94.91 91.29 82.24 86.48 97.25 95.35 89.43 91.64 303 353 Dicta 95.12 88.23 97.48 95.67 89.25 91.03 98.94 98.23 95.83 95.93 304 NAKDIMON0 95.78 92.59 79.00 83.01 97.05 94.87 85.70 88.30 354 305 NAKDIMON 97.37 95.41 87.21 89.32 355 306 356 Table 3: Document-level macro % accuracy on the test set from Shmidman et al.(2020) and on our new test set. 307 357 308 358 309 MON over an increasing amount of MODERN text 359 310 (§3), which closes more than half of the error gap 360 311 from Dicta. We tried to further improve NAKDI- 361 312 MON by initializing its parameters from a language 362 313 model trained to predict masked characters in a 363 314 large undotted Wikipedia corpus (440MB of text, 364 30% masks), but this setup provided only a negligi- 315 365 ble 0.07% improvement. 316 366 317 Error analysis A look at 20 cases which Dicta 367 318 got right but NAKDIMON got wrong, and 20 of 368 319 the opposite condition, reveals clear yet expected 369 320 Figure 1: WOR error rate on validation set as a function patterns: Dicta’s errors mostly constitute selection 370 321 of training set size vs. SOTA (Dicta), over five runs. of the wrong in-vocabulary word in a context, or 371 Other metrics show similar trends. 322 a wrong inflection of a verb. NAKDIMON tends 372 323 to create unreadable vocalization sequences, as it 373 324 4.2 New Test Set does not consult a dictionary and is decoded greed- 374 325 ily. These types of errors are more friendly to the 375 We provide results on the new test set (§2) in the 326 typical use cases of a dotting system, as they are 376 right half of Table 3. The improvement of our tuned 327 likely to stand out to a reader. Other recurring error 377 NAKDIMON over NAKDIMON0 is clear. NAKDI- types include named entities and errors at sentence 328 378 MON outperforms Morfix on character-level met- boundaries, which likely stem from lack of context. 329 rics but not on word-level metrics, mostly due to the 379 330 fact that Morfix ignores certain words altogether, 5 Related Work 380 331 incurring errors on multiple characters. We note the 381 332 As noted in the introduction, works on diacritiz- 382 substantial improvement our model achieves on the ing Hebrew are not common and all include word- 333 383 VOC metric compared to the WOR metric: 16.5% level features. In , diacritization serves a 334 384 of word-level errors are attributable to vocalization- comparable purpose to that in Hebrew but not ex- 335 385 agnostic dotting, compared to only 2.4% for Dicta clusively: most diacritic marks differentiate letters 336 386 and 9.7% for Snopi (but 20.9% for Morfix). Con- from each other (which only the sin/shin dot does 337 387 sidering that the central use case for dotting modern in Hebrew), while vocalization marks are in a one- 338 Hebrew text is to facilitate pronunciation to learn- to-one relationship with their phonetic realizations. 388 339 ers and for reading, and that undotted homograph Recent work in neural-based Arabic diacritization 389 340 ambiguity typically comes with pronunciation dif- includes a dictionary-less system (Belinkov and 390 341 ferences, we believe this measure to be no less Glass, 2015) which uses a 3-layer Bi-LSTM with a 391 342 important than WOR. sliding window of size 5. Similarly, Abandah et al. 392 343 (2015) also use a Bi-LSTM, but replace the slid- 393 4.3 Development Experiments 344 ing windows with a one-to-one architecture (one 394 345 While developing NAKDIMON, we performed sev- diacritic for each letter) and a one-to-many architec- 395 346 eral evaluations over a held-out validation set of ture (allowing any number of diacritics per charac- 396 347 40 documents with 27,681 tokens, on which Dicta ter). The former, which performed best, is similar 397 348 performs at 91.56% WOR accuracy. We show in to this work except for our separation of diacriti- 398 349 Figure 1 the favorable effect of training NAKDI- zation categories. Finally, Mubarak et al.(2019) 399

4 ACL-IJCNLP 2021 Submission 2830. Confidential Review Copy. DO NOT DISTRIBUTE.

400 has tackled Arabic diacritization as a sequence-to- Dror Kamir, Naama Soreq, and Yoni Neeman. 2002. 450 401 sequence problem, tasking the model with repro- A comprehensive NLP system for modern standard 451 Arabic and modern Hebrew. In Proceedings of the 402 ducing the characters as well as the marks. 452 ACL-02 Workshop on Computational Approaches 403 to Semitic Languages, Philadelphia, Pennsylvania, 453 404 6 Conclusion USA. Association for Computational Linguistics. 454 405 455 Learning directly from plain diacritized text can go Diederik P Kingma and Jimmy Ba. 2014. Adam: A 406 a long way, even with relatively limited resources. method for stochastic optimization. arXiv preprint 456 407 NAKDIMON demonstrates that a simple architec- arXiv:1412.6980. 457 408 458 ture for diacritizing Hebrew text as a sequence - Leonid Kontorovich. 2001. Problems in semitic nlp: 409 ging problem can achieve performance on par with Hebrew vocalization using hmms. 459 410 much more complex systems. We also introduce 460 Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, 411 and release a corpus of dotted Hebrew text, as well Younes Samih, and Kareem Darwish. 2019. Highly 461 412 as a source-balanced test set. In the future, we wish effective Arabic diacritization using sequence to se- 462 413 to evaluate the utility of dotting as a feature for quence modeling. In Proceedings of the 2019 Con- 463 ference of the North American Chapter of the Asso- 414 downstream tasks, taking advantage of the fact that 464 ciation for Computational Linguistics: Human Lan- 415 our simplified model can be easily integrated in an guage Technologies, Volume 1 (Long and Short Pa- 465 416 end-to-end Hebrew processing system. pers), pages 2390–2395, Minneapolis, Minnesota. 466 417 Association for Computational Linguistics. 467 418 468 References Danny Shacham and Shuly Wintner. 2007. Morpholog- 419 ical disambiguation of Hebrew: A case study in clas- 469 420 Gheith Abandah, Alex Graves, Balkees Al-Shagoor, sifier combination. In Proceedings of the 2007 Joint 470 Alaa Arabiyat, Fuad Jamour, and Majid Al-Taee. Conference on Empirical Methods in Natural Lan- 421 471 2015. Automatic diacritization of arabic text us- guage Processing and Computational Natural Lan- 422 ing recurrent neural networks. International Jour- guage Learning (EMNLP-CoNLL), pages 439–447, 472 423 nal on Document Analysis and Recognition (IJDAR), Prague, Czech Republic. Association for Computa- 473 tional Linguistics. 424 18:183–197. 474 425 Salim Abu-Rabia. 2001. The role of vowels in reading Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, 475 426 semitic scripts: Data from arabic and hebrew. Read- and Yoav Goldberg. 2020. Nakdan: Professional 476 ing and Writing, 14(1-2):39–59. Hebrew diacritizer. In Proceedings of the 58th An- 427 nual Meeting of the Association for Computational 477 428 Yonatan Belinkov and James Glass. 2015. Arabic di- Linguistics: System Demonstrations, pages 197– 478 429 acritization with recurrent neural networks. In Pro- 203, Online. Association for Computational Linguis- 479 430 ceedings of the 2015 Conference on Empirical Meth- tics. 480 ods in Natural Language Processing, pages 2281– 431 2285, Lisbon, Portugal. Association for Computa- Leslie N Smith. 2017. Cyclical learning rates for train- 481 432 tional Linguistics. ing neural networks. In 2017 IEEE Winter Confer- 482 ence on Applications of Computer Vision (WACV), 433 483 Shlomo Bentin and Ram Frost. 1987. Processing lexi- pages 464–472. IEEE. 434 484 cal ambiguity and visual word recognition in a deep 435 orthography. Memory & Cognition, 15(1):13–23. Reut Tsarfaty, Shoval Sadde, Stav Klein, and Amit 485 Seker. 2019. What’s wrong with Hebrew NLP? 436 486 Ya’akov Gal. 2002. An HMM approach to vowel and how to make it right. In Proceedings of the 437 487 restoration in Arabic and Hebrew. In Proceed- 2019 Conference on Empirical Methods in Natu- 438 ings of the ACL-02 Workshop on Computational Ap- ral Language Processing and the 9th International 488 439 proaches to Semitic Languages, Philadelphia, Penn- Joint Conference on Natural Language Processing 489 (EMNLP-IJCNLP): System Demonstrations, pages 440 sylvania, USA. Association for Computational Lin- 490 guistics. 259–264, Hong Kong, China. Association for Com- 441 putational Linguistics. 491 442 Yoav Goldberg and Michael Elhadad. 2010. Easy-first 492 443 dependency parsing of modern Hebrew. In Proceed- 493 ings of the NAACL HLT 2010 First Workshop on Sta- 444 tistical Parsing of Morphologically-Rich Languages, 494 445 pages 103–107, Los Angeles, CA, USA. Associa- 495 446 tion for Computational Linguistics. 496 447 497 Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. 448 Long short-term memory. Neural computation, 498 449 9(8):1735–1780. 499

5