Elaine Renée KEOWN Center for Judaic Studies, Philadelphia, Pennsylvania

HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES: HISTORY AND PRELIMINARY TABULATION

The Ancient was first computerized2 in 1956 by Prof. Roberto Busa in Gallarate3, Italy. Drawing on seven years of experience in- dexing Aquinas by early IBM punch card equipment4, Busa produced an automated index of several published Dead Sea Scrolls written in - brew, Aramaic, and Nabatean5. By 1959 in Israel a second project which would soon use computerized Hebrew started6. By the early 1960s re- searchers in France began writing more advanced software which included the computerized Tiberian te{amim7. In Nancy, Gérard Weil became the first scholar to work with a computerized version of the Tanach8. His collaborator, Pierre Rivière, wrote the first algorithms to analyze the com- plex te{amim patterns of the Leningrad Codex9. Weil' work led to the first center for studying electronic versions of the Tanach10. In the mid-

1. “Porting” is the process of converting any software product for use on a different hard- ware platform. 2. P. TASMAN, “Indexing the Dead Sea Scrolls by Electronic Literary Data Processing Methods”, IBM (Nov. 1958) [IBM Archives], p. 5. 3. The cards for the Hebrew were punched in Gallarate, soon the location of the Centro per 'Automazione dell'Analisi Litteraria; then they were flown to NYC, where they were sorted four times on an IBM 705 mainframe. See Tasman, cited above, p. 5, and also “IBM and the Dead Sea Scrolls”, Business Machines (June 10, 1958), p. 4. 4. From 1887-1970s most calculating machines and computers utilized perforated paper ribbon or punched cards similar to those invented in Lyons for looms between 1725-1804. See . MARGUIN, Histoire des instruments et machines à calculer, Paris, 1994, pp. 182, 191. 5. The Dead Sea Scrolls were written in the early 22-letter alphabet without final letters. The Nabatean was transliterated into Hebrew. See Tasman, cited above, p. 5. 6. Writing in 1959 about the planned historical Hebrew dictionary, Ben-Hayyim ob- served, “it is worthwhile investigating whether we can employ automation, that is to say, making use of punch card machines or electronic computers”. See . BEN-HAYYIM, The His- torical Dictionary of : A Plan, Jerusalem, 1959, p. 14. 7. . Weil, Concordance de la cantilation des Premiers Prophètes: Josué, Juges, Samuel et Rois, Paris, 1982, p. XXIVff. 8. J.J. HUGHES, Bits, Bytes and Biblical Studies, Grand Rapids, 1987, p. 515. 9. See WEIL, cited above, p. XI. 10. Prof. Weil's research group became CATAB, the Centre 'Analyse et de Traitement Automatique de la Bible, later in Lyons.

Revue des Études juives, 161 (1-2), janvier-juin 2002, pp. 235-240 236 HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES

1960s in Israel, at the Weizmann Institute and Bar Han, Profs. Aviezri Fraenkel and Yaacov Choueka started a large database of Hebrew responsa literature11. In the 1960s/70s, operating systems with character sets of 127 could be multilingual, but they were usually “monoscript”, including only Roman alphabet variations. During this time, the IBM standard code, EBCDIC, with a character set of 255, led to commercial standards with two scripts in Russia, Japan, Korea, Thailand, and other countries12. Later, in 1988, there was also a very creative Israeli standard, SI 960, 127 characters, bilingual English/Hebrew, with only English capital letters13. By the 1980s, standard computer operating systems had character sets of 255 symbols to manipulate14. In the late 80s, new bilingual international computer standards were proposed for about 40 languages, including He- brew, Arabic, Russian, Greek, and 30 European languages written with vari- ations of the Roman alphabet15. Different technical standards for Hebrew for librairies and for general computer usage were proposed by various coun- tries. At that point, some proposed international standards included the Tibe- rian vowels and other symbols16. However, when the first general, interna- tional, bilingual Latin/Hebrew computer code for 8-bit bytes was published, ISO 8859-8 included only the basic Hebrew consonants, despite having ad- equate code space for Tiberian vowels and other Masoretic symbols17. In the United States, in 1980, an American academic Hebrew computer code, now usually called “Michigan-Claremont-Westminster” (after the three institutions where it was developed or improved), was begun by Profs. ..D. Parunak and .. Whitaker as they produced a new elec- tronic version of the Leningrad Codex18. The original Michigan-Claremont text was later corrected against the codex by Prof. A. Groves of Westmin- ster Seminary. This newest electronic version of Leningrad was used in col- laborative or individual projects in Jerusalem, Bar Ilan, Philadelphia, Am- sterdam, Utrecht, and also recently in Bielefeld19.

11. . CHOUEKA, “Computerized full-text retrieval systems and research in the humani- ties: the responsa project”, Computers and the Humanities 14, 1980, pp. 153-169. 12. J. CLEWS, Language automation worldwide: the development of character set stand- ards, Harrogate, 1988, pp. 36-37. 13. Ibid., p. 94. 14. Ibid., p. 50. 15. Ibid., p. 16-17. See also V. ILLINGWORTH (ed.), Dictionary of Computing, Oxford, 1996, pp. 269-270. 16. Ibid., pp. 96-97. 17. Ibid., p. 95. 18. Hughes, op. cit., pp. 499-519. 19. .-J. DOEDENS, Text databases: one database model and several retrieval languages, Amsterdam, 1994, pp. 262-263. HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES 237

Although widely used and an excellent computer code, Michigan- Claremont-Westminster never became a public, internationally recognized character set like ISO-7, ASCII, or the ISO 8859-8 bilingual Latin/Hebrew code. The latest version of the Michigan-Claremont-Westminster text will be the electronic Biblia Hebraica Quinta20. In the 1990s, more complete public computer codes for Hebrew were produced. In 1996, the Israeli standards group, the SII, produced a new code, SI 1311.2, that included the Tiberian te{amim21. However, the most recent international computer code has been devel- oped in a collaboration between the ISO, based in Geneva, and the Consortium, based in Silicon Valley. ISO 10646 or Unicode contains 65,536 codes spaces. Thi is 256 times the number of spaces used in 1970s/ 80s operating systems. Unicode was initially designed in the late 1980s by a group of experts, many from the Chinese Language Computer Society. It allocates 22,000 spaces to “CJK” (Chinese, Japanese, and Korean) ide- ographs. The Unicode 2.0 Hebrew section now includes the basic Hebrew alphabet, Tiberian te{amim, and other Masoretic marks used in the Lenin- grad Codex, plus vowel digraphs for literary Yiddish, for a total of 87 sym- bols22. In 1997 in Greifswald, Germany, the Leningrad Codex in its Michigan- Claremont-Westminster electronic version was uploaded to a Unicode 2.0 format to become part of a parallel corpus (Leningrad-Vulgate-Septuagint in parallel plus Greek NT) with advanced retrieval functions23. This multi- database system is called QUEST. However, the complete character set for Hebrew script includes at least 173 symbols or letters: variant Hebrew alphabets for Jewish languages or dialects written in Hebrew plus four sets of Biblical annotations (three re- gional Masorahs and the Samaritan set). Other unusual Jewish languages used Georgian24, Malayalam25, or Turkish runic26 scripts.

20. See http://www.unifr.ch/bif/Chapters/bh5.html, middle paragraphs. 21. See: http://www.qsm.co.il/Hebrew/stdisr.htm. 22. See The Unicode Consortium, The Unicode Standard, version 2.0, Reading, 1996, pp. 7-60 and 7-61. For some reason, the code repeats and twice, but I only count them once here. 23. .-D. SYRING, “QUEST 2 — Computergestützte Philologie und Exegese”, Zeit- schrift für Althebräistik 11, 1998, pp. 85-89. 24. For a photo of a bilingual Georgian-Hebrew haggada, see . NAISHTAT, Yehude Gruzyah, Tel Aviv, 1970 p. 64. 25. R. DANIEL, Ruby of Cochin: An Indian Jewish woman remembers, Philadelphia, 1995, pp. 146, 151-54, 174-77. 26. S. PLETNEVA, Khazary, Moskva, 1976, p. 33ff. 238 HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES

A preliminary27 tabulation28 of symbols for Hebrew scripts follows:

HEBREW SYMBOLS: A PRELIMINARY LIST

Complete Net First on count29 count30 computer Ancient or common symbols 22-letter alphabet 22 22 1956 Ancient epigraphic punctuation31 2 2 Qumrani dots (upper middle 2 2 early 1960s and lower middle)32 Final letters 5 5 early 1960s Tiberian pointing and other 53 52 early 1960s Masoretic apparatus Other Hebrew manuscript33 symbols 7 7 Net subset totals 90

Extra symbols for regional Jewish languages Arabic34 6 4 late 1960s35 Berber36 1 0 Persian37 3 0

27. I had difficulty finding material on symbols for several languages, e.g., Shua- dit (standard Judeo-Provençal, as opposed to Comtadin) and Krimchak (a Rabbanite Kipchak Turkish language, similar to Crimean Tatar). For Krimchak, see W. MOSKO- VITZ, “Krimchak Language”, Encyclopaedia Judaica Yearbook 1988/9, Jerusalem, 1989, p. 371. 28. Couting Hebrew symbols for computerization is an art, not a science. Except for di- graphs and trigraphs, this count is minimal. It used the “decomposed” method of counting Hebrew symbols, counting each once instead of multiple times with each pointed consonant. The “decomposed” approach has been successfully used in many library systems. However, the competing approach, the “composed” approach, where each digraph or “com- posed” consonant with diacritic has a code number, may have certain advantages for complex computer software. Complex software includes pointed Tanach texts of piyyu†im, lexical databases, and, possibly, Web pages. 29. This numer includes the complete set of extra symbols found in the subset. 30. This number is the net number of symbols after subtracting those found in more than one category. 31. Y. AHARONI, Arad Inscriptions, Jerusalem, 1986, p. 34. This count does not include the hieratic numerals of Hebrew epigraphy. For these, see R. DEUTSCH, New epigraphic evi- dence from the Biblical period, Tel Aviv, 1995. 32. R. BUTIN, The ten nequdoth of the Torah, New York, 1969, p. XXV. 33. Here I include inverted nuns, pehs, and tsadis plus two abbreviation symbols. 34. . HARY, “Adaptations of Hebrew script”, in P.. DANIELS, The World's Writing Sys- tems, Oxford, 1995, pp. 727-734. 35. Judeo-Arabic texts were apparently first computerized at Dropsie by Prof. Lawrence V. Berman. Later in the 1970s Prof. Alan Corré produced a computerized lexicon. 36. P. GALAND-PERNET, Une version berbère de la haggadah de Pesach, Paris, 1970. See also M. 'CONNOR, “The Berber scripts”, in Daniels, cited above, p. 115. 37. Judeo-Persian, Bukhari, and Tat are dialects of Persian from different areas. See H. PAPER, A Judeo-Persian Pentateuch, Jerusalem, 1972. HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES 239

Tajik38 (Bukhari) 4 2 Tat39 3 2 Neo-Aramaic40, 41 (Kurdit) 3 1 Greek 0 0 French42 7 3 Comtadin43 1 0 Italian44 6 1 Ladino45 4 2 Yiddish46 6 3 Net subset totals 18

Other pointing, reading, masoretic systems Babylonian47 39 35 Palestinian48 31 18 Samaritan49 21 12 Net subset totals 65 Total Hebrew symbols found to date: 173

This preliminary count of Hebrew symbols shows that the complete He- brew character set will not fit into today's most common operating systems or frequently used software without special manipulation. Most standard software for electronic mail and Web pages can choose a particular charac- ter set of 255 symbols. However, in these standard systems, 127 spaces are already allocated for computer control codes, numbers, , and the English version of the Roman alphabet. This leaves only 128 spaces for He-

38. . TAGGER, Milon {Ivri-Bukhari, Tel Aviv, 1960, passim. 39. H. HAARMAN, “Yiddish and the other Jewish languages in the Soviet Union”, in J. FISHMAN (ed.), Readings in the Sociology of Jewish languages, Leiden, 1985, p. 165. 40. The languages called “Kurdit” in are actually Neo-Aramaic dialects, mostly from Kurdistan. See I. AVINERY, The Aramaic dialect of the Jews of Zakho, Jerusalem, 1988, p. V. 41. For symbols, see Y. SABAR, Targum de -Targum: an old Neo-Aramaic version of the Targum on Song of Songs, Wiesbaden, p. 9. 42. M. BANITT, Le glossaire de Bâle. Texte, Jerusalem, Académie Nationale des Sciences et des Lettres d'Israël, 1972, pp. IX, . 43. E. SABATIER, Chansons hébraïco-provençales des Juifs comtadins, Paris, 1927, pp. 11- 12. 44. A. FREEDMAN, Italian texts in Hebrew characters: problems of interpretation, Wies- baden, 1972, p. 123. 45. B. HARY, “Judeo-Spanish (Ladino)”, in Daniels, cited above, p. 734. 46. H. ARONSON, “Yiddish”, in Daniels, pp. 735-742. 47. P. KAHLE, Der masoretische Text des Alten Testaments nach der Überlieferung des babylonischen Juden, Leipzig, 1902, pp. 24, 34, 46-47. 48. M. DIETRICH, Neue palästinisch punktierte Bibelfragmente: Veröffentlicht und auf Text und Punktation hin untersucht, Leiden, 1968, p. 88* [Tafel II]. 49. R. MACUCH, Grammatik des samaritanischen Hebräisch, Berlin, 1969, pp. 61-76. 240 HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES brew symbols, when the character set is designed in a standard fashion. If certain punctuation, mathematical, and other symbols are omitted, one can fin perhaps 15 other spaces for Hebrew, for a total of 143 spaces for a bilin- gual Hebrew/English system. However, adding extra characters for French, Spanish, Italian, Swedish, Finnish, Norwegian, Afrikaans, etc., would leave fewer places for Hebrew. This analysis suggests the possibility of designing several international, standard, mostly overlapping character sets for Hebrew. Such sets could be alternated in bilingual electronic mail programs or for Web design. For a single electronic mail message or one Web page, a writer would be able to choose the partial Hebrew character set most appropriate for his or her needs. In more complex and flexible software, such as word processors, al- ternation between character sets can be more easily automated. However, for material to be sent via e-mail or converted to a different computer for- mat or hardware application, it is still easier to stay within one character set. This technical history also suggests that today the Hebrew script family is at the stage the complete Roman alphabet family was 10-15 years ago: partially computerized, with a need for greater study of the symbol set and its collation50 issues. Because the Hebrew alphabet family followed the early concept of the alphabet — without capital letters — in some ways it is easier to computer- ize and presents simpler multilingual possibilities, especially with regard to collation. This could be especially useful for lexical databases for Jewish interlinguistics or for the multilingual terminology required for the study of the Hebrew language and the masorah51.

50. “Collation” means automatic alphabetizing or sorting, within all kinds of databases. 51. The author can be reached at [email protected]. She thanks her parents (Prof. and Mrs. E.R. Keown), sister, and brother-in-law (Dr. and Mrs. Brett Scott) for their financial support, her teachers (Profs. J. Levenson and J. Hackett), and the library staff at the Center for Judaic Studies (Philadelphia), the Library of Congress, the Kiev Judaica Collection, JTSA, and Georgetown University.