Hebrew Alphabets, Symbols and Computer Codes: History and Preliminary Tabulation
Total Page:16
File Type:pdf, Size:1020Kb
Elaine Renée KEOWN Center for Judaic Studies, Philadelphia, Pennsylvania HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES: HISTORY AND PRELIMINARY TABULATION The Ancient Hebrew alphabet was first computerized2 in 1956 by Prof. Roberto Busa in Gallarate3, Italy. Drawing on seven years of experience in- dexing Aquinas by early IBM punch card equipment4, Busa produced an automated index of several published Dead Sea Scrolls written in He- brew, Aramaic, and Nabatean5. By 1959 in Israel a second project which would soon use computerized Hebrew started6. By the early 1960s re- searchers in France began writing more advanced software which included the computerized Tiberian te{amim7. In Nancy, Gérard Weil became the first scholar to work with a computerized version of the Tanach8. His collaborator, Pierre Rivière, wrote the first algorithms to analyze the com- plex te{amim patterns of the Leningrad Codex9. Weil's work led to the first center for studying electronic versions of the Tanach10. In the mid- 1. “Porting” is the process of converting any software product for use on a different hard- ware platform. 2. P. TASMAN, “Indexing the Dead Sea Scrolls by Electronic Literary Data Processing Methods”, IBM (Nov. 1958) [IBM Archives], p. 5. 3. The cards for the Hebrew were punched in Gallarate, soon the location of the Centro per l'Automazione dell'Analisi Litteraria; then they were flown to NYC, where they were sorted four times on an IBM 705 mainframe. See Tasman, cited above, p. 5, and also “IBM and the Dead Sea Scrolls”, Business Machines (June 10, 1958), p. 4. 4. From 1887-1970s most calculating machines and computers utilized perforated paper ribbon or punched cards similar to those invented in Lyons for looms between 1725-1804. See J. MARGUIN, Histoire des instruments et machines à calculer, Paris, 1994, pp. 182, 191. 5. The Dead Sea Scrolls were written in the early 22-letter alphabet without final letters. The Nabatean was transliterated into Hebrew. See Tasman, cited above, p. 5. 6. Writing in 1959 about the planned historical Hebrew dictionary, Ben-Hayyim ob- served, “it is worthwhile investigating whether we can employ automation, that is to say, making use of punch card machines or electronic computers”. See Z. BEN-HAYYIM, The His- torical Dictionary of Hebrew Language: A Plan, Jerusalem, 1959, p. 14. 7. G. Weil, Concordance de la cantilation des Premiers Prophètes: Josué, Juges, Samuel et Rois, Paris, 1982, p. XXIVff. 8. J.J. HUGHES, Bits, Bytes and Biblical Studies, Grand Rapids, 1987, p. 515. 9. See WEIL, cited above, p. XI. 10. Prof. Weil's research group became CATAB, the Centre d'Analyse et de Traitement Automatique de la Bible, later in Lyons. Revue des Études juives, 161 (1-2), janvier-juin 2002, pp. 235-240 236 HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES 1960s in Israel, at the Weizmann Institute and Bar Han, Profs. Aviezri Fraenkel and Yaacov Choueka started a large database of Hebrew responsa literature11. In the 1960s/70s, operating systems with character sets of 127 could be multilingual, but they were usually “monoscript”, including only Roman alphabet variations. During this time, the IBM standard code, EBCDIC, with a character set of 255, led to commercial standards with two scripts in Russia, Japan, Korea, Thailand, and other countries12. Later, in 1988, there was also a very creative Israeli standard, SI 960, 127 characters, bilingual English/Hebrew, with only English capital letters13. By the 1980s, standard computer operating systems had character sets of 255 symbols to manipulate14. In the late 80s, new bilingual international computer standards were proposed for about 40 languages, including He- brew, Arabic, Russian, Greek, and 30 European languages written with vari- ations of the Roman alphabet15. Different technical standards for Hebrew for librairies and for general computer usage were proposed by various coun- tries. At that point, some proposed international standards included the Tibe- rian vowels and other symbols16. However, when the first general, interna- tional, bilingual Latin/Hebrew computer code for 8-bit bytes was published, ISO 8859-8 included only the basic Hebrew consonants, despite having ad- equate code space for Tiberian vowels and other Masoretic symbols17. In the United States, in 1980, an American academic Hebrew computer code, now usually called “Michigan-Claremont-Westminster” (after the three institutions where it was developed or improved), was begun by Profs. H.V.D. Parunak and R.E. Whitaker as they produced a new elec- tronic version of the Leningrad Codex18. The original Michigan-Claremont text was later corrected against the codex by Prof. A. Groves of Westmin- ster Seminary. This newest electronic version of Leningrad was used in col- laborative or individual projects in Jerusalem, Bar Ilan, Philadelphia, Am- sterdam, Utrecht, and also recently in Bielefeld19. 11. Y. CHOUEKA, “Computerized full-text retrieval systems and research in the humani- ties: the responsa project”, Computers and the Humanities 14, 1980, pp. 153-169. 12. J. CLEWS, Language automation worldwide: the development of character set stand- ards, Harrogate, 1988, pp. 36-37. 13. Ibid., p. 94. 14. Ibid., p. 50. 15. Ibid., p. 16-17. See also V. ILLINGWORTH (ed.), Dictionary of Computing, Oxford, 1996, pp. 269-270. 16. Ibid., pp. 96-97. 17. Ibid., p. 95. 18. Hughes, op. cit., pp. 499-519. 19. C.-J. DOEDENS, Text databases: one database model and several retrieval languages, Amsterdam, 1994, pp. 262-263. HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES 237 Although widely used and an excellent computer code, Michigan- Claremont-Westminster never became a public, internationally recognized character set like ISO-7, ASCII, or the ISO 8859-8 bilingual Latin/Hebrew code. The latest version of the Michigan-Claremont-Westminster text will be the electronic Biblia Hebraica Quinta20. In the 1990s, more complete public computer codes for Hebrew were produced. In 1996, the Israeli standards group, the SII, produced a new code, SI 1311.2, that included the Tiberian te{amim21. However, the most recent international computer code has been devel- oped in a collaboration between the ISO, based in Geneva, and the Unicode Consortium, based in Silicon Valley. ISO 10646 or Unicode contains 65,536 codes spaces. Thi is 256 times the number of spaces used in 1970s/ 80s operating systems. Unicode was initially designed in the late 1980s by a group of experts, many from the Chinese Language Computer Society. It allocates 22,000 spaces to “CJK” (Chinese, Japanese, and Korean) ide- ographs. The Unicode 2.0 Hebrew section now includes the basic Hebrew alphabet, Tiberian te{amim, and other Masoretic marks used in the Lenin- grad Codex, plus vowel digraphs for literary Yiddish, for a total of 87 sym- bols22. In 1997 in Greifswald, Germany, the Leningrad Codex in its Michigan- Claremont-Westminster electronic version was uploaded to a Unicode 2.0 format to become part of a parallel corpus (Leningrad-Vulgate-Septuagint in parallel plus Greek NT) with advanced retrieval functions23. This multi- database system is called QUEST. However, the complete character set for Hebrew script includes at least 173 symbols or letters: variant Hebrew alphabets for Jewish languages or dialects written in Hebrew plus four sets of Biblical annotations (three re- gional Masorahs and the Samaritan set). Other unusual Jewish languages used Georgian24, Malayalam25, or Turkish runic26 scripts. 20. See http://www.unifr.ch/bif/Chapters/bh5.html, middle paragraphs. 21. See: http://www.qsm.co.il/Hebrew/stdisr.htm. 22. See The Unicode Consortium, The Unicode Standard, version 2.0, Reading, 1996, pp. 7-60 and 7-61. For some reason, the code repeats geresh and gershayim twice, but I only count them once here. 23. W.-D. SYRING, “QUEST 2 — Computergestützte Philologie und Exegese”, Zeit- schrift für Althebräistik 11, 1998, pp. 85-89. 24. For a photo of a bilingual Georgian-Hebrew haggada, see M. NAISHTAT, Yehude Gruzyah, Tel Aviv, 1970 p. 64. 25. R. DANIEL, Ruby of Cochin: An Indian Jewish woman remembers, Philadelphia, 1995, pp. 146, 151-54, 174-77. 26. S. PLETNEVA, Khazary, Moskva, 1976, p. 33ff. 238 HEBREW ALPHABETS, SYMBOLS AND COMPUTER CODES A preliminary27 tabulation28 of symbols for Hebrew scripts follows: HEBREW SYMBOLS: A PRELIMINARY LIST Complete Net First on symbol count29 count30 computer Ancient or common symbols 22-letter alphabet 22 22 1956 Ancient epigraphic punctuation31 2 2 Qumrani dots (upper middle 2 2 early 1960s and lower middle)32 Final letters 5 5 early 1960s Tiberian pointing and other 53 52 early 1960s Masoretic apparatus Other Hebrew manuscript33 symbols 7 7 Net subset totals 90 Extra symbols for regional Jewish languages Arabic34 6 4 late 1960s35 Berber36 1 0 Persian37 3 0 27. I had difficulty finding material on symbols for several languages, e.g., Shua- dit (standard Judeo-Provençal, as opposed to Comtadin) and Krimchak (a Rabbanite Kipchak Turkish language, similar to Crimean Tatar). For Krimchak, see W. MOSKO- VITZ, “Krimchak Language”, Encyclopaedia Judaica Yearbook 1988/9, Jerusalem, 1989, p. 371. 28. Couting Hebrew symbols for computerization is an art, not a science. Except for di- graphs and trigraphs, this count is minimal. It used the “decomposed” method of counting Hebrew symbols, counting each diacritic once instead of multiple times with each pointed consonant. The “decomposed” approach has been successfully used in many library systems. However, the competing approach, the “composed” approach, where each digraph or “com- posed” consonant with diacritic has a code number, may have certain advantages for complex computer software. Complex software includes pointed Tanach texts of piyyu†im, lexical databases, and, possibly, Web pages. 29. This numer includes the complete set of extra symbols found in the subset. 30. This number is the net number of symbols after subtracting those found in more than one category. 31. Y. AHARONI, Arad Inscriptions, Jerusalem, 1986, p.