Unicode Encoding the TITUS Project
Total Page:16
File Type:pdf, Size:1020Kb
ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Unicode Encoding and Online Data Access Ralf Gehrke / Jost Gippert The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“) (since 1987/1993) www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Scope of the TITUS project: • Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm Data formats (since 1995): • Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8 Unicode 4.0) – (Plain 7-bit ASCII format) • Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Original Scripts Covered: • Latin (with all kinds of diacritics), incl. variants* • Greek • Slavic (Cyrillic and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • Arabic (incl. Persian) • Runic • Ogham • and many more * not yet encodable (as such) in Unicode Example 1a: Donelaitis (Lithuanian: formatted text incl. diacritics: 8-bit version) www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 1b: Donelaitis (Lithuanian: formatted text incl. diacritics: Unicode version) Example 2a: Catechism (Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font) www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 2b: Catechism (Old Prussian: formatted text incl. diacritics: Unicode version, no special font) Example 3a: Codex Suprasliensis (Old Church Slavonic Cyrillic text in tentative Unicode encoding: TITUS font) www.ala.org/alcts 5 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 3b: Kiev folia (Old Church Slavonic Glagolitic text in substitutional Unicode encoding: TITUS font) Example 4a: Rigveda (Sanskrit text in Unicode encoding, Roman transcription) www.ala.org/alcts 6 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 4a: Rigveda (Same, other MS Windows 2000 font) Example 4c: Rigveda (Sanskrit text in Unicode encoding, Devangar script: MS Windows 2000 font) www.ala.org/alcts 7 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 4d: Rigveda (Same: other MS Windows 2000 font) Example 5a: Vs u Rmn (Early New Persian text in Unicode encoding, Roman transcription) www.ala.org/alcts 8 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 5b: Vs u Rmn (Early New Persian text in Unicode encoding, original script: MS Windows 2000 font) Example 5c: Vs u Rmn (Same, other MS Windows 2000 font) www.ala.org/alcts 9 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Basis of online retrieval: • Multilevel referencing system defining – Text structure levels • Texts • Chapters • Paragraphs – Representation structure levels • Pages • Lines • Formatting types (headers, catchwords etc.) • Language / script specific encoding Query preliminaries: • Manual query entry via form: – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/titusinx.htm • Features: – Preselection of languages / varieties – Text independent search – Preselection of query type – Combined search of up to 4 word forms – 7-bit based manual entry of word forms www.ala.org/alcts 10 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Language preselection Query form: Type preselection www.ala.org/alcts 11 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Combined search Query form: Result www.ala.org/alcts 12 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query preliminaries: • User input feature: – alternate 7-bit (ASCII) based manual entry of word forms – purpose: cross-platform compatibility – problem: unavailability and / or inapplicability of “national” keyboards – precondition: “English” keyboard available and accessible everywhere Query form: Character input www.ala.org/alcts 13 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program TITUS bibliography: An example Query form: Character input • Example 1: Latin special characters www.ala.org/alcts 14 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input • Example 2: Ancient Greek characters Query form: Character input • Example 3: Slavonic (Cyrillic) characters www.ala.org/alcts 15 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query preliminaries: • Data transfer feature: – 7-bit (ASCII) based transmission of data in query strings – purpose: secure cross-platform compatibility – problem: unavailability and / or inapplicability of Unicode data in data transmission via HTTP – precondition: representation of non-ASCII characters by “hex strings” Query form: Character input • Example 1: Latin special characters – Sanskrit represented by 2B01371Ee • = U + 01B2 ¡£¢¢ ¢ ¢ = U + 1E37 ¡¥¤¤ ¤ ¤ = ASCII e – N.B. Wherever a precomposed character is encodable as such, this is used in the text data base – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=22035&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=A www.ala.org/alcts 16 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 2: “Romanized” Devangar – Sanskrit also represented by 2B01371Ee • = U + 01B2 • = U + 1E37 • e = ASCII e – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=23059&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=D www.ala.org/alcts 17 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 3: Greek characters – Greek represented by a)’ndra or 041FBD03B403C103B103 ¦ ¦ ¦ • ¦ = U + 1F04 § § § • § = U + 03BD • = U + 03B4 ¨ ¨ ¨ • ¨ = U + 03C1 © © © • © = U + 03B1 – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=041FBD03B403 C103B103&LCPL=0&TCPL=0&C=H www.ala.org/alcts 18 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 4: Optional disregard of diacritics – Greek represented by andra or B103BD03B403C103B103 © © © • © = U + 03B1 § § § • § = U + 03BD • = U + 03B4 ¨ ¨ ¨ • ¨ = U + 03C1 © © © • © = U + 03B1 – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=B103BD03B403 C103B103&LCPL=0&TCPL=0&C=H www.ala.org/alcts 19 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties • Unicode specific treatment of diacritics – vs. • Software specific treatment of diacritics – vs. • TITUS specific treatment of diacritics Data base properties • Unicode specific treatment of diacritics: – Precomposed characters • vs. – Sequences of characters and diacritics • Correct treatment must be warranted by software www.ala.org/alcts 20 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties • Software specific treatment of diacritics (MS Access 2000 / XP): – SQL query for <a> yields <a, á, à, â> etc. • while – SQL query for <á> yields only <á> • Special functions depending on modern languages Data base properties • TITUS language specific treatment of diacritics: – SQL query for Lithuanian <s$> yields <š, sch, sz> etc. • while – SQL query for <sch> yields only <sch> • Special functions depending on cross-historical orthographic properties of languages www.ala.org/alcts 21 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties: Example SARDS • SARDS = South Asia Research Documentation Services • Part 1 covers the years 1789 – 1999 and contains more than 50 000 citations of research papers (no monographs) on Indology and South Asia Studies www.ala.org/alcts 22 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Tustep Encoding: Some diacritics SARDS in Tustep encoding www.ala.org/alcts 23 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program SARDS in Unicode encoding A question to librarians • How is Unicode changing the cataloguing of books? • Are authors and titles entered in original script or in transcriptions? • Or will both methods be used in parallel? www.ala.org/alcts 24 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Bibliography in original script and transcription: Example UniTeNS • UniTeNS = Unified Text Numbering System • A new proposal for an identification system for texts • Each text is awarded a 48-digit number, where the number reflects author, language, era, sort of text etc. • This number is independent of publication in print or electronic form or manuscripts www.ala.org/alcts 25 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program UniTeNS • All texts should be catalogued according to a complete classification scheme • A central institution should keep track of publications of each text in printed, electronic or other form • All publishers of texts should notify this institution about each publication Text numbering system: Example www.ala.org/alcts 26.