ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Unicode Encoding
and Online Data Access
Ralf Gehrke / Jost Gippert
The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“)
(since 1987/1993)
www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Scope of the TITUS project:
• Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm
Data formats (since 1995):
• Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8 Unicode 4.0) – (Plain 7-bit ASCII format)
• Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL
www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Original Scripts Covered: • Latin (with all kinds of diacritics), incl. variants* • Greek • Slavic (Cyrillic and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • Arabic (incl. Persian) • Runic • Ogham • and many more * not yet encodable (as such) in Unicode
Example 1a: Donelaitis
(Lithuanian: formatted text incl. diacritics: 8-bit version)
www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 1b: Donelaitis
(Lithuanian: formatted text incl. diacritics: Unicode version)
Example 2a: Catechism
(Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font)
www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 2b: Catechism
(Old Prussian: formatted text incl. diacritics: Unicode version, no special font)
Example 3a: Codex Suprasliensis
(Old Church Slavonic Cyrillic text in tentative Unicode encoding: TITUS font)
www.ala.org/alcts 5 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 3b: Kiev folia
(Old Church Slavonic Glagolitic text in substitutional Unicode encoding: TITUS font)
Example 4a: Rigveda
(Sanskrit text in Unicode encoding, Roman transcription)
www.ala.org/alcts 6 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 4a: Rigveda
(Same, other MS Windows 2000 font)
Example 4c: Rigveda
(Sanskrit text in Unicode encoding, Devangar script: MS Windows 2000 font)
www.ala.org/alcts 7 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 4d: Rigveda
(Same: other MS Windows 2000 font)
Example 5a: Vs u Rmn
(Early New Persian text in Unicode encoding, Roman transcription)
www.ala.org/alcts 8 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Example 5b: Vs u Rmn
(Early New Persian text in Unicode encoding, original script: MS Windows 2000 font)
Example 5c: Vs u Rmn
(Same, other MS Windows 2000 font)
www.ala.org/alcts 9 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Basis of online retrieval:
• Multilevel referencing system defining – Text structure levels • Texts • Chapters • Paragraphs – Representation structure levels • Pages • Lines • Formatting types (headers, catchwords etc.) • Language / script specific encoding
Query preliminaries:
• Manual query entry via form: – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/titusinx.htm
• Features: – Preselection of languages / varieties – Text independent search – Preselection of query type – Combined search of up to 4 word forms
– 7-bit based manual entry of word forms
www.ala.org/alcts 10 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Language preselection
Query form: Type preselection
www.ala.org/alcts 11 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Combined search
Query form: Result
www.ala.org/alcts 12 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query preliminaries:
• User input feature:
– alternate 7-bit (ASCII) based manual entry of word forms
– purpose: cross-platform compatibility – problem: unavailability and / or inapplicability of “national” keyboards – precondition: “English” keyboard available and accessible everywhere
Query form: Character input
www.ala.org/alcts 13 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
TITUS bibliography: An example
Query form: Character input
• Example 1: Latin special characters
www.ala.org/alcts 14 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Character input
• Example 2: Ancient Greek characters
Query form: Character input
• Example 3: Slavonic (Cyrillic) characters
www.ala.org/alcts 15 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query preliminaries:
• Data transfer feature:
– 7-bit (ASCII) based transmission of data in query strings
– purpose: secure cross-platform compatibility – problem: unavailability and / or inapplicability of Unicode data in data transmission via HTTP – precondition: representation of non-ASCII characters by “hex strings”
Query form: Character input
• Example 1: Latin special characters
– Sanskrit represented by 2B01371Ee
• = U + 01B2
¡£¢¢ ¢
¢ = U + 1E37
¡¥¤¤ ¤ ¤ = ASCII e – N.B. Wherever a precomposed character is encodable as such, this is used in the text data base
– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=22035&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=A
www.ala.org/alcts 16 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Character input
Query form: Character input
• Example 2: “Romanized” Devangar
– Sanskrit also represented by 2B01371Ee
• = U + 01B2 • = U + 1E37 • e = ASCII e
– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=23059&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=D
www.ala.org/alcts 17 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Character input
Query form: Character input
• Example 3: Greek characters
– Greek represented by a)’ndra or
041FBD03B403C103B103
¦ ¦ ¦
• ¦ = U + 1F04
§ § § • § = U + 03BD
• = U + 03B4
¨ ¨ ¨
• ¨ = U + 03C1
© © © • © = U + 03B1
– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=041FBD03B403 C103B103&LCPL=0&TCPL=0&C=H
www.ala.org/alcts 18 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Query form: Character input
Query form: Character input
• Example 4: Optional disregard of diacritics
– Greek represented by andra or
B103BD03B403C103B103
© © ©
• © = U + 03B1
§ § § • § = U + 03BD
• = U + 03B4
¨ ¨ ¨
• ¨ = U + 03C1
© © © • © = U + 03B1
– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=B103BD03B403 C103B103&LCPL=0&TCPL=0&C=H
www.ala.org/alcts 19 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Data base properties
• Unicode specific treatment of diacritics
– vs.
• Software specific treatment of diacritics
– vs.
• TITUS specific treatment of diacritics
Data base properties
• Unicode specific treatment of diacritics:
– Precomposed characters
• vs.
– Sequences of characters and diacritics
• Correct treatment must be warranted by software
www.ala.org/alcts 20 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Data base properties
• Software specific treatment of diacritics (MS Access 2000 / XP):
• while
– SQL query for <á> yields only <á>
• Special functions depending on modern languages
Data base properties
• TITUS language specific treatment of diacritics:
– SQL query for Lithuanian yields <š, sch, sz> etc.
• while
– SQL query for
• Special functions depending on cross-historical orthographic properties of languages
www.ala.org/alcts 21 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Data base properties: Example
SARDS
• SARDS = South Asia Research Documentation Services • Part 1 covers the years 1789 – 1999 and contains more than 50 000 citations of research papers (no monographs) on Indology and South Asia Studies
www.ala.org/alcts 22 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Tustep Encoding: Some diacritics
SARDS in Tustep encoding
www.ala.org/alcts 23 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
SARDS in Unicode encoding
A question to librarians
• How is Unicode changing the cataloguing of books? • Are authors and titles entered in original script or in transcriptions? • Or will both methods be used in parallel?
www.ala.org/alcts 24 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
Bibliography in original script and transcription: Example
UniTeNS
• UniTeNS = Unified Text Numbering System • A new proposal for an identification system for texts • Each text is awarded a 48-digit number, where the number reflects author, language, era, sort of text etc. • This number is independent of publication in print or electronic form or manuscripts
www.ala.org/alcts 25 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program
UniTeNS
• All texts should be catalogued according to a complete classification scheme • A central institution should keep track of publications of each text in printed, electronic or other form • All publishers of texts should notify this institution about each publication
Text numbering system: Example
www.ala.org/alcts 26