Unicode Encoding the TITUS Project

Unicode Encoding the TITUS Project

ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Unicode Encoding and Online Data Access Ralf Gehrke / Jost Gippert The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“) (since 1987/1993) www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Scope of the TITUS project: • Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm Data formats (since 1995): • Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8 Unicode 4.0) – (Plain 7-bit ASCII format) • Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Original Scripts Covered: • Latin (with all kinds of diacritics), incl. variants* • Greek • Slavic (Cyrillic and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • Arabic (incl. Persian) • Runic • Ogham • and many more * not yet encodable (as such) in Unicode Example 1a: Donelaitis (Lithuanian: formatted text incl. diacritics: 8-bit version) www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 1b: Donelaitis (Lithuanian: formatted text incl. diacritics: Unicode version) Example 2a: Catechism (Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font) www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 2b: Catechism (Old Prussian: formatted text incl. diacritics: Unicode version, no special font) Example 3a: Codex Suprasliensis (Old Church Slavonic Cyrillic text in tentative Unicode encoding: TITUS font) www.ala.org/alcts 5 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 3b: Kiev folia (Old Church Slavonic Glagolitic text in substitutional Unicode encoding: TITUS font) Example 4a: Rigveda (Sanskrit text in Unicode encoding, Roman transcription) www.ala.org/alcts 6 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 4a: Rigveda (Same, other MS Windows 2000 font) Example 4c: Rigveda (Sanskrit text in Unicode encoding, Devangar script: MS Windows 2000 font) www.ala.org/alcts 7 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 4d: Rigveda (Same: other MS Windows 2000 font) Example 5a: Vs u Rmn (Early New Persian text in Unicode encoding, Roman transcription) www.ala.org/alcts 8 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 5b: Vs u Rmn (Early New Persian text in Unicode encoding, original script: MS Windows 2000 font) Example 5c: Vs u Rmn (Same, other MS Windows 2000 font) www.ala.org/alcts 9 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Basis of online retrieval: • Multilevel referencing system defining – Text structure levels • Texts • Chapters • Paragraphs – Representation structure levels • Pages • Lines • Formatting types (headers, catchwords etc.) • Language / script specific encoding Query preliminaries: • Manual query entry via form: – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/titusinx.htm • Features: – Preselection of languages / varieties – Text independent search – Preselection of query type – Combined search of up to 4 word forms – 7-bit based manual entry of word forms www.ala.org/alcts 10 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Language preselection Query form: Type preselection www.ala.org/alcts 11 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Combined search Query form: Result www.ala.org/alcts 12 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query preliminaries: • User input feature: – alternate 7-bit (ASCII) based manual entry of word forms – purpose: cross-platform compatibility – problem: unavailability and / or inapplicability of “national” keyboards – precondition: “English” keyboard available and accessible everywhere Query form: Character input www.ala.org/alcts 13 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program TITUS bibliography: An example Query form: Character input • Example 1: Latin special characters www.ala.org/alcts 14 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input • Example 2: Ancient Greek characters Query form: Character input • Example 3: Slavonic (Cyrillic) characters www.ala.org/alcts 15 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query preliminaries: • Data transfer feature: – 7-bit (ASCII) based transmission of data in query strings – purpose: secure cross-platform compatibility – problem: unavailability and / or inapplicability of Unicode data in data transmission via HTTP – precondition: representation of non-ASCII characters by “hex strings” Query form: Character input • Example 1: Latin special characters – Sanskrit represented by 2B01371Ee • = U + 01B2 ¡£¢¢ ¢ ¢ = U + 1E37 ¡¥¤¤ ¤ ¤ = ASCII e – N.B. Wherever a precomposed character is encodable as such, this is used in the text data base – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=22035&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=A www.ala.org/alcts 16 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 2: “Romanized” Devangar – Sanskrit also represented by 2B01371Ee • = U + 01B2 • = U + 1E37 • e = ASCII e – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=23059&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=D www.ala.org/alcts 17 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 3: Greek characters – Greek represented by a)’ndra or 041FBD03B403C103B103 ¦ ¦ ¦ • ¦ = U + 1F04 § § § • § = U + 03BD • = U + 03B4 ¨ ¨ ¨ • ¨ = U + 03C1 © © © • © = U + 03B1 – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=041FBD03B403 C103B103&LCPL=0&TCPL=0&C=H www.ala.org/alcts 18 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Query form: Character input Query form: Character input • Example 4: Optional disregard of diacritics – Greek represented by andra or B103BD03B403C103B103 © © © • © = U + 03B1 § § § • § = U + 03BD • = U + 03B4 ¨ ¨ ¨ • ¨ = U + 03C1 © © © • © = U + 03B1 – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=B103BD03B403 C103B103&LCPL=0&TCPL=0&C=H www.ala.org/alcts 19 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties • Unicode specific treatment of diacritics – vs. • Software specific treatment of diacritics – vs. • TITUS specific treatment of diacritics Data base properties • Unicode specific treatment of diacritics: – Precomposed characters • vs. – Sequences of characters and diacritics • Correct treatment must be warranted by software www.ala.org/alcts 20 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties • Software specific treatment of diacritics (MS Access 2000 / XP): – SQL query for <a> yields <a, á, à, â> etc. • while – SQL query for <á> yields only <á> • Special functions depending on modern languages Data base properties • TITUS language specific treatment of diacritics: – SQL query for Lithuanian <s$> yields <š, sch, sz> etc. • while – SQL query for <sch> yields only <sch> • Special functions depending on cross-historical orthographic properties of languages www.ala.org/alcts 21 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Data base properties: Example SARDS • SARDS = South Asia Research Documentation Services • Part 1 covers the years 1789 – 1999 and contains more than 50 000 citations of research papers (no monographs) on Indology and South Asia Studies www.ala.org/alcts 22 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Tustep Encoding: Some diacritics SARDS in Tustep encoding www.ala.org/alcts 23 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program SARDS in Unicode encoding A question to librarians • How is Unicode changing the cataloguing of books? • Are authors and titles entered in original script or in transcriptions? • Or will both methods be used in parallel? www.ala.org/alcts 24 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Bibliography in original script and transcription: Example UniTeNS • UniTeNS = Unified Text Numbering System • A new proposal for an identification system for texts • Each text is awarded a 48-digit number, where the number reflects author, language, era, sort of text etc. • This number is independent of publication in print or electronic form or manuscripts www.ala.org/alcts 25 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program UniTeNS • All texts should be catalogued according to a complete classification scheme • A central institution should keep track of publications of each text in printed, electronic or other form • All publishers of texts should notify this institution about each publication Text numbering system: Example www.ala.org/alcts 26.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us