<<

ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Unicode Encoding

and Online Data Access

Ralf Gehrke / Jost Gippert

The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“)

(since 1987/1993)

www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Scope of the TITUS project:

• Electronic retrieval engine covering the textual heritage of all ancient Indo-European • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm

Data formats (since 1995):

• Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8  4.0) – (Plain 7-bit ASCII format)

• Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL

www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Original Scripts Covered: • Latin (with all kinds of ), incl. variants* • Greek • Slavic ( and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • (incl. Persian) • • and many more * not yet encodable (as such) in Unicode

Example 1a: Donelaitis

(Lithuanian: formatted text incl. diacritics: 8-bit version)

www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 1b: Donelaitis

(Lithuanian: formatted text incl. diacritics: Unicode version)

Example 2a: Catechism

(Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font)

www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 2b: Catechism

(Old Prussian: formatted text incl. diacritics: Unicode version, no special font)

Example 3a: Codex Suprasliensis

(Old Church Slavonic Cyrillic text in tentative Unicode encoding: TITUS font)

www.ala.org/alcts 5 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 3b: Kiev folia

(Old Church Slavonic Glagolitic text in substitutional Unicode encoding: TITUS font)

Example 4a: Rigveda

(Sanskrit text in Unicode encoding, Roman transcription)

www.ala.org/alcts 6 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 4a: Rigveda

(Same, other MS Windows 2000 font)

Example 4c: Rigveda

(Sanskrit text in Unicode encoding, Devangar script: MS Windows 2000 font)

www.ala.org/alcts 7 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 4d: Rigveda

(Same: other MS Windows 2000 font)

Example 5a: Vs u Rmn

(Early New Persian text in Unicode encoding, Roman transcription)

www.ala.org/alcts 8 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Example 5b: Vs u Rmn

(Early New Persian text in Unicode encoding, original script: MS Windows 2000 font)

Example 5c: Vs u Rmn

(Same, other MS Windows 2000 font)

www.ala.org/alcts 9 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Basis of online retrieval:

• Multilevel referencing system defining – Text structure levels • Texts • Chapters • Paragraphs – Representation structure levels • Pages • Lines • Formatting types (headers, catchwords etc.) • / script specific encoding

Query preliminaries:

• Manual query entry via form: – http://titus.fkidg1.uni-frankfurt.de/database/titusinx/titusinx.htm

• Features: – Preselection of languages / varieties – Text independent search – Preselection of query type – Combined search of up to 4 word forms

– 7-bit based manual entry of word forms

www.ala.org/alcts 10 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Language preselection

Query form: Type preselection

www.ala.org/alcts 11 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Combined search

Query form: Result

www.ala.org/alcts 12 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query preliminaries:

• User input feature:

– alternate 7-bit (ASCII) based manual entry of word forms

– purpose: cross-platform compatibility – problem: unavailability and / or inapplicability of “national” keyboards – precondition: “English” keyboard available and accessible everywhere

Query form: Character input

www.ala.org/alcts 13 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

TITUS bibliography: An example

Query form: Character input

• Example 1: Latin special characters

www.ala.org/alcts 14 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Character input

• Example 2: Ancient Greek characters

Query form: Character input

• Example 3: Slavonic (Cyrillic) characters

www.ala.org/alcts 15 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query preliminaries:

• Data transfer feature:

– 7-bit (ASCII) based transmission of data in query strings

– purpose: secure cross-platform compatibility – problem: unavailability and / or inapplicability of Unicode data in data transmission via HTTP – precondition: representation of non-ASCII characters by “hex strings”

Query form: Character input

• Example 1: Latin special characters

– Sanskrit represented by 2B01371Ee

• = U + 01B2

¡£¢¢ ¢

¢ = U + 1E37

¡¥¤¤ ¤ ¤ = ASCII e – N.B. Wherever a is encodable as such, this is used in the text data base

– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=22035&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=A

www.ala.org/alcts 16 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Character input

Query form: Character input

• Example 2: “Romanized” Devangar

– Sanskrit also represented by 2B01371Ee

• = U + 01B2 •  = U + 1E37 • e = ASCII e

– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=23059&LXWORD=2B01371Ee &LCPL=0&TCPL=0&C=D

www.ala.org/alcts 17 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Character input

Query form: Character input

• Example 3: Greek characters

– Greek represented by a)’ndra or

041FBD03B403C103B103

¦ ¦ ¦

• ¦ = U + 1F04

§ § § • § = U + 03BD

•  = U + 03B4

¨ ¨ ¨

• ¨ = U + 03C1

© © © • © = U + 03B1

– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=041FBD03B403 C103B103&LCPL=0&TCPL=0&C=H

www.ala.org/alcts 18 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Query form: Character input

Query form: Character input

• Example 4: Optional disregard of diacritics

– Greek represented by andra or

B103BD03B403C103B103

© © ©

• © = U + 03B1

§ § § • § = U + 03BD

•  = U + 03B4

¨ ¨ ¨

• ¨ = U + 03C1

© © © • © = U + 03B1

– http://titus.fkidg1.uni-frankfurt.de/database/titusinx/ titusinx.asp?LXLANG=8&LXWORD=B103BD03B403 C103B103&LCPL=0&TCPL=0&C=H

www.ala.org/alcts 19 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Data base properties

• Unicode specific treatment of diacritics

– vs.

• Software specific treatment of diacritics

– vs.

• TITUS specific treatment of diacritics

Data base properties

• Unicode specific treatment of diacritics:

– Precomposed characters

• vs.

– Sequences of characters and diacritics

• Correct treatment must be warranted by software

www.ala.org/alcts 20 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Data base properties

• Software specific treatment of diacritics (MS Access 2000 / XP):

– SQL query for yields etc.

• while

– SQL query for <á> yields only <á>

• Special functions depending on modern languages

Data base properties

• TITUS language specific treatment of diacritics:

– SQL query for Lithuanian yields <š, sch, sz> etc.

• while

– SQL query for yields only

• Special functions depending on cross-historical orthographic properties of languages

www.ala.org/alcts 21 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Data base properties: Example

SARDS

• SARDS = South Asia Research Documentation Services • Part 1 covers the years 1789 – 1999 and contains more than 50 000 citations of research papers (no monographs) on Indology and South Asia Studies

www.ala.org/alcts 22 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Tustep Encoding: Some diacritics

SARDS in Tustep encoding

www.ala.org/alcts 23 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

SARDS in Unicode encoding

A question to librarians

• How is Unicode changing the cataloguing of books? • Are authors and titles entered in original script or in transcriptions? • Or will both methods be used in parallel?

www.ala.org/alcts 24 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

Bibliography in original script and transcription: Example

UniTeNS

• UniTeNS = Unified Text Numbering System • A new proposal for an identification system for texts • Each text is awarded a 48-digit number, where the number reflects author, language, era, sort of text etc. • This number is independent of publication in print or electronic form or manuscripts

www.ala.org/alcts 25 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program

UniTeNS

• All texts should be catalogued according to a complete classification scheme • A central institution should keep track of publications of each text in printed, electronic or other form • All publishers of texts should notify this institution about each publication

Text numbering system: Example

www.ala.org/alcts 26