The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments

Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum) Ms. Paris BnF 5847, f. 5: Muslim scholars in discussion. Arabic translation of Dioscurides’ Materia medica

(Ibn al- al-- al-adwiya wa-l-aghdhiya, 1–4. 1291 H.) Filecards for the Greek and Arabic Lexicon (GALex) GALex The Database Glossarium Græco-Arabicum The Glossarium Graeco-Arabicum makes available information in the following fields of research:

• the vocabulary and syntax of Classical and Middle Arabic; • the development of a scientific and technical vocabulary in Arabic; • the vocabulary of Classical and Middle Greek; • the chronology and nature of the translation movement into Arabic; • the establishment of the texts of Greek works and their Arabic translations. The Glossarium Graeco-Arabicum online: November 2013 Telota Glossarium Graeco-Arabicum

BERLIN-BRANDENBURGISCHE AKADEMIE DER WISSENSCHAFTEN November 2013 Telota Glossarium Graeco-Arabicum

I Technical Challenges → polyalphabetic environment

II Scholarly Requirements → linguistic database

III Technical vs. Scholarly → concluding discussion

OUTLINE November 2013 Telota Glossarium Graeco-Arabicum

1 Languages Used in the GlossGA Interface

2 Character Corpus

3 Areas of Technical Challenges

4 Examples

I. TECHNICAL CHALLENGES November 2013 Telota Glossarium Graeco-Arabicum

Languages used within the project:

Ancient Greek Medieval Arabic Modern English

Greek alphabet Arabic alphabet alphabet

3 layers of optional vowel signs 1 layer of diacritics

LTR (left to right) RTL LTR

I.1. LANGUAGES November 2013 Telota Glossarium Graeco-Arabicum

Unicode Chart Range Description

C0 Controls and Basic Latin 0000-007F Latin Alphabet Latin Extended-A 0100-017F symbols Latin Extended-Additional 1E00-1EFF transliteration symbols

Greek and Coptic 0370-03FF 1F00-1FFF

Arabic 0600-06FF Arabic Alphabet Arabic Supplement 0750-077F Arabic Alphabet Spacing Modifier Letters 02B0-02FF special Arabic characters

→ in total: about 450 different characters from eight different charts

I.2. UNICODE November 2013 Telota Glossarium Graeco-Arabicum

Requirements:

1. Data input in all three alphabets with all vowels and diacritics → How to implement a comfortable interface?

2. Simultaneous display of texts in three alphabets and two directions → How to implement concurrent writing directions?

3. Search for terms, insensitive for diacritics or vowels → How to implement queries with different collation sets?

I.3. REQUIREMENTS November 2013 Telota Glossarium Graeco-Arabicum

a Data Input

b Writing Directions

c Search

d Search Terms

I.4. EXAMPLES November 2013 Telota Glossarium Graeco-Arabicum ʾ ˒ ʿ ˓

I.4.a. DATA INPUT November 2013 Telota Glossarium Graeco-Arabicum

[ʾ] U+02BE MODIFIER LETTER RIGHT HALF transliteration of Arabic hamza [˒] U+02D2 MODIFIER LETTER CENTRED RIGHT HALF RING more rounded articulation [ʿ] U+02BF MODIFIER LETTER LEFT HALF RING transliteration of Arabic ain [˓] U+02D3 MODIFIER LETTER CENTRED LEFT HALF RING less rounded articulation

I.4.a. DATA INPUT November 2013 Telota Glossarium Graeco-Arabicum

Problem: Appearance vs. Encoding

Users will normally choose charaters …

→ not because of their unicode description → but because of their appearance

How to bring Unicode to the user?

I.4.a. DATA INPUT November 2013 Telota Glossarium Graeco-Arabicum

Solutions:

– restrict the characters accepted by the database → safe, but required validation methods

– provide a virtual keyboard (onscreen) → user-friendly

Alternative methods:

code → less recommendable from unicode point of view → but widely used

I.4.a. DATA INPUT November 2013 Telota Glossarium Graeco-Arabicum

Phenomenon:

صحة (THEN) ص (Home (THEN) Arabic Glossary (THEN

becomes

ص> صحة < Home > Arabic Glossary

I.4.b. WRITING DIRECTIONS November 2013 Telota Glossarium Graeco-Arabicum

Problem: Strong vs. Weak Characters

In Unicode, alphabetic characters are usually STRONG CHARACTERS which determine the writing direction,

while characters are usually WEAK CHARACTERS which do not change the writing direction.

→ relevant in: separated lists, bibliographic references, breadcrumb lines, table alignments …

I.4.b. WRITING DIRECTIONS November 2013 Telota Glossarium Graeco-Arabicum

Solutions:

– insert a ”strong whitespace”: Unicodes U+200E (left to right) or U+200F (right to left)

– or, if in HTML, set the writing direction directly:

I.4.b. WRITING DIRECTIONS November 2013 Telota Glossarium Graeco-Arabicum

GREEK ARABIC ENGLISH

diacritics vowel signs diacritics not distinct not distinct distinct

requirement: requirement: requirement: d does not find ḏ 8س 8ب 7ب finds also سبب η finds also ἠ ἦ ἥ

Problem: Distinction vs. Collation

I.4.c. SEARCH November 2013 Telota Glossarium Graeco-Arabicum

Solution:

Greek Arabic English Greek collation Arabic collation Latin collation

Collation Charts:

Restrictions:

– does not work for mixed texts → data needs to be separated – some environments do not support Arabic vowel collation → e.g. MySQL <6.0

I.4.c. SEARCH November 2013 Telota Glossarium Graeco-Arabicum

Phenomenon:

مل user searches for Arabic words starting with – – truncation sysmbol (asterisk) appears at the wrong side *مل

Problem: Neutral Writing Direction

– the standard asterisk is a NEUTRAL CHARACTER – it adapts the main writing direction

I.4.d. SEARCH TERMS November 2013 Telota Glossarium Graeco-Arabicum

Solution:

Unicode Arabic Asterisk (U+066D), right-to-left مل٭

I.4.d. SEARCH TERMS November 2013 Telota Glossarium Graeco-Arabicum

Challenges for the Developer:

– Unicode does not provide general truncation or joker symbols

– different asterisk and joker signs must be processed

– no standard solution available

I.4.d. SEARCH TERMS November 2013 Telota Glossarium Graeco-Arabicum

Technical Recommendations for Polyalphabetic Environments

– use software components that supports unicode thoughout

– compose a project corpus of unicode characters

– provide input methods to make the characters easily available

– consider unicode writing directions and collations

– make sure that all characters do not only appear correctly, but that they are also encoded correctly

SUMMARY OF I. November 2013 Telota Glossarium Graeco-Arabicum

1 Corpus → How to deal with a database of 70,000+ words?

2 Translation movements → How to visualize transformations of language structures?

3 Single Lexemes → How to transform the database into a dictionary?

II. SCHOLARLY REQUIREMENTS November 2013 Telota Glossarium Graeco-Arabicum

How to deal with a database of 70,000+ words?

– search form → user needs to know exactly what he/she is looking for

– browsing (e.g. by sources and words in alphabetical order) → user needs to know roughly what he/she is looking for

– visualization → statistical and/or graphical approach → user can explore the corpus

II.1. CORPUS November 2013 Telota Glossarium Graeco-Arabicum

Distributon of sources in the GlossGA corpus

Area size corresponds to number of words

→ Which sources constitute the major/minor parts of the corpus?

II.1.a. CORPUS TREEMAP November 2013 Telota Glossarium Graeco-Arabicum

Distribution of words in one source

Area size corresponds to number of words

→ What kind of vocabulary does constitute the source?

II.1.b. SOURCE TREEMAP November 2013 Telota Glossarium Graeco-Arabicum

How to visualize transformation of language structures?

→ compare parts of speech in diagrams (experimental)

II.2. TRANSLATION MOVEMENTS November 2013 Telota Glossarium Graeco-Arabicum

Compared Parts of Speech

Blue: Greek Parts of Speech

Red: Arabic Parts of Speech

Bar Length: number of words of respective part of speech

II.2.a. TRANSLATION MOVEMENTS November 2013 Telota Glossarium Graeco-Arabicum

Compared Parts of Speech

X-Axis: Greek Parts of Speech

Y-Axis: Arabic Parts of Speech

Intersections: size represents number of words transferred from Greek PoS into Arabic PoS

II.2.b. TRANSLATION MOVEMENTS November 2013 Telota Glossarium Graeco-Arabicum

How to transform the database into a dictionary?

Experimental preview:

→ collation of all entries of a Greek lexeme → ordered by Arabic lexeme → output with source and context

II.3.a. SINGLE LEXEMES November 2013 Telota Glossarium Graeco-Arabicum

Export function via email:

II.3.b. SINGLE LEXEMES November 2013 Telota Glossarium Graeco-Arabicum

Recommendations

1 provide multiple access methods → support various user scenarios

2 invent statistical and visual evaluation methods → profit from electronic data processing

3 provide conventional scholarly formats → correspond to the community’s needs

SUMMARY OF II. November 2013 Telota Glossarium Graeco-Arabicum

Situation: Technical vs. Scholarly Requirements

– which one goes first?

→ technical requirements as necessary basis → scholarly requirements as superior objective

– both need attention from scholars – both need attention from techies

→ vice versa understanding → team competence

LAST BUT ONE SLIDE November 2013 Telota Glossarium Graeco-Arabicum

Thanks to you for your attention!

Project Website http://telota.bbaw.de/glossga

Contact Yury Arzhanov | [email protected] Torsten Roeder | [email protected]