Corpus Linguistic Tools for Historical Semantics in Arabic.Pdf

International Journal of Arabic-English Studies (IJAES) Vol. 15, 2014 Corpus Linguistic Tools for Historical Semantics in Arabic Omaima Ismail, Sane Yagi and Bassam Hammo University of Jordan, Jordan Abstract: In this paper, we present a set of corpus linguistic tools for conducting historical semantic research in the Arabic language. We compiled a Historical Arabic Corpus (HAC) that spans more than 1500 years of continuous language use. With techniques from the field of Natural Language Processing (NLP), the tools we presented here have been used to create the HAC and to explore lexical semantic change. The development of these tools is aimed at offering a catalyst to the ambitions goal of compiling an Arabic dictionary on historical principles. HAC and the tools can also be used for conducting research in a variety of areas of linguistics. Keywords: Arabic historical corpus, diachronic semantics, etymology, computational lexicography, lingusitic tools, NLP resources. 1. Introduction Corpus Linguistics is a sub-discipline of the scientific study of language that uses machine-aided tools for the compilation, retrieval, and analysis of a large body of classified, machine-searchable texts that are representative of authentic language use. It uses frequencies, collocations, and phrase structures to make generalizations about language use. A corpus is a database of a large body of classified machine-searchable texts that are selected to be representative of language as spoken or written in a specific geographic region or period of time, by a specific group of users, and/or for a specific function. The texts are often meta-encoded with information about the author, date, place, and medium of publication, genre, language, degree of representativeness, etc. They are also annotated at word and/or sentence level with information about a word’s part of speech (POS), grammatical function, morphological components, prosodic features, etc. Corpora are at the basis of a variety of recent linguistic studies. They are considered a good resource for research in natural language processing, teaching languages, machine translation, language engineering, information retrieval, lexical analysis, lexicography, and many others. (Al-Sulaiti and Atwell, 2006) Only few researches have been conducted on Arabic from a historical linguistic perspective. One reason for this is the fact that most NLP tools at the disposal of linguists have been geared to Modern Standard Arabic (MSA). The absence of an Arabic corpus on historical principles is at the root of this research deficiency. Arabic has one of the longest traditions in lexicography, with the first full- fledged dictionary dating back to 786 C.E. It also has one of the richest works of 135 Ismael, Yagi and Hammo Corpus Linguistic Tools for Historical Semantics ... lexicography with dictionaries that are onomasiological, semasiological, alphabetical, retrogradic, encyclopaedic, terminological, etc. What it does not have is a dictionary that traces the semantic development of words; ie, an etymological dictionary (a dictionary on historical principles). Towards the ultimate end of initiating systematic work in the direction of historical linguistic research and lexicography, we have leveraged information technology. This paper aims at describing, previewing, and demonstrating a set of computational tools that would facilitate research in historical semantics and etymological lexicography. It provides a set of tools that were used to create and analyze the Historical Arabic Corpus (HAC) in order to extract historical semantic knowledge about Arabic. The rest of the paper is organized as follows: Section 2 defines historical semantics and discusses semantic change. Section 3 provides a background on corpus linguistics and reviews some previous work. In section 4, we give a description of the research methodology we followed in the development of the corpus tools. Section 5 shows some experiments that were conducted for illustrating the utility of our work. Section 6 concludes and draws a roadmap for future research. 2. Historical semantics and meaning change Historical semantics is a scientific enterprise that studies diachronic change of meaning. It identifies, describes, and explains this change and attempts to discover the conditions that motivate it, the factors that regulate it, and the mechanisms that propagate it. It also studies the consequences of this change for meaning relations within and across semantic fields. Furthermore, historical semantics studies the conceptual history of a language community by investigating the etymology of words. Fritz (2012) asserts that historical semantics is also “a research area where fundamental problems of semantics tend to surface and which can be seen as a testing ground for theories of meaning and for methodologies of semantic description”, p. 2644. On the benefit that historical semantics has gained from corpus linguistics, Fritz (2012) maintains that corpus linguistics has inspired historical semantics to reflect on the relationship between collocations of a word and its senses. Historical corpora facilitate the study of gradual changes in contexts of use and make it possible to discriminate between new senses that are transient and new senses that develop firmly. Semantic change, meaning change, denotes the universal tendency of word meanings to become different over time by gaining new senses or losing old ones, replacing default senses, drifting in terms of word prototype, narrowing or widening category boundaries, pejoration or amelioration, and/or bleaching. As in all languages, Arabic words change meaning over time. Below we present some examples of the major types of semantic shift that affected Arabic words. These examples have been culled from our Historical Arabic Corpus to illustrate both some types of semantic change and the potential of our compiled corpus. Table 1 shows examples of words which underwent semantic 136 International Journal of Arabic-English Studies (IJAES) Vol. 15, 2014 specialization. We found out that many of the cases of semantic change are due to specialization (or semantic narrowing). Table 1. Examples of words that underwent specialization Word Classical meaning Meaning in modern contexts ;amana Trust Safe deposit; honesty; integrity? أﻣﺎﻧﺔ fidelity; confidence haraj Extreme distress Shyness; modesty ﺣَﺮَج najm A star in the sky Acelebrity ﻧﺠﻢ Broadening the senses of a word is another type of semantic change. Classical Arabic words that have undergone semantic generalization in MSA are few (cf. Table 2). Table 2. Examples of words that underwent semantic generalization Word Classical meaning Meaning in modern contexts ru:?ya Good dream Vision, trance, dream رؤﯾﺎ haythu Adverb of place Adverb ﺣﯿﺚ (where) Semantic pejoration (degradation) and semantic amelioration (elevation) also occur in modern Arabic contexts. Tables 3 and 4 illustrate both types. Table 3. Examples of words that underwent semantic pejoration Quranic word Meaning in the Quran Meaning in modern contexts hayawa:n True eternal life Animal ﺣﯿﻮان irha:b Deterrence of aggressors Indiscriminate assault? إرھﺎب 137 Ismael, Yagi and Hammo Corpus Linguistic Tools for Historical Semantics ... Table 4. Examples of words that underwent semantic amelioration Word Meaning in the Quran Meaning in modern contexts shaikh Old man Leader; prince; religious ﺷﯿﺦ scholar marah Falsehood Delight and joy ﻣَﺮَح Guardian spirits of Hell Accomplices زﺑﺎﻧﯿﺔ zaba:niya shaikh’, and inspect how it acquired‘ﺷﯿﺦ ,Now, let us focus on one word elevated senses in modern times. Figure 1 shows a snapshot from the concordance of n-gram textual phrases for the word ‘shaikh’ in current newspaper editorials from a variety of Arab countries. Whilst this word meant in Quranic Arabic 'old man', it means in Modern Arabic 'old man' as well as 'chief', 'religious leader', 'supreme religious leader', and 'scholar'. In this type of semantic change, the original meaning of a word transforms to an entirely different sense. Table 5 shows few examples of this type of semantic change. Table 5. Examples of words that underwent semantic transition Quranic word Original meaning in Meaning in modern Quran context jiha:z Furniture Machinery ﺟﮭﺎز nasiya Exhausting Erecting; fraudulent ﻧﺎﺻﺒﺔ yashri: Sells Buys ﯾَﺸْﺮِى shaikh’ in MSA‘ﺷﯿﺦ Fig 1. A snapshot from the concordance, tracking the word 138 International Journal of Arabic-English Studies (IJAES) Vol. 15, 2014 Table 6 illustrates how morphological change plays an important role in the semantic development of words. In modern contexts, some plural forms of Quranic words are now used as singular and new plural forms have been coined. Table 6. Examples of Quranic words that underwent morphological change Quranic Meaning in the Meaning in modern contexts word Quran ha:j’ Pilgrims One pilgrim, the plural being‘ ﺣَﺎجّ ’huja:j‘ﺣُﺠﱠﺎج nafar’ A group of people One person, the plural being‘ ﻧَﻔَﺮ ’anfa:r‘ أﻧﻔﺎر 3. Background and literature review As corpora are large and structured collections of texts, electronically stored and processed, they are good resources for linguistic research, natural language processing, and data mining. In order to make corpora more useful for linguistic research, they are often annotated with different information depending on the purpose that the corpus is used for. For example, a corpus could be annotated with Part of Speech Tags (POST), and some morphological information such as a word’s lemma, stem, root, morphological pattern,

Corpus Linguistic Tools for Historical Semantics in Arabic.Pdf

Statistical Parsing by Machine Learning from a Classical Arabic Treebank

Arabic Alphabet 1 Arabic Alphabet

Classical and Modern Arabic Corpora: Genre and Language Change

Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank

Morphological Annotation of Quranic Arabic

The Example of Psalm 1

Critical Survey of the Freely Available Arabic Corpora Wajdi Zaghouani Carnegie Mellon University Qatar Computer Science E-Mail: [email protected]

Visual Word Processing of Non-Arabic-Speaking Qur'anic

Hybrid Part-Of-Speech Tagger for Non-Vocalized Arabic Text

Language of Text Messages: a Corpus Based Linguistic Analysis of SMS in Pakistan

Sunnah Arabic Corpus: Design and Methodology

An Analysis of Societies, Speakers and Types of Languages in the Qurán