Proceedings of the Conference on LANGUAGE & TECHNOLOGY 2016
Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore- Pakistan
CONFERENCE COMMITTEES Organizing Committee
General Chair: Sarmad Hussain, CLE, KICS-UET, Lahore, Pakistan Chair, Technical Committee and Co-Chair, Script Processing Track: Faisal Shafait, NUST, Islamabad, Pakistan
Co-Chair, Language Processing Track: Tahira Naseem, IBM, USA Co-Chair, Speech Processing Track: Tania Habib, UET, Lahore, Pakistan Chair, Publication Committee: Agha Ali Raza, IT University, Lahore, Pakistan Chair, Program Committee: Sana Shams, CLE, KICS-UET, Lahore, Pakistan
Technical Committee
Adnan ul Hasan National Engineering and Scientific Commission, Pakistan Agha Ali Raza Information Technology University, Pakistan Annette Hautli-Janisz University of Konstanz, Germany Athar Khurshid Lahore Leads University, Pakistan Faisal Shafait National University of Sciences and Technology, Pakistan Farhat Jabeen University of Konstanz, Germany Gernot Kubin Technical University of Graz, Austria Hassan Sajjad Qatar Computing Research Institute, Qatar Imran Siddiqi Bahria University, Pakistan Karthick Narasimhan Massachusetts Institute of Technology, USA Kashif Javed University of Engineering and Technology, Pakistan Miriam Butt University of Konstanz, Germany Muhammad Kamal Khan Shaheed Benazir Bhutto University, Pakistan Muhammad Shehzad Hanif King Abdul Aziz University, Saudi Arabia Muhammet BAŞTAN Turgut Özal University, Turkey Muzzamil Luqman University La Rochelle, France Nadir Durrani Qatar Computing Research Institute, Qatar Nibal Nayef University of Hamburg, Germany Rudolf Rabenstein University of Erlangen-Nuremberg, Germany Saad Irtza University of New South Wales, Australia Saba Urooj Center for Language Engineering, Pakistan Sarmad Hussain University of Engineering and Technology, Pakistan Sebastian Lory University of Konstanz, Germany Sharmeen Tarar University of the Punjab, Pakistan Sheikh Faisal Rashid University of Engineering and Technology, Pakistan Tafseer Ahmad DHA Suffa University, Karachi, Pakistan Tahira Naseem International Business Machines, USA Tania Habib University of Engineering and Technology, Pakistan Yoon Keok Lee International Business Machines Research, USA Young-Suk Lee International Business Machines Research, USA Yuan Zhang Massachusetts Institute of Technology, USA
Publication Committee
Agha Ali Raza Information Technology University, Pakistan Hira Ejaz Information Technology University, Pakistan
Program Committee:
Aisha Yousaf Center for Language Engineering, Pakistan Atif Ali Javed Center for Language Engineering, Pakistan Benazir Mumtaz Center for Language Engineering, Pakistan Dilshad Ali Center for Language Engineering, Pakistan Hina Khalid University of Engineering and Technology, Pakistan Kashif Javed University of Engineering and Technology, Pakistan Muhammad Kamran Khan Center for Language Engineering, Pakistan Sahar Ruaf Center for Language Engineering, Pakistan Sana Shams Center for Language Engineering, Pakistan Tania Habib University of Engineering and Technology, Pakistan Touqeer Ehsan Center for Language Engineering, Pakistan
FOREWORD
On behalf of the Organizing Committee, we welcome the authors and participants to the sixth Conference on Language and Technology.
Center for Language Engineering (CLE) at Al-Khawarizmi Institute Computer Science (KICS), UET, Lahore, is pleased to host the Conference on Language and Technology 2016 (CLT16). As the sixth CLT, this conference is helping mature the language technology research in Pakistan and providing the intended platform for researchers to interact.
Thirty eight papers have been submitted for CLT16, of which twelve have been accepted for presentation and four for posters, through a rigorous review process by an international technical committee. The papers cover Mewati, Pashto, Sindhi and Urdu language specifically, and a host of areas, including linguistics and computational aspects of phonetics, phonology, syntax and semantics. This CLT also presents exciting talks including recent research in syntax, prosody and frontiers in machine translation.
On behalf of the Organizing Committee we would like to show gratitude to all who volunteered to plan and support the conference. We would like to thank the technical committee members for their diligent reviews of the research articles. We would also like to thank the conference sponsors, especially Higher Education Commission of Pakistan, Punjab Higher Commission, University of Konstanz and the German Academic Exchange Service (DAAD). We are grateful to the management of Al-Khawarizmi Institute of Computer Science and University of Engineering and Technology, Lahore, for their unrelenting support to hold the conference.
We wish you all a very fruitful CLT16 and a pleasant stay in Lahore.
Sarmad Hussain On behalf of the Organizing Committee
TABLE OF CONTENTS
1. ANALYSIS AND DEVELOPMENT OF RESOURCES FOR URDU TEXT STEMMING ...... 1
2. URDU TEXT GENRE IDENTIFICATION ...... 9
3. URDU PHONOLOGICAL RULES IN CONNECTED SPEECH ...... 15
4. MORPHOLOGY OF MEWATI LANGUAGE...... 25
5. IDENTIFICATION OF DIPHTHONGS IN URDU AND THE ACOUSTIC PROPERTIES ...... 33
6. AUTOMATIC DERIVATION OF NOUNS FROM ADJECTIVES IN PASHTO...... 41
7. NAMED ENTITY DATASET FOR URDU NAMED ENTITY RECOGNITION TASK ...... 51
8. ACOUSTIC INVESTIGATION OF /L, J, V/ AS APPROXIMANTS IN URDU ...... 57
9. SUBJECTIVE TESTING OF URDU TEXT-TO-SPEECH (TTS) SYSTEM ...... 65
10. DEVELOPMENT OF SINDHI LEXICAL FUNCTIONAL GRAMMAR ...... 73
11. A COMPREHENSIVE IMAGE DATASET OF URDU NASTALIQUE DOCUMENT IMAGES ...... 81
12. CLUSTERING URDU NEWS USING HEADLINES ...... 89
Poster Papers
13. DATABASE SCHEMA INDEPENDENT ARCHITECTURE FOR NL TO SQL QUERY CONVERSION ...... 95
14. SENTENCE LEVEL SENTIMENT ANALYSIS USING URDU NOUNS ...... 101
15. A MANAGEMENT AND EVALUATION FRAMEWORK FOR ENGLISH TO URDU TRANSLATION .. 109
16. HANDCRAFTED SEMANTIC HIERARCHY TO DEVELOP URDU WORDNET ...... 119
Analysis and Development of Resources for Urdu Text Stemming
Abdul Jabbar Department of Computer Science Institute of Southern Punjab, Multan, Pakistan [email protected]
Sajid Iqbal Department of Computer Science Bahauddin Zakariya University, Multan, Pakistan [email protected]
Muhammad Usman Ghani Khan [email protected] Department of Computer Science and Engineering University of Engineering and Technology, Lahore
Abstract Urdu language is different from other languages like English in terms of its linguistics and phonetic rules. It was developed in the 12th century from the Urdu has been facing various challenges in 1 Natural Language Processing (NLP) due to its regional Apabhramsha of northwestern India . In the following paragraph, we list the prominent morphological richness. Stemming, as a preprocessing technique used in different differences as compared to English language with applications of natural language processing, is one respect to text stemming process. of the basic morphological operation applied in Urdu script writing orientation is from right to left. written text. The technique is used differently in its In Urdu word alphabets are connected and do not various applications like machine translation, query start with capital letter as in English. Furthermore, processing, question answering, text summarization most of the characters change their shape based on and information retrieval. This paper presents the their position in the word and adjunct letters. Table 1 Urdu resources for Urdu text stemming such as below demonstrates some of the Urdu characters affixes list, stop word list, stem word list and morphology. stemming rules to remove the infixes letter(s) and recoding to extract correct stems. Here, we collect Table 1 Different shape of Urdu Character ف ا ا در آ ى affixes, 1124 stop words, 40904 stem word list 1211 and 35 rules with their various variations to remove Final Medial Initial Letter the infixes. ﻋ ﻌ ﻊ ع ع، ، ، وس ر رف، ، ع Keywords: Stemming, Urdu stem words, Urdu affixes, Urdu stop words, Natural Language Processing In English each word is separated by a hard space, but in Urdu words are not always separated by hard 1. Introduction
1 http://www.britannica.com/topic/Urdu-language
1
resources for Urdu NLP. This paper focuses on Further, two or more words could .اد .space, e.g construction of Urdu resources which can be used This is a especially for stemmer development for Urdu text . ، از be written as a single word like challenging issue in Urdu text segmentation In and evaluation, and generally for research in English proper nouns always start with capital letter computational linguistics. This paper describes the but in Urdu proper nouns do not start with a capital lists of resources produced during my MS thesis and letter because Urdu has no letter casing. Due to this, are available online2. it is a challenging task to extract proper nouns from The remainder of this paper is organized as Urdu text. English language researchers have used follows: Section 2 gives the overview of stemming rules and tools like named entity recognition (NER) algorithms. Section 3 describes the development of to mark proper nouns in given text. On the other stop words list. Section 4 introduces the Urdu affix hand, such tools for Urdu are either rare or have low list. The Development of the stem word list is accuracy rate [32],[33]. described in section 5 and development of infix rules In English, inflectional or derivational forms are are described in section 6. In section 7 conclusion created by attaching affixes to either or both sides of and discussion of future work can be found. the root. For example, words readable and unreadable are created from the root word read. In 2. Stemming algorithms this example, “un” is a prefix and “able” is a suffix. However, in Urdu language, affix letter(s) could be In stemming, affixes are chopped off from found anywhere in the middle of the word. For derivational and inflectional forms of Urdu words to instance, derived from and derived from in ار، ش extract stem. For example, Urdu words ا ر ر م . This is a challenging issue in Urdu text stemming are affixes and its common stem is ار which and ا because, it is difficult to distinguish between affix . According to Lovins [15] “a stemming algorithm ش letters and actual part of the root letters. Conversion from singular to plural is also different in Urdu is a computational procedure that reduces all the language. There are multiple rules to do this. For words with same root by stripping each word of its example, a singular could be converted to plural by derivational and inflectional suffixes”. Khoja and Garside [13] define the stemming algorithm in Arabic by adding language context as “Stemming is the process of , ب adding some suffix such as Further there may be removing all of a word's prefixes, suffixes and infixes۔ some co-fix with suffix multiple rules to convert a word into its plural to produce the stem or root”. The case of broken plural words Anjali Ganesh [23] analyzed the English stemmer . ب / ں/ is also different, because they do not follow the and classified them into three: truncating, Statistical normal morphological rules. Urdu broken plural and mixed. Moral [24] grouped stemming algorithms words are somewhat like irregular English plurals. into algorithmic-based approaches and linguistic- The difference is that English singular and plural based approaches. Moghadam [25] classify Persian words resemble to each other, such as man to men, stemmer into three classes: structural stemmers, table but in the case of Urdu both may be non-identical lookup stemmers and statistical stemmers. Basically, there are two types of stemmers and third is a . Another difference between two combination of these two, these are described in ر languages is that English has only uni-gram words detail below. even after derivation. In Urdu there are uni-gram, bi- gram and tri-gram words that are obtained after a) Rule base stemmer . and ، دى ہ derivation for example Resource development is basic and fore most step This is most commonly used stemming technique. of Natural Language Processing (NLP) field. Urdu First stemmer [5] that is found in literature was a rule resource development starts with Urdu Zabta Takhti base stemmer. This stemmer is suitable for those (UZT) that is standard code for Urdu characters, approved by Government of Pakistan (GOP) [8]. A text corpus consisting of 18 million Urdu words is collected by the Center for Research in Urdu Language Processing (CRULP) at National
University of Computer and Emerging Sciences in 2 Pakistan [10]. CLE at University of Engineering and https://sourceforge.net/projects/resource-for-urdu- Technology, Lahore has also produced different stemmer/
2 languages that have well defined linguistic structure. 3. Development of stop words list Rule base approach is also a better choice for infixes removal. This stemmer can give promising results if Usually a sentence contains stop words and excellent rules are developed. On the other hand, too content words as shown in figure 1. Stop words have much rules make it more complex with inclusion of no linguistic meaning, so useless in queries and can contrary and recursive set of rules. To apply one rule generally be safely ignored, e.g. in Urdu you have to remove other (opposite) rule. This “ ”,” ”,” ” situation can easily lead toward deteriorated and in English “on”, “in”, “to”, and “the”. On the performance. other hand, content words are keywords of any b) Statistical stemmer sentence and have lexical meaning. Content words list usually contains nouns, verbs, or adjectives. Urdu Another popular approach is the use of statistical stop word list usually contains postpositions, techniques. In such approaches, the system learns determiners, pronouns, and conjunctions [7]. [2] from available corpus and decides on unseen data Used 400 translated Urdu stop words from English using knowledge gained through its experience, i.e. stop words. In these translated Urdu stop words, . ں، ت و ہ .statistical measures obtained through experience are some are not actually stop words e.g used to remove prefixes and affixes. However, the 1200 closed words list is provided in [22] and all the use of statistical methods is popular among the close words are not stop words, for instance language processing community. N-gram based , are not stop words but closed words, and these ت [statistics are used by W. B. Frakes [26]. Melucci [27 used HMMs and YASS: Yet another suffix stripper words are stem able words. by Majumder [28] is a few of the examples. It can easily be seen that such approaches are limited to Stop words gain experience. In other words, such approaches ن ا و rarely cover the complete grammar of the language
[29]. For example, N-gram approach is better to Content words remove suffixes only however, Urdu language has Figure 1: Example of stop word and content word suffixes, co-suffixes, prefixes, infixes and circumfixes (prefixes and suffixes simultaneously) Up to now, no significant work has been carried [30] and use of HMM requires that every word must out to find the stop words from Urdu text. After start with the prefix [27]. This review shows that studying the various Urdu grammar books and statistical approaches are insufficient in stemming literature [3], [4], [18], [22], [34], [35], [36], [37]. We process. A comprehensive review of stemming developed stop word list consisting of 1124 Urdu technique applied in Urdu, Persian and Arabic is words present in [39]. 4. Development of Affixes lists c) Hybrid stemmers The affixes are a morpheme or set of Hybrid stemmer is combination of the rule base morphemes/words that are frequently attached to stemmer and statistical stemmer [40]. other words and create new words. Internal structure of Urdu word is shown in figure 2.