Proceedings of the Conference on Language & Technology 2012
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Conference on LANGUAGE & TECHNOLOGY 2012 Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore- Pakistan Organized by Society for Natural Language Processing In collaboration with Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore- Pakistan Copyright © 2012 Society for Natural Language Processing (SNLP), Pakistan www.snlp.org.pk ISBN: 978-969-9690-05-1 Conference Committees Organizing Committee General Chair: Sarmad Hussain, CLE, KICS-UET, Lahore, Pakistan Technical Committee Co-chair: Abid Khan, University of Peshawar, Pakistan Technical Committee Co-chair: Miriam Butt, University of Konstanz, Germany Publication Committee Co-chair: Tafseer Ahmed, University of Karachi, Pakistan Technical Committee Abid Khan, University of Peshawar, Pakistan (co-chair) Miriam Butt, University of Konstanz, Germany (co-chair) Afaq Husain, Riphah University, Pakistan Chai Wutiwiwatchai, NECTEC, Thailand Douglas Clarke, Cranfield University, UK Elena Bashir, University of Chicago, USA Ghulam Raza, PIEAS, Pakistan Imran Siddiqi, Bahria University Islamabad, Pakistan Khaver Zia, Beaconhouse National University, Pakistan Key-Sun Choi, KAIST, Korea Muhammad Afzal, KICSIT, Pakistan Pushpak Bhatacharyya, IIT Bombay, India Rachel Roxas, De La Salle University, Philippines Rajeev Sangal, IIIT Hyderabad, India Sarmad Hussain, KICS-UET, Pakistan Seemab Latif, NUST, Pakistan Tafseer Ahmed, University of Karachi, Pakistan Virach Sornlertlamvanich, NECTEC, Thailand Yogendra Yadava, Tribhuvan University, Kathmandu, Nepal Program Committee: Asad Abbas Tahir, CLE, KICS-UET, Lahore, Pakistan Atif Ali Javed, CLE, KICS-UET, Lahore, Pakistan Ammara Amanat, CLE, KICS-UET, Lahore, Pakistan Afia Mahmood, CLE, KICS-UET, Lahore, Pakistan Ayesha Zafar, CLE, KICS-UET, Lahore, Pakistan Farhat Abdullah, CLE, KICS-UET, Lahore, Pakistan Sana Shams, CLE, KICS-UET, Lahore, Pakistan Foreword On behalf of the Organizing Committee, we welcome the authors and participants to the fourth Conference on Language and Technology. Center for Language Engineering (CLE) at Al-Khawarizmi Institute Computer Science (KICS), UET, Lahore, is pleased to host the Conference on Language and Technology 2012 (CLT12). As the fourth CLT, this conference is helping mature the language technology research in Pakistan and providing the intended platform for researchers to interact. Twenty two papers have been submitted for CLT12, of which eleven have been accepted for presentation and four for posters, through a double blind review process by an international technical committee. With two withdrawals, the current volume has thirteen papers. The papers cover Pashto, Punjabi, Sindhi, Urdu languages specifically, and a host of areas, including linguistics and computational aspects of phonetics, phonology, syntax and semantics. This CLT also presents exciting workshops and demos, including recent research in linguistics of local languages and Nastalique optical character recognition. On behalf of the Organizing Committee we would like to show gratitude to all who volunteered to plan and support the conference. We would like to thank the technical committee members for their diligent reviews of the research articles. We would also like to thank the conference sponsors, especially University of Konstanz and the German Academic Exchange Service (DAAD). We are grateful to the management of Al-Khawarizmi Institute of Computer Science and University of Engineering and Technology, Lahore, for their unrelenting support to hold the conference. We wish you all a very fruitful CLT12 and a pleasant stay in Lahore. Sarmad Hussain On behalf of the Organizing Committee Table of Contents A Computational Multilingual Text Constituent Splitter and Phrasing: A Case of Pashto Language 1 Zaheer Ahmad, Mohammad Abid Khan, Jehan Zeb Khan Orakzai, Rahman Ali, Ibrar Ahmad Wikipedia is a Practical Alternative to the Web for measuring Co-occurrence based Word Association 9 Om P. Damani, Pankhil Chedda, Dipak Chaudhari The Pitch Contour of Declarative Sentences in Urdu Language 17 Farhat Jabeen, Sarmad Hussain Comparing Talks, Realities and Concerns over the Climate Change: Comparing Texts with Numerical 21 and Categorical Data Yasir Mehmood, Timo Honkela Urdu Syllable: Templates and Constraints 29 Mazhar Iqbal Ranjha Developing a Part of Speech Tagset for Sindhi 37 Mutee U Rahman CLE Urdu Digest Corpus 47 Saba Urooj, Sarmad Hussain, Farah Adeeba, Farhat Jabeen, Rahila Parveen Developing Urdu WordNet Using the Merge Approach 55 Ayesha Zafar, Afia Mahmood, Farhat Abdullah, Saira Zahid, Sarmad Hussain, Asad Mustafa An Acoustic Study of Vowel Nasalization in Punjabi 61 Saira Zahid, Sarmad Hussain Survey of Urdu OCR: An Offline Approach 67 Naila Fareen, Mohammad Abid Khan, Attash Durrani Poster Papers Error Analysis of Single Speaker Urdu Speech Recognition System 73 Saad Irtza, Dr. Sarmad Hussain A Dictionary based Urdu Word Segmentation using Dynamic Programming for Space Omission 81 Problem Rabiya Rashid, Seemab Latif Complexities and Implementation Challenges in Offline Urdu Nastaliq OCR 85 Danish Altaf Satti, Dr. Khalid Saleem A Computational Multilingual Text Constituent Splitter and Phrasing: A Case of Pashto Language Zaheer Ahmad, Mohammad Abid Khan, Jehan Zeb Khan Orakzai, Rahman Ali, Ibrar Ahmad Department of Computer Science, University of Peshawar, Khyber Pakhtunkhwa, Pakistan [email protected], [email protected], [email protected] [email protected], [email protected] Abstract discussion has been provided in section-4 about the Pashto language syntax rules. Section-5 is about the We propose a simple rules embedded matrix proposed approach to split sentences into their based method to split input sentences into their constituents. In the last, summary of the work has been constituents and phrases. Splitting a sentence into shared. phrases is a preprocess of machine translation for overcoming the problem of handling long sentences 1.1. Constituent Related Sentence Analysis and improving quality of automatic translation. An effort is made to remove or at least minimize the A group of words normally functions as a syntactic problem of recursion that is faced during the process unit/ constituent/phrases in a sentence [1], [2], [3], [4], of phrase splitting thereby saving a lot of time. The [5]. These groupings often give meaning to a sentence system is dynamic in design and theoretically would and help to identify important information about the work for any language that has some type of word structure of constituent boundary, linear order and order. However we have tested the system on Pashto syntactic categories [1], [2]. These constituents and language and this paper would describe the system in Phrases can be substituted and replaced, moved in a the perspective of Pashto language. The system can sentence, deleted, merged and built up by a series of achieve more than 90% results keeping in view the merger operations to form a larger constituent [1], [2], Phrase Rules are carefully captured in a table. [3]. The order of the constituent in a sentence is as important in the syntactic study of a sentence as the 1. Introduction word order [4]. It helps to understand a sentence after removing complexities, helps in translating a sentence Sentence splitting and getting constituents is a and representing the structure of constituent. To model specialized area of chunking. Sentences splitting into and represent constituent structure, Context Free constituents are an important preprocess in a wide Grammar (CFG) or phrase structure grammar has been variety of NLP disciplines, particularly in machine successfully used for languages such as English [6]. translation and sentence generation. It is not only However, there are many disadvantages in using CFG helpful to overcome the problems of complexity faced for natural languages such as ambiguity, left-recursion during the analysis of long sentences and improve the and repeated parsing of sub-trees. If a sentence is quality of translation. It can also be used in the structurally ambiguous, then the grammar assigns to it translation process directly for complete phrases. This more than one parse tree. It will be difficult to use CFG work is based on a language neutral approach to split in languages that do not follow strict word order [6]. sentences and get their constituents. However this research paper would mainly focus on Pashto language. 1.2. Recursion in Syntactic Structure of This section is dedicated to introduce and to elaborate Sentence sentence analysis and recursion. In section-2, related work has been discussed. Section-3 is describing some Unlike a chunk, a constituent of a particular challenges associated with Pashto language with the category can be embedded inside another constituent particular focus on Unicode for Arabic script. A detail of the same category, which, in turn, can be embedded inside another such constituent. This property or set of 1 properties of a sentence are called recursion [7], [8]. and a set of Noun grammar rules defined by Perini Rules that govern these properties are called recursive [13]. However, this approach needs sentences to be rules [7]. In recursion the categories like NP or VP manually tagged. Costa [14] used LXGram to describe repeat itself on the right side of the arrow when written the grammatical properties and the meaning of in Context Free Form. For Pashto