Download Download
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
ARTICLE Development of a Gold-Standard Pashto Dataset and a Segmentation App Yan Han and Marek Rychlik
ARTICLE Development of a Gold-standard Pashto Dataset and a Segmentation App Yan Han and Marek Rychlik ABSTRACT The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto. -
New Language Resources for the Pashto Language
New language resources for the Pashto language Djamel Mostefa 1 , Khalid Choukri 1 , Sylvie Brunessaux 2 , Karim Boudahmane 3 1 Evaluation and Language resources Distribution Agency, France 2 CASSIDIAN, France 3 Direction Générale de l'Armement, France E-mail: [email protected], [email protected], [email protected], [email protected] Abstract This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation. Keywords: Pashto, low-resource language, speech corpus, monolingual and multilingual text corpora, web crawling. other one being Dari) and one regional language in 1. Introduction Pakistan. There are very few corpora and Human Language The code assigned to the language by the ISO 639-3 Technology (HLT) services available for Pashto. No standard is [pus]. language resources for Pashto can be found in the According to the Ethnologue.com website, it is spoken by catalogues of LDC1 and ELRA2. around 20 million people and three main dialects are to be Pashto is a very low-resource language. Google doesn't considered: support Pashto in its search engine or translation services. -
A Three Months Study of COVID-19 in Pakistan
A Three Months Study of COVID-19 in Pakistan Jabir Ali1, Hafiz Muhammad Asmar Naeem2, Sultan Ali1, Sajjad Rahman1, Kainat Gull1, Ujalla Tanveer1, and Muhammad Ahmad Mushtaq1 1University of Agriculture Faisalabad 2University of Agriculture, Faisalabad(UAF) July 7, 2020 Abstract COVID-19 is a new pandemic caused by SARS-CoV-2 which has created a havoc worldwide. Within no time it prevailed throughout the world compelling all countries to take emergency measures to overcome the pandemic. To overcome the disease, there is still no vaccine developed, however, different drugs are under trial. So, the only strategy to overcome the deadly virus is to avoid each and every way of contact with the already infected patients. For the assurance of avoidance policy, different countries took different measures according to their circumstances. Developing countries are much more infected than the developed ones as they already lack in fulfilling many basic necessities of life including economy. Pakistan is one of those developing countries whose economy is badly hit by following the model strategies of developed countries. So, Pakistan introduced different strategies like smart lock down, tiger force, etc. Pakistan has faced the worst peak of pandemic latter than most of the countries, so, to walk with the world in all aspects, Government should put its best efforts in the health zone to overcome COVID-19 as soon as possible. Introduction Coronavirus Disease 2019 (COVID-19) is caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). SARS-CoV-2 is a single stranded (SS), enveloped, positive sensed, RNA beta b coronavirus (Raza et al., 2020) which is a close relative of Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV). -
Language Attrition: the Next Phase Barbara Köpke, Monika Schmid
Language Attrition: The next phase Barbara Köpke, Monika Schmid To cite this version: Barbara Köpke, Monika Schmid. Language Attrition: The next phase. Monika S. Schmid, Barbara Köpke, Merel Keijzer, Lina Weilemar. First Language Attrition: Interdisciplinary perspectives on methodological issues, John Benjamins, pp.1-43, 2004, Studies in Bilingualism, 9027241392. hal- 00879106 HAL Id: hal-00879106 https://hal.archives-ouvertes.fr/hal-00879106 Submitted on 31 Oct 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Language Attrition: The Next Phase Barbara Köpke (Université de Toulouse – Le Mirail) and Monika S. Schmid Vrije Universiteit Amsterdam Barbara Köpke Laboratoire de Neuropsycholinguistique Jacques Lordat Institut des Sciences du Cerveau de Toulouse Université de Toulouse-Le Mirail 31058 Toulouse Cedex France [email protected] Monika S. Schmid Engelse Taal en Cultuur Faculteit der Letteren Vrije Universiteit 1081 HV Amsterdam The Netherlands [email protected] Published in : M.S. Schmid, B. Köpke, M. Keijzer & L. Weilemar (2004). First Language Attrition. Interdisciplinary perspectives on methodological issues (pp. 1-43). Amsterdam: John Benjamins. Köpke, B. & Schmid, M.S. (2004). Language Attrition: The Next Phase. In M.S. Schmid, B. -
Afghan Opiate Trade 2009.Indb
ADDICTION, CRIME AND INSURGENCY The transnational threat of Afghan opium UNITED NATIONS OFFICE ON DRUGS AND CRIME Vienna ADDICTION, CRIME AND INSURGENCY The transnational threat of Afghan opium Copyright © United Nations Office on Drugs and Crime (UNODC), October 2009 Acknowledgements This report was prepared by the UNODC Studies and Threat Analysis Section (STAS), in the framework of the UNODC Trends Monitoring and Analysis Programme/Afghan Opiate Trade sub-Programme, and with the collaboration of the UNODC Country Office in Afghanistan and the UNODC Regional Office for Central Asia. UNODC field offices for East Asia and the Pacific, the Middle East and North Africa, Pakistan, the Russian Federation, Southern Africa, South Asia and South Eastern Europe also provided feedback and support. A number of UNODC colleagues gave valuable inputs and comments, including, in particular, Thomas Pietschmann (Statistics and Surveys Section) who reviewed all the opiate statistics and flow estimates presented in this report. UNODC is grateful to the national and international institutions which shared their knowledge and data with the report team, including, in particular, the Anti Narcotics Force of Pakistan, the Afghan Border Police, the Counter Narcotics Police of Afghanistan and the World Customs Organization. Thanks also go to the staff of the United Nations Assistance Mission in Afghanistan and of the United Nations Department of Safety and Security, Afghanistan. Report Team Research and report preparation: Hakan Demirbüken (Lead researcher, Afghan -
002.Alex.Comitato.2A Bozza
Xaverio Ballester /A/ Y EL VOCALISMO INDOEUROPEO Negli últimi quarant’anni la linguística indo- europea si è in gran parte perduta dietro al mito delle laringali, di cui non intendo tenere alcun conto, e dello strutturalismo, di cui ten- go un conto molto limitato. Giuliano Bonfante, I dialetti indoeuropei, p. 8 De la regla a la ley o de mal en peor Bien digna de mención entre las primerísimas descripciones del mode- lo vocálico indoeuropeo es la propuesta de un inventario fonemático con únicamente tres timbres vocálicos /a i u/, una propuesta empero que fue desgraciadamente y demasiado pronto abandonada, siendo quizá la más conspicua consecuencia de este abandono el hecho de que para la Lin- güística indoeuropea oficialista la ausencia de /a/ devino en la práctica un axioma, de modo que, casi en cualquier posterior propuesta sobre el voca- lismo indoeuropeo, se ha venido adoptando la idea de que no hubiese exis- tido nunca la vocal /a/, y explicándose los ineluctables casos de presencia de /a/ en el material indoeuropeo con variados y bizarros argumentos del tipo de vocalismo despectivo, infantil o popular. Sin embargo, si conside- rada hoy spregiudicatamente, la argumentación que motivó el desalojo de la primitiva /a/ indoeuropea no presenta, al menos desde una perspectiva fonotipológica hodierna, ninguna validez en absoluto. Invocaremos un testimonio objetivo del tema para exponer brevemen- te la cuestión. Escribía O. Szemerényi: «Sotto l’impressione dell’arcai- cità del sanscrito, i fondatori dell’indoeuropeistica e i loro immediati suc- cessori pensavano che il sistema triangolare del sanscrito i–a–u rappre- sentasse la situazione originaria. -
Pashto Alphabets
LEARNING PASHTO Intensive Elementary & Secondary Pashto for Military and other Professionals by Dawood Azami Visiting Scholar Email: [email protected] The Middle East Studies Center (MESC) The Ohio State University, Columbus August 2009 1 Aims of the Course: *To provide a thorough introductory course in basic Pashto with the accent on practical spoken Pashto, coverage of grammar, familiarity with Pashto pronunciation, and essential vocabulary. *Ability to communicate within a range of situations and to handle simple survival situations (e.g. finding lodging, food, transportation etc.) *Ability to read the simple Pashto texts dealing with a variety of social and basic needs. In addition to author’s own command and expertise, a number of sources (books, both published and unpublished, journals, websites, etc.) have been consulted while preparing this material. Word of thanks: The author would like to thank Dr. Alam Payind, Director, Middle East Studies Center (MESC), and Melinda McClimans, Assistant Director, MESC. Their cooperation and assistance certainly made my stay in Columbus easier and enjoyable. Copy Right: This course material is for teaching of Pashto language at The Ohio State University. The author holds the copy right for any other use. The author intends to publish a modified version of the material as a book in the future. Please contact the author for more information. (Email: [email protected] ) 2 Contents I. Pashto Alphabet .................................................................................................................. 5 A. Pashto Sounds Similar to English ............................................................................... 7 B. Pashto Sounds Different from English ........................................................................ 7 C. Two letters pronounced differently in major Pashto dialects: ...................................... 8 D. Arabic Letters/ Sounds in Pashto .................................................................................. 8 II. -
Naila Habib Khan
LIGATURE RECOGNITION SYSTEM FOR PRINTED URDU SCRIPT USING GENETIC ALGORITHM BASED HIERARCHICAL CLUSTERING A Thesis Submitted to the Faculty of the Institute of Management Sciences, Peshawar in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE By NAILA HABIB KHAN DEPARTMENT OF COMPUTER SCIENCE INSTITUTE OF MANAGEMENT SCIENCES PESHAWAR, PAKISTAN SESSION 2014-2017 This is to certify that the research work presented in this thesis entitled “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” was conducted by Naila Habib Khan under the supervision of Dr. Awais Adnan, Institute of Management Sciences, Peshawar, Pakistan. No part of this thesis has been submitted anywhere else for any other degree. This thesis is submitted to the Institute of Management Sciences, Peshawar in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the field of Computer Science. Student Name: Naila Habib Khan Signature: ___________________________ Examination Committee: a) External Foreign Examiner 1: Dr. Yue Cao School of Computing and Communications, Lancaster University, UK Signature: ___________________________ b) External Foreign Examiner 2: Prof. Dr. Ibrahim A. Hameed Deputy Head of Research and Innovation Department of ICT and Natural Sciences, Norwegian University of Science and Technology, UK Signature: ___________________________ c) External Local Examiner: Dr. Saeeda Naz Head of Department/ Assistant Professor Govt. Girls Postgraduate College, Abbotabad, Pakistan Signature: ___________________________ ii d) Internal Local Examiner: Dr. Imran Ahmed Mughal Assistant Professor Institute of Management Sciences, Peshawar, Pakistan Signature: ___________________________ Supervisor: Dr. Awais Adnan Assistant Professor Institute of Management Sciences, Peshawar, Pakistan Signature: ___________________________ Director: Dr. -
List of Category -I Members Registered in Membership Drive-Ii
LIST OF CATEGORY -I MEMBERS REGISTERED IN MEMBERSHIP DRIVE-II MEMBERSHIP CGN QUOTA CATEGORY NAME DOB BPS CNIC DESIGNATION PARENT OFFICE DATE MR. DAUD AHMAD OIL AND GAS DEVELOPMENT COMPANY 36772 AUTONOMOUS I 25-May-15 BUTT 01-Apr-56 20 3520279770503 MANAGER LIMITD MR. MUHAMMAD 38295 AUTONOMOUS I 26-Feb-16 SAGHIR 01-Apr-56 20 6110156993503 MANAGER SOP OIL AND GAS DEVELOPMENT CO LTD MR. MALIK 30647 AUTONOMOUS I 22-Jan-16 MUHAMMAD RAEES 01-Apr-57 20 3740518930267 DEPUTY CHIEF MANAGER DESTO DY CHEIF ENGINEER CO- PAKISTAN ATOMIC ENERGY 7543 AUTONOMOUS I 17-Apr-15 MR. SHAUKAT ALI 01-Apr-57 20 6110119081647 ORDINATOR COMMISSION 37349 AUTONOMOUS I 29-Jan-16 MR. ZAFAR IQBAL 01-Apr-58 20 3520222355873 ADD DIREC GENERAL WAPDA MR. MUHAMMA JAVED PAKISTAN BORDCASTING CORPORATION 88713 AUTONOMOUS I 14-Apr-17 KHAN JADOON 01-Apr-59 20 611011917875 CONTRALLER NCAC ISLAMABAD MR. SAIF UR REHMAN 3032 AUTONOMOUS I 07-Jul-15 KHAN 01-Apr-59 20 6110170172167 DIRECTOR GENRAL OVERS PAKISTAN FOUNDATION MR. MUHAMMAD 83637 AUTONOMOUS I 13-May-16 MASOOD UL HASAN 01-Apr-59 20 6110163877113 CHIEF SCIENTIST PROFESSOR PAKISTAN ATOMIC ENERGY COMMISION 60681 AUTONOMOUS I 08-Jun-15 MR. LIAQAT ALI DOLLA 01-Apr-59 20 3520225951143 ADDITIONAL REGISTRAR SECURITY EXCHENGE COMMISSION MR. MUHAMMAD CHIEF ENGINEER / PAKISTAN ATOMIC ENERGY 41706 AUTONOMOUS I 01-Feb-16 LATIF 01-Apr-59 21 6110120193443 DERECTOR TRAINING COMMISSION MR. MUHAMMAD 43584 AUTONOMOUS I 16-Jun-15 JAVED 01-Apr-59 20 3820112585605 DEPUTY CHIEF ENGINEER PAEC WASO MR. SAGHIR UL 36453 AUTONOMOUS I 23-May-15 HASSAN KHAN 01-Apr-59 21 3520227479165 SENOR GENERAL MANAGER M/O PETROLEUM ISLAMABAD MR. -
Medieval Hebrew Texts and European River Names Ephraim Nissan London [email protected]
ONOMÀSTICA 5 (2019): 187–203 | RECEPCIÓ 8.3.2019 | ACCEPTACIÓ 18.9.2019 Medieval Hebrew texts and European river names Ephraim Nissan London [email protected] Abstract: The first section of theBook of Yosippon (tenth-century Italy) maps the Table of Nations (Genesis 10) onto contemporary peoples and places and this text, replete with tantalizing onomastics, also includes many European river names. An extract can be found in Elijah Capsali’s chronicle of the Ottomans 1517. The Yosippon also includes a myth of Italic antiquities and mentions a mysterious Foce Magna, apparently an estuarine city located in the region of Ostia. The article also examines an onomastically rich passage from the medieval travelogue of Benjamin of Tudela, and the association he makes between the river Gihon (a name otherwise known in relation to the Earthly Paradise or Jerusalem) and the Gurganin or the Georgians, a people from the Caspian Sea. The river Gihon is apparently what Edmund Spenser intended by Guyon in his Faerie Queene. The problems of relating the Hebrew spellings of European river names to their pronunciation are illustrated in the case of the river Rhine. Key words: river names (of the Seine, Loire, Rhine, Danube, Volga, Dnieper, Po, Ticino, Tiber, Arno, Era, Gihon, Guyon), Kiev, medieval Hebrew texts, Book of Yosippon, Table of Nations (Genesis 10), historia gentium, mythical Foce Magna city, Benjamin of Tudela, Elijah Capsali, Edmund Spenser Textos hebreus medievals i noms de rius europeus Resum: L’inici del Llibre de Yossippon (Itàlia, segle X) relaciona la «taula de les nacions» de Gènesi 10 amb pobles i llocs contemporanis, i aquest text, ple de propostes onomàstiques temptadores, també inclou noms fluvials europeus. -
Pashto Five Reasons Why You Should Pashto Learn More About Pashtuns سالم
SOME USEFUL PHRASES IN PASHTO FIVE REASONS WHY YOU SHOULD PASHTO LEARN MORE ABOUT PASHTUNS سﻻم. زما نوم جان دئ. [saˈlɔːm zmɔː num ʤɔːn dǝɪ] Uzbeks are the most numerous Turkic /salām. zmā num jān dǝy./ AND THEIR LANGUAGE Hi. My name is John. 1. Pashto is spoken as a first or second language by people in Central Asia. They predomi- over 40 million people worldwide, but the highest nantly mostly live in Uzbekistan, a land- population of speakers are located in Afghanistan ستاسو نوم څه دئ؟ [ˈstɔːso num ʦǝ dǝɪ] and Pakistan, with smaller populations in other locked country of Central Asia that shares /stāso num ʦǝ dǝy?/ Central Asian and Middle Eastern countries such borders with Kazakhstan to the west and What is your name? as Tajikistan and Iran. north, Kyrgyzstan and Tajikistan to the ,A member of the Indo-Iranian language family .2 تاسې څنګه یاست؟ زه ښه یم، مننه. [ˈʦǝŋgɒ jɔːst zǝ ʂɒ jǝm mɒˈnǝnɒ] Pashto shares many structural similarities to east, and Afghanistan and Turkmenistan /ʦǝnga yāst? zǝ ṣ̆ a yǝm, manǝna./ languages such as Dari, Farsi, and Tajiki. to the south. Many Uzbeks can also be How are you? I’m fine, thanks. 3. Because of US involvement with Afghanistan over found in Afghanistan, Kazakhstan, Kyr- the past decade, those who study Pashto can find careers in a variety of fields including translation gyzstan, Tajikistan, Turkmenistan and the ستاسو دلیدو څخه خوشحاله شوم. [ˈstɔːso dǝˈliːdo ˈʦǝχa χoʃˈhɔːla ʃwǝm] and interpreting, consulting, foreign service and Xinjiang Uyghur Autonomous Region of /stāso dǝ-lido ʦǝxa xoš-hāla šwǝm./ intelligence, journalism, and many others. -
Doctor of Philosophy
DOCTOR OF PHILOSOPHY Linguistic identifiers of L1 Persian speakers writing in English NLID for authorship analysis Ria Perkins 2014 Aston University Some pages of this thesis may have been removed for copyright restrictions. If you have discovered material in AURA which is unlawful e.g. breaches copyright, (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please read our Takedown Policy and contact the service immediately Linguistic Identifiers of L1 Persian speakers writing in English. NLID for Authorship Analysis. Ria Charlotte Perkins, M.A. Centre for Forensic Linguistics School of Languages and Social Sciences, Aston University A thesis submitted for the fulfilment of the degree of Doctor of Philosophy December, 2012 ©Ria Perkins, 2012 Ria Perkins asserts her moral right to be identified as the author of this thesis. This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without proper acknowledgement. Thesis Summary Institution: Aston University Title: Linguistic Identifiers of L1 Persian speakers writing in English. NLID for Authorship Analysis. Name: Ria Charlotte Perkins Degree: Doctor of Philosophy Year of Submission: 2012 Synopsis: This research focuses on Native Language Identification (NLID), and in particular, on the linguistic identifiers of L1 Persian speakers writing in English. This project comprises three sub-studies; the first study devises a coding system to account for interlingual features present in a corpus of L1 Persian speakers blogging in English, and a corpus of L1 English blogs.