Wikidocsaligner: an Off-The-Shelf Wikipedia Documents Alignment Tool
Total Page:16
File Type:pdf, Size:1020Kb
2017 Palestinian International Conference on Information and Communication Technology WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool Motaz Saad Basem O. Alijla Department of Computer Science Department of Information Technology Faculty of Information Technology Faculty of Information Technology Islamic University of Gaza Islamic University of Gaza Gaza, Palestine Gaza, Palestine Email: [email protected] Email: [email protected] Abstract—Wikipedia encyclopedia is an attractive source for pedia written by contributors in several languages. Anyone comparable corpora in many languages. Most researchers de- can edit and write Wikipedia articles. Some of Wikipedia velop their own script to perform document alignment task, pages in some languages are translations of the corresponding which requires efforts and time. In this paper, we present W ikiDocsAligner, an off-the-shelf Wikipedia Articles alignment English versions, and some others are written independently. handy tool. The implementation of W ikiDocsAligner does not Wikipedia is an attractive source for comparable corpus be- require the researchers to import/export of interlanguage links cause it covers many languages and domains. But the aligning databases. The user just need to download Wikipedia dumps these documents is a challenging task [1]. (interlanguage links and articles), then provide them to the tool, There are many researchers complied comparable corpus which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use from Wikipedia such as [2] who described a language inde- W ikiDocsAligner to align comparable documents from Arabic pendent method to build parallel corpus from Wikipedia. The Wikipedia and Egyptian Wikipedia. So we shed the light on work in [3] presented a method to extract aligned sentences Wikipedia as a source of Arabic dialects language resources. The from a comparable corpus derived from Wikipedia, in addition, produced resources is interesting and useful as the demand on the authors extracted a bilingual lexicon from Wikipedia. Arabic/dialects language resources increased in the last decade. Index Terms—Comparable corpus, documents alignment, Ara- The authors of [4] used Wikipedia comparable documents to bic Wikipedia Corpus, Egyptian Wikipedia Corpus enhance the performance of machine translation. Wikipedia contributors use Wiki markup to write articles. I. INTRODUCTION Wiki markup also known as wikitext language or wikicode, and it is a lightweight markup language [5]. Inter-language Documents alignment is the task of arranging articles, links are links from a page in one Wikipedia language to an written in many languages, in pairs such that they are related to equivalent page in another language. The form of these links the same topic. The term comparable corpus is usually used to is [[languagecode : Title]] as shown in Figure 2. Historically, describe a set of documents in multiple languages, which are the list of interlanguage links was included in Wikipedia aligned at the topic level, these documents are not necessarily article itself, i.e., in the wikicode. Then, Wikimedia foundation translations of each other [1]. In contrast, parallel corpus is a launched Wikidata project in 2012 which intended to centralize set of aligned sentences which are translations of each other. and provide common source of data (including interlanguage Figure 1 shows an example of English1 and French2 links) for different Wikimedia foundation projects [6]. So Wikipedia comparable documents related to a biography of a interlanguage links are now provided by Wikidata projects as person. It can be noted from the figure that the first paragraph SQL script format. in the English document is longer than the French one, it also Researchers usually develop their own script to align provides more information about the person. Further, we note Wikipedia articles based on interlanguage links, but this takes that the English and French texts of these documents have efforts and time. In this paper, we present W ikiDocsAligner, different views to this person, so they are comparable. an off-the-shelf handy tool for Wikipedia comparable docu- Parallel and comparable are useful for several tasks such ments alignment. The implementation of W ikiDocsAligner as cross-lingual information retrieval, machine translation and does not require the researchers to import/export of SQL bilingual lexicon extraction. Comparable corpus is the best al- interlanguage links databases, the user just need to download ternative when parallel texts is not available for low resourced Wikipedia dumps (interlanguage links and articles), then pass languages. them to W ikiDocsAligner, which performs the alignment. Comparable corpora can be collected from news documents This software can be used easily to align Wikipedia documents or from Wikipedia. Wikipedia is an open source encyclo- in any language pair. 1https://en.wikipedia.org/wiki/T. E. Lawrence Then, the W ikiDocsAligner is used in this paper to 2https://fr.wikipedia.org/wiki/Thomas Edward Lawrence align Wikipedia comparable articles from Standard Arabic and 978-1-5090-6538-7/17 $31.00 © 2017 IEEE 34 DOI 10.1109/PICICT.2017.27 (a) The English document (b) The French document Fig. 1. English-French comparable documents from Wikipedia 35 The rest of this paper is organized as follows: II presents the extracts of Arabic Wikipedia corpus and Egyptian Wikipedia corpus, Section III describes W ikiDocsAligner, Section IV shows the application of W ikiDocsAligner to align Arabic- Egyptian documents from Wikipedia and presents the pro- duced comparable corpus. II. ARABIC AND EGYPTIAN WIKIPEDIA EXTRACTS This section presents Arabic and Egyptian Wikipedia Ex- tracts. Wikimedia provides a free copy of all available con- tents of Wikipedia Encyclopedia (articles, revisions, discussion of contributors). These copies are called dumps. Because Wikipedia contents change with time, the dumps are provided regularly every month. Wikipedia dumps can be downloaded in XML format. Wikipedia dumps can be downloaded from https: //dumps.wikimedia.org/. URL and file names depends on the target language, for example, Arabic Wikipedia dumps URL is https://dumps.wikimedia.org/arwiki/. The file name should be arwiki-date-pages-articles.xml.bz2 where -date- specify the dump date. The Wikipedia extracts in this paper is from 20 Jan 2017 dump. In this work, we use W ikiExtractor python script 3 to Fig. 2. Interlanguage links prior to Wikidata project: to the left the wikicode, extract documents from Wikipedia dumps. to the right the links as they appeared on the page A. Arabic Wikipedia Corpus Egyptian Arabic Dialect. The produced resources is an inter- esting and useful as the demand on Arabic/dialects language Arabic Wikipedia started in 2003 with 655 articles, devel- resources increased in research and business communities in oped to 459,208 articles by 20 Jan 2017. The information the last seven years [7]. Arabs usually post in social media in of Arabic Wikipedia Corpus is presented in Table I, where |d| is the number of documents in the corpus, |w| in the their dialects. So one of the reasons for this increased interest | | in Arabic dialects language resources maybe the emergence number of words in the corpus, and v is the vocabulary of what is known as the Arab Spring revolutions at the end size (number of unique words). We make this resource public of 2010. Another reason is business applications such as to provide a recent extract of Arabic Wikipedia corpus for sentiment analysis of product reviews. research community. In research community, many researcher built multi-dialect corpus of Arabic language for many application such as dialect TABLE I identification, machine translation, sentiment analysis. For ARABIC WIKIPEDIA CORPUS example, the authors in [8] built a corpus composed of 2,000 |d| 459,208 documents sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, |w| 83.5M words |v| Palestinian and Syrian Arabic, as well as English. The authors 4.7M unique words in [9] collected news and tweets in five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. Finally, the The corpus is available online at https://github.com/ authors built automatic dialect identification system using motazsaad/arWikiExtracts for research community. The objec- the collected resources. The work of [10] presented Multi- tive is to provide up-to-date extract of Arabic Wikipedia and Dialectal Corpus of Arabic collected from Twitter based on make it available for other researchers because there is no geographical information in tweets. recent extract accessible for the research community. As we can see from the work reviewed above, Ara- Table II shows the most frequent words in Arabic Wikipedia bic/Dialects are either built from scratch or collected from corpus, while Table III shows the most frequent words ex- social media. Egyptian Wikipedia corpus is never extracted cluding stopwords. The most frequent words represents the or used in research community. In this paper we shed the common words in a language. If we consider Arabic Wikipedia light on Egyptian Wikipedia as a source of Egyptian dialect corpus as a representative of Arabic language, then these corpus, we also present Egyptian/Arabic aligned documents lists represent the most frequent (common) words in Arabic as comparable corpus. These resources can be used in many language. applications such as dialect identification, sentiment