2017 Palestinian International Conference on Information and Communication Technology
WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool
Motaz Saad Basem O. Alijla Department of Computer Science Department of Information Technology Faculty of Information Technology Faculty of Information Technology Islamic University of Gaza Islamic University of Gaza Gaza, Palestine Gaza, Palestine Email: [email protected] Email: [email protected]
Abstract—Wikipedia encyclopedia is an attractive source for pedia written by contributors in several languages. Anyone comparable corpora in many languages. Most researchers de- can edit and write Wikipedia articles. Some of Wikipedia velop their own script to perform document alignment task, pages in some languages are translations of the corresponding which requires efforts and time. In this paper, we present W ikiDocsAligner, an off-the-shelf Wikipedia Articles alignment English versions, and some others are written independently. handy tool. The implementation of W ikiDocsAligner does not Wikipedia is an attractive source for comparable corpus be- require the researchers to import/export of interlanguage links cause it covers many languages and domains. But the aligning databases. The user just need to download Wikipedia dumps these documents is a challenging task [1]. (interlanguage links and articles), then provide them to the tool, There are many researchers complied comparable corpus which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use from Wikipedia such as [2] who described a language inde- W ikiDocsAligner to align comparable documents from Arabic pendent method to build parallel corpus from Wikipedia. The Wikipedia and Egyptian Wikipedia. So we shed the light on work in [3] presented a method to extract aligned sentences Wikipedia as a source of Arabic dialects language resources. The from a comparable corpus derived from Wikipedia, in addition, produced resources is interesting and useful as the demand on the authors extracted a bilingual lexicon from Wikipedia. Arabic/dialects language resources increased in the last decade. Index Terms—Comparable corpus, documents alignment, Ara- The authors of [4] used Wikipedia comparable documents to bic Wikipedia Corpus, Egyptian Wikipedia Corpus enhance the performance of machine translation. Wikipedia contributors use Wiki markup to write articles. I. INTRODUCTION Wiki markup also known as wikitext language or wikicode, and it is a lightweight markup language [5]. Inter-language Documents alignment is the task of arranging articles, links are links from a page in one Wikipedia language to an written in many languages, in pairs such that they are related to equivalent page in another language. The form of these links the same topic. The term comparable corpus is usually used to is [[languagecode : Title]] as shown in Figure 2. Historically, describe a set of documents in multiple languages, which are the list of interlanguage links was included in Wikipedia aligned at the topic level, these documents are not necessarily article itself, i.e., in the wikicode. Then, Wikimedia foundation translations of each other [1]. In contrast, parallel corpus is a launched Wikidata project in 2012 which intended to centralize set of aligned sentences which are translations of each other. and provide common source of data (including interlanguage Figure 1 shows an example of English1 and French2 links) for different Wikimedia foundation projects [6]. So Wikipedia comparable documents related to a biography of a interlanguage links are now provided by Wikidata projects as person. It can be noted from the figure that the first paragraph SQL script format. in the English document is longer than the French one, it also Researchers usually develop their own script to align provides more information about the person. Further, we note Wikipedia articles based on interlanguage links, but this takes that the English and French texts of these documents have efforts and time. In this paper, we present W ikiDocsAligner, different views to this person, so they are comparable. an off-the-shelf handy tool for Wikipedia comparable docu- Parallel and comparable are useful for several tasks such ments alignment. The implementation of W ikiDocsAligner as cross-lingual information retrieval, machine translation and does not require the researchers to import/export of SQL bilingual lexicon extraction. Comparable corpus is the best al- interlanguage links databases, the user just need to download ternative when parallel texts is not available for low resourced Wikipedia dumps (interlanguage links and articles), then pass languages. them to W ikiDocsAligner, which performs the alignment. Comparable corpora can be collected from news documents This software can be used easily to align Wikipedia documents or from Wikipedia. Wikipedia is an open source encyclo- in any language pair. 1https://en.wikipedia.org/wiki/T. E. Lawrence Then, the W ikiDocsAligner is used in this paper to 2https://fr.wikipedia.org/wiki/Thomas Edward Lawrence align Wikipedia comparable articles from Standard Arabic and
978-1-5090-6538-7/17 $31.00 © 2017 IEEE 34 DOI 10.1109/PICICT.2017.27 (a) The English document
(b) The French document
Fig. 1. English-French comparable documents from Wikipedia
35 The rest of this paper is organized as follows: II presents the extracts of Arabic Wikipedia corpus and Egyptian Wikipedia corpus, Section III describes W ikiDocsAligner, Section IV shows the application of W ikiDocsAligner to align Arabic- Egyptian documents from Wikipedia and presents the pro- duced comparable corpus.
II. ARABIC AND EGYPTIAN WIKIPEDIA EXTRACTS
This section presents Arabic and Egyptian Wikipedia Ex- tracts. Wikimedia provides a free copy of all available con- tents of Wikipedia Encyclopedia (articles, revisions, discussion of contributors). These copies are called dumps. Because Wikipedia contents change with time, the dumps are provided regularly every month. Wikipedia dumps can be downloaded in XML format. Wikipedia dumps can be downloaded from https: //dumps.wikimedia.org/. URL and file names depends on the target language, for example, Arabic Wikipedia dumps URL is https://dumps.wikimedia.org/arwiki/. The file name should be arwiki-date-pages-articles.xml.bz2 where -date- specify the dump date. The Wikipedia extracts in this paper is from 20 Jan 2017 dump. In this work, we use W ikiExtractor python script 3 to Fig. 2. Interlanguage links prior to Wikidata project: to the left the wikicode, extract documents from Wikipedia dumps. to the right the links as they appeared on the page
A. Arabic Wikipedia Corpus Egyptian Arabic Dialect. The produced resources is an inter- esting and useful as the demand on Arabic/dialects language Arabic Wikipedia started in 2003 with 655 articles, devel- resources increased in research and business communities in oped to 459,208 articles by 20 Jan 2017. The information the last seven years [7]. Arabs usually post in social media in of Arabic Wikipedia Corpus is presented in Table I, where |d| is the number of documents in the corpus, |w| in the their dialects. So one of the reasons for this increased interest | | in Arabic dialects language resources maybe the emergence number of words in the corpus, and v is the vocabulary of what is known as the Arab Spring revolutions at the end size (number of unique words). We make this resource public of 2010. Another reason is business applications such as to provide a recent extract of Arabic Wikipedia corpus for sentiment analysis of product reviews. research community. In research community, many researcher built multi-dialect corpus of Arabic language for many application such as dialect TABLE I identification, machine translation, sentiment analysis. For ARABIC WIKIPEDIA CORPUS example, the authors in [8] built a corpus composed of 2,000 |d| 459,208 documents sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, |w| 83.5M words |v| Palestinian and Syrian Arabic, as well as English. The authors 4.7M unique words in [9] collected news and tweets in five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. Finally, the The corpus is available online at https://github.com/ authors built automatic dialect identification system using motazsaad/arWikiExtracts for research community. The objec- the collected resources. The work of [10] presented Multi- tive is to provide up-to-date extract of Arabic Wikipedia and Dialectal Corpus of Arabic collected from Twitter based on make it available for other researchers because there is no geographical information in tweets. recent extract accessible for the research community. As we can see from the work reviewed above, Ara- Table II shows the most frequent words in Arabic Wikipedia bic/Dialects are either built from scratch or collected from corpus, while Table III shows the most frequent words ex- social media. Egyptian Wikipedia corpus is never extracted cluding stopwords. The most frequent words represents the or used in research community. In this paper we shed the common words in a language. If we consider Arabic Wikipedia light on Egyptian Wikipedia as a source of Egyptian dialect corpus as a representative of Arabic language, then these corpus, we also present Egyptian/Arabic aligned documents lists represent the most frequent (common) words in Arabic as comparable corpus. These resources can be used in many language. applications such as dialect identification, sentiment analysis, bilingual lexicon extraction and machine translation. 3https://github.com/attardi/wikiextractor
36 TABLE II TABLE III MOST FREQUENT WORDS IN ARABIC WIKIPEDIA CORPUS MOST FREQUENT WORDS IN ARABIC WIKIPEDIA CORPUS EXCLUDING STOPWORDS Arabic Word Frequency 912670 Arabic Word Frequency 122708 664693 ) * 47149 313382 )+ 37305 246869 35873 160484 ) 35259 122708 % 32556 122642 &,- 28823 110351 ) ./ 26614 100635 0 25478 92920 1 ./ 23319 89223 ) 2 22606
83028 3 22458 76458 24 22233 75980 2! 21754 72981 21481 71759 5 70403 20706 66420 3( 20463 ! 65653 19791 " 64536 % 19646 267 64420 19107 # 59250 )48 19077 $% 4 59228 18935 &' ( 57971 18148 56778 17820 53043 17612 51493 269 17514 ( : 50895 16684 ) * 47149 5 15800 42505 4 15686 ! 15602
B. Egyptian Wikipedia Corpus TABLE IV Egyptian Wikipedia started in 2008 with language code E GYPTIAN WIKIPEDIA CORPUS “arz” [11]. Egyptian Wikipedia received a lot critiques such |d| 16,203 documents as it will disperse efforts and deviate them from Arabic |w| 2.18M words Wikipedia, and it will be useless because there are few people |v| 293.5K unique words search for information written in Egyptian dialect, there is no formal grammar or structure of Egyptian dialect, education in schools and universities in Egypt is in standard Arabic, The corpus is available online at https://github.com/ the quality of Egyptian Wikipedia is not good, and also this motazsaad/arzWikiExtracts for research community. The ob- project will kill the Arabic language [11]. However, the project jective is to shed the light on Egyptian Wikipedia as a source continued and it has about 16,203 articles by 20 Jan 2017. for Egyptian dialect language resources. Such resource does The information of Egyptian Wikipedia Corpus is presented is not available for research community. This corpus can be in Table IV where |d| is the number of documents in the used for many NLP tasks such as building lanugage models for corpus, |w| in the number of words in the corpus, and |v| is Egyptian speech recognition systems or for Egyptian-Arabic the vocabulary size (number of unique words). We make this machine translation systems, or to build Egyptian dialect resource public to provide an extract of Egyptian Wikipedia identification system. corpus for research community, and to shed the light on Table V shows the most frequent words in Egyptian Wikipedia as a source for Egyptian dialect corpus. Wikipedia corpus. As we can see, this list presents the most
37 TABLE V aDataFrame like a spreadsheet or SQL table, or or a MOST FREQUENT WORDS IN EGYPTIAN WIKIPEDIA CORPUS dictionary of objects. Egyptian Word Frequency 5) W ikiDocsAligner enquiry the title of the target docu- 102238 ment by providing document id of the source language 71374 and target language code to the DataFrame. 42370 6) W ikiDocsAligner searches in target corpus for the 22464 required title and get the corresponding document. 7) Finally, W ikiDocsAligner align source and target doc- 20406 13703 ument pair, then save them in files with the same name, 269 but in separate directories. 12880 ) * 12149 This tools is open source software licensed under GPL- 10109 3.0 and available online at https://github.com/motazsaad/ 7833 WikiDocsAligner 7143 IV. ARABIC-EGYPTIAN ALIGNED CORPUS 6876 Table VI present Arabic-Egyptian comparable corpus 6411 aligned by W ikiDocsAligner, where |d| is the number of 6290 | | documents in the corpus, w in the number of words in 6006 | | the corpus, and v is the vocabulary size (number of unique 5963 words). It can be noted from the table that the Arabic part of 5818 Arabic-English has more words that the Egyptian part. The 5150 Arabic part is composed of 8.3M words and the Egyptian 5023 part is composed of 1.5M words. Additionally, The Arabic ; 4891 part vocabulary size is larger than the Egyptian part. This is ) 4615 expected because contributions to Arabic Wikipedia is much % 4504 bigger than the Egyptian one. % 4241 < 4166 TABLE VI # ARABIC-EGYPTIAN WIKIPEDIA CORPUS 4113 => 4111 Arabic Wikipedia Egyptian Wikipedia |d| 0 4038 10,197 10,197 ) |w| 8,397,154 1,543,516 3824 |v| 740,055 215,659 ? 3788 ) 3717 The corpus is available online at https://github.com/ motazsaad/comparableWikiCoprus for research community. This resource is useful for many NLP applications such to used words in Egyptian dialect. The list is interesting because improve the performance of machine translation systems and it can be used to extract stopword list for Egyptian dialect. In to build Arabic dialects identification systems. addition, it can be build Egyptian dialect lexicon. Table VII shows the most frequent words in Arabic Egyptian comparable corpus. It can be noted from the table that the III. WIKIPEDIA ARTICLES ALIGNER list is comparable, that is, the order of Arabic words and its W ikiDocsAligner is designed to work on any pair of equivalent Egyptian words is close. languages. So it is language independent. The alignment process is done as a pipeline as follows: V. C ONCLUSION 1) Download Wikipedia articles and interlanguage dumps In this paper we presented W ikiDocsAligner,anoff- from https://dumps.wikimedia.org/ the-self handy tool for aligning comparable documents from 2) Use W ikiExtractor to extract plain Wikipedia docu- Wikipedia Encyclopedia. The tool is open source and licensed ments from XML dumps. under GPL-3.0 and made available publicly for other re- 3) Pass pages and interlanguage links dumps to searchers. This tool can be used easily to align Wikipedia W ikiDocsAligner documents in any language pair. 4) W ikiDocsAligner parses SQL scripts, which con- Then we presented Arabic Wikipedia corpus, which is a re- tains insert statements of interlanguge links, then cent extract of Arabic Wikipedia. The objective was to provide W ikiDocsAligner convert it into Pandas DataFrames up-to-date extract of Arabic Wikipedia and make it available [12], which is 2-dimensional labeled data structure with for other researchers. We also presented Egyptian Wikipedia columns of potentially different types. One can think of corpus. Such resources does not exist in the literature. The
38 TABLE VII [3] H. HU and T. YAO, “Sentence alignment for bilingual comparable MOST FREQUENT WORDS IN ARABIC-EGYPTIAN WIKIPEDIA corpus from wikipedia,” Journal of Chinese Information Processing, COMPARABLE CORPUS vol. 1, p. 029, 2016. [4] G. Tholpadi, C. Bhattacharyya, and S. Shevade, “Translation induction Egyptian Word Frequency Arabic Word Frequency on indian language corpora using translingual themes from other lan- 79760 82496 guages,” in International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 2015, pp. 505–519. 52461 59782 [5] Wikipedia, “Wiki markup — wikipedia, the free encyclopedia,” 31610 27523 2017, [Online; accessed 2-February-2017]. [Online]. Available: https: 16462 21187 //en.wikipedia.org/w/index.php?title=Wiki markup&oldid=762249244 [6] ——, “Wikidata — wikipedia, the free encyclopedia,” 2017, [Online; 15544 14220 accessed 2-February-2017]. [Online]. Available: https://en.wikipedia. 10604 11616 org/w/index.php?title=Wikidata&oldid=763159853 ) * [7] S. Harrat, K. Meftouh, M. Abbas, S. Jamoussi, M. Saad, and K. Sma¨ıli, 9821 10499 “Cross-dialectal Arabic Processing,” in 16th International Conference 269 8992 9051 on Intelligent Text Processing and Computational Linguistics, Cairo, Egypt, April 14-20, vol. 9041. Springer International Publishing, 6832 8388 2015, pp. 620–632. [Online]. Available: http://dx.doi.org/10.1007/ 6237 7778 978-3-319-18111-0 47 [8] H. Bouamor, N. Habash, and K. Oflazer, “A multidialectal parallel 5566 7368 corpus of arabic.” in LREC, 2014, pp. 1240–1245. 5061 7279 [9] R. Cotterell and C. Callison-Burch, “A multi-dialect, multi-genre corpus of informal written arabic.” in LREC, 2014, pp. 241–245. 4949 6681 [10] H. Mubarak and K. Darwish, “Using twitter to collect a multi-dialectal 4668 " 5924 corpus of arabic,” in Proceedings of the EMNLP 2014 Workshop on 4593 5885 Arabic Natural Language Processing (ANLP), 2014, pp. 1–7. [11] Wikipedia, “Egyptian arabic wikipedia — wikipedia, the free 4542 5846 encyclopedia,” 2016, [Online; accessed 2-February-2017]. [Online]. ; 3718 $% 5822 Available: https://ar.wikipedia.org/w/index.php?title=%D9%88%D9% 8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8% ) 3686 5621 A7 %D9%85%D8%B5%D8%B1%D9%8A&oldid=21827573 3409 ! 5395 [12] W. McKinney, “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt 0 3316 &' 5137 and J. Millman, Eds., 2010, pp. 51 – 56. => 3263 4986 % 3257 4876 # 3257 4850 % 3180 4777 ? 3021 4562 ) 2975 # 4359 < ( 2821 3928 2799 < 3842 " 2737 ) * 3598 &,- 2667 3@5 3581 objective was to introduce Egyptian Wikipedia as a source for Egyptian dialect language resource. Finally, We presented Arabic-Egyptian comparable corpus, which is a set of Arabic- Egyptian documents aligned by W ikiDocsAligner. The constituted resources in this paper, Arabic, Egyptian, Arabic-Egyptian corpora, are ver useful for many NLP appli- cations such as machine translation, dialect identification and speech recognition. In the future work, we will use Arabic-Egyptian compa- rable corpus to improve the performance of Arabic-Egyptian machine translation system. In addition, the corpus will be used to build Egyptian dialect identification system.
REFERENCES [1] M. Saad, “Mining Documents and Sentiments in Cross-lingual Context,” Ph.D. dissertation, Universite` de Lorraine, January 2015. [2] A. Stromajerovˇ a,´ V. Baisa, and M. Blahus,ˇ “Between comparable and parallel: English-czech corpus from wikipedia,” RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, p. 3, 2016.
39