2017 Palestinian International Conference on Information and Communication Technology

WikiDocsAligner: an off-the-shelf Documents Alignment Tool

Motaz Saad Basem O. Alijla Department of Computer Science Department of Information Technology Faculty of Information Technology Faculty of Information Technology Islamic University of Gaza Islamic University of Gaza Gaza, Palestine Gaza, Palestine Email: [email protected] Email: [email protected]

Abstract—Wikipedia encyclopedia is an attractive source for pedia written by contributors in several languages. Anyone comparable corpora in many languages. Most researchers de- can edit and write Wikipedia articles. Some of Wikipedia velop their own script to perform document alignment task, pages in some languages are translations of the corresponding which requires efforts and time. In this paper, we present W ikiDocsAligner, an off-the-shelf Wikipedia Articles alignment English versions, and some others are written independently. handy tool. The implementation of W ikiDocsAligner does not Wikipedia is an attractive source for comparable corpus be- require the researchers to import/export of interlanguage links cause it covers many languages and domains. But the aligning databases. The user just need to download Wikipedia dumps these documents is a challenging task [1]. (interlanguage links and articles), then provide them to the tool, There are many researchers complied comparable corpus which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use from Wikipedia such as [2] who described a language inde- W ikiDocsAligner to align comparable documents from pendent method to build parallel corpus from Wikipedia. The Wikipedia and Egyptian Wikipedia. So we shed the light on work in [3] presented a method to extract aligned sentences Wikipedia as a source of Arabic dialects language resources. The from a comparable corpus derived from Wikipedia, in addition, produced resources is interesting and useful as the demand on the authors extracted a bilingual lexicon from Wikipedia. Arabic/dialects language resources increased in the last decade. Index Terms—Comparable corpus, documents alignment, Ara- The authors of [4] used Wikipedia comparable documents to bic Wikipedia Corpus, Egyptian Wikipedia Corpus enhance the performance of machine translation. Wikipedia contributors use Wiki markup to write articles. I. INTRODUCTION Wiki markup also known as wikitext language or wikicode, and it is a lightweight markup language [5]. Inter-language Documents alignment is the task of arranging articles, links are links from a page in one Wikipedia language to an written in many languages, in pairs such that they are related to equivalent page in another language. The form of these links the same topic. The term comparable corpus is usually used to is [[languagecode : Title]] as shown in Figure 2. Historically, describe a set of documents in multiple languages, which are the list of interlanguage links was included in Wikipedia aligned at the topic level, these documents are not necessarily article itself, i.e., in the wikicode. Then, translations of each other [1]. In contrast, parallel corpus is a launched Wikidata project in 2012 which intended to centralize set of aligned sentences which are translations of each other. and provide common source of data (including interlanguage Figure 1 shows an example of English1 and French2 links) for different Wikimedia foundation projects [6]. So Wikipedia comparable documents related to a biography of a interlanguage links are now provided by Wikidata projects as person. It can be noted from the figure that the first paragraph SQL script format. in the English document is longer than the French one, it also Researchers usually develop their own script to align provides more information about the person. Further, we note Wikipedia articles based on interlanguage links, but this takes that the English and French texts of these documents have efforts and time. In this paper, we present W ikiDocsAligner, different views to this person, so they are comparable. an off-the-shelf handy tool for Wikipedia comparable docu- Parallel and comparable are useful for several tasks such ments alignment. The implementation of W ikiDocsAligner as cross-lingual information retrieval, machine translation and does not require the researchers to import/export of SQL bilingual lexicon extraction. Comparable corpus is the best al- interlanguage links databases, the user just need to download ternative when parallel texts is not available for low resourced Wikipedia dumps (interlanguage links and articles), then pass languages. them to W ikiDocsAligner, which performs the alignment. Comparable corpora can be collected from news documents This software can be used easily to align Wikipedia documents or from Wikipedia. Wikipedia is an open source encyclo- in any language pair. 1https://en.wikipedia.org/wiki/T. E. Lawrence Then, the W ikiDocsAligner is used in this paper to 2https://fr.wikipedia.org/wiki/Thomas Edward Lawrence align Wikipedia comparable articles from Standard Arabic and

978-1-5090-6538-7/17 $31.00 © 2017 IEEE 34 DOI 10.1109/PICICT.2017.27 (a) The English document

(b) The French document

Fig. 1. English-French comparable documents from Wikipedia

35 The rest of this paper is organized as follows: II presents the extracts of Arabic Wikipedia corpus and Egyptian Wikipedia corpus, Section III describes W ikiDocsAligner, Section IV shows the application of W ikiDocsAligner to align Arabic- Egyptian documents from Wikipedia and presents the pro- duced comparable corpus.

II. ARABIC AND EGYPTIAN WIKIPEDIA EXTRACTS

This section presents Arabic and Egyptian Wikipedia Ex- tracts. Wikimedia provides a free copy of all available con- tents of Wikipedia Encyclopedia (articles, revisions, discussion of contributors). These copies are called dumps. Because Wikipedia contents change with time, the dumps are provided regularly every month. Wikipedia dumps can be downloaded in XML format. Wikipedia dumps can be downloaded from https: //dumps.wikimedia.org/. URL and file names depends on the target language, for example, Arabic Wikipedia dumps URL is https://dumps.wikimedia.org/arwiki/. The file name should be arwiki-date-pages-articles.xml.bz2 where -date- specify the dump date. The Wikipedia extracts in this paper is from 20 Jan 2017 dump. In this work, we use W ikiExtractor python script 3 to Fig. 2. Interlanguage links prior to Wikidata project: to the left the wikicode, extract documents from Wikipedia dumps. to the right the links as they appeared on the page

A. Arabic Wikipedia Corpus Egyptian Arabic Dialect. The produced resources is an inter- esting and useful as the demand on Arabic/dialects language Arabic Wikipedia started in 2003 with 655 articles, devel- resources increased in research and business communities in oped to 459,208 articles by 20 Jan 2017. The information the last seven years [7]. Arabs usually post in in of Arabic Wikipedia Corpus is presented in Table I, where |d| is the number of documents in the corpus, |w| in the their dialects. So one of the reasons for this increased interest | | in Arabic dialects language resources maybe the emergence number of words in the corpus, and v is the vocabulary of what is known as the Arab Spring revolutions at the end size (number of unique words). We make this resource public of 2010. Another reason is business applications such as to provide a recent extract of Arabic Wikipedia corpus for sentiment analysis of product reviews. research community. In research community, many researcher built multi-dialect corpus of Arabic language for many application such as dialect TABLE I identification, machine translation, sentiment analysis. For ARABIC WIKIPEDIA CORPUS example, the authors in [8] built a corpus composed of 2,000 |d| 459,208 documents sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, |w| 83.5M words |v| Palestinian and Syrian Arabic, as well as English. The authors 4.7M unique words in [9] collected news and tweets in five dialects of Arabic: Egyptian, Gulf, Levantine, Maghrebi and Iraqi. Finally, the The corpus is available online at https://github.com/ authors built automatic dialect identification system using motazsaad/arWikiExtracts for research community. The objec- the collected resources. The work of [10] presented Multi- tive is to provide up-to-date extract of Arabic Wikipedia and Dialectal Corpus of Arabic collected from based on make it available for other researchers because there is no geographical information in tweets. recent extract accessible for the research community. As we can see from the work reviewed above, Ara- Table II shows the most frequent words in Arabic Wikipedia bic/Dialects are either built from scratch or collected from corpus, while Table III shows the most frequent words ex- social media. Egyptian Wikipedia corpus is never extracted cluding stopwords. The most frequent words represents the or used in research community. In this paper we shed the common words in a language. If we consider Arabic Wikipedia light on Egyptian Wikipedia as a source of Egyptian dialect corpus as a representative of Arabic language, then these corpus, we also present Egyptian/Arabic aligned documents lists represent the most frequent (common) words in Arabic as comparable corpus. These resources can be used in many language. applications such as dialect identification, sentiment analysis, bilingual lexicon extraction and machine translation. 3https://github.com/attardi/wikiextractor

36 TABLE II TABLE III MOST FREQUENT WORDS IN ARABIC WIKIPEDIA CORPUS MOST FREQUENT WORDS IN ARABIC WIKIPEDIA CORPUS EXCLUDING STOPWORDS Arabic Word Frequency  912670 Arabic Word Frequency   122708  664693  ) * 47149  313382 )+ 37305  246869    35873 160484  )  35259  122708  % 32556  122642  &,- 28823   110351  ) ./ 26614  100635 0 25478    92920 1 ./ 23319   89223 ) 2 22606

  83028  3 22458  76458    24 22233  75980  2! 21754  72981   21481  71759  5  70403 20706  66420 3( 20463   ! 65653  19791   " 64536  % 19646  267 64420   19107 # 59250 )48  19077 $% 4 59228 18935 &'  (  57971 18148  56778    17820  53043   17612   51493 269 17514 ( :   50895  16684 ) * 47149 5 15800   42505 4 15686  ! 15602

B. Egyptian Wikipedia Corpus TABLE IV Egyptian Wikipedia started in 2008 with language code E GYPTIAN WIKIPEDIA CORPUS “arz” [11]. Egyptian Wikipedia received a lot critiques such |d| 16,203 documents as it will disperse efforts and deviate them from Arabic |w| 2.18M words Wikipedia, and it will be useless because there are few people |v| 293.5K unique words search for information written in Egyptian dialect, there is no formal grammar or structure of Egyptian dialect, education in schools and universities in is in standard Arabic, The corpus is available online at https://github.com/ the quality of Egyptian Wikipedia is not good, and also this motazsaad/arzWikiExtracts for research community. The ob- project will kill the Arabic language [11]. However, the project jective is to shed the light on Egyptian Wikipedia as a source continued and it has about 16,203 articles by 20 Jan 2017. for Egyptian dialect language resources. Such resource does The information of Egyptian Wikipedia Corpus is presented is not available for research community. This corpus can be in Table IV where |d| is the number of documents in the used for many NLP tasks such as building lanugage models for corpus, |w| in the number of words in the corpus, and |v| is Egyptian speech recognition systems or for Egyptian-Arabic the vocabulary size (number of unique words). We make this machine translation systems, or to build Egyptian dialect resource public to provide an extract of Egyptian Wikipedia identification system. corpus for research community, and to shed the light on Table V shows the most frequent words in Egyptian Wikipedia as a source for Egyptian dialect corpus. Wikipedia corpus. As we can see, this list presents the most

37 TABLE V aDataFrame like a spreadsheet or SQL table, or or a MOST FREQUENT WORDS IN EGYPTIAN WIKIPEDIA CORPUS dictionary of objects. Egyptian Word Frequency 5) W ikiDocsAligner enquiry the title of the target docu-   102238 ment by providing document id of the source language  71374 and target language code to the DataFrame.  42370 6) W ikiDocsAligner searches in target corpus for the  22464 required title and get the corresponding document.  7) Finally, W ikiDocsAligner align source and target doc-  20406  13703 ument pair, then save them in files with the same name, 269 but in separate directories.  12880 ) * 12149 This tools is open source software licensed under GPL-  10109 3.0 and available online at https://github.com/motazsaad/  7833 WikiDocsAligner  7143 IV. ARABIC-EGYPTIAN ALIGNED CORPUS   6876   Table VI present Arabic-Egyptian comparable corpus  6411  aligned by W ikiDocsAligner, where |d| is the number of 6290 | |  documents in the corpus, w in the number of words in  6006 | |  the corpus, and v is the vocabulary size (number of unique 5963 words). It can be noted from the table that the Arabic part of  5818  Arabic-English has more words that the Egyptian part. The  5150 Arabic part is composed of 8.3M words and the Egyptian   5023 part is composed of 1.5M words. Additionally, The Arabic ;  4891 part vocabulary size is larger than the Egyptian part. This is )  4615 expected because contributions to Arabic Wikipedia is much % 4504 bigger than the Egyptian one. % 4241 < 4166 TABLE VI # ARABIC-EGYPTIAN WIKIPEDIA CORPUS  4113 => 4111 Arabic Wikipedia Egyptian Wikipedia  |d| 0  4038 10,197 10,197 )  |w| 8,397,154 1,543,516  3824 |v| 740,055 215,659  ?   3788 ) 3717 The corpus is available online at https://github.com/ motazsaad/comparableWikiCoprus for research community. This resource is useful for many NLP applications such to used words in Egyptian dialect. The list is interesting because improve the performance of machine translation systems and it can be used to extract stopword list for Egyptian dialect. In to build Arabic dialects identification systems. addition, it can be build Egyptian dialect lexicon. Table VII shows the most frequent words in Arabic Egyptian comparable corpus. It can be noted from the table that the III. WIKIPEDIA ARTICLES ALIGNER list is comparable, that is, the order of Arabic words and its W ikiDocsAligner is designed to work on any pair of equivalent Egyptian words is close. languages. So it is language independent. The alignment process is done as a pipeline as follows: V. C ONCLUSION 1) Download Wikipedia articles and interlanguage dumps In this paper we presented W ikiDocsAligner,anoff- from https://dumps.wikimedia.org/ the-self handy tool for aligning comparable documents from 2) Use W ikiExtractor to extract plain Wikipedia docu- Wikipedia Encyclopedia. The tool is open source and licensed ments from XML dumps. under GPL-3.0 and made available publicly for other re- 3) Pass pages and interlanguage links dumps to searchers. This tool can be used easily to align Wikipedia W ikiDocsAligner documents in any language pair. 4) W ikiDocsAligner parses SQL scripts, which con- Then we presented Arabic Wikipedia corpus, which is a re- tains insert statements of interlanguge links, then cent extract of Arabic Wikipedia. The objective was to provide W ikiDocsAligner convert it into Pandas DataFrames up-to-date extract of Arabic Wikipedia and make it available [12], which is 2-dimensional labeled data structure with for other researchers. We also presented Egyptian Wikipedia columns of potentially different types. One can think of corpus. Such resources does not exist in the literature. The

38 TABLE VII [3] H. HU and T. YAO, “Sentence alignment for bilingual comparable MOST FREQUENT WORDS IN ARABIC-EGYPTIAN WIKIPEDIA corpus from wikipedia,” Journal of Chinese Information Processing, COMPARABLE CORPUS vol. 1, p. 029, 2016. [4] G. Tholpadi, C. Bhattacharyya, and S. Shevade, “Translation induction Egyptian Word Frequency Arabic Word Frequency on indian language corpora using translingual themes from other lan-  79760  82496 guages,” in International Conference on Intelligent Text Processing and  Computational Linguistics. Springer, 2015, pp. 505–519.  52461  59782  [5] Wikipedia, “Wiki markup — wikipedia, the free encyclopedia,”  31610  27523 2017, [Online; accessed 2-February-2017]. [Online]. Available: https:  16462  21187 //en.wikipedia.org/w/index.php?title=Wiki markup&oldid=762249244 [6] ——, “Wikidata — wikipedia, the free encyclopedia,” 2017, [Online;    15544 14220 accessed 2-February-2017]. [Online]. Available: https://en.wikipedia.  10604  11616 org/w/index.php?title=Wikidata&oldid=763159853 ) *   [7] S. Harrat, K. Meftouh, M. Abbas, S. Jamoussi, M. Saad, and K. Sma¨ıli, 9821  10499 “Cross-dialectal Arabic Processing,” in 16th International Conference 269 8992   9051 on Intelligent Text Processing and Computational Linguistics, Cairo,   Egypt, April 14-20, vol. 9041. Springer International Publishing, 6832 8388 2015, pp. 620–632. [Online]. Available: http://dx.doi.org/10.1007/  6237  7778 978-3-319-18111-0 47 [8] H. Bouamor, N. Habash, and K. Oflazer, “A multidialectal parallel  5566  7368 corpus of arabic.” in LREC, 2014, pp. 1240–1245.   5061  7279 [9] R. Cotterell and C. Callison-Burch, “A multi-dialect, multi-genre corpus     of informal written arabic.” in LREC, 2014, pp. 241–245.  4949  6681 [10] H. Mubarak and K. Darwish, “Using twitter to collect a multi-dialectal 4668 " 5924 corpus of arabic,” in Proceedings of the EMNLP 2014 Workshop on  4593  5885 Arabic Natural Language Processing (ANLP), 2014, pp. 1–7.   [11] Wikipedia, “Egyptian arabic wikipedia — wikipedia, the free 4542  5846 encyclopedia,” 2016, [Online; accessed 2-February-2017]. [Online]. ; 3718 $% 5822 Available: https://ar.wikipedia.org/w/index.php?title=%D9%88%D9%   8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8% )  3686  5621   A7 %D9%85%D8%B5%D8%B1%D9%8A&oldid=21827573  3409  ! 5395 [12] W. McKinney, “Data structures for statistical computing in python,” in  Proceedings of the 9th Python in Science Conference, S. van der Walt 0  3316 &' 5137   and J. Millman, Eds., 2010, pp. 51 – 56. => 3263  4986   % 3257  4876 #  3257  4850 % 3180  4777    ? 3021  4562   ) 2975 # 4359 < ( 2821  3928  2799 <  3842   " 2737 ) * 3598  &,- 2667 3@5 3581 objective was to introduce Egyptian Wikipedia as a source for Egyptian dialect language resource. Finally, We presented Arabic-Egyptian comparable corpus, which is a set of Arabic- Egyptian documents aligned by W ikiDocsAligner. The constituted resources in this paper, Arabic, Egyptian, Arabic-Egyptian corpora, are ver useful for many NLP appli- cations such as machine translation, dialect identification and speech recognition. In the future work, we will use Arabic-Egyptian compa- rable corpus to improve the performance of Arabic-Egyptian machine translation system. In addition, the corpus will be used to build Egyptian dialect identification system.

REFERENCES [1] M. Saad, “Mining Documents and Sentiments in Cross-lingual Context,” Ph.D. dissertation, Universite` de Lorraine, January 2015. [2] A. Stromajerovˇ a,´ V. Baisa, and M. Blahus,ˇ “Between comparable and parallel: English-czech corpus from wikipedia,” RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, p. 3, 2016.

39