Open Source International Arabic News Corpus - Preparation and Integration Into the CLARIN-Infrastructure
Total Page:16
File Type:pdf, Size:1020Kb
OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure Imad Zeroual*, Dirk Goldhahn†, Thomas Eckart†, Abdelhak Lakhouaja* *Computer Sciences Laboratory, Mohamed First University, Morocco {mr.imadine, abdel.lakh}@gmail.com †Natural Language Processing Group, University of Leipzig, German {dgoldhahn, teckart}@informatik.uni-leipzig.de Abstract growth of the ten most frequent online languages in the last 18 years. However, a few years ago, The World Wide Web has become a Arabic was considered relatively an under- fundamental resource for building large text resourced language that lacks the basic resources corpora. Broadcasting platforms such as news and corpora for computational linguistics, not a websites are rich sources of data regarding diverse topics and form a valuable foundation single modern standard Arabic tagged corpus was for research. The Arabic language is freely or publicly available. Since then, major extensively utilized on the Web. Still, Arabic progress has been made in building Arabic is relatively an under-resourced language in linguistic resources, primarily corpora (Zeroual terms of availability of freely annotated and Lakhouaja, 2018a); still, building valuable corpora. This paper presents the first version annotated corpora with a considerable size is of the Open Source International Arabic expensive, time-consuming, and requires News (OSIAN) corpus. The corpus data was appropriate tools. Therefore, many Arabic corpora collected from international Arabic news builders produce their corpora in a raw format. websites, all being freely available on the For building the Open Source International Web. The corpus consists of about 3.5 million articles comprising more than 37 million Arabic News (OSIAN) corpus, the typical sentences and roughly 1 billion tokens. It is procedures of the Leipzig Corpora Collection were encoded in XML; each article is annotated utilized. Furthermore, a language-independent with metadata information. Moreover, each Part-of-Speech (PoS) tagger, Treetagger, is adapted word is annotated with lemma and part-of- to annotate the OSIAN corpus with lemma and speech. The described corpus is processed, part-of-speech tags. archived and published into the CLARIN The prime motivation for building OSIAN infrastructure. This publication includes corpus is the lack of open-source Arabic corpora descriptive metadata via OAI-PMH, direct that can cope with the perspectives of Arabic access to the plain text material (available Natural Language Processing (ANLP) and Arabic under Creative Commons Attribution-Non- Commercial 4.0 International License - CC Information Retrieval (AIR), among other research BY-NC 4.0), and integration into the areas. Hence, we expect that the OSIAN corpus WebLicht annotation platform and can be used to answer relevant research questions CLARIN’s Federated Content Search FCS. in corpus linguistics, especially investigating variation and distinction between international and 1 Introduction national news broadcasting platforms with a diachronic and geographical perspective. The Arabic language is spoken by 422 million After this introduction, the remainder of the people, making it the fourth most used language on paper is structured as follows: In section 2, we 1 the Web . Its presence on the Web had the highest highlight the state-of-the-art of web-crawled 1 http://www.internetworldstats.com/stats7.htm 175 Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 175–182 Florence, Italy, August 1, 2019. c 2019 Association1 for Computational Linguistics corpora of the Arabic language. Further, the • Open Source Arabic Corpora2 (OSAC) methodology and tools used to build the OSIAN (Saad and Ashour, 2010): It is a collection of corpus are presented in Section 3. In Section 4, the large and free accessible raw corpora. The OSIAN corpus is described in more detail, yet, OSAC corpus consists of web documents some data analyses are performed and discussed. extracted from over 25 Arabic websites Finally, Section 5 contains some concluding using the open source offline explorer, remarks and future work. HTTrack. The compilation procedure involves converting HTML/XML files into 2 Literature review UTF-8 encoding using “Text Encoding Converter” as well as removing the The World Wide Web is an important source for HTML/XML tags. The final version of the researchers interested in the compilation of very corpus comprises roughly 113 million large corpora. A recent survey (Zeroual and tokens. Besides, it covers several topics Lakhouaja, 2018b) reports that 51% of corpora are namely Economy, History, Education, constructed based, totally or partially, on Web Religion, Sport, Health, Astronomy, Law, content. Web corpora continue to gain relevance Stories, and Cooking Recipes. within the computational and theoretical • arTenTen (Arts et al., 2014): It is a member linguistics. Given their size and the variety of of the TenTen Corpus Family (Jakubíček et domains covered, using Web-derived corpora is al., 2013). The arTenTen is a web-derived another way to overcome typical problems faced corpus of Arabic crawled using Spiderling by statistical corpus-based studies such as data- (Suchomel et al., 2012) in 2012. The sparseness and the lack of variation. arTenTen corpus is partially tagged. i.e., one The web corpora continue to gain relevance sample of the corpus, comprises roughly 30 within the computational and theoretical million, is tagged using the Stanford Arabic linguistics. Given their size and the variety of part-of-speech tagger. While, another domains covered, using web-derived corpora is sample, contains over 115 million words, is another way to overcome typical problems faced tokenised, lemmatised, and part-of-speech by statistical corpus-based studies such as data- tagged using MADA system. All in all, the sparseness and the lack of variation. Besides, they arTenTen comprises 5.8 billion words but it can be used to evaluate different approaches for the can only be explored by paying a fee via the classification of web documents and content by Sketch Engine website3. text genre and topic area (e.g., (Chouigui et al., • ArabicWeb16: Since 2009, the ClueWeb09 2017)). Furthermore, web corpora have become a web crawl (Callan et al., 2009), that includes prime and well-established source for 29.2 million of Arabic pages, was considered lexicographers to create many large and various the only and largest Arabic web crawl dictionaries using specialised tools such as the available. However, in 2016, a new and corpus query and corpus management tool Sketch- larger crawl of today’s Arabic web is Engine (Kovář et al., 2016). Moreover, some publicly available. This web crawl is called completely new areas of research, for which they ArabicWeb16 (Swuaileh et al., 2016) and deal exclusively with web corpora, have emerged. comprises over 150M web pages crawled Indeed, the aim was to build, investigate, and over the month of January 2016. In addition analyse corpora based on online social networks to addressing the limitation of the posts, short messages, and online forum ClueWeb09, ArabicWeb16 covers both discussions. dialectal and Modern Standard Arabic. Publicly available Arabic web corpora are quite Finally, the total size of the compressed limited, which greatly impacts research and dataset of ArabicWeb16 is about 2TB and it development of Arabic NLP and IR. However, is available for download after filling a some research groups (Zaghouani, 2017) have request form4. shown potentials in building web-derived corpora • The GDELT Project5 is a free open platform in recent years. Among them are: for research and analysis of the global 2 https://sites.google.com/site/motazsite/corpora/osac 4 https://sites.google.com/view/arabicweb16 3 https://www.sketchengine.co.uk/ 5 https://www.gdeltproject.org/ 176 2 database. All the datasets released are free, direct access via a Web interface, LCC data is also open, and available for unlimited and offered for free download. unrestricted use for any academic, For each word the dictionaries contain: commercial, or governmental use. Also, it is • Word frequency information. possible to download the raw datafiles, visualize it, or analyse it at limitless scale. • Sample sentences. Recently, the GDELT Project is starting to • Statistically significant word co-occurrences create linguistic resources. In fact, 9.5 billion (based on left or right neighbours or whole words of worldwide Arabic news has been sentences). monitored over 14 months (February 2015 to June 2016) to make a trigram dataset for the • A semantic map visualizing the strongest Arabic language. Consequently, an Arabic word co-occurrences. trigram table of the 6,444,208 trigrams that • Part of speech information (partially). appeared more than 75 times is produced6. • Similar words and other semantic It is worth mentioning that larger corpora in the information (partially). region of billions of words are usually created by downloading texts from the web unselectively 3.1.2 Crawling and processing of data with respect to their text type or content. For corpus creation, an adapted version of the Therefore, the content of such corpora cannot be CURL-portal (Crawling Under-Resourced determined before their construction, thus, it is Languages8) (Goldhahn et al., 2016) of the LCC necessary to filter, clean, and evaluate it was utilized. CURL allows creating Web- afterwards. accessible and downloadable corpora by simply entering