Towards an On-Demand Simple Portuguese Wikipedia

Towards an on-demand Simple Portuguese Wikipedia Arnaldo Candido Junior Ann Copestake Institute of Mathematics and Computer Sciences Computer Laboratory University of São Paulo University of Cambridge arnaldoc at icmc.usp.br Ann.Copestake at cl.cam.ac.uk Lucia Specia Sandra Maria Aluísio Research Group in Computational Linguistics Institute of Mathematics and Computer Sciences University of Wolverhampton University of São Paulo l.specia at wlv.ac.uk sandra at icmc.usp.br Abstract Language (the set of guidelines to write simplified texts) are related to both syntax and lexical levels: The Simple English Wikipedia provides a use short sentences; avoid hidden verbs; use active simplified version of Wikipedia's English voice; use concrete, short, simple words. articles for readers with special needs. A number of resources, such as lists of common However, there are fewer efforts to make words3, are available for the English language to information in Wikipedia in other help users write in Simple English. These include languages accessible to a large audience. lexical resources like the MRC Psycholinguistic This work proposes the use of a syntactic Database4 which helps identify difficult words simplification engine with high precision using psycholinguistic measures. However, rules to automatically generate a Simple resources as such do not exist for Portuguese. An Portuguese Wikipedia on demand, based on exception is a small list of simple words compiled user interactions with the main Portuguese as part of the PorSimples project (Aluisio et al., Wikipedia. Our estimates indicated that a 2008). human can simplify about 28,000 Although the guidelines from the Plain occurrences of analysed patterns per Language can in principle be applied for many million words, while our system can languages and text genres, for Portuguese there are correctly simplify 22,200 occurrences, with very few efforts using Plain Language to make estimated f-measure 77.2%. information accessible to a large audience. To the best of our knowledge, the solution offered by 1 Introduction Portugues Claro5 to help organizations produce European Portuguese (EP) documents in simple The Simple English Wikipedia1 is an effort to make language is the only commercial option in such a information in Wikipedia2 accessible for less direction. For Brazilian Portuguese (BP), a competent readers of English by using simple Brazilian Law (10098/2000) tries to ensure that words and grammar. Examples of intended users content in e-Gov sites and services is written in include children and readers with special needs, simple and direct language in order to remove such as users with learning disabilities and learners barriers in communication and to ensure citizens' of English as a second language. rights to information and communication access. Simple English (or Plain English), used in this However, as it has been shown in Martins and version of Wikipedia, is a result from the Plain Filgueiras (2007), content in such websites still English movement that occurred in Britain and the needs considerable rewriting to follow the Plain United States in the late 1970’s as a reaction to the Language guidelines. unclear language used in government and business A few efforts from the research community have forms and documents. Some recommendations on recently resulted in natural language processing how to write and organize information in Plain 3 http://simple.wiktionary.org/ 1 http://simple.wikipedia.org/ 4 http://www2.let.vu.nl/resources/elw/resource/mrc.html 2 http://www.wikipedia.org/ 5 http://www.portuguesclaro.pt/ 137 Proceedings of the 2nd Workshop on Speech and Language Processing for Assistive Technologies, pages 137–147, Edinburgh, Scotland, UK, July 30, 2011. c 2011 Association for Computational Linguistics systems to simplify and make Portuguese language Section 3 presents the methodology to build and clearer. ReEscreve (Barreiro and Cabral, 2009) is a evaluate the simplification engine for BP. Section 4 multi-purpose paraphraser that helps users to presents the results of the engine evaluation. simplify their EP texts by reducing its ambiguity, Section 5 presents an analysis on simplification number of words and complexity. The current issues and discusses possible improvements. linguistic phenomena paraphrased are support verb Section 6 contains some final remarks. constructions, which are replaced by stylistic variants. In the case of BP, the lack of 2 Related work simplification systems led to development of Given the dependence of syntactic simplification PorSimples project (Aluísio and Gasperin, 2010). on linguistic information, successful approaches This project uses simplification in different are mostly based on rule-based systems. linguistic levels to provide simplified text to poor Approaches using operations learned from corpus literacy readers. have not shown to be able to perform complex For English, automatic text simplification has operations such the splitting of sentences with been exploited for helping readers with poor relative clauses (Chandrasekar and Srinivas, 1997; literacy (Max, 2006) and readers with other special Daelemans et al., 2004; Specia, 2010). On the needs, such as aphasic people (Devlin and other hand. the use of machine learning techniques Unthank, 2006; Carroll et al. 1999). It has also to predict when to simplify a sentence, i.e. learning been used in bilingual education (Petersen, 2007) the properties of language that distinguish simple and for improving the accuracy of Natural from normal texts, has achieved relative success Language Processing (NLP) tasks (Klebanov et al., (Napoles and Dredze, 2010). Therefore, most work 2004; Vickrey and Koller, 2008). on syntactic simplification still relies on rule-based Given the general scarcity of human resources to systems to simplify a set of syntactic constructions. manually simplify large content repositories such This is also the approach we follow in this paper. as Wikipedia, simplifying texts automatically can In what follows we review some relevant and work be the only feasible option. The Portuguese on syntactic simplification. Wikipedia, for example, is the tenth largest The seminal work of Chandrasekar and Srinivas Wikipedia (as of May 2011), with 683,215 articles (1997) investigated the induction of syntactic rules and approximately 860,242 contributors6. from a corpus annotated with part-of-speech tags In this paper we propose a new rule-based augmented by agreement and subcategorization syntactic simplification system to create a Simple information. They extracted syntactic Portuguese Wikipedia on demand, based on user correspondences and generated rules aiming to interactions with the main Portuguese Wikipedia. speed up parsing and improving its accuracy, but We use a simplification engine to change passive not working on naturally occurring texts. into active voice and to break down and change the Daelemans et al. (2004) compared both machine syntax of subordinate clauses. We focus on these learning and rule-based approaches for the operations because they are more difficult to automatic generation of TV subtitles for hearing- process by readers with learning disabilities as impaired people. In their machine learning compared to others such as coordination and approach, a simplification model is learned from complex noun phrases (Abedi et al., 2011; Jones et parallel corpora with TV programme transcripts al., 2006; Chappell, 1985). User interaction with and the associated subtitles. Their method used a Wikipedia can be performed by a system like the memory-based learner and features such as words, Facilita7 (Watanabe et al., 2009), a browser plug-in lemmas, POS tags, chunk tags, relation tags and developed in the PorSimples project to allow proper name tags, among others features (30 in automatic adaptation (summarization and syntactic total). However, this approach did not perform as simplification) of any web page in BP. well as the authors expected, making errors like This paper is organized as follows. Section 2 removing sentence subjects or deleting a part of a presents related work on syntactic simplification. multi-word unit. More recently, Specia (2010) 6 http://meta.wikimedia.org/wiki/List_of_Wikipedias# presented a new approach for text simplification, Grand Total based on the framework of Statistical Machine 7 http://nilc.icmc.usp.br/porsimples/facilita/ 138 Translation. Although the results are promising for by a Link Grammar8 parser. Their motivation is to lexical simplification, syntactic rewriting was not improve the performance of systems for extracting captured by the model to address long-distance Protein-Protein Interactions automatically from operations, since syntactic information was not biomedical articles by automatically simplifying included into the framework. sentences. Besides treating the syntactic Inui et al. (2003) proposed a rule-based system phenomena described in Siddharthan (2006), they for text simplification aimed at deaf people. Using remove describing phrases occurring at the about one thousand manually created rules, the beginning of the sentences, like “These results authors generate several paraphrases for each suggest that” and “As reported previously”. While sentence and train a classifier to select the simpler they focus on the scientific genre, our work is ones. Promising results were obtained, although focused on the encyclopedic genre. different types of errors on the paraphrase In order to obtain a text easier to understand

Towards an On-Demand Simple Portuguese Wikipedia

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support