13/10/2013

AUTHOR’S NOTE

Please consult the new version of the tutorial notes Text Simplification with a at: Purpose http://www.dtic.upf.edu/~hsaggion/workshops_tutorials.html

Horacio Saggion which will be updated for the conference Universitat Pompeu Fabra Barcelona http://www.dtic.upf.edu/~hsaggion

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Teacher Text Simplification

Horacio Saggion The process of transforming a text into an equivalent http://taln.upf.edu/users/hsaggion which is more readable and/or understandable by a https://twitter.com/h_saggion target audience http://www.taln.upf.edu During simplification, complex sentences are split into Automatic text summarization: SumUM; SUMMA; MEAD; SummBank simple ones and uncommon vocabulary is replaced by : GATE more common expressions Text Simplification: www.simplext.es Started to attract the attention of natural language TALN - Tractament Automàtic del Llenguatge Natural processing some years ago (1996) mainly as a pre- http://www.taln.upf.edu processing step ~15 members: profs/lecturers, researchers, PhDs, Masters, etc. We are at Universitat Pompeu Fabra, Campus de la Comunicación, A number of events organized around the theme: Barcelona ICCHP, SLPAT, NLP4ITA, PITR

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

1 13/10/2013

This is simplification – human ☺ Why text simplification?

Original Text Adapted Text (by trained editor) It is an interesting research problem for the NLP community United States treats very bad a soldier Identify and measure sources of complexity / difficulty Amnesty International accused the in prison. U.S. authorities to provide an Create a “paraphrase” which is easy to read "inhuman" treatment to Bradley The soldier is called Bradley Manning. Several NLP expertises involved: summarization, natural Manning, a soldier accused of language generation, sentence compression, word sense leaking "wires" of American Bradley Manning is in prison for giving disambiguation, , etc.... diplomacy to the website information about the Government It is socially relevant of the United States to Wikileaks. Wikileaks. Unprecedented democratization of information (e.g. Web) Wikileaks is a website which provides Information is not equally accessible to everyone information on matters of public UN Enable: make information and information services interest. accessible to different groups of persons with disability

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Simplification users NLP as a simplification user

Deaf people (Inui & al., 2003; Chung et al., 2013) Dealing with complex sentences Blind people (Grefenstette, 1998) Initial simplification application (Chandresakar et People with low-literacy (Williams & Reiter, 2008; Aluísio & al., 2008) al., 1996) People with autism (Mitkov, 2012; Barbu et al., 2013; Orasan Improve results in IE (Jonnalagadda & Gonzalez, et al, 2013; Dornescu et al., 2013) 2011; Evans, 2011; Minard et al., 2012) Second language learners (Petersen and Ostendorf, 2007; Burstein et al., 2013; Eskenazi et al. 2013) Assist in question generation (Bernhard et al, Dyslexic people (Matausch & Peböck, 2010, Rello et al., 2013) 2012) People with aphasia (Carroll et al., 1999) Text summarization (Grefenstette, 1998 , Siddharthan et al, 2004)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

2 13/10/2013

There was an earthquake on 28 June SIMPLIFICATION 1976. AND IE SIMPLIFICATION The earthquake struck the city of Tangshan. Tangshan is in China. The earthquake had a magnitude of 7.8. The earthquake killed 240,000 people.

Epicenter TangshanTangshan Epicenter TangshanTangshan There was an earthquake on 28 June 1976. Dead 240,000 The earthquake struck the city of Tangshan. Dead 240,000 Tangshan is in China. GENERATION Time & Date3:42 a.m. 28/06/1976 The earthquake had a magnitude of 7.8. Time28/06/1976 Magnitude 7.8 The earthquake killed 240,000 people. Magnitude 7.8 Country ChinaChina Country ChinaChina

Languages Simple language initiatives

English (Chandrasekar et al., 1996; Siddharthan, 2002; Carroll Easy to read initiatives et al. 1998, Bouayad-Agha et al., 2009; Zhu et al., 2010; Coster Plain English / Basic English (Ogden, 1930); & Kauchak, 2011; Yatskar et al., 2011), French (Seratan, French Rationale; Easy-to-Read network (Petz andTronbacke, 2008); 2012;François & Fairon, 2012), Portuguese (Aluísio et al., Fácil Lectura (http://www.lecturafacil.net); 2008; Specia, 2010), Japanese (Inui et al.,2003), Arabic (Al- European Association Inclusion Europe Subaihin and Al-Khalifa, 2011), Danish (Klerke & Søgaard, Guidelines 2012), Swedish (Smith et al., 2010; Keskisärkkä, 2012), Simple and direct language; Spanish (Saggion et al. 2011; Bautista et al., 2012; Rello et al, one idea per sentence; 2013; Mosquera et al., 2013), Italian (Dell’Orletta et al. 2011, avoid jargon and technical and abbreviations; Tonelli et al. 2012), Basque (Aranzabe et al, 2012), Korean one word per concept; personalization; (Chung et al, 2013) use of active voice

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

3 13/10/2013

Where to find simple texts? Where to find simple texts?

Opera is an art form in which singers Opera is a drama set to music. and musicians perform a dramatic work An opera is a play in which everything combining text (called a libretto) and is sung instead of spoken. musical score.

The performance is typically given in an Operas are usually performed in opera house, accompanied by an opera houses. orchestra or smaller musical ensemble.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Where to find simple texts? Where to find simple texts?

EASY NEWS

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

4 13/10/2013

Where to find simple texts? Where to find simple texts? Klartale News Paper 8 Pages News Paper (Norwegian) (Swedish)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Where to find simple texts? Where to find simple texts? L’Essentiel News Paper Dueparole News Paper (French) (Italian)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

5 13/10/2013

Where to find simple texts? Exercise LiteracyWorks Web site (English) Take one article from one (or more) of the web sites given, translate it into English or your own language to “verify” that it is a “simple” text http://www.noticiasfacil.es/ES (Spanish) http://www.dueparole.it/sommario_.asp (Italian) http://www.8sidor.se/ (Swedish) http://www.klartale.no/ (Norwegian)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Natural Language Processing for Opportunities for NLP Text Simplification There is a heavy load associated to manual simplification of textual content Induced or hand-crafted rules for reducing syntactic complexity (Chandrasekar et al. 1996; Devlin & Tait, 1998; Siddharthan, 2002; It is impossible to wait to create texts that are Aluísio & al, 2008; Bouayad-Agha et al. 2009) adaptable Simplification of relative clauses, treatment of passive constructions; substitution of pronouns by antecedents. Part of the work could be done automatically Lexical substitution procedures (Caroll & al., 1998; De Belder & al, Machines could also help professionals 2011; Biran & al., 2011) Replacement of infrequent words by their most frequent synonym or There is also the idea that simple texts could be better replacement of words by a shorter synonym dealt with by NLP tools ... Combined approach treating the problem as MT (Zhu et al., 2010) Using English Wikipedia and Simple English Wikipedia as a parallel corpus and learning from it word replacements, “drop”, “copy”, “split”, and “reordering” operations.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

6 13/10/2013

First Steps: manual rules First Steps: rule learning

Rules over syntactic representations (Chandrasekar et al. Learning to transform from “complex” to “simple” 1996; Siddharthan, 2002) (Chandrasekar & Srinivas, 1996) Superficial analysis (ckunking) to identify noun and verb groups (O) Talwinder Singh, who masterminded the 1984 Kanishka crash, was killed in a fierce Rules: W X:NP, RELPRO Y, Z. => W X:NP Z. X:NP Y. (manually developed) two-hour encounter. Hu Jintao , who is the current Paramount Leader of the People’s Republic of (S) Talwinder Singh was killed in a fierce two-hour encounter. Talwinder Singh China , was visiting Spain masterminded the 1984 Kanishka crash. ∅ W = ORIGINAL SIMPLIFICATION X: =Hu Jintao was killed RELPRO:=who was killed Y = is the current Paramount Leader of the People’s Republic of China Talwinder Singh in … encounter Z= was visiting Spain Talwinder Singh in … encounter Hu Jintao was visiting Spain . Hu Jintao is the current Paramount Leader of CUT relative the People’s Republic of China. mastermined COPY mastermined

who the… crash Talwinder Singh the… crash

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

First Steps: user’s concerns What makes text difficult to read/understand?

• Project PSET: “Practical Simplification of English Texts” (Devlin There are a number of formulas that intend to & Tait, 1998) associate a readability level of index to a given texts • Targeted simplification for people with aphasia Transformation from passive to active voice Some formulas are extremely simple “A bid to build an incinerator on local wasteland was today accepted by the council .” => The council today accepted a bid to build an incinerator on local wasteland. they are based on concepts such as “word “Official documents were left on the underground by mistake .” => Mistake left official document on the underground. ( note that this is a made up example !!!) complexity”, frequency, sentence length, etc. Resolution/replacement of anaphoric expressions Experiments are generally carried out to show if Standard anaphora resolution system + replacement of pronouns for these indices correlate with human associated antecedents/referents Vocabulary simplification indices of readability Replacement of infrequent or little frequent words by their most frequent antecedent (use of a psycholinguistic database + WordNet)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

7 13/10/2013

Formulas for English Formulas for English

Flesch (1949) – Flesch Reading Ease Flesch-Kincaid (Kincaid et al, 1986) S=206.835 – (1.015*ASL) –(84.6*ASW) Text Level = (.39 x ASL) + (11.8 x ASW) - 15.59. S is an index from 0 to 100, an index of 30 means very difficult, 70 means reasonable, 100 means easy ASL = average number of words per sentence (# ASL = average sentence length words/# sentences). ASW = average number of syllables per word ASW = average number of syllables per word(# syllables FOG index (Gunning, 1952) /# words). S= 3.0680 + (0.877 * ASL) + (0.984 * PofP) a reasonable readability index will be between 6 & PofP is the percentage of polysyllables (over the total 10 (grade level). number of words) (GL= 0.4 (ASL + # hard words) )

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Language models for predicting text complexity Language models for predicting text complexity

Word length and sentence length seem to be poor The different components of the model are then estimated as surrogates for text readability follows: p(d|g) is an unigram model (the document is just a set of words) Language models are very popular in NLP and can be p(g) is the a priori distribution of a difficulty degree g used to measure text complexity (Si & Callan, 2001) p(d) is dropped from the equation because it does not affect the result text complexity prediction can be cast as a classic text classification The probabilistic model is combined with a sentence length problem model which assumes normal distribution of lengths per the model seeks to estimate the degree of difficulty (d) of a document difficulty level (d) using a probability distribution p(g|d) as always this probability is decomposed by Bayes into: The combined model has more predictive power than the classic Flesch-Kincaid test gdpgp )|()( dgp )|( = dp )(

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

8 13/10/2013

Text Quality (Pitler & Nenkova, 2008) Text Quality (Pitler & Nenkova, 2008)

Study the relative importance of different factors affecting Features in the study “text quality” baseline: char. per word, words per sentence, max number of words words, syntax, discourse (e.g. coreference), cohesion, etc. per sent, article length . vocabulary (various Log Likelihood – LogL - unigram language models): Dataset: Discourse Penn annotated with explicit and WSJ, AP News, and variants that take into account word length implicit discourse relations (dis.rel) syntactic features: height of parse tree, np per sentence, vp per Human assessment of a set of 30 articles sentence , subordinate per sentence 3 assessors for each text cohesion: proper nouns, definite articles, various similarity features adj. sentences rating “how well this text is written?” entity coherence: 34 features scores averaged per text discourse relations: # discourse relations, LogL of article based on Correlations are computed between values of features and disc. rel., LogL based on disc. rel. and number of disc. rel , explicit “perceived” text quality disc. rel, implicit disc. rel.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Readability and Linguistic Features (Stajner Readability and Linguistic Features (Stajner et al. 2012) et al. 2012)

Study the correlation between commonly used readability Correlation between readability formulae in the 4 datasets is formulas in English and a number of features which arise from very high (over .95) linguistic analysis of the text Correlation between Flesh Reading Ease score and POS Corpora: Simple Wikipedia, News, Health, Fiction features Readability measures: Flesh Reading Ease Score, Flesh- The strength of correlation depends on the dataset Kincaid, Fog Index, SMOG grading, char. / wrds, syl / wrds, For most of the features (V, N, Prep, Det, Adv, A, CS, CC, INF, ch/w, w/s, wrds / sentence defNP) , the smaller the value the more readable the text For word senses, the greater the value the more readable the text Features (average per sentence): Unexpected finding: fiction texts simpler than Simple Structural: Nouns, Adjectives, Determiners, Adverbs, Verbs, Infinitive, Coordinating Conj., Subordinating Conj., Prepositions Wikipedia if we accept readability formulae as predictors of Ambiguity: Pronouns, defNPs, Word Senses text complexity

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

9 13/10/2013

Additional Studies Additional Studies

Portuguese (Aluisio et al. 2010) and Italian (Tonelli et al, 2012) Swedish: Coh-Metric system (Graesser et al., 2004) LIX formula (Björnsson, 1968) which combines average number of words per Portuguese: Coh-Metrix-PORT tool to compute 59 features to perform 3-way sentence with average number of “complex” words per word (other readability classification incorporated into the tool FACILITA indexes are “nominal ratio” and “word variation index”) Italian: Coease system based on Coh-Metrix features (46) re-implemented for Falkenjack et al (2013) apply a machine learning approach to classify texts into Italian used in 3-way classification two categories (simple/normal), features - shallow, morpho, syntactic, lexical – Italian (Dell’Orletta et al, 2011) are borrowed from previous research Study readability assessment as text classification Arabic (Al-Khalifa et al., 2010) 2 corpora: La Repubblica (normal) and Due Parole (simple) use single and combination of features for readability assessment (3-levels): raw textual features, lexical features, morpho-syntactic features , syntactic perplexity , sent. length , word length, avg. syllables, frequency features Japanese (Sato et al., 2008) Study sentence classification Character (3 alphabets) unigram LM for 13 different grades (LM1... LM13). Grade French (François, 2011) of text is the one that makes a text “more likely” AI Readability formula – 406 features ( lexical, syntactic , semantic, FFL) – classification into 6 readability levels

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Additional Studies Additional Studies

Spaulding’s formula for Spanish (1956) Two stage approach to avoid underestimation in readability Found two factors correlated with text difficulty assessment (Sheehan et al, 2013) These two factors are not correlated Classification into genre Length of sentence (average) = ASL Estimate difficulty within genre Density of use of the vocabulary (excluding words from a pre- Obtain better estimators defined list of 1500 Spanish words considered easy) =Density Difficulty= 1.609 * ASL + 331.8 * Density+22 New measure: lexical tightness (Flor et al., 2013) a (to), abajo (down), abandonar (quit), aborrecer (hate), abrazar Measures to what extent words used in a text are associated in the (hug), abrir (open),....bailar (dance), bajar (descend), bajo (short),.... language if a text uses words that are strongly associated then it is easier to read The method comes with a graphic to map the obtained It is derived from a modification of Pointwise Mutual Information indices: 40-60 very easy; 61-80 easy;81-100 moderately difficult; 101- Better text complexity estimators 120 difficult: > 121 exceptionally difficult único (unique),....

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

10 13/10/2013

Formulas for Spanish Exercise

FRASE method (Vari-Cartier, 1981) adapts the Compute the complexity of these two texts using Flesch Reading Ease and indicate which one is simpler (note that you FRY method to Spanish have to count syllables) Select 3 random passages each with 100 words Compute average number of syllables and average number (I) Spain is a country in Southern Europe. It is in the Iberian Peninsula near Portugal and Gibraltar. France and the little country of Andorra are on its northeast side, where the Pyrenees mountains of sentences in 3 passages are. Represent the two variables in the graphic below to obtain (II) Spain, officially the Kingdom of Spain, is a sovereign state and a member of the European Union the readability index located in southwestern Europe on the Iberian Peninsula. Its mainland is bordered to the south and east by the Mediterranean Sea except for a small land boundary with the British Overseas Territory of Gibraltar; to the north and north east by France, Andorra, and the Bay of Biscay; and to the northwest and west by the Atlantic Ocean and Portugal.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Lexical simplification as a NLP task Lexical simplification as a NLP task

Lexical Simplification is a subtask of Text Simplification 3 aspects were considered (Siddharthan, 2006) concerned with replacing words or short complexity analysis phrases by simpler variants in a context aware fashion search for word substitutes ranking of substitutes based on context /simpler (generally synonyms), which can be understood by a wider a dataset is created based on aggregation of information by several range of readers annotators SemEval-2012 featured a lexical simplification task (Specia et Example of Lex Subs.: “… a bright boy…”; bright = intelligent (3); al. 2012) clever (3); smart (1) Lex Simp. dataset uses average ranks provided by annotators Based on a previous challenge on lexical substitution (McCarthy the task: replace a given word for its most appropriate substitute so & Navigli, 2007) that the sentence is easier to read 10 systems participated in the evaluation, a baseline system based on word frequency works fine and it is only beaten by one system

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

11 13/10/2013

Lexical Simplification Extracting Lexical Simplifications from Wikipedia

Combination of two sources: lexicon & language model (De Belder & al. 2010) Hypothesis: changes in Simple Wikipedia correspond to simplifications the Given a word in a text, two lists of words are generated author is making….(not always!) (Yatskar et al. 2010) L1: list of synonyms from the lexical database (authoritative source) A mechanism is needed to model when the change of one word by another is due to a simplification operation L2: list of alternative words obtained from a latent words language model (learning from non annotated data) Various models are possible: A probabilistic model estimates the probability of replacement of one word (original) by One model computes the “probability” that the change of a word “A” by word another word “a” is due to: correction, simplification, etc. It is assumed that in the normal Wikipedia simplification changes are P1(w|w_original)=P2(w|w_original,context)*P3(easy|w) negligible P2 is a language model that w fits in the given context It is also assumed that the proportion of corrections in Simple Wikipedia is P3 can be modelled in different ways: frequency, morphosyntactic properties, equal to those in normal Wikipedia complexity based on database, etc. The probability of changing “A” by “a” p(a|A) is approximated by frequencies (De Belder & Moens, 2012) create a dataset for evaluation of TS in English – similar A model is obtained for the most probable replacement for “A” to work done in SemEval 2012 Another method looks at comments left by editors identifying which Use data from the lexical substitution task proposed by (McCarthy & Navigli, 2007) association of “A” and “a” is stronger using PMI (point-wise mutual Filtering of very difficult words information) Alternative words were ranked by difficulty by annotators Rankings grouped to obtain a single model

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Extracting Lexical Simplifications from Wikipedia Learning simplification rules

Data extraction: pairs of words “A” and “a” Biran et al. (2011) also use English Wikipedia Sentences are aligned and words which have been (EW) and Simple English Wikipedia (ESW) substituted are identified: A => a EW is used to extract context vectors for each Two baseline methods proposed: word (co-occurrences) Frequency: use the most frequent substitution A similarity measure can be used to identify Random: chose a random valid substitution which words can be replaced by which words Compare with a list created automatically The cosine between vector representations is Human > Language Model > PMI > FREQ >= used in this work RANDOM Some filtering applied using WordNet

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

12 13/10/2013

Learning simplification rules Learning simplification rules

Implementing the simplicity of a word: example The length of the word is also taken as a “canine” and “dog” measure of complexity check occurrences of both words in EW and SEW len(“canine”)=6, len(“dog”)=3 “canine” appears 9620 times in EW final_complexity=complexity*len “canine” appears 62 times in SEW fc(“canine”)=155*6=930 “dog” appears 171000 times in EW fc(“dog”)=125*3=375 “dog” appears 1360 times in SEW canine “is more difficult than” dog complexity(“canine”) = 9620/62 = 155 So “canine” can be simplifies by “dog”, but complexity(“dog”) = 171000/1360 = 125 “dog” can not be simplified with “canine”

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Learning simplification rules Exercise

Grammaticality: generate all equivalent pairs, if word in past Try the method with the following word pairs tense then its simplification in past tense, etc. Chose as simplification of the target word w a replacement x indicating which word is easier that fits in the context cat, feline Baseline: replace a word by its more frequent synonym automobile, car Evaluation is a non-realistic scenario in the sense that only one word is simplified in the sentence inferno, fire chose a sentence where only one word has been replaced by the method three variables evaluated: simplification(yes/no), grammaticality (bad/ok/good), sense (yes/no) the proposed method is better than the “baseline”

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

13 13/10/2013

Exercise Learning “all” simplification automatically

Simplify the vocabulary in the following sentences This work considers that text simplification is a kind of John was flabbergasted when he saw the aesthetic machine translation (Coster & Kauchak, 2011) (but this is not illustrations in the gallery. totally true!) normal language TO simple language Seek for appropriate replacements for gallery, flabbergasted, aesthetic, and illustrations. Can be applied in contexts were original and simplification examples abound.... (original and simplification are not very Instruments different)

Google, English Wikipedia, Simple English Wikipedia Original sentence Simplified sentence WordReference can be used to find synonyms of words to Greene agreed that she could earn more by Greene agreed that she could earn more by breaking away from 20th Century Fox. leaving 20th Century Fox. be simplified In 1962 , Steinbeck received the Nobel Prize Steinbeck won the Nobel Prize in Literature for Literature. in 1962. They established themselves here and called They called the port Menestheus’s port. that port Menestheus’s port.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Learning “all” simplification automatically Learning “all” simplification automatically

It is applied in the more specific context of statistical machine Creation of corpus for training and text translation and in particular phrase-based translation models Align articles of SEW and EW An statistical translation model is based on the following formula to Align paragraphs using a similarity metric model the “generation” of a translation from a source sentence Align sentences using dynamic programming P(e|f) is the probability that a text e is a transation of text f Simplification model is based on MOSES (Koehn et al, 2007) this formula is appoximated by P(e)*P(f|e) each s is a phrase in a simple sentence and each n is a phrase in the P(e) is a language model i i original sentence p(f|e) is a translation model m p(simple | normal ) = ∏ nsp ii )|( given a text f we look for a text e that maximizes the formula i=1 finding the “best” e requires examination of all possibilities which is Phrases can be computed using the computer program impractical. An heusristic search is required to find a solution. GIZA++ (Och & Ney, 2000) The probabilities as stated can not be modelled directly, one has to use text and sentence components for modelling the problem (e.g. In a translation model it is generally true that “null” phrases an approximation) do not exist, however in simplification they are necessary

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

14 13/10/2013

Learning “all” simplification automatically Learning simplification from trees

In the evaluation various systems are compared Based on a corpus of comparable documents Normal translation model Translation model with delete operation of complex and simplified versions (Zhu et al. 2010) Do nothing system (baseline) English Wikipedia/Simple English Wikipedia Other text simplification approaches (Cohen & Lapata, 2009; Knight & Marcu, 2002) Align EW & SEW using a TF*IDF method and allow 1 to n Evaluation (comparison with gold standard) alignments BLEU (Papineni et al, 2002) used in machine translation Simple string accuracy (Clarke & Lapata, 2006) This work models the following aspects: F-score over words replacement of words and phrases For all tested measures Translation system with delete operation is better than translation system syntactic simplification seen as composition of the without delete operation following operations on a tree The baseline is better than the sophisticated systems “Split”, “Drop”, “Copying”, “Reordering”

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Learning simplification from trees Learning simplification from trees

PROBABILITIES PHRASE STRUCTURE OF COMPLEXT SENTENCE SPLIT ASSOCIATED TO S S CUTTING/SPLIT POINTS SBAR SBAR S S

VP VP VP VP NP WHNP NP WHNP PP PP NP PP started NP PP started which which in in August was thesixth month NP 735BC August wa thesixth month NP 735BC s

in the ancient Roman calendar in the ancient Roman calendar August was the sixth month in the ancient Roman calendar which started in 735BC.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

15 13/10/2013

Learning simplification from trees Learning simplification from trees

PROBABILITIES PROBABILITIES COPYING SUBJECTS DELETE AND REORDERING OF COPYING OF DELETING AND COMPONENTS A COMPONENT REORDERING S S S S

VP NP VP VP NP VP NP NP PP PP NP PP started NP PP started the the ancient Roman ancient the calendar in 735BC the calendar in 735BC August was sixth month NP August was sixth month NP

in the ancient Roman calendar in the ancient calendar

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Learning simplification from trees Simplification applications

PROBABILITY OF Text Summarization (Pal & Ruger, 2002) WORD SUBSTITUTION REPLACING A WORD Complexity of the sentence analyzed using a psycholinguistic database S S Replace complex words by simple synonyms using WordNet Subtitling : reducing the number of characters given physical constrains of the medium (Daeleman et al, 2004) NP VP VP One important operation is deletion of material NP PP dataset: alignments between transcriptions and subtitles NP PP started the old machine learning algorithms (using parallel corpus) the calendar in 735BC knowledge based system (linguistic rules for deletion) August was sixt month NP h machine learning approach performed badly rule-based approach worked better in the old calendar August was the sixth month in the old calendar. The old calendar started in 735BC.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

16 13/10/2013

Simplification applications Portuguese text simplification: PorSimples project

Simplification of patent claim “sentences” (Bouayad-Agha et Target population: people with low literacy Study of simplification cases based on corpus and specification of al. 2009) simplification procedures (Aluísio & al, 2008) An optical disk drive comprising: a laser light source for emitting a laser beam; an optical system for conversing the laser beam from the laser light source on a signal plane of optical disk on which signal marks are formed and for Treatment of relative clauses: transmitting the light reflected from the signal plane; one or more optical components, arranged in the optical path between the laser light source and the optical disk, for making the distribution of the laser beam converged by the Input: “The book, which John gave me, belongs to Paul” conversing means located on a ring belt just after the passage of an aperture plane of the optical system; a detection means for detecting the light reflected from the optical disk; and a signal processing circuit for generating a secondary differential signal by differentiating the signals detected by the detection means and for detecting the edge positions of the signal marks by comparing the secondary differential signal with a detection level. Find relative pronoun and check the relative is non restrictive Find where the relative ends An optical disk drive comprises a laser light source, an optical system, a detection means, and a signal processing circuit. Generate a sentence with the relative The laser light source emits a laser beam. Generate a sentence with the main clause. The optical systemconverses the laser beam from the laser light source on a signal plane of optical disk. On the latter, signal marks are formed. Reorder The optical systemalso transmits the light reflected from the signal plane. ….. The signal processing circuit generates a secondary differential signal. Output: “The book belongs to Paul. John gave me the book.” To do so, it differentiates the signals detected by the detection means. It also detects the edge positions of the signal mark. The are various procedures in charge of dealing with different To do so, it compares the secondary differential signal with a detection level. phenomena. These are cascaded in specific order to achieve a correct simplification.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Adapt simplification to the user Adapt simplification to the user

The FIRST project “Flexible Interactive Reading DysWebxia (Rello et al. 2012): Simplification/adaptation for dyslexic Support Tool” (funded by FP7) (Mitkov, 2012) users Dyslexia is a neurological reading disability that is characterized by Simplification for people with autism: they have problems difficulties with accurate and/or fluent word recognition and by with complex sentences, ambiguity, figurative language, poor spelling and decoding abilities etc. how to make textual content more accessible to people with Adaptive tool will simplify text, provide navigation dyslexia mechanisms, and add pictograms for better 10-17% of native English speakers are affected by dyslexia understanding.... 7.5-10% of native Spanish speakers are affected by dyslexia Simplification is not only for the final user but also for their basis for the work: use of eye-tracking methodology to verify carers different word/text conditions (frequency, length, use of paraphrase, etc.)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

17 13/10/2013

Adapt simplification to the user DysWebxia browser

Studying frequency and length factors (Rello & al. 2013) Frequency and readability: significant effect Length and readability: significant effect Frequency and understandability: no effect Length and understandability: significant effect Substitute or Help? (Rello & al. 2013) Comparison between automatically replacing a word by a synonym using LexSiS (Bott &al., 2012) providing an interactive way of accessing to a synonym list (after word sense disambiguation and simplicity scoring) aaa dyslexic users prefer the helping aid to the substitution by synonyms

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Numeracy Numeracy

• Simplification of numerical expressions in text (Bautista et al. 2012) and • Numerical expressions and dyslexia findings (Rello & al., adaptation to Spanish (Bautista & Saggion, to appear) 2013) • studies the problem of how to make numbers and numerical expressions simpler by the use of rounding and addition/change of modifiers • numerical information expressed as digits improves readability for dyslexic users Original Simplification Cerca de 1,9 millones de personas asistieron al Casi 2 millones de personas asistieron al concierto • numerical information expressed as percentages concierto (About 1.9 million people attended the (Nearly 2 million people attended the concert) concert) improves understandability for dyslexic users Sólo se ha vendido un cuarto de las entradas (Only a Sólo se ha vendido ¼ de las entradas (Only ¼ of the quarter of the tickets have been sold) tickets have been sold) Uno de cada cuatro niños hablan chino (One in four 1 de cada 4 niños hablan chino (1 in 4 children children speak Chinese) speak Chinese) Asistieron un 57% de la clase (57% of the class Asistieron mas de la mitad de la clase (More than attended….) half of the class attended…) Aprobaron el 98% (98% passed….) Aprobaron casi todos (Almost everyone passed…)

71 72 IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

18 13/10/2013

The Simplext Project (Saggion et al. 2011)

A text simplification system for people with cognitive disabilities One of the first applications of simplification to the Spanish language Technological objective Develop an ubiquitous simplification solution Simplified content can be delivered to different devices Scientific objectives Development of resources for simplification in Spanish Development of methods and algorithms to simplify content Social objective: simplifying content for people with cognitive disabilities Spanish: spoken by over 500 million people, official language of 21 countries, 3 rd language on Internet

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

INFORMATION SIMPLIFICATION Amnistía Internacional acusó a las Estados Unidos trata muy mal a un autoridades estadounidenses de soldado detenido. PROVIDER EXPERT proporcionar un "trato inhumano" a El soldado se llama Bradley Bradley Manning, un soldado Manning. acusado de filtrar “cables” de la Bradley Manning está detenido por COMPLEX diplomacia norteamericana al portal dar información del gobierno de … SENTENCE Wikileaks. ALIGNMENT Estados Unidos trata muy mal a un soldado Amnistía Internacional acusó a las PROGRAM detenido. autoridades estadounidenses de El soldado se llama Bradley Manning. proporcionar un "trato Bradley Manning está detenido por inhumano" a Bradley Manning, dar información del gobierno de Estados un soldado acusado de filtrar Unidos a Wikileaks. ALIGNMENT “cables” de la diplomacia Wikileaks es una página web donde se da TABLES GATE bi-text norteamericana al portal información sobre asuntos de interés editor Wikileaks. público. 4 SIMPLER . CORPUS SENTENCES BI-TEXT for EDITOR RESEARCH

SEPLNIJCNLP 2013, 2013 Madrid, – NAGOYA España– JAPAN SEPLNIJCNLP 2013, 2013 Madrid, – NAGOYA España– JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

19 13/10/2013

Making texts easy to read

ORIGINAL TEXT Lexical simplification Chose more accessible vocabulary / apply recurrent re-writing strategies Reduction of structural complexity by splitting sentences Reduce sentence length and use simpler syntax Content reduction drop unnecessary material Clarification MANUAL SIMPLIFICATION provide explanations/definitions of difficult terms

IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013

Syntactic simplification approach (Bott, Saggion, Syntactic simplification Mille; 2012) Treatment of relative clauses, gerundive and participle Rule-based approach that transforms constructions, coordination of clauses, verbs, and objects, dependency-trees / graphs and quotation inversion “The torrential rains that began on October 1 …, causing the Performs various sentence splitting operations on worst flooding...” -> “The torrential rains caused the worst… subordinate and coordinate structures These rains began on…” Copies subjects and verbs (e.g. relative pronoun Amnesty International pointed out that people have been replaced by lexical NP) arrested before the elections scheduled for November 29.” -> “Amnesty International pointed out… The elections are Orders the various clauses scheduled for November 29.” Produces the output from the resulting dependency “For this reason the people could not live in their houses and graphs were given shelter in camps” -> "For this reason the people could not live in their houses. These people were given shelter in Tools: dependency parser (Bohnet, 2009) and camps." MATE framework (Bohnet et al., 2000)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

20 13/10/2013

MATE Graph Tools MATE Rules

“El gato que es bonito INPUT OUTPUT CONSTRAINTS

leftside = [ rightside = [ conditions = [ come pescado.” ?Xl { rc:?Zr { ?Yl.ppos="v"; S-> ?Yl { <=> ?Zl ?Yl.mood="indicative" ?r-> ?Zl mark_relative_simple=applied ?Zl.slex=que or ?Zl.slex=quién } } or ?Zl.slex=cuál } ] or Zl.slex=donde; ] ?Zl.id

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Rule Application Graph Transformation

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

21 13/10/2013

Rule Evaluation Lexical simplification (Bott et al., 2012)

LexSiS: Lexical simplification for Spanish Word sense disambiguation (WSD) of a target word to identify Simplification Precision Recall an appropriate list of synonyms Relative clauses 40% 66% Simplicity computation to chose an appropriate synonym for a target word Gerundive 63% 20% Morphology generation of the simpler synonym Object coordination 42% 58% VP/Clause coordination 64% 50% Example Original: Descubren en Valencia una nueva ESPECIE de pez prehistórico / A new SPECIES of prehistoric fish is discovered in Evaluation carried out over 886 sentences of the corpus Valencia. Simplified: Descubren en Valencia un nuevo TIPO de pez Sources of error: , restrictive clauses, idioms, etc. prehistórico / A new TYPE of prehistoric fish is discovered in Valencia.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Word Sense Disambiguation Word Sense Disambiguation

The final representation of a word-sense w is an aggregated vector of all A freely available (Spanish Open - synonym terms for w OT) can be used to select synonyms mono (4 senses) gorila, simio, antropoide => mono_vector_1 simio, chimpancé, mandril, mico, macaco => mono_vector_2 mono (4 senses) / monkey overol, traje de faena => mono_vector_3 llamativo, vistoso, atractivo,…. => mono_vector_4 gorila, simio, antropoide / primate Given a target word w to be simplified, a vector is simio, chimpancé, mandril, mico, macaco / monkey computed based on w current context overol, traje de faena / garment Two methods used: one-sense-per-discourse vs many-senses- llamativo, vistoso, atractivo,…. / attractive per-discourse Using an 8M word corpus, a vector representations for each The similarity between this vector and all vectors for word entry in OT can be created w in the thesaurus are computed using cosine the vector models words in context frequencies (4 Pick the list of synonyms whose vector is more similar to the target vector for w words to the left, 4 words to the right)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

22 13/10/2013

Simplicity computation Simplicity computation

Corpus analysis findings Example Word length Descubren en Valencia una nueva ESPECIE de pez short words are more abundant in simple texts than in non-simplified texts prehistórico (A new SPECIES of prehistoric fish is Frequency (based on a word frequency list from RAE) discovered in Valencia.) low frequency words or words not in our reference lexicon are 50% more common in non-simplified texts than in simplified texts Open Thesaurus lists 5 different synonym lists for ESPECIE (SPECIES) This has led to the implementation of a simplicity score Frequency-based lexical simplification procedure selects for words that combines frequency and length the word GRUPO (GROUP) as replacement for ESPECIE out Frequency (also used in previous art) of all available synonyms Length (also used in previous art) LexSiS chooses one of the correct synonym lists and from the list the word TIPO (TYPE) as an appropriate The target word is replaced only if a number of replacement because the length factor makes TYPE conditions are met “simpler” than GROUP

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

LexSiS: Evaluation LexSiS: Evaluation

Three versions compared gold standard, random, frequency, Users presented with pairs of sentences and asked to indicate and LexSiS how the target word in the first sentence was when Overall LexSiS shows higher meaning preservation and compared to the target word in the second sentence: proposes more simpler synonyms (Bott et al. 2012) (A) El visitante puede contemplar los óleos y esculturas que se exponen en la PINACOTECA. (The visitor can appreciate the paintings Problems: replacements of words which are not difficult, and sculptures showed in the ART GALLERY.) replacement of words in collocations, out-of-vocabulary (B) El visitante puede contemplar los óleos y esculturas que se problem, …. exponen en el MUSEO. (The visitor can appreciate the paintings and sculptures showed in the MUSEUM.) simpler, more complex, equally complex/simple, not same meaning, don’t understand

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

23 13/10/2013

Other simplification operations (Drndarevic & Other Simplification Mechanisms Saggion, 2012)

Sentence deletion Transformation of nouns/adjectives of nationality Productive operation, around 84% of documents have delete operations “Tunisian authorities” => “authorities of Tunisia” applied Baseline: delete last sentence achieves 0.73 F-score Transformation of numerical expressions A classification system for sentence deletion using features such as position, year clarification “2010” => “the year 2010” cohesion, presence of numerical expressions, presence of named entities achieves 0.81 F-score deletion “end of may of 2010” => “2010” Definition insertion small numbers in digits “seven books” => “7 books” Also very productive: just over 57% of simplified texts contain definitions not replace by definition (“2 decades” => “20 years”) present in the original text Normalization of “reporting” verbs to the form decir (“say”) 70% of definitions correspond to named entities (people, organizations, places) “X explained that Y” => “X said that Y” Problem: what terms should be explained? what is a simple definition? Removal of information in parenthesis Where to insert the definition in the simplified text?

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Simplext full evaluation Simplext full evaluation

Evaluation of syntactic simplification module, Degree of simplification rule-based simplification, and whole system a set of readability (Drndarevic et al., 2013) indices tested the degree of simplification achieved => “readability” indices tested the grammaticality and meaning indices behaviour on indices behaviour on preservation => human evaluation manual simplification automatic simplification

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

24 13/10/2013

Simplext full evaluation Simplext full evaluation

Grammaticality and meaning preservation Grammaticality and meaning preservation 25 human evaluators asked to read and assess Questions: original and simplified sentences 1. paragraph A is grammatical 2. paragraph B is grammatical Questionnaires with 38 pairs of original (O) and 3. paragraphs A & B have the same meaning simplified (S) sentences Answers on Likert scale: 1 completely disagree – 5 Pairs O-S contained at least one lexical change and completely agree one syntactic change Results Order was altered random to counterbalance the grammaticality of originals 4.6 (~90% positive = scores 4 and 5) grammaticality of simplifications 3.85 (~60% positive = scores 4 sequence effect and 5) meaning preservation 3.83 (~70% positive = scores 4 and 5)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Evaluation with people with special needs Original text

google translate version Extrinsic evaluation with Simplext final users Alex de la Iglesia dimitirá como presidente de la Academia de Alex de la Iglesia resign as president of the Film Performed by the PRODIS foundation Cine Academy El director Alex de la Iglesia presentará su dimisión como The director Alex de la Iglesia shall resign as president of 31 subjects with Down Syndrome presidente de la Academia de Cine tras la próxima gala the Film Academy after the next gala Goya next February de los Goya del próximo 13 de febrero. 13. 3 texts in 3 different versions: original, manually simplified, Lo anuncia en un artículo publicado en la edición digital de "El He announced in an article published in the online automatically simplified País", en el que no vincula claramente su salida con su edition of the "Country", which clearly links his desacuerdo respecto a la "Ley Sinde", algo que sí ha departure not his disagreement over the "Sinde Law", 4 “inferential” questions for each text (only correct/incorrect options sugerido en su página en Twitter. which itself has suggested in his Twitter page. Explica que ha hablado últimamente con representantes de He explains that he has spoken recently with were considered) sectores contrarios a su posición sobre las descargas en representatives of sectors opposed to its position on the la Red y ha dialogado con ellos con fluidez. downloads on the Web and has talked with them measurements: reading time and number of correct answers fluently. "En este país, cambiar de opinión es el mayor de los pecados. "In this country, change your mind is the greatest sin. No differences were perceived in reading time and in percentage of Creo que tenemos instalado el chip de la intransigencia desde I think we have installed the chip intransigence in a hace tiempo", asegura. while, “ he says. correct answers in the three textual versions "Lo coherente es dejarlo", seña el cineasta, que promete que "It is consistent to quit," signaled the filmmaker, who seguirá defendiendo sus posiciones simplemente como promises to continue to defend their positions just as a Recent research (Fajardo et al., 2013) indicates that “easy-texts” have a director y que no "empañará" la gala de los Goya con su director and not "tarnish" the gala of the Goya with his role to play in making texts more understandable for people with posición sobre la polémica de la Red. position on the controversy of the Network intellectual disabilities

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

25 13/10/2013

Manual simplification Automatic simplification

google translate version google translate version

Alex de la Iglesia dimitirá como presidente de la Academia de Alex de la Iglesia dimitirá como presidente de la Academia de Alex de la Iglesia resign as president of the Film Alex de la Iglesia resign as president of the Film Academy. Cine Academy Cine. El director Alex de la Iglesia presentará su dimisión como presidente de la Academia de Cine tras la próxima The director Alex de la Iglesia shall resign as president of El director de cine Alex de la Iglesia deja la presidencia de la Film director Alex de la Iglesia leaves the presidency of the Film Academy after the next gala Goya . Academia de Cine. the Film Academy. gala de los Goya del próximo 13 de febrero. Lo anuncia en un artículo publicado en la edición digital de "El País". He announced in an article published in the online edition Alex de la Iglesia dejará la presidencia de la Academia de Cine Alex de la Iglesia will leave the presidency of the of “El Pais". En este artículo no vincula claramente su salida con su después de los Goya. Academy of Film after the Goya. This article does not clearly links his departure with his Los Goya es una fiesta donde se dan premios a los mejores The Goya is a party where they give awards for the best desacuerdo respecto a la "Ley Sinde", algo que sí ha disagreement over the "Sinde Law", which itself has actores y películas del cine español. actors and films of Spanish cinema. sugerido en su página en Twitter. Explica que ha hablado suggested in his Twitter page. Alex de la Iglesia announced his decision in an article in últimamente con representantes de sectores contrarios He explains that he has spoken recently with Alex de la Iglesia anunció su decisión en un artículo del a su posición sobre las descargas en la Red. Ha the newspaper El Pais. representatives of sectors opposed to its position on the periódico El País. dialogado con ellos con fluidez. "En este país, cambiar On its website explained that leaves the Film Academy downloads on the Web. En su página web ha explicado que deja la Academia de Cine for not agreeing with the “Sinde” law. de opinión es el mayor de los pecados.“ Asegura: "Creo por no estar de acuerdo con la Ley Sinde. He has spoken with them fluently. Sinde law is a law prohibiting free movie downloads . que tenemos instalado el chip de la intransigencia desde "In this country, change your mind is the greatest of sins." La Ley Sinde es una ley que prohíbe las descargas gratis de hace tiempo". El cineasta seña: "Lo coherente es Ensures: "I think we have installed the chip intransigence películas en internet. dejarlo". Este cineasta promete que seguirá defendiendo in a while". sus posiciones simplemente como director y que no The filmmaker sign: "The consistent is to quit." "empañará" la gala de los Goya con su posición sobre la This filmmaker promises that simply continue to defend polémica de la Red. their positions as a director and not "tarnish" the gala of the Goya with his position on the controversy of the Network.

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

Questionnaire Summary

google translate version Objectives of text simplification 1. ¿Qué cuenta la noticia que va hacer Alex de la Iglesia? 1. What has the news that goes to Alex de la Iglesia? (Resign (Dimitir / dejar de ser presidente de la Academia de / stop being president of the Film Academy) Cine) transform a text into a version which is easier to read and (hopefully 2. When Alex de la Iglesia retire as president of the Film 2. ¿Cuándo se retirará Alex de la Iglesia como presidente Academy? (After Goya Awards, 13 February) understand) and which has more or less the same content than the de la Academia de Cine? (Después los premios Goya, el 3. Where Alex de la Iglesia said stop being president of the 13 de Febrero) original Film Academy? (In the paper the country, on their website 3. ¿Dónde anunció Alex de la Iglesia que deja de ser Twitter) presidente de la Academia de Cine? (En el periódico el also contribute with resources to assist the process of simplification, País, en su página web de Twitter) 4. Why is removed Alex de la Iglesia, as president of the Film resources that could contribute to tasks other than the actual 4. ¿Por qué se retira Alex de la Iglesia, como presidente de Academy? (Not to be in accordance with the Law Sinde) la Academia de cine? (Por no estar de acuerdo con la Ley simplification Sinde) There is an increasing interest in NLP for the topic, and certainly in IR and Web Accessibility In coming years we will see application which will make textual access easier for different groups of people (and for many languages)

IJCNLP 2013 – NAGOYA – JAPAN IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013 © Horacio Saggion - 2013

26 13/10/2013

Summary References

Al-Ajlan, A., Hend S. Al-Khalifa, and AbdulMalik S. Al-Salman. ICDIM, page 506-511. IEEE, (2008) There are 3 axes in text simplification Aluísio, S., Lucia Specia, Thiago Alexandre Salgueiro Pardo, Erick Galani Maziero, Renata Pontin de Mattos Fortes: Towards Brazilian Portuguese automatic text simplification systems. ACM Symposium on Document Detecting the complexity/simplicity of a text (probably with respect to Engineering 2008: 240-248 external factors) Aluisio, S., Specia, L., Gasperin, C. and Scarton, C. (2010) Readability Assessment for Text Simplification. In Proceedings of the Fifth Workshop on Innovative Use of NLP for Building Educational Applications. Los Simplification of the lexicon Angeles, USA. Andrea Petz, Bror Tronback (2008) People with Specific Learning Difficulties: Easy to Read and HCI. ICCHP 2008: Simplification of the syntactic structure 690-692 Barthe, Kathy; Juaneda, Claire; Leseigneur, Dominique; Loquet, Jean-Claude; Morin, Claude; Escande, There are several techniques and a trend for corpus-based supervised Jean; Vayrette, Annick. (1999) GIFAS Rationalized French: A Controlled Language for Aerospace methods, however many languages lack appropriate resources so linguistic Documentation in French. Technical Communication, Volume 46, Number 2, May 1999 , pp. 220-229(10) Bautista, S. Raquel Hervás, Pablo Gervás, Richard Power, and Sandra Williams (2011). How to Make Numerical intuitions are paramount here Information Accessible: Text simplification (word, sentence) is not the only way to make text Björnsson, C.H. (1968). Läsbarhet. Stockholm: Liber. Bott S, Saggion H, Mille S. (2012). Text Simplification Tools for Spanish. Proceedings of Language Resources and accessible Evaluation Conference, 2012. Bott S, Saggion H. (2011). An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction. images, tables, pictures, animations ACL Workshop on Monolingual Text-to-Text Generation. disposition of the textual elements Bott S, Saggion H. (2012). Automatic Simplification of Spanish Text for e-Accessibility. Proceedings of the 13th International Conference on Computers Helping People with Special Needs. :54–56. speech Bouayad-Agha, N., Gerard Casamayor, Gabriela Ferraro, and Leo Wanner. (2009). Simplification of patent claim sentences for their paraphrasing and summarization. 2009 FLAIRS Conference.

IJCNLP 2013 – NAGOYA – JAPAN © Horacio Saggion - 2013

References References

Carroll, J. Guido Minnen, Yvonne Canning, Siobhan Devlin, and John Tait. (1998). Practical Simplification of Falkenjack, J., Katarina Heimann Mühlenbock, and Arne Jönsson . (2013) Features Indicating Chandrasekar, R. Christine Doran, and Bangalore Srinivas. 1996. Motivations and methods for text Readability in Swedish Text, Proceedings of NODALIDA 2013. Oslo, Norway. simplification. COLING 1996 , pps1041--1044. Fellbaum, C. WordNet . An Electronic Lexical Database. The MIT Press. Coster, W. and David Kauchak. 2011. Learning to simplify sentences using wikipedia. Proceedings of Text-To- Flesch, R. (1949) The art of readable writing. Harper, new York. Text Generation, Portland, Oregon. Association for Computational Linguistics. François, T. and Fairon, C. (2012) An « AI readability » formula for French as a foreign language. In Daelemans, W. and Anja Höthker and Erik Tjong Kim Sang. (2004). Automatic Sentence Simplification for Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing Subtitling in Dutch and English. Proceedings of the Language Resources and Evaluation Conference. and Computational Natural Language Learning. Jeju Island, Korea. De Belder, Jan; Deschacht, Koen; Moens, Marie-Francine. (2010) Lexical simplification, 1st international Gunning, R. (1952). The technique of clear writing. McGraw-Hill, New York. conference on interdisciplinary research on technology, education and communication, Kortrijk, Belgium, 25-27 May 2010 Kincaid, J.P, Fishburne, R.P., Rogers, R.L., Chisson, B.S. (1986) Derivation of new readability Dell’Orletta, F., Montemagni, S., Venturi, G. (2011). READ-IT: assessing readability of Italian texts with a view formulas for Navy enlisted personell. to text simplificationProceedings of the Second Workshop on Speech and Language Processing for Luz Rello, Ricardo Baeza-Yates, Horacio Saggion, Jennifer Pedler (2012). A First Approach to the Assistive Technologies. 73-83 Creation of a Spanish Corpus of Dyslexic Texts. Proceedings of NLP4ITA. Istambul. Devlin, S., Tait, J. (1998) The use of a psycholinguistic database in the simplification of text for aphasic Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil, Lillian Lee (2010) For the sake of readers. In J. Nerbonne, editor, Linguistic Databases, 161–173. CSLI Publications, Stanford, California simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. HLT-NAACL Drndarevic B, Saggion H. (2012a). Towards Automatic Lexical Simplification in Spanish: An Empirical Study. 2010: 365-368 Workshop Predicting and improving text readability for target reader populations (PITR2012). NAACL Mitkov, R. (2012) Natural Language Processing and Language Disabilities. Invited Talk. NLP4ITA 2012. LREC Workshop, Istambul. http://www.taln.upf.edu/nlp4ita/ Drndarevic B, Saggion H. (2012b). Drndarevic B, Saggion H. 2012. Reducing Text Complexity through Automatic Lexical Simplification: an Empirical Study for Spanish. SEPLN Journal . 49 DuBay, W.H. (2004), The Principles of Readability. Impact Information.

27 13/10/2013

References References

Ogden, C.K. (1930) Basic English: A General Introduction with Rules and Grammar. Spaulding, S. 1956. "A Spanish Readability Formula" Modern Language Journal 40(8) 433-441. Or Biran, Samuel Brody, and Noémie Elhadad. Putting it Simply: a Context-Aware Approach to Lexical Specia, L. (2010). Translating from Complex to Simplified Sentences. Computational Processing of the Simplification. 2011. ACL, pp. 496-501. Portland, OR. Portuguese Language. Lecture Notes in Computer Science Volume 6001, 2010, pp 30-39 P. Lal and S. Rüger (2002) Extract-based Summarization and Simplification. In Proc. Workshop on Text Summarization Document Understanding Conference (DUC) 2002 Specia, L., Jauhar, S.K. and Mihalcea, R. (2012) SemEval-2012 Task 1: English Lexical Simplification. First Petersen, S.E. and Ostendorf, M. (2007)Text Simplification for Language Learners: A Corpus Analysis. In Joint Conference on Lexical and Computational (*SEM). Montréal, Canada. Proceedings of SLaTE-2007, 69-72. Stajner, S. , Evans, R. , Orasan, C. and Mitkov, R. (2012) What Can Readability Really Tell Us About Text Pitler, E. and Nenkova, A. (2008) Revisiting Readabiliy: A Unified Framework for Predicting Text Quality. In Complexity? In Proceedings of NLP4ITA. Istanbul, Turkey. Proceedings of EMNLP, 2008. Tonelli, S., Ke Tran Manh, Emanuele Pianta (2012). Making Readability Indices Readable . In Proceedings Rello, L., Ricardo A. Baeza-Yates, Laura Dempere-Marco, Horacio Saggion: Frequent Words Improve Readability and Short Words Improve Understandability for People with Dyslexia. INTERACT (4) 2013: 203-219. of the NAACL Workshop on Predicting and Improving Text Readability for Target Reader Population Rello, L., Ricardo A. Baeza-Yates, Stefan Bott, Horacio Saggion: Simplify or help?: text simplification strategies (PITR2012), Montréal, Canada. for people with dyslexia. W4A 2013: 15 Vari-Cartier, P.: (1981): Development and Validation of a New Instrument to Asses the Readability of Rello, L., Susana Bautista, Ricardo A. Baeza-Yates, Pablo Gervás, Raquel Hervás, Horacio Saggion: One Half or Spanish Prose, Modern Language Journal, 65, 141-148. 50%? An Eye-Tracking Study of Number Representation Readability. INTERACT (4) 2013: 229-245 Woodsend, K. and Lapata, M. (2011) Learning to Simplify Sentences with Quasi-Synchronous Grammar Saggion H, Gómez-Martínez E, Etayo E, Anula A, Bourg L. (2011). Text Simplification in Simplext: Making Text and Integer Programming. In Proceedings of the Conference on Empirical Methods in Natural More Accessible. Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural. Language Processing. Edinburgh, UK. Sato, S., Matsuyoshi, S., Kondoh, Y. (2008) Automatic Assessment of Japanese Text Readability Based on a Textbook Corpus. In Proceedings of LREC 2008. Wubben, S., van der Bosch, A., Krahmer, E. (2012). Proceedings of the 50th Annual Meeting of the Seratan, V. (2012) Acquisition of Syntactic Simplification Rules for French. In Proceedings of LREC 2012. Association for Computational Linguistics. Jeju Island, Korea. Istanbul, Turkey. Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., and Lee.L. (2011) In Proceedings of the 2010 Annual Si, L. and Callan, J. (2001). "A statistical model for scientific readability." In Proceedings of the 10th Conference of the North American Chapter of the Association for Computational Linguistics. International Conference on Information and Knowledge Management (CIKM) ACM. Zhu, Z., Delphine Bernhard, and Iryna Gurevych. (2010). A monolingual tree-based translation model for Siddharthan, A. (2002). An architecture for a text simplification system. Proceedings of LREC 2002. pp64-71 sentence simplification. Proceedings of The 23rd International Conference on Computational Linguistics, pps1353--1361, Beijing, China, Aug.

28