arXiv:1912.05200v2 [cs.CL] 12 Dec 2019 eto u nweg,i h rtlresaeQ riigre training QA large-scale a first the F1 is 77.6 knowledge, and our corpora corpus of v1.1 best MLQA SQuAD-es Spanish generated the synthetically on resulting, points F1 68.1 of value A xeietlrslsso htormdl upromth outperform models our that show results Experimental QA. h ult forsnhtclygnrtdtasae data translated generated synthetically assessing our method, of TAR quality the the baselines of the. QA capability from Spanish the current demonstrated taken the set over evaluation improvements QA Our our Spanish evaluated two we on Finally, model fine-tuning by model. systems ver- Multilingual-BERT QA Spanish the Spanish a first trained its then We generating to sion. v1.1 method SQuAD our popular applied to the we dataset Indeed, QA unsupervised English automatically. an Spanish and translate MT to on algorithm alignment based method, generation Translate-Alig (TAR) the corpora developed Retrieve synthetic we particular, the In follow direction. we paper, this explored In been also has time 2018) test al., at et (Asai 2018; MT al., on et Lee QA based multilingual system a Additionally, 2019; to al., 2016). et Boschee, (Alberti T¨ure designed and as datasets resources based QA meth- training (MT) language-specific generation generate data. corpora machine-translation automatically synthetic training are 2019a) hand, al., fewer et ods other Liu with 2019; the examples Lee, knowl- language and source On Lee and many 2019; target the al., on et zero, given (Artetxe trained pro- transfer a model to QA to been to refers a of applied have learning edge techniques learning Cross-lingual generation few-shot cross-lingual corpora on posed. synthetic based and approaches for in Several advancement scarce hindering research. and QA usually English, Multilingual large than are Unfortunately, 2019) other al., corpora languages et Yang annotated datasets. 2019b; high-quality QA al., et large-scale Liu on 2018; al., 201 et al., Devlin et (Vaswani architectures fine- by pre-trained the primarily deep years, progress, tuning last enormous made the has In QA of 2015). field al., nat- et understand (Hermann to language machine a ural of ability for the assesses task classical to probe challenging a represents and and crucial comprehension machine-reading a is answering Question oe.Fnly eeautdorQ oeswt h recently the with models QA our evaluated we Finally, model. nwrn aae SuD 11t pns.W hnue this used then We Al Spanish. Translate to the v1.1 develop (SQuAD) we Dataset work, Answering this In ones. English the makes to datasets large-scale of unavailability the However, eety utlnulqeto nwrn eaeacrucia a became answering question multilingual Recently, uoai pns rnlto fteSuDDtstfrMult for Dataset SQuAD the of Translation Spanish Automatic { .Introduction 1. aiiopocrio at.uz jose.fonollosa marta.ruiz, casimiro.pio.carrino, aiioPoCrio at .Costa-juss R. Marta Carrino, Pio Casimiro nvria oi`ciad aauy,Barcelona Catalunya, Polit`ecnica de Universitat usinAnswering Question APRsac Center Research TALP tcalnigt ri utlnulQ ytm ihperfor with systems QA multilingual train to challenging it ihams 0%o aacnandi h rgnlEgihv English original the in contained data of 100% almost with , eerhtpc n ti eevn nrae neeti t in interest increased receiving is it and topic, research l orefrSpanish. for source g eree(A)mto oatmtclytasaeteSt the translate automatically to method (TAR) Retrieve ign rvosMliiga-ETbslnsaheigtenws new the achieving baselines Multilingual-BERT previous e set. Abstract n- 7; rpsdMQ n QA ecmrsfrcoslnulExtra cross-lingual for benchmarks XQuAD and MLQA proposed aae otanSaihQ ytm yfietnn Multilin a fine-tuning by systems QA Spanish train to dataset d6. xc ac onso h pns QA ops The corpus. XQuAD Spanish the on points Match Exact 61.8 nd eeal paig h A ehdi eindfrthe for designed source is the method of translation TAR the speaking, Generally rp,aqeto,adterltdasesaogwt its with along answers context the related in the position start and question, a graph, orsodn agtlanguage target corresponding ytm aiaigorapoc.W aebt h code available freely the dataset both v1.1 make QA SQuAD-es iii) We the Spanish and QA. approach. created for our Spanish state-of-the-art We validating systems current large-scale ii) can the first that establish the We Spanish languages. v1.1, to multiple fol- SQuAD-es dataset to the translated generalized training to are be method v1.1 automatic make SQuAD an we define the We contributions i) the lowings: summary, In usinitne o outeauto.Ec xml in example a Each is v.1.1 SQuAD evaluation. the posed robust each for for intended answers two question least the at of contains the 10% set set, velopment and training the 80% Unlike with respectively. examples, set total development int partitioned and span is training a It a is paragraph. question context each the to from annotated answer the where tasks ma- question swering extractive for large-scale dataset It high-quality a a articles. represents Wikipedia is on crowd-sourced v1.1 than questions more 100,000 SQuAD containing dataset The comprehension reading chine Spanish. 2016) al., into et Question applica- (Rajpurkar Stanford v.1.1 the its (SQuAD) of Dataset and Answering translation method automatic the TAR for the tion describes section This ossso he ancomponents: main three of consists • • 1 .TeTasaeAinRtiv (TAR) Translate-Align-Retrieve The 2. https://github.com/ccasimiro88/TranslateAlignRetrie rie erlmcietasain(M)model language (NMT) target translation to source machine from neural trained A rie nuevsdwr lgmn model alignment word unsupervised trained A ,Jos a, ` .R Fonollosa R. A. e ´ ehdo SQuAD on Method ( ,q a q, c, ( c src a } ) @upc.edu start q , ul aeo otx para- context a of made tuple ( c src tgt . a , q , src tgt ) a , ilingual eNPcommunity. NLP he xmlsit the into examples ac comparable mance tgt nodQuestion anford ) tate-of-the-art xmls It examples. rin othe to ersion, 1 . gual-BERT ve ctive de- an- o • A procedure to translate the source (csrc, qsrc,asrc) their translations. We relied on an efficient and examples and retrieve the corresponding accurate unsupervised word alignment called eflomal (ctgt, qtgt,atgt) translations using the previous (Ostling¨ and Tiedemann, 2016) based on a Bayesian model components with Markov Chain Monte Carlo inference. We used a fast implementation named efmaral 5 released by the same au- The next sections explain in details how we built each com- thor of eflomal model. The implementation also allows the ponent for the Spanish translation of the SQuAD v1.1 train- generation of some priors that are used at inference time ing set. to align quickly. Therefore, we used tokenized sentences 2.1. NMT Training from the NMT training corpus to train a token-level align- ment model and generate such priors. We built the first TAR component training from scratch an NMT model for English to Spanish di- 2.3. Translate and Retrieve the Answers rection. Our NMT parallel corpus is created by col- The final component, defines a strategy to translate and re- lecting the en-es parallel data from several resources. trieve the answers translation to obtain the translated ver- We first collected data from the WikiMatrix project sion of the SQuAD Dataset. Giving the original SQUAD (Schwenk et al., 2019) which uses state-of-the-art Dataset textual content, three steps are applied: multilingual sentence embeddings techniques from 2 the LASER toolkit (Artetxe and Schwenk, 2018a; 1. Translate all the (csrc, qsrc,asrc) instances Artetxe and Schwenk, 2018b) to extract N-way parallel corpora from Wikipedia. Then, we gathered additional 2. Compute the source-translation context alignment resources from the open-source OPUS corpus avoiding alignsrc−tran domain-specific corpora that might produce textual domain 3. Retrieve the answer translations atran a using the mismatch with the SQuAD v1.1 articles. Eventually, source-translation context alignment alignsrc−tran we selected data from 5 different resources, such as Wikipedia, TED-2013, News-Commentary, Tatoeba and The next sections describes in details the steps below. OpenSubTitles (Lison and Tiedemann, 2016; Wolk and Marasek, 2015; Tiedemann, 2012). Content Translation and Context-Alignment The The data pre-processing pipeline consisted of punc- first two steps are quite straightforward and easy to tuation normalization, tokenisation, true-casing and describe. First of all, all the (csrc, qsrc,asrc) examples eventually a joint source-target BPE segmentation are collected and translated with the trained NMT model. (Sennrichetal.,2016) with a maximum of 50k BPE Second, Each source context is split into sentences, and symbols. Then, we filtered out sentences longer than 80 then the alignments between the context sentences and tokens and removed all source-target duplicates. The final their translations are computed. Before the translation, corpora consist of almost 6.5M parallel sentences for the each context source csrc is split into sentences. Both training set, 5k sentence for the validation and 1k for the the final context translation ctran and context alignment test set. The pre-processing pipeline is performed with align(csrc,ctran) are consequently obtained by merging the scripts in the Moses repository3 and the Subword-nmt the sentence translations and sentence alignments, re- repository4. spectively. Furthermore, it is important to mention that, We then trained the NMT system with the Transformer since the context alignment is computed at a token level, model (Vaswani et al., 2017). We used the implementation we computed an additional map from the token positions available in OpenNMT-py toolkit (Klein et al., 2017) in to the word positions in the raw context. The resulting its default configuration for 200k steps with one GeForce alignment maps a word position in the source context to GTX TITAN X device. Additionally, we shared the source the corresponding word position in the context translation. and target vocabularies and consequently, we also share Eventually, each source context csrc and question qsrc the corresponding source an target embeddings between is replaced with its translation ctran and qtran while the the encoder and decoder. After the training, our best model source answer asrc is left unchanged. To obtain the correct is obtained by averaging across the final three consecutive answer translations, we designed a specific retrieval mech- checkpoints. anism implemented in the last TAR component described Finally, we evaluated the NMT system with the BLEU next. score (Papineniet al., 2002) on our test set. The model achieved a BLEU score of 45.60 point showing that the it Retrieve Answers with the Alignment The SQUAD is good enough to be used as a pre-trained English-Spanish Dataset is designed for extractive question answering mod- translator suitable for our purpose. els where each question must be an answer that belongs to the paragraph. It poses a significant constraint to take into 2.2. Source-Translation Context-Alignment account when data are translated automatically. We found The role of the alignment component is to com- that in most cases, the translation of a paragraph is different pute the alignment between the context sentences and from the translation of the answer that is contained in it. It may occur because a neural generative model, like the 2https://github.com/facebookresearch/LASER NMT, produces the translation of word conditioned on its 3https://github.com/moses-smt/mosesdecoder 4https://github.com/rsennrich/subword-nmt 5https://github.com/robertostling/efmaral context 3. The SQuAD-es Dataset To overcome this issue, we leverage on the previously We applied the TAR method to both the SQuAD v1.1 train- computed context alignment alignsrc−tran. There- ing datasets. The resulting Spanish version is referred as fore, we designed an answer extraction procedure that SQuAD-es v.1.1. In table 3 we show some (ces, qes,aes) retrieves answers even when the answer translation is examples taken from the SQuAD-es v1. These examples not contained in the paragraph translation. First, we show some good and bad (c,q,a) translated examples giv- use the alignsrc−tran to map the word positions of the start end ing us a first glimpse of the quality of the TAR method for answer source (asrc , ..., asrc ) to the corresponding start end the SQuAD dataset. word positions of the answer translation (atran , ..., asrc ). Also, a position reordering is applied to extract the ′ ′ 3.1. Error Analysis a start and the a end as the minimum and maximum over tran tran As follows, we conduct a detailed error analysis in order to (astart, ..., aend), respectively. This position reordering tran src better understand the quality of the translated (c , q ,a ) accounts for the inversion during the translation. Then, es es es data. for a given context, we look up the answer translation The quality of the translated c , q ,a examples in the a as a span of the context translation c . The span ( es es es) tran tran SQuAD-es v1.1 Dataset depends on both the NMT model it is searched from the corresponding start position astart tran and the unsupervised alignment model. The interplay in the context translation c . It is necessary to detect tran among them in the TAR method that determines the final in which part of the context translation c the answer tran result. Indeed, while bad NMT translations irremediably translation a is mapped, to prevent the extraction of tran hurt the data quality also an erroneous source-translation the incorrect answer span when it appears more than one alignment is responsible for the retrieval of wrong answer sentence. Furthermore, the a is lower-cased to improve tran spans. the matching probability on the ctran. If the answer translated is found in context translation, it is retrieved. In Therefore, we carried out a qualitative error analysis on SQuAD-es v1.1 in order to detect and characterize the pro- the opposite case, we retrieve the answer translation atran start end duced errors and the factors behind them. We inferred the as the span between the atran and atran. The pseudo-code in figure 1 shows the algorithm implementation. quality of the (ces, qes,aes) examples by looking at er- rors in the answer translations {aes}. Indeed, based on the TAR method, a wrong answer translation provides an easy diagnostic clue for a potential error in both the context Algorithm 1: Implementation of the answer retrieval with translation ces or the source-translation context alignment alignment for each (c,q,a) example. The csrc and ctran alignen−es. We also make use of the question translations are the context source and translation, qsrc is the question {qes} to asses the level of correctness of the answer transla- start end start end tions a . We pointed out systematic errors and classified source, asrc , asrc and atran , atran are the start and end { es} positions for the answer source and the answer translation, them in two types: alignsrc−tran is the source-translation context alignment, Type I: Misaligned Span The answer translation is a and atran is the retrieved answer. span extracted from a wrong part of the context transla- Result: atran tion and indeed does not represent the correct translation for csrc in contexts paragraph do of the source answer. This error is caused to a misalign- ctran ← get context translation ; ment in the source-translation alignment, or a translation align(csrc,ctran) ← get context alignment; error when the NMT model produces a context translation for q in questions do shorter than the source context that consequently generates for asrc in answers do a wrong span. a ← get answer translation ; tran Type II: Overlapping Span The answer translation is a if a in c then tran tran span with a certain degree of overlap with the golden span return a ; tran on the context translation. Indeed, it might contain some else additional words or punctuation, such as trailing commas (astart, ..., aend) ← get src positions; src src or periods. In particular, the additional punctuation is compute tran positions with alignment start end present when the source annotation exclude punctuation (a , ..., a ) ← align − ; tran tran src tran while words in the span contain these character, but the compute start position as minimum ′ source annotation does not. Sometimes, we also found a start ← min(astart,aend ); tran tran tran that the answer translation span overlaps part of the next compute end position as maximum ′end start end sentence. This error type is generated by a slightly impre- atran ← max(atran ,atran); ′start ′end cise source-translation alignment or by an NMT translation return atran ← ctran[atran : atran]; error. Nevertheless, we noticed that often enough the end resulting answer translation, respect to its question, turns end out to be acceptable. Table 4 shows some examples of end error type.

Overall, the two factors for these error types are both the NMT component and the alignment component in our TAR method. In order to have more quantitative idea of how alignment, named SQuAD-es-small. Table 2 shows the the error types are distributed across the (ces, qes,aes) ex- statistics of the final SQuAD-es v1.1 datasets in terms of amples, we randomly selected an article from SQuAD v1.1 how many (ces, qes,aes) examples are translated over the and manually counted the error types. Besides, we divided total number of examples in its original English version. the total examples into two sets, depending on how the an- We also show the average context, question and answer swer translations are retrieved. The first set contains all the length in terms of token. As a result, we the SQuAD-es (ces, qes,aes) instances, while the second, smaller set, con- v1.1 training contains almost the 100% of the SQuAD v1.1 tains only the (ces, qes,aes) instances retrieved when the data while the SQuAD-es-small v1.1 is approximately half answer translation is matched as span in the context trans- the size, with a about 53% of the data. In the next sec- lation. In this way, we isolate the effect of the alignment tion, these two SQUAD-es v1.1 datasets will be employed component in the answer retrieval and evaluate its impact to train Spanish question Answering models. on the distribution of the error types. Table 1 shows the number of occurrence for each error type on the two sets of SQuAD-es SQuAD-es-small (ces, qes,aes) examples. #ofex. 87595/87599 46260/87599 Avg. c len 140 138 Error # (%) {(c,q,a)}all {(c,q,a)}no align Avg. q len13 12 TypeI 15(7%) 0(0%) Avg. a len4 3 TypeII 98(43%) 7(10%) Total 113(50%) 7(10%) Table 2: Number of (ces, qes,aes) examples in the final TypeII(acc.) 68 4 SQuAD-es v1.1 Datasets over the total number of the origi- # of (c,q,a) ex.227 67 nal SQuAD v.1.1 (cen, qen, qen). We also computed the av- erage length for the context, question and answers in terms Table 1: The number of error types occurrences and its per- of tokens. centage for two sets of (ces, qes,aes) instances retrieved with and without alignment. 4. QA Experiments As a result, we found that the alignment is responsible for We trained two Spanish QA models by fine-tuning a the introduction of a large number of error occurrences in pre-trained Multilingual-BERT (mBERT) model on our the translated (ces, qes,aes) instances. As a consequence, SQuAD-es v1.1 datasets following the method used in when the answer translations are retrieved with the source- (Devlin et al., 2018). We employed the implementation translation alignment, we found a significant increase of in the open-source HuggingFace’s Transformers library 40% of the total errors. On the other side, when the align- (Wolf et al., 2019). Our models have been trained for three ment is not involved in the retrieval process, the total num- epochs one GTX TITAN X GPU device using the default ber of translated examples is drastically reduced by 70% of parameter’s values set in the HugginFace scripts. the total number of examples in the other case. However, The goal is to assess the quality of our synthetically gener- the number shows that the relative percentage of acceptable ated datasets used as a training resource for Spanish QA answers increased when the alignment is used. This anal- models. We performed the Spanish QA evaluation on ysis indicate that the TAR method can produce two kinds two recently proposed, freely available, corpus for cross- of a synthetical dataset, a bigger one with noisy examples lingual QA evaluation, the MLQA and XQuAD corpora and a smaller one with high quality. In the next section, we (Lewis et al., 2019; Artetxe et al., 2019). The MLQA cor- generate two Spanish translation of the SQuAD v1.1 train- pus is an N-way parallel dataset consisting of thousands ing dataset, by considering or not (ces, qes,aes) retrieved of QA examples crowd-sourced from Wikipedia in 7 lan- with the alignment, to empirically evaluate their impact on guages, among which Spanish. It is split into development the QA system performance. and test sets, and it represents an evaluation benchmark for cross-lingual QA systems. 3.2. Cleaning and Refinements Similarly, the XQuAD corpus is another cross-lingual QA After the translation, we applied some heuristics to clean benchmark consisting of question-answers pairs from the and refine the retrieved (ces, qes,aes) examples. Based on development set of SQuAD v1.1 translated by professional the error analysis, we post-processed the type II errors. In translators in 10 languages, including Spanish. Therefore, particular, we first filter out words in the answers transla- we used both the MLQA and XQuAD benchmark to eval- tions belonging to the next sentences. Then, we removed uate the performance of our trained models. We measured the extra leading and trailing punctuation. Eventually, we the QA performance with the F1, and Exact Match (EM) also removed empty answers translations that might be gen- score computed using the official evaluation script available erated during the translation process. in the MLQA repository, that represents a multilingual gen- Moreover, in order to examine the impact on the QA eralization of the commonly-used SQuAD evaluation script system performance, we produced two versions of the 6. SQuAD-es v1.1 training dataset. A standard one, con- Results of the evaluation are shown in Table 5 and Ta- taining all the translated (ces, qes,aes) examples and re- ble 6. On the MLQA corpus, our model best model ferred to as SQuAD-es v1.1 and another that keep only 6 the (ces, qes,aes) examples retrieved without the use of the https://rajpurkar.github.io/SQuAD-explorer/ en es Context Fryderyk Chopin was born in Zelazowa˙ Wola, Fryderyk Chopin naci´oen Zelazowa˙ Wola, 46 46 kilometres(29 miles) west of Warsaw, in what kil´ometros al oeste de Varsovia, en lo que en- was then the Duchy of Warsaw, a Polish state tonces era el Ducado de Varsovia, un estado established by Napoleon. The parish baptismal polaco establecido por Napole´on. El registro record gives his birthday as 22 February 1810, de bautismo de la parroquia da su cumplea˜nos and cites his given names in the Latin form Frid- el 22 de febrero de 1810, y cita sus nombres ericus Franciscus (in Polish, he was Fryderyk en lat´ın Fridericus Franciscus (en polaco, Fry- Franciszek). However, the composer and his deryk Franciszek). Sin embargo, el compositor y family used the birthdate 1 March,[n 2] which su familia utilizaron la fecha de nacimiento 1 de is now generally accepted as the correct date. marzo, [n 2] que ahora se acepta generalmente como la fecha correcta.

Question 1)InwhatvillagewasFr´ed´ericbornin? 1)¿Enqu´e pueblo naci´oFr´ed´eric? Answer 1) Zelazowa˙ Wola 1) Zelazowa˙ Wola,

Question 2)Whenwashisbirthdayrecordedasbeing? 2)¿Cu´ando se registr´osu cumplea˜nos? Answers 2) 22 February 1810 2) 22 de febrero de 1810,

Question 3)WhatistheLatinformofChopin’sname? 3) ¿Cu´al es la forma latina del nombre de Chopin? Answer 3) Fridericus Franciscus 3) Fridericus Franciscus

Context During the Five Dynasties and Ten Kingdoms Durante el per´ıodo de las Cinco Dinasties y period of China (907–960), while the fractured los Diez Reinos de China (907-960), mientras political realm of China saw no threat in a Ti- que el reino pol´ıtico fracturado de China no bet which was in just as much political dis- vio ninguna amenaza en un T´ıbet que estaba en array, there was little in the way of Sino- el mismo desorden pol´ıtico. Pocos documen- Tibetan relations. Few documents involving tos que involucran contactos sino-tibetanos so- Sino-Tibetan contacts survive from the Song dy- breviven de la dinast´ıa Song (960-1279). Los nasty (960–1279). The Song were far more con- Song estaban mucho m´as preocupados por con- cerned with countering northern enemy states of trarrestar los estados enemigos del norte de la di- the Khitan-ruled Liao dynasty (907–1125) and nast´ıa Liao gobernada por Kitan´ (907-1125) y Jurchen-ruled Jin dynasty (1115–1234). la dinast´ıa Jin gobernada por Jurchen (1115- 1234).

Question 1) When did the Five Dynasties and Ten King- 1) ¿Cu´ando tuvo lugar el per´ıodo de las Cinco doms period of China take place? Dinasties y los Diez Reinos de China? Answer 1) 907–960 1) (907-960),

Question 2)WhoruledtheLiaodynasty? 2)¿Qui´engobern´ola dinast´ıa Liao? Answer 2) the Khitan 2) la dinast´ıa Liao gobernada por Kitan´

Question 3)WhoruledtheJindynasty? 3)¿Qui´engobern´ola dinast´ıa Jin? Answer 3) Jurchen 3) gobernada por Jurchen

Table 3: Examples of (ces, qes,aes) examples selected in two of the first articles in the SQuAD v1.1 training dataset, Fred´ eric´ Chopin and Sino-Tibetan relations during the Ming dynasty. The corresponding answer spans in the contexts are highlighted in bold.

beat the state-of-the-art F1 score of the XLM (MLM + marks. Interestingly enough, when compared to a model TLM, 15 languages)(Lample and Conneau, 2019) baseline trained on synthetical data, such as the translate-train- for Spanish QA. Equally, on the XQuAD corpus, we set mBERT, our TAR-train-mBERT models show even more the state-of-the-art for F1 and EM score. Overall, the substantial improvements, setting out a notable increase of TAR-train+mBERT models perform better than the cur- 11.6−14.2% F1 score and 9.8−10.9% EM score. These re- rent mBERT-based baselines, showing significant improve- sults indicate that the quality of the SQuAD-es v1.1 dataset ments of 1.2%−3.3% increase in F1 score and 0.6%−6.5% is good enough to train a Spanish QA model able to reach raise in EM points across the MLQA and XQuAD bench- state-of-the-art accuracy, therefore proving the efficacy of en es Error Context The Medill School of Journalism has produced La Escuela Medill de Periodismo ha producido - notable journalists and political activists in- notables periodistas y activistas pol´ıticos in- cluding 38 Pulitzer Prize laureates. National cluyendo 38 premios Pulitzer. Los corre- correspondents, reporters and columnists such sponsales nacionales, reporteros y columnistas as The New York Times’s Elisabeth Bumiller, como Elisabeth Bumiller de The New York David Barstow, Dean Murphy, and Vincent Times, David Barstow, Dean Murphy, y Vin- Laforet, USA Today’s Gary Levin, Susan Page cent Laforet, Gary Levin de USA Today, Susan and , NBC correspondent Page y Christine Brennan de NBC. A. Adande, Kelly O’Donnell, CBS correspondent Richard y Kevin Blackistone. El autor m´as vendido de Threlkeld, CNN correspondent Nicole Lapin la serie Canci´on de Hielo y Fuego, George R. and former CNN and current Al Jazeera Amer- R. Martin, obtuvo un B.S. Y M.S. De Medill. ica anchor Joie Chen, and ESPN personalities Elisabeth Leamy ha recibido 13 premiosEmmy Rachel Nichols, Michael Wilbon, Mike Green- y 4 premios Edward R. Murrow. berg, Steve Weissman, J. A. Adande, and Kevin Blackistone. The bestselling author of the A Song of Ice and Fire series, George R. R. Mar- tin, earned a B.S. and M.S. from Medill. Elisa- beth Leamy is the recipient of 13 Emmy awards and 4 Edward R. Murrow Awards.

Question 1) Which CBS correspondant graduated from 1) ¿Qu´ecorresponsal de CBS se gradu´ode la - The Medill School of Journalism? Escuela Medill de Periodismo? Answer 1) Richard Threlkeld 1) NBC. Type I

Question 2) How many Pulitzer Prize laureates attended 1) ¿Cu´antos premios Pulitzer asistieron a la Es- - the Medill School of Journalism? cuela Medill de Periodismo? Answers 2) 38 2) 38 premios Pulitzer. Los Type II

Context Admissions are characterized as ”most selec- Las admisiones se caracterizan como ”m´as se- - tive” by U.S. News & World Report. There lectivas” por U.S. News & World Report. Hubo were 35,099 applications for the undergradu- 35.099 solicitudes para la clase de pregrado de ate class of 2020 (entering 2016), and 3,751 2020 (ingresando en 2016), y 3.751 (10.7%) (10.7%) were admitted, making Northwestern fueron admitidos, haciendo de Northwestern one of the most selective schools in the United una de las escuelas m´as selectivas en los Es- States. For freshmen enrolling in the class tados Unidos. Para los estudiantes de primer of 2019, the interquartile range (middle 50%) a˜no de matriculaci´on en la clase de 2019, el on the SAT was 690–760 for critical reading rango intermedio (50% medio) en el SAT fue and 710-800 for math, ACT composite scores de 690-760 para lectura cr´ıtica y 710-800 para for the middle 50% ranged from 31–34, and matem´aticas, compuesto ACT para el 31%. 91% ranked in the top ten percent of their high school class.

Question 1) What percentage of freshman students en- 1) ¿Qu´eporcentaje de estudiantes de primer - rolling in the class of 2019 ranked in the top a˜no matriculados en la clase de 2019 se ubic´o 10% of their high school class? en el 10% superior de su clase de secundaria? Answer 1) 91% 1) 31% Type I

Question 2) What percentage of applications were ad- 2) ¿Qu´eporcentaje de solicitudes fueron admi- - mitted for the undergraduate class entering in tidas para la clase de pregrado en 2016? 2016? Answer 2) 10.7% 2) (10.7%) Type II (acc.)

Question 3) How selective are admissions at Northwest- 3) ¿Cu´an selectivas son las admisiones en - ern characterized by U.S. News Northwestern caracterizadas por U.S. News Answer 3) most selective 3) ”m´as selectivas” Type II (acc.)

Table 4: Examples of error types in the (ces, qes,aes) examples in the randomly selected article from SQuAD v1.1. Dataset. Wrong answers are highlighted in red while acceptable answers in green. F1/EM es Our TAR-train+mBERT(SQuAD-es) 68.1 / 48.3 models TAR-train+mBERT(SQuAD-es-small) 65.5/47.2 MLQA mBERT 64.3/46.6 mBERTbaselines Translate-train+mBERT 53.9/37.4 state-of-the-art XLM(MLM+TLM,15languages) 68.0/49.8

Table 5: Evaluation results in terms of F1 and EM scores on the MLQA corpus for our Multilingual-BERT models trained with two versions of SQuAD-es v1.1 and the current Spanish Multilingual-BERT baselines

F1/EM es Our TAR-train+mBERT(SQuAD-es) 77.6 / 61.8 models TAR-train+mBERT(SQuAD-es-small) 73.8/59.5 JointMulti32kvoc 59.5/41.3 XQuAD JointMulti200kvoc(state-of-the-art) 74.3/55.3 mBERT baselines JointPair with Joint voc 68.3 / 47.8 JointPair with Disjoint voc 72.5 / 52.5

Table 6: Evaluation results in terms of F1 and EM scores on the XQuAD corpus for our Multilingual-BERT models trained with two versions of SQuAD-es v1.1 and the current Spanish Multilingual-BERT baselines

the TAR method for synthetical corpora generation. Acknowledgements The QA evaluation demonstrates that the performance This work is supported in part by the Spanish Ministerio de on the Spanish MLQA and XQuAD benchmarks of the Econom´ıa y Competitividad, the European Regional De- mBERT increased by 2.6 − 4.2% F1 score and 1.1 − 2.3% velopment Fund and the Agencia Estatal de Investigaci´on, EM score when the SQuAD-es v1.1 dataset is used com- through the postdoctoral senior grant Ram´on y Cajal, the pared the SQuAD-es-small v1.1 dataset. Based on the error contract TEC2015-69266-P (MINECO/FEDER,EU) and analysis in section 3, we can assume that the SQuAD-es the contract PCIN-2017-079 (AEI/MINECO). v1.1 is a bigger but noisy dataset, compared to the SQuAD- es-small that is the smaller but more accurate. There- fore, we observe that the mBERT model may be robust 6. Bibliographical References enough to tolerate noisy data giving more importance to Alberti, C., Andor, D., Pitler, E., Devlin, J., and Collins, M. the quantity. This observation connects to the problem of (2019). Synthetic QA corpora generation with roundtrip quality versus quantity in synthetical corpora generation consistency. In Proceedings of the 57th Annual Meeting and its application to multilingual reading comprehension of the Association for Computational Linguistics, pages (Lee et al., 2019) 6168–6173, Florence, Italy, July. Association for Com- 5. Conclusions putational Linguistics. Artetxe, M. and Schwenk, H. (2018a). Margin-based par- In this work we have designed a TAR method designed allel corpus mining with multilingual sentence embed- to automatically translate the SQuAD-es v1.1 training dings. CoRR, abs/1811.01136. dataset to Spanish. Hence, we applied the TAR method Artetxe, M. and Schwenk, H. (2018b). Massively multi- to generated the SQuAD-es v1.1 dataset, the first large- lingual sentence embeddings for zero-shot cross-lingual scale training resources for Spanish QA. Finally, we em- transfer and beyond. CoRR, abs/1812.10464. ployed the SQuAD-es v1.1 dataset to train QA systems that achieved state-of-the-art perfomance on the Spanish Artetxe, M., Ruder, S., and Yogatama, D. (2019). On the QA task, demonstrating the efficacy of the TAR approach cross-lingual transferability of monolingual representa- for synthetic corpora generation. Therefore, we make the tions. CoRR, abs/1910.11856. SQuAD-es dataset freely available and encourage its usage Asai, A., Eriguchi, A., Hashimoto, K., and Tsuruoka, Y. for multilingual QA. (2018). Multilingual extractive reading comprehension The results achieved so far encourage us to look for- by runtime machine translation. CoRR, abs/1809.03275. ward and extend our approach in future works. First Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). of all, we will apply the TAR method to trans- BERT: pre-training of deep bidirectional transformers lated the SQuAD v2.0 dataset (Rajpurkaret al., 2018) for language understanding. CoRR, abs/1810.04805. and other large-scale extractive QA such as Natural Hermann, K. M., Kocisk´y, T., Grefenstette, E., Espeholt, Questions(Kwiatkowski et al., 2019). Moreover, we will L., Kay, W., Suleyman, M., and Blunsom, P. (2015). also exploit the modularity of the TAR method to support Teaching machines to read and comprehend. CoRR, languages other than Spanish to prove the validity of our abs/1506.03340. approach for synthetic corpora generation. Klein, G., Kim, Y., Deng, Y., Crego, J., Senellart, J., and Rush, A. (2017). Opennmt: Open-source toolkit for Squad: 100, 000+ questions for machine comprehension neural machine translation. 09. of text. CoRR, abs/1606.05250. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Rajpurkar,P., Jia, R., and Liang, P. (2018). Know what you Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- don’t know: Unanswerable questions for squad. CoRR, cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., abs/1806.03822. Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, Sennrich, R., Haddow, B., and Birch, A. (2016). Neural S. (2019). Natural questions: a benchmark for question machine translation of rare words with subword units. In answering research. Transactions of the Association of Proceedings of the 54th Annual Meeting of the Associa- Computational Linguistics. tion for Computational Linguistics (Volume 1: Long Pa- Lample, G. and Conneau, A. (2019). Cross-lingual lan- pers), pages 1715–1725, Berlin, Germany, August. As- guage model pretraining. CoRR, abs/1901.07291. sociation for Computational Linguistics. Lee, C. and Lee, H. (2019). Cross-lingual transfer learning T¨ure, F. and Boschee, E. (2016). Learning to trans- for question answering. CoRR, abs/1907.06042. late for multilingual question answering. CoRR, Lee, K., Yoon, K., Park, S., and Hwang, S.-w. (2018). abs/1609.08210. Semi-supervised training data generation for multilin- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, gual question answering. In Proceedings of the Eleventh L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. International Conference on Language Resources and (2017). Attention is all you need. In I. Guyon, et al., Evaluation (LREC 2018), Miyazaki, Japan, May. Euro- editors, Advances in Neural Information Processing Sys- pean Language Resources Association (ELRA). tems 30, pages 5998–6008. Curran Associates, Inc. Lee, K., Park, S., Han, H., Yeo, J., Hwang, S.-w., and Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, Lee, J. (2019). Learning with limited data for multilin- C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow- gual reading comprehension. In Proceedings of the 2019 icz, M., and Brew, J. (2019). Huggingface’s transform- Conference on Empirical Methods in Natural Language ers: State-of-the-art natural language processing. ArXiv, Processing and the 9th InternationalJoint Conference on abs/1910.03771. Natural Language Processing (EMNLP-IJCNLP), pages Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, 2840–2850, Hong Kong, China, November. Association R., and Le, Q. V. (2019). Xlnet: Generalized autore- for Computational Linguistics. gressive pretraining for language understanding. CoRR, Lewis, P., Oguz, B., Rinott, R., Riedel, S., and Schwenk, H. abs/1906.08237. (2019). Mlqa: Evaluating cross-lingual extractive ques- tion answering. arXiv preprint arXiv:1910.07475. 7. Language Resource References Liu, J., Lin, Y., Liu, Z., and Sun, M. (2019a). XQA: A cross-lingual open-domain question answering dataset. Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: In Proceedings of the 57th Annual Meeting of the Asso- Extracting large parallel corpora from movie and TV ciation for Computational Linguistics, pages 2358–2368, subtitles. In Proceedings of the Tenth International Florence, Italy, July. Association for Computational Lin- Conference on Language Resources and Evaluation guistics. (LREC’16), pages 923–929, Portoroˇz, Slovenia, May. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., European Language Resources Association (ELRA). Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Schwenk, H., Chaudhary, V., Sun, S., Gong, H., and (2019b). Roberta: A robustly optimized BERT pretrain- Guzm´an, F. (2019). Wikimatrix: Mining 135m parallel ing approach. CoRR, abs/1907.11692. sentences in 1620 language pairs from wikipedia. CoRR, Ostling,¨ R. and Tiedemann, J. (2016). Efficient word align- abs/1907.05791. ment with Markov Chain Monte Carlo. Prague Bulletin Tiedemann, J. (2012). Parallel data, tools and interfaces of Mathematical Linguistics, 106:125–146, October. in opus. In Nicoletta Calzolari (Conference Chair), Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). et al., editors, Proceedings of the Eight International Bleu: a method for automatic evaluation of machine Conference on Language Resources and Evaluation translation. In Proceedings of the 40th Annual Meeting (LREC’12), Istanbul, Turkey, may. European Language of the Association for Computational Linguistics, pages Resources Association (ELRA). 311–318, Philadelphia, Pennsylvania, USA, July. Asso- Wolk, K. and Marasek, K. (2015). Building subject- ciation for Computational Linguistics. aligned comparable corpora and mining it for truly par- Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). allel sentence pairs. CoRR, abs/1509.08881.