Accepted as a short paper in ACL 2018 Split and Rephrase: Better Evaluation and a Stronger Baseline Roee Aharoni & Yoav Goldberg Computer Science Department Bar-Ilan University Ramat-Gan, Israel froee.aharoni,
[email protected] Abstract companying RDF information, and a semantics- augmented setup that does. They report a BLEU Splitting and rephrasing a complex sen- score of 48.9 for their best text-to-text system, and tence into several shorter sentences that of 78.7 for the best RDF-aware one. We focus on convey the same meaning is a chal- the text-to-text setup, which we find to be more lenging problem in NLP. We show that challenging and more natural. while vanilla seq2seq models can reach We begin with vanilla SEQ2SEQ models with high scores on the proposed benchmark attention (Bahdanau et al., 2015) and reach an ac- (Narayan et al., 2017), they suffer from curacy of 77.5 BLEU, substantially outperforming memorization of the training set which the text-to-text baseline of Narayan et al.(2017) contains more than 89% of the unique and approaching their best RDF-aware method. simple sentences from the validation and However, manual inspection reveal many cases test sets. To aid this, we present a of unwanted behaviors in the resulting outputs: new train-development-test data split and (1) many resulting sentences are unsupported by neural models augmented with a copy- the input: they contain correct facts about rele- mechanism, outperforming the best re- vant entities, but these facts were not mentioned ported baseline by 8.68 BLEU and foster- in the input sentence; (2) some facts are re- ing further progress on the task.