Data Mining and Knowledge Discovery manuscript No. (will be inserted by the editor) Controlling Hallucinations at Word Level in Data-to-Text Generation Clement Rebuffel? · Marco Roberti? · Laure Soulier · Geoffrey Scoutheeten · Rossella Cancelliere · Patrick Gallinari Received: date / Accepted: date Abstract Data-to-Text Generation (DTG) is a subfield of Natural Language Gen- eration aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which ex- hibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the train- ing data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements { usually called hallucinations{ in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained ap- proach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word- ? Equal contribution Cl´ement Rebuffel Sorbonne Universit´e,CNRS, LIP6, F-75005 Paris, France E-mail: clement.rebuff
[email protected] Marco Roberti University of Turin, Italy E-mail:
[email protected] Laure Soulier Sorbonne Universit´e,CNRS, LIP6, F-75005 Paris, France Geoffrey Scoutheeten BNP Paribas, Paris Rossella Cancelliere University of Turin, Italy arXiv:2102.02810v2 [cs.CL] 9 Jul 2021 Patrick Gallinari Sorbonne Universit´e,CNRS, LIP6, F-75005 Paris, France Criteo AI Lab, Paris 2 C.