Contextual Text Style Transfer

Contextual Text Style Transfer Yu Cheng1, Zhe Gan1, Yizhe Zhang2, Oussama Elachqar2, Dianqi Li3, Jingjing Liu1 1Microsoft Dynamics 365 AI Research 2Microsoft Research 3University of Washington fyu.cheng,zhe.gan,yizhe.zhang,ouelachq,[email protected], [email protected] Abstract 2017; Li et al., 2018; Prabhumoye et al., 2018; Subramanian et al., 2018), assuming that disentan- We introduce a new task, Contextual Text gling style information from semantic content can Style Transfer - translating a sentence into be achieved in an auto-encoding fashion with the a desired style with its surrounding context taken into account. This brings two key chal- introduction of additional regularizers (e.g., adver- lenges to existing style transfer approaches: sarial discriminators (Shen et al., 2017), language (i) how to preserve the semantic meaning of models (Yang et al., 2018)). target sentence and its consistency with sur- Despite promising results, these techniques still rounding context during transfer; (ii) how have a long way to go for practical use. Most to train a robust model with limited labeled existing models focus on sentence-level rewriting. data accompanied by context. To realize However, in real-world applications, sentences typ- high-quality style transfer with natural context preservation, we propose a Context-Aware ically reside in a surrounding paragraph context. In Style Transfer (CAST) model, which uses two formalized writing, the rewritten span is expected separate encoders for each input sentence and to align well with the surrounding context to keep its surrounding context. A classifier is fur- a coherent semantic flow. For example, to auto- ther trained to ensure contextual consistency matically replace a gender-biased sentence in a job of the generated sentence. To compensate description document, a style transfer model tak- for the lack of parallel data, additional self- ing the sentence out of context may not be able reconstruction and back-translation losses are to understand the proper meaning of the statement introduced to leverage non-parallel data in a semi-supervised fashion. Two new bench- and the intended message. Taking a single sen- marks, Enron-Context and Reddit-Context, are tence as the sole input of a style transfer model introduced for formality and offensiveness may fail in preserving topical coherency between style transfer. Experimental results on these the generated sentence and its surrounding context, datasets demonstrate the effectiveness of the leading to low semantic and logical consistency proposed CAST model over state-of-the-art on the paragraph level (see Example C in Table4). methods across style accuracy, content preser- Similar observations can be found in other style vation and contextual consistency metrics.1 transfer tasks, such as offensive to non-offensive 1 Introduction and political to neutral translations. Motivated by this, we propose and investigate a Text style transfer has been applied to many appli- new task - Contextual Text Style Transfer. Given a cations (e.g., sentiment manipulation, formalized paragraph, the system aims to translate sentences writing) with remarkable success. Early work relies into a desired style, while keeping the edited sec- on parallel corpora with a sequence-to-sequence tion topically coherent with its surrounding context. learning framework (Bahdanau et al., 2015; Jham- To achieve this goal, we propose a novel Context- tani et al., 2017). However, collecting parallel an- Aware Style Transfer (CAST) model, by jointly notations is highly time-consuming and expensive. considering style translation and context alignment. There has also been studies on developing text style To leverage parallel training data, CAST employs transfer models with non-parallel data (Hu et al., two separate encoders to encode the source sen- 1Code and datasets will be released at https:// tence and its surrounding context, respectively. github.com/ych133/CAST. With the encoded sentence and context embed- 2915 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2915–2924 November 16 - 20, 2020. c 2020 Association for Computational Linguistics dings, a decoder is trained to translate the joint tences in different styles. For example, Jhamtani features into a new sentence in a specific style. et al.(2017) pre-trained word embeddings by lever- A pre-trained style classifier is applied for style aging external dictionaries mapping Shakespearean regularization, and a coherence classifier learns to words to modern English words and additional text. regularize the generated target sentence to be con- However, available parallel data in different styles sistent with the context. To overcome data sparsity are very limited. Therefore, there is a recent surge issue, we further introduce a set of unsupervised of interest in considering a more realistic setting, training objectives (e.g., self-reconstruction loss, where only non-parallel stylized corpora are avail- back-translation loss) to leverage non-parallel data able. A typical approach is: (i) disentangling la- in a hybrid approach (Shang et al., 2019). The final tent space as content and style features; then (ii) CAST model is jointly trained with both parallel generating stylistic sentences by tweaking style- and non-parallel data via end-to-end training. relevant features and passing them through a de- As this is a newly proposed task, we intro- coder, together with the original content-relevant duce two new datasets, Enron-Context and Reddit- features (Xu et al., 2018). Context, collected via crowdsourcing. The former Many of these approaches borrowed the idea of contains 14,734 formal vs. informal paired sam- adversarial discriminator/classifier from the Gen- ples from Enron (Klimt and Yang, 2004) (an email erative Adversarial Network (GAN) framework dataset), and the latter contains 23,158 offensive (Goodfellow et al., 2014). For example, Shen vs. non-offensive paired samples from Reddit (Ser- et al.(2017); Fu et al.(2018); Lample et al.(2018) ban et al., 2017). Each sample contains an origi- used adversarial classifiers to force the decoder nal sentence and a human-rewritten one in target to transfer the encoded source sentence into a style, accompanied by its paragraph context. In different style/language. Alternatively, Li et al. experiments, we also leverage 60k formal/informal (2018) achieved disentanglement by filtering stylis- sentences from GYAFC (Rao and Tetreault, 2018) tic words of input sentences. Another direction and 100k offensive/non-offensive sentences from for text style transfer without parallel data is using Reddit (dos Santos et al., 2018) as additional non- back-translation (Prabhumoye et al., 2018) with a parallel data for model training. de-noising auto-encoding objective (Logeswaran The main contributions of this work are sum- et al., 2018; Subramanian et al., 2018). marized as follows: (i) We propose a new task - Regarding the tasks, sentiment transfer is one Contextual Text Style Transfer, which aims to trans- of the most widely studied problems. Transferring late a sentence into a desired style while preserv- from informality to formality (Rao and Tetreault, ing its style-agnostic semantics and topical consis- 2018; Li et al., 2019) is another direction of text tency with the surrounding context. (ii) We intro- style transfer, aiming to change the style of a given duce two new datasets for this task, Enron-Context sentence to more formal text. dos Santos et al. and Reddit-Context, which provide strong bench- (2018) presented an approach to transferring offen- marks for evaluating contextual style transfer mod- sive text to non-offensive based on social network els. (iii) We present a new model - Context-Aware data. In Prabhumoye et al.(2018), the authors pro- Style Transfer (CAST), which jointly optimizes the posed the political slant transfer task. However, generation quality of target sentence and its topical all these previous studies did not directly consider coherency with adjacent context. Extensive exper- context-aware text style transfer, which is the main iments on the new datasets demonstrate that the focus of this work. proposed CAST model significantly outperforms state-of-the-art style transfer models. 2.2 Context-aware Text Generation 2 Related Work Our work is related to context-aware text generation (Mikolov and Zweig, 2012; Tang et al., 2016), 2.1 Text Style Transfer which can be applied to many NLP tasks (Man- Text style transfer aims to modify an input sen- grulkar et al., 2018). For example, previous work tence into a desired style while preserving its style- has investigated language modeling with context in- independent semantics. Previous work has ex- formation (Wang and Cho, 2015; Wang et al., 2017; plored this as a sequence-to-sequence learning task Li et al., 2020), treating the preceding sentences using parallel corpora with paired source/target sen- as context. There are also studies on response gen- 2916 Context Coherence (Contextual Context Context Encoder Classifier Coherence Loss) Sentence Target style Target (Contextual Seq2Seq Loss) Decoder Sentence Input Encoder CAST with parallel data Sentence Original style Recon. (Reconstruction Loss) Decoder Sentence Style Input (Style Classification Loss) Encoder Classifier Sentence Target style Trans. Decoder Sentence Sentence Recon. (Back-Translation Loss) Encoder Decoder Original style CAST with non-parallel data Figure 1: Model architecture of the proposed CAST model for contextual text style

Contextual Text Style Transfer

Champollion: a Robust Parallel Text Sentence Aligner

The Web As a Parallel Corpus

Resourcing Machine Translation with Parallel Treebanks John Tinsley

Towards Controlled Counterfactual Generation for Text

Natural Language Processing Security- and Defense-Related Lessons Learned

A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems

Lines: an English-Swedish Parallel Treebank

Cross-Lingual Bootstrapping of Semantic Lexicons: the Case of Framenet

Parallel Texts

Open Architecture for Multilingual Parallel Texts M.T

Unsupervised Learning of Cross-Modal Mappings Between

BITS: a Method for Bilingual Text Search Over the Web