Xiaomingbot: a Multilingual Robot News Reporter
Total Page:16
File Type:pdf, Size:1020Kb
Xiaomingbot: A Multilingual Robot News Reporter Runxin Xu1,∗ Jun Cao2, Mingxuan Wang2, Jiaze Chen2, Hao Zhou2, Ying Zeng2, Yuping Wang2 Li Chen2, Xiang Yin2, Xijin Zhang2, Songcheng Jiang2, Yuxuan Wang2, and Lei Li2y 1 School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2 ByteDance AI Lab, Shanghai, China [email protected] fcaojun.sh, wangmingxuan.89, chenjiaze, zhouhao.nlp, zengying.ss, wangyuping, chenli.cloud, yinxiang.stephen, zhangxijin, jiangsongcheng, wangyuxuan.11, [email protected] Abstract News News News Avatar Generation Translation Reading Animation This paper proposes the building of Xiaom- Data-To-Text Neural Machine Text-To-Speech Lip ingbot, an intelligent, multilingual and multi- Generation Translation Synthesis Syncing modal software robot equipped with four inte- Text Cross-lingual Body-cloth Summarization Voice Cloning Render gral capabilities: news generation, news trans- lation, news reading and avatar animation. Its system summarizes Chinese news that it au- Figure 1: Xiaomingbot System Architecture tomatically generates from data tables. Next, it translates the summary or the full article into multiple languages, and reads the multi- capabilities reports on sports news that only focus lingual rendition through synthesized speech. on text generation. We argue in this paper that an in- Notably, Xiaomingbot utilizes a voice cloning telligent robot reporter should acquire the following technology to synthesize the speech trained capabilities to be truly user friendly: a) it should from a real person’s voice data in one input language. The proposed system enjoys several be able to create news articles from input data; merits: it has an animated avatar, and is able to b) it should be able to read the articles with lifelike generate and read multilingual news. Since it character animation like in TV broadcasting; and was put into practice, Xiaomingbot has written c) it should be multi-lingual to serve global users. over 600,000 articles, and gained over 150,000 None of the existing robot reporters are able dis- followers on social media platforms. play performance on these tasks that matches that of a human reporter. In this paper, we present Xi- 1 Introduction aomingbot, a robot news reporter capable of news The wake of automated news reporting as an emerg- writing, summarization, translation, reading, and ing research topic has witnessed the development visual character animation. In our knowledge, it and deployment of several robot news reporters is the first multilingual and multimodal AI news with various capabilities. Technological improve- agent. Hence, the system shows great potential for ments in modern natural language generation have large scale industrial applications. further enabled automatic news writing in certain Figure1 shows the capabilities and components areas. For example, GPT-2 is able to create fairly of the proposed Xiaomingbot system. It includes plausible stories (Radford et al., 2019). Bayesian four components: a) a news generator, b) a news generative methods have been able to create de- translator, c) a cross-lingual news reader, and d) an scriptions or advertisement slogans from structured animated avatar. The text generator takes input in- data (Miao et al., 2019; Ye et al., 2020). Summa- formation from data tables and produces articles in rization technology has been exploited to produce natural languages. Our system is targeted for news reports on sports news from human commentary area with available structure data, such as sports text (Zhang et al., 2016). games and financial events. The fully automated While very promising, most previous robot re- news generation function is able to write and pub- porters and machine writing systems have limited lish a story within mere seconds after the event took place, and is therefore much faster compared ∗ The work was done while the author was an intern at ByteDance AI Lab. with manual writing. Within a few seconds after y Corresponding author. the events, it can accomplish the writing and pub- Summary Text Summarization Machine Text-To-Speech Avatar Translation Animation Generated News Translation, Speech, Animation Figure 2: User Interface of Xiaomingbot. On the left is a piece of sports news, which is generated from a Ta- ble2Text model. On the top is the text summarization result. On the bottom right corner, Xiaomingbot produces the corresponding speech and visual effects. lishing of a story. The system also uses a pretrained model to Xiaomingbot’s neural cross lingual voice text summarization technique to create summaries reader, we’ve allowed it to learn a voice in different for users to skim through. Xiaomingbot can also languages with only a few examples c) For better translate news so that people from different coun- user experience, we also applied cross lingual vi- tries can promptly understand the general meaning sual rendering model, which generates synthesis of an article. Xiaomingbot is equipped with a cross lip syncing in consistent with the generated voice. lingual voice reader that can read the report in dif- d) Xiaomingbot has been put into practice and pro- ferent languages in the same voice. It is worth men- duced over 600; 000 articles, and gained over 150k tioning that Xiaomingbot excels at voice cloning. It followers in social media platforms. is able to learn a person’s voice from audio samples that are as short as only two hours, and maintain 2 System Architecture precise consistency in using that voice even when reading in different languages. In this work, we The Xiaomingbot system includes four components recorded 2 hours of Chinese voice data from a fe- working together in an pipeline, as shown in Fig- male speaker, and Xiaomingbot learnt to speak in ure1. The system receives input from data table English and Japanese with the same voice. Finally, containing event records, which, depending on the the animation module produces an animated car- domain, can be either a sports game with time-line toon avatar with lip and facial expression synchro- information, or a financial piece such as tracking nized to the text and voice. It also generates the stock market. The final output is an animated avatar full body with animated cloth texture. The demo reading the news article with a synthesized voice. video is available at https://www.youtube.com/ Figure2 illustrates an example of our Xiaomingbot watch?v=zNfaj_DV6-E. The home page is avail- system. First, the text generation model generates able at https://xiaomingbot.github.io. a piece of sports news. Then, as is shown on the top of the figure, the text summarization module The system has the following advantages: a) It trims the produced news into a summary, which produces timely news reports for certain areas and can be read by users who prefer a condensed ab- is multilingual. b) By employing a voice cloning stract instead of the whole news. Next, the machine translation module will translate the summary into • In-match Description. It describes most im- the language that the user specifies, as illustrated portant events in the game such as “some- on the bottom right of the figure. Relying on the one score a goal”, “someone received yellow text to speech (TTS) module, Xiaomingbot can card”. read both the summary and its translation in differ- ent languages using the same voice. Finally, the • Post-match Summary. It’s a brief summary system can visualize an animated character with of this game , while also including predictions synchronized lip motion and facial expression, as of the progress of the subsequent matches. well as lifelike body and clothing. 3.2 Text Summarization 3 News Generation For users who prefer a condensed summary of the report, Xiaomingbot can provide a short gist ver- In this section, we will first describe the automated sion using a pre-trained text summarization model. news generation module, followed by the news We choose to use the said model instead of gen- summarization component. erating the summary directly from the table data because the former can create more general content, 3.1 Data-To-Text Generation and can be employed to process manually written Our proposed Xiaomingbot is targeted for writing reports as well. There are two approaches to sum- news for domains with structured input data, such marize a text: extractive and abstractive summariza- as sports and finance. To generate reasonable text, tion. Extractive summarization trains a sentence se- several methods have been proposed(Miao et al., lection model to pick the important sentences from 2019; Sun et al., 2019; Ye et al., 2020). However, an input article, while an abstractive summarization since it is difficult to generate correct and reliable will further rephrase the sentences and explore the content through most of these methods, we employ potential for combining multiple sentences into a a template based on table2text technology to write simplified one. the articles. We trained two summarization models. One is Table1 illustrates one example of soccer game a general text summarization using a BERT-based data and its generated sentences. In the example, sequence labelling network. We use the TTNews Xiaomingbot retrieved the tabled data of a single dataset, a Chinese single document summarization sports game with time-lines and events, as well dataset for training from NLPCC 2017 and 2018 as statistics for each player’s performance. The shared tasks (Hua et al., 2017; Li and Wan, 2018). data table contains time, event type (scoring, foul, It includes 50,000 Chinese documents with human etc.), player, team name, and possible additional at- written summaries. The article is separated into a tributes. Using these tabulated data, we integrated sequence of sentences. The BERT-based summa- and normalized the key-value pair from the table. rization model output 0-1 labels for all sentences. We can also obtain processed key-value pairs such In addition, for soccer news, we trained a special as “Winning team”, “Lost team”, “Winning Score” summarization model based on the commentary- , and use template-based method to generate news to-summary technique (Zhang et al., 2016).