Fine-Grained Coordinated Cross-Lingual Text Stream
Total Page:16
File Type:pdf, Size:1020Kb
Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition Tao Ge1;2, Qing Dou3, Heng Ji4, Lei Cui2, Baobao Chang1, Zhifang Sui1, Furu Wei2 and Ming Zhou2 1MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, 100871, China 2Microsoft Research Asia, Beijing, 100080, China 3Facebook, CA, 94025, USA 4Rensselaer Polytechnic Institute, NY, 12180, USA ftage,lecu,fuwei,[email protected] [email protected], [email protected], fchbb,[email protected] Abstract Chinese This paper proposes to study fine-grained English inject Dezhou Texas ASEAN Abu Dhabi coordinated cross-lingual text stream align- (a) (b) (c) (d) ment through a novel information network decipherment paradigm. We use Burst In- Figure 1: Knowledge derived from fine-grained cross- formation Networks as media to represent lingual text stream alignments: (a) word transla- text streams and present a simple yet effec- tions; (b) polysemy/multi-references; (c) synonym/co- tive network decipherment algorithm with di- reference; (d) entity phrases verse clues to decipher the networks for ac- curate text stream alignment. Experiments if adjacent Chinese words in text are aligned to the on Chinese-English news streams show our same English named entity. approach not only outperforms previous ap- In order to acquire language knowledge for proaches on bilingual lexicon extraction from coordinated text streams but also can harvest Natural Language Processing (NLP) applications, high-quality alignments from large amounts we study fine-grained cross-lingual text stream of streaming data for endless language knowl- alignment. Instead of directly turning massive, edge mining, which makes it promising to be a unstructured data streams into structured knowl- new paradigm for automatic language knowl- edge (D2K), we adopt a new Data-to-Network-to- edge acquisition. Knowledge (D2N2K) paradigm, based on the fol- lowing observations: (i) most information units 1 Introduction are not independent, instead they are intercon- Coordinated text streams (Wang et al., 2007) refer nected or interacting, forming massive networks; to the text streams that are topically related and (ii) if information networks can be constructed indexed by the same set of time points. Previ- across multiple languages, they may bring tremen- ous studies (Wang et al., 2007; Hu et al., 2012) dous power to make knowledge mining algorithms on coordinated text stream focus on discovering more scalable and effective because we can em- and aligning common topic patterns across lan- ploy the graph structures to acquire and propagate guages. Despite their contributions to applications knowledge. like cross-lingual information retrieval and topic Based on the motivations, we employ a promis- analysis, such a coarse-grained topic-level align- ing text stream representation – Burst Information ment framework inevitably overlooks many use- Networks (BINets) (Ge et al., 2016a), which can ful fine-grained alignment knowledge. For exam- be easily constructed without rich language re- ple, Figure1 shows typical knowledge that can sources, as media to display the most important be derived from fine-grained Chinese-English text information units and illustrate their connections stream alignments. In addition to (a) bi-lingual in the text streams. With the BINet representa- word translations, we can also discover (b) poly- tion, we propose a simple yet effective network semous and multi-referential words if one Chinese decipherment algorithm for aligning cross-lingual word is aligned to multiple English words, (c) syn- text streams, which can take advantage of the co- onymous and co-referential word pairs if two Chi- burst characteristic of cross-lingual text streams nese words are aligned to the same English word, and easily incorporate prior knowledge and rich and (d) entity phrases (e.g., ?布N比 in Figure1) clues for fast and accurate network decipherment. 2496 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2496–2506 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics Jan 22 - Jan 29 Jan 14 - Jan 31 Jan 20 - Jan 29 Jan 22 - Jan 27 7-6 Jan 21 - Jan 29 Jan 25 - Jan 31 Jan 24 - Jan 31 Jan 24 - Jan 31 Jan 20 - Jan 31 Jan 28 - Jan 31 Jan 28 Li Australian Open Zheng Jie Jan 23 - Jan 28 Jan 14 - Jan 31 Jan 17 - Jan 28 7-6 Serena Williams Justine Henin Wildcard Jan 24 - Jan 27 Jan 22 - Jan 31 Jan 15 - Jan 31 Jan 23 - Jan 30 seed Grand Slam Kim Clijsters Dinara Safina Jan 18 - Jan 25 Jan 23 - Jan 31 Jan 15 - Jan 24 Jan 12 - Jan 25 Figure 2: (A part of) Burst Information Networks built from Chinese and English news streams. For example, in Figure2, each node in a BI- • We propose a network decipherment ap- Net is a bursty word with one of its burst peri- proach for text stream alignment, which can ods, representing an important information unit work in both low and rich resource settings in a text stream. To decipher the Chinese BINet, and outperform previous approaches. our approach first focuses on the nodes in the En- • We release our data (annotations) and sys- glish BINet in Figure2 as the candidates because tems to guarantee the reproducibility and they co-burst with the Chinese nodes. Then, we help future work improve on this task. decipher some nodes based on prior knowledge (the green node), the pronunciation similarity clue 2 Burst Information Network (the orange nodes) or literal translation similar- A Burst Information Network (BINet) is a graph- ity clue (the blue node). These deciphered nodes based text stream representation and has proven will serve as neighbor clues to decipher their adja- effective for multiple text stream mining tasks (Ge cent nodes (the red node) which will then be used et al., 2016a,b,c). In contrast to many informa- for further decipherment (e.g., decipher the yellow tion networks (e.g., (Ji, 2009; Li et al., 2014)), BI- node) through knowledge propagation across the Nets are specially for text streams. They focus on network, as the dashed arrows in Figure2 show. the burst information units which are usually re- Experiments on Chinese-English coordinated lated to important events or trending topics in text news streams show our approach can accurately streams and illustrate their connections. align nodes across the cross-lingual BINets and A BINet is originally defined as G = hV; E; !i derive various knowledge, and that with more in (Ge et al., 2016a). Each node v 2 V is a burst streaming data provided, we can harvest more element defined as a burst word1 during one of its high-quality alignments and thus derive more burst periods hw; Pi where w denotes a word and knowledge. By aligning endless text streams, it is P denotes one consecutive burst period of w, as promising for never-ending language knowledge Figure2 shows. Each edge 2 E indicates the mining, which can not only complement language connection between two burst elements with the resources but also benefit some NLP applications. weight ! which is defined as the number of docu- The main contributions of this paper are: ments where these two burst elements co-occur in • We propose a promising framework to the text stream. In this paper, we extend the BINet mine knowledge from inexhaustible coordi- definition to G = hV; E; !; πi by adding a binary nated cross-lingual text streams through fine- 1Burst words and their corresponding burst periods can be grained alignment, exploring a paradigm for detected based on Kleinberg burst detection algorithm (Klein- language knowledge acquisition. berg, 2003), as (Ge et al., 2016a) did. 2497 indicator π to indicate if two nodes (i.e., burst ele- in Figure2 bursts between January 25 and January ments) are frequently (more than 5 times) adjacent 31, 2010. We only need to look for its counterpart (as a bigram) in text, for mining knowledge such from the nodes in the English BINet whose burst as entity phrases in Figure1(d). period overlaps with this period. Formally, for a node c 2 Vc in the Chinese BINet, its candidate 3 Decipherment nodes in the English BINet can be derived as: Cand(c) = fejP(e) \P(c) 6= ;g After constructing a BINet from a foreign lan- where e 2 V , and P(c) and P(e) are the burst guage (we use Chinese as a foreign language e periods of c and e respectively. in this paper), we can decipher it by consult- ing an Engish BINet constructed from its coor- 3.3 Candidate Verification dinated English text stream. We define Gc = For the candidate list for c (i.e., Cand(c)), we hVc;Ec; !c; πci and Ge = hVe;Ee; !e; πei as the Chinese BINet and English BINet respectively. need to verify each node e 2 Cand(c) and choose the most probable one as c’s counterpart. For- For people who do not know Chinese, Gc is a net- work of ciphers. We design a novel BINet deci- mally, we define Score(c; e) as the credibility score of e being the correct counterpart of c and pherment procedure to decipher Gc by aligning as propose the following novel clues for verification. many nodes in Gc as possible to Ge. The decipher- ment process is defined to find e 2 Ve for a node Pronunciation c 2 V so that e is c’s counterpart in the English c Inspired by previous work on name translation text stream.2 mining (e.g., (Schafer III, 2006; Sproat et al., 3.1 Starting Point 2006; Ji, 2009)), for a node e 2 Cand(c), if its pronunciation is similar to c, then e is To decipher the Chinese BINet, we need a few likely to be the translation of c.