Textflow: a Text Similarity Measure Based on Continuous Sequences
Total Page:16
File Type:pdf, Size:1020Kb
TextFlow: A Text Similarity Measure based on Continuous Sequences Yassine Mrabet Halil Kilicoglu Dina Demner-Fushman [email protected] [email protected] [email protected] Lister Hill National Center for Biomedical Communications U.S. National Library of Medicine 8600 Rockville Pike, 20894, Bethesda, MD, USA Abstract ing scale of textual search, similarity measures still play an important role in refining search re- Text similarity measures are used in multi- sults to more specific needs such as the recognition ple tasks such as plagiarism detection, in- of paraphrases and textual entailment, plagiarism formation ranking and recognition of para- detection and fine-grained ranking of information. phrases and textual entailment. While re- These tasks are also often performed on dedicated cent advances in deep learning highlighted document collections for domain-specific applica- further the relevance of sequential mod- tions where text similarity measures can be di- els in natural language generation, existing rectly applied. similarity measures do not fully exploit Finding relevant approaches to compute text the sequential nature of language. Exam- similarity motivated a lot of research in the last ples of such similarity measures include n- decades (Sahami and Heilman, 2006; Hatzivas- grams and skip-grams overlap which rely siloglou et al., 1999), and more recently with deep on distinct slices of the input texts. In learning methods (Socher et al., 2011; Yih et al., this paper we present a novel text sim- 2011; Severyn and Moschitti, 2015). However, ilarity measure inspired from a common most of the recent advances focused on designing representation in DNA sequence align- high performance classification methods, trained ment algorithms. The new measure, called and tested for specific tasks and did not offer a TextFlow, represents input text pairs as standalone similarity measure that could be ap- continuous curves and uses both the actual plied (i) regardless of the application domain and position of the words and sequence match- (ii) without requiring training corpora. ing to compute the similarity value. Our For instance, Yih and Meek(2007) presented experiments on eight different datasets an approach to improve text similarity measures show very encouraging results in para- for web search queries with a length ranging from phrase detection, textual entailment recog- one word to short sequences of words. The pro- nition and ranking relevance. posed method is tailored to this specific task, and 1 Background the training models are not expected to perform well on different kinds of data such as sentences, The number of pages required to print the content questions or paragraphs. In a more general study, of the World Wide Web was estimated to 305 bil- Achananuparp et al.(2008) compared several text 1 lion in a 2015 article . While a big part of this similarity measures for paraphrase recognition, content consists of visual information such as pic- textual entailment, and the TREC 9 question vari- tures and videos, texts also continue growing at a ants task. In their experiments the best perfor- very high pace. A recent study shows that the av- mance was obtained with a linear combination erage webpage weights 1,200 KB with plain text of semantic and lexical similarities, including a 2 accounting for up to 16% of that size . word order similarity proposed in (Li et al., 2006). While efficient distribution of textual data and This word order similarity is computed by con- computations are the key to deal with the increas- structing first two vectors representing the com- 1http://goo.gl/p9lt7V mon words between two given sentences and using 2http://goo.gl/c41wpa their respective positions in the sentences as term 763 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 763–772 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-1071 weights. The similarity value is then obtained by subtracting the two vectors and taking the absolute value. While such representation takes into ac- count the actual positions of the words, it does not allow detecting sub-sequence matches and takes into account missing words only by omission. More generally, existing standalone (or tradi- tional) text similarity measures rely on the inter- sections between token sets and/or text sizes and frequency, including measures such as the Co- sine similarity, Euclidean distance, Levenshtein (Sankoff and Kruskal, 1983), Jaccard (Jain and Dubes, 1988) and Jaro (Jaro, 1989). The se- quential nature of natural language is taken into Figure 1: Dot matrix example for 2 DNA se- account mostly through word n-grams and skip- quences (Mount, 2004) grams which capture distinct slices of the analysed texts but do not preserve the order in which they appear. A new evaluation measure, called CORE, • In this paper, we use intuitions from a common used to better show the consistency of a sys- representation in DNA sequence alignment to de- tem at high performance using both its rank sign a new standalone similarity measure called average and rank variance when compared to TextFlow (XF). The proposed measure uses at the competing systems over a set of datasets. same time the full sequence of input texts in a natural sub-sequence matching approach together with individual token matches and mismatches. 2 The TextFlow Similarity Our contributions can be detailed further as fol- lows: XF is inspired from a dot matrix representation commonly used in pairwise DNA sequence align- A novel standalone similarity measure ment (cf. figure 1). We use a similar dot matrix • which: representation for text pairs and draw a curve os- cillating around the diagonal (cf. figure2). The – exploits the full sequence of words in area under the curve is considered to be the dis- the compared texts. tance between the two text pairs which is then – is asymmetric in a way that allows it normalized with the matrix surface. For practical to provide the best performance on dif- computation, we transform this first intuitive rep- ferent tasks (e.g., paraphrase detection, resentation using the delta of positions as in figure textual entailment and ranking). 3. In this setting, the Y axis is the delta of posi- – when required, it can be trained with a tions of a word occurring in the two texts being small set of parameters controlling the compared. If the word does not occur in the tar- impact of sub-sequence matching, posi- get text, the delta is considered to be a maximum tion gaps and unmatched words. reference value (l in figure2). – provides consistent high performance The semantics are: the bigger the area un- across tasks and datasets compared to der curve is, the lower the similarity between the traditional similarity measures. compared texts. XF values are real numbers in the [0,1] interval, with 1 indicating a perfect A neural network architecture to train match, and 0 indicating that the compared texts • TextFlow parameters for specific tasks. do not have any common tokens. With this rep- resentation, we are able to take into account all An empirical study on both performance con- matched words and sub-sequences at the same • sistency and standard evaluation measures, time. The exact value for the XF similarity be- performed with eight datasets from three dif- tween two texts X = x , x , .., x and Y = { 1 2 n} ferent tasks. y , y , .., y is therefore computed as: { 1 2 m} 764 is set to the same reference value equal to m, (i.e., the cost of a missing word is set by de- fault to the length of the target text), and: S is the length of the longest matching se- • i quence between X and Y including the word x , if x X Y , or 1 otherwise. i i ∈ ∩ XF computation is performed in O(nm) in the worst case where we have to check all tokens in the target text Y for all tokens in the input text X. XF is an asymmetric similarity measure. Its asymmet- ric aspect has interesting semantic applications as we show in the example below (cf. figure2). The minimum value of XF provided the best differ- entiation between positive and negative text pairs Figure 2: Illustration of TextFlow Intuition when looking for semantic equivalence (i.e., para- phrases), the maximum value was among the top three for the textual entailment example. We con- duct this comparison at a larger scale in the evalu- ation section. We add 3 parameters to XF in order to repre- sent the importance that should be given to posi- tion deltas (Position factor α), missing words (sen- sitivity factor β), and sub-sequence matching (se- quence factor γ), such that: n Figure 3: Practical TextFlow Computation 1 α β XFα,β,γ(X, Y ) = 1 γ Ti,i 1(X, Y ) − βnm S − i=2 i Xn 1 α β γ Ri,i 1(X, Y ) n − βnm S − 1 1 i=2 i XF (X, Y ) = 1 Ti,i 1(X, Y ) X − nm S − (4) i=2 i Xn (1) 1 1 With: Ri,i 1(X, Y ) − nm Si − β ∆β P (xi, X, Y ) ∆β P (xi 1, X, Y ) i=2 Ti,i 1(X, Y ) = | − − | X − 2 (5) With Ti,i 1(X, Y ) corresponding to the triangu- − lar area in the [i 1, i] step (cf. figure3) β − R (X, Y ) = Min(∆β P (xi, X, Y ), ∆β P (xi 1, X, Y )) and Ri,i 1(X, Y ) corresponding to the rectangu- i,i 1 − − − (6) lar component. They are expressed as: and: ∆P (xi, X, Y ) ∆P (xi 1, X, Y ) Ti,i 1(X, Y ) = | − − | (2) ∆ P (x , X, Y ) = βm, if x / X Y − 2 • β i i ∈ ∩ and: α < β: forces missing words to always cost • more than matched words.