TextFlow: A Text Similarity Measure based on Continuous Sequences Yassine Mrabet Halil Kilicoglu Dina Demner-Fushman
[email protected] [email protected] [email protected] Lister Hill National Center for Biomedical Communications U.S. National Library of Medicine 8600 Rockville Pike, 20894, Bethesda, MD, USA Abstract ing scale of textual search, similarity measures still play an important role in refining search re- Text similarity measures are used in multi- sults to more specific needs such as the recognition ple tasks such as plagiarism detection, in- of paraphrases and textual entailment, plagiarism formation ranking and recognition of para- detection and fine-grained ranking of information. phrases and textual entailment. While re- These tasks are also often performed on dedicated cent advances in deep learning highlighted document collections for domain-specific applica- further the relevance of sequential mod- tions where text similarity measures can be di- els in natural language generation, existing rectly applied. similarity measures do not fully exploit Finding relevant approaches to compute text the sequential nature of language. Exam- similarity motivated a lot of research in the last ples of such similarity measures include n- decades (Sahami and Heilman, 2006; Hatzivas- grams and skip-grams overlap which rely siloglou et al., 1999), and more recently with deep on distinct slices of the input texts. In learning methods (Socher et al., 2011; Yih et al., this paper we present a novel text sim- 2011; Severyn and Moschitti, 2015). However, ilarity measure inspired from a common most of the recent advances focused on designing representation in DNA sequence align- high performance classification methods, trained ment algorithms.