SEMILAR: the Semantic Similarity Toolkit

SEMILAR: The Semantic Similarity Toolkit

Vasile Rus, Mihai Lintean, Rajendra Banjade, Nobal Niraula, and Dan Stefanescu Department of Computer Science The University of Memphis Memphis, TN 38152 {vrus,rbanjade,mclinten,nbnraula,dstfnscu}@memphis.edu

a positive instance, meaning that Text A is a par- Abstract aphrase of Text B and vice versa. The importance of semantic similarity in Nat- We present in this paper SEMILAR, the SE- ural Language Processing (NLP) is highlighted Mantic simILARity toolkit. SEMILAR im- by the diversity of datasets and shared task eval- plements a number of algorithms for assessing uation campaigns (STECs) that have been pro- the semantic similarity between two texts. It is posed over the last decade (Dolan, Quirk, and available as a Java library and as a Java Brockett, 2004; McCarthy & McNamara, 2008; standalone ap-plication offering GUI-based access to the implemented semantic similarity Agirre et al., 2012). These datasets include in- methods. Furthermore, it offers facilities for stances from various applications. Indeed, there manual se-mantic similarity annotation by ex- is a need to identify and quantify semantic rela- perts through its component SEMILAT (a tions between texts in many applications. For SEMantic simILarity Annotation Tool). instance, paraphrase identification, an instance of the semantic similarity problem, is an important 1 Introduction step in a number of applications including Natu- ral Language Generation, Question Answering, We present in this paper the design and im- and dialogue-based Intelligent Tutoring Systems. plementation of SEMILAR, the SEMantic In Natural Language Generation, paraphrases are simILARity toolkit. SEMILAR a method to increase diversity of generated text (www.semanticsimilarity.org) includes im- (Iordanskaja et al. 1991). In Question Answer- plementations of a number of algorithms pro- ing, multiple answers that are paraphrases of posed over the last decade or so to address each other could be considered as evidence for various instances of the general problem of the correctness of the answer (Ibrahim et al. text-to-text semantic similarity. Semantic sim- 2003). In Intelligent Tutoring Sys-tems (Rus et ilarity is an approach to language understand- al., 2009; Lintean et al., 2010; Lintean, 2011), ing that is widely used in real applications. It paraphrase identification is useful to assess is a practical alternative to the true under- whether students’ articulated answers to deep standing approach, which is intractable as it questions (e.g. conceptual physics questions) are requires world knowledge, a yet to-be-solved similar-to/paraphrases-of ideal answers. problem in Artificial Intelligence. Generally, the problem of semantic similarity between two texts, denoted text A and text B, is Text A: York had no problem with MTA’s in- defined as quantifying and identifying the pres- sisting the decision to shift funds had been within ence of semantic relations between the two texts, its legal rights. e.g. to what extent text A has the same meaning Text B: York had no problem with MTA’s say- as or is a paraphrase of text B (paraphrase rela- ing the decision to shift funds was within its tion; Dolan, Quirk, and Brockett, 2004). Other powers. semantic relations that have been investigated systematically in the recent past are entailment, Given such two texts, the paraphrase identifi- i.e. to what extent text A entails or logically in- cation task is about automatically assessing fers text B (Dagan, Glickman, & Magnini, 2004), whether Text A is a paraphrase of, i.e. has the and elaboration, i.e. is text B is an elaboration of same meaning as, Text B. The example above is text A? (McCarthy & McNamara, 2008).

163 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 163–168, Soﬁa, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics Figure 1. Snapshot of SEMILAR. The Data View tab is shown. Semantic similarity can be broadly construed LAR library and toolkit presented here fulfill this between texts of any size. Depending on the need. granularity of the texts, we can talk about the In particular, the development of the semantic following fundamental text-to-text similarity similarity toolkit SEMILAR has been motivated problems: word-to-word similarity, phrase-to- by the need for an integrated environment that phrase similarity, sentence-to-sentence similari- would provide: ty, paragraph-to-paragraph similarity, or document-to-document similarity. Mixed combina-  easy access to implementations of various tions are also possible such as assessing the simi- semantic similarity approaches from the larity of a word to a sentence or a sentence to a same user-friendly interface and/or library. paragraph. For instance, in summarization it  easy access to semantic similarity methods might be useful to assess how well a sentence that work at different levels of text granulari- summarizes an entire paragraph. ty: word-to-word, sentence-to-sentence, paragraph-to-paragraph, document-to- 2 Motivation document, or a combination (SEMILAR in- The problem of word-to-word similarity has been tegrates word-to-word similarity measures). extensively studied over the past decades and a  authoring methods for semantic similarity. word-to-word similarity library (WordNet Simi-  a common environment for that allows sys- larity) has been developed by Pedersen and col- tematic and fair comparison of semantic sim- leagues (Pedersen, Patwardhan, & Michelizzi, ilarity methods. 2004).  facilities to manually annotate texts with se- Methods to assess the semantic similarity of mantic similarity relations using a graphical larger texts, in particular sentences, have been user interface that make such annotations proposed over the last decade (Corley and easier for experts (this component is called Mihalcea, 2005; Fernando & Stevenson, 2008; SEMILAT component - a SEMantic similari- Rus, Lintean, Graesser, & McNamara 2009). ty Annotation Tool). Androutsopoulos & Malakasiotis (2010) com- piled a survey of methods for paraphrasing and SEMILAR is thus a one-stop-shop for investi- entailment semantic relation identification at sen- gating, annotating, and authoring methods for the tence level. Despite all the proposed methods to semantic similarity of texts of any level of granu- assess semantic similarity between two texts, no larity. semantic similarity library or toolkit, similar to the WordNet library for word-to-word similarity, 3 SEMILAR: The Semantic Similarity exists for larger texts. Given the importance of Toolkit semantic similarity, there is an acute need for The authors of the SEMILAR toolkit (see Figure such a library and toolkit. The developed SEMI- 1) have been involved in assessing the semantic

164 similarity of texts for more than a decade. During large space of possibilities. The parameters in- this time, they have conducted a careful require- clude preprocessing options (collocation detec- ments analysis for an integrated software toolkit tion, punctuation, stopword removal, etc.), filter- that would integrate various methods for seman- ing options (all words, content words, etc.), tic similarity assessment. The result of this effort weighting schemes (global vs. local weighting, is the prototype presented here. We briefly pre- binary weighting, etc.), and normalization factors sent the components of SEMILAR next and then (largest text, weighted average, etc.). A total of describe in more detail the core component of 3,456 variants of lexical overlap can be generat- SEMILAR, i.e. the set of semantic similarity ed by different parameter settings in SEMILAR. methods that are currently available. It should be Lintean (2011) has shown that performance on noted that we are continuously adding new se- lexical overlap methods on the tasks of para- mantic similarity methods and features to SEMI- phrase identification and textual entailment tasks LAR. can vary significantly depending on the selected The SEMILAR toolkit includes the following parameters. Some lexical overlap variations lead components: project management; data view- to performance results rivaling more sophisticat- browsing-visualization; preprocessing (e.g., col- ed, state-of-the-art methods. location identification, part-of-speech tagging, It should be noted that the overlap category of phrase or dependency parsing, etc.), semantic methods can be extended to include N-gram similarity methods (word-level and sentence- overlap methods (see the N-gram overlap meth- level), classification components for qualitative ods proposed by the Machine Translation com- decision making with respect to textual semantic munity such as BLEU and METEOR). SEMI- relations (naïve Bayes, Decision Trees, Support LAR offers bigram and unigram overlap methods Vector Machines, and Neural Network), kernel- including the BLEU and METEOR scores. based methods (sequence kernels, word sequence A natural approach to text-to-text similarity kernels, and tree kernels; as of this writing, we methods is to rely on word-to-word similarity are still implementing several other tree kernel measures. Many of the methods presented next methods); debugging and testing facilities for compute the similarity of larger texts using indi- model selection; and annotation components (al- vidual word similarities. lows domain expert to manually annotate texts Mihalcea, Corley, & Strappavara (2006; with semantic relations using GUI-based facili- MCS) proposed a greedy method based on word- ties; Rus et al., 2012). For space reasons, we only to-word similarity measures. For each word in detail next the main algorithms in the core com- text A (or B) the maximum similarity score to ponent, i.e. the major text-to-text similarity algo- any word in the other text B (or A) is used. An rithms currently available in SEMILAR. idf-weighted average is then computed as shown in the equation below. 4 The Semantic Similarity Methods Available in SEMILAR  max{ Sim(w,T2 ) * idf (w)} 1 w{T } sim(T1,T 2)  ( 1  The core component of SEMILAR is a set of 2  idf (w) w{T } text-to-text semantic similarity methods. We 1  max{ Sim(w,T ) * idf (w)} have implemented methods that handle both uni- w{T } 1 directional similarity measures as well as bidirec- 2 )  idf (w) tional similarity measures. For instance, the se- w{T2} mantic relation of entailment between two texts is unidirectional (a text T logically entails a hy- The word-to-word similarity function sim(w, pothesis text H but H does not entail T) while the T) in the equation above can be instantiated to paraphrase relation is bidirectional (text A has any word-to-word similarity measure (e.g. same meaning as text B and vice versa). WordNet similarities or Latent Semantic Analy- Lexical Overlap. Given two texts, the sim- sis). The vast majority of word-to-word similari- plest method to assess their semantic similarity is ty measures that rely on WordNet are concept-to- to compute lexical overlap, i.e. how many words concept measures and to be able to use them one they have in common. There are many lexical must map words in the input texts onto concepts overlap variations. Indeed, a closer look at lexi- in WordNet, i.e. word sense disambiguation cal overlap reveals a number of parameters that (WSD) is needed. As of this writing, SEMILAR turns the simple lexical overlap problem into a addresses the issue in two simple ways: (1) se-

165 lecting the most frequent sense for each word, assessment in Intelligent Tutoring Systems. Each which is sense #1 in WordNet, and (2) using all text is regarded as a graph with words as the senses for each word and then take the max- nodes/vertices and syntactic dependencies as imum (or average) of the relatedness scores for edges. The subsumption score reflects how much each pair of word senses. We label the former a text is subsumed or contained by another. The method as ONE (sense one), whereas the latter is equation below provides the overall subsumption labeled as ALL-MAX or ALL-AVG (all senses score, which can be averaged both ways to com- maximum score or all senses average score, re- pute a similarity score, as opposed to just the spectively). Furthermore, most WordNet-based subsumption score, between the two texts. measures only work within a part-of-speech category, e.g. only between nouns.  max match(V ,V ) VtTv h t VhHv Other types of word-to-word measures, such subsump(T , H )  (  as those based on Latent Semantic Analysis or | Vh | Latent Dirichlet Allocation, do not have a word-  max match(E , E ) EtTe h t #neg_ rel sense disambiguation challenge. EhHe (1  (1) )    )  Rus and Lintean (2012; Rus-Lintean- | E | 2 Optimal Matching or ROM) proposed an opti- h mal solution for text-to-text similarity based on The lexical component can be used by itself word-to-word similarity measures. The optimal (given a weight of 1 with the syntactic compo- lexical matching is based on the optimal assign- nent given a weight of 0) in which case the simi- ment problem, a fundamental combinatorial op- larity between the two texts is just a composi- timization problem which consists of finding a tional extension of word-to-word similarity maximum weight matching in a weighted bipar- measures. The match function in the equation tite graph. can be any word-to-word similarity measure in- Given a weighted complete bipartite graph cluding simple word match, WordNet similarity , where edge has weight measures, LSA, or LDA-based similarity , the optimal assignment problem is to measures. find a matching M from X to Y with maximum Fernando and Stevenson (FST; 2008) pro- weight. posed a method in which similarities among all A typical application is about assigning a pairs of words are taken into account for compu- group of workers, e.g. words in text A in our ting the similarity of two texts. Each text is rep- case, to a set of jobs (words in text B in our case) resented as a binary vector (1 – the word occurs based on the expertise level, measured by in the text; 0 – the word does not occur in the , of each worker at each job. By adding text). They use a similarity matrix operator W dummy workers or jobs we may assume that X that contains word-to-word similarities between and Y have the same size, n, and can be viewed any two words. as and Y = .   T In the semantic similarity case, the weight   a W b is the word-to-word similarity between a word x sim( a , b )    in text A and a word y in text B. | a || b | The assignment problem can also be stated as Each element wij represents the word-level finding a permutation of {1, 2, 3, … , n} for semantic similarity between word ai in text A which is maximum. Such an and word bj in text B. Any word-to-word seman- assignment is called optimum assignment. The tic similarity measure can be used. Kuhn-Munkres algorithm (Kuhn, 1955) can find Lintean and Rus (2010; weighted-LSA or a solution to the optimum assignment problem in wLSA) extensively studied methods for semantic polynomial time. similarity based on Latent Semantic Analysis Rus and colleagues (Rus et al., 2009; Rus & (LSA; Landauer et al., 2006). LSA represents Graesser, 2006; Rus-Syntax-Negation or RSN) words as vectors in a 300-500 dimensional LSA used a lexical overlap component combined with space. An LSA vector for larger texts can be de- syntactic overlap and negation handling to com- rived by vector algebra, e.g. by summing up the pute an unidirectional subsumption score be- individual words’ vectors. The similarity of two tween two sentences, T (Text) and H (Hypothe- texts A and B can be computed using the cosine sis), in entailment recognition and student input (normalized dot product) of their LSA vectors. Alternatively, the individual word vectors can be

166 combined through weighted sums. Lintean and into a symmetric similarity measure as in the fol- Rus (2010) experimented with a combination of lowing (Dagan, Lee, & Pereira, 1997): 3 local weights and 3 global weights. All these versions of LSA-based text-to-text similarity IR(c,d) SIM( p, q)  10 measures are available in SEMILAR. SEMILAR also includes a set of similarity The Hellinger and Manhattan distances be- measures based on the unsupervised method La- tween two distributions are two other options tent Dirichlet Allocation (LDA; Blei, Ng, & that avoid the shortcomings of the KL distance. Jordnan, 2003; Rus, Banjade, & Niraula, Both are options are implemented in SEMILAR. 2013). LDA is a probabilistic generative model LDA similarity measures between two docu- in which documents are viewed as distributions ments or texts c and d can also include similarity of topics. That is, the text-to-text similarity is over a set of topics (θd - text d’s distribution over obtained multiplying the similarities between the topics) and topics are distributions over words (φt – topic t’s distribution over words). That is, each distribution over topics (θd and θc) and distribu- word in a document is generated from a distribution over words (φt1 and φt2). The similarity of tion over words that is specific to each topic. topics can be computed using the same methods A first LDA-based semantic similarity meas- illustrated above as the topics are distributions ure among words would then be defined as a dot- over words (for all the details see Rus, Banjade, product between the corresponding vectors rep- & Niraula, 2013). resenting the contributions of each word to a top- The last semantic similarity method presented in this paper is based on the Quadratic Assign- ic (φt(w) – represents the probability of word w in topic t). It should be noted that the contribu- ment Problem (QAP). The QAP method aims at tions of each word to the topics does not consti- finding an optimal assignment from words in text tute a distribution, i.e. the sum of contributions is A to words in text B, based on individual word- not 1. Assuming the number of topics T, then a to-word similarity measures, while simultaneous- simple word-to-word measure is defined by the ly maximizing the match between the syntactic formula below. dependencies of the words. The Koopmans-Beckmann (1957) formulation T of the QAP problem best fits this purpose. The goal of the original QAP formulation, in the do- LDA w2w(w,v)  t (w)t (v) t1 main of economic activity, was to minimize the objective function QAP shown below where ma- More global text-to-text similarity measures could trix F describes the flow between any two facili- be defined in several ways as detailed next. ties, matrix D indicates the distances between Because in LDA a document is a distribution locations, and matrix B provides the cost of lo- over topics, the similarity of two texts needs to cating facilities to specific locations. F, D, and B be computed in terms of similarity of distribu- are symmetric and non-negative. tions. The Kullback-Leibler (KL) divergence defines a distance, or how dissimilar, two distri- n n n min QAP(F, D, B)    f d   b butions p and q are as in the formula below. i1j1 i, j  (i) ( j) i1 i, (i)

T p The fi,j term denotes the flow between facili- KL( p,q)  p log i  i q ties i and j which are placed at locations π(i) and i1 i π(j), respectively. The distance between these locations is dπ(i)π(j). In our case, F and D describe If we replace p with θd (text/document d’s dis- dependencies between words in one sentence tribution over topics) and q with θc while B captures the word-to-word similarity (text/document c’s distribution over topics) we between words in opposite sentences. Also, we obtain the KL distance between two documents have weighted each term in the above formula- (documents d and c in our example). The KL tion and instead of minimizing the sum we are distance has two major problems. In case qi is maximizing it resulting in the formulation below. zero KL is not defined. Then, KL is not symmetric. The Information Radius measure (IR) solves n n n max QAP(F, D, B)     fi, jd  (1  )  b these problems by considering the average of pi i1 j1  (i) ( j) i1 i, (i) and qi as below. Also, the IR can be transformed

167 5 Discussion and Conclusions Fernando, S. & Stevenson, M. (2008). A semantic similarity approach to paraphrase detec- The above methods were experimented with on tion, Computational Linguistics UK (CLUK 2008) various datasets for paraphrase, entailment, and 11th Annual Research Colloquium. elaboration. For paraphrase identification, the Lintean, M., Moldovan, C., Rus, V., & McNamara D. QAP method provides best accuracy results (2010). The Role of Local and Global Weighting in (=77.6%) on the test subset of the Microsoft Re- Assessing the Semantic Similarity of Texts Using search Paraphrase corpus, one of the largest par- Latent Semantic Analysis. Proceedings of the 23rd aphrase datasets. International Florida Artificial Intelligence Re- Due to space constraints, we have not de- search Society Conference. Daytona Beach, FL. scribed all the features available in SEMILAR. For a complete list of features, latest news, refer- Lintean, M. (2011). Measuring Semantic Similarity: ences, and updates of the SEMILAR toolkit Representations and Methods, PhD Thesis, De- along with downloadable resources including partment of Computer Science, The University of software and data files, the reader can visit this Memphis, 2011. link: www.semanticsimilarity.org. Ibrahim, A., Katz, B., & Lin, J. (2003). Extracting structural paraphrases from aligned monolingual Acknowledgments corpora In Proceedings of the Second International This research was supported in part by Institute Workshop on Paraphrasing, (ACL 2003). for Education Sciences under award Iordanskaja, L., Kittredge, R., & Polgere, A. (1991). R305A100875. Any opinions, findings, and con- Natural Language Generation in Artificial Intelli- clusions or recommendations expressed in this gence and Computational Linguistics. Lexical se- material are solely the authors’ and do not neces- lection and paraphrase in a meaning-text genera- sarily reflect the views of the sponsoring agency. tion model, Kluwer Academic. McCarthy, P.M. & McNamara, D.S. (2008). User- References Language Paraphrase Corpus Challenge https://umdrive.memphis.edu/pmmccrth/public/Par Androutsopoulos, I. & Malakasiotis, P. 2010. A sur- aphraseCorpus/Paraphrase site.htm. Retrieved vey of paraphrasing and textual entailment meth- 2/20/2010 online, 2009. ods. Journal of Artificial Intelligence Research, 38:135-187. Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity - Measuring the Relatedness Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. of Concepts, In the Proceedings of the Nineteenth (2012). SemEval-2012 Task 6: A Pilot on Semantic National Conference on Artificial Intelligence Textual Similarity, First Joint Conference on Lexi- (AAAI-04), pp. 1024-1025, July 25-29, 2004, San cal and Computational Semantics (*SEM), Mon- Jose, CA (Intelligent Systems Demonstration). treal, Canada, June 7-8, 2012. Rus, V., Lintean M., Graesser, A.C., & McNamara, Blei, D.M., Ng, A.Y., & Jordan, M.I. 2003. Latent D.S. (2009). Assessing Student Paraphrases Using dirichlet allocation, The Journal of Machine Learn- Lexical Semantics and Word Weighting. In Pro- ing Research 3, 993-1022. ceedings of the 14th International Conference on Corley, C., & Mihalcea, R. (2005). Measuring the Artificial Intelligence in Education, Brighton, UK. Semantic Similarity of Texts. In Proceedings of the Rus, V., Lintean, M., Moldovan, C., Baggett, W., ACL Workshop on Empirical Modeling of Seman- Niraula, N., Morgan, B. (2012). The SIMILAR tic Equivalence and Entailment. Ann Arbor, MI. Corpus: A Resource to Foster the Qualitative Un- Dagan, I., Glickman, O., & Magnini, B. (2004). The derstanding of Semantic Similarity of Texts, In PASCAL Recognising textual entailment Chal- Semantic Relations II: Enhancing Resources and lenge. In Quinorero-Candela, J.; Dagan, I.; Magni- Applications, The 8th Language Resources and ni, B.; d'Alche-Buc, F. (Eds.), Machine Learning Evaluation Conference (LREC 2012), May 23-25, Challenges. Lecture Notes in Computer Science, Instanbul, Turkey. Vol. 3944, pp. 177-190, Springer, 2006. Rus, V., Banjade, R., & Niraula, N. (2013). Similarity Dolan, B., Quirk, C., & Brockett, C. (2004). Unsuper- Measures based on Latent Dirichlet Allocation, vised construction of large paraphrase corpora: Ex- The 14th International Conference on Intelligent ploiting massively parallel news sources. In Pro- Text Procesing and Computational Linguistics, ceedings of the 20th International Conference on March 24-30, 2013, Samos, Greece. Computational Linguistics (COLING-2004), Gene- va, Switzerland.

168