Mathlingbudapest: Concept Networks for Semantic Similarity
Total Page:16
File Type:pdf, Size:1020Kb
MathLingBudapest: Concept Networks for Semantic Similarity Gabor´ Recski Judit Acs´ Research Institute for Linguistics HAS Research Institute for Linguistics and Hungarian Academy of Sciences Dept. of Automation and Applied Informatics Benczur´ u. 33 Budapest U of Technology and Economics 1068 Budapest, Hungary Magyar tudosok´ krt. 2 (Bldg. Q.) [email protected] 1117 Budapest, Hungary [email protected] Abstract on Semeval Task 2a (Semantic Textual Similarity for English). All components of our system are We present our approach to measuring seman- available for download under an MIT license from tic similarity of sentence pairs used in Se- GitHub12. Section 2 describes the system architec- meval 2015 tasks 1 and 2. We adopt the ture and points out the main differences between our sentence alignment framework of (Han et al., system and that in (Han et al., 2013), section 3 out- 2013) and experiment with several measures of word similarity. We hybridize the common lines our word similarity metric derived from the vector-based models with definition graphs 4lang concept lexicon. We present the evaluation from the 4lang concept dictionary and de- of our systems on both tasks in section 4, and section vise a measure of graph similarity that yields 5 provides a brief conclusion. good results on training data. We did not ad- dress the specific challenges posed by Twitter 2 Architecture data, and this is reflected in placing 11th from 30 in Task 1, but our systems perform fairly Our framework for determining the semantic simi- well on the generic datasets of Task 2, with the larity of two sentences is based on the system pre- hybrid approach placing 11th among 78 runs. sented in (Han et al., 2013). Their architecture, Align and Penalize, involves computing an align- 1 Introduction ment score between two sentences based on some measure of word similarity. We’ve chosen to reim- This paper describes the systems participating in plement this system so that we can experiment with Semeval-2015 Task 1 (Xu et al., 2015) and Task 2 various notions of word similarity, including the (Agirre et al., 2015). To compute the semantic sim- one based on 4lang and presented in section 3. ilarity of two sentences we use the architecture pre- Although we reimplemented virtually all rules and sented in (Han et al., 2013) to find, for each word, its components described by (Han et al., 2013) for counterpart in the other sentence that is semantically experimentation, we shall only describe those that most similar to it. We implemented several meth- ended up in at least one of the five configurations ods for measuring word similarity, among them (i) submitted to Semeval. a word embedding created by the method presented The core idea behind the Align and Penalize archi- in (Mikolov et al., 2013) and (ii) a metric based on tecture is, given two senteces S1 and S2 and some networks of concepts derived from the 4lang con- measure of word similarity, to align each word of cept lexicon (Kornai and Makrai, 2013; Kornai et one sentence with some word of the other sentence al., 2015) and definitions from the Longman Dic- so that the similarity of word pairs is maximized. tionary of Contemporary English (Bullon, 2003). A hybrid system exploiting both of these metrics yields 1http://github.com/juditacs/semeval the best results and placed 11th among 73 systems 2http://github.com/kornai/pymachine 138 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 138–142, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics The mapping need not be one-to-one and is calcu- 3 Similarity from Concept Networks lated independently for words of S1 (aligning them This section will present the word similarity mea- with words from S2) and words of S2 (aligning them sure based on principles of lexical semantics pre- with words from S1). The score of an alignment is sented in (Kornai, 2010). The 4lang concept dic- the sum of the similarities of each word pair in the tionary (Kornai and Makrai, 2013) contains 3500 alignment, normalized by sentence length. The fi- definitions created manually. Because the Longman nal score assigned to a pair of sentences is the aver- Defining Vocabulary (LDV) (Boguraev and Briscoe, age of the alignment scores for each sentence. For 1989) is a subset of 4lang, we could automat- out-of-vocabulary (OOV) words, i.e. those that are ically extend this manually created seed to every not covered by the component used for measuring headword of the Longman Dictionary of Contem- word similarity, we use the Dice-similarity over the porary English (LDOCE) by processing their defi- sets of character 4-grams in each word. Addition- nitions with the Stanford Dependency Parser (Klein ally, we use simple rules to detect acronyms and and Manning, 2003), and mapping dependency re- compounds: if a word of one sentence that is a se- lations to sets of edges in the 4lang-style concept quence of 2-5 characters (e.g. ABC) has a matching graph. Details of the mapping will be described else- sequence of words in the other sentence (e.g. Ameri- where (Recski, 2015). can Broadcasting Company), all words of the phrase Since these definitions are essentially graphs of are aligned with this word and recieve an alignment concepts, we have experimented with similarity score of 1. If a sentence contains a sequence of two functions over pairs of such graphs that capture se- words (e.g. long term or can not) that appear in the mantic similarity of the concepts defined by each of other sentence without a space and with or without them. There are two fundamentally different config- a hyphen (e.g. long-term or cannot), these are also urations present in 4lang graphs: aligned with a score of 1. The score returned by the word similarity component can be boosted based on 1. two nodes may be connected via a 0-edge, WordNet (Miller, 1995), e.g. if one is a hypernym which is a generalization over unary predi- of the other, if one appears frequently in glosses of 0 0 cation (dog bark), attribution (dog the other, or if they are derivationally related. For −→ −→ faithful), and hypernymy, or the IS A re- the exact cases covered and a description of how 0 lation (dog mammal). the boost is calculated, the reader is referred to (Han −→ et al., 2013). In our submissions we only used this 2. two nodes can be connected, via a 1-edge and a boost on word similarity scores obtained from word 2-edge respectively, to a third one representing embeddings. a binary relation. Binaries include all transitive 1 2 verbs (e.g. cat CATCH branch). and The similarity score may be reduced by a vari- ←− −→ 1 a handful of binary primitives (e.g. tree ety of penalties, which we only enabled for Task 2 ←− HAS branch). 1 runs – they haven’t brought any improvement on −→ Task 2 datasets in any of our early experiments. Of We start by the intuition that similar con- the penalties described in (Han et al., 2013) we only cepts will overlap in the elementary configura- used the one which decreases alignment scores if tions they take part in: they might share a 0- the word similarity score for some word pair is very 0 0 neighbor, e.g. train vehicle car, small (< 0.05). We also introduced two new types −→ ←− or they might be on the same path of 1- and of penalties based on our observations of error types 1 2 in Twitter data: if one sentence starts with a question 2-edges, e.g. park IN town and 1 2 ←− −→ word and the other one does not or if one sentence street IN town. ←− −→ contains a past-tense verb and the other does not, We’ll define the predicates of a node as we reduce the overall score by 1/(L(S1) + L(S2)), the set of such configurations it takes part where L(S1) and L(S2) are the numbers of words in in. For example, based on the definition each sentence. graph in Figure 1, the predicates of the concept 139 Figure 1: 4lang definition of bird. bird are vertebrate; (HAS, feather); { (HAS, wing); (MAKE, egg) . } Figure 2: Definition of casualty built from LDOCE. Our initial version of graph similarity is the Jac- card similarity of the sets of predicates of each con- cept, i.e. P (w ) P (w ) S(w , w ) = | 1 ∩ 2 | 1 2 P (w ) P (w ) | 1 ∪ 2 | For all words that are not among the 3500 de- fined in 4lang we obtain definition graphs by au- tomated parsing of Longman definitions and the ap- plication of a simple mapping from dependency re- lations to graph edges (Recski, 2015). By far the largest source of noise in these graphs is that cur- rently there is no postprocessor component that rec- ognizes common structures of dictionary definitions like appositive relative clauses. For example the word casualty is defined by LDOCE as someone Figure 3: Expected definition of casualty. who is hurt or killed in an accident or war and we currently build the graph in Figure 2 instead of that in Figure 3. To mitigate the effects of these anomal- ities, we updated our definition of predicates: we let them be “inherited” via paths of 0-edges encoding the IS A-relationship. We’ve also experimented with similarity mea- sures that take into account the sets of all nodes ac- cessible from each concept in their respective def- inition graphs. This proved useful in establishing that two concepts which would otherwise be treated as entirely dissimilar are in fact somewhat related.