L2F/INESC-ID at Semeval-2017 Tasks 1 and 2: Lexical and Semantic Features in Word and Textual Similarity
Total Page:16
File Type:pdf, Size:1020Kb
L2F/INESC-ID at SemEval-2017 Tasks 1 and 2: Lexical and semantic features in word and textual similarity Pedro Fialho, Hugo Rodrigues, Lu´ısa Coheur and Paulo Quaresma Spoken Language Systems Lab (L2F), INESC-ID Rua Alves Redol 9 1000-029 Lisbon, Portugal [email protected] Abstract For word similarity, we test semantic equiva- lence functions based on WordNet (Miller, 1995) This paper describes our approach to the and Word Embeddings (Mikolov et al., 2013). Ex- SemEval-2017 “Semantic Textual Similar- periments are performed on test data provided ity” and “Multilingual Word Similarity” in the SemEval-2017 tasks, and yielded compet- tasks. In the former, we test our approach itive results, although outperformed by other ap- in both English and Spanish, and use a proaches in the official ranking. linguistically-rich set of features. These The document is organized as follows: in Sec- move from lexical to semantic features. In tion2 we briefly discuss some related work; in particular, we try to take advantage of the Sections3 and4, we describe our systems regard- recent Abstract Meaning Representation ing the “Semantic Textual Similarity” and “Mul- and SMATCH measure. Although with- tilingual Word Similarity” tasks, respectively. In out state of the art results, we introduce se- Section5 we present the main conclusions and mantic structures in textual similarity and point to future work. analyze their impact. Regarding word sim- ilarity, we target the English language and 2 Related work combine WordNet information with Word Embeddings. Without matching the best The general architecture of our STS system is systems, our approach proved to be simple similar to that of Brychc´ın and Svoboda(2016), and effective. Potash et al.(2016) or Tian and Lan(2016), but we employ more lexical features and AMR semantics. 1 Introduction Brychc´ın and Svoboda(2016) model feature de- pendence in Support Vector Machines by using the In this paper we present two systems that com- product between pairs of features as new features, peted in SemEval-2017 tasks “Semantic Textual while we rely on neural networks. In Potash et al. Similarity” and “Multilingual Word Similarity”, (2016) it is concluded that feature based systems using supervised and unsupervised techniques, re- have better performance than structural learning spectively. with syntax trees. A fully-connected neural net- For the first task we used lexical features, as work is employed on hand engineered features and well as a semantic feature, based in the Ab- on an ensemble of predictions from feature based stract Meaning Representation (AMR) and in the and structural based systems. We also employ a SMATCH measure. AMR is a semantic formal- similar neural network on hand engineered fea- ism, structured as a graph (Banarescu et al., 2013). tures, but use semantic graphs to obtain one of SMATCH is a metric for comparison of AMRs such features. (Cai and Knight, 2013). To the best of our knowl- For word similarity, our approach isolates the edge, these were not yet applied to Semantic Tex- micro view approach seen in (Tian and Lan, 2016), tual Similarity. In this paper we focus on the con- where word embeddings are applied to measure tribution of the SMATCH score as a semantic fea- the similarity of word pairs in an unsupervised ture for Semantic Textual Similarity, relative to a manner. This work also describes supervised ex- model based on lexical clues only. periments on a macro/sentence view, which em- 213 Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 213–219, Vancouver, Canada, August 3 - 4, 2017. c 2017 Association for Computational Linguistics ploy hand engineered features and the Gradient bytes and tokens of a pair of sentences, except for Boosting algorithm, as in our STS system. the Spectrum kernel on bits (as it is not a valid Henry and Sands(2016) employ WordNet for combination), resulting in 62 of our 159 features. their sentence and chunk similarity metric, as also The only semantic feature is the SMATCH occurs in our system for word similarity. score (Cai and Knight, 2013) which represents the similarity among two AMR graphs (Banarescu 3 Task 1 - Semantic textual similarity et al., 2013). The AMR for each sentence in a pair is generated with JAMR3, and then supplied In this section we describe our participation in to SMATCH, which returns a numeric value be- Task 1 of SemEval-2017 (Cer et al., 2017), aimed tween 0 and 1 denoting their similarity. at assessing the ability of a system to quantify the In SMATCH, an AMR is translated into triples semantic similarity between two sentences, using that represent variable instances, their relations, a continuous value from 0 to 5 where 5 means se- and global attributes such as the start node and lit- mantic equivalence. This task is defined for mono- erals. The final SMATCH score is the maximum lingual and cross-lingual pairs. We participated F score of matching triples, according to various in the monolingual evaluation for English, and we variable mappings, obtained by comparing their also report results for Spanish, both with test sets instance tokens. These are converted into lower composed by 250 pairs. Most of our lexical fea- case and then matched for exact equality. tures are language independent, thus we use the same model. 3.2 Experimental setup For a pair of sentences, our system collects the numeric output of metrics that assess their simi- We applied all metrics to the train, test and trial larity relative to lexical or semantic aspects. Such examples of the SICK corpus (Marelli et al., 2014) features are supplied to a machine learning algo- and train and test examples from previous Seman- rithm to: a) build a model, using pairs labeled with tic Textual Similarity in SemEval, as compiled by an equivalence value (compliant with the task), or Tan et al.(2015). b) predict such value, using the model. Thus, our training dataset is comprised of 24623 vectors (with 9841 from SICK) assigned to a con- 3.1 Features tinuous value ranging from 0 to 5. Each vector contains our 159 feature values for the similarity In our system, the similarity between two sen- among the sentences in an example pair. tences is represented by multiple continuous val- We standardized the features by removing the ues, obtained from metrics designed to leverage mean and scaling to unit variance and norm. Then, lexical or semantic analysis on the comparison of machine learning algorithms were applied to the sequences or structures. Lexical features are also feature sets to train a model of our Semantic Tex- applied to alternative views of the input text, such tual Similarity representations. Namely, we em- as character or metaphone1 sequences. A total of ployed ensemble learning by gradient boosting 159 features was gathered, from which one relies with decision trees, and feedforward neural net- on semantic representations. works (NN) with 1 and 2 fully connected hidden Lexical features are obtained from INESC- layers. ID@ASSIN (Fialho et al., 2016), such as TER, SMATCH is not available for Spanish, therefore edit distance and 17 others. These are applied this feature was left out when evaluating Spanish to 6 representations of an input pair, totaling 96 pairs (es-es). For English pairs (en-en), the sce- features since not all representations are valid on narios include: a) only lexical features, or b) an all metrics (for instance, TER is not applicable on ensemble with lexical features and the SMATCH character trigrams). Its metrics and input represen- score (without differentiation). tations rely on linguistic phenomena, such as the Gradient boosting was applied with the default BLEU score on metaphones of input sentences. configuration provided in scikit-learn (Pedregosa We also gather lexical features from HARRY2, et al., 2011). NN were configured with single and where 21 similarity metrics are calculated for bits, multiple hidden layers, both with a rectifier as ac- 1Symbols representing how a word sounds, according to tivation function. The first layer combines the 159 the Double Metaphone algorithm. 2http://www.mlsec.org/harry/ 3https://github.com/jflanigan/jamr 214 input features (or 158 when not using SMATCH) In order to evaluate the contribution of into 270 neurons, which are either combined into SMATCH, we analyzed some examples where a second layer with 100 neurons, or to the output SMATCH led to a lower deviation from the gold layer (with 1 neuron). Finally, we employed the standard, and, at the same time, higher deviation mean square error cost function and the ADAM from runs without SMATCH. optimizer (Kingma and Ba, 2014), and fit a model On 15 pairs, SMATCH based predictions were in 100 epochs and batches of 5. consistently closer to the gold standard, across all Our experiments were run with Tensorflow 0.11 learning algorithms, with an average difference (Abadi et al., 2015), with NN implementations of 0.27 from non SMATCH predictions. How- 4 from the Keras framework . Gradient boosting ever, after analyzing the resulting AMR of some of implementation is from scikit-learn. these cases, we noticed that information was lost during AMR conversion. For instance, consider 3.3 Results the following examples, which led to the results System performance in the Semantic Textual Sim- presented in Table2. ilarity task was measured with the Pearson coef- ficient. A selection of results is shown in Table 1, featuring our different scenarios/configurations, (A) The player shoots the winning points. / The our official scores (in bold), and systems that basketball player is about to score points for achieved results similar to ours or are the best of his team., with a gold score of 2.8.