UMD-TTIC-UW at Semeval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement

UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement Hua He1, John Wieting2, Kevin Gimpel2, Jinfeng Rao1, and Jimmy Lin3 1 Department of Computer Science, University of Maryland, College Park 2 Toyota Technological Institute at Chicago 3 David R. Cheriton School of Computer Science, University of Waterloo fhuah,[email protected], fjwieting,[email protected], [email protected] Abstract al., 2012; Fellbaum, 1998; Fern and Stevenson, 2008; Das and Smith, 2009; Guo and Diab, 2012; We describe an attention-based convolutional Sultan et al., 2014; Kashyap et al., 2014; Lynum et neural network for the English semantic textual similarity (STS) task in the SemEval- al., 2014). Competitive systems in recent years are 2016 competition (Agirre et al., 2016). We mostly based on neural networks (He et al., 2015; develop an attention-based input interaction Tai et al., 2015; Yin and Schutze,¨ 2015; He and Lin, layer and incorporate it into our multi- 2016), which can alleviate data sparseness with pre- perspective convolutional neural network (He training and distributed representations. et al., 2015), using the PARAGRAM-PHRASE word embeddings (Wieting et al., 2016) In this paper, we extend the multi-perspective trained on paraphrase pairs. Without using any convolutional neural network (MPCNN) of He et sparse features, our final model outperforms al. (2015). Most previous neural network models, the winning entry in STS2015 when evaluated including the MPCNN, treat input sentences sepa- on the STS2015 data. rately, and largely ignore context-sensitive interac- tions between the input sentences. We address this 1 Introduction problem by utilizing an attention mechanism (Bah- Measuring the semantic textual similarity (STS) of danau et al., 2014) to develop an attention-based in- two pieces of text remains a fundamental problem put interaction layer (Sec. 3). It converts the two in- in language research. It lies at the core of many dependent input sentences into an inter-related sen- language processing tasks, including paraphrase de- tence pair, which can help the model identify im- tection (Xu et al., 2014), question answering (Lin, portant input words for improved similarity mea- 2007), and query ranking (Duh, 2009). surement. We also use the strongly-performing The STS problem can be formalized as: given PARAGRAM-PHRASE word embeddings (Wieting et al., 2016) (Sec. 4) trained on phrase pairs from the a query sentence S1 and a comparison sentence Paraphrase Database (Ganitkevitch et al., 2013). S2, the task is to compute their semantic similarity in terms of a similarity score sim(S1;S2). The These components comprise our submission to SemEval Semantic Textual Similarity tasks (Agirre the SemEval-2016 STS competition (shown in et al., 2012; Agirre et al., 2013; Agirre et al., 2014; Figure 1): an attention-based multi-perspective Agirre et al., 2015; Agirre et al., 2016) are a popu- convolutional neural network augmented with lar evaluation venue for the STS problem. Over the PARAGRAM-PHRASE word embeddings. We pro- years the competitions have made more than 15; 000 vide details of each component in the following sec- human annotated sentence pairs publicly available, tions. Unlike much previous work in the SemEval and have evaluated over 300 system runs. competitions (Sariˇ c´ et al., 2012; Sultan et al., 2014), Traditional approaches are based on hand-crafted we do not use sparse features, syntactic parsers, or feature engineering (Wan et al., 2006; Madnani et external resources like WordNet. Output: Similarity Score from individual dimensions, while holistic filters can Structured Similarity Measurement Layer discover broader patterns of contextual information. We use both kinds of filters for a richer representa- Multi-Perspective Sentence Model tion of the input. Attention-Based Input Interaction Layer Multiple Window Sizes. The window size de- Paragram-Phraseb c b c b c Word Embeddings notes how many words are matched by a filter. The MPCNN model uses filters with different window Cats Sit On The Mat On The Mat There Sit Cats sizes ws in order to capture information at different Figure 1: Model overview. Input sentences are n-gram lengths. We use filters with ws selected from processed by the attention-based input interaction f1; 2; 3g, so our filters can find unigrams, bigrams, layer and multi-perspective convolutional sentence and trigrams in the input sentences. In addition, to model, then compared by the structured similarity retain the raw information in the input, ws is also measurement layer. The shaded components are our set to 1 where pooling layers are directly applied additions to the MPCNN model for the competition. over the entire sentence embedding matrix without the use of convolution layers in-between. 2 Base Model: Multi-Perspective Multiple Pooling Types. For each output vector Convolutional Neural Networks of a convolutional filter, the MPCNN model converts it to a scalar via a pooling layer. Pooling helps a We use the recently-proposed multi-perspective con- convolutional model retain the most prominent and volutional neural network model (MPCNN) of He et prevalent features, which is helpful for robustness al. (2015) due to its competitive performance.1 It across examples. One widely adopted pooling layer consists of two major components: is max pooling, which applies a max operation over 1.A multi-perspective sentence model for convert- the input vector and returns the maximum value. In ing a sentence into a representation. A convolu- addition to max pooling, The MPCNN model uses tional neural network captures different granular- two other types of pooling, min and mean, to extract ities of information in each sentence using multi- different aspects of the filter matches. ple types of convolutional filters, types of pooling, and window sizes. Similarity Measurement Layer. After the sentence models produce representations for each sen- 2.A structured similarity measurement layer tence, we use a module that performs comparisons with multiple similarity metrics for comparing lo- between the two sentence representations to output cal regions of sentence representations. a final similarity score. One simple way to do this The MPCNN model has a Siamese structure (Brom- would be to flatten each sentence representation into ley et al., 1993), with a multi-perspective sentence a vector and then apply a similarity function such model for each of the two input sentences. as cosine similarity. However, this discards important information because particular regions of the Multiple Convolutional Filters. The MPCNN sentence representations come from different under- model applies two types of convolutional filters: 1-d lying sources. Therefore, the MPCNN model per- per-dimension filters and 2-d holistic filters. The forms structured similarity measurements over par- holistic filters operate over sliding windows while ticular local regions of the sentence representations. considering the full dimensionality of the word em- The MPCNN model uses rules to identify local beddings, like typical temporal convolutional filters. regions whose underlying components are related. The per-dimension filters focus on information at a These rules consider whether the local regions are: finer granularity and operate over sliding windows (1) from the same filter type; (2) from the convo- of each dimension of the word embeddings. Per- lutional filter with the same window size ws; (3) dimension filters can find and extract information from the same pooling type; (4) from the same spe- 1https://github.com/hohoCode cific filter of the underlying convolution filter type. Only feature vectors that share at least two of the the softmax normalization: above are compared. There are two algorithms us- X ing three similarity metrics to compare local regions: E0[a] = D[a][b] one works on the output of holistic filters only, while b X the other uses the outputs of both the holistic and E1[b] = D[a][b] per-dimension filters. a On top of the structured similarity measurement Ai = softmax(Ei) layer, we stack two linear layers with a tanh acti- vation layer in between, followed by a log-softmax We finally define updated embeddings attenEmb 2 2d layer. More details are provided in He et al. (2015). R for each word as a concatenation of the original and attention-reweighted word embeddings: 3 Attention-Based Input Interaction Layer attenEmbi[a] = concat(Si[a];Ai[a] Si[a]) The MPCNN model treats input sentences sepa- where represents element-wise multiplication. rately with two neural networks in parallel, which Our input interaction layer is inspired by recent ignores the input contextual interaction information. work that incorporates attention mechanisms into We instead utilize an attention mechanism (Bah- neural networks (Bahdanau et al., 2014; Rush et al., danau et al., 2014) and develop an attention-based 2015; Yin et al., 2015; Rocktaschel¨ et al., 2016). interaction layer that converts the two independent Many of these add parameters and computational input sentences into an inter-related sentence pair. complexity to the model. However, our attention- We incorporate this into the base MPCNN model based input layer is simpler and more efficient. as the first layer of our system. It is applied over Moreover, we do not introduce any additional pa- raw word embeddings of input sentences to generate rameters, as we simply use cosine distance to create re-weighted word embeddings. The attention-based the attention weights. Nevertheless, adding this at- re-weightings can guide the focus of the MPCNN tention layer improves performance, as we show in model onto important input words. That is, words Section 5. in one sentence that are more relevant to the other sentence receive higher weights. 4 Word Embeddings We first define input sentence representation Si 2 We compare several types of word embeddings to ì×d (i 2 f0; 1g) to be a sequence of ` words, each R i represent the initial sentence matrices (Si).

UMD-TTIC-UW at Semeval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement

Semantic Role Labeling: an Introduction to the Special Issue

Ten Years of Babelnet: a Survey

ACNLP at Semeval-2020 Task 6: a Supervised Approach for Definition

Semeval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Semeval-2013 Task 4: Free Paraphrases of Noun Compounds

Detecting Semantic Difference: a New Model Based on Knowledge and Collocational Association

Semeval-2017 Task 2: Multilingual and Cross-Lingual Semantic Word Similarity

Semeval-2016 Task 1: Semantic Textual Similarity, Monolingual And

Semeval 2017 Task 10: Scienceie - Extracting Keyphrases and Relations from Scientiﬁc Publications

A Machine Translation Approach to Cross-Lingual Word Sense Disambiguation (Semeval-2013 Task 10)

L2 Translation Assistance by Emulating the Manual Post-Editing Process

SAARSHEFF at Semeval-2016 Task 1: Semantic Textual Similarity With