Voltron: a Hybrid System for Answer Validation Based on Lexical and Distance Features
Total Page:16
File Type:pdf, Size:1020Kb
Voltron: A Hybrid System For Answer Validation Based On Lexical And Distance Features Ivan Zamanov1, Nelly Hateva1, Marina Kraeva1, Ivana Yovcheva1, Ivelina Nikolova2, Galia Angelova2 1 FMI, Sofia University, Sofia, Bulgaria 2 IICT, Bulgarian Academy of Sciences, Sofia, Bulgaria [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract ing in the system because it offers convenient natu- ral language processing pipelines and has an API al- The purpose of this paper is to describe our lowing for system integration. For classification we submission to the SemEval-2015 Task 3 on used the Maximum Entropy classifier provided by Answer Selection in Community Question MALLET (McCallum and Kachites, 2002). We use Answering. We participated in subtask A, a combination of surface, morphological, syntactic, where the systems had to classify community answers for a given question as definitely rel- and contextual features as well as distance metrics evant, potentially useful, or irrelevant. For ev- between the question and answer. Distance metrics ery question-answer pair in the training data are based on word2vec (Mikolov et al., 2013a) and we extract a vector with a variety of features. DKPro Similarity (Bar,¨ et al.), (de Castilho, 2014). These vectors are then fed to a MaxEnt classi- fier for training. Given a question and an an- 2 Related work swer the trained classifier outputs class proba- bilities for each of the three desired categories. Several recent systems were created and used for The one with the highest probability is cho- similar analysis. Although their applications have sen. Our system scores better than the average some differences from the system described in this score in subtask A of Task 3. paper, we consider them relevant because they deal with semantic similarity. (Bas¸kaya, 2014) uses Vector Space Models which 1 Introduction have some similarity to our usage of word2vec cen- Nowadays, text analysis and semantic similarity are troid metrics with the difference that we do not orga- subject to a lot of research and experiments due to nize the whole text according to the structure of the the growth of social media influence, the increas- result matrix, as the VSMs do. The cosine similarity ing usage of forums for finding a solution of com- is common for both systems. The big difference is mon known problems and the Web upgrowth. As that we use only the input words while in his system beginners in the computational linguistics field, we the words’ likely synonyms according to a language were very interested in dealing with these topics and model are also used. We believe this contributes to have found Answer Validation as a good start. Our the consistently higher scores of his system. team chose to focus on subtask A of Task 3 in the Another work of (Vilarino˝ et al., 2014) also uses SemEval-2015 workshop, namely Answer selection n-grams, cosine similarity and that is a common in community question answering data. In order feature with our system. Some differing features to achieve good results, we combined most of the are Jaccard coefficient, Latent Semantic Analysis, techniques familiar to us. We process the data as Pointwise Mutual Information. Their results are question-answer pairs. The framework GATE (Cun- very close to ours. ningham et al., 2002) was used for the preprocess- Most of the works dealing with semantic similar- 242 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 242–246, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics ity use n-grams, metadata features and stop words should capture similar phrases used in the question- as we do. Our scores are not among the highest in comment pair. We assume that n-grams of higher subtask A of Task 3, but they come close to and sub- order could contribute as well, however we believe stantially differ from the average score in this field n = 2 and n = 3 would carry the most information of works. and n 4 would impact training time adversely. ≥ 3 Resources 4.1.3 Bad-answer-specific Features Bad comments often include a lot of punctuation, The datasets we use to train our system are provided more than one question in the answer, questions, fol- from the SemEval-2015 organizers. The datasets lowed by their obvious answer (when the expression consist of 2600 training and 300 development ques- or its synonyms could be directly found in the an- tions including 16,541 training and 1,645 develop- swer), more than two repeating letters next to each ment comments. other (i.e. exclamations such as ”ahaa”), greetings, Also for the extraction of some features we use pre- chat abbreviations, more than one uppercase word, a trained word and phrase vectors. They are trained lot of emoticons, exclamations and other very mean- on part of Google News dataset (about 100 billion ingless words. Emphasizing such tokens helps to words). The model contains 300-dimensional vec- distinguish bad comments specifically. tors for 3 million words and phrases. 4.1.4 Structural Features 4 Method We include the comment’s length in meaning- ful tokens, length in sentences and each sentence’s The task at hand is to measure how appropriate length as features, since longer comments should in- and/or informative a comment is with respect to a clude more information. Since named entities, such question. Our approach is to measure the related- as locations and organizations etc. would be es- ness of a comment to the question or, in other words, pecially indicative of the topic similarity between to measure if a question-comment pair is consistent. question and comment, we give them greater weight Therefore we attempt to classify each pair as Good, by again including named entities, recognized by Potential or Bad. GATE’s built-in NER tools. The main characteristic of a good comment is that it is related to the corresponding question. Also, we 4.1.5 TF Vector Space Features assume that when answering a question, people tend Another attempt to capture similar terms in the to use the same words with which the question was question and comment is to convert each entry to a asked because that would make it easier for the ques- local term-frequency vector and compute the cosine tion author to understand. Therefore, similar word- similarity between the vectors for the question and ing and especially similar phrases would be an indi- comment rounded to 0.1 precision. Similar word- cation of a more informative comment. ing, regardless of term occurrence frequency, should lead to a higher cosine similarity. We use DKPro’s 4.1 Features implementation of cosine similarity to achieve this We will call tokens that are not punctuation or stop (Bar,¨ et al.). The term ”local” refers to the fact that words meaningful, as they carry some information TF vectors of distinct entries are not related, that is, regardless of exactly how a sentence is formulated. the vector space is specific to a question-comment pair. 4.1.1 Lexical Features For every meaningful token, we extract its stem, 4.1.6 Word2vec Semantic Similarity lemma and orthography. A good answer, however, does not necessarily use the exact same words. Therefore we need 4.1.2 N-gram Features a way to capture the general ”topic” of a ques- Bigrams and trigrams of tokens (even non- tion. We opted for the word2vec word vectors, pro- meaningful ones) are also extracted since this posed by (Mikolov et al., 2013a), (Mikolov et al., 243 2013b), (Mikolov et al., 2013c). The general idea focus on the coarse-grained evaluation in the three of word2vec is to represent each word as a real vec- main classes (Good, Potential, Bad) since our sys- tor that captures the contexts of word occurrences tem does not try to target the finer-grained classifi- in a corpus. For a given question-comment pair, we cation. extract word2vec vectors from a pre-trained set for We defined our baseline system as the one that all tokens for which one is available. We compute uses only the lexical and structural features de- the centroids for the question and the comment, then scribed in the Method section, i.e. word tokens, sen- use the cosine between the two as a feature. The in- tence, question and answer length, as well as the tention is to capture the similarity between different bigrams and trigrams of the question-answer pair. terms in the pair. The same procedure is then applied With only these features, the system is very weak - once more for only NP-S, i.e. noun phrase, tokens the accuracy as reported by the scorer script against because they carry more information about the topic the gold standard is 44.18% and the F1 score is than other parts of speech. 24.05%. Next, we included the features that rely on GATE 4.2 Classifier Model gazetteers, such as the named entities features. This After all described features are extracted, they form improved the system’s performance by more than a list of string values associated with each question- 1%, reaching accuracy of 45.14% and F1 score of answer pair. As explained above, some of them 25.33%. are characteristic for bad answers, while others are Another experiment we did was to add to the mainly found in good ones. Therefore, it makes baseline system only the DKPro cosine similarity. sense to consider the feature list for a given question- This approach yielded a significant increase in the answer pair as a document itself.