A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases Tolgahan Cakaloglu 1 Xiaowei Xu 2 Abstract neural attention based question answering models, Yu et al. (2018), it is natural to break the task of answering a question Deep language models learning a hierarchical into two subtasks as suggested in Chen et al.(2017): representation proved to be a powerful tool for natural language processing, text mining and in- • formation retrieval. However, representations Retrieval: Retrieval of the document most likely to that perform well for retrieval must capture se- contain all the information to answer the question cor- mantic meaning at different levels of abstraction rectly. or context-scopes. In this paper, we propose a • Extraction: Utilizing one of the above question- new method to generate multi-resolution word answering models to extract the answer to the question embeddings that represent documents at multi- from the retrieved document. ple resolutions in terms of context-scopes. In order to investigate its performance,we use the In our case, we use a collection of unstructured natural lan- Stanford Question Answering Dataset (SQuAD) guage documents as our knowledge base and try to answer and the Question Answering by Search And the questions without knowing to which documents they cor- Reading (QUASAR) in an open-domain question- respond. Note that we do not benchmark the quality of the answering setting, where the first task is to find extraction phase; therefore, we do not study extracting the documents useful for answering a given ques- answer from the retrieved document but rather compare the tion. To this end, we first compare the qual- quality of retrieval methods and the feasibility of learning ity of various text-embedding methods for re- specialized neural models for retrieval purposes. Due to the trieval performance and give an extensive em- complexity of natural languages, optimal word embedding, pirical comparison with the performance of var- that represents natural language documents in a semantic ious non-augmented base embeddings with and vector space, is crucial for document retrieval. Traditional without multi-resolution representation. We argue word embedding methods learn hierarchical representations that multi-resolution word embeddings are con- of documents where each layer gives a representation that sistently superior to the original counterparts and is a high-level abstraction of the representation from a pre- deep residual neural models specifically trained vious layer. Most word embedding methods only use either for retrieval purposes can yield further significant the highest layer like Word2Vec by Mikolov et al.(2013), or gains when they are used for augmenting those an aggregated representation from the last few layers, such embeddings. as ELMo by Peters et al.(2018) as the representation for arXiv:1902.00663v7 [cs.IR] 22 May 2019 information retrieval. In this paper, we present a new word embedding approach called multi-resolution word embed- 1. Introduction ding that consists of two steps as shown in Figure1. In The goal of open domain question answering is to answer the first step, we form a mixture of weighted representa- questions posed in natural language, using a collection of tions across the whole hierarchy of a given word embedding unstructured natural language documents such as Wikipedia. model, so that all resolutions of the hierarchical representa- Given the recent successes of increasingly sophisticated tion are preserved for the next step. As the second step, we combine all mixture representations from various models 1Department of Computer Science, University of Arkansas, as an ensemble representation for the document retrieval Little Rock, Arkansas, United States 2Department of Informa- task. The proposed word embedding takes advantage of tion Science, University of Arkansas, Little Rock, Arkansas, multi-resolution power of individual word embedding mod- United States. Correspondence to: Tolgahan Cakaloglu <tx- [email protected]>. els, where each model is trained with a complementary strength due to the diversity of models and corpora. Taking Copyright 2019 by the author(s). the example of ”··· java ··· ” in Figure1, different level of A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases representation of ”java” including word level (word sense) 2. Related work and concept level (abstract meaning like coffee, island, and programming) are aggregated to form a mixture of represen- In order to express the importance of a word or a token to tations. In the second step, all these mixture representations a document in a document collection, a numerical statistic from different word embedding models are aggregated to is used in information retrieval. The TF-IDF, by Salton & form an ensemble representation, which takes advantage of McGill(1986), stands for term frequency-inverse document the complementary strength of individual models and cor- frequency which is proposed to calculate a weighting factor pora. Consequently, our multi-resolution word embedding in searches of information retrieval, text mining, and user delivers the power of multi-resolution with the strength of modeling. Parallel to the advances in the field, new methods individual models. that are intended to understand the natural language, are get- ting proposed. One of the major contributions is called word As another contribution of the paper, we improve the qual- embedding. There are various types of word embedding in ity of the target document retrieval task by introducing a the literature that is well covered by Perone et al.(2018). convolutional residual retrieval network (ConvRR) over the The influential Word2Vec by Mikolov et al.(2013) is one embedding vectors. The proposed ConvRR model further of the first popular approaches of word embedding based improves the retrieval performance by employing triplet on neural networks that are built upon the guiding work by learning with (semi-)hard negative mining on the target cor- Bengio et al.(2003) on the neural language model for dis- pus. tributed word representations. This type of implementation is able to conserve semantic relationships between words and their context; or in other terms, surrounding neighboring words. Two different approaches are proposed in Word2Vec to compute word representations. One of the approaches is called Skip-gram that predicts surrounding words, given a target word. The other approach is called Continuous Bag-of-Words that predicts target word, using a bag-of- words context. Global Vectors (GloVe) by Pennington et al. (2014), aims to reduce some limitations of Word2Vec by focusing on the global context instead of surrounding words for learning the representations. The global context is cal- culated by utilizing the word co-occurrences in a corpus. During this calculation, a count-based approach is func- tioned, unlike the prediction-based method in Word2Vec. On the other hand, fastText, by Mikolov et al.(2018), is also announced recently. It is based on the same principles as others that focus on extracting word embedding from a large corpus. fastText is very similar to Word2Vec except they train high-quality word vector representations by using Figure 1. The illustration of multi-resolution word embedding a combination of known tricks that are, however, rarely used method using an example of ”··· java ··· ” together, which accelerates fastText to learn representations more efficiently. Our paper is structured as follows: First, we start with a The important question still remains on extracting high- review of recent advances in text embedding in Section quality and more meaningful representations—how to seize 2. In Section3 we describe the details of our approach. the semantic, syntactic and the different meanings in differ- More specifically, we describe our multi-resolution word ent context—embedding from Language Models (ELMo),by embedding followed by an introduction of a specific deep Peters et al.(2018), is newly-proposed in order to tackle residual retrieval model that is used to augment text, using that question. ELMo extracts representations from a bi- the proposed word embedding model for document retrieval. directional Long Short Term Memory (LSTM),by Hochre- We present an empirical study and compare the proposed iter & Schmidhuber(1997), that is trained with a language method to the baselines that utilize non-augmented word model (LM) objective on a very large text corpus. ELMo embedding models. In Section4, we provide a detailed de- representations are a function of the internal layers of the scription of our experiments, including datasets, evaluation bi-directional Language Model (biLM) that outputs good metrics, and an implementation. Then, results are reported and diverse representations about the words/token (a con- in Section5. The paper is concluded with some future work volutional neural network over characters). ELMo is also in Section6. incorporating character n-grams, as in fastText, but there A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases are some constitutional differences between ELMo and its to lose the meaning and the knowledge of the pre-trained predecessors. Likewise, BERT, by Devlin et al.(2018), is multi-resolution embedding but make some adjustments on a method of pre-training language representations that is its knowledge with a limited additional training data. A final trained,using

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support