Arxiv:1812.06280V4 [Cs.CL] 26 Sep 2020 Able in KB Using fixed Continuous Vectors
Total Page:16
File Type:pdf, Size:1020Kb
Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia Ikuya Yamada1;2 Akari Asai3 Jin Sakuma4 [email protected] [email protected] [email protected] Hiroyuki Shindo5;2 Hideaki Takeda6 Yoshiyasu Takefuji7 Yuji Matsumoto2 [email protected] [email protected] [email protected] [email protected] 1Studio Ousia 2RIKEN AIP 3University of Washington 4The University of Tokyo 5Nara Institute of Science and Technology 6National Institute of Informatics 7Keio University Abstract In this work, we present Wikipedia2Vec, a Python-based open source tool for learning the em- The embeddings of entities in a large knowl- edge base (e.g., Wikipedia) are highly benefi- beddings of words and entities easily and efficiently cial for solving various natural language tasks from Wikipedia. Due to its scale, availability in that involve real world knowledge. In this a variety of languages, and constantly evolving paper, we present Wikipedia2Vec, a Python- nature, Wikipedia is commonly used as a KB to based open-source tool for learning the embed- learn entity embeddings. Our proposed tool jointly dings of words and entities from Wikipedia. learns the embeddings of words and entities, and The proposed tool enables users to learn the places semantically similar words and entities close embeddings efficiently by issuing a single to one another in the vector space. In particular, our command with a Wikipedia dump file as an argument. We also introduce a web-based tool implements the word-based skip-gram model demonstration of our tool that allows users to (Mikolov et al., 2013a,b) to learn word embeddings, visualize and explore the learned embeddings. and its extensions proposed in Yamada et al.(2016) In our experiments, our tool achieved a state- to learn entity embeddings. Wikipedia2Vec enables of-the-art result on the KORE entity related- users to train embeddings by simply running a sin- ness dataset, and competitive results on var- gle command with a Wikipedia dump file as an ious standard benchmark datasets. Further- input. We highly optimized our implementation, more, our tool has been used as a key com- ponent in various recent studies. We publi- which makes our implementation of the skip-gram cize the source code, demonstration, and the model faster than the well-established implementa- pretrained embeddings for 12 languages at tion available in gensim (Rehˇ rekˇ and Sojka, 2010) https://wikipedia2vec.github.io. and fastText (Bojanowski et al., 2017). Experimental results demonstrated that our tool 1 Introduction achieved enhanced quality compared to the exist- Entity embeddings, i.e., vector representations of ing tools on several standard benchmarks. Notably, entities in knowledge base (KB), have played a vi- our tool achieved a state-of-the-art result on the tal role in many recent models in natural language entity relatedness task based on the KORE dataset. processing (NLP). These embeddings provide rich Due to its effectiveness and efficiency, our tool has information (or knowledge) regarding entities avail- been successfully used in various downstream NLP arXiv:1812.06280v4 [cs.CL] 26 Sep 2020 able in KB using fixed continuous vectors. They tasks, including entity linking (Yamada et al., 2016; have been shown to be beneficial not only for tasks Eshel et al., 2017; Chen et al., 2019), named en- directly related to entities (e.g., entity linking (Ya- tity recognition (Sato et al., 2017; Lara-Clares and mada et al., 2016; Ganea and Hofmann, 2017)) but Garcia-Serrano, 2019), question answering (Ya- also for general NLP tasks (e.g., text classification mada et al., 2018b; Poerner et al., 2019), knowl- (Yamada and Shindo, 2019), question answering edge graph completion (Shah et al., 2019), para- (Poerner et al., 2019)). Notably, recent studies have phrase detection (Duong et al., 2019), fake news also shown that these embeddings can be used to detection (Singh et al., 2019), and text classification enhance the performance of state-of-the-art con- (Yamada and Shindo, 2019). textualized word embeddings (i.e., BERT (Devlin We also introduce a web-based demonstration et al., 2019)) on downstream tasks (Zhang et al., of our tool that visualizes the embeddings by plot- 2019; Peters et al., 2019; Poerner et al., 2019). ting them onto a two- or three-dimensional space using dimensionality reduction algorithms. The $ wget https://dumps.wikimedia.org/enwiki/latest/ enwiki-latest-pages-articles.xml.bz2 demonstration also allows users to explore the em- $ wikipedia2vec train enwiki-latest-pages-articles. beddings by querying similar words and entities. xml.bz2 MODEL_FILE The source code has been tested on Linux, Win- dows, and macOS, and released under the Apache Figure 1: Shell commands to train embeddings from License 2.0. We also release the pretrained em- the latest English Wikipedia dump. beddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, >>> from wikipedia2vec import Wikipedia2Vec >>> model = Wikipedia2Vec.load(MODEL_FILE) Polish, Portuguese, Russian, and Spanish). >>> model.get_entity_vector("Scarlett Johansson") memmap([-0.1979, 0.3086, ..., ], dtype=float32) The main contributions of this paper are summa- >>> model.get_word_vector("tokyo") rized as follows: memmap([ 0.0161, -0.0332, ..., ], dtype=float32) >>> model.most_similar(model.get_entity("Python ( • We present Wikipedia2Vec, a tool for learning programming language)"))[:3] [(<Word python>, 0.7265), the embeddings of words and entities easily and (<Entity Ruby (programming language)>, 0.6856), efficiently from Wikipedia. (<Entity Perl>, 0.6794)] • Our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and performed Figure 2: An example that uses the Wikipedia2Vec em- competitively on the various benchmark datasets. beddings on a Python interactive shell. • We present a web-based demonstration that al- lows users to explore the learned embeddings. neighboring entities connected by internal hyper- • We publicize the code, demonstration, and links of Wikipedia as additional contexts to train the pretrained embeddings for 12 languages at the model. Note that we used the RDF2Vec and https://wikipedia2vec.github.io. Wiki2Vec as baselines in our experiments, and achieved enhanced empirical performance over 2 Related Work these tools on the KORE dataset. Additionally, Many studies have recently proposed methods to there have been various relational embedding mod- learn entity embeddings from a KB (Hu et al., 2015; els proposed (Bordes et al., 2013; Wang et al., 2014; Li et al., 2016; Tsai and Roth, 2016; Yamada et al., Lin et al., 2015) that aim to learn the entity repre- 2016, 2017, 2018a; Cao et al., 2017; Ganea and sentations that are particularly effective for knowl- Hofmann, 2017). These embeddings are typically edge graph completion tasks. based on conventional word embedding models (e.g., skip-gram (Mikolov et al., 2013a)) trained 3 Overview with data retrieved from a KB. For example, Ris- Wikipedia2Vec is an easy-to-use, optimized tool toski et al.(2018) proposed RDF2Vec, which learns for learning embeddings from Wikipedia. This entity embeddings using the skip-gram model with tool can be installed using the Python’s pip inputs generated by random walks over the large tool (pip install wikipedia2vec). Em- knowledge graphs such as Wikidata and DBpe- beddings can be learned easily by running dia. Furthermore, a simple method that has been the wikipedia2vec train command with a widely used in various studies (Yaghoobzadeh and Wikipedia dump file3 as an argument. Figure1 Schutze, 2015; Yamada et al., 2017, 2018a; Al- shows the shell commands that download the latest Badrashiny et al., 2017; Suzuki et al., 2018) trains English Wikipedia dump file and run training of the entity embeddings by replacing the entity annota- embeddings based on this dump using the default tions in an input corpus with the unique identifier hyper-parameters.4 Furthermore, users can easily of their referent entities, and feeding the corpus use the learned embeddings. Figure2 shows the into a word embedding model (e.g., skip-gram). example Python code that loads the learned embed- 1 Two open-source tools, namely Wiki2Vec and ding file, and obtains the embeddings of an entity 2 Wikipedia Entity Vectors, have implemented this Scarlett Johansson and a word tokyo, as well as the method. Our proposed tool is based on Yamada most similar words and entities of an entity Python. et al.(2016), which extends this idea by using 3The dump file can be downloaded at Wikimedia Down- 1https://github.com/idio/wiki2vec loads: https://dumps.wikimedia.org 2https://github.com/singletongue/ 4The train command has many optional hyper-parameters WikiEntVec that are described in detail in the documentation. Logic Metaphysics Word-basedskip-grammodel Anchorcontextmodel Linkgraphmodel Science Philosopher Philosophy Aristotlewasaphilosopher Aristotlewasaphilosopher Aristotle Avicenna + + Plato Socrates Europe Theneighboringwordsofeachwordare Theneighboringwordsofahyperlink TheneighboringentitiesofeachentityinRenaissance usedascontexts pointingtoanentityareusedascontexts Wikipediaslinkgraphareusedascontexts Figure 3: Wikipedia2Vec learns embeddings by jointly optimizing word-based skip-gram, anchor context, and link graph models. 3.1 Model Anchor Context Model This model aims to Wikipedia2Vec implements the conventional skip- place similar words and entities close to one an- gram model (Mikolov et al.,