Arxiv:1307.1662V2 [Cs.CL] 27 Jun 2014 Plexity and Requirements for Each Individual Lan- Have Been Built and Tested Mainly on English
Total Page:16
File Type:pdf, Size:1020Kb
Polyglot: Distributed Word Representations for Multilingual NLP Rami Al-Rfou Bryan Perozzi Steven Skiena Computer Science Dept. Stony Brook University Stony Brook, NY 11794 fralrfou, bperozzi, [email protected] Abstract ment of familiarity with each language under con- sideration. These systems are typically carefully Distributed word representations (word tuned with hand-manufactured features designed embeddings) have recently contributed by experts in a particular language. This approach to competitive performance in language can yield good performance, but tends to create modeling and several NLP tasks. In complicated systems which have limited portabil- this work, we train word embeddings for ity to new languages, in addition to being hard to more than 100 languages using their cor- enhance and maintain. responding Wikipedias. We quantitatively Recent advancements in unsupervised feature demonstrate the utility of our word em- learning present an intriguing alternative. In- beddings by using them as the sole fea- stead of relying on expert knowledge, these ap- tures for training a part of speech tagger proaches employ automatically generated task- for a subset of these languages. We find independent features (or word embeddings) given their performance to be competitive with large amounts of plain text. Recent developments near state-of-art methods in English, Dan- have led to state-of-art performance in several ish and Swedish. Moreover, we inves- NLP tasks such as language modeling (Bengio tigate the semantic features captured by et al., 2006; Mikolov et al., 2010), and syntactic these embeddings through the proximity tasks such as sequence tagging (Collobert et al., of word groupings. We will release these 2011). These embeddings are generated as a result embeddings publicly to help researchers in of training “deep” architectures, and it has been the development and enhancement of mul- shown that such representations are well suited for tilingual applications. domain adaptation tasks (Glorot et al., 2011; Chen et al., 2012). 1 Introduction We believe two problems have held back the Building multilingual processing systems is a research community’s adoption of these methods. challenging task. Every NLP task involves dif- The first is that learning representations of words ferent stages of preprocessing and calculating in- involves huge computational costs. The process termediate representations that will serve as fea- usually involves processing billions of words over tures for later stages. These stages vary in com- weeks. The second is that so far, these systems arXiv:1307.1662v2 [cs.CL] 27 Jun 2014 plexity and requirements for each individual lan- have been built and tested mainly on English. guage. Despite recent momentum towards devel- In this work we seek to remove these barriers oping multilingual tools (Nivre et al., 2007; Hajicˇ to entry by generating word embeddings for over et al., 2009; Pradhan et al., 2012), most of NLP a hundred languages using state-of-the-art tech- research still focuses on rich resource languages. niques. Specifically, our contributions include: Common NLP systems and tools rely heavily on English specific features and they are infrequently • Word embeddings - We will release word tested on multiple datasets. This makes them hard embeddings for the hundred and seventeen to port to new languages and tasks (Blitzer et al., languages that have more than 10,000 ar- 2006). ticles on Wikipedia. Each language’s vo- A serious bottleneck in the current approach cabulary will contain up to 100,000 words. for developing multilingual systems is the require- The embeddings will be publicly available at (www.cs.stonybrook.edu/˜dsl), for vised feature learning with discriminative learning the research community to study their charac- methods to improve the performance of NLP ap- teristics and build systems for new languages. plications. Word clustering has been used to learn We believe our embeddings represent a valu- classes of words that have similar semantic fea- able resource because they contain a minimal tures to improve language modeling (Brown et al., amount of normalization. For example, we 1992) and knowledge transfer across languages do not lower case words for European lan- (Tackstr¨ om¨ et al., 2012). Dependency parsing guages as other studies have done for En- and other NLP tasks have been shown to bene- glish. This preserves features of the under- fit from such a large unannotated corpus (Koo et lying language. al., 2008), and a variety of unsupervised feature learning methods have been shown to unilaterally • Quantitative analysis - We investigate improve the performance of supervised learning the embedding’s performance on a part-of- tasks (Turian et al., 2010). (Klementiev et al., speech (PoS) tagging task, and conduct qual- 2012) induce distributed representations for a pair itative investigation of the syntactic and se- of languages jointly, where a learner can be trained mantic features they capture. Our experi- on annotations present in one language and ap- ments represent a valuable chance to evalu- plied to test data in another. ate distributed word representations for NLP as the experiments are conducted in a consis- Learning distributed word representations is a tent manner and a large number of languages way to learn effective and meaningful information are covered. As the embeddings capture in- about words and their usages. They are usually teresting linguistic features, we believe the generated as a side effect of training parametric multilingual resource we are providing gives language models as probabilistic neural networks. researchers a chance to create multilingual Training these models is slow and takes a signif- comparative experiments. icant amount of computational resources (Bengio et al., 2006; Dean et al., 2012). Several sugges- • Efficient implementation - Training these tions have been proposed to speed up the training models was made possible by our contri- procedure, either by changing the model architec- butions to Theano (machine learning library ture to exploit an algorithmic speedup (Mnih and (Bergstra et al., 2010)). These optimizations Hinton, 2009; Morin and Bengio, 2005) or by esti- empower researchers to produce word em- mating the error by sampling (Bengio and Senecal, beddings under different settings or for dif- 2008). ferent corpora than Wikipedia. (Collobert and Weston, 2008) shows that word The rest of this paper is as follows. In Section embeddings can almost substitute NLP common 2, we give an overview of semi-supervised learn- features on several tasks. The system they built, ing and learning representations related work. We SENNA, offers part of speech tagging, chunking, then describe, in Section 3, the network used to named entity recognition, semantic role labeling generate the word embeddings and its characteris- and dependency parsing (Collobert, 2011). The tics. Section 4 discusses the details of the corpus system is built on top of word embeddings and per- collection and preparation steps we performed. forms competitively compared to state of art sys- Next, in Section 5, we discuss our experimental tems. In addition to pure performance, the system setup and the training progress over time. In Sec- has a faster execution speed than comparable NLP tion 6 we discuss the semantic features captured pipelines (Al-Rfou’ and Skiena, 2012). by the embeddings by showing examples of the To speed up the embedding generation process, word groupings in multiple languages. Finally, SENNA embeddings are generated through a pro- in Section 7 we demonstrate the quality of our cedure that is different from language modeling. learned features by training a PoS tagger on sev- The representations are acquired through a model eral languages and then conclude. that distinguishes between phrases and corrupted versions of them. In doing this, the model avoids 2 Related Work the need to normalize the scores across the vocab- There is a large body of work regarding semi- ulary to infer probabilities. (Chen et al., 2013) supervised techniques which integrate unsuper- shows that the embeddings generated by SENNA Apple apple Bush bush corpora dangerous Dell tomato Kennedy jungle notations costly Paramount bean Roosevelt lobster digraphs chaotic Mac onion Nixon sponge usages bizarre Flex potato Fisher mud derivations destructive Table 1: Words nearest neighbors as they appear in the English embeddings. perform well in a variety of term-based evaluation In our work, we start from the example con- tasks. Given the training speed and prior perfor- struction method outlined in (Bengio et al., 2009). mance on NLP tasks in English, we generate our They train a model by requiring it to distinguish multilingual embeddings using a similar network between the original phrase and a corrupted ver- architecture to the one SENNA used. sion of the phrase. If it does not score the However, our work differs from SENNA in the original one higher than the corrupted one (by following ways. First, we do not limit our mod- a margin), the model will be penalized. More els to English, we train embeddings for a hundred precisely, for a given sequence of words S = and seventeen languages. Next, we preserve lin- [wi−n : : : wi : : : wi+n] observed in the corpus T , guistic features by avoiding excessive normaliza- we will construct another corrupted sequence S0 tion to the text. For example, our English model by replacing the word in the middle wi with a word places “Apple” closer to IT companies and “ap- wj chosen randomly from the vocabulary. The ple” to fruits. More examples of linguistic fea- neural network represents a function score that tures preserved by our model are shown in Table scores each phrase, the model is penalized through 1. This gives us the chance to evaluate the embed- the hinge loss function J(T ) as shown in 1. dings performance over PoS tagging without the 1 X need for manufactured features. Finally, we re- J(T ) = j1−score(S0)+score(S)j (1) jT j + lease the embeddings and the resources necessary i2T to generate them to the community to eliminate Figure 1 shows a neural network that takes a se- any barriers.