<<

Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou Bryan Perozzi Steven Skiena Computer Science Dept. Stony Brook University Stony Brook, NY 11794 {ralrfou, bperozzi, skiena}@cs.stonybrook.edu

Abstract ment of familiarity with each language under con- sideration. These systems are typically carefully Distributed word representations (word tuned with hand-manufactured features designed embeddings) have recently contributed by experts in a particular language. This approach to competitive performance in language can yield good performance, but tends to create modeling and several NLP tasks. In complicated systems which have limited portabil- this work, we train word embeddings for ity to new languages, in addition to being hard to more than 100 languages using their cor- enhance and maintain. responding Wikipedias. We quantitatively Recent advancements in unsupervised feature demonstrate the utility of our word em- learning present an intriguing alternative. In- beddings by using them as the sole fea- stead of relying on expert knowledge, these ap- tures for training a part of speech tagger proaches employ automatically generated task- for a subset of these languages. We find independent features (or word embeddings) given their performance to be competitive with large amounts of plain text. Recent developments near state-of-art methods in English, Dan- have led to state-of-art performance in several ish and Swedish. Moreover, we inves- NLP tasks such as language modeling (Bengio tigate the semantic features captured by et al., 2006; Mikolov et al., 2010), and syntactic these embeddings through the proximity tasks such as sequence tagging (Collobert et al., of word groupings. We will release these 2011). These embeddings are generated as a result embeddings publicly to help researchers in of training “deep” architectures, and it has been the development and enhancement of mul- shown that such representations are well suited for tilingual applications. domain adaptation tasks (Glorot et al., 2011; Chen et al., 2012). 1 Introduction We believe two problems have held back the Building multilingual processing systems is a research community’s adoption of these methods. challenging task. Every NLP task involves dif- The first is that learning representations of words ferent stages of preprocessing and calculating in- involves huge computational costs. The process termediate representations that will serve as fea- usually involves processing billions of words over tures for later stages. These stages vary in com- weeks. The second is that so far, these systems arXiv:1307.1662v2 [cs.CL] 27 Jun 2014 plexity and requirements for each individual lan- have been built and tested mainly on English. guage. Despite recent momentum towards devel- In this work we seek to remove these barriers oping multilingual tools (Nivre et al., 2007; Hajicˇ to entry by generating word embeddings for over et al., 2009; Pradhan et al., 2012), most of NLP a hundred languages using state-of-the-art tech- research still focuses on rich resource languages. niques. Specifically, our contributions include: Common NLP systems and tools rely heavily on English specific features and they are infrequently • Word embeddings - We will release word tested on multiple datasets. This makes them hard embeddings for the hundred and seventeen to port to new languages and tasks (Blitzer et al., languages that have more than 10,000 ar- 2006). ticles on Wikipedia. Each language’s vo- A serious bottleneck in the current approach cabulary will contain up to 100,000 words. for developing multilingual systems is the require- The embeddings will be publicly available at (www.cs.stonybrook.edu/˜dsl), for vised feature learning with discriminative learning the research community to study their charac- methods to improve the performance of NLP ap- teristics and build systems for new languages. plications. Word clustering has been used to learn We believe our embeddings represent a valu- classes of words that have similar semantic fea- able resource because they contain a minimal tures to improve language modeling (Brown et al., amount of normalization. For example, we 1992) and knowledge transfer across languages do not lower case words for European lan- (Tackstr¨ om¨ et al., 2012). Dependency parsing guages as other studies have done for En- and other NLP tasks have been shown to bene- glish. This preserves features of the under- fit from such a large unannotated corpus (Koo et lying language. al., 2008), and a variety of unsupervised feature learning methods have been shown to unilaterally • Quantitative analysis - We investigate improve the performance of supervised learning the embedding’s performance on a part-of- tasks (Turian et al., 2010). (Klementiev et al., speech (PoS) tagging task, and conduct qual- 2012) induce distributed representations for a pair itative investigation of the syntactic and se- of languages jointly, where a learner can be trained mantic features they capture. Our experi- on annotations present in one language and ap- ments represent a valuable chance to evalu- plied to test data in another. ate distributed word representations for NLP as the experiments are conducted in a consis- Learning distributed word representations is a tent manner and a large number of languages way to learn effective and meaningful information are covered. As the embeddings capture in- about words and their usages. They are usually teresting linguistic features, we believe the generated as a side effect of training parametric multilingual resource we are providing gives language models as probabilistic neural networks. researchers a chance to create multilingual Training these models is slow and takes a signif- comparative experiments. icant amount of computational resources (Bengio et al., 2006; Dean et al., 2012). Several sugges- • Efficient implementation - Training these tions have been proposed to speed up the training models was made possible by our contri- procedure, either by changing the model architec- butions to Theano (machine learning library ture to exploit an algorithmic speedup (Mnih and (Bergstra et al., 2010)). These optimizations Hinton, 2009; Morin and Bengio, 2005) or by esti- empower researchers to produce word em- mating the error by sampling (Bengio and Senecal, beddings under different settings or for dif- 2008). ferent corpora than Wikipedia. (Collobert and Weston, 2008) shows that word The rest of this paper is as follows. In Section embeddings can almost substitute NLP common 2, we give an overview of semi-supervised learn- features on several tasks. The system they built, ing and learning representations related work. We SENNA, offers part of speech tagging, chunking, then describe, in Section 3, the network used to named entity recognition, semantic role labeling generate the word embeddings and its characteris- and dependency parsing (Collobert, 2011). The tics. Section 4 discusses the details of the corpus system is built on top of word embeddings and per- collection and preparation steps we performed. forms competitively compared to state of art sys- Next, in Section 5, we discuss our experimental tems. In addition to pure performance, the system setup and the training progress over time. In Sec- has a faster execution speed than comparable NLP tion 6 we discuss the semantic features captured pipelines (Al-Rfou’ and Skiena, 2012). by the embeddings by showing examples of the To speed up the embedding generation process, word groupings in multiple languages. Finally, SENNA embeddings are generated through a pro- in Section 7 we demonstrate the quality of our cedure that is different from language modeling. learned features by training a PoS tagger on sev- The representations are acquired through a model eral languages and then conclude. that distinguishes between phrases and corrupted versions of them. In doing this, the model avoids 2 Related Work the need to normalize the scores across the vocab- There is a large body of work regarding semi- ulary to infer probabilities. (Chen et al., 2013) supervised techniques which integrate unsuper- shows that the embeddings generated by SENNA Apple apple Bush bush corpora dangerous Dell tomato Kennedy jungle notations costly Paramount bean Roosevelt lobster digraphs chaotic Mac onion Nixon sponge usages bizarre Flex potato Fisher mud derivations destructive

Table 1: Words nearest neighbors as they appear in the English embeddings. perform well in a variety of term-based evaluation In our work, we start from the example con- tasks. Given the training speed and prior perfor- struction method outlined in (Bengio et al., 2009). mance on NLP tasks in English, we generate our They train a model by requiring it to distinguish multilingual embeddings using a similar network between the original phrase and a corrupted ver- architecture to the one SENNA used. sion of the phrase. If it does not score the However, our work differs from SENNA in the original one higher than the corrupted one (by following ways. First, we do not limit our mod- a margin), the model will be penalized. More els to English, we train embeddings for a hundred precisely, for a given sequence of words S = and seventeen languages. Next, we preserve lin- [wi−n . . . wi . . . wi+n] observed in the corpus T , guistic features by avoiding excessive normaliza- we will construct another corrupted sequence S0 tion to the text. For example, our English model by replacing the word in the middle wi with a word places “Apple” closer to IT companies and “ap- wj chosen randomly from the vocabulary. The ple” to fruits. More examples of linguistic fea- neural network represents a function score that tures preserved by our model are shown in Table scores each phrase, the model is penalized through 1. This gives us the chance to evaluate the embed- the hinge loss function J(T ) as shown in 1. dings performance over PoS tagging without the 1 X need for manufactured features. Finally, we re- J(T ) = |1−score(S0)+score(S)| (1) |T | + lease the embeddings and the resources necessary i∈T to generate them to the community to eliminate Figure 1 shows a neural network that takes a se- any barriers. quence of words with size 2n + 1 to compute a Despite the progress made in creating dis- score. First, each word is mapped through a vo- tributed representations, combining them to pro- cabulary dictionary with the size |V | to an index duce meaning is still a challenging task. Sev- that is used to index a shared matrix C with the eral approaches have been proposed to address size |V |∗M where M is the size of the vector rep- feature compositionality for semantic problems resenting the word. Once the vectors are retrieved, such as paraphrase detection (Socher et al., 2011), they are concatenated into one vector called pro- and sentiment analysis (Socher et al., 2012) using jection layer P with size (2n + 1) ∗ M. The pro- word embeddings. jection layer plays the role of an input to a hidden 3 Distributed Word Representation layer with size |H|, the activations A of which are calculated according to equation 3, where W1, b1 Distributed word representations (word embed- are the weights and bias of the hidden layer. dings) map the index of a word in a dictionary to a feature vector in high-dimension space. Every di- A = tanh(W1P + b1) (2) mension contributes to multiple concepts, and ev- To calculate the phrase score, a linear combina- ery concept is expressed by a combination of sub- tion of the hidden layer activations A is computed set of dimensions. Such mapping is learned by using W and b . back-propagating the error of a task through the 2 2 model to update random initialized embeddings. score(P ) = W2A + b2 (3) The task is usually chosen such that examples can be automatically generated from unlabeled data Therefore, the five parameters that have to be (i.e so it is unsupervised). In case of language learned are W1, W2, b1, b2, C with a total number modeling, the task is to predict the last word of of parameters (2n + 1) ∗ M ∗ H + H + H + 1 + a phrase that consists of n words. |V | ∗ M ≈ M ∗ (nH + |V |) . 1 Score gine . Next we must tokenize the text. We rely M on an OpenNLP probabilistic tokenizer whenever

|V| possible, and default to the Unicode text segmen- C tation2 algorithm offered by Lucene when we have no such OpenNLP model. After tokenization, we

H normalize the tokens to reduce their sparsity. We Hidden Layer have two main normalization rules. The first re- places digits with the symbol #, so “1999” be- comes ####. In the second, we remove hyphens Projection Layer and brackets that appear in the middle of a token. As an additional rule for English, we map non- C C C C C Latin characters to their unicode block groups. In order to capture the syntactic and semantic Imagination is greater than detail features of words, we must observe each word sev- eral times in each of its valid contexts. This re- Figure 1: Neural network architecture. Words are quirement, when combined with the Zipfian dis- retrieved from embeddings matrix C and concate- tribution of words in natural language, implies that nated at the projection layer as an input to com- learning a meaningful representation of a language puter the hidden layer activation. The score is requires a huge amount of unstructured text. In the linear combination of the activation values of practice we deal with this limitation by restricting the hidden layer. The scores of two phrases are ourselves to considering the most frequently oc- ranked according to hinge loss to distinguish the curring tokens in each language. corrupted phrase from the original one. Table 2 shows the size of each language corpus in terms of tokens, number of word types and cov- erage of text achieved by building a vocabulary out 4 Corpus Preparation of the most frequent 100,000 tokens, |V |. Out of vocabulary (OOV) words are replaced with a spe- cial token hUNKi. We have chosen to generate our word embeddings While Wikipedia has 284 language specific en- from Wikipedia. In addition to size, there are other cyclopedias, only five of them have more than a desirable properties that we wish for the source of million articles. The size drops dramatically, such our language model to have: that the 42nd largest Wikipedia, Hindi, has slightly • Size and variety of languages - As of this above 100,000 articles and the 100th, Tatar, has writing (April, 2013), 42 languages had more slightly over 16,000 articles3. than 100,000 article pages, and 117 lan- Significant Wikipedias in size have a word cov- guages had more than 10,000 article pages. erage over 92% except for German, Russian, Ara- • Well studied - Wikipedia is a prolific re- bic and Czech which shows the effect of heavy us- source in the literature, and has been used age of morphological forms in these languages on for a variety of problems. Particularly, the word usage distribution. Wikipedia is well suited for multilingual ap- The highest word coverage we achieve is unsur- plications (Navigli and Ponzetto, 2010). prisingly for Chinese. This is expected given the • Quality - Wikipedians strive to write arti- limited size vocabulary of the language - the num- cles that are readable, accurate, and consist ber of entries in the Contemporary Chinese Dictio- of good grammar. nary are estimated to be 65 thousand words (Shux- • Openly accessible - Wikipedia is a resource iang, 2004). available for free use by researchers • Growing - As technology becomes more ac- 1Java Wikipedia API (Bliki engine) - http://code. cessible, the size and scope of the multilin- google.com/p/gwtwiki/ gual Wikipedia effort continues to expand. 2http://www.unicode.org/reports/tr29/ 3http://meta.wikimedia.org/w/index. To process Wikipedia markup, we first extract php?title=List_of_Wikipedias&oldid= the text using a modified version of the Bliki en- 5248228 Tokens Words Language Coverage ∗106 ∗103 English 1,888 12,125 96.30% German 687 9,474 91.78% French 473 4,675 95.78% Spanish 399 3,978 96.07% Russian 328 5,959 90.43% Italian 322 3,642 95.52% Portuguese 197 2,870 95.68% Dutch 197 3,712 93.81% Chinese 196 423 99.67% Swedish 101 2,707 92.36% Czech 80 2,081 91.84% Arabic 52 1,834 91.78% Danish 44 1,414 93.68% Bulgarian 39 1,114 94.35% Figure 2: Training and test errors of the French Slovene 30 920 94.42% model after 23 days of training. We did not notice Hindi 23 702 96.25% any overfitting while training the model. The error curves are smoother the larger the language corpus Table 2: Statistics of a subset of the languages pro- is. cessed. The second column reports the number of tokens found in the corpus in millions while the third column reports the word types found in thou- To train the model, we consider the data in mini- sands. The coverage indicates the percentage of batches of size 16. Every 16 examples, we es- the corpus that will be matching words in a vocab- timate the gradient using stochastic gradient de- ulary consists of the most frequent 100 thousand scent (Bottou, 1991), and update the parameters words. which contributed to the error using backpropaga- tion (Rumelhart et al., 2002). Calculating an exact 5 Training gradient is prohibitive given that the dataset size is in millions of examples. We calculate the devel- For our experiments, we build a model as the one opment error by sampling randomly 10000 mini- described in Section 3 using Theano (Bergstra et batches from the development dataset. al., 2010). We choose the following parameters, For each language, we set the batch size to 16 context window size 2n + 1 = 5, vocabulary examples, and the learning rate to be 0.1. Follow- |V | = 100, 000, word embedding size M = 64, ing, (Collobert et al., 2011)’s advice, we divide and hidden layer size H = 32. The intuition, here, each layer by the fan in of that layer, and we con- is to maximize the relative size of the embeddings sider the embeddings layer to have a fan in of 1. compared to the rest of the network. This might We divide the corpus to three sets, training, devel- force the model to store the necessary information opment and testing with the following percentages in the embeddings matrix instead of the hidden 90, 5, 5 respectively. layer. Another benefit is that we will avoid over- One disadvantage of the approach used by (Col- fitting on the smaller Wikipedias. Increasing the lobert et al., 2011) is that there is no clear stop- window size or the embedding size slows down ping criteria for the model training process. We the training speed, making it harder to converge have noticed that after a few weeks of training, within a reasonable time. the model’s performance reaches the point where The examples are generated by sweeping a win- there is no significant decrease in the average loss dow over sentences. For each sentence in the cor- over the development set, and when this occurs we pus, all unknown words are replaced with a special manually stop the training. An interesting prop- token hUNKi and sentences are padded with hSi, erty of this model is that we did not notice any h/Si tokens. In case the window exceeds the edges sign of overfitting for large Wikipedias. This could of a sentence, the missing slots are filled with our be explained by the infinite amount of examples padding token, hPADi. we can generate by randomly choosing the re- Word Translation Word Translation Word Word rouge red dentista dentist Mumbai Bombay juane yellow peluquero barber Chennai Madras rose pink ginecolog´ gynecologist Bangalore Shanghai blanc white camionero truck driver Kolkata Calultta French English

orange orangeSpanish oftalmologo´ ophthalmologist Cairo Bangkok bleu blue telegrafista telegraphist Hyderabad Hyderabad

Jkr thanks ¤˜d two boys Eisenbahnbetrieb rail operations ¤Jkr and thanks nA two sons Fahrbetrieb driving yA ¨ greetings ¤˜d§Ÿ two boys Reisezugverkehr passenger trains

Arabic Jkr¾ thanks + diacritic Arabic Vf® two children Fahrverkehr¨ ferries German ¤Jkr¾ and thanks + diacritic nyŸ two sons Handelsverkehr Trade ›rbA hello ntA two daughters Schulerverkehr¨ students Transport Transliteration Путин Putin Winter papa Pope Янукович Yanukovych Vernal Papa Pope Троцкий Trotsky xiazhi Summer solstice pontefice pontiff Гитлер Hitler Autumnal Equinox basileus basileus Italian Chinese Russian Сталин Stalin ziye Midnight canridnale cardinal Медведев Medvedev chuxi New Year’s Eve frate friar

Table 3: Examples of the nearest five neighbors of every word in several languages. Translation is retrieved from http://translate.google.com. placement word in the corrupted phrase. Figure ics from the text, the model learned that the 2 shows a typical learning curve of the training. two surface forms of the word mean similar As the number of examples have been seen so far things and, therefore, grouped them together. increased both the training error and the develop- In Arabic, conjunction words do not get sepa- ment error go down. rated from the following word. Usually, ”and thanks” serves as a letter signature as “sin- 6 Qualitative Analysis cerely” is used in English. The model learned that both words {“and thanks”, “thanks” } In order to understand how the embeddings space are similar, regardless their different forms. is organized, we examine the subtle information The second example illustrates a specific syn- captured by the embeddings through investigating tactic morphological feature of Arabic, where the proximity of word groups. This information enumeration of couples has its own form. has the potential to help researchers develop ap- • German - The example demonstrates that the plications that use such semantic and syntactic in- compositional semantics of multi-unit words formation. The embeddings not only capture syn- are still preserved. tactic features, as we will demonstrate in Section • Russian - The model learned to group Rus- 4, but also demonstrate the ability to capture in- sian/Soviet leaders and other figures related teresting semantic information. Table 3 shows dif- to the Soviet history together. ferent words in several languages. For each word • Chinese - The list contains three solar terms on top of each list, we rank the vocabulary accord- that are part of the traditional East Asian lu- ing to their Euclidean distance from that word and nisolar calendars. The remaining two terms show the closest five neighboring words. correspond to traditional holidays that occur • French & Spanish - Expected groupings of at the same dates of these solar terms. colors and professions is clearly observed. • Italian - The model learned that the lower • English - The example shows how the em- and upper cases of the word has similar bedding space is aware of the name change meaning. that happened to a group of Indian cities. “Mumbai” used to be called “Bombay”, 7 Sequence Tagging “Chennai” used to be called “Madras and “Kolkata” used to be called “Calcutta”. On Here we analyze the quality of the models we have the other hand, “Hyderabad” stayed at a sim- generated. To test the quantitative performance of ilar distance from both names as they point to the embeddings, we use them as the sole features the same conceptual meaning. for a well studied NLP task, part of speech tag- • Arabic - The first example shows the word ging. “Thanks”. Despite not removing the diacrit- To demonstrate the capability of the learned dis- Test Language Source TnT Unknown Known All German Tiger† (Brants et al., 2002) 89.17% 98.60% 97.85% 98.10% Bulgarian BTB† (Simov et al., 2002) 75.74% 98.33% 96.33% 97.50% Czech PDT 2.5 (Bejcekˇ et al., 2012) 71.98% 99.15% 97.13% 99.10% Danish DDT† (Kromann, 2003) 73.03% 98.07% 96.45% 96.40% Dutch Alpino† (Van der Beek et al., 2002) 73.47% 95.85% 93.86% 95.00% English PennTreebank (Marcus et al., 1993) 75.97% 97.74% 97.18% 96.80% Portuguese Sint(c)tica† (Afonso et al., 2002) 75.36% 97.71% 95.95% 96.80% Slovene SDT† (Dzeroskiˇ et al., 2006) 68.82% 95.17% 93.46% 94.60% Swedish Talbanken05† (Nivre et al., 2006) 83.54% 95.77% 94.68% 94.70%

Table 4: Results of our model against several PoS datasets. The performance is measured using accuracy over the test datasets. Third column represents the total accuracy of the tagger the former two columns reports the accuracy over known words and OOV words (unknown). The results are compared to the TnT tagger results reported by (Petrov et al., 2012). †CoNLL 2006 dataset tributed representations in extracting useful word work, where they trained a TnT tagger (Brants, features, we train a PoS tagger over the subset of 2000) on several treebanks. The TnT tagger is languages that we were able to acquire free anno- based on Markov models and depends on trigram tated resources for. We choose our tagger for this counts observed in the labeled data. It was cho- task to be a neural network because it has a fast sen for its fast speed and (near to) state-of-the-art convergence rate based on our initial experiments. accuracy, without language specific tuning. The part of speech tagger has similar architec- The performance of embeddings is competitive ture to the one used for training the embeddings. in general. Surprisingly, it is doing better than the However we have changed some of the network TnT tagger in English and Danish. Moreover, our parameters, specifically, we use a hidden layer of performance is so close in the case of Swedish. size 300 and learning rate of 0.3. The network is This task is hard for our tagger for two reasons. trained by minimizing the negative of the log like- The first is that we do not add OOV words seen lihood of the labeled data. To tag a specific word during training of the tagger to our vocabulary. wi we consider a window with size 2n where n The second is that all OOV words are substituted in our experiment is equal to 2. Equation 4 shows with one representation, hUNKi and there is no how we construct a feature vector F by concate- character level information used to inform the tag- nating (⊕) the embeddings of the words occurred ger about the characteristic of the OOV words. in the window, where C is the matrix that contains On the other hand, the performance on the the embeddings of the language vocabulary. known words is strong and consistent showing the i+2 M value of the features learned about these words F = C[wj] (4) from the unsupervised stage. Although the word j=i−2 coverage of German and Czech are low in the orig- The feature vector will be fed to the network and inal Wikipedia corpora (See Table 2), the features the error will back propagated back to the embed- learned are achieving great accuracy on the known dings. words. They both achieve above 98.5% accuracy. The results of this experiment are presented in It is noticeable that the Slovene model performs Table 4. We train and test our models on the uni- the worst, under both known and unknown words versal tagset proposed by (Petrov et al., 2012). categories. It achieves only 93.46% accuracy on This universal tagset maps each original tag in a the test dataset. Given that the Slovene embed- treebank to one out of twelve general PoS tags. dings were trained on the least amount of data This simplifies the comparison of classifiers per- among all other embeddings we test here, we ex- formance across languages. We compare our re- pect the quality to go lower for the other smaller sults to a similar experiment conducted in their Wikipedias not tested here. In Table 5, we present how well the vocabulary # Training Accuracy Language of each language’s embeddings covered the part of Examples Drop speech datasets. The datasets come from a differ- Bulgarian 200,049 -2.01% ent domain than Wikipedia, and this is reflected in Czech 1,239,687 -0.86% the results. Danish 96,581 -1.77% In Table 6, we present the results of training the German 735,826 -0.89% same neural network part of speech tagger with- English 950,561 -0.25% out using our embeddings as initializations. We Dutch 208,418 -1.37% found that the embeddings benefited all the lan- Portuguese 212,749 -0.91% guages we considered, and observed the greatest Slovene 27,284 -2.68% benefit in languages which had a small number of Swedish 199,509 -0.82% training examples. We believe that these results illustrate the performance Table 6: Accuracy of randomly initialized tag- ger compared to our results. Using the embed- % Token % Word dings was generally helpful, especially in lan- Language Coverage Coverage guages where we did not have many training ex- Bulgarian 94.58 77.70 amples. The scores presented are the best we Czech 95.37 65.61 found for each language (languages with more re- Danish 95.41 80.03 sources could afford to train longer before overfit- German 94.04 60.68 ting). English 98.06 79.73 Dutch 96.25 77.76 cantly improve results on NLP tasks (Turian et al., Portuguese 94.09 72.66 2010; Collobert et al., 2011). With this in mind, Slovene 95.33 83.67 we believe that the entire research community can Swedish 95.87 73.92 benefit from our release of word embeddings for over 100 languages. Table 5: Coverage statistics of the embedding’s We hope that these resources will advance the vocabulary on the part of speech datasets after nor- study of possible pair-wise mappings between em- malization. Token coverage is the raw percentage beddings of several languages and their relations. of words which were known, while the Word cov- Our future work in this area includes improving erage ignores repeated words. the models by increasing the size of the context window and their domain adaptivity through in- 8 Conclusion corporating other sources of data. We will be investigating better strategies for modeling OOV Distributed word representations represent a valu- words. We see improvements to OOV word han- able resource for any language, but particularly for dling as essential to ensure robust performance of resource-scarce languages. We have demonstrated the embeddings on real-world tasks. how word embeddings can be used as off-the-shelf solution to reach near to state-of-art performance Acknowledgments over a fundamental NLP task, and we believe that This research was partially supported by NSF our embeddings will help researchers to develop Grants DBI-1060572 and IIS-1017181, with ad- tools in languages with which they have no exper- ditional support from TexelTek. tise. Moreover, we showed several examples of in- teresting semantic relations expressed in the em- References beddings space that we believe will lead to inter- Susana Afonso, Eckhard Bick, Renato Haber, and Di- esting applications and improve tasks as semantic ana Santos. 2002. Floresta sinta´ (c) tica”: a treebank compositionality. for portuguese. In Proc. of the Third Intern. Conf. on While we have only considered the properties of Language Resources and Evaluation (LREC), pages word embeddings as features in this work, it has 1698–1703. been shown that using word embeddings in con- Rami Al-Rfou’ and Steven Skiena. 2012. Speedread: junction with traditional NLP features can signifi- A fast named entity recognition pipeline. In Pro- ceedings of the 24th International Conference on Joelle Pineau, editors, Proceedings of the 29th Inter- Computational Linguistics (Coling 2012), pages 53– national Conference on Machine Learning (ICML- 61, Mumbai, India, December. Coling 2012 Orga- 12), ICML ’12, pages 767–774. ACM, New York, nizing Committee. NY, USA, July.

Eduard Bejcek,ˇ Jarmila Panevova,´ Jan Popelka, Pavel Yanqing Chen, Bryan Perozzi, Rami Al-Rfou’, and Stranˇak,´ Magda Sevˇ cˇ´ıkova,´ Jan Stˇ epˇ anek,´ and Steven Skiena. 2013. The expressive power of word Zdenekˇ Zabokrtskˇ y.´ 2012. Prague Dependency embeddings. CoRR, abs/1301.3226. Treebank 2.5 – a revisited version of PDT 2.0. In Proceedings of COLING 2012, pages 231–246, R. Collobert and J. Weston. 2008. A unified architec- Mumbai, India, December. The COLING 2012 Or- ture for natural language processing: Deep neural ganizing Committee. networks with multitask learning. In International Conference on Machine Learning, ICML. Yoshua Bengio and J-S Senecal. 2008. Adaptive im- portance sampling to accelerate training of a neu- Ronan Collobert, Jason Weston, Leon´ Bottou, Michael ral probabilistic language model. Neural Networks, Karlen, Koray Kavukcuoglu, and Pavel Kuksa. IEEE Transactions on, 19(4):713–722. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Y. Bengio, H. Schwenk, J.S. Senecal,´ F. Morin, and J.L. November. Gauvain. 2006. Neural probabilistic language mod- els. Innovations in Machine Learning, pages 137– Ronan Collobert. 2011. Deep learning for efficient 186. discriminative parsing. In AISTATS.

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, 2009. Curriculum learning. In International Con- Matthieu Devin, Quoc Le, Mark Mao, Marc’Aurelio ference on Machine Learning, ICML. Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Ng. 2012. Large scale distributed deep net- James Bergstra, Olivier Breuleux, Fred´ eric´ Bastien, works. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, Pascal Lamblin, Razvan Pascanu, Guillaume Des- L. Bottou, and K.Q. Weinberger, editors, Advances jardins, Joseph Turian, David Warde-Farley, and in Neural Information Processing Systems 25, pages Yoshua Bengio. 2010. Theano: a CPU and 1232–1240. GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference Sasoˇ Dzeroski,ˇ Tomazˇ Erjavec, Nina Ledinek, Petr Pa- ˇ ˇ (SciPy), June. Oral Presentation. jas, Zdenek Zabokrtsky, and Andreja Zele. 2006. Towards a slovene dependency treebank. In Proc. of John Blitzer, Ryan McDonald, and Fernando Pereira. the Fifth Intern. Conf. on Language Resources and 2006. Domain adaptation with structural correspon- Evaluation (LREC). dence learning. In Conference on Empirical Meth- ods in Natural Language Processing, Sydney, Aus- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. tralia. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Pro- Leon´ Bottou. 1991. Stochastic gradient learning in ceedings of the Twenty-eight International Confer- neural networks. In Proceedings of Neuro-Nˆımes ence on Machine Learning (ICML’11), volume 27, 91, Nimes, France. EC2. pages 97–110, June.

Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- Jan Hajic,ˇ Massimiliano Ciaramita, Richard Johans- gang Lezius, and George Smith. 2002. The tiger son, Daisuke Kawahara, Maria Antonia` Mart´ı, Llu´ıs treebank. In IN PROCEEDINGS OF THE WORK- Marquez,` Adam Meyers, Joakim Nivre, Sebastian SHOP ON TREEBANKS AND LINGUISTIC THEO- Pado,´ Jan Stˇ epˇ anek,´ Pavel Stranˇak,´ Mihai Surdeanu, RIES, pages 24–41. Nianwen Xue, and Yi Zhang. 2009. The CoNLL- 2009 shared task: Syntactic and semantic depen- Thorsten Brants. 2000. Tnt: a statistical part-of- dencies in multiple languages. In Proceedings of speech tagger. In Proceedings of the sixth confer- the 13th Conference on Computational Natural Lan- ence on Applied natural language processing, pages guage Learning (CoNLL-2009), June 4-5, Boulder, 224–231. Association for Computational Linguis- Colorado, USA. tics. Alexandre Klementiev, Ivan Titov, and Binod Bhat- Peter F Brown, Peter V Desouza, Robert L Mercer, tarai. 2012. Inducing crosslingual distributed rep- Vincent J Della Pietra, and Jenifer C Lai. 1992. resentations of words. In Proceedings of COLING Class-based n-gram models of natural language. 2012, pages 1459–1474, Mumbai, India, December. Computational linguistics, 18(4):467–479. The COLING 2012 Organizing Committee.

Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Terry Koo, Xavier Carreras, and Michael Collins. Fei Sha. 2012. Marginalized denoising autoen- 2008. Simple semi-supervised dependency parsing. coders for domain adaptation. In John Langford and In In Proc. ACL/HLT. Matthias Trautner Kromann. 2003. The danish depen- Lu Shuxiang. 2004. The Contemporary Chinese Dic- dency treebank and the dtag treebank tool. In Pro- tionary (Xiandai Hanyu Cidian). Commercial Press. ceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT), page 217. Kiril Simov, Petya Osenova, Milena Slavcheva, Sia Kolkovska, Elisaveta Balabanova, Dimitar Mitchell P Marcus, Mary Ann Marcinkiewicz, and Doikoff, Krassimira Ivanova, Er Simov, and Milen Beatrice Santorini. 1993. Building a large anno- Kouylekov. 2002. Building a linguistically inter- tated corpus of english: The penn treebank. Compu- preted corpus of bulgarian: the bultreebank. In In: tational linguistics, 19(2):313–330. Proceedings of LREC 2002, Canary Islands.

T. Mikolov, M. Karafiat,´ L. Burget, J. Cernocky, and Richard Socher, Eric H. Huang, Jeffrey Pennington, S. Khudanpur. 2010. Recurrent neural network Andrew Y. Ng, and Christopher D. Manning. 2011. based language model. Proceedings of Interspeech. Dynamic pooling and unfolding recursive autoen- coders for paraphrase detection. In Advances in Andriy Mnih and Geoffrey E Hinton. 2009. A scalable Neural Information Processing Systems 24. hierarchical distributed language model. Advances in neural information processing systems, 21:1081– Richard Socher, Brody Huval, Christopher D. Man- 1088. ning, and Andrew Y. Ng. 2012. Semantic com- positionality through recursive matrix-vector spaces. Frederic Morin and Yoshua Bengio. 2005. Hierarchi- In Proceedings of the 2012 Conference on Em- cal probabilistic neural network language model. In pirical Methods in Natural Language Processing Proceedings of the international workshop on artifi- (EMNLP). cial intelligence and statistics, pages 246–252. Oscar Tackstr¨ om,¨ Ryan McDonald, and Jakob Uszko- Roberto Navigli and Simone Paolo Ponzetto. 2010. reit. 2012. Cross-lingual word clusters for direct Babelnet: Building a very large multilingual seman- transfer of linguistic structure. In Proceedings of the tic network. In Proceedings of the 48th annual meet- 2012 Conference of the North American Chapter of ing of the association for computational linguistics, the Association for Computational Linguistics: Hu- pages 216–225. Association for Computational Lin- man Language Technologies, pages 477–487. Asso- guistics. ciation for Computational Linguistics.

Joakim Nivre, Jens Nilsson, and Johan Hall. 2006. J. Turian, L. Ratinov, and Y. Bengio. 2010. Word rep- Talbanken05: A swedish treebank with phrase struc- resentations: a simple and general method for semi- ture and dependency annotation. In Proceedings of supervised learning. In Proceedings of the 48th An- the fifth International Conference on Language Re- nual Meeting of the Association for Computational sources and Evaluation (LREC), pages 1392–1395. Linguistics, pages 384–394. Association for Com- putational Linguistics. Joakim Nivre, Johan Hall, Sandra Kubler,¨ Ryan Mc- Donald, Jens Nilsson, Sebastian Riedel, and Deniz Leonoor Van der Beek, Gosse Bouma, Rob Malouf, Yuret. 2007. The CoNLL 2007 shared task on de- and Gertjan Van Noord. 2002. The alpino depen- pendency parsing. In Proceedings of the CoNLL dency treebank. Language and Computers, 45(1):8– Shared Task Session of EMNLP-CoNLL 2007, pages 22. 915–932, Prague, Czech Republic, June. Associa- tion for Computational Linguistics.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Nicoletta Cal- zolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur˘ Dogan,˘ Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, ed- itors, Proceedings of the Eight International Con- ference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Lan- guage Resources Association (ELRA).

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 2012 shared task: Modeling multilingual unre- stricted coreference in OntoNotes. In Proceedings of the Sixteenth Conference on Computational Natu- ral Language Learning (CoNLL 2012), Jeju, Korea.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 2002. Learning representations by back- propagating errors. Cognitive modeling, 1:213.