arXiv:1712.09405v1 [cs.CL] 26 Dec 2017 Text Keywords: rarel s however current pa the are this outperform that that In tricks models known pre-trained available of Crawl. Web combination a and using Wikipedia by collections, news re as nowadays applications such Processing Language Natural Many t ftersligwr etr.W ou anyon mainly focus We standard vectors. word the resulting of the of modifications qual ity the improves several significantly pipeline training that show researchers. of community We wide potentially a them to useful making very ones, available currently the improvement over consistent show new provide that we vectors work, word this pre-trained In train themselves. over by models vectors the to word ing prefer pre-trained available practitioners publicly NLP many use and cumbersome Crawl, Common be like can sources, Training data massive problems. on NLP models represen- other such the to better transferring the at on, are trained tations expected, is be model a can data As more 2010). trai the Lenci, the and about (Baroni statistics corpora ing important capturing are suggest they problems im- new that Their to transfering performance. at their capability pressive improve to pipelines NLP used in widely been have representations word2vec Recently, amount massive data. to of scale to speed maintainin training repre- while high sufficiently tasks, word a many to quality transfer high to able produce sentations implementation to their optimized and been architectures to have These according predicted in is context. while word its source word, the source model, a cbow given the predicted are words nearby rtecniuu a-fwrs(bw rhtcue,as architectures, (cbow) in skip-gram bag-of-words implemented the continuous either the on is based or representations models word log-bilinear train learning to for approach standard of A sources vast and rich data. from stati possible much as as efficiently aspect information capture critical tical to A thus is 1990). training al., their of et of cor- (Deerwester data unlabeled amount text large of typicall from pus limited is gathered gen- information statistics on This the from derived 2011). learned improve al., et (Collobert typically models data that of These words, info eralization distributional about applications. provide mation Learning representations Machine pre-trained and Process- Language (NLP) Natural many ing of blocks become building have basic representations word continuous Pre-trained 1 https://fasttext.cc/ Bjnwk ta. 2017) al., et (Bojanowski oa ioo,EoadGae it oaosi Christian Bojanowski, Piotr Grave, Edouard Mikolov, Tomas atet odvc odvcos pre-trained vectors, word word2vec, fastText, dacsi r-riigDsrbtdWr Representation Word Distributed Pre-Training in Advances word2vec .Introduction 1. Mklve l,21a and 2013a) al., et (Mikolov 1 ntesi-rmmodel, skip-gram the In . { mklv gae oaosi phsh ajoulin cpuhrsch, bojanowski, egrave, tmikolov, aeokA Research AI Facebook aeo h r yalremri nanme ftasks. of number a on margin large a by art the of tate fast- yo r-rie odrpeettosetmtdfo larg from estimated representations word pre-trained on ly Abstract n- sdtgte.Temi euto u oki h e e fpu of set new the is work our of result main The together. used y s- r- y g s - - e,w hwhwt ri ihqaiywr etrrepresent vector word high-quality train to how show we per, ann l h urudn od.Mr rcsl,gvna given precisely, More of sequence words. surrounding the con- window all symmetric a taining its as learns defined to is (2013a) according context The word al. context. et a Mikolov predicting in by representations used word as model cbow model The cbow Standard 2.1. it representations. as word im- known richer model several learn cbow explain to then provements the and word2vec, describe in briefly used was we section, this In words answering ques- phrase-based 2017). question al., a et rare Chen Squad bench- in 2016; al., and on features et (Rajpurkar pipeline dataset as standard and answering semantic 2013b), 2013), al., tion on al., et et (Mikolov (Luong syntactic, dataset quality analogies their marks: measure of use We 2017). the al., and et (2013b) (Bojanowski al. information subword et Mikolov in rep- used phrase the resentations (2013), Kavukcuoglu and i features Mnih by dependent troduced position the together: used rarely are tha strategies pre-processing data and modifications known where i.e.: surrounding, probability their the given of words log-likelihood the of the maximize to is model w crn ucinbtenaword a between function scoring a size eoe by denoted orc odi ere ncnrs ihasto sampled of word set a of a prob bility conditional with the precisely, The contrast More in candidates. negative words. learned over is proba- word classifiers this correct binary replace independent to is by for alternative bility impractical however An words is vocabulary. and choice large context This a vocabulary. of the scores in the i 1 over A Eq. function in softmax probability a representations. conditional the or for vectors, candidate natural word the by parametrized t − c 2 w . . . , c C o o n easm htw aeacs to access have we that assume we on, now For . t stecneto the of context the is t − T s ( 1 ,C w, words w , .MdlDescription Model 2. w X t } t =1 T @fb.com +1 ie t context its given ) hssoigfnto ilb later be will function scoring This . log w , . . . , w urc,Amn Joulin Armand Puhrsch, 1 w , . . . , p ( w t + t c | T t C o otx idwof window context a for h betv ftecbow the of objective the , t od .. h words the e.g., word, -th t ) C , w nE.()i replaced is (1) Eq. in n t context its and s etcorpora text e ations blicly (1) C n- a- s t , by the following quantity: 2.3. Phrase representations The original cbow model is only based on unigrams, which log 1+ e−s(w,C) + log 1+ es(n,C) , (2)   X   is insensitive to the word order. We enrich this model with n∈NC word n-grams to capture richer information. Directly incor- porating the n-grams in the models is quite challenging as it where NC is a set of negative examples sampled from the vocabulary. The objective function maximized by the cbow clutters the models with uninformative content due to huge model is obtainedby replacingthe log probabilityin Eq. (1) increase of the number of the parameters. Instead, we fol- by the quantity defined in Eq. (2), i.e.: low the approach of Mikolov et al. (2013b) where n-grams are selected by iteratively applying a mutual information T criterion to . Then, in a data pre-processing step we  −s wt,Ct s n,Ct  log 1+ e ( ) + log 1+ e ( ) . X   X   merge the words in a selected n-gram into a single token. t=1  n∈NCt  For example, words with high mutual information like ”New York” are merged in a token, ”New York”. A naturalparametrizationfor this model is to represent each This pre-processing step is repeated several times to word w by a vector vw. Similarly, a context is represented ′ form longer n-gram tokens, like ”New York City” or by the average of word vectors ′ of each word in its uw w ”New York University”. In practice, we repeat this pro- window. The scoring function is simply the dot product cess 5 − 6 times to build tokens representing longer between these two quantities, i.e., ngrams. We used the word2phrase tool from the word2vec 2 1 T project . Note that unigrams with high mutual information s(w, C)= uw′ vw. (3) are merged only with a probability of 50%, thus we still |C| X′ w ∈C keep significant number of unigram occurrences. Interst- Note that different parametrizations are used for the words ingly, even if the phrase representations are not further used in a context and the predicted word. in an application, they effectively improve the quality of the Word subsampling. The word frequency distribution in word vectors, as is shown in the experimental section. a standard follows a Zipf distribution, which 2.4. Subword information implies that most of the words belongs to small subset of Standard word vectors ignore word internal structure that the entire vocabulary (Li, 1992). Considering all the oc- contains rich information. This information could be useful curences of words equally would lead to overfit the param- for computing representations of rare or mispelled words, eters of the model on the representation of the most fre- as well as for mophologically rich languages like Finnish quent words, while underfitting on the rest. A common or Turkish. A simple yet effective approach is to enrich the strategy introduced in Mikolov et al. (2013a) is to subsam- word vectors with a bag of character n-gram vectors that ple frequent words, with the following probability p of disc is either derived from the singular value decomposition of discarding a word: the co-occurence matrix (Sch¨utze, 1993) or directly learned fromalargecorpusof data(Bojanowski et al., 2017). In the pdisc(w)=1 − t/fw (4) p latter, each word is decomposed into its character n-grams where fw is the frequency of the word w, and t > 0 is a N and each n-gram n is represented by a vector xn. The parameter. word vector is then simply the sum of both representations, i.e.: 2.2. Position-dependent Weighting 1 The context vector described above is simply the average vw + xn. (6) |N| X of the word vectors contained in it. This representation is n∈N oblivious to the position of each word. Explicitly encod- In practice, the set of n-grams N is restricted to the n-grams ing a representation for both a word and its position would with 3 to 6 characters. Storing all of these additional vec- be impractical and prone to overfitting. A simple yet effec- tors is memory demanding. We use the hashing trick to tive solution introduced in the context of word representa- circumvent this issue (Weinberger et al., 2009). tion by Mnih and Kavukcuoglu (2013) is to learn position representations and use them to reweight the word vectors. 3. Training Data This position dependent weighting offers a richer context We used several sources of text data that are publicly avail- representation at a minimal computational cost. able and the Gigaword dataset, as described in Table 1. Each position p in a context window is associated with In particular, we used English Wikipedia from June 2017, a vector d . The context vector is then the average of p from which we used the meta pages archive which resulted the context words reweighted by their position vectors. in a text corpus with more than 9 billion words 3. Further, More precisely, denoting by P the set of relative positions we used all news datasets from statmt.org from years [−c, . . . , −1, 1,...,c] in the context window, the context 2007 - 2016, the UMBC corpus (Han et al., 2013), the En- vector v of the word w is: C t glish Gigaword, and Common Crawl from May 20174.

vC = dp ⊙ ut p, (5) X + 2 p∈P https://github.com/tmikolov/word2vec 3https://dumps.wikimedia.org/enwiki/latest/ where ⊙ is the pointwise multiplication of vectors. 4https://commoncrawl.org/2017/06 In case of the Common Crawl, we wrote a simple data ex- Model SemSynTot tractor based on a unigram language model that retrieves cbow+uniq 79 73 76 the documents written in English and discards low quality data. The same approach can be in fact used to extract text cbow+uniq+phrases 82 78 80 data for many other languages from Common Crawl. cbow+uniq+phrases+weighting 87 82 85 We decided to perform no complex data normalization or Table 2: Accuracies on semantic and syntactic analogy pre-processing, as we want the resulting word vectors to be datasets for models trained on Common Crawl (630B very easily used by a wide community (the text normaliza- tion can be done on top of the published word vectors as words). By performing sentence-level de-duplication, adding position-dependent weighting and phrases, the a post-processing step). We only used a publicly available model quality improves significantly. tokenizer.perl script from the Moses MT project5. We observed that de-duplicating large text training corpora, es- pecially Common Crawl, significantly improves the quality of the resulting word vectors. 87% accuracy on the word analogy tasks is to our knowl- Corpus Size [billion] edge the best published result so far by a large margin, and much better than existing GloVemodels which were trained Wikipedia meta-pages 9.2 on comparable corpora. We improved this result further to Statmt.org News 4.2 88.5% accuracy by adding the sub-word features. UMBCNews 3.2 We also report very strong performance on the Rare Gigaword 3.3 Words dataset (Luong et al., 2013), again outperforming CommonCrawl 630 GloVe models by a large margin. Finally, we replaced the GloVe pre-trained vectors with the new fastText vec- Table 1: Training corpora and their size in billions of words tors in a system trained on the Squad after tokenization and sentence de-duplication. dataset (Rajpurkar et al., 2016). In a setup that is further described in Chen et al. (2017), we did observe significant improvement of the accuracy. 4. Results Further we report results for models trained on either the Model Analogy RW Squad Common Crawl, or on a combination of the Wikipedia, Statmt News, UMBC and Gigaword. This is compara- GloVeWiki+news 72 0.38 77.7% ble to corpora that other models that attempted to improve fastText Wiki + news 87 0.50 78.8% upon word2vec were trained on, notably the GloVe model GloVeCrawl 75 0.52 78.9% fromthe StanfordNLP group(Pennington et al., 2014). Al- fastText Crawl 85 0.58 79.8% though a careful analysis performed in Levy et al. (2015) shows that the original word2vec is faster to train, produces Table 3: Results on Word Analogy, Rare Words and Squad more accurate models and takes significantly less mem- datasets with fastText models trained on various corpora ory than the GloVe algorithm, the availability of large pre- (see Table 1) or Common Crawl (see Table 2), and com- trained GloVe models proved to be a useful resource for parison to GloVe models trained on comparable datasets. many researchers who do not have time to train their own model on very large dataset like the Common Crawl. We used the cbow architecture described in Section 2.1. with window size 5 for the baseline models and win- The models trained on the Wikipedia and News cor- dow size 15 for the models that learn position-dependent pora, and on the Common Crawl, were published at the weights (described in Section 2.2.). We used 10 nega- fasttext.cc website and are available to the NLP re- tive examples for training with the negative sampling and searchers. Further, we did experiment with the phrase- threshold for subsampling frequent words set to t = 10−5. based analogydataset introducedin Mikolov et al. (2013b), In Table 2 we report results on the word analogies and achieved 88% accuracy using the model trained on from Mikolov et al. (2013a) using baseline cbow model Crawl, which again is to our knowledge the new state of trained on Common Crawl with de-duplicated sentences, the art result. We plan to release the model containing all with phrases (we used 6 iterations of building the the phrases in the near future. phrases by merging bigrams with high mutual informa- Finally in Table 4, we use a script provided by tion), and with the position-dependent weighting as used Conneauet al. (2017) to measure the influence of dif- in Mnih and Kavukcuoglu (2013). The training itself took ferent pre-trained word vector models on several text three days on a single multi-core machine. classification tasks (MRPC, MR CR, SUBJ, MPQA, In Table 3, we can see comparison between cbow as im- SST and TREC). We performed the classification us- plemented in the fastText library (Bojanowski et al., 2017) ing the standard fastText toolkit running in a supervised and the GloVe models trained on comparable corpora. The mode (Joulin et al., 2016), using the pre-trained models to initialize the classifier. Overall, the new fastText word vec- 5https://github.com/moses-smt tors result in superior text classification performance. Corpora MRPC MR CR SUBJ MPQA SST TREC Average GloVe Wiki+news 71.9/81.0 75.7 78.1 91.5 86.9 78.1 66.6 79.7 GloVe Crawl 72.0/80.7 78.0 79.6 91.8 88.0 80.0 84.2 82.0 fastText Wiki+news 72.9/81.6 77.8 80.3 92.2 88.3 81.1 85.0 82.5 fastText Crawl 73.4/81.6 78.2 81.1 92.5 87.8 82.0 84.0 82.7

Table 4: Comparison of different pre-trained models on supervised text classification tasks.

5. Discussion Joint Conference on Lexical and Computational Seman- In this work, we have focused on providing very high qual- tics. Association for Computational Linguistics, June. ity set of pre-trained word and phrase vector representa- Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. tions. Our findings indicate that improvements can be (2016). Bag of tricks for efficient text classification. achieved by training well-known algorithms on very large arXiv preprint arXiv:1607.01759. text datasets, and that using certain tricks can provide fur- Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving ther gains in quality. Notably, we have found it very im- distributional similarity with lessons learned from word portant to de-duplicate sentences in large corpora such as embeddings. Transactions of the Association for Com- the Common Crawl before training the models. Next, we putational Linguistics, 3:211–225. have used an algorithm for building the phrases in a pre- Li, W. (1992). Random texts exhibit zipf’s-law-like word processing step. Finally, adding the position-dependent frequency distribution. IEEE Transactions on informa- weights and subword features to the cbow model architec- tion theory, 38(6):1842–1845. ture gave us the final boost of accuracy. The models de- Luong, T., Socher, R., and Manning, C. D. (2013). Better scribed in this paper are freely available to researchers and word representations with recursive neural networks for engineers at the fastText webpage, and we hope that these morphology. In CoNLL, pages 104–113. will be useful in various projects that use textual data. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector 6. Acknowledgements space. arXiv preprint arXiv:1301.3781. We thank Marco Baroni and German Kruszewski for useful Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and discussions and suggestions, Adam Fisch for help with the Dean, J. (2013b). Distributed representations of words experiments on Squad dataset, and Qun Luo for suggesting and phrases and their compositionality. In Advances to use the position-dependent weighting at the word2vec in neural information processing systems, pages 3111– discussion forum. 3119. Mnih, A. and Kavukcuoglu, K. (2013). Learning word em- 7. Bibliographical References beddings efficiently with noise-contrastive estimation. Baroni, M. and Lenci, A. (2010). Distributional memory: In Advances in neural information processing systems, A general framework for corpus-based semantics. Com- pages 2265–2273. putational Linguistics, 36(4):673–721. Pennington, J., Socher, R., and Manning, C. (2014). Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Glove: Global vectors for word representation. In Pro- (2017). Enriching word vectors with subword informa- ceedings of the 2014 conference on empirical methods tion. Transactions of the Association for Computational in natural language processing (EMNLP), pages 1532– Linguistics, 5:135–146. 1543. Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017). Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Reading wikipedia to answer open-domain questions. Squad: 100,000+ questions for machine comprehension arXiv preprint arXiv:1704.00051. of text. arXiv preprint arXiv:1606.05250. Collobert, R., Weston, J., Bottou, L., Karlen, M., Sch¨utze, H. (1993). Word space. In Advances in neural Kavukcuoglu, K., and Kuksa, P. (2011). Natural lan- information processing systems, pages 895–902. guage processing (almost) from scratch. Journal of Ma- Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and chine Learning Research, 12(Aug):2493–2537. Attenberg, J. (2009). Feature hashing for large scale Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and multitask learning. In ICML. Bordes, A. (2017). of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent se- mantic analysis. Journal of the American society for in- formation science, 41(6):391. Han, L., Kashyap, A. L., Finin, T., Mayfield, J., and Weese, J. (2013). UMBC EBIQUITY-CORE: Semantic Tex- tual Similarity Systems. In Proceedings of the Second