<<

Learning about Word Vector Representations and Deep through Implementing

David Jurgens School of Information University of Michigan [email protected]

Abstract introduced. The content is designed at the level of an NLP student who (1) has some technical back- Word vector representations are an essential ground and at least one advanced course in part of an NLP curriculum. Here, we de- and (2) will implement or adapt new NLP meth- scribe a homework that has students imple- ods. This level is deeper than what is needed for a ment a popular method for learning word vec- tors, word2vec. Students implement the core purely Applied NLP setting but too shallow for a parts of the method, including text preprocess- more focused NLP class, which ing, negative sampling, and descent. would likely benefit from additional derivations Starter code provides guidance and handles ba- and proofs around the to solidify sic operations, which allows students to focus understanding. The homework has typically been on the conceptually challenging aspects. Af- assigned over a three to four week period; many ter generating their vectors, students evaluate students complete the homework in the course of a them using qualitative and quantitative tests. week, but the longer time frame enables students 1 Introduction with less background or programming experience to work through the steps. The material prepares NLP curricula typically include content on word students for advanced NLP around deep semantics, how semantics can be learned computa- learning and pre-trained language models, as well tionally through word vectors, and what are the vec- as provides intuition for what steps modern deep tors’ uses. This document describes an assignment learning libraries perform. for having students implement word2vec (Mikolov The homework has three broad learning goals. et al., 2013a,b), a popular method that relies on First, the training portion of the homework helps a single- neural network. This homework is deepen students’ understanding of machine learn- designed to introduce students to word vectors and ing concepts, gradient descent, and develop com- simple neural networks by having them implement plex NLP software. Central to this design is hav- the network from scratch, without the use of deep- ing students turn the equations in the homework learning libraries. The assignment is appropriate and formal descriptions of word2vec into software for upper-division undergraduates or graduate stu- operations. This step helps students understand dents who are familiar with python programming, how to ground equations found in some papers into have some experience with the (Har- the more-familiar language of programming, while ris et al., 2020), and have been exposed to concepts also building a more intuition for how gradient around and neural networks. Through descent and work in practice. implementing major portions of the word2vec soft- Second, the process of ware and using the learned vectors, students will aids students in developing larger NLP software gain a deeper understanding of how networks are methods that involve end-to-end development. This trained, how to learn word vectors, and their uses goal includes seeing how different algorithmic soft- in downstream tasks. ware designs work and are implemented. The speed 2 Design and Learning Goals of training requires that students be moderately ef- ficient in how they implement their software. For This homework is designed to take place just be- example, the use of for loops instead of vector- fore the middle stretch of the class, after lexical se- ized numpy operations will lead to a significant mantics and machine learning concepts have been slow down in performance. In class instruction and 108

Proceedings of the Fifth Workshop on Teaching NLP, pages 108–111 June 10–11, 2021. ©2021 Association for Computational Linguistics tutorials detail how to write the relevant efficient scores for the subset of the SimLex-999 (Hill et al., numerical operations, which help guide students 2015) present in their training corpus, which is up- to identify where and how to selectively optimize. loaded to InClass1 to see how their vectors However, slow code will still finish correctly al- compare with others; this leaderboard helps stu- lowing students to debug for their initial implemen- dents identify a bug in their code (via a low-scoring tations for correctness. This need for performant submission) and occasionally prompts students to code creates opportunities for students to practice think about how to improve/extend their code to their performance optimizing skills. attain a higher score. Third, the lexical semantics portion of the home- Potential Extensions The word2vec method has work exposes students to the uses and limitations been extended in numerous ways in NLP to im- of word vectors. Through training the vectors, stu- prove its vectors (e.g., Ling et al., 2015; Yu and dents understand how statistical regularities in co- Dredze, 2014; Tissier et al., 2017). This assignment occurrence can be used to learn meaning. Quali- includes descriptions of other possible extensions tative and quantitative evaluations show students that students can explore, such as implementing what their model has learned (e.g., using vector dropout, adding decay, or making use analogies) and introduce them to concepts of poly- of external knowledge during training. Typically, a semy, fostering a larger discussion on what can be single extension to word2vec is included as a part captured in a vector representation. of the homework to help ground the in code but without increasing the difficulty of the as- 3 Homework Description signment. Students who are interested in deepening The homework has students implement two core as- their understanding can use these as starting points pects of the word2vec using numpy for to see how to develop their own NLP methods as a the numeric portions, and then evaluate with two part of a course project. downstream tasks. The first aspect has students per- This assignment also provides multiple possibili- form the commonly-used text preprocessing steps ties for examining the latent learned in word that turn a raw into self-supervised train- vectors. Prior work has established that pretrained ing examples. This step includes removing low- vectors often encode gender and racial biases based frequency tokens and subsampling tokens based on on the corpora they are trained on (e.g., Caliskan their frequency. The second aspect focuses on the et al., 2017; Manzini et al., 2019). In a future ex- core training procedure, including (i) negative sam- tension, this assignment could be adapted to use pling for generating negative examples of context biographies as a base corpus and have words, (ii) performing gradient descent to update students identify how occupations become more as- the two word vector matrices, and (iii) sociated with gendered words during training (Garg the negative log-likelihood. These tasks are bro- et al., 2018). Once this is discovered, students ken into eight discrete steps that guide students in can discuss various methods for mitigating it (e.g., how to do each aspect. The assignment document Bolukbasi et al., 2016; Zhao et al., 2017) and how includes links to more in-depth descriptions of the their method might be adapted to avoid other forms method including the extensive description of Rong of bias. This extension can help students critically (2014) and the recent chapter of Jurafsky and Mar- think about what is and is not being captured in tin(2021, ch. 6) to help students understand the pretrained vectors and models. math behind the training procedure. 4 Reflection on Student Experiences In the second part of the homework, students evaluate the learned vectors in two downstream Student experiences on this homework have been tasks. The first task has students load these vec- very positive, with multiple students expressing a tors using the package (Rehurek and So- strong sense of satisfaction at completing the home- jka, 2010) and perform vector arithmetic opera- work and being able to understand the algorithm tions to find word-pair analogies and examine the and software backing word2vec. Several students nearest-neighbors of words; this qualitative evalua- reported feeling like completing this assignment tion exposes students to what is or is not learned by was a great confidence boost and that they were the model. The second task is quantitative evalua- 1https://www.kaggle.com/c/about/ tion that has students generate word-pair similarity inclass 109 now more confident in their ability to understand ceedings of the National Academy of Sciences, NLP papers and connect , equations, and 115(16):E3635–E3644. code. The majority of student difficulties happen in Charles R Harris, K Jarrod Millman, Stéfan J van der two sources. First, the vast majority of bugs happen Walt, Ralf Gommers, Pauli Virtanen, David Cour- when implementing the gradient descent and cal- napeau, Eric Wieser, Julian Taylor, Sebastian Berg, culating the negative log-likelihood (NLL). While Nathaniel J Smith, et al. 2020. Array programming only a few lines of code in total, this step requires with numpy. , 585(7825):357–362. translating the for word2vec (in the Felix Hill, Roi Reichart, and Anna Korhonen. 2015. negative sampling case) into numpy code. This Simlex-999: Evaluating semantic models with (gen- translation task appeared daunting at first for many uine) similarity estimation. Computational Linguis- students, though they found creating the eventual tics, 41(4):665–695. solution rewarding for being able to ground similar Dan Jurafsky and James H. Martin. 2021. Speech & equations in NLP papers. Two key components Language Processing, 3rd edition. Prentice Hall. for mitigating early frustration were (1) including built-in periodic reports of the NLL, which help Wang Ling, Chris Dyer, Alan W Black, and Isabel students quickly spot whether there are numeric Trancoso. 2015. Two/too simple of errors that lead to infinity or NaN values and (2) word2vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chap- adding early-stopping and printing nearest neigh- ter of the Association for Computational Linguistics: bors of instructor-provided words (e.g., “January”) Human Language Technologies, pages 1299–1304. which should be thematically coherent after only a few minutes of training. These components help Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. 2019. Black is to criminal as cau- students quickly identify the presence of a bug in casian is to police: Detecting and removing multi- the gradient descent. class bias in word embeddings. In Proceedings of The second student difficulty comes from the the 2019 Conference of the North American Chap- text preprocessing steps. The removal of low- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long frequency words and frequency-based subsampling and Short Papers), pages 615–621. steps require students to have a solid distinction of type versus token in practice in order to sub- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tokens (versus types). I suspect that be- frey Dean. 2013a. Efficient estimation of word cause many of these routine preprocessing steps are representations in . arXiv preprint arXiv:1301.3781. done for the student by common libraries (e.g., the CountVectorizer of Scikit Learn (Pedregosa Tomas Mikolov, , Kai Chen, Greg Cor- et al., 2011)), these steps feel unfamiliar. Com- rado, and Jeffrey Dean. 2013b. Distributed repre- mon errors in this theme were subsampling types sentations of words and phrases and their composi- tionality. arXiv preprint arXiv:1310.4546. or producing a sequence of word types (rather than tokens) to use for training. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron References Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. the Journal of machine Tolga Bolukbasi, Kai-Wei Chang, James Zou, Learning research, 12:2825–2830. Venkatesh Saligrama, and Adam Kalai. 2016. Man is to programmer as woman is to Radim Rehurek and Petr Sojka. 2010. Software frame- homemaker? debiasing word embeddings. In work for topic modelling with large corpora. In In Proceedings of the 30th Conference on Neural Proceedings of the LREC 2010 workshop on new Information Processing Systems (NeurIPS). challenges for NLP frameworks. Citeseer.

Aylin Caliskan, Joanna J Bryson, and Arvind Xin Rong. 2014. word2vec parameter learning ex- Narayanan. 2017. Semantics derived automatically plained. arXiv preprint arXiv:1411.2738. from language corpora contain human-like biases. Science, 356(6334):183–186. Julien Tissier, Christophe Gravier, and Amaury Habrard. 2017. Dict2vec: Learning word embed- Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and dings using lexical dictionaries. In Proceedings of James Zou. 2018. Word embeddings quantify the 2017 Conference on Empirical Methods in Natu- 100 years of gender and ethnic stereotypes. Pro- ral Language Processing, pages 254–263. 110 Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers), pages 545–550. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2979–2989.

111