Learning About Word Vector Representations and Deep Learning Through Implementing Word2vec

Learning about Word Vector Representations and Deep Learning through Implementing Word2vec David Jurgens School of Information University of Michigan [email protected] Abstract introduced. The content is designed at the level of an NLP student who (1) has some technical back- Word vector representations are an essential ground and at least one advanced course in statistics part of an NLP curriculum. Here, we de- and (2) will implement or adapt new NLP meth- scribe a homework that has students imple- ods. This level is deeper than what is needed for a ment a popular method for learning word vectors, word2vec. Students implement the core purely Applied NLP setting but too shallow for a parts of the method, including text preprocess- more Machine Learning focused NLP class, which ing, negative sampling, and gradient descent. would likely benefit from additional derivations Starter code provides guidance and handles ba- and proofs around the gradient descent to solidify sic operations, which allows students to focus understanding. The homework has typically been on the conceptually challenging aspects. Af- assigned over a three to four week period; many ter generating their vectors, students evaluate students complete the homework in the course of a them using qualitative and quantitative tests. week, but the longer time frame enables students 1 Introduction with less background or programming experience to work through the steps. The material prepares NLP curricula typically include content on word students for advanced NLP concepts around deep semantics, how semantics can be learned computa- learning and pre-trained language models, as well tionally through word vectors, and what are the vec- as provides intuition for what steps modern deep tors’ uses. This document describes an assignment learning libraries perform. for having students implement word2vec (Mikolov The homework has three broad learning goals. et al., 2013a,b), a popular method that relies on First, the training portion of the homework helps a single-layer neural network. This homework is deepen students’ understanding of machine learn- designed to introduce students to word vectors and ing concepts, gradient descent, and develop com- simple neural networks by having them implement plex NLP software. Central to this design is hav- the network from scratch, without the use of deep- ing students turn the equations in the homework learning libraries. The assignment is appropriate and formal descriptions of word2vec into software for upper-division undergraduates or graduate stu- operations. This step helps students understand dents who are familiar with python programming, how to ground equations found in some papers into have some experience with the numpy library (Har- the more-familiar language of programming, while ris et al., 2020), and have been exposed to concepts also building a more intuition for how gradient around gradients and neural networks. Through descent and backpropagation work in practice. implementing major portions of the word2vec soft- Second, the process of software development ware and using the learned vectors, students will aids students in developing larger NLP software gain a deeper understanding of how networks are methods that involve end-to-end development. This trained, how to learn word vectors, and their uses goal includes seeing how different algorithmic soft- in downstream tasks. ware designs work and are implemented. The speed 2 Design and Learning Goals of training requires that students be moderately efficient in how they implement their software. For This homework is designed to take place just be- example, the use of for loops instead of vector- fore the middle stretch of the class, after lexical se- ized numpy operations will lead to a significant mantics and machine learning concepts have been slow down in performance. In class instruction and 108 Proceedings of the Fifth Workshop on Teaching NLP, pages 108–111 June 10–11, 2021. ©2021 Association for Computational Linguistics tutorials detail how to write the relevant efficient scores for the subset of the SimLex-999 (Hill et al., numerical operations, which help guide students 2015) present in their training corpus, which is up- to identify where and how to selectively optimize. loaded to Kaggle InClass1 to see how their vectors However, slow code will still finish correctly al- compare with others; this leaderboard helps stu- lowing students to debug for their initial implemen- dents identify a bug in their code (via a low-scoring tations for correctness. This need for performant submission) and occasionally prompts students to code creates opportunities for students to practice think about how to improve/extend their code to their performance optimizing skills. attain a higher score. Third, the lexical semantics portion of the home- Potential Extensions The word2vec method has work exposes students to the uses and limitations been extended in numerous ways in NLP to im- of word vectors. Through training the vectors, stu- prove its vectors (e.g., Ling et al., 2015; Yu and dents understand how statistical regularities in co- Dredze, 2014; Tissier et al., 2017). This assignment occurrence can be used to learn meaning. Quali- includes descriptions of other possible extensions tative and quantitative evaluations show students that students can explore, such as implementing what their model has learned (e.g., using vector dropout, adding learning rate decay, or making use analogies) and introduce them to concepts of poly- of external knowledge during training. Typically, a semy, fostering a larger discussion on what can be single extension to word2vec is included as a part captured in a vector representation. of the homework to help ground the concept in code but without increasing the difficulty of the as- 3 Homework Description signment. Students who are interested in deepening The homework has students implement two core as- their understanding can use these as starting points pects of the word2vec algorithm using numpy for to see how to develop their own NLP methods as a the numeric portions, and then evaluate with two part of a course project. downstream tasks. The first aspect has students per- This assignment also provides multiple possibili- form the commonly-used text preprocessing steps ties for examining the latent biases learned in word that turn a raw text corpus into self-supervised train- vectors. Prior work has established that pretrained ing examples. This step includes removing low- vectors often encode gender and racial biases based frequency tokens and subsampling tokens based on on the corpora they are trained on (e.g., Caliskan their frequency. The second aspect focuses on the et al., 2017; Manzini et al., 2019). In a future ex- core training procedure, including (i) negative sam- tension, this assignment could be adapted to use pling for generating negative examples of context Wikipedia biographies as a base corpus and have words, (ii) performing gradient descent to update students identify how occupations become more as- the two word vector matrices, and (iii) computing sociated with gendered words during training (Garg the negative log-likelihood. These tasks are bro- et al., 2018). Once this bias is discovered, students ken into eight discrete steps that guide students in can discuss various methods for mitigating it (e.g., how to do each aspect. The assignment document Bolukbasi et al., 2016; Zhao et al., 2017) and how includes links to more in-depth descriptions of the their method might be adapted to avoid other forms method including the extensive description of Rong of bias. This extension can help students critically (2014) and the recent chapter of Jurafsky and Mar- think about what is and is not being captured in tin(2021, ch. 6) to help students understand the pretrained vectors and models. math behind the training procedure. 4 Reflection on Student Experiences In the second part of the homework, students evaluate the learned vectors in two downstream Student experiences on this homework have been tasks. The first task has students load these vec- very positive, with multiple students expressing a tors using the Gensim package (Rehurek and So- strong sense of satisfaction at completing the home- jka, 2010) and perform vector arithmetic opera- work and being able to understand the algorithm tions to find word-pair analogies and examine the and software backing word2vec. Several students nearest-neighbors of words; this qualitative evalua- reported feeling like completing this assignment tion exposes students to what is or is not learned by was a great confidence boost and that they were the model. The second task is quantitative evalua- 1https://www.kaggle.com/c/about/ tion that has students generate word-pair similarity inclass 109 now more confident in their ability to understand ceedings of the National Academy of Sciences, NLP papers and connect algorithms, equations, and 115(16):E3635–E3644. code. The majority of student difficulties happen in Charles R Harris, K Jarrod Millman, Stéfan J van der two sources. First, the vast majority of bugs happen Walt, Ralf Gommers, Pauli Virtanen, David Cour- when implementing the gradient descent and cal- napeau, Eric Wieser, Julian Taylor, Sebastian Berg, culating the negative log-likelihood (NLL). While Nathaniel J Smith, et al. 2020. Array programming only a few lines of code in total, this step requires with numpy. Nature, 585(7825):357–362. translating the loss function for word2vec (in the Felix Hill, Roi Reichart, and Anna Korhonen. 2015. negative sampling case) into numpy code. This Simlex-999: Evaluating semantic models with (gen- translation task appeared daunting at first for many uine) similarity estimation. Computational Linguis- students, though they found creating the eventual tics, 41(4):665–695. solution rewarding for being able to ground similar Dan Jurafsky and James H. Martin. 2021. Speech & equations in NLP papers. Two key components Language Processing, 3rd edition. Prentice Hall. for mitigating early frustration were (1) including built-in periodic reports of the NLL, which help Wang Ling, Chris Dyer, Alan W Black, and Isabel students quickly spot whether there are numeric Trancoso.

Load more