Word2vec and Its Application to Examining the Changes in Word Contexts Over Time

Word2vec and its application to examining the changes in word contexts over time Taneli Saastamoinen Master’s thesis 10 November 2020 Social statistics Faculty of Social Sciences University of Helsinki Tiedekunta – Fakultet – Faculty Koulutusohjelma – Utbildingsprogram – Degree Programme Faculty of Social Sciences Social statistics Tekijä – Författare – Author Taneli Saastamoinen Työn nimi – Arbetets titel – Title Word2vec and its application to examining the changes in word contexts over time Oppiaine/Opintosuunta – Läroämne/Studieinriktning – Subject/Study track Statistics Työn laji – Arbetets art – Level Aika – Datum – Month and year Sivumäärä – Sidoantal – Number of pages Master's thesis November 2020 88 Tiivistelmä – Referat – Abstract Word2vec is a method for constructing so-called word embeddings, or word vectors, from natural text. Word embeddings are a compressed representation of word contexts, based on the original text. Such representations have many uses in natural language processing, as they contain a lot of contextual information for each word in a relatively compact and easily usable format. They can be used either for directly examining and comparing the contexts of words or as more informative representations of the original words themselves for various tasks. In this thesis, I investigate the theoretical underpinnings of word2vec, how word2vec works in practice and how it can be used and its results evaluated, and how word2vec can be applied to examine changes in word contexts over time. I also list some other applications of word2vec and word embeddings and briefly touch on some related and newer algorithms that are used for similar tasks. The word2vec algorithm, while mathematically fairly straightforward, involves several optimisations and engineering tricks that involve tradeoffs between theoretical accuracy and practical performance. These are described in detail and their impacts are considered. The end result is that word2vec is a very efficient algorithm whose results are nevertheless robust enough to be widely usable. I describe the practicalities of training and evaluating word2vec models using the freely available, open source gensim library for the Python programming language. I train numerous models with different hyperparameter settings and perform various evaluations on the results to gauge the goodness of fit of the word2vec model. The source material for these models comes from two corpora of news articles in Finnish from STT (years 1992-2018) and Yle (years 2011-2018). The practicalities of processing Finnish-language text with word2vec are considered as well. Finally, I use word2vec to investigate the changes of word contexts over time. This is done by considering word2vec models that were trained from the Yle and STT corpora one year at a time, so that the context of a given word can be compared between two different years. The main word I consider is "tekoäly" (Finnish for "artificial intelligence"); some related words are examined as well. The result is a comparison of the nearest neighbours of "tekoäly" and related words in various years across the two corpora. From this it can be seen that the context of these words has changed noticeably during the time considered. If the meaning of a word is taken to be inseparable from its context, we can conclude that the word "tekoäly" has meant something different in different years. Word2vec, as a quantitative method, provides a measurable way to gauge such semantic change over time. This change can also be visualised, as I have done. Word2vec is a stochastic method and as such its convergence properties deserve attention. As I note, the convergence of word2vec is by now well established, both through theoretical examination and the very numerous successful practical applications. Although not usually done, I repeat my analysis in order to examine the stability and convergence of word2vec in this particular case, concluding that my results are robust. Avainsanat – Nyckelord – Keywords word2vec, word vectors, word embeddings, distributional semantics, natural language processing, semantic change, vector representations Ohjaaja tai ohjaajat – Handledare – Supervisor or supervisors Matti Nelimarkka Säilytyspaikka – Förvaringställe – Where deposited Helsingin yliopiston kirjasto, Helsingfors universitets bibliotek, Helsinki University Library Muita tietoja – Övriga uppgifter – Additional information Tiedekunta – Fakultet – Faculty Koulutusohjelma – Utbildingsprogram – Degree Programme Valtiotieteellinen tiedekunta Yhteiskuntatilastotiede Tekijä – Författare – Author Taneli Saastamoinen Työn nimi – Arbetets titel – Title Word2vec and its application to examining the changes in word contexts over time Oppiaine/Opintosuunta – Läroämne/Studieinriktning – Subject/Study track Tilastotiede Työn laji – Arbetets art – Level Aika – Datum – Month and year Sivumäärä – Sidoantal – Number of pages Pro gradu -tutkielma Marraskuu 2020 88 Tiivistelmä – Referat – Abstract Word2vec on menetelmä niin sanottujen sanavektoreiden, tai sanaupotusten, laskemiseen luonnollisen tekstin pohjalta. Sanavektorit ovat sanojen kontekstien tiivistettyjä esitysmuotoja, missä kontekstit ovat alkuperäisestä tekstistä. Tällaisille vektoriesityksille on luonnollisen kielen käsittelyssä monia käyttökohteita. Ne sisältävät runsaasti informaatiota kunkin sanan kontekstista suhteellisen tiiviissä ja helposti käytettävässä muodossa. Sanavektoreita voidaan käyttää joko suoraan sanojen kontekstien tutkimiseen tai alkuperäisten sanojen informatiivisempina esitysmuotoina erilaisissa sovelluksissa. Tässä työssä tarkastelen word2vecin teoreettista pohjaa, word2vecin soveltamista käytännössä ja sen tulosten arviointia, sekä word2vecin käyttöä sanojen kontekstien ajan yli tapahtuvien muutosten tutkimiseen. Esittelen myös joitakin muita word2vecin ja sanavektoreiden käyttökohteita sekä muita samanlaisiin käyttötarkoituksiin kehitettyjä ja myöhempiä algoritmeja. Word2vec-algoritmi on matemaattiselta pohjaltaan suhteellisen suoraviivainen, mutta algoritmi sisältää useita optimointeja jotka vaikuttavat yhtäältä sen teoreettiseen tarkkuuteen ja toisaalta käytännön suorituskykyyn. Tutkin näitä optimointeja ja niiden käytännön vaikutuksia tarkemmin. Lopputulos on, että word2vec on hyvin suorituskykyinen algoritmi, jonka tulokset ovat kuitenkin riittävän vakaita ollakseen laajasti käyttökelpoisia. Kerron työssäni word2vec-mallien sovittamisesta ja arvioinnista käytännössä, käyttäen vapaasti saatavilla olevaa avoimen lähdekoodin gensim-kirjastoa Python-ohjelmointikielelle. Sovitan useita malleja eri hyperparametrien arvoilla ja arvioin tuloksia eri tavoin selvittääkseni mallien sopivuutta. Lähdemateriaalina näille malleille käytän kahta eri suomenkielisten uutisartikkeleiden tekstikorpusta, STT:ltä (vuosilta 1992-2018) ja Yleltä (vuosilta 2011-2018). Kerron myös käytännön ongelmista ja ratkaisuista suomenkielisen tekstin käsittelyssä word2vecillä. Lopuksi, käytän word2veciä sanojen kontekstien muutosten tutkimiseen ajan yli. Tämä tehdään sovittamalla useita word2vec-malleja Ylen ja STT:n materiaaliin, vuosi kerrallaan, jolloin annetun sanan kontekstia voidaan tutkia ja vertailla eri vuosien välillä. Kohdesanani on "tekoäly"; tutkin myös muutamaa muuta siihen liittyvää sanaa. Tuloksena on vertailu sanan "tekoäly" ja muiden liittyvien sanojen lähimmistä naapureista eri vuosina kullekin korpukselle, ja siitä nähdään että kyseisten sanojen kontekstit ovat havaittavasti muuttuneet tutkittuina ajanjaksoina. Jos sanan merkitys oletetaan jollakin tasolla samaksi kuin sen konteksti, voidaan todeta että sana "tekoäly" on eri aikoina tarkoittanut eri asiaa. Word2vec on kvantitatiivinen metodi, mikä tarkoittaa että mainittu semanttinen muutos on sen avulla mitattavissa. Tämä muutos voidaan myös kuvata visuaalisesti, kuten olen tehnyt. Word2vec on stokastinen menetelmä, joten sen suppenemisen yksityiskohdat ansaitsevat huomiota. Kuten totean työssäni, word2vecin suppenemisominaisuudet ovat jo varsin hyvin tunnetut, sekä teoreettisten tarkastelujen pohjalta että useisiin onnistuneisiin käytännön sovellutuksiin perustuen. Vaikka tämä ei yleensä ole tarpeen, toistan oman analyysini tutkiakseni word2vecin suppenemista ja stabiiliutta tässä nimenomaisessa tapauksessa. Johtopäätös on että tulokseni ovat vakaat. Avainsanat – Nyckelord – Keywords word2vec, word vectors, word embeddings, distributional semantics, natural language processing, semantic change, vector representations Ohjaaja tai ohjaajat – Handledare – Supervisor or supervisors Matti Nelimarkka Säilytyspaikka – Förvaringställe – Where deposited Helsingin yliopiston kirjasto, Helsingfors universitets bibliotek, Helsinki University Library Muita tietoja – Övriga uppgifter – Additional information Contents 1Introduction 3 1.1 Computationaltextanalysis . 3 1.2 Topic models . 5 1.3 Word2vec and related methods . 7 2Theword2vecalgorithm 8 2.1 Skip-gram . 8 2.1.1 Input transformation and negative sampling . 9 2.1.2 Neural network architecture . 13 2.2 Continuousbag-of-words . 18 2.2.1 Input transformation and negative sampling . 18 2.2.2 Neural network architecture . 20 2.3 Optimisations and adjustments . 22 2.4 Limitations ............................... 23 2.5 Applications of word2vec . 25 2.6 After word2vec . 26 3Word2vecinpractice 28 3.1 Source material and processing . 28 3.1.1 Data processing pipeline . 29 3.1.2 ProcessingofFinnish-languagetext . 30 3.2 Word2vec parameters and model training . 32 3.2.1 Evaluationofword2vecmodels . 33 4Experimentalresults 43

Word2vec and Its Application to Examining the Changes in Word Contexts Over Time

Harmless Overfitting: Using Denoising Autoencoders in Estimation of Distribution Algorithms

More Perceptron

Measuring Overfitting and Mispecification in Nonlinear Models

A Comprehensive Embedding Approach for Determining Repository Similarity

Over-Fitting in Model Selection with Gaussian Process Regression

On Overfitting and Asymptotic Bias in Batch Reinforcement Learning With

Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: Machine Learning Basics Math Formulation

Practice with Python

Gensim Is Robust in Nature and Has Been in Use in Various Systems by Various People As Well As Organisations for Over 4 Years

Software Framework for Topic Modelling with Large

Word2vec and Beyond Presented by Eleni Triantaﬁllou

Generative Adversarial Networks for Text Using Word2vec Intermediaries