Week 10: Classifying and Scaling Documents II POLS0010 Data Analysis

Julia de Romémont

Term 2 2021 UCL Department of Political Science This Lecture

1. Classifying using Support Vector Machines (SVM) 2. Scaling using Cosine Similarity 3. Scaling using Wordscores 4. Wrap-Up: Measuring Stuff from Text Remember

Two types of measurement schemes:

1. Classification of documents: involves categorical (often binary) measures 2. Scaling of documents: involves continuous measure

Common goal: Assign a text to a particular category, or a ⇒ particular position on a scale. 1. Classifying Using SVMs

I Last week, we first applied dictionary methods to classify a text as either negative or positive I Another, often more effective, method we discussed is lasso regression, which is a type of regularisation technique I Other classification techniques include for example: I Ridge regression I Random Forest I k-nearest neighbour I etc. I Another alternative method is called Support Vector Machine (SVM) 1. Classifying Using SVMs

The Basic Idea of SVMs

I Draw a line between observations belonging to two categories mapped as points in a p-dimensional1 space that maximises the width of the gap (margin) between the two categories I This line is called a separating hyperplane I New observations are classified according to which side of the hyperplane the fall to

1 Where p is the number of input (‘independent’) variables. 1. Classifying Using SVMs

The Separating Hyperplane

Suppose we have n observations of two variables x1 and x2, which fall into two classes y ∈ {−1, 1}. Then the separating hyperplane is defined by: β0 + β1x1 + β2x2 = 0

I Therefore all observations that have y = −1 will have β0 + β1x1 + β2x2 < 0 and thereby lie on one side of the hyperplane I And all observations with y = 1 will have β0 + β1x1 + β2x2 > 0 and lie on the other side 1. Classifying Using SVMs

I We want to find the hyperplane that maximises the distance - the margin - between the two categories I This means finding the line for which the observations closest to the line are the furthest, compared to other possible lines

James et al (2017: p.342) 1. Classifying Using SVMs

So what are support vectors?

I We want to draw a separating line between observations I Intuitively, observations far away from the boundary will not have any influence on the calculation of the hyperplane I Conversely, it is those observations closest to the boundary and within or on the margin on either side that will determine the location and therefore the equation of the hyperplane I These observations are the support vectors, as in they support the hyperplane 1. Classifying Using SVMs Special cases of SVMs

1. This is simple enough when we have perfectly linearly separable observations I This means that there exists a hyperplane where no observation is on the wrong side I In that case we use the maximal margin classifier 2. However, it is possible that no such separating hyperplane exists I The solution then is to find a hyperplane almost separates the cases I This involves defining a threshold number of observations that can be on the wrong side of the hyperplane I Similarly to the λ parameter in lasso regression, the key question is how to choose the optimal threshold C (for which we can use k-fold cross validation!) I This method is called support vector classifier 1. Classifying Using SVMs The most general case is when observations are neither per- ⇒ fectly nor linearly separable into classes, which is when we speak of Support Vector Machines2

I SVMs accommodate non-linear decision boundaries by adding another dimension which represents a measure of similarity of two observations (this is called the kernel trick) I With this new dimension, it is then possible to find a linear hyperplane I In the context of text analysis, using a linear kernel (i.e. support vector classifier) is often sufficient I The high number of features (words) make it likely that there is at least one p-dimensional space where the documents are linearly separable

2Note that SVMs have quite a few similarities with Logistic regression with Lasso (or Ridge) and often perform similarly. 1. Classifying Using SVMs

Back to the spam email filter emails_corpus <- corpus(emails$text) tok_emails <- tokens_remove( tokens_tolower(tokens(emails_corpus, remove_numbers =T,remove_punct = T)), stopwords("en")) dfm_emails <- dfm_weight(dfm(tok_emails), scheme="prop") doc_freq <- docfreq(dfm_emails) dfm_emails <- dfm_emails[,doc_freq>2] dfm_emails <- dfm_remove(dfm_emails,"subject") 1. Classifying Using SVMs

I Let’s prepare the data ahead of the modelling df_emails <- convert(dfm_emails, to="data.frame") df_emails$doc_id <- NULL df_emails <- cbind("y.var"=factor(emails$spam), df_emails)

I Now create some test and training data set.seed(123) cv.rows <- sample(nrow(df_emails),(nrow(df_emails)/2)) cv.data <- df_emails[cv.rows,] test.data <- df_emails[-cv.rows,] 1. Classifying Using SVMs

I Like last week, we will use k-fold Cross-Validation with 10 folds to ‘fine-tune’ the model and choose the best cost threshold C library(e1071) svm.emails <- tune.svm(y.var ~., data= cv.data, kernel="linear", cost =2 ^(2:6), tunecontrol=tune.control(cross=10)) svm.emails

## ## Parameter tuning of ’svm’: ## ## - sampling method: 10-fold cross validation ## ## - best parameters: ## cost ## 32 ## ## - best performance: 0.009778027 1. Classifying Using SVMs

I The best performance is achieved with a cost parameter set to 32 out of the choices we provided I We can extract the ‘best’ model and then look at the Confusion Matrix for our training data best.svm <- svm.emails$best.model table(best.svm$fitted,cv.data[,1])

Training Data

Actual Classification 0 1 0 2162 0 SVM Prediction 1 6 696

I Note that an SVM will have a training error rate dependent on the cost parameter C 1. Classifying Using SVMs

I We now look at the predictions for the test data to assess the model’s performance svm.preds <- predict(best.svm,test.data) table(svm.preds,test.data[,1])

Test Data

Actual Classification 0 1 0 2173 11 SVM Prediction 1 19 661 1. Classifying Using SVMs

I Remember the function we wrote from last week’s seminar that gave use error rate, sensitivity and specificity? Let’s re-use it here and compare the SVM and Lasso model model.assessment(svm.preds,test.data[,1]) # SVM

## Error rate Sensitivity Specificity ## 1.05 98.36 99.13 model.assessment(lasso.preds,test.data[,1]) # Lasso

## Error rate Sensitivity Specificity ## 3.53 86.76 99.45

I It looks like the SVM approach here is better at catching ‘true’ spam (higher sensitivity) with only very slightly less specificity 1. Classifying Using SVMs

I Similarly to Lasso, an SVM identifies the words (= variables) that are most predictive of either the one or the other class, by estimating coefficients depending on their importance and sets the rest to zero svm.coef <- matrix(coef(best.svm)) length(svm.coef[svm.coef!=0,])-1 # Minus Intercept!

## [1] 9992

I Out of an original 13,623 unique words, our optimised SVM uses 9,992 words as variables I This is far from the 215 our lasso model from last week used for a comparable performance. . . 1. Classifying Using SVMs

I We can also look at how many support vectors (= observations that determine the hyperplane) the model uses best.svm$tot.nSV # in total

## [1] 700 best.svm$nSV # no of SV on either side of the hyperplane

## [1] 361 339

I The model has 700 support vectors in total, and 361 on the non-spam side and 339 on the spam side of the hyperplane 1. Classifying Using SVMs

Advantages

I Model performance (i.e. predictive power) is often very high I Very flexible, especially because of the kernel trick I Quite intuitive: essentially the computational way of drawing lines between points Disadvantages

I Computational time (the tune.svm command from above took a couple hours to run for testing only 5 different cost parameters) I When the kernel is not linear, even more parameters need to be tuned I Less feature reduction than a logistic lasso approach, for instance I Not probabilistic: a default SVM approach does not quantify uncertainty! 1. Classifying Using SVMs

Good News

I This is not part of what you are expected to apply yourselves in either the seminars or assessment I The purpose of this was mainly to compare to a Lasso approach and highlight how any model choice involves trade-offs! Transition: Scaling

Two types of measurement schemes:

1. Classification of documents: involves categorical (often binary) measures 2. Scaling of documents: involves continuous measure

Common goal: Assign a text to a particular category, or a ⇒ particular position on a scale. Transition: Scaling

Why scale instead of classifying?

I Even though classification can involve scoring the probability of being 1 or 0, ultimately all documents are classed as either 1 or 0 I However, in many situations, documents may fall into a grey zone: neither fully 1 nor fully 0 I We could use a dictionary-based scaling approach, e.g. count of positive minus negative words I This has a similar problem: forces all words in the dictionary to be either one class or another I Most words are used in both classes Transition: Scaling

Perry, P & Benoit, K. (2017) “Scaling Text with the Class Affinity Model”.

“Affinity Model” = Wordscores, essentially 2. Scaling Using Cosine Similarity I Goal: systematically measure how “similar” two documents are I The Cosine Similarity approach (and others) begins by conceiving of documents as vectors doc1 <- c("Text analysis is really quite simple") doc2 <- c("Doing text analysis in practice is easier than it looks")

As a document-term matrix without stopwords:

doc_id text analysis really quite simple practice easier looks

text1 1 1 1 1 1 0 0 0 text2 1 1 0 0 0 1 1 1

Vectors are simply the rows:

I Doc. 1: (1,1,1,1,1,0,0,0) I Doc. 2: (1,1,0,0,0,1,1,1) 2. Scaling Using Cosine Similarity I Two vectors are similar to each other if they ‘point in the same direction’ I Cosine similarity measures the angle between two vectors I The smaller the angle, the more similar they are

In a two-dimensional space 2. Scaling Using Cosine Similarity The Cosine of an Angle

I The cosine of the angle θ is the length of the adjacent side divided by the length of the hypotenuse

I Cosine similarity (for positive vectors) is bounded between 0 and 1: I The smaller the angle, the closer cos(θ) is to 1 I The larger the angle, the closer cos(θ) is to 0 I Thus similar documents have cosine similarity close to 1 2. Scaling Using Cosine Similarity

In R library(quanteda.textstats) tokens_inaug <- tokens(data_corpus_inaugural, remove_numbers=T, remove_punct=T) tokens_inaug <- tokens_remove(tokens_inaug,stopwords("en")) dfm_inaug <- dfm(tokens_inaug, tolower=T) cos_sim <- as.matrix(textstat_simil(dfm_inaug, method = "cosine"))

I Note that there is no need to work with proportions (or TF-IDF) when documents are of different lengths: cosine similarity does not depend on document lengths 2. Scaling Using Cosine Similarity cos_sim[1:10,53:58]

1997-Clinton 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump

1789-Washington 0.30 0.33 0.32 0.32 0.33 0.29 1793-Washington 0.12 0.13 0.15 0.14 0.17 0.19 1797-Adams 0.32 0.33 0.35 0.34 0.36 0.35 1801-Jefferson 0.38 0.35 0.34 0.39 0.38 0.25 1805-Jefferson 0.30 0.34 0.31 0.32 0.34 0.20 1809-Madison 0.20 0.26 0.29 0.27 0.26 0.20 1813-Madison 0.20 0.26 0.25 0.26 0.25 0.22 1817-Monroe 0.34 0.35 0.31 0.36 0.39 0.31 1821-Monroe 0.31 0.31 0.31 0.31 0.34 0.31 1825-Adams 0.30 0.30 0.32 0.29 0.32 0.25 2. Scaling Using Cosine Similarity

I Which inaugural speeches were the most similar to Obama’s in 2009?

2013−Obama 1997−Clinton 2021−Biden 1993−Clinton 1985−Reagan 1973−Nixon 1969−Nixon 1981−Reagan 1989−Bush 2001−Bush 1965−Johnson 1977−Carter 1961−Kennedy 1941−Roosevelt obama <- data.frame("sim"=cos_sim[,"2009-Obama"]) 1905−Roosevelt 1953−Eisenhower 1921−Harding obama$pres <- rownames(obama) 2005−Bush 1925−Coolidge obama <- obama[obama$pres!="2009-Obama",] 1957−Eisenhower 1937−Roosevelt 2017−Trump 1917−Wilson 1933−Roosevelt ggplot(obama, 1853−Pierce 1837−VanBuren 1949−Truman aes(x=sim, 1929−Hoover 1913−Wilson y=reorder(pres,sim))) + 1801−Jefferson 1901−McKinley 1893−Cleveland geom_point(size=2) + 1881−Garfield 1897−McKinley xlab("Cosine Similarity") + 1817−Monroe 1889−Harrison 1869−Grant ylab("") + 1841−Harrison 1797−Adams theme_bw() + 1909−Taft 1789−Washington 1805−Jefferson theme(axis.text.y = element_text(size=17), 1877−Hayes 1945−Roosevelt axis.text.x = element_text(size=15), 1873−Grant 1857−Buchanan 1821−Monroe axis.title.x = element_text(size=23), 1861−Lincoln 1885−Cleveland plot.title = element_text(size=30)) 1833−Jackson 1865−Lincoln 1845−Polk 1825−Adams 1849−Taylor 1809−Madison 1813−Madison 1829−Jackson 1793−Washington 0.2 0.3 0.4 0.5 0.6 Cosine Similarity 2. Scaling Using Cosine Similarity

Advantages

I A useful summary of how similar documents are to each other I Intuitive and easy to understand I Very useful for authorship detection tasks Disadvantages

I It’s really just a measure of linguistic similarity, not how similar the speakers/writers are to each other intrinsically I It’s also unipolar: similarity to one single document I In the social sciences, we tend to want to measure continuous pre-defined latent traits, or binary pre-defined classes I Typically, it’s not enough for the social sciences. We need to more closely define what it means for texts to be ‘similar’ 3. Scaling Using Wordscores A Known Example Scaling the Labour party in the ’New Labour’ Era

John Hutton ● Andrew Smith ● ● David Blunkett ● ● Margaret Hodge ● ● Anne McGuire ● ● Jane Kennedy ● ● John Reid ● ● Caroline Flint ● Paul Boateng ● Malcolm Wicks ● Stephen Byers ● ● Patricia Hewitt ● ● Alan Howarth ● Frank Field ● Adam Ingram ● Chris Smith ● Karen Buck ● Gwyneth Dunwoody ● Donald Dewar ● John McDonnell ● Nick Raynsford ● Tam Dalyell ● Sam Galbraith ● ● Ken Livingstone ● Dennis Skinner ● ● John McAllion ● Glenda Jackson ● Tony Benn ● Tommy Graham ● Michael Martin ● Eric Clarke ● −1.0 −0.5 0.0 0.5 1.0 Estimated Wordscores Position

O’Grady, T. (2019) ‘Careerists Versus Coal-Miners: Welfare Reforms and the Substantive Representation of Social Groups in the British Labour Party’, Comparative Political Studies, 52(4), pp. 544–578. 3. Scaling Using Wordscores Other Applications Positions of Environmental Interest Groups vs. EU Commission

Kluver, H. (2009). “Measuring Interest Group Influence Using Quantitative Text Analysis.” Politics 10 (4): 535-549. 3. Scaling Using Wordscores Other Applications Detecting Lobbying Influence on Tobacco Control Legislation

Costal et al (2014). “Quantifying the influence of the tobacco industry on EU governance: automated content analysis of the EU Tobacco Products Directive.” Tobacco Control 23: 473-478. 3. Scaling Using Wordscores Other Applications Factions in the German Bundestag

Bernauer, J. and Brauninger, T. (2009). “Intra-Party Preference Heterogeneity and Faction Membership in the 15th German Bundestag: A Computational Text Analysis of Parliamentary Speeches.” German Politics 18 (3): 385-402. 3. Scaling Using Wordscores

I O’Grady (2019) uses the “wordscores” technique to measure the ideological position of Labour MPs I Klüver (2009) uses a similar approach to place EU environmental policies on a scale of pro-environmental control interest groups vs anti-environmental control interest groups, to assess who “wins” and who “loses” I Costal et al (2014) use wordscores to show how a EU policy directive on tobacco products has shifted towards the tobacco industry’s positions from the initial policy proposal to the final policy output I Bernauer & Bräuninger use it to measure and identify differences in the ideological positions of Members of the German Parliament within each party faction 3. Scaling Using Wordscores I Wordscores is a type of supervised scaling, meaning that we have some documents for which we already know the outcome variables which we then use to build our model I More concretely, it begins by dividing the corpus into two sets of documents: 1. Reference Documents that can uncontroversially be scored in advance 2. Virgin Documents whose scores are unknown I Wordscores then positions the virgin documents on a scale according to their similarity to the reference documents (in terms of word counts)

Reference Virgin Documents Reference Document 1 Document 2

I Note we can also have reference documents in the centre (or any point on this scale) I Must assign the references documents a score in advance: can be arbitrary, e.g. (0,1) 3. Scaling Using Wordscores

Caroline Flint on welfare-to-work December 1st , 1997 "In the past, taxpayers picked up the bill for mass dependency on bene- fit, persistent unemployment, huge subsidies for low pay and widespread fraud. It was truly a nation on bene- fit... overall benefit expenditure rose by £40 billion in real terms... the best form of welfare, and the one preferred by the great majority of people, is work... I am proud that New Labour is beginning to prioritise work over welfare and opportunity over waste." 3. Scaling Using Wordscores

Caroline Flint on welfare-to-work December 1st , 1997 "In the past, taxpayers picked up the bill for mass dependency on bene- fit, persistent unemployment, huge subsidies for low pay and widespread fraud. It was truly a nation on bene- fit... overall benefit expenditure rose by £40 billion in real terms... the best form of welfare, and the one preferred by the great majority of people, is work... I am proud that New Labour is beginning to prioritise work over welfare and opportunity over waste." 3. Scaling Using Wordscores

From Speeches to Policy Positions Laver, Benoit and Garry (2003)

Pro-Reform Work Highest Pro- Fraud Scores reform Measures Pro- Caroline Reform Flint Reform Independent 1997-07 Reference Employment Document Modern Radical . . . . Pro-Welfare . . Poor Cuts Pro- Society Welfare Suffering Dennis Reference Hardship Skinner Document Unemployed Lowest 1987-94 Pro- Desperate Scores Welfare Needy

(1) Training Set: Known (2) Reference (3) Wordscores (4) Scores for all ‘Extremists’ Documents Documents 3. Scaling Using Wordscores

From Speeches to Policy Positions Laver, Benoit and Garry (2003)

Pro-Reform Work Highest Pro- Fraud Scores reform Measures Pro- Caroline Reform Flint Reform Independent 1997-07 Reference Employment Secretaries of State, 1997-07 Document Modern Radical . . . . Pro-Welfare . . Poor Cuts Pro- Society Welfare Suffering Dennis Reference Hardship Skinner Socialist Campaign Group, 1987-94 Document Unemployed Lowest 1987-94 Pro- Desperate Scores Welfare Needy

(1) Training Set: Known (2) Reference (3) Wordscores (4) Scores for all ‘Extremists’ Documents Documents 3. Scaling Using Wordscores Mathematically

1. Pre-assign a score, Ar to each reference document 2. Calculate relative frequency of every word w in each reference document r = Fwr 3. Calculate probability that we are reading r, given that we are seeing w Fwr Pwr = P r Fwr 4. Produce a score for each word X Sw = (Pwr × Ar ) r

5. Use the wordscores to score each virgin document v X Sv = (Fwv × Sw ) w 3. Scaling Using Wordscores Mathematically 3. The probability of reading r, given w

Fwr Pwr = P r Fwr P where r Fwr = total occurrences of w across the reference documents. Example:

I 10,000 words in each of two reference documents I w is used 10 times in the first, and 30 times in the second I Fwr = 0.001 and 0.003 respectively I If we encounter this word in a virgin document, there is a 75% (0.003/0.004) probability that we’re reading (something close to) the second reference text, and a 25% (0.001/0.004) probability that we’re reading the first 3. Scaling Using Wordscores

Mathematically 4. A score for each word w

X Sw = (Pwr × Ar ) r

I Score for w is an average of the pre-assigned reference document scores, weighted by probability we are reading each one, given w I Example: word w from the previous slide scores -0.5 if Ar are (-1,1): 0.75 × (−1) + 0.25 × 1 = −0.5 3. Scaling Using Wordscores

Mathematically 5. Score the virgin documents

X Sv = (Fwv × Sw ) w where Fwv is the frequency of word w in virgin document v.

I Documents that use more words that are more typical of the (-1) reference document will score closer to (-1), and vice-versa I Note: words that appear in v but not r are by definition ignored by the algorithm! 3. Scaling Using Wordscores

I Importantly, and perhaps counterintuitively, the virgin texts will tend to be measured on a different scale than the reference documents I Commonly-used words will appear in near-equal proportions across the reference texts I These words’ scores will simply be an average of the pre-assigned reference text scores I If these words occur a lot in the virgin documents, virgin document scores will be pulled towards this mean I Therefore, the dispersion of the virgin documents’ scores will invariably be much smaller than the variance of the reference document scores I This is not an issue if the reference documents are not of interest themselves and will not be included in subsequent analyses 3. Scaling Using Wordscores Rescaling

I However, reference documents are often themselves of interest (e.g. Jeremy Corbyn’s ideological position in the New Labour era) I In that case, if the reference documents scores are to be included in the analyisis, the virgin documents can be re-scaled to place them on the same scale I For a given document, the re-scaled score is:

∗ SDr Sv = (Sv − Sv¯)( ) + Sv¯ SDv

I Sv is the original score for the document I Sv¯ is the mean score across virgin documents I SDr and SDv are the standard deviations of the reference documents and virgin documents, respectively 3. Scaling Using Wordscores

Laplace Smoothing

I Reference Documents can often contain words that do not feature at all in the other reference document(s) I Such words can have a substantial influence on the estimated scale, but if a word is used by only one document, the word itself may just be rare and uninformative about position I One solution is Lapace smoothing: add 1 to the count of every word I Example: I Reference documents “Taxes are wasteful” (1) and “Taxes are not wasteful” (−1) Wordscore for “not” without smoothing: 1/4 × (−1) = −1 I 1/4+0/3 I Wordscore for “not” with smoothing: 1/7 × 1 + 2/8 × (−1) = −0.27 1/7+2/8 1/7+2/8 3. Scaling Using Wordscores

Uncertainty

I We can produce standard errors, based on the variance of words scores in a virgin document around the virgin document’s mean

sP 2 w Fwv (Sw − Sv ) SEv = nv

I Note that the deviation for each word is weighted by its frequency of occurrence Fwv I These can be used for 95% confidence intervals, just like any other standard error 3. Scaling Using Wordscores Another Known Example Using UN General Assembly Speeches to Measure Countries’ Support for Russia vs. the USA

Baturo, A., N. Dasandi and S. Mikhaylov (2017). “Understanding state preferences with text as data: Introducing the UN General Debate corpus.” Research & Politics 4(2): 1-9. 3. Scaling Using Wordscores Wordscores in R

I Let’s try re-doing this for USA vs. China in 2017

# assign reference scores to China and the USA un_debates$ref <-NA un_debates$ref[un_debates$country=="CHN"] <- -1 un_debates$ref[un_debates$country=="USA"] <-1

# load data, turn to DFM, remove rare words speechCorpus <- corpus(un_debates$text, docvars = un_debates) tok_un <- tokens_remove( tokens_tolower( tokens(speechCorpus,remove_punct=T,remove_numbers=T)), stopwords("en")) dfm_un <- dfm(tok_un,tolower=T,) doc_freq <- docfreq(dfm_un) dfm_un <- dfm_un[,doc_freq>1] 3. Scaling Using Wordscores I When using Laplace smoothing we need to be very careful not just to use the whole document term matrix I Why? Well because adding 1 to all the words would also add 1 to words that did not appear in either document! dfm_un_sub <- dfm_subset(dfm_un,country %in%c ("USA","CHN")) dfm_un_sub <- dfm_trim(dfm_un_sub,min_termfreq =1)

I Let’s now estimate wordscores from the reference documents library(quanteda.textmodels) mod_ws <- textmodel_wordscores(dfm_un_sub, y = docvars(dfm_un_sub, "ref"), smooth=1)

I And score the virgin documents, with standard errors pred_ws <- predict(mod_ws, se.fit = TRUE, newdata = dfm_un) 3. Scaling Using Wordscores

I Use coef() to extract the wordscores (e.g. the “top 10” American words) coef(mod_ws)[order(coef(mod_ws),decreasing = T)][1:10]

## citizens regime america american today nation ## 0.8456909 0.8456909 0.8255694 0.8134035 0.7994133 0.7994133 ## sovereign strong human leaders ## 0.7994133 0.7831552 0.7640292 0.7412029 3. Scaling Using Wordscores

I We can visualise the results with the function textplot_scale1d() from the quanteda.textplot package I Conveniently, this creates a ggplot object and is therefore customisable! library(quanteda.textplots) textplot_scale1d(pred_ws, doclabels = docvars(speechCorpus,"country")) 3. Scaling Using Wordscores

USA ISR BOL JOR MEX PHL DOM SYR CZE UGA LBN ATG CUB AUT GBR CAN SVK AFG COL FRA FJI LBY IND CMR KIR LTU HUN MWI PAN VEN YEM YUG IRN ERI LIE KNA MHL ALB DMA URY MDV SOM LCA LKA EC HND PNG TUR AUS NLD ECU BTN GRD VAT IRQ CRI SAU GHA COM GEO ITA NIC VCT BDI AZE KHM BGD UKR PAK NER IRL DEU BRA THA CHL ISL GNB ARE ARM GTM SLB BEL LBR PSE DNK GMB JPN BHS COD MUS RWA GNQ BHR MMR SEN MLT LSO SUR MLI BIH EST ARG ZAF SWE FIN MCO NGA TUV CIV SVN PRK MDG PLW RUS NRU HRV ZMB TCD SGP QAT VUT GRC COG BWA LUX SMR PRY NAM CPV TGO HTI BRN GUY PRT EGY POL DZA BEN KOR UZB KEN NZL DJI ESP GIN BGR LVA NPL GAB SYC CYP BLZ TLS CHE WSM KGZ TTO BRB MRT FSM CAF NOR SDN MAR MYS STP MKD SLE ZWE MDA ROU JAM AND SWZ BLR IDN TZA BFA SSD MNG PER MNE TJK MOZ KAZ TUN KWT AGO ETH VNM TON SLV OMN LAO TKM CHN −0.4 −0.2 0.0 0.2 Document Position 3. Scaling Using Wordscores

I For a few selected countries

ISR MEX CZE GBR CAN AFG FRA IRN SOM AUS NLD SAU UKR PAK IRL DEU BRA JPN RWA SEN SVN RUS PRT EGY POL KOR KEN ESP NOR ZWE PER KAZ VNM LAO −0.2 −0.1 0.0 0.1 Document position 3. Scaling Using Wordscores Potential Issues3 A Wordscores relies on making a series of strong assumtions:

1. Wordscores assumes that we "know" the most extreme (reference) texts, which is a very strong assumption to make 2. The scale itself is not chosen inductively, but assumed in advance I There could be other scales that better explain variation across the documents, but Wordscores can’t tell us about them 3. The method also assumes that all words are equally informative about a document’s position I Words used commonly and equally across documents will be scored as “centrist” even if they have no ideological meaning I Must be very careful that the method isn’t simply picking up stylistic differences: works poorly on documents that mainly differ in terms of style

3Lowe, W. (2008). “Understanding Wordscores.” Political Analysis 16: 356-371. 3. Scaling Using Wordscores

I When there is too much overlap between reference documents, many commonly and equally used (and often uninformative) words will be given a similar, ‘centrist’ score I Look for instance at the wordscores from our text-model with Laplace smoothing and note the long horizontal line near the middle

With Smoothing

0.5

0.0 Word Score Word

−0.5

−1.0 0 500 1000 1500 Word ID 3. Scaling Using Wordscores

I When there is little overlap between the reference documents, many words are given the exact same score (of the one ref document they appear in) I This can be seen when calculating wordscores for the UN speeches without Laplace smoothing

Without Smoothing

1.0

0.5

0.0 Word Score Word

−0.5

−1.0

0 500 1000 1500 Word ID 3. Scaling Using Wordscores

I This means we need to choose large, linguistically diverse reference documents with mutual overlap between words I It’s better to use large, evenly-spaced documents (i.e. that provide substantive additional information) I We should include more than two reference documents where possible 3. Scaling Using Wordscores Choosing Reference Documents

I Remember from last week: the importance of the labelling stage for supervised model-based classification with pre-labelled documents I Similarly, Wordscores crucially depends on the choice of reference documents 1. Need to be the same document type, using similar language, as the virgin documents (Note: This makes it difficult to use Wordscores over long periods of time) 2. Need (at least) to be genuinely extreme, spanning the ends of the scale 3. Best if they are large and linguistically diverse: they can be an amalgamation of documents I Bottom line: Best to choose reference documents that are quintessential, uncontroversial examples of extreme/centrist language 3. Scaling Using Wordscores

Interpreting and Validating Wordscores

I Nothing guarantees that the output of Wordscores will be ‘correct’ or even substantively meaningful I Careful validation of results is vital: 1. Do the words associated with each end of the scale make sense? 2. Do the placements on the scale have high face validity? 3. Do the placements on the scale correlate well with other alternative measures of position (convergent validation) or other concepts that should be causally related (construct validation)? 3. Scaling Using Wordscores Advantages

I Intuitive and easy to understand I Rather than just focusing on how similar two documents are (like Cosine Similarity), creates a scale that gives meaning to differences/similarities I Very useful for analysing political discourse where relative word use most likely represents ideological differences Disadvantages

I Requires strong (and often debatable) assumptions I Some practical concerns about the measure itself: different scale of virgin vs reference documents and the fact that many words that have a similar (‘centrist’) score I The selection and scoring of reference documents is not easy and should be carefully justified 3. Scaling Using Wordscores

Making Wordscores Useful Wordscores is an easy and intuitive method for scaling, but. . .

1. Works best for documents that are pre-selected/processed to vary along a single dimension 2. Works best when documents contain mostly "ideological" language: relative word use must imply a spatial position 3. Can perform poorly on documents spanning long time periods, when language itself differs 4. Selection of reference documents is tricky but absolutely crucial 5. Requires large reference documents with substantial linguistic overlap 6. Ideally requires large documents 7. Must always be carefully validated 4. Wrap-Up: Measuring Stuff from Text

Two types of measurement schemes:

1. Classification of documents: involves categorical (often binary) measures 2. Scaling of documents: involves continuous measure

Common goal: Assign a text to a particular category, or a ⇒ particular position on a scale. 4. Wrap-Up: Measuring Stuff from Text

There are a number of possible methods that can be used to achieve this, which can, in first instance, be split into:4

1. Supervised learning: involves training the model on data where the outcome is already known 2. Unsupervised learning: involves letting the model tell us what groups exist in the data, since the outcome is not yet known

4 You will find the same terminology in other applications of machine learning. 4. Wrap-Up: Measuring Stuff from Text

Supervised methods Unsupervised methods Classification I Lasso regression I K-means clustering I Dictionary methods I SVM

Scaling I Wordscores I Cosine Similarity

Some (non-exhaustive) examples of QTA/machine learning methods 4. Wrap-Up: Measuring Stuff from Text

I As the focus is on measurement inference, the choice of the specific method depends on how useful the resulting classifier/measure is once applied to new data I In other words: how well the model deals with the bias-variance trade-off, I However, the whole point of these approaches is that we often do not know the “true” values to compare the predicted values to - that’s why we try to measure them! I For supervised learning we can use validation techniques on test data to get a sense of model performance I This is much more difficult for unsupervised methods! (You will learn more about this next year) 4. Wrap-Up: Measuring Stuff from Text

I There are many models that can help us learn about the world from text I There is no single one correct approach, and all have their advantages and disadvantages I Good text analysis does not start and stop with choosing the most “performing” method, but by a. Accepting that all methods will yield approximations, and are thereby subject to measurement error b. Accepting that there is no guarantee that the model results make sense! Careful validating and checking that “things make sense” at all stages is crucial c. Accepting that no amount of automation can make up for learning what is in your texts by reading them! Summary

Today we:

I Learned (and immediately decided to forget) about Support Vector Machines for text classification I Moved on to scaling documents, first with Cosine Similarity, then with Wordscores I Saw that, as always, the choice of the right technique is difficult to make and often depends on what we want to know!

Next time:

I Last lecture for this Module! I Introduction to automated data collection and web scraping I Spoiler Alert: it’s often much less fun in practice than it sounds. . . Thanks for watching!