POLS0010 Data Analysis

Week 10: Classifying and Scaling Documents II POLS0010 Data Analysis Julia de Romémont Term 2 2021 UCL Department of Political Science This Lecture 1. Classifying using Support Vector Machines (SVM) 2. Scaling using Cosine Similarity 3. Scaling using Wordscores 4. Wrap-Up: Measuring Stuff from Text Remember Two types of measurement schemes: 1. Classification of documents: involves categorical (often binary) measures 2. Scaling of documents: involves continuous measure Common goal: Assign a text to a particular category, or a ⇒ particular position on a scale. 1. Classifying Using SVMs I Last week, we first applied dictionary methods to classify a text as either negative or positive I Another, often more effective, method we discussed is lasso regression, which is a type of regularisation technique I Other classification techniques include for example: I Ridge regression I Random Forest I k-nearest neighbour I etc. I Another alternative method is called Support Vector Machine (SVM) 1. Classifying Using SVMs The Basic Idea of SVMs I Draw a line between observations belonging to two categories mapped as points in a p-dimensional1 space that maximises the width of the gap (margin) between the two categories I This line is called a separating hyperplane I New observations are classified according to which side of the hyperplane the fall to 1 Where p is the number of input (‘independent’) variables. 1. Classifying Using SVMs The Separating Hyperplane Suppose we have n observations of two variables x1 and x2, which fall into two classes y ∈ {−1, 1}. Then the separating hyperplane is defined by: β0 + β1x1 + β2x2 = 0 I Therefore all observations that have y = −1 will have β0 + β1x1 + β2x2 < 0 and thereby lie on one side of the hyperplane I And all observations with y = 1 will have β0 + β1x1 + β2x2 > 0 and lie on the other side 1. Classifying Using SVMs I We want to find the hyperplane that maximises the distance - the margin - between the two categories I This means finding the line for which the observations closest to the line are the furthest, compared to other possible lines James et al (2017: p.342) 1. Classifying Using SVMs So what are support vectors? I We want to draw a separating line between observations I Intuitively, observations far away from the boundary will not have any influence on the calculation of the hyperplane I Conversely, it is those observations closest to the boundary and within or on the margin on either side that will determine the location and therefore the equation of the hyperplane I These observations are the support vectors, as in they support the hyperplane 1. Classifying Using SVMs Special cases of SVMs 1. This is simple enough when we have perfectly linearly separable observations I This means that there exists a hyperplane where no observation is on the wrong side I In that case we use the maximal margin classifier 2. However, it is possible that no such separating hyperplane exists I The solution then is to find a hyperplane almost separates the cases I This involves defining a threshold number of observations that can be on the wrong side of the hyperplane I Similarly to the λ parameter in lasso regression, the key question is how to choose the optimal threshold C (for which we can use k-fold cross validation!) I This method is called support vector classifier 1. Classifying Using SVMs The most general case is when observations are neither per- ⇒ fectly nor linearly separable into classes, which is when we speak of Support Vector Machines2 I SVMs accommodate non-linear decision boundaries by adding another dimension which represents a measure of similarity of two observations (this is called the kernel trick) I With this new dimension, it is then possible to find a linear hyperplane I In the context of text analysis, using a linear kernel (i.e. support vector classifier) is often sufficient I The high number of features (words) make it likely that there is at least one p-dimensional space where the documents are linearly separable 2Note that SVMs have quite a few similarities with Logistic regression with Lasso (or Ridge) and often perform similarly. 1. Classifying Using SVMs Back to the spam email filter emails_corpus <- corpus(emails$text) tok_emails <- tokens_remove( tokens_tolower(tokens(emails_corpus, remove_numbers =T,remove_punct = T)), stopwords("en")) dfm_emails <- dfm_weight(dfm(tok_emails), scheme="prop") doc_freq <- docfreq(dfm_emails) dfm_emails <- dfm_emails[,doc_freq>2] dfm_emails <- dfm_remove(dfm_emails,"subject") 1. Classifying Using SVMs I Let’s prepare the data ahead of the modelling df_emails <- convert(dfm_emails, to="data.frame") df_emails$doc_id <- NULL df_emails <- cbind("y.var"=factor(emails$spam), df_emails) I Now create some test and training data set.seed(123) cv.rows <- sample(nrow(df_emails),(nrow(df_emails)/2)) cv.data <- df_emails[cv.rows,] test.data <- df_emails[-cv.rows,] 1. Classifying Using SVMs I Like last week, we will use k-fold Cross-Validation with 10 folds to ‘fine-tune’ the model and choose the best cost threshold C library(e1071) svm.emails <- tune.svm(y.var ~., data= cv.data, kernel="linear", cost =2 ^(2:6), tunecontrol=tune.control(cross=10)) svm.emails ## ## Parameter tuning of ’svm’: ## ## - sampling method: 10-fold cross validation ## ## - best parameters: ## cost ## 32 ## ## - best performance: 0.009778027 1. Classifying Using SVMs I The best performance is achieved with a cost parameter set to 32 out of the choices we provided I We can extract the ‘best’ model and then look at the Confusion Matrix for our training data best.svm <- svm.emails$best.model table(best.svm$fitted,cv.data[,1]) Training Data Actual Classification 0 1 0 2162 0 SVM Prediction 1 6 696 I Note that an SVM will have a training error rate dependent on the cost parameter C 1. Classifying Using SVMs I We now look at the predictions for the test data to assess the model’s performance svm.preds <- predict(best.svm,test.data) table(svm.preds,test.data[,1]) Test Data Actual Classification 0 1 0 2173 11 SVM Prediction 1 19 661 1. Classifying Using SVMs I Remember the function we wrote from last week’s seminar that gave use error rate, sensitivity and specificity? Let’s re-use it here and compare the SVM and Lasso model model.assessment(svm.preds,test.data[,1]) # SVM ## Error rate Sensitivity Specificity ## 1.05 98.36 99.13 model.assessment(lasso.preds,test.data[,1]) # Lasso ## Error rate Sensitivity Specificity ## 3.53 86.76 99.45 I It looks like the SVM approach here is better at catching ‘true’ spam (higher sensitivity) with only very slightly less specificity 1. Classifying Using SVMs I Similarly to Lasso, an SVM identifies the words (= variables) that are most predictive of either the one or the other class, by estimating coefficients depending on their importance and sets the rest to zero svm.coef <- matrix(coef(best.svm)) length(svm.coef[svm.coef!=0,])-1 # Minus Intercept! ## [1] 9992 I Out of an original 13,623 unique words, our optimised SVM uses 9,992 words as variables I This is far from the 215 our lasso model from last week used for a comparable performance. 1. Classifying Using SVMs I We can also look at how many support vectors (= observations that determine the hyperplane) the model uses best.svm$tot.nSV # in total ## [1] 700 best.svm$nSV # no of SV on either side of the hyperplane ## [1] 361 339 I The model has 700 support vectors in total, and 361 on the non-spam side and 339 on the spam side of the hyperplane 1. Classifying Using SVMs Advantages I Model performance (i.e. predictive power) is often very high I Very flexible, especially because of the kernel trick I Quite intuitive: essentially the computational way of drawing lines between points Disadvantages I Computational time (the tune.svm command from above took a couple hours to run for testing only 5 different cost parameters) I When the kernel is not linear, even more parameters need to be tuned I Less feature reduction than a logistic lasso approach, for instance I Not probabilistic: a default SVM approach does not quantify uncertainty! 1. Classifying Using SVMs Good News I This is not part of what you are expected to apply yourselves in either the seminars or assessment I The purpose of this was mainly to compare to a Lasso approach and highlight how any model choice involves trade-offs! Transition: Scaling Two types of measurement schemes: 1. Classification of documents: involves categorical (often binary) measures 2. Scaling of documents: involves continuous measure Common goal: Assign a text to a particular category, or a ⇒ particular position on a scale. Transition: Scaling Why scale instead of classifying? I Even though classification can involve scoring the probability of being 1 or 0, ultimately all documents are classed as either 1 or 0 I However, in many situations, documents may fall into a grey zone: neither fully 1 nor fully 0 I We could use a dictionary-based scaling approach, e.g. count of positive minus negative words I This has a similar problem: forces all words in the dictionary to be either one class or another I Most words are used in both classes Transition: Scaling Perry, P & Benoit, K. (2017) “Scaling Text with the Class Affinity Model”. “Affinity Model” = Wordscores, essentially 2. Scaling Using Cosine Similarity I Goal: systematically measure how “similar” two documents are I The Cosine Similarity approach (and others) begins by conceiving of documents as vectors doc1 <- c("Text analysis is really quite simple") doc2 <- c("Doing text analysis in practice is easier than it looks") As a document-term matrix without stopwords: doc_id text analysis really quite simple practice easier looks text1 1 1 1 1 1 0 0 0 text2 1 1 0 0 0 1 1 1 Vectors are simply the rows: I Doc.

Load more