Text Classification 1 § Long Used by “Information Professionals” § a Modern Mass Instantiation Is Google Alerts

5/21/17

Introduction to Information Retrieval Ch. 13

Standing queries

§ The path from IR to text classification: Introduction to § You have an information need to monitor, say: § Unrest in the Niger delta region § You want to rerun an appropriate query periodically to Information Retrieval find new news items on this topic § You will be sent new documents that are found § I.e., it’s not ranking but classification (relevant vs. not CS276: Information Retrieval and Web relevant) Search § Such queries are called standing queries Text Classification 1 § Long used by “information professionals” § A modern mass instantiation is Google Alerts

Chris Manning and Pandu Nayak § Standing queries are (hand-written) text classifiers

Introduction to Information Retrieval Introduction to Information Retrieval Ch. 13 Spam filtering Another text classification task

From: "" Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

======Click Below to order:

3 http://www.wholesaledaily.com/sales/nmd.htm ======

Introduction to Information Retrieval Sec. 13.1 Introduction to Information Retrieval Sec. 13.1

Categorization/Classification Document Classification

“planning § Given: Test language Data: proof § A representation of a document d intelligence” § Issue: how to represent text documents. § Usually some type of high-dimensional space – bag of (AI) (Programming) (HCI) words Classes: ML Planning Semantics Garb.Coll. Multimedia GUI § A fixed set of classes: Training learning planning programming garbage ...... C = {c1, c2,…, cJ} Data: intelligence temporal semantics collection § Determine: algorithm reasoning language memory reinforcement plan proof... optimization § The category of d: γ(d) ∈ C, where γ(d) is a network... language... region... classification function § We want to build classification functions (“classifiers”).

1 5/21/17

Introduction to Information Retrieval Ch. 13 Introduction to Information Retrieval Ch. 13

Classification Methods (1) Classification Methods (2) § Manual classification § Hand-coded rule-based classifiers § Used by the original Yahoo! Directory § One technique used by news agencies, § Looksmart, about.com, ODP, PubMed intelligence agencies, etc. § Accurate when job is done by experts § Widely deployed in government and § Consistent when the problem size and team is enterprise small “ ” § Difficult and expensive to scale § Vendors provide IDE for writing such § Means we need automatic classification methods for rules big problems

Introduction to Information Retrieval Ch. 13 Introduction to Information Retrieval Ch. 13 A Verity topic Classification Methods (2) A complex classification rule: art

§ Hand-coded rule-based classifiers § Note: § Commercial systems have complex query § maintenance issues languages (author, etc.) § Hand-weighting of § Accuracy can be high if a rule has been terms carefully refined over time by a subject expert [Verity was bought by § Building and maintaining these rules is Autonomy in 2005, expensive which was bought by HP in 2011 – a mess; I think it no longer exists ...]

Introduction to Information Retrieval Sec. 13.1 Classification Methods (3): Supervised learning § Given: § A document d § A fixed set of classes:

C = {c1, c2,…, cJ} § A training set D of documents each with a label in C § Determine: § A learning method or algorithm which will enable us to learn a classifier γ § For a test document d, we assign it the class γ(d) ∈ C 11

2 5/21/17

Introduction to Information Retrieval Ch. 13 Introduction to Information Retrieval

Classification Methods (3) Features

§ Supervised learning § Supervised learning classifiers can use any § Naive Bayes (simple, common) – see video, cs229 sort of feature § k-Nearest Neighbors (simple, powerful) § URL, email address, punctuation, capitalization, § Support-vector machines (newer, generally more dictionaries, network features powerful) § In the simplest bag of words view of § Decision trees à random forests à documents gradient-boosted decision trees (e.g., xgboost) § We use only word features § … plus many other methods § we use all of the words in the text (not a subset) § No free lunch: need hand-classified training data § But data can be built up by amateurs § Many commercial systems use a mix of methods

Introduction to Information Retrieval Introduction to Information Retrieval

The bag of words representation The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the great 2 adventure scenes are fun… It manages to be whimsical and love 2 romantic while laughing at the conventions of the fairy tale γ( genre. I would recommend it to γ( recommend 1 just about anyone. I've seen it )=c )=c several times, and I'm always happy to see it again whenever I have a friend who hasn't seen laugh 1 it yet. happy 1 ......

Introduction to Information Retrieval Sec.13.5 Introduction to Information Retrieval

Feature Selection: Why? Feature Selection: Frequency § Text collections have a large number of features § The simplest feature selection § 10,000 – 1,000,000 unique words … and more method: § Selection may make a particular classifier feasible § Just use the commonest terms ’ § Some classifiers can t deal with 1,000,000 features § No particular foundation § Reduces training time § But it make sense why this works § Training time for some methods is quadratic or worse in the number of features § They’re the words that can be well-estimated and are most often available as evidence § Makes runtime models smaller and faster § In practice, this is often 90% as good as § Can improve generalization (performance) better methods § Eliminates noise features § Avoids overfitting § Smarter feature selection: § chi-squared, etc.

3 5/21/17

Introduction to Information Retrieval Introduction to Information Retrieval Naïve Bayes: See IIR 13 or cs124 lecture on Coursera or cs229 SpamAssassin § Classify based on prior weight of class and § Naïve Bayes has found a home in conditional parameter for what each word says: spam filtering $ ' § Paul Graham’s A Plan for Spam cNB = argmax&log P(c j ) + ∑logP(xi | c j )) c C § Widely used in spam filters j ∈ %& i∈ positions () § But many features beyond words: § Training is done by counting and dividing: § black hole lists, etc. § particular hand-crafted text patterns N Tc x + α c j j k P(c ) ← P(xk | c j ) ← j N [T + α ] ∑ c j xi € xi ∈V

§ Don’t forget to smooth 19 € €

Introduction to Information Retrieval Introduction to Information Retrieval

SpamAssassin Features: Naive Bayes is Not So Naive § Basic (Naïve) Bayes spam probability §Very fast learning and testing (basically § Mentions: Generic Viagra § Regex: millions of (dollar) ((dollar) NN,NNN,NNN.NN) just count words) § Phrase: impress ... girl §Low storage requirements § Phrase: ‘Prestigious Non-Accredited Universities’ § From: starts with many numbers §Very good in domains with many § Subject is all capitals equally important features § HTML has a low ratio of text to image area §More robust to irrelevant features than § Relay in RBL, http://www.mail- abuse.com/enduserinfo_rbl.html many learning methods § RCVD line looks faked Irrelevant features cancel out without § http://spamassassin.apache.org/tests_3_3_x.html affecting results

Introduction to Information Retrieval Introduction to Information Retrieval Sec.13.6

Naive Bayes is Not So Naive Evaluating Categorization § More robust to concept drift (changing class § Evaluation must be done on test data that definition over time) are independent of the training data § Naive Bayes won 1st and 2nd place in KDD- § Sometimes use cross-validation (averaging results over multiple training and test splits of CUP 97 competition out of 16 systems the overall data) Goal: Financial services industry direct mail § Easy to get good performance on a test set response prediction: Predict if the recipient of that was available to the learner during mail will actually respond to the advertisement – training (e.g., just memorize the test set) 750,000 records. § A good dependable baseline for text classification (but not the best)!

4 5/21/17

Introduction to Information Retrieval Sec.13.6 Introduction to Information Retrieval Sec.13.6

Evaluating Categorization WebKB Experiment (1998) § Measures: precision, recall, F1, § Classify webpages from CS departments classification accuracy into: § Classification accuracy: r/n where n is § student, faculty, course, project the total number of test docs and r is § Train on ~5,000 hand-labeled web pages the number of test docs correctly § Cornell, Washington, U.Texas, Wisconsin classified § Crawl and classify a new site (CMU) using Naïve Bayes

§ Results

Introduction to Information Retrieval Introduction to Information Retrieval Sec.14.1

Remember: Vector Space Representation

§ Each document is a vector, one component for each term (= word). § Normally normalize vectors to unit length. § High-dimensional vector space: § Terms are axes § 10,000+ dimensions, or even 100,000+ § Docs are vectors in this space

§ How can we do classification in this space?

Introduction to Information Retrieval Sec.14.1

Classification Using Vector Spaces Documents in a Vector Space § In vector space classification, training set corresponds to a labeled set of points (equivalently, vectors) § Premise 1: Documents in the same class form a contiguous region of space § Premise 2: Documents from different Government classes don’t overlap (much) § Learning a classifier: build surfaces to Science delineate classes in the space Arts

5 5/21/17

Sec.14.1 Sec.14.1

Test Document of what class? Test Document = Government

Government Government

Science Science Arts Arts

Our focus: how to find good separators 31 32

Sec.14.2 Sec.14.2

Definition of centroid Rocchio classification § Rocchio forms a simple representative for  1  µ (c) = ∑v (d) each class: the centroid/prototype | Dc | d ∈Dc § Classification: nearest prototype/centroid § Where Dc is the set of all documents that belong to § It does not guarantee that classifications are class c and v(d) is the vector space representation of consistent with the given training data d.€ Why not? § Note that centroid will in general not be a unit vector even when the inputs are unit vectors.

33 34

Introduction to Information Retrieval Sec.14.2 Introduction to Information Retrieval Sec.14.4

Two-class Rocchio as a linear classifier Linear classifier: Example

§ Line or hyperplane defined by: § Class: “interest” (as in interest rate) M § Example features of a linear classifier w d ∑ i i = θ wi ti wi ti i=1 • 0.70 prime • −0.71 dlrs • 0.67 rate • −0.35 world § For Rocchio, set: • 0.63 interest • −0.33 sees ! ! ! • 0.60 rates • −0.25 year w = µ (c1) − µ (c2 ) • 0.46 discount • −0.24 group € ! ! 0.5 (| (c ) |2 | (c ) |2 ) • 0.43 bundesbank • −0.24 dlr θ = × µ 1 − µ 2 § To classify, find dot product of feature vector and weights € 35 36

6 5/21/17

Sec.14.2 Introduction to Information Retrieval Sec.14.3

Rocchio classification k Nearest Neighbor Classification § A simple form of Fisher’s linear discriminant § kNN = k Nearest Neighbor § Little used outside text classification § It has been used quite effectively for text § To classify a document d: classification § Define k-neighborhood as the k nearest § But in general worse than Naïve Bayes neighbors of d § Again, cheap to train and test documents § Pick the majority class label in the k- neighborhood § For larger k can roughly estimate P(c|d) as #(c)/k 37 38

Sec.14.1 Sec.14.3

Test Document = Science Nearest-Neighbor Learning § Learning: just store the labeled training examples D § Testing instance x (under 1NN): § Compute similarity between x and all examples in D. § Assign x the category of the most similar example in D. § Does not compute anything beyond storing the examples § Also called: Government § Case-based learning Science § Memory-based learning Arts § Lazy learning § Rationale of kNN: contiguity hypothesis Voronoi diagram 39 40

Sec.14.3 Introduction to Information Retrieval Sec.14.3

k Nearest Neighbor Nearest Neighbor with Inverted Index § Using only the closest example (1NN) is § Naively finding nearest neighbors requires a linear subject to errors due to: search through |D| documents in collection § But determining k nearest neighbors is the same as § A single atypical example. determining the k best retrievals using the test § Noise (i.e., an error) in the category label of document as a query to a database of training a single training example. documents. § More robust: find the k examples and § Use standard vector space inverted index methods to find the k nearest neighbors. return the majority category of these k § Testing Time: O(B|Vt|) where B is the average § k is typically odd to avoid ties; 3 and 5 number of training documents in which a test-document word are most common appears. § Typically B << |D| 41 42

7 5/21/17

Introduction to Information Retrieval Sec.14.3

kNN: Discussion Let’s test our intuition

§ No feature selection necessary § Can a bag of words always be viewed as a § No training necessary vector space? § Scales well with large number of classes § What about a bag of features? § Don’t need to train n classifiers for n classes § Can we always view a standing query as a § Classes can influence each other contiguous region in a vector space? § Small changes to one class can have ripple effect § Do far away points influence classification § Done naively, very expensive at test time in a kNN classifier? In a Rocchio classifier? § In most cases it’s more accurate than NB or Rocchio § Can a Rocchio classifier handle disjunctive § As the amount of data goes to infinity, it has to be a great classes? classifier! – it’s “Bayes optimal” § Why do linear classifiers actually work well 43 for text? 44

Introduction to Information Retrieval Sec.14.2 Introduction to Information Retrieval

Rocchio Anomaly 3 Nearest Neighbor vs. Rocchio

§ Prototype models have problems with polymorphic § Nearest Neighbor tends to handle polymorphic (disjunctive) categories. categories better than Rocchio/NB.

45 46

Introduction to Information Retrieval Sec.14.6 Introduction to Information Retrieval Sec.14.6 Bias vs. capacity – notions and terminology kNN vs. Naive Bayes § Consider asking a botanist: Is an object a tree? § Bias/Variance tradeoff § Too much capacity, low bias § Variance ≈ Capacity § Botanist who memorizes § § Will always say “no” to new object (e.g., different # of kNN has high variance and low bias. leaves) § Infinite memory § Not enough capacity, high bias § Rocchio/NB has low variance and high bias. § Lazy botanist § § Says “yes” if the object is green Linear decision surface between classes § You want the middle ground

(Example due to C. Burges) 47 48

8 5/21/17

Sec.14.6 Introduction to Information Retrieval Bias vs. variance: Summary: Representation of Choosing the correct model capacity Text Categorization Attributes § Representations of text are usually very high dimensional § “The curse of dimensionality” § High-bias algorithms should generally work best in high-dimensional space § They prevent overfitting § They generalize more § For most text categorization tasks, there are many relevant features & many irrelevant ones

49 50

Introduction to Information Retrieval Which classifier do I use for a given text classification problem? § Is there a learning method that is optimal for all text classification problems? § No, because there is a tradeoff between bias and variance. § Factors to take into account: § How much training data is available? § How simple/complex is the problem? (linear vs. nonlinear decision boundary) § How noisy is the data? § How stable is the problem over time? § For an unstable problem, it’s better to use a simple and robust classifier. 51