Christine Hodges SIMS 290, Fall 2004

Assignment 3: Text Classification

For the initial exploration part of this assignment, I made several models consisting of some set of features, a feature weighting strategy, and a choice of classification algorithm. Each model was trained on 80% of the available training data and tested on the remaining 20% of the training data. Thus, I used a training set and a held-out set to simulate the final testing environment. Each experiment involves training and testing one model on either the diverse set of newsgroup messages or the homogenous set (sci.*). In the next section (Description of Experiments), I discuss the features, weighting strategies, and WEKA classification algorithms I used in my initial experiments and some confusions I have. Then, in the section “Results of Initial Experiments,” I review and discuss the results of the experiments. For the final test (training on all the training data, testing on the test data), I chose the two models that performed best in the initial experiments: one performed the best in the diverse newsgroups classification task and a different model performed best in the homogenous task. I review and discuss the final test results in the “Results of Final Experiment” section. I offer new hypotheses and ideas for experiments and mention additional explorations and new ideas inspired by this exploration throughout this report.

Description of Experiments

Features I looked at linguistic features of the data and took each newsgroup message in its entirety (i.e. keeping whatever header and signature information the messages contain). I take a more low- level, linguistic-inspired approach. Approximately this means: words and their meanings are what I’m interested in for these experiments. Thus, these experiments follow a unigram (bag-of- words) model of language. Concretely, a lower-level, linguistic-based approach means that the header and signatures are nothing special; they are just collections of word features. Note: it can be argued that the subject line is a pragmatic/discourse aspect of the message, which would be what I mean for “higher-level linguistics.”

As a unigram model, the sets of features I chose for each experiment are based on the character strings in the messages. Newsgroup messages have plenty of character strings that are not words in any of the usual senses (e.g. both “dog” and “N.A.S.A” are usual words) or even in the extended senses (“3,1-beta-galactase” is a biochemistry word and “l33t” is a “l33tspeak,” a hacker-inspired subculture, word--pronounced “leet”, based on “elite”). Simple examples of character strings that are not words in even an extended sense are emoticons. Emoticons may be valuable classification features just as the presence of l33tspeak words is probably more likely in the sci.crypt newsgroup than in rec.auto. (Related Idea: Emoticons should give us a clue to the register of speech in a communication. I expect they would be helpful features in register-of- speech classification :)

Hodges 1 of 10 For the rest of this report, I refer to all character string tokens in the data as words (thus, including emoticons).

The different feature sets I looked at can be placed on a scale based on how much they “normalize” the words. Here I define “normalizing” as processing by some combination of taking out stopwords, lower-casing text, removing other symbols (cleaning “?” “.”, etc), and stemming using the Porter stemmer. (Note that the stemmer will not stem words with extra symbols such as quotes, and e-mail quote marks “>”. All models discard feature frequencies below 5 (as per the given weka.py). The experiments I did use the following features:

Amount of Feature Descriptions Experiment IDs Normalization AsIs: Stopwords: In, No lowercasing 11, 12 less normalization

Baseline: Stopwords: Out, Lowercasing 1, 2, 3, 4, 7, 8 Porter+clean: Strip symbols at beginning and end of string using python string.strip, Use 5, 6, 9, 10 Porter stemmer, Stopwords: Out, more normalization Lowercasing

I also began preparations for considering part-of-speech tags. I had planned to exclude certain POS classes (Noun or Verb) for different experiments. But I was using Dan Klein’s java parser and it took a long time to figure out how to interface it with the python program. Then other problems came up (such as, aligning the produced POS tags to the original sentence--there are occasionally empty tokens) and I noticed it would take forever and a day to prepare features for all the messages (even though tagging is relatively fast). So I put that aside. I will probably share the interfacing things I had to do on the newsgroup or something.

Feature Weighting I’m a little confused about feature weighting. The assignment said: “These features are weighted by their DF values. You should consider these features and their weights as a baseline to compare different features/weighting approaches against.” But to me it looks like the code passes to WEKA the feature frequency returned by extract_features_and_freqs() (which counts the frequency of the features in the given document). Thus, it seems that the given weka.py prepares weights based on the count of a term (word) in a document (Term Frequency) rather than counting the “number of documents a term appears in” as Document Frequency weighting does. Does WEKA do anything for this? I didn’t come to think of this until recently, so I haven’t had a chance to ask or test whether counting the number of items that contain a given feature at least once and passing that to WEKA works better. I refer to the weighting done by the given weka.py as the base weighting.

This may be the reason I’m confused about something else, or I may just be confused in general: I performed (what I think is) TF.IDF weighting as Preslav suggested in e-mail. But as will be seen in the results section, it didn’t help; thus, I think either I did something wrong, or the baseline isn’t correct (because base weighting isn’t doing Document Frequency weighting), or,

Hodges 2 of 10 less so, I think maybe it actually doesn’t help in this case. This makes me doubt whether I understood TF.IDF... In TF = “frequency of term i in document j,” what do we mean by “frequency”? Do we mean the simple count of instances of the term (as I thought; as in weka.py) or the count of this term divided by the total count of all terms?

I also performed weighting inspired by TF.IDF weighting. The IDF mission is to “make rare words across documents more important.” I experimented with an “Inverse Class Frequency” measure that similarly notes the number of classes and the number of classes that contain a given word. I refer to the use of this measure as “TF.IClassF.” IClassF would yield sharper bins for the valuation of feature rarity. For example, there are only two classes in the diverse set. The IClassF of a feature will either be log(2/2) or log(2/1). I didn’t really have concrete hypotheses about the actual effect of using this over IDF would be. I was just experimenting!

The following experiments use these feature-weighting schemes:

Feature Weighting Method Experiment IDs Base weighting 1, 2, 5, 6, 11, 12

TF.IDF 3, 4

TF.IClassF 7, 8, 9, 10

Classification Algorithms Experiments used the following WEKA classification algorithms. I have noted the linearity and binary-type distinctions in approaches as we discussed in class. I do not know if the WEKA implementations entirely match these descriptions (e.g. What does using a kernel estimator involve? I decided to use it based on Preslav’s mention of it in class.) Also, I do not know what similarity measure the k-Nearest-Neighbours classifier uses.

Linear or Binary or Classifier Name Parameters Non-linear? Multiclass? kNN3: kNearestNeighbours (IBk) Non-linear Multiclass k=3, defaults kNN5: kNearestNeighbours (IBk) Non-linear Multiclass k=5, defaults useKernelEstimator= NB: Naive-Bayes Linear Multiclass True, defaults SVM (SOM) Linear Binary defaults

Hodges 3 of 10 Results of Initial Experiments

The results of the initial experiments (train on 80% of available training data, test on remaining 20%) are given in Table 1. Results can be summarized as follows:

Diverse vs. Homogenous Task Overall, all models performed better when trained and tested on the diverse set than on the homogenous. Interesting cases of performance difference are discussed below.

Classification Algorithm The Support Vector Machine classification algorithm performed best overall. It had the highest accuracy of all classifiers in all but one experiment (Exp. 8, discussed below).

The k-Nearest-Neighbour classifier performed the worst overall. Its highest accuracy (76.27%, Exp. ID 1, Base features and weighting) is only about 3% higher than the lowest accuracy from the SVM on the harder homogenous task (Exp. ID 8, base features, TF.IClassF weighting). k=3 was better than k=5 on the homogenous task and when stemming was used on the diverse task. k=5 was better than k=3 when the base weighting and TF.IDF weighting were used with base features but TF.IClassF weighting raised k=3 performance enough to equal that of k=5. We might say checking with more neighbours helps on the diverse task but not the homogenous. Perhaps k=5 gets lost in homogenous data because when the data is similar, there are more near neighbours of different classes. In class, we mentioned how sensitive kNN classification is to the similarity measure used. The cosine similarity measure was mentioned as a good measure for NLP tasks. I do not know what similarity measure WEKA uses (Euclidean distance? There are no settings for similarity measure that I see). Perhaps a different similarity measure would yield greater accuracy for the kNN classifier.

The performance of the Naive-Bayes models is in between that of the SVM and kNN models.

Again, I don’t know the details of the WEKA implementation of these classification algorithms. But if they follow the usual implementation, these classifiers can be categorized as in the table above. We might then notice that the linear classifiers performed better than the non-linear one. And that the binary classifier performed better than the multiclass ones. But we should try a non- linear, binary classifier (such as kernel methods using SVMs as discussed in class) before we come to even initial conclusions.

Features and Weighting TF.IDF weighting of the base features appears to have no effect on some classifiers and to help others. This may be due to some error somewhere (see previous discussion) or not. If this isn’t an error, on the diverse task, the Naive-Bayes model and the kNN3 model performed better with TF.IDF weighting, while all other models were not effected. Accuracy on the homogenous task was no different from baseline.

I now focus feature/weighting discussion on the results using the SVM classifier.

Hodges 4 of 10 On the diverse task, the SVM classifier performed better on features/weighting combinations which take features from Porter stemming and string cleaning (Exp. 5, 9) and/or use TF.IClassF weighting (Exp. 7, 9). Using both stemming and TF.IClassF weighting resulted in the best score on the diverse task out of all experiments (98.10%, Exp. 9). Further, just using TF.IClassF (Exp. 7) produced the second best score on the diverse task (thus beating stemming alone). But, surprisingly, it seems the SVM classifier does not work well with TF.IClassF weighting on the homogenous task. The lowest accuracy from the SVM came from using TF.IClassF with the base features (Exp. 8). In Experiment 8, the Naive-Bayes classifier performed better than the SVM-- although I don’t know if the difference (1.27%) is statistically significant. But, the TF.IClassF SVM with stemmed features (Exp. 10) produced the second highest accuracy for the homogenous task! This interaction between TF.IClassF weighting and the task type is difficult to figure out. I don’t really have any intuitions about why this might occur. Perhaps the sharp bins that class frequency

Based on these interactions and the SVM scores from stemming with the base weighting, it seems the SVM benefits from stemming. Using stemmed features and the base weighting produced the highest score of all experiments on the more difficult homogenous task (82.99%, Exp. 6). Stemming clearly yielded higher accuracy for the Naive-Bayes and SVM classifiers compared to not stemming. The effects of stemming on kNN classification were less obvious. It helped k = 3 on the diverse set but hurt k = 5. But I can’t make sense of that nor discern any other trends.

I considered why stemming might help the SVM models. Note that the “As Is” least normalized features models (Exp. 11, 12) yield the lowest accuracies of all (expect on the homogenous task because the SVM seems to hate TF.IClassF weighting, discussed below). So there seems to be a trend in favour of normalizing the features for the SVM. I noticed that stemming did not reduce the number of features. The base 80% training data has 3870 features while stemming created 4500 features and the As Is feature set has 4797 features. This is evidence for there being no relation between number of features and accuracy. I would say that stemming produced more features than the base (which normalizes less) because cleaning and stemming allowed more words to reach the minimum frequency cutoff of 5.

To see what feature differences stemming produced, I used WEKA to chose the best features (using Chi-square test) from the initial 80% training data. Table 2 shows the top features chosen for each feature set of the homogenous task (As Is: Exp. 12, Base: Exp. 2, Porter stemmed+Cleaned: Exp. 6). One thing I notice is that “cryptography” is present as “crypto” and “cryptography” in the base feature set while it would become “crypto” in the stemmed set. Since “crypto” and “cryptography” would strongly indicate sci.crypt, stemming is good here since the stemmed model can weight "cryptography" high right away, but it has a lower in rank in the base feature set and isn’t even in the top 35 of the As Is set. Stemming might be bad if stemmed forms often obscured different word senses (as discussed in class) and there were no weighting mechanism to cope with the ambiguity by devaluing that feature. Now that I think about it, I can see why many NLPists do use stemming: for most tasks, word senses are not very stem ambiguous.

Hodges 5 of 10 Given the performance of all the models on these initial experiments, I chose to run the final test on the best homogenous classifier (Stemmed features, TF.IClassF-weighted SVM, Exp. 9) and the best diverse classifier (Stemmed features, Base weighting SVM, Exp. 6), using the same base models to establish the baseline.

Results of Final Test

SVM Model Accuracy on Final Test

Accuracy (% correct classification) Homogenous SVM Model Diverse Task Task Baseline 92.000 79.375 Stemming (Porter+clean) 94.500 81.000 Stemming + TF.IClassF weighting 97.250 76.875

Diverse Task: Confusion Matrices of Baseline and Best

Baseline Stemming + TF.IClassF weighting incorrect incorrect rec. sci. rec. sci. motorcycles space motorcycles space correct correct rec.motorcycles 184 16 rec.motorcycles 191 9 sci.space 16 184 sci.space 2 198

Homogenous Task: Confusion Matrices of Baseline and Best

incorrect sci. sci. sci. sci. crypt med space electronics correct Baseline sci.crypt 158 7 6 29 sci.med 2 160 5 33 sci.space 7 15 148 30 sci. 5 14 12 169 electronics

Hodges 6 of 10 incorrect sci. sci. sci. sci. crypt med space electronics correct Stemming sci.crypt 157 6 6 31 sci.med 2 166 9 23 sci.space 5 8 159 28 sci. 5 11 18 166 electronics

The results of the final test follow the most of the trends of the initial experiments and sharpen other possible trends.

 The classifiers were better at classifying the diverse newsgroups than the homogenous.  The stemmed features models were more accurate than the baseline, non-stemmed model on the diverse groups data.  TF.IClassF + Stemming beat Stemming alone on the diverse task but was much worse at classifying the homogenous data.  TF.IClassF + Stemming performed below baseline on the homogenous task. In the initial experiments, it beat the base model by about only 0.5% (Is statistically significant or not? Perhaps not.) As before, I’m not sure why TF.IClassF leads to results like this aside from the bin theory I mentioned before. But it is clearer now (with more training data) that the Support Vector Machine classifier doesn’t get along with it.

The confusion matrices show that the baseline and best SVM classifiers all uniformly confused classes of the diverse data. But, for the homogenous, they uniformly tended to confuse the correct class for sci.electronics; sci.electronics has the lowest F-measure. Perhaps sci.electronics has a lot of computer and gadgets and there are plenty of opportunities to discuss these topics in the other related newsgroups.

This lower-level linguistics-oriented approach yielded good results for the diverse data. I would work to improve homogenous data classification. That may involve looking into higher-level linguistic effects (subject line) and a different (not unigram) language model. I would improve the stemming process by cleaning the data further so that the stemmer would stem more often (Table 2 shows that “keys” made it into the top 35 features. I believe this is an artefact of the cleaning process; it is not an error of the stemmer--otherwise, “keys” would have received a higher ranking). I would further like to see if Stemming plus Chi-square feature reduction would do better. The top words of Table 2 look very promising. (We would have to come up with a way to create the sparse test files from the new features selected by WEKA). , After all this feature weighting and selection experimenting, I think Part-of-Speech pre-selection will not help; I think weighting is robust and selective enough.

Hodges 7 of 10 Table 1: Results of Initial Experiments (Bolded experiments denote models that are tested on the final test set.)

Accuracy (% correctly classified) ID Data Set Experiment (features, weighting methods) NB SVM kNN3 kNN5 1 Diverse (Given) Base weighting, stopwords-out, lowercase 87.9747 93.3544 60.7595 76.2658 2 Homog. (Given) Base weighting, stopwords-out, lowercase 72.0191 78.5374 39.5866 35.6121 3 Diverse ? TF.IDF, stopwords-out, lowercase 88.2911 93.3544 64.2405 76.8987 4 Homog. ? TF.IDF, stopwords-out, lowercase 72.0191 78.5374 39.5866 35.6121 5 Diverse Base weighting, Porter+clean, stopwords-out, lowercase 90.8228 94.6203 69.3038 62.9747 6 Homog. Base weighting, Porter+clean, stopwords-out, lowercase 75.0397 82.9889 39.9046 32.2734 7 Diverse TF.IClassF, stopwords-out, lowercase 87.6582 96.8354 73.4177 73.4177 8 Homog. TF.IClassF, stopwords-out, lowercase 74.7218 73.4499 41.3355 31.6375 9 Diverse TF.IClassF, Porter+clean, stopwords-out, lowercase 90.1899 98.1013 69.9367 64.8734 10 Homog. TF.IClassF, Porter+clean, stopwords-out, lowercase 78.6963 79.0143 40.0636 32.9094 11 Diverse Base weighting, "As Is": stopwords-IN, NO lowercasing 86.0759 91.7722 59.8101 59.8101 12 Homog. Base weighting, "As Is": stopwords-IN, NO lowercasing 74.5628 78.5 44.5 41.5

Hodges 8 of 10 Table 2: Chi-Squared Feature Ranking

Rank As Is Base Porter 1 encryption clipper clipper 2 key encryption encrypt 3 Clipper space key 4 government key space 5 Space chip chip 6 space government secur 7 chip crypto govern 8 clipper algorithm orbit 9 algorithm keys secret 10 crypto security crypto 11 keys escrow algorithm 12 escrow nsa escrow 13 NSA secret nsa 14 medical secure launch 15 security public medic 16 secure launch doctor 17 code medical moon 18 NASA moon gordon 19 Chip nasa wiretap 20 launch orbit encryption 21 DES encrypted banks 22 encrypted code pgp 23 PGP wiretap public 24 Once pgp nasa 25 wiretap chastity circuit 26 announcement disease keys 27 doctor shameful earth 28 disease surrender code 29 privacy shuttle tapped 30 patients announcement patient

Hodges 9 of 10 31 Banks des Bank 32 chastity cryptography treatment 33 moon gordon announc 34 orbit lunar chastiti 35 public privacy diseas

Hodges 10 of 10