Title Classification of Reddit Posts: Predicting “Not Safe for Work” Content
Total Page:16
File Type:pdf, Size:1020Kb
Title Classification of Reddit posts: predicting “Not Safe For Work” content Anonymous Authors University of British Columbia Abstract On average, six percent of posts from the social news website www.Reddit.com are of mature content and are tagged Not Safe For Work (NSFW). Given the inher- ent difference in content of Safe For Work (SFW) and NSFW posts, we sought out to a build an efficient classifier to predict if a post has NSFW content given only the title of the posts. We tested three different classifiers: Naive Bayes, Bernouilli Naive Bayes, Stocastic Gradient descent. We implemented a bag of words rep- resentation and tested different feature selection and extraction methods such as bigrams, Tf-Idf weighing, and χ2 scoring. With a dataset of over 4 million posts we succeeded in classifying the Reddit posts with a class weighted F-score of 0.56 and a false negative rate of 40%. Introduction Reddit is a socially driven news and entertainment website. Registered users can post pictures or texts, as well as comment on these posts, which are then up or down voted. The popularity of a post is typically measured by the number of up and down votes as well as the number of comments. Posts are created on a large variety of topics (see Figure 1 and 3), and posts which are considered to contain nudity, profanity, pornography, anything generally offensive or only suitable to a mature audience are tagged as “Not Safe for Work” or NSFW. Figure 1: A word cloud of the top subreddits of 10 000 posts randomly scrapped from the front page of Reddit [1] Two online tools currently exist to identify NSFW content given a link to a website or a Reddit post: www.isthatSFW.com and RedditSFWCheck, respectively [2]. These tools use the content of the website or the comments posted on Reddit to evaluate if the website or post is NSFW. Similarly,we wanted to evaluate if it is possible to classify a post as NSFW given only the title of the post. In addition, we hypothesize that the difference in content of SFW and NSFW posts should be reflected in the title of a post. On the other hand, given the frequent occurence of sarcasm and humour in titles of posts, many posts which seem to be “Safe For Work” (SFW) a priori and turn out to be offensive 1 and NSFW. For example, the title “I was chopping an onion when all of a sudden...” could link to the cartoon of a man crying from chopping the onion or a gory picture of a man with his finger cut off, some chopped onion, and blood everywhere. We tackled this challenging classification task using three different benchmark classifiers: Naive Bayes, Bernoulli Naive Bayes and Stochastic Gradient Descent. Following typical procedures and parameters used by natural language processing tools and document classifers, we organized the training and testing data in a “Bag of words” and implemented bigram feature extraction, Tf-Idf weighing, χ2 scoring and selection. Finally we evaluated the different classifiers by calculating their F-score and their rate of missclassifying NSFW posts as SFW. 1 What is a NSFW Reddit post? A typical Reddit post is somewhere between 3 to 20 words long and is usually a sentence, statement or a question. The title of the post will usually link to a website, a picture or text. The word cloud in Figure 2(top) shows the most reccurent words found in a sample of 10; 000 posts’titles from January 2013 which were not tagged as NSFW and are thus assumed to be SFW. The word cloud in Figure 1 shows the main subreddits where these posts were published. Figure 2: Word clouds of the top words occuring in 1000 NSFW posts’titles (bottom) and 10 000 SFW posts’titles (top) randomly scrapped from the front page of Reddit [1] Figures 2(top) and 3 are word clouds showing the most frequent terms used in the NSFW post’s title and the name of the subreddit these posts came from. Comparing the SFW and NSFW word clouds, we observe that most words occur mainly in one type of content and not the other while other words are common to both types of posts’ titles. For example, the words V alentine0s; day; boobs; like; today; know; anyone; and time are found in both word clouds. Therefore word occurence is a good feature for our classifier but may not be sufficient to classify certain posts. 2 Figure 3: A word cloud of the top subreddits of 1 000 NSFW posts randomly scrapped from the front page of Reddit [1] 2 Methods This section describes the different methods implemented. Our classification pipeline includes the extraction of Reddit’ posts’ metadata, the organisation of this data into a dataset, the selection of relevant features in our datatset, the classification of the training data, and the evaluation of the classifiers on the test data. 2.1 Scrapping Reddit The Reddit website is currently written in Python and is easy to scrape with a Python script. We extracted the metadata of 4 million posts from January 2013 including the title, date, number of votes (up and down votes), subreddit, user, NSFW binary tag, etc. 2.2 Creating an outlier free dataset There are two ways that a post can be tagged as NSFW on the Reddit website. Either the post was published in a subreddit with mature content where all posts are automatically tagged as NSFW, for example www.reddit.com/r/NSFWfunny, or the users which moderate the website tag the post as NSFW. The latter will naturally only occur if the post is popular enough that many users will have viewed the post and messaged the moderators that the post is in fact NSFW. Therefor to prevent the collection of posts mistakenly untagged, we scrapped in March the posts from January to make sure that they were on the website for about 2 months thus giving ample time for the posts to be viewed by several viewers. In addition, we eliminated from our dataset posts which were not popular by filtering out the posts whose total number of votes was less than 30. 2.3 Bag of words and bigrams Most document classifiers use a bag of words representation to organize the documents’ contents into a set of tokens [3, 4, 5]. The vocabulary of a dataset is the set of all tokens occuring in the documents. The 300 most common english words such as “the, and, I, because” better known as stopwords are removed from the list of possible tokens. This representation ignores the structure of each document and characterizes a document by the number occurences of a word in the document. For example, let our dataset contain only two posts: “The sky is blue and green” “Oh look look! The beautiful sky!” We first set the words to lower case and then extract all the words occuring in the title which we denote as tokens. After removing stopwords, we obtain a vocabulary of four tokens: “sky, blue, green, oh, look, beautiful”. We then build our feature matrix X where a row represents a post’s title and a column represents a feature or token. The entries of the matrix denote the number of occurences of the token in the post’s title: ( ) 1 1 1 0 0 0 X = 1 0 0 1 2 1 The dimensions of the feature vector are N ∗ M where N are the number of posts in our dataset and M the number of tokens in the vocabulary. 3 In order to preserve some local structure when selecting the features, we chose to store consecutive occurences of words or bigrams. In our example, the vocabulary of of unigrams and bigrams is “sky, blue, green, oh, look, beautiful, sky blue, blue green, oh look, look look, beautiful sky”. 2.4 Feature Selection 2.4.1 Tf-Idf weighing of tokens Term frequency - inverse document frequency (tf-idf) method measures the local and global impor- tance of a word or token in a collection of documents [5]. It assigns a weight to each token using the frequency of the token in a dataset and the frequency of a token in each document. The tf-idf of a token is measured using the equation tfidf(t) = tf(t; p) ∗ idf(t) where tf is the term frequency per post: the number of times a token t occurs in the post p. Similarly, Idf is the inverse docment frequency of the token: N idf(t) = log (1) ni where ni is the number of posts where the token occurs and N is the total number of posts in the dataset [5]. A token in a post with a high tf-idf is a token which appears many times in that post and is present in not many posts. In our example above, the word sky would have the lowest tf-idf and the word look would have the highest. We then obtain a new feature matrix X with weighted entries which are normalized for each post using L2 or Euler normalization [6]. 2.4.2 χ2 weighing and feature selection 2 The χ measures the association betweent the occurence of a token t and a category ci. For our two 2 categories cnsfw and csfw, the χ measure becomes [4, 5]: ∗ ¯ − ¯ 2 2 N [P (t; cnsfw)P (t; csfw) P (t; csfw)P (t; cnsfw)] χ (t; cnsfw) = (2) P (t) ∗ P (t¯)P (cnsfw)P (csfw) where P (t; cnsfw) is the probability of token t to belong to a post which is NSFW and P (t;¯ csfw) is the probability that token t is absent from a post which is SFW.