Content Categorization for Contextual Advertising Using Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 Content Categorization for Contextual Advertising Using Wikipedia Ingrid Grønlie Guren August 2, 2015 ii Abstract Automatic categorization of content is an important functionality in online ad- vertising and automated content recommendations, both for ensuring contextual relevancy of placements and for building up behavioral profiles for users that consume the content. Within the advertising domain, the taxonomy tree that content is classified into is defined with some commercial application in mind to somehow reflect the advertising platform’s ad inventory. The nature of the ad inventory and the language of the content might vary across brokers (i.e., the operator of the advertising platform), so it was of interest to develop a system that can easily bootstrap the development of a well-working classifier. We developed a dictionary-based classifier based on titles from Wikipedia articles where the titles represent entries in the dictionary. The idea of the dictionary-based classifier is so simple that it can be understood by users of the program, also those who lack technical experience. Further, it has the ad- vantage that its users easily can expand the dictionary with desirable words for specific advertisement purposes. The process of creating the classifier includes a processing of all Wikipedia article titles to a form more likely to occur in docu- ments, before each entry is graded to their most describing Wikipedia category path. The Wikipedia category paths are further mapped to categories based on the taxonomy of Interactive Advertising Bureau (IAB), which are categories relevant for advertising. The results of this process is a dictionary with entries connected to categories from the taxonomy, and forms the base of our classi- fier. Finally, we explored the possibilities of using Wikipedia’s internal links to translate the English classifier’s dictionary to a Norwegian dictionary. The evaluation of the classifier was performed on rappler.com for the En- glish classifier and adressa.no for the Norwegian classifier. The results of the classifiers were compared with a class tag within the url structure of published articles, and we could see that the classifiers were able to correctly categorize most articles. However, there is room for further improvement of the classifier in order to achieve higher evaluation scores. This is partly because our dictionary- based classifier is a one-to-many classifier, while we compare the results to a one-to-one classification. Overall, we found that we are able to create a varied and thorough dictionary by just exploring the titles of Wikipedia articles, and that the classifier gives a good indication of the content of articles. iii iv Acknowledgements This study has been a collaboration project between the Department of Infor- matics at the University of Oslo and Cxense (http://www.cxense.com). It was started in the Spring 2014 and finished in August 2015. First, I would like to express my gratitude to my supervisor Professor Alek- sander Øhrn for all his incredibly important feedback and for all the advice he has given me through the process. His ideas for this project have been in- credible valuable, and his comments have helped me every time I needed help with a problem. I am very grateful for his thorough scrutiny of the thesis. I would also like to thank his colleague Gisle Ytrestøl for all his help, including all the quick responses on emails, long discussions about implementations and his never-ending support and optimism about the project. A special thanks goes to my study friends, especially Sindre, Frida and Karine, for constantly reminding me how little time I had left, but still support- ing me with discussions and good company at the University. I would also like to thank my friend Elisabeth for motivating me every day and for never asking me to stop talking about my thesis. Your support has been beyond words. Finally I wish to thank my family for all help, comments, discussions, feed- back on what I’ve written, and most important; for supporting me every day. I could never have done this without any of you. Ingrid Grønlie Guren August 2, 2015 i ii Contents 1 Introduction1 1.1 Motivation..............................1 1.2 The Project..............................1 1.3 An Overview of Challenges.....................3 1.4 Thesis Outline............................5 2 Background Materials7 2.1 Automatic Content Analysis.....................7 2.1.1 What is Content Analysis?.................7 2.1.2 Content Analysis in Advertising..............8 2.2 Categorization............................9 2.3 Wikipedia............................... 11 2.3.1 Structure of Wikipedia.................... 11 2.3.2 Accessing Information from Wikipedia........... 12 2.4 Interactive Advertising Bureau (IAB)............... 13 2.5 Cxense................................. 16 3 Related Works 19 3.1 Similar Projects............................ 19 3.2 Wikipedia’s Category Structure................... 20 3.3 Wikipedia as Encyclopedic Knowledge............... 21 3.4 Classifiers Based on Wikipedia................... 22 3.4.1 Evaluation of the Classifiers................. 23 3.5 Disambiguation............................ 24 4 Methods 27 4.1 Finding the meaning of Wikipedia Articles............. 27 4.1.1 Representing the underlying structure........... 27 4.2 Grading Categories.......................... 29 4.2.1 Grading based on Inlinks and Outlinks........... 29 4.2.2 Normalized Grading based on Inlink and Outlink Numbers 30 4.2.3 Deciding Relevant Paths................... 30 4.3 Evaluation............................... 31 4.3.1 Evaluation of the Classifier................. 31 4.3.2 Optimizing the Classifier.................. 32 iii iv CONTENTS 5 Implementation 35 5.1 Finding Full Path of Articles.................... 35 5.1.1 Creating the Underlying Category Structure........ 35 5.1.2 Representing the Underlying Structure........... 39 5.1.3 Following Links Between Categories............ 40 5.1.4 Redirects........................... 41 5.2 Id Mapping.............................. 44 5.3 Grading of Categories........................ 45 5.3.1 Grading based on Inlink and Outlink Numbers...... 45 5.3.2 Normalized Grading Based on Inlinks and Outlinks.... 49 5.4 Mapping to Desirable Output Categories.............. 50 5.4.1 Mapping based on Wikipedia Categories.......... 50 5.4.2 Mapping based on Wikipedia Path Excerpts........ 51 5.4.3 Automatic Mapping..................... 51 5.4.4 Processing Titles....................... 52 5.5 Dictionary-based Classifier for Other Languages.......... 54 5.5.1 Creating a Norwegian Dictionary-based Classifier..... 54 5.5.2 Deploying the Results.................... 57 6 Results and Discussion 59 6.1 Evaluation of Category Mapping.................. 59 6.1.1 Mapping from Path Excerpts to Output Categories.... 60 6.2 Versions of the Dictionary-Based Classifier............. 68 6.2.1 IAB Dictionary-1 (iab-1).................. 68 6.2.2 IAB Dictionary-2 (iab-2).................. 71 6.2.3 IAB Dictionary-3 (iab-3).................. 71 6.2.4 IAB Dictionary-4 (iab-4).................. 71 6.2.5 IAB Dictionary-5 (iab-5).................. 71 6.2.6 IAB Dictionary-6 (iab-6).................. 72 6.2.7 Variation of Categories and Number of Entries for the Different Dictionary Versions................ 72 6.3 Results from the Classifier...................... 74 6.3.1 Retrieving Results from Cxense............... 75 6.3.2 Weight for Classification................... 76 6.3.3 Results for Sports...................... 79 6.3.4 Results for Arts & Entertainment.............. 80 6.3.5 Results for Technology & Computing............ 81 6.3.6 Global Evaluation of the Classifier............. 82 6.3.7 Comparison with Another Classifier............ 83 6.4 Evaluation of the Norwegian Classifier............... 84 6.5 Discussion of the Results....................... 85 7 Conclusion and Further Works 89 7.1 Conclusion.............................. 89 7.2 Further Works............................ 90 References 93 List of Figures 1.1 Illustration of the mapping between keywords and categories...2 1.2 Simplified illustration of the categorization process.........3 1.3 Example of disambiguation in Wikipedia..............5 2.1 Retargeting within Interest-based Advertising...........9 2.2 Categorization of keywords to Wikipedia categories........ 10 2.3 Categorzation process of the keywords............... 10 2.4 Subcategories of the category Astrid Lindgren........... 12 2.5 Categories for an Wikipedia article................. 12 2.6 Categories of the IAB Taxonomy.................. 15 2.7 The categorization process of the keywords............ 16 2.8 Illustration of entry matching process............... 16 4.1 Simplified illustration of the underlying structure of Wikipedia.. 27 4.2 The structure where each category knows its subcategories.... 28 4.3 The structure where each category knows the title of its articles. 28 4.4 Example of inlink number and outlink number for a category.. 29 4.5 Illustration of a perfect classifier................... 33 5.1 INSERT statement entry in enwiki-latest-categorylinks.sql.gz 36 5.2 Excerpt from enwiki-latest-categorylinks.sql.gz ...... 37 5.3 Insert statement for hidden category................ 38 5.4 Example path with hidden category................ 38 5.5 Example path without hidden category............... 38 5.6 INSERT statement with newline................... 39 5.7 Example of sortkey in Wikipedia.................