Privacy Aware Social Information Retrieval and Spam Filtering Using

Privacy aware social information retrieval and spam filtering using folksonomies Dissertation zur Erlangung des naturwissenschaftlichen Doktorgrades der Bayerischen Julius–Maximilians–Universität Würzburg vorgelegt von Beate Navarro Bullock aus Nürnberg Würzburg, 2015 Eingereicht am: 27.04.2015 bei der Fakultät für Mathematik und Informatik 1. Gutachter: Prof. Dr. Andreas Hotho 2. Gutachter: Prof. Dr. Gerd Stumme Tag der mündlichen Prüfung: 20.07.2015 Abstract Social interactions as introduced by Web 2.0 applications during the last decade have changed the way the Internet is used. Today, it is part of our daily lives to maintain contacts through social networks, to comment on the latest developments in microblogging services or to save and share information snippets such as photos or bookmarks online. Social bookmarking systems are part of this development. Users can share links to interesting web pages by publishing bookmarks and providing descriptive keywords for them. The structure which evolves from the collection of annotated bookmarks is called a folksonomy. The sharing of interesting and relevant posts enables new ways of retrieving information from the Web. Users can search or browse the folksonomy looking at resources related to specific tags or users. Ranking methods known from search engines have been adjusted to facilitate retrieval in social bookmarking systems. Hence, social bookmarking systems have become an alternative or addendum to search engines. In order to better understand the commonalities and differences of social bookmarking systems and search engines, this thesis compares several aspects of the two systems’ structure, usage behaviour and content. This includes the use of tags and query terms, the composi- tion of the document collections and the rankings of bookmarks and search engine URLs. Searchers (recorded via session ids), their search terms and the clicked on URLs can be extracted from a search engine query logfile. They form similar links as can be found in folksonomies where a user annotates a resource with tags. We use this analogy to build a tripartite hypergraph from query logfiles (a logsonomy), and compare structural and semantic properties of log- and folksonomies. Overall, we have found similar behavioural, structural and semantic characteristics in both systems. Driven by this insight, we inves- tigate, if folksonomy data can be of use in web information retrieval in a similar way to query log data: we construct training data from query logs and a folksonomy to build models for a learning-to-rank algorithm. First experiments show a positive correlation of ranking results generated from the ranking models of both systems. The research is based on various data collections from the social bookmarking systems BibSonomy and Delicious, Microsoft’s search engine MSN (now Bing) and Google data. To maintain social bookmarking systems as a good source for information retrieval, pro- viders need to fight spam. This thesis introduces and analyses different features derived from the specific characteristics of social bookmarking systems to be used in spam detection classification algorithms. Best results can be derived from a combination of profile, activity, semantic and location-based features. Based on the experiments, a spam detection framework which identifies and eliminates spam activities for the social bookmarking system BibSonomy has been developed. The storing and publication of user-related bookmarks and profile information raises questions about user data privacy. What kinds of personal information is collected and how do systems handle user-related items? In order to answer these questions, the thesis looks into the handling of data privacy in the social bookmarking system BibSonomy. Legal guidelines about how to deal with the private data collected and processed in social bookmarking systems are also presented. Experiments will show that the consideration of user data privacy in the process of feature design can be a first step towards strengthening data privacy. iii Zusammenfassung Soziale Interaktion, wie sie im letzten Jahrzehnt durch Web 2.0 Anwendungen eingeführt wurde, änderte die Art und Weise wie wir das Internet nutzen. Heute gehört es zum Alltag, Kontakte in sozialen Netzwerken zu pflegen, die aktuellsten Entwicklungen in Mikroblog- ging - Anwendungen zu kommentieren, oder interessante Informationen wie Fotos oder Weblinks digital zu speichern und zu teilen. Soziale Lesezeichensysteme sind ein Teil dieser Entwicklung. Nutzer können Links zu in- teressanten Webseiten teilen, indem sie diese mit aussagekräftigen Begriffen (Tags) verse- hen und veröffentlichen. Die Struktur, die aus der Sammlung von annotierten Lesezeichen entsteht, wird Folksonomy genannt. Nutzer können diese durchforsten und nach Links mit bestimmten Tags oder von bestimmten Nutzern suchen. Ranking Methoden, die schon in Suchmaschinen implementiert wurden, wurden angepasst, um die Suche in sozialen Lesezeichensystemen zu erleichtern. So haben sich diese Systeme mittlerweile zu einer ernsthaften Alternative oder Ergänzung zu traditionellen Suchmaschinen entwickelt. Um Gemeinsamkeiten und Unterschiede in der Struktur, Nutzung und in den Inhalten von sozialen Lesezeichensystemen und Suchmaschinen besser zu verstehen, werden in dieser Arbeit die Verwendung von Tags und Suchbegriffen, die Zusammensetzung der Doku- mentensammlungen und der Aufbau der Rankings verglichen und diskutiert. Aus den Suchmaschinennutzern eines Logfiles, ihren Anfragen und den geklickten Rankingergeb- nissen lässt sich eine ähnlich tripartite Struktur wie die der Folksonomy aufbauen. Die Häufigkeitsverteilungen sowie strukturellen Eigenschaften dieses Graphen werden mit der Struktur einer Folksonomy verglichen. Insgesamt lassen sich ein ähnliches Nutzerver- halten und ähnliche Strukturen aus beiden Ansätzen ableiten. Diese Erkenntnis nutzend werden im letzten Schritt der Untersuchung Trainings- und Testdaten aus Suchmaschinen- logfiles und Folksonomien generiert und ein Rankingalgorithmus trainiert. Erste Anal- ysen ergeben, dass die Rankings generiert aus impliziten Feedback von Suchmaschinen und Folksonomien, positiv korreliert sind. Die Untersuchungen basieren auf verschiede- nen Datensammlungen aus den sozialen Lesezeichensystemen BibSonomy und Delicious, und aus Daten der Suchmaschinen MSN (jetzt Bing) und Google. Damit soziale Lesezeichensysteme als qualitativ hochwertige Informationssysteme erhal- ten bleiben, müssen Anbieter den in den Systemen anfallenden Spam bekämpfen. In dieser Arbeit werden verschiedene Merkmale vom legitimen und nicht legitimen Nutzern aus den Besonderheiten von Folksonomien abgeleitet und auf ihre Eignung zur Spamentdeckung getestet. Die besten Ergebnisse ergeben eine Kombination aus Profil- Aktivitäts-, seman- tischen und ortsbezogenen Merkmalen. Basierend auf den Experimenten wird eine Spa- mentdeckungsanwendung entwickelt mit Hilfe derer Spam in sozialen Lesezeichensystem BibSonomy erkannt und eliminiert wird. Mit der Speicherung und Veröffentlichung von benutzerbezogenen Daten ergibt sich die Frage, ob die persönlichen Daten eines Nutzers in sozialen Lesezeichensystemen noch genügend geschützt werden. Welche Art der persönlichen Daten werden in diesen Sys- temen gesammelt und wie gehen existierende Systeme mit diesen Daten um? Um diese Fragen zu beantworten, wird die Anwendung BibSonomy unter technischen und datenschutzrechtlichen Gesichtspunkten analysiert. Es werden Richtlinien erarbeitet, die als Leitfaden für den Umgang mit persönlichen Daten bei der Entwicklung und dem Betrieb von sozialen Lesezeichen dienen sollen. Experimente zur Spamklassifikation zeigen, dass die Berücksichtigung von datenschutzrechtlichen Aspekten bei der Auswahl von Klassi- v fikationsmerkmalen persönliche Daten schützen können, ohne die Performanz des Sys- tems bedeutend zu verringern. vi Acknowledgements This thesis would not have been possible without the help, expertise and patience of many people. First, I would like to thank my advisor Prof. Dr. Andreas Hotho for his support and encouragement over the years. From the first project in 2007 to the last feedback in 2015 he supported me in the realization of this thesis. During my time as a research assistant, he helped me with teaching, student mentoring and always willingly provided technical know- how. His contributions and encouragement helped turn the past years into an amazing journey and made it possible to finally reach the finish line. I am also very thankful to Prof. Dr. Gerd Stumme for his continuous support and for always having an open ear during my years in Kassel. In both teams, in Kassel and Würzburg, I found an enthusiastic learning atmosphere. Ideas were welcome and time and resources organized to realise them. I very much enjoyed this creative, optimistic environment. Thank you to Prof. Dr. Frank Puppe for supporting me by being the chairman of the board of examiners and for integrating me into the Würzburg team. The ideas, approaches and methods discussed originate from my work in the university teams at Kassel and Würzburg. Thank you to the coauthors and team members, especially to Christoph, Dominik, Elmar, Folke, Robert and Stephan from the Knowledge and Engi- neering Group in Kassel and Florian, Martin, Peter, Martina and the other team members from the University of Würzburg. Also, many thanks to Sven, Björn, Alexander, Monika and Petra for their support. Additionally, I would like to thank Hana Lerch and Prof. Dr. Alexander Roßnagel from the

Privacy Aware Social Information Retrieval and Spam Filtering Using

Copyright 2009 Cengage Learning. All Rights Reserved. May Not Be Copied, Scanned, Or Duplicated, in Whole Or in Part

Liferay Portal 6 Enterprise Intranets

Address Munging: the Practice of Disguising, Or Munging, an E-Mail Address to Prevent It Being Automatically Collected and Used

Adversarial Web Search by Carlos Castillo and Brian D

Efficient and Trustworthy Review/Opinion Spam Detection

The History of Digital Spam

History of Spam

Spam in Blogs and Social Media

Detecting Web Robots with Passive Behavioral Analysis and Forced Behavior by Douglas Brewer (Under the Direction of Kang Li)

D2.4 Weblog Spider Prototype and Associated Methodology

A Survey on Web Spam and Spam 2.0

Pranam Kolari