USER PROFILING BASED ON FOLKSONOMY INFORMATION IN WEB 2.0 FOR PERSONALISED RECOMMENDER SYSTEMS

Huizhi Liang

Submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

Faculty of Science and Technology

Queensland University of Technology

April 2011

To my dear father Weiwang Liang and mother Tengying Liang

Keywords

User Profiling, Recommender Systems, Folksonomy, Tags, ,

Personalisation, Web 2.0

Page i

Page ii

Abstract

Information overload has become a serious issue for web users. Personalisation can provide effective solutions to overcome this problem. Recommender systems are one popular personalisation tool to help users deal with this issue. As the base of personalisation, the accuracy and efficiency of web user profiling affects the performances of recommender systems and other personalisation systems greatly.

In Web 2.0, the emerging user information provides new possible solutions to profile users. Folksonomy or information is a kind of typical Web 2.0 information.

Folksonomy implies the users‘ topic interests and opinion information. It becomes another source of important user information to profile users and to make recommendations. However, since tags are arbitrary words given by users, folksonomy contains a lot of noise such as tag , semantic ambiguities and personal tags.

Such noise makes it difficult to profile users accurately or to make quality recommendations.

This thesis investigates the distinctive features and multiple relationships of folksonomy and explores novel approaches to solve the tag quality problem and profile users accurately. Harvesting the wisdom of crowds and experts, three new user profiling approaches are proposed: folksonomy based user profiling approach, taxonomy based user profiling approach, hybrid user profiling approach based on folksonomy and taxonomy. The proposed user profiling approaches are applied to recommender systems to improve their performances. Based on the generated user profiles, the user and item based collaborative filtering approaches, combined with the content filtering methods, are proposed to make recommendations.

Page iii

The proposed new user profiling and recommendation approaches have been evaluated through extensive experiments. The effectiveness evaluation experiments were conducted on two real world datasets collected from Amazon.com and CiteULike websites. The experimental results demonstrate that the proposed user profiling and recommendation approaches outperform those related state-of-the-art approaches. In addition, this thesis proposes a parallel, scalable user profiling implementation approach based on advanced cloud computing techniques such as Hadoop, MapReduce and

Cascading. The scalability evaluation experiments were conducted on a large scaled dataset collected from Del.icio.us website.

This thesis contributes to effectively use the wisdom of crowds and expert to help users solve information overload issues through providing more accurate, effective and efficient user profiling and recommendation approaches. It also contributes to better usages of taxonomy information given by experts and folksonomy information contributed by users in Web 2.0.

Page iv

Table of Contents

Keywords ...... i Abstract...... iii Table of Contents...... v List of Figures ...... viii List of Tables ...... x Statement of Original Authorship ...... xi 1 INTRODUCTION ...... 1 1.1 Overview ...... 1 1.2 Research Problem and Objectives ...... 7 1.2.1 Research Problem ...... 7 1.2.2 Research Objectives ...... 10 1.3 Research Significance and Contributions ...... 11 1.4 Research Methodology ...... 13 1.5 Thesis Outline ...... 14 2 LITERATURE REVIEW ...... 18 2.1 User Profiling ...... 18 2.1.1 Web Personalisation ...... 18 2.1.2 User Profiling Approaches ...... 19 2.1.2.1 User Information Collection...... 21 2.1.2.2 User Profile Representation ...... 25 2.1.3 User Profiling in Web 2.0 ...... 27 2.1.3.1 User Profiling Based on Tags ...... 28 2.1.3.2 User Profiling Based on Other Web 2.0 User Information ...... 30 2.1.3.3 Hybrid User Profiling Based on Tags and Other Information ...... 32 2.2 Recommender Systems ...... 33 2.2.1 Recommendation Tasks and Evaluation Approaches ...... 34 2.2.2 Recommendation Approaches...... 36 2.2.2.1 Content Based Filtering ...... 36 2.2.2.2 Collaborative Filtering Approaches ...... 37 2.2.2.3 Hybrid Approaches ...... 41 2.2.3 Recommender Systems Based on Taxonomy ...... 42 2.2.4 Recommender Systems Based on Folksonomy ...... 44 2.3 Chapter summary ...... 47 3 USER PROFILING BASED ON FOLKSONOMY ...... 49 3.1 Notations ...... 49 3.2 The Relationship Modelling of Folksonomy ...... 53 3.3 Tag Representation Based on Folksonomy ...... 57 3.3.1 The Two Conditional Probabilities ...... 60 3.3.2 The Relevance of Two Tags in Terms of an Individual User ...... 62 3.4 Item Representation Based on Folksonomy ...... 67 3.5 User Profile Generation Based on Folksonomy ...... 73 3.6 A Framework of User Profiling Based on Folksonomy ...... 78 3.7 Chapter Summary ...... 79

Page v

4 USER PROFILING BASED ON TAXONOMY ...... 80 4.1 Notations ...... 81 4.2 Taxonomy Based User Profiling ...... 83 4.2.1 Item Representation Based on Taxonomy ...... 84 4.2.2 Tag Representation Based on Taxonomy ...... 91 4.2.3 User Representation Based on Taxonomy ...... 96 4.2.4 A Framework of User Profiling Based on Taxonomy ...... 100 4.3 Hybrid User Profiling Based on Folksonomy and Taxonomy...... 101 4.3.1 Hybrid Tag Representation ...... 102 4.3.2 Hybrid Item Representation ...... 104 4.3.3 Hybrid User Representation ...... 105 4.3.4 A Framework of Hybrid User Profiling Based on Folksonomy and Taxonomy ...... 106 4.4 Chapter Summary ...... 107 5 PERSONALISED ITEM RECOMMENDATION MAKING ...... 109 5.1 Problem Definition ...... 109 5.2 Neighbourhood Formation ...... 111 5.2.1 User Based K-Nearest-Neighbourhood Formation ...... 112 5.2.2 Item Based K-Nearest Neighbourhood Formation ...... 116 5.3 Recommendation generation ...... 119 5.3.1 User Based Recommendation Generation ...... 120 5.3.2 Item Based Recommendation Generation ...... 123 5.4 Framworks of the Proposed Recommendation Systems...... 127 5.5 Chapter Summary ...... 129 6 EXPERIMENTS AND RESULTS ...... 131 6.1 Experiments Design ...... 131 6.2 Evaluation Methods ...... 132 6.2.1 Datasets ...... 132 6.2.2 Evaluation Metrics ...... 135 6.2.3 Experiment Setup ...... 136 6.2.4 Experiment Environment and Framework ...... 138 6.3 Results and Discussions ...... 140 6.3.1 The Influence of Tags to the Standard Collaborative Filtering Recommendation Approaches...... 141 6.3.2 Results of Folksonomy Model ...... 143 6.3.2.1 Parameterisation ...... 143 6.3.2.2 Experimental Results ...... 144 6.3.2.3 Discussions ...... 152 6.3.3 Results of Taxonomy Model ...... 153 6.3.3.1 Parameterisation ...... 154 6.3.3.2 Experimental Results ...... 155 6.3.3.3 Discussions ...... 157 6.3.4 Hybrid Models v.s. Single Models ...... 158 6.3.4.1 Parameterisation ...... 159 6.3.4.2 Experimental Results ...... 160 6.3.4.3 Folksonomy Models v.s. Taxonomy Models ...... 161 6.3.4.4 Discussions ...... 165 6.4 Parallel User Profiling for Large Scaled Recommender Systems ...... 165 6.4.1 Related Work ...... 166 6.4.1.1 Large Scaled User Profiling and Recommender Systems ...... 166 6.4.1.2 Cloud Computing ...... 167 6.4.2 Large Scale Implementation...... 169 6.4.2.1 Parallel User Profiling ...... 170 6.4.2.2 Parallel Neighbourhood Formation ...... 172

Page vi

6.4.2.3 Parallel Recommendation Making ...... 174 6.4.2.4 The Parallel of the Above Three Steps ...... 175 6.4.3 Experiments and Results ...... 175 6.4.3.1 Dataset Preparation ...... 176 6.4.3.2 Experiment Setup ...... 176 6.4.3.3 Experimental Results ...... 178 6.4.4 Discussions...... 180 6.5 Chapter Summary ...... 180 7 CONCLUSION AND FUTURE WORK ...... 182 7.1 Conclusions ...... 182 7.2 Contributions ...... 184 7.3 Limitations and Future Work ...... 187 7.3.1 Limitations ...... 187 7.3.2 Future Work ...... 188 APPENDIX A: EXAMPLE FOLKSONOMY TAGS...... 191 APPENDIX B: PARALLEL IMPLEMENTATION BASED ON CASCADING MAPREDUCE ...... 193 BIBLIOGRAPHY ...... 195 ACKNOWLEDGEMENTS ...... 221

Page vii

List of Figures

Figure 1.1 Examples of , Item Folksonomy and Taxonomy of Amazon.com ...... 6 Figure 1.2 The Proposed Research Method for This Thesis ...... 13 Figure 3.1 An Example of a Tagging Graph...... 52 Figure 3.2 Tag Representation-Folksonomy ...... 64 Figure 3.3 Item Representation-Folksonomy ...... 70 Figure 3.4 User Representation-Folksonomy ...... 75 Figure 3.5 A Framework of User Profiling Based on Folksonomy ...... 78 Figure 4.1 An Example of Item Taxonomy ...... 83 Figure 4.2 Item Representation-Taxonomy ...... 89 Figure 4.3 Tag Representation-Taxonomy ...... 93 Figure 4.4 User Representation-Taxonomy ...... 98 Figure 4.5 A Framework of User Profiling Based on Taxonomy ...... 101 Figure 4.6 Hybrid Representations Based on Folksonomy and Taxonomy ...... 103 Figure 4.7 A Framework of Hybrid User Profiling Based on Folksonomy and Taxonomy ...... 107 Figure 5.1 A General Framework of the Proposed Recommender Systems ...... 111 Figure 5.2 The Framework of Folksonomy Based Recommendation Making ...... 127 Figure 5.3 The Framework of Taxonomy Based Recommendation Making ...... 128 Figure 5.4 The Framework of Hybrid Recommendation Making ...... 129 Figure 6.1 The Distributions of Tags in Dataset D1 and D2 ...... 134 Figure 6.2 The Distributions of Items in Dataset D1 and D2 ...... 134 Figure 6.3 Visualisation of the 5-fold Cross-Validation Experiment Setup ...... 138 Figure 6.4 The Experiment Framework of the Accuracy Evaluation ...... 140 Figure 6.5 Top-3 Precision of the Standard CF Approaches on Datasets D1 and D2 ...... 142 Figure 6.6 Top 10 Precision Results of Folksonomy Model of Dataset D1 (Comparison 1) ...... 145 Figure 6.7 Top 10 Recall Results of Folksonomy Model of Dataset D1 (Comparison 1) ...... 146 Figure 6.8 Top 10 F1 Measure Results of Folksonomy Model of Dataset D1 (Comparison 1) ...... 146 Figure 6.9 Top 10 Precision Results of Folksonomy Model of Dataset D1 (Comparison 2) ...... 149 Figure 6.10 Top 10 Recall Results of Folksonomy Model of Dataset D1 (Comparison 2) ...... 149 Figure 6.11 Top 10 F1 Measure Results of Folksonomy Model of Dataset D1 (Comparison 2) ...... 150 Figure 6.12 Top 10 Precision Results of Folksonomy Model of Dataset D2 (Comparison 2) ...... 150 Figure 6.13 Top 10 Recall Results of Folksonomy Model of Dataset D2 (Comparison 2) ...... 151 Figure 6.14 Top 10 F1 Measure Results of Folksonomy Model of Dataset D2 (Comparison 2) ...... 151 Figure 6.15 Top 10 Precision Results of Taxonomy Model of Dataset D1 ...... 156 Figure 6.16 Top 10 Recall Results of Taxonomy Model of Dataset D1 ...... 156 Figure 6.17 Top 10 F1 Results of Taxonomy Model of Dataset D1 ...... 157 Figure 6.18 Top 3 Precision Results of Hybrid Model of Dataset D1 ...... 160

Page viii

Figure 6.19 Top 3 Recall Results of Hybrid Model of Dataset D1 ...... 160 Figure 6.20 The Number of Tags with Different Values ...... 162 Figure 6.21 Top-3 Precision Results with Different Values ...... 164 Figure 6.22 A Cascading Flow and MapReduce Jobs ...... 169 Figure 6.23 Parallel Data Flow of User Profiling ...... 171 Figure 6.24 Parallel Data Flow of Neighborhood Forming ...... 173 Figure 6.25 Parallel Data Flow of Recommendation Making ...... 175 Figure 6.26 The Running Time Comparison of Job 2 ...... 179 Figure 6.27 The Running Time Comparison of Job 3 ...... 180

Page ix

List of Tables

Table 3.1 The Algorithm of Folksonomy Based Tag Representation ...... 66 Table 3.2 The Algorithm of Folksonomy Based Item Representation ...... 71 Table 3.3 The Algorithm of Folksonomy Based User Representation ...... 76 Table 4.1 The Algorithm of Taxonomy Based Item Representation ...... 90 Table 4.2 The Algorithm of Taxonomy Based Tag Representation ...... 94 Table 4.3 The Algorithm of Taxonomy Based User Representation ...... 99 Table 5.1 The Algorithm of User Based K-Nearest-Neighbourhood Formation Approach ...... 115 Table 5.2 The Algorithm of Item Based K-Nearest-Neighbourhood Formation Approach...... 118 Table 5.3 The Algorithm of User Based Recommendation Generation Approach ...... 122 Table 5.4 The Algorithm of Item Based Recommendation Generation Approach ...... 125

Page x

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signature: ______

Date: ______

Page xi

Page xii

Chapter 1

1Introduction

1.1 OVERVIEW

Internet has revolutionised the way people gather, process, and use information.

Great changes have occurred in business, education, medicine, publishing, research, as well as other aspects of our daily life. These changes have been described as making the whole global world become flat (Friedman, 2005).

The enormous amount of web data brings much more information and choices to people. However, it also increases the burden of retrieving useful information and making decisions, known as information overload. To be able to cope with huge, diverse, and dynamic web data, people need the assistance of intelligent software agents for finding, sorting, and filtering the available information (Etzioni, 1996). Driven by this, the field of personalisation has grown dramatically in the past few years.

―Personalisation is the ability to provide content and services tailored to individuals based on knowledge about their preferences and behaviour‖ (Hagen, 1999). It attempts to help users solve the information overload issue. User profiling is the foundation of web personalisation. It is to create and construct user profiles. The accuracy and efficiency of web user profiling affects the performance of personalisation systems greatly. Explicit ratings are typical user information. However, since not all users are willing to provide explicit profiles, the available explicit ratings are usually not sufficient

Page 1

to profile users accurately. Currently, the research of web user profiling mainly focuses on the analysis of web log data. It includes web usage information such as a user‘s click streams and navigation patterns, and the content or structural information of visited web pages. However, as the log data is generally huge in size and contains various kinds of noise, it becomes difficult to profile users accurately and efficiently. Consequently, exploring other alternative, less intrusive, better quality, and prevalently available user information sources has become a key issue for user profiling and web personalisation systems. Nowadays, the Web is moving towards its next generation, called Web 2.0 or

Social Web (Chi, 2008). Web 2.0 was coined by O‘Reilly Media in 2004 (Pilgrim, 2008).

Rather than viewing the web as an enormous reading resource repository, the new generation web becomes a platform for users to conduct online participation, collaboration and interaction. The popular Web 2.0 applications include collaborative writing in Wikipedia (http://www.wikipedia.com), building social networks in Facebook

(http://www.facebook.com), sharing information resources through collaborative tagging in De.licio.us (http://www.delicious.com), sharing videos in YouTube

(http://www.youtube.com), sharing and spreading short messages or micro- (e.g., tweets) in (http://www.twitter.com), sharing photos in

(http://www.flickr.com), and writing reviews and blogs on epinions

(http://www.epinions.com/) etc.

The current popularly available online information contributed by users includes textural content information, multimedia content information and network information.

The popular textural content information includes folksonomy (i.e., tags), reviews, blogs,

Page 2

micro-blogs, Wikipedia articles and others. Video and audio clips, and photos are popular multimedia content information. This kind of information contributed by web users in this new generation of read and write web, becomes an important information resource, aside from the information provided by the website authorities or organisations. These user information sources provide new possible solutions to harvest the wisdom of crowds

(Surowiecki, 2004) to conduct (Zaphiris, 2009) for information searching, gathering, filtering, organising, and recommendation making (Segaran, 2007).

Therefore, how to use or incorporate the emerging Web 2.0 user information in personalisation applications becomes an important and urgent research topic.

A very prominent feature of Web 2.0 is the prevalent functionality of allowing users to annotate, describe, or tag online content items in many social web sites. Tagging, also called collaborative tagging or social tagging, refers to the behaviour of assigning freely-chosen descriptive keywords to online content by ordinary Web users. Originally promoted by site Del.icio.us in 2005, tagging has gained popularity ever since. It has been popularly used in various kinds of websites such as Flickr

(http://www.flickr.com), LibraryThing (http://www.librarything.com), CiteULike

(http://www.citeulike.org), Bibsonomy (http://www.bibsonomy.org), Digg.com

(http://www.digg.com), and last.fm (http://www.last.fm). Tagging has also been applied to traditional e-commerce and content provider websites such as Amazon.com

(http://www.amazon.com), EI village (http://www.engineeringvillage.com) and enterprise organisations (Millen et al., 2006; Muller, 2007).

Page 3

These descriptive keywords are commonly referred to as tags or folksonomy.

Folksonomy is a term coined by Thomas Vander Wal in 2007 from the term taxonomy, which is a portmanteau of folk and taxonomy (Lamere, 2008). Traditionally, items are described or classified by taxonomy information. Item taxonomy is a set of terms or topics designed by experts to describe or classify items. The advantages of taxonomy include: vocabulary are standard and controlled, relationship information among concepts exists, information is given by authorities and reflecting common knowledge, vocabulary is independent from user communities and others. One limitation is that it does not reflect users‘ personal viewpoints or preferences. In addition, item taxonomy is not always up-to-date and the maintenance is usually costly (Van

Damme, et al., 2007; Gruber, 2007).

Unlike item taxonomy, folksonomy is contributed by users and contains rich personal information. Folksonomy has the distinctive advantages of: being given by users explicitly and proactively, reflecting users‘ topic preferences and personal viewpoints on item descriptions or classifications, and having multiple functions such as item organising and sharing, network building, and explicit opinion expressing (Sen et al., 2006). Thanks for its simplicity and multiple functions, the social tagging phenomenon and the resulting folksonomies have become a staple of many Web 2.0 websites and services (Golder &

Huberman, 2006).

Collaborative tagging and the resulting folksonomy attract the attention of researchers as they provide new opportunities to extend existing research in areas of information retrieval and Web search (Heymann et al., 2008; Bao et al., 2007),

Page 4

computational linguistics (Cattuto et al., 2008), the (Specia & Motta,

2007), user profiling (Shepitsen et al., 2008; Au Yeung et al., 2008; Šimko & Bielikova,

2010), and recommender systems (Niwa et al., 2006; Shepitsen et al., 2008; Jaschke et al.,

2007; Zhang, et al. 2010).

Figure 1.1 shows examples of a tag cloud, a user‘s tagging records, and the folksonomy tags and taxonomy categories of the book “The World Is Flat”. These pictures are edited from the referred screenshots of Amazon.com. Figure 1.1 (a) shows a small part of the Amazon tag cloud. A tag cloud illustrates the distribution and frequency of tags used by all users of a website. Typically, more frequently used tags are larger and more recently used tags will appear darker. The term ―globalization‖ is one tag in the tag cloud shown in Figure 1.1 (a). Figure 1.1 (b) shows the tagging records of an example user - Lawrence. Lawrence is one of the users who used the tag ―globalization‖.

Lawrence used this tag to collect the book “The World Is Flat” and other items. All users‘ tags that have been assigned to this item form the folksonomy of this item. Figure 1.1 (c) only shows partial folksonomy information of this item. The number of users of each tag is also given on this figure. The taxonomy categories assigned by experts of Amazon.com are shown in Figure 1.1 (c).

Page 5

(a) Tag Cloud

(b) The Tagging Information of the Tag ―globalization‖ for User Lawrence

(c) The Folksonomy Tags and Taxonomic Categories of the Item ―The World Is Flat”

Figure 1.1 Examples of Tag cloud, Item Folksonomy and Taxonomy of Amazon.com

Page 6

Under this background, this thesis explores how to utilise typical Web 2.0 information--folksonomy, to profile users accurately and enhance personalised recommendation making approaches aimed at helping users solve the information overload issue.

1.2 RESEARCH PROBLEM AND OBJECTIVES

1.2.1 Research Problem

Recommender system is one of the effective and popular personalisation applications, which can help to solve the information overload issue through making suggestions regarding which information is most relevant to a target user. A typical user profile of recommender systems consists of a set of items rated or preferred by the user together with ratings on these items (Bloedorn et al., 1996; Wang et al., 2006). However, not all users like to be involved in explicit ratings. Therefore, explicit ratings are not always available or applicable in real life applications (Oard & Kim, 1998). Currently, the inaccurate user profiles are regarded as the bottleneck of making quality recommendations (Adomavicius & Tuzhilin, 2005). Therefore, in order to construct satisfactory user profiles, alternative user data sources need to be explored. How to make recommendations based on other user information sources becomes an important research direction in this area (Adomavicius & Tuzhilin, 2005).

As mentioned before, in Web 2.0, tags are given by users explicitly and proactively to describe the contents or classifications of items, therefore reflecting users‘

Page 7

topic preferences and personal viewpoints on item descriptions or classifications (Sen, et al., 2006). Consequently, folksonomy has recently been identified as another important source to profile users (Au Yeung et al., 2008; Diederich & Iofciu, 2006; Mislove et al.,

2010).

Tag information has the following distinct advantages:

 Compared with explicit rating information, tags are less intrusive and have

multiple functions. The multiple functions include organising and sharing

items, building networks, and expressing explicit opinions and topic interests

(Sen et al., 2006). Because of its simplicity and direct benefits, more and

more people are being attracted to tagging. According to a report on the

usage of collaborative tagging published in 2007 (Rainie, 2007), about 28%

of American users have engaged in some forms of tagging activities.

 Compared with other commonly used implicit rating information sources

such as click streams or web logs, tags are lightweight, humanly

understandable, easy to process, and provide more reliable evidence of the

positive associations between users and tagged items. A user is more likely to

be interested in an item if he/she has tagged it with a tag than if he/she has

only browsed it or clicked on it. Therefore tagging can be considered as a

kind of implicit rating behaviour (Sen et al., 2006), and as such folksonomy

becomes another important implicit rating information source.

 Unlike item taxonomy/ontology or other kinds of content or

information, tags are contributed by general web users and contain rich

Page 8

personal topic interests and opinion information of online community users.

Folksonomies are also believed to be able to provide categorisation schemes

that are more flexible, dynamic, up-to-date, and community dependant or

specific.

 Tag information is domain free and is applicable to various kinds of items

such as videos, audios, documents, web pages, images, and e-commerce

products etc. Another advantage of tag information is that not only can it be

used as a standalone information source, but it can also be used to integrate

with other information sources such as videos, reviews, blogs, micro-blogs

and others.

Because of these advantages, users‘ tag information is commonly available. It becomes another kind of typical and prevalent user information, besides explicit ratings, to profile users and make recommendations. However, since there is no restriction or boundary on selecting words for tagging items, the tags used by users are free-formed and have problems such as semantic ambiguity and tag synonyms. Semantic ambiguity means that the same tag name has different meanings for different users. Tag synonyms means that different tags actually have the same meaning. Another concern related to tags is that nearly 60% of tags are personal tags that are only used by one user (Bischoff et al., 2008).

These disadvantages bring challenges to effectively make use of tags to profile users and describe items.

Mainly, the current user profiling and recommendation approaches treat tags as textural information, fail to effectively reduce the noise of tags, ignore the distinctive

Page 9

features of tags and the rich personal information contained in tags. How to reduce the noise and use the unique features and the rich personal information of folksonomy to profile users, and therefore enhance recommendation making, needs to be further explored. Moreover, very few works so far have discussed how to benefit from both folksonomy information contributed by users and taxonomy information given by experts with the purpose of finding more accurate user profiles and improving personalised recommendation making.

The Research Problem of this thesis is to explore effective and efficient approaches to reduce the noise of folksonomy and make use of the unique features of folksonomy to improve the performances of user profiling and recommendation making.

1.2.2 Research Objectives

Targeting the research problem, the primary Research Objectives of this thesis are:

 Objective 1: Generating user profiles based on folksonomy information.

Users‘ tagging information possesses rich information about users‘ topic preferences and gives understanding to the items. It is a valuable data source to obtain the users‘ interests, preferences and tastes. This research objective is to investigate the distinctive features of folksonomy information and propose novel approaches to reduce the noise of folksonomy and generate accurate user profiles.

 Objective 2: Generating user profiles based on folksonomy and taxonomy

information.

Page 10

Folksonomy or taxonomy information has its own unique advantages and limitations. This research objective is to investigate how to benefit from both information sources, and propose novel approaches to integrate item folksonomy and taxonomy to further reduce the noise of folksonomy and generate more accurate user profiles.

 Objective 3: Utilizing the generated user profiles for recommender systems.

To verify the effectiveness of the proposed user profiling approaches, the generated user profiles will be used in recommender systems. The recommendation making approaches incorporating the generated user profiles will be investigated. The proposed user profiling and recommendation making approaches will be evaluated in the methodology of experimentation. The performances of the proposed recommender systems incorporating the proposed user profiles will be compared with other related state-of-the-art models. If the accuracy of the recommendation making approaches can be improved, then the effectiveness of the proposed user profiling approaches will be verified. In addition, the scalability of the proposed user profiling approaches for large scale recommender systems will be investigated.

1.3 RESEARCH SIGNIFICANCE AND CONTRIBUTIONS

Theoretically, this research contributes to more accurate and better user profiling approaches. It improves the performance of personalisation and provides better solutions to the information overload issue. Personalisation will increase user satisfaction and enhance customer service and e-commerce sales. This research makes important contributions to personalisation by exploring and exploiting new user data sources and

Page 11

providing principles and novel approaches to construct user profiles based on folksonomy information in the Web 2.0.

Moreover, this research makes practical contributions to the development of recommender systems. The proposed user profiling approaches are applied to the application area of recommender systems to improve personalised item recommendations.

Typically, the users‘ explicit rating information is used to make recommendations.

However, since explicit ratings are not always available in real life applications, how to make recommendations based on implicit rating information becomes very important

(Adomavicius & Tuzhilin, 2005). This thesis effectively utilises the new Web 2.0 user information sources in recommender systems. It contributes to providing new solutions for profiling users accurately and making quality recommendations.

This research also contributes to the research of Web 2.0. Folksonomy is a typical

Web 2.0 information source. This research investigates the unique features of folksonomy.

It contributes to better understanding as well as better usage of folksonomy. Moreover, folksonomy has common features with other user created textural information such as blogs, micro-blogs (e.g., tweets), and reviews. The common features include having a direct relationship with the items it describes, implying users‘ interests and preferences, being humanly understandable, being lightweight, having free-formed user vocabularies etc. Although only folksonomy information is investigated, this research also contributes to better understanding and utilising of other user created information in Web 2.0.

Page 12

1.4 RESEARCH METHODOLOGY

Various research approaches have been used in the user profiling and recommender system fields. Some of these methods include surveys, case studies, prototyping and experimenting (Sarwar et al., 2000, Herlocker et al., 2004, Felden, 2007).

As the research has focused on the development of new systems or techniques in the recommender system, and the soundness of these systems, techniques or proposed strategies must be supported by the results from the experimentations and evaluations.

Therefore, the experimental approach integrated with the standard information system research cycle was chosen as the proposed research method. The process of the research approach used in this research is illustrated in Figure 1.2.

Define Problem

Develop Literature Review Solutions/Systems/Techniques

Experimental Dataset Selection

Experiment Design

Reflection Experiment

Evaluation

Figure 1.2 The Proposed Research Method for This Thesis

Page 13

1.5 THESIS OUTLINE

The rest of this thesis is outlined as follows:

Chapter 2: This chapter is a critical and comprehensive review of the existing research in related research areas of user profiling and recommender systems. It identifies and justifies the research context and gap from which the research questions were derived, and pinpoints the weaknesses of the existing related works.

Chapter 3: This chapter discusses the proposed folksonomy based user profiling approach. This chapter will firstly model the multiple relationships of folksonomy. To solve the tag quality problem, this chapter proposes to find the personal semantic meaning of each tag. A set of related tags for each tag in terms of each individual user will be determined. This chapter then suggests approaches to profile and represent users and items based on the generated correlations among tags.

The proposed folksonomy based user profiling approach which will be used to enhance recommender systems, will be discussed in Chapter 5. The relevant publications of the proposed folksonomy based user profiling and recommender systems are as below:

 Liang, H., Xu, Y., Li, Y., & Nayak, R. (2010). Connecting Users and Items

with Weighted Tags for Personalized Item Recommendations. Proceedings

of the 21st ACM conference on and Hypermedia, 51-60

 Liang, H., Xu, Y., Li, Y., & Nayak, R., Shaw, G. (2010). A Hybrid

Recommender System based on Weighted Tags. Proceedings of the 8th

Workshop on Text Mining of the 10th SIAM International Conference on

Data Mining, 2010

Page 14

 Liang, H., Xu, Y., Li, Y., & Nayak, R. (2009). Collaborative Filtering

Recommender Systems based on Popular Tags. Proceedings of the 14th

Australasian Document Computing Symposium, 3-10

 Liang, H., Xu, Y., Li, Y., & Nayak, R. (2009). Tag Based Collaborative

Filtering for Recommender Systems. Proceedings of the 4th International

Conference on Rough Sets and Knowledge Technology, 666-673

 Liang, H., Xu, Y., Li, Y., & Nayak, R. (2008). Collaborative Filtering

Recommender Systems Using Tag Information. Proceedings of the 2008

IEEE/WIC/ACM International Conference on Web Intelligence and

Intelligent Agent Technology- Volume 03, 59-62

Chapter 4: This chapter firstly discusses the proposed taxonomy based user profiling approach. With the purpose of finding the personalised semantic meaning of each tag, this chapter proposes to find a set of related taxonomic topics of each tag for each individual user to solve the tag quality problem. The approaches that represent each user and each item with item taxonomy information will then be presented. Moreover, the hybrid approach that combines the folksonomy and taxonomy based user profiling approaches will be proposed in this chapter.

The recommender systems based on the proposed taxonomy based and hybrid user profiling approaches will be discussed in Chapter 5. The relevant publications of the proposed taxonomy based user profiling and recommender systems are as below:

 Liang, H., Xu, Y., Li, Y., Nayak, R., & Weng, L. (2009). Personalized

Recommender Systems Integrating Social tags and Item Taxonomy.

Page 15

Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on

Web Intelligence and Intelligent Agent Technology - Volume 01, 540-547

 Bhuiyan, T., Xu, Y., Jøsang, A., & Liang, H. (2010). Developing Trust

Networks Based on User Tagging Information for Recommendation Making.

Proceedings of the 11th International Conference on Web Information

System Engineering, 357-364

The relevant publications of the proposed hybrid user profiling and recommendation approaches are as below:

 Liang, H., Xu, Y., Li, Y., & Nayak, R. (2010). Personalized Recommender

System Based on Item Taxonomy and Folksonomy. Proceedings of the 19th

ACM International Conference on Information and ,

1641-1644

 Liang, H., Xu, Y., & Li, Y. (2010). Mining Users‘ Opinions based on Item

Folksonomy and Taxonomy for Personalized Recommender Systems.

Proceedings of IEEE International Conference on Data Mining (ICDM)

2010 Topic Feature Discovery and Opinion Mining workshop, 1128-1135

Chapter 5: This chapter will discuss how to utilise the user profiling approaches proposed in Chapter 3 and Chapter 4 to make recommendations. Based on the proposed folksonomy, taxonomy and hybrid user profiles, both the user and item based personalised item recommendation making approaches will be proposed in this chapter.

Chapter 6: This chapter will discuss the evaluation of the proposed user profiling and recommendation approaches. Mainly the effectiveness of the proposed approaches

Page 16

will be evaluated. In addition, a parallel user profiling and recommendation approach based on cloud computing techniques will be put forward to evaluate the scalability of the proposed approaches. The relevant publications from this part of work are below.

 Liang, H., Hogan, J., Xu, Y. (2010). Parallel User profiling based on

folksonomy for Large Scaled Recommender Systems-An implementation of

Cascading MapReduce. Proceedings of IEEE International Conference on

Data Mining (ICDM) 2010 Knowledge Discovery Using Cloud and

Distributed Computing Platforms workshop, 154-161

(This section of work was conducted under the supervision of Associate

Professor Jim Hogan, when the author was an intern of the Cloud

Computing Internship & Competition Project of the Australian Academy of

Technological Sciences and Engineering, January-March, 2010.)

Chapter 7: This chapter concludes this thesis and draws the direction for future work.

Page 17

Chapter 2

2Literature Review

This Chapter will provide a critical review of the literature that led to the research questions this thesis attempts to solve. The literature demonstrates a thorough background of understanding to the area of study and provides arguments to support the research in this thesis. The aim of this chapter is to set up the research questions and the related research methodology. In Section 2.1, the related user profiling approaches will be briefly reviewed. Then, in Section 2.2, the state-of-art recommender systems will be outlined. In Section 2.3, a summary of current user profiling approaches for recommender systems will be given.

2.1 USER PROFILING

2.1.1 Web Personalisation

The research area of personalisation attempts to provide solutions to help users solve the information overload issue. Typical web personalisation applications include recommender systems, personalised web searching systems, personalised teaching or learning environment applications, and other personalised services. Among these applications, recommender systems are one of the popular personalisation tools and play an important role in people‘s lives.

Page 18

Personalization constitutes an iterative process that can be defined by the three stages of the understand-deliver-measure cycle:

Stage 1: Understand consumers by collecting comprehensive information about them and converting it into actionable knowledge stored in the form of consumer profiles.

This stage also called user profiling.

Stage 2: Deliver personalized offerings based on the knowledge about each consumer, as stored in the consumer profiles; the personalization engine must be able to find the most relevant offerings and deliver them to the consumer;

Stage 3: Measure personalization impact by determining how much the consumer is satisfied (and dissatisfied) with the delivered offerings. When it completes one cycle of the process, it sets the stage for the next cycle in which improved personalization techniques are able to make better personalization decisions.

The purpose of user profiling is to understand users, which is an inevitable part of personalisation. The accuracy and effectiveness of user profiling approaches greatly affect the performances of recommender systems and other personalisation systems. The following sections will review the current user profiling approaches and recommender systems respectively.

2.1.2 User Profiling Approaches

User profiling is the base of personalisation (Mobasher et al., 2002). Most personalisation systems are based on some type of user profile (Mobasher et al., 2002;

Zhang & Ghorbani, 2007). A user profile is a formal representation of a collection of

Page 19

information about a user, including: demographic information, usage information, and interests or goals (either explicitly stated by the user or implicitly derived by the system)

(Fawcett et al., 1996; Kilfoil, Ghorbani & Xing, 2003; Zigoris & Zhang, 2006; Arapakis et al., 2010). User profiles can be divided into individual user profiles and group user profiles. An individual user profile describes only one user‘s interests and other information. A group user profile describes a group of users‘ common interests or goals.

User profiling or user modelling is a process of constructing user models, and creating, updating or deleting user profiles. Research on user profiling is ongoing in the fields of information retrieval and filtering, artificial intelligence, and data mining etc. Its history can be traced back to 1979, when Elain Rich reported on her GRUNDY system

(Rich, 1979). Grundy, is described that builds models of its users, with the aid of stereotypes, and then exploits those models to guide it in its task, suggesting novels that people may find interesting.

The user profiling process generally consists of two main phases. The first phase is data collection, which is used to gather raw information about the user explicitly or implicitly. Different types of user data are extracted to create user profiles using different analysis or mining techniques. Then, in profile construction phase, the user profile can be represented in different ways, which will also affect the accuracy of user profiles.

Because of the complexity of human beings, the research into how to improve the accuracy and efficiency of user profiling still remains open.

In the following subsections, the commonly used user information collections and representation approaches will be reviewed.

Page 20

2.1.2.1 User Information Collection

The first phase of user profiling is to collect information about individual users.

Usually, information may be input explicitly by users or implicitly gathered by software tools.

1) Explicit User Information Collection

The explicit user information collection approach collects user information explicitly. Users are required to provide needed information to profile themselves, such as demographic information (e.g., gender, education background, nationality, occupation), interests and preferences information (e.g., topic interests, taste, font preferences), opinion information (e.g., reviews, comments, and feedbacks), friends and networks information etc. Explicit rating information is commonly used to profile users‘ interests or opinions.

Some websites work effectively based on explicit ratings, such as movie rating site

NetFlix (http://www.netflix.com), e-commerce website Amazon.com

(http://www.amazon.com) and ePinions (http://www.epinions.com), a website dedicated to collecting and sharing consumer reviews, ratings and comments.

Although these kinds of explicit user feedbacks are easy and effective, there are some drawbacks. Firstly, it places an additional burden on the user and requires the user‘s willingness to participate. With no explicit or direct benefits, very few users are willing to provide profiling information proactively and explicitly. In real life applications, the explicit user information is usually very sparse. It brings difficulties to utilise user profiles to conduct personalisation. Encouraging users to provide sufficient information is a

Page 21

challenging task. Furthermore, users may not accurately report their own interests or other data.

2) Implicit user information collection

Collecting information implicitly is another effective way to construct user profiles. The majority of recent research has focused on this area. Various web mining, information retrieval and filtering, and artificial intelligence technologies are involved.

The main advantage of implicit information collection is that it does not require any additional intervention by the user during the process of constructing profiles.

The traditional implicit information includes users‘ behaviour information (e.g., click streams, browsing history, and purchase records), and the content or structural information of the visited web pages or items. After collecting the user information, the next step is to analyse the collected information to construct user profiles. Based on different information sources, the user profiling approaches can be classified into the following categories:

 User profiling based on the content and link structure information of items.

The web content mining (Cooley et al., 1997; Kosala & Blockeel, 2000;

Mobasher, 2007; Holub & Bielikova, 2010) and web structure mining

(Kosala & Blockeel, 2000; Mobasher, 2007) approaches have been

commonly used to mine users‘ topic interests.

Web content mining describes the discovery of useful information from the

Web contents/data/documents. Usually a set of terms/keywords, phrases or

patterns extracted from the clicked or visited items will be used to represent

Page 22

the users‘ topic interests. The term weighting approaches, for example, tf-idf

are widely used to measure the importance of the weight of terms (Salton &

Buckley, 1988).

Web structure mining (Li & Zaïane, 2004) is the process of using graph

theory to analyze the node and connection structure of a web site. According

to the type of web structural data, web structure mining can be divided into

two kinds: a) Extracting patterns from hyperlinks in the web; b) Mining the

document structure. This is the analysis of the tree-like structure of page

structures to describe HTML or XML tag usage.

The structural mining can be used to rank quality and relevant items. Some

popularly used algorithms include HITS (Kleinberg, 1999) and PageRank

( Brin & Page, 1998).

Web pages and textural documents are typical types of item objects for these

kinds of user profiling approaches. These approaches are domain dependant

and not applicable for those items which have limited or no textural content

and link information (Adomavicius & Tuzhilin, 2005).

 User profiling based on item taxonomy or ontology information.

The taxonomic/ontological topics of clicked or rated items are commonly

used to represent users‘ topic interests (Middleton et al., 2004; Felden, 2007;

Ziegler et al., 2004; Weng et al., 2008). Ontology is the formal description

and explicit specification of a shared conceptualisation (Choi et al., 2000).

Item taxonomy is the classification of items, which can be seen as a kind of

Page 23

simplified domain ontology (Choi et al., 2000). In taxonomy only the ―is-a‖

relation is used while in ontologies, much more relations are employed. They

are provided by experts and reflect the common understanding of topics or

classification of items.

Various approaches have been proposed to profile users by way of

taxonomic/ontological topics. For example, Middleton et al. (2004) proposed

an approach to use ontology to inductively learn user interested topics.

Ziegler et al. (2004) proposed to use product taxonomic topics to represent

users‘ topic interests. These user profiling approaches have been used in

recommender systems, which will be further reviewed in Section 2.2.3.

One important advantage of item taxonomy/ontology information is that they

are domain free and can be used for various kinds of items. Another

advantage is that they have semantic meaning and can be used to describe

users‘ topic interests at a conceptual level - the so called ―intelligent‖ part of

the world knowledge model possessed by human beings (Li & Zhong, 2006).

 User profiling based on web logs.

Web logs capture the browsing history for individual users at a given website.

This is an important traditional information source used to profile users

implicitly. The web log analysis and usage mining based on users' usage

information (Wang et al., 2006; Barla et al., 2009) or navigated web content

information (Cooley et, al., 1997; Borges & Levene, 2000) are commonly

used to mine user profiles implicitly. In addition, text mining and natural

Page 24

language processing are used to find users‘ interests and opinions from the

content of visited web pages or users‘ textural online information.

According to some work of evaluation (White et al., 2001; Wærn, 2004; Teevan et al., 2005), there is no clear answer as to whether implicitly created profiles are more or less accurate than explicitly created profiles. Since implicit feedback places less burden on the user, and it automatically updates as the user interacts with the system, it seems to be the preferable method of collecting information about users. One drawback of implicit feedback techniques is that they can typically only capture positive feedback.

2.1.2.2 User Profile Representation

After collecting each user‘s information explicitly or implicitly, the next step is to construct user profiles with appropriate representations. Based on different representations, user profiles can be classified as rating-based, keyword-based, graph- based, hierarchical structure-based and rule-based user profiles. Typically, in recommender systems, users‘ ratings for items are used to represent users‘ item preferences. They are called rating-based user profiles.

Besides explicit or implicit ratings-based user profiles, the most common representations for user profiles are sets of keywords/terms or taxonomic/ontological topics that reflect users‘ topic preferences or interests. Those keywords can be provided by users explicitly or extracted from the content or taxonomy/ontology information of items (Adomavicius & Tuzhilin, 2001). Carreira and others (2004) proposed that

Page 25

keywords could be associated with multiple weights, each of which indicated the degree of user interest in a particular subject containing the keywords (Carreira et al., 2004).

In order to address the polysemy problem inherent with keyword-based profiles,

Minio and Tasso (1996) explored an approach to a weighted in which each node represented a concept (Minio & Tasso, 1996). This kind of approach has been used in InfoWeb (Gentili et al., 2003), WIFS (Micarelli & Sciarrone, 2004) and the system proposed in by Gasparetti and Micarelli (Gasparetti & Micarelli, 2005).

Concept hierarchies were initially used to represent the content of Web pages

(Guha et al., 2003), but have more recently been used to represent user profiles. Bloedorn and others suggested using hierarchical concepts, rather than a flat set concept, because this enabled the system to make generalisations (Bloedorn et al., 1996). The simplest concept hierarchical based profiles are constructed from a reference taxonomy or . More complex profiles may be constructed from reference ontologies. Some projects explored the use of these richer ontologies for improved search and recommendation results (Guha et al., 2003; Middleton et al., 2004).

Users‘ behaviours, an important part of user profiles, can be represented by a set of rules such as association or classification rules (Adomavicius &Tuzhilin, 2001). It can also be a set of patterns, such as frequent patterns or sequential patterns. Sometimes a behavioural profile can be represented as a model, such as Bayesian belief network, neural network, graph model, or similar.

Nowadays, with the development of Web 2.0, some new kinds of user information such as tags, reviews, blogs, tweets, pictures, videos, friends and network

Page 26

information are available. Implying users‘ interests and preferences information, these new kinds of user information can be used as alternative or supplementary information sources to construct and represent user profiles. The following subsection will review the current user profiling approaches based on these kinds of typical user information in Web

2.0.

2.1.3 User Profiling in Web 2.0

The Web 2.0 applications attract users to provide, create, and share more information online. The current popularly available online information contributed by users can be classified into textural content information, multimedia content information, and friends/network information.

The popular textural user contributed content information includes tags, blogs, reviews, micro-blogs, comments, posts, documents, Wikipedia articles and others.

Besides textural content, user contributed videos clips, audio clips, and photos are also widely available on the web now. In addition, users‘ network relationship information, e.g., ‗friends‘, ‗follower‘, ‗following‘, ‗trust‘ becomes more and more popular. Compared with web log, these kinds of new user information have the advantages of being lightweight, small sized, and explicitly and proactively provided by users. They become new important information sources used to profile users.

Page 27

2.1.3.1 User Profiling Based on Tags

Tag or folksonomy information is a kind of typical Web 2.0 information.

Recently, social tags became an important research focus. Implying users‘ explicit topic interests, social tags can be used to improve searching (Bao et al., 2007; Begelman et al.,

2006; Bindelli et al., 2008), clustering (Begelman et al., 2006; Shepitsen et al., 2008), and recommendation making (Niwa et al., 2006; Shepitsen et al., 2008; Jaschke et al., 2007;

Zhang, et al. 2010).

The research of tags mainly focuses on how to build better collaborative tagging systems (Scott et al., 2006), item navigation and organisation (Bindelli et al., 2008), semantic cognitive modelling in social tagging systems (Fu et al., 2010), personalising searches using tag information (Bao et al., 2007) and recommending tags (Marinho &

Schmidt-Thieme, 2007) and items (Tso-Sutter et al., 2008) to users etc. In real life, collaborative tagging systems are not only used in social sharing websites such as

De.licio.us, e-commerce websites such as Amazon.com, but also other traditional websites or organisations such as libraries and enterprises (Millen et al., 2006; Millen,

2007).

Mainly, tags are used to profile users‘ topic interests or preferences. The tagging behaviour is a kind of implicit rating behaviour (Sen et al., 2009). Based on tag information, user profiles can be represented by a set of tags or a set of tags with corresponding weights. In those early works, only the original tag set of a user was used to profile the topic interests of this user (Bogers & Bosch, 2008). The binary weighting approach (Bogers & Bosch, 2008) and the tf-idf weighting approach (Salton & Buckley,

Page 28

1988) borrowed from text mining are commonly used to assign different weights to tags.

These approaches profile users with their own tags directly.

Tso-Sutter et al. (2008) proposed a tag expansion method to profile a user with

tags and items after converting the three-dimensional user-tag-item relationship into an

expanded user-item implicit rating matrix. The tags of each user were regarded as

special items to expand each user‘s item set, while the tags of each item were regarded

as special users to expand each item‘s user set. This approach profiles users with their

own tags and items.

Because of the noise of tags, some user profiling approaches proposed to use the

related or expanded tags to profile each user or describe each item. In the work of Niwa

et al. (2006) and Shepitsen et al. (2008), tags of the same cluster were used to expand the

tag based user profiles. Au Yeung et al. (2008) proposed to find a cluster of tags to

profile the topic interests of each user, called personomy. All the popular tags of the

collected items of a user were used to profile the topic interests for that user. Moreover,

association rule mining techniques were used to find the associated tags to profile users

(Heymann et al., 2008). Wetzker et al. (2010) proposed a probability model to translate

a user‘s tag vocabulary into another user‘s tag vocabulary. These approaches not only

profiled users with their own tags but also with a set of related or expanded tags.

Targeting the tag quality problem, some latent semantic models were proposed to

process tags. The tag based latent topics were used to profile users. Wetzker et al. (2009)

proposed a hybrid probabilistic latent semantic model based on the binary user-item and

tag-item matrixes. Siersdorfer and Sizov (2009) proposed a latent Dirichlet Analysis

Page 29

model to find the latent topics of tags. Therefore instead of using tags directly, these

approaches used the latent topics of tags to profile users.

Aside from the above user profiling approaches, some approaches proposed to

use tensor, tube, or tri-party graphs to profile users‘ tagging behaviour (Rendle et al.,

2010; Zhang et al., 2010). In doing so, the three-dimensional relationship among users,

items and tags could be profiled directly. Rendle et al. (2010) proposed to use a three-

dimensional tensor to profile users. The work of Zhang et al. (2010) proposed

approaches to rank the weights of tags in the tri-party graphs to represent users‘ tagging

behaviour.

In addition to being used as a standalone information source, tags are also

popularly used in combination with information sources to profile users. In the next two

subsections, firstly the user profiling approaches based on other popular Web 2.0

information sources will be briefly reviewed. Then, the hybrid user profiling approaches

based on tags and other information sources will be reviewed.

2.1.3.2 User Profiling Based on Other Web 2.0 User Information

Some approaches were proposed to find the content or semantic meanings of multimedia information (Weinberger et al., 2008; Tingle et al., 2010). Because it is usually difficult to obtain the content information or semantic meanings of multimedia content information automatically, some approaches (Chen et al., 2010; Xu et al., 2010;

Arapakis et al., 2010) proposed to find users‘ interests based on the descriptive textural information such as title, abstract and description of photos, videos or audio. In addition,

Page 30

some approaches proposed to combine the content of multimedia information with the textural descriptive information to find users‘ interests (Arapakis et al., 2010).

Users‘ explicit network information has become an important information source to profile users and make recommendations. Massa and Aversani (1997) proposed to integrate users‘ explicit trust network with the inferred similarity neighbourhood to profile users‘ item preferences and make recommendations. Some approaches suggested using social networking information to mine target users‘ profiles (Bonhard & Sasse,

2006; Mislove et al., 2010; Mitzlaff et al., 2010). The work of Mislove et al. (2010) proposed to use friends‘ interested items or topics to refine target users‘ topic interests and make recommendations.

User created textural content information is the major source of user contributed information available on the web now. The research of blogs mainly focuses on the formation and mining of communities, -specific search engines (Chen et al., 2007), opinion mining (Zhang et al., 2007) and sentiments analysis (Mei et al., 2007). The research on reviews and comments is mainly focusing on opinion mining such as extracting products‘ rateable features (Popescu & Etzioni 2005; Titov & McDonald 2008), customer opinion summarisation (Zhuang et al., 2006; Aciar et al., 2006), sentiment analysis of user reviews (Ding et al., 2008; Jakob et al., 2009), etc.. The Wikipedia categories can be used as ontology to find users‘ topic or subject interests (Hu et al.,

2009). The structure and content of forum posts is used to find users‘ interests and opinions to recommend the best answer of a question to a user (Chen & Nayak, 2008).

More recently, the research of micro blogs such as tweets has begun. The current

Page 31

approaches focus on the opinion mining based on tweets (Jansen et al., 2009;

Diakopoulos & Shamma, 2010) and how to rank users (Weng et al., 2010) and recommend contents (Chen et al., 2010) and news (Phelan et al., 2009).

2.1.3.3 Hybrid User Profiling Based on Tags and Other Information

Tags are also widely used in combination with other information sources to profile users. Sen et al. (2009) proposed to combine tags, explicit ratings, implicit ratings, click streams and search logs to profile users and make personalised item recommendations. The work in (Heymann et al., 2008; Gemmis, 2008) proposed to combine tags with the textural content information of tagged items to find users‘ topic interests. The work proposed approaches to combine tags with blogs (Hayes & Avesani,

2007; Qu et al., 2008) to find users‘ opinions and topic interests. Recently, some approaches combined tags, images (Xu et al., 2010) and videos (Chen et al., 2010) to mine the semantic meaning of the multimedia items. The research of user profiling approaches that combine tags and micro-blogs such as tweets (e.g., hash tag) is one new research focus, demonstrated in the work of Huang et al. (2010) and Efron (2010).

In conclusion, user profiling is an ongoing research area. The new user information provides new solutions to profile users. At the same time, the noise and the new features of these kinds of user information bring challenges to the current user profiling approaches. How to reduce the noise, make use of the unique features and rich personal information of these kinds of user contributed information, is vital to accurately profile users.

Page 32

One important personalisation application area of user profiling is recommender systems. The current related recommendation approaches will be reviewed in Section 2.2.

2.2 RECOMMENDER SYSTEMS

Recommender systems have been an active research area for more than a decade.

Typically, users‘ numeric explicit ratings (e.g., 1-10) to items are used to profile users to make recommendations. The recommendation approaches based on explicit ratings are the major focus of this area. The recommender systems based on explicit ratings have been intensively explored.

Since the explicit ratings are not always available, how to profile users based on other information sources and make recommendations is another important research direction of this area. In addition to explicit ratings, the input information sources of recommender systems include binary implicit ratings, demographic data, content information of items, item taxonomy/ontology, click streams, web log data, new user information in Web 2.0 (e.g., tags, reviews, blogs, social friends, tweets) and others. Each kind of information source can be used solely or in combination with one or more other information sources to profile users and make recommendations.

In this section, the recommendation tasks and evaluation approaches will be introduced first. Then, the popularly used recommendation approaches will be outlined.

Finally, the recommendation approaches based on item taxonomy and tags will be reviewed in details.

Page 33

2.2.1 Recommendation Tasks and Evaluation Approaches

The tasks of recommender systems include rating prediction and Top N recommendation. The rating prediction task is to predict the rating value a user will give to a rated item. The Top N recommendation task is to recommend a set of unrated/new items to the target user (Deshpande & Karypis, 2004). For explicit ratings, both tasks are applicable. Since the binary values are used for implicit ratings, the rating prediction task is usually not applicable for implicit ratings while Top N recommendation is more applicable (Adomavicius & Tuzhilin, 2005).

The recommendation approaches can be evaluated from the aspects of effectiveness and efficiency. Accuracy metrics are popularly used to measure the effectiveness of recommendation approaches. The Mean Absolute Error (MAE) and Root

Mean Squared Error (RMSE) are widely used to measure the accuracy of the rating prediction task (Herlocker et al., 2004). The differences between the predicted user ratings and the true user ratings for a given set of items are calculated to measure the accuracy of this kind of recommendation task.

Precision and recall are commonly used to evaluate the Top N recommendation task. Different to MAE or RMSE, these metrics measure the ability of a recommendation approach in terms of selecting high quality or most potentially interesting items from the whole item set for a given target user (Herlocker et al., 2004, Montaner et.al., 2003,

Ziegler et al., 2004). Combining precision and recall together, F1 measure is another popular metric to evaluate the accuracy of recommendation approaches.

Page 34

The computation efficiency is another important aspect in evaluating the scalability of recommendation approaches. A common approach to evaluating the computation efficiency of a recommendation approach is to measure the amount of time it required to generate a single recommendation. However, since the efficiency bottleneck can be solved by other non-algorithmic approaches (such as employing higher performance hardware and parallel computing), the effectiveness evaluation plays a more important role than efficiency evaluation of a recommendation approach.

More recently, the advanced cloud computing techniques and services provided new solutions for the scalability issue of recommendation approaches. How to implement a recommendation approach in a highly paralleled way and benefit from the advanced cloud computing techniques is a new solution which would make the recommendation approach scalable and run efficiently at a large scale.

Beyond accuracy and scalability, other facets such as Coverage, Novelty and

Interestingness are also used to evaluate recommendation approaches (Adomavicius &

Tuzhilin, 2005). However, these metrics have some limitations. The major drawback is that they are subjective and no consensus on the definitions and the evaluation process of these facets so far. Therefore, they are not well recognized nor popularly used

(Adomavicius & Tuzhilin, 2005). Developing and studying better evaluation metrics constitutes an interesting and important research topic. The current majority research work of recommender system is using accuracy to measure the quality of systems

(Koren, 2008; Zhang et al., 2010)

Page 35

2.2.2 Recommendation Approaches

Over decades, many different recommendation techniques and systems with distinct strengths have been developed. Recommender systems can be broadly classified into three categories: content-based filtering, collaborative filtering (CF), and hybrid approaches (Adomavicius & Tuzhilin, 2005).

2.2.2.1 Content Based Filtering

Conventional techniques dealing with information overload typically make use of content-based filtering techniques. Content-based filtering, also called cognitive filtering

(Malone et al., 1987), relies on characterising the content of an item and information needs of potential users, and then using these representations to intelligently match items to users. In other words, content-based filtering techniques recommend items with similar contents to the items preferred by target users (Jian et al., 2005; Pazzani and Billsus, 2007;

Malone et al., 1987). Content-based systems are mostly used for recommendation across text-based items/documents (Adomavicius & Tuzhilin, 2005).

The input information sources of content-based approaches usually include users‘ explicit or implicit ratings to items and the content related information of items such as extracted keywords, taxonomic/ontological topics, categories/genes and others. Besides explicit or implicit ratings that reflect users‘ preferences to items, users‘ preferences or interest in topics are also widely used for content-based approaches. The topic preferences of a user usually are represented by keywords or terms generated implicitly from the content related information of items that the user rated, clicked, bought, or downloaded.

Page 36

Each keyword or term can have different levels of importance or ‗informativeness‘ as determined by a weighting measure (Adomavicius & Tuzhilin, 2005).

Besides the term vector model, latent semantic topic models such as Latent

Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Index (PLSI) are popularly used to process large textural corpus to recommend the most relevant items to users

(Adomavicius & Tuzhilin, 2005).

The pure content-based approaches have several drawbacks. Firstly, they have restricted application areas. They are applicable only to those application areas with textural items (e.g., web pages, documents) or those items which have rich content descriptions. They have difficulty recommending those items with no textural content

(e.g., video and audio clips, images) or less textural descriptions (e.g., e-commerce products). Secondly, they have the recommendation novelty issue or over-specialisation problem. The user is restricted to seeing only items that are very similar to what they have already rated. Also, content-based systems ignore community information and thus do not take any information from other users into account when making recommendations

(Popescul et al., 2001). Content-based systems also have the new-user problem

(Adomavicius & Tuzhilin, 2005).

2.2.2.2 Collaborative Filtering Approaches

Collaborative filtering, or social filtering, (Malone et al., 1987; Shardanand &

Maes, 1995) is the most popularly used technique in recommender systems. It is best known for its use on popular e-commerce websites such as Amazon.com and other

Page 37

websites (Linden et al., 2003). Essentially, a collaborative filtering based recommender system automates the process of ‗word-of-mouth‘ paradigm: it makes recommendations to a target user by consulting the opinions or preferences of the users with similar tastes to the target user ( Sarwar et al., 2000).

Explicit ratings are the typical input information of collaborative filtering approaches. The collaborative filtering approaches based on implicit ratings are arousing more and more attention. A very important advantage of collaborative filtering approaches is that they are applicable to various kinds of application areas and entirely independent with the content types of items. Other advantages of the collaborative filtering approach include having higher recommendation novelty and taking the community users‘ subjective opinions and preferences into consideration.

Despite their popularity, pure collaborative filtering based recommender systems usually suffer from the following problems. One challenge commonly encountered by collaborative filtering based recommenders is the cold-start problem. When a new system has an insufficient profile of users, or a new user has few or no rating information, collaborative filtering based approaches perform poorly ( Adomavicius & Tuzhilin, 2005).

Another problem is the ‗sparsity problem‘. When there are too many items in the system, there might be many users with few or no common items shared with others. How to form a proper neighbourhood with few common item ratings is another challenge for those application areas with a large number of items (e.g., books). Collaborative filtering approaches also have the new item problem (Adomavicius & Tuzhilin, 2005).

Page 38

Scalability is another major challenge for collaborative filtering based recommenders. Collaborative filtering based recommenders require data from a large number of users before being effective, in addition to requiring a large amount of data from each user, while limiting their recommendations to the exact items specified by those users. The computation efficiency of collaborative filtering is basically between

and , where is number of users and is number of items

(Papagelis et al., 2005). The numbers of users and items in e-commerce websites might increase dynamically (most of them are over several million), consequently, the recommenders will inevitably encounter severe performance and scaling issues (Sarwar et al., 2000, Papagelis et al., 2005).

The collaborative filtering approach can be classified into model based and memory based collaborative filtering approaches.

1) Model Based CF

Model based collaborative filtering algorithms use the collection of ratings to learn a model, which is then used to make rating predictions (Adomavicius & Tuzhilin,

2005). The explicit ratings are the typical information gathered to profile users for model based CF approaches. Some statistical and machine learning techniques are used for this kind of approach, for example, probabilistic models based on Bayesian networks, the statistical model based on K-means clustering (Begelman et al., 2006; Shepitsen et al.,

2008), the restricted Boltzmann Machine model (Salakhutdinov et al., 2007), and latent factor models based on matrix factorisation techniques (Koren, 2008).

Page 39

More recently, the model based CF approaches such as Latent factor models

(Koren, 2008) obtained more accurate performances for the rating prediction task based on large scale explicit rating datasets, such as Netflix dataset

(http://www.netflixprize.com/). But how to use matrix factorisation approaches to recommend Top N unrated items to the target user, and how to apply them to implicit ratings still remain open research questions (Koren, 2008).

2) Memory Based CF

Memory based algorithms are also called heuristic-based algorithms. They make recommendations based on the entire collection of previously rated items by the users.

The recommendation process of this type of approach includes user profiling, neighbourhood formation, and recommendation generation.

The memory based CF approaches can be classified into user based and item based approaches. For standard approaches, users are profiled with explicit ratings. Then, the K-Nearest neighbourhood (KNN) approaches are used to form the neighbourhood of each user or each item (Adomavicius & Tuzhilin, 2005). Cosine similarity and Pearson correlation are popularly used to calculate the similarity of two users or two items

(Adomavicius & Tuzhilin, 2005). After that, the recommendation generation algorithms will firstly predict how much the target user may be interested in each candidate item that is derived from the neighbourhood. Then, they will rank the items and generate the recommended item list for each target user.

The advantages of memory based CF approaches include being easy to understand and implement, easy to expand and hybrid with other information sources,

Page 40

and others (Adomavicius & Tuzhilin, 2005). Besides explicit ratings, users can be profiled with implicit ratings and other information sources from various kinds of features, for example, users‘ topic interests, browsing behaviours, and tagging behaviours. The similarity of two users can be calculated by the similarity of their user profiles. The similarity of two items can be calculated by the similarity of their item descriptions, or representations that are formed by users‘ ratings or other behaviours.

Compared with model based CF approaches that mainly focus on the rating prediction task, memory based CF approaches are applicable for both rating prediction and Top N recommendation tasks. For implicit ratings and Top N recommendation tasks, memory based CF approaches are still widely used.

2.2.2.3 Hybrid Approaches

From the recommendation techniques described in previous sections, it can be observed that different techniques have their own strengths and limitations, and no technique alone is the single best solution for all users in all situations (Adomavicius &

Tuzhilin, 2005). A hybrid recommendation system is composed of two or more diverse recommendation techniques, and the basic rationales of its forming are to gain better performance with fewer of the drawbacks of any individual technique, as well as to incorporate various input datasets in order to produce recommendations with higher accuracy and quality ( Burke, 2002).

Burke (Burke, 2002) has classified the hybrid recommendation approaches into seven categories: ‗weighted‘, ‗mixed‘, ‗switching‘, ‗feature combination‘, ‗cascade‘,

Page 41

‗feature argumentation‘ and ‗meta-level‘. The combination of content-based and collaborative filtering recommendation approaches are widely used to make better recommendations.

Linear combination is commonly used to mix two or more methods. The benefits of this type of hybridisation method include low effort and cost on system implementation, and capability of adjusting hybrid weighting.

The central idea of hybrid recommendation techniques is that they usually comprise strengths from various recommendation techniques. However, it also means they might potentially include the limitations from those techniques. Moreover, hybrid techniques are usually more resource intensive (in terms of computation efficiency and memory usage) than standalone techniques, as their resource requirements are accumulated from multiple recommendation techniques. For example, a ‗collaboration via content‘ hybrid (Pazzani, 2007) might need to process both item content information and user rating data to generate recommendations, therefore it requires more CPU circles and memory than any single content-based filtering or collaborative filtering technique.

2.2.3 Recommender Systems Based on Taxonomy

How to profile users‘ preferences or interests in topics is very important. Item taxonomy is one important traditional information source given by experts to find users‘ topic preferences (Adomavicius & Tuzhilin, 2005). The structural information of taxonomy can be used to find the semantically related, more general, or more specific taxonomic topics and improve the accuracy and novelty of recommendations. For

Page 42

example, for a target user interested in ‗flower‘, taxonomy-based techniques might also consider those items related to taxonomic topics ‗rose‘, ‗garden‘, etc.

There are some studies that utilise item taxonomical or ontological information to assist recommender systems. Middleton et al. (2004) used ontology to inductively learn user interested topics for recommending research papers to users. Based on the set of user-interested topics, the recommendation list could be efficiently generated by weeding out those research papers that did not fall into the preferred topics. The CHIP

Demonstrator (Aroyo et al., 2007) also makes semantics-driven recommendations by allowing users to explicitly rate a set of predefined semantic attributes of the items.

The important recommendation approaches based on item taxonomy include the work of Ziegler et al. (2004) and Weng et al. (2008). Ziegler et al. (2004) proposed a taxonomy-driven product recommender, which utilises a general tree structured product taxonomy to enhance its recommendations. This approach linearly combined the content- based filtering and memory based collaborative filtering approach to make recommendations. The taxonomy information is used to find users‘ topic interests and items‘ relevant topics. Those items that were popularly used by neighbourhood users and matched with the target user‘s topic interests would be recommended to the target user.

The approach of Ziegler et al. (2004) decayed the weight of the taxonomic topic node, based on the number of children nodes of the taxonomic topic node in the item taxonomy tree and the length of the taxonomic topic node. Weng et al. (2008) proposed to combine the implicit and explicit item preferences, with the topic preferences that were generated

Page 43

based on the taxonomic topic weighting approach, (Ziegler et al., 2004) to make item recommendations.

However, the taxonomic topic weighting approach (Ziegler et al., 2004) did not consider the popularity of each taxonomic topic node. Because taxonomic information is sophisticated and information rich, there are still many potential promising ways to utilise it in the applications of recommender systems. How to use the structural information of item taxonomy and other unique features of taxonomic topics (e.g., standard, well- recognised, and user community independent) to enhance recommender systems still needs to be explored.

2.2.4 Recommender Systems Based on Folksonomy

Since 2006, as a type of typical new user information in Web 2.0, tags or item folksonomy information has become an important additional information source to profile users and make recommendations. Based on the types of recommended objects, the recommender systems based on folksonomy or tags can be classified into user recommendations, tag recommendations and item recommendations.

User recommendations based on tags have attracted less attention. The research of tag based recommender systems mainly focuses on how to recommend tags to users (Sen et al., 2009). The problem of tag recommendation can be described as: given a target user and a set of items, how to recommend tags to a set of items for the user (Jaschke et al.,

2007). Some approaches such as using the co-occurrence of tags (Li et al., 2008), association rules (Heymann et al., 2008), folkrank (Jaschke et al., 2007), tensor (Rendle et

Page 44

al., 2009), the graph based approach (Guan et al., 2009; Zhang et al., 2010), the probabilistic model (Wetzker et al., 2009), Latent Dirrichlet Allocation (LDA) (Lu et al.,

2010), and link networks (Au Yeung, 2009) have been proposed.

At present, little work has been done on the item recommendations based on tags.

The problem of item recommendation can be described as: given a target user, how to recommend a set of new or untagged items to the target user. Since recommending a tag to a user to label an item is different to recommending an item to a user, the tag recommendation approaches usually cannot be used to recommend items directly (Sen et al., 2009).

In typical tagging communities, rare or no explicit ratings are available. As tagging is a kind of implicit rating behaviour (Sen et al., 2009) and the tags are pieces of textural information describing the content of items, mainly, the memory based CF and content-based approaches are used. Earlier work didn‘t consider the tag quality problem

(Tso-Sutter et al., 2008; Diederich et al., 2006). Diederich et al. (2006) proposed an exploratory user based CF approach using tag based user profiles. The tf-iuf weighting approach similar to the tf-idf approach in text mining was used for each user‘s tags. The work of Tso-shuter et al. (2008) extended the binary user-item matrix to binary user-item- tag matrix and used the Jaccard similarity measure approach to find neighbours. It was claimed that because of the tag quality problem, tag information failed to be very useful to improve the accuracy of memory based CF approaches (Tso-shuter et al., 2008).

More recently, the noise of tags or the quality (Sen et al., 2009; Gemmell et al.,

2009) and usefulness (Bischoff et al., 2008) of tags aroused attention. Some content-

Page 45

based approaches that dealt with the noise of textural contents were proposed. In the work of Niwa et al. (2006) and Shepitsen et al. (2008), the clustering approaches were used to find the item and tag clusters based on the tag based tf-idf content representations. The mapping of tags between users‘ tags and the representative tags of item clusters were used to make content-based web page recommendations. The Latent Semantic Analysis such as probabilistic Latent Semantic Index (PLSI) (Gemmis et al., 2008) and LDA

(Siersdorfer et al., 2009) based approaches have been proposed to remove the noise of tags and build latent semantic topic models to recommend items to users.

Besides these memory based CF approaches and content filtering models, in the work of Sen et al. (2009), a special tag rating function was used to infer users‘ tag preferences. Along with the inferred tag preferences, the click streams and tag search history of each user was used to get user‘s preferences for items. The various kinds of extra information and special functions make the work of Sen et al. (2009) incomparable and give restrictions to the applications of the work. More recently, Zhang et al. (2010) proposed to integrate the user-tag-item tripartite graph to rank items for the purpose of recommending unrated items to users. The user-tag-item graph was divided into user-tag and tag-item while the three-dimensional relationships reflecting the personal tagging relationships were ignored by the work of Zhang et al. (2010). Zhang et al. (2010) proposed a graph based rank approach to recommend documents to users. Zhen et al.

(2009) proposed to integrate tag information and explicit ratings to improve the accuracy of rating predictions of a model based CF approach.

Page 46

Mainly, the current approaches treat tags as textural information and ignore the features of tags: tags are given by users directly, thus, it forms a three-dimensional relationship between users, items and tags. This three-dimensional relationship records the personal tagging information of each user. How to use the rich personal information of folksonomy to profile users accurately and improve the accuracy of item recommendations still needs to be further investigated.

The combination of standard item taxonomy and user contributed folksonomy is another important research topic. Some pioneer work discussed how to hybrid taxonomy and folksonomy for knowledge organisation (Schmitz, 2006; Damme et al., 2007; Gruber,

2007; Eda et al., 2009) and navigation (Bindelli et al., 2008). Currently, the existing recommender systems only use one kind of the two information sources. Very rare works discussed how to benefit from both information sources to generate more accurate user profiles and recommender systems.

2.3 CHAPTER SUMMARY

This chapter reviewed the current related user profiling approaches and recommender systems. The new user information in Web 2.0 provides possible new solutions to profile users. Besides explicit ratings, folksonomy or tag information becomes another important information source to profile users. Its advantages include: being contributed by users explicitly and proactively, is less intrusive, domain free, humanly understandable, lightweight, and easy to integrate with other information

Page 47

sources. However, it also contains a lot of noise such as tag synonyms, semantic ambiguity, and personal tags.

The current user profiling and recommendation approaches based on folksonomy information still need to be improved. How to reduce the noise of tags and use the unique features and the rich personal information of folksonomy to profile users needs to be further explored. Moreover, very rare works discuss how to benefit from both folksonomy information contributed by users and taxonomy information given by experts to generate more accurate user profiles and recommendation making approaches.

The following chapters will extend the existing work through proposing effective approaches to make use of the unique features and the rich personal information of folksonomy, and by combining the standard item taxonomy information to further improve the performances of user profiling and recommendation making.

Page 48

Chapter 3

3User Profiling Based on Folksonomy

In Web 2.0, the folksonomy or tag information becomes an important information source to profile users. It reflects users‘ opinions on item classifications and descriptions as well as users‘ interests or preferences to the conceptual topics or categories of items.

Moreover, tags are given by users directly and can be used to describe many types of items including videos, photos, web pages, audios, documents and others. Because of their simplicity and multiple functions, tags are more and more popularly used in various application areas. This chapter will discuss how to profile users based on folksonomy information.

3.1 NOTATIONS

In this subsection, the concepts and entities involved in this research are formally defined. These definitions also will be used in subsequent chapters.

 Users: contains all users in an online community who

have used tags to organise items.

 Folksonomy (i.e., Tags, Social Tags) contains all tags

used by the users in . A tag is a piece of textural information given by users

to describe or classify items.

Page 49

 Items (i.e., Products, Resources): contains all items

tagged by users in . Items could be any type of online information resources

or products in an online community such as web pages, video clips, music

tracks, photos, academic papers, movies, books etc. We assume that each

item can be described by a set of tags contributed by users and a set of world

knowledge ontological/taxonomic topics assigned by experts.

 Tagging: The basic tagging behaviour is defined as .

If a user collected one item with a tag , then ,

otherwise, .

 User profile: A user profile is a collection of information about a user, such

as demographic information, interests or preferences, opinions, friends or

other network information. A user‘s interests or preferences for items and

topics are typical information that is used to profile a user for the purpose of

item recommendations.

 Item preferences: record a user‘s preference for items. Item preferences can

be classified into explicit item preferences and implicit item preferences.

o Explicit item preferences: a set of numeric numbers (e.g., ranging from

0-10) are explicitly assigned by a user to a set of items to express how

much a user likes the items. They are also called explicit ratings.

o Implicit item preferences: a set of binary numbers (i.e., 0, 1) are

automatically inferred and collected from the user‘s non-rating relevant

actions (e.g., history of purchases, navigation history and tagging

Page 50

behaviour) to indicate whether a user is interested in the items. They are

also called implicit ratings. Unlike explicit item preferences, they usually

imply the users‘ possible item interests rather than clear indications of

subjective item preferences (i.e., whether the users like or dislike the

items) (Montaner et al., 2003). Similarly, for these items that are not

implicitly preferred by , it can only be concluded that these items are

unseen or do not interest (rather than disliked by ).

In general, implicit ratings are far more obtainable and accessible in online

communities or websites than explicit ratings. In this work, the implicit

ratings are obtained from a user‘s tagging behaviour. If a user tagged an item,

then the item is implicitly rated by the user (Sen et al., 2009).

 Topic preferences: record a user‘s preferences on topics. Topic preferences

also can be divided into explicit topic preferences and implicit topic

preferences.

o Explicit topic preferences: a set of keywords and terms explicitly and

proactively given by a user to express this user‘s interests or preferences

to topics. For example, search queries, topic interests in a user profile, and

tags are commonly used explicit topic preferences.

o Implicit topic preferences: a set of keywords or terms that are extracted

from the content information or ontological/taxonomic topics of the items

that a user clicked, bought, tagged or rated. Instead of being given by

Page 51

users explicitly, implicit topic preferences are usually generated

automatically based on users‘ actions or behaviours.

In this thesis, the explicit topic preferences are obtained from a user‘s tag

information while the implicit topic preferences are generated from the

ontological/taxonomic information of the tagged items.

Figure 3.1 illustrates an example of a tagging graph. There are four users , ,

and who used five tags , , , , and tagged six items , , , , , in the example tagging community. Each user‘s tagging behaviour is illustrated in the graph. For example, user has two tags and . This user has tagged item with tag

and tagged item and with tag . User has used the tag and tagged item and . The example graph will also be used in subsequent chapters.

=”garden” ’s tagging =”apple”

’s tagging =”globalization” ’s tagging

=”internet” ’s tagging

=”0403”

A user A tag An item

Figure 3.1 An Example of a Tagging Graph

Page 52

3.2 THE RELATIONSHIP MODELLING OF FOLKSONOMY

Typically, there are two kinds of entities, users and items, in a recommender system. The implicit or explicit ratings are used to profile users. The ratings form a two- dimensional relationship between users and items. Different to the standard recommender systems, there are three kinds of entities: users, items and tags in a tagging community.

Users, items and tags are connected or related to each other. It forms multiple relationships among users, items and tags. An important distinctive feature of folksonomy is that tags are given by users directly to classify and organise items. Folksonomy itself is flat and does not contain structural information because users usually use tags to organise items without specifying the relationship between tags. However, the opinions of users on item classifications and descriptions, reflected by using tags, provide valuable information from which rich relationships among users, items, and tags can be generated.

Since the similar or closely related items, users and tags can be determined based on the relationships among them, it is very important to extract and model the relationships contained in folksonomy.

Firstly, the two-dimensional relationship between users and items can be obtained by considering users and items only. Besides the two-dimensional relationship between users and items, by taking tags into consideration, another two types of two- dimensional relationships can be formed: the relationship between users and tags and the relationship between items and tags. Each relationship between two entities can be modelled by two mappings. Each mapping describes the relationship of one entity to the other and vice versa. These relationships are formally defined below:

Page 53

 User-Item Relationship: is the relationship between users and items. It

records the implicit ratings that are derived from the tagged items of each

user and the users of each item. It includes User-Item Mapping and Item-

User Mapping, which are defined below:

1) User-Item Mapping

. It maps a user to his/her collected items. For simplicity,

is used to stand for For example, in Figure 3.1,

.

2) Item-User Mapping ,

. It maps an item to a set of users who have collected the

item. is used to stand for . For example, in Figure 3.1,

.

 User-Tag Relationship: is the relationship between users and tags. It records

each user‘s own tags and the user group of each tag. It includes User-Tag

Mapping and Tag-User Mapping. They are defined below:

3) User-Tag Mapping

. It maps a user to a set of tags that are used by the user.

is used to stand for . For example, in Figure 3.1,

.

4) Tag-User Mapping

. It maps a tag to a set of users who have the tag. is

Page 54

used to stand for . For example, in Figure 3.1,

.

 Item-Tag Relationship: is the relationship between items and tags. It

records the tags of each item and the aggregated items of each tag. Similarly,

it includes the following two mappings:

5) Item-Tag Mapping

. It maps an item to a set of tags that are used by some

users to label the item. is used to stand for . For example, in

Figure 3.1, .

6) Tag-Item Mapping

. It maps a tag to a set of items that are collected by

some users with the tag. is used to stand for . For example, in

Figure 3.1, .

Moreover, the three entities form a three-dimensional relationship: User-Item-Tag

Relationship. Different to the two-dimensional relationships from an aggregated perspective, it records the personal tagging behaviour of each user. It is defined below.

 User-Tag-Item Relationship: is the relationship among users, items and

tags. It records each user‘s personal tagging behaviour. It includes three pairs

of mappings, which are (Item×Tag)-User Mapping and User-(Item×Tag)

Mapping, (User×Tag)-Item Mapping and Item-(User×Tag) Mapping, and,

(User×Item)-Tag Mapping and Tag-(User×Item) Mapping. Since only the

Page 55

(User×Tag)-Item Mapping and Item-(User×Tag) Mapping are used in this

thesis, their formal definitions are given as below:

7) (User×Tag)-Item Mapping ,

. It maps a user-tag pair to a set of items that are collected

under the tag by the user. is used to stand for . For example,

in Figure 3.1, .

8) Item-(User×Tag) Mapping

. It maps an item to its user-tag pairs. is

used to stand for . For example, in Figure 3.1,

.

The three-dimensional relationship records each individual user‘s personal tagging behaviour. It also reflects each user‘s opinion on the similarity of items. Based on the (User×Tag)-Item Mapping , we can see that labelled with tag , a set of items are collected and grouped together according to the user ‘s viewpoint. For this user, the collected items are similar or closely related to each other in some way, otherwise the user will not put them together and label them with the same tag. Based on this observation, we define an important assumption about the similarity of two items.

[Assumption 1]. For any two items and , if they are being put together under the same tag by the same user , then, these items are similar and closely related with each other from the perspective of user .

Page 56

3.3 TAG REPRESENTATION BASED ON FOLKSONOMY

Typically, the standard recommendation making approaches (Adomavicius &

Tuzhilin, 2005) are based on the User-Item Relationship. The tag information has been discarded even after it becomes available in the Web 2.0 environment. For example, the similarity of users is calculated based on the overlaps of the tagged item sets that are obtained by the User-Item Mapping . While the similarity of items is calculated based on the overlaps of the user sets that are obtained by the Item-User Mapping .

Therefore, for standard collaborative filtering recommendation approaches, the user profiles only contain item preferences and fail to capture users‘ preferences on tags. In fact, the relationships based on tags can be used to generate user profiles and make recommendations.

Some approaches have been proposed to use the two-dimensional User-Tag

Relationship and Item-Tag Relationship to profile each user with the user‘s own tags and each item with a set of tags that are being directly assigned to the item (Diederich &

Iofciu, 2006; Tso-Sutter et al., 2008). However, since there is no restriction or boundary on selecting words for tagging items, the tags used by users are free-formed and contain a lot of noise such as semantic ambiguity, tag synonyms and personal tags (Sen et al.,

2006). The tag quality problem generates difficulties in user profiling based on folksonomy (Au Yeung et al., 2007). For those users who have used personal tags and those items that are being described by personal tags, it is difficult to find any similar users or items. Moreover, the semantic ambiguity of tags and tag synonyms cause

Page 57

inaccurate user profiling and item topic description. As a consequence, the noise of tags results in inaccurate neighbourhood formation and item recommendations.

For example, in Figure 3.1, since tag ―0403‖ is a personal tag, user cannot find any similar users based on the similarity of users‘ tag sets obtained by the User-Tag mapping . User and user will be considered as similar users since they have the same tag ―apple‖ even though for user ―apple‖ may mean a kind of fruit while for user , it may mean a brand of computer product.

By nature, a tag is given by a user to organise or classify the person‘s own items.

Thus, a tag is a textural entity dependant on its user from the perspective of individual users. If we can find the actual semantic meaning or related topics of a tag for each individual user, then the noise of tags can be reduced.

From the (User×Tag)-Item Mapping , we can see that labelled with tag

, a set of items are collected and grouped together according to the user ‘s viewpoint.

Based on Assumption 1, the collected items are similar or closely related in terms of the opinion of user . As tags reflect the conceptual topics of items, the relevant tags of these items can be considered to be closely related from the viewpoint of this user. Since the directly relevant tags of each item are recorded in the Item-Tag Mapping , through combining the (User×Tag)-Item Mapping and the Item-Tag Mapping together, a set of closely related tags for each tag in terms of an individual user can be obtained.

For example, as shown in Figure 3.1, based on (User×tag)-Item Mapping , different collected item sets of tag for user and can be obtained.

. Then, based on Item-Tag Mapping , we can get the

Page 58

direct relevant tags of each item.

Thus, for user , the tag ―apple‖ is related to the tag

―garden‖. But for user , is related to the tags ―globalization‖ and ―internet‖.

Therefore, different meanings of the same tag for different users can be determined. Because the tags of each item can be interpreted as the topics of each item (Li et al., 2008), the process of finding the related tags for each tag, with respect to each individual user, can be interpreted as finding the personalised semantic meanings or related topics of each tag for each user, which is called tag representation. The definition of tag representation in terms of folksonomy is given below.

[Definition 3.1] (Tag Representation-Folksonomy): represents the relevance of

each tag to each tag with respect to user . Let denote how strong is related to with respect to user , the relationship between a tag and other tags with respect to a user can be defined as the mapping , such that

. is called the folksonomy based representation of tag with respect to user .

Therefore, through finding the most personally related tags for each tag for each user, tag representation can help to remove the noise of tags. Based on the tag representations, more accurate item descriptions and user profiles can be generated, which will be discussed in Section 3.4 and 3.5 respectively.

How to calculate the relevance weight that measures the relevance strength of two tags in terms of user is very important. Before the discussion of how to

calculate the weight , two important conditional probabilities are defined first.

Page 59

They are the probability of tag being used to tag item given the item and the probability of tag being used by user given user .

3.3.1 The Two Conditional Probabilities and

For an item , let be the number of users that have tagged item ,

be the total number of users, the probability of item being tagged by users using any tags denoted as can be defined as the ratio between the number of users who tagged item and the total number of users.

(3.1)

The probability is 0 if no user has tagged and 1 if all users have tagged

. Moreover, let denote the number of users who tagged item with tag , the probability of item being tagged by users using a specific tag can be measured by the ratio between the number of users who tagged the item using tag and the total number of users. It is defined as:

(3.2)

Based on these two probabilities, an important conditional probability of tag being used to tag item , given the item , can be defined below:

(3.3)

Page 60

is the probability of being used to tag item , given the item .

The probability indicates how commonly the tag has been used by users to describe or classify a given item . It reflects the ―wisdom of crowds‖ in terms of the classification of the item . Reflecting the common viewpoint of users, the higher the probability, the more likely the tag represents a major topic for the item , or in other words, the more likely the item will be found in the tag .

For a user , the probability of user tagging items can be defined as the ratio between the number of items that are tagged by user and the total number of items.

(3.4)

Where is the number of items that user has tagged, and

is the total number of items. if the user has never tagged any items,

if the user has tagged all the items. The probability of user tagging items using a specific tag denoted as is defined as the ratio between the number of items that are tagged by the user using tag and the total number of items.

(3.5)

Where denotes the items that are tagged by user with the tag ,

. Based on the two probabilities and , another important conditional probability can be defined.

(3.6)

Page 61

The conditional probability represents the probability of tag being used by user , given the user . It reflects the interests or preferences of to tag .

The higher the value, the more likely the user is interested in

[Example 3.1] In Figure 3.1, the item has the tag and .

, . With a higher value, the tag ―globalization‖ can be considered a major topic of the item while the tag ―apple‖ represents a minor topic.

User has two tags and . , . User is more interested in tag . User only has a personal tag , .

3.3.2 The Relevance of Two Tags in Terms of an Individual User

As discussed in Section 3.3.1, the probability measures how important the tag is for representing the topics of the item For a given user

and a tag , the relevance strength of a tag being related to the tag for the user can be estimated based on the probabilities of being used to tag the items collected in the tag of the user (i.e., the probabilities for all the items in tag of user ), because those probabilities measure the possibilities that other users use to tag the items in of the user

The items collected in tag of user can be obtained by the mapping

, i.e., . Let , we can use any of ,

…, to estimate the relevance of tag to tag for user . This thesis proposes to use the expectation of , …, to estimate the relevance of tag to tag . Assuming that , …, are equally

Page 62

important to the user to calculate the relevance of tag to tag , the expectation is

actually the average value of , …, . The value of that measures the relevance weight of a tag to a tag for user can be calculated as:

(3.7)

represents the weight of how strong that tag is related to tag with

respect to user , =1. Since different items may be collected with tag

and tag by user , the relevance measure usually is not symmetric

(i.e., ).

In summary, let be a tag used by user , the folksonomy based representation of tag consists of a set of related tags, and their corresponding weights that measure how strong these tags are related to the tag . Since the differences of individual vocabularies are considered and the meanings or related topics of each tag are obtained, the problems of tag synonyms, tag semantic ambiguity, and spelling variations can be effectively solved.

Page 63

0.25

0.75

0

0 =”garden”

0 =”apple”

=”globalization”

=”internet”

0 =”0403”

0.16

0.5

0.34

0

Figure 3.2 Tag Representation -Folksonomy

[Example 3.2] (Tag Representation-Folksonomy) Figure 3.2 shows an example of folksonomy based tag representations of tag ―apple‖ for user and user . The

relevance weight of each two tags in terms of each individual user can be calculated with Equation 3.7.

For example, the relevance weight of tag ―apple‖ and tag ―garden‖ in terms of user can be calculated as follows. Since collected item and with tag ,

. Based on the tagging graph Figure 3.1, the following

conditional probabilities can be obtained: =0.5, =0.5,

=1.0, . The relevance weight of tag and tag in terms of

user can be calculated as

.

The relevance weight of tag ―apple‖ and tag ―garden‖ in terms of user

can be calculated as follows. Since collected item and with tag ,

Page 64

. From Figure 3.1, =0.33, =0.67,

= =0.33, =0.33, =0.33. Since =0

and =0,

.

Similarly, the relevance weight of tag ―apple‖ and tag ‖globalization‖ in

terms of user is =0. While the

relevance of tag and tag in terms of user is

. With the same calculation process, the

relevance of any two tags in terms of each individual user can be calculated. The

folksonomy based representation of tag for user is ( , )= {( , 0.25), ( ,

0.75), ( , 0.0), ( , 0.0), ( , 0.0)}, while the representation of for user is ( ,

)= {( , 0.0), ( , 0.16), ( , 0.5), ( , 0.34), ( , 0.0)}.

Therefore, for user , the tag ―apple‖ is mainly related to tag ―garden‖ for user . For user , tag ―apple‖ is related to tag ―globalization‖ and tag

―internet‖. It is more likely that means a kind of fruit for user but means a kind of computer product for user . Since the personalised semantic meanings of tag are generated for different users, the semantic ambiguity of this tag can be reduced. Similarly, we can acquire the related tags of each personal tag (e.g., ―0403‖). Moreover, it is easy to find the tag synonyms through comparing their tag representations.

Also, we can see that after representing a tag with a set of related tags, the relevance weight of a tag to itself is reduced. For example, in Figure 3.2, after the

Page 65

representation, the relevance weight of tag to for user is reduced from the original value 1 to 0.16 and the remainder is distributed to other related tags and .

Based on the tag representations, we can find more accurate item topic descriptions/classifications and users‘ topic preferences, which will be discussed in

Section 3.4 and Section 3.5 respectively. The algorithm of folksonomy based tag representation is shown in Table 3.1.

Table 3.1 The Algorithm of Folksonomy Based Tag Representation

Algorithm 3.1. TR-Folksonomy ( , )

Input: is a given user, is a given tag

Output: Folksonomy based tag representation

Method:

Begin

1: ← {} //initialisation

2: for each tag {

← 0 //initialisation

}//end for

3: ← // Get the items collected in tag by user .

4: for each tag {

sum← 0

for each item {

sum← sum+

}//end for

Page 66

← sum

}//end for

5: for each tag {

}//end for

6: Return

End

Let denote the total number of users in a tagging community, , denote the total number of items, , denote the total number of tags, ,

denote the total number of collected items in a tag by a user, , the computation complexity of Algorithm 3.1 is .

However, if , then . Therefore, the efficiency of this algorithm can be improved if we only calculate the relevance of those tags that have

. Let denote the number of related tags of those items in tag collected by user , , the improved computation complexity of

Algorithm 3.1 is . Usually, , .

3.4 ITEM REPRESENTATION BASED ON FOLKSONOMY

Reflecting the wisdom of crowds, folksonomy has become a popular information source to classify and describe items in addition to the traditional content or taxonomy/ontology information given by experts. However, since the items follow the power law distribution (Heymann et al., 2008), a large number of items are described by a very small number of tags. Resulting in very short and incomplete content descriptions, it

Page 67

brings difficulties in conducting content matching or filtering based on tags (Rendle et al.,

2009). Moreover, the tag quality problems such as personal tags, tag synonyms and semantic ambiguity also causes inaccurate item description, improper item neighbourhood formation and mismatch of content matching. Therefore, it is necessary to remove the noise of tags and find the relevant tags of each item with the purpose of describing the topics of items accurately. The process of determining the related tags of each item and representing each item with a set of relevant tags is called folksonomy based item representation. It is defined below.

[Definition 3.2] (Item Representations-Folksonomy): represents the relevance

of each item to each tag . Let denote the weight of how much the item

is relevant to the tag , the relationship between an item and a set of tags can be

defined as the mapping , such that

. is called the folksonomy based representation of item .

How to calculate the relevance weight of item to a tag is the major

focus of this sub section. The relevance weight proposed in Section 3.3.2 estimates the relevance of a tag to a tag with respect to a user . According to

Assumption 1, those items collected in tag must have something in common, otherwise the user will not put them together in the same tag. The related tag should reflect some topics of the items collected in tag . Therefore, if an item is collected by

user under a tag , we could use the relevance weight of tag to tag to estimate the relevance of tag to the item .

Page 68

For a given item , the total number of times that the item has been tagged by

users is the total number of user-tag pairs of item denoted as M. ,

and as defined in Section 3.2. That means, there are M number of

values of the possible user-tag pairs to estimate the relevance of to

the item . It is assumed that all the values are equally important to estimate

the relevance of tag to item The average of the values of the item ‘s user-tag pairs is used to estimate the relevance of tag to item .

However, if a tag is widely used to describe items, it is not a distinctive tag to represent this item. Similar to the idf weighting approach in text mining, we also should take the popularity of a tag for all items into consideration to measure the general importance of a tag to a specific item. Let be a tag, be the total number of items,

is defined as the inverse item frequency of tag . Usually,

, where is the number of items that have been described by tag

and the value of is calculated after the tag expansion for the whole item set P. To get a value between 0 and 1 to facilitate comparison, in this thesis, is set to equal to

, where is an irrational constant approximately equal to 2.72 and

. By taking the inverse item frequency of each tag into consideration, the

weight that measures the relevance of item to tag can be calculated with the following equation.

(3.8)

Page 69

Since the mapping can be viewed as a vector

for tags < >, each item can be described by a

|T|-sized vector . The values in the vector can be calculated by Equation 3.8.

0 =”garden” 0.059

=”apple” 0.31 =”globalization”

0.077 =”internet”

0.028 =”0403”

Figure 3.3 Item Representation-Folksonomy

[Example 3.3] (Item Representation-Folksonomy) Figure 3.3 shows an example of folksonomy based item representations of item The weight value of how much an

item is relevant to a tag can be calculated with Equation 3.8.

Based on the tagging graph shown in Figure 3.1,

. The relevance weight of item to tag

is shown as follows: =

) . Based on Equation 3.7 discussed in Section 3.3,

=0, = , and =0. After the representations of all items, not

only and but also and are relevant to . Thus, and

Page 70

. Therefore, = =0.028. With the same calculation

process, the relevance of item to other tags , , , and can be obtained. The

folksonomy based item representation of is: { , 0.0), ( , 0.059), ( ,

0.31), ( , 0.077), ( , 0.028)}.

With the item representation, item is relevant to tags , , and , and tag

―globalization‖ is still a major topic of while is slightly relevant to tag . For an item , not only the tags that are directly assigned by users to it, but also those tags that are derived from the similar items in the opinion of the users of this item are used to describe the topics of item . Thus, the relevant tags not only can remove the noise of tags, but also can help to find similar items. The algorithm of folksonomy based item representation is shown in Table 3.2.

Table 3.2 The Algorithm of Folksonomy Based Item Representation

Algorithm 3.2. IR-Folksonomy ( )

Input: is a given item

Output: Folksonomy based item representation

Method:

Begin

1: ← {} //initialisation

2: for each tag {

← 0 //initialisation

}//end for

3: ← // Get the user-tag pairs of item

Page 71

4: for each tag {

sum← 0, ← 0

for each user tag pair {

← +1

←TR-Folksonomy ( ) // Algorithm 3.1 in Table 3.1

// =

sum← sum+

}//end for

← sum

}//end for

5: for each tag {

Get

}//end for

6: for each tag {

}//end for

7: Return

End

The computation complexity of Algorithm 3.2 is , where is the total

number of tags, , is the number of user-tag pairs, . However, if

, then . Thus, the efficiency of this algorithm can be improved if we only calculate the relevance weights of those tags

Page 72

that have non-zero values. Let =| |, the improved

computation complexity of Algorithm 3.2 is . Usually, .

3.5 USER PROFILE GENERATION BASED ON FOLKSONOMY

User profile is used to describe users‘ interests and preferences information.

Users‘ preferences to items, also called item preferences, supply important information to profile users for recommender systems. Usually, explicit ratings are used to express the strength of users‘ preferences to items. In typical tagging communities, there is no explicit rating information available. Tagging can be regarded as a kind of implicit rating behaviour (Sen et al., 2006). In this thesis, the binary implicit ratings derived based on

User-Item Mapping are used to profile each user‘s item preferences.

Besides item preferences, users‘ interests or preferences to the topics of items, also called topic preferences, is another kind of important information to profile users

(Adomavicius & Tuzhilin, 2005). As mentioned before, the tag quality problem causes inaccurate user profiling if only the original tags that are directly given by a user are used to profile this user‘s topic preferences. One of the purposes of user profiling is to be able to recommend items to a user which have never been seen or tagged by this user.

Therefore, if we can estimate a user‘s preferences to other tags that are used by other users, we would be able to estimate the interests of this user in those items that have not been tagged by this user, but may have been collected under different tags by other users.

For example, in Figure 3.1, ―0403‖ is a tag used by user . Item was not collected by user . The tag ―0403‖ has not been assigned to item and user does

Page 73

not have the tags ―apple‖ and ―globalization‖ that have been assigned to item . If we know how much the user is interested in tags ―apple‖ and ―globalization‖ which were assigned to item by other users, we will be able to estimate whether user will be interested in the item . Therefore, we not only need to profile users‘ interests in their own tags, but also need to estimate users‘ interests in other tags. The process of finding tags that a user is interested in is called folksonomy based user representation. It is defined below.

[Definition 3.3] (User Representation-Folksonomy) represents each user

‘s preferences to each tag . Let denote the weight of how much the user is interested in the tag , the relationship between a user and a set of tags can be

defined as the mapping , such that

. is called the folksonomy based representation of user .

For each tag , if we know how much the user is interested in tag and the relevance weight of to for the user , then we can estimate how much user will be interested in tag . Based on the User-Tag Mapping , the probability of user using tag given user , can be calculated based on Equation 3.6. With Equation 3.7, the relevance weight of tag to tag with respect to user can be measured.

Therefore, the product of the two weights can be used to estimate how much user will be interested in the tag .

Similar to item representation, the occurrence of a tag (i.e., tag popularity) for all users should be taken into consideration to measure the general importance of a tag in the identification of the topic preference of a user. Let be a tag, is defined as the

Page 74

inverse user frequency of tag . Similar to , ,

where is the number of users of tag , . By taking the inverse

user frequency of a tag into consideration, the value of that measures user ‘s topic preferences to tag can be calculated with the equation below.

(3.9)

The mapping can be viewed as a vector:

for tags < >. Thus, each user can be profiled by two vectors: and

, where is a |P|-sized binary item vector representing ‘s item

preferences and is a |T|-sized tag value vector. If user has tagged item (i.e.,

), then the value of this item in vector is 1, otherwise is 0. The values of

vector can be calculated by Equation 3.9.

0

0

0.14

0.14

0.258

=”garden” =”globalization”

=”apple” =”internet” =”0403”

Figure 3.4 User Representation-Folksonomy

Page 75

[Example 3.4] (User Representation-Folksonomy) Figure 3.4 shows an example of folksonomy based user representations of user The weight value of a user‘s

preferences to a tag can be calculated based on Equation 3.9.

For example, the calculation of user ‘s preferences to tag ―internet‖ is

shown as follows.

. Based on the tagging graph in Figure 3.1, =1. Moreover,

based on Equation 3.7, = . After the representation of each user, not only

user has preference on tag , but also user and user have preferences on .

Therefore, , . Thus, =0.14.

With the same calculation process, the weight value of user ‘s preferences to other tags , , , and can be measured. The folksonomy based user representation

of user is: ={( , 0.14), ( , 0.14), ( , 0.285)}. Although user used a personal tag ―0403‖ to represent the topic preferences, this user is actually interested in tag ―globalization‖ and tag ―internet‖. The algorithm of folksonomy based user representation is shown below.

Table 3.3 The Algorithm of Folksonomy Based User Representation

Algorithm 3.3. UR-Folksonomy ( )

Input: is a given user

Output: Folksonomy based user representation

Method:

Begin

1: ← {} //initialisation

Page 76

2: for each tag {

←0 //initialisation

}//end for

3: for each tag {

sum ← 0

for each tag {

← TR-Folksonomy ( ) // Algorithm 3.1 in Table 3.1

// =

sum←sum+

}//end for

← sum

}//end for

4: for each tag {

get

}//end for

5: for each tag {

}//end for

6: Return

End

The computation complexity of Algorithm 3.3 is , where is the total

number of tags, . However, if , then . Thus, the

Page 77

efficiency of this algorithm can be improved if we only calculate the weights of those tags

that have or . Let denote the number of tags that user

has tagged, , the improved computation complexity of Algorithm 3.3 is

, where =| |. Usually, , .

3.6 A FRAMEWORK OF USER PROFILING BASED ON FOLKSONOMY

Figure 3.5 shows a framework of user profiling based on folksonomy. It includes

Tag Representation-Folksonomy, Item Representation-Folksonomy, and User

Representation-Folksonomy. The input of this framework is users‘ folksonomy/tagging information. The outputs include folksonomy based item representations and user profiles. Each user profile contains two parts: implicit item preferences and tag based topic preferences.

Folksonomy

Tag Representation- Folksonomy User Profiling- Folksonomy Item Representation- User Representation- Folksonomy Folksonomy

Folksonomy based Item representations Folksonomy based User profiles

Figure 3.5 A Framework of User Profiling Based on Folksonomy

Page 78

3.7 CHAPTER SUMMARY

This chapter discussed how to profile users based on folksonomy information.

Folksonomy information reflects users‘ viewpoints on item classifications and descriptions. It contains rich personal information and forms multiple relationships among users, items and tags. The multiple relationships of tagging, including three two- dimensional relationships and one three-dimensional relationship, were modelled in this chapter. Targeting the tag quality problem, a tag representation approach that represents each tag with a set of related tags for each individual user was proposed. Moreover, based on the tag representation, a set of relevant tags were determined to describe the topics of each item as well as the topic interests for each user. This chapter also proposed the tag weighting approaches based on the multiple dimensional relationships among users, items and tags, and the popularity of tags were proposed. The user profiles and item descriptions represented by users‘ tag vocabulary were generated in this chapter. The utilisation of the generated folksonomy based user profiles in personalised recommender systems will be discussed in Chapter 5.

Page 79

Chapter 4

4User Profiling Based on Taxonomy

Folksonomy information is popularly used to profile users and describe the topics of items. Folksonomy implies users‘ explicit topic preferences, but contains a lot of noise such as semantic ambiguity, tag synonyms and personal tags. To solve the tag quality problem, Chapter 3 proposed an approach to find the related tags to represent the personalised semantic meanings of each tag.

On the other hand, each item can be associated with a set of standard taxonomic topics given by experts. Item taxonomy is a set of controlled standard vocabulary terms or topics designed to describe or classify items. It reflects experts‘ opinions on item classifications and descriptions, however it does not reflect any individual user‘s personal viewpoints or preferences. Compared with folksonomy information, taxonomy has the advantages of: having standard and well controlled vocabulary, containing explicitly structured relationships among concepts, being authorised and well recognised as common knowledge, and being independent of user communities.

Item taxonomy/ontology is widely available for various domains. Some typical item include product classification taxonomy of Amazon.com

(http://www.amazon.com), ACM Computing Reviews (http://www.reviews.com),

Google Directory (http://directory.google.com), and Yahoo Directory

(http://www.yahoo.com). Library of Congress Subject Headings (http://www.loc.gov)

Page 80

and WordNet (http://wordnet.princeton.edu) are popularly used world knowledge ontology.

This chapter proposes approaches to use standard taxonomy information to reduce the noise of tags caused by free-formed vocabularies of folksonomy. Instead of using the tag vocabularies, the proposed approaches will use the standard taxonomy vocabulary to find and represent the semantic meanings of each tag, relevant topics of each item and topic interests of each user. Moreover, to answer the question that whether we can integrate item folksonomy and taxonomy to profile users‘ interests more accurately and benefit from both, a hybrid user profiling approach based on both folksonomy and taxonomy information will be discussed in this chapter.

4.1 NOTATIONS

Section 3.1 defined some important concepts and entities involved in this thesis.

Formal definitions of some other concepts and entities relating to item taxonomy are listed below:

 Item Taxonomic Topics: is a set of topics or categories

given by experts to describe or classify items.

 Item Taxonomy: is a set of item taxonomic topics and is a set

of relations between any and If =Ф, then is only a set of

taxonomic topics and no relationships are considered. In this work, only the

typical hierarchical relationship is used, which contains the ―is-a‖ relationship.

This thesis uses ―sub topic of‖ to represent the ―is-a‖ relationship. is defined as

Page 81

, where is a ―sub topic of‖ relationship. For any two taxonomic topics

, if , then is a sub topic of . The taxonomic topic

expresses general or broader concepts whereas, expresses specific and

narrow concepts. The taxonomy tree has exactly one root topic that represents the

most general topic. The leaf topics that do not have any direct sub topics represent

the most specific topics.

 Item Taxonomic Descriptors: In order to describe and

classify items, every item is associated with a set of item taxonomic

descriptors . A taxonomic descriptor is a sequence of ordered taxonomic

topics, denoted by where , ,

, is the root topic, is a leaf topic with no sub topics, and

An item can have multiple descriptors because the

item might possess a broad range of concepts. Strictly categorising the item under

one single concept might be imprecise.

Figure 4.1 shows an example of an item taxonomy tree. The taxonomy tree has nine taxonomic topics and . Within the item taxonomy depicted in the figure, ―book‖ is the root topic covering the broadest concept while

―programming‖ and ―flowers‖ are two of the leaf topics expressing the most specific concepts. There are six unique item taxonomic topic descriptors in the taxonomy tree:

, , , , ,

.

Page 82

[Example 4.1] Suppose that the items in Figure 3.1 are described by the taxonomic topics shown in Figure 4.1. The taxonomic topic descriptors of the items ,

, , and in Figure 3.1 are defined as: , ,

, , . For example, item is associated with

taxonomic topic descriptors and , , where ,

. Then, the item is described by taxonomic topics {―book‖,

―computers‖, ―programming‖} and {―book‖, ―computers‖, ―networks‖}.

book

garden computers

romance

flowers fruit programming networks A taxonomic topic

Figure 4.1 An Example ‖of Item Taxonomy

4.2 TAXONOMY BASED USER PROFILING

Folksonomy or tags are free-formed and lack of standardisation. As uncontrolled vocabularies, tags suffer from many difficulties such as ambiguity in the meaning of and differences between terms, a proliferation of synonyms, varying levels of specificity, lack of guidance on syntax and slight variations of spelling and phrasing (Li et a., 2008). The tag quality problem causes inaccurate user profiling.

Chapter 3 proposed approaches to use tag vocabulary and users‘ viewpoints on item classifications and descriptions, to find and represent the personalised semantic

Page 83

meanings of each tag, the relevant topics of each item, and the interested topic preferences of each user. This section will use the standard taxonomic topics and experts‘ viewpoints to represent each tag, each item and each user.

As discussed in Section 3.2, the User-Tag-Item relationship records the personal tagging information of each individual user. Based on the (User×Tag)-Item Mapping

, labelled with tag , a set of items are collected and grouped together according to the user ‘s viewpoint. Based on Assumption 1, these items are similar and closely related in the viewpoint of user . Therefore, the taxonomic topics of these items can be used to estimate the related topics or the semantic meanings of the tag in terms of user .

Before the discussion of how to measure the relevance strength of a tag to a taxonomic topic with respect to a user, the measurement of the relevance of an item to a taxonomic topic will firstly be discussed.

4.2.1 Item Representation Based on Taxonomy

Each item can be described with a set of taxonomic topics given by experts. More

specifically, each item is associated with a set of item taxonomic descriptors . An item taxonomic descriptor is a set of ordered taxonomic topics. It‘s a full path from the root topic node to a leaf topic node. With the ―sub topic of‖ relationship, the taxonomic topics are structured hierarchically. The hierarchical structure of taxonomy imposes important information for finding the relevance of an item to a taxonomic topic. The process of determining the related taxonomic topics of each item and representing each

Page 84

item with taxonomic topics is called taxonomy based item representation. It is defined below.

[Definition 4.1] (Item Representation-Taxonomy): represents the relevance of

each item ‘s to each taxonomic topic . Let denote the weight of how much the item is relevant to the taxonomic topic , the relationship between an item and a set of taxonomic topics can be defined as the mapping , such that

. is called the taxonomy based item representation.

The key part of generating item representations is to calculate value of that measures the relevance weight of a taxonomic topic and an item . A commonly used approach is to use the frequency of each taxonomic topic in item taxonomic descriptors to measure the weight of a taxonomic topic. However, taxonomic topic nodes at higher levels of the taxonomy tree reflecting general concepts usually appear more frequently in item taxonomic descriptors, than those at lower levels reflecting specific concepts.

Therefore, the structural information of the whole taxonomy tree should be taken into consideration when calculating the weight value of a taxonomic topic. The following factors should be considered to determine the weight of a taxonomic topic for the representation of an item :

 The frequency of a taxonomic topic.

If taxonomic topic appears more frequently in the descriptors of item

than other taxonomic topics, then should have a higher weight value than the

taxonomic topics which occur less frequently in .

Page 85

 The concept coverage of a taxonomic topic.

The taxonomic topics that express more specific concept are more useful for

identifying the feature of an item. If taxonomic topic expresses specific

concepts compared to other taxonomic topics, then should have a higher

weight value than taxonomic topics expressing general concepts.

 The structural information between one taxonomic topic and another.

Based on the specified direct ―sub topic of‖ relationship among taxonomic topics,

the inferred relationship among taxonomic topic nodes includes ―child‖, ―parent‖,

―sibling‖, ―grandparent‖, ―grandchild‖, ―ancestor‖ and others. The number of

children, the number of siblings and the number of ancestors of in the

taxonomy tree can affect the importance of taxonomic topic for the feature

representation of item (Weng, 2009).

By taking these three factors into consideration, Ziegler et al. (2004) proposed to decay the weight of a taxonomic topic node based on the number of children of a taxonomic topic node in the item taxonomy tree and the length of an item taxonomic descriptor. Inspired by the approach of Ziegler et al. (2004), this thesis proposes an approach that takes the structural information of item taxonomy into consideration when calculating the weight of a taxonomic topic for an item. The weight computation is conducted in a bottom up way. It is discussed as below.

Let be an item taxonomic descriptor of item , denote the weight of taxonomic topic for item taxonomic descriptor . Suppose the item descriptor and .

Page 86

 The calculation of non-leaf taxonomic topics.

For the non-leaf taxonomic topic in the example descriptor given

above, can be calculated as:

(4.1)

Where taxonomic topic is the parent node of taxonomic topic in

item descriptor , is the number of child nodes of taxonomic topic .

If is not a taxonomic topic in item taxonomic descriptor , then

.

 The calculation of leaf taxonomic topics.

The total weight of all topics in one item taxonomic descriptor can be set to a

positive number (Ziegler et al., 2004). To facilitate comparison, the total weight of

all the topics in is set to . Let x be the weight of the leaf node of the

example descriptor , the following equation can be obtained:

1

(4.2)

After solving Equation 4.2, the value of (i.e., ) can be calculated.

Based on the leaf node weight and Equation 4.1, the weight value of each non- leaf topic in can be calculated.

Leaf nodes have a higher weight value than those of non-leaf nodes calculated using Equation 4.1 and Equation 4.2. However, if a taxonomic topic is popularly used to

Page 87

describe items, it is not a distinctive topic that represents the item. Similar to folksonomy based item representation, the popularity of a taxonomic topic for all items should be considered. Let denote the inverse item frequency of taxonomic topic ,

, where is the number of items that have been described with in the item set , .

Assuming each descriptor is equally important for the topic classification and description of item , this thesis uses the average value of of all item

descriptors in to measure the overall relevance weight of item to taxonomic topic

. Let denotes the number of descriptors of item , the relevance weight can be calculated as:

(4.3)

Since the mapping can be viewed as a vector:

for taxonomic topics , each item

can be described by a |C|-sized taxonomic topic vector . The values can be calculated by Equation 4.3.

Page 88

0.035

0.12

0.197

0.26

=”book” =”computers” =”programming”

=”garden” =”flowers” =”networks” =”fruit” =”databases” =”romance”

Figure 4.2 Item Representation-Taxonomy

[Example 4.2] (Item Representation-Taxonomy) Figure 4.2 shows an example of

the taxonomy based item representations of . The relevance weight that measures the relevance weight of an item to a taxonomic topic can be calculated with Equation 4.3.

For example, the relevance weight of item to taxonomic topic , is shown as follows. As defined in Example 4.1, there are two descriptors and for

item , .

. Shown in the taxonomy tree in Figure 4.1,

. If , then based on Equation 4.2, . After

solving this equation, . Since descriptor does not contain taxonomic topic ,

. As has been used to describe items , and ,

=0.57. Therefore, .

The taxonomy based item representation of is: {{ , 0.035), ( ,

0.12), ( , 0.197), ( , 0.26)}}. Item is related to taxonomic topics ―computers‖,

Page 89

―programming‖ and ―networks‖. The algorithm of taxonomy based item representation is shown below.

Table 4.1 The Algorithm of Taxonomy Based Item Representation

Algorithm 4.1. IR-Taxonomy ( )

Input: is a given item

Output: taxonomy based item representation

Method:

Begin

1: ← {} //initialisation

2: for each taxonomic topic {

← 0 //initialisation

}//end for

3: for each taxonomic topic {

sum← 0

for each item descriptor {

sum←sum+

}//end for

← sum

}//end for

4: for each taxonomic topic {

Get

}//end for

Page 90

5: for each taxonomic topic {

}//end for

6: Return

End

Let denote the total number of taxonomic topics, , denote the number of

descriptors of item , , the computation complexity of Algorithm 4.1 is

. However, for each item descriptor , if , then = 0.

The efficiency of this algorithm can be improved if we only calculate the weights of those taxonomic topics that contained in the descriptors of this item. If we allow to indicate the number of taxonomic topics contained in the item descriptors of item , the improved computation complexity of Algorithm 4.1 is . Usually,

.

4.2.2 Tag Representation Based on Taxonomy

For a given user and a tag , the relevance strength of a taxonomic topic being related to a tag for the user can be estimated based on the relevance weight of

to the items collected in the tag of the user . Let denote the relevance weight

of a taxonomic topic to item , denote the set of items

collected in by user , we could use any of …, to estimate the relevance of

to for user . The process of finding the related taxonomic topics of each tag for each user is called taxonomy based tag representation. It is defined below.

Page 91

[Definition 4.2] (Tag Representation-Taxonomy): represents the relevance of

each tag to each taxonomic topic with respect to a user . Let denote how strong is related to with respect to user , the relationship between a tag and a set of taxonomic topics with respect to a user can be defined as the mapping

, such that

. is called the taxonomy based representation of tag with respect to the user .

Assuming that ,…, are equally important to the user when we

calculate the relevance of to . This thesis uses the average value of ,…, to

estimate the relevance of to . The value of that denotes the relevance weight of taxonomic topic to tag in terms of can be calculated as:

(4.4)

Page 92

0.035

0.147

0.175

0.175

0.035

0.12

0.099

0.13

0.22

=”book” =”computers” =”programming”

=”garden” =”flowers” =”networks‖ =”apple” =‖fruit” =”databases” =”romance”

Figure 4.3 Tag Representation-Taxonomy

[Example 4.3] (Tag Representation-Taxonomy) Figure 4.3 shows an example of the taxonomy based representations of tag ―apple‖ for user and user . The relevance weight of tag and taxonomic topic in terms of an individual user ,

denoted as , can be calculated with Equation 4.4.

For example, the calculation of the relevance of tag and taxonomic topic in

terms of user can be calculated as: . Based

on Equation 4.3, , 0. As a result, = = 0.099. The

relevance weight of tag and taxonomic topic in terms of user can be calculated

as: =0.

Page 93

The taxonomy based tag representations of tag for user and are:

, 0.035), ( , 0.147), ( , 0.175), ( , 0.175)}, ,

0.035), ( , 0.12), ( , 0.099), ( , 0.13), ( , 0.22)}. For user , the tag ―apple‖ is related to taxonomic topics ―garden‖, ―flowers‖ and ―fruit‖. Whereas, for user

, it is related to ―computers‖, ―programming‖, ―networks‖ and ―databases‖.

We can see that personalised semantic meanings of a tag can be generated for different users (e.g., has different representations for users and ) and the semantic ambiguity can be reduced. The related taxonomic topics of each personal tag (e.g.,

―0403‖) can also be determined. Furthermore, it is easy to find the tag synonyms by comparing their taxonomy based tag representations. The algorithm of taxonomy based tag representation is shown below.

Table 4.2 The Algorithm of Taxonomy Based Tag Representation

Algorithm 4.2. TR-Taxonomy ( , )

Input: is a given user, is a given tag

Output: taxonomy based tag representation

Method:

Begin

1: ← {} //initialisation

2: for each taxonomic topic {

←0 //initialisation

}//end for

3: ← // Get collected items

Page 94

4: for each taxonomic topic {

sum← 0

for each item {

← IR-Taxonomy ( ) //Algorithm 4.1 in Table 4.1

//

sum← sum+

}//end for

← sum

}//end for

5: for each taxonomic topic {

}//end for

6: Return

End

The computation complexity of Algorithm 4.2 is , where is the total number of taxonomic topics, , and is the total number of collected items in a

tag by a user, . The computation complexity can be improved if we only calculate the weights of taxonomic topics that have . If indicates the number of taxonomic topics that are contained in the item descriptors of all the collected items in a tag by a user the improved computation complexity of Algorithm 4.2 will be . Usually, , .

Page 95

4.2.3 User Representation Based on Taxonomy

As discussed in Chapter 3, each user can be profiled with implicit item preferences derived based on the User-Item Mapping and folksonomy based topic preferences. Each item can be associated with a set of taxonomic topics. The taxonomic topics of the collected items and the relevance weights of tags to taxonomic topics can be used to estimate each user‘s preferences to taxonomic topics. The process of finding users‘ interests or preferences on taxonomic topics is called taxonomy based user representation. It is defined below.

[Definition 4.3] (User Representation-Taxonomy): represents each user

‘s preferences to each taxonomic topic . Let denote the weight of how much the user is interested in the taxonomic topic , the relationship between a user and a set of taxonomic topics can be defined as the mapping , such that

. is called the taxonomy based user representation.

For each taxonomic topic and a given tag , if we know how much a user is interested in tag and the relevance weight of taxonomic topic to tag for the user , then we can estimate how much the user will be interested in the taxonomic topic . As discussed in Section 3.3.1, based on the User-Tag Mapping , the probability of user using tag given user , can be calculated with Equation 3.6.

Using Equation 4.4, the relevance weight of taxonomic topic to tag with respect to user can be measured. Therefore, the product of the two weights can be used to estimate how much user will be interest in taxonomic topic .

Page 96

Similar to the taxonomy based item representations, we should take the occurrence of a taxonomic topic for all users into consideration. Let denote the

inverse user frequency of topic , , where is the number of users that have implicit topic preferences on in the user set ,

. By taking the inverse user frequency of a taxonomic topic into

consideration, the value of that measures user ‘s topic preferences to taxonomic topic can be calculated with the equation below.

(4.5)

The mapping can be viewed as a vector: for taxonomic topics . Therefore, each user can be profiled by two

vectors: and , where is a |P|-sized binary item vector representing

‘s item preferences and is a |C|-sized taxonomic topic value vector. The values

of vector can be calculated by Equation 4.5.

Page 97

0.018

0.25 0.07

0.2 0.5

0.06

=”book” =”computers” =”programming”

=”garden” =”flowers” =”networks‖ =”0403” =‖fruit” =”databases” =”romance”

Figure 4.4 User Representation-Taxonomy

[Example 4.4] (User Representation-Taxonomy) Figure 4.4 shows an example of taxonomy based user representations of user The weight value of a user‘s

preferences to a taxonomic topic can be calculated with Equation 4.5.

For example, the calculation of user ‘s preferences to taxonomic topic

―programming‖ is shown as follows:

. =1. Based on Equation

4.4, =0.32. According to the tagging graph in Figure 3.1, the item taxonomy shown in Figure 4.1 and the description defined in Example 4.1, the user set of

contains , , and . Therefore, = =0.57 and = =0.2.

The taxonomy based user representation of is: ={( , 0.018), ( ,

0.07), ( , 0.2), ( , 0.06)}. Although user used a personal tag ―0403‖ to represent this user‘s topic preferences, after the taxonomy based user representation, is actually

Page 98

interested in taxonomic topics ―programming‖ and ―computers‖. The algorithm of taxonomy based user representation is shown below.

Table 4.3 The Algorithm of Taxonomy Based User Representation

Algorithm 4.3. UR-Taxonomy ( )

Input: is a given user

Output: taxonomy based user representation

Method:

Begin

1: ← {} //initialisation

2: for each taxonomic topic {

← 0 //initialisation

}//end for

3: for each taxonomic topic {

sum← 0

for each tag {

←TR-Taxonomy ( , ) //Algorithm 4.2 in Table 4.2

//

sum←sum+

}//end for

← sum

}//end for

4: for each taxonomic topic {

get

Page 99

}//end for

5: for each taxonomic topic {

}//end for

6: Return

End

The computation complexity of Algorithm 4.3 is , where is the total number of taxonomic topics, , and is the total number of collected items in a tag by a user, . This can be improved if we only consider taxonomic topics that have

or tags that have . If indicates the number of taxonomic topics that are contained in the item descriptors of all the collected items of user the improved computation complexity of Algorithm 4.3 is . is the

number of tags that a user has tagged, . Usually, .

4.2.4 A Framework of User Profiling Based on Taxonomy

Figure 4.5 shows a framework of user profiling based on taxonomy. It includes

Tag Representation-Taxonomy, Item Representation-Taxonomy and User

Representation-Taxonomy. The inputs of the framework are users‘ folksonomy and taxonomy information. The outputs include taxonomy based item representations and user profiles. Each user profile contains implicit item preferences and taxonomic topic based topic preferences.

Page 100

Taxonomy Folksonomy

Item Representation- Tag Representation- Taxonomy Taxonomy User Profiling- Taxonomy User Representation- Taxonomy

Taxonomy Based Item Representations Taxonomy Based User Profiles

Figure 4.5 A Framework of User Profiling Based on Taxonomy

4.3 HYBRID USER PROFILING BASED ON FOLKSONOMY AND TAXONOMY

Reflecting users‘ opinions on item classification and description, folksonomy contains rich personal explicit topic preferences information. As mentioned before, one limitation of folksonomy is its free-formed vocabularies that can cause inaccurate user profiling. Item taxonomy is standard, authorised, and user-independent. However, it reflects the opinions of experts only without considering users‘ personal viewpoints or preferences.

Can we integrate item folksonomy and the standard item taxonomy and benefit from both information sources to further improve the accuracy of user profiling and recommendation making? To answer this question, this section will propose a hybrid approach to combine folksonomy information that reflects the wisdom of crowds and

Page 101

standard taxonomy information that reflects the viewpoint of experts. The following subsections will discuss the hybrid tag representation, item representation and user representation approaches based on both information sources.

4.3.1 Hybrid Tag Representation

The process of finding the relevant tags and taxonomic topics of each tag in terms of each user is called hybrid tag representation.

[Definition 4.4] (Tag Representation-Hybrid): represents the relevance of each tag and each taxonomic topic to a given tag with respect to user

. The hybrid tag representation of tag with respect to user is

. is the folksonomy based representation of tag

with respect to user , defined in Section 3.3. is the taxonomy based representation of tag with respect to user , defined in Section 4.2.2.

Therefore, each tag can be represented by a set of tags and a set of taxonomic topics for an individual user.

Page 102

0.035 0.035 0.16 0.12

0.25 0.147 0.5 0.099 0.175 0.75 0.13

0.175 0.34 0.22

(a) Tag Representation-Hybrid

0.059 0.035 0.14 0.018

0.31 0.14 0.07 0.12

0.077 0.197 0.2

0.258

0.028 0.26 0.06

(b) Item Representation-Hybrid (c) User Representation-Hybrid

=”globalization” =”book” =”computers” =”programming”

=”internet” =”garden” =”flowers” =”networks‖ =”garden” =‖fruit” =”databases” =”apple” =”0403” =”romance”

Figure 4.6 Hybrid Representations Based on Folksonomy and Taxonomy

[Example 4.5] (Tag Representation-Hybrid) Figure 4.6 (a) shows an example of the hybrid representations of tag for user and user . The hybrid tag

representation of tag in terms of user is: , 0.035), ( , 0.147),

( , 0.175), ( , 0.175)}, {( , 0.25), ( , 0.75)}}. The hybrid tag representation of in

terms of user is: , 0.035), ( ,

0.12), ( , 0.099), ( , 0.13), ( , 0.22)}}.

For user , the tag ―apple‖ is related to taxonomic topic ―garden‖,

―flowers‖ and ―fruit‖ in the folksonomy based representation and tag ―garden‖ and

Page 103

―apple‖ in the taxonomy based representation. For user , tag ―apple‖ is mainly related to taxonomic topic ―computers‖, ―programming‖, ―networks‖,

―databases‖, tag ―globalization‖ and ―internet‖. Therefore, personalised semantic meanings of tag represented by tags and taxonomic topics are generated for different users. The semantic ambiguity can be reduced. We can also obtain the related taxonomic topics and tags of each personal tag. By comparing their hybrid tag representations, the tag synonyms can be detected. One situation is that some related tags may coincide with some taxonomic topics. For example, in Figure 4.6, tag ―garden‖ coincides with

―garden‖. Since they are in different vocabulary systems (i.e., Folksonomy and

Taxonomy), both of them are retained in the representations.

4.3.2 Hybrid Item Representation

The process of finding the relevant tags and taxonomic topics of each item is called hybrid item representation.

[Definition 4.5] (Item Representation-Hybrid): represents the relevance of each tag and each taxonomic topic to a given item . The hybrid item

representation of item is defined as . is the

folksonomy based representation of item , defined in Section 3.4. is the taxonomy based representation of item , defined in Section 4.2.1. Consequently, each item is described by a set of relevant taxonomic topics given by experts and a set of relevant tags contributed by users.

Page 104

[Example 4.6] (Item Representation-Hybrid) Figure 4.6 (b) shows an example of the hybrid item representations of item . The hybrid item representation of is:

{{( , 0.059), ( , 0.31), ( , 0.077), ( , 0.028)},{ , 0.035), ( , 0.12), ( ,

0.197), ( , 0.26)}}. Therefore, item is related to taxonomic topics ―computers‖,

―programming‖ and ―networks‖ and tags ―globalization‖ and ―internet‖.

4.3.3 Hybrid User Representation

The process of finding each user‘s interested tags and taxonomic topics is called hybrid user representation.

[Definition 4.6] (User Representation-Hybrid): represents the preferences of a user ‘s to each tag and each taxonomic topic . The hybrid user

representation of is defined as . is the

folksonomy based representation of user , defined in Section 3.5. is the taxonomy based representation of user , defined in Section 4.2.3.

As a result, a set of tags with their weights as well as a set of taxonomic topics with their weights are used to profile each user‘s topic preferences.

[Example 4.7] (User Representation-Hybrid) Figure 4.6 (c) shows an example of the hybrid user representations of user . The hybrid user representation of is:

={{( , 0.14), ( , 0.14), ( , 0.285)},{( , 0.018), ( , 0.07), ( , 0.2), ( ,

0.06)}}. Although used a personal tag ―0403‖ to represent his/her topic preferences, user is also interested in topics ―programming‖ and ―computers‖. In the

Page 105

folksonomy of this user community, is also interested in tag ―globalization‖ and

―internet‖.

4.3.4 A Framework of Hybrid User Profiling Based on Folksonomy and Taxonomy

Figure 4.7 shows a framework of hybrid user profiling based on folksonomy and taxonomy information. It includes Tag Representation-Folksonomy, Item Representation-

Folksonomy, User Representation-Folksonomy, Tag Representation-Taxonomy, Item

Representation-Taxonomy, and User Representation-Taxonomy. The inputs of this framework are users‘ folksonomy and taxonomy information. The outputs include hybrid item representations and user profiles. Each hybrid item representation contains both folksonomy and taxonomy based item representations. Each hybrid user profile contains implicit item preferences and topic preferences based on tags and taxonomic topics.

Page 106

Folksonomy Taxonomy

User Profiling- Hybrid

Tag Representation- Tag Representation- Item Representation- Folksonomy Taxonomy Taxonomy

Item Representation- User Representation- User Representation- Folksonomy Folksonomy Taxonomy

Hybrid Item Representations Hybrid User Profiles

Figure 4.7 A Framework of Hybrid User Profiling Based on Folksonomy and Taxonomy

4.4 CHAPTER SUMMARY

This chapter focused on how to profile users, based on item taxonomy given by experts and folksonomy information contributed by users.

Firstly, the taxonomy based user profiling approach was proposed. To reduce the noise of tags, the item taxonomy information was used to determine the personally related taxonomic topics of each tag for each individual user. An improved weighting approach that considers the structural information of taxonomy and the popularity of taxonomic topics was proposed to measure the weights of taxonomic topics. The user profiles and item descriptions represented by the standard and controlled taxonomy vocabulary were

Page 107

discussed in this chapter. Furthermore, a hybrid user profiling approach was proposed whereby user profiles and item descriptions are represented by tag vocabulary and standard taxonomy vocabulary.

The generated taxonomy based user profiles and hybrid user profiles can be used in personalisation applications. Chapter 5 will focus on the utilisation of proposed user profiles in personalised recommender systems.

Page 108

Chapter 5

5Personalised Item Recommendation Making

Recommender systems are popular used personalisation tools. The accuracy and effectiveness of user profiling affect the performances of recommender systems and other personalisation application greatly. Based on different user profiles, different similar users or items can be found and the items can be ranked differently for recommendations.

With folksonomy information in Web 2.0, not only the content or taxonomy information of items, but also the vocabularies of community users‘ and their viewpoints on the classifications and descriptions of items will affect whether an item will be recommended to a user. Chapter 3 and Chapter 4 discussed the proposed user profiling approaches based on folksonomy information and taxonomy information. This chapter will discuss how to utilise the proposed folksonomy based, taxonomy based and hybrid user profiles to make personalised item recommendations.

5.1 PROBLEM DEFINITION

The tasks of recommendation making include predicting item ratings and generating Top N recommendations (Adomavicius & Tuzhilin, 2005). Since there are no explicit ratings available in typical tagging communities, this thesis focuses on Top N recommendation task. The Top N recommendation task can be further classified as user recommendations, item recommendations and tag recommendations (Milicevic et al.,

Page 109

2010). Let be a target user, be the item set that the user already has,

, be a candidate item, be the predicted score of how much the user would be interested in the item , the problem of item

recommendation is defined as generating a set of ordered items to the use , where .

The memory based Collaborative Filtering approaches are popularly used for recommendation making based on implicit ratings (Adomavicius & Tuzhilin, 2005). This thesis uses memory based Collaborative Filtering approaches to make recommendations.

Figure 5.1 shows a general framework of the proposed recommender systems based on folksonomy and taxonomy. The inputs are folksonomy and taxonomy information. The output is a list of recommended items for each target user. Similar with other memory based Collaborative Filtering recommender systems, the recommendation process includes three steps. The first step is to profile users‘ interests and preferences as well as representing the relevant topics of each item. As discussed in Chapter 3 and

Chapter 4, each user and each item can be profiled based on folksonomy, taxonomy, or both information sources. Then, based on the users‘ profiles and item representations, a set of similar users or items will be determined. This step is to form the neighbourhood of users and items. After that, a set of items that are popularly used or tagged by neighbour users will be recommended to each target user.

The following sections of this chapter will discuss how to form neighbourhood and generate personalised recommendation lists based on the three kinds of proposed user profiles in details.

Page 110

A Target User

Folksonomy Taxonomy

User Profiling

Neighbourhood Formation Recommender System

Recommendation Generation

Recommended Item Lists

Figure 5.1 A General Framework of the Proposed Recommender Systems

5.2 NEIGHBOURHOOD FORMATION

Neighbourhood formation is to generate a set of like-minded peers for a target user or a set of similar peer items for an item . This thesis adopts the ―K-

Nearest-Neighbours‖ technique to find the neighbourhood for a user or an item. The user based K-Nearest-Neighbourhood formation approach is used to select the top K neighbour users with shortest distances to a user through computing the distances between user and all other users. While the item based K-Nearest-Neighbourhood forming approach is to select the top K neighbour items with the shortest distances to an item through calculating the distances between item and all other items. The distance or similarity measure can be calculated through various kinds of proximity computing approaches such as cosine similarity and Pearson correlation. The more

Page 111

accurate a user profile or item representation is, the more similar neighbour users or items will be found.

5.2.1 User Based K-Nearest-Neighbourhood Formation

As discussed in Chapter 3 and Chapter 4, each user can be profiled with implicit item preferences and topic preferences. The topic preferences of each user can be represented by vocabularies of folksonomy, taxonomy, or both. Thus, each user can be

profiled by no more than three vectors: a |P|-sized binary item vector that

represents user ‘s item preferences, a |T|-sized tag vector and a |C|-sized

taxonomic topic vector The different combinations of the three vectors form

different types of user profiles. For example, the combination of and

form the folksonomy based user profile. The combination of and form

the taxonomy based user profile, while the combination of , and form the hybrid user profile of user .

The similarity of two users and can be measured by the similarity of their user profiles. The similarity measure of the three different parts of user profiles can be calculated as below.

 The similarity of item preferences of two users.

To measure the similarity of item preferences with implicit binary ratings, a

simple approach is to count the overlap of commonly rated items between two

users (Breese et al., 1998). Since the approach of weighting each commonly

rated item with inversed user frequency or iuf (Breese et al., 1998) takes the user

Page 112

frequency of item into account, it performs better for binary ratings in many

cases (Breese et al., 1998). This thesis uses this iuf approach to calculate the

similarity of item preferences of two users, which is denoted as .

(5.1)

Where is the number of items that user has tagged and is the

number of items that user has tagged, , where

is the number of users that have tagged item , .

 The similarity of tag based topic preferences of two users.

Cosine similarity is popularly used to calculate the similarity of two vectors

(Adomavicius & Tuzhilin, 2005; Christopher et al., 2008). For any two |V|-sized

vectors and , the cosine similarity is defined as:

(5.2)

This thesis work uses the Cosine similarity to measure the similarity of tag based

topic preferences as well as taxonomic topic based topic preferences of two

users. The similarity of tag based topic preferences of two users is denoted as

It is defined as:

(5.3)

Page 113

 The similarity of taxonomic topic based topic preferences of two users.

The similarity of taxonomic topic based topic preferences of two users is

denoted as

(5.4)

As discussed before, the different combinations of the three parts form different types of user profiles. The similarity of two users can be measured by different combinations of the similarities of the three parts. Linear combination is popularly used to hybrid different recommendation techniques (Burke, 2002). This thesis uses linear combination to measure the similarity of two users. The similarity of two users based on hybrid user profiles that consider all the three parts is calculated as:

=

(5.5)

Where are combination parameters, and

.

With different settings of the parameter values, Equation 5.5 can be used to measure the similarity of two users based on different types of user profiles. With 0 and , the Equation 5.5 linearly combines the similarities of item preferences and tag based topic preferences, which measures the similarity of two users with folksonomy based user profiles. Similarly, with 0 and , it linearly

Page 114

combines the similarity of item preferences and topic preferences based on taxonomic topics, which measures the similarity of two users with taxonomy based user profiles.

Using the similarity measure approach, we can generate the neighbourhood of the target user , which includes K nearest neighbour users who have similar user profiles with user . With different user profiles, different parameters will be set in Equation 5.5 and different similar users will be found. Generally, the neighbourhood of user is denoted as . Where function maxK {} returns the top K most similar users to . The algorithm of User based K-Nearest-

Neighbourhood formation approach is shown below.

Table 5.1 The Algorithm of User Based K-Nearest-Neighbourhood Formation Approach

Algorithm 5.1. KNN-User ( , K, , , )

Input: is a given user, K is the number of neighbours, are the parameter that linearly combine the similarities of item preferences, tag based topic preferences and taxonomic topic preferences, , . The parameters settings of Folksonomy based similarity calculation are and

. The parameters settings for taxonomy based similarity calculation are

and .

Output: a set of neighbour users for user

Method:

Begin

1: ‘s K nearest user set ← Ф //initialization

2: for each user {

Page 115

← UR-Folksonomy ( //Algorithm 3.3 in Table 3.3

← UR-Taxonomy ( ) //Algorithm 4.3 in Table 4.3

}//end for

3: for each user {

sim1← // Equation 5.1

sim2← // Equation 5.3

sim3← // Equation 5.4

← sim1 sim2 sim3

}//end for

4: ← // top K users with highest similarity scores

5: Return

End

The computation complexity of this algorithm is , where is the number of users in a tagging community.

5.2.2 Item Based K-Nearest Neighbourhood Formation

Each item can be represented by two vectors: a |T|-sized tag vector and

a |C|-sized taxonomic topic vector The similarity of two items and can be measured by the similarity of their item representations. Cosine similarity is used to measure the similarity of two tag vectors as well as two taxonomic topic vectors.

 The similarity of folksonomy based representations of two items.

Page 116

The similarity of relevant tags of two items is denoted as .

(5.6)

 The similarity of taxonomy based representations of two items.

The similarity of relevant taxonomic topics of two items is denoted as

(5.7)

The hybrid item representation combines both folksonomy and taxonomy based item representations. Similarly, linear combination is used to combine the two parts. The similarity of two items based on hybrid item representations is calculated as:

(5.8)

Where and are linear combination parameters, and

.

With 1 and 0, the Equation 5.8 measures the similarity of two items that only represented by folksonomy information, while with 0 and 1, it measures the similarity of two items that only represented by taxonomy information.

Similarly, we can generate the top K nearest neighbour items of each item with the proposed similarity calculation approaches. With different item representations,

Page 117

different parameters will be set in Equation 5.8. As a consequence, different similar items will be determined. Generally, the top K neighbour items of each item is denoted as

. The algorithm of Item based K-Nearest-

Neighbourhood formation approach is shown below.

Table 5.2 The Algorithm of Item Based K-Nearest-Neighbourhood Formation Approach

Algorithm 5.2. KNN-Item ( , K, , )

Input: is a given user, K is the number of neighbours, , and are the parameters that linearly combine the similarities of relevant tag based topics and taxonomic topics, , . The parameter settings of

Folksonomy based similarity calculation are and . The parameters setting for taxonomy based similarity calculation are and .

Output: a set of similar items of item

Method:

Begin

1: Item ‘s K nearest item set ← Ф //initialization

2: for each item {

← IR-Folksonomy ( ) //Algorithm 3.2 in Table 3.2

← IR-Taxonomy ( ) //Algorithm 4.1 in Table 4.1

}//end for

3: for each item {

sim1← //Equation 5.6

sim2← //Equation 5.7

Page 118

← sim1 sim2

}//end for

4: ← // top K item with highest similarity scores

5: Return

End

The computation complexity of this algorithm is , where is the number of items in a tagging community.

5.3 RECOMMENDATION GENERATION

Typically, based on the generated neighbourhood, a set of items that are most frequently rated or tagged by the neighbour users of the target user or most similar to the target user‘s tagged items will be recommended to the target user. Since the topics of items and the topic preferences of users can be represented by a set of tags and a set of taxonomic topics, the topic matching measure between the target user and the candidate item can be used to improve the accuracy of recommendations through selecting those items that are not only rated or tagged by the neighbour users, but also have similar content topics with the target user.

This thesis combines the Collaborative Filtering and content matching approaches to make recommendations. Both user and item based Collaborative Filtering approaches that combines the content-based filtering/matching methods will be discussed respectively.

Page 119

5.3.1 User Based Recommendation Generation

For each target user , a set of candidate items can be generated from the items

tagged by user 's neighbour users based on the similarity of user profiles. The

candidate item set of user is denoted as ,

. For each candidate item ,

is the sub set of users in who have tagged the item , the prediction

score of how much may be interested in is calculated by considering the similarities

between user and those users who are ‘s neighbours and have tagged the item ,

and the content matching between the item 's topics and user 's topic preferences.

This thesis uses the cosine similarity to calculate the content matching between a

target user and a candidate item .

 The calculation of tag based content matching. The tag based content matching

between a target user and a candidate item is denoted as .

(5.9)

 The calculation of taxonomic topic based content matching. The taxonomic topic

based content matching between a target user and a candidate item is

defined as:

(5.10)

Page 120

Similar with other popular hybrid recommendation approaches (Burke, 2002), this thesis uses linear combination to integrate the Collaborative Filtering and content filtering/matching approaches.

Based on the hybrid user profiles and item representations, the hybrid user based recommendation generation approach linearly combines the user similarity, the tag based content matching and the taxonomic topic based content matching, to predict how much a user will be interested in a candidate item. The prediction score for each candidate item

denoted as can be calculated below:

(5.11)

Where ,and are combination parameters, and

.

With 0 and , the Equation 5.11 linearly combines the user similarity and the tag based content matching. It can be used to make recommendations based on folksonomy based user profiles and item representations. While through setting

0 and , Equation 5.11 linearly combines the user similarity and the taxonomic topic based content matching. It can be used to make recommendations based on the taxonomy based user profiles and item representations. The Top N items with high prediction scores will be recommended to the target user .

For folksonomy based user profiles, the settings of the parameters of both neighbourhood forming and recommendation generations are: , ,

Page 121

0 and . For taxonomy based user profiles, the settings of parameters are: , , 0 and . The algorithm of the proposed user based recommendation generation approach is shown below.

Table 5.3 The Algorithm of User Based Recommendation Generation Approach

Algorithm 5.3. User based Recommendation: Recommender-User ( ,N, , , )

Input: is a given target user, N is the number of items to be recommended,

, , are the parameter that linearly combine the collaborative filtering and content matching based on folksonomy and taxonomy, , , ,

. The parameter settings of the combination of collaborative filtering and

Folksonomy based content matching are and . The parameter settings of the combination of collaborative filtering and taxonomy based content matching are and .

Output: a list of items recommended for

Method:

Begin

/*Step1: Get the K nearest neighbors of user */

1: ←KNN-User ( , K, , , ) // Algorithm 5.1 in Table 5.1

/*Step2: get candidate items from neighbor users */

2: candidate item set ← Ф //Initialization

3: for each user {

}//end for

/*Step3: recommendation generation*/

Page 122

4: for each candidate item {

← 0, sim ← 0 //Initialization

for each user {

sim1 ← //Equation 5.5

sim2 ← //Equation 5.9

sim3 ← //Equation 5.10

sim ← sim+ sim1+ sim2+ sim3

}//end for

← sim

}//end for

5: Return the Top items with highest scores to

End

The computation complexity of this algorithm is , is the number of candidate items, , K is the number of selected top K neighbour users.

Usually, .

5.3.2 Item Based Recommendation Generation

For item based approach, the candidate item set can be the whole item set except for those items that are already rated or tagged by the target user. To avoid unnecessary computation of item pairs, the top K most similar items of each rated or tagged item of the target user can be aggregated together as the candidate item set, which is denoted as

, . For each candidate item , usually,

Page 123

the prediction score can be calculated through the calculation of the sum or average similarity of the candidate item with all rated or tagged items of the target user . Since the user‘s topic preferences are obtained based on the related tags or taxonomic topics of all the items that the user has, the similarity of the candidate item with the user‘s topic preferences actually measures the average or total similarity of the candidate item with all tagged items of the target user. Thus, if a candidate item has the most similarity score with one of the user‘s tagged item, and it has the most similar topics with the user‘s topic preferences, then this item will have higher prediction score than other items. Thus, this work proposes to calculate the prediction score of a candidate item based on the maximum score of the linear combination of the item similarity with each tagged/rated item and the content matching with the target user‘s topic preferences.

Based on the hybrid user profiles and item representations, the item based recommendation generation approach linearly combines the item similarity, the tag based content matching and the taxonomic topic based content matching to predict how much a user will be interested in a candidate item. The prediction score for each candidate item

denoted as can be calculated as below:

=

(5.12)

Where ,and are combination parameters, and

.

With 0 and , the Equation 5.12 linearly combines the item similarity and the tag based content matching. It can be used to make recommendations

Page 124

based on folksonomy based user profiles and item representations. While with 0 and , Equation 5.12 linearly combines the item similarity and the taxonomic topic based content matching. It can be used to make recommendations based on taxonomy based user profiles and item representations. The Top N items with high prediction scores will be recommended to the target user .

For folksonomy based user profiles, the settings of the parameters of both neighbourhood forming and recommendation generations are: 1, 0, 0 and . For taxonomy based user profiles, the settings of parameters are:

0, 1, 0 and . The algorithm of the proposed item based recommendation generation is shown as below.

Table 5.4 The Algorithm of Item Based Recommendation Generation Approach

Algorithm 5.4. Item based Recommendation: Recommender-Item ( ,N, , , )

Input: is a given target user, N is the number of items to be recommended,

are the parameter that linearly combine the collaborative filtering and content matching based on folksonomy and taxonomy, , .

The parameter settings of the combination of collaborative filtering and Folksonomy based content matching are and . The parameters setting of the combination of collaborative filtering and taxonomy based content matching are and .

Output: a list of items recommended for

Method:

Begin

/*Step1: get candidate items from neighbor item */

Page 125

1: user ‘s candidate item set ← Ф //Initialization

2: for each collected item {

/* Get the K nearest neighbors of item */

← KNN-Item( , K, , ) //Algorithm 5.2 in Table 5.2

}//end for

/*Step2: recommendation generation*/

3: for each candidate item {

max ← 0 //Initialization

for each collected item {

sim1← //Equation 5.8

sim2← //Equation 5.9

sim3← //Equation 5.10

sim← sim1+ sim2+ sim3

If sim > max Then max ← sim

}//end for

← max

}//end for

4: Return the Top items with highest scores to

End

The computation complexity of this algorithm is , where is the

number of items that a user has tagged, , is the number of candidate items,

. Usually, , .

Page 126

5.4 FRAMWORKS OF THE PROPOSED RECOMMENDATION SYSTEMS

Based on different input information sources and user profiles, the proposed recommender systems can be classified into three kinds of different recommender systems: folksonomy based, taxonomy based and hybrid recommender systems. Each kind of recommender systems include user based and item based recommendation approaches.

Figure 5.2 shows the framework of the proposed folksonomy based recommender systems. The input is users‘ folksonomy information. The outputs of UserProfiling-

Folksonomy illustrated in Figure 3.5 are used to form neighbours and generate recommendation lists based on either user based or item based approaches.

Folksonomy A Target User

UserProfiling- Folksonomy

, 1, 0

KNN-User KNN-Item 0, 0,

Recommender- Recommender- User Item

Recommended Item Lists

Figure 5.2 The Framework of Folksonomy Based Recommendation Making

Page 127

The framework of taxonomy based recommendation making approach is shown in Figure 5.3. The inputs include folksonomy and taxonomy information. Different with

Figure 5.2, the module UserProfiling-Taxonomy illustrated in Figure 4.5 is adopted.

Moreover, the parameter settings of neighbourhood formation and recommendation generation are different.

Folksonomy Taxonomy A Target User

UserProfiling- Taxonomy

, 0, 1

KNN-User KNN-Item

0, 0,

Recommender- Recommender- User Item

Recommended Item Lists

Figure 5.3 The Framework of Taxonomy Based Recommendation Making

Figure 5.4 shows the framework of the hybrid recommendation making based on folksonomy and taxonomy. The inputs information sources include folksonomy and taxonomy information. The UserProfiling-Hybrid illustrated in Figure 4.7 is adopted to generate hybrid user profiles and item representations. Based on the outputs of

Page 128

UserProfiling-Hybrid, the recommended item lists will be generated based on either user or item based neighbourhood formation and recommendation generation approaches.

Folksonomy Taxonomy

A Target User

User Profiling- Hybrid

KNN-User KNN-Item

Recommender- Recommender- User Item

Recommended Item Lists

Figure 5.4 The Framework of Hybrid Recommendation Making

5.5 CHAPTER SUMMARY

This chapter discussed how to utilise the proposed folksonomy based, taxonomy based and hybrid user profiles discussed in Chapter 3 and Chapter 4 in recommender systems to make Top N item recommendations. The neighbourhood forming and item ranking approaches that incorporate the folksonomy and taxonomy information were proposed in this chapter. Based on the proposed user profiles, the user and item based

Page 129

collaborative filtering approaches combined with the content filtering methods were proposed to generate Top N recommended item lists.

The evaluation of the performances of the proposed user profiling and recommendation approaches will be discussed in Chapter 6.

Page 130

Chapter 6

6Experiments and Results

This chapter focuses on the evaluation of the proposed user profiling models and recommendation approaches. The experiments design, the analysis of data collections and the selected evaluation metrics and baseline models will be discussed firstly. Then, the results of the experiments will be illustrated. Finally, the analysis and discussions of the results will be presented.

6.1 EXPERIMENTS DESIGN

The major objective of the experiments was to show how the proposed user profiling models can improve the performance of recommender systems effectively and efficiently. In order to give a comprehensive investigation, the experiments were conducted in terms of the following hypothesises:

 [Hypothesis 1]: The proposed Folksonomy based user profiling approach

can effectively improve the recommendation accuracy.

 [Hypothesis 2]: The proposed taxonomy based user profiling approach can

effectively improve the recommendation accuracy.

 [Hypothesis 3]: The proposed hybrid user profiling approach is more

accurate than the folksonomy or taxonomy based user profiling approaches.

Page 131

 [Hypothesis 4]: The proposed user profiling approaches are scalable and can

be used for large scaled recommender systems.

To verify Hypothesis 1, Hypothesis 2, and Hypothesis 3, experiments were conducted to evaluate the effectiveness of the proposed approaches on two real world datasets collected from Amazon.com and CiteULike.com. The effectiveness evaluation is the major focus of this chapter. It will be discussed in Section 6.2 and Section 6.3.

Experiments were also conducted to verify Hypothesis 4. The efficiency comparison experiments that evaluate the scalability of the proposed approaches were conducted on a large scale dataset collected from Del.ico.us. This will be examined in

Section 6.4.

6.2 EVALUATION METHODS

6.2.1 Datasets

The effectiveness evaluation experiments were conducted on two real world datasets collected from Amazon.com and CiteULike.com.

1) Dataset D1: Amazon.com dataset. This dataset was collected from

Amazon.com on April, 2008. The items of the dataset are books. To avoid too sparse, we only selected users that had at least 5 items and items that had been used by at least 3 users. The final dataset consisted of 4,112 users, 34,201 tags, and 30,467 items. We also extracted the taxonomic descriptors (Ziegler et al., 2004; Weng et al., 2008) of each item

Page 132

from Amazon.com. The taxonomy formed by the descriptors was tree-structured and contained 9,919 unique topics.

2) Dataset D2: CiteULike dataset. The ―Who-posted-what‖ dataset

(http://static.citeulike.org/data/current.bz2) that contains the basic tagging information was used. The items of this dataset were research papers. The original dataset contained

50,926 users, 346,084 tags and 1,681,089 items. This thesis selected users that had at least

5 items and items that had been used by at least 2 users. The final dataset comprised

7,103 users, 78,414 tags, 117,279 items.

The distributions of tags for Dataset D1 and D2 are shown in Figure 6.1. It illustrates that tags follow power law distribution for both datasets. Large number of tags are used by a small number of users while only a small number tags are popularly used by many users. Appendix A shows some example popular and personal tags of both datasets.

In Dataset D1, the maximum number of users a tag had is 405. Only 4.8% of tags

(i.e., 1,648) were used by at least 10 users while nearly 89% of tags (i.e., 30,417) were used by at most 5 users. In addition, nearly 67% of tags (i.e., 22,903) were used by only one user, which are called personal tags.

In Dataset D2, the maximum number of users a tag had was 1,211. Nearly 70% of tags (i.e., 55,184) were personal tags, and nearly 90% of tags (i.e., 70,185) had at most 5 users. Only 5.2% of tags (i.e., 4,131) were used by at least 10 users in Dataset D2.

Page 133

4500 4000 3500 3000 D2 2500 2000 D1

Nuver ofTags Nuver 1500 1000 500 0 0 200 400 600 800 1000 1200 Number of Users

Figure 6.1 The Distributions of Tags in Dataset D1 and D2

The distributions of items for Datasets D1 and D2 are shown in Figure 6.2.

Similarly, items also follow power law distribution for both datasets. Figure 6.2 demonstrates that most tagged or collected items were unpopular items in the long tail while only a small number of items were popularly tagged by many users.

12000

10000

8000 D2

6000

4000 D1 Number ofItems Number 2000

0 0 20 40 60 80 100 120 140 Number of Users

Figure 6.2 The Distributions of Items in Dataset D1 and D2

Page 134

6.2.2 Evaluation Metrics

In this thesis, the well recognised Precision, Recall and F1 metrics are used to evaluate the accuracy of the recommendations of the proposed approaches.

 Precision and Recall

Precision and recall are the most popular metrics for evaluating information retrieval systems. As the key metrics, they were proposed by Cleverdon et al. in 1968

(Cleverdon et al., 1966), and are still used today.

Precision is defined as the ratio between the number of selected relevant items

(denoted as Nrs) and the number of selected items (denoted as ) shown in Equation 6.1.

Precision represents the probability that a selected item is relevant. It can be seen as a measure of exactness or fidelity.

(6.1)

Recall is defined as the ratio between the number of selected relevant items

(denoted as Nrs ) and the number of relevant items available Nr, which is shown in

Equation 6.2. Recall represents the probability that a relevant item will be selected. It can be seen as a measure of completeness.

(6.2)

 Measure

Page 135

In order to provide a general overview of the overall performances, metric is used to combine the results of Precision and Recall. It is defined below.

(6.3)

6.2.3 Experiment Setup

The performance of learning algorithms is typically influenced by several parameters. One way of optimising parameters is by maximising performance on a given data set. However, such tuning tends to overestimate the expected performance of the system. To prevent this kind of over fitting, the 5-fold cross-validation (Weiss &

Kulikowski, 1991) was used to evaluate the effectiveness of the proposed approaches.

Each of the two dataset was randomly partitioned into 5 sub datasets. Of the 5 sub datasets, one single sub dataset (i.e., 20% of users) was retained as the validation data for testing (i.e., test data). The remaining 4 sub datasets (i.e., 80% of users) were used as training data. For each test user, a random 20% of the items of this user were hidden as the test/answer item set, and 80% of the user‘s items were used as this user‘s training item set. Therefore, for a user in the training set, this user‘s profile was generated based on the folksonomy or taxonomy information of all the tagged items of the user. For a user in the test set, this user‘s profile was generated based on the folksonomy or taxonomy information of the training items of the test user. For each test user, the recommender system generated a list of ordered items that the test user did not collect or tag. The top

Page 136

items with high prediction scores were recommended to the user. If an item in the recommendation list was also in the test user's hidden test item list, then the item was counted as a hit.

For Top N item recommendation task, the number of selected items in

Equation 6.1 is the number of recommended items. Therefore, . For a target user

, let denote the hidden test item set of , and denotes the set of Top N recommended items to . The number of selected relevant items in

Equation 6.1 can be defined as . The total number of relevant items in Equation 6.2 is the number of test items of , which is defined as

. The precision and recall metrics specifically for the Top N recommendation problem are defined as Equation 6.4 and Equation 6.5.

(6.4)

(6.5)

The average precision, recall and F1 measure of the whole test users of one partitioned validation sub dataset were recorded as one run of the results. The average precision and recall values of the 5 runs were used to measure the accuracy performance of the recommendations. Figure 6.3 visualises this experimental setup.

Page 137

The whole dataset

5-folded

Training data Test data

Training item set Test item set

Figure 6.3 Visualisation of the 5-fold Cross-Validation Experiment Setup

6.2.4 Experiment Environment and Framework

To evaluate the effectiveness of the proposed approaches, this thesis implemented the proposed user profiling and recommendation approaches and other related state-of- the-art baseline models. Java was used as the programming language. The experiments were mainly conducted on a High Performance Computer (i.e., HPC) provided by High

Performance Computing Group of Queensland University of Technology, Australia. The experiments can be conducted on a Personal Computer equipped with Intel Pentium IV

3.0GHz CPU and 2G memory running a Window XP operating system. However, the running time of the experiments would be slower compared to running on the HPC with

4G-6G memory and better performances CPUs.

Page 138

The proposed approaches include:

 FM-User and FM-Item: These are the proposed recommendation

approaches based on folksonomy information. The folksonomy based user

profiling approach is adopted. Each tag, user and item is represented with a set of

tags. For simplicity, FM-User and FM-Item are called Folksonomy Models. The

former one is the proposed user based Folksonomy Model while the latter one is

the proposed item based Folksonomy Model.

 TM-User and TM-Item: These are the proposed recommendation approaches

based on taxonomy information. The taxonomy based user profiling approach is

adopted. Each tag, user and item is represented with a set of taxonomic topics. For

simplicity, the two approaches are called Taxonomy Models. TM-User is the

proposed user based Taxonomy Model. TM-Item is the proposed item based

Taxonomy Model.

 FTM-User and FTM-Item: These are the proposed hybrid recommendation

approaches based on folksonomy and taxonomy information. The hybrid user

profiling approach is adopted. Each tag, user and item is represented with a set of

tags and a set of taxonomic topics. FTM-User is the user based hybrid

recommendation approach. FTM-Item is the item based hybrid recommendation

approach. They are called Hybrid Models.

For each test user, this thesis compared the recommendation results produced by the above proposed approaches with other related state-of-the-art baseline models. Figure

6.4 illustrates the framework of the evaluation experiments.

Page 139

A Test User

The training set of the test user

Training Set

Folksonomy Taxonomy Hybrid Baseline Baseline Models Models Models model-I … model-n

Recommended Item Lists The test item set of the test user Evaluation Test Set

Evaluation Results

Figure 6.4 The Experiment Framework of the Accuracy Evaluation

6.3 RESULTS AND DISCUSSIONS

Results of the Folksonomy Model, results of the Taxonomy Model and the comparison between the Hybrid Model and each single model (i.e., the Folksonomy

Model and the Taxonomy Model) will be examined. However, first, we will analyse the influence of tags to the recommendation accuracy for the standard Collaborative Filtering recommendation approaches.

Page 140

6.3.1 The Influence of Tags to the Standard CF Recommendation Approaches

The standard collaborative filtering (CF) approaches are popular benchmark baseline models. The purpose of this sub set of experiments is to show that simply using

User-Tag relationships to make recommendations cannot improve the recommendation accuracy of the standard CF approaches. In this sub set of experiments, we compared the

Top-3 (N=3) precision results of the following approaches:

 CF-Item: This is the standard item based CF approach that is based on the

User-Item relationship or the binary user-item matrix. The similarity of two items

was calculated based on the overlap of their user sets (i.e., the Item-User

mapping). In our experiments, an advanced version of CF that takes the inverse

item frequency (iif) value of each user into consideration to measure the similarity

of two items was implemented as suggested by (Breese et al., 1998).

 CF-User: This is the standard user based CF approach that is based on the

User-Item relationship. The similarity of two users was calculated based on the

overlap of their item sets (i.e., the User-Item mapping). The inverse user

frequency (iuf) value of each item was taken into consideration when measuring

the similarity of two users (Breese et al., 1998)

 TCF-User: This is the user based CF approach that is based on the User-Tag

relationship. The similarity of two users was calculated based on the overlap of

their tag sets (i.e., the User-Tag mapping). The inverse user frequency value of

each tag was taken into account when measuring the similarity of two users.

The Top-3 (N=3) Precision results of Dataset D1 and D2 are shown in Figure 6.5.

Page 141

0.18 0.16 0.14 0.12 CF-Item 0.1 CF-User

3 Precision 3 0.08 - TCF-User

Top 0.06 0.04 0.02 0 D1 D2 Datasets

Figure 6.5 Top-3 Precision of the Standard CF Approaches on Datasets D1 and D2

Discussions:

As shown in Figure 6.5, the TCF-User approach had the worst precision results for both Dataset D1 and D2. The similarity of two users was measured based on the overlap of their tag sets. The noise contained in tags such as semantic ambiguity, tag synonyms and personal tags caused inaccurate user profiling and improper neighbourhood formation. As a consequence, the tag information failed to improve the recommendation accuracy. Even, it actually decreased the recommendation performances.

Therefore, it is necessary to reduce the noise of tags and make use of the unique features of tags to make recommendations. The following sub sections will examine the experimental results of the approaches proposed in this thesis.

Page 142

6.3.2 Results of Folksonomy Model

The objective of this set of experiments was to verify the following hypothesis:

[Hypothesis 1]: The proposed Folksonomy based user profiling approach can

effectively improve the recommendation accuracy.

The parameterisation of the proposed Folksonomy Model will be reviewed first.

Secondly, the comparison with existing widely used approaches to remove tag noise will be presented. Thirdly and more importantly, the comparison of state-of-the-art baseline models will be discussed.

6.3.2.1 Parameterisation

As discussed in Chapter 5, the settings of the parameters for FM-User, the proposed user based recommendation approach based on folksonomy information, were

, 0 and . We conducted the experiments by setting the values of and from 0.0 to 1.0 increasing by 0.1, while the values of and were set from 1.0 to 0.0 decreasing by 0.1. The results indicated that with the value ranging from 0.8 to 1.0 and the value ranging from 0.4 to 0.5, the proposed user based approach achieved the best performance on the two datasets.

The settings of the parameters for FM-Item, the proposed item based recommendation approach based on folksonomy information, were 1, 0,

0 and . In the experiments, we set the value of from 0.0 to 1.0 increasing by 0.1 while was set from 1.0 to 0.0 decreasing by 0.1. The results indicated

Page 143

that with the value ranging from 0.4 to 0.5, the proposed item based approach achieved the best results on the two datasets.

The values of the best settings of the parameters indicate that item preferences play a more important role than the topic preferences in user based neighbourhood formation. Collaborative filtering and content matching are equally important for the recommendation generations of these two datasets.

For fair comparisons, the parameters of baseline models were also tuned to the best settings if applicable. The following discussions are given on the basis of the best settings of the parameters.

6.3.2.2 Experimental Results

1) Comparison 1: the Comparisons with Related Tag Noise Removing

Approaches.

Targeting the tag quality problem, this thesis proposes to find a set of personally related tags for each tag with respect to each individual user. To evaluate the effectiveness of the proposed Folksonomy Models in terms of removing the noise of tags, this thesis compared the recommendation accuracies of the proposed folksonomy based approach

FM-User and FM-Item with those of the following methods:

 Clustering. This approach was used in the work of Niwa et al. (2006) and

Shepitsen et al. (2008). Items were clustered based on their tf-iuf weighted tag

profiles. Treating user‘s tags as queries, the most relevant items were

recommended.

Page 144

 ARTE: Association rule approach is popularly used to expand the tags of

users/items with a set of associated tags to recommend tags (Li et al., 2008;

Heymann et al., 2008). Inspired by the work of Shaw et al. (2009), we used

association rules to expand the tags for the purpose of item recommendations. The

same with the approach of Heymann et al. (2008), each item‘s tag set was used as

one transaction record in the whole transaction set. Based on the transaction set, a

set of association rules with given confidence and support values were generated.

 LDA: This is the Latent Dirichlet allocation (LDA) approach proposed in by

Siersdorfer et al. (2009) for item recommendations. The LDA model was used to

find the hidden semantic topics of tags and therefore, remove the noise.

The Top 10 (N=1,…,10) Precision, Recall and F1 measure results of these approaches on Dataset D1 are shown in Figure 6. 6, Figure 6.7, Figure 6.8 respectively.

0.4

0.35 FM-User 0.3 FM-Item 0.25 Clustering 0.2 ARTE

Precision 0.15 LDA 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.6 Top 10 Precision Results of Folksonomy Model of Dataset D1 (Comparison 1)

Page 145

0.12

0.1 FM-User 0.08 FM-Item

0.06 Clustering Recall ARTE 0.04 LDA 0.02

0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.7 Top 10 Recall Results of Folksonomy Model of Dataset D1 (Comparison 1)

0.16

0.14

0.12 FM-User 0.1 FM-Item

F1 F1 0.08 Clustering 0.06 ARTE 0.04 LDA 0.02

0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.8 Top 10 F1 Measure Results of Folksonomy Model of Dataset D1(Comparison 1) Discussions:

As shown in the above figures, the proposed user based approach FM-User performed slightly better than the proposed item based approach FM-Item. They both

Page 146

performed better than the other approaches. The results suggest that the proposed folksonomy based tag representation approach is effective.

The LDA approach had the worst performance. It only processed tags as common textural information. Usually, LDA works well for large and dense textural corpuses (e.g., web page corpus). The short and sparse tag based content representations weakened the performance of LDA. As a result, it was even outperformed by the standard item based

CF approach CF-Item. The Top 10 accuracy results of CF-Item of Dataset D1 are shown in Figure 6.9, Figure 6.10, and Figure 6.11.

The experimental results of the association rules based tag expansion approach

(ARTE) were unsatisfactory. Since the antecedents and the consequences of each association rule should occur frequently in the transaction dataset, the personal tags that need to expand cannot find associated tags. Only the frequent or popular tags were expanded with a set of associated poplar tags. This kind of tag expansion can promote the accuracy of tag recommendations because the popular tags have more chances to be used.

However, for item recommendations, the popular tags are usually not so useful in identifying the tag preferences or the relevant topics/tags of items. As a result, ARTE did not achieve a satisfactory level of performance. In addition, the association rule based tag expansion is not a personalised approach. The occurrences of tags are calculated based on the tag names. The same set of associated tags was expanded for different users if they used the same tag names. Consequently, most of the tag noise could not be detected.

The Clustering approach was a content filtering approach. It did not use the collaborative filtering approach. The tags of items were expanded based on the clustering

Page 147

approach. However, only the frequent tags in a cluster were selected to expand the user‘s topics. Such frequent tags were unable to identify the most similar items or users in many occasions. As a result, the Clustering approach was outperformed by the proposed approaches.

2) Comparison 2: the Comparisons with the State-of-the-art Baseline

Models.

The objective of this set of comparison experiments was to evaluate the overall effectiveness of the proposed recommendation approaches. This was achieved by comparing state-of-the-art item recommendation approaches based on the implicit ratings and tag information. This thesis compared the recommendation accuracies of FM-User and CF-Item with those of the following approaches:

 Graph Rank. This approach was recently published by Zhang et al. (2010). It

examines item recommendation using tagging information. An integrated

diffusion-based algorithm making use of both the user-item graph and the item-

tag graph was proposed to make personalised item ranks for each user.

 Tag tf-iuf. This approach was proposed by Diederich et al. (2006). The tf-idf

tag profiles are used to represent users‘ topic preferences. This approach did not

consider the noise of tags nor combine content filtering methods.

 Tso-Sutter’s approach. This approach was proposed by Tso-Sutter et al.

(2008). It uses binary user-item-tag matrixes to make recommendations. This is

an extended collaborative filtering approach.

Page 148

The Top 10 Precision, Recall, and F1 results of these approaches of Dataset D1 are shown in Figure 6.9, Figure 6.10, and Figure 6.11 respectively.

0.4 FM-User 0.35 Graph Rank 0.3 Tag tf-iuf 0.25

Tso-Sutter’s Precision 0.2 approach CF-Item 0.15

0.1 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.9 Top 10 Precision Results of Folksonomy Model of Dataset D1 (Comparison 2)

0.12

0.1 FM-User

0.08 Graph Rank

0.06 Tag tf-iuf Recall

0.04 Tso-Sutter’s approach 0.02 CF-Item

0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.10 Top 10 Recall Results of Folksonomy Model of Dataset D1 (Comparison 2)

Page 149

0.16

0.14 FM-User 0.12 Graph Rank 0.1 Tag tf-iuf

0.08 F1 0.06 Tso-Sutter’s approach 0.04 CF-Item

0.02

0 Top N 1 2 3 4 5 6 7 8 9 10

Figure 6.11 Top 10 F1 Measure Results of Folksonomy Model of Dataset D1 (Comparison 2)

The Top 10 Precision, Recall, and F1 evaluation results of Dataset D2 for FM-

User, Graph Rank, Clustering and CF-Item are shown in Figure 6.12, Figure 6.13, and

Figure 6.14 respectively.

0.3

0.25 FM-user 0.2 Graph Rank 0.15

Precision Clustering 0.1 CF-Item 0.05

0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.12 Top 10 Precision Results of Folksonomy Model of Dataset D2 (Comparison 2)

Page 150

0.16 0.14

0.12 FM-User 0.1 Graph Rank 0.08 Recall Clustering 0.06 CF-Item 0.04 0.02 0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.13 Top 10 Recall Results of Folksonomy Model of Dataset D2 (Comparison 2)

0.14

0.12

0.1 FM-User

0.08 Graph Rank F1 0.06 Clustering CF-Item 0.04

0.02

0 Top N 1 2 3 4 5 6 7 8 9 10

Figure 6.14 Top 10 F1 Measure Results of Folksonomy Model of Dataset D2 (Comparison 2)

Discussions:

The results from Figure 6.9 to Figure 6.14 show that the proposed Folksonomy

Models FM-User and FM-Item outperformed the baseline models for both datasets.

Page 151

As shown in Figure 6.9 to Figure 6.11, Tso-Sutter’s approach performed slightly better than the CF-Item. Tso-Sutter’s approach did not use content filtering or any weighting approaches and ignored personal tagging information. Therefore, this approach failed to improve the accuracy of recommendations largely.

Although the Tag tf-iuf approach used tf-idf weighting approach, it did not consider the tag quality problem. It simply removed those tags that were used by less than certain users (e.g., no more than 5 users) in the experiments. In addition, it did not combine with the content filtering approach. As a result, the Tag tf-iuf approach did not significantly improve the accuracy of recommendations.

Although the Graph Rank approach performed better than the CF-Item, it performed worse than the proposed approaches. The Graph Rank approach relies on the relationships of users, items and tags. However, it divides the three-dimensional tagging graph into user-tag and tag-item bipartite graphs. Therefore, the three-dimensional relationships reflecting the personal tagging relationships of each individual user were ignored.

6.3.2.3 Discussions

The experimental results of Section 6.3.2.2 demonstrated the effectiveness of the proposed Folksonomy Models. The best performances of the proposed Folksonomy

Models suggest that the proposed Folksonomy based user profiling approach can effectively solve the tag quality problem and improve the recommendation accuracy.

Hypothesis 1 is valid.

Page 152

The overall precision and recall values were relatively low, mainly because the datasets were not dense. The proposed approaches had the best performance. It relied on both two-dimensional and three-dimensional relationships among tags, users and items to find the personalised semantic meaning of each tag for a user. The proposed approaches also eliminated the noise of tags, profiled a user‘s tag preferences and extracted items‘ relevant topics/tags accurately. In addition, though no content information of items was used, the proposed approaches benefited from combining the memory based collaborative filtering approaches with the content filtering approach based on the content information given by users (i.e., tags).

6.3.3 Results of Taxonomy Model

The objective of this set of experiments is to verify the following hypothesis:

[Hypothesis 2]: The proposed taxonomy based user profiling approach can

effectively improve the recommendation accuracy.

To verify this hypothesis, the parameterisation of the proposed Taxonomy Models will be discussed first, followed by the comparison with related baseline models. Since

Dataset D2 does not contain taxonomy information, only Dataset D1 is used in this set of experiments.

Page 153

6.3.3.1 Parameterisation

The settings of the parameters for TM-User, the proposed user based recommendation approach based on taxonomy information, were , ,

0 and . We conducted the set of experiments by setting the values of

and from 0.0 to 1.0 increasing by 0.1 and the values of and were set from 1.0 to 0.0 decreasing by 0.1. The results indicated that with the value ranging from 0.8 to

1.0 and the value ranging from 0.4 to 0.5, the proposed user based approach achieved the best performance on the Dataset D1.

The settings of the parameters for FM-Item, the proposed item based recommendation approach based on taxonomy information, were 0, 1,

0 and . We set the value of from 0.0 to 1.0 increasing by 0.1 and was set from 1.0 to 0.0 decreasing by 0.1. The experimental results indicate that with the value ranging from 0.4 to 0.5, the proposed item based approach achieved the best results on the Dataset D1.

The values of the best settings of the parameters indicate that item preferences played a more important role than topic preferences in user based neighbourhood formation. Collaborative filtering and content matching played equally important roles for the recommendation generations for Dataset D1. This is similar to Folksonomy Models.

The following discussions were given on the basis of the best settings of the parameters of the referred approaches.

Page 154

6.3.3.2 Experimental Results

To evaluate the effectiveness of the proposed Taxonomy Models, Zeigler‘s approach (Zeigler et al., 2004) was chosen as the baseline model to compare with the proposed TM-User and TM-Item models. Both approaches make recommendations based on users‘ preferences on taxonomic topics. However, the proposed TM-User and

TM-Item models generate users‘ taxonomic preferences based on tag information, whereas, Zeigler‘s approach generates users‘ taxonomic preferences from users‘ implicit item ratings.

 TPR: Taxonomy-driven Product Recommender proposed by Zeigler et al.

(2004). It used implicit ratings but not tag information to generate user profiles.

The proposed TM-User and TM-Item models use users‘ taxonomic preferences

and item preferences to make recommendations. In order to have a fair

comparison with the proposed approaches, the TPR was implemented by

combining the item preferences and taxonomic topic preferences. This is an

extension to Zeigler‘s approach (2004).

The Top 10 (N=1,…,10) Precision, Recall and F1 measure evaluation results of

Dataset D1 are shown in Figure 6.15, Figure 6.16, and Figure 6.17.

Page 155

0.35

0.3

0.25 TM-User 0.2 TM-Item 0.15 Precision TPR 0.1

0.05

0 1 2 3 4 5 6 7 8 9 10 Top N

Figure 6.15 Top 10 Precision Results of Taxonomy Model of Dataset D1

0.08

0.07

0.06

0.05 TM-User 0.04

Recall TM-Item 0.03 TPR 0.02

0.01

0 Top N 1 2 3 4 5 6 7 8 9 10

Figure 6.16 Top 10 Recall Results of Taxonomy Model of Dataset D1

Page 156

0.12

0.1

0.08

TM-User

F1 0.06 TM-Item 0.04 TPR 0.02

0 Top N 1 2 3 4 5 6 7 8 9 10

Figure 6.17 Top 10 F1 Results of Taxonomy Model of Dataset D1

6.3.3.3 Discussions

When comparing the performance of TPR with the baseline model CF-Item discussed in Section 6.3.1, TPR performed better than CF-Item. This indicates that taxonomy information can effectively improve recommendation accuracy as claimed in

Ziegler‘s paper.

As shown in the above figures, the proposed user based approach FM-User performed slightly better than the proposed item based approach FM-Item. They performed better than the TPR approach based on Ziegler‘s taxonomic topic weighing approach (Zeigler et al., 2004). This improvement suggests that after considering tagging information and the popularity of taxonomic topics, the accuracy of item recommendations based on item taxonomy can be further improved.

Page 157

Compared with tag noise removing approaches such as Clustering, ARTE and

LDA discussed in Section 6.3.2.2, the proposed FM-User and FM-Item performed better.

The results suggest that taxonomy can be a high quality, standard, well recognised, and independent external vocabulary to annotate or represent the semantic meaning of each tag and reduce the noise of tags caused by the free-formed vocabularies of users.

Therefore, the experimental results in Section 6.3.3.2 suggested that the proposed taxonomy based tag representation approach is effective. The proposed Taxonomy

Models benefit from using the standard taxonomy vocabulary and experts‘ viewpoints on item classifications and descriptions to reduce the noise of tags. Different with other taxonomy weighting approaches, the proposed taxonomic topic weighting approach takes the structural information of taxonomy and the popularity of each taxonomic topic into consideration. Moreover, the weighting approach is easy to understand and implement.

The better performances of the proposed Taxonomy Models, indicates that the proposed taxonomy based user profiling approach can effectively reduce the noise of tags and improve the recommendation accuracy. Hypothesis 2 is valid.

The comparison of the performances of the proposed Folksonomy Models and the proposed Taxonomy Models will be further discussed in the following sub section.

6.3.4 Hybrid Models v.s. Single models

The purpose of this set of experiments was to verify the following hypothesis:

Page 158

[Hypothesis 3]: The proposed hybrid user profiling approach performs better

than the proposed user profiling approaches based on only folksonomy or

taxonomy information.

To verify the above hypothesis, the parameterisation of the proposed Hybrid

Models will be discussed firstly. Then, the comparisons of the recommendation accuracies of the proposed Hybrid Models, with those of the proposed Folksonomy

Models and Taxonomy Models, will be discussed. The comparisons of the recommendation accuracies of the Folksonomy Models and Taxonomy Models will also be discussed in detail.

6.3.4.1 Parameterisation

The parameters for FTM-User, the proposed user based hybrid approach, include

, , and that are used to form the user neighbourhood and , , and that are used to make hybrid recommendations. The parameters for the proposed item based hybrid approach FTM-Item include and . The results indicated that with

0.8, = 0.1, =0.1, =0.3, 0.5, 0.2, the proposed user based approach achieved the best performances for Dataset D1. With =0.3, = 0.3, =0.4, =0.3, the proposed item based approach achieved the best performances for Dataset D2.

The following discussions are given on the basis of the best settings of the parameters.

Page 159

6.3.4.2 Experimental Results

The Top 3 (N=1,2,3) Precision and Recall evaluation results of the hybrid approaches FTM-User and FTM-Item with those of the proposed user based

Folksonomy Model FM-User and the proposed user based Taxonomy Model TM-User for Dataset D1 are shown in Figure 6.18 and Figure 6.19.

0.43

0.38 FTM-User

0.33 FTM-Item FM-User 0.28

Precision TM-User

0.23

0.18 1 2 3 Top N

Figure 6.18 Top 3 Precision Results of Hybrid Model of Dataset D1

0.07

0.06

0.05 FTM-User FTM-Item 0.04

Recall FM-User 0.03 TM-user 0.02

0.01

0 1 2 3 Top N

Figure 6.19 Top 3 Recall Results of Hybrid Model of Dataset D1

Page 160

Discussions:

As shown in the above two figures, the proposed user based hybrid approach

FTM-User performed slightly better than the proposed item based hybrid approach

FTM-Item. The two Hybrid Models performed better than the proposed single models

FM-User and TM-User. The results indicated that after combining the item taxonomy and folksonomy information, the accuracy of item recommendations can be further improved.

Another important finding is that the proposed Folksonomy Model FM-User performed better than the proposed Taxonomy Model TM-User. The following sub section will further discuss and analyse the comparisons of the two models.

6.3.4.3 Folksonomy Models v.s. Taxonomy Models

To compare the proposed Folksonomy Model and Taxonomy Model, this thesis will discuss the influences of personal tags for the proposed Folksonomy Model. Then it will discuss the comparison of the two models.

1) The influence of personal tags.

As discussed in Section 6.2.1, the distribution of tags of the two datasets follows the power law distributions, which are similar to other tagging communities (Heymann et al., 2008). This power law distribution suggests that the majority of the tags existing in tagging communities were personal tags. They were usually meaningless to other users and useless for finding neighbours (e.g., ―0403‖ in Figure 3.1). In many approaches

(Diederich et al., 2006; Niwa et al., 2006; Tso-Sutter et al., 2008), the personal tags or

Page 161

tags with low popularity were removed in pre-processing. For the proposed approaches, a tag is represented with a set of other tags or a set of taxonomic topics. With the tag representation, a personal tag which might be meaningless to other users, becomes meaningful. More importantly, the personal tags actually play an important role in improving the accuracy of recommendations.

Let denote the popularity of a tag (or the number of users of a tag), the number of tags used by at least users of both datasets are plotted in Figure 6.20 where was set from 1 to 10 incrementally. Different from the tag distribution graph in Figure 6.1, Figure

6.20 is a more detailed tag distribution graph that only considers those tags with no more than 10 users.

100000 78414 D2 23229 34201 14045 10299 8229 10000 6839 11298 5837 5124 4573 4131

6906 Number ofTags Number 4897 3784 D1 3051 2544 2155 1884 1648 1000 1 2 3 4 5 6 7 8 9 10 θ

Figure 6.20 The Number of Tags with Different Values

To evaluate the influence of the personal tags on the performance of the proposed approaches, firstly, a set of tags whose popularity are larger than or equal to were selected. Then, only those selected tags were retained in the folksonomy based user and

Page 162

item representations. The number of tags used by at least users of both datasets and the

Top-3 (N=3) precision values of the proposed Folksonomy Model FM-User with different values are plotted in Figure 6.21 where was set from 1 to 10 incrementally.

As shown in Figure 6.21, with =1, all the tags (i.e., 34201) were retained in the folksonomy based item and user representations and achieved the best precision value

0.31 for Dataset D1 with the proposed Folksonomy Model FM-User. Similarly, the best precision value can be achieved for Dataset D2 with the proposed FM-User approach when all the tags are retained (i.e., 78,414).

The chart suggested that the personal tags can improve the precision results from

0.28 to 0.31 with changed from 2 to 1 for Dataset D1. Similarly, the personal tags can improve the precision results from 0.19 to 0.21 with changed from 2 to 1 for Dataset

D2. Moreover, the graph indicated that although keeping more tags did not necessarily promote the precision values, the precision values decreased dramatically when a large number (i.e., 90%) of tags with lower values (i.e., 5) were removed.

Page 163

0.35

(34201, 0.31) 0.3 (11298, 0.28) FM-User, D1 (6906, 0.27) TM-User, D1 (4897,0.24) (9919, 0.24) 0.25

3 Precision 3 0.24 - (2544, 0.236) (78414,0.21) (3784, 0.23) Top 0.2 (3051, 0.232) (2155,0.228) (1887, 0.218) (23229,0.19) (1648, 0.2) (14045,0.189) 0.15 (10299,0.18)(8229, 0.17) (5837, 0.166) (6839, 0.169) (4573, 0.161) (5124,0.162) FM-User, D2 (4131, 0.16) 0.1 1 2 3 4 5 6 7 8 9 10 θ

Figure 6.21 Top-3 Precision Results with Different Values

2) The comparison of Folksonomy Models and Taxonomy Models.

This thesis compared the Top-3 (N=3) Precision values of the proposed

Folksonomy Model FM-User with the proposed Taxonomy Model TM-User on Dataset

D1. The Top-3 (N=3) precision value of the proposed Taxonomy Model TM-User was

0.24 and there were 9,919 unique taxonomic topics in the Dataset D1. As shown in Figure

6.21, only when we selected less than 4,897 tags with >4, the proposed Folksonomy

Model FM-User performed worse than the Taxonomy Model TM-User (i.e.,

Precision 0.24).

The comparison results suggested that after we reduced the noise of tags and make use of the rich personal information of tags with the proposed approaches, folksonomy can be used as a quality information source to find users‘ topic interests. It

Page 164

can even be used to provide more accurate personalised item recommendations than taxonomy.

6.3.4.4 Discussions

The experimental results of Section 6.3.4.2 demonstrated the effectiveness of the proposed Hybrid Models. The best performances of the proposed Hybrid Models suggest that the proposed hybrid user profiling approach can generate more accurate user, tag and item representations than the proposed user profiling approaches based on only folksonomy or taxonomy information. Therefore, the accuracies of item recommendations can be further improved. Hypothesis 3 is valid.

As the Folksonomy Model contained rich personal information, it had better recommendation accuracy than the proposed Taxonomy Models. The proposed Hybrid

Models had the best performances. It benefited from integrating the standard item taxonomy vocabulary and users‘ personal vocabularies, as well as considering the viewpoints of both experts and users on item descriptions/classifications.

6.4 PARALLEL USER PROFILING FOR LARGE SCALED RECOMMENDER SYSTEMS

[Hypothesis 4]: The proposed user profiling approaches are scalable and can be

used for large scaled recommender systems.

To verify this hypothesis, this sub section will present a parallel user profiling approach and a scalable recommender system. The current advanced cloud computing

Page 165

techniques including Hadoop, MapReduce and Cascading are employed to implement the approaches proposed in Chapter 3 in a parallel way. The experiments were conducted on

Amazon EC2 Elastic MapReduce and S3 with a real world large scale dataset from

Del.icio.us website.

To verify this hypothesis, this sub section proposes a parallel user profiling approach and a scalable recommender system. The current advanced cloud computing techniques including Hadoop, MapReduce and Cascading are employed to implement the proposed approaches.

In Section 6.4.1, related work will be briefly reviewed. In Section 6.4.2, the large scaled implementation using cloud computing techniques, will be presented. The experiments and associated discussion will be given in Section 6.4.3 and Section 6.4.4.

6.4.1 Related Work

In this subsection, the related large scaled user profiling and recommender systems will be firstly reviewed. Then, a very brief review of the advanced cloud computing techniques will be presented.

6.4.1.1 Large Scaled User Profiling and Recommender Systems

The scalability problem is an important issue for recommender systems

(Adomavicius & Tuzhilin, 2005). The large number of users, items and other information in real life online communities bring challenges to recommender systems. Some research focusing on scalable neighbourhood based approaches (Papagelis et al., 2005; Takács et

Page 166

al., 2009) and model based recommendation approaches (Rashid et al., 2006) were proposed.

Mahout (http://lucene.apache.org/mahout/) is a scalable open source recommender system, which is implemented based on Hadoop and MapReduce.

However, it only uses users‘ explicit ratings to do recommendation rather than using implicit ratings or other useful emerging user information in Web 2.0 such as tags.

More recently, some paralleled data mining and knowledge discovery approaches that are based on cloud computing techniques were proposed, such as paralleled similarity calculation (Elsayed et al., 2008), clustering (Böse et al., 2010) and random walk (Chiang et al., 2010). Shmueli-Scheuer et al. (2010) explored how to extract user profiles from large scale data using MapReduce. However, it didn‘t use tag information to profile users.

6.4.1.2 Cloud Computing

Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers (Ghemawat et al., 2003). Hadoop is a top-level

Apache project. It was built and is using by a community of contributors from all over the world such as Amazon, google, Yahoo etc.

MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem

Page 167

(unstructured) or within a (structured). The MapReduce framework consists of two parts: Map and Reduce. For the Map part, the master node takes the input, breaks it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. In terms of the

Reduce part, the master node takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster. The processing API lets the developer quickly assemble complex distributed processes without having to

"think" in MapReduce and to efficiently schedule the processes based on their dependencies and other available meta-data. The Cascading processing model is based on a "pipes and filters" metaphor. The developer uses the Cascading API to assemble pipelines that split, merge, group, or join streams of data while applying operations to each data record or groups of records. As it is simple and user friendly, Cascading has been officially supported by Amazon Elastic MapReduce.

In cascading user guide < http://www.cascading.org/userguide/pdf/userguide.pdf >

(2010), the key concepts of Cascading include Flow, Pipe, Tap, Cascade, and others. A pipeline is called a pipe assembly. Before a pipe assembly can be executed, it must be bound to data sources and data sinks, called Taps. The process of binding pipe assemblies to sources and sinks results in a Flow. Flows can be executed on a data cluster like

Page 168

Hadoop. The collection of Flows is called a Cascade. There are five Pipe types: Pipe,

Each, GroupBy, CoGroup, Every, and SubAssembly. The definitions and more detailed information are in the cascading user guide.

The Cascading flow will be converted into MapReduce jobs that can be executed on a Hadoop cluster with an inner MapReduce Job Planner. Figure 6.22 shows how a reasonably normal Flow would be partitioned into MapReduce jobs.

Figure 6.22 A Cascading Flow and MapReduce Jobs

(This graph is from the Cascading user guide book)

6.4.2 Large Scale Implementation

This sub section focuses on how to parallel the folksonomy based user profiling approach. It discusses a simplified parallel user based collaborative filtering recommendation approach. The proposed paralleled approach can be easily applied to other user profiling approaches and recommendation approaches.

In this sub section, the parameters in Equation 5.5 were set to

The parameters in Equation 5.11 were set to .

Page 169

The parallel implementation includes four steps: 1) parallel user profiling, 2) parallel neighbourhood formation, 3) parallel recommendation making, 4) the parallel of these three steps.

6.4.2.1 Parallel User Profiling

Figure 6.23 illustrates the proposed parallel user profiling process. The calculation of each user‘s preference weight to each tag shown in Algorithm 3.3 in Table 3.3 can be implemented by Flow fUserProfiling. The two Taps tSource and tSink1 specify the input and output of Flow fUserProfiling. The flow includes six Pipes: pLine, pItemTag, pUserItem, pUserTag, pTagIuf and pUserProfile. Some Pipes can be processed in parallel, while some need the output of another Pipe as input data. The program fragment of the implementation of this flow is in the Table B.1 of Appendix B.

Page 170

[ ]

[ ] [ ] [ ] [ ]

Line Pipe: pUserTag Pipe: pItemTag [ ]

Split GB Pr CoG GB avg CoG w Tap: tSource Tap: tSink1

Pipe: pLine Pipe: pUserProfile GB count GB iuf inputPath/inputFiles outputPath1/UserProfiles A [ ] Pipe: pUserItem Pipe: pTagIuf [ ] [ ]

GB: GroupBy [ ]

CoG: CoGroupBy Flow: fUserProfiling

Figure 6.23 Parallel Data Flow of User Profiling

Page 171

The Pipe pLine splits each line of each input file. The Pipe pItemTag calculates the probability of each tag for each item. It can be paralleled with the Pipe pUserItem that counts the number of tagged items of each user. The outputs of the two Pipes are used as the input data of the Pipe pUserTag to calculate the average probability of each tag being used to tag the items of the user. The Pipe pUserTag can be paralleled with the Pipe pTagIuf that calculates the inverse user frequency of each tag. After calculating the product of the output results of the Pipe pUserTag and pTagIuf with the Pipe pUserProfile, the weight that indicates the degree of each user‘s preference to a tag can be obtained. The output results are stored as files in the output path. The output files can be used as input in the next step to calculate the similarity of users.

6.4.2.2 Parallel Neighbourhood Formation

The key job of this step is to calculate the similarity of each user pair. To facilitate the implementation of a complete recommender system in Cascading, we discuss how to parallel cosine similarity in Cascading. Figure 6.24 illustrates the proposed parallel cosine similarity calculation approach. The program fragment of the implementation of this flow is in Table B.2 of Appendix B.

It parallels the calculation of the accumulated sum of the power of each element of a vector and that of the product of the non-zero value elements of two vectors. The

Flow fNeighborhoodForming also includes six Pipes: pUserI, pUserJ, pPowerI, pPowerJ, pMultiplyIJ, and pSimilarityIJ.

Page 172

[ ] [

Pipe: pUserI Pipe: pPowerI

Split power [ ]

[ Pipe: pSimilarityIJ profileLine

Split power CoG CoG sim B Tap: tSink2 Tap: tSink3 Pipe: pUserJ Pipe: pPowerJ

[ ]

CoG multiply outputPath1/UserProfiles outputPath2/SimFiles

Pipe: pMultiplyIJ GB: GroupBy

[ ] CoG: CoGroupBy Flow: fNeighborhoodForming

Figure 6.24 Parallel Data Flow of Neighborhood Forming

Page 173

The input of this Flow is the output user profiles of Flow fUserProfiling. Two

Pipes pUserI and pUserJ are used to split the input lines. The Pipe pPowerI and pPowerJ calculate the accumulated sum of the power of each element of the two tag vectors. They are paralleled with the Pipe pMultiplyIJ that calculates the accumulated sum of the product of the non-zero value elements of the two tag vectors. The Pipe pSimilarityIJ firstly calculates the root square of the product of the outputs of pPowerI and pPowerJ. Then, divided by the output of pMultiplyIJ, the similarity of two users can be obtained. The output files are stored in another output path. The output of this Flow also can be used to generate recommendations directly.

6.4.2.3 Parallel Recommendation Making

Figure 6.25 illustrates the proposed parallel recommendation making approach.

The Flow fRecommending includes two Pipes: pCandidateItem and pRecommender.

The input of pRecommender is the output of Flow fNeighborhoodForming. After filtering those items that are tagged by target user, the items that tagged by the top K neighbours are selected as candidate items. The Pipe pRecommender calculates the prediction score of each candidate item for each user. After ranking the candidate items by the prediction score, the final output files are stored in files in an output path.

Page 174

[ ] [ ] Pipe: pRecomender [ ] B topK CoG rec Tap: tSink4

A filter Pipe: pCandidateItem outputPath3/RecFiles

[ ] [ ] GB: GroupBy Flow: fRecommending CoG: CoGroupBy

Figure 6.25 Parallel Data Flow of Recommendation Making

6.4.2.4 The Parallel of the Above Three Steps

The above three steps have an obvious dependency relationship. Since Cascading will check the actual dependency of Flow instances in run time, the three steps can be paralleled. For example, if two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel. The program fragment of the parallel of the three steps with Cascading is shown in Table B.3 in Appendix B.

6.4.3 Experiments and Results

This sub section will discuss the evaluation experiments and results of the proposed paralleled user profiling approach implemented with Cascading MapReduce.

Page 175

6.4.3.1 Dataset Preparation

We conducted the experiments with the Del.icio.us dataset downloaded from the web site http://robertwetzker.com/2009/06/25/delicious-dataset-for-download/. The dataset contains all public bookmarks of about 950,000 users retrieved from delicious.com between September 2003 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments. The full corpus is about

7GB of compressed data. It‘s one of the largest folksonomy datasets used in research to date. The details of the dataset are discussed in (Wetzker et al., 2008).

6.4.3.2 Experiment Setup

The experimental environment includes the development environment and the computation environment. A local desktop with internet access as the development machine was used as the development environment. Since the operating system is

Windows XP, the latest version of Cygwin (http://www.cygwin.com/) was installed as the shell to run Linux/Unix commands. Java was used as a programming language and

JDK1.6.0 was installed. Eclipse 3.4.1 (http://www.eclipse.org/) was installed as the programming and building tool. Hadoop 1.18.3 and Cascading 1.0.18 were installed.

The stand alone mode of Hadoop was used to debug the programs.

The computation platform is Amazon EC2 Elastic MapReduce clouds (i.e.,

Amazon EMR, http://aws.amazon.com/elasticmapreduce/). The clients can select the types or sizes of the clouds and the payment is according to the actual running time.

Amazon EMR supports Cascading and Hadoop MapReduce and can be run in the AWS console mode, which makes it easy and convenient to submit and run the customer applications/jobs on Amazon EC2. The Amazon S3 is used to store the input and output data. The software CloudBerry Explorer for Amazon S3

Page 176

(http://cloudberrylab.com/?page=cloudberry-explorer-amazon-s3) was installed in the local machine to transfer the data between the local machine with Amazon S3.

The setup and configuration of the experimental environment was as below:

1) The setup of the development environment. The first step was to setup a development environment to develop the customer application and make sure that the grammar and business logic is correct. As mentioned in 4.2.1, JDK 1.6.0, Eclipse 3.4.1,

Hadoop 1.18.3 and Cascading 1.0.18 were installed. After the setup of the

JAVA_HOME for Hadoop, users can run the WordCount example of Hadoop in the standalone mode to test whether the development environment had been configured successfully.

2) The implementation of the customer application. When the development environment had been setup successfully, we implemented the customer application and built the executable jar file with Eclipse.

3) The next step was to run and debug the custom executable jar file in Cygwin or Linux/Unix with Hadoop.

4) The final step was to run the custom executable jar file in Amazon EC2 clouds.

After the implemented customer application had run successfully in the standalone mode of Hadoop, users can submit the application to real clouds such as

Amazon EC2. Before the job submission, users need to have an Amazon EC2 account.

Then, users need to sign in the services of Amazon Elastic MapReduce and Amazon S3.

S3 is used to store the input and output data while EMR is used for computation. Though users can use the command of Hadoop to download and upload data, the data transfer software is more convenient to help users to transfer data between local machine and S3.

After the installation of the CloudBerry Explorer for Amazon S3 on the local machine,

Page 177

users can use this software connect to Amazon S3 with the secure credential information.

Then, users need to create a buck name in S3. Users can create folders in the buck and then upload the custom jar file and input data to the buck. After uploading the jar and input data to S3, users can run the jar in EC2. Since the AWS console supported the job submission and the management of jobs with web browser, it became easy to submit and run custom jobs in EC2. In Elastic MapReduce of AWS Management Console

(http://aws.amazon.com/console/), users can create a new work flow and selected customer jar as the job type. Then users need to specify the Jar Location and Arguments.

The path of Jar Location is /path/to/jar while the input data path or output data path are s3n:///path/to/input/ and s3n:///path/to/output. It will cause ―input path of file doesn‘t exist‖ error if the path setting is not correct. After selecting the size, CPU and memory type of the clouds, the job can be run in Amazon

EMR. With the console, users could easily debug and manage the submitted jobs. The output data was stored in the created buck name of S3. The output data could be downloaded to the local machine with CloudBerry Explorer for Amazon S3.

6.4.3.3 Experimental Results

To evaluate the effectiveness of the proposed paralleled user profiling and recommendation approaches, three running Jobs on Amazon EC2 EMR clouds and conducted two comparison experiments on both clouds and a local desk top machine were created.

Job 1 was to conduct user profiling based on the proposed paralleled user profiling approach for the whole Del.icio.us dataset. The clouds configuration is 10 nodes, high performance CPU and middle scaled clouds. It took 11 hours and18 minutes to finish the job.

Page 178

One month of Del.icio.us data was used to conduct efficiency comparison experiments. Job 2 was to conduct paralleled user profiling with the 1 month data (May,

2004) on small scale clouds of Amazon EC2 EMR. It took 24 minutes to process the one month of data of the Delicious dataset with a 4 nodes small scale clouds. To evaluate the efficiency of the paralleled user profiling approach implemented in Cascading

MapReduce, we also implemented the user profiling Algorithm 1 with Java language in an unparalleled way. It was run in a local desk machine with 1G memory, 3G CPU Dell desk top. It took 1 hour and 10 minutes to process the same data. The results of the running time of Job 2 are shown in Figure 6.26.

80 70 60 50 40

Minutes 30 20 10 0 ParalelUserProfiling UserProfiling

Figure 6.26 The Running Time Comparison of Job 2

Job 3 was to conduct the proposed parallel recommendation making process with the 1 month of data on the same scale clouds with Job 2. It took 2 hours and 14 minutes to process the one month of data with a 4 nodes small scale clouds. Similar with Job 2, we also implemented recommendation approach Algorithm 2 with Java in an unparalleled way and ran it on the same local desk top machine. It took 7 hours to

Page 179

process the same data. The running time and the comparison of Job 3 are shown in

Figure 6.27.

450 400 350 300 250

200 Minutes 150 100 50 0 parallelRecsys Recsys

Figure 6.27 The Running Time Comparison of Job 3

6.4.4 Discussions

The results in Figure 6.26 and Figure 6.27 show that paralleled user profiling and recommendation making approaches can effectively improve the efficiency of large scaled data processing. The proposed user profiling approaches are scalable and can be used for large scaled recommender systems. Hypothesis 4 is valid.

The experimental results also suggested that Cascading MapReduce and Cloud computing services can provide possible new solutions to help users to solve the scalability problem of user profiling and recommender systems.

6.5 CHAPTER SUMMARY

This chapter evaluated the effectiveness and scalability of the proposed user profiling approaches and recommendation approaches. Two real world datasets collected

Page 180

from Amazon.com and CiteULike were used to evaluate the effectiveness of the proposed approaches. The experiments conducted on the two datasets suggested that the proposed approaches performed better than other state-of-the-art work and can effectively enhance recommendation accuracy. The experimental results show that personal tags can improve the recommendation accuracy. The results also suggested that after the noise of tags was reduced and the rich personal information of tags was taken into account, folksonomy can be used as quality information source to profile users. The best accuracy performances of the proposed hybrid approaches indicated that the integration of the vocabulary and viewpoints of crowds and experts can further improve the accuracy of user profiling and item recommendations.

A large scaled dataset collected from De.licio.us website was used to evaluate the scalability of the proposed user profiling approaches. The advanced cloud computing techniques such as Hadoop, MapReduce and Cascading were employed to implement the proposed user profiling approach based on folksonomy in a paralleled way. The experimental results suggested that the parallel user profiling implementation is effective and the proposed user profiling approaches are scalable and can be used for large scaled recommender systems.

Page 181

Chapter 7

7Conclusion and Future work

7.1 CONCLUSIONS

The emerging user information in Web 2.0 provides new possible solutions to profile users and make recommendations. Folksonomy or tag information is a kind of typical web 2.0 information and offers many advantages. It is simple, domain free, less intrusive, implies rich personal topic interests and opinion information, has multiple functions and is understandable by humans. Due to these advantages, folksonomy has become another kind of important information source to profile users and make recommendations.

This thesis investigated the distinctive features of folksonomy and modelled the multiple relationships of folksonomy. Targeting solving the tag quality problem and generating more accurate user profiles and item descriptions, this thesis proposed three user profiling approaches:

 Folksonomy based user profiling approach. To reduce the noise of tags,

this approach was proposed to find the personally related tags of each tag for

each individual user. The tag vocabulary and viewpoints of community users on

item classifications and descriptions were used to profile users‘ topic preferences

and items‘ relevant topics. The tag weighting approaches, based on the multiple

dimensional relationships among users, items and tags and the popularity of tags

were proposed. The user profiles and item descriptions represented by users‘ tag

vocabulary were generated.

Page 182

 Taxonomy based user profiling approach. In this approach, the item

taxonomy information, reflecting the wisdom of experts and common

understanding on item classifications and descriptions, was used to determine the

personally related taxonomic topics of each tag for each individual user. The

purpose was to reduce the noise of tags. An improved weighting approach that

considers the structural information of taxonomy and the popularity of

taxonomic topics was proposed to measure the weights of taxonomic topics. The

user profiles and item descriptions represented by the standard and controlled

taxonomy vocabulary were generated.

 Hybrid user profiling approach based on folksonomy and taxonomy.

This hybrid approach integrated the wisdom of both crowds and experts to

represent tags, to profile users and to describe items. The user profiles and item

descriptions represented by both tag vocabulary and the standard taxonomy

vocabulary were generated.

Moreover, this thesis also explored how to utilise the proposed user profiling approaches to enhance Top N item recommendation task. Based on different user profiles, different neighbourhoods can be formed and different item ranking values can be predicted. Based on the generated user profiles, the user and item based collaborative filtering approaches, combined with the content filtering methods were proposed. The proposed recommendation approaches considered both objective item taxonomy information and subjective user opinions on item classifications and descriptions to form neighbourhood and rank items.

This thesis also conducted extensive effectiveness evaluation experiments on two real world datasets collected from Amazon.com and CiteULike. The comparison experimental results suggested that the proposed user profiling and recommendation

Page 183

approaches performed better than other state-of-the-art work. The proposed folksonomy based user profiling and recommendation approaches performed better than the proposed taxonomy based user profiling and recommendation approaches. This suggested that after we reduced the noise of tags and made use of the rich personal information of tags, folksonomy can be used as a quality information source to find users‘ topic interests. The proposed hybrid user profiling and recommendation approaches had the best accuracy performances. This indicated that the integration of the vocabulary and viewpoints of crowds and experts can further improve the accuracy of user profiling and item recommendation making.

In addition, to making the proposed user profiling approaches scalable and useful for large scaled recommender systems, this thesis proposed a parallel user profiling implementation based on advanced cloud computing techniques such as Hadoop,

MapReduce and Cascading. The scalability evaluation experiments were conducted on a large scaled dataset collected from Del.icio.us website. The efficiency comparison results suggested that the parallel user profiling implementation is effective and the proposed user profiling approaches are scalable.

7.2 CONTRIBUTIONS

This thesis contributes to help users solve the information overload issue through proposing more accurate, effective and efficient user profiling and recommendation approaches based on folksonomy information in Web 2.0. Specifically, this thesis makes a number of contributions to web personalisation, recommender systems, and Web 2.0 or Social Webs.

 This thesis contributes to user profiling and web personalisation.

Page 184

A key issue of web personalisation is the lack of quality, sufficient, and less intrusive information about users‘ interest or preferences. The large amount of new information contributed by users in Web 2.0, provides new sources to profile users. This thesis focuses on how to profile users accurately based on the typical Web 2.0 information folksonomy. The distinctive advantages and disadvantages of folksonomy information have been investigated. Although folksonomy contains users‘ explicit topic interests and preferences information, the noise contained in folksonomy such as tag synonyms, semantic ambiguity, and personal tags makes it difficult to profile users. In some research, folksonomy information was believed to be less useful or even caused decreases in the accuracy of user profiling and personalisation. The proposed three user profiling approaches in this thesis can effectively reduce the noise of folksonomy and generate more accurate user profiles. After reducing the noise with the proposed approaches, folksonomy can be used as a quality information source to profile users and enhance personalisation applications.

The proposed tag representation approaches that reduce the noise of user vocabularies can also be applied to other application areas such as query expansion and searching. This research also contributes to natural language and text processing areas.

 This thesis contributes to effectively utilising the new Web 2.0 user

information source in recommender systems.

Compared with explicit ratings and other implicit rating information such as web logs and click streams, folksonomy information is less intrusive, humanly understandable, and lightweight. Besides profiling users with folksonomy information, this thesis also proposed approaches to describe and represent items based on folksonomy or taxonomy information. Usually, items can be represented by those objective features such as the keywords extracted from the content information of items,

Page 185

taxonomy topics and the metadata given by experts. On the other hand, with folksonomy information, users‘ opinions on item classifications and descriptions form another kind of social usage related features of items. Those features are subjective and dependant on users. They have influences on determining the peer neighbour users or items. The proposed recommendation approaches incorporate the two kinds of features that reflect the opinions of users and experts to measure the similarity of users and items, and to rank candidate items. This is a new contribution to recommender systems.

 This thesis contributes to the research of Web 2.0.

This thesis modelled the multiple relationships among users, items and tags. It is the first work to formally model the relationships of folksonomy. Eight kinds of relationships including six two-dimensional mappings and two three-dimensional mapping in folksonomy were defined in this thesis. The three-dimensional relationship that records the personal tagging behaviour ignored by other approaches plays very important role in the reducing the noise of tags and making personalised recommendations. This is a new contribution to the research and usage of folksonomy information. The proposed relationship modelling approach also can be applied to other user created online information such as blogs, reviews, and message posts (e.g., tweets).

Also, the analysis of the influences of personal tags and popular tags gives contribution to the design of collaborative tagging systems. Besides better usages of folksonomy, this thesis also contributes to better usage of taxonomy information. Moreover, it contributes to effectively integrate the wisdom of crowds and experts to improve the accuracy of user profiling and item recommendations. This contributes to bridging the usage gap between Web 1.0 and Web 2.0 information.

 This thesis contributes to solving the scalability issue in user profiling.

Page 186

This thesis proposed a parallel user profiling implementation based on the current advanced cloud computing techniques. It contributes to providing a new low cost and user friendly solution to help people solve the scalability issue of user profiling and recommender systems. This part of the work can also be used as an example or a case study to help the readers to create and run their own applications with cloud computing techniques quickly.

Domain free and language free are another two important advantages of the proposed user profiling and recommendation approaches. Since folksonomy can be used to describe any types of items, the proposed approaches can be used to recommend any types of items such as textual content items and multimedia items. The proposed approaches do not require any pre processing or textural processing such as segmentations or stemming on tags. Although English tag words are used as examples in this thesis, the proposed approaches can be used for any folksonmies in any languages.

The proposed work in this thesis requires less user information, compared to other hybrid user profiling approaches which combine tags and other resources including blogs, videos, texts or tweets. It can be applicable in any situations where users‘ folksonomy information is available while the other hybrid approaches require extra user information. Moreover, the proposed approaches in this thesis can be used to further improve the accuracy of user profiling and recommendation for other hybrid user profiling approaches that combine tags and other resources.

7.3 LIMITATIONS AND FUTURE WORK

7.3.1 Limitations

This thesis has limitations. These areas are discussed below:

Page 187

 The thesis work focuses on the research of folksonomy information. The

proposed approaches are not applicable to the situation where there is no

folksonomy information. The effectiveness of the proposed approaches may be

weakened when there is very little tagging information available.

 This thesis does not conduct sentimental analysis of tags. Users‘ sentimental

orientations contained in some tags are not considered in the proposed

approaches. This may affect the accuracy and completeness of user profiling in

some degree.

 The integration of tags and other popular user information such as explicit

ratings, reviews, and tweets are not in the scope of this thesis. This is an area that

could be explored in future work.

7.3.2 Future Work

Future research related to this study could extend in the following directions.

The sentimental analysis of tags could be explored. The opinion mining techniques could be used to profile users‘ sentimental orientations to further improve the accuracy of user profiling and recommendations based on folksonomy information could be made. Another future area of work is to explore effective user profiling and recommendation making approaches when there is little tagging information available.

Since the content information are generated by the collaborative tagging of users, although the proposed approaches combined the collaborative filtering and content based approach, they still have the similar drawbacks as other collaborative filtering approaches such as cold start, data sparsity, new item and new user problems when a user has tagged very few items or an item only tagged by a very few users. The solutions to solve these problems will be further investigated.

Page 188

The proposed user profiling and recommendation making approaches in this thesis could be further extended to combine different folksonomies to make cross folksonomy item recommendations. The proposed taxonomy based representations also could be used to compare, bridge and integrate different folksonomies.

This research could be extended to integrate with other user information to generate more accurate and complete user profiles. Instead of using tags as standalone information, tags are also popularly used together with other user created online information, such as blogs, reviews, videos, and tweets. How to further improve the user profiling accuracy through combining tags and these kinds of user information is one important research question that also needs to be explored in the future.

In addition, how to use folksonomy information and explicit rating to improve the performances of rating prediction tasks in recommender systems requires further research. The combination of tags with other social relationship information such as friends, followers, and trust network forms another area of important future work.

Page 189

Page 190

Appendix A: Example Folksonomy Tags

 Example Popular Tags of Amazon.com Dataset

fiction vampire self-help memoir love science humor thriller historical fiction comics fantasy history business horror philosophy poetry travel mystery science fiction psychology art suspense spirituality book romance adventure women religion literature paranormal romance politics magic inspirational historical romance historical cookbook biography buddhism depression theology africa classics diet ethics physics ghosts anatomy jk rowling dating freedom life lies yoga angels folklore nature intelligence education leadership medieval epic fantasy crime peace travel guide satire gardening world war ii sports grief photography statistics essays contemporary romance horses detective dragon law of attraction war witchcraft drugs dystopia paranormal personal growth sex marriage propaganda law nutrition language death god shakespeare gay fun mysteries and thrillers cia management drama cats great fiction america recipes medicine friendship mythology espionage economics supernatural relationships anthropology historical novel sales erotic football fashion money ecology beauty love story cold war faith end of the world networking china new age parenting happiness hope

 Example Personal Tags of Amazon.com Dataset

dvd-dislikes the restaurant at the end of the universe stewardson commando sarah camburn gammell pure sci fi imajinn books louise l hay bellydance workout music wilson american paanda cake bretdougherty fixer-uppers winston brothers alpinism ted hughes learn photoshop carl sagan donald gerds 12-12-06 jake logan lucinda williams laurence g boldt martian die ufa story

Page 191

 Example Popular Tags of CiteULike Dataset

protein microarray method communication research algorithms systems tagging memory alignment dynamics science algorithm optimization methodology information learning no-tag education search selection architecture control programming human context community classification database evaluation regulation networks collaboration history interaction modelling data statistics complexity web graph svm rna philosophy language expression hci security psychology dna management network software cognition economics attention genetics book simulation genome prediction ontology genomics internet bibtex-import design development biology mobile evolution visualization clustering technology knowledge semantic yeast modeling vision culture theory semantics methods gene brain analysis cancer folksonomy bayesian structure social bioinformatics fmri experimental framework activity monads synthesis ecoli systems-biology depression service diversity africa society resource adaptive pagerank cloud ranking hippocampus plant kinetics workflow breast face motion behavior sequencing obesity reasoning variation estimation fitness behaviour splicing diabetes applications visualisation lisp life profile inhibition epistemology markov democracy datamining monte-carlo usability cells mutation spatial optics metadata informatics clinical guidelines nature biodiversity practice geography intelligence fluorescence plants rnai pathways media chromatin family robot ethnography recombination empirical social-networks recommendation

 Example Personal Tags of CiteULike Dataset

anastasio_ma presynaptique reprograming_factors cs-learning fossil_fuel fm-h modern_behaviour hidden_terminal myosin-xi bicamerality _www epsetg0516 technology_management gernot riemann-surface dirac_operator csc_nays tremula interest_point envelopes brewster informatik irf-1 protein-quantitation- complete_stx h-ii 3561 base_calling tola08

Page 192

Appendix B: Parallel Implementation Based on Cascading MapReduce

 Table B.1 Program Fragment of Flow fUserProfiling

Program Fragment 1: The Implementation of Flow fUserProfiling

// The setting of flow fUserProfiling String inputPath=args[0]; String outputPath1=args[1]; Tap tSource = new Hfs(new TextLine(), inputPath); Tap tSink1 = new Hfs(new TextLine(), outputPath1+"/",true); // The implementation of Pipe pLine Pipe pLine = new Each("pline", new Fields("line"), new RegexSplitter (new Fields("user1","item1","tag1"), "::")); // The implementation of Pipe pUserItem /*Count is an inner aggregator of Cascading that calculates the number of items in the current group */ Pipe pUserItem=new GroupBy("pUserItem", pLine, new Fields("user1","item1"), new Fields("user1")); pUserItem=new Every (pUserItem, new Fields("user1","item1"), new Count(new Fields("count1"))); //The implementations of other pipes are omitted … Map source_1 = Cascades.tapsMap( Pipe.pipes(pLine), Tap.taps(tSource));

Flow flow_userprofile = flowConnector.connect("userProfiling algorithm", source_1, tSink1, pUserProfile );

Page 193

 Table B.2 Program Fragment of Flow fNeighbourhoodForming

Program Fragment 2: The Implementation of Flow fNeighbourhoodForming

… // The implementation of Pipe pUserI /* power () is a self defined sub class extends aggregator class*/ Pipe pUserI = new Each ( "pUserI",new Fields ("profileLine"),new RegexSplitter(new Fields("u1","t1","w1"))); // The implementations of Pipe pPowerI Pipe pPowerI=new GroupBy ("pPowerI",pUserI,new Fields("u1"),new Fields("u1")); pPowerI=new Every (pPowerI,new Fields("w1"),new power(new Fields("power1")), new Fields("u1","power1") ); …

 Table B.3 Program Fragment of the Parallel of the Three Steps

Program Fragment 3: The Parallel of the Three Steps

/* parallelRecsys.class is the name of the self defined class file*/ Properties properties = new Properties(); properties.setProperty("hadoop.job.ugi", "hadoop,hadoop"); FlowConnector.setApplicationJarClass( properties, parallelRecsys.class ); FlowConnector flowConnector = new FlowConnector(properties); CascadeConnector connector = new CascadeConnector(); Cascade cascade = connector.connect( fUserProfiling, fNeighborhoodForming, fRecommending ); cascade.complete();

Page 194

Bibliography

Aciar, S., Zhang, D., Simoff, S. & Debenham, J. (2006). Recommender System Based on Consumer Product Reviews. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 719-723

Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734-749

Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining Association Rules Between Sets of Items in Large Databases. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 207-216.

Antoniou, G., & Harmelen, F. v. (2004). A Semantic Web Primer. Cambridge: The MIT Press.

Arapakis, I., Moshfeghi, Y., Joho, H., Ren, R., Hannah, D., & Jose, J.M. (2010). Enriching User Profiling with Affective Features for the Improvement of a Multimodal Recommender System. Proceeding of the 2010 ACM International Conference on Image and Video Retrieval, 1–8

Aroyo, L., Stash, N., Wang, Y., Gorgels, P., & Rutledge, A.L. (2007). CHIP Demonstrator: Semantics-driven Recommendations and Museum Tour Generation. Lecture Notes in Computer Science, 2007, Volume 4825/2007, 879- 886

Au Yeung, C.M., Gibbins, N., & Shadbolt, N. (2009). Contextualising Tags in Collaborative Tagging Systems. Proceedings of the 20th ACM Conference on Hypertext and Hypermedia, 251-260

Page 195

Au Yeung, C.M., Gibbins, N., & Shadbolt, N. (2008). A Study of User Profile Generation from Folksonomies. Proceedings of the Workshop on Social Web and Knowledge Management of WWW2008, 1-8

Au Yeung, C.M., Gibbins, N., & Shadbolt, N. (2007). Understanding the Semantics of Ambiguous Tags in Folksonomies. Proceedings of the International Workshop on Emergent Semantics and Ontology Evolution at ISWC/ASWC 2007, 108-121

Balabanović, M., & Shoham, Y. (1997). Fab: Content-Based, Collaborative Recommendation. of the ACM, 40(3), 66-72

Bao, S., Wu, X., Fei, B., Xue, G., Su, Z., & Yu, Y. (2007): Optimizing Web Search Using Social . Proceedings of the 16th International Conference on , 501 – 510

Barla, M. Bielikova, M. (2009). On Deriving Tagsonomies: Keyword Relations Coming from the Crowd. Proceedings of the 2009 International Conferences on Computational , 309–320

Barla, M., Tvarozek, M., & Bielikova, M. (2009). Rule-based User Characteristics Acquisition from Logs with Semantics for Personalized Web-based Systems. Computing and Informatics, Vol. 28, No. 4, 2009, 399–427

Begelman, G., Keller, P., & Smadja, F. (2006). Automated Tag Clustering: Improving Search and Exploration in the Tag Space. Proceedings of the Collaborative Web Tagging Workshop of WWW 2006.

Bernstein, M.S., Tan, D., Smith, G., Czerwinski, M., Horvitz, E. (2010). Personalization via Friendsourcing. ACM Transactions on Computer-Human Interaction, Volume 17, Issue 2 , 2010, Article No. 6

Page 196

Bhuiyan, T., Xu, Y., Jøsang, A., & Liang, H. (2010). Developing Trust Networks Based on User Tagging Information for Recommendation Making. Proceedings of the 11th International Conference on Web Information System Engineering, 357- 364

Billsus, D., & Pazzani, M.J. (2000). User Modeling for Adaptive News Access. User Modeling & User-Adapted Interaction, 10(2-3), 147-180

Bindelli, S., Criscione, C., Curino, C., Drago, M.L., Eynard, D., & Orsi, G. (2008). Improving Search and Navigation by Combining Ontologies and Social Tags. Proceedings of the 2008 OTM Confederated International Workshops, 76-85

Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R. (2008). Can All Tags be Used for Search? Proceedings of the 17th ACM international Conference on Information and Knowledge Management, 193-202

Bloedorn, E., Mani, I., & MacMillan, T.R. (1996). Machine Learning of User Profiles: Representational Issues. Proceedings of AAAI 1996, 433-438

Bonhard, P., & Sasse, M.A. (2006). ‗Knowing me, knowing you‘ — Using Profiles and Social Networking to Improve Recommender Systems. BT Technology Journal, Volume 24, Issue 3, 2006, 84 - 98

Borges, J., & Levene, M. (2000). Data Mining of User Navigation Patterns. Lecture Notes in Computer Science, 2000, Volume 1836/2000, 92-112

Bogers T., & Bosch, A. (2008). Recommending Scientific Articles Using CiteULike.

Proceedings of the 2008 ACM Conference on Recommender Systems, 287–290

Page 197

Brin, S., & Page, L. (1998). The Anatomy of a Large-scale Hypertextual Web Search

Engine. Computer Networks and ISDN Systems, Volume 30, Issue 1-7, 1998,

107-117

Böse, J., Andrzejak, A., & Högqvist, M. (2010). Online Aggregation: Parallel and Incremental Data Mining with Online Map-Reduce. Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud of WWW 2010.

Breese, J.S., Heckerman, D., & Kadie, C. (1998). Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of Conference on Uncertainty in Artificial Intelligence, 43-52

Burke, R. (2002). Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction, 12(2002), 331-370

Burke, R. (2000). Knowledge-Based Recommender Systems, Encyclopedia of Library and Information Systems, vol. 69, Supplement 32, Marcel Dekker, 2000.

Caro, L.D., Candan, K.S., Sapino, M.L. (2008). Using tagflake for Condensing Navigable Tag Hierarchies from Tag Clouds. Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1069-1072

Catutto, C., Schmitz, C., Baldassarri, A., Servedio, V.D.P., Loreto, V., Hotho, A., Grahl, M., & Stumme, G. (2007). Network Properties of Folksonomies. AI Communications Journal, Special Issue on "Net-work Analysis in Natural Sciences and Engineering", Volume 20, Issue 4, 2007, 245-262

Chen, L., & Nayak, R. (2008). Expertise Analysis in a Question Answer Portal for Author Ranking. Proceedings of the 2008 IEEE/WIC/ACM International

Page 198

Conference on Web Intelligence and Intelligent Agent Technology - Volume 01, 134-140

Chen,Y., Tsai, F.S., & Chan, K.L.(2007). Blog Search and Mining in the Business Domain. Proceedings of the 2007 International Workshop on Domain Driven Data Mining, 55 – 60

Chen, J., Nairn, R., Nelson, L., Bernstein, M., & Chi, E. (2010). Short and

Tweet: Experiments on Recommending Content from Information Streams.

Proceedings of the 28th International Conference on Human factors in

Computing Systems, 1185-1194

Chen, Z., Cao, J., Song, Y., Guo, J. Zhang, Y., & Li, J. (2010). Context-oriented Web Video Tag Recommendation. Proceedings of the 19th International Conference on World Wide Web, 1079-1080

Chi, E.H. (2008). The social web: Research and Opportunities. Computer, 41(9):88-91, 2008.

Chiang, M., Wang, T., & Peng, W. (2010). Parallelizing Random Walk with Restart for Large-Scale Query Recommendation. Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud of WWW 2010.

Choi, K.S., Lee, C.H., & Rhee, P.K. (2000). Document Ontology Based Personalized Filtering System. Proceedings of the 8th ACM International Conference on Multimedia, 362–364

Christopher D.M., Raghavan, P. & Schütze, H. (2008). Introduction to Information

Retrieval. Cambridge University Press, 2008.

Page 199

Christopher D.M., & Schütze, H. (1999). Foundations of Statistical Natural Language

Processing. MIT Press, Cambridge, 1999.

Clements, M., Vries, A.P., & Reinders, M.J.T. (2008). Detecting Synonyms in Social Tagging Systems to Improve Content Retrieval. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 739-740

Cooley, R., Srivastava, J., & Mobasher, B. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference of Tools with Artificial Intelligence, 558–567

Cleverdon, C.W., Mills, J., Keen, M. (1966). Factors Determining the Performance of Indexing Systems. ASLIB Cranfield project, Cranfield.

Deshpande M., & Karypis, G. (2004). ―Item-Based Top-N Recommendation Algorithms,‖ ACM Trans. Information Systems, vol. 22, no. 1, 143-177, 2004

Diakopoulos, N.A., & Shamma, D.A. (2010). Characterizing Debate Performance via

Aggregated Twitter Sentiment. Proceedings of the 28th International

Conference on Human factors in Computing Systems, 1195-1198

Diederich, J. & Iofciu, T. (2006). Finding Communities of Practice from User Profiles Based On Folksonomies. Proceedings of the 1st International Workshop on Building Technology Learning Solutions for Communities of Practice.

Eda, T., Yoshikawa, M., Uchiyama, T., & Uchiyama, T. (2009). The Effectiveness of Latent Semantic Analysis for Building Up a Bottom-up Taxonomy from Folksonomy Tags. Proceedings of the 18th International Conference on World Wide Web, 421 – 440

Page 200

Efron, M. (2010). Retrieval in a Environment. Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 787-788

Elsayed, T., Lin, J., & Oard, D. (2008). Pairwise Document Similarity in Large Collections with MapReduce. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, 265–268

Etzioni, O. (1996). The World Wide Web: quagmire or gold mine? Communication of the ACM, 39, 1996, 65-68

Farrell, S., & Lau, T. (2006). Fringe Contacts: People-Tagging for the Enterprise.

Proceedings of the WWW ’06 Collaborative Web Tagging Workshop, 2006.

Fawcett, T., & Provost, F. (1996). Combining Data Mining and Machine Learning for Efficient User Profiling, Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining, 1996, 8-13

Felden, C. (2007). Ontology-Based User Profiling. Business Information Systems. Springer.

Friedman, T. L. (2005). The World Is Flat: A Brief History of the Twenty-First Century. Farrar, Strauss and Giroux.

Fu, W., Kannampallil, T., Kang, R., & He, J. (2010). Semantic Imitation in Social Tagging. ACM Transactions on Computer-Human Interaction, Vol. 17, No. 3, Article 12, 12:1- 12:37

Gasparetti, F., & Micarelli, A. (2005). User Profile Generation Based on a Memory Retrieval Theory. The 1st International Workshop on Web Personalization, Recommender Systems and Intelligent User Interfaces, 66-75

Page 201

Gemmis, M. de, Lops, P., Semeraro, G., & Basile, P. (2008). Integrating tags in a semantic content-based recommender. Proceedings of the 2008 ACM Conference on Recommender systems, 163-170

Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google File System. ACM SIGOPS Operating Systems Review Volume 37, Issue 5, 2003, 29 - 43

Goel, S., Broder, A., Gabrilovich, E., & Pang, B. (2010). Anatomy of the Long Tail: Ordinary People with Extraordinary Tastes. Proceedings of the 3rd ACM International Conference on Web Search and Data mining, 201–210

Guan, Z., Bu, J., Mei, Q., Chen, C., & Wang, C. (2009). Personalized Tag Recommendation Using Graph-based Ranking on Multi-type Interrelated Objects. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 540-547

Guha, R., McCool, R., & Miller, E. (2003). . Proceedings of the 2003World Wide Web, 700-709

Gemmell, J., Ramezani, M., Schimoler, T., Christiansen, L., & Mobasher, B. (2009). The Impact of Ambiguity and Redundancy on Rag Recommendation in Folksonomies. Proceedings of the 3rd ACM conference on Recommender Systems, 45-52

Gemmis, M. de, Lops, P., Semeraro, G., & Basile, P., Integrating Tags in a Semantic

Content-based Recommender. Proceedings of the 2008 ACM Conference on

Recommender Systems, 163-170

Gentili, G., Micarelli, A., & Sciarrone, F. (2003). Infoweb: An Adaptive Information Filtering System for the Cultural Heritage Domain. Applied Artificial Intelligence 17(8-9), 715-744

Page 202

Golder, S.A., & Huberman, B.A. (2006). Usage Patterns of Collaborative Tagging

Systems. Journal of Information Science, 32(2):198–208, 2006.

Gruber, T. (2007). Folksonomy of Ontology: A Mash-up of Apples and Oranges. International Journal on Semantic Web and Information Systems, Vol. 3, No. 2, 2007, 1-11

Guan, Z., Wang, C., Bu, J., Chen, C., Yang, K., Cai, D., & He, X. (2010). Document

Recommendation in Social Tagging Services. Proceedings of the 19th

International Conference on World Wide Web, 391-400

Guy, I., Zwerdling, N., Ronen, I., Carmel, D., Uziel, E. (2010). Social Media Recommendation Based on People and Tags. Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 194-201

Hagen, P., Manning, H., & Souza, R. (1999). Smart Personalization. Cambridge: Forrester Research.

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Berlin: Springer, 2009.

Hayes, C. & Avesani, P. (2007). Using Tags and Clustering to Identify Topic-Relevant Blogs. Proceedings of the 1st International Conference on Weblogs and Social Media.

Herlocker, J.L., Konstan, J.A., Terveen, L.G., & Riedl, J.T. (2004). Evaluating Collaborative Filtering Recommender Systems, ACM Trans. Information Systems, vol. 22, no. 1, 2004, 5-53

Page 203

Heymann, P., Ramage, D., & Garcia-Molina, H. (2008). Social Tag Prediction. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 531-538

Hofmann, T. (2004). Latent Semantic Models for Collaborative Filtering. ACM

Transactions on Information Systems, 22(1):89–115, 2004.

Holub, M., Bielikova, M. (2010). Estimation of User Interest in Visited Web Page. Proceedings of International Conferences on World Wide Web 2010, 1111-1112

Hsieh, N.C. (2004). An Integrated Data Mining and Behavioural Scoring Model for Analysing Bank Customers. Expert Systems with Applications, Vol. 27, No. 4, 2004, 623-633

Hu, X., Zhang, X., Lu,C., Park, E.K., Zhou, X. (2009). Exploiting Wikipedia as

External Knowledge for Document Clustering. Proceedings of the 15th ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining,

389-396

Huang, J., Thornton, K. M., & Efthimiadis, E. N. (2010). Conversational Tagging in Twitter. Proceedings of the 21st ACM conference on Hypertext and Hypermedia, 173-178

Jakob, N., Weber, S., Müller, M., & Gurevych, I. (2009). Beyond the Stars: Exploiting Free-Text User Reviews for Improving the Accuracy of Movie Recommendations. Proceeding of the 1st International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion, 57-64

Page 204

Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A. (2009). Micro-blogging as Online

Word of Mouth Branding. Proceedings of the 27th International Conference

Extended Abstracts on Human Factors in Computing Systems, 3859-3864

Jaschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2007). Tag Recommendations in Folksonomies. Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 506-514

Jian, C., Jian, Y., & Jin, H. (2005). Automatic Content-based Recommendation in e- commerce. Proceedings of the 2005 IEEE International Conference on e- Technology, e-Commerce and e-Service (EEE'05) on e-Technology, e- Commerce and e-Service, 748-753

Joachims, T. (2002). Optimizing Search Engines Using Clickthrough data. Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 133-142

Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). Accurately Interpreting Clickthrough Data as Implicit Feedback. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 154-161

Karypis, G. (2001). Evaluation of Item-Based Top-N Recommendation Algorithms.

Proceedings of the Tenth International Conference on Information and

Knowledge Management, 247–254

Kautz, H., Selman, B., & Shah, M. (1997). Referral Web: Combining Social Networks

and Collaborative Filtering. Communications of the ACM, 40(3):63–65, 1997.

Page 205

Kazienko, P., & Pilarczyk, M. (2006). Hyperlink Assessment Based on Web Usage Mining. Proceedings of the 17th Conference on Hypertext and Hypermedia, 85- 88

Kleinberg, J.M. (1999). Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Volume 46, Issue 5, 1999, 604–632

Koren, Y. (2008). Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426–434

Kosala, R., & Blockeel, H. (2000). Web Mining Research: A Survey. SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, 2:1–15

Lamere, P. (2008). Social Tagging and Music Information Retrieval, Journal of New Music Research, 2008, Vol. 37, No. 2, 101–114

Li, J., & Zaïane, O.R. (2004). Combining Usage, Content, and Structure Data to Improve Web Site Recommendation, Proceedings of the 5th International Conference of Electronic Commerce and Web Technologies, 2004, 305-315

Li, X., Guo, L., & Zhao, Y.E. (2008). Tag-based Social Interest Discovery. Proceeding of the 17th International Conference on World Wide Web, 675-684

Li, Y., & Zhong, N. (2006). Mining Ontology for Automatically Acquiring Web User Information Needs. IEEE Transactions on Knowledge and Data Engineering, 18(4):554–568.

Liang, H., Hogan, J., Xu, Y. (2010). Parallel User Profiling Based on Folksonomy for Large Scaled Recommender Systems-An Implementation of Cascading MapReduce. Proceedings of IEEE International Conference on Data Mining

Page 206

(ICDM) 2010 Knowledge Discovery Using Cloud and Distributed Computing Platforms workshop, 154-161

Liang, H., Xu, Y., & Li, Y. (2010). Mining Users‘ Opinions based on Item Folksonomy and Taxonomy for Personalized Recommender Systems. Proceedings of IEEE International Conference on Data Mining (ICDM) 2010 Topic Feature Discovery and Opinion Mining workshop, 1128-1135

Liang, H., Xu, Y., Li, Y., & Nayak, R. (2010). Personalized Recommender System Based on Item Taxonomy and Folksonomy. Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 1641- 1644.

Liang, H., Xu, Y., Li, Y., & Nayak, R. (2010). Connecting Users and Items with Weighted Tags for Personalized Item Recommendations. Proceedings of the 21st ACM conference on Hypertext and Hypermedia, 51-60

Liang, H., Xu, Y., Li, Y., Nayak, R., & Weng, L. (2009). Personalized Recommender Systems Integrating Social tags and Item Taxonomy. Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01, 540-547

Liang, H., Xu, Y., Li, Y., & Nayak, R., Shaw, G. (2010). A Hybrid Recommender System based on Weighted Tags. Proceedings of the 8th Workshop on Text Mining of the 10th SIAM International Conference on Data Mining, 2010

Liang, H., Xu, Y., Li, Y., & Nayak, R. (2009). Tag Based Collaborative Filtering for Recommender Systems. Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology, 666-673

Page 207

Liang, H., Xu, Y., Li, Y., & Nayak, R. (2009). Collaborative Filtering Recommender Systems Based on Popular Tags. Proceedings of the 14th Australasian Document Computing Symposium, 3-10

Liang, H., Xu, Y., Li, Y., & Nayak, R. (2008). Collaborative Filtering Recommender Systems Using Tag Information. Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology- Volume 03, 59-62

Linden, G., Smith, B., & York, J. (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, Volume 7, Issue1, 2003, 76- 80

Liu., B. (2010). Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010.

Liu, B. (2007). Web data mining: exploring hyperlinks, contents, and usage data. Berlin: Springer, 2007.

Lu, C., Hu, X., Chen, X., Park, J., He, T., & Li., Z. (2010). The Topic-perspective Model for Social Tagging Systems. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 683-692

Malone, T.W., Grant, K.R., Turbak, F.A., Brobst, S.A., & Cohen, M.D. (1987) Intelligent Information-sharing Systems. Communications of the ACM, Volume 30, Issue 5, 1987, 390-402

Marlin, B. (2003). Modeling User Rating Profiles for Collaborative Filtering, Proceedings of the 17th Annual Conference of Neural Information Processing Systems, 2003.

Page 208

Massa,P., & Aversani, P. (2007). Trust-aware Recommender Systems. Proceedings of

the 2007 ACM conference on Recommender systems, 17 - 24

Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic Sentiment Mixture: Modelling Facets and Opinions in Weblogs. Proceedings of the 2007 World Wide Web, 171-180

Melville, R., Mooney, R. J., & Nagarajan, R. (2002). Content-Boosted Collaborative Filtering for Improved Recommendations. Proceedings of the 18th National Conference on Artificial Intelligence (AAAI'02), 187-192

Micarelli, A., & Sciarrone, F. (2004). Anatomy and Empirical Evaluation of an Adaptive Web-Based Information Filtering System. User Modelling and User-Adapted Interaction, 14 (2-3), 2004, 159-200

Middleton, S.E., Shadbolt, N.R., & Roure, D.C. (2004). Ontological User Profiling in Recommender Systems, ACM Transaction of Information Systems, 22 (1), 2004, 54-88

Milicevic, A.K., Nanopoulos, A., & Ivanovic, M. (2010). Social Tagging in Recommender Systems: a Survey of the State-of-the-art and Possible Extensions. Artificial Intelligence Review. Springer Netherlands, 2010, 187- 209.

Millen, D. R., Feinberg, J., & Kerr, B. (2006). Dogear: Social Bookmarking in the Enterprise. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 111-120

Minio, M., Tasso, C. (1996). User Modelling for Information Filtering on INTERNET Services: Exploiting an Extended Version of the UMT Shell. UM96 Workshop on User Modelling for Information Filtering on the World Wide Web.

Page 209

Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010). You are Who You Know: Inferring User Profiles in Online Social Networks. Proceedings of the 3rd ACM International Conference of Web Search and Data Mining, 251-260

Mitzlaff, F., Benz, D., Stumme, G., & Hotho, A. (2010).Visit Me, Click Me, Be My Friend: An Analysis of Evidence Networks of User Relationships in Bibsonomy. Proceedings of the 21st ACM conference on Hypertext and Hypermedia, 265- 270

Mobasher, B. (2007). Data Mining for Web Personalization. In The Adaptive Web: Methods and Strategies of Web Personalization, P. Brusilovsky, A. Kobsa, and W. Nejdl, Eds. Lecture Notes in Computer Science, vol. 4321. Springer- Verlag.

Mobasher, B., Dai, H., Luo, T. & Nakagawa, M.(2002). Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization, Data Mining and Knowledge Discovery, vol. 6, no. 1, 2002, 61-82

Mobasher, B.,Cooley, R., & Srivastava, J. (2000). Automatic Personalization based on Web Usage Mining. Communications of the ACM, 2000, 43 (8), 2000, 142-151

Muller, M. J. (2007). Comparing Tagging Vocabularies Among Four Enterprise Tag- based Services. Proceedings of the 2007 international ACM Conference on Supporting Group Work, 341-350

Niwa, S., Doi, T., & Hon‘Iden, S. (2006). Web Page Recommender System Based on Folksonomy Mining. Transactions of Information Processing Society of Japan, 47(5):2006, 1382–1392

Page 210

Oard, D.W. & Kim, J. (1998). Implicit Feedback for Recommender Systems. Proc. Recommender Systems. Papers from 1998 Workshop, Technical Report WS- 98-08, 1998.

Pang, B., & Lee L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, Vol. 2: No 1–2, 2008, 1-135.

Papagelis, M., Rousidis, I., Plexousakis, D., & Theoharopoulos, E. (2005). Incremental Collaborative Filtering for Highly-Scalable Recommendation Algorithms. Foundations of Intelligent Systems, Lecture Notes in Computer Science, 2005, Volume 3488/2005, 553-561

Pazzani, M. J. & Billsus, D. (2007). Content-based Recommender Systems. Lecture Notes In Computer Science, The Adaptive Web: Methods and Strategies of Web Personalization, 2007, 325-341

Phelan, O., McCarthy, K., & Smyth, B. (2009). Using Twitter to Recommend Real-

time Topical News. Proceedings of the third ACM conference on Recommender

Systems, 385-388

Pilgrim, C. (2008). Improving the Usability of Web 2.0 Applications. Proceedings of the 19th ACM conference on Hypertext and Hypermedia, 239-240

Plangprasopchok, A., Lerman, K., & Getoor, L. (2010). Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 949-958

Popescu, A. & Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews. Proceedings of the 2005 Conference on Empirical Methods in Natural Language Processing (EMNLP), 339 - 346

Page 211

Popescul, A., Ungar, L., Pennock, D. & Lawrence, S. (2001). Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 437-444

Qu, L., Müller, C., & Gurevych, I. (2008). Using Tag Semantic Network for Key Phrase Extraction in Blogs. Proceeding of the 17th ACM Conference on Information and Knowledge Management, 1381-1382

Rainie, L. (2007). 28% of Online Americans have Used the Internet to Tag Content. Technical report, Pew Internet and American Life Project, 2007.

Rashid, A.M., Lam, S.K., Karypis, G., & Riedl, J. (2006). ClustKNN: a Highly Scalable Hybrid Model-& Memory-based CF Algorithm. Proceedings of KDD Workshop on Web Mining and Web Usage Analysis, 2006.

Rendle, S., Marinho, L.B., Nanopoulos, A., & Schmidt-Thieme, L. (2010). Learning Optimal Ranking with Tensor Factorization for Tag Recommendation. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 727-736

Said, A., Wetzker, R., Umbrath, W., Hennig, L. (2009). A Hybrid PLSA Approach for Warmer Cold Start in Folksonomy Recommendation. Proceedings of Workshop on Recommender Systems and the Social Web of the 3rd ACM Conference on Recommender Systems.

Salakhutdinov, R., Mnih, A., & Hinton, G. (2007). Restricted Boltzmann Machines for

Collaborative Filtering, Proceedings of the 24th International Conference on

Machine Learning, 79-798

Page 212

Salton, G. & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text

Retrieval. Information Processing & Management, 24(5):513–523, 1988.

Sarwar, B., Karypis, G., Konstan, J. & Riedl, J. (2001). Item-Based Collaborative Filtering Recommendation Algorithms, Proceedings of the 10th International Conference on World Wide Web, 285 - 295

Schifanella, R., Barrat, A., Cattuto, C., Markines, B., & Menczer, F. (2010). Folks in Folksonomies: Social Link Prediction from Shared Metadata. Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 271–280

Schmitz, P. (2006). Inducing Ontology from Flickr Tags. Proceedings of the Collaborative Web Tagging Workshop at WWW2006.

Schmitz, C., Hotho, A., Jäschke, R., & Stumme, G. (2006). Mining Association Rules in Folksonomies. Data Science and Classification: Proceedings of the 10th IFCS Conference, Studies in Classification, Data Analysis, and Knowledge Organization, 2006, 261- 270

Segaran, T. (2007). Programming Collective Intelligence: Building Smart Web 2.0 Applications. O'Reilly Media, 2007.

Sen, S., Vig, J., & Riedl, J. (2009). Learning to Recognize Valuable Tags. Proceedings of the 13th International Conference on Intelligent User Interfaces, 87-96

Sen, S., S. Lam, A. Rashid, D. Cosley, D. Frankowski, J.Osterhouse, M. Harper, & J. Riedl. (2006). Tagging, Communities, Vocabulary, Evolution. Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, 181-190

Page 213

Sen, S., Vig, J., & Riedl, J. (2009). Tagommenders: Connecting Users to Items through Tags. Proceedings of the 18th International Conference on World Wide Web, 671-680

Seo, Y.W., Zhang, B.T. (2000). Learning User's Preferences by Analysing Web Browsing Behaviours. Proceedings of the 4th International Conference on Autonomous Agents, 381-387

Shahabi, C, & Chen, YS. (2003) An Adaptive Recommendation System without Explicit Acquisition of User Relevance Feedback. Distributed and Parallel Databases, Volume 14, Number 2, 2003, 173-192

Shardanand U. & Maes, P. (1995). Social Information Filtering: Algorithms for Automating ‗Word of Mouth‘. Proceedings of the SIGCHI Conference on Human factors in Computing Systems, 1995, 210-217

Shaw, G., Xu, Y., & Geva S. (2009). Investigating the Use of Association Rules in Improving Recommender systems. Proceedings of the 14th Australasian Document Computing Symposium, 2009, 106-109

Shepitsen, A., Gemmell, J., Mobasher, B., & Burke, R. (2008). Personalized Recommendation in Social Tagging Systems Using Hierarchical Clustering. Proceedings of the 2008 ACM Conference on Recommender Systems, 259- 266

Shmueli-Scheuer, M., Roitman, H., Carmel, D., Mass, Y., & Konopnicki, D. (2010). Extracting User Profiles from Large Scale Data. Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud of WWW 2010.

Siersdorfer, S., & Sizov, S. (2009). Social Recommender Systems for Web 2.0 Folksonomies. Proceedings of the 20th ACM conference on Hypertext and Hypermedia, 261–270

Page 214

Šimko, M., & Bielikova, M. (2010). User Modeling Based on Emergent Domain Semantics. Proceedings of User Modeling, Adaptation, and Personalization 2010, 411-412

Specia, L., & Motta, E. (2007). Integrating Folksonomies with the Semantic Web. Proceedings of the 4th European Conference on The Semantic Web, 624-639

Surowiecki, J. (2004). The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday, 2004.

Suchanek, F.M., Vojnovi´c, M., & Gunawardena, D. (2008). Social tags: Meaning and Suggestions. Proceeding of the 17th ACM Conference on Information and Knowledge Management, 223-232

Szomszor, M. N., Cantador, I., & Alani, H. (2008). Correlating User Profiles from Multiple Folksonomies. Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, 33–42

Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. The Journal of Machine Learning Research, Volume 10, 2009, 623-656

Tingle, D., Kim, Y. E., & Turnbull, D. (2010). Exploring Automatic Music Annotation with "acoustically-objective" Tags. Proceedings of the International Conference on Multimedia Information Retrieval 2010, 55-62

Titov, I., and McDonald, R. (2008). Modelling Online Reviews with Multi-grain Topic Models. Proceeding of the 17th International Conference on World Wide Web, 111-120

Page 215

Tso-Sutter, K.H.L., Marinho, L.B., & Schmidt-Thieme, L. (2008). Tag-aware Recommender Systems by Fusion of Collaborative Filtering Algorithms. Proceedings of the 2008 ACM Symposium on Applied Computing, 1995-1999

Teevan, J., Dumais, S., Horvitz, E. (2005). Personalizing Search via Automated Analysis of Interests and Activities. Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 449-456

Van Damme, C., Hepp, M. & Siorpaes, K. (2007). Folksontology: An Integrated Approach for Turning Folksonomies into Ontologies. Proceedings of the ESWC 2007 Workshop Bridging the Gap between Semantic Web and Web 2.0, 71-84

Vig, J., Sen, S., & Riedl, J. (2009). Tagsplanations: Explaining Recommendations Using Tags. Proceedings of the 13th International Conference on Intelligent User Interfaces, 47-56

Wang, J., De Vries, A. P., & Reinders, M.J.T. (2006). A User-Item Relevance Model for Log-based Collaborative Filtering. Proceedings of the European Conference on IR Research, 37-48

Weinberger, K. Q., Slaney, M., & Zwol, R. V. (2008). Resolving Tag Ambiguity. Proceeding of the 16th ACM International Conference on Multimedia, 111-120

Weiss, S.M., & Kulikowski, C.A. (1991). Computer Systems That Learn. Morgan Kaufmann, 1991.

Weng, L.T., Xu, Y., Li, Y., & Nayak, R. (2008). Web Information Recommendation Making based on Item Taxonomy. Proceedings of the Tenth International Conference on Enterprise Information Systems, 20-28

Page 216

Weng, J., Lim, E.P., Jiang, J., He, Q. (2010). TwitterRank: Finding Topic-sensitive

Influential Twitterers. Proceedings of the third ACM International Conference

on Web Search and Data Mining, 261-270

Wetzker, R., Said, A., &Zimmermann, C. (2009). Understanding the User: Personomy Translation for Tag Recommendation. Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009, 275-284

Wetzker, R., Zimmermann, C., & Bauckhage, C. (2008). Analyzing Social Bookmarking Systems: A del.icio.us Cookbook. Proceedings of Mining Social Data Workshop of ECAI 2008, 26-30

Wetzker, R., Zimmermann, C., Bauckhage, C., & Albayrak, S. (2010). I tag, You tag, Translating Tags for Advanced User Models, Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 71-80

White, R.W., Jose, J.M., & Ruthven, I. (2001). Comparing Explicit and Implicit Feedback Techniques for Web Retrieval: TREC-10 Interactive Track Report. Proceedings of the 10th Text Retrieval Conference, 534-538

Wetzker, R., Umbrath, W., & Said, A. A Hybrid Approach to Item Recommendation in Folksonomies. Proceedings of the WSDM’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, 25–29

Wu, X., Zhang, L., & Yu, Y. (2006). Exploring Social Annotations for the Semantic Web. Proceedings of the 15th International Conference on World Wide Web, 417-426

Wærn, A. (2004). User Involvement in Automatic Filtering: An Experimental Study. User Modelling and User-Adaptive Interaction, 14 (2-3), 201-237

Page 217

Xu, H., Zhou, X., Wang, M., Xiang, Y., & Shi, B. (2010). Exploring Flickr's Related Tags for Semantic Annotation of Web Images. Proceeding of the ACM International Conference on Image and Video Retrieval 2010.

Xu, Z., Fu, Y., Mao, J., & Su, D. (2006). Towards the semantic web: Collaborative Tag Suggestions. Proceedings of the Collaborative Web Tagging Workshop of WWW 2006.

Zaphiris, P., Ang, C.S.(2009). Social Computing and Behavioral Modeling. Springer, 2009.

Zhang, W., Yu, C., & Meng, W. (2007). Opinion Retrieval from Blogs. Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management, 831-840

Zhang, Z., Zhou, T., & Zhang, Y. (2010). Personalized Recommendation via Integrated Diffusion on User-tem-tag Tripartite Graphs. Physica A 389 (2010), Elsevier, 2010, 179-186

Zhen, Y., Li, W., & Yeung, D. (2009). TagiCoFi: Tag Informed Collaborative Filtering. Proceedings of the third ACM conference on Recommender systems, 69-76.

Zhou, T.C., Ma, H., Lyu, M.R., & King, I. (2010). UserRec: A User Recommendation Framework in Social Tagging Systems. Proceedings of the 24th AAAI Conference on Artificial Intelligence, 1486-1491

Zhu, J., Wang, C., He, X., Bu, J., Chen, C., Shang, S., Qu, M., & Lu, G. (2009). Tag- oriented Document Summarization. Proceedings of the 18th International Conference on World Wide Web, 1195-1196

Page 218

Zhuang, L., Jing, F., & Zhu, X. (2006). Movie Review Mining and Summarization. Proceedings of the 15th ACMIinternational Conference on Information and Knowledge Management, 43-50

Ziegler, C. N., Lausen, G., & Schmidt-Thieme, L. (2004). Taxonomy-driven Computation of Product Recommendations. The 13th ACM International Conference on Information and Knowledge Management, 406-415

Zigoris, P. & Zhang, Y. (2006). Bayesian Adaptive User Profiling with Explicit & Implicit Feedback. Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 397-404

Page 219

Page 220

Acknowledgements

I would like to express my sincere gratitude to my principal supervisor Dr. Yue

Xu, for her invaluable expertise, encouragement, trust, tolerance, and helpful guidance throughout the journey of this research project. I am so lucky to have her as my mentor since she is so nice, gentle, charming, modest, and talented. Many thanks also go to my associate supervisors, Professor Yuefeng Li and Dr. Richi Nayak for their generous support and comments on my work during this candidature.

I would like to thank for the great support of HPC group of QUT, especially Dr.

Neil Kelson, and the help of Associate Professor Jim Hogan for the experiments and cloud computing implementations support. I want to thank all the anonymous reviewers of my papers and thesis for their valuable comments and all the members of RecSys gmail group for the discussions about recommender systems. Without the inspiration of the example coding of Liang Xiang, I would have not known the secret of science computing. I also would like to acknowledge the help of Li-Tung Weng and Gavin Shaw.

I wish to thank all my fellow students and friends for their friendship and company in this tough and lonely journey. Thanks for their listening, arguing, criticizing, reminding and understanding, especially Yan Lou, Chaofeng Lin, Xiaohui Tao, Xujuan

Zhou, Abdulmohsen, and Touhid. Thanks for all the people and friends I met in conferences and seminars. Thanks for the great books, movies and music that make me think, doubt, believe, laugh, and weep. Many thanks also go to the friendly people, nice sunny days and beautiful beaches in Queensland.

Page 221

Finally, I wish to extend my deepest appreciation to my family members—my dear parents, sisters and brother, and lovely nephews—for their endless love and support.

I wish I would have had more time to stay with them to share all the joys and sadness, ups and downs in our lives. My parents only had very limited school educations because of what happened in China in the 50s and 60s in the last century. They always encourage me to get better education. This thesis is to apply for the highest degree, but I know I have already got the best education in the world from them.

Page 222