Social Tagging

Kristina Lerman USC Information Sciences Institute Thanks to Anon Plangprasopchok for providing material for this lecture. essembly Bugzilla

delicious Social Web essembly Bugzilla

deliciousSocial Web is a platform for people to create, organize, and share information Create Information

• People create content (resources) • Text posts: blogs, Twitter, … • Images: Flickr, Picasa, … • Videos: YouTube, Vimeo, … • News stories: Digg, Reddit, Slashdot, … • Bookmarks: Delicious, CiteULike, Bibsonomy, … • Personal profiles: Facebook, MySpace, … • Maps: OpenStreetMaps, … • Locations: FourSquare, … Organize Information

• People organize resources • Annotate with metadata • tags: descriptive labels • geotags: geographic coordinates • Add to folders: organize content within personal hierarchies • E.g., sets and collections on Flickr • Other types of metadata may include • Discussions, comments, reviews • Ratings, votes, …

• Social Tagging most popular form of annotation Social Tagging: Delicious

Content (webpage)

User Tags Social Tagging: Flickr

submitter

+ Mackay May 2008 (Set) private + Birds (Set) + Birds (Pool) albums + Canberra (Pool) + Field Guide: Birds of the World (Pool) + Birds, Birds, Birds (Pool) public + BIRDPIX (3/day) (Pool) + Australian Birds (Pool) groups + Birds – Kingfishers, Pittas, and Bee-eaters (Pool) + Birds of Queensland (Pool)

Rainbow bee-eater Merops ornatus Australia Queensland tags Mackay Gardens discussion Share Information

• People share resources • Social networks: broadcast to social connections • Friends on Facebook, … • Fans/Followers on Twitter, Digg, … • Groups affiliations • Hotlists: emerge from collective activity • E.g., Digg front page, Flickr Explore, Flickr Trends… Social Networks: Facebook Social Networks: Flickr Harvesting Knowledge from Social Tagging

Resources

RR graph: Resource (web page) PageRank User Tags

Resource (photo) User

Tags Users Tags UU graph: RUT hypergraph: Social network Harvesting knowledge analysis from social tagging Overview

Harvesting knowledge from social tagging • “Structure of Collaborative Tagging Systems” • Statistics of tagging activity • Consensus about meaning of document quickly emerges from the opinions of many users • “Exploiting Social Annotation for Automatic Resource Discovery” • Learn hidden topics in a collection of tagged documents • Use hidden topics to find relevant documents Social Tagging

• Tags are labels attached to content • Chosen from an uncontrolled personal vocabulary • Help users to more efficiently • Browse • Filter • Search information • Collaborative/social tagging • Anyone can attach labels to resources (not only experts or producers of content) • Collectively, tags represent a semantic annotation of a resource (alternative to Semantic Web) Tagging and Taxonomies

• Taxonomy – hierarchical, exclusive organization of • Tagging – non-hierarchical, objects inclusive organization of • Linnaean classification objects felidaepantheratiger • Articles tagged ‘cat’, ‘africa’ felidaefeliscat • File system: articles about cats in Africa c:\articles\cats ‘africa’ ‘cats’ c:\articles\africa c:\articles\africa\cats c:\articles\cats\africa ‘cats’ AND ‘africa’

Search multiple folders to find But, will not find articles all relevant content tagged with ‘cheetah’ Kinds of Tags

• What content is about (topic) identify who or what document is about: ‘cat’, ‘africa’ • What it is what kind of thing it is: ‘article’, ‘blog’, ‘book’ • Who owns it who owns/created content: ‘nikographer’ • Refining categories refine or qualify categories, especially numbers • Identify qualities or characteristics express opinion: ‘funny’, ‘interesting’ • Self-reference ‘mystuff’ • Task organizing ‘toread’, ‘jobsearch’ Social Tagging Dimensions

• Tagging rights: who can tag? • Self-tagging – only resource owner (blog posts, Flickr by convention) • Free-for-all – anyone can tag a resource (Delicious) • Consolidation: assisted tag generation? • Blind tagging – user enters tags independently of other users • Suggestive tagging – system suggests tags based on annotations of other users • Resource type • Text – Web pages, blog posts, bibliographic material, … • Multimedia – images, videos, … • Source of content • User-owned – e.g., images on Flickr • Scavenged from the Web – e.g., Delicious • Connectivity: links between users • Reciprocity – undirected links (Facebook) vs directed (Flickr, Delicious) • Link type – friend relationship vs contact (on Flickr) shows degree of trust User Motivations

What are users’ motivations to tag? • Organizational • Mark items for future personal retrieval • Social • Mark items for others to find, e.g., concert photos on Flickr • Can result in spamming • Express opinion, e.g., “funny” tag on video

Collective value emerges from tagging decisions of individual users • How can users be incentivized to contribute high quality annotations? Social Tagging on del.icio.us

• Social bookmarking site del.icio.us • Users can tag any Web page (URL) • Delicious suggests tags based on existing tags for the URL • Delicious aggregates popular tags • Anyone can see bookmarks of others • Users can create social links

• Value of social tagging • Users bookmark for their own benefit • Organization • Retrieval • Useful public good emerges • Tag suggestions • List of popular URLs and tags (hotlists) Tagging on del.icio.us

Content (webpage)

User Tags Dynamics of del.icio.us

• Delicious dynamics [Golder & Huberman] • User activity • Tag vocabulary growth • Datasets • Bookmarks collected over 4 days in June 2005 • Sample of users who posted bookmarks in this period Dynamics of User Interests

• Tags reflect how user’s interests and knowledge change in time • Tag1 and Tag2 are tag1 consistent interests of the user • Tag3 is new interest tag2 tag3 • Or a new way to differentiate between concepts/interests Times tag has been used

bookmark Stable Patterns in Tagging

• Consider a single URL • As it is tagged by more users • Each tag’s proportion represents the combined description of the URL by many users • After ~100 bookmarks, relative frequency of each tag is fixed Tag proportion (wrt all tags) proportion Tag

Number of bookmarks for URL Findings

• Consensus about a URL’s topics • Emerges quickly- after ~100 users bookmark it • URLs do not have to become popular for tags to be useful • Minority opinions can stably coexist with popular ones • Can be used to categorize/organize URLs • Reasons for consensus • Imitation – users imitate tag selection of others • But, stable patterns also exist for less common tags (not shown to users) • Shared knowledge • Can we learn it? Learning from Social Tagging/Annotation

• Annotations by an individual user may be inaccurate and incomplete…

• Annotations from many different users may complement each other, making them meaningful in aggregate

Goal: Learn concepts from social annotations created by many users Learning Concepts from Tags

By sparky2000 By A lion Rohrs “Jaguar”

= ? Animal Car Goal of Learning Algorithm

Resources “Animal” “Car”

Tags ?

“Flower”

Group semantically related tags and resources

A group ~ A concept Challenges in Learning from Annotations

• Sparse data 4-7 tags per bookmark; 3.74 tags per photo [Rattenbury07+] • Ambiguity jaguar: car vs. animal • Polysemy window: hole in a wall vs. glass pane that resides in it • Synonymy kid vs. child • Disagreement cats\africa vs. africa\cats • Different Levels of Specificity Dog vs. Beagle • Multiple facets Bird tagged by appearance, location, scientific/colloquial name” Document Modeling Approaches

• ‘Bag-of-words’ – tf-idf • Document as a vector of word frequencies • Small reduction in document description length • Does not handle synonymy and polysemy • Latent semantic indexing - LSI • Identifies subspace of tf-idf that captures most of the variance in a corpus • Reduction in document description length (# principal components) • Handles polysemy and synonymy • Topic modeling – pLSI, LDA • Documents as random mixtures over (hidden) topics, where each topic is a distribution over words • Large reduction in description length (# topics) • Inference • Given a document corpus, estimate parameters of the model – Compute distribution of hidden topics given the document A Stochastic Process of Word Generation

pLSI (Hofmann99); Document (r) LDA (Blei03+)

Topics (z)

Possible Words

Possible Topics

Generated words (t) Learned Topics

Possible Words

Possible Topics

High probability words in each topic: travel, flights, airline, flight, airlines, guide, aviation, … map, maps, world, earth, latitude, longitude, directions, address, geography, distance, zip, usa, gmaps, atlas, … video, download, bittorrent, p2p, youtube, media, torrent, torrents, movies, … Apply LDA to Tagging

“Animal” Resource “Car” (document)

Tags LDA (words) “Flower” Application to Resource Discovery

• Resource discovery • Given a seed source, find other data sources that provide the same functionality • e.g., find geocoders like http://geocoder.us, which returns geographic coordinates of a specified US address • Benefits • Increase robustness of II applications • If http://geocoder.us fails, substitute with another source • Increase coverage of II applications • http://geocoder.ca geocodes US AND Canadian addresses Source Discovery and Modeling [Ambite et al, 2009]

unisys anotherWS Invocation discovery & extraction

• sample Background input “90254” • Seed URL knowledge values

unisys http://wunderground.com

unisys(Zip,Temp,…) :-weather(Zip,…,Temp,Hi,Lo) • definition of • patterns known sources • domain • sample values types source semantic modeling typing unisys(Zip,Temp,Humidity,…) Exploiting Social Annotation for Resource Discovery

Approach: Use topic modeling of social annotation obtained from Delicious to find sources similar to a given seed URL

Seed URL Obtain Annotation URLs e.g., corpus from Delicious Probabilistic LDA, Learning to learn Model concepts Tags Users Candidates URL’s distribution over concepts

Rank by Compute Similarity URL Similarity To seed Corpus of Annotated Resources

• Crawling strategy • For each seed, retrieve the 20 popular tags • For each tag, retrieve sources annotated with same tag • For each source, retrieve all tags Topic Modeling of Social Annotations

• Use LDA to learn 80 topics in each corpus • Distributions over topics is used to compute similarity of target URL to seed Source Discovery Results

• Manually label top 100 ranked URLs by similarity to seed URL • Compare to Google’s “find similar URLs” functionality Source Discovery Results Discussion

• Users express their knowledge through the tags they create while annotating content • Apply document modeling techniques to social annotations data • Infer hidden topics in annotated data • Use topics for source discovery task • Outperforms standard Web search • Next – Extract more complex types of knowledge from social annotations • Sentiment • Folksonomies