?? o – p. 1/ Department of Computer Science — University of San Francisc University of San Francisco Department of Computer Science Chris Brooks and Nancy Montanez

15th International Conference Autotagging and Hierarchical Clustering Improved Annotation of the via Tags

• Taggging has recently become a popular method for annotating and organizing entries. • Allows users to attach keywords to blog entries, and share these annotations with others. • Easy to use and intuitive. • But what tasks are tags useful for? • More specifically, do tags help as an information retrieval mechanism?

Department of Computer Science — University of San Francisco – p. 2/?? Shared Tags and

• Tags have (at least) three clear uses: ◦ Individual organization ◦ Shared annotation of articles into categories ◦ Shared annotation as an aid to searching • We are more interested in tags as a mechanism for sharing information. • : the meaning associated with a will evolve and coalesce through community usage.

Department of Computer Science — University of San Francisco – p. 3/?? Popular tags

About Me, Acne News, Actualite, Actualites, Actualites et politique, Advertising, Allmant, All Posts, amazon, Amigos, amor, Amusement, Anime, An- nouncements, Articles/News, Asides, Asterisk, audio, Babes, Babes On Flickr, Baby, Baseball, Blogging, , book, books, Business, Car, Car Insurance, Cars, category, Cell Phones, China, Cinema, Cine cinema, Comics, Computadores e a Internet, Computer, Computers, Computers and Internet, Computers en internet, Computing, CSS, Curiosidades, Current events, Data Recovery, days, Development, diario, Directory, Divertissement, Dogs, dreams, Entertainment, Entretenimento, Entretenimiento, Environment, etc, Europe, Event, EveryDay, Everything, F1, fAcTs, Family, fashion, Feeling, Feelings, FF11, FFXI, Film, Firefox, Flash, Flickr, Flutes, Food and Drink, Football, foreign-exchange, Foreign Exchange, Fotos, Friends, Fun, Funny, general, Game, Games, Gaming, Generale, General news, General Posting, General webmaster threads, Geral, Golf, Google, gossip, Hardware, Health and wellness, Health Insurance, History, hobbies, Hobby, Home, Humor, Hurricane Katrina, Info, Informatica e Internet, Interna- tional, Internet, In The News, Intrattenimento, Java, jeux, Jewelry, jogos,Journal, , Juegos, kat-tun, Katrina, Knitting, Law, Legislation, libros, Life, Links, Live, Livres, Livros, London, Love, Love Poems, Lyrics, Musica, Macintosh, Marketing, MassCops Recent Topics, Me, Media, meme, memes, memo, metblogs, metroblogging, Military, Misc, Misc., miscellaneous, MobLog, Mood, Movie, Movies, murmur, Music, Musica, Musik, Musings, Musique, Muziek, My blog, Nature, News and politics, Noticias e politica, Opinion, Ordinateurs et Internet, Organizacoes, Organizaciones, Organiza- tions, others, Pasatiempos, Passatempos, PC, Pensamentos, Pensamientos, People, Personal, Philosophy, photo, Pictures, , Poem, poemas, Poesia, Poker, police headlines, Politik, Projects, Quotes, Radio, Ramblings, random, Randomness, Random thoughts, Rant, Real Estate, Recipes, reflexiones, reizen, Relationships, Research, Resources, Review, RO, RSS, Saude e bem-estar, Salud y bienestar, Sante et bien-etre, School, Science, Search, Sex, sexy, Shopping, Site news, Society, software, Spam, Stories, stuff, Tech News, technology, Television, Terrorism, test, Tips, Tools, Travel, Updates, USA, Viagens, Viajes, Video, Videos, VoIP, Votes, Voyages, War, Weather, Weblog, Website, weight loss, Whatever, Windows, Wireless, wordpress, words, Work, World news, Writing The 250 most popular tags on Technorati, as of October 6, 2005 • Things to notice: ◦ Tags tend to be general terms ◦ Synonyms and related concepts are repeated ◦ Misspellings, and different cases ◦ Jargon, slang, spam, and Non-English words ◦ Non-useful tags (everything, etc, random, test)

Department of Computer Science — University of San Francisco – p. 4/?? Representational Power

• A tag is a label that is applied to a set of blog entries. • There is no way to specify relationships between tags ◦ Opposite, more general/specific, synonym, etc • In logical terms, tags are a propositional mechanism. ◦ This should set off some alarms amongst the AI people in the audience! • We see users trying to use tags more expressively ◦ e.g. “San Francisco, California” ◦ This can’t be decomposed, or related to the tag “San Francisco” or the tag “California” ◦ Maybe tags are not quite so easy to use ...

Department of Computer Science — University of San Francisco – p. 5/?? Tags as an Information Retrieval Mechanism

• In this paper, we tried to determine whether tags were useful as an information retrieval mechanism. ◦ Specifically, can tags help with a search task? • How similar are articles that are assigned the same tags? ◦ Hypothesis: Rarer tags are better at describing articles than more specific tags.

Department of Computer Science — University of San Francisco – p. 6/?? Tags as an Information Retrieval Mechanism

• Retrieved the top 350 tags from Technorati, and then the 250 most recent articles for each tag. • Articles are converted into weighted vectors, using TFIDF to assign weights to each word. • All articles that share a tag are assigned to a tag cluster • The size of a tag cluster is measured using the average pairwise cosine similarity. ◦ Note: the actual content of the documents is what is evaluated.

Department of Computer Science — University of San Francisco – p. 7/?? How Similar are Tag Clusters?

• Articles with the same tag Similarity Amoung Blogs using Popular Tags 250 Blogs per Tag 1 are somewhat similar. 0.9 • Small spike amongst highly 0.8 popular tags. (game, games, 0.7

0.6 vote) 0.5 • Contrary to expectations, 0.4 Cosine Similarity

0.3 articles with rare tags are not

0.2 more similar than articles 0.1 with common tags.

0 0 50 100 150 200 250 300 350 Tag Rank (or Popularity) • But how similar are these clusters?

Department of Computer Science — University of San Francisco – p. 8/?? Baselines

• Tagging clusters articles Similarity Amoung Random Clusters of Blogs 2,500 Blogs Total 1 better than random 0.9

0.8 selection, but worse than

0.7

0.6 Google News.

0.5

0.4 Cosine Similarity • Tagging seems most

0.3

0.2 effective at grouping 0.1 articles into broad topical 0 0 5 10 15 20 25 30 35 40 45 50 Index of Clusters (50 Blogs per Cluster) bins. Similarity Amoung Documents from Google News 10 Different News Topics, 30 Articles per Topic 1 • Not very effective as a 0.9

0.8 mechanism for locating 0.7 particular articles. 0.6

0.5

0.4 Cosine Similarity

0.3

0.2

0.1

0 1 2 3 4 5 6 7 8 9 10 Index of News Topic Clusters (Roughly 30 Articles per Topic)

Department of Computer Science — University of San Francisco – p. 9/?? Autotagging

• Perhaps users are not very good at choosing tags for search - can automated methods do better? • Autotagging is the process of automatically assigning tags based on the content of an article. • Hypothesis: To determine what an article is about, look at the article itself! • Assign TFIDF scores to all words and extract the highest-scoring words.

Department of Computer Science — University of San Francisco – p. 10/?? Autotagging

• We also extracted the top Similarity Amoung Blogs with the Same Top TFIDF Keywords 500 Blogs Total 1 three highest-scoring words 0.9 from each article and 0.8

0.7 assigned them as tags.

0.6

0.5 • Clusters formed using these

0.4 Cosine Similarity words were smaller and 0.3

0.2 much more similar than 0.1 clusters using user-chosen 0 0 20 40 60 80 100 120 Index of TFIDF Keyword Clusters keywords. Pairwise similarity of clusters of articles shar- • Tags extracted from user ing a highly-scored word. text are more helpful in creating specific categories than user-selected tags are.

Department of Computer Science — University of San Francisco – p. 11/?? Generating Hierarchies of Tags

• Tags are unable to express related concepts. • Do related articles have tags judged as similar by a human? • To address this question, we use agglomerative clustering to construct a tag hierarchy. • Goal: identify and group tags that are similar or related.

Department of Computer Science — University of San Francisco – p. 12/?? Agglomerative Clustering

• The agglomerative clustering algorithm is very straightforward: • Find the two closest tag clusters and merge them into a single abstract cluster. Repeat until one cluster containing all tags remains. • This yields a dendrogram showing tag similarities.

• Tags = {t1, t2, ..., tn} • while |Tags| > 1 :

◦ find ti, tj s.t. sim(ti, tj ) >= sim(ti, tk)∀k 6= i, k 6= j ◦ tnew = ti ∪ tj

◦ tags = tags − {ti, tj} ◦ tags = tags ∪ tnew

Department of Computer Science — University of San Francisco – p. 13/?? Positive results: locating related tags t t p e c o n ro o c . . . T u i s e n iT p s k J n u B d f o E F I L o y r I A t o C a I e c n a r u s u n e c n a r r u s n N s w e i l t h I s s e w o g e c n a r u s n i t i p P n o r c s e r R f e c n e e r e h t H l a e L i l i t i P F i l g e n s a o e n r g o o y c i c t i l P o s h C d i W e s e e n a e n d i r n g ga e n d d P d d G r n ga e n o s n r a n a e • Clustering is able to G d Hd o r n a n a e e m construct groups of l P a n o s r e tags that might be l i D a s w e n y characterized as J k o s e b t A u o e m i h A t g y n n “related” by a human. M t o S y y r i F l s g e n e i F l e g e n d B m e r o o i t s n o m o e M d o o M d a y y y a s s e P m e o t P e y r o t i l y r a r e i f l n s o s e r e p x s e

Department of Computer Science — University of San Francisco – p. 14/?? Negative results: shared vocabulary

• Using vectors of single words to represent documents can produce anomalies ◦ Both politics and games talk about scores, opponents, and winning. Department of Computer Science — University of San Francisco – p. 15/?? Negative results: syntactic problems

• “diary” and “dairy” are seen as closely related. ◦ Misspelling in the tag. ◦ Illustrates a problem with the current representational power of tags. ◦ Your tags are only as good as your users! ◦ (aside: many community blogs have frequent discussions

about “appropriate” tagging vocabularies)Department of Computer Science — University of San Francisco – p. 16/?? Conclusions

• Tags are very attractive due to their simplicity and ease of use. • Limited representational power makes them most useful for grouping into large categories. • By themselves, tags do not seem very effective as a search mechanism. • Tags can be grouped using clustering techniques, which indicates that relationships can be induced automatically. • Needed: tools for increasing expressivity without sacrificing ease of use. ◦ Expressing relationships, suggesting appropriate tags, catching misspellings, automatically grouping tags.

Department of Computer Science — University of San Francisco – p. 17/?? Future Work

• Current experiments only provide an approximate picture of cluster similarity. ◦ Phrase extraction would produce more precise results. ◦ Other metrics should be evaluated. • Currently developing tools that suggest tags based on article similarity and hierarchy. • Question: do authors and readers use the same tag vocabulary? • Thanks to Technorati for the use of their data.

Department of Computer Science — University of San Francisco – p. 18/??