Improved Annotation of the Blogosphere Via Autotagging and Hierarchical Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering Christopher H. Brooks Nancy Montanez Computer Science Department Computer Science Department University of San Francisco University of San Francisco 2130 Fulton St. 2130 Fulton St. San Francisco, CA 94117-1080 San Francisco, CA 94117-1080 [email protected] [email protected] ABSTRACT of tagging. Tags are collections of keywords that are attached to Tags have recently become popular as a means of annotating and blog entries, ostensibly to help describe the entry. While tagging organizing Web pages and blog entries. Advocates of tagging argue has become very popular, and tags can be found on many popular that the use of tags produces a ’folksonomy’, a system in which the blogs, there has not been (to our knowledge) much analysis devoted meaning of a tag is determined by its use among the community as to the question of whether tags are an effective organizational tool a whole. We analyze the effectiveness of tags for classifying blog for blogs, what functions tags are well suited for, or the broader entries by gathering the top 350 tags from Technorati and measur- question of how tags can benefit authors and readers of blogs. ing the similarity of all articles that share a tag. We find that tags are In this paper, we discuss some initial experiments that aim to useful for grouping articles into broad categories, but less effective determine what tasks are suitable for tags, how blog authors are in indicating the particular content of an article. We then show that using tags, and whether tags are effective as an information re- automatically extracting words deemed to be highly relevant can trieval mechanism. We examine blog entries indexed by Techno- produce a more focused categorization of articles. We also show rati and compare the similarity of articles that share tags to de- that clustering algorithms can be used to reconstruct a topical hier- termine whether articles that have the same tags actually contain archy among tags, and suggest that these approaches may be used similar content. We compare the similarity of articles that share to address some of the weaknesses in current tagging systems. tags to clusters of randomly-selected articles and also to clusters of articles that share most-relevant keywords, as determined using TFIDF. We find that tagging seems to be most effective at placing Categories and Subject Descriptors articles into broad categories, but is less effective as a tool for indi- H.3.1 [Information Systems]: Content Analysis and Indexing cating an article’s specific content. We speculate that this is in part ; H.3.3 [Information Systems]: Search and Retrieval due to tags’ relatively weak representational power. We then show how clustering algorithms can be used with existing tags to con- General Terms struct a hierarchy of tags that looks very much like those created by a human, and suggest that this may be a solution to the argu- Experimentation, Algorithms ment between advocates of handbuilt taxonomies and supporters of folksonomies and their emergent meaning. We then conclude with Keywords a discussion of future work, focusing on increasing the expressivity blogs, tagging, automated annotation, hierarchical clustering of tags without losing their ease of use. 1. INTRODUCTION 2. BACKGROUND In the past few years, weblogs (or, more colloquially, blogs) have Tags are keywords that can be assigned to a document or object emerged as a means of decentralized publishing; they have success- as a simple form of metadata. Typically, users do not have the fully combined the accessibility of the Web with an ease-of-use that ability to specify relations between tags. Instead, they serve as a has made it possible for large numbers of people to quickly and set of atomic symbols that are associated with a document. Unlike easily disseminate their opinions to a wide audience. Blogs have the sorts of schemes traditionally used in libraries, in which users quickly developed a large and wide-reaching impact, from leaking select keywords from a predefined list, a user can choose any string the details of upcoming products, games, and TV shows to helping to use as a tag. shape policy to influencing U.S. Presidential elections. The idea of tagging is not new; photo-organizing tools have had As with any new source of information, as more people begin this for years, and HTML has had the ability to allow META key- blogging, tools are needed to help users organize and make sense words to describe a document since HTML 2.0 [3] in 1996. How- of all of the blogs, bloggers and blog entries in the blogosphere (the ever, the idea of using tags to annotate entries has recently become most commonly-used term for the space of blogs as a whole). One quite popular within the blogging community, with sites like Tech- recently popular phenomenon in the blogosphere (and in the Web norati1 indexing blogs according to tags, and sites like Furl2 and more generally) that addresses this issue has been the introduction Delicious 3 providing users with the ability to assign tags to web Copyright is held by the International World Wide Web Conference Com- 1 mittee (IW3C2). Distribution of these papers is limited to classroom use, http://www.technorati.com 2 and personal use by others. http://www.furl.com WWW 2006, May 23–26, 2006, Edinburgh, Scotland. 3http://del.icio.us ACM 1-59593-323-9/06/0005. pages and, most importantly, to share these tags with each other. to their results. As of yet, no such function exists in the blogo- Tags have also proved to be very popular in the photo-sharing com- sphere. There is also a risk that the discussion can become balka- munity, with Flickr4 being the most notable example. nized; authors within the blogosphere may be more likely to ref- This idea of sharing tags leads to a concept known as “folkson- erence other blogs or online works, whereas authors of journal or omy” [11, 9], which is intended to capture the notion that the proper conference papers will be more inclined to cite papers with “tra- usage or accepted meaning of a tag is determined by the practicing ditional” academic credentials. One goal of this paper is begin to community, as opposed to being decreed by a committee. Advo- create a bridge between these communities and move some of the cates of folksonomy argue that allowing the meaning of a tag to discussion regarding tagging and folksonomies into the traditional emerge through collective usage produces a more accurate mean- academic publishing venues. ing than if it was defined by a single person or body. Advocates Tags, in the sense that they are used in Technorati and Delicious, of folksonomies as an organizational tool, such as Quintarelli [9], are propositional entities; that is, they are symbols with no mean- argue that, since the creation of content is decentralized, the de- ing in the context of the system apart from their relation to the scription of that content should also be decentralized. They also documents they annotate. It is not possible in these systems to de- argue that centrally-defined, hierarchical classification schemes are scribe relationships between tags (such as ‘opposite’, ‘similar’, or too inflexible and rigid for application to classifying Web data (in ‘superset’) or to specify semantics for a tag, apart from the fact particular blogs), and that a better approach is to allow the “mean- that it has been assigned to a group of articles. While this would ing” of a tag to be defined through its usage by the tagging commu- seem like a very weak language for describing documents, tagging nity. This, it is argued, provides a degree of flexibility and fluidity advocates point to its ease of use as a factor in the adoption of tag- that is not possible with an agreed-upon hierarchical structure, such ging. Tagging advocates argue that a less-powerful but widely used as that provided by the Library of Congress’ system for cataloging system such as tagging provides this collaborative determination books. of meaning, and is superior to more powerful but less-widely-used To some extent, the idea of folksonomy (which is an argument systems (the RDF vision of the Semantic Web is typically offered for subjectivity in meaning that has existed in the linguistics com- as an alternative, often in a straw-man sort of way). Tagging ad- munity for years) is distinct from the particular choice of tags as vocates also argue that any externally-imposed set of hierarchical a representational structure, although in practice the concepts are definitions will be too limiting for some users, and that the lack often conflated. There’s no a priori reason why a folksonomy must of structure provides users with the ability to use tags to fit their consist entirely of a flat space of atomic symbols, but this point needs. This presents a question: what needs are tags well-suited to is typically contested by tagging advocates. Quintarelli [9] and address? Mathes [7] both argue that a hierarchical representation of topics In this paper, we focus on the issue of tags as a means of annotat- does not reflect the associative nature of information, and that a ing and categorizing blog entries. Blog entries are a very different hand-built taxonomy will likely be too brittle to accurately repre- domain from photos or even webpages. Blog entries are more like sent users’ interests. As to the first point, we would argue that there “traditional” documents than webpages; they typically have a nar- is no conflict between indicating subclass/superclass relationships rative structure, few hyperlinks, and a more “flat” organization, as and allowing for associative relations, as can be seen in representa- opposed to web pages, which often contain navigational elements, tion schemes such as Description Logic [1].